Download
Information about the resources
All the following resources, with the exception of the 'Frequency dictionary...' corpus, are currently available under the terms of this Licence, free of charge. Starting the download implies the acceptance of these terms. The 'Frequency dictionary...' corpus is available under the terms of the GNU General Public License.
Most of these resources are available in the .tar.bz2 format. These files can be unpacked under Windows e.g. using the free program 7-Zip.
Corpora
Binary versions of corpora searchable with Poliqarp, all as tar archives compressed with bzip2:
2nd edition of the IPI PAN Corpus (March 2006)
- 2.all.250.bin.tar.bz2 — the full IPI PAN Corpus, over 250 million segments.
-
2.sample.30.bin.tar.bz2 —
the IPI PAN Corpus Sample searchable at
http://korpus.pl/; over 30
million segments.
Just as in case of the 1. edition of the IPI PAN Corpus, this version of sample contains a variety of texts representing different genres:- contemporary prose: over 10%
- older prose: almost 10%
- non-fiction: 10%
- newspapers: 50%
- parliamentary proceedings: 15%
- law: 5%
1st edition of the IPI PAN Corpus (June 2004)
- 1.sources.100.bin.tar.bz2 — a sample of the IPI PAN Corpus also available (for non-commercial purposes) as source texts. (Please get in touch with Adam Przepiórkowski in order to obtain these sources.) It contains over 100 million segments corresponding to over 286 thousand different lemmata. This subcorpus was created by taking the 1st edition of full IPI PAN Corpus and removing all newspaper texts, as well as about 10% random paragraphs of each copyrighted text.
- 1.wstepny.70.bin.tar.bz2 — a sample of the IPI PAN Corpus from the CD "The IPI PAN Corpus. Preliminary Version", over 70 million segments corresponding to over 364 thousand different lemmata.
- 1.sample.15.bin.tar.bz2 —
the IPI PAN Corpus Sample searchable at
http://korpus.pl/, over 15 million
segments corresponding to 217 thousand different lemmata. The sample corpus, although it perhaps does not deserve to be
called a balanced corpus, contains a variety of texts representing
different genres:
- contemporary prose: 10%
- older prose: 10%
- science: 10%
- newspapers: 50%
- parliamentary proceedings: 15%
- law: 5%
- frek.bin.tar.bz2 — yet
another version of the corpus of the Frequency
dictionary of contemporary Polish (Słownik frekwencyjny polszczyzny
współczesnej, Kurcz, Lewicki, Sambor, Szafran and Woronczak, 1990,
Instytut Języka Polskiego PAN, Cracow), developed in the 1960ies, and
containing 500 thousand words divided evenly into five genres:
- popular science,
- news dispatches,
- editorials and longer articles,
- artistic prose, and
- artistic drama.
A source (XML) version of the "Frequency dictionary..." corpus as a tar archive compressed with bzip2:
Poliqarp
See here.

