Poliqarp's query syntax is based on that of Corpus Query Processor (CQP), perhaps the most popular program of this kind, created at the University of Stuttgart, but it contains a number of additional features and improvements.1 The present section describes the syntax of Poliqarp queries and illustrates it with numerous examples.
In the simplest case, a query is just a sequence of segments, e.g.:
There are three segments in the latter query above, corresponding to two words: przyszedłem and rano. In the case of simple queries like the two queries above, Poliqarp attempts to identify those words which might consist of smaller segments and to handle them properly, so also the following queries will give the expected results:
By default, queries are interpreted in a case-sensitive manner, so the following queries will produce different results:
Queries may contain standard regular expressions over characters, specified with the help of the following special characters: ?, *, +, ., ,, |, {, }, [, ], (, ), as well as natural numbers; segment specifications containing regular expressions must be enclosed in quotes ". Since the formal introduction of regular expressions lies far outside the scope of the current publication, we will be content with discussing just a few examples, which, nevertheless, should allow the user to understand the syntax and semantics of such regular expressions.
The specifications of segments given above must match complete segments, rather than only their parts, hence the necessity of flanking the sequence (la){3,} in query 13. above with the regular expression .*, matching any sequence of characters (also the empty sequence). The same effect can be achieved with the help of the flag /x, which means that the given specification must be matched by a subsequence of the segment, not necessarily by the complete segment:
The following query may be used in order to find all forms of the lexeme KORPUS:
The base attribute is one of many attributes that may be used in a query. The value of this attribute should specify the base form (the lemma), so a query like [base=pisać] can be used to find forms such as pisać `write' (infinitive), piszę (non-past form), pisała (l-participle), piszcie (imperative), pisanie (gerund), pisano (impersonal), pisane (adjectival participle), etc.
Another attribute that may be used in queries is orth. The values of this attribute specify segments, so each of the following pairs contains queries which are equivalent.
On the other hand, the two queries below are not equivalent:
The values of base and orth may contain regular expressions of the kind described in §3.1 above, e.g.:
Queries about segments and about base forms may be combined. For example, the following query may be used to find all occurrences of the segment minę understood as a form of the lexeme MINA `mine, face' (and not, say, as a form of the lexeme MIJAĆ, `to pass'):
A similar effect can be achieved with the help of the following query, about those occurrences of the segment minę which are not interpreted as forms of MIJAĆ.
The condition that the base form be different from mijać may also be specified by putting the negation (the exclamation mark) before the name of the attribute, so the query below is equivalent to the query above.
Just as in the propositional calculus, double negation is equivalent to no negation, so the following queries about the segment nie understood as a form of the pronoun ON are fully equivalent:
In Poliqarp queries, the operator & plays the role of logical conjunction. The operator dual to & is |, which plays the role of logical disjunction, e.g.:
In order to better understand the difference between the operators & and |, let us compare the effect of the following two queries:
As the examples above show, specifications of corpus positions, enclosed in square brackets, may contain any number of conditions of the type attribute=value, combined with the operators !, & and |. It is also possible to completely omit any conditions -- the query below could be used to find all segments in the corpus.2
This trivial specification of corpus positions, matching any segment, may be useful for finding two forms in a certain distance from each other, e.g., two segments separated by two other segments, as in the following query:
It would perhaps be more interesting to specify the upper limit on the number of segments which may intervene between two forms, not just the exact number of such intervening positions. Poliqarp makes it possible to pose such queries, as it allows to posit regular expressions also over corpus positions. For example, the following query may be used to find a form of the lexeme BAĆ occurring two, three or four positions after the segment się:
A more accurate query concerning various occurrences of the inherently reflexive verb BAĆ SIĘ should find się within a certain window before a form of the lexeme BAĆ, but without any intervening punctuation (intervening punctuation will often indicate clause boundary), or immediately after a form of bać, separated from that form by at most a single personal pronoun:
The rather baroque query above can be simplified by replacing the condition orth!="[.!?,:]" with a direct reference to the `grammatical class' interp:
In general, the values of the pos
attribute are the abbreviations of names of grammatical classes
discussed in §2.2 (cf. the table on
p.
). For example, a query about a sequence of
two nominal forms beginning with an a may be formulated as
follows:
The specifications of the values of pos may, just as in case of orth and base, contain regular expressions. For example, taking into account the fact that personal pronouns are split between the class of 3rd person pronouns ppron3 and non-3rd person pronouns ppron12, the following queries may be used to find any form of any personal pronoun:
Apart from the specifications of segments (with the help of orth), base forms (base) and grammatical classes (pos), queries may contain specifications of particular grammatical categories, such as case or gender. The following attributes may be used to this end (cf. §2.1):
| attribute | possible values |
|---|---|
| number | sg pl |
| case | nom gen dat acc inst loc voc |
| gender | m1 m2 m3 f n |
| person | pri sec ter |
| degree | pos comp sup |
| aspect | imperf perf |
| negation | aff neg |
| accentability | akc nakc |
| post-prepositionality | npraep praep |
| accommodability | congr rec |
| agglutination | agl nagl |
| vocalicity | nwok wok |
Hence, it is possible to pose the following queries:
The following three-letter abbreviations may be used instead of the full names of the attributes:
| attribute | abbreviation |
|---|---|
| number | nmb |
| case | cas |
| gender | gnd |
| person | per |
| degree | deg |
| aspect | asp |
| negation | neg |
| accommodability | acm |
| accentability | acn |
| post-prepositionality | ppr |
| agglutination | agg |
| vocalicity | vcl |
In the graphical and text versions of Poliqarp, it is possible to define so-called aliases, i.e., abbreviations for alternative values of a given attribute, which may themselves be used as if they were possible values of attributes. The current version of the IPI PAN Corpus has four such aliases already pre-defined:
| alias | definition |
|---|---|
| masc | m1 m2 m3 |
| noun | subst depr ger xxs ppron12 ppron3 |
| pron | ppron12 ppron3 siebie |
| verb | fin praet aglt bedzie inf imps impt pact ppas pcon pant ger winien |
With the definitions of the aliases noun and masc given above, the following two queries are equivalent:
The values of grammatical classes and categories may be specified jointly, with the use of the tag attribute. For example, the following query may be used to find singular nominative neuter nouns:
.
Just as in case of other attributes, also the specification of the value of tag may contain regular expressions, e.g.:
One of the features that distinguish the IPI PAN Corpus and Poliqarp from other corpora and search tools is the representation and processing of ambiguities. There are cases where it is impossible to tell which of a number of interpretations is the right one, as in 1. below.
Since it is impossible to resolve the grammatical case of pijaną in 1., both interpretations, accusative and instrumental, should be marked in the corpus as correct in this context.However, given that after disambiguation a single segment may contain more than one interpretation, the question arises whether such ambiguous segments, e.g., pijaną in 1., should be included in the result of a query which matches only some of these interpretations, e.g., in the result of the query [case=acc]. On the one hand, the segment pijaną should be included in the result of [case=acc], as accusative is one of the correct interpretations of this segment in this context, but on the other hand, this segment should not be included, as it is not absolutely certain that this is an accusative form.
Instead of choosing between these interpretations of a query like [case=acc], Poliqarp allows the user to pose both kinds of queries. When a single equality sign is used, as in [case=acc], all segments whose at least one interpretation matches the given condition will be returned, so both pijaną and ją in 1. will be included in the result of this query. On the other hand, when two equality signs are used, as in [case==acc], only those segments will be returned whose all interpretations satisfy the condition expressed with ==, i.e., in 1., only the form ją will match the query.
With this distinction in hand, it is possible to search for forms which, e.g., may in a given context be interpreted as either accusative or genitive, so -- given a properly tagged corpus -- the following query should give non-empty results.
The queries above pertain to interpretations which are the result of morphosyntactic disambiguation. The IPI PAN Corpus contains also all other interpretations assigned to a given segment by the morphological analyser. In some situations it is useful to have access to such interpretations rejected by the disambiguator, e.g., for the task of finding all syncretic forms of a certain kind in the corpus, or when investigating disambiguation errors. For example, in order to find all syncretic accusative/genitive forms in the corpus, regardless of their interpretation in contexts in which they occur, the following query may be posed:
The final equality operator available in Poliqarp queries is ~~. The following query may be used for finding those forms which are unambiguously accusative, again, regardless of the context in which they occur.
The table below summarises the four equality operators put at the user's disposal in Poliqarp.
| in the results of | in the results of | |
|---|---|---|
| morphological analysis | disambiguation | |
| at least one interpretation | ~ | = |
| each interpretation | ~~ | == |
It should be clear that the following implications hold:
Texts contained in the IPI PAN Corpus are divided into sentences and paragraphs. This information may be taken into account in queries, in order to constrain a query to a sentence or a paragraph, as in the query below, which may be used to find the form się separated from a form of the verb BAĆ by any positive number of (non-się) segments, but within a sentence.
Each text in the IPI PAN Corpus comes with a set of data about that text, such as its title and author, publisher, date of publication, etc. Some of such metadata are accessible through Poliqarp and may be used to constrain the scope of a query, e.g., to texts by a given author or published between certain dates.
There are three types of metadata available in the 1st edition of the IPI PAN Corpus, and there are five meta-attributes which correspond to those three types:
In order to constrain the scope of a query with metadata, the keyword meta should be placed at the end of the query and it should be followed by specifications of values of meta-attributes. In case the scope of the query is also constrained to a sentence or to a paragraph, the specification of metadata should follow the structural constraint, e.g.:
Regular expressions are not allowed in case of the date-valued attributes created, first_published and published. On the other hand, it is possible to use the lesser/greater signs < and >, e.g.:
Constraints on meta-attributes may be combined with the operators &, | and !, e.g:
In the first version of the IPI PAN Corpus, many texts do not have complete metadata associated with them. The results of the queries involving metadata specifications above will only come from those texts which have values of the relevant meta-attributes defined. That means that, perhaps contrary to expectations, the result of the first of the two queries below will be a small subset of the result of the second query.
In case a given meta-attribute does not have a value defined, it is assumed that its value is the empty string, i.e., "", so the query below is equivalent to the latter query above.
In the 2nd edition of the IPI PAN Corpus, metadata has been revised, corrected and extended by new meta-attributes. Currently, the following attributes are available (note that the names of the attributes have changed with respect to the first edition of the IPI PAN Corpus; Polish names are used now):
For example, the following query may be used to find a sequence of 5 nouns in any scienctific or educational text published as a book:
In order to make the results of a query more readable, it is possible to place within the query proper, i.e., before the qualifiers within and meta, a special alignment marker, ^, as in: