1 Segmentation
Tags are assigned to segments (tokens, roughly -- words). Segments
are not longer than orthographic words (`from space to space'), but
sometimes segments are shorter than orthographic words:
- Agglutinative forms of the lexeme BYĆ
`to be' are separate segments, so the following words consist of two
segments each: [łgał][eś] `lied-you',
[długo][śmy] `long time-we', [tak][em]
`so-I'.
- Also particles by (subjunctive particle), -ż(e) (emphatic
particle) and -li (question particle) are considered to be
separate segments, so the following words consist of a number of
segments: [przyszedł][by] `come-would',
[napisała][by][m] `write-would-I',
[chodź][że] `come-Emph',
[potrzebował][że][by][ś]
`need-Emph-would-you', [znasz][li] `know-Q'.
- The post-prepositional weak pronominal form
-ń, as in [do][ń] `to-him' or
[ze][ń] `with-him', is also a separate segment.
- Some words containing the hyphen are also split into segments,
namely:
- words such as [polsko][-][niemiecki]
`Polish-German',
- double names, e.g., [Kowalska][-][Nowakowska].
On the other hand, inflected acronyms such as PRL-u are not
split into smaller segments.
- Sentence-final words containing word-final full stops, e.g., abbreviations such as itp. `etc.', ordinal numbers written in digits, and
initials, are also split into smaller segments, e.g.:
[itp][.], [George] [W][.], etc.
The reason for that comes from the double role of the full stop in
such cases: it is a part of the word and at the same time it plays
the role of a sentence-final punctuation mark. When such words
do not occur in sentence-final positions, they are considered to
be single segments.
The segmentation principles given above lead to the segmentation
of 1. (translated into English in 2.) that is
presented in 3.
- Pojechalibyśmy z Janem M. Rokitą i Janem
Nowakiem-Jeziorańskim na sesję polsko-amerykańską, gdyby nas
zaprosił George W. Byłaby to nasza już 2. doń podróż od czasów
PRL-u, a może i 3., czy nawet 4.
- `We would go with Jan M. Rokita and Jan
Nowak-Jeziorański to the Polish-American session, if we were invited by
George W. That would already be our 2nd trip to him since the times
of PRL, and perhaps 3rd, or even 4th.'
- [Pojechali][by][śmy]
[z] [Janem] [M.] [Rokitą] [i] [Janem]
[Nowakiem][-][Jeziorańskim] [na] [sesję]
[polsko][-][amerykańską][,] [gdyby] [nas]
[zaprosił] [George] [W][.]
[Była][by] [to] [nasza] [już] [2.]
[do][ń] [podróż] [od] [czasów]
[PRL-u][,] [a] [może] [i] [3.][,]
[czy] [nawet] [4][.]