Computer-Аssisted Description of the Old Bulgarian Lexica for an e-Based Derivational Dictionary of Old Bulgarian
(Institute for Bulgarian Language, Bulgarian Academy of Sciences)
The project Computer-Аssisted Description of the Old Bulgarian Lexica for an e-Based Derivational Dictionary of Old Bulgarian deals with synchronic morphological description of the derivational patterns observed in the word stock of Старобългарски речник (Dictionary of Old Bulgarian; Ivanova-Mircheva et al. 1999; 2009). The patterns will be made available as an electronically accessible derivational dictionary at the end of the project. Here, I discuss in brief a couple of principles behind the description of the lexical material in the context of computational morphology.
Проектът „Компютърно описание на старобългарското словно богатство (с оглед на създаване на електронен словообразувателен речник)“ разработва синхронно морфологично описание на словообразувателните модели при лексикалния фонд, представен в Старобългарския речник (Иванова-Мирчева 1999; 2009), което ще бъде достъпно в края на проекта под формата на електронен словообразувателен речник на старобългарския език. Настоящият текст представя накратко някои принципи, по които част от езиковия материал може да бъде описан в контекста на компютърната морфология.
The project Computer-Аssisted Description of the Old Bulgarian Lexica for an e-Based Derivational Dictionary of Old Bulgarian1 aims at providing synchronic morphological description of the derivational patterns attested in Старобългарски речник (СР from now on; Ivanova-Mircheva et al. 1999, 2009) which was made available online at http://histdict.uni-sofia.bg/, along the project Computatеr and Interactive Tools for Historical Linguistic Research (BG051PO001-3.3-04-0011).
Old Bulgarian2 is the earliest Slavic language attested in a number of written manuscripts from the 10th century on. The term Old Church Slavonic, widely used in the literature, reflects the status of the language as being used for religious purposes in the Slavic orthodox communities, while the term Old Bulgarian reflects the ethnic origin of the language found in the earliest Slavic written texts (cf. Duridanov et al. 1991). In the project review, I refer to the language as Old Bulgarian because the lexica of the earliest manuscripts used in the description of the derivational patterns are excerpted from Старобългарски речник, in English: Dictionary of Old Bulgarian.
The description of the OB derivation is a ready resource to be used in the development of an experimental tool for derivational pattern recognition in lines with the mechanisms implemented in the morphological annotators and taggers. Besides, information about derivation and lexical features may improve the performance of taggers (cf. Denis and Sagot 2009). Thus, an additional task of our project would be to investigate, theoretically and experimentally, the case for an automatic morphological analysis of diachronic language data with sequentally added derivational segments.
The aim of the project is to provide description of the data to be used further rather than accurate linguistic analysis. Modelled patterns may not be purely linguistically motivated, as the notion of insertion/deletion is purely descriptive but this strategy is often taken in computational morphology.
This project review briefly motivates our research while disussing a number of denominal and deadjectival derivational patterns constructed from the data with a focus on problematic issues regarding language data analysis and modelling. After a discussion on computational morphology, the review procedes to presenting instances of denominal and deadjectival word formation patterns3that are provided along the project.
1. Computational morphology
Morphology studies the structure of individual words and multi-word expressions and the rules of word formation, with computational morphology dealing with (computational) processing of words to develop approaches to computational analysis and synthesis of wordforms.
Words are usually formed by combining smaller units of linguistic information called morphemes which consist of phonemes, often represented by orthographic symbols. Morphemes can be classified into free (root or root word) and bound (affix, suffix) morphemes. A free morpheme can occur as a single word, while bound ones do not occur alone, but are attached to a root and may transform a word of one part-of-speech into a word of another part-of-speech. A word can consist of a single morpheme that conveys multiple pieces of information, or of two and more morphemes combined in various ways and subjected to changes.
Words are built through three types of morphological processes: inflectional morphology, derivational morphology, and compounding. Inflectional morphology introduces information that can be used in syntactic contexts; inflectional morphemes usually do not change the part-of-speech of a word. Derivational morphology produces a new word from another word, and often the derived word is of different class than the word it originates from (derivational operations generate denominal adjectives, denominal verbs, deadjectival nouns, among others). Derivational morphology may attempt at analysing both productive morphological processes that apply to almost all members of a wordclass, and unproductive processes that apply to a few members. Compounding is concatenation of two or more root words into a new word, allthough the boundary between words and compounds is not always clear.
Further, words encode different pieces of information such as referentiality (the object in the real world a word refers to), classification type (declensional and inflectional class), functionality in a sentence (the word's relation to other words). Semantically, words can be divided into a concept node, syntactic representation that specifies combinatorial properties (word class, subcategorisation, syntactic realisation, and others), and semantic representation that specifies aspects of meaning (cf. Schreuder and Harald Baayen 1995).
There are different implementations of computational morphology ranging from simple concatenation of morphemes to finite state automata embeded in different applications (cf. Oflazer 2009).
One of the simple approaches builds a list of all wordforms, with sublists of roots and affixes, and heuristics or rule-based affix-stripping mechanisms that use language-specific rules to split words into morphemes. If a part of the wordform is in the affix sublist, it can be stripped to check whether the resulting form is a valid word (checks can be performed against a lexicon available from a dictionary such as СР, or matching wordforms in a corpus). If a match is not found, the mechanism adds any of the other affixes to check whether some combination would produce a valid word. This approach needs lots of language specific heuristics and rules and is rather clumsy with respect to processing ambiguous forms and homonyms and for modeling diacronic changes and alternations.
Finite state approaches involve a lexicon (consisting of free and bound morphemes) and morphographemic rules modeled as finite state transducers. The assumption is that the set of words in a (natural) language forms a regular (formal) language, thus instead of listing all words, one needs to abstract and describe the generation mechanisms. The lexicon can be employed in the form of a finite state transducer with the assumption that a wordform shall cover the pattern of prefix+root+suffix where the prefix and the suffix are optional. The structure can be further refined to a point where only valid forms are accepted and others are rejected. The morphological structures are licensed through a morphographemic transducer that generates all possible patterns in which the input word can be segmented as sanctioned by the alternation rules, graphemic conventions, and morphophonological processes reflected in the orthography.
These mechanisms are usually implemented in a tool known as morphological tagger or automatic morphological analyser. Although development of a tagger for old languages with rich and changing morphology is a difficult task, we assume that a pilot morphological analyser may recognize about half of the wordforms in a mediaeval text (mostly closed class words). Further, the implementation may detect and analyse derivational patterns through recognition and analysis of morphemes (and/or complex endings) to achieve a high rate of recognition of part-of-speech and morphological information encoded into words, even if the application has no information about a given wordform in a given text. For instance, the implementation may enable users to search for all possessive adjectives formed through the suffix -ов-, or all deverbal nouns and denominal verbs. Besides, natural language processing applications may extract and use the information encoded in the words. A tool may guess the part-of-speech of a wordform containing, for example, the denominal suffix 'ьств' to further check whether the knowledge of derivational patterns would enhance the tagger's performance with unprocessed diachronic language data.
These mechanisms should also provide for duplication and ambiguity of the derivational processes reflected in diachronic data. Morphological description and analysis of natural language may still be applied to them but one needs to take into consideration various factors.
Firstly, the description of the diachronic language data relies exclusively on written records of unclear representativess.
Secondly, diachronic language data reflect processes that changed the language but the description necessarily involves observation on a series of synchronic states reflected in the written documents dated from different periods of language history.
Thirdly, research is complicated by the fact that language resources are not readily available. In computational description of inflectional languages, the same graphical sign may mark different class (parts-of-speech), thus software needs to be trained to identify it – an already achieved task with modern languages with available lexicons and neatly structured derivational and inflectional patterns.
Fourthly, attempts at computational analysis of wordforms suggest that lexical information denotated by words can be separated from the grammatical one and lexical base can be easily differentiated from endings. However, the task is much more complex with respect to OB wordforms that reflect historical shift of vocabulary and grammar. In more common approaches to computational description of language resources, root(s) and derivational morphemes can be considered a single stem to which inflectional endings are added. In other approaches, derivational and inflectional morphology (derivational and inflectional affixes, respectively) have the same status and are added (simultaneously at descriptive level) to the root morpheme. In either cases, however, computational representation of language may come at a cost of loss of information that is essential for a language with rich derivational system.
Fifthly, the patterns involved in diachronic derivation do not distinguish between production and recognition. They describe word classes and productive (and unproductive) derivational rules that can be isolated from the data based on observation such as that certain morphemes generate words of a certain part-of-speech (for instance. denominal suffixes such as -тел- and -оба-/-ьба- in OB).
To sum up, we use computational morphology to deal with derivational analysis in order to obtain information that is encoded in a wordform and can be employed in later stages of language data processing. The following section discusses a sample of the nominal and adjectival derivational patterns that our project shall provide.
2. On modeling patterns of nominal and adjectival derivation in OB
A couple of issues need to be addressed in developing a morphological analyzer for the OB language data, such as what type of data are to be compiled, what ambiguities can occur, what solutions to suggest and what implementation to use. A computational model may include derivational rules to connect roots and affixes to model morphotactic representation of how morphemes are combined and form morpheme sequences. Usually, patterns form paradigms based on (broad) classes of root words. Additionally, spelling rules in morphographemics build a (comprehensive) inventory of what happens when morphemes are combined.
First, we generate sublists of morphemes – free and bound (root list and affix list, respectively) – that are modeled after our observation on the OB data as attested in СР.
The root list contains concept nodes (cf. Schreuder and Harald Baayen 1995) that encode essential information about the referent that may be needed in modeling morphotactical and morphographemic phenomena. These are, for example, 'хвал-' in хвал-а4 'praise (n.)5' and хвал-и-ти 'to praise (inf.)'; and плод- in: плод-ъ 'fruit (n.)'; плод-и-ти сѧ 'to bear fruit (inf.)'; плод-ьн-ъ 'fruitful (attr.)'; плод-ов-ит-ъ 'fertile, fruitful, productive (attr.)'; плод-о-пр-нош-ен-ь (a compound noun with two roots) 'contribution of fruits'.
The affix list contains suffixes that encode combinatorial information (syntactic and semantic) about the class (part-of-speech) of the derived word, classificatory information (inflectional and paradigmatic), and, possibly, the root to which the bound morpheme is attached, among others. For instance, the list of bound morphemes contains the denominal suffix -тел- generating род-и-тел-ь 'parent' (derived from the verb род-и-ти 'to bear (inf.)') where the suffixation produces Nomina agentis with a male referent through attachment to the preceding complex of the root род- and the verbal suffix -и-. The suffix is followed by a thematic vowel -ь (*i) marking the inflectional paradigm of the derived noun (-i-) (on patterns of nominal derivation: Duridanov et al. 1991: 176-191). Another example is the adjectival suffix 'ьн' as in гроз-ьн-ъ 'ugly' (on patterns of adjectival derivation: Duridanov et al. 1991: 212-224).
OB patterns of derivation employ both prefixation where bound morphemes are attached before the root (при-и-ти 'to come (inf.)' derived from и-ти 'to go (inf.)' with the prefix при-; не-чист-ъ 'untidy (attr.)' derived from чист-ъ 'tidy (attr.)' through the negative prefix не-), and suffixation where bound morphemes are attached after the root (павьл-ов-ъ 'Paul's (poss.)' is derived from павьл-ъ 'Paul' through the possessive adjectivаl suffix -oв-; крас-от-а 'beauty (n.)' can be analysed as derived from крас-ьн-ъ 'beautiful (attr.)' with the adjectival suffix -ьн- being replaced by the denominal suffix -от-).
Below, I will present a couple of problematic issues in the description and interpretation of OB patterns of denominal and deadjectival derivation.
2.1. Root invariants
OB derivation employs a process where bound morphemes are attached before or after the free morpheme (or any other intervening morpheme). It can trigger spelling and/or phonological changes at the linking boundary, such as in: пророч-ица 'propehtess (n.)' derived from пророк-ъ 'prophet (n.)' (with the final consonant of the stem -к- alternating with -ч- according to the rules of the first palatalization), and вожд-ь 'leader (n.)' derived from вод-и-ти 'to lead (inf.)'; diminutive дъшт-ица 'small board (n.)' derived from дъск-а 'board (n.)'. We can enlist пророк- and пророч-; вод- and вожд-; and дъск- and дъшт- as root invariants or to model them using spelling rules for modification. There are also instances with alternations inside the root as in гром-ъ 'thunder (n.)' and гръм-ѣ-ти 'to thunder (inf.)', and твар-ъ 'creation (n.)' and твор-и-ти 'to create (inf)'.
The simplest OB pattern for nominal and adjectival word formation involves a root – a concept node, followed by a thematic vowel (*а, *o, *i, *u) that provides information about the gender and declensional class (gender on adjectives licenses agreement with the noun they modify). Even the simplest pattern, however, involves a derivational path.
An analysis can segment a non-transparent root, as in вод-а 'water (n.)', бог-ъ 'God' (n.)', or a derivational one, such as in deverbal adjective люб-ив-ъ 'loving (attr.)' derived from люб-и-ти 'to love (inf.)' with the attachment of adjectival suffix -ив- to the root. In some instances, however, the direction is not transparent and we need to postulate a pattern to follow. For instance, хвал-а 'praise (n.)' can be interpreted as deverbal noun motivated by the verb хвал-и-ти 'to praise (inf.)' but also as a noun that generate a denominal verb (as pointed in Duridanov et a. 1991: 177, citing Мейе 1951: 276).
2.3. Derivational chain
Affixation in OB exhibits a variety of patterns of which only the ones involving more than one affix are discussed below.
A derived wordform can be a product of both prefixation (where the bound morpheme goes before the free one) and suffixation. As prefixation is not a noun deriving operation, a wordform such as при-ш-ьств-и-ѥ 'advent (n.)' is assumed to be motivated by a verbal root – the one in при-и-ти 'to come (inf.)'. The prefixation first occurs on the verbal root, generating a prefixed verb from which the noun can be derived through attaching denominal suffixes.
Second, the derivation chain can involve more than one suffix – denominal and/or adjectival. Thus, пророч-иц-а 'propehtess (n.)' is derived from пророк-ъ 'prophet (n.)' through attachment of a nominal suffix and a thematic vowel, while двьр-ьн-ик-ъ 'door-keeper (n.)' and двьр-ьн-иц-а '(female) door-keeper (n.)' can be modeled as derived via двьр-ьн-ъ 'door (attr.)' from двьр-ь 'door (n.)' with first adjectival suffix (-ьн-) and a second nominal one (-ик-), or as directly motivated by двьр-ь 'door (n.)' and derived through the compound suffix -ьник- ((cf. Duridanov et al. 1991).
A wordform can involve a much more complicated derivation chain including a series of suffixes, as in род-и-тел-ьн-иц-а 'woman in child birth (n.)' (derived from род-и-тел-ь 'parent (n.)') possibly exhibiting the following multi-step derivational chain: *rod- / род-ъ 'origin' > род-и-ти 'to bear (inf.)' > род-и-тел-ь 'parent (n.)' [> *род-и-тел-ьн- '(unattested) parental', child bearing (attr.)] > *род-и-тел-ьн-ик- '(unattested) ?man in child birth (n.)' > род-и-тел-ьн-иц-а 'woman in child birth'. OB data attest neither an adjective with a stem *родительн-, nor, expectedly, a noun with a stem *родительник-. Thus, we may analyse род-и-тел-ьн-иц-а 'woman in child birth' as denominal noun derived from родител-ь via the compound suffix -ьник- (cf. Duridanov et al. 1991; Gladney 2006) as in двьр-ьн-иц-а '(female) door-keeper (n.)'. Thus, the analysis may model родительница as derived from *родительник- though attachment of a denominal suffix generating a noun with a female referent.
2.4. Segmentation ambiguity
Segmentation ambiguity may arise in various stages in the derivation chain and involve various constituents: stem ambiguity – semantic or compositional, suffix ambiguity, and others.
One of the most productive suffixes for nominal derivation /ij/ (a constituent of the compound suffix /ij-e/ -и/-ь as in падени 'downfall (n.)', съпасени 'salvation (n.)') can be attached to (de)verbal and adjectival stems. With deverbal stems, the suffix can be preceded by the adjective suffixes -н- or -т- that also occur in passive participles. We can analyse гнѣван 'enragement (n.)' as deverbal noun (derived from гнѣвати (сѧ) 'to rage, be angry (inf.)'). Evidence also comes from the syntactic satellites of deverbal nominalisation as the operation in deverbal derivation requires tuning on argument structure (Gladney 2006). Meanwhile, the same suffix /ij/ is employed in deajectival nouns such as съдрав-ь 'health (n.)' (derived from the adjective съдрав-ъ 'healthy (attr.)').
A root can be homophonous as with гор- in гор-ьн-ъ 'upland, mountain (attr.)' motivated by гор-а 'mountain (n.)', and гор- in гор-ък-ъ 'sorrowful (attr.)' motivated by *гор-ь 'pity, sorrow (unattested in СР).
An ambiguity also occurs when a word can be legitimately divided into morphemes to generate a number of patterns. If the division is to be motivated by the existing separate non-derived word(forms), the suffix -ица employed in the derivation of Nomina agentis nouns with female referent, may have a number of interpretations. As mentioned above, the suffix is found in двьрница '(female) door-keeper (n.)' derived from двьрьникъ '(male) door-keeper (n.)' (motivated by двьрь 'door (n.)'). The same structure is found in лъв-ица 'lioness (n.)', пророч-ица 'prophetess (n.)' and родительн-ица 'woman in child birth (n.)' with correspondent nouns denoting the male referent лъв-ъ 'lion (n.)', пророк-ъ 'prophet (n.)', and родител-ь 'parent (n.)'. There are neither *лъвик-ъ, nor *пророчик-ъ nor *родительник-ъ.
OB data give also homographous morphemes attached to different stems, such as -ин- attested as a denominal suffix -ин- in властел-ин-ъ 'ruler (n.)' and an adjectival suffix -ин- in иоан-ин-ъ 'John's (attr.)'.
2.5. Unproductive suffixes
Many unproductive suffixes are involved in OB derivation without being interpreted as separate constituents. However, we can segment them if the stemming is transparent, as in the deverbal nouns из-бꙑ-т-ък-ъ 'surplus, redundancy (n)' (dervied from a prefixed passive participle), and пи-в-о 'a drink (n.)' (derived from пити 'to drink (inf).'). Despite interpretational variants, the derivational pattern of some words is not easily motivated, either semantically as in праздьникъ 'feast, free day (n.)' (from the adjective празд-ьн-ъ 'empty, free (attr.)'), or technically, as with глагол-ъ 'word, speech (n.)' whose derivation can be reconstructed through reduplication of *găl-(găl)-. Similarly, the suffixes *t- and *-(o)s- that form the (compound) denominal suffix -ость are not analysed separately.
As mentioned above, compounding can be modeled as concatenation of two or more free morphemes (respectively, two concept nodes) into a new word (a complex concept). Compounding is highly productive derivational process in OB and frequent in our data both in nominal and adjectival derivation.
Morphologically, the first part is unchangeable, while the second one attracts the inflection determining the part-of-speech and declensional class of the compound and supporting the constituent's relation to other constituents in the syntax. Semantically, the first concept node modifies the second one.
Compound adjectives can be derived from nouns as second (inflectable) component, as in благ-о-образ-ьн-ъ 'fine-looking (attr.)' (motivated by образъ 'face, image (n.)'), other adjectives, as in благ-о-лѣп-ьн-ъ 'wonderful (attr.)' (motivated by лѣпъ 'good, good lookng (attr)'), and from verbal roots (participles) as in благ-о-твор-и-в-ъ 'beneficial' (from the participle of the verb творити 'to create (inf.)'). The first constituent can be adjectival as благо-, nominal, as бог-о- in бог-о-ѹ-eн-ъ 'God-taught', adverbial тожде-имен-и-т-ъ 'eponymous' (if following Duridanov 1991: 224-227).
Compound nouns can be derived by various suffixes, and their first constituent often is nominal as in бог-о-вид-ьц-ь 'God seeing', вино-пи-иц-а 'drunkard' (вино 'wine (n.)'), and adjectival as in добр-о-род-ьств-о 'noble descent (n.)' but also adverbial as in мало-мощ-ь 'cripple (n.)'.
Compounds following a transparent derivational patterns are analysed as having two roots connected either by interfix (-о-, -е-) or by zero affix (compare грѣхъ-пад-ан-ь- 'the original sin' vs. грѣх-о-пад-ан-ь- 'the original sin').
On the other hand, lexemes such as глагол-ъ 'word, speech (n.)', медвѣдъ 'bear (n.)' and a few others were formed by concatenation of two roots in Proto-Slavic but in OB they function as simple one-root words and are modeled as such in our patterns of derivation.
The project Computer-Аssisted Description of the Old Bulgarian Lexica for an e-Based Derivational Dictionary of Old Bulgarian shall provide information about derivational pairs and derivational chains online together with detailed segmentation of morphological constituents. Information about the frequency of morphemes (roots and affixes) and classification of the derivational patterns as attested in the СР lexicon will be also made available (plus some additional remarks about the frequency of derivational patterns in texts).
- Denis, Pascal and Benoît Sagot. 2009. In:Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In: Proceedings of The Pacific Asia Conference on Language, Information and Computation (PACLIC 23), Hong Kong, China. http://researchers.lille.inria.fr/~pdenis/papers/paclic09.pdf (10.11.2012).
- Duridanov, Ivan (еd). 1991. A Grammar of Old Bulgarian Language. Sofia: Bulgarian Academy of Sciences Publishing House. [In Bulgarian: Дуриданов, Иван. 1991. Граматика на старобългарския език. С., Издателство на Българската академия на науките, 606 с.]
- Ganeva, Gergana. 2012. Verbal Suffixation in the History of Bulgarian Language in the Context of the Development of Aspectual Opposition. Ph.D. Thesis. Sofia. [In Bulgarian: Ганева, Гергана. 2012. Глаголната суфиксация в историята на българския език с оглед на изграждането на видовата опозиция. Дисертация. София]. Ръкопис.
- Gladney, Frank. 2006. Slavic morphology. Glossos 8 (Slavic Linguistics 2000: The Future of Slavic Linguistics in America). At: http://www.seelrc.org/glossos/issues/8/gladney.pdf [08/11/2012]
- Meillet, Antoine. 1934. Le slave commun. Paris: H. Champion. [In Russian: Мейе, Aнтуан. Общеславянский язык. Москва, Издательство иностранной литературы, 1951, 491 с.].
- Ivanova-Mircheva, Dora (еd.). 1999. Dictionary of Old Bulgarian. Vol. 1 (A-I). Sofia: Valentin Trayanov Publishing House. [In Bulgarian: Старобългарски речник. Т. 1 (А – И). София, Изд. „Валентин Траянов“, 1999, 1027 с.]
- Ivanova-Mircheva, Dora (еd.). 2009. Dictionary of Old Bulgarian. Vol. 2 (O-Х). Sofia: Valentin Trayanov Publishing House. [In Bulgarian: Старобългарски речник. Т. 2 (О – X). София, Изд. „Валентин Траянов“, 2009, 1320 с.]
- Oflazer, Kemal. 2009. Computational Morphology (University course notes). http://fsmnlp2009.fastar.org/Program_files/Oflazer%20-%20slides.pdf [10.11.2012).
- Schreuder, Robert and R. Harald Baayen. 1995. Modeling morphological processing. In: L. B. Feldman (ed.), Morphological Aspects of Language Processing. Hillsdale, NJ: Lawrence Earlbaum Associates.
About the author
Ph.D. Tsvetana Dimitrova is Assistant Professor in the Department of Computational Linguistics at the Institute of Bulgarian language (Bulgarian academy of sciences). Her research interests include theoretical and applied linguistics – corpus linguistics, diachronic linguistics, syntax, diachronic syntax, generative grammar, language change; computational lexicons, lexical semantic nets, linguistic annotation.
Electronic address: cvetana at dcl dot bas dot bg