HPSG-based Bulgarian-English Statistical Machine Translation

Kiril Simov¹, Petya Osenova², Laska Laskova², Stanislava Kancheva², Aleksandar Savkov³, Rui Wang⁴

¹Linguistic Modeling Department, Institute of Information and Communication Technologies, BAS, Bulgaria

²Department for Bulgarian Language, SU “St. Kl. Ohridski”, Bulgaria

³Department of Informatics, University of Sussex, England

⁴DFKI, Saarbrucken, Germany

Резюме

Статията представя ресурсите и обработките за българския език, необходими за процеса на статистическия машинен превод от български на английски език. В рамките на ресурсите се разглеждат различни езикови корпуси, речници и граматики. От гледна точка на обработката се описва езиковият анализатор за български, който включва следните компоненти: сегментиране на текста, морфологично тагиране, лематизация и депендентен синтактичен анализ. Представен е механизъм, по който синтактичният анализ се проектира в семантичен. Дискутират се различни езикови модели и връзката им с крайния резултат. Оценката на превода е направена по два начина: количествен и качествен.

1. Introduction

Recently a number of machine translation efforts have focused on grammatical formalisms in performing source language analysis, transfer rule application and target language generation. It is worth mentioning several works, such as (Bond et. al 2005) exploiting DELPH-IN ¹ infrastructure for developing of HPSG grammars; (Riezler and Maxwell III 2006) using LFG grammar; (Oepen et. al 2007) working on a hybrid architecture consisting of an LFG grammar, an HPSG grammar, partial parsing; and (Bojar and Hajic 2008) using the Functional Generative Description framework for language analysis on analytical and tectogrammatical level. All the approaches rely on the advances in the development of deep grammar natural language parsing. The ap-proaches share similar architecture and techniques to overcome the drawbacks of the deep processing in comparison to statistical shallow methods.

Manually created word aligned bi- or multilingual corpora have proven to be useful resources in variety of tasks, e.g. for the development of automatic alignment tools, but also for lexicon extraction, word sense disambiguation, machine translation, annotation transfer and others.

Within EuroMatrixPlus Project our aim was to prepare and align a Bulgarian-English corpus and a treebank in order to test statistical MT tools later on. The aim of constructing such a treebank is to use it as a source for learning of statistical transfer rules for Bulgarian-English machine translation along the lines of (Bond et. al 2005). The transfer rules in this framework are rewriting rules over MRS (Minimal Recursion Semantics) structures (Copestake et. al 2005). The basic format of the transfer rules is:

[C:] I [!F] → O

where I is the input of the rule, O is the output. C determines the context and F is the filter of the rule. C selects positive context and F selects negative context for the application of a rule. For more details on the transfer rules consult (Oepen 2007). This type of rules allows for an extremely flexible transfer of factual and linguistic knowledge between the source and the target languages. Thus, the treebank has to contain parallel sentences, their syntactic and semantic analyses as well as correspondences on the level of MRS. In the development of such a parallel treebank we rely on the Bulgarian HPSG resource grammar BURGER² (Osenova 2010), and on a dependency parser (Malt Parser – (Nivre et. al 2006), trained on the BulTreeBank data. Both parsers produce semantic representations in terms of MRS. BURGER automatically constructs them, while the output of the Malt Parser is augmented with rules for constructing of MRS from the dependency trees. The treebank is a parallel resource aligned first on sentence level. Then the alignment is done on the level of MRS. This level of abstraction makes possible the usage of different tools for producing these alignments, since MRS is meant to be compatible with various syntactic frameworks. The chosen procedure is as follows: first, the Bulgarian sentences are parsed with BURGER. If it succeeds, then the produced MRSes are used for the alignment. In case BURGER fails, the sentences are parsed with Malt Parser, and then RMRSes (Robust Minimal Recursion Semantic) are constructed over the dependency parse. The RMRSes are created via a set of transformation rules (Simov and Osenova 2011). In both cases we keep the syntactic analyses for the parallel sentences.

With respect to the MRS alignments, a very pragmatic approach has been adopted – namely, the MRS alignments originated from the word level alignment. This approach is based on the following observations and requirements:

Both approaches for generation of MRS over the sentences are lexicalized;
Non-experts in linguistics can do the alignments successfully;
Different rules for generation/testing are possible.

Both parsers (for Bulgarian and English), which we use for the creation of MRSes, are lexicalized in their nature. Thus, they first assign elementary predicates to the lexical elements in the sentences, and then, on the base of the syntactic analysis, these elementary predicates are composed into MRSes of the corresponding phrases, and finally of the whole sentence.

Our belief is that having alignments on word level, syntactic analyses and the rules for composition of MRS, we will be able to determine correspondences between bigger MRSes than lexical level MRSes, using the ideas of (Tinsley et. al 2009). They first establish the mapping on word level (automatically), then for candidate phrases they calculate the rank of the correspondences on the base of the word level alignment. Thus, our idea is to score the correspondences between two MRSes on the base of involved elementary predicates as well as the syntactic structure of the parallel sentences. Additionally, the constructed MRS structures (or RMRS, in the rest of the paper in many places both terms are interchangeable) will allow us also to use state-of-the-art statistical machine translation engines in order to improve the pure word/phrase-based statistical machine translation. The first experiments in this direction are presented in the second part of the paper.

In the recent years, machine translation (MT) has achieved significant improvement in terms of translation quality. Both data-driven approaches (e.g., statistical MT (SMT)) and knowledge-based (e.g., rule-based MT (RBMT)) have achieved comparable results shown in the evaluation campaigns (Callison-Burch et. al 2011). However, according to the human evaluation, the final outputs of the MT systems are still far from satisfactory.

Fortunately, recent error analysis shows that the two trends of the MT approaches tend to be complementary to each other, in terms of the types of the errors they made (Thurmair 2005; Chen et. al 2009). Roughly speaking, RBMT systems often do not have a lexicon and thus they lack robustness. At the same time, they handle better the linguistic phenomena that require syntactic information. SMT systems, on the contrary, are more robust in general, but sometimes output ungrammatical sentences.

In fact, instead of competing with each other, there is also a line of research trying to combine the advantages of the two sides through a hybrid framework. Although many systems can be put under the umbrella of ‘hybrid’ systems, there are various ways to do the combination/integration. (Thurmair 2009) summarized several different architectures of hybrid systems using SMT and RBMT systems. Some widely spread ones are: 1) using an SMT to post-edit the outputs of an RBMT; 2) selecting the best translations from several hypotheses coming from different SMT/RBMT systems; and 3) selecting the best segments (phrases or words) from different hypotheses.

Concerning the language pair Bulgarian-English, it has not been studied very well, mainly due to the lack of resources, including corpora, preprocessors, etc. There was a system published by (Koehn et. al 2009), which was trained and tested on the European Union law data, but not on other domains like news. They reported a very high BLEU score (Papineni et. al 2002) on the Bulgarian-English translation direction (61.3), which inspired us to further investigate this problem.

In this paper, we present the guidelines for alignment of Bulgarian-English sentences on word and semantic level. The semantic level of alignment builds on the word level and the syntactic processing. We also focus on the Bulgarian-to-English translation and mainly explore the approach of annotating the SMT baseline with linguistic features, derived from the preprocessing and hand-crafted grammars. There are three motivations behind our approach: 1) the SMT baseline trained on a decent amount of parallel corpora outputs surprisingly good results, in terms of both statistical evaluation metrics and preliminary manual evaluation; 2) the augmented model gives us more space for experimenting with different linguistic features without losing the ‘basic’ robustness; 3) the MT system can profit from continued advances in the development of the deep grammars thereby opening up further integration possibilities.

The paper is organized as follows: Section 2 describes the parallel Bulgarian-English Corpus. Section 3 outlines the Language Processing Pipeline for Bulgarian. In Section 4 the word level alignment strategy is presented in detail. Section 5 focuses on the semantic level alignment. Section 6 reports on the statistical machine translation model and related experiments. Section 7 concludes the paper and gives insights for future work.

2. Preparation of the Parallel Corpus

The parallel corpora used in our work were selected with respect to the following criteria: (1) availability of parallel texts, (2) quality of translations, (3) parsed by the relevant parser for Bulgarian and/or English, (4) availability of aligned parallel texts on sentence level. We use three main sources of data:

The Bulgarian HPSG-based treebank (BulTreeBank).
Datasets distributed together with the English Resource Grammar.
SETIMES parallel corpus (part of OPUS parallel corpus ³).

First, we have started with the data that is already presented in the Bulgarian HPSG-based treebank - BulTreeBank. These data have been used in two respects: (1) as a source for the creation of an HPSG grammar for Bulgarian, and (2) as a source for the parallel Bulgarian-English data where English translations are available. Such an example is the Bulgarian constitution. These data have been syntactically analysed, but not aligned with any English counterparts. They did not contain MRS analyses. Thus, two tasks have been performed. First, a set of rules have been formulated for building MRS analyses for the sentences on the basis of the output of the dependency parser. Second, the parallel texts have been aligned on sentence and word level.

The datasets distributed together with the English Resource Grammar (ERG) became the second source for the parallel treebank. We have started to translate these sets into Bulgarian. These datasets include two parts: domain-oriented one and grammar-centered one. The domain part contains of real texts in the domain of tourism, while the grammar-centered one focuses predominantly on the word order and the syntactic variety of constructed sentences. Thus, these datasets include not only development sets for ERG, but also data from the Norwegian-English parallel treebank. Our motivation to select them is that ERG has already been tuned to process them. At the moment there is a small set of 200 sentences, parsed by both grammars - BURGER and ERG; the rest are parsed by ERG only. The Bulgarian counterparts were parsed by BURGER, and since its coverage is restricted, the dependency parser with transfer rules to MRS analyses was also used.

The third source is the SETIMES parallel corpus, which is part of the OPUS parallel corpus. The data in the corpus was aligned automatically on sentence level. Thus, we first checked the consistency of the automatic alignments. It turned out that about 25 % of the sentence alignments were not correct.

As it can be seen, our parallel Bulgarian-English treebank consists of various genres - news, tourism, administrative texts, constructed examples. All of them are aligned on sentence level and some of them - on word level and on semantic level.

Since SETIMES appeared to be the noisiest data set, our effort was directed into cleaning it as much as possible before the start of the experiments. We have cleaned manually about 25 000 sentences. The rest of the data set includes around 135 000 sentences. Altogether the data set is about 160 000 sentences, when the manually checked part is added. Thus, two actions were taken:

Improving the tokenization of the Bulgarian part. The observations from the manual check of the set of 25 000 sentences showed systematic errors in the tokenized text. Hence, these cases have been detected and fixed semi-automatically.
Correcting and Removing the suspicious alignments. Initially, the ratio of the lengths of the English and Bulgarian sentences was calculated in the set of the 25 000 manually annotated sentences. As a rule, the Bulgarian sentences are longer than the English ones. The ratio is 1.34. Then we calculated the ratio for each pair of sentences. After this, the optimal interval was manually determined, such that if the ratio for a given pair of sentences is within the interval, then we assume that the pair is a good one. The interval for these experiments is set to [0.7; 1.8]. All the pairs with ratio outside of the interval have been deleted. Similarly, we have cleaned EMEA data set which is also part of OPUS parallel corpus, but contains domain text in the area of drugs.

The size of the resulting datasets are: 151,718 sentence pairs for the SETIMES dataset and 704,631 sentence pairs for the EMEA dataset. Thus, the size of the original datasets was decreased by 10 %.

3. Linguistic Processing Pipeline for Bulgarian

In this section we present the linguistic processing pipeline (BTB-LPP ⁴) for Bulgarian which we used for analyzing of the datasets in integration with BURGER Grammar. BTB-LPP comprises three main modules: a Morphological Tagger, a Lemmatizer and a Dependency Parser.

3.1 Morphological Tagger

The morphological tagger is constructed as a pipeline of three modules - two statistical taggers trained on the Morphologically Annotated Part of BulTreeBank (BulTreeBank-Morph)⁵ and a rule-based module exploiting a large Bulgarian Morphological Lexicon and manually crafted disambiguation rules.

SVM Tagger

The first statistical tagger uses the SVMTool (Giménez and Márquez 2004), which is a SVM-based statistical sequential classifier. It is built on top of the SVMLight (Joachims and Schölkopf 1999) implementation of the Support Vector Machine algorithm (Vapnik 1999). Its flexibility allows it to be trained for an arbitrary language as long as it is provided with enough annotated data. The accuracy that was achieved with the optimal training configuration ranged from 89 % to 91 % depending on the text genre. Having applied the morphological lexicon as a filter on the possible tags for each word form together with the set of disambiguation rules, the best achieved result was 94.65 % accuracy.

Rule-based Component

The task of this component is to correct some of the erroneous analyses made by the SVM Tagger. The correction of the wrong suggestions is performed by two sources of linguistic knowledge – the morphological lexicon and the set of context based rules. In the process of repairing we used as much as possible from the information provided by the SVM tagger. The context rules are designed in such a way that they aim at achieving higher precision even at the cost of low recall. The lexicon look-up is implemented as cascaded regular grammars within the CLaRK system – (Simov et. al 2001). The lexicon is an extended version of (Popov et. al 2003) and covers more than 110 000 lemmas. Additionally, a set of gazetteers were incorporated within the regular grammars. Here is an example of a rule: If a wordform is ambiguous between a masculine count noun (Ncmt) and a singular short definite masculine noun (Ncmsh), the Ncmt tag should be chosen if the previous token is a numeral or a number.

Guided Learning System: GTagger

GTagger is based on the guided learning system - (Georgiev et. al 2012). The best result is 97.98 % accuracy. It can be considered the state-of-the-art for Bulgarian. However, this result is achieved when the input to GTagger is already tagged with the list of all possible tags for each token - similarly to the morphological dataset BulTreeBank-Morph. BTB-LPP provides such an input for GTagger exploiting the SVM Tagger as well as the rule-based component that tags some tokens with a list of the best possible candidate tags according to the morphological lexicon. Additionally, the set of rules is applied in order to solve some of the ambiguities.

The combination of the three components implements the morphological tagger of BTB-LPP. The SVM Tagger plays the role of a guesser for the unknown words. The rule-based component provides an accurate annotation of the known words, leaving some unsolved cases. GTagger provides the final result. This result is used by the lemmatizer and the dependency parser.

3.2 Lemmatizer

The second processing module of BTB-LPP is a functional lemmatization module, based on the morphological lexicon, mentioned above. The functions are defined via two operations on word forms: remove and concatenate. The rules have the following form:

if tag = Tag then {remove OldEnd; concatenate NewEnd}

where Tag is the tag of the word form, OldEnd is the string which has to be removed from the end of the word form and NewEnd is the string which has to concatenated to the beginning of the word form in order to produce the lemma. Here is an example of such a rule:

if tag = Vpitf-o1s then {remove ох; concatenate а}

The application of the rule to the past simple verb form for the verb четох (remove: ох; concatenate: а) gives the lemma чета (to read). Additionally, we encode rules for unknown words in the form of guesser word forms: #ох and tag=Vpitf-o1s. In these cases the rules are ordered.

In order to facilitate the application of the rules, we attach them to the word forms in the lexicon. In this way, we gain two things: (1) we implement the lemmatization tool as a part of the regular grammar for lexicon look-up, discussed above and (2) the level of ambiguity is less than 2% for the correct tagged word forms. In case of ambiguities we produce all the lemmas. After the morphosyntactic tagging, the rules that correspond to the selected tags, are applied.

3.3 Dependency Parser

Many parsers have been trained on data from BulTreeBank. Especially successful was the MaltParser of Joakim Nivre (Nivre et. al 2006). It works with 87.6 % accuracy. The following text describes the dependency relations produced by the parser.

Here is a table with the dependency tagset, related to the Dependency part of the BulTreeBank. This part has been used for training of the dependency parser:

Tag	Number of occurrences in BulTreeBank	Meaning
adjunct	12009	Adjunct (optional verbal argument)
clitic	2263	Short forms of the possessive pronouns
comp	18043	Complement (arguments of non-verbal heads, non-finite verbal heads, copula, auxiliaries)
conj	6342	Conjunction in coordination
conjarg	7005	Argument (second, third, ...) of coordination
indobj	4232	Indirect Object (indirect argument of a non-auxiliary verbal head)
marked	2650	Marked (clauses, introduced by a subordinator)
mod	42706	Modifier (dependants which modify nouns, adjectives, adverbs; also the negative and interrogative particles)
obj	7248	Object (direct argument of a non-auxiliary verbal head)
subj	14064	Subject
pragadjunct	1612	Pragmatic adjunct
punct	28134	Punctuation
xadjunct	1826	Clausal adjunct
xcomp	4651	Clausal complement
xmod	2219	Clausal modifier
xprepcomp	168	Clausal complement of preposition
xsubj	504	Clausal subject

Table 1. The Dependency Tagset

In addition to the dependency tags, also the morphosyntactic tags have been attached to each word (Simov et. al 2004). For each lexical node the lemma was assigned. The number under the name of each relation indicates how many times the relation appears in the dependency version of BulTreeBank. We have also statistics for the triples <DependentWordForm, Relation, HeadWordForm>. It is used for defining the rules for constructing RMRS structures over the dependency parses, produced by the Malt Parser.

Here is an example of a processed sentence. The sentence is 'Според одита в електрическите компании политиците злоупотребяват с държавните предприятия.' The glosses for the words in the Bulgarian sentence are: Според (according) одита (audit-the) в (in) електрическите (electrical-the) компании (companies) политиците (politicians-the) злоупотребяват (abuse) с (with) държавните (state-the) предприятия (enterprises). The translation in the original source is: ‘Electricity audits prove politicians abusing public companies.’

After the application of the language pipeline, the result is represented in a table form following the CoNLL shared task format. It is given in Table 2.

No	WF	Lemma	POS	POSex	Ling	DepHead	DepRel
1	според	според	R	R	-	7	adjunct
2	одита	одит	N	Nc	npd	1	prepcomp
3	в	в	R	R	-	2	mod
4	електрическите	електрически	A	A	pd	5	mod
5	компании	компания	N	Nc	fpi	3	prepcomp
6	политиците	политик	N	Nc	mpd	7	subj
7	злоупотребяват	злоупотребявам	V	Vp	tir3p	0	root
8	с	с	R	R	-	7	indobj
9	държавните	държавен	A	A	pd	10	mod
10	предприятия	предприятие	N	Nc	npi	8	prepcomp

Table 2. The analysis of the Bulgarian sentence in CoNLL format.

The column WF corresponds to the order of the wordforms in the sentence. The information in Ling column is the suffix of the corresponding tag (according to BulTreeBank morphosyntactic tagset) after removing the prefix represented in column POSex (extended POS). The elements in DepHead point to number of the dependency head of the given word form. The DepRel is the dependency relation between the two wordforms. On the basis of such kind of analysis we add RMRS analysis - see below.

4. Word Level Alignment

In this section we present our guidelines for manual alignment between Bulgarian and English. These rules were applied by three annotators to a parallel corpus of more than 100 000 words included in the training set of Bulgarian-English automatic aligner. Thus, we expect them to have impact also on this automatic alignment.

4.1 Introduction to Word Level Alignment

Alignment is the process of identification and mark-up of textual segments (sentences, phrases, words, even letters) that belong to a parallel or target text to the corresponding textual segments in a source text. These two texts presumably share same content (there is a translational equivalence between them) and their differences are due to language specificity and/or translation variation. The target text is not necessarily a translation of the source text; it might even be the case that the direction of the translation is unknown or unspecified. A pair of two such texts is called a bi-text and the collection of bi-texts forms a parallel corpus, or, to put it differently, a parallel corpus is any collection of source and target-language versions of a given text.

Parallel corpora have a vast range of applications, such as lexicon extraction and rule induction, chunk alignment in example-based machine translation, extraction and transfer of lexico-semantic relations and many others.

Prior to and during the first stage of the alignment process annotation guidelines are devised to promote the consistency of the corpus. Providing rigorous, high quality alignment rules documentation is very important especially when there are several annotators involved in the task. We followed the standard practice in the guidelines development. One of the annotators (the so called super-annotator) drafted a pilot version that other annotators used to align a small set of randomly selected bi-texts (Bulgarian-English sentence pairs). Then the annotations were compared and each annotation variation was discussed. As a result the pilot guidelines version was modified and enriched with some additional rules.

Among the main factors that affect rules formulation are: task specificity, linguistic theoretical backgrounds, available support software and possible links usage.

Task specificity. The annotation guidelines presented here were developed and adopted during the creation of the Bulgarian-English Parallel Treebank. The bank itself has several annotation layers and the mapping between the texts is done on two levels – semantic and word level. The alignments were made as a means to support semantic level alignment, which in turn is represented by a Minimal Recursion Semantic structures corresponding to each sentence in the corpus (more about MRS construction in Simov and Osenova 2011).

Linguistic framework usually affects phrase alignment rules as well as mapping between synsemantic words (like prepositions, determiners, particles, auxiliary verbs) and synsemantic and/or autosemantic words (Macken 2010). Some of the decisions in the guidelines were made so that the alignments comply with the dependency-based syntactic annotation of the Treebank, but not at the expense of their intuitive apprehensiveness – the annotators do not have to be well grounded in Minimal Recursion Semantics nor dependency analysis.

Available support software. In the context of building Bulgarian-English Parallel Treebank manual alignment served to establish parallels between terminal nodes (lexical elements or tokens). This is one of the main reasons why we use Word Aligner ⁶ – a web-based version of the tool developed by C. Callison-Burch, and modified by us for our purposes (Fig. 1). In contrast to other tools (e.g. Hand Align ⁷), it provides no special means for differentiation between the word-level and phrase-level alignments. However, the phrase alignment is simulated by block alignments (Fig. 2). The data can be introduced to the aligner in two ways: by typing in the parallel text boxes or by uploading files. Input files have the following format: one sentence pair per line, source and target sentences separated with an @@@ sequence.

Usage of possible links. Originally, the two types of links (sure link or S link and possible link or P link) were introduced to mark annotator’s degree of certainty about the alignment decision (Kruijff-Korbayová et. al 2006). In addition, P links are used when the aligned tokens belong to semantically identical or similar phrases and have no corresponding counterparts, but excluding them from the phrase would result in changing its meaning or making it ungrammatical.

WordAlign interface

Fig. 1. WordAlign interface. Each pair of sentences is represented as a grid of squares. Correspondence between two tokens is marked by clicking on a square – once (black square) or twice (grey square). The Comments field proved to be very useful for keeping track of decision motivation.

Block (phrase to phrase) alignin

Fig. 2. Block (phrase to phrase) aligning. In this example two complex verb forms are aligned as a whole. There is no straightforward correspondence between Bulgarian and English auxiliary components (беше [be.3SG.PST] – had been); only the lexical verbs are sure aligned (заловен [capture.PPP.SG.M] – captured).

4.2 Word Level Alignment Rules.

The annotation guidelines for the Bulgarian-English word alignment follow the tradition established by the guidelines used in similar projects, aiming at the creation of golden standards for different language pairs, such as the Blinker project for English-French alignment (Melamed 1998), the alignment task for the Prague Czech-English Dependency Treebank 1.0 (Kruijff-Korbayová et. al 2006), the Dutch parallel Corpus project (Macken 2010), among others.

4.2.1. General Alignment Rules.

We adopt the general rules that have proven to be shared by the different annotation tasks and alignment strategies:

1. Mark as many tokens as necessary in the source and in the target sentence to ensure a two-way equivalence.

2. Mark as few tokens as possible in the source and in the target sentence, but preserve the two-way equivalence (Veronis 1998; Merkel 1999; Macken 2010).

3. If a token or a phrase has no corresponding counterpart in the other language, and bears no structural and/or semantic significance, it should be left unlinked (NULL link, square with no fill) (Melamed 1998).

We avoid weak alignment when the propositions, conveyed by the source and the target sentence, are equivalent in terms of truth conditions, but differ in form of expression, lexically and/or syntactically. The same holds for the cases where one of the propositions is derived as a (cancellable) inference from the other. Consider the classical example The car was sold – The car was bought, only the second sentence in Bulgarian – Колата беше купена. The verb купена [buy.PPP.SG.F] should not be linked to sold (Fig.3). This prevents attribution of the same elementary predicate to words with opposite meaning.

Word alignment of logically equivalent, but lexically different sentences

Fig. 3. Word alignment of logically equivalent, but lexically different sentences. The two NPs, the punctuation mark and the auxiliary verbs are linked strongly, but the two participles derived from antonymous verbs are left unaligned.

Idioms and free translations present a special case. If two autosemantic words or phrases refer to the same object, but do not share the same meaning, they are aligned with a P link (Fig. 4).

This animal – this dog alignment

Fig. 4. This animal – това куче [‘this dog’] alignment. The meanings of the two weakly aligned words are in superconcept – subconcept relation.

The same rule holds when there is a synsemantic – autosemantic correspondence, for example if the translator chooses to use a pronoun instead of a noun (Fig. 5).

Synsemantic – autosemantic word alignment

Fig. 5. Synsemantic – autosemantic word alignment: Ivan's mother called. - Неговата майка се oбади. [His.DET mother REFL called. – ‘His mother called.’].

P link is also used when a lexical item is paraphrased in the other language, which is another very frequent phenomenon in bi-texts. The example, given below (Fig. 6), shows a case where one English word is translated periphrastically with a noun phrase, postmodified by a prepositional phrase.

Paraphrase

Fig. 6. Paraphrase: these non-Serbs - тези лица от несръбски произход [‘these persons from non-Serbian origin’].

And finally, idioms are linked with an S link; each token from the idiom in the source sentence is aligned with an S link to each token from the idiom in the target sentence (Fig. 7).

Idioms are block (phrase) aligned

Fig. 7. Idioms are block (phrase) aligned: She'll marry him when pigs begin to fly. – Тя ще се омъжи за него на куково лято.

4.2.2. Language Specific Rules.

These rules are primarily language specific and their subjects are predominantly, but not exclusively, the function words. During the alignment process we encountered several reoccurring types of discrepancies ⁸, some of them reflecting the grammatical differences between English and Bulgarian, and some – due to the translator’s choice to use alternative phrasing regarded by him/her as more appropriate, but structurally and morphosyntactically more distant to the original text. Some typical hardnut problems concern double negation, analytical tense forms, prepositional phrases vs. noun phrases, active vs. passive voice.

The last version of the annotation guidelines consists of approximately 70 specific rules. They are organized in the following sections:

a. Noun phrases

Determiners. Articles, demonstratives and possessive pronouns

Prepositional phrases

Expressing possession. Whose and of whom

Substitution with one(s)

b. Verb vs. Noun

c. Verb phrases

Expletive subject and pro-drop

Reflexive pronouns in a verb complex

Тo and да particles

Negation

Do support

Active vs. passive voice constructions

Phrasal verbs and “stretched” verbs

4. Subordination. Conjunctions, wh-words and relatives

5. Interrogative sentences

6. Numerals

7. Punctuation and conjunctions

The structure of the guidelines resembles that of a grammar, but there are some topics that one would not usually find in a textbook, e.g. “Punctuation and conjunctions”. However, the rules stated in this part of the guidelines are designed to be in compliance with the dependency-based analysis of coordination. For example, the comma, followed by and, resp. и (‘and’) is P aligned to the corresponding conjunction. In this case the comma itself is not a conjunction and it is “attached” to the coordinative и.

Fig. 8. Punctuation alignment and dependency analysis: Greeks and Bulgarians are both Orthodox people – и гърците, и българите са православни народи [and Greeks, and Bulgarians are Ortodox people].

Some of the language specific rules will be presented below:

Noun phrases. English determiners like a(n) or the correspond either to Bulgarian determiners един (one) or bare NP (Fig. 9), or to the so called full/short definite article, which is a morpheme.

Determiners

Fig. 9. Determiners: both - the article a, and the noun house, are S aligned to the bare Bulgarian noun къща (‘house’).

Usually if one of the two corresponding NPs has no modifier, the determiner and the head of the phrase are aligned together to the head of the other phrase (Kruijff-Korbayová 2006) or (Macken 2010). Since in Bulgarian the article could be a morpheme attached to the first modifier, we decided to link both the article and the modifier from the English sentence to the corresponding Bulgarian modifier with an S link (Fig. 10).

Determiners and modifiers

Fig. 10. Determiners and modifiers: the lovely old house – хубавата стара къща [lovely.F.SG.DET old.F.SG house.F.SG – ‘the lovely old house’].

English possessive pronouns and Bulgarian possessive pronouns are S aligned. Features like reflexivity or form type, full or short, are ignored (Fig. 11 a, b, c, d).

Possessive forms alignment

Fig. 11a. Possessive forms alignment: their son – техния син [their.M.SG.DET son].

Possessive forms alignment 2

Fig. 11b. Possessive forms alignment: their son – синът им [son.DET their].

Possessive forms alignment 3

Fig. 11c. Possessive forms alignment: their son – своя син [POSS.REFL.M.SG.DET son].

Possessive forms alignment 4

Fig. 11d. Possessive forms alignment: their son – сина си [son.M.SG.DET POSS.REFL].

Prepositional phrases. Very often English noun premodifiers are translated into prepositional phrases in Bulgarian. If that is the case, the preposition is aligned with a P link to the head noun (Fig. 12).

Prepositional Phrases

Fig. 12. Prepositional phrases: Justice Minister Cemil Cicek – Министърът на правосъдието Джемил Чичек [Minister.DET of justice Cemil Cicek]

English possessive noun forms are translated into Bulgarian either with на prepositional phrase (John’s – на Иван), or with an adjective that has possessive meaning (John’s – Иванов). In case of PP translation, the preposition itself is aligned to the possessive ’s (for singular) or ’ (for plural) marker with an P link to reflect the fact that the two possessive markers are morphosyntactically different (Fig. 13).

Fig. 13. Possessive noun forms: my elder sister's life – животът на по-голямата ми сестра[life.DET of more-old.DET my sister]

Bulgarian adjectives with possessive meaning and English possessive noun forms are S linked (Fig. 14); despite the fact that the former are noun-derived adjectives while the latter are noun forms, they are semantically identical regardless of the sentential context.

Possesive noun forms

Fig. 14. Possesive noun forms: Ivan's wife – Ивановата жена [Ivan-ADJ-F.SG-DET wife]

Expletive subject and pro-drop

Expletive subjects (it, there) rarely have correspondence in Bulgarian sentences, but they are obligatory for English. That is why we decided to link it or there with an S link to the Bulgarian verb components, i.e. to the whole verb complex (Fig. 15).

Expletive subject

Fig. 15.Expletive subject: It is raining – Вали [‘Rain-3.SG.PRS’]. The auxiliary is is P linked to the lexical verb in Bulgarian.

On the other hand, Bulgarian language is a pro-drop language (Fig. 16). If the subject is unexpressed, then the English subject should be linked with a P link to all Bulgarian verb components that express one of the agreement categories: person, gender, number, and the main verb form itself (see Lambert et. al 2006).

Unexpressed subject in Bulgarian

Fig. 16. Unexpressed subject in Bulgarian: They would not dare – Не биха посмели [Not would-3.PL dare-PPT-PL].

Verbs. We try to follow the rules as they were first formulated in (Melamed 1998): if possible, link main (lexical) verb to main verb and auxiliary verb to auxiliary verb (Fig. 17).

Verb alignment

Fig. 17. Verb alignment. A rare case with one-to-one correspondence of the verb complex components, in this example - auxiliary verbs (was – беше) and past passive participles (elected – избрана): She was elected in 1997. – Тя беше избрана през 1997.

Whenever the auxiliary form is not present or different in the source or target phrase, it should be aligned to the main verb weakly or the two verb forms should be phrase aligned (Fig. 18). This is why in one of the examples, given above (Fig. 15), the auxiliary (is) is weakly aligned to the synthetic Bulgarian verb form (вали). In addition we use a list of fixed strongly or weakly aligned couples of auxiliary verbs or particles that annotators consult.

Phrase alignment of complex verbs

Fig. 18. Phrase alignment of complex verb forms: he is laughing – той се смее [he REFL laugh.3.SG.PRS]. Here two peculiarities are captured: 1) the Bulgarian verb смея се (to laugh) is inherently reflexive, which accounts for the S linking between laughing and се смее; 2) the habitual – continuous event type opposition is not grammatically presented in Bulgarian, thus both Present Simple and Present Continuous tense forms are translated alike. The translational equivalence is not metrical, so the auxiliary English verb is P linked.

Double negation. The last topic we will address here, concerns some negation phenomena. The problem they pose is twofold: 1) the diversity of morphosyntactical patterns that affects 2) their linguistic modeling and theoretical representation is neither exhaustive, nor uncontroversial. Devising rules for each possible translation is not achievable, so although guidelines cover the most common cases, they are by no means comprehensive. Some basic principles will be illustrated below.

Verb negation. In Czech the verb itself has a morphologically marked negative form that is weakly aligned with the positive form in English (Kruijff-Korbayová et. al 2006). In Bulgarian the negative marker is not a morpheme, but either a particle – не, or a verb with negative lexical meaning – няма, нямаше. This allows for one-to-one sure alignment of negative words (Fig. 19).

Negative verb forms alignment

Fig. 19. Negative verb forms alignment: I wouldn’t come. – Нямаше да дойд-а. [Not.have-PST to come-1.SG.PRS]. Infinitival да particle is weakly aligned to the English main verb, I is weakly aligned to the component that expresses the grammatical meanings person and number, дойда.

Double negation is typical for Slavic languages like Czech and Bulgarian, but not for English. Often it is the case that one or more negative or indefinite pronouns from the Bulgarian sentence correspond to indefinite English pronouns. They should be mapped with an S link (Fig. 20).

ksimov-spring-2012-20

Fig. 20. Double negation and pronoun correspondence: I couldn’t see anything. – Не можах да видя нищо. [Not can-PST.1.SG to see nothing]. In negative context indefinite English pronouns are always translated with negative Bulgarian pronouns.

If it is the English verb, that does not have negative form, the Bulgarian negative particle is aligned with a P link to the English word that bares negative meaning (Fig. 21). The same rule applies to all other negative pronouns present in the Bulgarian sentence – all of them should be P linked to the negative English word.

ksimov-spring-2012-21

Fig. 21. Double negation – linking negative words: I felt nothing. – Нищо не почувствах. [Nothing not feel.PST.1.SG].

Active vs. passive voice constructions

Voice shifts are among the most frequent sources of grammatical mismatch between the aligned sentences. If the source and target verbs are in different voices, the following rule applies: link the main verbs with an S link and P link the rest of the verb complexes with a P link (Fig. 22).

ksimov-spring-2012-22

Fig. 22. Aligning verbs in different voices: Ivan interviewed Tanya. (active voice) – Таня билаинтервюирана от Иван. (passive voice) [Tanya be.PTCP.F.SG interview.PASS.PTCP.F.SG by Ivan – ‘Tanya was interviewed by Ivan.’].

Verb arguments that show no morphosyntactic differences are sure aligned. When the syntactic realization differs due to the verb, e.g. noun phrase vs. prepositional phrase, the preposition is P aligned, whereas the mapping type is S for the noun phrase component (Fig. 22 above) and P for the pronouns that have different cases, Nominative vs. Accusative (Fig. 23).

ksimov-spring-2012-23

Fig. 23. Aligning arguments with different realisation: He interviewed her. – Тя била интервюирана от него. [She be.PTCP.F.SG interview.PASS.PTCP.F.SG by him. – ‘She was interviewed by him.’]

The validation checks over the corpus showed that most of the discrepancies between the alignments are due to the following factors:

Complexity of translation variation. There are many phrases with differing internal structure, especially analytical tense forms.

Rules proliferation. Some grammatical phenomena, such as double negation in Bulgarian, which affects not only the VP, but also the argument realisation, are too diverse to be handled by a reasonable number of exact rules.

Underspecification. The alignment in such cases remains underspecified and the decisions made by different annotators or even by the same person but for a different sentence pair, are rarely identical.

Different annotation “styles”. The lack of alignment instructions or underspecification may trigger emergence of individual, implicit alignment patterns that reflect different annotators interpretations and solutions to problems that are not discussed in the guidelines.

As it was mentioned above, the word level alignment supports the semantic level one. The next section presents it.

5. Semantic Level Alignment

Our approach is inspired by the work on MRS and RMRS (see Copestake 2003; 2007) and the previous work on transfer of dependency analyses into RMRS structures described in (Spreyer and Frank 2005) and (Jakob et. al 2010).

MRS is introduced as an underspecified semantic formalism (Copestake et. al 2005). It is used to support semantic analyses in the HPSG English Resource Grammar (Copestake and Flickinger 2000), but also in other grammar formalisms like LFG. The main idea is the formalism to rule out spurious analyses resulting from the representation of logical operators and the scope of quantifiers. Here we will present only basic definitions from (Copestake et. al 2005). For more details the cited publication should be consulted. An MRS structure is a tuple <GT, R, C>, where GT is the top handle, R is a bag of EPs (elementary predicates) and C is a bag of handle constraints, such that there is no handle h that outscopes GT. Each elementary predication contains exactly four components: (1) a handle which is the label of the EP; (2) a relation; (3) a list of zero or more ordinary variable arguments of the relation; and (4) a list of zero or more handles corresponding to scopal arguments of the relation (i.e., holes). Here is an example of an MRS structure for the sentence “Every dog chases some white cat.”

<h0, {h1: every(x,h2,h3), h2: dog(x), h4: chase(x, y), h5: some(y,h6,h7), h6: white(y), h6: cat(y)}, {}>

The top handle is h0. The two quantifiers are represented as relations every(x, y, z) and some(x, y, z) where x is the bound variable, y and z are handles determining the restriction and the body of the quantifier. The conjunction of two or more relations is represented by sharing the same handle (h6 above). The outscope relation is defined as a transitive closure of the immediate outscope relation between two elementary predications – EP immediately outscopes EP' iff one of the scopal arguments of EP is the label of EP'. In this example the set of handle constraints is empty, which means that the representation is underspecified with respect to the scope of both quantifiers. Here we finish with the brief introduction of the MRS formalism. The rest of the definitions will be introduced when necessary in the text.

RMRS is introduced as a modification of MRS which to capture the semantics resulting from the shallow analysis. Here the following assumptions are taken into account – the shallow processor does not have access to a lexicon. Thus it does not have access to arity of the relations in EPs. Therefore, the representation has to be underspecified with respect to the number of arguments of the relations. Additionally, the forming of the relation names follows such conventions that provide possibilities to construct a correct semantic representation only on the base of information provided by a POS tagger, for example. The arguments are introduced separately by argument relations between the label of a relation and the argument. The names of the argument relations follow some standardized convention like RSTR, BODY, ARG1, ARG2, etc. These argument relations are grouped in a separate set in a given RMRS structure. Both representations MRS and RMRS could be transferred to each other under certain conditions. In the paper we follow the representation of RMRS used in (Jakob et. al 2010), which defines an RMRS structure as a quadruple < hook, EP-bag, argument set, handle constraints >, where a hook consists of three elements l:a:i, l is a label, a is an anchor and i is an index. Each elementary predication is additionally marked with an anchor – l:a:r(i), where l is a label, a is an anchor and r(i) is a relation with one argument of appropriate kind – referential index or event index. The argument set contains argument statements of the following kind a:ARG(x), where a is anchor which determines for which relation the argument is defined, ARG is the name of the argument, and x is an index or a hole variable or handle (h) for scopal predicates. The handle constraints are of the form h =q l, where h is a handle, l is a label and =q is the relation expressing the constraint similarly to MRS. =q sometimes is written as qeq.

RMRS was used in analyses of two dependency treebanks – TIGER treebank of German and Prague Dependency Treebank of Czech. The work on Prague Dependency Treebank presented in (Jakob et. al 2010) first assigns elementary predications to each node in the tectogrammatical tree. Then the elementary predications for the nodes are combined on the basis of the dependency annotation in the trees. Similar approach is taken by us, except that the analyses from which we start are not trees on tectogrammatical level. Thus, our trees contain nodes for each token in the sentences.

5.1 RMRS for Bulgarian Dependency Parses

Here we present a set of rules for transfer of dependency parses into RMRS presentations. The information input for the RMRS structures is based on the following linguistic annotation – the lemma (Lemma) for the given wordform; the morphosyntactic tag (MSTag) of the wordform, and the dependent relations in the dependency tree. In cases of quantifiers we have access to the lexicon used in BURGER. Here we present the rules for some of the most important combinations. We take into account the MRS structures produced by BURGER in order to be able to compare them to RMRS structures produced over the dependency trees. Thus, the algorithm for producing of RMRS from a dependency parse is implemented via two types of rules:

<Lemma, MSTag> → EP-RMRS

The rules of this type produce an RMRS including an elementary predicate.

<DRMRS, Rel, HRMRS> → HRMRS'

The rules of this type unite the RMRS constructed for a dependent node (DRMRS) into the current RMRS for a head node (HRMRS). The union (HRMRS') is determined by the relation (Rel) between the two nodes. In the rest of the section we present examples of these rules.

First, we start with assigning EPs for each lemma in the dependency tree. These EPs are similar to node EPs of (Jakob et al 2010). Each EP for a given lemma consists of a predicate generated on the basis of the lemma string. When the lemma is a quantifier and thus it is a part of the BURGER lexicon, we copy the related information about its relation and arguments – RESTRICTION (RSTR) and BODY. Additionally, the morphosyntactic features of the wordform are presented. On the basis of the part-of-speech tag the type of ARG0 is determined – referential index or event index. After this initial step the basic RMRS structure for each lemma in the sentence is compiled. Below we discuss the exploitation of the rest of the information in the dependency tree – the types of links to the other lemmas as well as the further contribution of the morphosyntactic features. Here is an example for the verb ‘чета’ (to read):

< l1:a1:e1, { l1:a1:чета_v_rel(e1) }, { a1:ARG1(x1) }, {} >

In this example we also include information for the unexpressed subject (ARG1) which is always incorporated in the verb form. In case the subject is expressed, it will be connected to the same referential index. For some types of nodes the EP RMRS will include information only for arguments of the predicate of the head node.

The short forms of pronouns (clitics) do not introduce a semantic relation. The semantic relation is introduced only by their full counter-parts. It is rather straightforward transfer, since the short forms are annotated as clitics, while the full forms are assigned grammatical roles – object or indirect object. Thus, the full forms in verbal domain are automatically transferred as ARG2 and ARG3 of the corresponding verb. In this transfer we always connect the object to argument ARG2 slot and indirect object to ARG3 slot. For example, the sentence чета му я (Read-I him-dative her-accusative, ‘I read it to him’) will have the following representation:

< l1:a1:e1,

{l1:a1:чета_v_rel(e1) },

{ a1:ARG1(x1), a1:ARG2(x2), a1:ARG3(x3) },

{} >

The EP RMRS for the accusative clitic introduces only the information for ARG2 and appropriate grammatical features for the variable x2 (third person, singular, feminine). Similarly EP RMRS for the dative clitic provides ARG3 and its grammatical features (third person, singular, masculine). When this information is incorporated into the head RMRS, the anchors for the ARG2 and ARG3 are changed with respect to the anchor of the head.

The subject is mapped to ARG1. It is worth noting that the Subject argument is partially determined during the previous step in building EPs, because Bulgarian is a pro-drop language, and the main subject properties are considered part of the verb form. Here is an example for the sentence момче му я чете [Boy him-dative her-accusative read], ‘A boy reads it to him):

< l2:a4:e1,

{ l1:a1:момче_n_rel(x1), l2:a4:чета_v_rel(e1) },

{ a4:ARG1(x1), a4:ARG2(x2), a4:ARG3(x3) },

{} >

Another example with an explicit direct object for the sentence момче му чете книга [Boy him-dative reads book], ‘A boy reads a book to him’):

< l2:a3:e1,

{ l1:a1:момче_n_rel(x1), l2:a3:чета_v_rel(e1), l3:a4:книга_n_rel(x2) },

{ a3:ARG1(x1), a3:ARG2(x2), a3:ARG3(x3) },

{} >

A problematic case is the passive construction in which the arguments are represented as alternating dependency relations. In this case the lemma is consulted for the semantic presentation, and the indirect object relation is assigned as a PP-relation, which introduces the Subject.

The modifying words (mod) – adjectives, adverbs or nouns introduce a modifier relation. When the modifier is definite, then the information is treated only on the syntactic level. Thus, the head is considered semantically definite, and the information is divided between the two levels of analysis.

The complements of the copula need the information from the morphosyntactic tag, since the adjective, adverb and PPs raise their INDEX to the semantically vacuous copula. In contrast to them, the nouns introduce a referential INDEX, which, however, is not raised to the copula.

When an auxiliary verb is recognized, which takes a participle as a complement, and then depending on the participle, the transfer is realized accordingly. For example, if the participle is aorist, then it is in active voice. If it is passive, then the semantics follows the strategy from above.

The transfer of the impersonal verbs into RMRS also relies on the morphosyntactic tags. They introduce a restriction on its subject to be pro-nominal, 3rd person, singular, neuter. The relation xcomp is transformed into a constraint, which ensures that the ARG1 of the modal qeqs the label of the verb in the da-construction (analytical substitute form for the Old Bulgarian infinitive).

Here is a simplified representation of the sentence Трябва да му кажа. [Must to him-dat tell-I], ‘I have to tell him’):

< l1:a1:e1,

{ l1:a1:трябва_v_rel(e1), l2:a4:кажа_v_rel(e2) },

{ a1:ARG2(e2), a4:ARG1(x1), a4:ARG3(x2) },

{} >

The xmod relation connects a clause to a nominal head. When the clause is introduced by a relative pronoun, its RMRS is incorporated in the RMRS of the head and the index introduced by the relative pronoun is made the same as the index of the head. In cases when the clause is not introduced by a relative clause the event index of the clause is nominalised and the new referencial index is made the same as the index of the head.

The xsubj relation is incorporated in the head RMRS depending on the kind of the dependent clause. If it is a relative clause then the index of the relative pronoun is made equal to the index introduced by the unexpressed subject of the head. In the other cases the event represented by the clause is nominalized and the new referential index is made equal to the index of the unexpressed subject of the head.

The marked relation is always connected to a subordinate conjunction. The subordinate conjunction introduces a two argument relation, where both arguments are events. In this case the RMRS of the dependent clause is added to the RMRS assigned to the conjunction. Additionally, the index of the second argument is made equal to the index of the dependent clause.

The xprepcomp relation is treated as an ordinary prepcomp relation, but the index of ARG1 is an event.

The canonical coordination is handled relatively straightforwardly. The conj label introduces a coordination relation, and conjarg is mapped to the right index R-INDEX. Then, the left index L-INDEX is taken by the above level, which contains the grammatical role of the whole coordination phrase.

The pragadjunct introduces different types of modifiers on pragmatic level like vocatives, parenthetical expressions, etc. For the moment, we incorporate the RMRS of the dependent element in the RMRS of the head without additional constraints, but these cases require more work in future.

The relation punct is ignored.

The incorporation of the dependent RMRS into the head RMRS is done recursively from the leaves of the tree up. After the construction of the RMRS of the tree root, we need to add the missing quantifiers for the unbound referential indexes. For each such index the algorithm determines the handle with a widest scope and uses it as a RSTR argument.

Here is a pseudo code of the main algorithm RMRS which selects the root of the input tree and calls the recursive function which calculates the RMRS for the sentence:

algorithm rmrs

Input: DTree (dependency tree in CoNLL format)

Output: < hook, EP-bag, argument set, handle constraints >

(RSMS structure for the sentence)

RootNode → root(DTree)

setEnumerators()

RMRS → nodeRMRS(DTree, RootNode)

return addQuantifiers(RMRS)

end_algorithm

The function root(DTree) selects the root of the tree. The function nodeRMRS(DTree, Node) constructs recursively RMRS structure for the subtree starting at node Node. The subtree is part of the whole tree for the sentence – DTree. The function setEnumerators() sets the initial numbers for labels, referential and event variables. For anchors we use the token numbers that are already in the CoNLL format of the dependency tree. The function addQuantifiers(RMRS) introduces the missing quintifiers in the final RMRS. Here is the pseudo code for the function:

function nodeRMRS(DTree, CurrentNode)

NodeEP → nodeEP(DTree, CurrentNode)

for DNode in depNodes(DTree, CurrentNode)

DNodeRMRS → nodeRMRS(DTree, DNode)

DRel → nodeRel(DTree, DNode)

NodeEP → union(NodeEP, DNodeRMRS, DRel)

end_for

return NodeEP

end_function

This function first calls the function for constructing RMRS for the elementary predication for the current node in the dependency tree – nodeEP(DTree, CurrentNode). This function implements the first kind of rules mentioned above. It has access to the lemma and the grammatical information for the current node. The predicate name is constructed on the basis of the lemma and the part of speech (for example, чета_v_rel), the argument type is determined on the basis the grammatical information – event or referential index. Additional information can be added for other arguments of the verbs as it was described above. In case of access to a lexicon, the function will be tuned to the information within the lexicon. This will be relevant for the case of the valency lexicon.

The function depNodes(DTree, CurrentNode) returns a set of nodes in the tree which are dependent of the current node. For each of them the function nodeRMRS(DTree, Node) is called. The result of this recursive call is incorporated within the current RMRS on the basis of the dependency relations. This is done by the function union(NodeEP, DNodeRMRS, DRel). This function is defined by the second kind of rules described above. Note that all the relevant information is available in the already constructed RMRS structures for the head node as well as the dependent nodes and the type of the relations.

The rules of the first kind are 118. They correspond to a reduced morphosyntactic tagset of (Simov et. al 2004). The rules of the second kind are 53. The construction of these rules follows the statistics, presented in section 3. We first implemented rules for most frequent combinations. As much as we can not be sure that the treebank contains examples of all possible combinations we implement ‘catch all’ which just construct the union of the sets within the two RMRSes.

5.2 Alignment of RMRS Structures

As mentioned above, we use word level alignment in order to establish alignment on the level of the RMRS. For both languages the phrases are assigned an RMRS structure which represents the semantic value of the phrase (in the case of the dependency parse this MRS incorporates the semantic values of all dependent elements). The intuition behind our approach is that the lexical data of each structure in the syntactic analysis for a pair of sentences are aligned on the word level. Then we assume that their MRS structures are equivalent modulo the meaning of the language-specific elementary predicates. We exploit this intuition in constructing the semantic alignment in our treebank.

First we establish correspondences on the lexical level. Each pair of lexical items in the corresponding analyses are made equivalent on the basis of word alignment. The next step is to traverse the trees bottom-up. For each phrase or head for which the components are aligned, a correspondence on the MRS level is established. It should be explicitly noted that a correspondence on the sentence level is also established. Here we present an example. Let us consider the following pair of sentences from the English Resource Grammar datasets:

Кучето на Браун лае .

Dog-the(neut) of Browne barks.

Browne's dog barks.

The word level alignment is:

(Кучето = dog)

(на = 's)

(на Браун = Browne 's)

(лае = barks)

(Браун = Browne)

Here are the MRS structures assigned to both sentences by ERG and BURGER. Some details are hidden for readability:

ERG:

<h1, { h3: proper_q_rel(x3,h4,h6),

h7: named_rel(x5,"Browne"),

h8: def_explicit_q_rel(x10, h9, h11),

h12: poss_rel(e13,x10,x5),

h12: dog_n_1_rel(x10),

h14: bark_v_1_rel(e2,x10)},

{ h4 qeq h7 h9 qeq h12 }>

BURGER:

<h1, { h3: куче_n_1_rel(x4),

h3: на_p_1_rel(e5,x4,x6),

h7: named_rel(x6, "Браун"),

h8: exist_q_rel (x6, h9, h10),

h11: exist_q_rel (x4, h12, h13),

h1: лая_v_rel (e2,x4)},

{ h12 qeq h3 h9 qeq h7 }>

The result of correspondences between MRS on the basis of word level establishes the following mappings of elementary predicates lists:

(m1)

(Браун = Browne)

{ h3: proper_q_rel(x5, h4, h6), h7: named_rel(x5, "Browne") }

is mapped to

{ h7: named_rel(x6, "Браун"), h8: exist_q_rel(x6, h9, h10) }

(m2)

(на = 's)

{ h12: poss_rel(e13, x10, x5) }

is mapped to

{ h3: на_p_1_rel(e5, x4, x6) }

(m3)

(на Браун = Browne 's)

{ h3: proper_q_rel(x5, h4, h6), h7: named_rel(x5, "Browne"),

h8: def_explicit_q_rel(x10, h9, h11), h12: poss_rel(e13, x10, x5) }

is mapped to

{ h3: на_p_1_rel(e5, x4, x6), h7: named_rel(x6, "Браун"), h8: exist_q_rel(x6, h9, h10) }

(m4)

(Кучето = dog)

{ h12: dog_n_1_rel(x10) }

is mapped to

{ h3: куче_n_1_rel(x4), h11: exist_q_rel(x4, h12, h13) }

(m5)

(лае = barks)

{ h14: bark_v_1_rel(e2, x10) }

is mapped to

{ h1: лая_v_rel(e2, x4) }

As we mentioned above, our goal is to have MRS alignment not just on word level, but also on phrase level in the sentence. Thus, using the correspondences described in the previous section and the syntactic analyses of both sentences we can infer the following mapping:

(m6)

(Кучето на Браун = Browne 's dog)

{ h3: proper_q_rel(x5, h4, h6),

h7: named_rel(x5, "Browne"),

h8: def_explicit_q_rel(x10, h9, h11),

h12: poss_rel(e13, x10, x5),

h12: dog_n_1_rel(x10) }

is mapped to

{ h3: на_p_1_rel(e5, x4, x6),

h7: named_rel(x6, "Браун"),

h8: exist_q_rel(x6, h9, h10),

h3: куче_n_1_rel(x4),

h11: exist_q_rel(x4, h12, h13) }

This automatic procedure for inferring semantic correspondences on the phrasal level provide flexability for different strategies for such alignments. For example, such correspondences might be equipped with similarity scores on the basis of word alignment types involved in the corresponding phrase, as well as the type of the phrase itself. For instance, if the word alignment of two corresponding phrases involves only sure links, then the MRS level alignment for these phrases also is assumed to be sure. Respectively, if on the word level there are unsure links, then the MRS level alignment could be assumed to be unsure. This idea could be developed further depending on the application. Also, in some cases the MRS level alignment could be assumed to be sure, although it includes some unsure links on word level. For example, in case of analytical verb forms many elements will be aligned only by possible links, but the whole forms are linked as a sure correspondence. We believe that such pairs of sentences with appropriate syntactic and semantic analyses and word alignment are a valuable source for construction of alignments on the semantic level.

After the processing of SETIMES dataset on the level of RMRS analysis we expect the statistical machine translation methods to learn phrase correspondences similar to the described above. Some experiments in this direction is presented in the rest of the paper.

6. Bulgarian-English Machine Translation - Experiments

In this section we present a model for using the result from the (R)MRS analysis of Bulgarian text in a statistical machine translation system.

6.1 Factor-based SMT Model

Our approach is built on top of the factor-based SMT model proposed by (Birch et. al 2007), as an extension of the traditional phrase-based SMT framework. Instead of using only the word form of the text, it allows the system to take a vector of factors to represent each token, both for the source and target languages. The vector of factors can be used for different levels of linguistic annotations, like lemma, part-of-speech (POS), or other linguistic features. Furthermore, this extension actually allows us to incorporate various kinds of features if they can be (somehow) represented as annotations to the tokens.

The process is quite similar to supertagging (Bangalore and Joshi 1999), which assigns ‘rich descriptions (supertags) that impose complex constraints in a local context’. In our case, all the linguistic features (factors) associated with each token form a supertag to that token. (Singh and Bandyopadhyay 2010) had a similar idea of incorporating linguistic features, while they worked on Manipuri-English bidirectional translation. Our approach is slightly different from (Birch et. al 2007) and (Hassan et. al 2007), who mainly used the supertags on the target language side, English. We will experiment with both sides, but primarily on the source language side, Bulgarian. This potentially huge feature space provides us with various possibilities of using our linguistic resources developed in and out of this project.

In particular, we consider the following factors on the source language side (Bulgarian):

WF - word form is just the original text token.
Lemma is the lexical invariant of the original word form. We use a lemmatizer, which operates on the output from the POS tagging. Thus, the 3rd person, plural, imperfect tense verb form вървяха (['walking-were'], They were walking) is lemmatized as the 1st person, present tense verb вървя.
POS - part-of-speech of the word. We use the positional POS tagset of the BulTreeBank, where the first letter of the tag indicates the POS itself, while the next letters refer to semantic and/or morphosyntactic features, such as: Dm - where D stands for adverb, and m stand for modal; Ncmsi - where N stand for noun, c means common, m is masculine, s is singular,and i is indefinite.
Ling - other linguistic features derived from the POS tag in the BulTreeBank tagset.

In addition to these, we can also incorporate syntactic structure of the sentence by breaking down the tree into dependency relations. For instance, a dependency tree can be represented as a set of triples in the form of <parent, relation, child>. <loves, subject, John> and <loves, object, Mary> will represent the sentence ‘John loves Mary’. Consequently, three additional factors are included for both languages:

DepRel - is the dependency relation between the current word and the parent node.
HLemma is the lemma of the current word's parent node.
HPos is the POS tag of the current word's parent node.

Let us consider the following sentence: Високият мъж чете интересна книга ([Tall-the man reads interesting book], The tall man is reading an interesting book). In the phrase високият мъж, the HLemma of the adjective високият is the noun мъж. Its HPOS is Noun, and its DepRel is modification (mod). However, the HLemma of the noun мъж is the verb чете. Hence, its HPOS is Verb, and the DepRel is subject (subj).

Let us return to the example sentence from above. In the construction of the new factor representation of the corpus we do not use the POS and DepHead columns. Additions are given in columns: HLemma and HPOS. The new columns are presented in Table 3.

WF	Lemma	POSex	Ling	HLemma	HPOS
според	според	R	-	злоупотребявам	VP
одита	одит	Nc	npd	според	R
в	в	R	-	одит	Nc
електрическите	електрически	A	pd	компания	Nc
компании	компания	Nc	fpi	в	R
политиците	политик	Nc	mpd	злоупотребявам	Vp
злоупотребяват	злоупотребявам	Vp	tir3p	-	-
с	с	R	-	злоупотребявам	Vp
държавните	държавен	A	pd	предприятие	Nc
предприятия	предприятие	Nc	npi	с	R

Table 3. The sentence analysis with added head information - HLemma and HPOS (some columns from Table 1 are deleted here)

The information from the table format is represented with respect to the requirements of the Moses system. We extended the grammatical features to have the same size. All the information is concatenated to the wordforms in the text.

6.2 MRS Supertagging

We firstly do a fuzzy match between the surface tokens and the MRS elementary predicates (EPs) and then extract the following features as extra factors:

EP - the name of the elementary predicate, which usually indicates an event or an entity semantically.
EoV indicates the current EP is either an event or a variable.
ARGnEP indicates the elementary predicate of the argument which belongs to the predicate. n is usually from 1 to 3.
ARGnPOS indicates the POS tag of the argument which belongs to the predicate.

Notе that we do not take all the information provided by the MRS, e.g., we throw away the scopal information. This kind of information is not straightforward to be represented in such ‘tagging’-style models, which will be tackled in the future.

Here the RMRS analysis of the example sentence is presented and its encoding in the corpus.

< a7:e1, { a1:според_r(e6), a2:одит_n(x4), a3:v_r(e5), a4:електрически_a(e4),

a5:компания_n(x3), a6:политик_n(x1), a7:злоупотребявам_v(e1),

a8:s_r(e2), a9:държавен_a(e3), a10:предприятие_n(x2) },

{

a1:ARG1(e1), a1:ARG2(x4), a3:ARG1(x4), a3:ARG2(x3), a4:ARG1(x3),

a7:ARG1(x1), a7:ARG3(e2), a8:ARG1(e1), a8:ARG2(x2), a9:ARG1(x2)

{} >

This information is represented for each wordform in Table 4. Most of the columns from the previous tables are missing here due to lack of space. The information for the argument elementary predicate and its part of speech are given in the same column. There are several special cases. First, some wordforms might introduce more than one elementary predicate. In such cases, we represent the same wordform more than once in the corpus (as well as in the table) - a new copy for each additional elementary predicate. Second, in some cases like clitic doubling there is more than one candidate for the POS of the corresponding argument. In this case we use the POS of the full-fledged direct or indirect object. Third, in some cases (third person, singular) the subject argument is constructed on the basis of the grammatical features of the verb (because of pro-drop in Bulgarian). In such cases there is more than one possible pronoun for the corresponding subject. Then we leave the predicate of ARG1 underspecified.

No	EP	EoV	EP_1/POS_1	EP_2/POS_2	EP_3/POS_3
1	според_r	e	злоупотребявам_v/Vp	одит_n/Nc	-
2	одит_n	v	-	-	-
3	в_r	e	одит _n/Nc	компания_n/Nc	-
4	електрически_a	e	компания_n/Nc	-	-
5	компания_n	v	-	-	-
6	политик_n	v	-	-	-
7	злоупотребявам_v	e	политик_n/Nc	-	с_r/R
8	с_r	e	злоупотребявам_v/Vp	предприятие_n/Nc	-
9	държавен_a	e	предприятие_n/Nc	-	-
10	предприятие_n	v	-	-	-

Table 4. Representation of MRS factors for each wordform in the sentence.

All these factors encoded within the corpus provide us with a rich selection of factors for different experiments. The model of encoding MRS information in the corpus as additional features does not depend on the actual semantic analysis - MRS or RMRS, because both of them provide enough semantic information. This ensures the adequacy of both analysis types.

6.3 Experiments with the Bulgarian raw corpus

To run the experiments, we use the phrase-based translation model provided by the open-source statistical machine translation system, Moses (Birch et. al 2007). For training the translation model, the parallel corpora were preprocessed with the tokenizer and lowercase converter provided by Moses. Then the procedure is quite standard:

We run GIZA++ (Och and Ney 2003) for bi-directional word alignment, and then obtain the lexical translation table and phrase table.
A tri-gram language model is estimated using the SRILM toolkit (Stolcke 2002).
Minimum error rate training (MERT) (Och and Ney 2003) is applied to tune the weights for the set of feature weights that maximizes the official f-score evaluation metric on the development set.

The rest of the parameters we use the default setting provided by Moses.

We split the corpora into the training set, the development set and the test set. For SETIMES, the split is 100,000/500/1,000 and for EMEA, it is 700,000/500/1,000. For reference, we also run tests on the JRC-Acquis corpus ⁹. The final results under the standard evaluation metrics are shown in the following table in terms of BLEU (Papineni et. al 2002):

Corpora	Test	Dev	Final	Drop
SETIMES → SETIMES	34.69	37.82	36.49	/
EMEA → EMEA	51.75	54.77	51.62	/
SETIMES → EMEA	13.37	/	/	61.5 %
SETIMES → JRC-Acquis	7.19	/	/	79.3 %
EMEA → SETIMES	7.37	/	/	85.8 %
EMEA → JRC-Acquis	9.21	/	/	82.2 %

Table 5. Results of the baseline SMT system (Bulgarian-English)

As we mentioned before, the EMEA corpus is mainly about the description of medicine usage, and the format is quite fixed. Therefore, it is not surprising to see high performance on the in-domain test (2nd row in Table 1.). SETIMES, consisting of news articles, is in a less controlled setting. The BLEU score is lower. The results on the out-of-domain tests are in general much lower with a drop of more than 60 % in BLEU score (the last column). For the JRC-Acquis corpus, in contrast to the in-domain scores given by (Callison-Burch et. al 2011) (61.3), the low out-of-domain results shows a very similar situation as EMEA. A brief manual check of the results indicate that the out-of-domain tests suffer severely from the missing lexicon, while the in-domain test for the news articles contains more interesting issues to look into. The better translation quality also makes the system outputs human readable.

6.4 Experiments with the Linguistically-Augmented Bulgarian Corpus

As we described the factor-based model in Section 6.1, we also perform experiments to test the effectiveness of different linguistic annotations. The different configurations we considered are shown in the first column of Table 6¹⁰.

ID	Model	BLEU	1-gram	2-gram	3-gram	4-gram
1	WF	38.61	69.9	44.6	31.5	22.7
2	WF, POS	38.85	69.9	44.8	31.7	23.0
3	WF, Lemma, POS, Ling	38.84	69.9	44.7	31.7	23.0
4	Lemma	37.22	68.8	43.0	30.1	21.5
5	Lemma, POS	37.49	68.9	43.2	30.4	21.8
6	Lemma, POS, Ling	38.70	69.7	44.6	31.6	22.8
7	WF, DepRel	36.87	68.4	42.8	29.9	21.1
8	WF, DepRel, HPOS	36.21	67.6	42.1	29.3	20.7
9	WF, Lemma, POS, Ling, DepRel	36.97	68.2	42.9	30.0	21.3
10	WF, Lemma, POS, Ling, DepRel, HLemma	29.57	60.8	34.9	23.0	15.7
11	WF, POS, EP	38.74	69.8	44.6	31.6	22.9
12	WF, POS, Ling, EP	38.76	69.8	44.6	31.7	22.9
13	WF, EP, EoV	38.74	69.8	44.6	31.6	22.9
14	WF, POS, EP, EoV	38.74	69.8	44.6	31.6	22.9
15	WF, Ling, EP, EoV	38.76	69.8	44.6	31.7	22.9
16	WF, POS, Ling, EP, EoV	38.76	69.8	44.6	31.7	22.9
17	EP, EoV	37.22	68.5	42.9	30.2	21.6
18	EP, EoV, Ling	38.38	69.3	44.2	31.3	22.7

Table 6. Results of the factor-based model (Bulgarian-English, SETIMES 150,000)

These models can be roughly grouped into four categories: word form with linguistic features; lemma with linguistic features; models with dependency features; and MRS elementary predicates EP) and the type of the main argument of the predicate (EoV). The setting of the system is mostly the same as the previous experiment, except for 1) increasing the training data from 100,000 to 150,000 sentence pairs; 2) specifying the factors during training and decoding; and 3) without doing MERT ¹¹. We perform the finer-grained model only on the SETIMES data, as the language is more diverse (compared to the other two corpora). The results are shown in Table 6.

The first model is served as the baseline here. We show all the n-gram scores besides the final BLEU, since the some of the differences are very small. In terms of the numbers, POS seems to be an effective factor, as Model 2 has the highest score. Model 3 indicates that linguistic features also improve the performance. Model 4-6 show the necessity of including the word form as one of the factors, in terms of BLEU scores. Model 10 shows significant decrease after incorporating HLemma feature. This may be due to the data sparsity, as we are actually aligning and translating bi-grams instead of tokens. This may also indicate that increasing the number of factors does not guarantee performance enhancement. After replacing the HLemma with HPOS, the result is close to the others (Model 8). The experiments with features from the MRS analyses (Model 11-16) show improvements over the baseline consistently and using only the MRS features (Model 17-18) also delivers descent results. In future experiments we will consider to include more feature from the MRS analyses.

So far, incorporating additional linguistic knowledge has not shown huge improvement in terms of statistical evaluation metrics. However, this does not mean that the translations delivered are the same. In order to fully evaluate the system, manual analysis is absolutely necessary. We are still far from drawing a conclusion at this point, but the preliminary scores calculated already indicate that the system can deliver decent translation quality consistently.

6.5 Manual Evaluation

We manually validated the output for all the models mentioned in Table 6. The guideline includes two aspects of the quality of the translation: Grammaticality and Content. Grammaticality can be evaluated solely on the system output and Content by comparison with the reference translation. We use a 1-5 score for each aspect as follows:

Grammaticality

The translation is not understandable.
The evaluator can somehow guess the meaning, but cannot fully understand the whole text.
The translation is understandable, but with some efforts.
The translation is quite fluent with some minor mistakes or re-ordering of the words.
The translation is perfectly readable and grammatical.

Content

The translation is totally different from the reference.
About 20 % of the content is translated, missing the major content/topic.
About 50 % of the content is translated, with some missing parts.
About 80 % of the content is translated, missing only minor things.
All the content is translated.

For the missing lexicons or not-translated Cyrillic tokens, we ask the evaluators to score 2 for one Cyrillic token and score 1 for more than one tokens in the output translation.

The results are shown in the following two tables, Table 7 and Table 8, respectively. The current results from the manual validation are on the basis of 150 sentence pairs. The numbers shown in the tables are the number of sentences given the corresponding scores. The 'Total' column sums up the scores of all the output sentences by each model.

ID	Model	1	2	3	4	5	Total
1	WF	20	47	5	32	46	487
2	WF, POS	20	48	5	37	40	479
3	WF, Lemma, POS, Ling	20	47	6	34	43	483
4	Lemma	15	34	11	46	44	520
5	Lemma, POS	15	38	12	51	34	501
6	Lemma, POS, Ling	20	48	5	34	43	482
7	WF, DepRel	32	48	3	29	38	443
8	WF, DepRel, HPOS	45	41	7	23	34	410
9	WF, Lemma, POS, Ling, DepRel	34	47	5	30	34	433
10	WF, Lemma, POS, Ling, DepRel, HLemma	101	32	0	8	9	242
11	WF, POS, EP	19	49	4	34	44	485
12	WF, POS, Ling, EP	19	49	3	39	40	482
13	WF, EP, EoV	20	49	2	41	38	478
14	WF, POS, EP, EoV	19	50	3	31	47	487
15	WF, Ling, EP, EoV	19	48	5	37	41	483
16	WF, POS, Ling, EP, EoV	19	49	5	373740480
17	EP, EoV	15	41	10	44	40	503
18	EP, EoV, Ling	20	49	7	38	36	471
19	Google	0	2	20	52	76	652
20	Reference	0	0	5	51	94	689

Table 7. Manual evaluation of the grammaticality

ID	Model	1	2	3	4	5	Total
1	WF	20	46	5	23	56	499
2	WF, POS	20	48	5	24	53	492
3	WF, Lemma, POS, Ling	20	47	1	24	58	503
4	Lemma	15	32	5	33	65	551
5	Lemma, POS	15	35	9	32	59	535
6	Lemma, POS, Ling	20	48	5	22	55	494
7	WF, DepRel	32	49	4	14	51	453
8	WF, DepRel, HPOS	45	41	2	21	41	422
9	WF, Lemma, POS, Ling, DepRel	34	48	3	20	45	444
10	WF, Lemma, POS, Ling, DepRel, HLemma	101	32	0	6	11	244
11	WF, POS, EP	19	49	3	20	59	501
12	WF, POS, Ling, EP	19	50	2	20	59	500
13	WF, EP, EoV	19	50	4	16	61	500
14	WF, POS, EP, EoV	19	50	2	23	56	497
15	WF, Ling, EP, EoV	19	48	4	18	61	504
16	WF, POS, Ling, EP, EoV	19	50	3	24	54	494
17	EP, EoV	14	38	7	31	60	535
18	EP, EoV, Ling	19	49	7	20	55	493
19	Google	1	0	9	42	98	686
20	Reference	1	0	5	37	107	699

Table 8. Manual evaluation of the content

The results show that linguistic and semantic analyses definitely improve the quality of the translation. Exploiting the linguistic processing on word level - Lemma, POS and Ling - produces the best result. However, the model with only EP and EoV features also delivers very good results, which indicates the effectiveness of the MRS features from the deep hand-crafted grammars. Including more factors (especially the information from the dependency parsing) drops the results because of the sparseness effect over the dataset, which is consistent with the automatic evaluation BLEU score. The last two rows are shown for reference. 'Google' shows the results of using the online translation service provided by http://translate.google.com/. The high score (very close to the reference translation) may be because our test data are not excluded from their training data. In future we plan to do the same evaluation with a larger dataset.

The problem with the untranslated Cyrillic tokens in our view could be solved in most of the cases by providing additional lexical information from a Bulgarian-English lexicon. Thus, we also evaluated the possible impact of such a lexicon if it had been available. In order to do this, we substituted each copied Cyrillic token with its translation when there was only one possible translation. We did such substitutions for 189 sentence pairs. Then we evaluated the result by classifying the translations as acceptable or unacceptable. The number of the acceptable translations are 140 in this case.

The manual evaluation of the translation models on a bigger scale is in progress. The current results are promising. Statistical evaluation metrics can give us a brief overview of the system performance, but the actual translation quality is much more interesting to us, as in many cases, the different surface translations can convey exactly the same meaning in the context.

6.5 Related Work

Our work is also enlightened by another line of research, transfer-based MT models, which are seemingly different but actually very close. In this section, before we mention some previous work in this research direction, we firstly introduce the background of the development of the deep HPSG grammars.

The MRSes are usually delivered together with the HPSG analyses of the text. There already exist quite extensive implemented formal HPSG grammars for English (Copestake and Flickinger 2000), German (Müller and Kasper 2000), and Japanese (Siegel 2000; Siegel and Bender 2002). HPSG is the underlying theory of the international initiative LinGO Grammar Matrix (Bender et al. 2002). At the moment, precise and linguistically motivated grammars, customized on the base of the Grammar Matrix, have been or are being developed for Norwegian, French, Korean, Italian, Modern Greek, Spanish, Portuguese, Chinese, etc. There also exists a first version of the Bulgarian Resource Grammar - BURGER. In the research reported here, we use the linguistic modeled knowledge from the existing English and Bulgarian grammars. Since the Bulgarian grammar has limited coverage on news data, dependency parsing has been performed instead. Then, mapping rules have been defined for the construction of RMRSes.

However, the MRS representation is still quite close to the syntactic level, which is not fully language independent. This requires a transfer at the MRS level, if we want to do translation from the source language to the target language. The transfer is usually implemented in the form of rewriting rules. For instance, in the Norwegian LOGON project (Oepen et al. 2004), the transfer rules were hand-written (Bond et al. 2005; Oepen et .al 2007), which included a large amount of manual work. Graham (2008; 2009) explored the automatic rule induction approach in a transfer-based MT setting involving two lexical functional grammars (LFGs), which was still restricted by the performance of both the parser and the generator. Lack of robustness for target side generation is one of the main issues, when various ill-formed or fragmented structures come out after transfer. Oepen et .al. (2007) use their generator to generate text fragments instead of full sentences, in order to increase the robustness. We want to make use of the grammar resources while keeping the robustness, therefore, we experiment with another way of transfer involving information derived from the grammars.

In our approach, we take an SMT system as our 'backbone' which robustly delivers some translation for any given input. Then, we augment SMT with deep linguistic knowledge. In general, what we are doing is still along the lines of previous work utilizing deep grammars, but we build a more 'light-weighted' transfer model.

7. Conclusion and Future Work

In this paper, we presented the preparation phase of Bulgarian language resources and tools to support Bulgarian-English hybrid machine translation. The main result is a manually aligned Bulgarian-English corpus, part of which is also augmented with linguistic information on syntactic and semantic level turning it into a parallel treebank. All these resources are under further development in the direction of coverage of more parallel data and in the direction of enriching the resources with syntactic and semantic frames of verbs, semantic annotation of the text, bilingual lexicons.

We also reported our work on building a linguistically-augmented statistical machine translation model from Bulgarian to English. Based on our observations over the previous approaches on transfer-based MT models, we decided to build a hybrid system by combining an SMT system with deep linguistic resources. We performed a preliminary evaluation on several configurations of the system (with different combinations of linguistic knowledge). The high BLEU score shows the high quality of the translation delivered by the SMT baseline; and manual analysis confirms the consistency of the system.

The detailed manual evaluation of the translation models is in progress, which is the focus of the rest of the project lifetime. The current results are promising. Statistical evaluation metrics can give us a brief overview of the system performance, but the actual translation quality is much more interesting to us, as in many cases, the different surface translations can convey exactly the same meaning in the context.

There are various aspects we can improve the current approach: 1) The MRSes are not fully explored yet, since we have only considered the EP and EoV features. 2) We would like to add factors on the target language side (English) as well. 3) The guidelines of the manual evaluation need further refinement for considering the missing lexicons as well as how much of the content is truly conveyed. 4) We also need more experiments to evaluate the robustness of our approach in terms of out-domain tests.

8. Acknowledgements

This work has been supported by the European project EuroMatrixPlus (IST-231720).

Notes

1. http://www.delph-in.net/

2. BURGER is constructed along the lines of English Resource Grammar – ERG (Copestake and Flickinger 2000).

3. OPUS – an open source parallel corpus, http://opus.lingfil.uu.se/

4. The pipeline is developed on the basis of the language resources, created within BulTreeBank project. The prefix BTB stands for BulTreeBank.

5. http://www.bultreebank.org/btbmorf/

6. A web-based tool, available at http://www.bultreebank.bas.bg/aligner/index.php.

7. http://www.umiacs.umd.edu/~hal/HandAlign/index.html

8. Not surprisingly they are similar to the translation divergences in JRC-Acquis, described in (Vertan and Gavrila 2011).

9. http://optima.jrc.it/Acquis/

10. As one can see not all combinations of the available linguistic knowledge are explored. Some of them are planned for future work.

11. This is mainly due to the large amount of computation required. We will perform MERT on the better-performing configurations in the future.

References

Bangalore, Srinivas and Aravind Joshi. 1999. Supertagging: an approach to almost parsing. In: Computational Linguistics 25(2), June. 237–265.
Bender, E., D. Flickinger and S. Oepen. 2002. The Grammar Matrix: An Open-Source Starter-Kit for the Rapid Develop-ment of Cross-Linguistically Consistent Broad-Coverage Precision Grammars. In: John Carroll, John, Nelleke Oostdijk, and Richard Sutcliffe, (eds.), Procedings of the Workshop on Grammar Engineering and Evaluation at the 19th International Conference on Computational Linguistics. Taipei, Taiwan. 8-14.
Bender, Emily, Scott Drellishak, Antske Fokkens, Laurie Poulson and Safiyyah Saleem. 2010. Grammar Customization. Research on Language and Computation 8(1). 23–72.
Birch, Alexandra, Miles Osborne and Philipp Koehn. 2007. Ccg supertags in factored statistical machine translation. In: Chris Callison-Burch, Philipp Koehn, Christof Monz and Cameron Shaw Fordyce (eds.), Proceedings of the Second Workshop on Statistical Machine Translation. Prague. 9–16.
Bojar, Ondrej and Jan Hajic. 2008. Phrase-based and deep syntactic English-to-Czech statistical machine translation. In: Chris Callison-Burch, Philipp Koehn, Christof Monz, Josh Schroeder and Cameron Shaw Fordyce (eds.), StatMT '08: Proceedings of the Third Workshop on Statistical Machine Translation. 143–146.
Bond, Francis, Stephan Oepen, Melanie Siegel, Ann Copestake and Dan Flickinger. 2005. Open Source Machine Translation with DELPH-IN. In: Proceedings of the Open-Source Machine Translation Workshop at the 10th Machine Translation Summit. 15–22.
Callison-Burch, Chris, Philipp Koehn, Christof Monz, and Omar-F. Zaidan. 2011. Findings of the 2011 workshop on statistical machine translation. In: Proceedings of the 6th Workshop on SMT.22–64.
Chen, Yu, M. Jellinghaus, A. Eisele, Yi Zhang, S. Hunsicker, S. Theison, Ch. Federmann, and H. Uszkoreit. 2009. Combining multiengine translations with moses. In: Proceedings of the 4th Workshop on SMT.43–46.
Copestake, Ann and Dan Flickinger. 2000. Open source grammar development environment and broad-coverage English grammar using HPSG. Gavrilidou M., Crayannis G., Markantonatu S., Piperidis S. and Stainhaouer G. (eds.), Proceedings of the 2nd International Conference on Language Resources and Evaluation. 591–598.
Copestake, Ann. 2003. Robust Minimal Recursion Semantics (working paper). http://www.cl.cam.ac.uk/~aac10/papers
Copestake, Ann, Dan Flickinger, Carl Pollard, and Ivan A. Sag. 2005. Minimal Recursion Semantics: An Introduction. In: Research on Language and Computation, 3(4). 281–332.
Copestake, Ann. 2007. Applying Robust Semantics. In: Proceedings of the 10th Conference of the Pacific Assocation for Computational Linguistics (PA-CLING). 1–12, Melbourne, Australia.
Georgiev, Georgi, V. Zhikov, P. Osenova, K. Simov, and P. Nakov. 2012. Feature-rich part-of-speech tagging for morphologically complex languages: Application to Bulgarian. In: In: Walter Daelemans, Mirella Lapata and Luis Marquez (eds.), Proceedings of EACL 2012. 492-502.
Giménez, J. and L, Márquez. 2004. SVMTool: A general POS tagger generator based on Support Vector Machines. In: Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, Raquel Silva (eds.), Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC'04). Lisbon, Portugal. 43–46.
Graça, J., J. P. Pardal, L. Coheur, D. Caserio. 2008. Multi-Language Word Alignments Annotation Guidelines (version 0.9). Spoken Language Systems Laboratory (L²F). May 25, 2008. http://www.inesc-id.pt/pt/indicadores/Ficheiros/4734.pdf
Graham, Yvette and Josef van Genabith. 2008. Packed rules for automatic transfer-rule induction. In: Proceedings of the European Association of Machine Translation Conference (EAMT 2008).Hamburg, Germany, September. 57–65.
Graham, Yvette., A. Bryl, and J. van Genabith. 2009. F-structure transfer-based statistical machine translation. In: Miriam Butt and Tracy Holloway King (eds.), Proceedings of the Lexical Functional Grammar Conference. 317–328, Cambridge, UK. CSLI Publications, Stanford University, USA.
Hassan, Hany, Khalil Sima'an, and Andy Way. 2007. Supertagged phrase-based statistical machine translation. In: John Carroll (ed.), Proceedings of ACL, Prague, Czech Republic. 288-295
Jakob, Max, M. Lopatková, V. Kordoni. 2010. Mapping between Dependency Structures and Compositional Semantic Representations. In: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner and Daniel Tapias (eds.), Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), ELRA. 2491-2497.
Joachims, T. and B. Schölkopf. 1999. Making Large-Scale SVM Learning Practical. In: Burges, C. and Smola, A. (eds.), Advances in Kernel Methods - Support Vector Learning. Cambridge, MA, USA: MIT Press. 169-184.
Kancheva, Stanislava. 2010. Representation of the Grammatical Roles for Bulgarian in the Dependency Grammar. (unpublished Master Thesis). In Bulgarian.
Koehn, P., A. Birch, and R. Steinberger. 2009. 462 machine translation systems for europe. In: Laurie Gerber, Pierre Isabelle, Roland Kuhn, Nick Bemish, Mike Dillinger, Marie-Josée Goulet (eds.), Proceedings of MT Summit XII. http://www.mt-archive.info/MTS-2009-Koehn-1.pdf.
Kruijff-Korbayová, Ivana, K. Chvátalová, and O. Postolache. 2006.Annotation Guidelines for Czech-English Word Alignment. In: Nicoletta Calzolari, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk and Daniel Tapias (eds.), Proceedings of the Fifth Language Resources and Evaluation Conference (LREC).
Lambert, P, A. De Gispert, R. Banchs and J. B. Mariño. 2006. Guidelines for Word Alignment Evaluation and Manual Alignment. Language Resources and Evaluation 39. 267–285.
Macken, L. 2010. Annotation Guidelines for Dutch-English Word Alignment. Version 1.0. Technical report, Language and Translation Technology Team, Faculty of Translation Studies, University College Ghen. http://webs.hogent.be/~lmac139/publicaties/SubsententialAnnotationGuidelines.pdf
Melamed, D. 1998. Annotation Style Guide for the Blinker Project. Version 1.0.4. Philadelphia, University of Pennsylvania. http://repository.upenn.edu/ircs_reports/53/
Merkel, M. 1999. Annotation Style Guide for the PLUG Link Annotator. http://www.ida.liu.se/~magme/publications/pluglinkannot.pdf
Müller, Stefan and Walter Kasper. 2000. HPSG analysis of German. In: Wahlster, Wolfgang (ed.), Verbmobil. Foundations of Speech-to-Speech Translation. 238–253. Springer, Berlin, Germany, Artificial Intelligence edition.
Nivre, Joakim, Johan Hall, Jens Nilsson. 2006. Malt-Parser: A data-driven parser-generator for de-pendency parsing. In: Nicoletta Calzolari, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk and Daniel Tapias (eds.), Proc. of LREC-2006. 2216-2219.
Och, Franz-Josef and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. In: Computational Linguistics, 29(1). 19–51.
Oepen, Stephan, Helge Dyvik, Jan Tore Lønning, Erik Velldal, Dorothee Beermann, John Carroll, Dan Flickinger, Lars Hellan, Janne Bondi Johannessen, Paul Meurer, Torbjørn Nordgård, and Victoria Rosén. 2004. Som å kapp-ete med trollet? towards MRS-based norwegian to english machine translation. In: Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation, Baltimore, MD. 11–20.
Oepen, Stephan, Erik Velldal, Jan Tore Loenning, Paul Meurer, Victoria Rosén and Dan Flickinger. 2007. Towards Hybrid Quality-Oriented Machine Translation - On Linguistics and Probabilities in MT. In: Proc. of the 11th Conference on Theoretical and Methodological Issues in MT.144–153.
Osenova, Petya. 2010. The Bulgarian Resource Grammar. Saarbücken: Verlag Dr. Müller. 71 pp.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of ACL. 311–318.
Popov, Dimitar, Kiril Simov, Svetlomira Vidinska, and Petya Osenova. 2003. Spelling Dictionary of Bulgarian. Sofia: Nauka i izkustvo (In Bulgarian: Попов, Димитър, Кирил Симов, Светломира Видинска и Петя Осенова. 2003. Правописен речник на българския език. София: Наука и изкуство).
Riezler, Stefan and John T. Maxwell III. 2006. Grammatical machine translation. In: Robert C. Moore, Jennifer Chu-Carroll, Jeff Bilmes and Mark Sanderson (eds.), HLT-NAACL. 248-255.
Siegel, Melanie and Emily M. Bender. 2002. Efficient Deep Processing of Japanese. In: Shu-Chuan Tseng, Tsuei-Er Chen and Yi-Fen Liu (eds.), Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan. 31–38.
Siegel, Melanie. 2000. HPSG analysis of Japanese. In: Wahlster, Wolfgang (ed.), Verbmobil. Foundations of Speech-to-Speech Translation, pp. 265 – 280. Springer, Berlin, Germany, Artificial Intelligence edition.
Simov, Kiril, Z. Peev, M. Kouylekov, A. Simov, M. Dimitrov, A. Kiryakov. 2001. CLaRK - an XML-based System for Corpora Development. In: Paul Rayson, Andrew Wilson, Tony McEnery, Andrew Hardie and Shereen Khoja (eds.), Proc. of the Corpus Linguistics 2001 Conference. Lancaster, UK. 553–560.
Simov, Kiril, Petya Osenova and Milena Slavcheva. 2004. BTB-TR03: BulTreeBank Morphosyntactic Tagset. BulTreeBank Technical Report № 03 http://www.bultreebank.org/TechRep/BTB-TR03.pdf.
Simov, Kiril and P. Osenova. 2011. Towards Minimal Recursion Semantics over Bulgarian Dependency Parsing. In:Galia Angelova, Kalina Bontcheva, Ruslan Mitkov, Nikolai Nikolov (eds.), Proc. of RANLP’11. 471-478.
Singh, Thoudam Doren and Sivaji Bandyopadhyay. 2010. Manipuri-English bidirectional statistical machine translation. In: Dekai Wu (editor) Proceedings of the Fourth Workshop on Syntax and Structure in Statistical Translation. Beijing, China. 83-91.
Spreyer, Kathrin and Anette Frank. 2005. Projecting RMRS from TIGER Dependencies. In: Stefan Müller (ed.), Proceedings of the 12th HPSG Conference. Lisbon, Portugal. 354–363.
Stolcke, Andreas. 2002. Srilm - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, volume 2. Denver. 901-904.
Thurmair, Gregor. 2005. Hybrid architectures for machine translation systems. In: Language Resources and Evaluation, 39(1). 91-108.
Thurmair, Gregor. 2009. Comparing different architectures of hybrid machine translation systems. In: In: Laurie Gerber, Pierre Isabelle, Roland Kuhn, Nick Bemish, Mike Dillinger and Marie-Josée Goulet (eds.) Proceedings of MT Summit XII. 340–347.
Tinsley, J., M. Hearne and A. Way 2009. Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation. In: De Smedt, K., Hajič, J. and Kübler, S. (eds.) Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (2007). 175-187.
Vapnik, V. N. 1999. The nature of statistical learning theory (2nd ed.). New York: Springer.
Veronis, J. 1998. Arcade. Tagging guidelines for word alignment. Version 1.0. http://aune.lpl.univ-aix.fr/projects/arcade/2nd/word/guide/index.html
Vertan, K. and M. Gavrila. 2011. Using Manual and Parallel Aligned Corpora for Machine Translation Services within an On-line Content Management System. In: In Kiril Simov, Petya Osenova, Jorg Tiedemann, and Radovan Garabik (eds.), Proceedings of the RANLP-2011 Second Workshop on Annotation and Exploitation of Parallel Corpora, Hissar, Bulgaria, 15 September 2011. 53–58.

About the Authors

Dr. Kiril Simov is associate professor at the Language Modeling Department, Institute for Information and Communication Technologies, Bulgarian Academy of Sciences. His research interests include Natural Language Processing, Ontologies, Computational Lexicography, Machine Translation.

Email: kivs at bultreebank dot org

Dr. Petya Osenova is associate professor at the Department of Bulgarian Language, Sofia University “St. Kl. Ohridski”. Her scientific interests are in the areas of Morphology, Syntax, Formal Grammars, Corpus Linguistics and Natural Language Processing.

Email: petya at bultreebank dot org

Dr. Laska Laskova is professor assistant at the Department of Bulgarian Language, Sofia University “St. Kl. Ohridski”. She is interested in Morphology and Syntax, Language Resources and Processing, Formalization of Temporal Relations.

Email: laska dot laskova at gmail dot com

Stanislava Kancheva is PhD student at the Department of Bulgarian Language, Sofia University “St. Kl. Ohridski”. Her scientific interests are in the areas of Morphology and Syntax, Dependency Grammars, Language Resources.

Email: stanislava_kuncheva at abv dot bg

Aleksandar Savkov is PhD student at the Department of Informatics, University of Sussex, Great Britain. He holds a MA degree of Computational Linguistics from the University of Tuebingen, Germany. His scientific interests are in the areas of Machine Translation, NLP Processing and Text Mining.

Email: A.Savkov at sussex dot ac dot uk

Dr. Rui Wang is researcher in Language Technology Lab of German Research Center for Artificial Intelligence (DFKI GmbH). His main research interests include Textual Inference, Machine Translation, and Parsing.

Email: wang dot rui at dfki dot de

Форма за търсене

Littera et Lingua

HPSG-based Bulgarian-English Statistical Machine Translation

HPSG-based Bulgarian-English Statistical Machine Translation

Резюме

1. Introduction

2. Preparation of the Parallel Corpus

3. Linguistic Processing Pipeline for Bulgarian

3.1 Morphological Tagger

SVM Tagger

Rule-based Component

Guided Learning System: GTagger

3.2 Lemmatizer

3.3 Dependency Parser

4. Word Level Alignment

4.1 Introduction to Word Level Alignment

4.2 Word Level Alignment Rules.

4.2.1. General Alignment Rules.

5. Semantic Level Alignment

5.1 RMRS for Bulgarian Dependency Parses

5.2 Alignment of RMRS Structures

6. Bulgarian-English Machine Translation - Experiments

6.1 Factor-based SMT Model

6.2 MRS Supertagging

6.3 Experiments with the Bulgarian raw corpus

6.4 Experiments with the Linguistically-Augmented Bulgarian Corpus

6.5 Manual Evaluation

6.5 Related Work

7. Conclusion and Future Work

8. Acknowledgements

Notes

References

About the Authors