Teaching Computer and Translation at New Bulgarian University
(Department of English Studies, New Bulgarian University)
The paper presents the structure of a successful Computer and Translation course, offered to students at the Department of Foreign Languages and Literatures of New Bulgarian University. The course presents information on the history, sub-areas and recent developments in computational linguistics, focusing on problems of translation. The aim of the course is to familiarise students with tools facilitating the process of translation: corpus-analysis tools, translators’ workbenches, systems for machine translation allowing the creation of customer data bases, and give them hands-on experience of using the software. Emphasis is placed on combined use of tools, on developing abilities for making best use of knowledge and skills in problem solving and for the efficient performance of translation tasks.
Статията представя структурата на един успешен курс, „Компютър и превод”, предлаган на студенти от Департамент Чужди езици и литератури на Нов български университет. Курсът включва информация за историята, разклоненията и новите развития в областта на компютърната лингвистика, като поставя акцент върху проблемите на превода и съществуващите категории системи. Студентите се научават да използват програми за обработка на електронен текст, работни места на преводача и системи за автоматичен превод, позволяващи предварителна намеса на преводача чрез създаване на потребителски бази данни. Сред основните цели на курса е създаването на умения за комбиниране на системи от различен тип и ефективното им използване за целите на превода и при решаване на различни приложни задачи.
1. Aims of the course
“Computer and Translation” (C&T) is a new NBU course, first offered to students four years ago and in constant development since. The main aim of the course is to introduce future translators to basic text processing tools and translation support software, including parallel text aligners, translation workbenches, machine translation programmes. Particular attention is paid to possibilities to combine ideas, methods and output from text analysis, computer-assisted human translation and machine translation tools to improve the quality and general efficiency of translators’ work. Along with this main aim, the course presents: a/ an overview of developments in the field of computational linguistics; b/ a classification of translation methods and tools; c/ basic concepts and terms from formal theories of language. In the four years of existence of C&T, the number of students taking the course has increased 4.5 times.
2. A Workbench for Text Analysis
The term “Workbench” is a relatively new one for computational linguistics. Workbenches provide software environments facilitating particular applied linguistics tasks.
The Workbench for Text Analysis (The Linguist’s Workbench – further LW), used in the course as support to a general introduction to computer-assisted text analysis, is an updated version of the first Bulgarian software environment for processing electronic texts. Its basic modules were created in 1992, as part of a project funded by the Bulgarian Ministry of Education and Science (Stambolieva1992). The general design of the environment, the major functions and their ordering are modelled after basic text analysis operations in traditional corpus linguistics. LW includes several modules which can be combined and ordered in different ways. The system is open and allows the addition of modules to support new applications.
LW includes: a/ a set of basic modules for text analysis; b/ a set of supplementary modules for corpus development. The modules covered by C&T are: BUILD (text segmentation and indexing), LEM (lemmatization and POS tagging), CONC (a concordancer with ten context positions) and MIX (a computer-assisted aligner).
LW processes files in plain text format. BUILD segments files into paragraphs, sentences and graphic words; it simultaneously provides each segment with an individual index defining its position in the text.
An immediate output of BULD is the presentation of quantitative information. This information, provided by the feature STATISTICS, has the following form (for a randomly chosen text A-Gen):
Text length in running words: 7609
Different forms: 1878
Average length of paragraph in sentences: 9
Average length of paragraph in words: 155
Average length of sentence in words: 15
Maximum number of words in sentence: 64
Maximum number of sentences in paragraph: 37
Students are thus given a first impression of the type of text they will be translating (quantitative parameters can vary considerably from register to register and from one author to another) and learn to relate quantitative information to text type/source.
Another direct output of BUILD made available by STATISTICS is a list of graphic words, which can be arranged alphabetically and/or by order of frequency. Students go through the list to identify 1/ high frequency lexis, esp. forms belonging to lexical category words, and 2/ unknown words. They save their word lists and edit them to obtain two text files, for each of the above groups. Unknown words are immediately looked up in online dictionaries and translation equivalents are added to the unknown words file. The list of high frequency lexis is concordanced at a later stage.
Students are given a number of text types to process with BUILD are asked to analyse the output. Text types include students’ essays, examples of the belles-lettres style, by different authors, of the publicistic style, newspaper style, scientific prose style, the style of official documents. Stylistic variation is studied in terms of its quantitative parameters. To take two examples:
1. A first assessment of students’ essays is made, based on the following criteria:
- number of running words (deviations from assigned length being penalised);
- number of paragraphs (lack of paragraph structure, excessive number of paragraphs, unbalanced structure being penalised);
- length of paragraphs in words (shorter first and last paragraph expected);
- average length of sentences in words;
- number of different forms used;
- presence of expected lexis;
- presence of expected linking words, etc.
2. Students study proceedings of the Bulgarian National Assembly, placing the information obtained against the background of average values for the literary language. This text type can easily be recognized on the basis of:
- length of text in running words (one session contains, on average, 20 000 running words);
- different word forms – appr. 3500 – 4000, i.e. between 850 and 1200 lexemes;
- deviations in word rank: the Bulgarian preposition за, for instance, has a much higher frequency here. Note is made of such deviations and the word forms are looked up at the concordancing stage of analysis; other deviations are high frequency ranks for the nominal forms: господин(Mister)/ госпожо(Mrs - Vocative), председател(chairman), представители(representatives), събрание(assembly), комисията(the committee), думата(the word/ the floor), съвет(council), закон(law), гласуване(vote), група(group), колеги(colleagues), комисия(committee), докладчик(speaker), предложения(proposals), решение(decision). In no other text type does the 1st person singular personal pronoun have such high rank – more than twice higher than average language values; the same is true of the negative particle не, the verb forms благодаря(thank), мисля(think)and моля(ask), the adjective forms уважаеми (respected)andнароден(people’s).
Texts processed with BUILD can be passed on to the LEM module or/and to MIX, in either order.
The LW lemmatiser/POS tagger is a tool for computer-assisted human lemmatization / analysis with, as a newly developed feature, optional transfer of knowledge from one lemmatised file to other files. Lemmatisation and POS tagging are two interconnected submodules with a similar interface, where POS tagging can benefit from the output of lemmatisation.
Graphic words are listed, on the left-hand side of the screen, repeated as many times as they appear in the file. In most cases forms representative of a given lemma appear in groups (e.g. follow, followed, following, follows), which allows selection of the whole group. In case of possible ambiguity due to homonymy, and only if the lemmas are not themselves homonymous, students check appearances of the forms one by one, using the Show feature on top of the screen. Show displays forms in the context of the sentence, providing information on their place in the file (paragraph number, sentence number within the paragraph, word position in the sentence). Lemmas are typed in, in the right-hand side slot; they are added to the list of lemmas, while the selected form/forms are Set as forms of the currently marked lemma. Lemmatisation can be edited, and corrected, with the Clear set feature of the module. Although full lemmatization is, in most cases, not necessary, and is recommended for high-frequency lexical class forms, students are asked to lemmatise at least two files: a file in their native language and its translation equivalent in the language studied. This exercise helps develop awareness of the degree of ambiguity, and the types of ambiguity, of graphic words in the two languages. The output of this submodule of LEM is passed on to the POS tagger, to the concordancer and back to the STATISTICS feature of BUILD, which now also offers information on the number of lexemes.
Fig. 1. LEM/Submodule: Lemmatiser
The POS tagger can, but need not, make use of lemmatisation. The procedure is very similar to the lemmatisation procedure. Students make use of the knowledge and analysis skills acquired in their morphology classes and extend them. Working with this submodule, they are faced with a different type of ambiguity. Where the texts have previously been lemmatised, the lemma may be sufficient to resolve ambiguity – Cf. BG, били with lemmas: съм(to be), бия(to beat), били (Billy), hence V, V, N – but not always – Cf. EN works or working. It is for such cases that the Show feature is provided, here too. POS tagging is carried out by activating one of the options provided on the right hand side of the screen. The latest version of the tagger allows several levels of subcategorisation.
Following the first stage of POS analysis, quantitative information on parts of speech appears in the respective slots of BUILD/STATISTICS. Students can now re-analyse the text making use of this information. They are taught to use quantitative information on parts of speech to decide on text register, functional style or discourse topic.
Fig. 2. LEM/Submodule: POS Tagger.
2. 3. CONC
Concordancing with LW is one of the exercises that student taking the course enjoy most. The concordancer offers better context analysis possibilities than the simple KWIC format (Cf. Sinclair 1991). CONC allows the study of context for five positions on the left and five positions on the right of the key word. Each of these ten positions can be rearranged by alphabetical or frequency order. This feature of the module allows precise extraction of collocations and idiom chunks, even for scientific texts where multiple-word lexemes often extend to 5, 6 or more graphic words. Concordancing may, but need not, be preceded by lemmatisation or POS tagging. Of course, previous lemmatisation will allow concordancing more than one wordform, and this is the option to be taken if the contextual characterisation of a lexeme is aimed at. On the other hand, our experience shows that collocations / idiom chunks more often than not make use of only one form of the lexeme. The same can be said, more generally, of sense disambiguation: word senses are often realised by one or more, but not all, forms of the lemma. Stambolieva (2001) demonstrates this for the three senses of BG алкохол(alcohol), where the plural definite and indefinite forms can only represent the specialised sense of chemistry. Concordancing is carried out in the following stages: 1. All occurrences, in the context described above, of a selected word form or lemma appear on the main screen after pressing GET. Each of the positions on either side of the key form can be rearranged by alphabetical or frequency order using the abc/fre option, position location (from -5 to +5), followed by Sort.
Fig. 3. Concordancing with LW.
Students are made aware that enriching a linguistic data base with the precisely analysed context of the forms of a lexeme is one of the important conditions for its successful functioning. Collocations take up a central place not only in building translation systems, but also in lexicographic work and foreign language teaching, especially at the higher levels of language learning. It is not by chance that more and more attention is being paid to patterns of collocation in both monolingual and bilingual dictionaries, that dictionaries of collocation are becoming increasingly popular. Many language processing systems largely rely on collocation for ambiguity resolution of different types.
C&T students use the LW concordancer to work on a number of different tasks. In one such task, students are asked to look at their notes on deviations of words and forms from overall frequency rank (Cf. 2.1. above) and explain these deviations.
Thus, the high frequency of the preposition за in parliamentary proceedings can easily be accounted for considering its participation in the following set phrases, typical of parliamentary jargon: изменение /допълнение, приемането, etc../ на закона за /държавния бюджет/; устройствения закон за /бюджета/; правилника за /организацията и дейността на народното събрание/; държавния бюджет за ... година; става дума за; програмата за; краен срок за; има думата за реплика/дуплика/;предложението за; за второ четене. In this discourse type,за often appears as an adverb or particle (to be for or against): за, против, въздържали се; народни представители – за.
The high frequency rank of the 1st person singular personal pronoun азcan be
explained if the following are considered: аз не знам/не зная; аз не мога да не изразя; аз ще помоля; аз ще обърна внимание на; аз бих искал да; аз бих могъл да; аз съм съгласен с; аз мися, че; аз смятам, че; аз моля да; аз разчитам на.
The negative particle is a constituent in the following high-frequency strings: да не забравяме, че; да не фиксираме; да не ограничаваме; не може да; не можем да; не става дума за; а не да; аз не знам; аз не мога да; аз не зная как; аз не казвам да; предложението не се приема.
2.4. Aligning with MIX
This module can be used for aligning any two files, whether in the same language or in different languages. Thus, MIX can be a tool for comparing two different translations of the same text, two versions of a song, stylistic versions of the same content, etc. Its main function, however, is to create parallel texts. The source text and the translation are processed with BUILD, they are loaded as File 1 and File 2, respectively, and aligned upon activation of Mix. The tool offers for alignment the first source text sentence and the first translation text sentence. Pressing OK will enter these sentences as a translation pair and the next sentences will be offered for alignment. In the course of aligning their set texts, students learn very soon that translations do not necessarily follow the source text structuring into sentences: long sentences can be split in two or more; alternatively, two or more source text sentences can appear in the translation merged in one. MIX allows for such asymmetrical alignment.
The result of the alignment process can be used in different applications, as e.g. in creating or enriching the translation memory (TM) of a Translator’s Workbench. It can also be converted into a bilingual text file, where the translation sentence/sentences follow(s) the source text they are aligned to – Cf. part of Orwell’s Nineteen Eighty-Four (Orwell 1949), aligned to its Bulgarian translation (Orwell 1989):
#From somewhere at the bottom of a passage the smell of roasting coffee – real coffee, not Victory Coffee – came floating out into the street. # Някъде от дъното на входа в улицата нахлу аромат на печено кафе – истинско кафер не кафе „Победа”.
#Winston paused involuntarily. # Уинстън неволно замря.
# For perhaps two seconds he was back in the half-forgotten world of his childhood. # Може би за секунда-две се върна в полузабравения свят на детството си.
#Then a door banged, seeming to cut off the smell as abruptly as though it had been a sound. # После се затръшна врата и ароматът секна внезапно, сякаш звук.
Fig. 4. Aligning translation equivalents with MIX.
Students are advised to prepare for all types of translation and interpretation by processing monolingual texts or aligning texts with previous translations using LW or similar tools. Obtaining relevant documents, analysing their formal structure and concordancing them is useful for all, but are a near must for consecutive interpreters in preparing their abbreviations. Aligning documents with earlier translations and looking up translation equivalents of text unities of various lengths is also an exercise worth every minute of the effort.
3. Machine Translation Tools
Like most people, students of language have, at one time or other, made use of the available tools for machine translation. For Bulgarian, apart from Google Translate, such tools are WebTrance Translator (and the more recent SkypeTrance), Babylon and BULTRA – a product of the Bulgarian software developers ProLangs. Two of these tools are analysed as part of the course: Google Translate and BULTRA.
While students know how to access translation tools, they are not familiar with all of them; further, while they realise that the translations leave much to desired, their initial assessment does not go beyond qualifications such as “awful” or “abominable”. An important task is to help them analyse this performance in the context of the classification of machine translation tools. Both tools are found to be examples of local, direct translation. They cope well with definiteness and basic syntactic structures but fail with syntactic transformations and the translation of Tense and Aspect – Cf. some Google translate-generated Bulgarian strings:
1. He sang the song – Той изпя песента. (Perfective Aspect, Aorist)
2. He sang for an hour – *Той пееше за един час. (Imperfective Aspect, Imperfect Tense)
3. They ate the sandwich. – **Те яде сандвич.(Imperfective Aspect, Aorist/Present?)
4. Did you eat the sandwich? – ***Знаете ли, яде сандвич?Students define the errors and try to formulate rules which would improve the system’s performance.
It is a publication in the proceedings of an international workshop (Paskaleva, Netcheva 2005) that drew my attention to the BULTRA translation tool, above all with the information that the system provides for pre- and post-processing and for the creation of customer data bases. These latter allow the addition of both one-word units and multiple-word lexemes and collocations.
In this part of the course, students are taught to improve the output of systems for automatic translation by doing pre-processing analysis, creating customer databases and post-editing the output. A text, preferably from a subject area which is not covered by the BULTRA thematic glossaries, is uploaded, translated using the general database, and the resulting file is saved. The source language text is then processed with BUILD and CONC. High-frequency vocabulary and collocations are noted and provided with appropriate translations. These one-word and multi-word unities are entered in the customer database together with their translations. The source text is then translated with BULTRA a second time, with the customer data base added to the general one. The two translations are compared.
4. Translators’ Workbenches
After a short introduction to the notion of computer-assisted human translation, students are familiarised with the principles and main features of two workbenches: Wordfast (in its standard and Anywhere.Wordfast versions) and TRADOS. These will not be described here, as they are very popular tools, descriptions and trial versions of which can be downloaded for free. I will focus on some specific approaches to their use.
Translators’ workbenches are based on the following principle. While the translator has access to a larger section of text, what he sees highlighted is a sentence: like MIX, the tools offer sentence segments, one by one, with a possibility to add more segments in the highlighted zone. The translator composes his text in a “translation line”, highlighted in a different colour. Once the sentence (s) has (have) been translated, the result is entered into the memory of the tool, to form part of a Translation Memory (TM). At later stages in the translation of the text, or during subsequent translation sessions, whenever identical or very similar segments appear in the source text line, the tool will offer translation equivalents based on earlier translations, i.e. on the TM.
The course relates translation workbenches to corpus analysis tools in the following ways:
1. Students are advised to process their text with BUILD prior to uploading them in the translator’s workbench. As noted above, for any process of translation obtaining information on the structure of the text, its quantitative parameters, high frequency or specific vocabulary, is a desirable first step.
2/. Students are advised and taught to refer to parallel (aligned) texts and other available reference tools for help, along with the use of TM. A TM is only useful if it consists of legitimate translation pairs. Failing that, it could turn into a powerful instrument for the generation of consistently erroneous translation equivalents.
3. The possibility to load parallel text as translation memory is explored.
For the time being, our university environment is not sufficient to give students a real feeling of reaping the fruit of a translation memory; this part of the course therefore ends with a short hands-on translation session at a translation firm working with TRADOS.
Computational linguistics has travelled a long way from the linguistic wars of the 60es and 70es to the realisation that compromise approaches, hybrid methods and, generally, open-mindedness and flexibility yield better results. Guided by this understanding, the NBU Computer and Translation course presents to students a number of traditionally opposed approaches to the processing of natural language texts, placing emphasis on strong points, areas of contact and possibilities for combining tools from the arsenal of different schools. The excellent technical equipment of the NBU laboratories, as well as the cooperation with translation firms ensure the necessary balance and link between learning and doing, knowledge and skills – a feature of the course that students clearly find attractive.
- Orwell 1949. Orwell, George 1949. Nineteen Eighty-Four. Penguin Books.
- Paskaleva, Elena, Tanya Natcheva. 2005. BULTRA (English-Bulgarian Machine Translation System). Basic Modules and Development Prospects. In: Stelios Piperidis and Elena Paskaleva (eds.). Language and Speech. Infrastructure for Information Access in the Balkan Countries. Borovets, Bulgaria.
- Sinclair, John. 1991. Corpus, Concordance, Collocation. Oxford University Press.
- Stambolieva, Maria. 1992. A Linguist’s Workbench. Research Project, funded by the Bulgarian Ministry of Education and Science, National Research Foundation, № 208-Ч
- Оруел, Джордж 1989. 1984. Превод от английски Лидия Божилова. София: Профиздат.
- Стамболиева, Мария 2001. Електронните архиви в езиковедската работа. Български език 1 (2-3). 121-30.
About the author
Dr. Maria Stambolieva is Associate Professor at the Department of Foreign Languages and Literatures of New Bulgarian University and Lecturer at IFAG, Sofia, where she teaches Morphology, Syntax, Computer and Translation, General English and Business English. M. Stambolieva is the pioneer of Bulgarian computational corpus linguistics, the author of the first monolingual corpus of Modern Literary Bulgarian, of a multilingual parallel corpus of Bulgarian, English and French, of corpus-processing software. She has participated in and directed many international and national research and educational projects. Her interests are in the field of formal linguistics, corpus linguistics and computational linguistics; she is the author of books and articles in her areas of interest. M. Stambolieva co-chairs the Bulgarian national association of corpus and computational linguists ANABELA.
Електронен адрес: mstambolieva at nbu dot org