A MODULAR APPROACH TO TURKISH NOUN COMPOUNDING:THE INTEGRATION OF A FINITE-STATE MODEL-A.A. BİRTÜRK-S.SONG-METU-NEC

 

A Modular Approach to Turkish Noun Compounding:The Integration of a Finite-State

 

 Model Aysenur Akyuz Birturk1and Sandiway Fong21Department of Computer EngineeringMiddle East Technical University, Ankarabirturk@ceng.metu.edu.tr2NEC Research Institute Princeton, New JerseySandiway@research.nj.nec.com

 

AbstractIn this paper, we describe the designand integration of a three level cascadednon-deterministic finite state model ofTurkish compounding into TurkishPAPPI, a comprehensive syntacticparserinthe principles-and-parameters(P&P) framework.Ourapproach is to handle compounding asanintermediatestage between morphological analysis and syntacticparsing.Wediscusshow the compounding machinehandlesbracketing paradoxes and adjectives incompoundsand howthenon-determinism allows for ambiguity due to different bracketings and Turkish subject/object pro-drop.1 IntroductionNoun compounding poses a special challenge fornatural language processing (NLP) systemsbecause of its productivity and the wide range ofpossible semantic relations that can be encodedthrough compounding.Compounds such as tea bag, tea leaf, tea garden, tea cake, tea break, and tea service behavelike common nouns with respect to syntax but encode different semantic relations between the head and the modifier elements. Furthermore,Downing (1977) claims that there are no linguisticrestrictions on the possible relations implicit innominalcompounds,andnewlycreatedcompounds may be interpreted in many ways withno contextual context, for example, cousin chair. Thus compounding is semantically unpredictable.Semantic compositionality means that the semantics of a complex expression is a function ofthe semantics of its parts and the mode ofcombination (Partee, 1984). However, semanticcompositionality appears to be violated in manyexamples: (1) (a) school girlgirl school(b) summer school*school summerOne controversy with respect to compoundingis whether they are formed by transformation, e.g.‘girl friend’ from ‘friend who is a girl’, or byspecial word formation rules. Important early work on a comprehensive but purely transformationaltreatment of compounding, encoding a variety of sentence types and grammatical relations, was doneby Lees (1966). In the same framework, Botha (1968), arguing from Afrikaans compounding data,showed that phonological considerations must alsocome into play.Argumentsagainstthetransformationalapproach include violation of recoverability of deletion and that compounding is acquired beforerelative clauses formation (see Hoeksema (1985)for further discussion).Levi(1978)attemptedtocharacterize compounding using a small set of basic semanticrelations, e.g. make, for or cause. However, thisreductionist approach is of limited practical import given the polysemy inherent in compounding;issues of pragmatics and real-world knowledgemust also come into play. Along similar lines,Johnston and Busa (1999) attempted to treat corecases of compounding in the Generative Lexiconframework (Pustejovsky, 1991) by co-opting qualiainformation present in the lexical entry for the headnoun.Most compounds obey the morphological islandconstraint of Botha (1980) :The individual constituents of the complex words formed by means of word formation rules lose the ability to interact with inflectional, derivational and syntactic processes.
--------------------------------------------------------------------------------
Page 2
For example : (2)(a) bus ticket(b) *buses ticket (plural number insertion)Turkish constitutes a special case forcompounding. In particular, the indefinite ‘izafet’ construction exhibits many interesting properties aswe will see in section 2. However, indefinite ‘izafet’ construction in Turkish also obey the morphological island constraint in most cases. This allows us to treat compounding in a modularfashion, as a distinct process separate from the restof syntax. Our approach is to handle compoundingas an intermediate stage between morphological analysis and syntactic parsing. Our goal is topackage and identifycompounds for syntacticanalysis without analyzing the semantic relationsbetween elements of a compound. The organization of the paper is the following.In Section 2, we give the analysis of noun compounds in Turkish. Problem description isgiven in section 3. Sections 4 and 5 describe ourapproach and the design of the compoundingmachine. We discuss the limitations of theapproach in Section 6 and our concluding remarks are in Section 7. 2 Compound Nouns in TurkishCompound nouns in Turkish are divided into two types (Spencer, 1991; Lewis, 1967; Lapointe,Brentari and Farrell, 1998): a. Single-wordConstruction:These are phonologically single words and have idiosyncratic(i.e. non-compositional) meaning. The types ofsingle-wordcompoundsareNoun+Noun,Adj+Noun,Noun+Adj,Verb+Verb and Noun+Verb. The construction is not productive. Anexample from Spencer(1991) is: (3) başbakanhead + minister ‘prime minister’b.‘izafet’ Construction: There are two types of‘izafet’ constructions1. 1. Definite (possessive construction): Ittakes the form ‘Noun-GEN Noun-POSS’and generally corresponds to the English ‘Noun’s Noun’ or ‘Noun of the Noun’type syntactic phrases. (4) bahçenin kapısıgarden-GEN gate-POSS‘the gate of garden’1In the following constructions, GEN stands for the genitive suffix, POSS stands for the possessivesuffix, and PLR stands for the plural suffix.2. Indefinite: It takes the form ‘Noun Noun-POSS’ and corresponds to the English ‘Noun Noun’ compounds.(5) bahçe kapısıgarden gate-POSS‘garden gate’Two types of branching are possible inindefinite constructions and branching is explicitlysignaled by overt morphology:1. Right branching ([N [N … [N N-POSS]…]] ) (6) Türk Dil Kurumu Turk Language Organization-POSS‘Turkish Language Council’2. Left branching ([…[[N N-POSS] N-POSS]…])(7) Dil Kurumu Sözlüğü Language Organization-POSS dictionary-POSS ‘Language Council Dictionary’Both type of branching may occur in oneconstruction:(8) Türk Dil Kurumu Sözlüğü Turk Language Organization-POSS dictionary-POSS‘Turkish Language Council Dictionary’In general, the head, i.e. the last noun in the compound, cannot be modified directly and non-heads lose referential and other syntacticproperties (see section 5.1.2 for a counterexample):(9) (a) bahçe kapısıgarden gate-POSS‘garden gate’(b) *bahçe-ler kapısıgarden-PLR gate-POSS(c) *bahçe yeni kapısıgarden new gate-POSS3 Problem DescriptionAs we have described in the previous section,Turkish noun compounding consist of single-word,and indefinite ‘izafet’ constructions. Our goal is toextend an existing, comprehensive principles-and-parameters (P&P) parser for Turkish, Turkish PAPPI (Birturk, 1998), to accept and parse Turkish noun compounds.Single-word compounds are lexicalized and thus are inserted whole in the lexicon. The definite‘izafet’ construction can be shown to be a syntacticconstruction and this is already handled by thecurrent syntactic engine employed by Turkish PAPPI. Our main concern in this study is therefore the analysis of indefinite ‘izafet’ constructions thatare also very productive. Hereafter, in this paper,we will restrict our attention to cases of indefinitecompound constructions only.
--------------------------------------------------------------------------------
Page 3
4 Pre-parsing Compounds Our approach is to handle compounding beforesyntacticparsing beginsbut after initial morphologicalanalysis(Fig.1). Thusthe compounding component serves as a preprocessingphase in Turkish PAPPI. Since compounds are opaque to syntactic processes, i.e. they function asa single syntactic unit once the head has beenidentified, we can simply treat them as specially-marked nouns inheriting the features of the headnoun. We simply ‘encapsulate’ them as nouns,marked with a special feature ‘compound’. In otherwords, compounds are seen and treated bysyntactic components as ordinary nouns. Forexample, they cantake part in syntacticconstructions such as possessive constructions, -ki constructions or adjective phrases:(11) (a) evin [bahçe kapısı] house-GEN garden gate-POSS‘the garden gate of house’(b) tahta [bahçe kapısı] wooden garden gate-POSS‘wooden garden gate’The task of initial morphological analysis is toexpand or decompose input words into theirconstituent morphemes or tokens in the lexicon.Such a stage of analysis is especially important foragglutinating languages like Turkish where it isimpractical to store all forms of a word. As an example, ‘elmalarımı’ is expanded into‘elma PLR POSS1SG ACC’. Here, PLR,POSS1SG, and ACC are abstract morphemes(PLR : plural, POSS1SG : 1.singular possessive,ACC : accusative case) that are implemented as‘markers’ in the lexicon (Birturk, 1998). Markers are a special class of morphemes that do not project structure like regular heads or other morphemes such as verbal causatives or passives. Instead, markers in the PAPPI system are realizations offeature elements that are attached to regularcategories (Fong, 1998), in the sense of the Caseinsertion mechanism described in Chomsky (1996).Hence, markers are applied before syntacticparsing; in other words, they’re resolved as featuresor modifications on existing features of heads. Inour example, PLR, POSS1SG and ACC simplyinstantiate the number, possessive and case features of the head noun ‘elma’.Figure 1. Levels of Linguistic Analysis.5 A Finite State Machine for Compounding Chomsky (1957) demonstrated that finite state methods are incapable of representing the fullrichness of natural languages. However, there are many subsets of natural languages that are adequately covered by finite state methods (Rocheand Schabes 1997; Kornai, 1999; Karttunen and Oflazer, 2000). In this section, we assumefamiliarity with finite state automata, if not the reader is referred to the introduction in Lewis and Papadimitriou (1998).We demonstrate a successful application offinite state methods in Turkish noun compounding.A cascaded nondeterministic finite state machine (FSM) is designed for compounding. Thecompounding machine permits left and rightbranching in compounds, and handles bracketing paradoxes and adjectives in the compounds. The non-deterministic design of the machine allows for ambiguous parses due to subject/object pro-drop inTurkish and different bracketings due to syntax. The cascaded finite state machine has three levels:1. Simple Compounding 2. Nested Compounding3. Title-ProperNoun CompoundingWe describe each of the three levels in the following sub-sections.5.1 Simple Compounding The initial FSM designed for compounding isshown in Fig.2. Note that this machine permits leftand right branching in compounds. Whenever a compound is parsed, the features of the head, i.e. the last noun in the compound, are copied and instantiate the compound’s features as a whole. Asa consequence, markers (described in section 4)that modify the head noun apply also to the entirecompound.As a rule, we do not employ a compoundmarker in the lexicon; the POSS marker is used toserve for both compounding and agreement inpossessive constructions. This also prevents doublemarking for compound possessees in possessiveconstructions: (12) (a) evin bahçe kapısıhouse-GEN garden gate-POSS ‘the garden gate of house’(b) *evin bahçe kapısısıhouse-GEN garden gate-POSS-POSSIn order to save branching information, weemploy two features, namely RBRANCH and LBRANCH for the compound noun. They are initially set to 0. RBRANCH is incremented each
--------------------------------------------------------------------------------
Page 4
time state 0 or 1 is visited and LBRANCH isincremented each time state 2 is visited.Figure 2. The initial FSM for compounding.5.1.1 Handling Bracketing ParadoxesCompounds in Turkish may involve bracketingparadoxes that cannot be resolved by lexical rules(Sehitoglu and Bozsahin, 1999). In the followingexample, taken from Goksel (1993), the pluralmarker has semantic scope over the nominalcompound marker; however, the opposite is truewith respect to the morphological bracketing:(13) otobüs bilet-ler-i bus ticket-PLR-POSS‘bus tickets’ We solved this bracketing paradox byconsidering –leri as a composite plural compoundmarker and modifying N-POSS labels by N-{PL}-POSS, i.e. N-POSS or N-PLR-POSS. 5.1.2Adjectives in the compoundAdjectives may also be a part of some nouncompounds as in the English examples [[shortstory] competition], or [[natural language] parser]. Similar compounds can be found in Turkish: (14) kısa film yarışmasıshort movie competition-POSS‘short movie competition’The adjective ‘kısa’ modifies ‘film’, not ‘[filmyarışması]’. Here we put aside the question ofinternal bracketing for semanticsand simplyencapsulate the entire sequence as [kısa filmyarışması]. We handle this case in the machine byadding a Adj-Noun link between states 0 and 1 in Fig.2.5.1.3Allowing Ambiguity Examples like white door handle or Americanhistory teacher are ambiguous since one may (a)start at N, and get [Adj white][N door handle], i.e.the handle is white; or (b) start at Adj, and get [Nwhite door handle], interpreted as the door is white.Thus compounding may start with the adjective ornoun.The situation in Turkish is further complicatedby the possibility of subject and object-drop(indicated by the zero element ø below). Forexample, there are three possible parses for the sentence:(15)1NATO 2yaz okulu düzenledi.NATO summer school organize-PAST3SGi. ‘NATO organized [summer school].’ii. ‘ø organized [NATO summer school].’ iii. ‘NATO summer school organized ø.’Thus compounding may start at positions 1 or 2(given by subscripting) in the sentence above. Thisis accommodated by adding a nondeterministiccomponent to the finite state machine. The revised FSM given in Fig.3 handles bracketing paradoxes,adjectival insertion in compounds and ambiguity ofthe aforementioned kind. * stands for any terminal symbol, so the machine may stay at initial state orstart compounding with a N or Adj. In general, for a sequence of m unmarked nouns followed by nPOSS-marked nouns, (m*n) compounds may begenerated.Figure 3. Level 1: Compounding Machine.As a consequence of the nondeterminismnecessary to handle ambiguity, controlling overgeneration becomes an issue. In some cases,illicit compounds can be eliminated by appealing toselectional restrictions. However, in many cases, there are no clear criteria for elimination. Consider,for example: (16) Zeynep otobüs bileti verdi.Zeynep bus ticket-POSS give-PAST3SGNote: Zeynep is a personal name in Turkish.i. ‘Zeynep gave bus ticket to ø’ ii. ‘ø gave [Zeynep bus ticket]’iii. ‘[Zeynep bus ticket] gave ø to ø’Here, (iii) can be eliminated by selectional restrictions of the verb give. However, (ii) can onlybe eliminated by appealing to extra-grammaticalinformation such as real-world knowledge ordiscourse context.Consider also the following example:(17) Kennedy havaalanına gitti.Kennedy airport-POSS-DAT go-PAST3SG i. ‘Kennedy went to airport.’ii. ‘ø went to [Kennedy airport].’(17ii) is a parallel example to (16ii) above. Hence, in principle, the compounding module iscorrect in allowing for both possibilities. 5.2 Nested CompoundingMultiple compounding is also a commonphenomenon in natural languages. For example:
--------------------------------------------------------------------------------
Page 5
(18) Middle East Technical UniversityComputer Engineering DepartmentIn Turkish, nested compounds have the nestedstructure [[COMPOUND-1] …[COMPOUND-N]].(19) [[OrtaDoğu TeknikÜniversitesi][Bilgisayar Mühendisliği Bölümü] [yaz okulu kayıtları]] başladı. [[Middle EastTechnical University-POSS][Computer Engineering-POSS Department-POSS][Summer School registration-PLR-POSS]] start-PAST3SG ‘Middle East Technical University ComputerEngineeringDepartmentSummer Schoolregistrations started.’This is the second step in compounding. Afterthe compounding FSM in Fig.3 has completed the first level of compounding, the newly formedcompounds are sent to the FSM in Fig.4 for nestedcompounds. This machine is also nondeterministic. Thus, we may obtain different bracketings for thefollowing sentence:(20) Orta Doğu Teknik Üniversitesi BilgisayarMühendisliği Bölümü açtı. Middle East Technical University-POSS Computer Engineering-POSS Department-POSS establish-PAST3SG i. ‘[Middle East Technical University ComputerEngineering Department] established ø.’ ii. ‘[Middle East Technical University] established[Computer Engineering Department].’Figure 4. Level 2: Nested Compounding Machine. 5.3 Title-ProperNoun CompoundingAnother phenomenon related to compounding is Title-ProperNoun sequences. The title can be a simple noun or a compound as indicated in thefollowing examples: (21) (a) Başkan BushPresident Bush‘President Bush’(b) [Turizm Bakanı] MumcuTourism Minister-POSS Mumcu ‘Minister of Tourism Mumcu’ Note: Mumcu is the surname of the Minister of Tourism.This is handled by a separate compounding level. After the completion of nested compounding,the nondeterministic FSM given in Fig.5 handlestitle-properNoun sequences. This machine is alsonondeterministic. Thus, we may obtain differentparses for the following sentence:(22) Turizm Bakanı Mumcu’ya bir hediye verdi.Tourism Minister-POSS Mumcu-DAT apresent give-PAST3SGi.‘ø gave a present to Minister of TourismMumcu’ii.‘Minister of Tourism gave a present to Mumcu’Figure 5. Level 3: Title-ProperN Compounding.6 Limitations of the ApproachIn this section, we discuss limitations of the approach presented so far. 6.1 Conjunction in compoundsPutative compounds like NY and NJ trains, NY buses and trains, Democracy and Human Rights Report, France and England Territories involvecoordinating conjunctions and ambiguity :(23) NY ve NJ trenleriNY and NJ train-PLR-POSSi.‘NY and [NJ buses]’ ii.‘[NY and NJ] buses’ Coordination in general is a process at the level of syntax and cannot be handled with our approachat present. It is not even entirely clear whether theyshould be treated as a compound or handled withina general treatment of conjunction. We leave suchcases for future research. 6.2 Other problematic cases There are some compound like structures which violate the morphological island constraint. For example:(24) Sokak çocuklarına yardım derneği Street child-PLR-POSS-DAT help foundation-POSS‘Foundation to help street children’ This is an apparent nested compound in which the first sub-compound is marked by a dative case marker (DAT). The case marker indicates and triggers a syntactic phrase boundary. Thus, insteadof modifying the nested compounding machine toaccommodate case markers, the example can beshunted aside and handled in the syntacticcomponent after the first level of compounding.
--------------------------------------------------------------------------------
Page 6
There is a special type of noun compounding inTurkish: [N ADJ-POSS].(25) dut kurusu mulberry dry-POSS‘dried mulberry’The difference in noun-adjective order impliesthat the level-1 FSM must be revised or perhapsanother separate level may be introduced to handlethese cases. 7 Conclusion The cascaded finite state machine described in thispaper is a descriptively adequate, powerful and simple yet flexible mechanism for handling simpleand multiple compounds before real syntacticparsing begins. Compounds constructed by the machine interact with later syntactic processes ascommon nouns. Ambiguous parses due to differentbracketings and subject/object pro-drop in Turkishare obtained through nondeterminism and thecascaded design of the FSM. An important point isthat we do not analyze the semantic relationsbetween elements of a compound. As we havepointed out previously, compounding may encodearbitrarily complex and deep semantic relations,requiring both real-world and discourse knowledge.A comprehensive treatment of this topic is beyondthe purview of simple finite state machinery (forfurther discussion on this topic, see for exampleJohnston and Busa (1999)). The machine has beentested on a variety of different types of compoundsfound in Turkish and it demonstrably integrateswell as a new component to the existing Turkish PAPPI parser. For example, syntactic constraintssuch as selectional restrictions help filter outunwanted cases of compounding. However, the possibilityofovergenerationdueto nondeterminism also points to the need for extra-grammatical constraints. The existing machine cannot by itself deal with overgeneration where discourse or real-world knowledge is involved. To handle these cases, the results of compoundingmust be further filtered after syntactic analysis at the level of syntactic logical form (LF), the interface to semantic interpretation. AcknowledgementsThe authors thanks David Lebeaux and Piroska Csuri for their help in the analysis of Turkish compounds. We would also like to thank CemBozsahin and David Lebeaux for valuable comments on this paper. Of course, theresponsibility for any errors and omissions lies withus.ReferencesBirturk A. 1998. A Computational Analysis of Turkish using the Government-Binding Approach. Ph.D. thesis, Middle East Technical University, Ankara.Botha R.P. 1968. The Function of the Lexicon inTransformational GenerativeGrammar. Mouton, The Hague. Botha R.P. 1980. Word-based Morphology andSynthetic Compounding, Stellenbosch Papers inLinguistics 5, Stellenbosch University.Chomsky, N. 1957. Syntactic Structures. Mouton, The Hague. Chomsky, N. 1996. Knowledge as Language, Prager.Downing, P. 1977. On the Creation and Use ofEnglish Compounds, Language 53. Fong, S. 1998, The PAPPI Reference Manual, available at http://www.neci.nj.nec.com/homepages/sandiway/pappi/doc/refman. Goksel, A. 1993. Levels of Representation and Argument Structure in Turkish. Ph.D. Thesis, SOAS.Hoeksema, J. 1985. Categorial Morphology. Garland, New York.Johnston, M. and Busa F. 1999. Qualia Structure and the Compositional Interpretation of Compounds. In Breadth and Depth of Semantic Lexicons, E. Viegas (ed), Kluwer. Karttunen L. and Oflazer K.(eds.). 2000. Computational Linguistics: Special Issue on Finite-State Methods in NLP, Vol.26, No.1., MIT Press.Kornai, A. (ed). 1999. Extended Finite StateModels of Language. Cambridge University Press. Lapointe S.G., Brentari D.K., and Farrell P.M.(eds). 1998. Morphology and Its Relation toPhonology and Syntax. CSLI Publications. Lees, R.B. 1966. The grammar of Englishnominalizations. Bloomington and the Hague.Levi, J.N. 1978. The Syntax and Semantics ofComplex Nominals. Academic Press, NY.Lewis, G.L. 1967. Turkish Grammar. OxfordPress.Lewis H. and Papadimitriou C. 1998. Elementsof the Theory of Computation. 2.edition. Prentice-Hall.Partee B. 1984. Compositionality. In Varieties of Formal Semantics, Landman F., Veltman F.(eds), Foris Publications.Pustejovsky, J. 1991. The Generative Lexicon.Computational Linguistics, 17.4. Roche, Emmanuel, and Yves Schabes (eds).1997. Finite-State Language Processing, MITPress, Cambridge, MA.Sehitoglu O., Bozsahin C. 1999. Lexical Rules and Lexical Organization. In Breadth and Depth ofSemantic Lexicons, E. Viegas (ed), Kluwer. Spencer, A. 1991. Morphological Theory, Blackwell Publishers. Sproat, R. 1992. Morphology and Computation. The MIT Press.


Anket

  Yabancılara Türkçe öğretimi sahasında bizzat sahada çalışan öğreticilerin katkıları olmadan üretilen çözümlerin, doğru çözümler olabileceğini düşünüyor musunuz ?

  • E-Bülten

  • Sözlük

  • Müzik Yayını

    575874 Ziyaretçi