eBooks

Von der Form zur Bedeutung: Texte automatisch verarbeiten - From Form to Meaning: Processing Texts Automatically

Proceedings of the Biennial GSCL Conference 2009

0916
2009
978-3-8233-7511-1
978-3-8233-6511-2
Gunter Narr Verlag 
Christian Chiarcos
Richard Eckart de Castilho
Manfred Stede

The book contains the papers presented at the 2009 bi-annual conference of the German Society for Computational Linguistics and Language Technology (GSCL), as well as those of the associated 2nd UIMA@GSCL workshop. The main theme of the conference was computational approaches to text processing, and in conjunction with the UIMA (Unstructured Information Management Architecture) workshop, this volume offers an overview of current work on both theoretical and practical aspects of text document processing.

Christian Chiarcos / Richard Eckart de Castilho Manfred Stede (eds.) Gunter Narr Verlag Tübingen Von der Form zur Bedeutung: Texte automatisch verarbeiten From Form to Meaning: Processing Texts Automatically Proceedings of the Biennial GSCL Conference 2009 Von der Form zur Bedeutung: Texte automatisch verarbeiten From Form to Meaning: Processing Texts Automatically Christian Chiarcos / Richard Eckart de Castilho Manfred Stede (eds.) Von der Form zur Bedeutung: Texte automatisch verarbeiten From Form to Meaning: Processing Texts Automatically Proceedings of the Biennial GSCL Conference 2009 Gunter Narr Verlag Tübingen Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http: / / dnb.d-nb.de abrufbar. © 2009 Narr Francke Attempto Verlag GmbH + Co. KG Dischingerweg 5 · D-72070 Tübingen Das Werk einschließlich aller seiner Teile ist urheberrechtlich geschützt. Jede Verwertung außerhalb der engen Grenzen des Urheberrechtsgesetzes ist ohne Zustimmung des Verlages unzulässig und strafbar. Das gilt insbesondere für Vervielfältigungen, Übersetzungen, Mikroverfilmungen und die Einspeicherung und Verarbeitung in elektronischen Systemen. Gedruckt auf chlorfrei gebleichtem und säurefreiem Werkdruckpapier. Internet: http: / / www.narr.de E-Mail: info@narr.de Druck und Bindung: Laupp & Göbel, Nehren Printed in Germany ISBN 978-3-8233-6511-2 Contents v Table of Contents Programme Committee ix Preface xi Invited Talks Towards a Large-Scale Formal Semantic Lexicon for Text Processing Johan Bos 3 Who Decides What a Text Means? Graeme Hirst 15 eChemistry: Science, Citations and Sentiment Simone Teufel 17 Main Conference Normalized (Pointwise) Mutual Information in Collocation Extraction Gerlof Bouma 31 Hypernymy Extraction Based on Shallow and Deep Patterns Tim vor der Brück 41 Stand off-Annotation für Textdokumente: Vom Konzept zur Implementierung Manuel Burghardt, Christian Wolff 53 Annotating Arabic Words with English Wordnet Synsets Ernesto William De Luca, Farag Ahmed, Andreas Nürnberger 61 The Role of the German Vorfeld for Local Coherence Stefanie Dipper, Heike Zinsmeister 69 Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen von verba dicendi in nach-PPs Kurt Eberle, Gertrud Faaß, Ulrich Heid 81 “Süße Beklommenheit, schmerzvolle Ekstase” - Automatische Sentimentanalyse in den Werken von Eduard von Keyserling Manfred Klenner 93 vi Contents TMT: Ein Text-Mining-System für die Inhaltsanalyse Peter Kolb 101 Integration of Light-Weight Semantics into a Syntax Query Formalism Torsten Marek 109 A New Hybrid Dependency Parser for German Rico Sennrich, Gerold Schneider, Martin Volk, Martin Warin 115 Dependenz-basierte Relationsextraktion mit der UIMA-basierten Textmining-Pipeline UTEMPL Jannik Strötgen, Juliane Fluck, Anke Holler 125 From Proof Texts to Logic Jip Veldman, Bernhard Fisseni, Bernhard Schröder, Peter Koepke 137 Social Semantics and Its Evaluation by Means of Closed Topic Models: An SVM- Classification Approach Using Semantic Feature Replacement by Topic Generalization Ulli Waltinger, Alexander Mehler, Rüdiger Gleim 147 From Parallel Syntax Towards Parallel Semantics: Porting an English LFG-Based Semantics to German Sina Zarrieß 159 Nominations for GSCL Award Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles Christian Hardmeier 173 Robust Processing of Situated Spoken Dialogue Pierre Lison 185 Ein Verfahren zur Ermittlung der relativen Chronologie der vorgotischen Lautgesetze Roland Mittmann 199 Contents vii UIMA Workshop Programme Committee 213 Foreword from the Workshop Chairs 215 L U C AS - A L UCENE CAS Indexer Erik Faessler, Rico Landefeld, Katrin Tomanek, Udo Hahn 217 Multimedia Feature Extraction in the SAPIR Project Aaron Kaplan, Jonathan Mamou, Francesco Gallo, Benjamin Sznajder 225 TextMarker: A Tool for Rule-Based Information Extraction Peter Kluegl, Martin Atzmueller, Frank Puppe 233 ClearTK: A Framework for Statistical Natural Language Processing Philip V. Ogren, Philipp G. Wetzler, Steven J. Bethard 241 Abstracting UIMA Types away Karin Verspoor, William Baumgartner Jr., Christophe Roeder, Lawrence Hunter 249 Simplifying UIMA Component Development and Testing Christophe Roeder, Philip V. Ogren, William Baumgartner Jr., Lawrence Hunter 257 UIMA-Based Focused Crawling Daniel Trümper, Matthias Wendt, Christian Herta 261 Annotation Interchange with XSLT Graham Wilcock 265 Appendix List of Contributors 271 Programme Committee ix Programme Committee • Maja Bärenfänger, Justus-Liebig-Universität Gießen • Stefan Busemann, DFKI Saarbrücken • Irene Cramer, TU Dortmund • Stefanie Dipper, Ruhr-Universität Bochum • Anette Frank, Universität Heidelberg • Roland Hausser, Universität Erlangen-Nürnberg • Wolfgang Höppner, Universität Duisburg-Essen • Claudia Kunze, Universität Tübingen • Lothar Lemnitzer, BBAW Berlin, Universität Tübingen • Henning Lobin, Justus-Liebig-Universität Gießen • Alexander Mehler, Universität Bielefeld • Georg Rehm, vionto GmbH Berlin • David Schlangen, Universität Potsdam • Thomas Schmidt, Universität Hamburg • Ulrich Schmitz, Universität Duisburg-Essen • Roman Schneider, IDS Mannheim • Bernhard Schröder, Universität Duisburg-Essen • Uta Seewald-Heeg, Hochschule Anhalt • Angelika Storrer, TU Dortmund • Maik Stührenberg, Universität Bielefeld • Andreas Witt, IDS Mannheim • Christian Wolff, Universität Regensburg • Heike Zinsmeister, Universität Konstanz Preface xi Preface The bi-annual conferences of the German Society for Computational Linguistics and Language Technology (GSCL) traditionally designate a main conference theme, for which submissions are particularly encouraged. For the 2009 conference, we chose text processing as the main theme, encompassing both the theoretical aspects of ascribing structure to text and the practical issues of computational applications targeting textual information. This choice reflects the research focus of the team that is in charge of organizing the conference this year: the Applied Computational Linguistics Group at Potsdam University. We are very happy to host the conference on our campus ‘Neues Palais’, right beside some of the major historical attractions offered by this beautiful city. For automatic text processing, inter-operability and standardization are of great importance, and recent years have seen a steadily growing interest in UIMA, an architecture for flexibly creating analysis systems capable of dealing with documents representing unstructured information - of which text is a very prominent type. Therefore, this volume also contains the proceedings of the 2nd UIMA@GSCL workshop (one of the seven workshops held at the conference). We are particularly happy to present three invited talks treating different aspects of the conference theme. Johan Bos presents his work on Boxer, a system that demonstrates how formal semantic analysis - which for a long time had often been seen as a laboratory exercise involving selected “example sentences” - can scale up to the robust analysis of text, building on the output of a stateof-the-art syntactic parser. Graeme Hirst discusses the different perspectives that can (or should) be taken on text meaning and its computation. Finally, Simone Teufel describes an interesting practical application of text meaning analysis: She gives an overview of her work on analyzing scientific papers, focusing on the problems of named-entity recognition and computation of rhetorical document structure. Again, a robust semantic representation plays a central role for both tasks. Furthermore, this volume includes three short versions of recent student’s theses. These are the finalists of the GSCL Award, which is given every two years to the best student’s thesis. Students or their supervisors can nominate entrants, and a group of referees from the conference’s programme committee decides on the best three. These are invited to the conference to present their work, and then the winner is chosen based on both the written text and the oral presentation. While at the time of printing the winner is yet unknown, we hereby wish to extend our warm congratulations to all three finalists who have their work presented in this book. As a last word, we wish to acknowledge the generous support given by the Sonderforschungsbereich 632 Information Structure in terms of funding and helping with the overall organization. For their work in planning and administering the conference, we thank the members of our organizing committee: Annett Esslinger, Peter Kolb and Florian Kuhn. Regarding the production of this proceedings volume in particular, we thank the members of both programme committees for their help with reviewing the submissions; Georg Rehm for providing us with the Latex framework; and Andreas Peldszus for patiently proofreading chapters and helping with harmonizing their presentation styles. The editors, however, are responsible for any errors and oversights that may be remaining. The editors, Potsdam, August 2009 Invited Talks Towards a Large-Scale Formal Semantic Lexicon for Text Processing * Johan Bos Department of Computer Science University of Rome “La Sapienza”, Italy bos@di.uniroma1.it Abstract We define a first-order fragment of Discourse Representation Theory (DRT) and present a syntax-semantics interface based on Combinatory Categorial Grammar (CCG) (with five basic categories N, NP, S, PP and T) for a large fragment of English. In order to incorporate a neo-Davidsonian view of event semantics in a compositional DRT framework we argue that the method of continuation is the best approach to deal with event modifiers. Within this theoretical framework, the lexical categories of CCGbank were associated with the formal semantic representations of our DRT variant, reaching high coverage of which practical text processing systems can benifit. 1 Introduction Formal approaches in semantics have long been restricted to small or medium-sized fragments of natural language grammars. In this article we present a large-scale, formally specified lexicon backed up by a model-theoretic semantics. The aim of the lexicon is to provide wide-coverage parsers with the possibility to produce interpretable structures in a robust yet principled way. The semantic lexicon that we present is situated in Discourse Representation Theory (Kamp & Reyle, 1993), and it follows to a great extent the theory as formulated by its orginator Hans Kamp and colleagues. However, it deviates on certain points, as it comprises: • a neo-Davidsonian view on representing event structures; • a syntax-semantics interface based on categorial grammar and type-theory; • a DRS language compatible with first-order logic. In a neo-Davidsonian take on event semantics events are first-order entities and characterised by one-place predicate symbols. An inventory of thematic roles encoded as two-place relations between the events and the subcategorised arguments or modifiers completes this picture. We choose this way * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 3-14. 4 Johan Bos of representing events because it yields a more consistent analysis of event structure. As inventory of thematic roles we take VerbNet (Kipper et al., 2008). As a preliminary to semantics, we need syntax. The syntax-semantics interface illustrated here is based on a categorial grammar, namely Combinatory Categorial Grammar, or CCG for short (Steedman, 2001). A categorial grammar lends itself extremely well for this task because it is lexically driven and has only few “grammar” rules, and not less because of its type-transparency principle, which says that each syntactic type (a grammar category) corresponds to a unique semantic type. The existence of CCGbank (Hockenmaier, 2003), a large collection of CCG derivations for a newswire corpus, and the availability of robust parsers trained on it (Clark & Curran, 2004), make CCG further a practically motivated choice. The choice of first-order logic for the semantic interpretation of text is a restriction in terms of expressiveness. However, it opens the way to implement logical reasoning in practical systems by including automated deduction tools such as theorem provers and finite model builders for inference tasks such as consistency and informativeness checking (Blackburn & Bos, 2005). In the scope of this article we are primarily interested in defining proper semantic representations for lexical categories. Hence resolving scope ambiguities, word sense disambiguation, thematic role labelling, and anaphora resolution are tasks outside the scope of this article. They are, of course, essential in a complete system for computational semantics, but these are tasks orthogonal to the objectives of producing a formally grounded semantic lexicon. Instead, the challenges in the context of this article are providing well-formed, interpretable, lexical semantic representations for a broad variety of linguistic phenomena, and doing this on a large-scale, producing a lexicon suited for wide-coverage, high precision natural language processing grammars for open-domain text analysis. 2 Discourse Representation Theory (DRT) DRT is a theory of natural language meaning, and was originally designed by Kamp to study the relationship between indefinite noun phrases and anaphoric pronouns as well as temporal relations (Kamp, 1981). Since its publication in the early 1980s, DRT has grown in depth and coverage, establishing itself as a well-documented formal theory of meaning, dealing with a stunning number of semantic phenomena ranging from pronouns, abstract anaphora, presupposition, tense and aspect, propositional attitudes, ellipsis, to plurals (Kamp & Reyle, 1993; Klein, 1987; van der Sandt, 1992; Asher, 1993; van Eijck & Kamp, 1997; Geurts, 1999; Kadmon, 2001). DRT can be divided into three components. The central component is a formal language defining Discourse Representation Structures (DRSs), the meaning representations for texts. The second component deals with the semantic interpretation of DRSs. The third component constitutes an algorithm that systematically maps natural language text into DRSs, the syntax-semantics interface. Let’s consider these components in more detail in our version of DRT. One of the main principles of the theory is that a DRS can play both the role of semantic content, and the role of discourse context (van Eijck & Kamp, 1997). The content gives us the precise model-theoretic meaning of a natural language expression, and the context it sets up aids in the interpretation of subsequent anaphoric expressions occuring in the discourse. A key ingredient of a DRS is the discourse referent, a concept going back to at least the work of Karttunen (1976). A discourse referent is an explicit formal object storing individuals introduced by the discourse, along Towards a Large-Scale Formal Semantic Lexicon 5 with its properties, for future reference. The recursive structure of DRSs determines which discourse referents are available for anaphoric binding. Semantic interpretation of DRSs is carried out by translation to first-order logic. The DRS language employed in our large-scale lexicon is nearly identical to that formulated in Kamp & Reyle (1993). It is, on the one hand, more restrictive, leaving out the so-called duplex conditions because they do not all permit a translation to first-order logic. Our DRS language forms, on the other hand, an extension, as it includes a number of modal operators on DRSs not found in Kamp & Reyle (1993) nor van Eijck & Kamp (1997). This DRS language is known to have a translation to ordinary first-order formulas. Examples of such translations are given in Kamp & Reyle (1993), Muskens (1996), and Blackburn et al. (2001), disregarding the modal operators. A translation incorporating the modal operators is given by Bos (2004). We won’t give the translation in all its detail here, as it is not the major concern of this article, and interested readers are referred to the articles cited above. Various techniques have been proposed to map natural language text into DRS: the top-down algorithm (Kamp & Reyle, 1993), DRS-threading (Johnson & Klein, 1986), or compositional methods (Zeevat, 1991; Muskens, 1996; Kuschert, 1999; van Eijck & Kamp, 1997; de Groote, 2006). We will follow the latter tradition, because it enables a principle way of constructing meaning permitting broad coverage — it also fits very well with the grammar formalism of our choice, CCG. Hence, in addition to the formal language of DRSs, we need a “glue language” for partial DRSs — we will define one using machinery borrowed from type theory. This will give us a formal handle on the construction of DRSs in a bottom-up fashion, using function application and β-conversion to reduce complex expressions into simpler ones. Let’s turn to the definition of partial DRSs. 1 The basic semantic types in our inventory are e (individuals) and t (truth value) 2 . The set of all types is recursively defined in the usual way: if τ 1 and τ 2 are types, then so is 〈 τ 1 , τ 2 〉 , and nothing except the basic types or what can be constructed via the recursive rule are types. Expressions of type e are either discourse referents, or variables. Expressions of type t are either basic DRSs, DRSs composed with the merge (; ), or DRSs formed by function application (@): <exp e > : : = <ref> | <var e > <exp t > : : = <ref> ∗ <condition> ∗ | (<exp t >; <exp t >) | (<exp 〈α,t〉 > @ <exp α >) Following the conventions in the DRT literature, we will visualise DRSs in their usual box-like format. As the definition above shows, basic DRSs consist of two parts: a set of discourse referents, and a set of conditions. The discourse referents can be seen as a record of topics mentioned in a sentence or text. The conditions tell us how the discourse referents relate to each other, and put further semantic constraints on their interpretation. We distinguish between basic and complex conditions. The basic conditions express properties of discourse referents or relations between them: <condition> : : = <basic> | <complex> <basic> : : = <sym 1 >(<exp e >) | <sym 2 >(<exp e >,<exp e >) | <exp e >=<exp e > | card(<exp e >)=<num> | named(<exp e >,<sym 0 >) 1 We employ Backus-Naur form in the definitions following. In using this notation, non-terminal symbols are enclosed in angle brackets, choices are marked by vertical bars, and the asterix suffix denotes zero or more repeating items. 2 Type t corresponds to truth value in static logic, however in a dynamic logic like DRT one might want to read it as a state transition, following van Eijck & Kamp (1997). 6 Johan Bos Here <sym n > denotes an n-place predicate symbol, and <num> a cardinal number. Most nouns and adjectives introduce a one-place relation; prepositions, modifiers, and thematic roles introduce two-place relations. (In our neo-Davidsonian DRS language we don’t need ternary or higher-place relations.) The cardinality condition is used for counting quantifiers, the naming condition for proper names. The equality condition explicitly states that two discourse referent denote the same individual. Now we turn to complex conditions. For convenience, we split them into unary and binary complex conditions. The unary complex conditions have one DRS as argument and cover negation, the modal operators expressing possibility and necessity, and a “hybrid” condition connecting a discourse referent with a DRS. The binary conditions have two DRSs as arguments and form implicational, disjunctive, and interrogative conditions: <complex> : : = <unary> | <binary> <unary> : : = - <exp t > | <exp t > | <exp t > | <ref>: <exp t > <binary> : : = <exp t > ⇒ <exp t > | <exp t > ∨ <exp t > | <exp t > ? <exp t > The unary complex conditions are mostly activated by negation particles or modal adverbs. The hybrid condition is used for the interpretation of verbs expressing propositional content. The binary complex conditions are triggered by conditional statements, certain determiners, coordination, and questions. Finally, we turn to expressions with complex types. Here we have three possibilities: variables, λ-abstraction, and function application: <exp 〈α,β〉 > : : = <var 〈α,β〉 > | λ<var α >.<exp β > | (<exp 〈γ,〈α,β〉〉 >@<exp γ >) To graphically distinguish between the various types of variables we will make use of different (indexed) letters denoting different types (Table 1). This table also illustrates the language of partial DRSs with some first examples. Note that variables can range over all possibly types, except type t. This restriction permits us to use the standard definition of β-conversion to avoid accidental capturing of free variables by forcing the functor to undergo the process of α-conversion. Applying a functor expression to an argument expression of type t could result in breaking existing bindings, because DRSs can bind variables outside their syntactic scope. Since we don’t have variables ranging over type t in our language of partial DRSs, this can and will never happen. In practical terms, the syntax-semantics interface is not affected by this limitation, and therefore the price to pay for this restriction of expressiveness is low. 3 Combinatory Categorial Grammar (CCG) Every semantic analysis presupposes a syntactic analysis that tells us how meaning representations of smaller expressions can be combined into meaning representations of larger expressions. The theory of syntax that we adopt is CCG, a member of the family of categorial grammars (Steedman, 2001). In a categorial grammar, all constituents, both lexical and composed ones, are associated with a syntactic category. Categories are either basic categories or functor categories. Functor categories indicate the nature of their arguments and their directionality: whether they appear on the left or on the right. In CCG, directionality of arguments is indicated within the functor category by slashes: a Towards a Large-Scale Formal Semantic Lexicon 7 Table 1: Examples of (partial) DRSs Type Variable Expression Example e x, x 1 , x 2 , . . . e, e 1 , e 2 , . . . t x e manager(x) smoke(e) theme(e,x) a manager smoked 〈 e, t 〉 p, p 1 , p 2 , . . . λx. company(x) company 〈〈 e, t 〉 , t 〉 n, n 1 , n 2 , . . . λp.( x bank(x) ; (p@x)) a bank forward slash / tells us that an argument is to be found on its immediate right; a backward slash \ that it is to be found on its immediate left. The inventory of basic categories in a CCG grammar usually comprises S, N, PP, and NP, denoting categories of type sentence, noun, prepositional phrase, and noun phrase, respectively (Steedman, 2001). Functor categories are recursively defined over categories: If α is a category and β is a category, then so are (α/ β) and (α \ β). 3 Note that, in theory, there is no limit to the number of categories that we are able to generate. In practice however, given the way natural language works, the number of arguments in a functor category rarely succeeds four. Consider for instance S \ NP, a functor category looking for a category of type NP on its immediate left, yielding a category of type S. This category, of course, corresponds to what traditionally would be called an intransitive verb or verb phrase in a grammar for English. Other examples of functor categories are N/ N, a pre-nominative adjective, and (S \ NP)/ NP, a transitive verb. These simple examples demonstrate a crucial concept of a categorial grammar: functor categories encode the subcategorisation information directly and straightforwardly. All such syntactic dependencies, local as well as long-range, are specified in the categories corresponding to lexical phrases. This process of “lexicalisation” is a trademark property of categorial grammars and manifests itself in a lexicon exhibiting a large variety of categories complemented with a small set of grammar rules — the essence of a categorial grammar. The most simplest of categorial grammar, pure categorial grammar, has just two combinatory rules: forward and backward application. CCG can be viewed as a generalisation of categorial grammar, and the initial C it is blessed with is due to the variety of combinatory rules that it adds to the grammar of pure categorial grammar. The complete collection of binary combinatory rules in CCG consists of the application rules, the rules for composition and their crossing variants (including the generalised versions), and the substitution rules (not covered in this article). In addition, CCG comes with rules for type raising. Some variations of CCG have special rules for dealing with coordination. 3 Outermost brackets of a functor category are usually suppressed. 8 Johan Bos 3.1 The Application Rules Forward application (> in Steedman’s notation), and backward application (<), are the two basic combinatory rules of classic categorial grammar. Below we will give the schemata of these rules, and make use of the colon to associate the category with its semantic interpretation. Following the partial-DRS language defined in the previous section, we use @ to denote function application. X/ Y: φ Y: ψ > X: (φ@ψ) Y: φ X \ Y: ψ < X: (ψ@φ) Here X and Y are variables ranging over CCG categories, φ and ψ are partial DRSs. The > rule can be read as follows: given an expression of category X/ Y with interpretation φ, followed by an expression of category Y with interpretation ψ, we can deduce a new category X, with interpretation φ applied to ψ. Along the same lines, the < rule cam be paraphrased as follows: given the category X \ Y with interpretation ψ, preceeded by a category Y with interpretation ψ, we can deduce a new category X, with interpretation ψ applied to φ. 3.2 The Composition Rules Here we have forward and backward composition, rules that were originally introduced to deal with various cases of natural language coordination. Steedman refers to these rules by >B and <B, blessed after the Bluebird from Raymond Smullyan’s tale of the “logical forest” and its feathered friends. X/ Y: φ Y/ Z: ψ >B X/ Z: λx.(φ@(ψ@x)) Y \ Z: φ X \ Y: ψ <B X \ Z: λx.(ψ@(φ@x)) The >B rule can be paraphrased as follows: given an expression of category X/ Y, we’re looking for an expression of category Y on our right, but find an expression of category Y/ Z instead, then we can deduce X/ Z (because, after all, we did encounter Y, if we can promise to find a Z later as well). A similar explanation can be given for the <B rule, reversing the direction of subcategorisation. Semantically, something interesting is happening here. Since we haven’t find Z yet, we need to postpone its semantic application. What we do is introduce the variable x and apply it to the interpretation of Y/ Z (or Y \ Z in the <B rule), and then abstract over it again using the λ-operator. Both composition rules have so-called “crossing” variants: backward crossing composition (<B x ) and forward crossing composition (>B x ). Backward crossing occurs in cases of adjunctal modification in English. Forward crossing is not needed for an English grammar, but supports languages with freer word order such as Italian. The crossing rules are specified as follows: X/ Y: φ Y \ Z: ψ >B x X \ Z: λx.(φ@(ψ@x)) Y/ Z: φ X \ Y: ψ <B x X/ Z: λx.(ψ@(φ@x)) 3.3 The Generalised Composition Rules The composition rules also have generalised variants, in order to guarantee a larger variety of modification types. Steedman introduces a technical device to abstract over categories, the $, a placeholder for a bounded number of directional arguments (Steedman, 2001). Towards a Large-Scale Formal Semantic Lexicon 9 X/ Y: φ (Y/ Z)$: ψ > B$ (X/ Z)$: λ x .( φ @( ψ @ x )) (Y \ Z)$: φ X \ Y: ψ < B$ (X \ Z)$: λ x .( ψ @( φ @ x )) X/ Y: φ (Y \ Z)$: ψ > B x $ (X \ Z)$: λ x .( φ @( ψ @ x )) (Y/ Z)$: φ X \ Y: ψ < B x $ (X/ Z)$: λ x .( ψ @( φ @ x )) Here we introduce a vectorised variable as short-hand notation for the number of abstractions and applications respectively required. This number, of course, depends on the number of arguments of the functor category that is involved. 3.4 Type Raising CCG also comes with two type raising rules, forward and backward type raising. Steedman calls these >T and <T, respectively, after Smullyan’s Thrush. They are unary rules, and effectively what they do is change an argument category into a functor category. The standard CCG example illustrating the need for type raising is non-constituent coordination such as right-node raising, a phenomenon that can be accounted for in CCG by combining type raising with forward composition. The type raising rules are defined as follows: X: φ >T Y/ (Y \ X): λx.(x@φ) X: φ <T Y \ (Y/ X): λx.(x@φ) Type-raising is an extremely powerful mechanism, because it can generate an infinitely large number of new categories. Usually, in practical grammars, restrictions are put on the use of this rule and the categories it can apply to. 4 Building a Formal Semantic Lexicon The method we follow to construct a large-scale semantic lexicon has two parts: a theoretical part, defining a mapping between the base categories of CCG and semantic types; and a practical part, assigning to each lexical category a suitable partial DRS of the right type, guided by CCG’s type transparency principle (Steedman, 2001). As we have seen from the previous section, the combinatory rules of CCG are equipped with a direct semantic interpretation, completing the syntax-semantics interface. Recall that the main basic categories in CCG are N (noun), S (sentence), NP (noun phrase) and PP (prepositional phrase). In addition we will take on board the basic category T (text) and motivate this addition below. Our main task here is to map these basic categories to the two basic semantic types we have at our disposal: e (entities), and t (truth value). 4.1 The Category N The CCG category N is assigned a semantic type 〈 e, t 〉 , a type corresponding to properties. This makes sense, as what nouns are essentially doing is expressing kinds of properties. This decision allows us to view some first examples of associating lexical items with λ-DRS. So let’s consider the word squirrel of category N, and the adjective red of category N/ N and their semantic representations: N: 〈 e, t 〉 : λx. squirrel(x) N/ N: 〈〈 e, t 〉 , 〈 e, t 〉〉 : λp.λx.( red(x) ; (p@x)) 10 Johan Bos 4.2 The Category PP We also assign the type 〈 e, t 〉 to the category PP. It is perhaps confusing and misleading to connect two different syntactic categories with the same semantic type (because it seems to obscure CCG’s principle of type transparancy (Steedman, 2001)). However, from a semantic perspective, both N and PP denote properties, and it seems just to assign them both 〈 e, t 〉 (see the example for at a table below). One could attempt to make a semantic distinction by sorting entities into individuals and eventualities, because in CCG a PP usually plays the role of a subcategorised argument of a verb. But this wouldn’t lead us anywhere, as the PP could, in principle, also be an argument of a relational noun as illustrated below for wife: PP: 〈 e, t 〉 : λx 1 . x 2 table(x 2 ) at(x 1 ,x 2 ) N/ PP: 〈〈 e, t 〉 , 〈 e, t 〉〉 : λp.λx 1 .( x 2 person(x 1 ) wife(x 2 ) role(x 1 ,x 2 ) ; (p@x 2 )) 4.3 The Category NP At first thought it seems to make sense to associate the type e with the category NP, as for example Steedman (2001) does. After all, a noun phrase denotes an entity. However, this is not what we propose, because it would require a type-raised analysis in the lexicon for determiners to get proper meaning representations for quantifiers. In CCGBank (Hockenmaier, 2003), on which our practical implementation is based, determiners such as every are categorised as NP/ N rather than their typeraised variants (T/ (T \ NP))/ N and (T \ (T/ NP))/ N, which would permit us to assign the type e to NP and still give a proper generalised quantifier interpretation. The motivation of CCGBank to refrain from doing this is obviously of practical nature: it would increase the number of categories drastically. Nevertheless, by associating the category NP with the type 〈〈 e, t 〉 , t 〉 we are able to yield the required interpretation for determiners. Essentially, what we are proposing is a uniform type-raised analysis for NPs, denoting functions from properties to truth values. This is, of course, an idea that goes back to Montague’s formalisation for a fragment of English grammar (Montague, 1973), illustrated by the following example for someone: NP: 〈〈 e, t 〉 , t 〉 : λp.( x person(x) ; (p@x)) 4.4 The Category S Now we turn to the category for sentences, S. Normally, one would associate the type t to S: after all, sentences correspond to propositions and therefore denote truth values. Yet, the semantics of events that we will assign to verbal structures, following Davidson, requires us to connect the category S with the type 〈〈 e, t 〉 , t 〉 . As this is slightly unconventional, we will motivate this choice in what follows. The neo-Davidsonian approach that we are following yields DRSs with discourse referents denoting event-like entities. For instance, the DRS for the sentence a manager smoked would be one similar as the one shown in Table 1. This would be a fine DRS. But what if we continue the sentence, with a modifier, as in a manager smoked in a bar, or with a sequence of modifiers, as in a manager Towards a Large-Scale Formal Semantic Lexicon 11 smoked in a bar yesterday? As we wouldn’t have any λ-bound variables at our disposal, it is impossible to connect the modifiers in a bar and yesterday to the correct discourse referent for the event (e). In CCG, such modifiers would typically be of category S \ S or S/ S (i.e. sentence modifiers) and (S \ NP) \ (S \ NP) or (S \ NP)/ (S \ NP) (i.e. verb phrase modifiers). It is not difficult to see that associating a DRS to category S would lead us astray from the principles of compositional semantics, when trying to maintain a neo-Davidsonian first-order modelling of event structures. We would need an ad-hoc mechanism to ensure the correct binding of the event discourse referent. Several “tricks” could be performed here. Event discourse referents could always have the same name (say e 0 ), and modifiers would be represented with a free variable of the same name. This would require a modification of the definition of closed expressions and variable renaming — it would also exclude a DRS with two different events discourse referents. Another possibility is to treat modifiers as anaphoric, linking to the event discourse referent closest in proximity. This would establish a dichotomy in the analysis of modifiers, and would moreover require modification of the β-conversion procedure too, as variables ranging over type t would be introduced. No further comments needed — such ad-hoc methods aren’t welcome in any systematic syntax-semantic interface. There are two basic directions to surmount this problem, while maintaining a compositional system. The first is to abstract over the event variable, and introduce the discourse referent when the parse of the sentence is completed. We call this the method of delay and it would yield the type 〈 e, t 〉 for the S category. The second approach is to introduce the discourse referent for the event in the lexicon, but reserve a place for any modifier that comes along later in the parse. This option, which we dub the method of continuation, would yield the type 〈〈 e, t 〉 , t 〉 for S. We show the partial DRSs for both options: λe. x manager(x) smoke(e) agent(e,x) λp.( x e manager(x) smoke(e) agent(e,x) ; (p@e)) Is there a practical difference between the two options? Yes, there is. The delayed approach (shown on the left) will always introduce the discourse referent on the outermost level of DRS. This will give the incorrect prediction for sentences with scopal operators such as quantifiers in subject position. The continuation approach (shown on the right) doesn’t suffer from this limitation, as the discourse referent is already introduced. Moreover, the continuation approach also gives us control over where in the DRS modifiers are situated which the delay approach doesn’t. This motivates us to favour the continuation approach. To illustrate the impact of the method of continuation on the lexical categories, consider the partial DRS for the intransitive verb smoked: (S[dcl] \ NP): 〈〈〈 e, t 〉 , t 〉 , 〈〈 e, t 〉 , t 〉〉 : λn 1 .λp 2 .(n 1 @λx 3 .( e 4 smoke(e 4 ) agent(e 4 ,x 3 ) ; (p 2 @e 4 ))) 4.5 The Category T The category T (for text) corresponds to the semantic type t and as a consequence to an ordinary DRS (or an expression that can be reduced to a DRS) which can be translated to first-order logic. It is not usually a lexical category, but typically introduced by punctuation symbols that signal the 12 Johan Bos end of the sentence. For instance, the full stop maps to a CCG category that turns a sentence into a text: T \ S. This makes sense from a linguistic point of view, because a full stop (or other punctuation symbols such as exclamation and question marks) signals that the sentence is finished and no more sentence of verb phrase modifiers will be encountered. 5 Practical Results Our inventory of lexical categories is based on those found in CCGbank (Hockenmaier, 2003) and used by the C&C parser for CCG (Clark & Curran, 2004). CCGbank is a version of the Penn Treebank (Marcus et al., 1993) and consists of a set of CCG derivations covering 25 sections of texts taken from the Wall Street Journal, comprising a total of over one million tagged words. CCGBank contains nearly 49,000 sentences annotated with a total of 1,286 different CCG categories, of which 847 appear more than once. Given the theoretical foundations presented in the previous sections, we manually encoded the majority of the lexical categories found in CCGbank with partial DRSs (categories not used by the C&C parser, and some extremely rare categories, were not taken into account). Even though there is a one-to-one mapping between the CCG categories and semantic types — and this must be the case to ensure the semantic composition process proceeds without type clashes — the actual ingredients of a partial DRSs can differ even within the scope of a single CCG category. A case in point is the lexical category N/ N can correspond to an adjective, a superlative, a cardinal expression, or even common nouns and proper names (in compound expressions). In the latter two cases the lexical entry introduces a new discourse referent, in the former cases it does not. To deal with these differences we also took into account the part of speech assigned by the C&C parser to a token to determine an appropriate partial DRS. This system for semantic interpretation was implemented and when used in combination with the C&C parser achieves a coverage of more than 99% on re-parsed Wall Street Journal texts, and similar figures on newswire text. With coverage we mean here the percentage of sentences for which a well-formed DRS could be produced (not necessarily a DRS that is semantically adequate). The robustness of the overall framework for deep text processing is good enough to make it possible for practical tasks with non-restricted domains, such as open-domain question answering (Bos et al., 2007) and recognising textual entailment (Bos & Markert, 2005). Summing up: formal semantics isn’t a paper-and-pencil game anymore, nor is it limited to implementations of toy fragments of natural language. It has genuinely matured to a level useful for applications in the real world. Towards a Large-Scale Formal Semantic Lexicon 13 References Asher, N. (1993). Reference to Abstract Objects in Discourse. Kluwer Academic Publishers. Blackburn, P. & Bos, J. (2005). Representation and Inference for Natural Language. A First Course in Computational Semantics. CSLI. Blackburn, P., Bos, J., Kohlhase, M., & de Nivelle, H. (2001). Inference and Computational Semantics. In H. Bunt, R. Muskens, & E. Thijsse, editors, Computing Meaning Vol.2, pages 11-28. Kluwer. Bos, J. (2004). Computational Semantics in Discourse: Underspecification, Resolution, and Inference. Journal of Logic, Language and Information, 13(2), 139-157. Bos, J. & Markert, K. (2005). Recognising Textual Entailment with Logical Inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 628-635. Bos, J., Guzzetti, E., & Curran, J. R. (2007). The Pronto QA System at TREC 2007: Harvesting Hyponyms, Using Nominalisation Patterns, and Computing Answer Cardinality. In E. Voorhees & L. P. Buckland, editors, Proceeding of the Sixteenth Text RETrieval Conference, TREC-2007, pages 726-732, Gaithersburg, MD. Clark, S. & Curran, J. (2004). Parsing the WSJ using CCG and Log-Linear Models. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL ’04), Barcelona, Spain. de Groote, P. (2006). Towards a Montagovian account of dynamics. In Proceedings of Semantics and Linguistic Theory XVI (SALT 16). Geurts, B. (1999). Presuppositions and Pronouns. Elsevier, London. Hockenmaier, J. (2003). Data and Models for Statistical Parsing with Combinatory Categorial Grammar. Ph.D. thesis, University of Edinburgh. Johnson, M. & Klein, E. (1986). Discourse, anaphora and parsing. In 11th International Conference on Computational Linguistics. Proceedings of Coling ’86, pages 669-675, University of Bonn. Kadmon, N. (2001). Formal Pragmatics. Blackwell. Kamp, H. (1981). A Theory of Truth and Semantic Representation. In J. Groenendijk, T. M. Janssen, & M. Stokhof, editors, Formal Methods in the Study of Language, pages 277-322. Mathematical Centre, Amsterdam, Amsterdam. Kamp, H. & Reyle, U. (1993). From Discourse to Logic; An Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and DRT. Kluwer, Dordrecht. Karttunen, L. (1976). Discourse Referents. In J. McCawley, editor, Syntax and Semantics 7: Notes from the Linguistic Underground, pages 363-385. Academic Press, New York. Kipper, K., Korhonen, A., Ryant, N., & Palmer, M. (2008). A large-scale classification of English verbs. Language Resources and Evaluation, 42(1), 21-40. Klein, E. (1987). VP Ellipsis in DR Theory. Studies in Discourse Representation Theory and the Theory of Generalised Quantifiers. Kuschert, S. (1999). Dynamic Meaning and Accommodation. Ph.D. thesis, Universität des Saarlandes. Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313-330. 14 Johan Bos Montague, R. (1973). The proper treatment of quantification in ordinary English. In J. Hintikka, J. Moravcsik, & P. Suppes, editors, Approaches to Natural Language, pages 221-242. Reidel, Dordrecht. Muskens, R. (1996). Combining Montague Semantics and Discourse Representation. Linguistics and Philosophy, 19, 143-186. Steedman, M. (2001). The Syntactic Process. The MIT Press. van der Sandt, R. (1992). Presupposition Projection as Anaphora Resolution. Journal of Semantics, 9, 333- 377. van Eijck, J. & Kamp, H. (1997). Representing Discourse in Context. In J. van Benthem & A. ter Meulen, editors, Handbook of Logic and Language, pages 179-240. Elsevier, MIT. Zeevat, H. W. (1991). Aspects of Discourse Semantics and Unification Grammar. Ph.D. thesis, University of Amsterdam. Who Decides What a Text Means? Graeme Hirst Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 3G4 gh@cs.toronto.edu Abstract Writer-based and reader-based views of text-meaning are reflected by the respective questions “What is the author trying to tell me? ” and “What does this text mean to me personally? ” Contemporary computational linguistics, however, generally takes neither view. But this is not adequate for the development of sophisticated applications such as intelligence gathering and question answering. I discuss different views of text-meaning from the perspective of the needs of computational text analysis and the collaborative repair of misunderstanding. This paper has originally been published under the title The Future of Text-Meaning in Computational Linguistics in Petr Sojka, Aleš Horák, Ivan Kopecek, and Karel Pala (eds.) (2008), Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD 2008) (Lecture Notes in Artificial Intelligence 5246, Springer-Verlag), September 2008, Brno, Czech Republic, pp. 1-9. eChemistry: Science, Citations and Sentiment * Simone Teufel Cambridge University JJ Thomson Avenue, Cambridge CB3 0FD, UK Simone.Teufel@cl.cam.ac.uk Abstract In the eChemistry project SciBorg, we address challenges in computational linguistics of two kinds: those that are specific to chemistry (e.g., how to recognise chemical compounds robustly in text) and those which are discipline-independent (e.g., the ways in which the citation system and the scientific argumentation interact and can be recognised). This paper introduces the project and its goal of providing a robust semantics-based infrastructure for the interpretation of scientific text, upon which several eChemistry tasks are built. In particularly, I will concentrate on the application of finding rhetorical structure in the paper text. 1 Introduction: The Project SciBorg Unlike in the biomedical discipline, the scientific content contained in chemistry papers is only recently being processed in such a way that scientists can access it more freely. SciBorg (Copestake et al., 2006) is a 4-year, EPSRC-sponsored project in eChemistry held by the Cambridge University (the Computer Laboratory, the Chemistry department and the Cambridge eScience Centre), and in collaboration with three scientific publishers (the Royal Society of Chemistry, Nature Publishing and the International Union of Crystallography Paper). SciBorg addresses several natural language tasks. Firstly, all chemical compounds contained in the scientific text are recognised and connected to a database of compounds. Beyond simple linking, one can also extract new ontological information from the text, e.g., the information that an alkaloid is a subtype of azacycle: . . . alkaloids and other complex polycyclic azacycles . . . The project also provides support to chemists in the form of a “semantically fine-tuned” search. For instance, a chemist may search for papers which describe the synthesis of Tröger’s base from anilines. Such a search can be satisfied by the following snippets: The synthesis of 2,8-dimethyl-6H, 12H-5, 11methanodibenzo[b,f]diazocine (Troeger’s base) from p-toluidine and of two other Troeger’s base analogs from other anilines Tröger’s base (TB)... The TBs are usually prepared from para-substituted anilines * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 17-27. 18 Simone Teufel Similar search environments have been provided for the biomedical domain, e.g., for gene expression in Kim et al. (2008). They can be realised on the basis of a syntactic parse, if one automatically treats syntactic variation, lexical variation and nominalisation. What is far harder is the search for papers describing synthesis of Tröger’s base which do not involve anilines, e.g., using a search expression such as the following (where X is a paper, TB stands for “Tröger’s base” and y is the compound from which the synthesis starts): X: Goal(X,h), h: synth, result(h,<TB>), Source(h,y) & NOT(aniline(y)) From a computational linguistics perspective, the novelty in SciBorg is its use of a deep semantic representation, RMRS (Copestake, 2009), in all stages of the processing (see section 2). Processing in SciBorg is broken up into the component tasks given in Fig. 1 (Copestake et al., 2006). Our three project partners provide text in their publisher-specific XML, which is transformed into our XML vocabulary SciXML (Rupp et al., 2008); our group already holds scientific texts from other disciplines in SciXML (computational linguistics, cardiology and genetics). Named entity recognition (NER) by a component called OSCAR-3 is the first processing step (see section 3), which is followed by POS-tagging and parsing by the ERG/ PET (Copestake & Flickinger, 2000) and RASP (Briscoe & Carroll, 2002). RMRSs from different parsers are then merged. Our representation is SAF (Waldron & Copestake, 2006) standoff annotation which forms a lattice of partial analyses at each level. Anaphora are recognised, word sense disambiguation is performed, and a rhetorical analysis in the form of Argumentative Zoning (Teufel et al., 2009) is performed. This information supports ontology-enrichment, along with various search tasks we support. This paper is mainly focused on “rhetorical search”, another type of search supported in SciBorg. When researching a particularly synthesis route, a chemist may be very interested in description of problems or failures encountered in similar syntheses. Such “rhetorical searches” will be described in section 4. 2 Robust Minimal Recursion Semantics A central goal of SciBorg is the development of a representation which enables the tight integration of partial information from a wide variety of language processing tools and has a sound logical basis compatible with the emerging semantic web standards. Robust Minimal Recursion Semantics (RMRS) is an application-independent representation for compositional semantics. It is an extension of the Minimal Recursion Semantics (MRS: (Copestake et al., 2006)) approach which is well established in deep processing in NLP. MRS is compatible with RMRS but RMRS can also be used with shallow processing techniques, such as part-of-speech tagging, noun phrase chunking and stochastic parsers which operate without detailed lexicons. RMRS output from the shallower systems is less fully specified than the output from the deeper systems, but in principle fully compatible. RMRS can be interconverted with a dependency style representation (DMRS), which is designed to be more human-readable, and it is this we show in Figure 2. The example is a slightly simplified version of the DMRS version of an analysis given by the English Resource Grammar (ERG) for the following sentence: The photodecomposition is independent of the presence of oxygen in the solution. In our approach, general purpose language processing tools, such as the ERG, are used with minimal modification in the analysis of the chemistry texts. The chemistry-specific named entities (such as eChemistry: Science, Citations and Sentiment 19 RSC papers Nature papers IUCr papers Biology and Computational Linguistics (pdf) SciXML sent. split ERG tokeniser OSCAR-3 RASP parser RASP tokeniser & POS tagger ERG / PET RMRS merge WSD rhetorical analysis anaphora tasks Figure 1: Overall architecture of the SciBorg system names for compounds and elements) are replaced by placeholders. Thus the ERG actually analyses the sentence: The photodecomposition is independent of the presence of OSCARCOMPOUND in the solution. OSCARCOMPOUND corresponds to the predicate ‘ocmpd’ in Figure 2. The ERG assumes a lexicon but has a mechanism for treating unknown words which are not NEs, such as photodecomposition. 3 OSCAR3: Recognition of Chemical Named Entities One of the first processing tasks necessary is the recognition of chemical named entities - their unusual typographic form can make them interfere with other processing (such as POS tagging), unless they are packaged up in XML. They are recognised and connected to the ChEBI database ( http: / / www.ebi.ac.uk/ chebi/ ), as in the following example: The slides were washed with glacial acetic acid. ChEBI: 15366 The challenge is to recognise named entities of many types, e.g., systematic names (2-acetoxybenzoic acid), semi-systematic names (acetylsalicylic acid), trivial names (aspirin), formulae (C 6 H 12 DO 6 ), as well as acronyms/ abbreviations (NADH, 5-HT) and catalogue numbers (A23187 (calcium ionophore)). Corbett et al. (2007) present an annotation scheme for chemical entities which is based on reliability studies with three annotators, all of which are domain experts, on the basis of extensive guidelines. Independent annotation of 14 papers (40,000 tokens, 3,000 Named Entities) between 3 experts showed good agreement (Pairwise F-measures 92.8%; 90.0%; 86.1%). Corbett & Copestake’s _the_q _photodecomposition_n _independent_a _the_q _presence_n_1 _of_p ocmpd_n _in_p _the_q _solution_n_1 RSTR/ H ARG1/ NEQ RSTR/ H ARG1/ EQ ARG2/ NEQ RSTR/ H ARG2/ NEQ ARG1/ EQ ARG2/ NEQ Figure 2: Dependency MRS representation 20 Simone Teufel (2008) automatic chemical NER (OSCAR) is based on a Maximum entropy model with several features, including two HMMs (one based on characters, the other on the IOB notation), token shapes and regular expressions. Classification is cascaded. Multiword items receive special treatment, including look-up in the lexicon (“dry ice”), recognition of suffixes-yl -ate (“ethyl acetate”), and use of stop patterns (C.elegans). Whenever there are several overlapping candidates, the longest leftmost string is chosen. Recognition results are ∼ 60% recall at 95% precision, and ∼ 60% precision at 90% recall. 4 Use Cases: Searches for Chemists Let us now turn to rhetorical searches. In our application field, synthetic chemistry, there are several searches which would be attractive to specialists but which are not currently supported by search engines; examples follow. Use case 1: How is a given article being used? Determination of the importance and function of cited work can be relevant to searches in the scientific literature. A searcher surveying the literature, whether for a review article or on moving into a new field of research, may be interested in how a particular article has been taken up in the field, or how it fits into the research programme of a given group or laboratory. One of the functions recognised in SciBorg are statements of use of materials, machinery, experimental methodologies, parameters, theories, etc. This information could also be used to compile fairer bibliometric measures of the impact of individual work. Schemes for citation classification according to motivation exist in the area of citation content classification (Moravcsik & Murugesan, 1975; Garfield, 1965), and have been replicated automatically (Nanba & Okumura, 1999; Garzone & Mercer, 2000; Teufel et al., 2006). However, the raw context of a citation in itself is often not sufficient to describe what the relation between the citation and the citing paper is. The approach described below assumes that discourse context (combined with techniques from sentiment analysis) can help in determining the sentiment between a citing and a cited paper. Use case 2: Identifying paradigm shifts. Lisacek et al. (2005) argued that information retrieval can profit from identification of papers with paradigm shift statements, as such papers have a high impact in an area. Consider the following example of such a paradigm shift sentence: In contrast with previous hypotheses, compact plaques form before significant deposition of diffuse A beta, suggesting that different mechanisms are involved in the deposition of diffuse amyloid and the aggregation into plaques. Paradigm shift sentences also occur in the chemical literature, and we reserve a special category in our analysis for them. Use case 3: Identifying failures. For a research chemist planning to synthesize a given kind of compound, it is invaluable to know what methods have been shown not to work during similar attempts. Mining such failures from the literature is sufficiently valuable that there is a commercial database of failed reactions. There are two particularly characteristic rhetorical structures in synthetic chemistry papers which we shall address. The first is that in a description of a long synthetic procedure, authors might helpfully mention in passing steps which were found not to work. These will generally be followed by a “recovery” statement, which explains how the problem can be avoided. The second is that many papers report eChemistry: Science, Citations and Sentiment 21 a dozen or more very similar experiments in a table of figures, which work to differing degrees. Particularly successful or unsuccessful experiments will be described in the body text, often identified by their entry number in the relevant table. What is interesting from a linguistic point of view is that what emerges are patterns of sentimentrelated statements (positive and negative, failures and successes), some of which can be predicted from the discourse context. Use case 4: Identifying practical applications and future/ ongoing work. University press offices, funding bodies and science journalists, and other popular accounts of science often look out for practical applications mentioned in the literature. In a similar vain, authors also often mention possibilities for future or ongoing work at the end of a paper, as one of the conclusions drawn from the work. This is often clearly linguistically signalled but not marked explicitly by the paper structure. A list of future research possibilities can give research administrators an indication of trends and current “hot topics”, and it can provide young researchers with ideas about where to start their research. In our scheme, explicit statements of applications and future work receive particular attention (and thus their own category in the annotation scheme below). 5 Argumentative Zoning for Chemistry Argumentative Zoning is the name of a shallow discourse analysis that can be applied to scientific articles (Teufel & Moens, 2002). Sentences in scientific text are classified into 7 categories, corresponding to different rhetorical status to sentences. This level of analysis can be replicated automatically by a supervised machine learning approach, based on a set of shallower features (see section 5.2 below) which can be easily determined in text. Argumentative Zoning II (AZ-II, Teufel et al., 2009) is an elaboration of the original AZ scheme (Teufel, 2000) with more fine-grained distinctions. There were two reasons for refining AZ: We wanted to bring AZ closer to contemporary citation function schemes, and we also wanted to make distinctions recently found useful by other researchers. For instance, Lisacek et al. (2005) show that [A NTISUPP ] is particularly important in biomedical searches. The scheme is given in Tab. 1. Note that the finer grain in AZ-II has been accomplished purely by splitting existing AZ categories; hence, the coarser AZ categories are recoverable (with the exception of the [T EXTUAL ] category, which was discontinued. This is due to the fact that the application it was designed for, navigation in electronic papers, is not a central task in SciBorg). Despite the similarity of AZ-II to citation classification schemes, there are differences: in AZ and AZ-II, the occurrence of a citation is not a necessarily decisive cue for a sentence to belong to a particular zone, although it is an important cue for several categories. Our annotation guidelines are 111 sides of A4 long and contain a decision tree, detailed description of the semantics of the 15 categories, 75 rules for pairwise distinction of the categories and copious examples from both chemistry and computational linguistics (CL). They are written in such a way that experts and non-experts should be able to consistently perform the annotation. Guideline development started with papers from CL and then added material specific to chemistry. It took 3 months part-time-work to prepare the guidelines for CL, but substantially less time to adapt them for chemistry. During guideline development, 70 chemistry papers were used, which are distinct from the ones used for annotation. The large neutral AZ category [O WN ] has been split into methods, conclusions and results for AZ-II. Other AZ-like schemes for scientific discourse created for the biomedical domain (Mizuta 22 Simone Teufel Table 1: AZ-II Annotation Scheme and Correspondence to AZ (Teufel, 2000) AZ-II Category Description AZ Category [A IM ] Statement of specific research goal, or hypothesis of current paper [A IM ] [N OV _A DV ] Novelty or advantage of own approach [A IM ]; ([O WN ]) [C O _G RO ] Nobody owns knowledge claim for this statement (or knowledge claim not significant for the paper) [B ACKGROUND ] [O THER ] Neutral description of somebody else’s work. [O TH ] [P REV _O WN ] Neutral description of authors’ previous own work. [O TH ] [O WN _M THD ] New work presented in paper work: methods ([O WN ]) [O WN _F AIL ] A solution/ method/ experiment in the paper that did not work ([O WN ]) [O WN _R ES ] Measurable/ objective outcome of own work ([O WN ]) [O WN _C ONC ] Findings, conclusions (non-measurable) of own work ([O WN ]) [C O D I ] Comparison, contrast, difference to other solution (neutral) [C ONTRAST ] [G AP _W EAK ] Lack of solution in field, problem with other solutions [C ONTRAST ] [A NTISUPP ] Clash with somebody else’s results or theory; superiority of own work [C ONTRAST ] [S UPPORT ] Other work supports current work or is supported by current work [B ASIS ] [U SE ] Other work is used in own work [B ASIS ] [F UT ] Statements/ suggestions about future work (own or general) [O WN ] & Collier, 2004), for computer science (Feltrim et al., 2005) and for astrophysics (Merity et al., 2009) also made the decision to subdivide [O WN ], mostly in similar ways to how we propose here (Merity et al., 2009 decide to divide into “data”, “observation” and “technique”). However, our work presents the first experimental proof that humans can make such distinctions reliably, in our case even for two disciplines (CL and chemistry). An important principle of AZ and AZ-II is that category membership should be be decidable without domain knowledge. The reason for this is that the human annotation, which acts as a gold standard, should simulate the best possible output that a realistic text-understanding system (which cannot model arbitrary domain-knowledge) could theoretically create. This rule is anchored in the guidelines: when choosing a category, no reasoning about the scientific facts is allowed. Annotators may use only general, rhetorical or linguistic knowledge. For instance, lexical and syntactic parallelism in a text can mean that the authors were setting up a comparison between themselves and some other approach (category [C O D I ]). There is, however, a problem with annotator expertise and with the exact implementation of the “no domain knowledge” principle. This problem does not become apparent until one starts working with disciplines where at least some of the annotators or guideline developers are not domain experts (chemistry, in our case). Domain experts naturally use scientific knowledge and inference when eChemistry: Science, Citations and Sentiment 23 they make annotation decisions; it would be unrealistic to expect them to be able to disregard their domain knowledge simply because they were instructed to do so. We therefore artificially created a situation where all annotators are “semi-informed non-experts”, which forces them to comply with the principle: Justification: Annotators have to justify all annotation decisions by pointing to some text-based evidence. There are rules prescribing what type of justification is allowed; general disciplinespecific knowledge is explicitly excluded. Discipline-specific generics: The guidelines contain a section with high-level facts about the general research practices in the discipline. These generics are aimed to help non-expert annotators recognise how a paper might relate to already established scientific knowledge. For instance, the better annotators are able to distinguish what is commonly known from what is newly claimed by the authors, the more consistent their annotation will be. Guidelines are therefore split into a domain-general and a domain-specific part. The disciplinespecific generics constitute the only chemistry-specific knowledge which is acceptable as justification. In the case of chemistry, they come in the form of a “chemistry primer”, a 10-page collection of high-level scientific domain knowledge. It is not an attempt to summarise all methods and experimentation types in chemistry; this would be impossible to do. Rather, it tries to answer many of the high-level AZ-relevant questions a non-expert would have to an expert. It contains: a glossary of words a non-chemist would not have heard about or would not necessarily recognise as chemical terminology; a list of possible types of experiments performed in chemistry; a list of commonly used machinery; a list of non-obvious negative characterisations of experiments and compounds (“sluggish”, “inert”), and specific instructions per category. For instance, if a compound or process is considered to be so commonly used that it is in the “general domain” (e.g., “the Stern-Volmer equation” or “the Grignard reaction”), it is no longer associated with somebody’s specific work, and as a result its usage is not to be marked with category [U SE ]. Annotation with expert-trained non-expert annotators means that a domain expert must be available initially, during the development of the annotation scheme and the guidelines. Their job is to describe scientific knowledge in that domain in a general way, in as far as it is necessary for the scheme’s distinctions. Once this is done, our methodology allows to hire expert and non-expert annotators and still produce reliable results. We believe our guidelines could be expanded relatively easily into many other disciplines, using domain experts which create similar primers for genetics, experimental physics, cell biology, but re-using the bulk of the existing material. 5.1 Human Annotation Results For chemistry, 30 random-sampled papers from journals published between 2004 and 2007 by the Royal Society of Chemistry were used for annotation, with a total of 3745 sentences. The papers cover all areas of chemistry and some areas close to chemistry, such as climate modelling, process engineering, and a double-blind medical trial. For CL, 9 papers published at ACL, EACL or EMNLP conferences between 1998 and 2001 were annotated, with a total of 1629 sentences. Average article length between chemistry journal articles (3650 words/ paper) and CL conference articles (4219 words/ paper) is comparable. Both chemistry and CL papers were automatically sentence-split, with manual correction of errors. A web-based annotation tool was used for guideline definition and for annotation. Every sentence is assigned a category. The annotators were the three authors of Teufel et al. (2009), only two of which are domain experts: Colin Batchelor (Annotator A) is a PhD-level chemist, Advaith Siddharthan (Annotator B) a 24 Simone Teufel BSc-level chemist, whereas I (Annotator C) don’t have any chemistry knowledge. No communication between the annotators was allowed. The inter-annotator agreement for chemistry was κ = 0.71 (N=3745, n=15, k=3). For CL, the inter-annotator agreement was κ = 0.65 (N=1629, n=15, k=3). For comparison, the inter-annotator agreement for the original, CL-specific AZ with 7 categories was κ = 0.71 (N=3420, n=7, k=3). A direct comparison of these annotation results with those from the original AZ scheme (by collapsing categories into the 6 shared categories between the schemes) shows inter-annotator agreement of κ = 0.75 (N=3745, n=6, k=3) for AZ-II, which is higher than for collapsed AZ (κ = 0.71, N=3420, n=6, k=3), even though AZ only covered one discipline. This is a positive result for the domainindependence of AZ-II, and also for the feasibility of using trained non-experts as annotators. With respect to the distribution of categories for the two disciplines, the most striking differences between chemistry and CL concern the distribution of [O WN _M THD ], which is more than twice as common in CL (56% v. 25%), and [O WN _R ES ], which is far more common in chemistry overall (24% v 5.6%). Usage of other people’s work or materials also seems to be more common in chemistry, or at least more explicitly expressed (7.9% vs 2.7%). With respect to the shorter, rarer categories, there is a marked difference in [O WN _F AIL ] (0.1% in CL, but 0.8% in chemistry) and [S UPPORT ], which is more common in chemistry (1.5% vs 0.7%). However, this effect is not present for [A NTISUPP ] (contradiction of results), the “reverse” category to [S UPPORT ] (0.6% in CL vs 0.5% in chemistry). We used Krippendorff’s test for category distinction (Krippendorff, 1980), which reveals that the categories [U SE ], [A IM ], [O WN _M THD ], [O WN _R ES ] and [F UT ] are particularly easy to distinguish in both disciplines, whereas [A NTISUPP ], [O WN _F AIL ] and [P REV _O WN ] are particularly hard to distinguish in both disciplines - interestingly, all of these are negative categories. On the one hand, [U SE ], [A IM ] and [F UT ] are important for several types of the use case searches mentioned above; but as [A NTISUPP ] and [O WN _F AIL ] are also crucial for the envisaged downstream tasks, the problems with their definition should be identified and solved in future versions of the guidelines. There are also discipline-specific problems with distinguishability; for instance, the definition of the categories [S UPPORT ] and [N OV _A DV ] seem to be harder in CL than in chemistry, whereas [C O D I ] seems to be easier in chemistry than in CL. We believe this is due to the fact that comparisons of methods and approaches are more common in CL and are clearly expressed, whereas in chemistry the objects that are involved in comparisons are more varied and at a lower grade of abstraction (e.g., compounds, properties of compounds, coefficients, etc). As far as the chemistry annotation is concerned, it is interesting to see whether Annotator A was influenced during annotation by domain knowledge which Annotator C did not have, and Annotator B had to a lower degree. We therefore calculated pairwise agreement, which was κ AC = 0 . 66 , κ BC = 0 . 73 and κ AB = 0 . 73 (all: N=3745,n=15,k=2). That means that the largest disagreements were between the non-expert (C) and the expert (A), though the differences are modest. This might point to the fact that Annotators A and B might have used a certain amount of domain-knowledge which the guidelines do not yet, but should, cover. It is possible that discourse annotation of chemistry is intrinsically easier than discourse annotation of CL, because it is a more established discipline. For instance, it is likely that the problemsolving categories [O WN _F AIL ], [O WN _M THD ], [O WN _R ES ] and [O WN _C ONC ] are easier to describe in a discipline with an established methodology (such as chemistry), than they are in a younger, developing discipline such as computational linguistics. eChemistry: Science, Citations and Sentiment 25 5.2 Automatic Annotation We report here some preliminary results of automating the classification. Like in Teufel (2000), the task is cast as a classification problem to be solved in a supervised machine learning framework using features extracted by shallow processing (POS tagging and pattern matching). The pool of features we used comes from Teufel & Moens (2002) and includes lexical features, content word features, verb syntactic features, and citation features, location features and the history feature. Three lexical (cue phrase) features model meta-discourse (Hyland, 1998); the features were manually selected from CL papers, and used unchanged for chemistry. The first records the occurrence of about 1700 manually identified scientific cue phrases (such as “in this paper”, Teufel, 2000). The second models the main verb of the sentence, by look-up in a verb lexicon organised by 13 main clusters of verb types (e.g. “change verbs”), and the third models the likely subject of the sentence, by classifying them either as the authors, or other researchers, or none of the above, using an extensive lexicon of regular expressions. Content word features model occurrence and density of content words in the sentences, where content words are either defined as non-stoplist words in the subsection heading preceding the sentence, or as words with a high TF*IDF score. Verb syntax features include complex tenses, voice, and presence of an auxiliary. Citation features record whether and where a citation occurs in text, and whether it is a self-citation. Context can be modelled in terms of location of a sentence with respect to its paragraph, within a section, and within the document; in the latter case, the text is split into ten segments and each sentence to the segment it is located in. It can also be modelled in terms of history with respect to the previous label. We estimate the most likely AZ category of the previous sentence by beam search, and hand this feature directly to the machine learner. 1 We use the W EKA toolkit (Witten & Frank, 2000) with 30-fold cross-validation and the Naive Bayes algorithm, as did Teufel & Moens (2002). The performance for the full 15 category classification task is P(A)= 51.42%, κ= 0.41, Macro-F = 34%; The results per category are as follows: O WN _M THD : 61.4%; O WN _R ES : 56.6%; A IM : 56.5%; C O _G RO 49.9%; O TH 48.5%; O WN _C ONC : 43.6%; U SE : 39.4%; S UPPORT : 39.2%; P REV _O WN : 35.1%; G AP _W EAK : 24.7%; O WN _F AIL : 20.7%; N OV _A DV : 12.9%; F UT : 7.3%. Two categories that were difficult for human annotators proved impossible for the machine learner: no instances were classified as [C O D I ] and [A NTISUPP ]. We think this might be due to both the hedged signalling of negative sentiment in scientific discourse, and the small number of examples for these classes. Clearly more effort is needed to achieve acceptable results, in analysis, feature definition and selection, and choice of machine learner. Results for the conflated set with six categories were P(A)=76.0, κ=0.51, Macro-F=51% (F-measures: A IM : 55.1%; B ASIS : 44.2%; C ONTRAST : 19.8%; O WN : 88.8%; O TH : 49.2%; B ACKGROUND : 48.5%). This compares favourably with the results reported in Teufel & Moens (2002), namely P(A)=73.0% , κ=0.45, Macro-F=0.50, in as far as the comparison is justified (the domains are different, and Teufel & Moens (2002) used an additional T EXTUAL category). 6 Conclusions This paper introduced project SciBorg, which is a novel e-science project in that it uses a semantic representation (RMRS) throughout for various tasks. Of these, I mainly discuss rhetorical document structure recognition here, and give some recent results. In particular, I have argued that one can 1 In Teufel (2000), I applied a HMM model post-hoc to AZ categories classified in a first pass run, with no positive effect. 26 Simone Teufel in principle define rhetorical categories in such a way that expert knowledge is not required during annotation, which allows for annotation by non-experts. Other applications for AZ-II than specialist searches in the literature are possible, for instance the teaching of scientific writing to novices, and in particular the teaching of how to structure one’s scientific argument. Such a tool could recognise rhetorical faux-pas of unpracticed writers in science, suggest alternative meta-discourse formulations and ordering. The tool by Feltrim et al. (2005) is an early prototype for such a system. It critiques students’ introductions of Brazilian CS theses written in Portuguese. Other applications for AZ include improved citation indexing (Teufel et al., 2009) and tailored summarisation (Teufel, 2000). 7 Acknowledgements I would like to thank my colleagues on project SciBorg, Ann Copestake, Peter Corbett, Peter Murray- Rust, Andy Parker, Advaith Siddharthan and C.J. Rupp, our collaborator Colin Batchelor from the Royal Society of Chemistry, and our three publisher project partners. Project SciBorg is funded by EPSRC (EP/ C010035/ 1). References Briscoe, T. & Carroll, J. (2002). Robust accurate statistical annotation of general text. In Proc. of LREC-02, Las Palmas, Gran Canaria. Copestake, A. (2009). Slacker Semantics: Why superficiality, dependency and avoidance of commitment can be the right way to go. In Proc. of EACL-09, Athens, Greece. Copestake, A. & Flickinger, D. (2000). An open-source grammar development environment and broadcoverage english grammar using HPSG. In Proceedings of LREC-00. Copestake, A., Corbett, P., Murray-Rust, P., Rupp, C., Siddharthan, A., Teufel, S., & Waldron, B. (2006). An architecture for language processing for scientific texts. In Proc. of the UK e-Science Programme All Hands Meeting 2006 (AHM2006), Nottingham, UK. Corbett, P. & Copestake, A. (2008). Cascaded classifiers for confidence-based chemical named entity recognition. BMC Bioinformatics, 9(Suppl 11), S4. Corbett, P., Batchelor, C., & Teufel, S. (2007). Annotation of Chemical Named Entities. In Proc. of ACL workshop BioNLP-2007: Biological, Translational, and Clinical Language Processing, Prague, Czech Republic. Feltrim, V., Teufel, S., Nunes, G., & Alusio, S. (2005). Argumentative zoning applied to critiquing novices’ scientific abstracts. In J. G. Shanahan, Y. Qu, & J. Wiebe, editors, Computing Attitude and Affect in Text: Theory and Applications. Springer, Dordrecht, The Netherlands. Garfield, E. (1965). Can Citation Indexing be Automated? In M. e. a. Stevens, editor, Statistical Association Methods for Mechanical Documentation (NBS Misc. Pub. 269). National Bureau of Standards, Washington. Garzone, M. & Mercer, R. E. (2000). Towards an Automated Citation Classifier. In Proc. of the 13th Biennial Conference of the CSCI/ SCEIO (AI-2000). Hyland, K. (1998). Persuasion and context: The pragmatics of academic metadiscourse. Journal of Pragmatics, 30(4), 437-455. eChemistry: Science, Citations and Sentiment 27 Kim, J.-D., Ohta, T., & Tsujii, J. (2008). Corpus annotation for mining biomedical events from literature. BMC Bioinformatics, 9, 10. Krippendorff, K. (1980). Content Analysis: An Introduction to its Methodology. Sage Publications, Beverly Hills, CA. Lisacek, F., Chichester, C., Kaplan, A., & Sandor, A. (2005). Discovering paradigm shift patterns in biomedical abstracts: Application to neurodegenerative diseases. In Proc. of the SMBM. Merity, S., Murphy, T., & Curran, J. R. (2009). Accurate Argumentative Zoning with Maximum Entropy models. In Proceedings of ACL-IJCNLP-09 Workshop on text and citation analysis for scholarly digital libraries (NLPIR4DL), Singapore. Mizuta, Y. & Collier, N. (2004). An annotation scheme for rhetorical analysis of biology articles. In Proc. of LREC-04. Moravcsik, M. J. & Murugesan, P. (1975). Some Results on the Function and Quality of Citations. Social Studies of Science, 5, 88-91. Nanba, H. & Okumura, M. (1999). Towards multi-paper summarization using reference information. In Proc. of IJCAI-99, pages 926-931. Rupp, C., Copestake, A., Corbett, P., Murray-Rust, P., Siddharthan, A., Teufel, S., & Waldron, B. (2008). Language Resources and Chemical Informatics. In Proc. of LREC-08, Marrakech, Morocco. Teufel, S. (2000). Argumentative Zoning: Information Extraction from Scientific Text. Ph.D. thesis, School of Cognitive Science, University of Edinburgh. Teufel, S. & Moens, M. (2002). Summarising scientific articles — experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409-446. Teufel, S., Siddharthan, A., & Tidhar, D. (2006). Automatic classification of citation function. In Proc. of EMNLP-06. Teufel, S., Siddharthan, A., & Batchelor, C. (2009). Towards discipline-independent argumentative zoning: Evidence from chemistry and computational linguistics. In Proc. of EMNLP-09, Singapore. Waldron, B. & Copestake, A. (2006). A stand-off annotation interface between delph-in components. In The fifth workshop on NLP and XML: Multidimensional Markup in Natural Language Processing (NLPXML- 2006). Witten, I. & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann. Main Conference Normalized (Pointwise) Mutual Information in Collocation Extraction * Gerlof Bouma Department Linguistik Universität Potsdam gerlof.bouma@uni-potsdam.de Abstract In this paper, we discuss the related information theoretical association measures of mutual information and pointwise mutual information, in the context of collocation extraction. We introduce normalized variants of these measures in order to make them more easily interpretable and at the same time less sensitive to occurrence frequency. We also provide a small empirical study to give more insight into the behaviour of these new measures in a collocation extraction setup. 1 Introduction In collocation extraction, the task is to identify in a corpus combinations of words that show some idiosyncrasy in their linguistic distribution. This idiosyncrasy may be reduced semantic compositionality, reduced syntactic modifiability or simply a sense that the combination is habitual or even fixed. Typically but not exclusively, this task concentrates on two-part multi-word units and involves comparing the statistical distribution of the combination to the distribution of its constituents through an association measure. This measure is used to rank candidates extracted from a corpus and the top ranking candidates are then selected for further consideration as collocations. 1 There are literally dozens of association measures available and an important part of the existing collocation extraction literature has consisted of finding new and more effective measures. For an extreme example see Pecina (2008a), who in one paper compares 55 different (existing) association measures and in addition several machine learning techniques for collocation extraction. A recent development in the collocation literature is the creation and exploitation of gold standards to evaluate collocation extraction methods - something which is for instance standard practice in information retrieval. Evaluation of a method, say, a certain association measure, involves ranking the data points in the gold standard after this measure. An effective method is then one that ranks the actual * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 31-40. 1 In the context of this paper, we will not attempt a more profound definition of the concept of collocation and the related task of collocation extraction. For this we refer the interested reader to Manning and Schütze (1999, Ch. 5) and especially Evert (2007). A comprehensive study of all aspects of collocation extraction with a focus on mathematical properties of association measures and statistical methodology is Evert (2005). 32 Gerlof Bouma collocations in this list above the non-collocations. Four such resources, compiled for the shared task of the MWE 2008 workshop, are described in Baldwin (2008), Evert (2008a), Krenn (2008), and Pecina (2008b). One of the lessons taught by systematic evaluation of association measures against different gold standards is that there is not one association measure that is best in all situations. Rather, different target collocations may be found most effectively with different methods and measures. It is therefore useful to have access to a wide array of association measures coupled with an understanding of their behaviour if we want to do collocation extraction. As Evert (2007, Sect. 6), in discussing the selection of an association measure, points out, choosing the best association measure for the job involves empirical evaluation as well as a theoretical understanding of the measure. In this paper, we add to the large body of collocation extraction literature by introducing two new association measures, both normalized variants of the commonly used information theoretical measures of mutual information and pointwise mutual information. The introduction of the normalized variants is motivated by the desire to (a) use association measures whose values have a fixed interpretation; and (b), in the case of pointwise mutual information, reduce a known sensitivity for low frequency data. Since it is important to understand the nature of an association measure, we will discuss some theoretical properties of the new measures and try to gain insight in the relation between them and the original measures through a short empirical study. The rest of this paper is structured as follows: Section 2 discusses mutual information and pointwise mutual information. We then introduce their normalized variants (Sect. 3). Finally, we present an empirical study of the effectiveness of these normalized variants (Sect. 4). 2 Mutual Information 2.1 Definitions Mutual information (MI) is a measure of the information overlap between two random variables. In this section I will review definitions and properties of MI. A textbook introduction can be found in Cover and Thomas (1991). Readers familiar with the topic may want to skip to Sect. 3. The MI between random variables X and Y , whose values have marginal probabilities p ( x ) and p ( y ) , and joint probabilities p ( x, y ) , is defined as: 2 I ( X ; Y ) = ∑ x,y p ( x, y ) ln p ( x, y ) p ( x ) p ( y ) . (1) The information overlap between X and Y is 0 when the two variables are independent, as p ( x ) p ( y ) = p ( x, y ) . When X determines Y , I ( X ; Y ) = H ( Y ) , where H ( Y ) is the entropy of, or lack of information about, Y , defined as: H ( Y ) = − ∑ y p ( y ) ln p ( y ) . (2) When X and Y are perfectly correlated (they determine each other), I ( X ; Y ) reaches its maximum of H ( X ) = H ( Y ) = H ( X, Y ) , where H ( X, Y ) is the joint entropy of X and Y , which we get by replacing the marginal distribution in (2) with the joint distribution p ( x, y ) . 2 In this paper, I will always use the natural logarithm. Changing the base of the logarithm changes the unit of measurement of information, but this is not relevant in the context of this paper. Further, capital variable names refer to random variables, whereas lowercase ones refer to the values of their capitalized counterparts. Finally, 0 · ln 0 is defined to be 0 , which means that in a contingency table, cells with zero counts/ probability do not contribute to MI, entropy, etc. Normalized (P)MI in Collocation Extraction 33 Other ways to look at MI is as a sum of entropies (3) or as the expected or average value of pointwise mutual information (4). I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X, Y ) (3) I ( X ; Y ) = E p(X,Y ) [ i ( X, Y )] (4) = ∑ x,y p ( x, y ) i ( x, y ) i ( x, y ) = ln p ( x, y ) p ( x ) p ( y ) (5) Pointwise mutual information (PMI, 5) is a measure of how much the actual probability of a particular co-occurrence of events p ( x, y ) differs from what we would expect it to be on the basis of the probabilities of the individual events and the assumption of independence p ( x ) p ( y ) . Note that even though PMI may be negative or positive, its expected outcome over all joint events (i.e., MI) is positive. 2.2 Mutual Information in Collocation Extraction Mutual information can be used to perform collocation extraction by considering the MI of the indicator variables of the two parts of the potential collocation. 3 In Table 1 (page 34), I have given counts and probabilities (maximum likelihood estimates: p = f/ N ) for the collocation candidate Mr President, extracted from the Europarl corpus (Koehn, 2005). The MI between the two indicator variables I ( L mr ; R president ) is in this case 0.0093. The Europarl sample consists of about 20k bigramme types with frequencies above 20. An MI of 0.0093 puts Mr President at rank 2 when these types are sorted after MI. For a recent application of MI in collocation extraction see Ramish et al. (2008). More common than MI as defined above is the use of the test statistic for the log-likelihood ratio G 2 , first proposed as a collocation extraction measure in Dunning (1993). For G 2 it has been observed that it is equivalent to MI in collocation extraction (e.g., Evert, 2005, Sect. 3.1.7). 4 Pointwise MI is also one of the standard association measures in collocation extraction. PMI was introduced into lexicography by Church and Hanks (1990). Confusingly, in the computational linguistic literature, PMI is often referred to as simply MI, whereas in the information theoretic literature, MI refers to the averaged measure. In our example in Table 1 (page 34), the bigramme Mr President receives a score of i ( L mr = yes , R president = yes ) = 4 . 972 . In our Europarl sample of 20k types, Mr President comes 1573th in terms of PMI. Although MI and PMI are theoretically related, their behaviour as association measures is not very similar. An observation often made about PMI is that low frequency events receive relatively high scores. For instance, infrequent word pairs tend to dominate the top of bigramme lists that 3 In this paper, we shall use two-word collocations as our running example. The indicator variable L w maps to yes when the leftmost word in a candidate is w and to no otherwise. Similarly for R w and the rightmost word. 4 As mentioned, we use association measures to rank candidates. A measure is thus equivalent to any monotic transformation. G 2 and MI differ by a constant factor 2N , where N is the corpus size, if we assume a maximum likelihood estimate for probabilities ( f/ N ), since G 2 = 2 X x,y f(x, y) ln f(x, y) f e (x, y) = 2N X x,y p(x, y) ln p(x, y) p(x)p(y) = 2N · MI where the expected frequency f e (x, y) = f(x)/ N · f(y)/ N · N . 34 Gerlof Bouma Table 1: Counts (l) and MLE probabilities (r) for the bigramme Mr President in a fragment of the English part of the Europarl corpus R president L mr yes no Total yes 6 899 3 849 10 748 no 8 559 3 459 350 3 467 909 Total 15 458 3 463 199 3 478 657 R president L mr yes no Total yes .0020 .0011 .0031 no .0025 .9944 .9969 Total .0044 .9956 are ranked after PMI. One way this behaviour can be understood is by looking at the PMI value of extreme cases. When two parts of a bigramme only occur together (the indicator variables of the words are perfectly correlated), we have p ( x, y ) = p ( x ) = p ( y ) . In this situation, PMI has a value of − ln p ( x, y ) . This means that the PMI of perfectly correlated words is higher when the combination is less frequent. Even though these facts about the upper bound do not automatically mean that all low frequency events receive high scores, the upper bound of PMI is not very intuitive for an association measure. 5 Furthermore, the lack of a fixed upper bound means that by looking at PMI alone, we do not know how close a bigramme is to perfect correlation. In contrast, we do know how close it is to independence, since a completely uncorrelated word pair receives a PMI of 0. A sensitivity for low frequency material is not necessarily a disadvantage. As mentioned in the introduction, different collocation extraction tasks may have different effective association measures. If we look at the MWE 2008 shared task results (Evert, 2008b), we can conclude that PMI performs relatively well as an association measure in those cases where bare occurrence frequency does not. That is, there are collocation extraction tasks in which the relative lack of a correlation with occurrence frequency is an attractive property. MI does not suffer from a sensitivity to low frequency data, as it is an average of PMIs weighted by p ( x, y ) - as p ( x, y ) goes down, the impact of the increasing PMI on the average becomes less. In fact, in the kind of data we have in collocation extraction, we may expect the upper bound of MI to be positively correlated with frequency. MI equals the entropy of the two indicator variables when they are perfectly correlated. Its maximum is thus higher for more evenly distributed variables. In contingency tables from corpus data like in Table 1, by far most probability mass is in the bottom right (L w = no, R v = no). It follows that entropy, and thus maximal MI, is (slightly) higher for combinations that occur more frequently. As with PMI, however, the lack of a fixed upper bound for MI does mean that it is easier to interpret it as a measure of independence (distance to 0) than as a measure of correlation. 5 The unintuitive moving upper bound behaviour of PMI is related to the use of a ratio of probabilities. The statistical measure of effect size relative risk has a similar problem. Figuratively, there is a ‘probability roof’ that one can’t go through, e.g., p(x) can be twice as high as p(y) when p(y) = .05 , but not when p(y) = .55 . The probability roof of p(a, b) is min(p(a), p(b) , which, in terms of ratios, becomes further away from p(a)p(b) as p(a) and p(b) get smaller. Normalized (P)MI in Collocation Extraction 35 3 Normalizing MI and PMI To give MI and PMI a fixed upper bound, we will normalized the measures to have a maximum value of 1 in the case of perfect (positive) association. For PMI, it is hoped that this move will also reduce some of the low frequency bias. There are several ways of normalizing MI and PMI, as in both cases the maximum value of the measures coincide with several other measures. 3.1 Normalized PMI When two words only occur together, the chance of seeing one equals the chance of seeing the other, which equals the chance of seeing them together. PMI is then: i ( x, y ) = − ln p ( x ) = − ln p ( y ) = − ln p ( x, y ) (6) (when X and Y are perfectly correlated and p ( x, y ) > 0) . This gives us several natural options for normalization: normalizing by some combination of − ln p ( x ) and − ln p ( y ) , or by − ln p ( x, y ) . We choose the latter option, as it has the pleasant property that it normalizes the upper as well as the lower bound. We therefore define normalized PMI as as: i n ( x, y ) = ( ln p ( x, y ) p ( x ) p ( y ) ) / − ln p ( x, y ) . (7) Some orientation values of NPMI are as follows: When two words only occur together, i n ( x, y ) = 1 ; when they are distributed as expected under independence, i n ( x, y ) = 0 as the numerator is 0 ; finally, when two words occur separately but not together, we define i n ( x, y ) to be − 1 , as it approaches this value when p ( x, y ) approaches 0 and p ( x ) , p ( y ) are fixed. For comparison, these orientation values for PMI are respectively − ln p ( x, y ) , 0 and −∞ . 6 3.2 An Aside: PMI 2 Since the part in the PMI definition inside of the logarithm has an upper bound of 1 / p ( x, y ) , one may also consider ‘normalizing’ this part. The result is called PMI 2 , defined in (8): ln ( p ( x, y ) p ( x )( y ) / 1 p ( x, y ) ) = ln p ( x, y ) 2 p ( x )( y ) . (8) The orientation values of PMI 2 are not so neat as NPMI’s: 0 , ln p ( x, y ) , and −∞ respectively. As a normalization, NPMI seems to be preferable. However, PMI 2 is part of a family of heuristic association measures defined in Daille (1994). The PMI k family was proposed in an attempt to investigate how one could improve upon PMI by introducing one or more factors of p ( x, y ) inside the logarithm. Interestingly, Evert (2005) has already shown PMI 2 to be a monotic transformation 6 One of the alternatives, which we would like to mention here but reserve for future investigations, is to normalize by − ln max(p(x), p(y)) . This will cause the measure to take its maximum of 1 in cases of positive dependence, i.e., when one word only occurs in the context of another, but not necessarily the other way around. It seems plausible that there are collocation extraction tasks where this is a desired property, for instance in cases where the variation in one part of the collocation is much more important than in the other. See Evert (2007, Sect. 7.1), for some remarks about asymmetry in collocations. 36 Gerlof Bouma of the geometric mean association measure. 7 Here we see that there is a third way of understanding PMI 2 - as the result of normalizing the upper bound before the taking the logarithm. 8 3.3 Normalized MI We know that in general 0 ≤ I ( X ; Y ) ≤ H ( X ) , H ( Y ) ≤ H ( X, Y ) . In addition, when X, Y correlate perfectly, it is also the case that I ( X ; Y ) = H ( X ) = H ( Y ) = H ( X, Y ) . As in the case of PMI before, this gives us more than one way to normalize MI. In analogy to NPMI, we normalize MI by the joint entropy: I n ( X, Y ) = ∑ x,y p ( x, y ) ln p(x,y) p(x)p(y) − ∑ x,y p ( x, y ) ln p ( x, y ) (9) MI is the expected value of PMI. Likewise, the normalizing function in NMI is the expected value of the normalizing function in NPMI: − ∑ x,y p ( x, y ) ln p ( x, y ) = E p(X,Y ) [ − ln p ( X, Y )] . The orientation values of NMI are 1 for perfect positive and negative correlation, and 0 for independence. It is possible to define a signed version of (N)MI by multiplying by ± 1 depending on the sign of p ( x, y ) − p ( x ) p ( y ) . This does not make a practical difference for the extraction results, however. The observed dispreferred bigrammes do typically not get very high scores and therefore do not get interspersed with preferred combinations. 3.4 Previous Work on Normalizing (P)MI The practice of normalizing MI - whether as in (9) or by alternative factors - is common in data mining and information retrieval. An overview of definitions and data mining references can be found in Yao (2003). As mentioned above, PMI 2 , as special case of PMI k , was introduced and studied in Daille (1994), together with a range of other association measures. PMI 2 and PMI 3 were re-proposed as (log frequency biased) mutual dependency in Thanopoulos et al. (2002), in an attempt to get a more intuitive relation between PMI’s upper bound and occurrence frequency. 4 A Preliminary Empirical Investigation To get a better feeling for the effect of normalizing MI and PMI, we will present results of evaluating NMI and NPMI against three parts of the MWE 2008 shared task. The procedure is as follows: The collocation candidates are ranked according to the association measure. These lists are then compared to the gold standards by calculating average precision. Aver- 7 The geometric mean association measure is: gmean ( x , y ) = f(x, y) p f(x)f(y) 8 We have further noticed that in practice PMI 2 is nearly a monotone transformation of X 2 . To see why this may be so, consider one of the simplifications of X 2 valid in the case of two indicator variables (Evert, 2005, Lemma A.2): N · [f(L w = yes , R v = yes ) − f e (L w = yes , R v = yes )] 2 f e (L w = yes , R v = yes ) · f e (L w = no , R v = no ) It is not uncommon for f e (L w = no , R v = no ) to be nearly N and for f e (L w = yes , R v = yes ) to be orders of magnitude smaller than f(L w = yes , R v = yes ) in a co-occurrence table. If we ‘round off’ the formula accordingly and take its logarithm, we arrive at PMI 2 . Normalized (P)MI in Collocation Extraction 37 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 PMI rank NPMI rank 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 MI rank NMI rank Frequency groups: ‘ • ’ ≤ 20 < ‘ ’ ≤ 100 < ‘ ◦ ’ Figure 1: Normalization of PMI (l) and MI (r) per frequency group (German AN data) age precision takes a value of 100% when all collocations are ranked before the non-collocations. Its value equals the percentage of collocations in the dataset when the candidates are randomly ordered. The first dataset contains 1212 adjective-noun bigrammes sampled from the Frankfurter Rundschau (Evert, 2008a). We consider three different subtasks on the basis of this dataset, depending on how narrow we define collocation in terms of the annotation. The second dataset is described in Krenn (2008) and contains 5102 German verb-PP combinations, also taken from the Frankfurter Rundschau. Here, too, we look at three subtasks by considering either or both of the annotated collocation types as actual collocations. For the final and third dataset, we look at 12232 Czech bigrammes, described in Pecina (2008b). Before evaluating (N)(P)MI against the gold standards, it is instructive to look at the effects of normalization on the ranking of bigrammes produced by each measure. To this end, we plotted the rankings according to the original measures against their normalized counterparts in Fig. 1. From the left plot, we conclude that PMI and NPMI agree well in the ranking: the ranks fall rather closely to the diagonal. For MI and NMI, to the right, we see that normalization has more impact, as the points deviate far from the diagonal. In addition, the plotted data has been divided into three groups: high, medium, and low frequency. Normalizing PMI should reduce the impact of low frequency on ranking. Indeed, we see that the low frequency points fall above the diagonal - i.e., they are ranked lower by NPMI than by PMI, if we consider 1 to be the highest rank - and high frequency points fall below it. Normalizing MI, on the other hand, on average moves high frequency data points down and low frequency points up. All in all, we can see that in practice normalization does what we wanted: normalizing PMI makes it slightly less biased towards low frequency collocations, normalizing MI makes it less biased towards high frequency ones. Although not as clearly observable as the effect of normalization, the graphs in Fig. 1 also show the relation of the un-normalized measures to simple occurrence frequency. For MI, high frequency combinations tend to appear in the upper half of the ranked bigramme list. If we rank after PMI, 38 Gerlof Bouma Table 2: Evaluation of (P)MI and their normalized counterparts on three datasets. Reported are the average precision scores in percent German AN German V-PP Measure cat 1 cat 1-2 cat 1-3 figur support both Czech bigrammes random 28.6 42.0 51.6 5.4 5.8 11.1 21.2 frequency 32.2 47.0 56.3 13.6 21.9 34.1 21.8 pmi 44.6 54.7 61.3 15.5 10.5 24.4 64.9 npmi 45.4 56.1 62.7 16.0 11.8 26.8 65.6 pmi 2 45.4 56.8 63.5 17.0 13.6 29.9 65.1 mi 42.0 56.1 64.1 17.3 22.9 39.0 42.5 nmi 46.1 58.6 65.3 14.9 10.6 24.6 64.0 however, the high frequency bigrammes are more evenly spread out. PMI’s alleged sensitivity to low frequency is perhaps more accurately described as a lack of sensitivity to high frequency. Table 2 contains the results of the evaluation of the measures on the three data sets. The ‘random’ and ‘frequency’ measures have been included as baselines. The reported numbers should only be taken as indications of effectiveness as no attempt has been made to estimate the statistical significance of the differences in the table. Also, the results do not in any sense represent state-of-the-art performance: Pecina (2008a) has shown it is possible to reach much higher levels of effectiveness on these datasets with machine learning techniques. Table 2 shows that NPMI and PMI 2 consistently perform slightly above PMI. The trio has belowfrequency performance on the German V-PP data in the ‘support’ and ‘both’ subtasks. This is to be expected, at least for PMI and NPMI. The frequency baseline is high in these data (much higher than random), suggesting that measures that show more frequency influence (and thus not (N)PMI) will perform better. The behaviour of NMI is rather different from that of MI. In fact it seems that NMI behaves more like one of the pointwise measures. Most dramatically this is seen when MI is effective but the pointwise trio is not: in the German V-PP data normalizing MI has a disastrous effect on average precision. In the other cases, normalizing MI has a positive effect on average precision. Summarizing, we can say that, throughout, normalizing PMI has a moderate but positive effect on its effectiveness in collocation extraction. We speculate that it may be worth using NPMI instead of PMI in general. NMI, however, is a very different measure from MI, and it makes more sense to use both the original and the normalized variant alongside of each other. 5 Conclusion and Future Work In this paper, we have tried to introduce into the collocation extraction research field the normalized variants of two commonly used association measures: mutual information and pointwise mutual information. The normalized variants NMI and NPMI have the advantage that their values have fixed interpretations. In addition, a pilot experimental study suggests that NPMI may serve as a more effective replacement for PMI. NMI and MI, on the other hand, differ more strongly in their relationship. As the collocation literature has shown that the effectiveness of a measure is strongly Normalized (P)MI in Collocation Extraction 39 related with the task, much more and more profound empirical study is needed to be able to declare NPMI as always more effective as PMI, however. In the experiments discussed above, we have relied on MLE in the calculation of the association scores. Since the measures are functions of probabilities, and not frequencies directly, it is straightforward to replace MLE with other ways of estimating probabilities, for instance some smoothing method. A more radical further step would be to use a different reference distribution in the association measures, i.e., to measure p ( x, y ) ’s deviation from something else than p ( x ) p ( y ) . A change of reference distribution may, however, force us to adopt other normalization strategies. Finally, as indicated in Section 3, there is more than one way to Rome when it comes to normalization. We hope to have demonstrated in this paper that investigating the proposed normalized measures as well as alternative ones is worth the effort in the context of collocation research. References Baldwin, T. (2008). A resource for evaluating the deep lexical acquisition of English verb-particle constructions. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pages 1-2, Marrakech. Church, K. W. & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22-29. Cover, T. & Thomas, J. (1991). Elements of Information Theory. Wiley & Sons, New York. Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. Ph.D. thesis, Université Paris 7. Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61-74. Evert, S. (2004/ 2005). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, IMS Stuttgart. Evert, S. (2007). Corpora and collocations. Extended Manuscript of Chapter 58 of A. Lüdeling and M. Kytö, 2008, Corpus Linguistics. An International Handbook, Mouton de Gruyter, Berlin. Evert, S. (2008a). A lexicographic evaluation of German adjective-noun collocations. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pages 3-6, Marrakech. Evert, S. (2008b). The MWE 2008 shared task: Ranking MWE candidates. Slides presented at MWE 2008. http: / / multiword.sourceforge.net/ download/ SharedTask2008.pdf . Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT Summit 2005. Krenn, B. (2008). Description of evaluation resource - German PP-verb data. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pages 7-10, Marrakech. Manning, C. & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Pecina, P. (2008a). A machine learning approach to multiword expression extraction. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pages 54-57, Marrakech. 40 Gerlof Bouma Pecina, P. (2008b). Reference Data for Czech Collocation Extraction. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pages 11-14, Marrakech. Ramisch, C., Schreiner, P., Idiart, M., & Villavicencio, A. (2008). An evaluation of methods for the extraction of multiword expressions. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pages 50-53, Marrakech. Thanopoulos, A., Fakotakis, N., & Kokkinakis, G. (2002). Comparative evaluation of collocation extraction metrics. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), pages 620-625, Las Palmas. Yao, Y. (2003). Information-theoretic measures for knowledge discovery and data mining. In Karmeshu, editor, Entropy Measures, Maximum Entropy and Emerging Applications, pages 115-136. Springer, Berlin. Hypernymy Extraction Based on Shallow and Deep Patterns * Tim vor der Brück Intelligent Information and Communication Systems (IICS) FernUniversität in Hagen 58084 Hagen, Germany tim.vorderbrueck@fernuni-hagen.de Abstract There exist various approaches to construct taxonomies by text mining. Usually these approaches are based on supervised learning and extract in a first step several patterns. These patterns are then applied to previously unseen texts and used to recognize hypernym/ hyponym pairs. Normally these approaches are only based on a surface representation or a syntactic tree structure, i.e., a constituency or dependency tree derived by a syntactical parser. In this work we present an approach which, additionally to shallow patterns, directly operates on semantic networks which are derived by a deep linguistic syntactico-semantic analysis. Furthermore, the shallow approach heavily depends on semantic information, too. It is shown that either recall or precision can be improved considerably than by relying on shallow patterns alone. 1 Introduction A large knowledge base is needed by many tasks in the area of natural language processing, including question answering, textual entailment or information retrieval. One of the most important relations stored in a knowledge base is hypernymy which is often referred to as the is-a relation. Quite a lot of effort was spent on hypernymy extraction from natural language texts. The approaches can be divided into three different types of methods: • Analyzing the syntagmatic relations in a sentence • Analyzing the paradigmatic relations in a sentence • Document Clustering A quite popular approach of the first type of algorithms was proposed by Hearst and consists of the usage of so-called Hearst patterns (Hearst, 1992). These patterns are applied on arbitrary texts and the instantiated pairs are then extracted as hypernymy relations. Several approaches were developed to extract such patterns automatically from a * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 41-52. 42 Tim vor der Brück text corpus by either employing a surface (Morin & Jaquemin, 2004; Maria Ruiz-Casado & Castells, 2005) or a syntactical tree representation (Snow et al., 2005). Paradigmatic approaches expect that words in the textual context of the hypernym (e.g., neighboring words) can also occur in the context of the hyponym. The textual context can be represented by a set of the words which frequently occur together with the hypernym (or hyponym). Whether a word is the hypernym of a second word can then be determined by a semantic similarity measure on the two sets (Cimiano et al., 2005). If Word Sense Disambiguation is used, those approaches can operate directly on concepts instead of words which is currently rather rarely done. A further method to extract hypernymy relations is document clustering. For that, the documents are hierarchically clustered. Each document is assigned a concept or word it describes. The document hierarchy is then transferred to a concept or word hierarchy (Quan et al., 2004). In this work 1 we will follow a hybrid approach. On the one side, we apply shallow patterns which do not require parsing but only need a tokenization of the analyzed sentence. In contrast to most common approaches our shallow method extracts pairs of concepts, not of words, as determined by Word Sense Disambiguation. On the other side, we employ deep patterns directly on the semantic networks (SN) which are created by a deep semantic parser. These patterns are partly learned by text mining on the SN representations and partly manually defined. We use for the extraction of hypernyms the German Wikipedia corpus from November 2006 which consists of about 500 000 articles. 2 System Architecture Fig. 2 shows the architecture of our system SemQuire (SemQuire relating to Acquire Semantic Knowledge). In the first step, the Wikipedia corpus is parsed by the deep analyzer WOCADI 2 (Hartrumpf, 2002). The parsing process does not employ a grammar but is based on a word class functional analysis. For that it uses a semantic lexicon (Hartrumpf et al., 2003) containing currently 28 000 deep and 75 000 shallow entries. For each sentence, WOCADI tries to create a token list, a dependency tree and a SN. In contrast to the SN and the dependency tree, the token list is always created even if the analyzed sentence is ill-formed and not syntactically correct. Both types of patterns (shallow and deep) are applied to the parse result of Wikipedia. In particular, the shallow patterns are applied on the token information while the deep patterns are applied on the SNs. If an application of such a pattern is successful, the variables occurring in the patterns are instantiated with concepts of the SN (or with concepts occurring in the token list for shallow patterns) and a hypernymy relation is extracted. Furthermore, a first validation is made which is based on semantic features and ontological sorts (a description of semantic features and ontological sorts can be found in Helbig (2006)). If the validation is successful, the extracted relation is stored in the knowledge base. We currently develop an approach to generate for each relation a quality score by the combination of several features. Relations assigned a low quality score should then be seen with caution and have to be validated before using them in any operational system. 1 Note that this work is related to the DFG-project: Semantische Duplikatserkennung mithilfe von Textual Entailment (HE 2847/ 11-1) 2 WOCADI is the abbreviation for WOrd ClAss DIsambiguation. Hypernymy Extraction Based on Shallow and Deep Patterns 43 Tokens SN Validation (Score) Text Shallow patterns Deep patterns WOCADI Validation (Filter) ledge base Know− Figure 1: System architecture of SemQuire 3 Application of Shallow Patterns The information for a single token as returned by the WOCADI parser consists of • word-id: the number of the token • char-id: the character position of the token in the surface string • cat: the grammatical category • lemma: a list of possible lemmas • reading: a list of possible concepts • parse-reading/ lemma: a concept and lemma determined by Word Sense Disambiguation (see Fig. 2). The chosen concept must be contained in the concept list, analogously for the lemma. Note that concepts are marked by trailing numbers indicating the intended reading (e.g., house.1.1). A pattern is given by a premise and a conclusion SU B ( a, b ) . The premise consists of a regular expression containing variables and feature value structures (see Fig. 3) where the variables are restricted to the two appearing in the conclusion (a,b). As usual, a question mark denotes the fact that the following expression is optional, a wildcard denotes the fact that zero or more of the following expression are allowed. The variables are instantiated with concepts relating to nouns from the token list (parse-lemma) as returned by WOCADI. The feature value structures are tried to be unified with token information from the token list. Since all variables of the conclusion must show up in the premise too, the conclusion variables are fully instantiated if a match is successful. The instantiated 44 Tim vor der Brück (analysis-ml ( ((word “Der”) (word-id 1) (char-id 0) (cat (art dempro)) (lemma (“der”)) (reading (“der.1" “der.4.1”) (parse-lemma “der”) (parse-reading “der.1”))) ((word “Bundeskanzler”) (word-id 2) (char-id 4) (cat (n)) (lemma (“Bundeskanzler”)) (reading (“bundeskanzler.1.1”)) (parse-lemma “bundeskanzler”) (parse-reading “bundeskanzler.1.1”)) ((word “und”) (word-id 3) (char-id 18) (cat (conjc)) (lemma (“und”)) (reading (“und.1”))) ((word “andere”) (word-id 4) (char-id 22) (cat (a indefpro)) (lemma (“ander”)) (reading (“ander.1.1" “ander.2.1”)) (parse-lemma “ander”) (parse-reading “ander.1.1”)) ((word “Politiker”) (word-id 5) (char-id 29) (cat (n)) (lemma (“Politiker”)) (reading (“politiker.1.1”)) (parse-lemma “politiker”) (parse-reading “politiker.1.1”)) ((word “kritisierten”) (word-id 6) (char-id 39) (cat (a v)) (lemma (“kritisieren" “kritisiert”)) (reading (“kritisieren.1.1”)) (parse-lemma “kritisieren”) (parse-reading “kritisieren.1.1”)) ((word “das”) (word-id 7) (char-id 52) (cat (art dempro)) (lemma (“der”)) (reading (“der.1" “der.4.1”)) (parse-lemma “der”) (parse-reading “der.1”)) ((word “Gesetz”) (word-id 8) (char-id 59) (cat (n)) (lemma (“Gesetz”)) (reading (“gesetz.1.1”)) (parse-lemma “Gesetz”) (parse-reading “gesetz.1.1”)) ((word “.”) (word-id 9) (char-id 67) (cat (period)) (lemma (“.”)) (reading (“period.1”))))) Figure 2: Token information for the sentence Der Bundeskanzler und andere Politiker kritisierten das Gesetz. ‘The chancellor and other politicians criticized the law.’ as returned by the WOCADI parser Hypernymy Extraction Based on Shallow and Deep Patterns 45 (((SUB a b)) ( a * ( ((word “,”)) ? (((cat (art)))) a) ((word “und”)) ? (((cat (art)))) ((word “andere”)) ? (((cat (a)))) b)) Figure 3: One shallow pattern used to extract hypernymy relations conclusion is then extracted as a hypernymy relation. Note that if a parse is not successful, a disambiguation to a single concept for a token is usually not possible. In this case the concept is chosen from the token’s concept list which occurs in the corpus most often. The entire procedure is illustrated in Fig. 4. If a variable appears several times in the premise part it is bound to several constants and, in the case a match could be established, the elements of the Cartesian product of all variable bindings for the two variables are extracted as relation pairs. Example: The chancellor, the secretary and other politicians criticized the law. If the pattern specified in Fig. 3 is applied on the sentence above, the variable a can be bound to chancellor.1.1 and secretary.1.1, b can be bound to politician.1.1. Thus, the two relations SUB ( chancellor .1 .1 , politician.1 .1 ) and SUB ( secretary .1 .1 , politician.1 .1 ) are extracted. We employed 20 shallow patterns. A selection of them is displayed in Table 1 3 . Each pattern in this table is accompanied by a precision value which specifies the relative frequency that a relation extracted by this pattern is actually correct. Relations which are automatically filtered out by the validation component (see Sect. 2) are disregarded for determining the precision. The patterns s 3 , s 5 , s 7 , s 8 , s 9 , and s 10 are basically German translations of Hearst patterns. Note that pattern s 2 , in order to get an acceptable precision, is only applied to the first sentences of Wikipedia articles since such sentences usually contain concepts related in a hypernymy relation. 4 Application of Deep Patterns In addition to shallow patterns we also employ several deep patterns. A selection of deep patterns is shown in Table 2. Fig. 5 shows an example for an SN following the MultiNet paradigm (Helbig, 2006). An SN consists of nodes representing concepts and edges representing relations between concepts. In addition, nodes can also be connected by means of functions (marked by a preceding *). In contrast to relations, the number of arguments is often variable for functions. The result and the arguments of a function corresponds to MultiNet nodes. 3 Note that the patterns are actually defined as attribute value structures. However for better readability and space constraints we use a more compact representation in this table. 46 Tim vor der Brück The following MultiNet relations and functions are used in the diagram shown in Fig. 5: • SUB : relation of conceptual subordination for objects (hypernymy) • TEMP : relation specifying the temporal embedding of a situation • PROP : relation between object and property • SUBS : relation of conceptual subordination for situations (troponymy) • SCAR: cognitive role: carrier of a state, associated to a situation • OBJ : cognitive role: neutral object, associated to a situation • ∗ MODP : function modifying properties Additionally, each node in the SN is associated with a list of layer features, i.e., degree of generality (GENER), determination of reference (REFER), variability (VARIA), facticity (fact), intensional quantification (QUANT), pre-extensional cardinality (CARD) and entity type (ETYPE). The patterns can refer to the layer features too. Currently, only the layer feature REFER is used in our patterns (see pattern d 3 in Table 2). This layer feature specifies if a concept is determinate (for instance by usage of a definite article or a demonstrative determiner) or indeterminate. The pattern SUB ( A, B ) ← SCAR ( C, D ) ∧ SUB ( D, A ) ∧ SUBS ( C, denote .1 .1 ) ∧ OBJ ( C, E ) ∧ SUB ( E, B ) can be matched to the SN displayed in Fig. 5 to extract the relation SUB ( skyscraper .1 .1 , house .1 .1 ) as illustrated in Fig. 6. Note that different sentences can lead to the same SN. For instance, the semantically equivalent sentences He owns a piano, a cello and other instruments. and He owns a piano, a cello as well as other instruments. lead to the same SN. Thus, the pattern d 4 of Table 2 can be used to extract the relations SUB ( piano.1 .1 , instrument .1 .1 ) and SUB ( cello.1 .1 , instrument .1 .1 ) from both sentences. In general, the number of patterns can be considerably reduced by using an SN in comparison to the employment of a surface or a syntactic representation. Pattern d 1 , d 2 , and d 3 in Table 2 are stricter versions of pattern d 4 which lead to a slight increase in precision for patterns d 1 and d 2 . Practically no improvement was observed for pattern d 3 . d 1 requires the hypernym node to follow immediately the hyponym node. This prevents the extraction of SUB ( cookies .1 .1 , milk _product .1 .1 ) in the sentence: We bought cookies, butter, and other milk products. d 2 disallows other concept nodes to attach to the hyponym node which can be used to further specialize the hyponym candidate like in the sentence: His father and other gangsters . . . . The concept node belonging to his father is subordinated to gangster but this is not the case for father. Thus, in contrast to pattern d 4 , the pattern d 2 would not extract SUB ( father .1 .1 , gangster .1 .1 ) from this sentence. d 3 finally requires that the hypernym should not be referentially determined. Hypernymy Extraction Based on Shallow and Deep Patterns 47 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 word secretary cat (n) lemma (secretary) read (secretary . 1 . 1 secretary . 1 . 2) pl secretary pr secretary . 1 . 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 4 word , cat (comma) read comma . 1 . 1 3 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 word the cat (art) lemma (the) read (the . 1 . 1) pl the pr the . 1 . 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 word chancellor cat (n) lemma (chancellor) read (chancellor . 1 . 1) pl chancellor pr chancellor . 1 . 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 a [ word , ] [ cat (art) ] a 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 word and cat (conj) lemma (and) read (and . 1 . 1) pl and pr and . 1 . 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 word other cat (a) lemma (other) read (other . 1 . 1) pl other pr other . 1 . 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 word politicians cat (n) lemma (politician) read (politician . 1 . 1) pl politician pr politician . 1 . 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 [ word and ] [ word other ] b Figure 4: Matching a pattern with a token list by unification. The pattern is displayed below the token information. Each variable is set to the value of the parse-reading attribute (pr=parse-reading, read=reading, pl=parse-lemma) present.0 TEMP OBJ SUBS *MODP SUB PROP house.1.1 tall.1.1 very.1.1 SCAR denote.1.1 skyscraper.1.1 SUB Figure 5: SN for the sentence: A skyscraper denotes a very tall house. 48 Tim vor der Brück present.0 TEMP OBJ SUBS SUB *MODP SUB PROP tall.1.1 very.1.1 SCAR SUB B=house.1.1 denote.1.1 A=skyscraper.1.1 Figure 6: SN matched with the pattern SUB ( A, B ) ← SCAR ( C, D ) ∧ SUB ( D, A ) ∧ SUBS ( C, denote .1 .1 ) ∧ OBJ ( C, E ) ∧ SUB ( E, B ) . Matching edges are printed in bold. The dashed arc is the inferred new edge. Table 1: The shallow patterns (name, definition, precision) employed for hypernymy extraction where A/ A i (1 ≤ i ≤ n + 1) is the hyponym of B. The symbol l denotes the fact that the lemma is referred to instead of the word form. The precision is not given for patterns which could not been matched often enough for reliable estimation. Name Definition English Translation Precision s 1 als l A (. . . ) bezeichnet man B A (. . . ) is called B 0.79 s ∗ 2 A (. . . ) ist ein B A (. . . ) is a B 0.75 s 3 A 1 ,. . .,A n und ander l B A 1 ,. . . , A n and other B 0.71 s 4 B wie A B like A 0.70 s 5 A 1 ,. . .,A n oder ander l B A 1 ,. . .,A n or other B 0.66 s 6 B wie beispielsweise A 1 , . . . , A n und | oder A n+1 B like for example A 1 , . . . , A n and | or A n+1 0.63 s 7 B, insbesondere A 1 , . . . , A n und | oder A n+1 B, especially A 1 , . . . , A n and | or A n+1 0.57 s 8 B, einschließlich A 1 ,. . . ,A n und | oder A n+1 B including A 1 ,. . . ,A n and | or A n+1 0.28 s 9 solch ein B wie A 1 , . . . , A n und | oder A n+1 such a B like A 1 , . . . , A n and | or A n+1 s 10 solch l B wie A 1 , . . . , A n und | oder A n+1 such a B like A 1 , . . . , A n und | oder A n+1 s 11 alle B außer A 1 , . . . , A n und A n+1 all B except A 1 , . . . , A n and A n+1 s 12 alle B bis auf A 1 , . . . , A n und A n+1 all A except A 1 , . . . , A n and A n+1 - ∗ : pattern is only matched to the first sentence of each Wikipedia article. Hypernymy Extraction Based on Shallow and Deep Patterns 49 Table 2: A selection of deep patterns. F r ( a 1 , a 2 ) : a 1 is the first argument of function r and precedes a 2 in the argument list; G r ( a 1 , a 2 ) : a 1 precedes a 2 in the argument list of function r; H r ( a 1 , a 2 ) : a 1 immediately precedes a 2 in the argument list of function r. Name Definition Precision d 1 SUB ( A, B ) ← SUB ( C, A ) ∧ PRED ( E, B ) ∧ F ∗ITMS ( D, C ) ∧ F ∗ITMS ( D, E ) ∧ H ∗ITMS ( C, E ) ∧ PROP ( E, ander .1 .1 ( other .1 .1 )) 0.74 d 2 SUB ( A, B ) ← SUB ( C, A ) ∧ PRED ( E, B ) ∧ F ∗ITMS ( D, C ) ∧ F ∗ITMS ( D, E ) ∧ G ∗ITMS ( C, E ) ∧ PROP ( E, ander .1 .1 ( other .1 .1 )) ∧ - ATTCH ( J, C ) 0.74 d 3 SUB ( A, B ) ← SUB ( C, A ) ∧ PRED ( E, B ) ∧ F ∗ITMS ( D, C ) ∧ F ∗ITMS ( D, E ) ∧ G ∗ITMS ( C, E ) ∧ PROP ( E, ander .1 .1 ( other .1 .1 )) ∧ - REF ER ( E, DET ) 0.73 d 4 SUB ( A, B ) ← SUB ( C, A ) ∧ PRED ( E, B ) ∧ F ∗ITMS ( D, C ) ∧ F ∗ITMS ( D, E ) ∧ G ∗ITMS ( C, E ) ∧ PROP ( E, ander .1 .1 ( other .1 .1 )) 0.73 d 5 SUB ( A, B ) ← PRED ( C, B ) ∧ SUB ( E, A ) ∧ F ∗ALTN1 ( D, C ) ∧ F ∗ALTN1 ( D, E ) ∧ PROP ( C, ander .1 .1 ( other .1 .1 )) 0.71 d 6 SUB ( A, B ) ← SUB ( C, B ) ∧ SUB ( D, A ) ∧ SUB ( D, C ) 0.66 d 7 SUB ( A, B ) ← SCAR ( C, D ) ∧ SUB ( D, A ) ∧ OBJ ( C, E ) ∧ SUB ( E, B ) ∧ SUBS ( C, bezeichnen.1 .1 ( denote .1 .1 )) 0.60 d 8 SUB ( A, B ) ← ARG2 ( D, C ) ∧ SUB ( C, A ) ∧ MCONT ( E, D ) ∧ SUB ( F, man.1 .1 ( one .1 .1 )) ∧ SUBS ( E, bezeichnen.1 .1 ( denote .1 .1 )) ∧ ARG1 ( D, G ) ∧ PRED ( G, B ) ∧ AGT ( E, F ) 0.51 d 9 SUB ( A, B ) ← ARG1 ( D, E ) ∧ ARG2 ( D, F ) ∧ SUBR ( D, equ.0 ) ∧ SUB ( E, A ) ∧ SUB ( F, B ) 0.17 50 Tim vor der Brück 5 Evaluation We applied the patterns on the German Wikipedia corpus from 2005 which contains 500 000 articles. In total, we extracted 160 410 hypernymy relations employing 12 deep and 20 shallow patterns. The deep patterns were matched to the SN representation, the shallow patterns to the tokens. Concept pairs which are also recognized by the morphological compound analysis (a compound is normally a hyponym of its primary concept) were excluded from the results since such pairs can be recognized on the fly and need not to be stored in the knowledge base. Otherwise, the number of extracted concept pairs would be much larger than 160 410. Naturally, shallow patterns have the advantage that they are applicable if the parse fails. On the other hand, deep pattern are still applicable, if there are additional constituents and subclauses between hypoand hypernyms which usually cannot be covered by shallow patterns. The following sentences from the Wikipedia corpus are typical examples where the hypernymy relationship could only be extracted using deep patterns (hyponym and hypernym are underlined): Das typisch nordhessische Haufendorf liegt am Emsbach im historischen Chattengau, wurde im Zuge der hessischen Gebiets- und Verwaltungsreform am 1. Februar 1971 Stadtteil von Gudensberg, und hatte 2005 1 400 Einwohner. ‘Haufendorf, a typical north Hessian village, is located at the Emsbach in the historical Chattengau, became, during the Hessian area and administration reform, a district of Gudensberg and had 1 400 inhabitants in 2005.’ Auf jeden Fall sind nicht alle Vorfälle aus dem Bermudadreieck oder aus anderen Weltgegenden vollständig geklärt. ‘In any case, not all incidents from the Bermuda Triangle or from other world areas are fully explained.’ From the last sentence pair, a hypernymy pair can be extracted by application of rule d 5 from Table 2 but not by any shallow patterns. The application of pattern s 5 fails due to the word aus ‘from’ which cannot be matched. To extract this relation by means of shallow patterns an additional pattern would have to be introduced. This would also be the case if deep syntactic patterns were used instead since the coordination of Bermudadreieck ‘Bermuda Triangle’ and Weltgegenden ‘word areas’ is not represented in the syntactic dependency tree but only on a semantic level. We evaluated the portion of extracted relations which are regarded correct for every pattern. Obvious mismatches which are recognized automatically by checking ontological sort and semantic features of hyponym/ hypernym for subsumption. An extracted relation is only considered correct if it makes sense to store this relation without modifications in an ontology or name list. This means, extracted relations assumed to express hypernymy are considered incorrect if • multi-token expressions are not correctly recognized, • the singular forms of unknown concepts appearing in plural form are not estimated correctly, • the hypernym is to general, e.g., word or concept, or • the wrong reading is chosen by the Word Sense Disambiguation. The precision for each pattern is shown in Table 1 for the shallow patterns and in Table 2 for the deep patterns. Hypernymy Extraction Based on Shallow and Deep Patterns 51 77 870 of the extracted relations were only determined by the deep but not by the shallow patterns. If relations extracted by the rather unreliable pattern d 9 are disregarded, this number reduces to 27 999. The other way around, 61 998 of the relations were determined by the shallow but not by the deep patterns. 20 542 of the relations were both recognized by deep and shallow patterns. Naturally, only a small fraction of the relations were manually checked for correctness. The accuracy of the annotated relations extracted by the shallow patterns is 0.62, by the deep ones 0.51. The accuracy of the relations extracted by both the deep and the shallow patterns is 0.80, considerably larger than the other two values. 6 Conclusion and Outlook We introduced an approach for extracting hypernymy relation by a combination of shallow and deep patterns, where the shallow patterns are applied on the token list and the deep patterns on the SNs representing the meaning of the sentences. By using a semantic representation the number of patterns can be reduced in comparison to a syntactic or surface representation. Furthermore, by combining shallow and deep patterns the precision or the recall regarding the extracted relations can be improved considerably. If a parse was not successful, we still can extract relations by employing the shallow patterns. In contrast, if additional constituents show up between the hyponym and the hypernym, the application of shallow patterns often fails and the hypernymy relation can be extracted by the application of a deep pattern. In order to further improve recall and precision, we currently work on assigning a quality score to the extracted hypernymy pairs. Furthermore, the possibility that shallow patterns can require certain lemmas or concepts to show up in the token list is only rarely used and should be considered more often. In strong inflecting languages like German the usage of lemmas or concepts instead of word forms can improve the applicability of patterns considerably. Extracting relation using shallow patterns currently lead to a higher recall than for the deep ones. Thus, the collection of applied deep patterns should be further extended. Finally, it is planned to transfer the entire relation extraction approach to other relations than hyponymy, especially meronymy. To do this, the shallow and deep patterns would have to be replaced. Several patterns for meronymy extraction are for instance described by Girju et al. (2006). Furthermore, the two validation components (see Sect. 2) need to be modified for this purpose. Acknowledgements We wish to thank all people who contributed to this work, especially Sven Hartrumpf for proofreading this document. 52 Tim vor der Brück References Cimiano, P., Pivk, A., Schmidt-Thieme, L., & Staab, S. (2005). Learning Taxonomic Relations from Heterogeneous Sources of Evidence. In P. Buitelaar, P. Cimiano, & B. Magnini, editors, Ontology Learning from Text: Methods, evaluation and applications, pages 59-73. IOS Press, Amsterdam, The Netherlands. Girju, R., Badulescu, A., & Moldovan, D. (2006). Automatic Discovery of Part-Whole Relations. Computational Linguistics, 32(1), 83-135. Hartrumpf, S. (2002). Hybrid Disambiguation in Natural Language Analysis. Ph.D. thesis, FernUniversität in Hagen, Fachbereich Informatik, Hagen, Germany. Hartrumpf, S., Helbig, H., & Osswald, R. (2003). The semantically based computer lexicon HaGenLex - Structure and technological environment. Traitement automatique des langues, 44(2), 81-105. Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics (COLING), Nantes, France. Helbig, H. (2006). Knowledge Representation and the Semantics of Natural Language. Springer, Berlin, Germany. Maria Ruiz-Casado, E. A. & Castells, P. (2005). Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia. In 10th International Conference on Applications of Natural Language to Information Systems, pages 67-79, Alicante, Spain. Morin, E. & Jaquemin, C. (2004). Automatic Acquisition and Expansion of Hypernym Links. Computers and the Humanities, 38(4), 363-396. Quan, T. T., Hui, S. C., Fong, A. C. M., & Cao, T. H. (2004). Automatic Generation of Ontology for Scholarly Semantic Web. In The Semantic Web - ISWC 2004, volume 4061 of LNCS, pages 726-740. Springer, Berlin, Germany. Snow, R., Jurafsky, D., & Ng, A. Y. (2005). Learning syntactic patterns for automatic hypernym discovery. In Advances in Neural Information Processing Systems 17, pages 1297-1304. MIT Press, Cambridge, Massachusetts. Stand off-Annotation für Textdokumente: Vom Konzept zur Implementierung (zur Standardisierung? ) * Manuel Burghardt & Christian Wolff Institut für Information und Medien, Sprache und Kultur (I: IMSK) Universität Regensburg 93040 Regensburg, Germany {manuel.burghardt,christian.wolff}@sprachlit.uni-regensburg.de Zusammenfassung Stand off -Annotation beschreibt die logische Trennung von Primärdaten und Annotation. Dieses Konzept läßt sich bis in die 90er Jahre zurückverfolgen und ist seitdem auf vielfältige Weise interpretiert und implementiert worden. Der vorliegende Beitrag untersucht, wie sich die verschiedenen Umsetzungen der stand off -Annotation voneinander unterscheiden und versucht Vor- und Nachteile der einzelnen Ansätze herauszuarbeiten, um künftigen Standardisierungsansätzen im Bereich stand off -Annotation den Weg zu ebnen. Bereits bestehende Standardisierungsansätze werden abschließend diskutiert. 1 Motivation: Stand off -Annotation heute Seit Mitte der 90er Jahre erstmals die Idee einer Trennung von Markup und Primärdaten durch semantische Hyperlinks (Thompson & McKelvie, 1997) formuliert wurde, hat sich stand off -Annotation als de facto-Standard für die Metadatenanreicherung im Bereich des literary and linguistic computing etabliert, insbesondere bei Mehrebenen- oder Zeitleistenannotationen. Trotz der Verbreitung und Anwendung des stand off-Konzepts ist man von einer standardisierten Architektur noch weit entfernt. Tatsächlich wird das stand off-Konzept sehr unterschiedlich interpretiert. Diese Studie versucht, das Spektrum vorhandener stand off-Implementierungen zu erfassen und sie miteinander zu vergleichen, um Vor- und Nachteile der einzelnen Ansätze aufzuzeigen. * Erschienen in: C. Chiarcos, R. Eckart de Castilho, M. Stede (Hrsg.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, S. 53-59. 54 Manuel Burghardt, Christian Wolff 2 Konzept und Entwicklung der stand off -Annotation Um Texte über eine Abfrageschnittstelle quantitativ und qualitativ auswerten zu können, müssen sie zuerst annotiert werden. Annotation bezeichnet dabei die Beifügung von Metadaten zu einer definierten Annotationsbasis. Ferner unterscheidet man bei der Textauszeichnung zwischen Header- Annotation, die z. B. bibliographische Metadaten zum (Gesamt-)Text enthält, struktureller Annotation, also der Auszeichnung von physischer und logischer Textstruktur und positioneller Annotation zur inhaltlichen Auszeichnung der einzelnen Annotationseinheiten. In der Vergangenheit wurde die Textauszeichnung vor allem durch so genannte inline-Annotationen realisiert. Bei der inline- Annotation bilden Primärtext und Annotation eine Einheit und werden in ein und derselben Datei gespeichert. Dieses Vorgehen birgt jedoch Unzulänglichkeiten: So wird durch die direkte Annotation der Primärtext in gewisser Weise zerstört, zumindest aber manipuliert. Dies ist vor allem dann problematisch, wenn der Text nicht frei verfügbar ist, sondern beispielsweise nur online oder auf einem ROM-Medium (read only memory) vorliegt. Die Lesbarkeit des Originaltextes nimmt zunehmend ab, je intensiver das Dokument mit Metadaten z. B. im XML-Format annotiert wird. Der größte Nachteil des inline-Ansatzes wird jedoch in den stark begrenzten Möglichkeiten zur Annotation von Überschneidungen und konkurrierenden Hierarchien deutlich. Die parallele Annotation eines Textes auf mehreren Annotationsebenen ist mit dem inline-Ansatz kaum zu bewerkstelligen. Mit dem stand off-Konzept wird eine strikte logische Trennung von Primär- und Sekundärdaten gefordert (Thompson & McKelvie, 1997; Dipper, 2005; Rodríguez et al., 2007). Diese Trennung beseitigt die oben genannten Einschränkungen der inline-Annotation weitestgehend. Der Originaltext bleibt unverändert, da er noch vor der eigentlichen Annotation auf Zeichen- oder Wortebene indexiert wird, um so über externe Annotationsdateien referenzierbar zu sein (Dybkjær & Bernsen, 2000). Überschneidungen und Mehrfachannotationen können über diesen Referenzierungsmechanismus ebenso realisiert werden, wie die nachträgliche Hinzufügung oder Löschung von Annotationsebenen. Ein „Hund“ kann gleichzeitig als Substantiv und als Säugetier annotiert werden. Die zwei entsprechenden Annotationsebenen könnten „Wortart“ und „Ontologie“ lauten. Für jede neue Ebene wird eine weitere Annotationsdatei angelegt. Im Falle einer Löschung wird einfach die Datei mitsamt der Referenzierungen gelöscht. Probleme ergeben sich bei dieser Art der Annotation lediglich, wenn der Originaltext nachträglich geändert wird, da auf diese Weise der Index durcheinander gerät. Nach derartigen Änderungen muss der Primärtext in jedem Fall mit bereits erstellten Annotationen synchronisiert werden. 3 XML und stand off-Annotation Obwohl sich die stand off-Formate verschiedener Projekte teilweise stark voneinander unterscheiden, wird in fast allen Fällen die Ebene der physischen Datenstruktur (Dipper et al., 2007) mit Hilfe der eXtensible Markup Language (XML) realisiert. Die Idee hinter XML ist es, die implizite Struktur eines Textdokuments durch das Hinzufügen von strukturell-beschreibenden Markuptags zu explizieren. So ist es möglich, die Struktur eines Textes unabhängig von seiner physischen Erscheinung, wie etwa Fettschrift oder Kursivschrift, zu definieren (Bryan, 1992). Diese Markuptags werden bei der inline-Annotation, die auch heute noch vor allem für die Repräsentation hierarchischer Baumstrukturen eingesetzt wird, verwendet, um beliebige Informationen wie etwa Parts of Speech (PoS) oder Satzarten direkt in den Primärtext zu annotieren. Beim stand off-Ansatz werden meist die Möglichkeiten der XML-Techologien XPointer und XLink (DeRose et al., 2002, 2001) genutzt, um die Annotationen über Referenzen vom Originaltext zu trennen. Mit XLink ist es möglich, Stand off-Annotation für Textdokumente: Vom Konzept zur Implementierung 55 über die Attribute eines Elements uni- und multidirektionale Links in XML-Dokumenten definieren. Mit der Anfragesprache XML Pointer Language können darüber hinaus bestimmte Teile eines XML-Dokuments referenziert werden, indem die entsprechenden Knoten in der XML-Baumstruktur adressiert werden. Die TEI Standoff Markup Workgroup 1 empfiehlt zur Erstellung von stand off -Inhalten XML Includes, eine weitere XML-Technologie, die es ermöglicht, innerhalb von XML-Dokumenten auf Teile anderer Dokumente zu verweisen. Aufgrund des hohen Verbreitungsgrades des XML-Standards als Instrument zur Modellierung von strukturierter Information (Lobin, 1998), und aufgrund der Verfügbarkeit von Mechanismen wie XLink, XPointer und XML Include sollte XML auch als Grundlage für jegliche Standardisierungsbestrebungen im Bereich des stand off -Markup herangezogen werden. XML stellt deshalb den kleinsten gemeinsamen Nenner der untersuchten stand off-Formate dar. 4 Implementierungsansätze des stand off-Konzepts Nachfolgend vergleichen wir unterschiedliche Implementierungen des stand off-Konzepts auf Basis von XML. Hierfür werden exemplarisch die stand off-Formate ausgewählter Textannotationswerkzeuge analysiert, die alle die parallele Annotation auf mehreren Ebenen unterstützen. In einer projektbezogenen Vorstudie zur Eignung von Annotationswerkzeugen für diachrone Korpora wurde unter anderem das Kriterium „Unterstützung von stand off -Annotation“ untersucht. Die vier Tools Callisto 2 , GATE 3 , MMAX2 4 und UAMCorpusTool 5 fielen bei der Evaluation durch ihre teilweise stark divergierenden Umsetzungen des stand off -Konzepts auf. In diesem Abschnitt sollen deshalb die stand off -Annotationsformate der vier oben genannten Tools anhand folgender Parameter verglichen werden: (a) Speicherung und Konservierung des Originaltextes (b) Synchronisierungsmechanismen bei Änderungen im Originaltext (c) Indexierung und Tokenisierung des Originaltexts (d) Realisierung der logischen Trennung von Originaltext und Annotation 4.1 Speicherung und Konservierung des Originaltextes Nur beim UAMCorpusTool wird der Originaltext ohne jegliche Manipulation in einem separaten Ordner gespeichert und so für spätere Anwendungen konserviert. Callisto speichert den Originaltext als Base64-kodierten Signalstrom ab. Base64 beschreibt ein Verfahren bei dem Daten als ASCII- Zeichenstrom kodiert werden (Josefsson, 2006). Die Entwickler greifen auf diesen Mechanismus zurück, um unerwünschte Zeilenumbrüche, welche bei der Portierung von Texten zwischen UNIX- und PC-Systemen entstehen können, zu umgehen, da dies die eindeutige Referenzierung des Originaltextes unmöglich macht. Durch die Base64-Verschlüsselung kann der Zeichenstrom auf beiden Systemen konsistent dargestellt werden. Nachteile dieser Lösung sind eine Zunahme der Dateigröße, sowie der völlige Verlust der Lesbarkeit des Originaltextes ohne entsprechende Decodersoftware. 1 http: / / www.tei-c.org/ Activities/ Workgroups/ SO/ , Zugriff Juli 2009 2 http: / / callisto.mitre.org/ , Zugriff Juli 2009 3 http: / / gate.ac.uk/ index.html , Zugriff Juli 2009 4 http: / / www.eml-research.de/ english/ research/ nlp/ download/ mmax.php , Zugriff Juli 2009 5 http: / / wagsoft.com/ CorpusTool/ index.html , Zugriff Juli 2009 56 Manuel Burghardt, Christian Wolff Abbildung 1: Zusammenspiel von XLink und XPointer Bei den stand off-Formaten von GATE und MMAX2 (Müller, 2005) wird der Originaltext insofern manipuliert, als die Indizes direkt in die Primärdaten geschrieben werden. Dies beeinträchtigt nicht nur die Lesbarkeit des Textes, sondern verstößt auch gegen die grundlegende Forderung nach Unversehrtheit der Originaldaten. 4.2 Synchronisierungsmechanismen bei Änderungen im Originaltext Bei zwei von vier Formaten beinhaltet das Annotationswerkzeug Synchronisierungsmechanismen, die auch nachträgliche Änderungen am Originaltext erlauben. GATE und MMAX2 ermöglichen die Korrektur orthografischer Fehler im Originaltext während des laufenden Annotationsprozesses. Der geänderte Index wird über das Annotationstool mit den bisherigen Annotationen synchronisiert. Die Software von Callisto und UAMCorpusTool unterstützt eine solche Synchronisierung nicht. Änderungen im Originaltext machen alle vorherigen Annotationen zu diesem Text unbrauchbar. 4.3 Indexierung und Tokenisierung des Originaltexts Bis auf das MMAX2-Tool wird der Primärtext bei allen anderen Werkzeugen zeichenweise zerlegt. Die Annotationseinheiten werden dann durch zwei Zahlen beschrieben, welche die Start- und Endpunkte der jeweiligen Einheit im laufenden Zeichenstrom beschreiben. Beim Format von MMAX2 kann über eine grafische Oberfläche ein Tokenisierer konfiguriert werden, über welchen sich die Grenzen zwischen den Annotationseinheiten beliebig feinkörnig bestimmen lassen. Will man einen Text beispielsweise nur hinsichtlich seiner Wortarten annotieren, so kann man über white space und Satzzeichen den Tokenizer anweisen, den Text wortweise zu zerlegen. Dies hat zwar den Vorteil, dass während des Annotationsprozesses die gewünschten Annotationseinheiten mit einem Klick selektiert werden können, birgt aber Probleme wenn man beispielsweise nachträglich den Text noch auf Morphemebene annotieren möchte. In diesem Fall muss der Text nochmals neu tokenisiert und mit bereits vorhandenen Annotationen synchronisiert werden. 4.4 Realisierung der logischen Trennung von Originaltext und Annotation Große Unterschiede zeigen sich bei der Implementierung der Trennung von Originaltext und Annotation. Callisto und GATE trennen Primär- und Sekundärdaten zwar logisch voneinander, speichern die Daten allerdings in ein und derselben Datei. Dabei stehen die Originaldaten in einer Art Header- Bereich, die Annotationen im Body-Bereich. Bei MMAX2 und UAMCorpusTool werden Originaltext und Annotation auch physisch rigoros getrennt und in unterschiedlichen Dateien gespeichert. Der Vorteil dieser konsequenten Trennung liegt in der Konservierung des Originaltextes, welche bei MMAX2 wegen der Inline-Annotation der Indizes trotzdem verletzt wird. Der Nachteil liegt im Verwaltungsaufwand der einzelnen Dateien. Mit jeder neuen Annotationsebene kommt eine weitere Datei hinzu. Allerdings fördert die dateiweise Trennung der Annotation alles in allem die Lesbarkeit. Stand off-Annotation für Textdokumente: Vom Konzept zur Implementierung 57 Tabelle 1: Indexierung und Referenzierung beim UAMCorpusTool Das also war des Pudels. . . D a s a l s o w a r d e s P u d e l s 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 start: start: 5 start: 10 start: 14 start: 18 end: 3 end: 8 end: 12 end: 16 end: 23 <segment id=’1’ start=’1’ end=’3’ features=’POS; ART’ state=’active’/ > 5 Empfehlungen für ein standardisiertes stand off-Format Die Untersuchung der Realisierungsformen von stand off -Formaten bei Annotationswerkzeugen macht deutlich, dass das Konzept einer stand off -Annotation auf unterschiedlichste Weise interpretiert und umgesetzt wird. Dabei scheinen einige Implementierungen Vorteile gegenüber anderen Ansätzen zu haben. In diesem Teil der Studie werden die besten Implementierungen der einzelnen Untersuchungsparameter zu einer kurzen Empfehlungsliste für künftige Standardisierungsansätze auf dem Gebiet der stand off -Annotation zusammengefasst: (a) Der Originaltext sollte in seinem ursprünglichen Zustand im Dateisystem der Annotationsdatei gesichert werden, um die Lesbarkeit und Wiederverwendbarkeit zu gewährleisten. (b) Die Indexierung des Originaltextes sollte in einer gesonderten Datei gespeichert werden. (c) Die Annotationssoftware, die das stand off-Format generiert, sollte Synchronisierungsmechanismen enthalten, die es erlauben den Originaltext auch während des laufenden Annotationsprozesses zu ändern. (d) Die Software sollte Versionskontrolle und Änderungshistorie der Primärdaten unterstützen. (e) Bei der Indexierung sollte der Text am besten zeichenweise erfasst werden, da so später beliebig feinkörnige Annotationen hinzugefügt werden können. (f) Die Speicherung von Originaltext und Annotation in unterschiedlichen Dateien erhöht die Lesbarkeit und ermöglicht die Konservierung der Primärdaten. 6 Von der Implementierung zur Standardisierung? Bei den untersuchten stand off -Formaten handelt es sich durchweg um konkrete Implementierungen innerhalb eines Annotationswerkzeugs. Ein standardisiertes stand off -Format sollte jedoch als Meta- Format konzipiert werden, welches innerhalb eines definierten Rahmens und unter Berücksichtigung der oben formulierten Empfehlungen verschiedene Implementierungen erlaubt (Dipper et al., 2007). Eine grundlegende Forderung in Hinblick auf ein standardisiertes Format ist somit die Entflechtung von Annotationswerkzeugen und stand off -Formaten. In diesem Bereich wurden mit dem stand off - Format PAULA 6 (Potsdam Interchange Format for Linguistic Annotation), einem generisches Format, das stand off -Annotationen verschiedener Annotationswerkzeuge vereinheitlichen kann und 6 http: / / www.sfb632.uni-potsdam.de/ ~d1/ paula/ doc/ index.html , Zugriff Juli 2009 58 Manuel Burghardt, Christian Wolff für eine große Datenbank namens ANNIS (ANNotation of Information Structure) verfügbar macht, erste wichtige Schritte auf dem Weg zu einer weiterreichenden stand off -Standardisierung unternommen. Das Hauptproblem bei der Entwicklung eines stand-off -Annotationsstandards dürfte eher die Schnittstelle zu bestehenden Datenbanken und Korpora sowie die Vielzahl an unterschiedlichen Annotationsformaten darstellen. Wenn man also ein Format schaffen möchte - egal ob nun stand off oder inline - so muss dieses Format nicht nur verschiedene Annotationswerkzeuge unterstützen, sondern auch eine Schnittstelle zu bereits bestehenden Annotationen schaffen. Mit GrAF (Ide & Suderman, 2007) wird ein solch generisches Austauschformat auf der Basis von Graphen beschrieben. GrAF ensteht im Rahmen des Linguistic Annotation Framework (LAF), einem großangelegten Standardisierungsprojekt, welches durch die Zusammenführung einzelner Teilstandards (TEI, CES, XCES, etc.) einen internationalen Standard für die Erstellung, Annotation und Manipulation von linguistischen Daten definiert. Aufgrund des äußerst heterogenen Feldes an Annotationswerkzeugen und Formaten scheint ein standardisiertes Format in naher Zukunft wenig realistisch. Allerdings könnte eine großangelegtes Projekt wie das LAF, welches der Forschungsgruppe ISO/ TC37/ SC4 (Normierung von Sprachressourcen) angehört, wohl am ehesten den Annotationsbereich homogenisieren. Um vorläufig zumindest ein Mindestmaß an Benutzbarkeit, Qualität und Konsistenz konkreter stand off -Implementierungen zu gewährleisten, sollten bei der Konzeption neuer Formate die unter Punkt 5 formulierten Empfehlungen berücksichtigt werden. Literatur Bryan, M. (1992). SGML. An Author’s Guide to the Standard Generalized Markup Language. Addison- Wesley, Bonn. DeRose, S., Maler, E., & Orchard, D. (2001). XML Linking Language (XLink) Version 1.0. W3C Recommendation, June 27, 2001. DeRose, S., Jr., R. D., Grosso, P., Maler, E., Marsh, J., & Walsh, N. (2002). XML Pointer Language (XPointer). W3C Working Draft, August 16, 2002. Dipper, S. (2005). XML-based Stand off-Representation and Exploitation of Multi-Level Linguistic Annotation. In Proceedings of Berliner XML Tage 2005 (BXML 2005), pages 39-50. Dipper, S., Götze, M., Küssner, U., & Stede, M. (2007). Representing and Querying Standoff XML. In G. Rehm, A. Witt, & L. Lemnitzer, editors, Data structures for linguistic resources and applications. Proceedings of the Biennial GLDV Conference 2007, pages 337-346, Tübingen. Narr. Dybkjær, L. & Bernsen, N. O. (2000). The MATE markup framework. In Annual meeting of the ACL, Proceedings of the 1st SIGdial Workshop on Discourse and Dialogue, pages 19 -28, Morristown/ NJ. Association for Computational Linguistics, Association for Computational Linguistics. Ide, N. & Suderman, K. (2007). GrAF: A Graph-based Format for Linguistic Annotations. In Proceedings Linguistic Annotation Workshop held in conjunction with ACL 2007, pages 1-8, Morristown/ NJ. Association for Computational Linguistics, Association for Computational Linguistics. Josefsson, S. (2006). RFC4648. The Base16, Base32, and Base64 Data Encodings (proposed standard). http: / / tools.ietf.org/ html/ rfc4648 , http: / / www.rfc-editor.org/ rfc/ rfc4648.txt . Internet Engineering Task Force (IETF), RFC4648, Fremont/ CA, Oktober 2006. Lobin, H. (1998). Informationsmodellierung in XML und SGML. Springer, Heidelberg / Berlin. Stand off-Annotation für Textdokumente: Vom Konzept zur Implementierung 59 Müller, C. (2005). A flexible stand off-data model with query language for multi-level annotation. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 109-112, Morristown/ NJ. Association for Computational Linguistics. Rodríguez, K. J., Dipper, S., Götze, M., Poesio, M., Riccardi, G., Raymond, C., & Rabiega-Wi´ sniewska, J. (2007). Standoff Coordination for Multi-Tool Annotation in a Dialogue Corpus. In Proceedings of the ACL Linguistic Annotation Workshop, pages 148-155, Morristown/ NJ. Association for Computational Linguistics. Thompson, H. S. & McKelvie, D. (1997). Hyperlink semantics for standoff markup of read-only documents. In Proceedings SGML Europe 1997. Annotating Arabic Words with English Wordnet Synsets: An Arabic Wordnet Interface * Ernesto William De Luca, Farag Ahmed, Andreas Nürnberger Otto-von-Guericke-University of Magdeburg, Germany {ernesto.deluca,farag.ahmed,andreas.nuernberger}@ovgu.de Abstract So far, only few work has been done in order to create and maintain an Arabic Wordnet resource. In this paper we present a tool to support lexicographers in creating Arabic Wordnet synsets. This creation is done query-oriented, where an arabic word is searched and secondly annotated with English synsets. Parallel corpora are then used to create glosses for every new created Arabic synsets. A user interface including the functionalities described in our approach is presented and discussed. 1 Introduction Arabic is a Semitic language based on the Arabic alphabet containing 28 letters. One of the main problems in retrieving Arabic language texts concerns the word form variations. Considering the different properties of this language, in the following, we clarify the differences of the Arabic language to other languages (see below). Then, a brief review of the well known Arabic morphological analyzer (see Section 1.1) is given. Last, after an introduction about Wordnet and its Arabic variant (see Section 2), we describe our approach (see Section 3) and give some concluding remarks in Section 4. Let us consider the Arabic word (kAtb; author) as an example. This word is built up from the root (ktb; write). Prefixes and suffixes can be added to the words that have been built up from common roots to add number or gender, for example, adding the Arabic suffix (an; the) to the word (kAtb; author) will lead to the word (kAtbAn; authors) which represents the masculine dual, while the plural form is (ktAb; authors). Arabic nouns and verbs are heavily prefixed, so that it is complicated to process them computationally. The definite article (al; the) is always attached to nouns, and many conjunctions and prepositions are also attached as prefixes to nouns and verbs, hindering the retrieval of morphological variants of words (Moukdad, 2004). Table 1 shows an example for the word form variants that share the same principal concept whose English translation contain the word author or authors. Arabic is different from English and other Indo-European languages with respect to a number of important aspects: words are written from right to left; it is mainly a consonantal language in its * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 61-68. 62 Ernesto William De Luca, Farag Ahmed, Andreas Nürnberger Table 1: Extract of some word form variations built up from the arabic root (ktb; write) written forms, i.e. it excludes vowels; its two main parts of speech are the verb and the noun in that word order, and these consist, for the main part, of triliteral roots (three consonants forming the basis of noun forms that are derived from them); it is a morphologically complex language, in that it provides flexibility in word formation: as briefly mentioned above, complex rules govern the creation of morphological variations, so that hundreds of words can be formed from one single root (Moukdad & Large, 2001). Furthermore, the letter shapes are changeable in form, depending on the location of the letter at the beginning, middle or at the end of the word. Arabic poses a real natural language processing challenge for many reasons; Arabic sentences are usually long and punctuation has no or little affect on interpretation of the text. Contextual analysis is important in Arabic in order to understand the exact meaning of some words. Characters are sometimes stretched for justified text, i.e. a word will be spread over a bigger space than usual, which prevents a (character based) exact match for the same word. In Arabic synonyms are very common, for example, year has three synonyms in Arabic (EAm) , (Hwl) , (snp) that are all widely used in everyday communication. Despite the previous issues and the complexity of Arabic morphology, which impedes the matching of the Arabic word, another real issue for the Arabic language is the absence of diacritization (sometimes called vocalization or voweling). Diacritization can be defined as signs over and under letters that are used to indicate the proper pronunciations. The absence of diacritization in Arabic texts poses a real challenge for Arabic natural language processing, leading to high ambiguity. Even though, the use of diacritization is extremely important for readability and understanding, diacritization is very rarely used in real life situations. Diacritization signs do not appear in most printed media in Arabic regions nor on Arabic internet web sites. For native speakers the absence of diacritization is not an issue. They can easily understand the exact meaning of the word from the context, but for inexperienced learners as well as in computer usage, the absence of the diacritization is a real issue (Ahmed & Nürnberger, 2008). Annotating Arabic Words with English Wordnet Synsets 63 1.1 Arabic Morphological Analyzers In the past few years several studies have been done for automatic morphological analysis of Arabic (Abderrahim & Reguig, 2008). In the following, we restrict our discussion to the two most important Arabic Morphological Analyzer: the finite-state arabic morphological Analyzer and the Tim Buckwalter Arabic morphological analyzer (BAMA). This is important, because our approach bases on the recognition of morphemes that are used to recognize in a second time the correct word sense. 1.1.1 Finite-State Arabic Morphological Analyzer at Xerox In 1996, the Xerox Research Centre Europe produced a morphological analyzer for Modern Standard Arabic. In 1998 a finite-state morphological analyzer of written Modern Standard Arabic words that is available for testing on the Internet was implemented. The system receives online orthographical words of Arabic that can be full diacritics, partial diacritics or without diacritics. The system has a wide dictionary coverage. After receiving the words the system analyze them in order to identify affixes and roots from patterns. Beesley (2001) reported that Xerox has several lexicons: the root lexicon contains about 4390 entries; the second one is a dictionary of patterns which contains about 400 entries. Each root entry is hand-encoded and associated with patterns. The average root participates in about 18 morphologically distinct stems, producing 90000 Arabic stems. When these stems are combined with possible prefix and/ or suffix by composition, 72000000 abstract words are generated. 1.1.2 Tim Buckwalter Arabic Morphological Analyzer (BAMA) BAMA is the most well known tool of analyzing Arabic texts. It consists of a large database of word forms which interacts with other concatenation databases. An Arabic word is considered as concatenation of three regions, a prefix region, a stem region and a suffix region. The prefix and suffix regions can be omitted. Prefix and suffix lexicon entries cover all possible concatenations of Arabic prefixes and suffixes, respectively. Every word form is entered separately. It takes the stem as the base form. Furthermore it also provides information about the root. (BAMA) morphology reconstructs vowel marks and provides English glossary. It provides all possible compositions of stems and affixes for a word. (BAMA) group together stems with similar meaning with associated it with lemmaID. The (BAMA) contains 38,600 lemmas. For more details about the entire constructions of the (BAMA) we refer the reader to Habash (2004). 2 Wordnet For better understanding how to create an Arabic lexical resource, we also want to present Wordnet, and then give a short introduction about the already existing Arabic Wordnet. Wordnet is one of the most important English lexical resources available to researchers in the field of text analysis and many related areas. Fellbaum (1998) discussed the design of this electronic lexical database Wordnet, designed based on psycholinguistic and computational theories of the human lexical memory. Wordnet can be used for different applications, like word sense identification, information retrieval, and particularly for a variety of content-based tasks, such as semantic query expansion or conceptual indexing in order to improve information retrieval performance (Vintar et al., 2003). It provides a list 64 Ernesto William De Luca, Farag Ahmed, Andreas Nürnberger of word senses for each word, organized into synonym sets (synsets), each representing one constitutional lexicalized concept. Every element of a synset is uniquely identified by its synset identifier (synsetID). It is unambiguous and a carrier of exactly one meaning. Furthermore, different relations link these elements of synonym sets to semantically related terms (e.g. hyperonyms, hyponyms, etc.). All related terms are also represented as synset entries. It also contains descriptions of nouns, verbs, adjectives, and adverbs. Wordnet distinguishes two types of linguistic relations. The first type is represented by lexical relations (e.g. synonomy, antonomy and polysemy) and the second by semantic relations (e.g. hyponomy and meronomy). Glosses (human descriptions) are often (about 70% of the time) associated with a synset (Ciravegna et al., 1994). Wordnet has been upgraded into different versions. In version of Wordnet 2.0 nominalizations, which link verbs and nouns pertaining to the same semantic class were introduced, as well as domain links, based on an “ontology" that should help for the disambiguation process. In the newest version of Wordnet 3.0 some changes were made to the graphical interface and Wordnet library with regard to adjective and adverb searches adding “Related nouns" and “Stem Adjectives". 2.1 Arabic Wordnet Black et al. (2006) discuss in their paper an approach to develop an Arabic (Wordnet) lexical resource for the Standard Arabic language. The Arabic Wordnet project (AWN) bases on the design of the (Princeton) Wordnet described above and is mappable with its version 2.0 and EuroWordNet. The Suggested Upper Merged Ontology (SUMO) and the related domain ontologies are used as the basis for its semantics. The authors already described the “manual” extension and translation of the already existing synsets from one language (e.g. English) to Arabic (Elkateb, 2005). But it is not clear if and how this manual annotation process is supported by an interactive system. 3 Our Approach In the following we discuss the Arabic Wordnet Interface that we implemented, in order to support authors in annotating Arabic words with English synsets. The system can be described by the following steps: • Arabic Synset Creation - The user types an Arabic query word - A list of translations in English is retrieved - The user checks English translations - If a translation is not included, the user can add it through the “other translation” check box - An English list of Wordnet synsets related to the chosen translation is retrieved - The user checks Wordnet synsets and chooses the correct matching synsets - The synsetIDs of chosen synsets are retrieved and assigned to the arabic word • Arabic Synset Gloss Creation - Every word contained in the glosses of every English Synset is retrieved individually Annotating Arabic Words with English Wordnet Synsets 65 Figure 1: Arabic Wordnet Interface - Possible English Translation - The best matching sentences are retrieved from parallel corpora using semantic similarity measures (Patwardhan et al., 2003) - An Arabic list of possible glosses related to the chosen translation are retrieved from the parallel corpora - The user chooses the best matching sentences and thus an Arabic gloss is created - If no glosses match to the word sense chosen, the user can describe the synset with a new gloss 3.1 Arabic Synset Creation The process starts after the user submitted a query word by means of a client interface (see Figure 1). In this example the user is searching for the arabic word (mwq). The system retrieves all matching translations and presents them with check boxes that the user can activate. The choice of the translations is done by using the araMorph package, that is a sophisticated java-based tool Buckwalter analyzer (Buckwalter, 2002). This tool includes Java classes for the morphological analysis of arabic text files and the principal arabic encodings (UTF-8, ISO-8859-6 and CP1256). Afterwards the user can decide to maintain all automatically selected translations suggested by the system or to choose only the adequate translations from the list, if these are more conform to the intended concept described by the query word; the system gives also the possibility to add a new translation (using the “other translation” check box) that could be not available in the Wordnet resource. When these words are selected, the related Wordnet synsets are retrieved and a list of synsets is presented to the user. Again in this phase, the user has to choose the best describing synset for the searched word (see Figure 2). This step is important in order to retrieve the correct 66 Ernesto William De Luca, Farag Ahmed, Andreas Nürnberger Figure 2: Arabic Wordnet Interface - Selecting synsetIDs for Arabic Word Figure 3: Arabic Wordnet Interface synsetID Assignment for Arabic Word Annotating Arabic Words with English Wordnet Synsets 67 English synset that will be representative for the arabic word typed at the beginning of the search process. The last step is done when the user has chosen the correct synset; the corresponding synsetID is retrieved and stored together with the arabic query word. Within this process, we can enrich every arabic word given as a query by the user in a semi-automatic way, creating a new parallel arabic synset with the same synsetID used in the English Wordnet. In this way, we can extend the English Wordnet and create an interlingual access through the synsetID (see Figure 3). 3.2 Arabic Synset Gloss Creation For creating the glosses related to the new created synsets, different stepts have to be considered. The algorithm starts by exploiting the English Wordnet Glosses and the parallel corpora in which at least one word contained in source language (English) matches to the translated word (Arabic). Every word contained in the glosses of every English Synset is retrieved and compared with the text included in the relevant English sentences in the “Arabic English Parallel News Part 1” corpora (Consortium LDC, 2004). Semantic similarity measures (Patwardhan et al., 2003) are applied to compare all words related to the Wordnet synsets with the one contained in the corpora. The best matching sentences retrieved from the parallel corpora are presented and an arabic list of possible glosses related to the chosen translation are presented to the user that can choose the best matching sentences. These sentences are then added as an Arabic Gloss. The user could also have the possibility to add new gloss entries related to the new translation, just selecting the respective synset. 4 Conclusions We presented a tool for supporting lexicographers in creating Arabic Wordnet synsets. After the discussion of related work, we explain the query-oriented creation of the Arabic synsets, where an arabic word is searched and then annotated with English synsets. Parallel corpora are used to create glosses for every new created Arabic synsets. Currently, we study how the proposed approach for creating an Arabic Wordnet resource can be combined with the approaches presented in Black et al. (2006). Furthermore, a small user study is planned in order to evaluate the interface and especially the semi-automatic synset and gloss creation process. References Abderrahim, M. E. A. & Reguig, F. B. (2008). A Morphological Analyzer for Vocalized or Not Vocalized Arabic Language. Journal of Applied Sciences, 8, 984-991. Ahmed, F. & Nürnberger, A. (2008). Arabic/ English Word Translation Disambiguation Approach based on Naive Bayesian Classifier. In Proceedings of the 3th IEEE International Multiconference on Computer Science and Information Technology (IMCSIT08), pages 331-338. Beesley, K. (2001). Finite-state morphological analysis and generation of Arabic at Xerox Research: Status and plans in 2001. In ACL Workshop on Arabic Language Processing, pages 1-8, Toulouse,France. Black, W., Elkateb, S., Rodriguez, H., Alkhalifa, M., Vossen, P., Pease, A., & Fellbaum, C. (2006). Introducing the Arabic WordNet Project. In Proceedings of the 3rd International WordNet Conference 2006. Buckwalter, T. (2002). Arabic Morphological Analyzer Version 1.0. In LDC Catalog No.: LDC2002L49. 68 Ernesto William De Luca, Farag Ahmed, Andreas Nürnberger Ciravegna, F., Magnini, B., Pianta, E., & Strapparava, C. (1994). A Project for the Construction of an Italian Lexical Knowledge Base in the Framework of WordNet. Technical Report 9406-15, IRST-ITC. Consortium LDC (2004). Arabic English Parallel News Part 1. In LDC Catalog No.: LDC2004T18. Elkateb, S. (2005). Design and implementation of an English Arabic dictionary/ editor. Ph.D. thesis, Manchester University. Fellbaum, C. (1998). WordNet, an electronic lexical database. MIT Press. Habash, N. (2004). Large scale lexeme based arabic morphological generation. In Proc. of TALN-04, Fez, Morocco. Moukdad, H. (2004). Lost in Cyberspace: How do search engines handle Arabic queries? In Proceedings of the 32nd Annual Conference of the Canadian Association for Information Science, Winnipeg. Moukdad, H. & Large, A. (2001). Information retrieval from full-text Arabic databases: Can search engines designed for English do the job? Libri, 51(2), 63-74. Patwardhan, S., Banerjee, S., & Pedersen, T. (2003). Using Measures of Semantic Relatedness for Word Sense Disambiguation. In Proc. of the Fourth Int. Conf. on Intell. Text Processing and Computational Linguistics, pages 241-257, Mexico City, Mexico. Vintar, S., Buitelaar, P., & Volk, M. (2003). Semantic Relations in Concept-Based Cross-Language Medical Information Retrieval. In Proc. of the Workshop on Adapt. Text Extraction and Mining, Croatia. The Role of the German Vorfeld for Local Coherence: A Pilot Study * Stefanie Dipper 1 and Heike Zinsmeister 2 1 Institute of Linguistics Bochum University dipper@linguistics.rub.de 2 Institute of Linguistics Konstanz University Heike.Zinsmeister@uni-konstanz.de Abstract This paper investigates the contribution of the German Vorfeld to local coherence. We report on the annotation of a corpus of parliament debates with a small set of coarse-grained labels, marking the functions of the Vorfeld constituents. The labels encode referential and discourse relations as well as non-relational functions. We achieve inter-annotator agreement of κ = 0 . 66 . Based on the annotations, we investigate different features and feature correlations that could be of use for automatic text processing. Finally, we perform an experiment, consisting of an insertion task, to assess the individual impact of different types of Vorfeld on local coherence. 1 1 Introduction A text is said to be coherent if it is easy to read and understand. Global coherence is achieved on different levels of the text: (i) on the content level, topics are shared across the sentences (Halliday & Hasan, 1976); (ii) on the level of information structure, the focus of attention is shifted smoothly in course of the text (Grosz & Sidner, 1986); (iii) finally on the logical level, discourse relations mediate between sentences and other parts of the text (Mann & Thompson, 1988). Local coherence, the correlate of global coherence at sentence level, is determined by the smoothness of the transition from one sentence to the next. (i) Topic continuity manifests itself in chains of coreferent entities. (ii) The way these entities are realized in a sequence of sentences (position, grammatical role, choice of determiner, etc.) relates to their salience, which, in turn, determines the reader’s focus of attention when shifting from one sentence to the next. (iii) Discourse relations become visible in the occurrence of connectives such as conjunctions, adverbials, or other fixed expressions. Models of coherence refer to textual, cohesive means to approximate coherence. These means include (i) referential relations, which approximate shared topics (Barzilay & Lapata, 2008; Filippova * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 69-79. 1 We would like to thank three anonymous reviewers of a previous version of this paper for valuable comments as well as our student annotators. The research reported in this paper was partly financed by Europäischer Sozialfonds in Baden-Württemberg. 70 Stefanie Dipper, Heike Zinsmeister & Strube, 2007b; Elsner & Charniak, 2008), (ii) referential relations in combination with grammatical role assignment or information status, which approximate attentional focus shift (Grosz et al., 1995; Strube & Hahn, 1999; Barzilay & Lapata, 2008), and (iii) discourse connectives, which are explicit markers of discourse relations (Stede & Umbach, 1998; Knott & Dale, 1994; Prasad et al., 2008). In general, the beginning of a sentence seems to be a naturally distinguished position for relating the sentence to its preceding context. For instance, it is well-known that old information, which takes up information from the prior context, tends to occur early in the sentence. In a language such as German, the first obligatory position in a declarative sentence, the Vorfeld (“pre-field”) is not restricted to a specific grammatical function, such as the subject. For instance, in the sequence Max ist krank. Deshalb wird er im Bett bleiben. (‘Max is sick. Therefore, he will stay in bed.’), these positions are occupied by the subject and a discourse connective, respectively. The sequence is locally coherent, due to two textual means. First, there is a coreference link between the sentences (Max, er), and second, the connective (deshalb ‘therefore’) signals a cause-consequence relation between the sentences. Due to its flexibility, the Vorfeld is a highly suitable position for all kinds of coherence-inducing elements. Hence, our working hypothesis is that the majority of the Vorfeld constituents should be related to the prior context. In the present study, we investigate the contribution of the Vorfeld constituent to the emergence of local coherence. We define a classification of functions —some contributing to local coherence, some not—and report on an annotation experiment in which 113 argumentative texts were annotated with coreference and discourse relations, among others (Sec. 3). The annotated texts serve as an empirical base for investigating (i) what kind of relations occur in the Vorfeld position in spoken monologue texts; (ii) whether the different kinds can be distinguished by means of textual properties; and (iii) to what extend they contribute to local coherence (Sec. 4). The last issue is supplemented by a human insertion experiment in which a sentence that has been extracted from a text has to be re-inserted in the remaining text. Accuracy and ease of insertion is matched with the local Vorfeld contexts of the extracted sentence (Sec. 5). 2 Related Work The classical centering approach (Grosz et al., 1995) models local coherence in terms of sentence transition types that describe whether the focus of attention continues, is about to change or is changed. The focus of attention is defined on the basis of “centers”—referential expressions—and a salience hierarchy of the involved expressions in terms of grammatical roles; transition types take into account cross-sentential pairs of salient expressions and whether they are referentially related. A similar approach but more robust is the entity-grid model (Barzilay & Lapata, 2008; Filippova & Strube, 2007a). It takes all kinds of equivalence sets and referential chains into account and combines them with grammatical-role information. For each referential entity and each sentence, it is recorded whether the entity occurs in that sentence or not. The entity-grid model learns patterns of occurrences. Functional centering (Strube & Hahn, 1999) modifies the classical centering approach by employing information status—which characterizes information as old/ new, familiar/ unfamiliar (Prince, 1981)—for determining salience of referential expressions. The importance of information status is also emphasized in the coreference-inspired model of local coherence by Elsner & Charniak (2008). The Role of the German Vorfeld for Local Coherence 71 Discourse connectives indicate discourse relations (Prasad et al., 2008), e.g., daher, deswegen ‘therefore’ indicate a cause function (Stede & Umbach, 1998), and strengthen local coherence. Kibble & Power (2004) combine constraints on centering and discourse connectives, among others, to model local coherence. A summary of discourse annotation projects dealing with discourse relations that are explicitly marked as well as implicit ones is given in Stede et al. (2007). In contrast to previous studies we focus on the information provided by the Vorfeld constituent. In the case of discourse relations, we consider relations explicitly marked by discourse connectives as well as implicit relations, provided they are indicated by the Vorfeld constituent as such. The exceptional status of the Vorfeld constituent is also acknowledged in the corpus-based study of Filippova & Strube (2006), which served as the empirical base for a two-step generation model in which one classifier is trained specifically to pick the Vorfeld constituent (Filippova & Strube, 2007b). Among a number of general features, they use grammatical functions, coreferential information as well as textual hints on information status. They do not include discourse relations. Another recent corpus study is Speyer (2005) and Speyer (2007). He finds that the initial position is usually occupied by brand-new elements or scene-setting elements. Contrastive topics are less preferred, and salient centers—which he equals with non-contrastive topics-are even lower on the preference scale. His findings imply that the most preferred occupants of the Vorfeld are not related to the previous context but have a sentence-internal function only. Most models of local coherence provide algorithms to order discourse units like sentences or clauses. A recent overview of centering approaches and sentence ordering is provided in Karamanis et al. (2009). Chen et al. (2007) use such a model to insert new information into existing documents. Elsner & Charniak (2008) adopt this task to test the quality of their coherence model. 3 The Corpus and Its Annotations To investigate the role of the Vorfeld constituents, we created and annotated a corpus of selected debates from the European Parliament. The corpus served three purposes: First, we wanted to know the ratio of the sentences that are related to the previous context by virtue of some cohesive item located in the Vorfeld position. Speyer (2007) showed that in his corpus, which consisted of texts from different genres, almost half of the Vorfelds were not related to the previous context. However, it is well known that discourse structure depends on the text type and genre; see, e.g., Berzlánovich et al. (2008), and the figures that Speyer (2007) presents for the genres contained in his corpus. Hence, we were interested in the ratios that would show up in our corpus. Our working hypothesis was that the majority of the Vorfeld constituents should be related to the prior context. Second, we wanted to investigate the types of expressions that occur in the German Vorfeld, be they cohesive or non-cohesive, and search for correlations between morpho-syntactic and discourse properties. Third, we wanted to examine the influence of different types of cohesive expressions on coherence; for this, we designed an experiment that we present in Sec. 5. 3.1 The Corpus The texts that we included in our corpus are part of the Europarl corpus (Koehn, 2005). The Europarl corpus consists of protocols of debates in the European Parliament, both in the original language, as delivered by the speaker, as well as in translations into ten other languages, as delivered by the translation services of the European Union. In the Europarl corpus, individual contributions (“turns”) are 72 Stefanie Dipper, Heike Zinsmeister marked by SGML elements, along with the names, parties and languages of the respective speakers. As our basis, we selected all contributions whose original language is German (including Austrian German). We transformed the SGML representation into XML. After tokenization, we applied the German chunker of the TreeTagger (Schmid, 1994) to the corpus. Based on the chunk and part-of-speech annotations, we used a heuristics to automatically determine the location of the Vorfeld for each sentence, if any: Our script reads in the analyses of the TreeTagger and searches for the first occurrence of a finite verb, which marks the Vorfeld boundary. If the first word of the current sentence is a subjunction, the finite verb is included in the Vorfeld; otherwise it is excluded from the Vorfeld; leading conjunctions are skipped. For the annotation task, we isolated medium-sized turns, consisting of 15-20 sentences. This was done (i) to guarantee that the turns contained a sufficient number of Vorfeld sentences, and, hence, would realize a coherent discourse, (ii) to guarantee that the turns would be long enough to allow us to look for cross-sentential patterns, and (iii), at the same time, to avoid turns that are too lengthy and (maybe) more difficult to annotate than shorter ones; also, some long turns contain written reports that are read out by the speakers. The turns were presented to the annotators without further context information. 3.2 Annotation Tagset We aimed at a tagset that would be rather easy to apply. Hence, we decided not to use a detailed tagset as, e.g., provided by the classical set of RST relations (Rhetorical Structure Theory, Mann & Thompson 1988). Moreover, we decided not to differentiate between discourse relations at the semantic or pragmatic level, since this distinction often poses problems for non-expert annotators. Instead, we opted for coarse-grained labels that are intuitively accessible (cf. Marcu & Echihabi 2002). We finally adopted a small set of coherence labels that can be grouped into five groups. In the following, we describe each group; annotated examples are provided in the Appendix. Reference relations: coreference, bridging, reference to the global theme We distinguish between real coreference, and indirect relations or bridging, as in I am reading [an interesting paper]. [The author] claims that . . . . Sometimes, the Vorfeld constituent refers to the (implicit) general global topic of the parliament’s session. Coreference and bridging relations are prime examples cohesive means. Discourse relations: cause, result, continue, contrast, text The Vorfeld can be occupied by a constituent that indicates a discourse relation to the preceding context. The labels cause and result are complementary relations: they are designed to cope with any kind of motivational relation (semantic or pragmatic) that holds between two statements, one representing the “premise”, the other the “conclusion”. The label cause indicates that the current sentence represents the premise. The label result is the complementary relation: it marks sentences that represent the conclusion. A clear indicator of a causal relations is, e.g., the phrase der Grund (hierfür) ist, dass . . . ‘the reason (for this) is’; charateristic indicators of result relations are the discourse connectives daher, deswegen ‘therefore’. The continue and contrast relations are opposed to each other in a similar way: the first one, continue, deals with sentences that continue the discourse in the same vein as before, e.g., by elaborating a topic or by drawing comparisons; the second one, contrast, introduces a contrastive sentence, which deviates from the previous discourse. The text relation is for references to text structure. The Role of the German Vorfeld for Local Coherence 73 A discourse relation is only annotated if the respective indicator is located in the Vorfeld position. The discourse relation, however, holds between the entire current sentence and the underlined preceding segment. Like reference relations, discourse relations are cohesive means. Situational relations: deictic, address The label deictic marks references to the situation of the speaking event. Usually it is used for the pronouns ich, wir ‘I, we’. The label address marks those Vorfeld constituents that are used to address the audience, e.g., the chair of the parliament. Vorfeld deictics are very frequent in our type of corpus. They relate the current sentence to the external situation, hence they could be considered as non-cohesive (at the textual level). However, successive use of deictics could be interpreted as an instance of a coreference relation, i.e. it would be a cohesive means. Internal functions Some Vorfeld constituents do not refer to the previous context but to (some segment within) the current sentence. Often, these constituents are instances of frame-setting topics, setting the context of interpretation for the current sentence (Jacobs, 2001). Expletive Function This label is reserved for the so-called Vorfeld-es (Pütz, 1986): under certain circumstances, the pronoun es ‘it’ occupies the Vorfeld position without having the status of an argument. The function of the Vorfeld-es is not entirely clear but it is often said to be associated with presentational sentences: similar to the there is-construction in English, expletive es would serve as a means to place all other constituents behind the finite verb, which is the canonical position of new information. In our context, we decided to analyze expletive es as non-cohesive. 3.3 Annotation Process The corpus was annotated by 15 students of a seminar about information and text structure and by two paid student assistants. They had a short training period with sample texts, followed by a round of discussion. For the annotation, the tool MMAX2 was used. 2 The Vorfeld constituents were marked in advance and highlighted, to ease the annotation task. Manual annotation consisted of picking the correct label among a set of 11 predefined features. In case of coreference relations, the antecedent had to be marked. Similarly, in case of discourse relations, the discourse segment that represents the “sister node” or “antecedent” of the annotated relation (causal, result, etc.) was marked. On average, the annotation took about 10min/ text. In total, the corpus currently consists of 113 annotated texts, average length is 17.2 sentences. We computed inter-annotator agreement on the basis of 18 texts and found κ= 0.66, which is an acceptable value for a task like ours (Artstein & Poesio, 2008). 4 Corpus Exploration The corpus consists of 1,940 sentences in total. 86% of the sentences feature a Vorfeld (that could be recognized automatically). Ignoring the first sentence of each text, 91% have a Vorfeld. Among these Vorfeld sentences, 45.1% have a Vorfeld that is clearly related to the previous context (via a 2 http: / / mmax2.sourceforge.net/ , accessed 2009, April 10. 74 Stefanie Dipper, Heike Zinsmeister Figure 1: Distribution of selected Vorfeld functions across the text (left), and selected correlations between morpho-syntactic categories and Vorfeld functions (right) reference relation or discourse relation). 21.9% contain a reference to the situation, which could be special instances of coreference relations (see discussion in Sec. 3). The Vorfeld “types” occur with the frequencies displayed in the table below. 3 Related Vorfeld Deictic Vorfeld Unrelated Vorfeld 22.8% Reference 21.9% Situational 4.8% Expletives 22.3% Discourse 28.2% Internal 45.1% Total 21.9% Total 33.0% Total We decided to focus on related Vorfeld types for this study. To get a first impression of the data, we started by looking at the distribution of the different Vorfeld types within each text. The chart in Fig. 4 displays for selected Vorfeld types the relative number of occurrences and their relative text positions. Relative numbers are computed separately for each type. Relative positions are accumulated in blocks of 20%. Vorfeld types that show a rather even distribution across the entire text have been omitted from the chart. As can be expected, address relations usually occur at the beginning and the end of a turn. The majority of the discourse relations show similar distributions; interestingly, the peak of the result relation occurs rather late. This could be a characteristics of argumentative text such as the Europarl debates. Another feature that we looked at is the distance between relational Vorfelds and their “antecedents”. The medians of the distances show that discourse relations typically relate to the immediate preceding sentence or, less frequently, to the penultimate sentence. Coreference relations tolerate longer distances to their antecedents than discourse relations do. Still, coreference relations have a distance median of 1, but show large variation. The bridging relation has a distance median of 2. 3 Situational relations: 19.5% deictic, 2.4% address; Reference relations: 16.0% coreference, 5.0% bridging, 1.8% global theme; Discourse relations: 11.9% continue, 4.0% cause, 3.0% contrast, 1.9% result, 1.5% text. The Role of the German Vorfeld for Local Coherence 75 A further feature is the correlation between Vorfeld types and morpho-syntactic categories, cf. Fig. 4. 4 Not surprisingly, there is a positive correlation between reference relations and noun chunks, similarly between deictic and noun chunks (not displayed). Another positive correlation can be observed with discourse relations and adverbs. In contrast to ordinary reference relations, bridging relations show a positive correlation both with noun chunks and prepositional chunks. 5 This shows that Vorfeld constituents that stand in a bridging relation are often not realized as a prominent grammatical function but mainly seem to serve as a bridge between the current sentence and the prior context. Finally, sentence internal functions also correlate with prepositional chunks; it is still to be confirmed whether these are frame-setting elements. 5 Experiment: Insertion Task As mentioned in Sec. 3, our third goal was to examine different types of cohesive expressions according to their influence on coherence. Our hypothesis was that all types of coherence relations link sentences together, but to a different degree. To validate this hypothesis, we designed an experiment consisting of an insertion task (cf. Elsner & Charniak 2008): Procedure From each turn of our annotated corpus, an arbitrary sentence was extracted. Turninitial sentences and sentences without Vorfeld marking were skipped. The extracted sentence, followed by the text block consisting of the remaining sentences, was presented to the annotators, who had to guess the original location of the extracted sentence. Results Annotators correctly marked the location in 54% of the texts (53 in total). If we also consider locations as correct that differ from the actual position by just one sentence, accuracy increases to 73%. A closer look at the results reveals that incorrect insertions are located more frequently in front of the actual position than behind it. Next, we were interested in the question whether the relations that occur in the immediate contexts of the extraction sites are relevant. That is, we investigated whether there are contexts that facilitate or complicate the localization task. According to the annotators themselves, the preceding context was most helpful. According to the results of the experiment, we observe the following tendencies: (i) If the dislocated sentence has a Vorfeld that is referentially related to the previous context, insertion accuracy improves. (ii) Similarly, if the sentence following the dislocated one has a Vorfeld with an internal function, accuracy goes up. (iii) Finally, if the dislocated sentence is the first one of a paragraph, accuracy improves as well.—For preceding sentences, no clear picture emerges from our experiment. Likewise, negative effects do not show up as clearly as positive effects. Based on the results of the insertion experiment, we could hypothesize that referential relations expressed in the Vorfeld have a stronger contribution to local coherence than discourse relations. However, to be able to really interpret the results and assess the impact of the context, we certainly need more annotated data and have to scrutinize individual cases. 4 “ADV”: adverb; “NC”: noun chunk, “PC”: prepositional chunk; “subord”: subordinate clause; “unknown” means that the category could not be determined automatically. 5 Annotators have been instructed to mark reference relations also in cases where the embedded NP rather than the entire PC is related. 76 Stefanie Dipper, Heike Zinsmeister 6 Conclusion Coherence is an important issue for all NLP tasks that depend on text generation, such as text summarization or machine translation. It is a crucial step to generate sentences that fit smoothly into the local context, which is defined as the transition of one sentence to the next one. Contrary to our working hypothesis, the majority of Vorfeld constituents in our corpus is not related to the prior context (by a relation reference or discourse). A further interesting finding is that most discourse relations connect adjacent sentences. This insight could facilitate the task of discourse analysis considerably. Possibly, this result is related to the fact that we took only relations into account that were indicated by the Vorfeld constituent. Comparing real coreference and bridging relations, we found that coreference relations are predominantly established by noun chunks while bridging relations are often instantiated by prepositional chunks, which correspond to less-focused grammatical functions (such as prepositional objects or adjuncts). Potentially, this shows that bridging elements rather serve to connect two sentences than to provide for the next focus of attention. In our next steps, we will explore the Vorfeld types that have not been in our focus so far: unrelated and situation-deictic Vorfeld functions. In addition, we want to find out whether related and nonrelated Vorfeld constituents can be distinguished automatically, and whether they can be classified into meaningful subclasses. The corpus study indicates that features such as (morpho-)syntactic category, relative position in the text, and distance to the antecedent could be employed in this classification task. The training of a system would require a large database to get a larger number of significant results. Appendix: Example Sentences In the examples, the Vorfeld constituents are marked in boldface. In the case of reference and discourse relations, the segment in the preceding context that is referred to by the Vorfeld constituent is also marked (underlined). Usually, just the semantic head of the antecedent is underlined, as in Ex. 4; in case of propositional antecedents, suitable fragments or entire sentences are marked, cf. Ex. 3 and 6. 6 Example 1: Expletive G: Es sind alle Versuche gestartet worden, wir müssen jetzt zu Entscheidungen kommen. E: ‘We have tried everything (lit: It has been tried everything); now we need to make decisions.’ Example 2: Deictic G: Ich würde mir eine bessere Verständigung mit dem britischen Außenministerium wünschen. E: ‘I would like to see a closer understanding between the British Foreign Office and ourselves.’ 6 The line labeled “E” presents an English translation that is based on the original translations from Europarl. We used the tool OPUS ( http: / / urd.let.rug.nl/ tiedeman/ OPUS ) to retrieve the English translations. The Role of the German Vorfeld for Local Coherence 77 Example 3: Coreference G: Nun haben aber die Juristen gesprochen — ich muss zugeben, dass ich nicht verstanden habe, was Herr Lehne gesagt hat, aber das liegt vielleicht an mir. Die Überlegungen von Frau Kaufmann und von Herrn Lehne will ich jetzt nicht bewerten. E: ‘Now, though, the lawyers have spoken, and I have to concede that I did not understand what Mr Lehne said, but perhaps that is my fault. I do not, right now, want to weigh up the pros and cons of Mrs Kaufmann’s and Mr Lehne’s arguments.’ Example 4: Bridging G: Deshalb sollte es in punkto Handel und Investitionen ein attraktiver Partner für die Union sein. Dieses Potenzial wird von Unternehmen der Europäischen Union nicht voll ausgeschöpft, die eine starke Präferenz für China zu haben scheinen. E: ‘Therefore, it ought to be an attractive partner for the Union where trade and investment are concerned. This potential is not fully exploited by European Union companies, which seem to have a strong preference for China.’ Example 5: Theme (The parliament debates an action plan for animal protection) G: Der nun diskutierte Aktionsplan ist sicherlich ein weiterer wichtiger Schritt in die richtige Richtung. E: ‘The Action Plan now under discussion is certainly a further important step in the right direction.’ Example 6: Contrast G: Bereits im Jahr 2003 hat die EU bekanntlich ihrer Besorgnis über Hunde-, Stier- und Hahnenkämpfe Ausdruck verliehen, was auch erfreulicherweise im vorliegenden Dokument seinen Niederschlag gefunden hat. Seltsamerweise wurde allerdings die Fuchsjagd vergessen. E: ‘As we know, the EU expressed its concern about dog, bull and cock fighting back in 2003 and I am pleased to note that this is also reflected in the present document. Strangely, however, fox-hunting has been overlooked.’ Example 7: Text G: Der zweite Punkt betrifft die berühmten 500 Millionen. E: ‘My second point is that of the celebrated 500 million.’ Example 8: Internal G: Seit Anbeginn des bewussten Tierschutzes infolge der zunehmend technisierten Viehzucht im 19. Jahrhundert hat sich bekanntlich einiges getan. E: ‘As we know, a fair amount has happened since animal protection as a concept was born as a result of increasingly mechanised animal breeding in the 19th century.’ 78 Stefanie Dipper, Heike Zinsmeister References Artstein, R. & Poesio, M. (2008). Inter-Coder Agreement for Computational Linguistics. Computational Linguistics, 34(4), 555-596. Barzilay, R. & Lapata, M. (2008). Modeling Local Coherence: An Entity-based Approach. Computational Linguistics, 34(1), 1-34. Berzlánovich, I., Egg, M., & Redeker, G. (2008). Coherence structure and lexical cohesion in expository and persuasive texts. In A. Benz, P. Kühnlein, & M. Stede, editors, Proceedings of the Workshop on Constraints in Discourse III, pages 19-26. Chen, E., Snyder, B., & Barzilay, R. (2007). Incremental text structuring with online hierarchical ranking. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 83-91. Elsner, M. & Charniak, E. (2008). Coreference-inspired Coherence Modeling. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and Human Language Technologies (ACL-HLT), pages 41-44, Columbus, Ohio. Filippova, K. & Strube, M. (2006). Improving Text Fluency by Reordering of Constituents. In Proceedings of the ESSLLI Workshop on Modelling Coherence for Generation and Dialogue Systems, pages 9-16, Málaga. Filippova, K. & Strube, M. (2007a). Extending the Entity-grid Coherence Model to Semantically Related Entities. In Proceedings of the Eleventh European Workshop on Natural Language Generation (ENLG 07). Filippova, K. & Strube, M. (2007b). Generating Constituent Order in German Clauses. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL), pages 320-327, Prague, Czech Republic. Grosz, B. & Sidner, C. (1986). Attentions, Intentions and the Structure of Discourse. Computational Linguistics, 12, 175-204. Grosz, B., Joshi, A., & Weinstein, S. (1995). Centering: A Framework for Modeling the Local Coherence of Discourse. Computational Linguistics, 21, 203-225. Halliday, M. & Hasan, R. (1976). Cohesion in English. Longman, London. Jacobs, J. (2001). The dimensions of topic-comment. Linguistics, 39, 641-681. Karamanis, N., Mellish, C., Poesio, M., & Oberlander, J. (2009). Evaluating Centering for Information Ordering Using Corpora. Computational Linguistics, 35(1), 29-46. Kibble, R. & Power, R. (2004). Optimizing Referential Coherence in Text Generation. Computational Linguistics, 30(4), 401-416. Knott, A. & Dale, R. (1994). Using linguistic phenomena to motivate a set of coherence relations. discourse processes. Discourse Processes, 18(1), 35-62. Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the 10th Machine Translation Summit (MT Summit X). Mann, W. & Thompson, S. (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text, 8(3), 243-281. The Role of the German Vorfeld for Local Coherence 79 Marcu, D. & Echihabi, A. (2002). An Unsupervised Approach to Recognizing Discourse Relations. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 368-375, Philadelphia, PA. Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The Penn Discourse TreeBank 2.0. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco. Prince, E. F. (1981). Toward a taxonomy of given-new information. In P. Cole, editor, Radical Pragmatics, pages 223-255. Academic Press, New York. Pütz, H. (1986). Über die Syntax der Pronominalform ‘es’ im modernen Deutsch. Number 3 in Studien zur deutschen Grammatik. Tübingen: Narr, 2nd edition. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the International Conference on New Methods in Language Processing. Speyer, A. (2005). Competing Constraints on Vorfeldbesetzung in German. In Proceedings of the Constraints in Discourse Workshop, pages 79-87, Dortmund. Speyer, A. (2007). Die Bedeutung der Centering Theory für Fragen der Vorfeldbesetzung im Deutschen. Zeitschrift für Sprachwissenschaft, 26, 83-115. Stede, M. & Umbach, C. (1998). DiMLex: A Lexicon of Discourse Markers for Text Generation and Understanding. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics (ACL) and 17th International Conference on Computational Linguistics (COLING), pages 1238-1242, Montreal, Quebec, Canada. Stede, M., Wiebe, J., Hajiˇ cová, E., Reese, B., Teufel, S., Webber, B., & Wilson, T. (2007). Discourse annotation working group report. In Proceedings of the Linguistic Annotation Workshop (LAW) at ACL, pages 191-196, Prague. Strube, M. & Hahn, U. (1999). Functional Centering — Grounding Referential Coherence in Information Structures. Computational Linguistics, 25, 309-344. Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen von verba dicendi in nach-PPs * Kurt Eberle, Gertrud Faaß, Ulrich Heid Universität Stuttgart, SFB-732 B3 Institut für maschinelle Sprachverarbeitung - Computerlinguistik - Azenbergstr. 12 D 70174 Stuttgart { eberle,faasz,heid } @ims.uni-stuttgart.de Zusammenfassung Wir schlagen im Folgenden Schemata für die semantische Repräsentation von -ung-Nominalisierungen von verba dicendi in mehrdeutigen nach-PPs vor. Der Rahmen ist die Diskursrepräsentationstheorie. Aus den Repräsentationen werden Kriterien für die automatische Disambiguierung in einem System zur flachen semantischen Analyse von Sätzen abgeleitet. Die Kriterien werden durch maschinell erkennbare Indikatoren realisiert. Neben harten Entscheidungskriterien werden auch solche berücksichtigt, die nur mehr oder weniger starke Empfehlungen für oder gegen eine Lesart darstellen. Auf Basis pragmatischer Erwägungen zu den erarbeiteten Lesart- Schemata werden Gewichtungen für solche Kriterien vorgenommen und es wird gezeigt, wie sich diese in einem Bootstrapping-Ansatz zur Approximation des Disambiguierungsverhaltens von Rezipienten an Korpora überprüfen und verbessern lassen. Exemplarische Ergebnisse aus der Implementierung werden vorgestellt. 1 1 Einleitung -ung-Nominalisierungen sind in der Regel mehrdeutig und können für das Ereignis stehen, auf das sich das Verb bezieht aus dem sie gebildet werden (nach der Begradigung der Elsenz. . . ), oder auf den Zustand der aus dem Ereignis resultiert (während der Teilung Deutschlands . . . ) oder auf ein Objekt das als Ergebnis dem Ereignis kausal zugeschrieben ist (die Übersetzung des Romans verkauft sich gut). Diese dreifache sortale Ambiguität ist nicht generell gegeben, sie hängt von den Realisierungsbedingungen des mit dem Verb bezeichneten Ereignistyps ab. Manche -ung-Nominalisierungen haben nur zwei Lesarten, manche sogar nur eine. Im Falle von Mehrdeutigkeit können die * Erschienen in: C. Chiarcos, R. Eckart de Castilho, M. Stede (Hrsg.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, S. 81-91. 1 Die Arbeit ist Teil der Untersuchungen im Teilprojekt B3 zur Disambiguierung von Nominalisierungen bei der Extraktion linguistischer Daten aus Corpustext im SFB-732 Incremental Specification in Context. 82 Kurt Eberle, Gertrud Faaß, Ulrich Heid verschiedenen Lesarten unterschiedlich präferiert sein. In der Regel ist die Präferenz oder Auswahl vom umgebenden Kontext abhängig. Hypothesen zur Mehrdeutigkeit von -ung-Nominalisierungen, zur Beziehung zu den zugrundeliegenden Verben und zur kontextuellen Auswahl finden sich etwa in Ehrich & Rapp (2000); Osswald & Helbig (2004); Roßdeutscher (2007); Spranger & Heid (2007); Eberle et al. (2008). -ung-Nominalisierungen von Verben, die die Darstellung einer Aussage beschreiben oder eine Einstellung gegenüber einer Aussage (verba dicendi wie schildern, mitteilen oder beurteilen, bewerten), sind eine in vielerlei Hinsicht interessante Teilgruppe der -ung-Nominalisierungen. Einmal sind die mit ihnen assoziierten Objektlesarten von einer besonderen Art: Mitteilungen, Bewertungen usf. sind Aussagen, beziehen sich also nicht auf physikalische, sondern auf ‘mentale/ logische’ Objekte, die in der Semantik als Inhalte von Einstellungen eine besondere Rolle spielen. Dann beziehen sich die assoziierten Ereignisse direkt oder indirekt auf die pragmatisch äußerst interessanten Äußerungshandlungen, speziell auf Repräsentativa; und drittens hängen die beiden Lesarttypen ontologisch besonders eng zusammen, wodurch es oft schwierig ist, die konkret vorliegende Lesart im Kontext zu diskriminieren, vgl. Beispiel (1). (1) Liegt für die Volksinitiative nach Mitteilung des Landeswahlleiters die erforderliche Zahl von gültigen Eintragungen vor, so wird der Antrag als Landtagsdrucksache verteilt. (1) ist ein Satz aus dem DeWaC-Korpus (Deutsches Web-as-Corpus, vgl. (Baroni & Kilgarriff, 2006)), das für die Untersuchungen als Datenbasis benutzt wird. Der Satz besagt, dass ein bestimmter Antrag verteilt wird, wenn entsprechend der Mitteilung des Landeswahlleiters eine bestimmte Bedingung erfüllt ist. Wird der Satz geringfügig modifiziert, indem beispielsweise nach Mitteilung ersetzt wird durch nach erfolgter Mitteilung, so wird deutlich, dass der Satz (als solcher) auch eine andere Lesart hat, wonach die Eintragungen vorliegen müssen, nachdem die Mitteilung erfolgt ist. Im ersten Fall ist das Vorliegen der Eintragungen ein Teil der Mitteilung, die entsprechend als Aussage oder Proposition verstanden wird. Wir nennen diese Lesart deshalb die propositionale Lesart. Bei ihr wird nach als Diskursrelation - im Sinne der Präposition entsprechend - interpretiert. Im zweiten Fall ist das Vorliegen der Eintragungen ein Zustand der temporal lokalisiert wird: nach der Mitteilung; d.h. hier geht es um die Mitteilung als Referenzereignis oder Referenzzeit. Wir nennen die Lesart deshalb die temporale Lesart. Dabei wird nach als temporale Relation interpretiert. In Satz (1) sind also sowohl die Präposition als auch die Nominalisierung mehrdeutig und ihre Auflösungen hängen voneinander ab oder sie hängen zumindest zusammen. Außerdem beinhaltet die hier als propositional skizzierte Lesart bei genauer Betrachtung ebenfalls eine temporale Relation: Wenn der Äußernde die von der Satzprädikation beschriebene Situation als “entsprechend einer Mitteilung” gegeben darstellt - als Inhalt oder logische Folge der Mitteilung verstanden als Proposition, dann setzt dies natürlich voraus, dass der Äußernde auch von einem Mitteilungsereignis ausgeht, das diese Proposition, den Mitteilungsinhalt, einführt und dieses Ereignis liegt offensichtlich zeitlich vor der Perspektivzeit des Satzes der über diesen Inhalt informiert. Vermutlich sind die beiden Lesarten aus diesem Grund für den Leser oft schwer zu unterscheiden. Das Mitteilungsereignis spielt aber weder dieselbe Rolle noch ist seine temporale Beziehung zum Satzereignis dieselbe. Wenn also potenziell disambiguierende Kontextelemente selber wieder ambig sind, wie nach (mit temporaler Interpretation und als Diskursrelation), und deren Lesarten unterschiedliche sortale Erwartungen über die modifizierten oder selegierten Beschreibungen auslösen, müssen die entsprechenden Interpretationsprozesse beim Rezipienten besonders nuanciert erfolgen. Das Herausarbei- Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen . . . 83 ten von Regularitäten solcher nuancierter Interpretationsprozesse steht im Zentrum der folgenden Untersuchung. Als Beispiele dienen die skizzierten Verwendungen von -ung-Nominalisierungen von verba dicendi (wie Äußerung, Darstellung, Erklärung, Meldung, Mitteilung) als interne Argumente von nach- PPs. Wir gehen wie folgt vor: Zunächst werden die propositionale und die temporale Lesart im Rahmen der Diskursrepräsentationstheorie (DRT, Kamp, 1981) semantisch analysiert (Abschnitt 2). Aus den Analysen ergeben sich eine Reihe von Kriterien für die Disambiguierung, denen linguistische Einheiten zugeordnet werden können, “Indikatoren”, die sie realisieren (Abschnitt 3). Unter formalen Gesichtspunkten unterscheiden wir drei Arten von Indikatoren: (i) Modifikatoren und Selektoren einer Nominalisierung die per semantischer Selektionsrestriktion deren Lesarten in eindeutiger Weise disambiguieren (das sind z.B. Adjektive und Präpositionen oder Verben); (ii) Strukturelle Elemente im Satz die aufgrund ihrer sortalen Eigenschaften und der pragmatischen Erwartungen die sich den beiden Analysetypen zuordnen lassen eine bestimmte Lesart nahe legen oder als unwahrscheinlich erscheinen lassen und (iii) solche die ihren disambiguierenden Einfluss erst durch Einbeziehen von (mehr) Weltwissen und/ oder Wissen jenseits der Satzgrenze ausüben. Die Arbeit zielt auf die Identifizierung von Indikatoren, die sich in einem System zur semantischen Analyse von Texten als linguistische Einheiten maschinell leicht abfragen lassen, d.h. auf Indikatoren der ersten und der zweiten Art. Das Analysesystem soll sich, um handhabbar zu bleiben, neben strukturellem Wissen allein auf Sortenhierachien und im Lexikon verankerte Selektionsbeschränkungen stützen. In (Eberle et al., 2008) ist ein entsprechendes System beschrieben und ein Bootstrapping-Verfahren zur Extraktion von Indikatoren der ersten Art aus Korpora vorgestellt worden. Die aktuelle Studie setzt diese Arbeiten fort mit Fokus auf Indikatoren der zweiten Art. In Abschnitt 3 werden solche Indikatoren diskutiert und dabei tentativ Gewichtungen festgelegt. Diese Gewichtungen sind Input eines Korpus-Evaluationszyklus zur Überprüfung der Hypothesen und zur Justierung der Gewichtungen der in der Folge skizziert wird (Abschnitt 4). Zweck der Untersuchung ist es, einerseits ein tieferes Verständnis der betrachteten -ung-Nominalisierungen zu gewinnen und andererseits mit operationalisierbaren Indikatoren deren automatische Disambiguierung für computerlinguistische Anwendungen wie Textklassifikation und Maschinelle Übersetzung etc. zu erleichtern. Die Studie schließt mit einer ersten Ergebnissichtung (Abschnitt 4.2) und der Skizzierung nächster Schritte (Abschnitt 5). 2 Lesarten von nach-PPs mit Nominalisierungen von verba dicendi 2.1 Propositionale Lesart Wenn die nach-PP propositional verstanden wird, dann wird die Aussage des durch die PP modifizierten Satzes als logische Folge oder Inhalt der Aussage aus der PP verstanden. Das schematische Beispiel (2) illustriert diese Lesart. Wir repräsentieren es wie in (2 rep ). 84 Kurt Eberle, Gertrud Faaß, Ulrich Heid (2) Nach x’s Darstellung von P , SATZ (z.B. : Nach Jane’s Darstellung des Sachverhalts (/ des Problems/ der Situation), ist Y der Fall.) (2 rep ) 〈 e x p e: darstellen(x,p) e ≺ n P(p) p: K p , K p ⇒ K satz 〉 Dabei wird präsupponiert, dass der Sachverhalt (das Problem, etc.) K p (das mit p bezeichnet wird) irgendwann vor dem aktuellen jetzt (now) des Berichts n dargestellt worden ist (wobei P die Qualität von p als Sachverhalt, Problem, Fragestellung etc. charakterisiert), und es wird behauptet, dass die Aussage des modifizierten Satzes, K satz , gefolgert werden kann, falls der Inhalt der Darstellung, K p , richtig ist. 2 Von welcher sortalen Qualität die durch das Verb im Satz eingeführte Situation ist (d.h. welcher Sorte der ausgezeichnete Diskursreferent (DRF) von K satz zugeordnet ist), ist bei dieser Lesart ohne Belang. Der Satz kann die Existenz eines bestimmten Ereignisses (event), Prozesses (process) oder Zustands (state) als Teil der durch p beschriebenen Szene behaupten, letzteres einschließlich (modal modifizierter) Prädikationen (prop), vgl. (3): (3) Nach Jane’s Darstellung, i) hatte Freddy Hans erschlagen. (e@event) ii) hatte es gestern geregnet (a@process) iii) war die Frau blutüberströmt. (s@state) iv) war es möglich, dass Z (q@prop) Wir unterlassen Repräsentationen für die Sätze aus (3); sie explizieren das Schema aus (2 rep ), wobei bei (iv) die Repräsentation von Z unter einem Modaloperator eingebettet wird. 2.2 Temporale Lesart In Beispiel (3.i) kann die PP nach Jane’s Darstellung offensichtlich auch gut temporal verstanden werden. Auch in (3.ii), aber weniger offensichtlich. Insbesondere die Information gestern macht diese Lesart dort etwas schwierig, mit plötzlich statt gestern ist sie viel prominenter. (3.iii) lässt sie nur ganz schwer zu, wenn überhaupt; (3.iv) aber wieder relativ leicht. Was sind die Gründe für das unterschiedliche Verhalten? Bei der temporalen Lesart wird das Ereignis aus der PP als Referenzzeit zur Situierung des Ereignisses, des Prozesses, des Zustands aus dem modifizierten Satz benutzt. Solche Nachordnungsbezüge verlangen, dass es sich bei dem entsprechend modifizierten Satzereignis um ein Ereignis im engen Sinne (d.h. um eine temporale Einheit mit festen Grenzen) handeln muss. Das ist oft beobachtet und in entsprechende Aktionsart-Regeln abgebildet worden (vgl. fürs Deutsche Bäuerle (1988); Herweg (1990); Eberle (1991)). Wenn das (zunächst) nicht der Fall ist, wie bei Prozessen und Zuständen, ist ein entsprechender Satz trotzdem akzeptabel, vorausgesetzt, der Prozess oder der Zustand kann um- oder reinterpretiert werden als ein solches Ereignis. 3 Das kann u.a. durch inchoative Interpretation erfolgen. Für den Prozess aus (3.ii) und den Zustand aus (3.iv) liegt das 2 Die Strukturierung der Repräsentationen in Präsupposition und Assertion folgt dem Vorschlag in (Kamp, 2002). 3 Vgl. Eberle (1991) zu Uminterpretation und Egg & Herweg (1994) zu Reinterpretation. Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen . . . 85 nahe: Nach dem Ende der Darstellung hatte der Regen eingesetzt oder war es, aufgrund der neuen Kenntnisse, möglich geworden, dass Z. (3 rep .ii) repräsentiert die entsprechende Lesart von (3.ii). (3 rep .ii) 〈 e j p jane(j) e: darstellen(j,p) e ≺ n P(p) p: K p , e’ e’: begin (λ a a a: regnen ) e ≺ e’ e’ ≺ n gestern(e,n) gestern(e’,n) 〉 In dieser Repräsentation werden sowohl das Darstellungsereignis als auch der Beginn des Regens in der Zeit von gestern verortet; vgl. dazu Abschnitt 3.3. Für (3.iii) liegt der temporale Interpretationstyp nicht nahe. Einerseits ist blutüberströmt sein kein Zustand der plötzlich, ohne weitere Ursache oder Begründung, einsetzt (wie Regen), andererseits legt der Kontext keine solche Begründung nahe (so wie eine Darstellung eine Möglichkeit begründen kann, (3.iv)). Eine andere Umwertung liegt ebenfalls nicht auf der Hand, beispielsweise eine Begrenzung wie sie mit Hinzufügen von Adverbialen der Dauer erfolgt, vgl. nach der Darstellung herrschte minutenlang Schweigen. Wir folgern, dass die temporale Lesart (höchstens dann) statthaft ist, wenn die Hauptsatz-VP einen Ereignistyp im engen Sinne repräsentiert oder wenn es eine “Uminterpretation” dieser VP zu einem solchen Ereignistyp gibt, die naheliegend ist. Letzteres macht deutlich, dass das Aktionsart-Kriterium, das wir als ein erstes Kriterium aus der Lesarten-Analyse ableiten können, kein ‘hartes’ Kriterium ist, jedenfalls nicht beschränkt auf die gegebene Analysesituation und ohne Weltwissen. Wir werden auf Basis der semantischen Hintergrundinformationen die dem Bewertungssystem zur Verfügung stehen (im wesentlichen Sortenhierachie und Selektionsbeschränkungen) in der Regel auch nicht bestimmen können, wann eine Umwertung naheliegend ist und wann nicht. Wir können aber sortales Wissen zur temporalen Seinsweise - als Ereignis, Prozess, typischerweise begrenzter historischer Zustand, typischerweise nicht begrenzter (z.B. logischer) Zustand - benutzen, um gemittelt Kosten für eine Umwertung zu schätzen und damit die Kosten für die temporale Lesart aus Aktionsart-Sicht zu approximieren (vgl. Abschnitt 3.2). 3 Identifizierung von Indikatoren 3.1 Eindeutig disambiguierende Restriktionen: Beispiel temporale Selektion Am einfachsten zu finden und zu behandeln sind disambiguierende Indikatoren, die eindeutige sortale Restriktionen ausüben. Diese Indikatoren müssen nicht die -ung-Nominalisierung selbst modifizieren (wie in (4.a)) oder selegieren. Sie können ihre Wirkung auch indirekt entfalten, indem sie beispielsweise die Präposition (oder Präpositionalphrase) modifizieren, die die Nominalisierung als Argument enthält. Formal können sie sehr unterschiedlich sein, wie die Beispiele aus (4) zeigen, die alle aus dem DeWaC-Korpus stammen. 86 Kurt Eberle, Gertrud Faaß, Ulrich Heid (4) a. . . . oder legt er diese nach erfolgter Meldung aus von ihm zu vertretenden Gründen nicht ab, so gilt die Wiederholungsprüfung als abgelegt und nicht bestanden. b. Bei der Anfahrt zum Besteller darf der Fahrpreisanzeiger erst nach Meldung des Fahrers beim Besteller eingeschaltet werden. c. Sollten Sie . . . einverstanden sein, haben Sie das Recht, der Änderung innerhalb eines Monats nach Mitteilung zu widersprechen. (4.a) ist ein Beispiel für direkte adjektivische Modifikation. (4.b) ist ein Beispiel für Modifikation durch ein Fokusadverb, wobei die Disambiguierung indirekt erfolgt: erst strukturiert den Satz in Fokus (die nach-PP) und Hintergrund (d.i. der restliche Satz), wobei nach temporal und in der Folge Meldung als Ereignis, interpretiert werden muss. In (4.c) nimmt eine temporale PP die nach-PP als Referenz und fügt deren temporaler Lokalisierung eine Präzisierung hinzu. 3.2 Lesart-Empfehlungen Neben den eindeutig-selegierenden Indikatoren gibt es schwächere Hinweise, ‘Default-Constraints’. Wir ordnen solchen schwachen Indikatoren Gewichtungen zu und benutzen dabei für Unwahrscheinlichkeit und Präferenz eine Skala von -3 bis +3, wobei die Werte inhaltlich für nahezu unmöglich, sehr unwahrscheinlich, nicht favorisiert, neutral, favorisiert, sehr wahrscheinlich, fast sicher stehen bzw. die Kosten für Umwertung schätzen, wie im besprochenen Aktionsart-Kriterium das wir unter Verwendung dieser Gewichte wie in Tab. 1 spezifizieren können. Die Berechnung der Default-Bewertung zu einem Satz erfolgt dann durch Integration der Werte der verschiedenen Hinweise. Wir diskutieren im Folgenden exemplarisch zwei weitere Kriterien für solche Indikatoren, eines, das sich auf die modifizierte VP, und eines, das sich auf das interne Argument der nach-PP bezieht. 3.3 Kriterien zum Ereignistyp der modifizierten Verbalphrase Beispielfall: Temporale Situierung Die temporale Lesart bedeutet, dass die nach-PP eine Referenzzeit t für das Satzereignis setzt. Das Vorkommen einer weiteren Temporalangabe ist unter diesen Umständen pragmatisch ungewöhnlich, außer sie wird als Präzisierung zu t und nicht als eigenständige ‘zweite’ Referenzzeit verstanden. Das Problem ist, zu erkennen, wann dies der Fall ist und wann nicht. Die schematischen Kontraste in (5) zeigen Hinweise: (5) (i) Nach Jane’s Darstellung, am Dienstag, diskutierte Hans mit Freddy. (ii) Nach Jane’s Darstellung diskutierte Hans mit Freddy am Dienstag. (iii) Am Dienstag diskutierte Hans mit Freddy nach Jane’s Darstellung. Tabelle 1: Gewichtungen für Lesart-Empfehlungen, Aktionsart-Kriterium Aktionsart-Kriterium: Ereignis Prozess Zustand Zustand historisch nicht-historisch Negation, Modaleinbettung Temporale Lesart +1 0 0 -2 Propositionale Lesart 0 0 +1 +2 Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen . . . 87 In (5.i) zeigt die Vorfeld-Position, dass am Dienstag, als Apposition zu der nach-PP, mit dieser eine gemeinsame Referenzzeit spezifiziert: innerhalb des Dienstags und nach der Darstellung. In (5.ii) und (5.iii) sind die Beiträge dagegen getrennt. In diesen Fällen hängt es davon ab, inwieweit die Einführung der Informationen als fortschreitende Präzisierung einer gemeinsamen Referenzzeit verstanden werden kann, entsprechend den für den Satz als natürlich empfundenen Skopusverhältnissen. Für (5.ii) bedeutet dies, dass die Darstellung vermutlich vor der Diskussion am Dienstag erfolgt (z.B. am Montag), wohingegen in (5.iii) sie vermutlich auch am Dienstag stattfindet. Dabei wirkt (5.iii) aufgrund der erwarteten Informationskonventionen natürlicher. (5.ii) und das Beispiel (3.ii) oben in Abschnitt 2.1 haben dieselbe Struktur. Die Repräsentation (3 rep .ii) zeigt, dass ein Freiheitsgrad durch den Spielraum bei der Interpretation der Skopusverhältnisse entsteht. (3.ii) und (5.ii) können (wenn auch etwas markiert) so verstanden werden, dass die zweite Temporalangabe Skopus nimmt über die nach-PP (beide Ereignisse liegen am selben Tag, wie bei (5.iii)). Weil es schwierig ist, diese diversen Möglichkeiten bei getrennten Angaben genau zu bewerten, begnügen wir uns in einem ersten Zugang mit der pauschalen Annahme, dass eine im Satz räumlich von der nach-PP getrennte Temporalangabe eher nicht für die temporale Lesart spricht (Tab. 2). Bei den aktuell stattfindenden Tests der Kriterien am Korpus werden die Ergebnisse benutzt, um einerseits die inhaltliche Beschreibung der Kriterien sukzessive zu verfeinern und andererseits ihre Gewichtungen zu justieren. Ersteres hier z.B. durch Einbeziehen von Reihenfolgeerwartungen und Subklassifizierungsinformation zu Temporalangaben. 3.4 Kriterien zur nach-PP Beispielfall: Thema-Kennzeichnung Bei der propositionalen Lesart gibt die VP bzw. der Satz Auskunft über den Inhalt, das Thema, der Äußerung der nach-PP: In (2 rep ) war das Thema der Äußerung die Struktur K p , bezeichnet durch den DRF p. Unter pragmatischen Gesichtspunkten darf p nicht schon (in wesentlichen Teilen) geschildert sein, sonst ist der Beitrag aus der VP nicht informativ, d.h. K p muss in den Teilen die K satz berichtet akkommodiert sein: aus dem tatsächlich im Text Berichteten zu K p darf K satz nicht ableitbar sein (insofern ist der Zusammenhang K p ⇒ K satz ein abduktiver). Immer dann wenn das Thema der Nominalisierung substantieller charakterisiert ist als durch der Sachverhalt, das Problem, etc. scheint also die temporale Lesart naheliegender, vgl. Beispiel (6) aus DeWaC: (6) Eine anderweitige zumutbare Ersatzmöglichkeit für die Kläger besteht auch nicht in der Geltendmachung von Ansprüchen gemäß §. . . nach Erklärung eines Widerrufs ihrer auf Abschluß der Darlehensverträge gerichteten Willenserklärungen (§1 Abs. 1 HaustürWG). Daraus folgern wir das Thema-Kriterium mit tentativen Gewichtungen wie in Tab. 3 gegeben (mit Genitv-Komplementen wie in (6) als Indikatoren). Tabelle 2: Gewichtungen für Lesart-Empfehlungen, Referenzzeit-Kriterium Referenzzeit-Kriterium: separate Referenzzeitangabe keine separate Angabe Temporale Lesart -2 0 Propositionale Lesart +2 0 88 Kurt Eberle, Gertrud Faaß, Ulrich Heid Tabelle 3: Gewichtungen für Lesart-Empfehlungen, Thema-Kriterium Thema-Kriterium: substantielles Thema kein substantielles Thema Temporale Lesart +2 0 Propositionale Lesart -2 0 3.5 Weitere ‘weiche’ Constraints Sowohl für die nach-PP als auch für die VP die sie modifiziert bzw. für beide zusammen sind eine Reihe weiterer Kriterien und ihrer Indikatoren im Test. Kriterien zu Tempus, zur Determination und Quantifikation nehmen die Möglichkeiten in den Blick, die Ereignisse als Folge in einer narrativen Sequenz zu betrachten (wie bei der Ereignis- Reinterpretation in Beispiel (3 rep .ii)) und verwenden dafür morphologische und Wortklassen-bezogene Indikatoren. Kriterien wie das Vorhandenseins eines Agens beziehen sich auf die erwartbare Informationsstrukturierung bei der propositionalen Lesart und verwenden syntaktisch-semantische Indikatoren (wobei etwa ein ‘typischer, öffentlich bekannter’ Verursacher von Äußerungen als Indikator für die propositionale Lesart betrachtet wird, wie in nach Meldungen der Süddeutschen Zeitung). Wie bei den beiden skizzierten Kriterien kommen auch in allen anderen Fällen Überlegungen zur pragmatischen Akzeptibilität im Sinne von (Grice, 1975) zum Tragen. 4 Korpusstudie Die identifizierten Indikatoren werden aktuell in einer Korpusstudie auf Tauglichkeit getestet und ihre Signifikanz im Sinne der attribuierten Gewichtung bestimmt. Wir benutzen dabei ein Korpus- Analysewerkzeug, das es für ein breites Fragment des Deutschen mit sehr umfangreichem Wortschatz erlaubt, Sätzen flache, unterspezifizierte Diskursrepräsentationen (FUDRSen; vgl. Eberle (2004)) auf der Basis syntaktischer Dependenzanalysen zuzuorden (vgl. Eberle et al. (2008)). Es basiert auf dem Lingenio-Forschungsprototypen für die Analyse des Deutschen 4 . Der Vorteil dieser Repräsentationen ist es, dass sie lexikalische und strukturelle syntaktische und semantische Mehrdeutigkeiten nur soweit auflösen, als das für die Behandlung eines betrachteten Phänomens notwendig ist. Damit können aus Texten Sätze extrahiert werden, die den Kriterien des untersuchten Phänomens genügen, in unserem Fall also -ung- Nominalisierungen zu verba dicendi als interne Argumente von nach-PPs enthalten. Aus den FUDRSen solcher Sätze sind leicht die möglichen Modifikatoren der betrachteten Nominalisierungen und die Modifikanden der sie enthaltenden PPs auszulesen, ohne dass mögliche Modifikatoren und Argumente durch strukturell falsche Auflösung von Mehrdeutigkeiten verloren gehen. Gleichzeitig können diese FUDRSen partiell disambiguiert werden, beispielsweise so, dass entschieden ist, wie genau die lokale Struktur aussieht, die die ung-Nominalisierung enthält. Auf diese Strukturen lassen sich die oben diskutierten Kontext- Constraints anwenden und die sortale Qualität des -ung-Diskursreferenten bestimmen. Abb.1 zeigt den Output des Werkzeugs für den Satz aus (7) unter der Vorgabe, dessen Struktur unterspezifiziert zu repräsentieren, die Bewertung für die sortale Lesart der -ung-Nominalisierung durchzuführen und deren Selektor und Modifikatoren (für die spätere Verfeinerung ihrer Charakterisierungen) zu extrahieren: 4 Siehe http: / / lingenio.de/ Deutsch/ Forschung/ Projekte/ unis-sfb732-b3.htm , 28.07.2009 Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen . . . 89 (7) Beweisstücke müssen nach Bekanntmachung der Staatsanwaltschaft vorgelegt werden. In Satz (7) kann die DP der Staatsanwaltschaft wegen der Kasusambiguität zwischen Genitiv und Abbildung 1: Unterspezifizierte Dependenzanalyse des Analyse-Tools mit (a) Selektor- und Modifikatoren-Extraktion und (b) Sortenberechnung Dativ Agens der Bekanntmachung sein oder Benefizient des Vorlegens. Das gibt die Repräsentation wieder (xmod). Extrahiert werden (im gewählten Analysemodus) Phrasen, die Selektor oder Modifikator sein können, aber nicht müssen und die sortale Berechnung erfolgt auf Basis der Informationen die strukturell beitragen können, aber nicht müssen. Möglicher Agens, Modaleinbettung und Präsens legen die propositionale Lesart nahe. 4.1 Bootstrapping Wie für ‘harte’ Indikatoren schon geschehen (vgl. (Eberle et al., 2008)), werden aktuell auch mögliche Indikatoren der ‘weichen’ Disambiguierungskriterien mit dem beschriebenen Werkzeug im Extraktionsmodus herausgeschrieben (vgl. Abb. 1 (a)) und nach Bewertung ihres Beitrags ins Lexikon des Systems aufgenommen. Ihre dort beschriebene Wirkungsweise wird dann im Berechnungsmodus des Systems bei der partiellen Disambguierung der einschlägigen Belegstellen im Korpus auf ihre Stimmigkeit getestet (vgl. Abb. 1 (b)) und bei Bedarf weiter justiert. Bei der bisher verwendeten Berechnung werden alle verwendeten Gewichte paritätisch zusammengerechnet. Die Gewichte kann man mittels der üblichen Maximierungsverfahren für Maximum- Entropie-Modelle optimieren, denn das Berechnungsverfahren beschreibt ein solches Modell mit linguistischen Features (vgl. Och & Ney (2002)): L(nach-PP) = argmax L i { 8 ∑ m=1 λ m h m ( L i | ( nach-PP,VP )) } D.h. die Lesart der nach-PP ist diejenige Lesart L i (die propositionale oder die temporale Lesart) des Satzes mit nach-PP und VP, der von der Summe der (insgesamt 8) Kriterien h m (Agens-, Aktionsart- Kriterium etc.) der höchste Wert zugeordnet wird, wobei λ m die zugeordneten Gewichte sind. Der ‘Gold-Standard’ für das Verfahren ist die ausschließlich Satz-bezogene manuell zugewiesene Lesartengewichtung. Die Vertiefung und Erweiterung der Kriterien folgt methodisch der üblichen Extraktion-Spezifikation-Test-Spirale, mit Analyse im Extraktionsmodus, lexikographischer Spezifikation und Test im Berechnungsmodus. 90 Kurt Eberle, Gertrud Faaß, Ulrich Heid 4.2 Ergebnisse In einer der bisher an DeWaC durchgeführten Studien wurde für die verba dicendi-Nominalisierungen Äußerung, Darstellung, Erklärung, Meldung, Mitteilung eine Teilmenge von 7864 Sätzen mit Verwendung unter nach-PPen extrahiert und analysiert. Diese Menge wurde weiter subklassifiziert nach morphologischen Kriterien die a priori als signifikant betrachtet worden sind (im Sinne eines auf die Nominalisierung bezogenen Aktionsartkriteriums): nach Vorkommen im Plural oder im Singular, und nach Determination (vorhandener Artikel und quantifizierende Einschränkungen versus bare singular/ plural). Diese Vorab-Unterteilung erwies sich jedoch nach ersten Evaluationen für sich allein genommen als wenig ergiebig; oder umgekehrt formuliert: der semantische Beitrag dieser morphologischen Kriterien ist zu wenig spezifisch, als dass er für sich allein eine Lesart- Präferenz auslösen könnte. Für ein kleines Fragment des Gesamt-Korpus (100 Sätze) wurden die Indikatoren für die 8 betrachteten Kriterien vollständig aufbereitet und damit die Kriterien getestet. Für dieses Fragment ergab sich eine über 80-prozentige Akzeptabilität der errechneten Ergebnisse. Momentan wird das Fragment sukzessive vergrößert, um über einen umfassenderen Bestand an Indikatoren zu verfügen. Ziel ist, das Gesamtfragment der extrahierten nach-PP-Sätze abzudecken, die Gewichtungen des Verfahrens darauf zu trainieren und dann an weiteren Korpora zu testen. Ein Problem der Berechnungen im Bewertungsmodus ist das ‘Rauschen’ das durch Fehler bei den zugrunde liegenden Analysen entsteht; es entsteht speziell bei der Unterscheidung von Agens und Thema bei synkretistischen Kasusformen. Das Tool verwendet hier mit gutem Erfolg semantische Klassifikationen seines reichhaltigen lexikalischen Bestands. 5 Zusammenfassung - nächste Schritte Wir haben einen Vorschlag gemacht, welche Inhalte die temporale und die propositionale Lesart von Nominalisierungen der verba dicendi umfassen und welche kontextuellen Informationen aufgrund dessen die eine oder die andere Lesart mehr oder weniger empfehlen oder unwahrscheinlich machen. Auf der Basis solcher Kriterien sind eine Anzahl von Indikatoren vorgeschlagen worden, die es erlauben sollen, die im Kontext präferierte Lesart automatisch zu ermitteln. Das dabei verwendete Korpusanalyse-Tool ist in der Lage, mögliche Indikatoren zu extrahieren und deren Bedeutung für die Disambiguierung - nach Aufarbeitung durch Annotation von Präferenzeigenschaften im Lexikon des Systems - bei der Berechnung der präferierten Lesart zu berücksichtigen und zu integrieren. Erste ermutigende Ergebnisse der Studie liegen vor. Die nächsten Schritte beinhalten die Ausarbeitung eines umfassenderen Kriterienkatalogs und die Festlegung von Präferenzwerten für die zugehörigen Indikatoren mittels der skizzierten Bootstrapping-Spirale. Besonderes Gewicht wird in der Folge darauf zu legen sein, inwieweit die Indikatoren für einzelne Vertreter des betrachteten Nominalisierungstyps unterschiedlich sind bzw. deren Gewichtungen, und inwieweit statistische Zusammenhänge zwischen einzelnen Indikatoren abgeleitet werden können bzw. inwieweit ein Sortiment statistisch möglichst unabhängiger Kriterien abgeleitet werden kann. Sukzessive soll die Methode auf größere Teilbereiche der Nominalisierungen ausgedehnt werden. Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen . . . 91 Literatur Baroni, M. & Kilgarriff, A. (2006). Large linguistically-processed web corpora for multiple languages. In 11th conference of the European Association for Computational Linguistics, EACL 2006, pages 87- 90, http: / / acl.ldc.upenn.edu/ eacl2006/ companion/ pd/ 01_baronikilgarrif_69. pdf . Bäuerle, R. (1988). Ereignisse und Repräsentationen. Technical Report 43, IBM Deutschland, WT LILOG, Stuttgart. Wiederabdruck der Habilitationsschrift von 1987, Universität Konstanz. Eberle, K. (1991). Ereignisse: Ihre Logik und Ontologie aus textsemantischer Sicht. Dissertation, Universität Stuttgart. Eberle, K. (2004). Flat underspecified representation and its meaning for a fragment of German. Habilitationsschrift, Universität Stuttgart. Eberle, K., Heid, U., Kountz, M., & Eckart, K. (2008). A tool for corpus analysis using partial disambiguation and bootstrapping of the lexicon. In A. Storrer, A. Geyken, A. Siebert, & K.-M. Würzner, editors, Text Resources and Lexical Knowledge: Selected Papers from the 9th Conference on Natural Language Processing KONVENS 2008. De Gruyter, Berlin. Egg, M. & Herweg, M. (1994). A phase-theoretical semantics of aspectual classes. Report 11, Verbmobil, IBM, Wissenschaftliches Zentrum, Heidelberg. Ehrich, V. & Rapp, I. (2000). Sortale Bedeutung und Argumentstruktur: -ung -Nominalisierungen im Deutschen. Zeitschrift für Sprachwissenschaft, 19(2), 245-303. Grice, P. (1975). Logic and conversation. In P. Cole & J. L. Morgan, editors, Speech Acts, pages 41-58. Academic Press, New York. Herweg, M. (1990). Zeitaspekte. Die Bedeutung von Tempus, Aspekt und temporalen Konjunktionen. Dissertation, Universität Hamburg. Kamp, H. (1981). A theory of truth and semantic representation. In J. Groenendijk, T. Janssen, & M. Stokhof, editors, Formal Methods in the Study of Language. Mathematical Centre Tract, Amsterdam. Kamp, H. (2002). Einstellungszustände und Einstellungszuschreibungen in der Diskursrepräsentationstheorie. Forschungsberichte der DFG-Forschergruppe Logik in der Philosophie 89, Universität Konstanz. Och, F. J. & Ney, H. (2002). Discriminative training and maximum entropy models for statistical machine translation. In Proceedings of the Annual Meeting of the ACL, pages 295-302, Philadelphia, PA. Osswald, R. & Helbig, H. (2004). Derivational semantics in HaGenLex an interim report. In D. Schnorbusch & S. Langer, editors, Semantik im Lexikon. Narr, Tübingen. Roßdeutscher, A. (2007). Syntactic and semantic constraints in the formation and interpretation of -ung nouns. Vortrag beim Workshop “Nominalizations across Languages”, Stuttgart, 11./ 12. Dezember 2007. Spranger, K. & Heid, U. (2007). Applying constraints derived from the context in the process of incremental sortal specification of german -ung-nominalizations. In Proceedings of the 4th International Workshop on Constraints and Language Processing, CSLP. “Süße Beklommenheit, schmerzvolle Ekstase” Automatische Sentimentanalyse in den Werken von Eduard von Keyserling * Manfred Klenner Institut für Computerlinguistik Universität Zürich, Schweiz klenner@cl.uzh.ch Zusammenfassung Es wird ein regelbasierter Ansatz zur Sentimentanalyse für das Deutsche vorgestellt. Dazu wurde anhand von GermaNet manuell ein Lexikon erstellt, das 8000 Lemmata als positiv oder negativ einordnet. Die Eingabetexte werden gechunkt, die Lemmata anhand des Polaritätslexikons markiert und mittels einer auf regulären Ausdrücken basierenden Kaskade von Umschreibaktionen kompositionell zu immer komplexeren, in ihrer Polarität bestimmten Phrasen gruppiert. Evaluiert wird anhand eines literarischen Textes von Eduard von Keyserling. 1 Einführung Eine Zielsetzung der Sentimentanalyse 1 ist die Bestimmung der semantischen Orientierung von Wörtern, Phrasen und Texten. Präziser gefasst ist es die Wortbedeutung, die eine positive oder negative semantische Orientierung oder auch Polarität aufweist. Das Adjektiv ‘billig’, zum Beispiel, trägt nur dann eine negative Polarität, wenn es im Sinne von ‘minderwertig’ verwendet wird. In der Bedeutung von ‘preiswert’ ist es positiv. In den meisten Systemen zur Sentimentanalyse wird auf eine Disambiguierung verzichtet, so auch im vorliegenden Fall. Die Polarität eines Wortes ist oft über seine Lesarten hinweg invariant, bzw. die Erfolgsquote heutiger Systeme zur Bedeutungsdisambiguierung ist nicht ausreichend, um die Qualität der Sentimentanalyse auf diesem Weg zu verbessern. Pro Wort wird daher meist nur eine Polarität vergeben, im Idealfall die der häufigsten Lesart. Die Einordnung eines Wortes als positiv oder negativ erfolgt anhand einer kulturell vorgegebenen Werteskala. ‘Freude’ ist positiv, ‘zu lieben’ ist positiv. ‘Hass’ ist negativ, ‘zu zerstören’ auch. Das Kriterium der Klassifikation wäre demnach, ob etwas wünschenswert/ angenehm/ moralisch ist oder eben nicht. Manchmal scheint jedoch ein Wort bzw. sein Denotat je nach Perspektive positiv oder negativ besetzt zu sein, etwa ‘Strafstoß’ oder ‘Spion’. Da mit der Tätigkeit eines Spions im Allgemeinen aber Handlungen verknüpft sind, die man gemeinhin als unmoralisch bezeichnet (Aushorchen, Fälschen, etc.) muss man ‘Spion’ negativ einordnen (ähnlich für ‘Strafstoß’). * Erschienen in: C. Chiarcos, R. Eckart de Castilho, M. Stede (Hrsg.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, S. 93-99. 1 Das Deutsche Universalwörterbuch des Dudenverlages definiert Sentiment als Empfindung, Gefühl. 94 Manfred Klenner Ein Wort kann aber nicht nur positiv, negativ oder neutral sein, sondern auch als Verstärker (Intensivierer) dienen. Unter Umständen hängt der Sentimenttypus eines Wortes sogar vom Kontext ab. So ist ‘perfekt’ in Kombination mit dem neutralen ‘Mahl’ positiv - ‘perfektes Mahl’. Hingegen fungiert es in Verbindung mit Katastrophe, einem negativen Nomen, als Intensivierer - ‘die perfekte Katastrophe’ ist in einem besonderen Maße katastrophal. Daneben gibt es noch Wörter, die kontextuell als Polaritätsumkehrer (Invertierer) wirken. Das Wort ‘Mangel’ ist negativ (’Es herrschte ein Mangel’), in einer NP-PP-Verbindung jedoch invertiert es die Polarität der PP. ‘Mangel ∗ an Feingefühl + ’ 2 ist negativ, ‘Mangel ∗ an Schmerz − ’ (88’300 Belege durch eine Suchmaschine) hingegen positiv. Die Komposition von Polaritäten, wenn also innerhalb einer Phrase mehr als eine Wortpolarität gegeben ist und eine Phrasen- oder gar Satzpolarität ermittelt werden soll, scheint regelhaft zu sein (vgl. aber die Diskussion in Kapitel 3). So ist zum Beispiel ‘enttäuschte − Hoffnung + ’ negativ. ‘[Ein Mangel ∗ an [enttäuschter − Hoffnung + ] − ] + ’ wieder positiv und so weiter bis auf die Satzebene. Bestimmte Verben setzen die Polaritätskomposition jedoch außer Kraft, z.B. ‘lieben’, ‘hassen’. Egal, was man hasst, Positives oder Negatives, das im Satz ausgedrückte Sentiment ist negativ (Ich hasse (Ehrungen + 3 | Schmähungen − )). Bislang fehlte ein umfangreiches Polaritätslexikon für das Deutsche, das den Ressourcen für das Englische wie z.B. dem ‘subjectivity lexicon’ (Wilson et al., 2005) vergleichbar wäre. Die vorliegende Arbeit schließt diese Lücke. Ein 8000 Wörter umfassendes Polaritätslexikon wurde manuell anhand von GermaNet erstellt. In der Literatur zum Thema Sentimentanalyse wurden bislang mehrheitlich nicht-literarische Texte zugrundegelegt: Zeitungstexte, Produktbewertungen und Filmbewertungen. Auch hier treten die oben diskutierten Probleme auf, jedoch ist echte Komposition (Phrasenpolaritäten aus zwei oder mehr Wortpolaritäten) obwohl nicht ausgeschlossen, so doch eher selten. In bestimmten literarischen Genres hingegen sind solche Konstruktionen erwartbar und werden zudem noch literarisch strapaziert. Wir haben uns daher für ein literarisches Textkorpus entschieden: die Werke von Eduard von Keyserling (aus der Gutenberg-Bibliothek 4 ). 2 Ein Polaritätslexikon für das Deutsche Jede kompositonelle Sentimentanalyse setzt ein Lexikon voraus, in dem Apriori-Polaritäten von Wörtern festgelegt sind. Für das Deutsche war ein solches bislang nicht verfügbar. Wie unsere Experimente für das Englische (Klenner et al., 2009) mit dem Subjektivitätslexikon (Wilson et al., 2005) gezeigt haben, kann man mit einen Lexikon von 8000 Wörtern bereits sehr gute Resultate erzielen. Anstatt, wie heute oft üblich, solche Daten automatisch zu akquirieren und damit meist eine stark verrauschte Ressource zu generieren, z.B. SentiWordNet (Esuli & Sebastiani, 2006), haben wir uns für manuelle Annotation entschieden. Da nur bestimmte Teile des deutschen Wortschatzes positive oder negative Polaritäten aufweisen, haben wir uns an den GermaNet-Klassen orientiert. Tabelle 1 listet die bislang annotierten Klassen auf, weitere Annotationen sind geplant. Tabelle 2 zeigt einige zufällig ausgewählte Einträge des Lexikons. Ergänzt haben wir das so geschaffene Lexikon um einige intensivierende Adverbien (’sehr’ etc.) und Invertierer wie ‘nicht’. 2 + kennzeichnet positive Polarität, − eine negative, ∗ bezeichnet einen Invertierer (Shifter). 3 Solche Kombinationen mit inversen Polaritäten sind u.U. ironisch gemeint. 4 http: / / gutenberg.spiegel.de/ “Süße Beklommenheit, schmerzvolle Ekstase” - Automatische Sentimentanalyse . . . 95 Tabelle 1: Annotierte GermaNet-Klassen Adjektive Nomen Verben Allgemein Koerper Koerperfunktion Geist, Privativ Mensch Kognition Gesellschaft Motiv Kommunikation Koerper, Verhalten Relation Kontakt Menge, Nat Attribut Gesellschaft Tabelle 2: Lemmata mit positiver (obere Hälfte) bzw. negativer Polarität beredt Mut überlegen angstfrei legendenhaft schön erprobt neidlos angesehen Gleichstellung barmherzig einfügen behend lieben Charme hilfreich Sich-wohl-Fühlen Glanz bewandert heroisch Gemeinschaftsgefühl gehoben menschenwürdig Seelenverwandtschaft sinnentleert Depression täuschen Suchtverhalten stümperhaft Graus unkalkulierbar intrigieren Instinktlosigkeit Bösartigkeit stigmatisieren nekrophil erzürnen Weh exaltieren unverdient bitter missliebig verärgern wankelmütig disqualifizieren Zuchtlosigkeit griesgrämig erschütternd 3 Kompositionalität der Polarität Wenn innerhalb einer Phrase mehr als ein polarisiertes Wort auftritt, dann ergibt sich die Polarität der Phrase kompositionell. Im einfachsten Fall weisen alle Wörter die gleiche semantische Orientierung auf, so dass sich durch die Komposition nichts ändert. ‘feierliche + Schönheit + ’ ist positiv, ‘langweilige − Kranke − ’ negativ. Überall dort, wo unterschiedliche Polaritäten aufeinandertreffen, wird eine Entscheidung erzwungen: ist das Ganze positiv oder negativ (niemals ist es neutral). Die Hoffnung wäre, dass es hier eine klare Tendenz gibt. Zum Beispiel wie in ‘törichte − Heirat + ’, wo ein negatives Adjektiv ein positives Nomen neutralisiert und eine negative Phrasenpolarität entsteht. In ähnlicher Weise könnte ein positives Adjektiv ein negatives Nomen neutralisieren wie in ‘süße + Beklommenheit − ’, das positiv ist. Wir haben aber argumentiert, dass in solchen Kombinationen das Adjektiv die Rolle eines Intensivierers einnimmt, der die negative Tendenz des Nomen weiter verstärkt, vgl. ‘perfekte + ’ Katastrophe − ’ 5 . Kombinationen aus einem positiven Adjektiv und einem negativen Nomen sind auch oft ironisch oder sarkastisch gemeint, vgl. ‘grandioser Untergang’. Nach der Inspektion zahlreicher Beispiele sind wir zu der Überzeugung gelangt, dass es ein besonderes Stilmerkmal von Keyserling ist, dass er solche Kombinationen in einem positiven, wenn auch melancholischen Sinne verwendet. Tabelle 3 stellt die verschiedenen Kombinationen anhand von Beispielen einander gegenüber. 5 Der Unterschied zwischen beiden Fällen ist, dass ‘perfekt’ den Grad, ‘süß’ aber die Qualität des Nomens modifiziert. 96 Manfred Klenner Tabelle 3: NP-Komposition String NP pol String NP pol feierliche + Schönheit + + übermäßige − Liebe + poetische + Liebe + + törichte − Heirat + zärtliches + Mitleid + + schwermütiges − Lächeln + lustige + Ungeheuerlichkeit − + langweilige − Kranke − süße + Beklommenheit − + qualvolle − Schmerzen − - Beispiele von Kompositionalität durch Polaritätsumkehrung sind in Tabelle 4 aufgelistet. In der Erzählung ‘Fürstinnen’ konnten 55 Fälle (true positives) identifiziert werden. Fehler (false positives) erzeugte die Verwendung von ‘nicht’ im Fragekontext: ‘nicht wahr? ’ ist eben nicht negativ. Tabelle 4: Invertierung (Shift) durch Negation (bereute − nichts) + (war nicht langweilig − ) + (nichts einwandte + ) − (nicht schön + ) − (nicht sehr günstig + ) − (liebte + diese Einladung nicht) − (nicht zufrieden + ) − (hilft + nichts) − (nicht den Mut + ) − Argumentationsspielraum bietet folgendes Beispiel mit identischem Polaritätsmuster ‘neg + neg’: ‘an gebrochenem Herzen sterben’ ist eindeutig negativ, doch wie ist es bei ‘das boshafte Gerede der Leute zu verachten’. Ist es nicht positiv, etwas Negatives zu verachten? Aber Verachtung ist auch ein negatives Gefühl. Beide Sichtweisen scheinen vertretbar. Schliesslich soll nicht unerwähnt bleiben, dass es explizite und implizite Fälle von Sentiment gibt: ‘Sie war froh, dass er gegangen war’ ist explizit positiv, wobei implizit unter Umständen eine negative Einstellung zu der Person besteht, die gegangen ist. 4 Ein System zur Sentimentanalyse Das hier beschriebene System 6 basiert auf der Ausgabe des TreeTagger Chunkers (Schmid, 1994). Jedes vom TreeTagger identifizierte Lemma wird im Polaritätslexikon nachgeschlagen und erhält das dort aufgefundene Label: POS, NEG, NEUT, SHIFT (’nicht’) oder INT (Verstärker wie ‘sehr’). Regeln (reguläre Ausdrücke) operieren in einer Kaskade über denjenigen Chunks, die polarisierte Wörter enthalten und generieren auf diese Weise immer größere, polaritätsgetaggte Teilketten eines Satzes. In seltenen Fällen werden auch ganze Sätze markiert. Zur Inspektion der Ergebnisse haben wir ein Analysetool implementiert (vgl. Klenner et al. (2009) und Petrakis et al. (2009)). Die Polaritätsbestimmung erfolgt anhand einer Kaskade von Umschreibeaktionen. Die Anwendung der Regeln erfolgt von innen nach aussen. NP-Regeln vor NP-PP-Regeln vor VC-NP-Regeln usw. Innerhalb der einzelen Regelmengen gibt es wiederum eine Ordnung, z.B. Adjektiv-Nomen- Bestimmung vor der Anwendung der Negation (z.B. ‘kein schönes Geschenk’: zuerst [schönes Geschenk] + , dann Invertierung mit ‘kein’). Wir haben einen Compiler implementiert, der uns das Schreiben von Regeln vereinfacht. Die Mustersprache operiert über der Chunker-Ausgabe. 6 Eine dreisprachige Demoversion findet sich unter: www.cl.uzh.ch/ kitt/ polart/ “Süße Beklommenheit, schmerzvolle Ekstase” - Automatische Sentimentanalyse . . . 97 Tabelle 5: Regeln # Regel Beispiel 1 nc_*=adja: POS,*=nn: NEUT; → POS ’perfektes Mahl’ 2 nc_; vc_l=sein; nd_pol=POS; → POS ‘sie ist sympathisch’ 3 nc_*=*: _; vc_wollen=vmfin: _; vc_pol=NEG; → NEG ‘sie wollte leiden’ 4 nc_*=pper: _; vc_l=dürfen; nd_l=nicht; vc_; → NEG ‘sie durfte nicht reiten’ Im folgenden sollen einige Regeln diskutiert werden (vgl. Tabelle 5). Regel 1 ersetzt ein positives Adjektiv (adja: POS) innerhalb eines Nomen-Chunks (nc) unmittelbar gefolgt von einem neutralen (NEUT) normalen Nomen (nn) im gleichen Chunk durch die Polarität POS (positiv)(vgl. ‘perfektes Mahl’). Regel 2 kombiniert ein beliebiges Nomen-Chunk (nc), gefolgt von einem Verb-Chunk (vc), das das Lemma ‘sein’ einhält (l=sein) und auf das ein positives prädikatives Chunk (nd) folgt und ersetzt das Ganze durch die Polarität POS (z.B. ‘Sie ist sympathisch’). Regel 3 erfasst Kontexte mit dem Modalverb ‘wollen’, gefolgt von einem negativen Vollverb und bildet sie auf die Polarität NEG (negativ) ab (z.B. ‘Sie wollte leiden’). Regel 4 markiert Teilketten als negativ, bei denen ein Personalpronomen (pper) gefolgt wird von ‘dürfen’ und einer Negation (’nicht’) und einem beliebigen Verb-Chunk (entsprechend ‘etwas nicht dürfen’). Die Regeln sind robust, aber nicht fehlerfrei. Regelinteraktionen können nicht generell ausgeschlossen werden. Im Moment hat die deutsche Version 65 Regeln, die englische 88 und die noch experimentelle französische 21. Ein weiterer, noch experimenteller Teil des Systems ist die Berechnung der Intensität einer Phrasenpolarität. Jedes Wort hat eine Polaritätsstärke, im Moment einheitlich 1. Bei der Phrasenkomposition werden die einzelnen Stärkewerte addiert, Intensivierer verdoppeln, Invertierer kehren um, ohne den Wert zu verändern. Zum Beispiel hätte die Phrase ‘misslungene Lobesrede’ den Wert 2 (negativ), ‘keine misslungene Lobesrede’ den Wert 2 (positiv) und ‘total misslungene Lobesrede’ den Wert 4 (negativ). 7 5 Empirische Evaluierung Eine umfassende Evaluation wurde bislang nur für die englische Variante unseres Systems durchgeführt (Klenner et al., 2009). Es liegt allerdings eine vorläufige Untersuchung für das Deutsche vor. Als Textkorpus wurden diejenigen Werke von Eduard von Keyserling verwendet, die in der Gutenberg-Bibliothek elektronisch zur Verfügung stehen (14 Erzählungen bzw. Romane). Der Tree- Tagger hat 88’628 Sätze identifiziert, die Anwendung der Regeln erbrachte die Polarität von 9’764 Phrasen (5’241 positiv und 4’523 negativ). Die Verarbeitung inklusive Chunking dauerte viereinhalb Minuten. Um die Güte der Sentimentanalyse zu ermitteln, haben wir die polaritätsgetaggten Phrasen aus Keyserlings ‘Fürstinnen’ (die gedruckte Version hat 120 Seiten) manuell evaluiert. Das System hat insgesamt 1’209 Phrasen klassifiziert, davon 635 positiv und 574 negativ. Unsere Auszählung ergab, dass von den 635 525 tatsächlich positiv und von den 574 462 tatsächlich negativ waren, so dass die Präzision der positiven Klasse bei 90.38% liegt, die der negativen bei 80.48%. Auf eine 7 Es ist interessant, dass in ‘keine total misslungene Lobesrede’ der Intensivierer durch die Negation nun abschwächend wirkt. Es resultiert eben nicht eine gelungene Lobrede, daher ist der Wert 4, der durch die Negation ja unberührt bleibt, hier unangebracht. 98 Manfred Klenner Ermittlung der Ausbeute wurde bewusst verzichtet. Viele verpassten Phrasen rühren von Lücken im Polaritätslexikon her, das wir noch mit GermaNet-Daten erweitern werden. Einige Fehler gehen auf den Chunker zurück und als Folge die inkorrekte Anwendung von Regeln. Ambiguität spielt eine Rolle (’die Turmuhr schlug zwölf’ mit ‘schlug’ = negativ), Fragekonstruktionen wurden nicht neutralisiert, daher rühren Fehler wie die negative Klassifikation von ‘nicht wahr? ’. 6 Literaturdiskussion Der Fokus der vorliegenden Arbeit ist die Sentimentkomposition. Zum Thema Komposition liegen sehr wenige Arbeiten vor. Eine vollständige kompositionelle Analyse, basierend auf einer normativen nicht-robusten Grammatik des Englischen, ist in Moilanen & Pulman (2007) beschrieben. Obgleich gerade in literarischen Texten in der Regel eine grammatikalische Struktur vorausgesetzt werden kann, sind heutige Parser und Grammatiken in der Regel nicht geignet, um z.B. komplexe Dialogstrukturen zu analysieren. Aus diesem Grund ist ein robuster, regelbasierter Ansatz vorzuziehen. 7 Zusammenfassung und Ausblick Wir haben ein System zur robusten Sentimentanalyse deutscher Texte vorgestellt, das regelbasiert und mit Hilfe eines manuell konstruierten, 8000 Einträge umfassenden Polaritätslexikons eine kompositionale Sentimentinterpretation erzeugt. Es war unsere Intention, zu überprüfen, ob ein solcher Ansatz auch für literarische Texte brauchbar ist, welche Probleme dabei auftreten und ob ein solches Vorgehen helfen kann, interessante literarische Fragestellungen zu bearbeiten oder gar in der Lage ist, bestimmte stilistische Phänomene aufzudecken. Letzteres muss nicht kategorisch verneint werden, wie unsere Experimente gezeigt haben. Das Aufeinandertreffen von positiven Adjektiven und negativen Nomen ist in Keyserlings Werken einerseits ein häufig gewähltes Stilmittel, das anderseits, nicht wie im nicht-literarischen Fall eine negative, sondern eine positive Charakterisierung darstellt. Obwohl der regelbasierte Ansatz in einer ersten Untersuchung auch für das Deutsche gute Ergebnisse geliefert hat, sind die Begrenzungen aufgrund der freien Wortstellung grösser als im Englischen. Wir planen daher die Regelanwendung auf die Phrasen-Ebene zu beschränken und mit einem Dependenzparser Subjekt- und Objektinformation verfügbar zu machen. Danksagung Mein Dank gilt Stefanos Petrakis, Ronny Peter und Angela Fahrni für die Zusammenarbeit. Diese Arbeit wird vom Schweizerischen Nationalfonds (SNF) unterstützt (Nr. 100015_122546/ 1). “Süße Beklommenheit, schmerzvolle Ekstase” - Automatische Sentimentanalyse . . . 99 Literatur Esuli, A. & Sebastiani, F. (2006). SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining. In Proc. of LREC-06, Genova, Italy. Klenner, M., Petrakis, S., & Fahrni, A. (2009). A Tool for Polarity Classification of Human Affect from Panel Group Texts. In Intern. Conference on Affective Computing & Intelligent Interaction, Amsterdam, The Netherlands. Moilanen, K. & Pulman, S. (2007). Sentiment Composition. In Proc. of RANLP-2007, pages 378-382, Borovets, Bulgaria. Petrakis, S., Klenner, M., Ailloud, E., & Fahrni, A. (2009). Composition multilingue de sentiments. In TALN (Traitement Automatique des Langues Naturelles) (Demo paper), Senlis, France. Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proc. of Intern. Conf. on New Methods in Language Processing. Wilson, T., Wiebe, J., & Hoffmann, P. (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In Proc. of HLT/ EMNLP 2005, Vancouver, CA. TMT: Ein Text-Mining-System für die Inhaltsanalyse * Peter Kolb Department Linguistik Universität Potsdam kolb@linguatools.de Zusammenfassung Mit der Verfügbarkeit umfangreicher elektronischer Daten eröffnet sich für Sozial- und Geisteswissenschaftler die Möglichkeit, ihre Fragestellungen auf größere Datenmengen anzuwenden, die Erfassung und Analyse teilweise zu automatisieren und den Aussagen damit größere (quantifizierte) Validität zu verleihen. Zugleich ergeben sich neue technische Herausforderungen für die Datenanalyse, die sich nur mit Hilfe der Computerlinguistik lösen lassen. Dieser Beitrag stellt ein Text Mining-System vor, welches die Potenziale computer- und korpuslinguistischer Verfahren für eine multilinguale Inhaltsanalyse aufzeigt. 1 Einführung Ein Forschungsprojekt am Otto-Suhr-Institut für Politikwissenschaft der Freien Universität Berlin untersucht die Frage, ob die unterschiedlichen nationalen Medienöffentlichkeiten in der Europäischen Union (EU) Problemsichten teilen, ähnliche politische Akteure - insbesondere die EU - als Handlungsträger sehen und ähnliche normative Kriterien zur Beurteilung von politischen Problemen heranziehen. Dazu wird eine multilinguale politikwissenschaftliche Medientext-Analyse durchgeführt. Grundlage dieser ländervergleichenden Längsschnittanalyse ist ein bereinigtes Vollsample von 489.500 Zeitungsartikeln, die in den Jahren 1990-2006 in je zwei großen Tageszeitungen in acht Ländern erschienen. Sie wurden mit Hilfe einer komplexen Suchanfrage aus Medienarchiven wie LexisNexis und aus den Archiven einzelner Zeitungen gewonnen. Die Untersuchungsländer sind: Deutschland, Frankreich, Großbritannien, Irland, Niederlande, Österreich, Polen und die USA. Angesichts dieser Datenmenge ergeben sich zwei wesentliche Herausforderungen: einerseits eine automatisierte Aufbereitung der Rohdaten für ein auf die Forschungsfragen zugeschnittenes dynamisches Datenmanagement; andererseits die automatische Erfassung von semantischen Teilmengen, die entweder entfernt oder aber für nähere Analysen herangezogen werden sollen (z.B. alle fälschlicherweise erfassten Duplikate und Samplingfehler, Artikel zu Interventionen im engeren Sinne, Artikel mit EU-Referenzen etc.). Als Antwort auf diese Herausforderungen entstand das “Text Mining Tool” (TMT). TMT integriert unterschiedliche Zugangsweisen zu einem Korpus in einer Anwendung. Es weist eine Client-Server-Architektur auf. Die gesamten Daten befinden sich auf einem Server-Rechner, * Erschienen in: C. Chiarcos, R. Eckart de Castilho, M. Stede (Hrsg.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, S. 101-107. 102 Peter Kolb der eine AJAX-Webschnittstelle 1 bereitstellt, über die mittels eines beliebigen Webbrowsers auf TMT zugegriffen werden kann. Dies bietet drei Vorteile: Erstens kann ein Client-Rechner von einem beliebigen Standort über das Internet auf das Tool zugreifen, zweitens spielt das jeweilige Betriebssystem, unter dem der Client läuft, keine Rolle, und drittens muss auf dem Client außer einem Webbrowser keine weitere Software installiert werden. Das Text-Mining-System umfasst drei Teilkomponenten: die Suchmaschine LiSCo, das Distributionsanalysewerkzeug DISCO und eine sogenannte “Themenmaschine”. Diese Bausteine werden im Folgenden beschrieben. 2 LiSCo Die Suchmaschine LiSCo (Linguistische Suche in Corpora) indexiert das jeweilige Korpus und stellt diverse Suchwerkzeuge bereit. LiSCo basiert auf dem Lucene-Index 2 , einem in Java implementierten leistungsfähigen Volltextindex, der frei verfügbar ist. Den ersten Schritt der Indexierung bildet die Aufbereitung und Vorverarbeitung des Korpus. Die Dokumente werden mit etwaigen Metadaten wie Quelle, Datum, Ursprungsland usw. versehen und in einem XML-Format gespeichert. Anschließend wird mit Hilfe des Tree-Taggers (Schmid, 1994) ein PoS-Tagging und eine Lemmatisierung durchgeführt. Zum Schluss werden alle Texte ggf. nach Unicode (UTF-8) konvertiert. Die Dokumente werden dann in den Lucene-Index eingelesen, wobei verschiedene durchsuchbare Felder für jedes Dokument gespeichert werden. So kann sowohl nach den ursprünglichen Wortformen, als auch nach den Lemmata gesucht werden, außerdem nach eventuell vorhandenen Metadaten wie Datum, Land, Quelle usw. Der gesamte Volltext jedes Dokuments wird ebenfalls in den Index aufgenommen. In den früheren Versionen von TMT war die zu indexierende Einheit das Dokument, d.h. der gegebene Zeitungsartikel. Dies wurde in der aktuellen Version auf eine Passage geändert. Grund war das Feedback der Nutzer hinsichtlich der Arbeit mit dem Relevanzfeedback und der Suche nach ähnlichen Dokumenten (siehe unten). Da die Auswahl der relevanten Dokumente aus einer Trefferliste aufgrund der Kurzzusammenfassung der Dokumente (nämlich dem jeweils ersten Absatz) erfolgte, die interne inhaltliche Repräsentation der Dokumente aber auf Grundlage des gesamten Dokuments erzeugt wurde, kam es regelmäßig zu starken Abweichungen gegenüber den Erwartungen der Nutzer. Zur automatischen Segmentierung der Dokumente in Passagen wird ein einfacher Fenster-Ansatz verfolgt: die Dokumente werden in gleichgroße Abschnitte aus zehn Sätzen zerlegt (Tiedemann & Mur, 2008). Zur Suche im Index stellt Lucene eine leistungsfähige Abfrage-Syntax bereit. Eine Boolesche Suche mit den Operatoren AND, OR und NOT ist ebenso möglich wie eine Suche mit Wildcards, eine trunkierte Suche, oder eine exakte Phrasensuche. Lucene bietet außerdem die Möglichkeit, Treffer in ihrem Kontext anzuzeigen und hervorzuheben. Damit lässt sich bereits eine einfache Konkordanzanzeige bereitstellen. Lucene implementiert neben einem Standard-Volltextindex auch das Vektormodell (Salton, 1971) des Information Retrievals. Dazu speichert Lucene zu jedem Term die Auftretenshäufigkeit in der jeweiligen Passage (Termfrequenz TF) sowie seine Häufigkeit in der gesamten Menge aller Passagen 1 Verwendet wird das Google Web Toolkit: http: / / code.google.com/ webtoolkit 2 http: / / lucene.apache.org Text-Mining für die Inhaltsanalyse 103 (Dokumentenfrequenz DF). Aus diesen Angaben kann das bekannte TF-IDF-Maß zur Bestimmung der Relevanz eines Terms berechnet werden. Ein Term ist dabei umso wichtiger, je häufiger er in der jeweiligen Passage vorkommt und je seltener er insgesamt in der Dokumentensammlung auftaucht. Auf dieser - von Lucene bereitgestellten - Grundlage haben wir ein Relevanzfeedback (Rocchio, 1971) und eine Suche nach inhaltlich ähnlichen Passagen (und auch Dokumenten) implementiert, die nachfolgend kurz beschrieben werden. Beim Relevanzfeedback kann der Benutzer die zu einem Suchergebnis angezeigten Trefferpassagen durch Anklicken als relevant oder nicht relevant bewerten, und die Anfrage dann per Mausklick wiederholen. Die ursprüngliche Suchanfrage wird automatisch um die relevantesten Terme aus den vom Nutzer als relevant bewerteten Passagen erweitert. Terme aus den als irrelevant bewerteten Passagen werden aus der Suchanfrage entfernt. Die Terme der automatisch erzeugten Suchanfrage können ausgegeben werden. Per Mausklick kann auch nach inhaltlich ähnlichen Passagen zu einer gegebenen Passage gesucht werden. Dabei wird die Ausgangspassage durch einen Vektor ihrer relevantesten Terme repräsentiert, die mit ihrem TF-IDF-Wert gewichtet sind. Dieser Vektor kann als Suchanfrage an den Lucene- Index geschickt werden, der dann die ähnlichsten Passagen als Treffer ausgibt. Ein weiteres in TMT implementiertes Suchverfahren bildet die automatische Kategorisierung von Passagen oder Dokumenten in ein vom Benutzer vorgegebenes Kategorienmodell. Die automatische Einordnung neuer Dokumente in das Kategorienmodell erfolgt über die zuvor beschriebene Ähnlichkeitssuche. Das neue Dokument wird mit allen prototypischen Dokumenten im Kategorienmodell auf Ähnlichkeit verglichen und in die Kategorie mit den ähnlichsten Dokumenten eingeordnet. Dazu wird das sogenannte k-nearest-neighbour-Verfahren (Sebastiani, 2002) eingesetzt, das sich durch Robustheit, Geschwindigkeit und eine gute Skalierbarkeit hinsichtlich der Kategorienanzahl auszeichnet. 3 DISCO Mit dem Distributionsanalyse-System DISCO (Kolb, 2008, 2009) lassen sich zu einem Suchwort die signifikanten Kookkurrenzen und die distributionell ähnlichen Wörter anzeigen. Zudem können auf Grundlage der distributionellen Ähnlichkeit Wortcluster berechnet und graphisch dargestellt werden. Die Kookkurrenzen vermitteln einen ersten Eindruck, in welchen Zusammenhängen das Suchwort im Korpus verwendet wird. Auf Basis der Kookkurrenzen berechnet DISCO die Wörter, die im Korpus eine ähnliche Distribution aufweisen. Besonders im Falle abstrakter Nomen erhält man hier semantisch ähnliche Wörter zum Ausgangswort, teilweise das ganze semantische Spektrum. Die Listen der signifikanten Kookkurrenzen und distributionell ähnlichen Wörter können durch ein String-Ähnlichkeitsmaß gefiltert werden. Auf diese Weise können sehr schnell Schreibvarianten und Flexionsformen zu einem gegebenen Wort identifiziert werden. Diese können dann per Mausklick in einen editierbaren Thesaurus übernommen werden, der zur Erweiterung von Suchanfragen genutzt werden kann. Der Einsatz von DISCO ist anhand zweier beispielhafter Anwendungsfälle in Kolb et al. (2009) beschrieben. 104 Peter Kolb 4 Themenmaschine Im Text Mining besteht eine grundsätzliche Kluft zwischen der konzeptuellen Ebene einerseits und der textlich-lexikalischen Ebene andererseits. Existiert eine manuell erstellte Ontologie aus interessierenden Konzepten, d.h. abstrakten semantischen Einheiten, müssen diese in konkret formulierten sprachlichen Ausdrücken wiedergefunden werden. Texte müssen also erst einmal mit Konzepten annotiert werden. Dabei treten die schon aus dem Information Retrieval bekannten Probleme der Lesartenambiguität und Paraphrase auf, d.h. ein Korpus mit einer vorhandenen Ontologie zu annotieren ist keineswegs trivial. Ein alternativer Ansatz beseht darin, textbasiert vorzugehen und aus relevanten Termen oder Phrasen im Text eine “Ontologie” aufzubauen. Hier wird versucht, von der konkreten Formulierung im Text zu einer semantischen, möglichst abstrakten Darstellung zu gelangen. Die Schwierigkeit ist dabei, einen hinreichenden Abstraktionsgrad zu erreichen, um gleichbedeutende, aber unterschiedlich ausgedrückte Inhalte überhaupt aufeinander beziehen zu können. Es ist nicht zu erwarten, dass es sich bei den automatisch extrahierten Themen um Konzepte der Art handelt, die denen in einer intellektuell erstellten Ontologie entsprechen. Die “Themenmaschine” stellt einen solchen Versuch des korpusgetriebenen Aufbaus eines Wissensnetzes dar. Dazu werden im Korpus automatisch Themen in Form relevanter Phrasen identifiziert. Diese werden mit Hilfe linguistischer Analysen auf eine Normalform gebracht. Themen und Teilthemen werden zu Hauptthemen und Themengruppen gebündelt, die eine hierarchische Darstellung - etwa in Form eines Baums - der Themen erlauben. Die Passagen, in denen die Themen gefunden wurden, bilden die Blätter des Baums. Außerdem werden mit Hilfe der semantischen Ähnlichkeit von DISCO verwandte Themen angeboten. Das Verfahren arbeitet wie folgt. Mit Hilfe des LoPar-Chunkers (Schmid & Schulte im Walde, 2000) werden zunächst Nominal- und Präpositionalphrasen erkannt. Zum Beispiel werden aus dem Satz Außenminister Joschka Fischer fordert militärische Intervention die Phrasen Außenminister Joschka Fischer und militärische Intervention extrahiert. Im anschließenden Normalisierungsschritt werden Phrasen mit unterschiedlichem syntaktischen Aufbau auf eine gemeinsame Form gebracht. Dadurch können inhaltliche Übereinstimmungen zwischen unterschiedlich formulierten Textabschnitten erkannt werden. Beispielsweise würden die drei Phrasen militärische Intervention, Intervention des Militärs, Militärintervention alle zu Intervention Militär normalisiert. Die Normalisierung erfolgt mit Hilfe sprachspezifischer, manuell erstellter syntaktischer Muster wie der folgenden: Bestimmungswort+Grundwort −→ Grundwort Bestimmungswort Adjektiv Nomen −→ Nomen Nom(Adjektiv) Nomen 1 { der, des } Nomen 2 −→ Nomen 1 Nomen 2 Nomen 1 Adjektiv Nomen 2 −→ Nomen 1 Nomen 2 Nom(Adjektiv) Nomen 1 { von, zum, in, ... } Nomen 2 −→ Nomen 1 Nomen 2 Das erste Muster betrifft Komposita und normalisiert z.B. Giftstoff zu Stoff Gift. Muster vier und fünf würden im Zusammenspiel mit Muster eins die Ausdrücke Transport giftiger Stoffe und Transport von Giftstoffen auf die gleiche Normalform Transport Stoff Gift zurückführen. Momentan existieren knapp 40 Muster für das Deutsche. Text-Mining für die Inhaltsanalyse 105 Um aus Nomina abgeleitete Adjektive, wie etwa Militär militärisch, Zeremonie zeremoniell usw., wieder auf die nominale Form zu bringen, wird auf eine Datenbank mit morphologisch verwandten Wörtern zurückgegriffen. Im Deutschen ist zusätzlich der Einsatz eines morphologischen Analyseschrittes zur Zerlegung von Komposita in ihre Konstituenten notwendig. Dazu wurde auf Basis eines großen Wörterbuchs eine entsprechende Komponente implementiert, die aber hier aus Platzgründen nicht näher beschrieben werden kann. Als nächstes werden Phrasen und Teilphrasen zusammengefasst. Hierbei würde eine Phrase wie Bundesaußenminister Fischer mit der oben aufgeführten Phrase Außenminister Joschka Fischer identifiziert werden. Durch diesen Prozess der Phrasennormalisierung und -zusammenfassung wird die referentielle Struktur eines Textes teilweise aufgedeckt. Wünschenswert wäre es, diese Methode zu erweitern, um auch anaphorische Pronomen in die Analyse einbeziehen zu können. Hier bietet sich der Einsatz eines Verfahrens zur Koreferenz-Auflösung an (wie z.B. Strube et al. (2002)), in das dann auch die von DISCO bereitgestellte semantische Ähnlichkeit einfließen könnte (Nilsson & Hjelm, 2009). Durch die Zusammenfassung ähnlich formulierter Ausdrücke über den ganzen Text hinweg wird erreicht, dass sich die anschließende Relevanzberechnung nach dem TF-IDF-Maß deutlich verbessert. Nur diejenigen Phrasen, die einen bestimmten Relevanzwert erreichen, gelangen in den Themenindex. Zu jedem Thema werden seine im Text realisierte Form (z.B. Transport von Giftstoffen), die Normalform (Transport Stoff Gift), sowie ein Zeiger auf das Ursprungsdokument gespeichert. Die Speicherung erfolgt im Lucene-Index. Zur Themensuche wird die Suchanfrage nach dem beschriebenen Verfahren normalisiert, dann werden diejenigen Themen aus dem Themenindex abgerufen, die die höchste Überschneidung ihrer Normalformen mit der Normalform der Anfrage aufweisen. Die Themen werden hierarchisch gegliedert und z.B. in Form eines Baums angezeigt. Die Knoten werden von der Wurzel des Baums zu den Blättern hin immer spezifischer und informativer. So findet sich an der Wurzel des Baums meist ein einzelnes Wort, das auf den unteren Ebenen in immer größere Kontexte eingebettet wird, bis man an den Blättern des Baums zu ganzen Textpassagen oder Dokumenten gelangt. Einen Beispielbaum für die Anfrage Giftstofftransport (Normalform: Transport Stoff Gift) zeigt Abbildung 1. 5 Schlussfolgerung und Ausblick Das vorgestellte Text-Mining-Werkzeug ist auf die Bedürfnisse der Politikwissenschaftler zugeschnitten, die es im erwähnten Forschungsprojekt für Inhaltsanalysen einsetzen. Die Evaluierung des Systems erfolgt bislang ausschließlich durch das Feedback der User und die Beobachtung der Systemnutzung (z.B. Auswerten von Logdateien). Ziel ist es zunächst, festzustellen, welche der angebotenen Suchmöglichkeiten von den Benutzern überhaupt angenommen werden. Resultat dieser Vorgehensweise war bereits der Umbau des Systems vom Dokumentenretrieval hin zum Retrieval einzelner Textpassagen. Zum Einsatz der Themenmaschine liegen leider zum gegenwärtigen Zeitpunkt noch keine Rückmeldungen vor. Auf eine quantitative Evaluierung von Funktionen wie Relevanzfeedback oder Kategorisierung wurde verzichtet, da aus der Literatur bekannte Standardverfahren angewandt werden; auch die zugrundeliegende Merkmalsextraktion bzw. -normalisierung und die Merkmalsgewichtung erfolgt über Standardkomponenten (Tree-Tagger) und -verfahren (TF-IDF). Hinsichtlich Zeitverhalten und Skalierbarkeit des Systems muss zwischen der Indexierung einer Dokumentensammlung und dem späteren Anfragezeitpunkt unterschieden werden. Die Indexierung 106 Peter Kolb Transport Transport lebender Tiere Transport von Explosivstoff Transport von Giftstoff Transport von Giftstoff − Unfall Transport von Giftstoff − Unfall auf der A100 Transport von Giftstoff − Unfall in Schwedt Transport von Giftstoff − Verschmutzung der Umwelt Verschmutzung Transport von Giftstoff − Austritt Transport von Giftstoff − Transport von Giftstoff − Verschmutzung des Grundwassers ... ... ... ... ... ... Abbildung 1: Beispielhafte Ergebnisdarstellung einer Themensuche nach Giftstofftransport einer großen Dokumentensammlung ist natürlich ein aufwändiger (und im jetzigen System noch nicht vollständig automatisierter) Schritt. Die Dokumente müssen vorverarbeitet, getaggt, gechunkt und durch Lucene indexiert werden, Themen müssen extrahiert werden usw. Besonders aufwändig ist die Berechnung der distributionellen Ähnlichkeiten durch DISCO. Um distributionelle Ähnlichkeiten für z.B. die 150.000 häufigsten Wortformen eines Korpus im Umfang von 100 Millionen Wortformen zu berechnen, benötigt DISCO etwa 24 Stunden. Prinzipiell ist der eingesetzten Technologie (Lucene, DISCO) nur durch die verfügbare Hardware (Rechenleistung, Plattenkapazität und Größe des Hauptspeichers) Grenzen gesetzt. LiSCo und Themenmaschine können inkrementell arbeiten und so mit wachsenden Dokumentensammlungen umgehen, DISCO bietet dagegen keine Möglichkeit Dokumente zu einem einmal berechneten Wortraum hinzuzufügen. Bei knapp 500.000 indexierten Zeitungsartikeln erreicht der Lucene-Index eine Größe von 4,6 Gigabyte. Suchanfragen an Lucene oder DISCO werden ohne spürbare Verzögerung beantwortet. Zeitaufwändig sind lediglich die Erstellung von Zeitleisten und insbesondere die Berechnung der DISCO- Graphen. Bei letzterer kommt es zu Wartezeiten von teilweise über einer Minute. Text-Mining für die Inhaltsanalyse 107 Literatur Kolb, P. (2008). DISCO: A Multilingual Database of Distributionally Similar Words. In Tagungsband der 9. KONVENS - Ergänzungsband, pages 37-44, Berlin. Kolb, P. (2009). Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference on Computational Linguistics - NODALIDA ’09, pages 81-88, Odense, Denmark. Kolb, P., Kutter, A., Kantner, C., & Stede, M. (2009). Computer- und korpuslinguistische Verfahren für die Analyse massenmedialer politischer Kommunikation: Humanitäre und militärische Interventionen im Spiegel der Presse. In Tagungsband des GSCL Symposiums “Sprachtechnologie und eHumanities”, Duisburg. Nilsson, K. & Hjelm, H. (2009). Using Semantic Features Derived from Word-Space Models for Swedish Coreference Resolution. In Proceedings of the 17th Nordic Conference on Computational Linguistics - NODALIDA ’09, pages 134-141, Odense, Denmark. Rocchio, J. (1971). Relevance Feedback in Information Retrieval. In G. Salton, editor, The SMART Retrieval System - Experiments in Automatic Document Processing, pages 313-323. Prentice-Hall, Upper Saddle River, NJ, USA. Salton, G. (1971). The SMART Retrieval System - Experiments in Automatic Document Processing. Prentice- Hall, Upper Saddle River, NJ, USA. Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. International Conference on New Methods in Language Processing. Schmid, H. & Schulte im Walde, S. (2000). Robust German Noun Chunking with a Probabilistic Context-Free Grammar. In Proceedings of the 18th International Conference on Computational Linguistics (COLING- 00), pages 726-732, Saarbrücken, Germany. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1), 1-47. Strube, M., Rapp, S., & Müller, C. (2002). The Influence of Minimum Edit Distance on Reference Resolution. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 312-319, Philadelphia, USA. Tiedemann, J. & Mur, J. (2008). Simple is Best: Experiments with Different Document Segmentation Strategies for Passage Retrieval. In Proceedings of the 2nd workshop on Information Retrieval for Question Answering (IR4QA), pages 17-25, Manchester, UK. Integration of Light-Weight Semantics into a Syntax Query Formalism * Torsten Marek Institute of Computational Linguistics University of Zürich Binzmühlestrasse 14, 8052 Zürich, Switzerland marek@ifi.uzh.ch Abstract In the Computational Linguistics community, much work is put into the creation of large, highquality linguistic resources, often with complex annotation. In order to make these resources accessible to non-technical audiences, formalisms for searching and filtering are needed, like the TIGER corpus query language. Recently, augmented treebanks have been published, including the SALSA corpus which features frame semantic annotation on top of syntactic structure. We design an extension for the TIGER language which allows searching for frame structures along with syntactic annotation. To achieve this, the TIGER object model is expanded to include frame semantics, while remaining fully backwards-compatible and add these extensions to our own implementation of TIGER. 1 Frame Semantics Frame semantics (Fillmore, 1976, 1985) is a formalism that aims to model predicates and their arguments as conceptual structures, called frames. Frames represent prototypical situations and provide an abstraction layer on the concrete syntactic realization of the predicate and its arguments as well as disambiguation of potentially polysemous words. Rather than describing the grammatical function of a phrase, frames relate phrases to predicates based on their semantic function, the role. The predicates which can evoke a frame are called lexical units and may be verbs, nouns, adjectives or adverbs. The actual instance of a lexical unit in the text is referred to as the target of a frame. Examples 1 to 3 show three instances of the frame SENDING , all evoked by different realizations of the lexical unit send.v. (1) [Alice] SENDER sent SENDING [Bob] RECIPIENT [a message] THEME . (2) [Alice] SENDER sends SENDING [a message] THEME [to Bob] RECIPIENT . (3) [To Bob] RECIPIENT , [the message] THEME had been sent SENDING [by Alice] SENDER . * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 109-114. 110 Torsten Marek While the syntactic structure varies from example to example, all sentences describe the prototypical situation of SENDING a THEME (the message) from a SENDER (Alice) to a RECIPIENT (Bob). In all cases, the frame SENDING is evoked by the lexical unit send, independent of its tense and diathesis. The shallow analysis provided by frame semantics helps to generalize across syntactic alternations that fully or nearly preserve the meaning and provide a more abstract level than a syntactic analysis can. At the same time, it is more robust than a complete semantic analysis, because it is primarily concerned with representing valence and polysemy of predicates. 1.1 Berkeley FrameNet In order to allow for better generalization, relations between frames and roles are defined, which creates a frame net. An example relation is inheritance, which defines a conceptual hierarchy of frames and their roles. This relation encodes role equivalences explicitly, without the need to rely on a limited set of universal roles. The Berkeley FrameNet project (Baker et al., 1998) at the International Computer Science Institute at UC Berkeley has created a frame net for English. In the latest release 1.3, it contains 795 frame descriptions, which define the meaning of each frame, its roles and lexical units along with annotated example sentences. 1.2 TIGER & SALSA Based on the Berkeley FrameNet, the SALSA project at Saarland University started annotating the TIGER corpus (Brants et al., 2002), a treebank with ~50.000 phrase structure trees of German newspaper text, using frame semantics. A first version of this augmented corpus was eventually released as the SALSA corpus in 2007 (Erk et al., 2003; Burchardt et al., 2006). Roles and targets are not annotated on the surface text but on the syntactic structure. This results in several interacting structural layers, also known as multi-level annotation. 2 Motivation Corpora such as TIGER or SALSA are important resources for Natural Language Processing, both as training material for machine-learning systems and for linguistic research. However, linguistic interest is often limited to a single, specific phenomenon at a time, resulting in an exploratory rather than exhaustive usage pattern of the corpus. The TIGER corpus query language (Lezius, 2002) is a part of the TIGER project and has been developed to support exactly this kind of linguistically motivated text exploration. With the availability of a large corpus with frame semantic and syntactic annotation, it is natural to formulate queries like the one in example 4, matching the sample sentence in example 5. (4) Find all sentences where the role TOPIC in the frame STATEMENT is realized by a PP with the preposition “über”. (5) [Hotels und Gaststätten] SPEAKER klagen STATEMENT [über knauserige Gäste] TOPIC . The part of the query which describes the syntactic realization of the role can be expressed in a simple way using the TIGER query language, shown in example 6. Integration of Light-Weight Semantics into a Syntax Query Formalism 111 (6) [cat="PP"] >AC [word="über"] The search for a frame and its role instances and, even more important, the connection between a role and its syntactic material is inexpressible with TIGER. For the SALSA corpus, the XML storage format for TIGER corpora was extended (Erk & Padó, 2004) and the new elements and relations are not represented in the query language any more, which reduces the usefulness of the SALSA corpus to linguists. 2.1 Related Work The NITE object model (Carletta et al., 2003) defines an annotation model for multi-modal corpora that allows for any number of input and annotation layers, which again can be the basis for an arbitrary number of possibly intersecting hierarchies. NQL (Evert & Voormann, 2003) is a query language on top of the NITE object model that can be used to search both structural and timeline-based annotations in multi-modal corpora. Heid et al. (2004) have shown that it is possible to encode the SALSA corpus in the NITE object model and write linguistic queries like 7 in NQL, as shown in example 8. (7) Find words or syntactic categories which are the target of different semantic frames or which have more than one role, each role belonging to a different frame. (8) ($f1 frame) ($f2 frame) (exists $phrase syntax) (exists $target word): $f1 >"target" $target && $f2 >"target" $target && $f1 ^ $phrase && $f2 ^ $phrase && $f1 ! = $f2 Unfortunately, this work is only a proof of concept. The SALSA corpus is not available in the converted object model and development on NXT search, the principal implementation of NQL, has stopped several years ago. 3 Query Language Extensions Our extension does not interfere with the original parts of the query language. Especially, it does not introduce new matches—a syntax-only query on a corpus with frame semantic annotation yields exactly the same results as if the corpus consisted of only syntactic annotation as thus is completely backwards-compatible. 3.1 New Node Types The annotation of frame instances forms a new, additional structural layer on top of the existing syntax annotation, which remains unchanged. Each frame instance forms a small tree on its own (cf. figure 3.1), with a fixed depth and fixed node types on each level. The root is the frame instance itself, which has one exactly target (or frame-evoking element, FEE ) and any number of roles (or frame elements, FE ). FEE and FE nodes reference syntax nodes, which can be terminals ( T ) or nonterminals ( NT ). 112 Torsten Marek Figure 1: Annotation structure of a frame instance Until now, the syntactic types in TIGER were direct subtypes of the root type FREC . In our extension, they are modified to be derived from a new intermediate type SYNTAX . We introduce a new type SEMANTICS , which serves as the base type of all new node types for frame semantics annotation. We introduce a new type FRAME for frame instances and new types FE for roles and FEE for targets. The latter two types have a different feature set and a different interpretation with regard to their containing frame instance, but both are members of frames and reference syntactic material. This is the reason for the introduction of a new intermediate type SYNSEM (short for syntax-semantics connector), from which we derive FE and FEE . 3.2 New Node Features Each new node type has several features which can be used in queries: • FRAME frame : the name of a frame. • FE role : the role name • FEE lemma : the lemma of the target To support users with visual cues when writing queries that combine syntactic and semantic material and keep the extension backwards-compatible, we extend the syntax of TIGER. Node descriptions with a type that is a subtype of SEMANTICS must be surrounded by curly braces instead of square brackets. In example 9, it is clear that the node description refers to a semantic node. (9) {frame="Statement"} Since there is no common behavior between both node types as far as node relations are concerned, this requirement does not incur any restrictions. If a user writes a query in which frame semantics nodes are referenced with square brackets (or vice versa), the implementation can spot the mistake and fail with a meaningful error message. 3.3 Basic Node Relations To allow expression of queries on the structure of the new annotation elements in TIGER/ SALSA XML, we define several new basic relation constraints. The information for the constraints can be read directly from the annotation. To describe the structure of a frame annotation (cf. figure 3.1), we introduce two new relation constraints: Integration of Light-Weight Semantics into a Syntax Query Formalism 113 ! ! ! ! Figure 2: Annotation graph for the sentence from example 5, with result highlights Frame Members: FRAME > SYNSEM A SYNSEM node is a member of a the FRAME node. Syntactic Material: SYNSEM > SYNTAX A SYNTAX node is referenced by a SYNSEM node, the connection layer between syntax and frame semantics. 3.4 Advanced Features The advanced features of the extension support queries over: • hierarchies of frames and roles • underspecification relations between frames • morphologically complex words • semantic types of roles 4 Results We have designed an extension of the TIGER query language that makes it possible to formulate linguistic queries on syntactic and frame semantic annotation, like the one in example 4: 1. Find all instances of TOPIC : #r: {role="Topic"} 2. Find all instances of STATEMENT : #f: {frame="Statement"} 3. Role is a member of frame: #f > #r 4. Role is fully realized by a syntax node: #r > #pp & arity(#r, 1) 114 Torsten Marek Putting all these parts together with the original syntax query fragment from example 6, we get the query in example 10: (10) {frame="Statement"} > #r & #r: {role="Topic"} > #pp & arity(#r, 1) & #pp: [cat="PP"] >AC [word="über"] Figure 2 shows the annotation graph of the sentence in example 5, which is a result of query 10. The extensions described in this paper have been implemented in our own TIGER query engine (Mettler, 2007). A demo of this query engine is also available online. 1 References Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley FrameNet Project. In Proceedings of COLING-ACL 1998, pages 86-90. Brants, S., Dipper, S., Hansen, S., Lezius, W., & Smith, G. (2002). The TIGER Treebank. In Proceedings of the First Workshop on Treebanks and Linguistic Theories, TLT 2002, Sozopol, Bulgaria. Burchardt, A., Erk, K., Frank, A., Kowalski, A., Padó, S., & Pinkal, M. (2006). The SALSA Corpus: A German Corpus Resource for Lexical Semantics. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy. Carletta, J., Kilgour, J., O’Donnell, T. J., Evert, S., & Voorman, H. (2003). The NITE object model library for handling structured linguistic annotation on multimodal data sets. In Proceedings of the EACL Workshop on Language Technology and the Semantic Web (NLPXML-2003), Budapest, Hungary. Erk, K. & Padó, S. (2004). A powerful and versatile XML format for representing role-semantic annotation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, Lisbon, Portugal. Erk, K., Kowalski, A., Padó, S., & Pinkal, M. (2003). Towards a Resource for Lexical Semantics: A Large German Corpus with Extensive Semantic Annotation. In Proceedings of the ACL 2003, pages 537-544. Evert, S. & Voormann, H. (2003). NQL - A Query Language for Multi-Modal Language Data. Technical report, IMS, University of Stuttgart, Stuttgart, Germany. Fillmore, C. J. (1976). Frame Semantics and the Nature of Language. In Annals of the New York Academy of Sciences: Conference on the Origin and Development of Language and Speech, volume 280, pages 20-32. Fillmore, C. J. (1985). Frames and the semantics of understanding. Quaderni di Semantica, IV(2), 222-254. Heid, U., Voormann, H., Milde, J.-T., Gut, U., Erk, K., & Padó, S. (2004). Querying both time-aligned and hierarchical corpora with NXT search. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC 2004, Lisbon, Portugal. Lezius, W. (2002). Ein Suchwerkzeug für syntaktisch annotierte Textkorpora. Ph.D. thesis, IMS, University of Stuttgart, Stuttgart, Germany. Mettler, M. B. (2007). Parallel Treebank Search - The Implementation of the Stockholm TreeAligner Search. C-uppsats, Stockholm University, Stockholm, Sweden. 1 http: / / fnps.coli.uni-saarland.de: 8080/ A New Hybrid Dependency Parser for German * Rico Sennrich, Gerold Schneider, Martin Volk, and Martin Warin Universität Zürich, Institut für Computerlinguistik Binzmühlestrasse 14, CH-8050 Zürich Abstract We describe the development of a new German parser that uses a hybrid approach, combining a hand-written rule-based grammar with a probabilistic disambiguation system. Our evaluation shows that this parsing approach can compete with state-of-the-art systems both in terms of efficiency and parsing quality. The hybrid approach also allows for the integration of the morphology tool GERTWOL, which leads to a comparatively high precision for core syntactic relations. 1 Introduction Parsing German keeps attracting interest, both because German is a major European language, and because it has special characteristics such as a relatively free word order and a rich morphology. These characteristics mean that a parsing approach that is appropriate for English is not automatically so for German. While German parsers typically perform worse than English ones, the controversy whether parsing German is an inherently harder task than parsing English is still open (Kübler, 2006). Inter-language comparisons aside, it has been shown that even when only comparing German parsers, choice of treebank and evaluation measure have a considerable effect on reported results (Rehbein & van Genabith, 2008). An additional confounding factor is the varying amount of gold information used in different evaluations, ranging from POS-tags up to morphological analyses (Versley, 2005; Buchholz & Marsi, 2006). We report user-oriented evaluation results that are based on real-world conditions rather than ideal ones. Specifically, only plain text has been taken from the gold set, and all additional information required by the parsers has been predicted automatically. 2 Previous Work on German Dependency Parsing A number of comparative studies and workshops give estimates of the performance of current German parsers (Versley, 2005; Buchholz & Marsi, 2006; Kübler et al., 2006; Kübler, 2008). While comparing the results of different studies is not easily possible due to variations in the test setting and evaluation process, conclusions can be drawn from the individual studies. * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 115-124. 116 Rico Sennrich, Gerold Schneider et al. Kübler, in addition to discussing the particular challenges of German parsing, show encouraging results obtained by statistical constituency parsing (89.18% labeled F-score for a lexicalised Stanford PCFG system, Kübler et al., 2006). One needs to bear in mind, however, that the “constituent structure for a German sentence will often not be sufficient for determining its intended meaning” (Kübler et al., 2006). This is especially true for noun phrases, which can serve as subjects, different kinds of objects, predicative nouns and genitive attributes, among others. When requiring the correct identification of grammatical function, the parser performs considerably worse (75.33%). In the PaGe 2008 Shared Task on Parsing German, the dependency version of the MaltParser is shown to be better at identifying grammatical functions than its constituency counterpart and other constituency parsers (Kübler, 2008; Hall & Nivre, 2008). The MaltParser is also among the topperforming parsers in both the PaGe 2008 and the CoNLL-X Shared Tasks, obtaining a labeled attachment score of 88.6% and 85.8%, respectively. The labeled attachment score (LAS) measures “the percentage of [non-punctuation] tokens for which the system has predicted the correct head and dependency relation” (Buchholz & Marsi, 2006). Of the 19 participating groups in the CoNLL-X Shared Task on Multilingual Dependency Parsing (Buchholz & Marsi, 2006), the average LAS for German is 78.6%. The best parser, described by McDonald et al. (2005), achieves 87.3%. Their system “can be formalized as the search for a maximum spanning tree in a directed graph” (McDonald et al., 2005). Versley compares a parser based on Weighted Constraint Dependency Grammar (WCDG) by Foth et al. (2004) to an unlexicalised PCFG parser across different text types, concluding that “statistical parsing for German is lacking not only in comparison with results in English, but also with manually constructed parsers for German” (Versley, 2005). The better performance of the hand-crafted WCDG parser (an LAS of 88.1% in the TüBa-D/ Z test set, in contrast to 79.9% for the PCFG parser) comes at a cost of speed though: parsing took approximately 68 seconds per sentence with the WCDG, and 2 with the PCFG (Versley, 2005). We respond to the lack of fast German rule-based parsers by presenting a parser that combines a hand-written grammar with a statistical disambiguation system. We report a parsing speed of several sentences per second at 85.5% performance for gold standard POS tags and morphology and 78.4% performance for automatic tagging and morphology. We use the parser Pro3Gres (Schneider, 2008), a fast and robust dependency parser that has been applied widely for English. We have developed a German grammar based on Foth (2005). 3 The Pro3Gres Parser The Pro3Gres parser is a robust and fast bi-lexicalised dependency parser originally developed for English. It uses a hybrid architecture combining a manually written functional dependency grammar (FDG) with statistical lexical disambiguation obtained from the Penn Treebank. The original architecture is shown in figure 1. The disambiguation method extends the PP-attachment approach of Collins & Brooks (1995) to all major dependency types. The attachment probability for the syntactic relation R at distance dist, given the lexical items a and b is calculated using MLE estimation, including several backoff levels. P ( R, dist | a, b ) ∼ = p ( R | a, b ) · p ( dist | R ) = f ( R, a, b ) ∑ n i=1 f ( R i , a, b ) · f ( R, dist ) f R (1) A New Hybrid Dependency Parser for German 117 Figure 1: Pro3Gres architecture The statistical disambiguation allows the parser to prune aggressively while parsing and to return likely analyses that are licensed by the grammar ranked by their probability. The English Pro3Gres parser has been shown to achieve state-of-the-art performance (Schneider, 2008; Schneider et al., 2007; Haverinen et al., 2008). We have chosen a modified version of the Pro3Gres system because its architecture has shown to be robust and efficient, making it a promising framework for creating a German parser. 4 Adaptation of Pro3Gres to German Adapting parsers to new languages and domains has been recognised as an important research area (Nivre et al., 2007). We have adapted the Pro3Gres parser and its architecture to German in several ways. Taking into account the relatively free word order of German, we have chosen to include morphological and topological rules in the grammar to better identify noun phrase boundaries, in contrast to the English Pro3Gres parser, which uses a dedicated chunker. Existing linguistic resources used include the TreeTagger for POS-tagging, the GERTWOL system for morphological analysis, and part of the TüBa-D/ Z corpus for the extraction of statistical data (Schmid, 1994; Haapalainen & Majorin, 1995; Telljohann et al., 2004). The TüBa-D/ Z corpus is a German treebank of written newspaper texts containing approximately 36,000 sentences. We have split the corpus into a training section (32,000 sentences), a development section (1000 sentences), and an evaluation section (3000 sentences). For the development of the hand-written, rule-based dependency grammar, we used the grammar framework described by Foth (2005) as a reference. We used a version of TüBa-D/ Z that was automatically converted into this dependency format, but with some conceptual differences (Versley, 2005). In case of inconsistency between Foth (2005) and TüBa-D/ Z, we adopted the labels of the latter. Non-projective structures are processed similarly to pseudo-projective parsing, involving a projective grammar combined with deprojectivization at a later stage (Nivre & Nilsson, 2005). Morphological analysis is an important step in reducing ambiguity, which both improves the speed of the parser and its results. We have used the GERTWOL system to lemmatise all tokens and get their possible morphological analyses. Lemmatising helps to alleviate the sparse data problem in our probabilistic system, and case information marks the grammatical function of a noun phrase. 118 Rico Sennrich, Gerold Schneider et al. Since many word forms are ambiguous, the lists of possible analyses are of little use in isolation. Only by enforcing agreement rules, both within noun phrases and between subject and verb, can we reduce the degree of ambiguity considerably. Apart from determining grammatical functions, these morphological constraints are also helpful in identifying phrase boundaries. With about two work-months devoted to the development of the German grammar and the probabilistic disambiguation module, Pro3GresDE works well with common grammatical phenomena, but cannot yet properly handle rarer ones such as genitive objects or noun phrases in vocative or adverbial function. 5 Method for Parser Evaluation We have already stated that different parsers cannot be directly compared on the basis of their respective performance in various publications. Differences in the test setting, most notably the evaluation measure used and the extent of manual annotation provided, have a considerable effect on the results. Hence, we have decided to conduct a ceteris paribus comparison of Pro3GresDE with two state-ofthe-art machine learning parsers, MaltParser (Nivre et al., 2006) and MSTParser (McDonald et al., 2005). We have evaluated all three parsers using a test set of 3000 sentences of the TüBa-D/ Z corpus, with approximately 32,000 sentences being used as a training set. For every token, our evaluation script tests if the parser predicts the right head and dependency relation. The dependency relation ROOT, which is used for all unattached tokens, including punctuation marks, is ignored in our evaluation. We will report precision, recall and F-scores for total parser performance and selected dependency relations. Additionally, we report the speed of all parsers as measured on an Opteron 8214 system. Since the parsers natively use between one and two of the eight available CPU cores, they have been restricted to using a single core in order to avoid a bias against single-threaded parsers. It is easily possible to run several instances of either parser in parallel to parse large text collections. In a first round of tests, we measure the performance of Pro3GresDE using different test settings to illustrate the effect of automatically predicting POS-tags (in lieu of gold tags) and morphological information on parser performance. Also, the performance gain achieved by adding statistical disambiguation is shown. Subsequently, we extend our evaluation to other parsers. All three parsers are provided with the output from TreeTagger, a tokeniser and part-of-speech tagger that was not specifically trained on TüBa-D/ Z. 1 It achieves a rather low tagging accuracy of 93.3% on our test set. Tokens tagged with $. are considered sentence boundaries, even if this conflicts with the gold set. The sentence/ position numbering of the latter has been adjusted to ensure that it is aligned with the parser output for the evaluation. Also included in the test set are lemmas and morphological analyses provided by GERTWOL, although only Pro3GresDE makes use of these. We have not attempted to use this additional data to improve the two machine learning parsers, since lists of morphological analyses are incompatible with the training data of one analysis per token. 1 Instead, the default parameter file was used, which was trained on a newspaper corpus containing 90,000 tokens by IMS Stuttgart. A New Hybrid Dependency Parser for German 119 6 Evaluation of Pro3GresDE Table 1 shows total parser performance and speed of Pro3GresDE in different test settings. 2 The F-scores for some dependency relations that are of particular interest for tasks such as text mining, are shown in table 2. Unsurprisingly, the best results are achieved when using the part-of-speech tags and morphological information of the gold standard, the F-score for all relations being 85.5%. While these results are not attainable in a realistic setting with automatic tokenising, POS-tagging and morphological analysis, they come close to other evaluations that used similar test sets (Versley, 2005; Kübler, 2008). Since Pro3GresDE leaves more tokens unattached than the gold standard, total recall is typically about 6 percentage points lower than total precision. Compared to purely rule-based parsing, we can see that the inclusion of probabilities boosts parser performance by 10 percentage points, while at the same time speeding it up by 50%. While the effect on total performance of adding automatically extracted morphological information is not as big - still a considerable improvement of 3 percentage points - case information is very helpful in attributing the correct function to noun phrases, increasing the F-score for accusative objects by 16, for dative objects by 34, and for genitive modifiers by 24 percentage points. Additionally, it boosts the speed by further 40%. While it might seem counterintuitive that parsing speed increases as the system becomes more complex, the additional modules allow us to discard unlikely or morphologically unsound analyses at an early stage, which reduces the number of ambiguous structures that have to be built up. The parser is optimised for best results with both the statistics module enabled and GERTWOL morphology information available. If the statistics module is disabled, we have observed that increasing the complexity of the rule-based grammar resulted in a decrease in parser performance. This is due to rare dependency relations such as PRED (predicate) and OBJP (prepositional phrase as verb complement) being heavily overpredicted in this test setting. In both cases, the label is distinguished from structurally identical ones on a semantic level (Foth, 2005), and both relations only occur with certain verbs. Using a probabilistic disambiguation, we can improve the F-score for PRED from 19.0% to 67.2%, and for OBJP from 20.7% to 70.2%. Similarly, little time has been invested into improving parser performance when morphological information is missing. Hence, the parser will even consider morphologically unambiguous pronouns such as ihn to be possible subjects, a problem which could be solved by using full bilexicalization in the absence of morphological information. The dependency relation labeled APP, which covers proper appositions, is also used to link tokens within multi-noun chunks. Consequently, the APP results in table 2 are an indication of how well NP boundaries are recognised by the parser. Probabilistic rules result in a 33 percentage point increase in performance. Morphological information leads to a further improvement (3 and 6 points for automatic and gold morphological information, respectively). APP is one of the relations with a higher recall than precision (82.3% versus 76.8% with gold morphology and tagging), which indicates that too few phrase boundaries are predicted. Using automatic POS-tagging, parsing results are considerably worse, with an F-score of 78.4% for all relations. The performance drop of 5.4 percentage points is close to the error rate of the 2 binary vs. probabilistic: shows whether dependency relations are modified with a (pseudo-)probability or not; nomorph vs. automorph vs. goldmorph: shows whether the test set contains morphological information, and if yes, if it is automatically extracted (GERTWOL) or from the gold standard; goldPOS vs. autoPOS: shows whether the test set uses part-of-speech tags from the gold standard or from an automatic tagger (TreeTagger). 120 Rico Sennrich, Gerold Schneider et al. Table 1: Pro3GresDE precision and recall of the parser over all dependency relations binary, probabilistic, probabilistic, probabilistic, probabilistic, nomorph, nomorph, automorph, goldmorph, automorph, goldPOS goldPOS goldPOS goldPOS autoPOS precision 68.7 82.9 86.5 88.6 81.5 recall 65.3 79.0 81.2 82.6 75.5 F 1 67.0 80.9 83.8 85.5 78.4 speed (sentences/ sec) 3.3 5.1 7.0 10.9 6.4 Table 2: Pro3GresDE: F 1 for selected dependency relations (SUBJ: subject; OBJA: accusative object; OBJD: dative object; PRED: predicate; APP: apposition; PP: prepositional phrase as adjunct; OBJP: PP as complement) binary, probabilistic, probabilistic, probabilistic, probabilistic, nomorph, nomorph, automorph, goldmorph, automorph, goldPOS goldPOS goldPOS goldPOS autoPOS SUBJ 51.9 83.3 89.1 93.0 82.5 OBJA 25.2 64.6 81.2 90.0 74.2 OBJD 8.4 29.2 63.8 81.8 56.3 PRED 19.0 67.2 67.4 69.6 58.8 GMOD 50.0 62.6 86.2 94.2 81.0 APP 40.0 72.9 76.3 79.5 66.9 PP 53.8 70.0 69.6 70.4 64.0 OBJP 20.7 70.2 69.4 70.0 61.3 POS-tagger (6%), but we deem this to be coincidental. This is because tagging errors are of varying significance. Whereas the distinction between proper names and nouns is of relatively little importance in our grammar, erroneously tagged verbs may lead to all verbal dependents being incorrectly attached. 7 Comparing Pro3GresDE to MaltParser and MSTParser While Pro3GresDE does not quite reach the performance that has been reported for other parsers, a comparison based on different corpora, evaluation scripts etc. is of little relevance. When parsing the test set that has been described above with MaltParser and MSTParser, the two parsers obtained considerably lower scores than in CoNLL-X. This performance drop was to be expected due to differences in the test setting. Most importantly, parser input has more noise, with automatically assigned POS-tags and sentence boundaries instead of the gold ones. Another factor that might explain the relatively low performance of MaltParser and MSTParser is that both parsers, albeit trained on TüBa-D/ Z, were not specifically tuned for best results on this A New Hybrid Dependency Parser for German 121 Table 3: Parser performance (total results and F 1 scores of selected grammatical relations) Pro3GresDE MaltParser MSTParser Precision 81.5 79.1 78.9 Recall 75.5 79.5 76.5 F 1 78.4 79.3 77.7 SUBJ 82.5 77.1 75.5 OBJA 74.2 64.8 65.8 OBJD 56.3 29.4 31.4 GMOD 81.0 69.6 71.5 ADV 66.8 76.2 79.0 corpus. For MaltParser, we used CoNLL-X settings, including pseudo-projective parsing (Nivre & Nilsson, 2005). The latter led to a 10 percentage point increase in recall, albeit with a 5 percentage point loss in precision. We have used standard settings for MSTParser, with the only exception that we chose the non-projective parsing algorithm, which in preliminary tests outperformed the projective one. The fact that Pro3GresDE was developed on TüBa-D/ Z puts it at an advantage in this evaluation. It is unclear how evaluation results would be affected by more neutral test data, with the training data staying the same. Versley, when comparing a rule-based WCDG parser and a statistical PCFG one, found that both were “equally sensitive to text type variation” (Versley, 2005). When using another treebank for both training and testing, we expect the statistical parsers to have the advantage. They can be retrained on a portion of the same treebank they are evaluated on, while rule-based parsers require a (often lossy) mapping between the different dependency representations (Schneider et al., 2007). Regarding general performance, we can observe that the total F-scores of MaltParser, MST- Parser and Pro3GresDE seem very similar, with the gap between the best-performing and the worstperforming system being 1.6 percentage points. On closer inspection, however, the results of the parsers are clearly different. The variance between parsers is greater when considering precision and recall instead of the F-score. Pro3GresDE achieves the highest precision, but the lowest recall, while MaltParser features a recall that is slightly higher than its precision. A recall lower than precision indicates that the parser predicted more unattached tokens (root nodes) than exist in the gold set, and vice versa. Parser performance also varies when analysing single dependency relations. From this point of view, Pro3GresDE has some clear strengths and weaknesses. For the dependency relation ADV, the performance of Pro3GresDE is more than 10 percentage points worse than that of MSTParser (66.8% and 79.0% F-score respectively). This is mainly due to the fact that the Pro3GresDE grammar attaches adverbs to the finite verb if possible, without considering all possible heads. This leads to a high number of tokens that are correctly identified as adverbs, but attached to the wrong head (which accounts for 60-70% of the errors). 3 3 A full disambiguation of adverb attachment can be computationally expensive with our approach. So far, we chose to focus on other syntactic relations. 122 Rico Sennrich, Gerold Schneider et al. Table 4: Parse time (for 3000 sentences) and speed (in sentences per second) Parser Time Speed Pro3GresDE 467s 6.4 MaltParser 604s 5.0 MSTParser 438s 6.8 On the other hand, Pro3GresDE performs better than the machine learning systems when it comes to the grammatical function of noun phrases. For the dependency relations SUBJ, OBJA, OBJD and GMOD, Pro3GresDE outperforms the other parsers by 5 to 25 percentage points. This is possible through the inclusion of automatically extracted morphological information. The high ambiguity of a morphological analysis, which is only resolved in the parsing process itself through agreement rules, makes it unlikely that machine learning systems can successfully integrate this data to improve parsing performance. Hence, we consider the ability to use highly ambiguous morphological information to increase parser performance an advantage of our rule-based system. MSTParser was the fastest parser in our evaluation, parsing our test set 40% faster than MaltParser and 30% faster than Pro3GresDE. These differences are small, however, compared to the factor 30 speed difference reported in Versley (2005). Parser settings and the training set are likely to have a bigger effect on parsing speed than parser choice. 4 Still, all three parsers reach a parsing speed of several sentences per second and are thus applicable for large-scale parsing tasks. 8 Conclusions In summary, Pro3GresDE achieves competitive results, both in terms of efficiency and performance. Used in combination with GERTWOL, it outperforms MaltParser and MSTParser in the prediction of central grammatical relations such as subjects and objects, a property which makes the parser a suitable choice for tasks relying on this information. Future research will include an extension of statistical disambiguation rules and integration of additional linguistic resources to further improve parser performance. 4 Specifically, parsing time of MaltParser depends on the number of support vectors, which grows with training set size. For MSTParser, parsing time is independent of training set size, but depends on the number of features used (Joakim Nivre, personal communication, March 24, 2009). A New Hybrid Dependency Parser for German 123 References Buchholz, S. & Marsi, E. (2006). CoNLL-X Shared Task on Multilingual Dependency Parsing. In Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pages 149-164, New York City. Association for Computational Linguistics. Collins, M. & Brooks, J. (1995). Prepositional Attachment through a Backed-off Model. In Proceedings of the Third Workshop on Very Large Corpora, Cambridge, MA. Foth, K. A. (2005). Eine umfassende Constraint-Dependenz-Grammatik des Deutschen. University of Hamburg. Foth, K. A., Daum, M., & Menzel, W. (2004). A Broad-coverage Parser for German Based on Defeasible Constraints. In KONVENS 2004, Beiträge zur 7. Konferenz zur Verarbeitung natürlicher Sprache, Vienna, Austria. Haapalainen, M. & Majorin, A. (1995). GERTWOL und Morphologische Disambiguierung für das Deutsche. In Proceedings of the 10th Nordic Conference of Computational Linguistics. University of Helsinki, Department of General Linguistics. Hall, J. & Nivre, J. (2008). A Dependency-Driven Parser for German Dependency and Constituency Representations. In Proceedings of the ACL 2008 Workshop on Parsing German, pages 47-54, Columbus, Ohio. Haverinen, K., Ginter, F., Pyysalo, S., & Salakoski, T. (2008). Accurate conversion of dependency parses: targeting the Stanford scheme. In Proceedings of Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008), Turku, Finland. Kübler, S. (2006). How Do Treebank Annotation Schemes Influence Parsing Results? Or How Not to Compare Apples And Oranges. In N. Nicolov, K. Boncheva, G. Angelova, & R. Mitkov, editors, Recent Advances in Natural Language Processing IV: Selected Papers from RANLP 2005, Amsterdam. John Benjamins. Kübler, S. (2008). The PaGe 2008 Shared Task on Parsing German. In Proceedings of the Workshop on Parsing German, pages 55-63, Columbus, Ohio. Association for Computational Linguistics. Kübler, S., Hinrichs, E. W., & Maier, W. (2006). Is it Really that Difficult to Parse German? In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP), Sydney, Australia. McDonald, R., Pereira, F., Ribarov, K., & Hajiˇ c, J. (2005). Non-Projective Dependency Parsing using Spanning Tree Algorithms. In Proceedings of HLT-EMNLP. Nivre, J. & Nilsson, J. (2005). Pseudo-Projective Dependency Parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 99-106, Ann Arbor, Michigan. Association for Computational Linguistics. Nivre, J., Hall, J., & Nilsson, J. (2006). MaltParser: A Data-Driven Parser-Generator for Dependency Parsing. In Proceedings of LREC, pages 2216 -2219, Genoa, Italy. Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., & Yuret, D. (2007). The CoNLL 2007 Shared Task on Dependency Parsing. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 915-932. Rehbein, I. & van Genabith, J. (2008). Why is It so Difficult to Compare Treebanks? TIGER and TüBa- D/ Z Revisited. In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories, Bergen, Norway. 124 Rico Sennrich, Gerold Schneider et al. Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the Conference on New Methods in Language Processing, Manchester, UK. Schneider, G. (2008). Hybrid Long-Distance Functional Dependency Parsing. Doctoral Thesis, Institute of Computational Linguistics, University of Zurich. Schneider, G., Kaljurand, K., Rinaldi, F., & Kuhn, T. (2007). Pro3Gres Parser in the CoNLL Domain Adaptation Shared Task. In Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, pages 1161-1165, Prague. Telljohann, H., Hinrichs, E. W., & Kübler, S. (2004). The TüBa-D/ Z Treebank: Annotating German with a Context-Free Backbone. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, Lisbon, Portugal. Versley, Y. (2005). Parser Evaluation Across Text Types. In Fourth Workshop on Treebanks and Linguistic Theories (TLT), Barcelona, Spain. Dependenz-basierte Relationsextraktion mit der UIMA-basierten Textmining-Pipeline UTEMPL * Jannik Strötgen 1,2 , Juliane Fluck 1 und Anke Holler 3 1 Fraunhofer Institut für Algorithmen und wissenschaftliches Rechnen SCAI juliane.fluck@scai.fraunhofer.de 2 Institut für Informatik, Ruprecht-Karls-Universität Heidelberg jannik.stroetgen@informatik.uni-heidelberg.de 3 Seminar für Deutsche Philologie, Georg-August-Universität Göttingen anke.holler@phil.uni-goettingen.de Zusammenfassung Der Artikel beschreibt das UIMA-basierte Textmining-System UTEMPL, das für die spezifischen Anforderungen der Verarbeitung biomedizinischer Fachliteratur entwickelt wurde. Anhand der sog. Protein-Protein-Interaktionen wird dargestellt, wie dieses flexible und modular aufgebaute System zur Relationsextraktion genutzt werden kann. Eine Evaluierung anhand verschiedener Korpora zeigt, dass ein linguistisch motivierter, dependenzbasierter Ansatz zur Relationsextraktion in seiner Leistungsfähigkeit einem einfachen Pattern-Matching-Ansatz meist überlegen ist. 1 Einleitung In aktiven Forschungsbereichen wie der Biomedizin liegen neue Erkenntnisse vor allem als unstrukturierte Textdaten vor. Zugleich wächst die Zahl der wissenschaftlichen Publikationen exponentiell (Zhou & He, 2008). Eine rein stichwortbasierte Suche ist kaum noch ausreichend, um in den vorhandenen großen Beständen biomedizinischer Fachliteratur neu gewonnenes Wissen zu identifizieren. Zugleich stellt die Biomedizin domänenspezifische Anforderungen, da in den Dokumenten sowohl nicht-eindeutige Gen- und Proteinnamen als auch domänenspezifische Relationen zwischen verschiedenen Namensentitäten, wie z.B. Protein-Protein-Interaktionen (PPIs), erkannt werden müssen. Vor diesem Hintergrund gewinnt der Einsatz leistungsfähiger Textmining (TM)-Verfahren in der biomedizinischen Forschung zunehmend an Bedeutung. Im vorliegenden Artikel beschreiben wir eine modular aufgebaute, flexible Softwareumgebung, die die Kombination verschiedener TM-Komponenten erlaubt, und zeigen, wie diese zur Extraktion von PPIs aus fachwissenschaftlichen biomedizinischen Texten eingesetzt werden kann. Der Artikel ist folgendermaßen strukturiert: Im nachfolgenden Abschn. 2 stellen wir die für die biomedizinische Domäne entwickelte TM-Pipeline UTEMPL vor. Dabei gehen wir auf die Architektur UIMA * Erschienen in: C. Chiarcos, R. Eckart de Castilho, M. Stede (Hrsg.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, S. 125-136. 126 Jannik Strötgen, Juliane Fluck, Anke Holler und auf die Anforderungen der biomedizinischen Domäne ein. Abschn. 3 widmet sich Methoden der Relationsextraktion und Abschn. 4 präsentiert einen neuen, in UTEMPL realisierten Ansatz zur Extraktion von PPIs. Abschn. 5 diskutiert die auf verschiedenen Korpora erzielten Evaluationsergebnisse. 2 Die Textmining-Pipeline UTEMPL Das UIMA-basierte System UTEMPL wurde mit dem Ziel entwickelt, TM-Aufgaben mit wechselnden Anforderungen im Bereich der biomedizinischen Domäne zu lösen. Ein klarer Vorzug der erstellten TM-Pipeline liegt in ihrem modularen Aufbau, so dass einzelne Komponenten unaufwändig angepasst bzw. ausgetauscht werden können, falls (i) neue, bessere Komponenten zur Verfügung stehen oder (ii) die Anforderungen sich geändert haben. 2.1 UIMA UTEMPL basiert auf der frei verfügbaren, plattformunabhängigen Architektur und Software-Umgebung UIMA (Unstructured Information Management Architecture). Diese ermöglicht es, unstrukturierte Daten verschiedener Art (Text, Audiodaten, Bilder) zu verarbeiten 1 , und erlaubt zudem, verschiedene Komponenten und Suchtechnologien derart zu verknüpfen, dass eine Pipeline von interagierenden Tools entsteht. Dadurch, dass alle Komponenten der Pipeline auf eine gemeinsame Datenstruktur zugreifen, die sog. Common Analysis Structure (CAS), können auch Werkzeuge verbunden werden, die zunächst nicht für ein Zusammenspiel entwickelt wurden. Eine UIMA-Pipeline besteht aus Komponenten der folgenden drei Arten: mindestens einem Collection Reader, einer Analysis Engine und einem CAS Consumer. Der Collection Reader (CR) liest Daten, wie Textdokumente, von einer Datenquelle (Datenbank, Filesystem etc.) ein. Durch den CR wird festgelegt, wie über die einzelnen Dokumente iteriert wird und welche Teile der Dokumente weiter analysiert werden sollen. Der CR erstellt zu diesem Zweck für jedes Input-Dokument ein CAS-Objekt, das neben dem in unserem Fall wichtigen Dokumententext zusätzlich Metadaten enthalten kann. Diese CAS-Objekte werden an die Analysis Engines weitergereicht. Die Analysis Engines (AEs) sind die Bestandteile der UIMA-Pipeline, die das jeweilige Dokument analysieren, Informationen finden und annotieren. Die erste AE einer Pipeline bekommt das CAS-Objekt direkt vom CR, spätere AEs von der jeweils vorigen. Die von AEs gefundenen oder abgeleiteten Informationen, die sog. Analysis Results, beinhalten typischerweise Metainformationen über den Inhalt des Dokuments. AEs können sowohl auf den Dokumententext innerhalb des CAS-Objektes zugreifen als auch auf die Analysis Results voriger AEs 2 . Der CAS Consumer ist das letzte Glied innerhalb der UIMA Pipeline und führt die abschließende Verarbeitung des CAS-Objektes durch. Anders als ein CR oder eine AE fügt der CAS Consumer dem CAS-Objekt keine weiteren Metainformationen hinzu. Typische Aufgaben eines CAS Consumers sind stattdessen, relevante Elemente aus dem CAS-Objekt zu extrahieren oder zu visualisieren, einen Suchindex für den Inhalt der CAS-Objekte aufzubauen oder mit Hilfe eines Goldstandards eine Evaluierung vorzunehmen. 1 UIMA wurde von IBM entwickelt ( http: / / www.research.ibm.com/ UIMA/ ); seit 2006 wird die Entwicklung bei der Apache Software Foundation ( http: / / incubator.apache.org/ uima/ ) als Open-Source-Projekt fortgeführt. 2 Bspw. Informationen über die Sätze eines Dokuments mit Positionsangaben. Dependenz-basierte Relationsextraktion mit UTEMPL 127 2.2 Anforderungen an die Verarbeitung biomedizinischer Texte Die Entwicklung von TM-Komponenten für die Extraktion von biomedizinischer Information muss der Tatsache Rechnung tragen, dass die biomedizinische Sprache eine Subsprache darstellt 3 und daher durch Besonderheiten im Bereich der Lexik und der Syntax gekennzeichnet ist (Grishman, 2001). So enthält das biomedizinische Vokabular vor allem nominale Ausdrücke, wie Protein- und Gennamen oder Bezeichnungen für biomedizinische Verfahren, sowie verbale Lexeme zur Beschreibung relationaler Beziehungen, wie z.B. to inhibit, to phosphorylate oder to bind (Cohen et al., 2008). Hinzu kommen terminologische Besonderheiten, wodurch Lexeme eine andere Bedeutung aufweisen als in der Standardsprache. Bspw. deutet das Verb to associate in der biomedizinischen Subsprache i.d.R. auf eine Interaktion im Sinne von binding hin (Friedman et al., 2002). Im Bereich der Syntax ist die biomedizinische Subsprache zum Einen durch häufig auftretende Passivierungen geprägt, die als verbale (A was activated by B), als adjektivische (B-activated A) und als nominale (A activation by B) Formen vorkommen. Zum Anderen lassen sich vor allem syntaktische Muster der Form Protein-Interaktionsverb-Protein beobachten. Aus den genannten Eigenschaften der biomedizinischen Subsprache ergeben sich spezifische Probleme bei der Verarbeitung biomedizinischer Textbestände, insbesondere weil gängige NLP- Werkzeuge vorrangig für die Verarbeitung standardsprachlicher Dokumente entwickelt und bzgl. standardsprachlicher Korpora evaluiert wurden. Beispielsweise werden überdurchschnittlich viele Lexeme bei der Verarbeitung biomedizinischer Texte nicht erkannt. Dies betrifft named entities wie Gen- und Proteinnamen oder Verben, die für die Domäne typische Interaktionen beschreiben (downregulate, upregulate etc.). Wie Wermter et al. (2005), Lease & Charniak (2005) und Cohen et al. (2008) unabhängig voneinander zeigen, können nur domänenspezifische Anpassungen der jeweiligen NLP-Werkzeuge zu besseren Verarbeitungsergebnissen führen. Die Pipeline UTEMPL, die im nächsten Abschn. beschrieben wird, ermöglicht es, solche Anpassungen mit angemessenem Aufwand vorzunehmen. 2.3 Der Aufbau von UTEMPL Die UIMA-basierte Pipeline UTEMPL, deren Komponenten in Abb.1 dargestellt sind, verknüpft verschiedene existierende TM-Werkzeuge in integrativer Weise, um den Belangen der biomedizinischen Domäne gerecht zu werden. Insbesondere können verschiedene Arten von Korpora von UTEMPL verarbeitet werden. UTEMPL nutzt die im Abschn. 2.1 beschriebenen Komponenten der UIMA-Architektur. Als Collection Reader stehen in UTEMPL mehrere Komponenten zur Verfügung, die jeweils unterschiedliche Formate der Eingabetexte verarbeiten. Beispielsweise wurde ein CR für Medline-Abstracts 4 eingebunden. 5 Zusätzlich sind CRs für Volltexte und Goldstandard- Korpora implementiert. So ist gewährleistet, dass alle in UTEMPL verknüpften Werkzeuge nach dem Einlesen der Daten unabhängig von der Korpusquelle verwendet werden können. 3 Unter einer Subsprache versteht man eine besondere, spezialisierte Form einer natürlichen Sprache, die in einer bestimmten Domäne oder einem bestimmten Fachgebiet verwendet wird. 4 Medline ist eine Datenbank für biomedizinische Publikationen mit über 18 Millionen Eintragungen, die neben Informationen zum Autor oder zur Zeitschrift zumeist das Abstract einer Publikation enthalten. Medline ist frei über das Interface PubMed ( http: / / www.ncbi.nlm.nih.gov/ pubmed/ ) zugänglich. Da Medline eine der Hauptquellen für die Recherche im biomedizinischen Bereich ist, steigt neben der Zahl an Eintragungen auch die Zahl der Suchanfragen rapide an. 5 Dieses Werkzeug wurde am Language and Information Engineering Lab der Universität Jena ( http: / / www. julielab.de ) entwickelt. 128 Jannik Strötgen, Juliane Fluck, Anke Holler ! " # $ % &' # $ & ( ! ! "# ! # $% & $% & ' & ( # % ' ' ' Abbildung 1: Relevante, in UTEMPL integrierte Komponenten mit ihren Aufgaben Alle Anwendungen, die die für die Relationsextraktion nötige Vorverarbeitung der Texte übernehmen, sind als Analysis Engines eingebunden. Dazu zählen Komponenten zur Satzgrenzenerkennung, zur Tokenisierung, zum Parsing und zur Named Entity Recognition. Auf Grund der UIMA- Architektur können in UTEMPL bereits existierende, speziell für die biomedizinische Domäne entwickelte Werkzeuge problemlos genutzt werden. 6 Ebenfalls als AEs sind die Komponenten zur Relationsextraktion realisiert. Wie im folgenden Abschn. genauer dargestellt wird, wurden zwei Ansätze, ein Pattern-Matching-Ansatz und ein dependenz-basierter Ansatz, umgesetzt und vergleichend evaluiert. Für die Verarbeitung der Analyseergebnisse enthält UTEMPL verschiedene CAS Consumer. Bspw. sind CAS Consumer für die Visualisierung, für den Aufbau eines Lucene-Index sowie für die Evaluierung realisiert. Ein Vorzug der beschriebenen UIMA-basierten Pipeline UTEMPL ist, dass die Evaluierung vollständig innerhalb der Pipeline vollzogen werden kann, wenn der Collection Reader für den Goldstandard sowie der CAS Consumer für die Evaluation gemeinsam verwendet werden. 3 Relationsextraktion In diesem Abschnitt werden zunächst drei methodische Ansätze zur Extraktion von Relationen in unstrukturierten Daten eingeführt. Danach werden zwei in UTEMPL integrierte Komponenten zur Extraktion von PPIs diskutiert. 3.1 Existierende methodische Ansätze Die einfachste Methode zur Relationsextraktion stellen auf Kookkurrenz beruhende Ansätze dar, bei denen innerhalb eines gewählten Fensters (z.B. Phrasen, Sätze, Satzpaare, Abschnitte bis hin zu vollständigen Dokumenten) gemeinsam auftretende Entitäten erfasst werden (Ding et al., 2002). Diesem Vorgehen liegt die Hypothese zugrunde, dass gemeinsam vorkommende Entitäten in irgendeiner Weise miteinander in Beziehung stehen. Ein Vorteil von Kookkurrenzansätzen ist, dass sie sehr effizient sind, da durch ihre Einfachheit nahezu keine Vorverarbeitung der Texte nötig ist. Zudem wird durch diese Ansätze ein hoher Recall erreicht, jedoch führt die fehlende Analyse der syntaktischen und/ oder semantischen Beziehungen zu einer geringen Precision. Eine zweite gängige Methode zur Bestimmung von Relationen zwischen Entitäten ist das Pattern- Matching. Bei diesem Verfahren wird zumeist regelbasiert mit Hilfe regulärer Ausdrücke in Texten 6 Bspw. ist der Julie Sentence Boundary Detector (Tomanek et al., 2007) frei wählbar, und für die Namenserkennung stehen die am Fraunhofer Institut SCAI entwickelten NER-Werkzeuge, wie z.B. der ProMiner (Hanisch et al., 2005), zur Verfügung. Dependenz-basierte Relationsextraktion mit UTEMPL 129 nach definierten Mustern gesucht, wobei diese Muster so spezifisch wie nötig und so allgemein wie möglich sein sollten. Dies ist nicht immer leicht umzusetzen, zumal einerseits desto mehr Muster benötigt werden, je näher diese an der syntaktischen Variation des Textes bleiben (McNaught & Black, 2006), andererseits aber kleine Regelsets zur Beschreibung der Muster leichter zu pflegen sind als große (Plake et al., 2005). Im Vergleich zu Kookkurrenzansätzen ist das Pattern Matching bei der Verarbeitung nur geringfügig zeitintensiver. Allerdings sind die Definition der Muster und die Regelerstellung zeit- und arbeitsaufwändig, setzen i.d.R. Domänenwissen voraus und müssen für neue Domänen jeweils angepasst werden. Pattern-Matching-Ansätze zielen auf eine Erhöhung der Precision, gehen daher aber oft mit einer Verschlechterung des Recalls einher. Eine dritte Methode zur Relationsextraktion stellen Ansätze dar, die auf einer tiefen linguistischen Analyse beruhen, was einen erhöhten, auch zeitlichen Aufwand für die Vorverarbeitung der Dokumente erfordert, da die Texte syntaktisch analysiert werden müssen. Dies kann durch flaches (oberflächliches) oder tiefes (vollständiges) Parsing geschehen. Relationen zwischen einzelnen Einheiten im Satz werden durch Regeln beschrieben, die auf die syntaktische Struktur rekurrieren. Dieses methodische Vorgehen zeichnet sich durch gute Recall- und Precision-Ergebnisse aus (Fundel et al., 2006). 3.2 Relationsextraktion mit UTEMPL In UTEMPL sind zwei selbst entwickelte Komponenten zur Extraktion von PPIs integriert worden: der sog. Pattern Relation Finder und der sog. Dependency Relation Finder. Während die erste Komponente den Pattern-Matching-Ansatz umsetzt, aber auch in der Lage ist, einfache Kookkurrenzen auszugeben, basiert die zweite Komponente auf einer syntaktischen Analyse von Dependenzrelationen. Für beide Komponenten wurden Listen mit domänenspezifischen nominalen und verbalen Lexemen zur Beschreibung von Relationen (Interaktionen) zwischen Entitäten erstellt. Die Lexeme wurden jeweils semantischen Kategorien zugeordnet. Die erstellten Listen (VERB4INT und NOUN4INT) können in die Analysis Engines für die Relationsextraktion als Ressource geladen werden. Die Entwicklung der Regeln für beide Ansätze erfolgte auf der Basis eines annotierten Korpus, dem AIMed Corpus 7 . Dieses Korpus wurde mittels des Fisher-Yates-Algorithmus reproduzierbar in ein Trainings- und ein Evaluierungsset unterteilt. 8 Das Trainingsset beinhaltet 1564 Sätze (80%) mit durchschnittlich 0,51 Interaktionen pro Satz. Das 391 Sätze umfassende Evaluierungsset enthält im Schnitt 0,64 Interaktionen pro Satz. Für beide methodischen Ansätze wurde eine Syntax entwickelt, die die Formulierung von Regeln außerhalb des Programmcodes erlaubt. 9 Für die Umsetzung des Pattern-Matching-Ansatzes in UTEMPL wurde in Anlehnung an Plake et al. (2005) ein Regelset erstellt, das syntaktisch alle Eigenschaften regulärer Ausdrücke aufweist. Der Pattern-Matching-Ansatz ist in UTEMPL als Analysis Engine realisiert. Im folgenden Abschn. wird detailliert erläutert, wie der dependenz-basierte Ansatz in UTEMPL realisiert wurde. 7 ftp: / / ftp.cs.utexas.edu/ pub/ mooney/ bio-data/ 8 Ein solches Vorgehen ermöglicht die Evaluierung auf Korpora, die bei der Regelentwicklung unberücksichtigt geblieben sind. 9 Anpassungen und Erweiterungen können daher entwicklerunabhängig ohne Änderung des Programmcodes vollzogen werden. 130 Jannik Strötgen, Juliane Fluck, Anke Holler 4 Dependenz-basierte Extraktion von PPIs Der dependenz-basierte Ansatz ist inspiriert durch das RelEx-System von Fundel et al. (2006). Die gemeinsame Grundidee ist, Regeln für die Extraktion von Protein-Protein-Interaktionen zu entwerfen, die sich auf die Ausgabe eines Dependency-Parsers stützen. Anders als RelEx, das mit drei sehr allgemeinen Regeln arbeitet, wird in UTEMPL versucht, für verschiedene syntaktische Phänomene spezifische Regeln zu formulieren. Auf diese Weise sollen zufällig gemeinsam auftretende Entitäten von miteinander interagierenden Entitäten unterschieden werden. 10 Es ist insgesamt zu erwarten, dass der dependenz-basierte Ansatz zumindest bzgl. der Precision einem einfachen Pattern Matching überlegen ist. Das Vorgehen bei der dependenz-basierten Relationsextraktion ist folgendermaßen: Nach der Bestimmung der Satzgrenzen und der Entitäten werden zunächst alle Sätze gesucht, die eine Kookkurrenz enthalten. 11 Diese Sätze werden dann an den Stanford Parser weitergeleitet, der ein Partof-Speech-Tagging und eine Syntaxanalyse vornimmt. Der UIMA-Wrapper des Stanford Parsers 12 wird so erweitert, dass dem Parser die durch eine NER-Anwendung gefundenen Entitäten mitgeteilt werden. Dadurch können Fehler vermieden und Verarbeitungszeit eingespart werden, da durch diese Erweiterung alle Entitäten als Eigennamen und als einzelne Token behandelt werden. Auf der Dependenzausgabe des Stanford Parsers werden die Regeln zur Extraktion von Relationen (Interaktionen) angewandt. Dabei kann sowohl auf die Relationen, die der Stanford-Parser als Typen für die Verbindungen zwischen den einzelnen Lexemen ausgibt (de Marneffe et al., 2006), als auch auf die bereits erwähnten Ressourcen VERB4INT und NOUN4INT zurückgegriffen werden. Um die einzelnen Komponenten von UTEMPL so flexibel wie möglich zu gestalten, wird der Dependency Relation Finder als eine eigene Analysis Engine entwickelt, die neben den Ressourcen für die Interaktionen ausdrückende Lexeme auf Annotationen im CAS zurückgreift. Zu diesen Annotationen gehören die Ausgaben einer NER Anwendung, des Stanford Parsers sowie des als AE entwickelten Cooccurrence Finders. Als weitere Ressource werden dem Dependency Relation Finder die erarbeiteten Regeln übergeben. 4.1 Regelsyntax Zur Verdeutlichung der Syntaxeigenschaften der Regelsprache dient Abb. 2, in der die Dependenzausgaben des Stanford Parsers für die relevanten Bereiche zweier Beispielsätze aus dem AIMed Corpus in Baumstruktur dargestellt sind. Es werden folgende vier Entitätenpaare betrachtet, wobei die jeweilige Tokennummer der Entitäten in Klammern angegeben ist. Aufgrund der Übergabe der Entitäten durch den Namenserkenner behandelt der Dependency-Parser auch mehrwortige Entitäten als ein Token. A: calnexin (7) und calreticulin (9) in S1 B: calnexin (7) und Glut 1 (12) in S1 C: calreticulin (9) und Glut 1 (12) in S1 D: IRS-1 (23) und insulin receptor (26) in S2 10 Darüber hinaus sollte die lineare Distanz zwischen interagierenden Entitäten innerhalb eines Satzes von geringer Bedeutung sein, zumindest solange der Dependency-Parser den jeweiligen Satz weitgehend korrekt analysieren kann. 11 So wird vermieden, dass der Dependency-Parser Sätze analysiert, in denen aufgrund fehlender Entitäten keine Relationen gefunden werden können. 12 Dieser ist Teil des UIMA bioNLP Toolkits, http: / / bionlp-uima.sourceforge.net . Dependenz-basierte Relationsextraktion mit UTEMPL 131 ! "# $# # # # # # # # ###### ###### # % # ! "#%%%# # # # # # # % Abbildung 2: Dependenzausgabe des Stanford Parsers für Beispiele aus dem AIMed Corpus (Wörter mit Tokennummern in Ellipsen; Relationen in Rechtecken; Entitäten fett) Im ersten Beispielsatz (S1) sollen B und C, nicht jedoch A als Interaktionen erkannt werden, in S2 soll D als Interaktion extrahiert werden. Ein wichtiges Charakteristikum ist der gemeinsame Elternknoten, der entweder eine der Entitäten ist (A, D) oder ein anderes Wort (B, C). Diese Unterscheidung führt zu zwei Regelmengen, die e1isCP und otherCP genannt werden. Das Ziel der Syntax ist, dass mit Befehlen Bedingungen an den Dependenzbaum gestellt werden können. Die Positionen (Knoten) können von den Entitäten (E1, E2) und vom gemeinsamen Elternknoten (CP) aus angesprochen werden. Die Befehle sind CP, e1CP-1, e1CP-2, e2CP-1 und e2CP-2 (1 bzw. 2 unterhalb CP Richtung E1 bzw. E2) sowie pE1, gpE1, ggpE1 und gggpE1 (1, 2, 3 bzw. 4 oberhalb von E1). Als Attribute erhalten die Befehle einen String, ein Lexem aus INOUN4INT (INOUN) oder aus IVERB4INT (IVERB). Diese Attribute enthält auch der Befehl anywhere, der alle Positionen zwischen E1 bzw. E2 und CP überprüft. Die Relationen (Pfade) zwischen den Entitäten und CP oder dem Wurzelknoten werden mit e1RelToCP, e2RelToCP und e1RelToRoot angesprochen und enthalten Relationsangaben (z.B. nsubj, dobj), die der Dependency-Parser ausgibt. Mit e1RelType und e2RelType kann die Genauigkeit der Relationen bestimmt werden: equals, starts, ends oder contains. Außerdem können die maximalen Relationsentfernungen bestimmt (e1RelMax und e2RelMax), eine Negationsprüfung verlangt (checkNeg: yes) und eine Regel als Negation verwendet werden, wodurch nicht interagierende Entitätenpaare ausgeschlossen werden können (isNegation: yes). Mit diesen Befehlen und der Berücksichtigung der Regelgruppen otherCP und e1isCP können die Regeln B1 und B2 aus Abb. 3 geschrieben werden, die für die oben genannten Relationen B, C und D zutreffen, für A richtigerweise jedoch nicht. 4.2 Regelentwicklung Die Entwicklung der Regeln ist in einzelne Schritte aufgeteilt. Zunächst wird versucht, präzise Regeln zu schreiben, die eine hohe Precision erzielen. Dadurch können ein Regelset für hohe Precision und eines für hohe F-Score-Werte entwickelt werden. In Abb. 3 sind Regeln für einige der in den Entwicklungsschritten beschriebenen Konstruktionen aufgeführt und in Abb. 4.2 sind die Ergebnisse der Entwicklungsschritte auf dem Trainings- und Evaluierungsset angegeben. 132 Jannik Strötgen, Juliane Fluck, Anke Holler Nr Type e1RelToCP/ e1RelToRoot e2RelToCP e1RelType e2RelType CP e2CP-1 pE1 gpE1 B1 otherCP nsubj pobj->prep ends / equals IVERB B2 e1isCP pobj->prep conj starts / equals between INOUN 1 otherCP nsubj dobj equals / eqauls IVREB 2 otherCP nsubj pobj->prep equals / eqauls IVREB with 4 otherCP nsubjpass pobj->prep equals / eqauls IVERB by 5 e1isCP pobj->prep pobj->prep starts / equals by of INOUN 8 e1isCP pobj->prep conj starts / equals between INOUN 7b otherCP pobj->prep pobj->prep equals / equals to of INOUN 9 otherCP nn pobj->prep equals / equals INOUN with Nr Type e1RelToCP/ e1RelToRoot e2RelToCP e1RelType e2RelType CP e2CP-1 isNegation 12a e1isCP nsubj->rcmod / ends IVERB yes 12b e1isCP rcmod / ends IVERB 12c otherCP nsubjpass ->xcomp ends / ends IVERB 12d otherCP nsubj ->xcomp ends / ends IVERB Nr Type e1RelToCP e1RelType anywhere e1RelMax e2RelMax 13 e1isCP IVERBorINOUN 6 14 otherCP nsubj ends IVERB 3 3 Abbildung 3: Einige der beschriebenen Regeln: Bei e1isCP Regeln wird e1RelToRoot, bei otherCP e1RelToCP verwendet; Nr entspricht der Bezeichnung der Beispiele im Text dep-1 Zunächst werden Regeln für einfache, aber häufig auftretende Relationen geschrieben, die teilweise in Abb. 3 aufgeführt sind. Die Beispiele sind Generalisierungen, es reicht aus, dass A und B der Kopf ihrer NP sind: (1) A binds B (4) A was activated by B (7) Binding of A to B (2) A interacts with B (5) Activation of A by B (8) Interaction between A and B (3) A binds to B (6) Interaction of A with B Dass bereits False Positives (FP) extrahiert werden (siehe Abb. 4.2), liegt daran, dass spekulative Interaktionen im AIMed Corpus meistens nicht annotiert sind, in UTEMPL jedoch weder ausgeschlossen noch als Negation betrachtet werden. Ein Beispiel aus dem Trainingscorpus ist: (2’) We also investigated whether A ... interacts with B ... dep-2 Für die ersten vier Regeln werden moderate Generalisierungen zugelassen. Während die Relationen zwischen E1 und CP identisch bleiben, dürfen die zwischen E2 und CP am Anfang zusätzlich eine Abkürzung, Konjunktion und Apposition sowie eine präpositionale Verknüfung oder einen Modifkator einer nominalen Komposition enthalten. Zusätzlich werden Konstruktionen mit mehreren Verben behandelt, die mit einer Konjunktion am Ende der Relation zwischen E2 und CP abgedeckt werden. Neu gefundene Konstruktionen mit Relationen zwischen A und B sind bspw.: (1b) A binds to C and interacts with B. (2b) A interacts with another protein, B ... dep-3 Je nach Kontext ist die Dependenzausgabe für Konstruktionen wie (5), (6) und (7) nicht immer wie bei S2 in Abb. 2. Stattdessen ist häufig das Interaktionsnomen der gemeinsame Elternknoten. Der Grund hierfür ist, dass der Parser eine PP-Anhängung verschieden durchführen kann. Als Beispiel ist in Abb. 3 Regel 7b angegeben. Zusätzlich werden auch für die Regeln 5, 6 und 7 Generalisierungen zugelassen, während für Regel 8 keine gefunden wurden. Dependenz-basierte Relationsextraktion mit UTEMPL 133 dep-4 Weitere häufig auftretende Konstruktionen werden berücksichtigt: (9) A interaction with B (10) A is responsible for B expression (11) A-dependent transcription of B Bei (9) ist A interaction, bei (10) B expression jeweils eine nominale Komposition. Die Regel für (9) (Abb. 3) existiert für with, to, by und of. Zusätzlich werden auch für diese Regeln Generalisierungen zugelassen. dep-5 Da der Recall noch immer bei unter 20% liegt, werden weitere Beschränkungen für die Regeln gelockert. Zusätzlich zu den Generalisierungen der Relationen zwischen E2 und CP werden nun auch welche zwischen E1 und CP zugelassen. Dies geschieht teilweise durch Lockerung der strikten Bindung (e1RelType: ends statt equals). Wird diese Relation gelockert, muss die zwischen E2 und CP strikt gelten. Alternativ kann die Relation zwischen E2 und CP gelockert werden, solange die zwischen E1 und CP strikt gilt. Bei den e1isCP-Regeln, die statt e1RelToCP die Relation zwischen E1 und dem Wurzelknoten betrachten (e1RelToRoot mit e1RelType: start), sind diese Änderungen nicht notwendig. Stattdessen wird erlaubt, dass zwischen E1 und der Präpositionalphrase (PP) mit dem Interaktionswort eine zusätzliche PP existieren kann. Ein Beispiel, auf das diese Änderung zutrifft, ist: (6b) Activation of the subunit of A with B Die relevanten Informationen in e1RelToRoot sind nun syntaktisch weiter entfernt von der Entität. Dadurch werden statt der ersten und der zweiten Stelle in der Relation zwischen E1 und dem Wurzelknoten (vgl. 5 in Abb. 3) nun die dritte (ggpE1 statt pE1) und die vierte Stelle (gggpE1 statt gpE1) bedeutsam. dep-6 Als nächstes werden komplexere syntaktische Formulierungen behandelt, wie Relativsätze, NcI- (Nominativus cum Infinitivo) und Modalverbkonstruktionen. Ist E2 im Relativsatz selbst in subjektivischer Verwendung, sollen die Relativsatzregeln nicht zutreffen. Deshalb werden vor den positiven (vgl. 12b in Abb. 3) negative Regeln (vgl. 12a) aufgerufen, die diese Konstruktionen filtern. Die Regeln für Infinitivkonstruktionen sind 12c und 12d in Abb. 3. Beispiele für diese komplexeren syntaktische Formulierungen sind: (12a) ... e.g. A, with which B binds C (12b) ... that A, which activates B ... (12c) A was found to interact with B (12d) A was able to interact with B Dieses Regelset wird bei der Evaluierung aller Korpora als High-Precision-Regelset (HP-Set) verwendet, um einen Eindruck zu bekommen, wie hoch Recall und Precision bei Verwendung relativ strikter Regeln sind. Das Regelset könnte auf großen Textquellen bereits zu guten Ergebnissen führen, da bei diesen die Precision oft stärker gewichtet wird als der Recall, denn es kann davon ausgegangen werden, dass Relationen mehrmals vorkommen und somit zumindest teilweise mit relativ einfachen Formulierungen. dep-7 Im letzten Schritt werden sehr allgemeine Regeln hinzugefügt, denn der Recall ist mit 32% noch niedrig und für die Optimierung des F-Scores sollten sich Precision und Recall annähern. Zunächst werden negative Regeln aufgerufen, damit möglichst viele nicht-interagierende Entitätenpaare von der weiteren Betrachtung ausgeschlossen werden. Dann folgen positive Regeln, die auf der Hypothese beruhen, dass Entitäten, die in Bezug auf ihre syntaktischen Relationen eine geringe Distanz aufweisen, miteinander interagieren, sofern zusätzlich ein Interaktionswort innerhalb dieser syntaktischen Relation auftritt. Bei den otherCP Regeln stellt sich heraus, dass für Aktivformulierungen andere Werte verwendet werden sollten als für Passivkonstruktionen. Diese betrachten IVERB und INOUN jeweils einzeln und können je nach Wert von e1RelMax verschiedene Werte für e2RelMax berücksichtigen, während für e1isCP eine sehr allgemeine Regel ausreicht (siehe Regeln 13 und 14 in Abb. 3). 134 Jannik Strötgen, Juliane Fluck, Anke Holler 0 10 20 30 40 50 60 70 80 90 100 dep−1 dep−2 dep−3 dep−4 dep−5 dep−6 dep−7 Cooc RelEx Precision, Recall, F−Score in [%] Regelentwicklungsschritt train−precision eval−precision train−recall eval−recall train−fscore eval−fscore Abbildung 4: Stufen der Regelentwicklung für den Dependency Relation Finder auf dem AIMed Korpus: Trainingsset (Quadrate) und Evaluierungsset (Kreuze). Als Vergleich dienen die F-Scores von Cooccurrence und RelEx nach Pyysalo et al. (2008). BioInfer HPRD50 LLL05 P R F P R F P R F PA 36 56 44 66 60 63 69 48 56 HP 64 25 36 95 38 54 98 41 58 DE 44 39 41 84 56 68 96 72 82 RE 39 45 41 76 64 69 82 72 77 KO 13 99 23 38 100 55 50 100 66 Abbildung 5: Evaluierungsergebnisse des UTEMPL Pattern Relation Finders (PA) sowie des Dependency Relation Finders mit High-Precision (HP) und vollständigem Regelset (DE). Als Vergleich dienen die von Pyysalo et al. (2008) angegebenen Werte für RelEx (RE) und einen Kookkurrenzansatz (KO). Mit diesen allgemeinen Regeln werden auf dem Trainingscorpus mit dem vollständigen Regelset eine Precision von 48,7%, ein Recall von 50,7% und damit ein F-Score von 49,7% erreicht (siehe Abb. 4.2). Die Werte auf dem Validierungsset sind ähnlich, der Verlust an Precision ist von Schritt dep-6 zu dep-7 jedoch so groß, dass er nicht vollständig durch den Gewinn an Recall ausgeglichen werden kann. 5 Evaluierung Für die Evaluierung wurden die von Pyysalo et al. (2008) analysierten Korpora (u.a. BioInfer, HPRD50 und LLL05) verwendet. Die Analyse erfolgte mit einem Kookkurrenzansatz und dem bereits erwähnten RelEx, deren Ergebnisse bei der Evaluierung der UTEMPL-Relationsextraktions- Methoden als Vergleichswerte dienen und mit den Ergebnissen in Abb. 4.2 dargestellt sind 13 . Mit dem Dependency Relation Finder werden F-Score-Werte zwischen 41% und über 82% erreicht, woraus folgt, dass die Korpora sehr verschieden sind. Neben den unterschiedlichen Annotationskriterien für Interaktionen spielen vor allem die Anzahl an Entitäten und Interaktionen pro Satz eine wichtige Rolle für das Abschneiden der Systeme (Pyysalo et al., 2008). Die Gegenüberstellung mit den Ergebnissen von RelEx zeigt, das mit UTEMPL wettbewerbsfähige Ergebnisse erzielt werden können. Auffällig ist vor allem, dass mit dem Dep-HP Regelset sehr gute Precision-Werte erzielt werden. Zusätzlich wurde mit einer Erweiterung des Pattern Relation Finders bei dem BioNLP’09 Shared Task on Event Extraction 14 teilgenommen, und es konnten wettbewerbsfähige Ergebnisse erzielt werden (Kim et al., 2009). 13 Die in den Korpora annotierten Entitäten werden von allen Systemen als gegeben angenommen, damit das Evaluationsergebnis durch Fehler der NER nicht verfälscht wird. 14 http: / / www-tsujii.is.s.u-tokyo.ac.jp/ GENIA/ SharedTask/ Dependenz-basierte Relationsextraktion mit UTEMPL 135 6 Schlussfolgerung In diesem Artikel haben wir die hoch flexible, UIMA-basierte Textmining-Pipeline UTEMPL präsentiert, die sich dadurch auszeichnet, dass Software-Komponenten frei kombinierbar eingebunden und Regelsets außerhalb der Software entwickelt werden können. Wir haben zudem aufgezeigt, wie UTEMPL zur Extraktion von Protein-Protein-Interaktionen aus großen Textbeständen biomedizinischer Fachliteratur eingesetzt werden kann. Insbesondere haben wir auf der Grundlage entsprechender Evaluationsergebnisse nachgewiesen, dass mit einer tiefen linguistischen dependenz-basierten Methode wettbewerbsfähige Resultate bei der Relationsextraktion erzielt werden können. Danksagungen Wir danken Roman Klinger und Theo Mevissen für inhaltliche Diskussionen und den drei anonymen Reviewern für hilfreiche Kommentare. Literatur Cohen, K. B., Palmer, M., & Hunter, L. (2008). Nominalization and Alternations in Biomedical Language. PLoS ONE, 3(9), e3158. de Marneffe, M.-C., MacCartney, B., & Manning, C. D. (2006). Generating Typed Dependency Parses from Phrase Structure Parses. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), pages 449-454. Ding, J., Berleant, D., Nettleton, D., & Wurtele, E. S. (2002). Mining MEDLINE: Abstracts, Sentences, or Phrases? In Proceedings of the Pacific Symposium on Biocomputing, pages 326-337. Friedman, C., Kra, P., & Rzhetsky, A. (2002). Two Biomedical Sublanguages: A Description Based on the Theories of Zellig Harris. Journal of Biomedical Informatics, 35(4), pages 222-235. Fundel, K., Küffner, R., & Zimmer, R. (2006). RelEx - Relation Extraction Using Dependency Parse Trees. Bioinformatics, 23(3), pages 365-371. Grishman, R. (2001). Adaptive Information Extraction and Sublanguage Analysis. In Proceedings of Workshop on Adaptive Text Extraction and Mining at Seventeenth International Joint Conference on Artificial Intelligence, pages 77-79. Hanisch, D., Fundel, K., Mevissen, H.-T., Zimmer, R., & Fluck, J. (2005). ProMiner: Rule-based Protein and Gene Entity Recognition. BMC Bioinformatics, 6 Suppl 1, S14. Kim, J.-D., Ohta, T., Pyysalo, S., Kano, Y., & Tsujii, J. (2009). Overview of BioNLP’09 Shared Task on Event Extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1-9. Lease, M. & Charniak, E. (2005). Parsing Biomedical Literature. In Second International Joint Conference on Natural Language Processing (IJCNLP’05), pages 58-69. McNaught, J. & Black, W. J. (2006). Information Extraction. In S. Ananiadou & J. McNaught, editors, Text Mining for Biology and Biomedicine, chapter 7, pages 143-177. Artech House. Plake, C., Hakenberg, J., & Leser, U. (2005). Optimizing Syntax Patterns for Discovering Protein-Protein Interactions. In Proceedings of the 2005 ACM Symposium on Applied Computing, pages 195-201. 136 Jannik Strötgen, Juliane Fluck, Anke Holler Pyysalo, S., Airola, A., Heimonen, J., Björne, J., Ginter, F., & Salakoski, T. (2008). Comparative Analysis of Five Protein-Protein Interaction Corpora. BMC Bioinformatics, 9 Suppl 3, S6. Tomanek, K., Wermter, J., & Hahn, U. (2007). Sentence and Token Splitting Based on Conditional Random Fields. In Proceedings of the 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), pages 49-57. Wermter, J., Fluck, J., Strötgen, J., Geißler, S., & Hahn, U. (2005). Recognizing Noun Phrases in Biomedical Text: An Evaluation of Lab Prototypes and Commercial Chunkers. In SMBM 2005 - Proceedings of the 1st International Symposium on Semantic Mining in Biomedicine, pages 25-33. Zhou, D. & He, Y. (2008). Extracting Interactions between Proteins from the Literature. Journal of Biomedical Informatics, 41(2), pages 393-407. From Proof Texts to Logic Discourse Representation Structures for Proof Texts in Mathematics * Jip Veldman 1 , Bernhard Fisseni 2 , Bernhard Schröder 2 , Peter Koepke 1 1 Rheinische Friedrich-Wilhelms-Universität Bonn, Mathematisches Institut 2 Universität Duisburg-Essen, Germanistik / Linguistik Abstract We present an extension to Discourse Representation Theory that can be used to analyze mathematical texts written in the commonly used semi-formal language of mathematics (or at least a subset of it). Moreover, we describe an algorithm that can be used to check the resulting Proof Representation Structures for their logical validity and adequacy as a proof. 1 Introduction The era of theorem provers (computer programs that try to prove that a mathematical statement is true) and proof checkers (computers programs that try to check the validity of a mathematical argument) started in the beginning of the sixties. While from the beginning (Abrahams, 1964), the aim was to process proofs as written by mathematicians, this goal was not attained. All the proof checkers developed to date only accept input written in peculiar formal languages that do not have much in common with the language that mathematicians normally use and are thus quite inaccessible or unattractive to most mathematicians. Meanwhile, computational linguistics has developed many techniques and formalisms. The aim of the Naproche project is to apply such techniques to proof checking systems; the resulting system will accept proof texts that can be read by humans as well as by machines, thus coming closer to the original goal of the pioneers in the field. The system will support wide range of applications in educational settings (teaching to write good proofs) and mathematics in general (supporting mathematicians in constructing proofs and ensuring comprehensibility). We describe in this paper the general architecture and formalism behind our approach to processing natural language mathematical texts, which relies on an extension of Discourse Representation Theory (DRT, Kamp & Reyle, 1993; for further reference on Naproche technicalities, see Cramer, 2009; Kühlwein, 2008). 1 * Erschienen in: C. Chiarcos, R. Eckart de Castilho, M. Stede (Hrsg.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, S. 137-145. 1 These papers and other material are available on the website, http: / / www.naproche.net 138 Jip Veldman, Bernhard Fisseni et al. This paper is organized as follows: Section 2 provides some background on proof checking. In Section 3, we will give some of the most striking features of the language of mathematics and describe to what extent they have been implemented in the Naproche system. In Section 4, we will describe how to extend DRT in order to reflect these features. Finally, in Section 5, we will show how we can check such a representation structure for its logical validity. 2 Proof Checking Not only computational linguistics, but also theorem proving has left infancy. During the early years, research concentrated on the development of general theorem provers. As we now know, to attain this general goal is an impossible task for reasons of computational complexity. This is one of the reasons which led to the development of proof checking; for practical purposes, the involvement of humans in proof checking makes proof checking ‘easier’ than fully automated proving. At the end of the sixties, Nicolas Govert de Bruijn was the first one to successfully implement a proof checker. His Automath system (de Bruijn, 1994) was the first proof checker in which a substantial mathematical text, namely Landau (1930), could be checked - after it was appropriately formalized by human operators. Today there is a large community of computer scientists developing and applying proof checking systems. Several complex mathematical theorems like the Four Colour Theorem (Gonthier, 2005, using the Coq system) and Gödel’s completeness theorem for first-order logic (Braselmann & Koepke, 2005, using the Mizar system) have been formalized and checked using a proof checking system. Thus, these programmes are attractive in their own way, but the fact that they demand peculiar input has hampered their adoption by general mathematicians; one of the aims of our approach is to allow the use of sophisticated modern provers and checkers to mathematical natural language texts. 3 The Language of Mathematics and the Language of Naproche Mathematicians use a small subset of written natural language enriched by a significant amount of formulaic notation to communicate mathematical theories. We call it the semi-formal language of mathematics (SFLM). As part of the Naproche project, a controlled natural language that captures a subset of the features of SFLM has been developed. Below, we describe the most important features of SFLM and the state of the implementation in the Naproche system. To illustrate the coverage of the system so far: The first chapter of Landau (1930) has been adapted to the Naproche language. The difficulties encountered were all solvable up to now. The coverage of Naproche is constantly being extended. The most stable core to date permits to formulate proofs in a style that looks much more like a normal SFLM proof than proofs formallized for any other proof checker. The processing of proof metastructure is already in place and quite complete, and now the focus is on extending the linguistic form of single sentences. The Naproche language is more formal than SFLM in some regards on which we comment below and more restricted than the language used by most mathematicians; this is also due to the fact that one of the uses for which Naproche is developed is teaching how to write proofs. As an example we give a proof of “ √ 2 is irrational” written in Naproche style. From Proof Texts to Logic 139 Theorem. √ 2 is irrational. Proof. Assume that √ 2 is rational. Then there are integers a, b such that a 2 = 2 · b 2 and gcd ( a, b ) = 1 . Hence a 2 is even, and therefore a is even. So there is an integer c such that a = 2 · c. Then 4 · c 2 = 2 · b 2 , 2 · c 2 = b 2 , and b is even. Contradiction. (QED) However, one has to bear in mind that SFLM is not a uniform phenomenon. The language used in textbooks is much more explicit and much more formal than the language of advanced journal articles. Therefore, Naproche language and the accepted textual structure ist much closer to the language of algebraically written textbooks than this language sometimes is to the language of highly specialised articles. An example for this are geometric argumentations (as in knot theory) and coarsegrained argumentations, which latter are typical for specialised papers. Both defy formalisation at the current state of the art. Cultural background. SFLM is considered so important by the community of mathematicians that mathematics students will spend a lot of time during the first months at university learning to communicate in it. However, it has not seen much attention from linguists (Eisenreich, 1998; Ranta, 1999; Zinn, 2006 being some of the few exceptions). As an example of SFLM we cite two small fragments from page 11 and 12 of Kunen (1980), a standard textbook in set theory. The first quotation shows that Zermelo-Fraenkel set theory proves that there is no universal set. Theorem. -∃ z ∀ x ( x ∈ z ) . Proof. If ∀ x ( x ∈ z ) , then, by Comprehension, form { x ∈ z : x ∈ x } = { x : x ∈ x } , which would yield a contradiction by the Russell paradox discussed above. (QED) The second quotation gives the definition of an ordered pair. . . . 〈 x, y 〉 = {{ x } , { x, y }} is the ordered pair of x and y. SFLM incorporates parts of the syntax and semantics of natural language, so that it takes over its complexity and some of its ambiguities. However, SFLM texts are distinguished from common natural language texts by several characteristics: Mixing formulas and natural language. Proofs combine natural language expressions with mathematical symbols and formulas, which can syntactically function like noun phrases, as grammatical predicates, as predicatives, as modifiers, or even as whole sentences. At the moment, Naproche supports such mixed language unless the formulaic part has scope over several sentences (which is often considered ‘bad style’) or, conversely, inside a formulaic expression, natural language fragments are used. Avoiding ambiguities. In SFML, constructions which are hard to disambiguate are generally avoided. An important example is coreference: Mathematical symbols ensure unambiguous reference, most importantly variables fulfill the task of anaphoric pronouns in other natural language. Therefore, referents are generally labelled explicitly if ambiguity could arise, as in the beginning of 140 Jip Veldman, Bernhard Fisseni et al. the first example, where it is assumed that the universal set exists and at the same time it is labelled with the variable name “z”. Therefore, Naproche treats mathematical referents differently from other referents, and also identifies free variables in formulas, as these are available as antecedents in natural language fragments. It is thus not sufficient to only semantically process formulas, but the formulaic representation must be taken into account because it retains important information for the proof representation. Variables are treated differently by Naproche depending on the place of introduction. For variables introduced explicitly (by There is an x), Naproche uses the dynamic semantics of DRS. Remember that using existentially quantified variables on the left side of an implication DRS leads to implicit universal quantification. Variables implicitly introduced in a proof are assumed to be existentially quantified, while variables introduced implicitly in theorems, lemmas and definitions are treated as universally quantified, even if the linguistic form is not that of an implication. In this, Naproche follows mathematical custom. Assumption management. It is characteristic of SFML texts that assumptions are introduced and retracted in the course of the argument. Contrary to general texts, in SFLM the scopes of assumptions are deeply nested. To take an example, the proof cited above is a proof by contradiction: At the beginning (in the if clause), it is assumed that the universal set z exists. All subsequent claims are relativised to this assumption. Finally, the assumption leads to a contradiction, and is retracted, concluding that the universal set does not exist. In Naproche, management of assumptions has been implemented. At the moment, assumptions can be introduced explicitly in various ways, e. g. by Assume. The scope of assumptions extends to the end of a proof (marked by QED), unless they are explicitly discharged using Thus. Extending the vocabulary. Definitions in SFLM texts add new symbols and expressions to the vocabulary and fix their meaning. Naproche supports definitions; if an expression or symbol is used before its definition, the user is informed of this mistake. Text structure and intratextual references. Mathematical texts are highly structured. At a global level, they are commonly divided into building blocks like definitions, lemmas, theorems and proofs. Inside a proof, assumptions can be nested into other assumptions, so that the scopes of assumptions define a hierarchical proof structure. Proof steps are then commonly justified by referring to results in other texts, or previous passages in the same text. In the first example two intratextual references are used: “Comprehension”, referring to the Axiom of Comprehension, and “the Russell paradox discussed above”; special identifiers representing the structure of the mathematical text, such as “Theorem 1.5” are also common. Proof structure is therefore an inherent feature of the Naproche language, including intratextual references and the corresponding reference markers (“Lemma 42” and later “by Lemma 42”). Contradiction. Proofs by contradiction are supported in Naproche; this is signalled by Contradiction at the end of the (sub)proof. From Proof Texts to Logic 141 Notation support. Syntactically, Naproche supports a wide range of mathematical formula notation, the semantics of which must be defined in the text or in a module containing background knowledge. Linguistic form of sentences. The linguistic form of sentences in SFLM is quite restricted. Naproche accepts simple declarative SFLM sentences consisting of a mix of natural language terms and formulaic notation (see above); furthermore, relative sentences with such that, which are very typical of SFLM, can be processed, as well as intratextual references (by Lemma 42). Natural language terms can be noun phrases, unary and binary predicates, consisting of a verb or alternatively a copula. At the moment, we are working toward support for adjectives and nouns with complements and argument alternation (A is equivalent to B and A and B are equivalent). Guiding the reader. Some linguistic forms used in SFLM can be safely excluded from the coverage of Naproche. The most important case are comments that guide the reader but do not contribute to the semantics of the proof. While limited coverage for such commentaries as “by Induction” in (1-1a) has been implemented, as the information can also guide the proof checking process, general coverage of such commentary is not planned because of the immense semantic complications: the commentary in (1-1b) speaks about the structure of the proof, temporarily suspending previously established variable bindings. (1) (a) Thus by induction, if x = 1 then there is a u such that x = u ′ . 2 (b) We now prove that x is unique in the formula given above. 4 From Discourse to Proof Representation Structures We use techniques based on Discourse Representation Theory (DRT, Kamp & Reyle, 1993) to analyze texts written in SFLM. Claus Zinn pioneered such an approach in Zinn (2003), Zinn (2004) and Zinn (2006). We call a discourse representation structure (DRS) whose structure is extended to suit mathematical discourse Proof Representation Structure (PRS). 3 A PRS has five constituents: [1] An identification number (i), [2] a list of discourse referents (d 1 , . . . , d m ), [3] a list of mathematical referents (m 1 , . . . , m n ), [4] a list of textual referents (r 1 , . . . , r p ) and [5] an ordered list of conditions (c 1 , . . . , c l ). Similar to DRSes, we can display PRSes as boxes. The following box illustrates the general structure of a PRS: i d 1 , . . . , d m m 1 , . . . , m n c 1 ... c l r 1 , . . . , r p The identification number (i above) can later be used in textual referents in intratextual and intertextual references (which are stored in the ‘drawer’ containing r 1 , . . . , r p above, and assigned using 2 This example is from the translation of Landau (1930) into the Naproche language. 3 The name, if not the structure itself, is due to Zinn’s work (Zinn, 2006). 142 Jip Veldman, Bernhard Fisseni et al. the use () condition, see below). If a PRS represents one of the building blocks mentioned above, we add a marker qualifying the type (theorem, proof, etc.) to its identification number. As in DRSes, discourse referents (d 1 , . . . , d m above) are used to identify objects in the domain of the discourse. However, the domain contains three kinds of objects: first, mathematical objects like numbers or sets, secondly, symbols and formulas that are used to make claims about mathematical objects, and finally, textual referents. Discourse referents can refer to all three kinds of objects. Mathematical referents (m 1 , . . . , m n ) are the terms and formulas which appear in the text; they must be available as their syntactic structure is exploited in mathematical reasoning and are bound using mathid () conditions (see below). In the current PRSes, free variables contained in a mathematical formula are identified and can thus be referred to in later discourse. Finally, there is an ordered list of conditions. Just as in the case of DRSes, PRSes and PRS conditions are defined recursively: let A, B be PRSes, X, X 1 , . . . , X n discourse referents, Y a mathematical referent, and Z a textual referent. Then 1. for any n-ary predicate p (e.g. expressed by adjectives and noun phrases in predicative use and verbs in SFLM), p ( X 1 , . . . , X n ) is a condition. 2. holds ( X ) is a condition representing the claim that the formula referenced by X is true. 3. mathid ( X, Y ) is a condition which binds a discourse referent X to a mathematical referent Y (a formula or a term). 4. use ( Z ) is a condition representing that the source of the textual referent Z explains the previous or the next proof step. 5. A is a condition, i. e. conditions can be nested. 6. - A is a condition representing the negation of A. 7. A ⇒ B is a condition representing an implication, a universal quantifier, or an assumption and its scope. 8. contradiction is a condition. 9. A : = B is a condition representing a definition. Note that contrary to the case of DRSes, a bare PRS can be a direct condition of a PRS, and PRSes are not merged in a way that the provenance of conditions becomes intransparent. This allows to represent in a PRS the structure of a text divided into building blocks (definitions, lemmas, theorems, proofs) by structure markers. The hierarchical structure of assumptions is represented by nesting conditions of the form A ⇒ B: A contains an assumption, and B contains the representation of all claims made inside the scope of that assumption. The algorithm creating PRSes from a text in SFLM proceeds sequentially: It starts with the empty PRS. Each sentence or structure marker in the discourse updates the PRS according to an algorithm similar to a standard DRS construction algorithm, but taking the nesting of assumptions into account. The following figures show simplified versions of the PRSes constructed from the examples from Kunen (1980) given above. From Proof Texts to Logic 143 theorem. 1 goal. 2 0 -∃ z ∀ x ( x ∈ z ) mathid (0 , -∃ z ∀ x ( x ∈ z )) holds (0) body. 3 id. 4 1,2 z, ∀ x ( x ∈ z ) mathid (1 , z ) mathid (2 , ∀ x ( x ∈ z )) holds (2) ⇒ id. 5 3,4,5 { x ∈ z : x ∈ x } , { x : x ∈ x } { x ∈ z : x ∈ x } = { x : x ∈ x } use ( comprehension ) mathid (3 , { x ∈ z : x ∈ x } ) mathid (4 , { x : x ∈ x } ) mathid (5 , { x ∈ z : x ∈ x } = { x : x ∈ x } ) holds (5) contradiction use ( Russell paradox ) comprehension, Russell paradox id. 0 〈 x, y 〉 : = id. 1 1,2,3,4,5 x, y, 〈 x, y 〉 , {{ x, } , { x, y }} , 〈 x, y 〉 = {{ x, } , { x, y }} mathid (1 , x ) mathid (2 , y ) mathid (3 , 〈 x, y 〉 ) mathid (4 , {{ x, } , { x, y }} ) mathid (5 , 〈 x, y 〉 = {{ x, } , { x, y }} ) holds (5) alternative.name (3 , ordered pair of ) 144 Jip Veldman, Bernhard Fisseni et al. Even in these easy examples there are various interesting phenomena in the PRS constructions: 1. While parsing the statement “ ∀ x ( x ∈ z ) ”, we compute the list of free variables of this formula. We add a new discourse referent for the free variable z and quantify this new variable existentially, because this variable is not bound be any assumption made before. 2. While parsing “ { x ∈ z : x ∈ x } = { x : x ∈ x } ”, we also extract the mathematical objects used in this formula and give them a discourse referent. 3. the PRS condition alternative.name codes a alternative description in natural language for the object defined formally. 5 Checking PRSes There is a natural translation from a PRS to first-order logic. So the correctness of the text represented in a PRS can be checked using a theorem prover. We proceed by first translating a PRS into first-order logic, and then a theorem prover for first-order logic can be used to check the result of this translation. For the first step we use a variant of the translation from DRSes to first-logic described by Blackburn & Bos (2003). We list the most important changes that have to be made to the algorithm in order to parse PRSes correctly: 1. mathid conditions are not translated into subformulas of the first-order translation. A mathidcondition, which binds a discourse referent n to a term t, triggers a substitution of n, by t, in the translation of subsequent conditions. A mathid-condition, which binds a discourse referent n to a formula φ, causes a subsequent holds ( n ) -condition to be translated to φ. 2. Definition conditions are also not translated into subformulas of the translation. Instead, a condition of the form A : = B triggers a substitution of the relation symbol it defines, by the first-order translation of B, in subsequent conditions. 3. The holds-conditions that occur in a goal-PRS are translated after parsing the following body- PRS. 4. use ( Z ) is not translated, but should be used as a hint for the proof checker. For example, the PRS of the first example can be translated as follows into first-order logic enriched with class terms: [ ∀ z ( ∀ x ( x ∈ z )) ⇒ ( { x ∈ z : x ∈ x } = { x : x ∈ x } ∧ ⊥ )] ⇒ -∃ z ∀ x ( x ∈ z ) The checking algorithm used in Naproche is based on, but more complex than the algorithm described above. For example we did not explain how use is used as a hint for the proof checker. 6 Conclusion We have sketched how a number of phenomena characteristic for the semi-formal language of mathematics are treated in our proof checking system Naproche. We extend the DRS approach to discourse semantics in a number of ways to deal with phenomena like variables, mathematical formula, explicit references to text parts, definitions and assumptions with wide scope. All extensions preserve the direct interpretability in first-order logic of standard DRT. From Proof Texts to Logic 145 References Abrahams, P. W. (1964). Application of lisp to checking mathematical proofs. In E. C. Berkeley & D. G. Bobrow, editors, The Programming Language LISP: Its Operations and Applications, pages 137-160. The MIT Press, Cambridge (USA), London. Blackburn, P. & Bos, J. (2003). Working with Discourse Representation Theory: An Advanced Course in Computational Linguistics. CSLI, Stanford. Braselmann, P. & Koepke, P. (2005). Gödel’s completeness theorem. Formalized Mathematics, 13, 49-53. Cramer, M. (2009). Mathematisch-logische Aspekte von Beweisrepräsentationsstrukturen. Master’s thesis, Rheinische Friedrich-Wilhelms-Universität Bonn. de Bruijn, N. G. (1994). Reflections on automath. In R. B. N. et al., editor, Selected Papers on Automath, volume 133 of Studies in Logic, pages 201-228. Elsevier. Eisenreich, G. (1998). Die neuere Fachsprache der Mathematik seit Carl Friedrich Gauß. In L. Hoffmann, H. Kalverkämper, H. E. Wiegand, C. Galinski, & W. Hüllen, editors, Fachsprachen / Languages for Special Purposes: ein internationales Handbuch zur Fachsprachenforschung und Terminologiewissenschaft / An International Handbook of Special-Language and Terminology Research, 1. Halbband, number 14 in Handbücher zur Sprach- und Kommunikationswissenschaft, chapter 136, pages 1222-1230. de Gruyter, Berlin, New York. Gonthier, G. (2005). A computer-checked proof of the four colour theorem. unpublished ms. http: / / research.microsoft.com/ en-us/ people/ gonthier/ 4colproof.pdf . Kamp, H. & Reyle, U. (1993). From Discourse to Logic. Kluwer, Dordrecht. Kühlwein, D. (2008). A calculus for proof representation structures. Diploma Thesis. Rheinische Friedrich- Wilhelms-Universität Bonn. Kunen, K. (1980). Set theory, volume 102 of Studies in Logic and the Foundations of Mathematics. North- Holland Publishing Co., Amsterdam. An introduction to independence proofs. Landau, E. (1930). Grundlagen der Analysis. Das Rechnen mit ganzen, rationalen, irrationalen, komplexen Zahlen. Akademische Buchgesellschaft, Leipzig. Ranta, A. (1999). Structures grammaticales dans le Français mathématique. Mathématiques, informatique et Sciences Humaines, pages 138: 5-56; 139: 5-36. Zinn, C. (2003). A computational framework for understanding mathematical discourse. Logic Journal of the IGP, 11(4), 457-484. Zinn, C. (2004). Understanding Informal Mathematical Discourse. Ph.D. thesis, Institut für Informatik, Universität Erlangen-Nürnberg. Zinn, C. (2006). Supporting the formal verification of mathematical texts. Journal of Applied Logic, 4(4), 592-621. Social Semantics and Its Evaluation by Means of Closed Topic Models: An SVM-Classification Approach Using Semantic Feature Replacement by Topic Generalization * Ulli Waltinger 1 , Alexander Mehler 1 and Rüdiger Gleim 2 1 Bielefeld University 2 Goethe University Frankfurt Abstract Text categorization is a fundamental part in many NLP applications. In general, the Vector Space Model, the Latent Semantic Analysis and Support Vector Machine implementation have been successfully applied within this area. However, feature extraction is the most challenging task when conducting categorization experiments. Moreover, sensitive feature reduction is needed in order to reduce time and space complexity especially when deal with singular value decomposition or larger sized text collections. In this paper we examine the task of feature reduction by means of closed topic models. We propose a feature replacement technique conducting a topic generalization comprising user generated concepts of a social ontology. Derived feature concepts are then subsequently used to enhance and replace existing features gaining a minimum representation of twenty social concepts. We examine the effect of each step in the classification process using a large corpus of 29,086 texts comprising 30 different categories. In addition, we offer an easy-to-use web interface as part of the eHumanities Desktop in order to test the proposed classifiers. 1 Introduction In this paper we consider the problem of text categorization by means of Closed Topic Models (CTM). Different to Open Topic Models (OTM, Waltinger & Mehler, 2009; Mehler & Waltinger, 2009) were topic labels represent content categories which change over time contributed by an open community, e.g. Wikipedia users - topic labels of CTM are defined in advance. Therefore, traditional machine learning techniques such as document classification and clustering techniques can be applied. Most commonly the classification or categorization is based on the Vector Space Model (VSM, Salton, 1989) - representing textual data with the ‘bag of words’ (BOW) approach. Despite of its variations this method represents all words within a term-document matrix indexed by a feature weighting measure e.g. term frequency (TF) or inverse document frequency (IDF) or a combination of both. Documents are then judged on basis of their similarity of term-features (following the idea that similar documents will also share similar features) and can be clustered * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 147-158. 148 Ulli Waltinger, Alexander Mehler, Rüdiger Gleim by e.g. k-means algorithms - where k defines the numbers of predefined categories. Conducting Support Vector Machine (SVM) techniques often better results can be achieved. However, features selection is the important part when applying SVM for categorizing a document collection. Despite of linguistic information such as part-of-speech, lemma or stem information, selected features are mainly comprised by content und structure features out of the training corpus. Rising the number of considered features boosts also the complexity in computing the categorization. Therefore, its good performance often applies to a small set of categories or short texts. Using larger documents, a feature reduction technique Taira & Haruno (1999); Kim et al. (2005); Dasgupta et al. (2007) has to be applied. Most often this is done by introducing a certain threshold of the feature weighting function. However, utilized features are still those retained out of the document collection. Following the idea that document categorization is not about the words occurring in the text, but about common concepts texts represent, concept enhancement and feature replacement are the keywords of this paper. Our focus is an alternative representation of individual texts in order to enhance existing classical BOW features and then reduce it to a minimum representation. In recent years a few approaches have been proposed regarding feature enhancement using different resources of knowledge. Andreas & Hotho (2004) proposed a method using background knowledge from an ontology by means of the lexical-semantic net WordNet Fellbaum (1998), to improve text classification. As one of the first in the field of social network driven methods, Gabrilovich & Markovitch (2005) used directory concepts of the Open Directory Project (ODP), to enhance textual data. Later Gabrilovich & Markovitch (2006) and Zalan Bodo (2007) used data from the Wikpedia project for support vector machine based text categorization experiments. Wang & Domeniconi (2008) proposed a semantic kernel technique for text classification also on the basis of the Wikipedia data set. In the field of biomedicine Xinghua et al. (2006) proposed a Latent Dirichlet Allocation model (LDA), evaluated on the TREC corpus for an enhanced representation of biomedical knowledge. As a commonality, all approaches have utilized the article title of Wikipedia to enhance their existing dataset. Evaluation was performed using the famous Reuters and the Movie-Review corpus - comprising the English language. Different to them, in this work we propose a ‘knowledge feature breath’ on basis of generalized category concepts out of a social network. We are labelling individual texts of a document collection with a fixed number of relevant category concept definitions instead of article namespaces. Utilized category information, considered as the key topic information, are subsequently used for a topic generalization (e.g. from topic tennis to a more general label sports). The idea behind this approach is, that topic related documents share similar generalized topic labels. These predicted labels pose as new, semantically related, features for the categorization process. In contrast to the approaches above, we focus on feature reduction rather than feature extension. Predicted labels act as a substitution of the initial textual feature set. The contribution of this paper is threefold: First, we evaluate the performance of text classification using Latent Semantic Analysis (LSA, Landauer & Dumais, 1997), Support Vector Machine (SVM, Joachims, 2002), and a feature-reduced SVM implementation, tested on a large corpus of 29,086 texts comprising 30 different categories. Second, we examine the effect in text classification using the proposed semantic-feature-replacement technique by means of topic generalization. Third, an online categorizer will be introduced that aims to combine a convenient user interface with a framework which is open to arbitrary categorization approaches. Social Semantics and Its Evaluation by Means of Closed Topic Models 149 2 Method Taking a ‘knowledge feature breath’ in order to extend or reduce an existing document representation with topic-related features, an external knowledge repository is needed. In this context we are utilizing the social ontology of the online encyclopedia Wikipedia as a source of terminological knowledge. Concepts are reflected through Wikipedia articles and more importantly their corresponding Wikipedia category information (Section 2.1). In particular, we make use of the category taxonomy in order to predict generalized topic labels, constructing additional semantically related feature concepts (Section 2.2). Predicted concepts are then subsequently used for the classification task either as feature enhancement or replacement candidates (Section 2.3). Figure 1: Text categorization by means of generalized topic concepts 2.1 Concept Generation The method of identifying category labels out of the Wikipeda, utilizes the article collection of the social network. Generally speaking, we first try to identify the most adequate article concepts for a given text fragment, and then use associated category information to predict more general topic labels (see Figure 1). The approach of mapping a given text fragment onto the Wikipedia article collection is done following Gabrilovich & Markovitch (2007), by building an inverted vector index. A detailed description of the used minimized representation of the German Wikipedia dataset and the method in aligning a given text fragment onto the article collection can be found in Waltinger & Mehler (2009). At large, we merely parse the entire Wikipedia article collection, and perform a tokenization and lemmatization. Each article and its corresponding lemmata are stored in a vector representation. Therefore, vector entries represent lemmata that occur in the respective article. Each lemma feature is weighted by the TF-IDF scheme Salton & McGill (1983), which reflects the association or affinity to the corresponding article concepts. In a next step, we invert this vector, defined as V art , using lemmata as the index, and article namespaces as the index entries. The reduction of the vector representation is done by sorting all article concepts (the vector entries) on basis of their affinity scores in descending order and remove those articles concepts whose affinity score is less than five percent of the highest feature weight. Having V art given, we can apply a standard text similarity algorithm, using the cosine metric, in order to identify relevant Wikipedia articles for a given text fragment. In order to access the topic-related category information, we follow Waltinger et al. 150 Ulli Waltinger, Alexander Mehler, Rüdiger Gleim (2008), using the assigned article-category hyperlinks within each Wikipedia article page. Thus, the second vector representation, defined as V cat , stores for each article entry its corresponding category concepts. Feature are also weighted through the TF-IDF scheme. Following this, we are able to retrieve the weighted number of unique category concepts K for a given text fragment by iterating over V art and collecting k j ∈ V cat . Since both vectors V art and V cat are sorted in descending order, the first entries of our vectors correspond to those concepts which fits best to a given input text on basis of our concept vector representation. 2.2 Topic Generalization Being able to request the most relevant Wikipedia articles - V art - and their corresponding category information - V cat - for a given text fragment, we are following Waltinger & Mehler (2009) in computing a topic generalization. The task to generalize certain topics is defined as making generalizations from specific concepts to a broader context. For example from the term BASKETBALL to the more general concept SPORT or from DELEGATE to POLITICS. Again, we are utilizing the category taxonomy of the Wikipedia for this task. The category taxonomy has been extracted in a top-down manner, from the root category namespace: Category: Contents, connecting all subordinated categories to its superordinate concepts. We therefore forced the taxonomy into the representation of a directed tree defined as D. Any given text fragment is first mapped to the most specific category concepts as an entry point - V art → V cat - and then tracked upwardly moving along the taxonomy. Each category concept we meet on the way up inherits the feature weight of the initial category. Therefore, for each edge we have passed a more general concept is derived - comprising the desired topic generalization vector V topic . V topic is also sorted in descending order, from the most general topics at the beginning to the most specific concepts at the end. See Table 1 and Table 2 for an example of the topic generalization for different domains. Table 1: Top-5 article and generalized topic concepts for closed topic stock market DAS GRÖSSTE KURSPLUS seit 1985 wurde an den acht hiesigen Börsen im vergangenen Jahr erzielt. Beispielsweise zog der Deutsche Aktienindex um 47 Prozent an (vgl. SZ Nr. 302). Trotz Rezession und Hiobsbotschaften von der Unternehmensfront hatten sich zunächst britische und amerikanische Fondsverwalter bei hiesigen Standardwerten engagiert, woraufhin in der zweiten Hälfte des vergangenen Jahres der SZ-Index um 31 Prozent hochgeschnellt war. . . . . . Related Articles Generalized Topics 1. Anlageklasse 1. Finanzierung 2. Bundesanleihe 2. Finanzmarkt 3. Nebenwert 3. Ökonomischer Markt 4. Bullen- und Bärenmarkt 4. Wirtschaft 5. Börsensegment 5. Rechnungswesen Social Semantics and Its Evaluation by Means of Closed Topic Models 151 Table 2: Top-5 article and generalized topic concepts for closed topic campus Berwerbungsfrist läuft ab: Bis zum 15. Januar müssen die Bewerbungen für die zulassungsbeschränkten Studienplätze bei der Zentralstelle für die Vergabe von Studienplätzen (ZVS) in Dortmund eingetroffen sein. Die notwendigen Unterlagen sind bei den örtlichen Arbeitsämtern, Universitäten . . . Weniger Habilitationen: 1992 wurden an den Hochschulen in Deutschland rund 1300 Habilitationsverfahren . . . . . Related Articles Generalized Topics 1. Provadis School of IMT 1. Bildung 2. Approbationsordnung 2. Deutschland 3. Private Hochschule 3. Bildung nach Staat 4. Hochschulabschluss 4. Akademische Bildung 5. Hochschule Merseburg 5. Wissenschaft 2.3 Feature Replacement The main focus of this paper is the task of feature replacement for text categorization. In a broader context we address the problem of high dimensionality of BOW approaches due to the amount of comprised features. Feature reduction techniques contribute to the removal of noise and lower the overfitting in a classification process. In order to judge the importance of comprised features to categories, a weighting function is needed. We make use of the well-known TF-IDF weighting function (see Equation 1), which measures the importance of a feature t i to the actual document d j in connection to the entire corpus size N . Therefore an input document d is represented as a data vector v d = [ w 1,d , w 2,d , . . . , w N,d ] T , where d i is defined as a set of features. w ij = tf ij · idf i = freq ij max l ( freq lj ) · log N n i (1) Once having all features weighted, we conduct the concept construction on the basis of the topic generalization proposed in the previous section. In special, for each text we generate twenty topicrelated concepts. The ten best article concepts and the ten best category concepts. Note that each concept is tokenized and weighted by w ij . This is done in order to gain affinity scores for different written concepts. Consider for example the category concept: ACADEMIC EDUCATION. We resolve this multi-word concept into two individual concepts: ACADEMIC and EDUCATION. All resolved weighted features are then put to our semantic feature vector defined as V sem . Feature reduction in first place is performed by replacing initial features of the data vector d with lower w ij by corresponding items in V sem . The main idea behind this approach is that related documents share related generalized concepts. 152 Ulli Waltinger, Alexander Mehler, Rüdiger Gleim 3 Empirical Evaluation 3.1 SVM Settings Since we were interested in the performance of feature replacement, we conducted different experiments by varying the initial amount of features. First, we used all textual noun, verb and adjective features (C-SVM). Second, we reduced the initial features to noun only features and limited by a threshold (R-SVM). Third, we added all features of V sem to the reduced feature representation (G- SVM). Forth, we replaced the amount of features by the size of V sem (M-SVM). Fifth, we used only the features gained from the topic generalization (V sem ) (GO-SVM). Sixth, we used only the features of V sem and additionally reduced it with a certain threshold (MGO-SVM). For the actual text categorization we make use of the kernel-based classification algorithm of the support vector machine (SVM) implementation SVMlight Joachims (2002) version 6.02. We used SVMs because their good performance in the task of text categorization. For each class an SVM classifier was trained using the linear kernel. The results were evaluated using the leave-one-out cross-validation estimation of SVMlight. For comparison to non-SVM approaches we computed various supervised and unsupervised baselines. First, a random clustering of all documents was performed. Second we conducted a Latent Semantic Analysis (LSA, Deerwester et al., 1990). It is a dimensionality reduction technique based on singular value decomposition (SVD). The SVD is computed keeping the k best eigenvalues. In our experiments we defined k as 300. The resultant matrix was then used for the categorization experiments conducting different clustering techniques including k-means, hierarchical, average linking clustering. 3.2 Vector Settings The calculation of our feature vector representation is based upon the German version of Wikipedia (February 2009). After parsing the XML dump comprising 756,444 articles we conducted the preprocessing by lemmatizing all input tokens and removing smaller concepts. We ignored those articles having fewer than five incoming and outgoing links and fewer than 100 non stopwords. The final vector representation comprised 248,106 articles and 620,502 lemmata. The category tree representation consisted of 55,707 category entries utilizing 128,131 directed hyponomy edges. 3.3 Evaluation Corpus The evaluation of the proposed methods was done by using a large corpus of newspaper articles. We used data of the German newspaper Süddeutsche Zeitung (SZ). The initial corpus comprised 135,546 texts within 96 categories. Due to its unbalanced category-text proportions, an adjusted subset was extracted consisting of 29,086 text, 30 categories and 232,270 unique textual features. 4 Results The aim of our experiment was to determine the effect of topic generalization in a SVM text classification environment. To which level does feature enhancement by means of semantic concepts boost the classical SVM categorization? To which level can we reduce the initial features set in order to still retain an acceptable performance? How doe unsupervised methods perform compared to supervised? As Figure 2 shows, all SVM (supervised) methods clearly outperform the unsupervised clustering results. With an average F-measure of 0 . 631 , clustering conducting a LSA (see Table 3) Social Semantics and Its Evaluation by Means of Closed Topic Models 153 obviously performs better than baseline approaches (F-measure of 0 . 15 ), but also confirms that for classical text categorization SVMs are most appropriate. Comparing the different SVM implementations (Table 4), we can identify that feature enhancement (G-SVM with an average F-measure of 0 . 424 ) boosts with a difference of up to 0 . 700 (at category gesp) the classical SVM implementation using all noun, verb and adjective features ( 0 . 778 ). Yet, in comparison to a much reduced SVM implementation (R-SVM: 0 . 914 ) - using only nouns and limited to 5,000 features overall only minor enhancement can be identified. But looking more closely at the data, we can observe that within categories which perform lower than an F-Measure of 0 . 900 in the reduced version the G-SVM improves the results. Nevertheless, since the average results of the R-SVM implementation are a priori very high not much improvement could have been expected. Much more interesting is the aspect of using only the topic generalization concepts for the classification, and discarding all actual features of the text. Using only twenty concepts per text, we still reach a promising F-measure of 0 . 884 (GO-SVM 1 ). Reducing this set to the MGO-SVM 2 implementation, that is only 1000 features for all 29,086 texts the average results show also a very promising 0 . 855 . When comparing thereby the number of used features (see Table 4) using the GO-SVM and the R-SVM, we can identify that we were able to reduce the actual reduced features additional with an average percentage of 80 . 10 %. Therefore, the results for using only the topic generalization concepts for text categorization seem to be very up-and-coming. 5 Online Categorization Using eHumanities Desktop We have presented an approach to text classification by incorporating social ontologies and showed an evaluation with different settings. In order to put interested users in a position to test the classifiers on their own, we have developed an easy-to-use web interface as part of the eHumanities Desktop Mehler et al. (2009); Gleim et al. (2009). The eHumanities Desktop is an online system for linguistic corpus management, processing and analysis. Based on a well-founded data model to manage resources, users and groups it offers application modules to perform tasks of text preprocessing, information retrieval and linguistic analysis. The set of functionality is easily extensible by new application modules as for example the Categorizer. The Categorizer aims to combine a convenient user interface with a framework which is open to arbitrary categorization approaches. The typical user does not want to bother with details of a given method and what preprocessing needs to be done. Using the Categorizer all that needs to be done is to pick a classifier, specify the input document (e.g. plain text, html or pdf) and start the categorization. Alternatively it is also possible to enter the input text directly via cut&paste. The classifiers themselves are defined in terms of a eHumanities Desktop Classifier Description, an XML based language to specify how to connect a given input document to a classifier. The language is kept as flexible as possible to be open to arbitrary algorithms. In case of a SVM a classifier description defines, among others, which kind of text preprocessing needs to be done, what features and which models to use and how an implementation of the SVM needs to be called. Integrating a new classification algorithm usually does not take more than putting the program on the server and writing a proper classifier description. Please note that the category models are ‘normal’ documents of the corpus management system as are the classifier descriptions. Thus they can easily be shared among users. Figure 3 shows an example 1 The GO-SVM results are based upon eleven categories, since classification process was still running by the end of the cfp-deadline. 2 The MGO-SVM was reported on the basis of five computed categories, since the SVM was still computing by the end of the cfp-deadline. 154 Ulli Waltinger, Alexander Mehler, Rüdiger Gleim Figure 2: a) F-Measure results of supervised classification comparing baseline, classical and topic generalized enhanced SVM implementations; b) F-Measure results of unsupervised classification comparing baseline, average linking and best category stream clustering Figure 3: Screenshot showing the eHumanities Desktop Categorizer Social Semantics and Its Evaluation by Means of Closed Topic Models 155 Table 3: Average F-Measure results of the entire classification experiments: C-SVM refers to the SVM implementation using all textual features; G-SVM to a feature enhanced, R-SVM and M-SVM to the feature reduced implementations; GO-SVM and MGO-SVM refer to the experiments using only topic-related concepts as the feature set Implementation F-Measure G-SVM 0.915 R-SVM 0.914 M-SVM 0.913 C-SVM 0.836 GO-SVM 0 . 884 1 MGO-SVM 0 . 855 2 LSA 0.631 Random 0.15 of how the Categorizer can be used to categorize text. The content has directly been inserted via cut&paste from the online portal of a German newspaper. The classifier is SVM-based and trained on categories of the Süddeutsche Zeitung. The table shows the results of the categorization with the best performing category at the top. 6 Conclusions This paper presented a study on text categorization by means of closed topic models. We proposed a SVM based approach using a semantic feature replacement. New features are created on the basis of the social network Wikipedia conducting a topic generalization using category information. Generalized concepts are then used as a replacement of conventional features. We examined different methods in enhancing, replacing and the deletion of features during the classification process. In addition, we offer an easy-to-use web interface as part of the eHumanities Desktop in order to test the proposed classifiers. Acknowledgment We gratefully acknowledge financial support of the German Research Foundation (DFG) through the EC 277 Cognitive Interaction Technology, the SFB 673 Alignment in Communication (X1), the Research Group 437 Text Technological Information Modeling, the DFG-LIS-Project P2P-Agents for Thematic Structuring and Search Optimization in Digital Libraries and the Linguisitc Networks project funded by the German Federal Ministry of Education and Research (BMBF) at Bielefeld University. 156 Ulli Waltinger, Alexander Mehler, Rüdiger Gleim Table 4: Results of SVM-Classification comparing R-SVM and G-SVM (Imp. 1) and C-SVM and G-SVM (Imp. 2): No.T. refers to the number of texts within a category; Feat.Reduc. shows the amount of feature reduction, comparing the initialand the reduced feature set ID Name No.T. Feat.Reduc. C-SVM R-SVM G-SVM Imp. 1 Imp. 2 1 baro 465 -87.00% 0.995 0.998 0.998 +0.000 +0.003 2 camp 345 -87.33% 0.913 0.953 0.956 +0.003 +0.043 3 diew 276 -84.96% 0.925 0.994 0.995 +0.001 +0.070 4 fahr 313 -88.31% 0.924 0.978 0.989 +0.011 +0.065 5 film 2457 -87.57% 0.865 0.955 0.954 -0.001 +0.089 6 fird 2213 -71.87% 0.902 0.971 0.973 +0.002 +0.071 7 firm 1339 -83.59% 0.969 0.995 0.995 +0.000 +0.026 8 gesp 1234 -91.08% 0.393 0.751 0.817 +0.066 +0.424 9 inha 1933 -55.11% 0.974 0.988 0.987 -0.001 +0.013 10 kost 533 -90.57% 0.926 0.964 0.966 +0.002 +0.040 11 leut 911 -78.69% 0.903 0.994 0.996 +0.002 +0.093 12 loka 1953 -76.94% 0.728 0.772 0.781 +0.009 +0.053 13 mein 2240 -79.82% 0.923 0.961 0.951 -0.010 +0.028 14 mitt 677 -74.51% 0.171 0.485 0.495 +0.010 +0.324 15 nchg 1105 -84.99% 0.799 0.761 0.769 +0.008 -0.020 16 nrwk 349 -76.17% 0.871 0.957 0.936 -0.021 +0.065 17 nrwp 297 -85.21% 0.952 0.846 0.838 -0.008 -0.114 18 nrww 342 -76.46% 0.932 0.983 0.983 +0.000 +0.051 19 reit 286 -80.09% 0.933 0.946 0.953 +0.007 +0.020 20 schf 542 -61.92% 0.682 0.795 0.785 -0.010 +0.103 21 spek 375 -68.73% 0.712 0.975 0.950 -0.025 +0.238 22 spfi 318 -90.84% 0.908 0.984 0.983 -0.001 +0.075 23 stdt 700 -75.89% 0.767 0.850 0.853 +0.003 +0.086 24 szen 2314 -82.66% 0.827 0.869 0.878 +0.009 +0.051 25 szti 336 -56.82% 0.947 0.979 0.973 -0.006 +0.026 26 thkr 1613 -90.24% 0.817 0.939 0.945 +0.006 +0.128 27 tvkr 2355 -78.87% 0.846 0.949 0.951 +0.002 +0.105 28 woch2 375 -88.17% 1.00 1.00 1.00 +0.000 +0.000 29 zwif 409 -84.06% 0.666 0.862 0.829 +0.033 +0.163 30 zwiz 481 -84.69% 0.918 0.968 0.969 +0.001 +0.051 Social Semantics and Its Evaluation by Means of Closed Topic Models 157 References Andreas, S. B. & Hotho, A. (2004). Boosting for Text Classification with Semantic Features. In In Proceedings of the MSW 2004 Workshop at the 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 70-87. Dasgupta, A., Drineas, P., Harb, B., Josifovski, V., & Mahoney, M. W. (2007). Feature selection methods for text classification. In KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 230-239, New York, NY, USA. ACM. Deerwester, S., Dumais, S., Landauer, T., Furnas, G., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6), 391-407. Fellbaum, C., editor (1998). WordNet. An Electronic Lexical Database. The MIT Press. Gabrilovich & Markovitch (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. Proceedings of the Twenty-First National Conference on Artificial Intelligence, Boston, MA. Gabrilovich, E. & Markovitch, S. (2005). Feature generation for text categorization using world knowledge. In Proceedings of The Nineteenth International Joint Conference for Artificial Intelligence, pages 1048- 1053, Edinburgh, Scotland. Gabrilovich, E. & Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. Proceedings of the 20th International Joint Conference on Artificial Intelligence, pages 6-12. Gleim, R., Waltinger, U., Ernst, A., Mehler, A., Feith, T., & Esch, D. (2009). eHumanities Desktop - An Online System for Corpus Management and Analysis in Support of Computing in the Humanities. In Proceedings of the Demonstrations Session at EACL 2009, pages 21-24, Athens, Greece. Association for Computational Linguistics. Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, Norwell, MA, USA. Kim, H., Howland, P., & Park, H. (2005). Dimension Reduction in Text Classification with Support Vector Machines. J. Mach. Learn. Res., 6, 37-53. Landauer, T. & Dumais, S. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104(1), 211-240. Mehler, A. & Waltinger, U. (2009). Enhancing Document Modeling by Means of Open Topic Models: Crossing the Frontier of Classification Schemes in Digital Libraries by Example of the DDC. Appears in Library Hi Tech. Mehler, A., Gleim, R., Waltinger, U., Ernst, A., Esch, D., & Feith, T. (2009). eHumanities Desktop — eine webbasierte Arbeitsumgebung für die geisteswissenschaftliche Fachinformatik. In Proceedings of the Symposium “Sprachtechnologie und eHumanities”, 26.-27. Februar, Duisburg-Essen University. Salton, G. (1989). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley, Reading, Massachusetts. Salton, G. & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, New York. 158 Ulli Waltinger, Alexander Mehler, Rüdiger Gleim Taira, H. & Haruno, M. (1999). Feature selection in SVM text categorization. In AAAI ’99/ IAAI ’99: Proceedings of the 6.th national conference on AI, pages 480-486, Menlo Park, CA, USA. American Association for Artificial Intelligence. Waltinger, U. & Mehler, A. (2009). Social Semantics And Its Evaluation By Means Of Semantic Relatedness And Open Topic Models. In Proceedings of the 2009 IEEE/ WIC/ ACM International Conference on Web Intelligence. Waltinger, U., Mehler, A., & Heyer, G. (2008). Towards Automatic Content Tagging: Enhanced Web Services in Digital Libraries Using Lexical Chaining. In 4rd International Conference on Web Information Systems and Technologies (WEBIST ’08), 4-7 May, Funchal, Portugal, Barcelona. Wang, P. & Domeniconi, C. (2008). Building semantic kernels for text classification using wikipedia. In KDD08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 713-721, New York, NY, USA. ACM. Xinghua, L., Bin, Z., Atulya, V., & ChengXiang, Z. (2006). Enhancing Text Categorization with Semanticenriched Representation and Training Data Augmentation. Research Paper. Zalan Bodo, Zsolt Minier, L. C. (2007). Text categorization experiments using Wikipedia. In Proceedings of the 1st Knowledge Engineering: Principles and Techniques, Cluj-Napoca, Romania, 2007, pages 66-72. Research Paper. From Parallel Syntax Towards Parallel Semantics: Porting an English LFG-Based Semantics to German * Sina Zarrieß Institut für maschinelle Sprachverarbeitung (IMS) University of Stuttgart, Germany Abstract This paper reports on the development of a core semantics for German implemented on the basis of an English semantics that converts LFG F-structures to flat meaning representations. Thanks to the parallel design of the broad-coverage LFG grammars written in the context of the ParGram project (Butt et al., 2002) and the general surface independence of LFG F-structure analyses, the development process was extremely facilitated. We describe and discuss the overall architecture of the semantic conversion system as well as the basic properties of the semantic representation and the adaptation of the English to the German semantics. 1 Introduction This paper reports on the development of a core semantics for German which was implemented on the basis of an English semantics that converts LFG F-structures to flat meaning representations. The development strategy crucially relies on the parallel design of the broad-coverage LFG grammars written in the context of the ParGram project (Butt et al., 2002). We will first describe the overall architecture of the semantic conversion system as well as the basic properties of the semantic representation. Section 3 discusses the development strategy and the core semantic phenomena covered by the German semantics. Recently, the state of the art in parsing has made wide-coverage semantic processing come into the reach of research in computational semantics (Bos et al., 2004). This shift from the theoretical conception of semantic formalisms to wide-coverage semantic analysis raises questions about appropriate meaning representations as well as engineering problems concerning development and evaluation strategies. The general motivation of this work is to explore large-scale LFG syntax as a backbone for linguistically motivated semantic processing. Research in the framework of LFG has traditionally adopted a crosslingual perspective on linguistic theory (Bresnan, 2000). In the context of the ParGram project, a number of high quality, broadcoverage grammars for several languages have been produced over the years (Butt et al., 2002; Butt & King, 2007). 1 The project’s research methodology particularly focusses on parallelism which * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 159-169. 1 The project’s webpage: http: / / www2.parc.com/ isl/ groups/ nltt/ pargram/ 160 Sina Zarrieß means that the researchers rely on a common syntactic theory as well as development tools, but which also concerns parallelism on the level of syntactic analyses. As the LFG formalism assumes a two-level syntax that divides the analysis into language and surface dependent constituent structure and a functional structure which basically represents the surface independent grammatical relations of a sentence, it constitutes a particularly appropriate basis for large-scale, multilingual syntax. Besides the theoretical implications, parallel grammar development bears the practical advantage that the resources developed for a particular language can often easily be ported to related languages. Kim et al. (2003) report that the Korean ParGram grammar was constructed in two months by adapting the Japanese grammar for Korean. The work presented in this paper describes an experience where a pair of parallel grammars substantially faciliated the development of a semantic resource based on these. We rely on the semantic conversion system presented in Crouch & King (2006) that is ported to a German semantics which derives semantic representations from LFG F-structures. Due to the fact that the syntactic F-structure input is largely parallel, the German core semantics could be implemented within a single month. 2 F-Structure Rewriting as an LFG Semantics Since the early days of LFG, there has been research on interfacing LFG syntax with various semantic formalisms (Dalrymple, 1999). For the English and Japanese ParGram grammar, a broadcoverage, glue semantic construction has been implemented by Crouch (1995) and Umemoto (2006). In contrast to these approaches, the semantic conversion described in Crouch & King (2006) is not driven by a specific semantic theory about meaning representation, nor by a theoretically motivated apparatus of meaning construction. Therefore, we will talk about “semantic conversion” instead of “construction” in this paper. The main idea of the system is to convert the surface-independent, syntactic relations and features encoded in an F-structure to normalized semantic relations. At the current state of development, the representation simplifies many phenomena usually discussed in the formal semantic literature (see the next section), but is tailored for use in Question Answering (Bobrow et al., 2007a) or Textual Entailment applications (Bobrow et al., 2007b). The semantic conversion was implemented by means of the XLE platform (Butt & King, 2007), used for grammar development in the ParGram project. It makes use of the built-in transfer module to convert LFG F-structures to semantic representations. Some formal aspects of the conversion are described in section 2.2. The idea to use transfer rules to model a semantic concstruction has also been pursued by Spreyer & Frank (2005) who use the transfer module to model a RMRS semantic construction for the German treebank TIGER . 2.1 The Semantic Representation As a first example, a simplified F-structure analysis for the following sentence and the corresponding semantic representation are given in Figure 1. 1. In the afternoon, John was seen in the park. The basic idea of the representation exemplified in Figure 1 is to represent the syntactic arguments and adjuncts of the main predicate in terms of semantic roles of the context introduced by the main predicate or some semantic operator. For the sake of readability, we visualize the contexts as boxes. The internal representation is a flat set of skolems where a context variable indicates the From Parallel Syntax Towards Parallel Semantics 161 embedding (Crouch & King, 2006). This packed representation also allows for a compact processing of ambiguities which is essential for the usability of deep grammars (Butt & King, 2007). Thus, the grammatical roles of the main passive verb in sentence 1 are semantically normalized such that the passive subject is assigned the Stimulus role and an unspecified Agent is introduced 1. The role of the modifiers are specified in terms of their head preposition. This type of semantic representation is inspired by Neo-Davidsonian event semantics, introduced in Parsons (1990). Other semantic properties of the event introduced by the main verb such as tense or nominal properties such as quantification and cardinality are explicitely encoded as a relation between the lexical item and the value of this property. The contexts can be tought of as propositions or possible worlds. They are headed by an operator that can recursively embed further contexts. Context embeddings can be induced by lexical items or syntactic constructions and include the following operators: (i) negation or question (ii) sentential modifiers (possibly) (iii) coordination with or (iv) conditionals (v) some subordinating conjunctions (without) (vi) clause-embedding verbs (doubt). The representation avoids many formal semantic complexities typically discussed in the literature, for instance the interpretation of quantifiers, by encoding them as conventionalized semantic predications. Given this skolemized first-order language, the task of textual entailment can be conceived as matching the hypothesis representation against the semantic representation of the text where higher-order reasonning is approximated by explicit entailment rules (e.g. all entails some, past does not entail present), see Bobrow et al. (2007b) for a presentation of an RTE system based on this semantic representation. 2.2 The Semantic Conversion The XLE transfer module, which we use for the implementation of the conversion of F-structures to semantic representations, is a term rewrite system that applies an ordered list of rewrite rules to a given F-structure input and yields, depending on the rewrite rules, new F-structures (e.g. translated F-structures) or semantic representations as described above. The technical features of the XLE transfer module are described in Crouch et al. (2006). An important feature for large-scale development is for instance the mechanism of packed rewriting that allows for an efficient representation and processing of ambigous F-structure analyses. The semantic conversion, as described in Crouch & King (2006), is not a priori constrained by a formally defined syntax-semantics mapping. The main intuition of the conversion is that the embeddings encoded in the syntactic analysis have to be reencoded in a way such that they correspond to a semantic embedding. Then, the grammatical relations have to be converted to semantic relations such that e.g. grammatical argument correspond to semantical roles and adjuncts to modifiers of a certain semantic type. An example rewrite rule which converts a passive F-structure analysis and to a normalized active analysis is given in Figure 2. In order to be maintainable and extensible, the set of transfer rules are organized in a modular way. The main steps of the semantic conversion are given in the following: (i) Flattening syntax-specific F-structure embeddings that do not correspond to semantic embeddings (ii) Canonicalization of grammatical relations (e.g. depassivization) (iii) Marking items that induce a semantic embedding (iv) Linking F-structure scopes and context of the semantic representation. (v) Removing F-structure specific features. An explicitely modular conception of the transfer procedure also facilitates its porting to other languages. Thus, steps 1 and 2 (and partly 3) may be dependent on the language specific F-structure 162 Sina Zarrieß Figure 1: LFG F-structure analysis and corresponding semantic representation for example sentence 1 encoding, while the general steps from 3 and 5 do not have to be changed at all when porting the transfer rules to another language. This problem is very similar to experiences from manual engineering of syntactic grammars that can only be practically assured if certain methodological conventions are respected during the development process (Butt et al., 2002; Dipper, 2003; Zinsmeister et al., 2001). 3 From English to German Semantics 3.1 Semantic Grammar Development In contrast to the various gold standard treebanks available for the development and evaluation of parsers, gold standards for semantic representations are hardly available. This has a number of methodological implications for “semantic grammar” development. The only possibility to account for the accuracy of the semantic construction is to manually inspect the output of the system for a necessarily small set of input sentences. Moreover, the transfer scenario complicates the assessment of the system’s coverage. While in Bos et al. (2004), the coverage of the meaning construction can be quantified by the number of syntactic analysis that the construction algorithm can process, the transfer conversion will never fail on a given syntactic input. Since the transfer rules just procued a greedy matching with the +VTYPE(%V, %%), +PASSIVE(%V,+), OBL-AG(%V, %LogicalSUBJ), PTYPE(%LogicalSUBJ,%%), OBJ(%LogicalSUBJ,%P) ==> SUBJ(%V, %P), arg(%V,%N,%P). Figure 2: Example rewrite rule for passive normalization From Parallel Syntax Towards Parallel Semantics 163 input, the unmatched features pass unchanged to the output and will be probably deleted by some of the catch-all rules which remove remaining syntactic features in the final step of the conversion. Therefore, manual inspection is necessary to see whether the conversion has really processed all the input it was supposed to process. This limited evaluation scenario entails that the semantics developer has to think hard about defining the set of phenomena he wants to cover and document precisely which type of syntactic phenomena his semantics intends to assign an interpretation to. Therefore, in the rest of this section, we will try to give a concrete overview of the type of phenomena that is covered by the English-German semantics. 3.2 A Parallel Testsuite In consequence to these considerations on evaluation of the transfer semantic conversion, a central aspect of our development metholodogy is the composition of a testsuite of German sentences which represents the “core semantics” that our systems covers. The multilingual perspective provided a major orientation for the composition of this testsuite. As the English semantics implicitely defines a set of core phenomena interpreted by the syntax-semantic interface, we dispose of a set of grammatical F-structure relations that receive a particular semantic representation. The developers of the English semantics had documented many “core” transfer rules (assuring the normalization and context embedding) with example phrases or sentences such that one could easily reconstruct the type of phenomenon each transfer rule was intended to analyze. On the basis of this system documentation, we first conceived an English testsuite where each sentence contained a construction which could be related to the application of a specific transfer rule. Then, for each of the sentences we selected a German sentence which exhibited the German counterpart of the phenomenon that was targeted in the English sentence. For instance, if a transfer rule for relative clauses fired on a given English sentence we also translated the German sentence such that it contained a relative clause. As most of the test sentences target fairly general phenomena at the syntax-semantic interface, there was a parallel German realization of the construction in most of the cases. In cases where no straightforward parallel realization could be found, we tried to find a semantically parallel translation of the sentence. For instance, the English cleft construction exemplified by the following sentence of our testsuite, does not have a syntactically parallel realization in German. In this case, the sentence was translated by a “semantic” equivalent that emphasizes the oblique argument. (1) a. It is to the store that they went. b. Zum Markt sind sie gegangen. During the development process, the testset was further extended. These extensions were due to cases where the English grammar assigns a uniform analysis to some constructions that the German gramamr distinguishes. For instance, while the English grammar encodes oblique arguments the same way it encodes direct objects, the German grammar has a formally slightly different analysis such that rules which fire on obliques in English, do not fire for German input. The final parallel testsuite comprises 200 sentence pairs. 164 Sina Zarrieß The following enumeration lists the basic morpho-syntactic phenomena covered by our core semantics testsuite. 1. Sentence types (declaratives, interrogatives, quotations etc.) 2. Coordination (of various phrase types) 3. Argument semantic role mapping, including argument realization normalization (depassivization etc.) 4. Sentential and verbal modification (discursive, propositional, temporal, etc.) 5. Nominal modification (measures, quantifiers, comparatives, etc.) 6. Tense and aspect 7. Appositions and titles 8. Clause-embeddings, relative clauses, gerunds, etc. 9. Predicative and copula constructions 10. Topicalization As turns out, the abstract conception of LFG F-structure analysis already assumes a major step towards semantic interpretation. Many global syntactic properties are explicitely represented as feature-value pairs, e.g. features for sentence type, mood, tense and aspect. Moreover, the Fstructure already contains many information about e.g. the type of nominal phrases (proper names, quantified phrases etc.) or types of modifiers (e.g. adverb types). 3.3 Parallel Core Semantics The English core semantics developed by Crouch & King (2006) comprises 798 (ordered! ) rewrite rules. As we hypothesized that a major part of the English rules will also apply to German Fstructure input, we first copied all English transfer rules to the German semantics and then proceeded by manual error correction: For each German test sentence, we manually checked whether the transfer semantics produce an interpretation of the sentence which is parallel to the English analysis. In case a mismatch was detected, the respective rules where changed or added in the German transfer rule set. To cover the 200 sentences in our parallel testsuite, 47 rewrite rules had to be changed out of the 798 rules which constitute the core English semantics. Out of these 47 rules, 23 rules relate to real structural differences in the F-structure encoding for German and English. The rest of the modifications is mainly due to renamings of the features or lexical items that are hard-coded in the transfer grammar. While in a more surface-oriented syntax, it would be hardly possible to design largely parallel syntax-semantic interfaces for the range of phenomena listed in the last section, the surfaceindependence of LFG F-structures ensures that a major part of the English core semantics straightforwardly applies to the German input. An impressive illustration of the language independence of LFG F-structure analyses in the Par- Gram grammars is the pair of analyses presented in Figure 3, produced by the semantic conversion for the example pair in (2). From Parallel Syntax Towards Parallel Semantics 165 (2) a. Wo hat Tom gestern geschlafen? b. Where did Tom sleep yesterday? The representation for the German sentence was produced by running the English transfer semantics on German syntactic input. Although the word-order of English and German questions is governed by distinct syntactic principles, the semantic representation of the German sentence is almost entirely correct since the F-structure analyses abstract from the word-order differences. The only fault in the German representation in 3 is the interpretation of the temporal adverb yesterday gestern. The transfer rule for temporal verb modification did not fire because the adverb type features for English and German differ. Generally, three cases of divergences at the syntax-semantic interface for English and German where treated while porting the semantic conversion rules: • Divergences at the level of F-structure encoding which basically amount to different namings of attributes (e.g. the case of temporal adverbials discussed in the preceding paragraph). This case can be easily detected by means of the parallel testsuite. The left hand side of the conversion rules then require minor adaptions. • Divergences at the level of F-structure encoding which reflect different linguistic analyses of a parallel syntactic phenomenon. An example is the encoding of adverbs in German that are assigned a semantic form that selects for a ‘pro’ subject while English adverbs do not get a semantic form value (illustrated in Figure 4). This case can be easily detected by means of the parallel testsuite. The German F-structure was rewritten to be parallel to the English analysis such that no additional semantic role figures in the semantic representation. • Divergences at the level of syntax that yield a parallel semantic interpretation. An example is given in the example sentence pair in (4). In the English sentence, the negation is syntactically an adjunct of the main verb and will be lifted to the main context. However, in German, negation can be syntactically deeply embedded, e.g. modify another adverb, by still having semantic scope over the main verb. The corresponding F-structure analyses are illustrated in Figure 5. To assign a correct interpretation to the German F-structure, a new conversion rule needs to be written that lifts the scope of negation adverbials across several F-structure embeddings. Those cases of syntactic-semantic divergences are not guaranteed to be covered by just translating the English testsuite, but need to be added based on monolingual knowledge. To assure a good coverage of the conversion, more future work needs to be invested into monolingual particularities. (4) a. John wasn’t seen anymore. b. John wurde nicht mehr gesehen. 3.4 Discussion The crosslingual parallelism of the semantics presented in this paper is also due to the relatively coarse-grained level of representation that interprets many phenomena prone to subtle crosslingual divergences (e.g. the interpretation of quantifiers or tense and aspect) in terms of conventionalized predications. The actual semantic interpretation of these phenomena is deferred to later representation or processing layers, as in this framework, to the definition of entailment relations (Bobrow et al., 2007b). 166 Sina Zarrieß Figure 3: Parallel semantic analyses for the sentence pair given in example (2) (3) John singt schrecklich. John sings terribly. Figure 4: German analysis of manner adverbials in the ParGram grammar From Parallel Syntax Towards Parallel Semantics 167 Figure 5: Non-Parallel syntactic analyses for the sentence pair given in example (4) Thus, we did not treat any type of divergences at the syntax-semantic interface of English and German where largely parallel syntactic analyses correspond to diverging semantic interpretations. Typical candidates for such semantic divergences would be the interpretation of tense and aspect across languages. While the flat representation we adopted in this work basically retains the morphsyntactic analysis coming from the F-structure (see also Figure 3), formal semantic theories usually adopt much more complex analyses of this phenomenon. A very clear-cut example of such crosslingual differences is reported in the literature on aspect and modality (e.g. Bhatt, 2006). In languages that encode aspect in the morphological verb inflection system, modals seem to have different entailments according to their morphological aspect like the perfective vs. the imperfective modal in French: (5) Jean pouvait soulever cette table, mais il ne l’a pas fait. ‘Jean was able.IMP to lift this table, but he didn’t do it.’ (6) Jean a pu soulever cette table, #mais il ne l’a pas fait. ‘Jean was able.PERF to lift this table,#but he didn’t do it.’ Purely lexical entailment rules for this kind of complex interaction between aspect, modality and tense operators will be very hard to formulate. This points to a general trade-off between a representation that generalizes over many theoretical subtleties and a representation that doesn’t capture certain generalizations which would lead to a more linguistically informed account of entailment relations. However, the formal semantics literature does not always discuss crosslingual differences on the level of entailments such that differences between the German and English tense and aspect interpretation seem to amount to very subtle distinctions not necessarily reflected in the entailments (as an examples see Arnim von Stechow’s comparison between English and German perfect (von Stechow, 2002)). Future work on the semantics presented in this paper will have to take such tensions into account and think about the general goals and applications of the semantic representation. 168 Sina Zarrieß 4 Conclusion This work illustrates the positive practical implications of crosslingually parallely designed linguistic resources for large-scale linguistic engineering. Due to the abstract F-structure layer in LFG syntax and its parallel implementation in the ParGram project, further resources that build on Fstructure representations can be very easily ported to other languages. Future research will have to investigate to what extent this also applies to more distant languages, like Urdu and English for instance. The paper also discussed some problematic aspects of the development of a large-scale semantic system. The crosslingual development perspective allowed us to define a set of core semantic phenomena covered by the representation. However, the rather flat representation might obstruct potential crosslingual differences in semantic interpretation. Future research has is needed to develop a more general development and evaluation methodology for the representation of meaning. References Bhatt, R. (2006). Covert Modality in Non-finite Contexts, volume 8 of Interface Explorations, chapter Ability Modals and their Actuality Entailments. Mouton de Gruyter. Bobrow, D. G., Cheslow, B., Condoravdi, C., Karttunen, L., King, T. H., Nairn, R., de Paiva, V., Price, C., & Zaenen, A. (2007a). PARC’s Bridge question answering system. In T. H. King & E. M. Bender, editors, Proceedings of the GEAF (Grammar Engineering Across Frameworks) 2007 Workshop, pages 13-15. Bobrow, D. G., Cheslow, B., Condoravdi, C., Karttunen, L., King, T. H., Nairn, R., de Paiva, V., Price, C., & Zaenen, A. (2007b). Precision-focused textual inference. In ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 28 - 29. Bos, J., Clark, S., Steedman, M., Curran, J. R., & Hockenmaier, J. (2004). Wide-coverage semantic representations from a CCG parser. In COLING ’04: Proceedings of the 20th international conference on Computational Linguistics, page 1240, Morristown, NJ, USA. Association for Computational Linguistics. Bresnan, J. (2000). Lexical-Functional Syntax. Blackwell, Oxford. Butt, M. & King, T. H. (2007). XLE and XFR: A Grammar Development Platform with a Parser/ Generator and Rewrite System. In International Conference on Natural Language Processing (ICON) Tutorial. Butt, M., Dyvik, H., King, T. H., Masuichi, H., & Rohrer, C. (2002). The Parallel Grammar Project. In Proceedings of COLING-2002 Workshop on Grammar Engineering and Evaluation, Taipei, Taiwan. Crouch, D. (1995). Packed Rewriting for Mapping Semantics to KR. In Proceedings of the International Workshop on Computational Semantics. Crouch, D., Dalrymple, M., King, T., Maxwell, J., & Newman, P. (2006). XLE Documentation. Crouch, R. & King, T. H. (2006). Semantics via F-Structure Rewriting. In M. Butt & T. H. King, editors, Proceedings of the LFG06 Conference. Dalrymple, M. (1999). Semantics and Syntax in Lexical Functional Grammar: The Resource Logic Approach . MIT Press, Cambridge, Mass. Dipper, S. (2003). Implementing and Documenting Large-Scale Grammars — German LFG. Ph.D. thesis, Universität Stuttgart, IMS. From Parallel Syntax Towards Parallel Semantics 169 Kim, R., Dalrymple, M., Kaplan, R. M., King, T. H., Masuichi, H., & Ohkuma, T. (2003). Multilingual Grammar Development via Grammar Porting . In ESSLLI 2003 Workshop on Ideas and Strategies for Multilingual Grammar Development . Parsons, T. (1990). Events in the Semantics of English. A Study in Subatomic Semantics, volume 19 of Current studies in linguistics series ; 19. MIT Pr., Cambridge, Mass. [u.a.]. Spreyer, K. & Frank, A. (2005). The TIGER 700 RMRS Bank: RMRS Construction from Dependencies. In Proceedings of LINC 2005, pages 1-10. Umemoto, H. (2006). Implementing a Japanese Semantic Parser Based on Glue Approach. In Proceedings of The 20th Pacific Asia Conference on Language, Information and Computation. von Stechow, A. (2002). German seit ‘since’ and the Ambiguity of the German Perfect. In I. Kaufmann & B. Stiebels, editors, More than Words: A Festschrift for Dieter Wunderlich, pages 393-432. Akademie Verlag, Berlin. Zinsmeister, H., Kuhn, J., Schrader, B., & Dipper, S. (2001). TIGER Transfer - From LFG Structures to the TIGER Treebank. Technical report, IMS, University of Stuttgart. Nominations for GSCL Award Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles * Christian Hardmeier Fondazione Bruno Kessler Via Sommarive, 18 38123 Trento, Italia Abstract Statistical Machine Translation (SMT) has been successfully employed to support translation of film subtitles. We explore the integration of Constraint Grammar corpus annotations into a Swedish-Danish subtitle SMT system in the framework of factored SMT. While the usefulness of the annotations is limited with large amounts of parallel data, we show that linguistic annotations can increase the gains in translation quality when monolingual data in the target language is added to an SMT system based on a small parallel corpus. 1 Introduction In countries where foreign-language films and series on television are routinely subtitled rather than dubbed, there is a considerable demand for efficiently produced subtitle translations. Although it may seem that subtitles are not appropriate for automatic processing as a result of their literary character, it turns out that their typical text structure, characterised by brevity and syntactic simplicity, and the immense text volumes processed daily by specialised subtitling companies make it possible to produce raw translations of film subtitles with statistical methods quite effectively. If these raw translations are subsequently post-edited by skilled staff, production quality translations can be obtained with considerably less effort than if the subtitles were translated by human translators with no computer assistance. A successful Swedish-Danish Machine Translation system for subtitles, which has now entered into productive use, has been presented by Volk & Harder (2007). The goal of the present study is to explore whether and how the quality of a Statistical Machine Translation (SMT) system of film subtitles can be improved by using linguistic annotations. To this end, a subset of 1 million subtitles of the training corpus used by Volk and Harder was morphologically annotated with the DanGram parser (Bick, 2001). We integrated the annotations into the translation process using the methods of * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 173-183. This paper summarises Christian Hardmeier’s Lizentiatsarbeit submitted to the University of Basel in 2008. An earlier version of the paper was published as “Hardmeier, C. & Volk, M. (2009). Using linguistic annotations in Statistical Machine Translation of film subtitles. In NODALIDA 2009 conference proceedings, pages 57-64, Odense.” 174 Christian Hardmeier factored Statistical Machine Translation (Koehn & Hoang, 2007) implemented in the widely used Moses software. After describing the corpus data and giving a short overview over the methods used, we present a number of experiments comparing different factored SMT setups. The experiments are then replicated with reduced training corpora which contain only part of the available training data. These series of experiments provide insights about the impact of corpus size on the effectivity of using linguistic abstractions for SMT. 2 Machine Translation of Subtitles As a text genre, subtitles play a curious role in a complex environment of different media and modalities. They depend on the medium film, which combines a visual channel with an auditive component composed of spoken language and non-linguistic elements such as noise or music. Within this context, they render the spoken dialogue into written text, are blended in with the visual channel and displayed simultaneously as the original sound track is played back, which redundantly contains the same information in a form that may or may not be accessible to the viewer. In their linguistic form, subtitles should be faithful, both in contents and in style, to the film dialogue which they represent. This means in particular that they usually try to convey an impression of orality. On the other hand, they are constrained by the mode of their presentation: short, written captions superimposed on the picture frame. The characteristics of subtitles are governed by the interplay of two conflicting principles (Becquemont, 1996): unobtrusiveness (discrétion) and readability (lisibilité). In order to provide a satisfactory film experience, it is paramount that the subtitles help the viewers quickly understand the meaning of the dialogue without distracting them from enjoying the film. The amount of text that can be displayed at one time is limited by the area of the screen that may be covered by subtitles (usually no more than two lines) and by the minimum time the subtitle must remain on screen to ensure that it can actually be read. As a result, the subtitle text must be shortened with respect to the full dialogue text in the actors’ script. The extent of the reduction depends on the script and on the exact limitations imposed for a specific subtitling task, but may amount to as much as 30 % and reach 50 % in extreme cases (Tomaszkiewicz, 1993). As a result of this processing and the underlying considerations, subtitles have a number of properties that make them especially well suited for Statistical Machine Translation. Owing to their presentational constraints, they mainly consist of comparatively short and simple phrases. Current SMT systems, when trained on a sufficient amount of data, have reliable ways of handling word translation and local structure. By contrast, they are still fairly weak at modelling long-range dependencies and reordering. Compared to other text genres, this weakness is less of an issue in the Statistical Machine Translation of subtitles thanks to their brevity and simple structure. Indeed, half of the subtitles in the Swedish part of our parallel training corpus are no more than 11 tokens long, including two tokens to mark the beginning and the end of the segment and counting every punctuation mark as a separate token. A considerable number of subtitles only contains one or two words, besides punctuation, often consisting entirely of a few words of affirmation, negation or abuse. These subtitles can easily be translated by an SMT system that has seen similar examples before. The orientation of the genre towards spoken language also has some disadvantages for Machine Translation systems. It is possible that the language of the subtitles, influenced by characteristics of speech, contains unexpected features such as stutterings, word repetitions or renderings of Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles 175 non-standard pronunciations that confuse the system. Such features are occasionally employed by subtitlers to lend additional colour to the text, but as they are in stark conflict with the ideals of unobtrusiveness and readability, they are not very frequent. It is worth noting that, unlike rule-based Machine Translation systems, a statistical system does not in general have any difficulties translating ungrammatical or fragmentary input: phrase-based SMT, operating entirely on the level of words and word sequences, does not require the input to be amenable to any particular kind of linguistic analysis such as parsing. Whilst this approach makes it difficult to handle some linguistic challenges such as long-distance dependencies, it has the advantage of making the system more robust to unexpected input, which is more important for subtitles. We have only been able to sketch the characteristics of the subtitle text genre in this paper. A detailed introduction, addressing also the linguistics of subtitling and translation issues, is provided by Díaz-Cintas & Remael (2007), and Pedersen (2007) discusses the peculiarities of subtitling in Scandinavia. 3 Constraint Grammar Annotations To explore the potential of linguistically annotated data, our complete subtitle corpus, both in Danish and in Swedish, was linguistically analysed with the DanGram Constraint Grammar (CG) parser (Bick, 2001), a system originally developed for the analysis of Danish for which there is also a Swedish grammar. Constraint Grammar (Karlsson, 1990) is a formalism for natural language parsing. Conceptually, a CG parser first produces possible analyses for each word by considering its morphological features and then applies constraining rules to filter out analyses that do not fit into the context. Thus, the word forms are gradually disambiguated, until only one analysis remains; multiple analyses may be retained if the sentence is ambiguous. The annotations produced by the DanGram parser were output as tags attached to individual words as in the following example: $- Vad [vad] <interr> INDP NEU S NOM @ACC> vet [veta] <mv> V PR AKT @FS-QUE du [du] PERS 2S UTR S NOM @<SUBJ om [om] PRP @<PIV det [den] <dem> PERS NEU 3S ACC @P< $? In addition to the word forms and the accompanying lemmas (in square brackets), the annotations contained part-of-speech (POS) tags such as INDP for “independent pronoun” or V for “verb”, a morphological analysis for each word (such as NEU S NOM for “neuter singular nominative”) and a tag specifying the syntactic function of the word in the sentence (such as @ACC> , indicating that the sentence-initial pronoun is an accusative object of the following verb). For some words, more fine-grained part-of-speech information was specified in angle brackets, such as <interr> for “interrogative pronoun” or <mv> for “verb of movement”. In our experiments, we used word forms, lemmas, POS tags and morphological analyses. The fine-grained POS tags and the syntax tags were not used. 176 Christian Hardmeier When processing news text, the DanGram parser achieves a precision of between 97.6 % (an Internet news source) and 98.5 % (newspaper text) for Danish morphological annotations (Bick, 2001). The actual precision obtained with our subtitle corpus is certainly lower, as the subtitles differ considerably from news text in terms of sentence structure, punctuation and style. In particular, the relatively frequent occurrence of grammatical sentences split into two subtitles harmed the performance of the parser by partially suppressing the context used to disambiguate words in the second half of the sentence. In most cases, however, the analyses provided by the parser seemed to be reasonable. 4 Factored Statistical Machine Translation Statistical Machine Translation formalises the translation process by modelling the probabilities of target language (TL) output strings T given a source language (SL) input string S, p ( T | S ) , and conducting a search for the output string ˆ T with the highest probability. In the Moses decoder (Koehn et al., 2007), which we used in our experiments, this probability is decomposed into a loglinear combination of a number of feature functions h i ( S, T ) , which map a pair of a source and a target language element to a score based on different submodels such as translation models or language models. Each feature function is associated with a weight λ i that specifies its contribution to the overall score: ˆ T = arg max T log p ( T | S ) = arg max T ∑ i λ i h i ( S, T ) The translation models employed in factored SMT are phrase-based. The phrases included in a translation model are extracted from a word-aligned parallel corpus (Koehn et al., 2003). The associated probabilities are estimated by the relative frequencies of the extracted phrase pairs in the same corpus. For language modelling, we used the SRILM toolkit (Stolcke, 2002); unless otherwise specified, 6-gram language models with modified Kneser-Ney smoothing were used. The SMT decoder tries to translate the words and phrases of the source language sentence in the order in which they occur in the input. If the target language requires a different word order, reordering is possible at the cost of a score penalty. The translation model has no notion of sequence, so it cannot control reordering. The language model can, but it has no access to the source language text, so it considers word order only from the point of view of TL grammaticality and cannot model systematic differences in word order between two languages. Lexical reordering models (Koehn et al., 2005) address this issue in a more explicit way by modelling the probability of certain changes in word order, such as swapping words, conditioned on the source and target language phrase pair that is being processed. In its basic form, Statistical Machine Translation treats word tokens as atomic and does not permit further decomposition or access to single features of the words. Factored SMT (Koehn & Hoang, 2007) extends this model by representing words as vectors composed of a number of features and makes it possible to integrate word-level annotations such as those produced by a Constraint Grammar parser into the translation process. The individual components of the feature vectors are called factors. In order to map between different factors on the target language side, the Moses decoder works with generation models, which are implemented as dictionaries and extracted from the targetlanguage side of the training corpus. They can be used, e. g., to generate word forms from lemmas Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles 177 and morphology tags, or to transform word forms into part-of-speech tags, which could then be checked using a language model. 5 Experiments with the Full Corpus We ran three series of experiments to study the effects of different SMT system setups on translation quality with three different configurations of training corpus sizes. For each condition, several Statistical Machine Translation systems were trained and evaluated. In the full data condition, the complete system was trained on a parallel corpus of some 900,000 subtitles with source language Swedish and target language Danish, corresponding to around 10 million tokens in each language. The feature weights were optimised using minimum error rate training (Och, 2003) on a development set of 1,000 subtitles that had not been used for training, then the system was evaluated on a 10,000 subtitle test set that had been held out during the whole development phase. The translations were evaluated with the widely used BLEU and NIST scores (Papineni et al., 2002; Doddington, 2002). The outcomes of different experiments were compared with a randomisation-based hypothesis test (Cohen, 1995). The test was two-sided, and the confidence level was fixed at 95 %. The results of the experiments can be found in table 1. The baseline system used only a translation model operating on word forms and a 6-gram language model on word forms. This is a standard setup for an unfactored SMT system. Two systems additionally included a 6-gram language model operating on part-of-speech tags and a 5-gram language model operating on morphology tags, respectively. The annotation factors required by these language models were produced from the word forms by suitable generation models. In the full data condition, both the part-of-speech and the morphology language model brought a slight, but statistically significant gain in terms of BLEU scores, which indicates that abstract information about grammar can in some cases help the SMT system choose the right words. The Table 1: Experimental results full data symmetric asymmetric BLEU NIST BLEU NIST BLEU NIST Baseline 53.67 % 8.18 42.12 % 6.83 44.85 % 7.10 Language models parts of speech 53.90 % 8.17 42.59 % 6.87 ◦ 44.71 % 7.08 morphology 54.07 % 8.18 42.86 % 6.92 44.95 % 7.09 Lexical reordering word forms 53.99 % 8.21 42.13 % 6.83 ◦ 44.72 % 7.05 lemmas 53.59 % 8.15 42.30 % 6.86 ◦ 44.71 % 7.06 parts of speech ◦ 53.36 % 8.13 42.33 % 6.86 ◦ 44.63 % 7.05 Analytical translation 53.73 % 8.18 42.28 % 6.90 46.73 % 7.34 BLEU score significantly above baseline (p < . 05 ) ◦ BLEU score significantly below baseline (p < . 05 ) 178 Christian Hardmeier improvement is small; indeed, it is not reflected in the NIST scores, but some beneficial effects of the additional language models can be observed in the individual output sentences. One thing that can be achieved by taking word class information into account is the disambiguation of ambiguous word forms. Consider the following example: Input: Ingen vill bo mitt emot en ismaskin. Reference: Ingen vil bo lige over for en ismaskine. Baseline: Ingen vil bo mit imod en ismaskin. POS/ Morphology: Ingen vil bo over for en ismaskin. Since the word ismaskin ‘ice machine’ does not occur in the Swedish part of the training corpus, none of the SMT systems was able to translate it. All of them copied the Swedish input word literally to the output, which is a mistake that cannot be fixed by a language model. However, there is a clear difference in the translation of the phrase mitt emot ‘opposite’. For some reason, the baseline system chose to translate the two words separately and mistakenly interpreted the adverb mitt, which is part of the Swedish expression, as the homonymous first person neuter possessive pronoun ‘my’, translating the Swedish phrase as ungrammatical Danish mit imod ‘my against’. Both of the additional language models helped to rule out this error and correctly translate mitt emot as over for, yielding a much better translation. Neither of them output the adverb lige ‘just’ found in the reference translation, for which there is no explicit equivalent in the input sentence. In the next example, the POS and the morphology language model produced different output: Input: Dåliga kontrakt, dålig ledning, dåliga agenter. Reference: Dårlige kontrakter, dårlig styring, dårlige agenter. Baseline: Dårlige kontrakt, dårlig forbindelse, dårlige agenter. POS: Dårlige kontrakt, dårlig ledelse, dårlige agenter. Morphology: Dårlige kontrakter, dårlig forbindelse, dårlige agenter. In Swedish, the indefinite singular and plural forms of the word kontrakt ‘contract(s)’ are homonymous. The two SMT systems without support for morphological analysis incorrectly produced the singular form of the noun in Danish. The morphology language model recognised that the plural adjective dårlige ‘bad’ is more likely to be followed by a plural noun and preferred the correct Danish plural form kontrakter ‘contracts’. The different translations of the word ledning as ‘management’ or ‘connection’ can be pinned down to a subtle influence of the generation model probability estimates. They illustrate how sensitive the system output is in the face of true ambiguity. None of the systems presented here has the capability of reliably choosing the right word based on the context in this case. In three experiments, the baseline configuration was extended by adding lexical reordering models conditioned on word forms, lemmas and part-of-speech tags, respectively. As in the language model experiments, the required annotation factors on the TL side were produced by generation models. The lexical reordering models turn out to be useful in the full data experiments only when conditioned on word forms. When conditioned on lemmas, the score is not significantly different from the baseline score, and when conditioned on part-of-speech tags, it is significantly lower. In this case, the most valuable information for lexical reordering lies in the word form itself. Lemma and part of speech are obviously not the right abstractions to model the reordering processes when sufficient data is available. Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles 179 Another system, which we call the analytical translation system, was modelled on suggestions from the literature (Koehn & Hoang, 2007; Bojar, 2007). It used the lemmas and the output of the morphological analysis to decompose the translation process and use separate components to handle the transfer of lexical and grammatical information. In order to achieve this, the baseline system was extended with additional translation tables mapping SL lemmas to TL lemmas and SL morphology tags to TL morphology tags, respectively. In the target language, a generation model was used to transform lemmas and morphology tags into word forms. Previous results strongly indicate that this translation approach is not sufficient on its own; instead, the decomposed translation approach should be combined with a standard word form translation model so that one can be used in those cases where the other fails (Koehn & Hoang, 2007). This configuration was therefore adopted for our experiments. The analytical translation approach fails to achieve any significant score improvement with the full parallel corpus. Closer examination of the MT output reveals that the strategy of using lemmas and morphological information to translate unknown word forms works in principle, as shown by the following example: Input: Molly har visat mig bröllopsfotona. Reference: Molly har vist mig fotoene fra brylluppet. Baseline: Molly har vist mig bröllopsfotona. Analytical: Molly har vist mig bryllupsbillederne. In this sentence, there can be no doubt that the output produced by the analytical system is superior to that of the baseline system. Where the baseline system copied the Swedish word bröllopsfotona ‘wedding photos’ literally into the Danish text, the translation found by the analytical model, bryllupsbillederne ‘wedding pictures’, is both semantically and syntactically flawless. Unfortunately, the reference translation uses different words, so the evaluation scores will not reflect this improvement. The lack of success of analytical translation in terms of evaluation scores can be ascribed to at least three factors: Firstly, there are relatively few vocabulary gaps in our data, which is due to the size of training corpus. Only 1.19 % (1,311 of 109,823) of the input tokens are tagged as unknown by the decoder in the baseline system. As a result, there is not much room for improvement with an approach specifically designed to handle vocabulary coverage, especially if this approach itself fails in some of the cases missed by the baseline system: Analytical translation brings this figure down to 0.88 % (970 tokens), but no further. Secondly, employing generation tables trained on the same corpus as the translation tables used by the system limits the attainable gains from the outset, since a required word form that is not found in the translation table is likely to be missing from the generation table, too. Thirdly, in case of vocabulary gaps in the translation tables, chances are that the system will not be able to produce the optimal translation for the input sentence. Instead, an approach like analytical translation aims to find the best translation that can be derived from the available models, which is certainly a reasonable thing to do. However, when only one reference translation is used, current evaluation methods will not allow alternative solutions, uniformly penalising all deviating translations instead. While using more reference translations could potentially alleviate this problem, multiple references are expensive to produce and just not available in many situations. Consequently, there is a systematic bias against the kind of solutions analytical translation can provide: Often, the evaluation method will assign the same scores to untranslated gibberish as to valid attempts at translating an unknown word with the best means available. 180 Christian Hardmeier 6 Experiments with Reduced Corpora We tested SMT systems trained on reduced corpora in two experimental conditions. In the symmetric condition, the systems described in the previous section were trained on a parallel corpus of 9,000 subtitles, or around 100,000 tokens per language, only. This made it possible to study the behaviour of the systems with little data. In the asymmetric condition, the small 9,000 subtitle parallel corpus was used to train the translation models and lexical reordering models. The generation and language models, which only rely on monolingual data in the target language, were trained on the full 900,000 subtitle dataset in this condition. This setup simulates a situation in which it is difficult to find parallel data for a certain language pair, but monolingual data in the target language can be more easily obtained. This is not unlikely when translating from a language with few electronic resources into a language like English, for which large amounts of corpus data, even of the same text genre, may be readily available. The results of the experiments with reduced corpora follow a more interesting pattern. First of all, it should be noted that the experiments in the asymmetric condition consistently outperformed those in the symmetric condition. Evidently, Statistical Machine Translation benefits from additional data, even if it is only available in the target language. In comparison to the training sets used in most other studies, the training corpus of 9,000 segments or 100,000 tokens per language used in the symmetric experiments is tiny. Consequently, one would expect the translation quality to be severely impaired by data sparseness issues, making it difficult for the Machine Translation system to handle unseen data. This prediction is supported by the experiments: The scores are improved by all extensions that allow the model to deal with more abstract representations of the data and thus to generalise more easily. The highest gains in terms of BLEU and NIST scores result from the morphology language model, which helps to ensure that the TL sentences produced by the system are well-formed. Interestingly enough, the relative performance of the lexical reordering models runs contrary to the findings obtained with the full corpus. Lexical reordering models turn out to be helpful when conditioned on lemmas or POS tags, whereas lexical reordering conditioned on word forms neither helps nor hurts. This is probably due to the fact that it is more difficult to gather satisfactory information about reordering from the small corpus. The reordering probabilities can be estimated more reliably after abstracting to lemmas or POS tags. In the asymmetric condition, the same phrase tables and lexical reorderings as in the symmetric condition were used, but the generation tables and language models were trained on a TL corpus 100 times as large. The benefit of this larger corpus is obvious already in the baseline experiment, which is completely identical to the baseline experiment of the symmetric condition except for the language model. Clearly, using additional monolingual TL data for language modelling is an easy and effective way to improve an SMT system. Furthermore, the availability of a larger data set on the TL side brings about profound changes in the relative performance of the individual systems with respect to each other. The POS language model, which proved useful in the symmetric condition, is detrimental now. The morphology language model does improve the BLEU score, but only by a very small amount, and the effect on the NIST score is slightly negative. This indicates that the language model operating on word forms is superior to the abstract models when it is trained on sufficient data. Likewise, all three lexical reordering models hurt performance in the presence of a strong word form language model. Apparently, when the language model is good, nothing can be gained by having a doubtful reordering model trained on insufficient data compete against it. Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles 181 The most striking result in the asymmetric condition, however, is the score of the analytical translation model, which achieved an improvement of impressive 1.9 percentage points in the BLEU score along with an equally noticeable increase of the NIST score. In the asymmetric setup, where the generation model has much better vocabulary coverage than the phrase tables, analytical translation realises its full potential and enables the SMT system to produce word forms it could not otherwise have found. In sum, enlarging the size of the target language corpus resulted in a gain of 2.7 percentage points BLEU on the baseline score of the symmetric condition, which is entirely due to the better language model on word forms and can be realised without linguistic analysis of the input. By integrating morphological analysis and lemmas for both the SL and the TL part of the corpus, the leverage of the additional data can be increased even further by analytical translation, realising another improvement of 1.9 percentage points, totalling 4.6 percentage points over the initial baseline. 7 Conclusion Subject to a set of peculiar practical constraints, the text genre of film subtitles is characterised by short sentences with a comparatively simple structure and frequent reuse of similar expressions. Moreover, film subtitles are a text genre designed for translation; they are translated between many different languages in huge numbers. Their structural properties and the availability of large amounts of data make them ideal for Statistical Machine Translation. The present report investigates the potential of incorporating information from linguistic analysis into an existing Swedish-Danish phrase-based SMT system for film subtitles (Volk & Harder, 2007). It is based on a subset of the data used by Volk and Harder, which has been extended with linguistic annotations in the Constraint Grammar framework produced by the DanGram parser (Bick, 2001). We integrated the annotations into the SMT system using the factored approach to SMT (Koehn & Hoang, 2007) as offered by the Moses decoder (Koehn et al., 2007) and explored the opportunities offered by factored SMT with a number of experiments, each adding a single additional component into the system. When a large training corpus of around 900,000 subtitles or 10 million tokens per language was used, the gains from adding linguistic information were generally small. Minor improvements were observed when using additional language models operating on part-of-speech tags and tags from morphological analysis. A technique called analytical translation, which enables the SMT system to back off to separate translation of lemmas and morphological tags when the main phrase table does not provide a satisfactory translation, afforded slightly improved vocabulary coverage. Lexical reordering conditioned on word forms also brought about a minor improvement, whereas conditioning lexical reordering on more abstract categories such as lemmas or POS tags had a detrimental effect. On the whole, none of the gains was large enough to justify the cost and effort of producing the annotations. Moreover, there was a clear tendency for complex models to have a negative effect when the information employed was not selected carefully enough. When the corpus is large and its quality good, there is a danger of obstructing the statistical model from taking full advantage of the data by imposing clumsily chosen linguistic categories. Given sufficient data, enforcing manually selected categories which may not be fully appropriate for the task in question is not a promising approach. Better results could possibly be obtained if abstract categories specifically optimised for the task of modelling distributional characteristics of words were statistically induced from the corpus. 182 Christian Hardmeier The situation is different when the corpus is small. In a series of experiments with a corpus size of only 9,000 subtitles or 100,000 tokens per language, various manners of integrating linguistic information were consistently found to be beneficial, even though the improvements obtained were small. When the corpus is not large enough to afford reliable parameter estimates for the statistical models, adding abstract data with richer statistics stands to improve the behaviour of the system. Compared to the system trained on the full corpus, the effects involve a trade-off between the reliability and usefulness of the statistical estimates and of the linguistically motivated annotation, respectively; the difference in the results stems from the fact that the quality of the statistical models strongly depends on the amount of data available, whilst the quality of the linguistic annotation is about the same regardless of corpus size. The close relationship of Swedish and Danish may also have impact: For language pairs with greater grammatical differences, the critical corpus size at which the linguistic annotations we worked with stop being useful may be larger. Our most encouraging findings come from experiments in an asymmetric setting, where a very small SL corpus (9,000 subtitles) was combined with a much larger TL corpus (900,000 subtitles). A considerable improvement to the score was realised just by adding a language model trained on the larger corpus, which does not yet involve any linguistic annotations. With the help of analytical translation, however, the annotations could be successfully exploited to yield a further gain of almost 2 percentage points in the BLEU score. Unlike the somewhat dubious improvements in the other two conditions, this is clearly worth the effort, and it demonstrates that factored Statistical Machine Translation can be successfully used to improve translation quality by integrating additional monolingual data with linguistic annotations into an SMT system. References Becquemont, D. (1996). Le sous-titrage cinématographique: contraintes, sens, servitudes. In Y. Gambier, editor, Les transferts linguistiques dans les médias audiovisuels, pages 145-155. Presses universitaires du Septentrion, Villeneuve d’Ascq. Bick, E. (2001). En Constraint Grammar parser for dansk. In 8. Møde om udforskningen af dansk sprog, pages 40-50, Århus. Bojar, O. (2007). English-to-Czech factored Machine Translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 232-239, Prague. Cohen, P. R. (1995). Empirical methods for Artificial Intelligence. MIT Press, Cambridge (Mass.). Díaz-Cintas, J. & Remael, A. (2007). Audiovisual Translation: Subtitling, volume 11 of Translation Practices Explained. St. Jerome Publishing, Manchester. Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the second International conference on Human Language Technology Research, pages 138-145, San Diego. Karlsson, F. (1990). Constraint Grammar as a framework for parsing running text. In COLING-90. Papers presented to the 13th International conference on Computational Linguistics, pages 168-173, Helsinki. Koehn, P. & Hoang, H. (2007). Factored translation models. In Conference on empirical methods in Natural Language Processing, pages 868-876, Prague. Using Linguistic Annotations in Statistical Machine Translation of Film Subtitles 183 Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology, pages 48-54, Edmonton. Koehn, P., Axelrod, A., et al. (2005). Edinburgh system description for the 2005 IWSLT speech translation evaluation. In International workshop on spoken language translation, Pittsburgh. Koehn, P., Hoang, H., et al. (2007). Moses: open source toolkit for statistical machine translation. In Annual meeting of the Association for Computational Linguistics: Demonstration session, pages 177-180, Prague. Och, F. J. (2003). Minimum error rate training in Statistical Machine Translation. In Proceedings of the 41st annual meeting of the Association for Computational Linguistics, pages 160-167, Sapporo (Japan). Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311-318, Philadelphia. ACL. Pedersen, J. (2007). Scandinavian subtitles. A comparative study of subtitling norms in Sweden and Denmark with a focus on extralinguistic cultural references. Ph.D. thesis, Stockholm University, Department of English. Stolcke, A. (2002). SRILM: an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing, Denver (Colorado). Tomaszkiewicz, T. (1993). Les opérations linguistiques qui sous-tendent le processus de sous-titrage des films. Wydawnictwo Naukowe UAM, Pozna ´ n. Volk, M. & Harder, S. (2007). Evaluating MT with translations or translators. What is the difference? In Proceedings of MT Summit XI, pages 499-506, Copenhagen. Robust Processing of Situated Spoken Dialogue * Pierre Lison Language Technology Lab German Research Centre for Artificial Intelligence (DFKI GmbH) Saarbrücken, Germany Abstract Spoken dialogue is notoriously hard to process with standard language processing technologies. Dialogue systems must indeed meet two major challenges. First, natural spoken dialogue is replete with disfluent, partial, elided or ungrammatical utterances. Second, speech recognition remains a highly error-prone task, especially for complex, open-ended domains. We present an integrated approach for addressing these two issues, based on a robust incremental parser. The parser takes word lattices as input and is able to handle ill-formed and misrecognised utterances by selectively relaxing its set of grammatical rules. The choice of the most relevant interpretation is then realised via a discriminative model augmented with contextual information. The approach is fully implemented in a dialogue system for autonomous robots. Evaluation results on a Wizard of Oz test suite demonstrate very significant improvements in accuracy and robustness compared to the baseline. 1 Introduction Spoken dialogue is often considered to be one of the most natural means of interaction between a human and a robot. It is, however, notoriously hard to process with standard language processing technologies. Dialogue utterances are often incomplete or ungrammatical, and may contain numerous disfluencies like fillers (err, uh, mm), repetitions, self-corrections, fragments, etc. Moreover, even in the case where the utterance is perfectly well-formed and does not contain any kind of disfluencies, the dialogue system still needs to accommodate the various speech recognition errors thay may arise. This problem is particularly acute for robots operating in real-world noisy environments and deal with utterances pertaining to complex, open-ended domains. Spoken dialogue systems designed for human-robot interaction must therefore be robust to both ill-formed and ill-recognised inputs. In this paper, we present a new approach to address these two difficult issues. Our starting point is the work done by Zettlemoyer and Collins on parsing using relaxed CCG grammars (Zettlemoyer & Collins, 2007). In order to account for natural spoken language phenomena (more flexible word order, missing words, etc.), they augment their grammar framework with a small set of non-standard combinatory rules, leading to a relaxation of the grammatical constraints. A discriminative model over the parses is coupled with the parser, and is responsible for selecting the most likely interpretation(s) among the possible ones. * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 185-197. 186 Pierre Lison In this paper, we extend their approach in two important ways. First, Zettlemoyer & Collins (2007) focused on the treatment of ill-formed input, ignoring the speech recognition issues. Our approach, however, deals with both ill-formed and misrecognized input, in an integrated fashion. This is done by augmenting the set of non-standard rules with new ones specifically tailored to deal with speech recognition errors. Second, we significantly extend the range of features included in the discriminative model, by incorporating not only syntactic, but also acoustic, semantic and contextual information into the model. An overview of the paper is as follows. We describe in Sect. 2 the architecture in which our system has been integrated. We then discuss the approach in Sect. 3. Finally, we present in Sect. 4 the evaluations on a WOZ test suite, and conclude. 2 Architecture The approach we present in this paper is fully implemented and integrated into a cognitive architecture for autonomous robots (see Hawes et al. (2007)). It is capable of building up visuo-spatial models of a dynamic local scene, and continuously plan and execute manipulation actions on objects within that scene. The robot can discuss objects and their materialand spatial properties for the purpose of visual learning and manipulation tasks. Fig. 1 illustrates the architecture for the communication subsystem. Figure 1: Architecture schema of the communication subsystem (only for comprehension) Starting with speech recognition, we process the audio signal to establish a word lattice containing statistically ranked hypotheses about word sequences. Subsequently, parsing constructs grammatical analyses for the given word lattice. A grammatical analysis constructs both a syntactic analysis of the utterance, and a representation of its meaning. The analysis is based on an incremental chart parser 1 for Combinatory Categorial Grammar (Steedman & Baldridge, 2009). These meaning representations are ontologically richly sorted, relational structures, formulated in a (propositional) description logic, more precisely in HLDS (Baldridge & Kruijff, 2002). The parser then compacts all meaning representations into a single packed logical form (Carroll & Oepen, 2005; Kruijff et al., 2007). A packed logical form represents content similar across the different analyses as a single graph, using overand underspecification of how different nodes can be connected to capture lexical and syntactic forms of ambiguity. 1 Built using the OpenCCG API: http: / / openccg.sf.net Robust Processing of Situated Spoken Dialogue 187 At the level of dialogue interpretation, the logical forms are resolved against a SDRS-like dialogue model (Asher & Lascarides, 2003) to establish co-reference and dialogue moves. Linguistic interpretations must finally be associated with extra-linguistic knowledge about the environment - dialogue comprehension hence needs to connect with other subarchitectures like vision, spatial reasoning or planning. We realise this information binding between different modalities via a specific module, called the “binder”, which is responsible for the ontology-based mediation accross modalities (Jacobsson et al., 2008). Interpretation in context indeed plays a crucial role in the comprehension of utterance as it unfolds. Human listeners continuously integrate linguistic information with scene understanding, (foregrounded entities and events) and word knowledge. This contextual knowledge serves the double purpose of interpreting what has been said, and predicting/ anticipating what is going to be said. Their integration is also closely time-locked, as evidenced by analyses of saccadic eye movements in visual scenes (Knoeferle & Crocker, 2006) and by neuroscience-based studies of event-related brain potentials (Van Berkum, 2004). Figure 2: Context-sensitivity in processing situated dialogue understanding Several approaches in situated dialogue for human-robot interaction demonstrated that a robot’s understanding can be substantially improved by relating utterances to the situated context (Roy, 2005; Brick & Scheutz, 2007; Kruijff et al., 2007). By incorporating contextual information at the core of our model, our approach also seeks to exploit this important insight. 3 Approach 3.1 Grammar Relaxation Our approach to robust processing of spoken dialogue rests on the idea of grammar relaxation: the grammatical constraints specified in the grammar are “relaxed” to handle slightly ill-formed or misrecognised utterances. Practically, the grammar relaxation is done via the introduction of non-standard CCG rules (Zettlemoyer & Collins, 2007) 2 . We describe here three families of relaxation rules: the discourse-level composition rules, the ASR correction rules, and the paradigmatic heap rules (Lison, 2008). 2 In Combinatory Categorial Grammar, rules are used to assemble categories to form larger pieces of syntactic and semantic structure. The standard rules are application ( <, > ), composition ( B ), and type raising ( T ) (Steedman & Baldridge, 2009). 188 Pierre Lison 3.1.1 Discourse-level Composition Rules In natural spoken dialogue, we may encounter utterances containing several independent “chunks” without any explicit separation (or only a short pause or a slight change in intonation), such as “yes take the ball right and now put it in the box”. These chunks can be analysed as distinct “discourse units”. Syntactically speaking, a discourse unit can be any type of saturated atomic categories - from a simple discourse marker to a full sentence. The type-changing rule T du converts atomic categories into discourse units: A : @ i f ⇒ du : @ i f ( T du ) where A represents an arbitrary saturated atomic category ( s , np , pp , etc.). Rule T C then integrates two discourse units into a single structure: du : @ a x ⇒ du : @ c z / du : @ b y ( T C ) where the formula @ c z is defined as: @ { c: d-units } (list ∧ ( 〈 FIRST 〉 a ∧ x ) ∧ ( 〈 NEXT 〉 b ∧ y )) (1) 3.1.2 ASR Error Correction Rules Speech recognition is highly error-prone. It is however possible to partially alleviate this problem by inserting error-correction rules (more precisely, new lexical entries) for the most frequently misrecognised words. If we notice for instance that the ASR frequently substitutes the word “wrong” for “round” (because of their phonological proximity), we can introduce a new lexical entry to correct it: round adj : @ attitude (wrong) (2) A small set of new lexical entries of this type have been added to our lexicon to account for the most frequent recognition errors. 3.1.3 Paradigmatic Heap Rules The last family of relaxation rules is used to handle the numerous disfluencies evidenced in spoken language. The theoretical foundations of our approach can be found in Blanche-Benveniste et al. (1990); Guénot (2006), which offer an interesting perspective on the linguistic analysis of spoken language, based on an extensive corpus study of spoken transcripts. Two types of syntactic relations are distinguished: syntagmatic relations and paradigmatic relations. Syntagmatic constructions are primarily characterized by hypotactic (i.e. head-dependent) relations between their constituents, whereas paradigmatic ones do not have such head-dependent asymmetry. Together, constituents connected by such paradigmatic relations form what Blanche-Benveniste et al. (1990) calls a “paradigmatic heap”. A paradigmatic heap is defined as the position in a utterance where the “syntagmatic unfolding is interrupted”, and the same syntactic position hence occupied by several linguistic objects. Disfluencies can be conveniently analysed as paradigmatic heaps. Robust Processing of Situated Spoken Dialogue 189 Table 1: Example of grid analysis for three utterances containing disfluencies Example 1 Bob i’m at the uh south uh let’s say east-southeast rim of a uh oh thirty-meter crater Example 2 up on the uh Scarp and maybe three hundred err two hundred meters Example 3 it it probably shows up as a bright crater a bright crater on your map Consider the utterances in Table 1 3 . These utterances contain several hard-to-process disfluencies. The linguistic analysis of these examples is illustrated on two dimensions, the horizontal dimension being associated to the syntagmatic axis, and the vertical dimension to the paradigmatic axis. A vertical column therefore represents a paradigmatic heap. The disfluencies are indicated in bold characters. The rule T P H is a type-changing rule which allows us to formalise the concept of paradigmatic heap in terms of a CCG rule, by “piling up" two constituents on a heap. A : @ a x ⇒ A : @ c z / A : @ b y ( T P H ) where the formula @ c z is defined as: @ { c: heap-units } (heap ∧ ( 〈 FIRST 〉 a ∧ x ) ∧ ( 〈 NEXT 〉 b ∧ y )) (3) The category A stands for any category for which we want to allow this piling-up operation. For instance, the two heaps of example (3) are of category np . 3.2 Parse Selection Using more powerful rules to relax the grammatical analysis tends to increase the number of parses. We hence need a mechanism to discriminate among the possible parses. The task of selecting the most likely interpretation among a set of possible ones is called parse selection. Once the parses for a given utterance are computed, they are filtered or selected in order to retain only the most likely interpretation(s). This is done via a (discriminative) statistical model covering a large number of features. 3 Transcript excerpts from the Apollo 17 Lunar Surface Journal [ http: / / history.nasa.gov/ alsj/ a17/ ] 190 Pierre Lison Formally, the task is defined as a function F : X → Y where X is the set of possible inputs (in our case, X is the space of word lattices), and Y the set of parses. We assume: 1. A function GEN( x ) which enumerates all possible parses for an input x. In our case, the function represents the admissibles parses of the CCG grammar. 2. A d-dimensional feature vector f ( x, y ) ∈ d , representing specific features of the pair ( x, y ) (for instance, acoustic, syntactic, semantic or contextual features). 3. A parameter vector w ∈ d . The function F , mapping a word lattice to its most likely parse, is then defined as: F ( x ) = arg max y ∈GEN (x) w T · f ( x, y ) (4) where w T · f ( x, y ) is the inner product ∑ d s=1 w s f s ( x, y ) , and can be seen as a measure of the “quality” of the parse. Given the parameter vector w , the optimal parse of a given word lattice x can be therefore easily determined by enumerating all the parses generated by the grammar, extracting their features, computing the inner product w T · f ( x, y ) , and selecting the parse with the highest score. The task of parse selection is an example of a structured classification problem, which is the problem of predicting an output y from an input x, where the output y has a rich internal structure. In the specific case of parse selection, x is a word lattice, and y a logical form. 3.3 Learning 3.3.1 Training Data To estimate the parameters w , we need a set of training examples. Since no corpus of situated dialogue adapted to our task domain is available to this day - let alone semantically annotated - we followed the approach advocated in Weilhammer et al. (2006) and generated a corpus from a handwritten task grammar. We first designed a small grammar covering our task domain, each rule being associated to a HLDS representation and a weight. Once specified, the grammar is then randomly traversed a large number of times, resulting in a large set of utterances along with their semantic representations 4 . It is worth noting that, instead of annotating entire derivations, we only specify the resulting semantics of the utterance, ie. its logical form. The training data is thus represented by a set of examples ( x i , z i ) , where x i is an utterance and z i is a HLDS formula. For a given training example ( x i , z i ) , there may be several possible CCG parses leading to the same semantics z i . The parameter estimation can therefore be seen as a hidden variable problem , where the training examples contain only partial information. 3.3.2 Perceptron Learning The algorithm we use to estimate the parameters w using the training data is a perceptron. The algorithm is fully online it visits each example in turn, in an incremental fashion, and updates w if 4 Because of its relatively artificial character, the quality of such training data is naturally lower than what could be obtained with a genuine corpus. But, as the experimental results have shown, it remained sufficient for our purpose. In a near future, this generated training data will be progressively replaced by a real corpus of spoken dialogue transcripts. Robust Processing of Situated Spoken Dialogue 191 necessary. Albeit simple, the algorithm has proven to be very efficient and accurate for the task of parse selection (Collins & Roark, 2004; Zettlemoyer & Collins, 2007). The pseudo-code for the online learning algorithm is detailed in [Algorithm 1]. It works as follows: the parameters w are first initialised to arbitrary values. Then, for each pair ( x i , z i ) in the training set, the algorithm computes the parse y ′ with the highest score according to the current model. If this parse happens to match the best parse associated with z i (which we denote y ∗ ), we move to the next example. Else, we perform a perceptron update on the parameters: w = w + f ( x i , y ∗ ) − f ( x i , y ′ ) (5) The iteration on the training set is repeated T times, or until convergence. It is possible to prove that, provided the training set ( x i , z i ) is separable with margin δ > 0 , the algorithm is assured to converge after a finite number of iterations to a model with zero training errors (Collins & Roark, 2004). See also Collins (2004) for convergence theorems and proofs. 3.4 Features As we have seen, the parse selection operates by enumerating the possible parses and selecting the one with the highest score according to the linear model parametrised by w . The accuracy of our method crucially relies on the selection of “good” features f ( x, y ) for our model that is, features which help discriminating the parses. In our model, the features are of four types: semantic features, syntactic features, contextual features, and speech recognition features. 3.4.1 Semantic Features Semantic features are defined on substructures of the logical form. We define features on the following information sources: the nominals, the ontological sorts of the nominals, and the dependency relations (following Clark & Curran (2003)). The features on nominals and ontological sorts aim at modeling (aspects of) lexical semantics e.g. which meanings are the most frequent for a given word -, whereas the features on relations and sequence of relations focus on sentential semantics which dependencies are the most frequent. These features help us handle various forms of lexical and syntactic ambiguities. 3.4.2 Syntactic Features Syntactic features are features associated to the derivational history of a specific parse. The main use of these features is to penalise to a correct extent the application of the non-standard rules introduced into the grammar. To this end, we include in the feature vector f ( x, y ) a new feature for each non-standard rule, which counts the number of times the rule was applied in the parse. In the derivation shown in Fig. 4, the rule T P H (application of a paradigmatic heap to handle the disfluency) is applied once, so the corresponding feature value is set to 1 . These syntactic features can be seen as a penalty given to the parses using these non-standard rules, thereby giving a preference to the “normal” parses over them. This mechanism ensures that the grammar relaxation is only applied “as a last resort” when the usual grammatical analysis fails to provide a full parse. 192 Pierre Lison Figure 3: HLDS logical form for “I want you to take the mug” take s/ np the np/ n ball n np > np/ np T P H the np/ n red n/ n ball n n > np > np > s > Figure 4: CCG derivation for the utterance “take the ball the red ball”, containing a self-correction Robust Processing of Situated Spoken Dialogue 193 Algorithm 1 Online perceptron learning Require: set of n training examples { ( x i , z i ) : i = 1 ...n } - T : number of iterations over the training set - GEN ( x ) : function enumerating the parses for an input x according to the grammar. - GEN ( x, z ) : function enumerating the parses for an input x with semantics z. - L ( y ) maps a parse tree y to its logical form. - Initial parameter vector w 0 % Initialise w ← w 0 % Loop T times on the training examples for t = 1 ...T do for i = 1 ...n do % Compute best parse according to current model Let y ′ = arg max y∈GEN(x i ) w T · f ( x i , y ) % If the decoded parse = expected parse, update the parameters if L ( y ′ ) = z i then % Search the best parse for utterance x i with semantics z i Let y ∗ = arg max y∈GEN(x i ,z i ) w T · f ( x i , y ) % Update parameter vector w Set w = w + f ( x i , y ∗ ) − f ( x i , y ′ ) end if end for end for return parameter vector w 3.4.3 Contextual Features As we already mentioned, one striking characteristic of spoken dialogue is the importance of context. Understanding the visual and discourse contexts is critical to resolve potential ambiguities and compute the most likely interpretation(s). The feature vector f ( x, y ) therefore includes various features related to the context: • Activated words: our dialogue system maintains in its working memory a list of contextually activated words (cfr. Lison & Kruijff (2008)). This list is continuously updated as the dialogue and the environment evolves. For each context-dependent word, we include one feature signaling its potential occurrence in the word lattice. • Expected dialogue moves: for each dialogue move, we include one feature indicating if the move is consistent with the current discourse model. These features ensure for instance that the dialogue move following a QuestionYN is a Accept , Reject or another question (e.g. for clarification requests), but almost never an Opening . 3.4.4 Speech Recognition Features Finally, the feature vector f ( x, y ) also includes features related to the speech recognition. The ASR module outputs a set of (partial) recognition hypotheses, packed in a word lattice. One example is 194 Pierre Lison given in Fig. 5. To favour the hypotheses with high confidence scores (which are, according to the ASR statistical models, more likely to reflect what was uttered), we introduce in the feature vector several acoustic features measuring the likelihood of each recognition hypothesis. Figure 5: Example of word lattice 4 Evaluation We performed a quantitative evaluation of our approach, using its implementation in a fully integrated system (cf. Sect. 2). To set up the experiments for the evaluation, we have gathered a Wizard-of-Oz corpus of human-robot spoken dialogue for our task-domain (Fig. 6), which we segmented and annotated manually with their expected semantic interpretation. The data set contains 195 individual utterances along with their complete logical forms. Three types of quantitative results are extracted from the evaluation results: exact-match, partialmatch, and word error rate. Tables 2, 3 and 4 illustrate the results, broken down by use of grammar relaxation, use of parse selection, and number of recognition hypotheses considered. Each line in the tables corresponds to a possible configuration. Tables 2 and 3 give the precision, recall and F 1 value for each configuration (respectively for the exactand partial-match), and Table 4 gives the Word Error Rate [WER]. Table 2: Exact-match accuracy results (in percent) Size of word lattice (number of NBests) Grammar relaxation Parse selection Precision Recall F 1 -value (Baseline) 1 No No 40.9 45.2 43.0 . 1 No Yes 59.0 54.3 56.6 . 1 Yes Yes 52.7 70.8 60.4 . 3 Yes Yes 55.3 82.9 66.3 . 5 Yes Yes 55.6 84.0 66.9 (Full approach) 10 Yes Yes 55.6 84.9 67.2 The baseline corresponds to the dialogue system with no grammar relaxation, no parse selection, and use of the first NBest recognition hypothesis. Both the partial-, exact-match accuracy results and the WER demonstrate statistically significants improvements over the baseline. We also observe that the inclusion of more ASR recognition hypotheses has a positive impact on the accuracy results. 5 Conclusions We presented an integrated approach to the processing of (situated) spoken dialogue, suited to the specific needs and challenges encountered in human-robot interaction. Robust Processing of Situated Spoken Dialogue 195 Figure 6: Wizard-of-Oz experiments for a task domain of object manipulation and visual learning Table 3: Partial-match accuracy results (in percent) Size of word lattice (number of NBests) Grammar relaxation Parse selection Precision Recall F 1 -value (Baseline) 1 No No 86.2 56.2 68.0 . 1 No Yes 87.4 56.6 68.7 . 1 Yes Yes 88.1 76.2 81.7 . 3 Yes Yes 87.6 85.2 86.4 . 5 Yes Yes 87.6 86.0 86.8 (Full approach) 10 Yes Yes 87.7 87.0 87.3 Table 4: Word error rate (in percent) Size of word lattice (NBests) Grammar relaxation Parse selection Word Error Rate 1 No No 20.5 1 Yes Yes 19.4 3 Yes Yes 16.5 5 Yes Yes 15.7 10 Yes Yes 15.7 In order to handle disfluent, partial, ill-formed or misrecognized utterances, the grammar used by the parser is “relaxed” via the introduction of a set of non-standard rules which allow for the combination of discourse fragments or the correction of speech recognition errors. The relaxed parser yields a (potentially large) set of parses, which are then retrieved by the parse selection module. The parse selection is based on a discriminative model exploring a set of relevant semantic, syntactic, contextual and acoustic features extracted for each parse. The outlined approach is currently being extended in new directions, such as the exploitation of parse selection during incremental parsing to improve the parsing efficiency (Lison, 2009), the introduction of more refined contextual features, or the use of more sophisticated learning algorithms, such as Support Vector Machines. 196 Pierre Lison References Asher, N. & Lascarides, A. (2003). Logics of Conversation. Cambridge University Press. Baldridge, J. & Kruijff, G.-J. M. (2002). Coupling CCG and Hybrid Logic Dependency Semantics. In ACL’02: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 319-326, Philadelphia, PA. Association for Computational Linguistics. Blanche-Benveniste, C., Bilger, M., Rouget, C., & van den Eynde, K. (1990). Le francais parlé : Etudes grammaticales. CNRS Editions, Paris. Brick, T. & Scheutz, M. (2007). Incremental Natural Language Processing for HRI. In Proceeding of the ACM/ IEEE international conference on Human-Robot Interaction (HRI’07), pages 263 - 270. Carroll, J. & Oepen, S. (2005). High Efficiency Realization for a Wide-Coverage Unification Grammar. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP’05), pages 165-176. Clark, S. & Curran, J. R. (2003). Log-linear models for wide-coverage CCG parsing. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 97-104, Morristown, NJ, USA. Association for Computational Linguistics. Collins, M. (2004). Parameter estimation for statistical parsing models: theory and practice of distributionfree methods. In New developments in parsing technology, pages 19-55. Kluwer Academic Publishers. Collins, M. & Roark, B. (2004). Incremental parsing with the perceptron algorithm. In ACL ’04: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, page 111, Morristown, NJ, USA. Association for Computational Linguistics. Guénot, M.-L. (2006). Éléments de grammaire du francais: pour une théorie descriptive et formelle de la langue. Ph.D. thesis, Université de Provence. Hawes, N. A., Sloman, A., Wyatt, J., Zillich, M., Jacobsson, H., Kruijff, G.-J. M., Brenner, M., Berginc, G., & Skocaj, D. (2007). Towards an Integrated Robot with Multiple Cognitive Functions. In Proc. AAAI’07, pages 1548-1553. AAAI Press. Jacobsson, H., Hawes, N., Kruijff, G.-J., & Wyatt, J. (2008). Crossmodal content binding in informationprocessing architectures. In Proceedings of the 3rd ACM/ IEEE International Conference on Human-Robot Interaction (HRI), Amsterdam, The Netherlands. Knoeferle, P. & Crocker, M. (2006). The coordinated interplay of scene, utterance, and world knowledge: evidence from eye tracking. Cognitive Science. Kruijff, G.-J. M., Lison, P., Benjamin, T., Jacobsson, H., & Hawes, N. (2007). Incremental, multi-level processing for comprehending situated dialogue in human-robot interaction. In Language and Robots: Proceedings from the Symposium (LangRo’2007), pages 55-64, Aveiro, Portugal. Lison, P. (2008). Robust Processing of Situated Spoken Dialogue. Master’s thesis, Universität des Saarlandes, Saarbrücken. http: / / www.dfki.de/ ∼ plison/ pubs/ thesis/ main.thesis.plison2008.pdf. Lison, P. (2009). A Method to Improve the Efficiency of Deep Parsers with Incremental Chart Pruning. In Proceedings of the ESSLLI Workshop on Parsing with Categorial Grammars, Bordeaux, France. (in press). Lison, P. & Kruijff, G.-J. M. (2008). Salience-driven Contextual Priming of Speech Recognition for Human- Robot Interaction. In Proceedings of the 18th European Conference on Artificial Intelligence, Patras (Greece). Robust Processing of Situated Spoken Dialogue 197 Roy, D. (2005). Semiotic Schemas: A Framework for Grounding Language in Action and Perception. Artificial Intelligence, 167(1-2), 170-205. Steedman, M. & Baldridge, J. (2009). Combinatory Categorial Grammar. In R. Borsley & K. Börjars, editors, Nontransformational Syntax: A Guide to Current Models. Blackwell, Oxford. Van Berkum, J. (2004). Sentence comprehension in a wider discourse: Can we use ERPs to keep track of things? In M. Carreiras & C. C. Jr., editors, The on-line study of sentence comprehension: Eyetracking, ERPs and beyond, pages 229-270. Psychology Press, New York NY. Weilhammer, K., Stuttle, M. N., & Young, S. (2006). Bootstrapping Language Models for Dialogue Systems. In Proceedings of INTERSPEECH 2006, Pittsburgh, PA. Zettlemoyer, L. S. & Collins, M. (2007). Online Learning of Relaxed CCG Grammars for Parsing to Logical Form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 678-687. Ein Verfahren zur Ermittlung der relativen Chronologie der vorgotischen Lautgesetze * Roland Mittmann Johann Wolfgang Goethe-Universität Institut für Vergleichende Sprachwissenschaft Postfach 11 19 32, D-60054 Frankfurt am Main mittmann@em.uni-frankfurt.de Zusammenfassung Die Phoneme einer Sprache entwickeln sich nach Lautgesetzen, deren genaue Abfolge zwischen zwei aufeinanderfolgenden Sprachstufen allerdings häufig unklar ist. Wünschenswert wäre daher ein Programm, das alle möglichen Reihenfolgen der Lautgesetze in diesem Zeitraum testet und nur die korrekten ausgibt. Dieser Versuch ist im Folgenden am Beispiel der Entwicklung vom Urindogermanischen zum Gotischen beschrieben. Da die Zahl der Kombinationsmöglichkeiten jedoch sehr schnell anwächst, mussten aus Performanzgründen Lautgesetze zusammengefasst und der Untersuchungszeitraum verringert werden, um eine erfolgreiche Berechnung zu ermöglichen. Neben den Ergebnissen zweier exemplarischer Programmdurchführungen wird schließlich noch eine Idee für eine performantere Umsetzung dargestellt. 1 Einleitung Die historische Sprachwissenschaft sieht sich vielfach mit dem Problem konfrontiert, dass lautliche Entwicklungen zwischen zwei Entwicklungsstufen desselben sprachlichen Kontinuums durch die Beleglage nur unzureichend dokumentiert sind. So setzt etwa die Überlieferung der romanischen Sprachen erst im Mittelalter ein, als diese sich bereits weit vom Lateinischen entfernt haben. Da die Phoneme einer Sprache sich nach bestimmten Gesetzmäßigkeiten - den Lautgesetzen - verändern, besteht die klassische Vorgehensweise zur Ermittlung der Reihenfolge dieser Lautgesetze darin, sie an einer Vielzahl einzelner Wortformen zu testen, insbesondere solcher, die die von besonders vielen Lautgesetzen betroffen sind. Dieser Test lässt sich heutzutage allerdings auch mithilfe eines Computerprogramms durchführen - schneller und weniger fehleranfällig: Dazu benötigt das Programm ein doppeltes Wortkorpus - jede Wortform in den älteren und in der jüngeren Sprachstufe - sowie eine Liste der im * Erschienen in: C. Chiarcos, R. Eckart de Castilho, M. Stede (Hrsg.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, S. 199-209. Dieser Artikel ist eine Kurzfassung meiner samt dem Computerprogramm online veröffentlichten Magisterarbeit: “Mittmann, R. (2008). Ein computerbasiertes Verfahren zur Ermittlung der relativen Chronologie der vorgotischen Lautgesetze. Abschlussarbeit zur Erlangung des Magister Artium, Johann Wolfgang Goethe-Universität Frankfurt am Main. http: / / publikationen.ub.uni-frankfurt.de/ volltexte/ 2008/ 5724/ ” 200 Roland Mittmann dazwischenliegenden Zeitraum abgelaufenen Lautgesetze, also Suchausdruck und Ersetzungszeichenkette. Das Programm kann nun alle möglichen Reihenfolgen der Lautgesetze am Wortkorpus der älteren Sprachstufe testen, das so erzeugte Wortkorpus mit dem der jüngeren Sprachstufe abgleichen und die eine oder mehrere mögliche richtige Lautgesetzfolgen - also die, bei denen erzeugtes Wortkorpus und tatsächliches Wortkorpus der jüngeren Sprachstufe übereinstimmen - ausgeben. 2 Sprachliche Grundlagen 2.1 Der Begriff des Lautgesetzes Die Postulierung der Existenz von Lautgesetzen ist eng mit der Geschichte der Indogermanistik verknüpft. Nachdem William Jones Ende des 18. Jh. die Verwandtschaft des Sanskrit mit den europäischen Sprachen erkannt hatte, erforschten Rasmus Rask, Franz Bopp, Jacob Grimm und Karl Verner die historische Entwicklung der germanischen Sprachen, insbesondere die sog. Lautverschiebungen, denen sie jedoch noch keine ausnahmslose Geltung attestierten. Im Jahre 1878 veröffentlichten dann die Junggrammatiker Hermann Osthoff und Karl Brugman ihre These von der Ausnahmslosigkeit der Lautgesetze. Sie erklärten, dass alle Wörter, in denen ein der Lautbewegung unterworfener Laut unter gleichen Verhältnissen erscheine, ausnahmslos davon betroffen seien und dass die Richtung der Lautbewegung bei allen Angehörigen einer Sprachgemeinschaft dieselbe sei (Osthoff & Brugman, 1878). Da bei den Lautgesetzen vom Speziellen aufs Allgemeine geschlossen wird - vergleichbare Einzelentwicklungen werden zusammengefasst und für regelhaft abgelaufen erklärt -, lässt sich ihre Gültigkeit, wie die jedes induktiven Ansatzes, zwar nicht beweisen. Als Hypothesen haben sie jedoch so lange Bestand, bis sie widerlegt werden. Die Einführung des Begriffs des Phonems durch die in den zwanziger Jahren des 20. Jh. aufkommende Phonologie bestätigte noch einmal das Konzept des Lautgesetzes: Ändert sich ein Merkmal eines Phonems, so ist es wahrscheinlich, dass sich dieses Merkmal auch bei den anderen Phonemen ändert, die dasselbe Merkmal aufweisen. 2.2 Sprachliche Problemstellungen Jede Sprache ist fortwährend Veränderungen unterworfen, solange sie in Gebrauch ist. Es zeugt stets von einer gewissen Willkür, wenn man in ihrer Entwicklung einen Schnitt zieht und alles Vorangegangene und Folgende für nicht mehr zu der gewählten Sprachstufe gehörig erklärt. Hinzu kommt das Problem, dass jede Sprache an verschiedenen Orten und von verschiedenen Personen(gruppen) gesprochen wird. Den bestmöglichen Ausweg aus diesem Dilemma bietet daher ein in seiner Entstehungszeit möglichst eng begrenztes, auf einem einzelnen, von einem einzigen Autor verfassten Text basierendes Korpus. Die einzige Möglichkeit, Lautgesetze aufstellen zu können, die über jeden Zweifel an ihrer Gültigkeit erhaben sind, ist diejenige, für jeden einzelnen Fall ein eigenes Lautgesetz zu postulieren. Sobald man beginnt, die Entwicklung eines zweiten Wortes unter dem gleichen Lautgesetz zu subsumieren, besteht die Gefahr, dass auch die Entwicklung weiterer Wörter von diesem Gesetz betroffen sein müsste oder dem Gesetz gar widersprechen könnte. Auch hier kann die Festlegung eines klar begrenzten Textkorpus einen Ausweg darstellen. Will man ein Lautgesetz trotz einzelner widersprechender Fälle aufrechterhalten, so muss man das Postulat von deren Allgemeingültigkeit um eine Einschränkung durch Ausnahmen ergänzen, Verfahren zur Ermittlung relativer Chronologie 201 die gemeinhin mit dem Prinzip der Analogie zu begründen versucht werden. Vier weitere Phänomene, die bei der Aufstellung von Lautgesetzen hinderlich sein können, sind die bewusste Bewahrung von Wörtern vor lautgesetzlichen Veränderungen, die Entlehnung fremdsprachiger Wörter, die Neubildung von Wörtern aus bestehendem lexikalischem Material und die Aufgabe von Wörtern. Trifft nun also mindestens eines der fünf genannten Phänomene - Analogie, Lautkonservatismus, Entlehnung, Neubildung oder Wortaufgabe - auf ein im Untersuchungskorpus enthaltenes Wort zu, so ist es aus diesem auszuschließen. Ein weiterer Faktor, der die Untersuchung erschwert, ist die Frage der Aussprache. Die Alphabete, die bis ins 20. Jh. für die Aufzeichnung von Sprache verwendet wurden, waren i.d.R. nicht ausreichend differenziert, um alle phonetischen Unterschiede wiederzugeben. Schreibungsunterschiede auf synchroner und diachroner Ebene können hier aber ebenso Aufschlüsse geben wie Transkriptionen in andere Schriftsysteme oder Aussprachebeschreibungen zeitgenössischer Grammatiker. 2.3 Die Wahl zweier begrenzender Sprachstufen Da für eine Untersuchung wie die beabsichtigte ein Zeitraum ohne schriftliche Überlieferung den nötigen Anlass bot und diese Arbeit im Rahmen indogermanistischer Forschungen entstanden ist, lag es nahe, als jüngere Sprachstufe eine altindogermanische Sprache und als ältere Sprachstufe rekonstruiertes Gemeinurindogermanisch (im Folgenden: Urindogermanisch) zu wählen. Zwar ist das Urindogermanische in keiner Weise belegt, dennoch kann die Rekonstruktion vieler einzelner Wortformen nach knapp 150 Jahren der Forschung als relativ gesichert gelten. Die Tatsache, dass es sich dabei um eine rekonstruierte Sprachform handelt, ist sogar von gewissem Nutzen, da auch über ihr Phonemsystem ein hohes Maß an Einigkeit besteht. Das Gotische als gewählte jüngere Sprachstufe weist die in diesem Fall nützlichen Eigenschaften auf, dass es über eine eingrenzbare Zahl an belegten Wortformen verfügt und dass der größte Teil des belegten Gesamtkorpus aus einem einzigen umfangreichen Text besteht, einer Bibelübersetzung des Bischofs Wulfila. Um dennoch größtmögliche sprachliche Einheitlichkeit herzustellen, ist bei der Auswahl des Wortkorpus ausschließlich auf Wulfilas Bibelübersetzung zurückgegriffen worden. Ein Problem für die beabsichtigte Untersuchung stellt nur zuweilen die unsichere Aussprache einiger Grapheme dar. Da aber das Gotische mit einem eigenen, von der griechischen Schrift abgeleiteten Alphabet geschrieben wurde, bei dem phonematisch nicht benötigte Grapheme für Phoneme verwendet werden, die im Griechischen nicht existieren, kann davon ausgegangen werden, dass zumindest eine ausreichende Phonem-Graphem-Zuordnung vorliegt. 2.4 Die Lautgesetze Die Lautgesetze für die sprachlichen Veränderungen zwischen Urindogermanisch und Gotisch sind insbesondere aus Ringe (2006), Szulc (2002), Braune (1981) und Kieckers (1960) zusammengetragen worden, unter Zuhilfenahme von Meier-Brügger (2002), Schweikle (1996), Müller (2007) und Lindeman (1997). Aufgenommen wurden nur Lautgesetze, die auch im Gotischen belegt sind, in der einschlägigen Literatur allgemein anerkannt sind und nicht nur wenige Einzelwörter betreffen. Die Reihenfolge der Lautgesetze ist so gewählt, dass jedes der im Wortkorpus enthaltenen urindogermanischen Wörter diese durchlaufen kann, um dann die korrekte gotische Entsprechung zu liefern. Mag dies als Vorwegnahme der Aufgabe des Programms erscheinen, so sei darauf hingewiesen, dass es sich hierbei lediglich um eine der möglichen Reihenfolgen handelt und Zielsetzung des Programms ja sein soll, die verschiedenen möglichen Reihenfolgen zu ermitteln. 202 Roland Mittmann Nachfolgend nun einige Beispiele aus den 65 im Untersuchungszeitraum abgelaufenen Lautgesetzen: 2. Dehnung von auslautendem ¯ o Dehnung von / ¯ o/ im Wortauslaut zu überlangem (dreimorigem) schleiftonigem / ô/ 21. Grimms Gesetz II (2. Teil der Ersten Lautverschiebung) - Aufgabe der Stimmhaftigkeit der Mediae und Wandel zu Tenues, Zusammenfall mit den verbliebenen Tenues - teilweise regressive Kontaktassimilation vorangehender Mediae und Mediae aspiratae ebenfalls zu Tenues 31. Dissimilation von / mn/ zu / bn/ Regressive Dissimilation der Nasalsequenz / mn/ zur Folge von Plosiv und Nasal / bn/ 58. Spirantenverhärtung - Verlust des Stimmtons von / z/ im Wortauslaut sowie vor / t/ oder alveolarem Frikativ - Verlust des Stimmtons von hier frikativem / b/ oder / d/ - jedoch nur nach Vokal - in denselben Positionen 2.5 Das Wortkorpus Um die prinzipielle Möglichkeit der Ermittlung der tatsächlich möglichen Reihenfolgen von Lautgesetzen durch Anwendung sämtlicher theoretisch möglicher Reihenfolgen zu prüfen, soll bei der Untersuchung ein Wortkorpus von 75 Einträgen, das zumindest jeweils einen Beleg für die Gültigkeit sämtlicher der angeführten Lautgesetze liefert, genügen (vgl. Tabelle 1 mit Beispielen). Wenngleich es auf diese Weise an einer repräsentativen Zahl von Belegen für die Gültigkeit der Lautgesetze mangeln mag, so sei darauf hingewiesen, dass es, um Zweifel an der Gültigkeit der Lautgesetze tatsächlich ausschließen zu können, theoretisch nötig wäre, sämtliche belegten (und rein lautgesetzlich entwickelten) Wortformen ins Korpus aufzunehmen. Die gotischen Wortformen im Korpus wurden aus Streitberg (2000) ermittelt, zudem diente auch Streitberg (2000) zur Ermittlung im Gotischen belegter Wortformen. Von zusätzlichem Nutzen für die Rekonstruktion der urindogermanischen Formen waren darüber hinaus Holthausen (1934), Pokorny (1959), Mallory & Adams (2006) und Rix (2001). 3 Das Computerprogramm 3.1 Der Aufbau des Computerprogramms Dem Programm sollen zunächst die beiden angeführten Wortlisten übergeben werden, die eine davon in der älteren Sprachstufe - hier Urindogermanisch -, die andere in der jüngeren Sprachstufe - hier Gotisch. Hinzu kommen die in der Zwischenzeit gewirkt habenden Lautgesetze, die in ihrer Reihenfolge unabhängig voneinander sein müssen. Verfahren zur Ermittlung relativer Chronologie 203 Tabelle 1: Auszug aus dem Wortkorpus Nr. Gotische Form Urindogermanische Form Gotische Quelle 11. gilstra *g h éld h treh 2 Römer 13,6 27. faírra *pérh 2 eh 1 Lukas 14,32 57. fimf *pénk w e Johannes 6,13 58. figgrans *penk w róms Markus 7,33 75. °laílot *le-lóh 1 de Matthäus 8,15 Das Programm soll nun die Lautgesetze in sämtlichen Kombinationsmöglichkeiten auf die ältere Wortliste anwenden und die verschiedenen Ergebnisse speichern. Die Zahl der sämtlichen Kombinationsmöglichkeiten folgt dem Prinzip der Fakultät, sodass sich bei steigender Anzahl an Lautgesetzen schnell eine hohe Zahl an Kombinationsmöglichkeiten und somit auch eine lange Rechendauer des Prozessors ergeben wird. Daher soll zu Testzwecken zusätzlich noch die Möglichkeit geschaffen werden, die Lautgesetze in einer festen Reihenfolge auf das Wortkorpus anzuwenden. Hierzu soll die bereits bei der Beschreibung der Lautgesetze gewählte Reihenfolge dienen, da diese ja auf die Generierung korrekter Ergebnisse hin ausgerichtet ist. Erkennt das Programm unter den Kombinationsmöglichkeiten solche, die das tatsächliche Ergebnis zutage fördern, so sollen diese sämtlich ausgegeben werden, damit genau untersucht werden kann, inwieweit die einzelnen Lautgesetze in ihrer Reihenfolge voneinander unabhängig oder etwa auf eine bestimmte Position festgelegt sind. Findet sich keine Kombinationsmöglichkeit, die das richtige Ergebnis ausgibt, so soll es genügen, diejenigen fünf Möglichkeiten auszugeben, die dem richtigen Ergebnis am nächsten kommen. Die ausgegebenen Kombinationsmöglichkeiten sollen jeweils eine eigene Zeile einnehmen und zunächst eine laufende Nummer, dann den Anteil der richtigen Wortformen in Prozent, darauf eine Kurzform der Lautgesetze in der durchgeführten Reihenfolge und schließlich, sofern die Übereinstimmung der Wortformen nicht 100 % beträgt, eine Liste der falschen Wortformen enthalten. Die Wahl einer Programmiersprache fiel auf Perl, da diese Sprache speziell zur Verarbeitung von Textdateien konzipiert wurde und zugleich in vielen Fällen verschiedene Umsetzungsmöglichkeiten für ein bestimmtes Programmierproblem bietet. Die urindogermanische Wortliste wurde, durch Leerzeichen getrennt, als Skalar (Zeichenkette) angelegt, da Lautgesetze immer sämtliche Wörter betreffen. Das gotische Wortkorpus hingegen wurde als Array (Liste) angelegt, da ja das Resultat jedes einzelnen vom Programm veränderten Wortes mit dem tatsächlichen Ergebnis abgeglichen werden muss. Die Lautgesetze wurden als einfach verschachtelter Array umgesetzt: Jedes Array-Element enthält zunächst eine Kurzform des Lautgesetzes - zur Identifikation innerhalb der Sortierreihenfolge bei der Ausgabe -, gefolgt von der Beschreibung der sich verändernden Laute sowie, bei umgebungsabhängigen Lautgesetzen, auch der sie umgebenden Laute und anschließend der Beschreibung der geänderten Laute. Bei Lautgesetzen, die aus mehreren nacheinander abfolgenden Einzelprozessen bestehen, stehen die weiteren Einzelprozesse an den jeweils folgenden zwei Positionen. Die Kombination der Lautgesetze miteinander in sämtlichen möglichen Reihenfolgen geschieht mittels eines Permutationsalgorithmus in einer Subroutine, die sich rekursiv immer wieder selbst aufruft, bis alle möglichen Reihenfolgen der Lautgesetze durchgeführt worden sind. Sie bekommt die Anzahl der Lautgesetze übergeben, kann aber auch auf deren Inhalt zugreifen, da der verschach- 204 Roland Mittmann telte Array als globale Variable angelegt wurde. Es werden mehrere Kopien von der ursprünglichen urindogermanischen Wortliste angelegt, damit auch für Durchläufe nach Durchführung von Lautgesetzen die ursprüngliche Liste wieder verwendet werden kann. Die Berechnungen werden über mehrere Zählvariablen gesteuert, die sich bei ihrer Begrenzung an der übergebenen Anzahl der Lautgesetze orientieren. Da Perl es nicht ermöglicht, sämtliche Umgebungsbedingungen aus den Suchausdrücken auszulagern, müssen alle Lautgesetze so oft wiederholt werden, wie sie zu ändernde Phoneme vorfinden. Um die Lautgesetze zu Testzwecken in fester Reihenfolge anzuwenden, existiert eine alternative Subroutine, die im Wechsel mit der rekursiven Subroutine ein- und auskommentiert werden kann. Anschließend werden die noch als Skalare vorliegenden umgewandelten Wortlisten in Array-Elemente aufgeteilt und jedes Wort dann auf seine Übereinstimmung mit der tatsächlichen gotischen Wortliste hin verglichen, wobei die nicht übereinstimmenden Wortformen gespeichert und die Vergleichsergebnisse in prozentuale Anteile umgerechnet werden. Diese Prozentangabe wird dann mit den nach Reihenfolge der Durchführung sortierten Kurzformen der Lautgesetze sowie den nicht mit den tatsächlichen Ergebnissen übereinstimmenden Wortformen verkettet und die einzelnen Ergebnisse nach der größten Prozentzahl sortiert. Im letzten Teil wird schließlich die Ausgabe der Informationen wie oben beschrieben gesteuert und das Ergebnis des Programms in eine Datei geschrieben. 3.2 Durchführung des Computerprogramms Um die Rechenzeit des fertigen Programms zu ermitteln, erweist sich das Auskommentieren von Lautgesetzen als hilfreich: Die Kombination von 65 Lautgesetzen miteinander ergibt 65! ≈ 8 · 10 90 Möglichkeiten und scheint somit gänzlich unberechenbar. Liegt die Rechenzeit auf einem Intel- Pentium-Laptop mit 1,73-GHz-Prozessor und 512 Mebibyte Arbeitsspeicher bei bis zu vier Lautgesetzen noch unter einer Sekunde, beträgt sie bei neun bereits mehrere Stunden. Versuche mit zehn Lautgesetzen scheitern dann am fehlenden Speicher. Daher empfiehlt sich jedoch ein Blick auf den Arbeitsspeicher eines solchen handelsüblichen Rechners: Dieser verfügt über einen x86-Prozessor, der einen Adressbus von 32 Bit aufweist. Damit können bis zu 2 32 Byte = 4 Gibibyte = 4.194.204 Kibibyte an Speicherstellen adressiert werden. Das urindogermanische Wortkorpus hat eine Länge von 637 Zeichen, was einem Speichervolumen von 637 Byte entspricht. Bereits die Anfertigung einer Kopie dieser Liste hat einen Speicherbedarf von insgesamt mehr als einem Kibibyte zur Folge. Bei zehn Lautgesetzen ergeben sich 10! = 3 . 628 . 800 Kombinationsmöglichkeiten, für die ein Speicher von 4 Gibibyte deshalb nicht mehr ausreichen kann. Für Abhilfe sorgt ein Rechner mit x64-Prozessor und 64-Bit-Adressbus, der bis zu 2 64 Byte = 16 Exbibyte = 17.179.869.184 Gibibyte adressieren kann. Dieser Rechner benötigt etwa einen halben Tag zur Berechnung von zehn Lautgesetzen. Durch Multiplikation der vorherigen Zeiten mit der Anzahl der aktuell ausgewählten Lautgesetze ergibt sich, dass für elf Lautgesetze mehrere Tage Rechenzeit, für zwölf mehr als ein Monat zu erwarten wären. Es ist ausgeschlossen, dass selbst ein Großrechner 65! Kombinationsmöglichkeiten innerhalb einer realistischen Zeitspanne berechnen kann. Wie bereits angeführt, ergibt 65! ca. 8 · 10 90 , und selbst bei einer Rechenzeit von einem Jahr müsste ein Rechner daher immer noch ca. 3 · 10 83 Kombinationsmöglichkeiten pro Sekunde ermitteln können. Nach Auskunft des Forschungszentrums Jülich ist deren schnellster Supercomputer 2009 in der Lage, 10 15 Rechenoperationen pro Sekunde durchzuführen. Verfahren zur Ermittlung relativer Chronologie 205 Daher bietet es sich daher als erste Maßnahme an, mehrere Lautgesetze zusammenzufassen. Im Folgenden werden auch Zusammenfassungen mehrerer Lautgesetze als einzelne Lautgesetze betrachtet, da sie für das Programm, aber auch für die Auswertung als solche zu gelten haben. Zunächst sollen Lautgesetze zusammengefasst werden, die in unmittelbarem zeitlichen Bezug zueinander stehen, danach solche, die die gleichen Laute in verschiedenen Umgebungen betreffen, ohne dass die davon betroffenen Lautumgebungen Hinweise auf unterschiedliche Wirkungszeitpunkte lieferten. Anschließend sind Lautentwicklungen, die von ähnlicher Art und dabei kontextunabhängig und etwa zur gleichen Zeit abgelaufen sind, zu betrachten. Schließlich bietet sich schließlich noch eine Zusammenfassung derjenigen germanisch-gotischen Lautgesetze an, die nur im Wortauslaut gewirkt haben, der sog. Auslautgesetze. Die Zahl der Lautgesetze lässt sich so von 65 auf 23 reduzieren. Da eine weitere Zusammenfassung nicht sinnvoll erscheint, sollen in einem weiteren Schritt Lautgesetze aus der Untersuchung herausgenommen werden. Dabei ist zu beachten, dass auch die Wortkorpora dann entsprechend anzupassen sind. Da die Untersuchung sich ja in erster Linie auf die dem Gotischen vorausgehenden Lautgesetze beziehen soll, soll diese Ausklammerung nach Möglichkeit vor allem bei den frühen Lautgesetzen stattfinden. Einige Lautgesetze haben in mehreren indogermanischen Sprachzweigen gewirkt und sind daher kein zwingender Bestandteil der Untersuchung. Andere Lautgesetze sind dagegen aufgrund ihrer Kontextunabhängigkeit in ihrem genauen relativen Wirkungszeitpunkt kaum bestimmbar. Die verbleibenden 13 Lautgesetze zeigen allesamt, dass ihre relative Chronologie nicht beliebig sein kann. Da, wie beschrieben, eine Durchführung des Programms mit handelsüblichen Computern nur mit zehn Lautgesetzen möglich ist, ist die Ausklammerung dreier weiterer Lautgesetze vonnöten. Hierfür bieten sich schließlich Lautgesetze an, die nur fürs Gotische, nicht aber fürs Nord- und Westgermanische gelten und deren Wirkungszeitraum teilweise noch anhand von gotischen Wörtern in lateinischen und griechischen Quellen ermittelt werden kann. Um das Wortkorpus in seiner Inputebenso wie in seiner Output-Form anzupassen, muss bei den verbliebenen zehn Lautgesetzen (vgl. Tabelle 2) nun zunächst die Entscheidung getroffen werden, ob sie als vor oder als nach dem Untersuchungszeitraum des Programms abgelaufen betrachtet werden Tabelle 2: Für die Untersuchung verbleibende Lautgesetze Titel des Lautgesetzes Nummer Kurzform 1. Erste Lautverschiebung 20-22 1.LV 2. Verners Gesetz 23 F>B 3. Develarisierungen und Delabialisierungen von Labiovelaren samt den Kontraktionen zu Labiovelaren 31-37 Q>K/ B 4. Konditionierte Hebungen von / e/ zu / i/ 38-39 e>i/ _ 5. Elisionen von Approximanten 40-41 J>Ø 6. Qualitativer Zusammenfall von / a/ und / o/ 42-44 a<>o 7. Nasalschwund mit Ersatzdehnung 45 Vnh>V: h 8. Auslautgesetze 47-56 C/ V#>Ø# 9. Spirantenverhärtungen und -erweichungen 57-58 B<>P/ F 10. Aufgabe des freien Akzents 59 ′ >Ø 206 Roland Mittmann sollen. Für die Ergebnisse des Programms ist das nur insofern relevant, als dabei keine Wortformen angenommen werden dürfen, die nie existiert haben können. Mithilfe der alternativen Subroutine zur Anwendung der Lautgesetze in fester Reihenfolge können unter Auskommentierung von Lautgesetzen nun die neuen Input- und Output-Wortlisten ermittelt werden. Auch bei der eigentlichen Programmdurchführung müssen dann alle Lautgesetze außer den untersuchten auskommentiert bleiben. 3.3 Auswertung des Programmergebnisses Die Ausgabedatei zeigt, dass das Programm aus den 10! = 3 . 628 . 800 Möglichkeiten zur Kombination der zehn ausgewählten Lautgesetze die dort dargestellten 28 möglichen Kombinationen ausgewählt hat, von denen die ersten drei in Abbildung 1 dargestellt sind. 0000001. 100 % e>i/ _ 1.LV F>B Q>K/ B J>Ø a<>o Vnh>V: h C/ V#>Ø# B<>P/ F ′ >Ø - 0000002. 100 % e>i/ _ 1.LV F>B Q>K/ B J>Ø a<>o C/ V#>Ø# Vnh>V: h B<>P/ F ′ >Ø - 0000003. 100 % e>i/ _ 1.LV F>B Q>K/ B J>Ø a<>o C/ V#>Ø# B<>P/ F Vnh>V: h ′ >Ø - Abbildung 1: Auszug aus der Ausgabedatei Das Ergebnis belegt zunächst einmal, dass es prinzipiell möglich ist, die allein möglichen Reihenfolgen der zwischen zwei Sprachstufen, von denen die eine unmittelbar aus der anderen hervorgegangen ist, abgelaufenen Lautgesetzen auf die eingangs beschriebene Weise zu ermitteln. Tabelle 3 stellt nun die durch das Programm ermittelten möglichen Reihenfolgen der zehn untersuchten Lautgesetze dar. In jedem Feld sind zunächst die für das Programm verwendeten und auch im Programmergebnis dargestellten Kurzformen der zusammengefassten Lautgesetze angegeben, gefolgt von den Nummern der betroffenen einzelnen Lautgesetze in runden Klammern. Der zeitliche Verlauf wird dabei auf der von links nach rechts verlaufenden Achse dargestellt. Wenn die genaue Reihenfolge von Lautgesetzen unerheblich ist, sind diese untereinander angeordnet. Dennoch müssen selbstverständlich sämtliche Lautgesetze durchlaufen werden. 3.4 Zweite Programmdurchführung Um die Funktionalität des Programms zu belegen, soll eine zweite Programmdurchführung mit anderen Lautgesetzen erfolgen. Diesmal sollen sämtliche kontextabhängigen Lautgesetze vom Gotischen aus zurückblickend betrachtet berücksichtigt werden. Dabei ist auch das Ergebnis des ersten Durchlaufs von Nutzen sein: Die drei zuletzt ausgeklammerten Lautgesetze werden wieder aufgenommen und im Gegenzug drei in unmittelbarer zeitlicher Abfolge stehende früh abgelaufene Lautgesetze ausgeklammert. Tabelle 3: Schematische Darstellung der Programmausgabe (erste Durchführung) 1.LV F>B Q>K/ B a<>o C/ V#>Ø# B<>P/ F ′ >Ø (20-22) (23) (24-30) (42-44) (47-56) (57-58) (59) e>i/ _ (38-39) J>Ø (40-41) Vnh>V: h (45) Verfahren zur Ermittlung relativer Chronologie 207 Nach Anpassung des Wortkorpus und Durchführung des Programms gibt das Programm 210 Kombinationsmöglichkeiten aus. Die möglichen Reihenfolgen zumindest einiger der untersuchten Lautgesetze sind also deutlich vielfältiger als bei der ersten Programmdurchführung. In der tabellarischen Darstellung (Tabelle 4) muss eine Lösung dafür gefunden werden, dass LG 62 vor LG 63 und LG 57-58 vor LG 59 abgelaufen sein muss - weitere Einschränkungen in der Reihenfolge der vier Lautgesetze gibt es jedoch nicht. Dieses Verhältnis soll durch die fehlenden senkrechten Linien dargestellt werden. Auch die tabellarische Darstellung zeigt deutlich, wie wenig genau die Reihenfolge auch der kontextabhängigen unter den unmittelbar vorgotischen Lautgesetzen mithilfe des Programms zu ermitteln ist. Zugleich demonstriert sie aber auch, dass diejenigen sieben Lautgesetze, die in beiden Untersuchungsdurchläufen betrachtet wurden, jeweils in der gleichen Reihenfolge aufeinanderfolgen müssen. Damit kann als belegt gelten, dass die Wahl des Urindogermanischen und des Gotischen als Bezugssprachen der Wahl zweier nur durch wenige Lautgesetze voneinander getrennter Sprachen kaum nachsteht. 3.5 Ein weitergehendes Anwendungskonzept Dass es möglich ist, mithilfe des beschriebenen Programms die möglichen Reihenfolgen von bis zu zehn Lautgesetzen zu ermitteln, ist nun hinreichend gezeigt worden. Nachdem jedoch bisher stets eine maximalistische Lösung des Problems angestrebt wurde, wäre es auch denkbar, im Zuge eines minimalistischen Ansatzes immer nur jeweils zwei Lautgesetze auf ihre relative Chronologie hin zu betrachten. Wiesen diese eine feste Reihenfolge auf, so gäbe das Programm nur ein Ergebnis aus; wäre die Reihenfolge zwischen beiden frei, so erhielte man zwei Ergebnisse. Die Zahl der Kombinationsmöglichkeiten lässt sich mithilfe des Binomialkoeffizienten ( n k ) ermitteln. Im beschriebenen Fall (n = 65; k = 2) beträgt das Ergebnis 2.080 und liegt damit durchaus im Rahmen des Umsetzbaren. Wie in Abschnitt 3.2 beschrieben, müssen allerdings bei der Selektion einer Anzahl von Lautgesetzen aus der Gesamtheit von 65 Input- und Output-Wortlisten so angepasst werden, dass sie sich tatsächlich nur durch das Ablaufen der selektierten Lautgesetze voneinander getrennt sind. Hierbei ist auch die Entscheidung zu treffen, welche der verbleibenden Lautgesetze als vor und welche als nach dem Untersuchungszeitraum abgelaufen betrachtet werden sollen. Die theoretischen 2.080 Ergebnisse wiesen jedoch sicherlich eine gewisse Redundanz auf: Es muss nicht jedes Lautgesetz in Bezug zu jedem anderen gesetzt werden, um die insgesamt möglichen Reihenfolgen zu ermitteln. Unter dieser Prämisse wäre denkbar, diejenigen Programmabläufe, die keine Übereinstimmung von 100 % liefern, aus der Auswertung herauszunehmen und zudem die Entscheidung, welche Lautge- Tabelle 4: Schematische Darstellung der Programmausgabe (zweite Durchführung) e>i/ _ J>Ø a<>o C/ V#>Ø# aV>V (38-39) (40-41) (42-44) (47-56) (64-65) Vnh>V: h (45) e>i (62) i/ u>æ/ å (63) B<>P/ F (57-58) ′ >Ø (59) 208 Roland Mittmann setze als vor und welche als nach dem Untersuchungszeitraum abgelaufen betrachtet werden sollen, trotz des damit verbundenen erhöhten Risikos, eine Übereinstimmung von nicht 100 % zu erhalten, vom Programm durch einen Automatismus mit zuvor festzulegenden Kriterien ermitteln zu lassen. 4 Ergebnisse und Ausblick Die Untersuchung hat gezeigt, dass es prinzipiell möglich ist, mittels eines Computerprogramms die allein möglichen Reihenfolgen von Lautgesetzen zu ermitteln, wenn diese zusammen mit einem doppelten Wortkorpus in den vor sowie nach dem Untersuchungszeitraum vorliegenden Sprachstufen dem Programm übergeben werden. Dabei ist jedoch deutlich geworden, dass das Testen sämtlicher theoretisch möglicher Reihenfolgen durch das beschriebene Programm eine zu große Rechenzeit in Anspruch nähme. Um eine Berechnung zu ermöglichen, mussten daher Lautgesetze zusammengefasst oder ausgeklammert werden. Es ist zwar dafür Sorge getragen worden, die für die Sprachgeschichte wichtigsten und vermeintlich in zeitlicher Abhängigkeit vom Ablauf anderer Lautgesetze stehenden Lautgesetze auszuwählen; dennoch entbehrte diese Selektion nicht eines erheblichen Maßes an Subjektivität. Um die Ergebnisse daher noch präziser an den Stand der sprachhistorischen Forschung anzupassen, sollte zunächst die Untersuchung von Rechenmethoden fortgeführt werden, die in erheblich kürzerer Zeit als das beschriebene Programm zum Ergebnis führen. Eine Erfolg versprechende Methode könnte auch die zuletzt erörterte Betrachtung aller Kombinationsmöglichkeiten von zwei Lautgesetzen sein. Erst wenn die Möglichkeiten zur Verbesserung der Software voll ausgeschöpft sind, scheint ein detaillierter Abgleich der Programmresultate mit der einschlägigen Forschungsliteratur, der ja in der Arbeit nur angerissen wurde, angebracht. So wäre dann schließlich zu ermitteln, ob das Programm vielleicht bisher nicht aufgezeigte neue Aufschlüsse über die relative Chronologie der Lautgesetze zu bieten vermag. Ein weiteres vor dem genannten Schritt noch umzusetzendes Desiderat wäre die Erstellung eines repräsentativen Wortkorpus. Schon eine Vergrößerung des Korpus auf tausend oder zweitausend Wörter würde das Maß an Verlässlichkeit der Formulierung ebenso wie der Reihenfolge der Lautgesetze noch deutlich anheben. Um eine deutliche Erhöhung der Rechenzeit zu vermeiden, wäre auch vorstellbar, ein mithilfe eines kleinen Korpus erhaltenes Programmergebnis im Anschluss noch einmal an einem großen Korpus zu testen. Über das Beschriebene und Eingeforderte hinaus wäre auch eine Verbesserung der Benutzerfreundlichkeit des Programms von Vorteil: Der Vorgang der Selektion bestimmter Lautgesetze sowie die darauf aufbauende Anpassung der Wortkorpora könnten automatisiert werden; in einem noch weitergehenden Schritt wäre eine Benutzeroberfläche zur Eingabe von Korpuseinträgen und Lautgesetzen denkbar. Auf diese Weise könnte schließlich ein für die Forschung nützliches Hilfsmittel entstehen, um die Entwicklung des Gotischen und anderer Sprachen in Zeiten ohne schriftliche Überlieferung fundiert zu erhellen. Verfahren zur Ermittlung relativer Chronologie 209 Literatur Braune, W. (1981). Gotische Grammatik. Mit Lesestücken und Wörterverzeichnis. Niemeyer, Tübingen. 19. Auflage, neu bearbeitet von Ernst Ebbinghaus. Holthausen, F. (1934). Gotisches etymologisches Wörterbuch. Winter, Heidelberg. Kieckers, E. (1960). Handbuch der vergleichenden gotischen Grammatik. Hueber, München. 2., unveränderte Auflage. Lindeman, F. O. (1997). Introduction to the ‘Laryngeal Theory’. Institut für Sprachwissenschaft der Universität Innsbruck, Innsbruck. Mallory, J. P. & Adams, D. Q. (2006). The Oxford Introduction to Proto-Indo-European and the Proto-Indo- European World. Oxford University Press, Oxford. Meier-Brügger, M. (2002). Indogermanische Sprachwissenschaft. De Gruyter, Berlin. Unter Mitarbeit von Matthias Fritz und Manfred Mayrhofer, 8., überarbeitete und ergänzte Auflage. Müller, S. (2007). Zum Germanischen aus laryngaltheoretischer Sicht. De Gruyter, Berlin. Osthoff, H. & Brugman, K. (1878). Morphologische Untersuchungen auf dem Gebiete der indogermanischen Sprachen. Erster Theil. Hirzel, Leipzig. Pokorny, J. (1959). Indogermanisches etymologisches Wörterbuch. Francke, Bern. 2 Bände. Ringe, D. (2006). A Linguistic History of English. Volume 1. From Proto-Indo-European to Germanic. Oxford University Press, Oxford/ New York. Rix, H., editor (2001). Lexikon der indogermanischen Verben. Reichert, Wiesbaden. Schweikle, G. (1996). Germanisch-deutsche Sprachgeschichte im Überblick. Metzler, Stuttgart. 4. Auflage. Streitberg, W., editor (2000). Die gotische Bibel. Der gotische Text sowie seine griechische Vorlage. Winter, Heidelberg. 7. Auflage. Szulc, A. (2002). Geschichte des standarddeutschen Lautsystems. Edition Praesens, Wien. 2 nd UIMA @ GSCL Workshop Programme Committee • Sophia Ananiadou, University of Manchester, Great Britain • Branimir K. Boguraev, IBM T.J. Watson Research Center, USA • Philipp Cimiano, Delft University of Technology, Netherlands • Anni R. Coden, IBM T.J. Watson Research Center, USA • Richard Eckart de Castilho, Technische Universität Darmstadt, Germany • Leo Ferres, University of Concepción, Chile • Stefan Geißler, TEMIS GmbH, Germany • Iryna Gurevych, Technische Universität Darmstadt, Germany • Udo Hahn, Friedrich-Schiller-Universität Jena, Germany • Nicolas Hernandez, Université de Nantes, France • Dietmar Rösner, Universität Magdeburg, Germany • Michael Tanenblatt, IBM T.J. Watson Research Center, USA • Katrin Tomanek, Friedrich-Schiller-Universität Jena, Germany • Graham Wilcock, University of Helsinki, Finland Foreword from the Workshop Chairs Originally started as an IBM initiative, the Unstructured Information Management Architecture (UIMA) has attracted the interest of an ever-growing community of researchers and developers whose efforts are devoted to imposing structure (in terms of multi-layered annotations) onto unstructured data objects, such as written or spoken language, audio or video data, adhering to strict software engineering principles. UIMA, in the meantime, has grown into an Apache incubator project 1 and has even become subject of standardization efforts for semantic search and content analytics. 2 These advancements were paralleled by a number of scientific events which helped shape the UIMA community and provided fora for discussions centred around crucial UIMA topics. It all started with the first dedicated UIMA workshop held under the auspices of, at that time, the German Society for Linguistic Data Processing (Gesellschaft für Linguistische Datenverarbeitung - GLDV) in Tübingen, April. 2007 3 This inaugural national meeting was followed by the first fully international UIMA workshop collocated with the 6th “Language Resources and Evaluation Conference” (LREC 2008) in Marrakech, Morocco, May 2008. 4 Quite recently, the first “French-speaking Meeting around the Framework Apache UIMA” took place at the 10 th “Libre Software Meeting” in Nantes in July 2009. 5 This volume now contains the proceedings from the 2 nd UIMA workshop to be held under the auspices of the German Language Technology and Computational Linguistics Society (Gesellschaft für Sprachtechnologie und Computerlinguistik - GSCL) in Potsdam, October 1, 2009. From 13 submissions, the programme committee selected 5 full papers (eight proceedings pages) and 3 poster papers (four proceedings pages). The organizers of the workshop wish to thank all people involved in this meeting - submitters of papers, reviewers, GSCL staff and representatives (in particular, Manfred Stede) - for their great support, rapid and reliable responses and willingness to act on very sharp time lines. We appreciate their enthusiasm and cooperation. The members of the Organizing Committee: Udo Hahn, Katrin Tomanek (JULIE Lab, FSU Jena) Iryna Gurevych, Richard Eckart de Castilho (UKP Lab, TU Darmstadt) August 2009 1 http: / / incubator.apache.org/ uima/ 2 http: / / www.oasis-open.org/ committees/ tc_home.php? wg_abbrev=uima 3 http: / / incubator.apache.org/ uima/ gldv07.html 4 http: / / watchtower.coling.uni-jena.de/ ~coling/ uimaws_lrec2008/ 5 http: / / www.uima-fr.org/ RMLL-cfp.html L U C AS - A L UCENE CAS Indexer * Erik Faessler Rico Landefeld Katrin Tomanek Udo Hahn Jena University Language & Information Engineering (J ULIE ) Lab Friedrich-Schiller-Universität Jena http: / / www.julielab.de Abstract L U C AS is a U IMA CAS consumer component which bridges the U IMA framework with the L UCENE search engine library. L U C AS stores CAS data in a L UCENE index and thus allows to exploit the results of collection processing in a very efficient way. We describe L U C AS in detail and report on a large-scale application of L U C AS in the framework of S EMEDICO a semantic search engine for the life sciences domain. 1 Introduction As lots of NLP systems increasingly operate on very large document collections they generate massive amounts of linguistic meta data. Hence, the requirement emerges to store analysis results in some sort of search engine index such that structured and efficient access is guaranteed. The U IMA framework 1 offers a solid middleware infrastructure for processing unstructured documents such that annotations at different linguistic levels can be generated. It lacks, however, suitable software tools for efficient storage and retrieval. This paper introduces L U C AS , a U IMA -compliant deployment component to store U IMA processing results in a L UCENE search engine index. L U C AS is highly flexible; a configuration file defines how the processing results are represented in the search index. L U C AS offers some useful add-ons, including hypernym expansion of indexed terms, index creation in parallelized processing scenarios, filtering of textual contents as a pre-processing step to indexing, and optimized ranking and highlighting facilities. L U C AS has recently been incorporated in the U IMA sandbox 2 which hosts analysis components and tooling around U IMA . All sandbox components are free to use and open source-licensed under the Apache Software License. L U C AS can be downloaded from the sandbox website. A similarly powerful and functionally related tool is part of the O MNI F IND3 package. O MNI - F IND ’s indexing component and L U C AS are both flexibly configurable by a configuration file which determines the appearance of the resulting search index. L U C AS employs the L UCENE API whereas * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 217-224. 1 http: / / incubator.apache.org/ uima/ 2 http: / / incubator.apache.org/ uima/ sandbox.html 3 See IBM © O MNI F IND ™Enterprise Edition V8.5. 218 Erik Faessler, Rico Landefeld et al. O MNI F IND creates indexes using IBM’s search and index API (SIAPI). 4 However, O MNI F IND is a purchasable product including much more functionality than only the indexer. The rest of the paper is structured as follows: Section 2 introduces basic notions underlying U IMA and L UCENE , L U C AS is described in detail in Section 3. In Section 4, S EMEDICO , a semantic search engine which builds upon U IMA text analytics and L U C AS indexing facilities, is presented as a use case for an application highly dependent on L U C AS . 2 Fundamental Notions Underlying UIMA and L UCENE This section introduces basic notions underlying U IMA and L UCENE . A fundamental understanding of both is necessary to make full use of the features offered by L U C AS . For more detailed information, the reader is referred to U IMA ’s User Guide 5 and a textbook on L UCENE (Gospodneti´ c & Hatcher, 2004). 2.1 UIMA Apache U IMA is an Open Source framework for the processing of unstructured information with the ultimate goal to obtain structured pieces of contents as output. Written natural language text can be considered as one example of unstructured information (from the perspective of computers, at least). Though U IMA can handle any type of unstructured information, including video or audio data, we here focus on textual data. Processing a collection of documents is done in a pipelined manner. Original documents are successively read by a CollectionReader from some input stream and then passed through several AnalysisEngine instances for linguistic analysis. Processing results can be deployed by a CAS- Consumer . In the U IMA framework, AnalysisEngine instances add their processing results in the form of Annotation instances to the Common Analysis System ( CAS ). The CAS contains the original text of the document, which is subject to analysis, together with all Annotation instances incrementally added as processing results. An Annotation is of a particular type and has a set of attributes called features, e.g. information on its character offsets in the analyzed document. All Annotation instances of the same type are shuffled into a particular AnnotationIndex . Such an index is subsequently used to iterate over all Annotation instances of a specified type. Straightforward meta information might be that some text span is characterized as a word or token with its lemma and part-of-speech label each being another feature. Features are not necessarily primitive data types but can also refer to another Annotation instance. An Annotation for an entity mention might have a feature enumerating all token Annotation instances covered by the same span. U IMA is data-driven in that the components of a pipeline only need to exchange the CAS which is thus used as a data storage and data transport facility. A CAS is initialized by the Collection- Reader and then passed through the single AnalysisEngine instances. This U IMA -specific data passing is illustrated in Figure 1. A CAS can contain several views of the document being analyzed. These views are implemented as Sofa (Subject of Analysis) instances each of which is connected to a set of Sofa -specific AnnotationIndex instances. 4 http: / / www.ibm.com/ developerworks/ websphere/ library/ specs/ 0511_portal-siapi/ 0511_portal-siapi.html 5 http: / / incubator.apache.org/ uima/ documentation.html L U C AS - A L UCENE CAS Indexer 219 Figure 1: Schematic view of a U IMA pipeline. CR stands for the CollectionReader , AE 1 to AE n for a sequence of AnalysisEngine instances. C denotes a CAS-Consumer which feeds a database. A CAS object serves as data transport and storage facility. 2.2 Lucene Apache L UCENE6 is an Open Source, high-performance information retrieval library for generating and querying search engine indexes. As the central component for generating a search engine index, L UCENE provides the IndexWriter . This class takes care of creating and populating a search index. The type of data written to such an index is a L UCENE Document . Generally speaking, a Document constitutes a collection of Field instances each of which represents some information about the original text document to be indexed. A Field can contain meta information such as the document’s author name or creation date, which is not intended to be searchable, but also text input which has been indexed by text analytics (e.g., supplying token boundaries, POS or named entity tags) and might accordingly be queried. Prior to indexing, the Field text is converted into L UCENE ’s final indexing representation, i.e., terms. To do so, text is first split into L UCENE Token instances which can subsequently be refined or modified by a L UCENE TokenFilter . When combining L UCENE and U IMA , tokenization is typically done outside L UCENE by respective AnalysisEngine instances. Further filtering might, however, still be necessary, e.g., stop word removal, to improve search performance. A L UCENE index can consist of several searchable fields. Given previous analysis of the input text has found subsequences referring to different entity types, these subsequences can be stored in entity type-related fields thus allowing semantic (i.e., typed) search. 3 L U C AS L U C AS automatically populates a L UCENE search index by mapping Annotation instances stored in each document’s CAS to L UCENE Document instances according to a specific mapping definition. This allows to retrieve documents not only by the information contained directly in the (unstructured) text itself but also by the meta data added as a result of executing the U IMA processing pipeline. L U C AS is highly flexible without the need to modify the source code. A mapping file specifies how each U IMA Annotation is mapped to a L UCENE Field . In the following, we provide an indepth description of L U C AS ’ core functionalities and its special features. A technical documentation of L U C AS for further details and a reference for the mapping file is also available. 3.1 Core Functionality of L U C AS To store U IMA CAS data into a L UCENE index L U C AS traverses the CAS object’s set of AnnotationIndex instances, looks for the Annotation types specified by the user, retrieves the associated text data and features, and feeds them into L UCENE ’s IndexWriter . More technically, 6 http: / / lucene.apache.org 220 Erik Faessler, Rico Landefeld et al. L U C AS transforms the document’s U IMA Annotation instances into streams of L UCENE Token instances which are then stored in L UCENE Field instances. From an Annotation , both the associated plain text, its direct feature values, as well as feature values contained in referenced Annotation instances can be extracted by L U C AS . The latter is realized through feature paths. When the feature value is another Annotation instance (which might again contain further Annotation instances), feature values anywhere in this hierarchy can be retrieved by the feature path. Assume, we need to access city information stored in an “Address” Annotation which again is referenced by an “Affiliation” Annotation stored as a feature in an “Author” Annotation instance, the respective feature path is “affiliation.address”, which is illustrated in Figure 2. <fields> <field name="cities" index="yes"> <annotations> <annotation sofa="text" type="de.julielab.types.Author" featurePath="affiliation.address"> <features> <feature name="city"> </ features> </ annotation> </ annotations> </ field> </ fields> Figure 2: A snippet of a mapping file showing the use of a feature path If multiple features of an Annotation are to be indexed, for each value a separate L UCENE Token is generated. When the content of several features should be stored in a single Token , L U C AS concatenates the feature values with a specified delimiter string. This concatenation is then stored in a single Token . By mapping AnnotationIndex instances to user-defined L UCENE Field instances, L U C AS as one of its core functionalities supports Field -based search. That is, during search one can specify in which field a query should be evaluated. This is the key to semantic search. In a biomedical application (cf. Section 4), for example, it is often advantageous to restrict the search of a common, very ambiguous short form or acronym to its entity type, e.g., a cell or gene. 3.2 Additional Features of L U C AS Token Filters. Mapping a CAS AnnotationIndex to a L UCENE index Field results in a stream of Token instances. The contents of a stream can be controlled by token filters. The abstract TokenFilter class is part of the L UCENE API and L U C AS offers a collection of predefined token filters as stated in Table 1. In addition to streams of tokens generated on the basis of an AnnotationIndex , L U C AS is also able to apply Token filtering to all Token instances that are to be indexed in the same Lucene Field . The HypernymsFilter is of special importance for the use of L U C AS in semantic search engines. Given a text file specifying hypernyms of words that may occur, the HypernymsFilter additionally adds the respective hypernym whenever indexing a word occurring in the list. This L U C AS - A L UCENE CAS Indexer 221 enables the “upwards” use of taxonomic structures in a search application as all documents can be found in which at least one hyponym of the queried word occurs. It is also possible to employ user-defined token filters by implementing an interface to L UCENE ’s TokenFilter and reference this class by its fully qualified name in L U C AS ’ mapping file. No changes to L U C AS ’ source code are necessary. Highlighting. The hit highlighting package availabe for L UCENE allows to mark text in the original document matching a search query. If several L UCENE Token instances refer to the same text span, the L UCENE position increment of the concerned Token instances is adapted to properly use highlighting. Assume, e.g., a text document is tokenized and tagged for named entities. Of course, the span of named entities overlaps with the span of tokens covering the same characters in the text. L U C AS is able to fill a single Field with Token instances from several AnnotationIndex instances. In addition, the tokens may be aligned. That is, the Token instances are sorted in ascending order by their offsets. This enables applications that make full use of the produced index, i.e., to employ L UCENE ’s highlighting on all indexed tokens. Multiple Sofa Support. As mentioned in Section 2.1, a CAS may contain different views of its document. Views are realized by Sofa instances. L U C AS offers full support of several Sofa instances within one CAS . Users can specify the Sofa to be considered in the mapping file. Parallelized Processing. A U IMA pipeline can be parallelized by running multiple threads within one Java Virtual Machine. Typically, such a thread constitutes a single processing pipeline and the single pipelines often share only one CollectionReader instance. To allow multiple instances of L U C AS to write to the same index without conflicts, instances of L U C AS within the same CPE can share an IndexWriter . Table 1: Overview to predefined token filters in L U C AS Name Description AdditionFilter Adds suffixes or prefixes to text tokens. HypernymsFilter Adds hypernyms of a token term with the same offset and position increment of 0. PositionFilter Allows to select only the first or the last token of a token stream, all other tokens are discarded. ReplaceFilter Allows to replace tokens. SnowballFilter Integration of the L UCENE S NOWBALL stemmer. 7 SplitterFilter Splits token strings. ConcatenateFilter Concatenates token strings with a certain delimiter string. StopwordFilter Integration of the L UCENE stop word filter. UniqueFilter Filters string-identical tokens. The resulting token stream contains only tokens with unique strings. UpperCaseFilter Turns the string of each token into upper case. LowerCaseFilter Turns the string of each token into lower case. 7 For information about S NOWBALL see http: / / snowball.tartarus.org/ 222 Erik Faessler, Rico Landefeld et al. A shared IndexWriter is realized by the LucasIndexWriterProvider which is in charge of creating an IndexWriter instance. This instance is then provided to each L U C AS instance through U IMA ’s ResourceManager . 8 4 S EMEDICO : L U C AS in Motion S EMEDICO9 is a broad-coverage semantic search engine targeting at researchers from the life sciences and especially the biomedical domain. It provides deep semantic access to the contents of P UB M ED10 abstracts. All semantic meta data accessible through S EMEDICO are automatically extracted by a comprehensive text mining engine built upon U IMA utilizing mostly components of JC O R E (Hahn et al., 2008). Running on P UB M ED abstracts, this engine identifies terms from several crucial biomedical terminologies including M E SH, 11 the Cell Ontology, 12 and UniProt, 13 as well as several biomedical entity types including, e.g., genes and proteins, different cell types and organisms. Through normalization of the extracted terms and entities, synonyms and spelling variants are also stored allowing an exhaustive and “type-complete” search. The utilized terminologies contain hierarchies of varying degrees (ranging from depth 2 to depth 10), thus making them apt to browsing. Technically, this is accomplished by adding all parent terms to the search index. Bibliographic meta data already provided in the abstracts, such as author and journal names, is also added to the pool of meta data and thus made searchable. The S EMEDICO search interface complements a classical ranked document list interface with the faceted search approach (Schneider et al., 2009). It currently contains about 20 categories (known as facets) with over 900,000 hierarchically organized concepts based on the text-mining-generated semantic meta data and the bibliographic meta data. The search interface efficiently allows to handle the vast size of domain terminologies and category systems. S EMEDICO offers a collection of features supporting user-friendly search (Schneider et al., 2009). These features include: • Automatic query term completion to map search terms directly to the controlled vocabulary, • Category facets supplied at a dynamically modifiable level of depth to orient the user what area of terminology is currently being explored by the search, • The consequences of term refinement are immediately displayed at the facet level by showing conceptually more specialized terms at the finer, drilled-down level of granularity, • For the textual context of retrieved documents, matched terms (including synonyms and spelling variants) are highlighted. 8 See U IMA ’s User Guide for more detailed information on the ResourceManager . 9 http: / / www.semedico.de 10 http: / / www.ncbi.nlm.nih.gov/ pubmed/ is the premier bibliographic database in the life sciences, with more than 18M bibliographic entries. 11 M E SH ( http: / / www.nlm.nih.gov/ mesh/ ) is a comprehensive and well-curated terminology with approximately 25,000 biomedical terms and 140,000 (non-curated) chemical substances. 12 https: / / lists.sourceforge.net/ lists/ listinfo/ obo-cell-type 13 UniProt ( http: / / www.uniprot.org ) is the most comprehensive terminology for proteins across several species (350.000 entries). L U C AS - A L UCENE CAS Indexer 223 Figure 3: S EMEDICO in action: Searching for “Aldesleukin” L U C AS was developed to efficiently make available the results of text mining P UB M ED abstracts for S EMEDICO through a search engine index. Figure 3 shows a screenshot of a search example. The original, user-defined query contains only the term “aldesleukin”. “Aldesleukin” is yet an alternative name for “interleukin-2”, a signaling molecule of the immune system. 14 As a search result, S EMEDICO lists all papers which refer to this protein, including those that contain spelling variants or synonyms of “aldesleukin”. To the right, retrieved papers are shown, including title and author information, as well as a snippet of text where the relevant search term is highlighted. The left side shows facets which apply to this query. For all facets, currently the most general term is selected. Further refinement of the query can be done by selecting sub-terms. The first facet (“Genes and Proteins”), e.g., indicates that currently all papers containing mentions of “aldesleukin” (IL2) of any organism are shown. Technically spoken, that’s where the hypernym filter comes into play. Restricting this protein to specific organisms, e.g., homo sapiens, is achieved by selecting this sub-term from the facet. 5 Conclusions We have described L U C AS , a piece of software that bridges two software worlds, viz. U IMA for text analytics and L UCENE for information retrieval. The gain from this docking is a highly efficient indexing vehicle that allows large-scale semantic retrieval within the context of the U IMA framework. Also the provision of advanced query features, e.g., frequency-truncated facets, would have been almost impossible without a sophisticated indexing solution. Another concern driving the development of L U C AS has been its proper integration into the Open Source community which has been achieved by making L U C AS part of the U IMA sandbox. 14 Check http: / / www.uniprot.org/ uniprot/ P60568 for additional alternative names. 224 Erik Faessler, Rico Landefeld et al. Acknowledgements This work was partially funded by the German Ministry of Education and Research (BMBF) within the E S CIENCE funding framework under grant no. 01DS001 (“Wissensmanagement für die Stammzellbiologie”). References Gospodneti´ c, O. & Hatcher, E. (2004). L UCENE in Action. Manning. Hahn, U., Buyko, E., Landefeld, R., Mühlhausen, M., Poprat, M., Tomanek, K., & Wermter, J. (2008). An Overview of JC O R E , the J ULIE Lab UIMA Component Repository. In Proceedings of the LREC’08 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP‘, pages 1-7. Schneider, A., Landefeld, R., Wermter, J., & Hahn, U. (2009). Do users appreciate novel interface features for literature search? A user study in the life sciences domain. In SMC 2009 - Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics. . Multimedia Feature Extraction in the SAPIR Project * Aaron Kaplan 1 , Jonathan Mamou 2 , Francesco Gallo 3 , and Benjamin Sznajder 2 1 Xerox Research Centre Europe, 6 chemin de Maupertuis, 38240 Meylan, France 2 IBM Haifa Research Lab., 31905 Haifa, Israel 3 EURIX Group R&D Department, 26 via Carcano, 10153 Torino, Italy Abstract SAPIR is a peer-to-peer multimedia information retrieval system that can index structured and unstructured text, still and moving images, speech, and music. The system’s feature extraction component, which analyzes documents to prepare them for indexing, is implemented using UIMA. It handles compound documents using an architecture of (potentially nested) splitters and mergers within a UIMA aggregate. For example, the moving image from a video is split into a number of representative video frames, each of which is processed by the same analysis engine used for still images, and then merging the results to form a unified representation of the video. The output of the feature extraction module is a document description in a representation based on the MPEG-7 standard. 1 Introduction The SAPIR project 1 brought together nine industrial and academic partners to build a peer-to-peer multimedia search engine that supports search over audio (both speech and music), video, still images, and text, using the query-by-example paradigm. For example, a snapshot taken with a mobile phone can be used to search for images and videos of similar objects, or a short audio recording of some music can be used to search for performances of the same musical work. Multimedia documents, by definition, contain more than one type of information, and the SAPIR query mechanism supports queries over multiple types. For example, one could search for videos that have images similar to a given snapshot and contain given words in the audio. We used UIMA to implement SAPIR’s feature extraction component. This component takes a multimedia document as input, and returns a description of the document in a representation based on the MPEG-7 standard (see Section 3). The description contains features extracted from the different media in the document, and it is used by the indexing component to insert the document into a peer-to-peer distributed index. The feature extraction system breaks down a compound multimedia document (e.g. a video) into its component parts (e.g. the video’s image frames and audio track), and routes the parts to media-specific analysis engines that do the feature extraction. These analysis engines include: * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 225-232. 1 http: / / www.sapir.eu 226 Aaron Kaplan, Jonathan Mamou et al. • video: processes video recordings; performs shot detection, generates a representation of the video’s shot structure, and identifies a representative frame for each shot to be processed by the image annotator. • image: processes both still images and video frames; extracts five MPEG-7 visual descriptors (ScalableColor, ColorStructure, ColorLayout, EdgeHistogram, HomogeneousTexture). • music: processes audio recordings and MIDI files; extracts representations of melody, harmony, and rhythm. • speech: processes audio recordings (which may be stand-alone documents or audio tracks extracted from videos); builds a word confusion network and a phoneme lattice. • text: processes text, with or without XML or HTML markup; performs tokenization, lemmatization, named entity recognition, and summarization. We have found no previous descriptions of systems in which UIMA was used to extract features from multiple media types in a single multimedia document. TALES 2 is a UIMA-based system which performs multimedia mining and translation of broadcast news and news Web sites. For broadcast video news, TALES performs video capture, keyframe extraction, automatic speech-to-text conversion, machine translation of the foreign text to English, and information extraction. However, there is no publicly available description of the UIMA analytics approach. In Section 2, we will first explain the hierarchical structure, implemented within a UIMA aggregate analysis engine, by which compound multimedia documents are decomposed, and the parts analyzed and then recomposed to form the final representation. Next, Section 3 briefly presents the MPEG-7 standard, describes the MPEG-7 based representation that is the output of our feature extraction system, and explains the corresponding UIMA feature structures. Since this workshop is attached to an NLP conference, we then briefly describe in Section 4 the analysis engines that process natural language, namely the speech and text components. We will not cover the music, image, or video analysis engines in detail here. The music component is based on research described in Di Buccio et al. (2009), Miotto & Orio (2008) and Kaplan et al. (2008). The image component is built around reference software from the MPEG-7 eXperimentation Model 3 and the ImageMagick 4 library; for research on the use of these features for image retrieval, see Stanchev et al. (2004) and Kaplan et al. (2008). The video analysis engine uses the ffmpeg 5 and MJPEG 6 open-source libraries; for details, see Allasia et al. (2008) and Kaplan et al. (2008). The focus of this paper is on the use of UIMA for composing single-medium analysis engines into a multimedia aggregate. The main scientific contributions of the SAPIR project are described elsewhere—see citations throughout the text. 2 Multimedia Splitting and Merging A multimedia document is composed of parts that have different media types. For example, a video is composed of still frames and an audio track. The same basic media type may occur in different 2 http: / / domino.research.ibm.com/ comm/ research_projects.nsf/ pages/ tales.index. html 3 http: / / www.chiariglione.org/ mpeg/ standards/ mpeg-7/ mpeg-7.htm 4 http: / / www.imagemagick.org 5 http: / / ffmpeg.org 6 http: / / mjpeg.sourceforge.net Multimedia Feature Extraction in the SAPIR Project 227 kinds of multimedia documents. For example, a frame of a video and a photograph are of the same basic media type, and as far as our feature extraction algorithms are concerned it makes no difference whether a given image comes from a snapshot or a frame of a video. We therefore thought it desirable to have a single UIMA analysis engine for each media type, e.g. a single image annotator that processes both video frames and still photographs. We implemented this idea by defining components called splitters and mergers that decompose and recompose compound documents. A splitter is a CAS Multiplier that accepts a CAS of one media type, and outputs the original CAS plus one or more CASes of different media types. For example, the moving image splitter takes a moving image CAS (the difference between a video and a moving image will become clearer shortly) as input, and outputs that CAS plus a number of still image CASes representing selected frames of the moving image. Each splitter has a matching merger, which receives all of the CASes output by the splitter after they have been processed by other components, and assembles the extracted information into an appropriate structure in the original CAS. In our example, the moving image merger copies image features from the individual image CASes into a feature structure array in the original moving image CAS. Only the original CAS (the moving image CAS in the example) leaves the aggregate, so seen from the outside the aggregate behaves like a normal analysis engine, not a CAS multiplier. A splitter and a merger are composed in an aggregate with the appropriate feature extraction modules and a custom flow controller, which directs a CAS to the appropriate delegate (or directly to the merger) based on its media type. The split/ merge structure can be applied recursively. Figure 1 illustrates the structure of our aggregate analysis engine for video. At the top level, the video splitter generates one CAS for the moving image and another for the audio track. The moving image CAS is processed by the moving image analysis engine, which is itself a split/ merge aggregate as described above. The audio CAS is processed by the speech aggregate, which is a traditional aggregate that performs speech-to-text transcription followed by text processing on the resulting transcript. (The speech and text components will be described in more detail in Section 4.) Note that the flow of information inside the video aggregate forks and then joins. As always, the custom flow controller routes CASes to the appropriate annotators within the aggregate. The split/ merge approach allows a clean separation of concerns. It allows the image annotator to process frames extracted from videos, without knowing anything about the structure of videos or the feature structures used to describe them. To add, for example, functionality for processing web pages composed of text and embedded images, we needn’t modify the image annotator to add a case for handling web page CASes. Instead, we can write a new web page splitter and merger, and compose them with the original, unmodified image annotator. The argument for the split/ merge structure at the top level of the video aggregate is perhaps not as strong as it is at the level of the moving image aggregate. An alternative would be to put the moving image and audio parts into different views in the original video CAS, and use SOFA mappings to bind the moving image and speech analysis engines to the appropriate views. Splitting the image and audio parts into different CASes might give some advantage in terms of parallellizing the processing flow, but we have not yet pursued this idea. After processing by the video aggregate, a video CAS contains the following information: a temporal decomposition of the video into shots; for each shot, the start time and duration, the URL of a still image that represents the shot, and five MPEG-7 visual descriptors extracted from that image; a word confusion network representing the results of speech-to-text processing; a textual transcription of the WCN annotated with lemmata, named entities, and temporal offsets; and a summary of the text. 228 Aaron Kaplan, Jonathan Mamou et al. Figure 1: The video aggregate 3 Feature Representations It was decided at the outset of the SAPIR project to use the MPEG-7 standard (International Organization for Standardization, 2002; Manjunath et al., 2002) to represent extracted features and other metadata for all media. MPEG-7 provides an extremely rich XML-based formalism for representing the structure and contents of multimedia documents. Despite its expressiveness, the standard did not cover all of the types of features we intended to extract, in particular for text and music, so we defined some SAPIR-specific extensions (Kaplan et al., 2007). Extracted features are first represented and manipulated as UIMA feature structures, and then in a final step the fully annotated CAS is transformed into an MPEG-7 description. Initially we hoped to create the UIMA type system definition, and the code for translating between formats, automatically. The MPEG-7 representation is defined using XML Schema, and in principle it would be possible to map XML Schema to a UIMA type system definition automatically. We attempted to do this by going via Ecore: Eclipse IDE 7 provides automatic mapping of XML Schema to Ecore 8 , and UIMA’s Ecore2UimaTypeSystem goes from Ecore to UIMA. We encountered several problems with this approach. First, Ecore2UimaTypeSystem is not widely used, and is thus not as mature or well-tested as other parts of UIMA; we encountered a number of bugs in this code (the bugs we have discovered so far have since been fixed). Second, the full MPEG-7 standard is very large, and the corresponding UIMA type system was too big to be loaded into memory. Had we continued to 7 http: / / www.eclipse.org 8 http: / / www.eclipse.org/ modeling/ emf Multimedia Feature Extraction in the SAPIR Project 229 pursue this approach, the next step would have been to prune the XML Schema definitions down to just the part of the standard that we actually use, which is only a small fraction of the total. In the end, we did not pursue this approach. We gave up on automating the mapping, and simply defined a UIMA type system by hand, and wrote code for translating UIMA feature structures to the subset of MPEG-7 that we use. 4 NLP Components We will now present the feature extraction modules that handle natural language. 4.1 Spoken Information Retrieval Search in spoken data is an emerging research area currently garnering a lot of attention from the natural language research community. We have developed a UIMA analysis engine that incorporates the state-of-the-art asset developed by IBM Research in the area of Automatic Speech Recognition (ASR, Soltau et al., 2005). The information produced by this analysis engine is used in a novel scheme for information retrieval from noisy transcripts. The scheme uses additional output from the transcription system to reduce the effect of recognition errors in the word transcripts (Mangu et al., 2000). Although ASR technology is capable of transcribing speech to text, it suffers from deficiencies such as recognition errors and a limited vocabulary. For example, noisy spontaneous speech is typically transcribed with an accuracy of 60% to 70%. In some circumstances where there are noisy channels, foreign accents, or under-trained engines, accuracy may fall to 50% or lower. Our scheme shows a dramatic improvement in the quality of searches being conducted within transcript information (Mamou et al., 2006). To overcome the limitations and high error rate associated with phonetic transcription and queries for terms not recognized by the ASR engines, we have developed a new technique that combines phone-based and word-based search. When people search through speech transcripts and query for terms that are outside the vocabulary domain on which the engine is trained, the engine may not return any results. The “out of vocabulary” (OOV) terms are those words missing from the ASR system vocabulary. Although phonetic transcription constitutes an alternative to word transcription for OOV search, they suffer from high error rate and are therefore not a viable alternative. We have developed algorithms specifically for fuzzy search on phonetic transcripts, thereby overcoming this problem (Mamou et al., 2007, 2008; Ramabhadran et al., 2009). 4.2 Text Processing The text analysis engine provides tokenization, lemmatization, sentence boundary detection, recognition of dates and person and place names, and summarization, for English text. It is based on the Xerox Incremental Parser (XIP, Aït-Mokhtar et al., 2001), a tool that performs robust and deep syntactic analysis. XIP provides mechanisms for identifying major syntactic structures and major functional relations between words on large collections of unrestricted documents (e.g. web pages, newspapers, scientific literature, encyclopedias). It provides a formalism that smoothly integrates a number of description mechanisms for shallow and deep robust parsing, ranging from part-of-speech disambiguation, entity recognition, and chunking, to dependency grammars and extra-sentential processing. Named entity recognition relies on, and is also part of, the general parsing process (Brun 230 Aaron Kaplan, Jonathan Mamou et al. & Hagege, 2004). Measured over entities of all types, the named entity recognition system has a precision of 94% and recall of 88%. The summarizer uses sentence, lemma, and name annotations produced by the linguistic analysis, as well as other internal XIP information such as anaphoric information, to rank the sentences of the document by informativeness and choose the most informative ones to include in the summary. Summaries can be used to facilitate browsing of results retrieved for a query. Needless to say, the text processing functionality works better on “clean” text documents than on automatically transcribed speech. We have not attempted a formal evaluation, but our impression is that the quality of lemmatization and named entity recognition when applied to transcribed speech is degraded but remains acceptable, whereas the quality of summaries generated from speech is generally too poor to be useful, as recognition errors and errors in sentence boundary detection compound the already difficult summarization problem. While for some media the input to the text processing module is transcribed from speech, in other pipelines the original document is textual. An analysis engine based on the open source nekohtml 9 and xerces 10 libraries prepares XML and HTML documents, including ill-formed HTML as it is often found on the web, for processing by the parser, which expects plain text. It creates a plaintext view in which the markup tags have been removed, and adds UIMA annotations to preserve the alignment between the plain-text view and the original view, so that at a later stage annotations added to the plain-text view by the text analysis engine can be copied back to the original view, with the offsets adjusted appropriately. Information about the location of tags is also used to influence sentence boundaries—for example, a sentence will not be allowed to span a location in the plain-text view that corresponds to a <p> (paragraph break) tag in the original view. 5 Conclusions and Future Work To support multimedia indexing and query-by-example search, the SAPIR project has developed a feature extraction system for multimedia documents that contain combinations of text, images, video, speech, and music. The system is implemented in UIMA, using a pattern of splitters and mergers in which a multimedia CAS is split into multiple simpler CASes containing one media type each, which are processed and then recombined in order to generate a representation of the original, compound document. In this paper we have described the system architecture and some of the design decisions behind it, as well as the speech transcription and text processing modules. Descriptions of the other feature extraction modules can be found in the references. Indexing and search in SAPIR are distributed over a peer-to-peer network, but the current version of the feature extraction subsystem is not. Since it is built on UIMA it could of course be distributed using UIMA’s distributed processing functionality, but only over a network with a fixed set of nodes known ahead of time. It would be interesting to explore how UIMA could be adapted to working in a peer-to-peer network, in order to distribute the computational load of feature extraction among all participants. 9 http: / / nekohtml.sourceforge.net 10 http: / / xerces.apache.org Multimedia Feature Extraction in the SAPIR Project 231 Acknowledgements The SAPIR project was funded by the European Commission Sixth Framework Programme. We would like to thank all of the SAPIR participants, and in particular Fabrizio Falchi and Paolo Bolettieri for the still image analysis engine, Nicola Orio and Riccardo Miotto for the music analysis engine, and Walter Allasia, Mouna Kacimi, and Yosi Mass for helpful discussions and comments. References Aït-Mokhtar, S., Chanod, J.-P., & Roux, C. (2001). A multi-input dependency parser. In Proceedings of the Seventh IWPT (International Workshop on Parsing Technologies), Beijing, China. Allasia, W., Falchi, F., Gallo, F., Kacimi, M., Kaplan, A., Mamou, J., Mass, Y., & Orio, N. (2008). Audiovisual content analysis in P2P networks: The SAPIR approach. In Proceedings of the 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), pages 610-614. IEEE Computer Society. Brun, C. & Hagege, C. (2004). Intertwining deep syntactic processing and named entity detection. In ESTAL 2004, Alicante, Spain. Di Buccio, E., Masiero, I., Mass, Y., Melucci, M., Miotto, R., Orio, N., & Sznajder, B. (2009). Towards an integrated approach to music retrieval. In Proceedings of the Fifth Italian Research Conference on Digital Library Management Systems, Padua, Italy. International Organization for Standardization (2002). MPEG-7 The Generic Multimedia Content Description Standard. Kaplan, A., Falchi, F., Allasia, W., Gallo, F., Mamou, J., Mass, Y., Miotto, R., Orio, N., & Hagège, C. (2007). Common schema for feature extraction (revised). Deliverable 3.1, SAPIR. http: / / www.sapir.eu/ deliverables.html . Kaplan, A., Bolettieri, P., Falchi, F., Lucchese, C., Allasia, W., Gallo, F., Mamou, J., Sznajder, B., Miotto, R., Orio, N., Brun, C., Coursimault, J.-M., & Hagège, C. (2008). Feature extraction modules for audio, video, music, and text. Combined deliverables 3.2, 3.3, 3.4, 3.5, SAPIR. http: / / www.sapir.eu/ deliverables.html . Mamou, J., Carmel, D., & Hoory, R. (2006). Spoken document retrieval from call-center conversations. In SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 51-58, New York, NY, USA. ACM. Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In SIGIR ’07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 615-622, New York, NY, USA. ACM. Mamou, J., Mass, Y., Ramabhadran, B., & Sznajder, B. (2008). Combination of multiple speech transcription methods for vocabulary independent search. In Search in Spontaneous Conversational Speech Workshop, SIGIR 2008. Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech and Language, 14(4), 373-400. Manjunath, B. S., Salembier, P., & Sikora, T., editors (2002). Introduction to MPEG-7: Multimedia Content Description Interface. Wiley. 232 Aaron Kaplan, Jonathan Mamou et al. Miotto, R. & Orio, N. (2008). A music identification system based on chroma indexing and statistical modeling. In Proceedings of the 9th International Conference of Music Information Retrieval, pages 301-306, Philadelphia, USA. Ramabhadran, B., Sethy, A., Mamou, J., Kingsbury, B., & Chaudhari, U. (2009). Fast decoding for open vocabulary spoken term detection. In NAACL-HLT . Soltau, H., Kingsbury, B., Mangu, L., Povey, D., Saon, G., & Zweig, G. (March 2005). The IBM 2004 conversational telephony system for rich transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Stanchev, P., Amato, G., Falchi, F., Gennaro, C., Rabitti, F., & Savino, P. (2004). Selection of MPEG-7 image features for improving image similarity search on specific data sets. In Proceedings of the 7-th IASTED International Conference on Computer Graphics and Imaging (CGIM 2004), pages 395-400. ACTA Press. TextMarker: A Tool for Rule-Based Information Extraction * Peter Kluegl, Martin Atzmueller, and Frank Puppe University of Würzburg Department of Computer Science VI Am Hubland, 97074 Würzburg, Germany {pkluegl, atzmueller, puppe}@informatik.uni-wuerzburg.de Abstract This paper presents T EXT M ARKER - a powerful toolkit for rule-based information extraction. T EXT M ARKER is based on UIMA and provides versatile information processing and advanced extraction techniques. We thoroughly describe the system and its capabilities for human-like information processing and rapid prototyping of information extraction applications. 1 Introduction Due to the abundance of unstructured and semi-structured information and data in the form of textual documents, methods for information extraction are key techniques in the information age. There already exist a variety of methods for information extraction. These can roughly be divided into machine learning and knowledge engineering approaches (Turmo et al., 2006): Often, machine learning approaches are applied for information extraction. However, in our experience rule-based techniques provide a viable alternative especially since these allow for rapid-prototyping capabilities, that is, by starting with a minimal rule set that can be extended as needed. Furthermore, modeling human-like information extraction and processing can be directly supported. The model for the extraction of information in the hybrid knowledge engineering approach is either handcrafted or learned by covering algorithms and similar learning methods. For the representation of the white-box knowledge, often sequence labeling or classification approaches are used, e.g., finite state transducers, concept patterns, lambda expressions, logic programs or proprietary rule languages. Appelt (1999) gives some reasons to prefer the knowledge engineering approach to the established black-box machine learning approaches that often use knowledge engineering for the feature extraction themselves. Additionally, the time spent on the annotation of a training corpus has to be compared to the effort of writing handcrafted rules. In some cases, there is even no machine learning method available that satisfies the required expressiveness for an optimal model. The Unstructured Information Management Architecture (UIMA, Ferrucci & Lally, 2004) is a flexible and extensible architecture for the analysis and processing of unstructured data, especially * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 233-240. 234 Peter Kluegl, Martin Atzmueller, Frank Puppe text. The feature structures and the artifact (the text document and its annotations, respectively), are represented by the common analysis structure (CAS). Different components defined by descriptor files can process a CAS in a pipeline architecture. Whereas components like analysis engines add new annotations, CAS consumer process the contained information further. Additionally, UIMA supports multiple CAS and different views on a CAS. A major point for the interoperability of UIMA concerns the concept of a type system that defines the different types used by a component. Types themselves define the concept of a feature structure and contain a set of features, more precisely references to other feature structures or simple values. In this paper, we describe the T EXT M ARKER system as a UIMA-based toolkit for rule-based information extraction. T EXT M ARKER applies a knowledge engineering approach for acquiring rule sets and can be complemented by machine learning techniques. Due to its intuitive and extensible rule formalization language that also provides scripting-language like features, T EXT M ARKER provides for a powerful toolkit for rule-based information extraction and processing. The knowledge formalization mechanism also supports meta-level information extraction by improving and applying its extraction rules incrementally. Technically, the system is firmly grounded on UIMA. The rest of the paper is structured as follows: Section 2 provides a detailed introduction to the T EXT M ARKER system and gives an overview on current research directions. Next, we describe the integration and combination of T EXT M ARKER and UIMA in Section 3. After that, Section 4 provides an overview on applications using the T EXT M ARKER system. Finally, Section 5 concludes the paper with a discussion of the presented work and proposes several directions for future work. 2 The T EXT M ARKER System The T EXT M ARKER system 1 is a rule-based tool for the processing of unstructured information, especially for information extraction. It aims at supporting the knowledge engineer with respect to the rapid prototyping of information extraction applications and in providing the necessary elements for modeling human extraction knowledge and processes. The development environment is essential for the successful usage of a rule or scripting language and is therefore continually being improved. It is based on the DLTK framework 2 in order to provide a full-featured editor for the knowledge engineer. Components for the explanation provide statistics and related information, e.g., about how often the blocks and rules tried to apply and match, how often they succeeded, and how their conditions evaluated. In the following we provide a short introduction to the core concepts and language elements of T EXT M ARKER . A more detailed description can be found in the T EXT M ARKER project wiki 3 . 2.1 The T EXT M ARKER language A rule file or Script in the language of the T EXT M ARKER system mainly consists of three major parts (cf. table 1): A declaration of the package of the script (PackageDecl), that determines the namespace of newly defined types. Import elements make external components available, for example by referencing UIMA descriptors of Analysis Engines and Type Systems or other rule sets of the T EXT M ARKER language. Five different kinds of a Statement can form the body of the script file. Whereas the TypeDecl creates new types of feature structures, VariableDecl defines a new variable 1 The source code is available at https: / / sourceforge.net/ projects/ textmarker/ 2 http: / / www.eclipse.org/ dltk/ 3 http: / / tmwiki.informatik.uni-wuerzburg.de/ TextMarker: A Tool for Rule-Based Information Extraction 235 for types, strings, booleans or numbers. The ResourceDecl loads external trie structures, word lists or tables that can be used by conditions and actions. The Block element groups other statements and provides some functionality commonly known in scripting languages. A Rule consists of a list of rule elements that are made up of three parts: The mandatory matching condition of a rule is given by a TypeExpr or a StringExpr and creates a connection to the document. The optional QuantifierPart defines greedy or reluctant repetitions of the rule element, similar to regular expressions. Then, additional conditions and actions in the ConditionActionPart add further requirements and consequences to the rule element. Both elements, a Condition and an Action, require commonly several arguments, respectively expressions of types, numbers, strings or booleans or specialized constructs like assignments. An expression itself can be a terminal element, e.g., a number, variables containing a terminal element, a function returning that kind of element or combinations of them, e.g., a multiplication of two number expressions. The T EXT M ARKER language currently provides 24 different conditions and 19 different actions and their number is constantly growing. The list of available conditions and actions can be extended by third parties in order to customize the T EXT M ARKER system for specialized domains. In addition to elements that facilitate the rule engineering, external libraries or databases can enrich the T EXT M ARKER language. Table 1: Extract of the T EXT M ARKER language definition in Backus-Naur-Form Script → PackageDecl Import* Statement* Import → (‘TYPESYSTEM’ | ‘SCRIPT’ | ‘ENGINE’) Identifier ‘; ’ Statement → TypeDecl | ResourceDecl | VariableDecl | Block | Rule TypeDecl → ‘DECLARE’ (AnnotationType)? Identifier (‘,’ Identifier )* | ‘DECLARE’ AnnotationType Identifier ( ‘(‘ FeatureDecl ‘)’ )? ResourceDecl → (‘LIST’|’TABLE’) Identfier ‘; ’ VariableDecl → (‘TYPE’|’STRING’|’INT’|’DOUBLE’|’BOOLEAN’) Identifier ‘; ’ Block → ‘BLOCK’ ‘(‘ Identifier ‘)’ RuleElementWithType ‘{’ Statement* ‘}’ Rule → RuleElement+ ‘; ’ RuleElement → (TypeExpr | StringExpr) QuantifierPart? ConditionActionPart? QuantifierPart → ‘*’ | ‘*? ’ | ‘+’ | ‘+? ’ | ‘? ’ | ‘? ? ’ | ‘[’ NumberExpr ‘,’ NumberExpr ‘]’ (‘? ’)? ConditionActionPart → ‘{’ (Condition ( ‘,’ Condition )*)? ( ‘->’ Action ( ‘,’ Action)*)? ‘}’ Condition → ConditionName (‘(‘ Argument (‘,’ Argument)* ‘)’)? Action → ActionName (‘(‘ Argument (‘,’ Argument)* ‘)’)? Argument → TypeExpr | NumberExpr | StringExpr | BooleanExpr | . . . The characteristics of the T EXT M ARKER language are illustrated with two simple examples. In the first example, a rule with three rule elements is given that processes dates in a certain format, e.g., “Dec. 2004”, “July 85” or “11.2008”. The first rule element matches on a basic annotation of the type ANY (any token), if its covered text is contained in a dictionary named Month.twl. An optional PERIOD can follow the word. Then, a NUM (number) annotation has to come next, that has between two and four digits. If all three rule elements matched, then a new Month annotation for the text matched by the first rule element, a Year annotation for the text matched by the last rule element and a Date annotation for the completely matched text are created. ANY{INLIST(Months.twl) -> MARK(Month), MARK(Date,1,3)} PERIOD? NUM{REGEXP(".{2,4}") -> MARK(Year))}; The rule in the second example creates a new relation concerning an employment. The single rule element matches on all Sentence annotations that contain an annotation of the type EmploymentIndi- 236 Peter Kluegl, Martin Atzmueller, Frank Puppe cator. Then, a feature structure of the type EmplRelation is created and the values of its features EmployeeRef and EmployerRef are assigned to an annotation of the type Employee and Employer located within the matched annotation. The sentence “Peter works for Frank”, for example, has to be already annotated with the employee “Peter”, the employer “Frank” and the employment indicator “works for” that determines the role of persons in that sentence. Sentence{CONTAINS(EmploymentIndicator) -> CREATE(EmplRelation, "EmployeeRef" = Employee, "EmployerRef" = Employer)}; 2.2 Rule Inference The inference of the T EXT M ARKER system relies on a complete, disjunctive partition of the document. A basic (minimal) annotation for each element of the partition is assigned to a type of a hierarchy. These basic annotations are enriched for performance reasons with information about annotations that start at the same offset or overlap with the basic annotation. Normally, a scanner creates a basic annotation for each token, punctuation or whitespace, but can also be replaced with a different annotation seeding strategy. Unlike other rule-based information extraction language, the rules are executed in an imperative way. Experience has shown that the dependencies between rules, e.g., the same annotation types in the action and in the condition of a different rule, often form tree-like and not graph-like structures. Therefore, the sequencing and imperative processing did not cause disadvantages, but instead obvious advantages, e.g., the improved understandability of large rule sets. Algorithm 2 summarizes the rule inference of the T EXT M ARKER system. The rule elements can of course match on all kinds of annotations. Therefore the determination of the next basic annotation returns the first basic annotation after the last basic annotation of the complete, matched annotation. 2.3 Special Features The T EXT M ARKER language features some special characteristics that are not found in other information extraction systems. The expressiveness of the T EXT M ARKER language is increased by elements commonly known in scripting languages. Conditions and actions can refer to expression, respectively variables. They either evaluate or modify the value of the expression. As a consequence, a rule can alter the behavior of other rules, e.g., by changing the type of the created annotations. The block element introduces the functionality of procedures, conditioned statements and loops. The identifier of the block element defines the name of the procedure and is used in the invocation by other rules. The single rule element in the definition of the block element creates a local view of the inner statements of the document. If the type expression of the rule element refers to a type that occurs several times in the current view on the documents, then the inner statements are executed on each text fragment defined by these annotations. The conditions of the rule elements add additional requirements, that need to be fulfilled before the inner statements are processed. Therefore a match on the complete document with additional conditions equals an if statement. Some actions can modify the view on the document by filtering or retaining certain types of annotations or elements of the markup. For reasons of convenience, handcrafted rules are often based on different assumptions. Therefore unintended token classes, e.g., markup or whitespace, or arbitrary types of annotations can be removed from the view on the document. The knowledge engineer is able to retain the important features and increase the robustness of the extraction process. TextMarker: A Tool for Rule-Based Information Extraction 237 Algorithm 2 Rule Inference Algorithm collect all basic annotations that fulfill the first matching condition for all collected basic annotations do for all rule elements of current rule do if quantifier wants to match then match the conditions of the rule element on the current basic annotation determine the next basic annotation after the current match if quantifier wants to continue then if there is a next basic annotation then continue with the current rule element and the next basic annotation else if rule element did not match then reset the next basic annotation to the current one end if end if set the current basic annotation to the next one if some rule elements did not match then stop and continue with the next collected basic annotation else if there is no current basic annotation and the quantifier wants to continue then set the current basic annotation to the previous one end if end if end for if all rule elements matched then execute the actions of all rule elements end if end for Sometimes it is rather difficult to capture all relevant features for a type in a single rule. In addition to the creation of annotations, it is possible to add a positive or negative score: The features can thus be distributed among several rules that each weight their impact on the type with a heuristic score. The annotation is not created until enough rules have fired and the heuristic value has exceeded a defined threshold. The set of rules is even more robust, since the absence of a few features can be neglected. Beside its primary task to extract information, the input document can also be modified, e.g., for anonymization. For this purpose some actions are able to delete, replace or color the text of the matched text fragment. The actual change will be performed by another analysis engine that creates a new view containing the modified document. 2.4 Research Directions The formalization of matching rules is usually difficult and time consuming. Often, machine learning methods can support the knowledge engineer in a semi-automatic process: Annotated documents form the input for rule learning methods complementing the handcrafted rules. Since the knowledge engineer usually has deep insights in the domain and the learning task, appropriate acquisition methods and their parameters can be selected. There are several options for utilizing the learned rules: The acquired rules can be modified and transferred, if the quality of the proposed rules is good enough. If this is not the case, then the settings of the learning task can be adapted by annotating more examples or by changing the applied methods and their parameters. However, the process can also be restarted by continuing with the next concept. Our integrated framework for this process (Kluegl et al., 2009a) currently contains prototypes of four learning methods, for example 238 Peter Kluegl, Martin Atzmueller, Frank Puppe LP 2 (Ciravegna, 2003), and is extended with well known and with more specialized methods in future. Since the methods are used in a semi-automatic process, the accuracy of extraction process is not as important as the comprehensibility, the extensibility, the capabilities of system integration, the usage of features and the overall result. According to the common process model in information extraction, features are extracted from the input document and are used by a model to identify information. But using already extracted information for further information extraction can often account for missing or ambiguous features and increase the accuracy in domains with repetitive structure. If the document was written, for example, by a single author, often the same layout for repetitive structures is used. If the corpus contains documents by different authors and these authors used different layout styles, then the relation between the features and information is contradictory. By identifying a confident information and analyzing its features, meta-features can be created that describe the relation of the information and its features for the current document. These meta-features and the transfer knowledge that projects the meta-features to other text fragments form a dynamic layer of the model. Our meta-level information extraction approach (Kluegl et al., 2009b) engineers parts of the human processing of documents and is able to considerably increase the accuracy of an information extraction application. 2.5 Related Systems The JAPE (Cunningham et al., 2000) system also applies patterns on annotations, uses a textual knowledge representation, but utilizes finite state machines for rule inference; the integration of additional java code is possible. While JAPE is not based on UIMA, an integration is enabled using a UIMA-bridge. L ANGUAGE W ARE4 is a comprehensive linguistic platform for semantic analysis that is also embedded into UIMA. It provides an integrated development environment with real-time testing capabilities, and several configurable components, e.g., for dictionary lookup, language identification, syntactic and semantic analysis, or entity and relation extraction. In contrast to T EXT M ARKER the rule construction is performed using a drag-and-drop paradigm. The rule inference is strongly based on conceptual text structures like sentences and therefore especially useful for the processing of unstructered documents. T EXT M ARKER applies a different paradigm, i.e., similar to a scripting language implemented using rules. Therefore, T EXT M ARKER directly supports several tasks directly ‘out of the box’ that would otherwise be implemented using separate UIMA components. 3 T EXT M ARKER and UIMA The development environment of the T EXT M ARKER system provides a build process for the automatic creation of UIMA descriptors. For each rule file, an analysis engine descriptor and a type system descriptor are created that include all dependencies and types of referenced rule files. For the analysis engine a generic descriptor is extended that allows to configure all generated descriptors in the project. The availability of generated descriptors eases the integration of T EXT M ARKER components in other UIMA pipelines for various tasks, like feature construction, document modification or information extraction. UIMA elements can directly be processed by the given rules, for example the values of features can be transferred between compatible feature structures. Arbitrary UIMA 4 http: / / www.alphaworks.ibm.com/ tech/ lrw TextMarker: A Tool for Rule-Based Information Extraction 239 type systems and analysis engines can be used directly in a rule file by importing the descriptors and executed by an action (in the case of an analysis engine). In doing so, a new document is created with the current filtering settings: The HTML elements of semi-structured documents, for example, are removed before a part-of-speech tagger is executed on that dynamic document. Then, the newly created annotations with their offsets are transferred to the original document and can be used by T EXT M ARKER rules. There is ongoing work to integrate several component repositories like DKP RO (Gurevych & Müller, 2008) and advanced machine learning toolkits like C LEAR TK (Ogren et al., 2008) directly into the T EXT M ARKER development environment in order to provide a simple usage of linguistic approaches and arbitrary components with the complete expressiveness of the system. 4 Experiences Although the T EXT M ARKER system is still in an early project state, it was already successfully applied in several projects, especially for semi-structured documents. For the task of creating structured data of work experiences in curricula vitae, more than 10000 documents with heterogenous layouts and structures were successfully processed. Feature structures containing the complete description, the exact start date and end date, the employer and title amongst other annotations are extracted. The T EXT M ARKER system uses large dictionaries together with advanced approaches to reproduce the human perception of text fragments. In order to apply data mining on medical discharge letters, the T EXT M ARKER first anonymizes and then partitions the document in different fragments for diagnoses, therapies, oberservations and so forth. These sections are further processed for the acquisition of structured data. The C ASE T RAIN5 project is also using the T EXT M ARKER system for its authoring component. Structured documents for the creation of e-learning cases are parsed and an error feedback for the authors is created. 5 Conclusions In this paper, we have presented the T EXT M ARKER system as a UIMA-based tool for rule-based information extraction. Furthermore, we also have discussed the tight integration of T EXT M ARKER within UIMA, and we have outlined several current research directions for T EXT M ARKER that provide for further advanced information processing and extraction techniques. With its intuitive and extensible rule formalization language, T EXT M ARKER provides for a powerful toolkit for the UIMA community, and is not limited to certain domains, but open for various applications. For future work, we aim to perform several improvements, e.g., embedding more UIMA structures into the language (lists and arrays), and to add more language elements, e.g., new expressions, conditions and actions. Additionally, we plan to further improve the development environment. One of the most important goals is the extension of the rapid-prototyping and rule learning capabilities that will also be utilized for the (automatic) acquisition of meta knowledge. We also aim for a combination with other approaches, e.g., advanced machine learning techniques or text mining methods. For example, textual subgroup mining (Atzmueller & Nalepa, 2009) is an interesting complement, e.g., for supporting rule formalization. 5 http: / / casetrain.uni-wuerzburg.de 240 Peter Kluegl, Martin Atzmueller, Frank Puppe Acknowledgements This work has been partially supported by the German Research Council (DFG) under grant Pu 129/ 8-2. References Appelt, D. E. (1999). Introduction to Information Extraction. AI Commun., 12(3), 161-172. Atzmueller, M. & Nalepa, G. J. (2009). A Textual Subgroup Mining Approach for Rapid ARD+ Model Capture. In Proc. 22nd Intl. Florida AI Research Soc. Conf. (FLAIRS). AAAI Press. Ciravegna, F. (2003). ( LP ) 2 , Rule Induction for Information Extraction Using Linguistic Constraints. Technical Report CS-03-07, University of Sheffield, Sheffield. Cunningham, H., Maynard, D., & Tablan, V. (2000). JAPE: A Java Annotation Patterns Engine (Second Edition). Research Memorandum CS-00-10, University of Sheffield. Ferrucci, D. & Lally, A. (2004). UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Nat. Lang. Eng., 10, 327-348. Gurevych, I. & Müller, M.-C. (2008). Information Extraction with the Darmstadt Knowledge Processing Software Repository (Extended Abstract). In Proceedings of the Workshop on Linguistic Processing Pipelines, Darmstadt, Germany. Kluegl, P., Atzmueller, M., & Puppe, F. (2009a). A Framework for Semi-Automatic Development of Rulebased Information Extraction Applications. In Proc. LWA 2009 (KDML - Special Track on Knowledge Discovery and Machine Learning). (accepted). Kluegl, P., Atzmueller, M., & Puppe, F. (2009b). Meta-Level Information Extraction. In The 32nd Annual Conference on Artificial Intelligence, Berlin. Springer. (accepted). Ogren, P. V., Wetzler, P. G., & Bethard, S. (2008). ClearTK: A UIMA Toolkit for Statistical Natural Language Processing. In UIMA for NLP workshop at LREC 08. Turmo, J., Ageno, A., & Català, N. (2006). Adaptive Information Extraction. ACM Comput. Surv., 38(2), 4. ClearTK: A Framework for Statistical Natural Language Processing * Philip V. Ogren 1 , Philipp G. Wetzler 1 , and Steven J. Bethard 2 1 University of Colorado at Boulder Boulder CO 80309, USA 2 Stanford University Stanford CA 94305, USA Abstract This paper describes a software package, ClearTK, that provides a framework for creating UIMA components that use statistical learning as a foundation for decision making and annotation creation. This framework provides a flexible and extensible feature extraction library and wrappers for several popular machine learning libraries. We provide an architectural overview of the core ideas implemented in this framework. 1 Introduction Natural Language Processing (NLP) systems written for research purposes are often difficult to install, use, compile, debug, and extend (Pedersen, 2008). As such, The Center for Computational Language and Education Research (CLEAR) 1 undertook to create a framework to facilitate statistical NLP that would be a foundation for future software development efforts that would encourage widespread use by being well documented, easily compiled, designed for reusability and extensibility, and extensively tested. The software we created, ClearTK 2 , provides a framework that supports statistical NLP through components that facilitate extracting features, generating training data, building classifiers, and classifying annotations 3 . ClearTK also provides a toolkit containing a number of ready-to-use components for performing specific NLP tasks, but this toolkit is not described here. ClearTK is available under the very permissive BSD license 4 . ClearTK is built on top of the Unstructured Information Management Architecture (UIMA, Ferrucci & Lally, 2004). Briefly, UIMA provides a set of interfaces for defining components for analyzing unstructured information and provides infrastructure for creating, configuring, running, debugging, and visualizing these components. In the context of ClearTK, we are focused on UIMA’s ability to process textual data. All components are organized around a type system which defines * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 241-248. 1 http: / / clear.colorado.edu 2 http: / / cleartk.googlecode.com 3 For developers, the concepts discussed in this paper are implemented in the Java package org.cleartk.classifier 4 http: / / www.opensource.org/ licenses/ bsd-license.php 242 Philip V. Ogren, Philipp G. Wetzler, Steven J. Bethard the structure of the analysis data associated with documents 5 . This information is instantiated in a data structure called the Common Analysis Structure (CAS). There is one CAS per document that UIMA components can access and update. We chose UIMA for a number of reasons including its open source license, wide spread community adoption, strong developer community, elegant APIs that encourage reusability and interoperability, helpful development tools, and extensive documentation. While UIMA provides a solid foundation for processing text, it does not directly support machine learning based NLP. ClearTK aims to fill this gap. The outline of the paper is as follows. Section §2 through Section §6 lay out the core machine learning constructs that ClearTK provides. Section §7 describes how all of these constructs are brought together by UIMA components via annotation handlers. Section §8 details how ClearTK supports both sequential and non-sequential learners and Section §9 discusses characteristics and goals of ClearTK of importance and lists the supported machine learning libraries. 2 Feature Extraction The first step for most statistical NLP components is to convert raw or annotated text into features, which give a machine learning model a simpler, more focused view of the text. In ClearTK, a feature extractor takes an annotation (or set of annotations) and produces features from it. A feature in ClearTK is a simple object that contains a name and a value. For example, a pair of features extracted for a token-level annotation corresponding to the word “pony" might consist of the name=value pairs text=pony and part-of-speech=noun, respectively. Many features are created by querying the CAS for information about existing annotations. Because features are typically many in number, short lived, and dynamic in nature (e.g. features can derive from previous classifications), they are not added to the CAS but rather exist as simple Java objects. The SpannedTextExtractor 6 is a very simple example of a feature extractor that takes an annotation and returns a feature corresponding to the covered text of that annotation. The TypePathExtractor is a slightly more complicated feature extractor that extracts features based on a path that describes a location of a value defined by the type system with respect to the annotation type being examined. For example, consider a type system that has a type called Word with a single feature called partOfSpeech and a second type called Constituent with a single feature called headWord whose range is defined to be of type Word. A type path extractor initialized with the path headWord/ partOfSpeech can extract features corresponding to the part-of-speech of the head word of examined constituents. A much more sophisticated feature extractor is WindowFeatureExtractor . It operates in conjunction with a SimpleFeatureExtractor (such as the SpannedTextExtractor or TypePathExtractor ) and extracts features over some numerically bounded and oriented range of annotations (e.g. five tokens to the left) relative to a focus annotation (e.g. a named entity annotation or syntactic constituent) that are within some boundary annotation (e.g. a sentence or paragraph annotation.) The annotation types used by the window feature extractor are all configurable with respect to the type system which allows it to be used in many different situations. Table 1 lists some of the feature extractors provided by ClearTK. Note that all of these feature extractors are type system agnostic i.e. they are initialized with user specified types as needed and do not require importing ClearTK’s type system. ClearTK also provides feature extractors that are type system 5 We use the term document to refer to any unit of text that is analyzed 6 This font indicates class names found in ClearTK. ClearTK: A Framework for Statistical Natural Language Processing 243 dependent such as the SyntacticPathExtractor which creates a feature that describes the syntactic path between two syntactic constituents in the same sentence. Feature proliferators create features by taking as input features created by feature extractors and deriving new features. An example of a feature proliferator is the LowerCaseProliferator which takes features created by e.g. the SpannedTextExtractor and creates a feature that contains the lower cased value of the input feature. Another example of a feature proliferator is the NumericTypeProliferator which examines a feature value and characterizes it with respect the numeric properties of the stream, e.g. if it contains digits, contains only digits, looks like a year (e.g. 1964), looks like a Roman numeral, etc. While it is easy and common for developers to create their own feature extractors, ClearTK provides extraction logic for a number of commonly used features. As an illustration, the feature extraction library provided by ClearTK allows one to extract features corresponding to the following (assuming that the CAS along with its type system contains these data): • three part-of-speech tags to the left of a word as three features. • three part-of-speech tags to the left of a word as a trigram. • part-of-speech tags of the head words of constituents in a sentence. • identifiers of previously recognized concepts to the left of an annotation. • penultimate word of a named entity mention annotation. • last three letters of the first two words of a named entity mention annotation. • lengths of the previous 10 sentences. • word counts of a document. • part-of-speech counts of a document. • classifications assigned to previous annotations in a sequence. Table 1: Feature extractors provided by ClearTK Extractor features extracted derived from . . . spanned text the text spanned by an annotation type path a value given by a path through the type system distance the “distance” between two annotations white space existence of whitespace before or after an annotation relative position the relative position of two annotations (e.g. left or overlap left) window annotations in or around the focus annotation n-gram n consecutive annotations in or around the focus annotation bag bag-of-words style feature extraction counts counts of values 244 Philip V. Ogren, Philipp G. Wetzler, Steven J. Bethard 3 Instances Once a piece of text has been characterized by a list of features, it is ready to be passed to the machine learning library. At training time, each list of features is paired with an outcome which tells the machine learning algorithm what to learn. An outcome will generally be a numeric or string value. For example, when building a part-of-speech tagger, the outcomes will be string values corresponding to part-of-speech tags, so for example, the features (word=pony, 1-word-left=rainbow, 2-wordsleft=The, 3-char-suffix=ony, 1-part-of-speech-left=adjective, 2-part-of-speech-left=determiner) would be paired with the outcome noun. These pairs of feature lists and outcomes are called Instance objects in ClearTK. Note that Instance outcomes must be provided to the classifier at training time, but are not provided at classification time when the classifier tries to predict the outcomes. Thus when creating training data from Instance s, the outcomes must be known in advance. Generally, this means that the outcomes are derived directly from the contents of the CAS (e.g. because gold-standard part-ofspeech data has been read in by a collection reader). When classifying instances the outcome is not known in advance and is determined by passing to the classifier an Instance that has only features and no outcome. The classifier, using a model that was built previously by the machine learning library using training data (see Section §5 below), determines the most appropriate outcome that it can and returns it. The returned outcome is then interpreted and updates to the CAS are made as appropriate. In our working example, if the classifier predicted the outcome “noun”, this string would then be used to set the part-of-speech tag of a token annotation. 4 Feature Encoding For many simple situations, once features are extracted it is a simple matter of passing those features off to the classifier for classification or to a training data writer to write the features (along with an outcome) to the appropriate data format. However, as we began using ClearTK for more complicated scenarios we recognized that there are two closely related tasks to consider when passing features to a classifier. We call these two tasks feature extraction (as described above) and feature encoding. They roughly correspond to the following questions, respectively: How is the value of a feature extracted? and How is this feature presented to a particular classifier? As an example, consider a scenario in which distance from a focus annotation to some other annotation is extracted as a feature whose value is negative (corresponding to distance to the left) or positive (to the right). The question of how to extract the value for this feature is likely to be a straightforward counting of e.g. intervening tokens or characters where the values might range from -100 to +100. However, such a feature is not straightforwardly presented to a classifier such as OpenNLP’s MaxEnt 7 implementation which does not allow negative-valued features. Here, it may be appropriate to present the values according to some binning scheme such as far-left , near-left , far-right , and near-right . Such considerations are handled by feature encoders which serve to simplify the code that performs feature extraction by not polluting it with classifier-specific considerations as the one just described. Another scenario in which feature encoding greatly simplifies feature extraction can be found, for example, when performing document classification. It is very common to use word counts as features and these are easily extracted. However, for many classifiers it will make sense to encode 7 http: / / maxent.sourceforge.net/ ClearTK: A Framework for Statistical Natural Language Processing 245 these features as TF-IDF values. In fact, it may be that all of the classifiers supported by ClearTK benefit from using TF-IDF. Even still, there are many variations of TF-IDF and one technique may work better for one classifier than for another. In this scenario, ClearTK allows one to mix and match various feature encoders that perform TF-IDF calculations with different classifiers without having to clutter up the code that performs the basic feature extraction task of word counting. This makes it much easier to write feature extraction code that is oblivious to the classifier that is being used. The boundary between feature extraction and feature encoding is not always well defined for many scenarios 8 . However, we have found that by splitting these two tasks into separate activities it simplifies the resulting code that performs them. 5 Training a Classifier In order to use one of the classifiers supported by ClearTK, one must first create a model using that classifier’s learner. Each learner requires as input training data in a format specific to each classifier. For this task, ClearTK provides DataWriter s which know how to write Instance objects to the correct data format which can then be used by a learner to train a model. ClearTK provides a DataWriter for each classifier that it supports which removes the need for a developer to worry about these details which vary from classifier to classifier. Once training data has been created for a particular classifier, it’s learner must be invoked to train a model. This can be accomplished in ClearTK by means of a ClassifierBuilder (there is one for each classifier) which is a class that invokes the learner with parameters specific to the learner which then proceeds to build a model. A ClassifierBuilder is also responsible for packaging up the resulting model into a jar file along with other pertinent information such that ClearTK can use the model later when classification is performed. However, there are many situations in which it may be desirable to invoke a learner manually (e.g. from a command line interface) to build a model from training data. For example, in research settings it is common to train many different models using the same learner but with different parameters and evaluate the fitness of a model directly using the learner (e.g. performing 10-fold cross-validation is a training option in Mallet 9 described in McCallum (2002)). The training data generated by the DataWriter s (as described above) is suitable for direct use with the targeted learner. 6 Classification Classification is performed by a Classifier which is a wrapper class that handles the details of providing an Instance to a machine learning model so that it can classify it and return an outcome. ClearTK provides a Classifier wrapper for each machine learning library that is supported. 7 AnnotationHandlers Two analysis engines interface between UIMA and the classifier machinery: DataWriterAnnotator , which passes Instance s to a DataWriter object to write out the classifier training data, 8 For an extended discussion of feature extraction and encoding, please see http: / / code.google.com/ p/ cleartk/ issues/ detail? id=55 9 http: / / mallet.cs.umass.edu/ 246 Philip V. Ogren, Philipp G. Wetzler, Steven J. Bethard and ClassifierAnnotator which passes Instance s to a Classifier object to make outcome predictions. In both cases, Instance s must be derived from the CAS before they can be written or classified. Thus both analysis engines must perform the same logic of iterating over annotations of relevance, performing feature extraction, and then producing Instance s to be consumed. Because the code that determines how these instances are built is identical in both analysis engines, ClearTK consolidates this code into a single class called an AnnotationHandler which provides a single implementation for the logic of creating instances and performing feature extraction. This simplifies the code by eliminating redundancy. Because both the DataWriterAnnotator and the ClassifierAnnotator provide Instance s as input to their respective DataWriter or Classifier , they both implement a common interface, InstanceConsumer . An AnnotationHandler will provide logic for looping through annotations, performing feature extraction, and creating instances and will then hand off each Instance to its InstanceConsumer which will be either an object of type DataWriterAnnotator or ClassifierAnnotator . As an example, an AnnotationHandler for part-of-speech tagging would handle the logic of looping over each token annotation and performing feature extraction for each token to be tagged. This will result in a single Instance being created for each token. At runtime the AnnotationHandler ’s InstanceConsumer will be an object of type DataWriterAnnotator , in which case the Instance will be written to a file, or ClassifierAnnotator , in which case an outcome will be returned corresponding to a part-of-speech tag. 8 Sequential vs. Non-sequential Classification There are many situations, especially in text processing, in which classifying a single Instance at a time is inappropriate or suboptimal. For example, it is rare for a part-of-speech tagger to individually classify each token separately. Instead, it is much more common that all the tokens in a given sentence are classified together. Similarly, there are classifiers such as Conditional Random Fields (CRF) and Hidden Markov Models which classify a sequence of instances together rather than one at a time. For this reason, ClearTK provides a parallel set of classes which support sequential classification which are called e.g. SequentialDataWriter , SequentialClassifier , SequentialAnnotationHandler , SequentialDataWriterAnnotator , etc. One sequential learner that we support is Mallet’s CRF implementation. However, it is quite common to use non-sequential learners for sequential classification tasks. To this end we have created a ViterbiClassifier class which wraps a non-sequential Classifier up as a SequentialClassifier . This class has two primary functions. The first is to provide viterbi-style search through the possible classifications the non-sequential classifier provides for each instance. The second, is to provide feature extraction based on previous outcomes. Generally, some of the best features to use in a sequential classification task correspond to the previous outcomes of previously classified instances. The ViterbiClassifier can use either a configurable default OutcomeFeatureExtractor or a custom-made one. In this way, feature extraction as performed by a SequentialAnnotationHandler does not need to concern itself with features based on previous outcomes. This is good because such features are inappropriate for other SequentialClassifier s such as the one ClearTK provides for Mallet’s CRF (namely because previous outcomes are unavailable when a sequence of Instance s are passed to it.) ClearTK: A Framework for Statistical Natural Language Processing 247 9 Discussion In this paper we have described the core concepts that make up the “framework” half of ClearTK while largely ignoring the “toolkit” half of the library. Of course, these two sides of ClearTK work together to make a number of tasks much easier to implement. For example, ClearTK provides infrastructure in its toolkit for tokenization, named entity recognition, semantic role labeling, and BIO-style chunking, among others. It also provides a number of specialized collection readers that read in corpora commonly used for training and evaluating these tasks. However, one major distinction between the toolkit and the framework is that the former in many cases is dependent on a type system provided by ClearTK while the latter is completely type-system agnostic. For this reason, and because the framework is a more general foundation for creating new components we believe that it is ClearTK’s more important contribution to the community. One of the primary goals of ClearTK is to ease development of new machine learning based NLP components without sacrificing any power of the library by catering only to simple solutions. We hope that developers will find that ClearTK eases the burden of building NLP components by freeing them from having to wrestle with mundane details such as generating training data files in the appropriate formats or invoking a variety of third party libraries that all do roughly the same thing (at classification time) but with very different APIs. Also, we have aimed to provide a feature extraction library that provides many of the common features that are used so that developers do not have to rewrite them. It is our hope that in many scenarios a developer will have to do little more than write a new AnnotationHandler . On the other hand, it has also been our goal to empower developers to tinker with the details that arise in end-to-end component development as they see fit or as need arises. For example, in many cases it will be possible for a developer to create a new component without ever worrying about the distinction between feature extraction and feature encoding as described above by simply using the default feature encoders assigned to each classifier. However, this may not be feasible for more complicated or competitive scenarios. Similarly, developers can exert full control over how a model is learned via library-specific parameter tuning (e.g. kernel selection for LibSVM). Currently, ClearTK supports LibSVM 10 described in Chang & Lin (2001), Mallet Classifiers (as non-sequential classifiers) and Mallet’s CRF (as a sequential classifier), OpenNLP MaxEnt, and SVM light11 described in Joachims (1999). ClearTK enjoys 73.4% test coverage for the framework portion of the code described in this paper. We have found ClearTK to be very useful for our own research and development. In Ogren et al. (2008) we showed that ClearTK can be used to create a biomedical part-of-speech tagger that performs at state-of-the-art levels. More recently, the work done by Wetzler et al. (2009) exploited ClearTK’s ability to create document classifiers. 10 http: / / www.csie.ntu.edu.tw/ ~cjlin/ libsvm/ 11 http: / / svmlight.joachims.org/ 248 Philip V. Ogren, Philipp G. Wetzler, Steven J. Bethard References Chang, C.-C. & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http: / / www.csie.ntu.edu.tw/ ~cjlin/ libsvm . Ferrucci, D. & Lally, A. (2004). UIMA: An Architectural Approach to Unstructured Information Processing in the Corporate Research Environment. Natural Language Engineering, 10(3-4), 327-348. Joachims, T. (1999). Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning. MIT-Press. McCallum, A. K. (2002). MALLET: A Machine Learning for Language Toolkit. http: / / mallet.cs. umass.edu . Ogren, P. V., Wetzler, P. G., & Bethard, S. (2008). ClearTK: A UIMA toolkit for statistical natural language processing. In UIMA for NLP workshop at Language Resources and Evaluation Conference (LREC). Pedersen, T. (2008). Empiricism is not a matter of faith. Computational Linguistics, 34(3), 465-470. Wetzler, P. G., Bethard, S., Butcher, K., Martin, J. H., & Sumner, T. (2009). Automatically assessing resource quality for educational digital libraries. In WICOW ’09: Proceedings of the 3rd workshop on Information credibility on the web, pages 3-10, New York, NY, USA. ACM. Abstracting the Types away from a UIMA Type System * Karin Verspoor, William Baumgartner Jr., Christophe Roeder, and Lawrence Hunter Center for Computational Pharmacology University of Colorado Denver School of Medicine MS 8303, PO Box 6511, Aurora, CO 80045 Karin.Verspoor@ucdenver.edu , William.Baumgartner@ucdenver.edu , Chris.Roeder@ucdenver.edu , Larry.Hunter@ucdenver.edu Abstract This paper discusses the design of a “generic” UIMA type system, in which the type system itself is a lightweight meta-model and genericity is achieved by referencing an external object model containing the full semantic complexity desired for the application. This allows arbitrarily complex semantic types to be manipulated and created by UIMA components, without requiring representation of domain-specific types within the type system itself. A meta-model type system further allows for the definition of a single type system that can be used in a wide array of contexts, supporting an even wider array of semantic types. 1 Introduction The Unstructured Information Management Architecture 1 (UIMA, Ferrucci et al., 2006) is a framework that supports the definition and integration of software modules that perform analysis on unstructured data such as text documents or videos. UIMA facilitates assignment of structure to these unstructured artifacts by providing a means to link the artifacts to meta-data that describes them (Ferrucci et al., 2009). Two key components of UIMA are the Common Analysis Structure, or CAS, and the UIMA type system. The CAS is the basic data structure in which both the unstructured information being analyzed and the meta-data inferred for that information are stored. The UIMA type system is a declarative definition of an object model, and serves two main purposes: (a) to define the kinds of meta-data that can be stored in the CAS; and (b) to support description of the behavior of a processing module, or analytic, through specification of the types it expects to be in an input CAS and the types it inserts as output. There are many strategies for defining a type system in UIMA. UIMA itself does not include a particular set of types that developers must use, other than the high-level notion of a uima.jcas.tcas.Annotation , the basic type of all meta-data that refers to a region of an artifact (e.g. a span of text). Users must define their own domain and application specific type * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 249-256. 1 Apache UIMA website: http: / / incubator.apache.org/ uima 250 Karin Verspoor, William Baumgartner Jr. et al. systems. This leads to the problem of type system divergence, which in turn stands in the way of seamless plug and play compatibility among UIMA components. The irony here is that the ease with which UIMA can be customized actually works against the overall interoperability goals of the UIMA framework. Representing types in different forms, or even under different package names/ namespaces, can easily break interoperability between UIMA components. This flexibility can result in large type systems that tie analysis engines down to specific domain models, are cumbersome to maintain in the face of shifting domain semantics, and prevent straightforward reuse in diverse applications. In this paper, we will consider the semantics of the UIMA type system itself. We identify two extremes for the definition of a UIMA type system. Any given UIMA type system may lie at some point within these extremes. 1. Fully specified semantic model: A type system in which all of the relevant semantic types of the application domain for the meta-data of the system are specified within the UIMA type system. 2. Generic meta-model: A type system in which only very high-level semantic types are specified within the UIMA type system, while the meta-data semantics is defined externally to UIMA. We argue in this paper for preferring type systems that are strongly generic, that abstract the domain semantics away from the UIMA framework. The types in our UIMA type system allow us to model knowledge captured by publicly-curated ontologies and other external resources by referencing the terms in those resources. A strength of this approach is the ability to take advantage of community-wide consensus already achieved in the external resources we use. What we propose is essentially a type system without types, or more specifically with a minimal number of explicitly defined types. The approach purposefully creates a layer of abstraction between the type system itself and the semantics that it represents. We provide an example of such a type system, and show that this has the advantage of application flexibility while facilitating component reuse and sharing. 2 UIMA Type System as a Meta-Model The perspective we endorse is to consider the UIMA type system as a meta-model for the semantics of the structured content of an artifact. We adopt the term meta-model carefully, following its use in software engineering, where it indicates a model of a model. For instance, UML, the Unified Modeling Language, is a model defining how to express the structure of a software program. In the type system we describe here, the underlying model represents the domain semantics, and the meta-model is a description of how to characterize that domain semantics in UIMA. It follows from this perspective that the domain semantics itself should not be a part of the UIMA type system. Most UIMA type systems have a generic “upper level” in the class hierarchy extending the builtin Annotation type with classes that capture the basic distinctions among kinds of meta-data to be inserted in the CAS. For UIMA systems focused on representing the core information contained in textual documents for natural language processing, this normally consists of types such as SyntacticAnnotation , with subtypes for tokens, sentences, etc., and ConceptAnnotation as the umbrella class for specific kinds of entities and relations. These upper level types are Abstracting the Types away from a UIMA Type System 251 seen as forming the application-independent core of the type system, easily shareable among analytics, with extensions used to capture more application-specific distinctions (Hahn et al., 2007; Kano et al., 2008). Thus, an application in the biomedical domain, for instance, might add subtypes to ConceptAnnotation for particular entity types, such as proteins, organisms, and so forth. We propose a type system that is restricted to this upper level, where all domain type distinctions can be made in resources external to UIMA, such as ontologies or databases. These specific distinctions are then referenced from a generic type within UIMA, where the relevant domain type is indicated as the value of a feature in the generic type, rather than as a class in the type system model. This approach gives the advantage of having a UIMA application that is robust to, and up to date with, changes in the domain semantics. Others have also proposed UIMA type systems that include a connection to external resources (Hahn et al., 2007; Buyko & Hahn, 2008; Kano et al., 2009). However, these type systems are still presumed to make the core semantic distinctions of the application domain within the UIMA type system itself, so that for instance an entity that is linked to an identifier in the UniProtKB database 2 , a database which stores information about proteins, would be annotated with a UIMA type of Protein , extending ConceptAnnotation . In our view, such domain-specific types do not belong in the UIMA type system at all. 3 The CCP UIMA Type System In the Center for Computational Pharmacology (CCP), we have implemented a meta-model type system that is used in all of our UIMA applications, such as our tools for biomedical natural language processing 3 (Baumgartner et al., 2008), and particularly the OpenDMAP system (Hunter et al., 2008), a system for concept recognition in text. The CCP UIMA type system consists of a lightweight annotation hierarchy, where the domain semantics is captured through pointers into external resources. In OpenDMAP, we take advantage of community-curated ontologies that have been developed for and by experts in our application domain, biomedicine. An ontology is an explicit specification of the concepts and relations employed in a given domain of interest (Gruber, 1993). In the biomedical domain, there has been a concerted effort to develop large-scale, shared, representations of domain concepts for the purpose of enabling linkages across databases and supporting consistent labeling of diverse data. Large amounts of time and money have been invested in achieving community consensus in resources such as the Gene Ontology (The Gene Ontology Consortium, 2000). Our meta-model approach is able to directly build on these efforts. A prototype meta-model type system was used in Baumgartner et al. (2008), however our proposed CCP UIMA type system can be seen in Figure 1. This reflects the model we are working towards. It is very close to what we currently have implemented, with some modifications to support the new UIMA standard (Ferrucci et al., 2009), in particular the inclusion of a Referent . The existing implementation is also not a completely generic meta-model as we have maintained some specific classes (mostly for syntactic annotations) for more efficient indexing (see Section 6 below). It was previously noted by Heinze et al. (2008) that it can be important to separate mentions in a text from the entity those mentions refer to, for instance to link all mentions of a given entity together, and to normalize the representation of an entity. This observation led to the inclusion of the 2 UniProtKB database: http: / / www.uniprot.org . 3 Tools for BioNLP: http: / / bionlp.sourceforge.net 252 Karin Verspoor, William Baumgartner Jr. et al. Figure 1: UML diagram of the core CCP UIMA type system Referent type in the UIMA standard (Ferrucci et al., 2009), encapsulating a reference into a data source containing the domain semantics. The specification notes that the domain ontology can be defined within the UIMA type system, as is done by Heinze et al. (2008), or externally to UIMA, as we propose here. In either case, this representation more cleanly separates semantic types from tokens, that is, the abstract representation of a concept from its physical manifestation in text (or some other data source). The Referent type fits in well with our approach. Each Annotation may be an occurrence of a Referent , while each Referent is manifested through an Annotation . The primary UIMA annotation type in our type system is a CCPTextAnnotation , which augments the basic UIMA Annotation type with some additional features which we have found useful (e.g. an annotation ID and support for capturing non-contiguous spans with a single annotation). The CCPClassReferent type is a specialization of Referent , adding support for arbitrary slots, or attributes, to further characterize the reference. In our implementation, the CCPTextAnnotation occurrenceOf field is intended to be linked to a CCPClassReferent rather than a basic Referent object. A slot, implemented as CCPSlotReferent , also extends Referent and thus also refers to an element of an external ontology. Figure 2 provides an example of the sort of complex representations that can be expressed in our type system. This representation results from the application of our OpenDMAP system to information extraction of biological events in the context of the BioNLP09 shared task (Cohen et al., 2009). The strings in quotes correspond to the text of the CCPTextAnnotation associated with a given class referent ( CCPClassReferent ), while the data in brackets reflects the core Referent features. What we see in the example is a single positive_regulation event with multiple slots (arguments), one labeled as a cause and the other as a theme , where the slot value for each slot is in turn a CCPClassReferent labeled as a protein and associated with a particular string in the text, and a simple slot value indicating the identifier of the protein. Essentially, the representation corresponds to the predicate positive_regulation("gp41","IL-10") , with the additional information that the two arguments are proteins, and their roles. All of the labels for class referent and slot elements correspond to concepts defined in the BioNLP09 ontology we developed for the task. Note that for a different use case in which the external ontology specifies the relevant protein identifiers, each protein ID, rather than with a simple slot referent, could be represented as ClassReferent: [Elem: T2, Src: BioNLP09] . Abstracting the Types away from a UIMA Type System 253 ClassReferent: [Elem: positive_regulation, Src: BioNLP09] "up-regulation" ComplexSlotReferent: [Elem: cause, Src: BioNLP09] ClassReferent: [Elem: protein, Src: BioNLP09] "gp41" SlotReferent: [Elem: ID, Src: BioNLP09] with SLOT VALUE: T3 ComplexSlotReferent: [Elem: theme, Src: BioNLP09] ClassReferent: [Elem: protein, Src: BioNLP09] "IL-10" SlotReferent: [Elem: ID, Src: BioNLP09] with SLOT VALUE: T2 Figure 2: An example of a complex class referent identified by the OpenDMAP system in text: The portion within brackets [] corresponds to the core features of a Referent , ontologyElement (Elem) and ontologySource (Src). We utilize an ontology specific to our submission to the BioNLP09 shared task (Cohen et al., 2009), not shown. 4 Discussion of the Meta-Model Approach Our experiences using a meta-model type system have enabled us to use semantic data models that would have been impossible to handle using a traditional fully specified type system. A clear advantage in using a meta-model type system is the fact that the full set of types, and the relationships between them, do not need to be generated a priori. This benefit is crucial in a number of different ways. First, when dealing with very large semantic ontologies, there is no need to create, and subsequently maintain, a large collection of Java class files and corresponding type system descriptors. As an example, the Gene Ontology contains 27,742 terms as of this writing. Large numbers of classes may be daunting in regards to code maintenance, but it becomes an intractable problem because ontologies may support multiple inheritance, whereas Java and the UIMA type system implementation do not. Ontologies that support multiple inheritance cannot be accurately reflected in the UIMA type system, their size not withstanding. Equally as beneficial to representing large ontologies is the inherent domain and application independence gained with a meta-model type system. No extensions to the type system are needed when switching domains. Further, since relationships between semantic types have been abstracted away from the type system, we are no longer restricted to representing only the hierarchical parent/ child relationships as can be mirrored in the Java class hierarchy. We can take advantage of arbitrary relationships that might exist in a complex ontology. The meta-model approach is not free of faults however. Here we discuss several drawbacks and how we combat them. In regards to developer efficiency, perhaps the most significant drawback comes from the loss of compile-time semantic type checking when using a meta-model type system. This compile-time checking is present when using a fully specified type system. It provides valuable feedback to the developer that is lost when using a meta-model type system. We have replaced this compile-time checking with two runtime validation procedures. The first checks the validity of the annotation structure in general. This is necessary because the internal UIMA collections are not genericized, i.e. it is important to check that a ComplexSlotReferent is not inserted into an FSArray storing NonComplexSlotReferent s, or storing CCPSpan s for that matter. We turn back to the ontologies to validate the semantic consistency of the annotations and use the formal ontology specification as a basis for type checking. Another, somewhat less crucial in our experience, drawback to using the meta-model is the inability to use the integrated annotation indexes to extract specific annotation types. We come back to this point in Section 6. 254 Karin Verspoor, William Baumgartner Jr. et al. 5 Enabling Analysis Engine Interoperability One of the key goals of the UIMA framework is to facilitate interoperability of analytics. The type system plays a critical role in achieving this goal, as it specifies the types available for analytics to both operate on and produce. For instance, an analytic that performs part-of-speech tagging on a text may require that TokenAnnotation objects marking tokens have been previously stored in the CAS. A tokenizer that produces some other kind of token annotation, say Token objects, cannot be combined directly with the tagging analytic. Adopting an abstracted type system means that the annotations produced by an analytic will be grounded in an external semantic resource, which is more easily shared among research groups, and whose structure likely reflects a consensus regarding the entities and relationships defining the domain. While the development of a common type system shared among diverse research groups would also solve the interoperability problem, there has yet to emerge a widely adopted type system, despite several proposals (Hahn et al., 2007; Kano et al., 2008) and the diverse range of applications to which UIMA is applied is likely to preclude widespread adoption of one. Use of an abstracted type system gives more flexibility to UIMA system developers to add fields to their types as necessary, and at runtime, while adhering to a common semantics for domain entities. Also, because the type system is very shallow, it is more likely to be applicable to the full range of applications. Another solution to the interoperability issue, the definition of analytics that convert from one type system to another (Kano et al., 2009), is also facilitated through the use of an abstracted type system. This is again because the domain semantics is external to the UIMA framework: what must be converted is limited to the meta-model, i.e. whatever specific features have been added to the high-level types, while the references to the external semantics remain unchanged. However, there does exist a limitation on interoperability using the meta-model approach. It still requires different modules referencing the same external data source to agree on the basic metamodel for that data source. For Referent objects which only need to refer to an external element identifier, there is little issue, but for the more complex objects we aim to capture with our CCP- ClassReferent object - including events and relations - there must be agreement on how to map the various ontology elements to the abstracted representation. For instance, conventions for naming slots and a specification of which slots are to be included in the representation must be decided for maximal compatibility. Essentially, this is a question of how the UIMA meta-model aligns to the meta-model of the underlying data source (e.g. a database schema). We feel that this level of consensus is more straightforward to achieve, and in future work we will examine automated methods for deriving the UIMA type system meta-model automatically from the meta-model of the data itself. 6 Implications for the UIMA framework Adoption of a meta-model representation for the UIMA type system leads to certain technical desiderata for the UIMA framework, in particular with regards to indexing of objects in the CAS. In the current Apache UIMA implementation, these objects are indexed by type. For systems which implement a full semantic model in the type system, this means that it is straightforward to access only those objects that correspond to a particular semantic class, as each class corresponds to a unique type. For the meta-model approach to be as efficient, it must be possible to index and access the CAS meta-data via, at least, the two core fields of the Referent object. It is currently possible to define a filtered iterator over objects in the CAS that achieves the result of only iterating Abstracting the Types away from a UIMA Type System 255 over objects that contain particular feature values, but this is not as efficient as direct support for the meta-model representation. While this has not so far been an issue for us, one could imagine a system requirement concerning optimization of index functionality. The same issue impacts representation of behavioral meta-data for analytics. Input/ output capabilities of analytics are specified in terms of type system types. Ideally, analytics would be able to declare capabilities in terms of feature values. This would allow analytics to refer to the domain types specified in the Referent fields. 7 Conclusion We have described an approach to constructing UIMA type systems in which the semantic complexity of the application domain is allowed to remain fully external to the type system definition itself. This has important implications for the robustness and interoperability of analytics developed in UIMA. Acknowledgements The authors acknowledge NIH grants R01LM009254, R01GM083649, and R01LM008111 to Lawrence Hunter for supporting this research. Thanks to Philip Ogren for contributing to early discussions on our type system implementation. References Baumgartner, W. J., Cohen, K., & Hunter, L. (2008). An open-source framework for large-scale, flexible evaluation of biomedical text mining systems. J Biomed Discov Collab 29; 3: 1. Buyko, E. & Hahn, U. (2008). Fully embedded type systems for the semantic annotation layer. In In Proceedings of First International Conference on Global Interoperability for Language Resources (ICGL). Cohen, K., Verspoor, K., Johnson, H., Roeder, C., Ogren, P., Baumgartner, W., White, E., Tipney, H., & Hunter, L. (2009). High-precision biological event extraction with a concept recognizer. In Proceedings of the BioNLP09 Shared Task Workshop, pages 50-58, Boulder, CO. Ferrucci, D., Murdock, W., & Welty, C. (2006). Overview of component services for knowledge integration in UIMA (a.k.a. SUKI). Technical Report RC24074, IBM Research. Ferrucci, D., Lally, A., Verspoor, K., & Nyberg, A. (2009). Unstructured information management architecture (UIMA) version 1.0. Oasis Standard http: / / docs.oasis-open.org/ uima/ v1.0/ uima-v1.0.pdf . Gruber, T. (1993). Toward principles for the design of ontologies used for knowledge sharing. Formal Ontology in Conceptual Analysis and Knnowledge Representation. Hahn, U., Buyko, E., Tomanek, K., Piao, S., Tsuruoka, Y., McNaught, J., & Ananiadou, S. (2007). An UIMA annotation type system for a generic text mining architecture. In UIMA Workshop, GLDV Conference. Heinze, T., Light, M., & Schilder, F. (2008). Experiences with UIMA for online information extraction at thomson corporation. In UIMA for NLP workshop at Language Resources and Evaluation Conference (LREC). 256 Karin Verspoor, William Baumgartner Jr. et al. Hunter, L., Lu, Z., Firby, J., Baumgartner, W., Johnson, H., Ogren, P., & Cohen, K. (2008). OpenDMAP: An open-source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics, 9. Kano, Y., Nguyen, N., Saetre, R., Yoshida, K., Miyao, Y., Tsuruoka, Y., Matsubayashi, Y., Ananiadou, S., & Tsujii, J. (2008). Filling the gaps between tools and users: A tool comparator, using protein-protein interactions as an example. In Pacific Symposium on Biocomputing, volume 13, pages 616-627. Kano, Y., Baumgartner, W., McCrohon, L., Ananiadou, S., Cohen, K., & Hunter, L. (2009). U-compare: share and compare text mining tools with uima. Bioinformatics. The Gene Ontology Consortium (2000). Gene ontology: tool for the unification of biology. Nat. Genet., 25(1), 25-29. Simplifying UIMA Component Development and Testing with Java Annotations and Dependency Injection * Christophe Roeder, Philip V. Ogren, William A. Baumgartner Jr., and Lawrence Hunter Center for Computational Pharmacology University of Colorado Denver School of Medicine MS 8303, PO Box 6511, Aurora, CO 80045 USA Chris.Roeder@ucdenver.edu , philip@ogren.info , William.Baumgartner@ucdenver.edu , Larry.Hunter@ucdenver.edu Abstract Developing within the Apache UIMA project framework presents challenges when writing and testing components in Java. Challenges stem from the relationship between the Java source code implementing the components and the corresponding UIMA XML descriptor files describing configuration and deployment settings. Java Annotations and Dependency Injection can be used to establish a stronger separation of concerns between framework integration and core component implementation, thus freeing the developer from commonly repeated tasks and allowing simplified development and testing. 1 Introduction UIMA 1 components are defined by a pair of files: a source code file implementing component functionality and an XML descriptor file that stores component metadata consisting of component description information, e.g. information about input parameters (names, types, default values, etc.). At runtime, a third file specifies values for those input parameters: either the CPE descriptor when running in the CPM or an aggregate descriptor when running the AS. The current UIMA Java implementation burdens the developer with keeping the first two in synch, and it burdens the developer with writing and testing code to query the third. We propose two solutions: Java Annotations (Bloch et al., 2004) to eliminate the duplication of component descriptor metadata, and a completed Dependency Injection implementation (Fowler, 2004) to eliminate redundant configuration metadata extraction code. * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 257-260. 1 Apache UIMA website: http: / / incubator.apache.org/ uima 258 Christophe Roeder, Philip V. Ogren et al. 2 Java Annotations Java Annotations are a facility that appeared in Java 1.5. They allow metadata to be defined in the Java source code and retained during the compilation process. As they are part of the Java class files, Java Annotations are retrievable at runtime via the Java Reflection API 2 . We propose that Java Annotations are a viable mechanism for representing the metadata currently defined in UIMA component descriptor XML files. Positioning the metadata within the source code would remove the metadata duplication that currently exists, thereby easing the burden of coding and testing on the developer. Under this scenario, changing an input parameter requires a change only to a single file, reducing the chance of creating inconsistencies. A UIMA version supporting Java Annotations would also support the current XML representation in support of older components. It is important to point out that the use of Java Annotations does not preclude the use of XML descriptor files in different components. Java Annotations require a dedicated class to represent the annotation. Below is an example Annotation class that could be used to represent data about a file parameter. The functions in the Annotation declaration work as getters for the values of the Annotation’s parameters. @Retention(RetentionPolicy.RUNTIME) public @interface FileParameterAnnotation { String description() default "no description"; boolean multiValued() default false; boolean mandatory() default false; String defaultValue() default ""; } With the Annotation defined, it could be used to mark a Java class member variable as an input parameter for a particular UIMA component. The Java syntax for invoking an Annotation is the @ symbol followed by the name of the Annotation, optionally including named values for its parameters. To minimize duplication, the name appearing in the deployment configuration is assumed to be the same as the variable name. Below is an example of how an input parameter indicating the path to an output file could be defined for a fictitious UIMA component. public class Example_AE extends JCasAnnotator_ImplBase { @FileParameterAnnotation(description="output file path", multiValued=false, mandatory=true) private String outputFilePath; public void setOutputFilePath(String s) { outputFilePath = s; } public void process() { FileWriter fr = new FileWriter(outputFilePath)); } } Once annotated, configuration information for a UIMA component can easily be extracted at runtime by reading from the Java Annotation using Reflection. 3 Dependency Injection Dependency Injection (Fowler, 2004) is a name given to the process of assembling a runtime configuration from a configuration file. Assembling the processing pipeline of components from a CPE descriptor is a form of Dependency Injection. The components are injected into the framework by a process driven by the CPE descriptor. In UIMA the injection doesn’t include setting the parameter 2 Java Reflection API: http: / / java.sun.com/ j2se/ 1.4.2/ docs/ api/ java/ lang/ reflect/ package-summary.html Simplifying UIMA Component Development and Testing 259 values as in other implementations, rather the components query the UIMAContext, and this must be written for each component. A complete Dependency Injection implementation would have the framework set the parameters on the component objects, eliminating the need to write and test such code for each component. With Dependency Injection, generic framework code searches the component for setter functions using Reflection based on the parameter name, and puts the configuration values obtained from the metadata into the component. For example, if you have the Example_AE from above, code in the depths of UIMA that creates it from the parsed deployment descriptor as represented in the UIMA- Context object would look roughly like the following code. It would read the metadata (below), extracting the method name and the value (the path), inspect the annotator, and set the file path on it: <casProcessor deployment="integrated" name="Example_AE"> <configurationParameterSettings><nameValuePair> <name>OutputFilePath</ name><value><string>/ home/ roederc/ outputfile</ string></ value> </ nameValuePair></ configurationParameterSettings> JCasAnnotator ae = new Example_AE(); / / ae.setOutputFilePath("/ home/ roederc/ outputfile"); Class aeClass = Class.forName("Example_AE"); Method method = aeClass.getMethod("set" + "OutputFilePath", String.class); method.invoke(ae, "/ home/ roederc/ outputfile"); The three lines that follow the commented call to setInputFilePath() do essentially the same thing as the commented call, but in a way that can be driven by metadata. The function name is derived from the parameter name. Other code in UIMA would supply that string and this code would be used generically for any component. Since the framework is responsible for transferring the configuration data from the UIMAContext object to the component, the component does not have to do it, thereby reducing the developer of both writing and testing this code for each new component. 4 Related Work UUTUC (Ogren & Bethard, 2009) is a set of convenience classes written to simplify UIMA testing. It allows a component and related UIMA infrastructure to be created without the overhead of a UIMA pipeline and related XML 3 . It does this by providing factory classes with methods that create various UIMA framework components from configuration data supplied in the Java code. UUTUC was written to reduce the dependence of test code on XML configuration data from deployment descriptors. Although the configuration metadata is not required in XML form components still need to be written with queries into the UIMAContext object to get configuration data when UUTUC alone is used. 5 Conclusion Adding Java Annotations to and improving the Dependency Injection in UIMA is the logical next step after UUTUC in making developing and testing components easier. The techniques described here not only reduce the XML required, but reduce the code and related testing required. Improving Dependency Injection is separate from XML issues, but reduces the amount of code involved in maintenance and testing when changing parameters. Java Annotations eliminate the need for XML 3 UUTUC tutorial: http: / / code.google.com/ p/ uutuc/ wiki/ GettingStarted 260 Christophe Roeder, Philip V. Ogren et al. component descriptors in the broader UIMAContext object. UUTUC, in contrast, while eliminating the need for component and deployment descriptors when testing, does not change the need for XML component descriptors during deployment under AS or CPM (UUTUC does not prevent AnalysisEngineDescription.toXML() from being called, so the descriptors can be generated in that environment). The component descriptors are used to accurately create deployment descriptors, so delivering a component without them hobbles deployment in these environments. Java Annotations provide an alternative way of representing this information, and with modification to some or all of CPM, AS, and tools used to create the deployment descriptors, can replace the XML form of component descriptors. Moving UIMA in this direction would involve two major steps. Incorporating Java Annotations requires identifying and modifying code that reads the component descriptors. This does not have to preclude continued use of component descriptors for existing components. Improving Dependency Injection would require modification of the CPM and/ or AS to inject values from the UIMAContext object to the components. Such a modification, like that for Java Annotations, does not preclude existing components from querying the UIMAContext object as they do today. The changes can be made so the benefits are there for the future, while not requiring change for existing components. Acknowledgments The authors thank Kevin Bretonnel Cohen, Helen L. Johnson, and Karin Verspoor for careful review. This work was supported by NIH grants R01LM009254, R01GM083649, and R01LM008111 to Lawrence Hunter and T15LM009451 to Philip Ogren. References Bloch, J. et al. (2004). JSR 175: A metadata facility for Java programming language. Technical Report Final Release, Sun Microsystems Inc. Fowler, M. (2004). Inversion of control containers and the dependency injection pattern. http: / / www. martinfowler.com/ articles/ injection.html . Ogren, P. & Bethard, S. (2009). Building test suites for uima components. In Proceedings of the Workshop on Software Engineering, Testing and Quality Assurance for Natural Language Processing (SETQA-NLP 2009). UIMA-Based Focused Crawling * Daniel Trümper, Matthias Wendt, and Christian Herta neofonie GmbH Robert-Koch-Platz 4 10117 Berlin {truemper,matthias,herta}@neofonie.de Abstract In this paper, we describe our ongoing efforts to implement a Topic Specific Crawler. We propose to integrate the classification into the crawler using UIMA and present our architecture for a distributed crawler based on UIMA-AS. We believe this will allow for a finer-grained scalability of the system by scaling the expensive components of the crawler. 1 Introduction and Related Work A crawler is one of the key components in any Web search engine. The current tasks of crawlers are limited to downloading and storing Web pages for processing. While we generally second the argument of improving the crawling speed, we want to emphasize that for a special purpose crawler-integrated processing is of value. State-of-the-art distributed crawlers focus on scaling based on the amount of pages which will be crawled. In contrast, we believe that for special-purpose crawlers it is also important to scale accordingly based on the amount of work needed for complex processing. This applies especially to applications like Focused Crawling or Spam Detection. The basic crawler architecture is described by Mercator (Heydon & Najork, 1999). A freelyavailable crawler based on Mercator’s design is the Internet Archive’s Heritrix 1 crawler (Mohr et al., 2004). A distributed crawler design is described in Boldi et al. (2004). The crawler is started on several nodes. Each host is then crawled only by one node. The dynamic assignment of newly-extracted URLs is based on Consistent Hashing (Karger et al., 1997). A Focused Crawler is introduced in Chakrabarti et al. (1999). The basic idea is to follow only links from pages that belong to a given target topic. * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 261-264. 1 http: / / crawler.archive.org 262 Daniel Trümper, Matthias Wendt, Christian Herta 2 Current UIMA Usage and Motivation In order to conduct a Focused Crawl, we have created a proxy that contains a Topic Specific Crawling (TSC) module. The module is composed of feature extraction components and a trained SVM (Joachims, 1998) model implemented as Analysis Engines of a UIMA-based pipeline. We have measured the performance of our TSC component using the UIMA statistics accessible via the Java Management Extensions. Using 10 , 000 previously defined features, the classifier component can process up to 16 documents per second. A separate experiment showed that the Heritrix crawler is able to download up to 90 documents per second. Both tests were conducted on a single Quad-Core machine with 2 , 33 GHz processors, 8 GB of memory and 200 GB hard disk. The proxy-based implementation of the Focused Crawling has the distinct advantage that the classification process is independent of the crawler. The drawback of this method is the general loss of crawler performance due to the implementation of the politeness policies. These specify that at any given time only one connection to a single host is allowed to be open. The time interval between two requests is typically adapted to the server latency, which is now unknown to the crawler due to additional non-constant classification times. Furthermore, several timeout parameters (socket timeout, connection timeout, e.g.) must be increased according to the complexity of the feature extraction. To overcome these disadvantages, we propose integrating the classification process directly into the crawler. First, we briefly introduce the policies that a crawler has to support: 1. influence the way new sites are explored (selection policies), 2. assure scalability for conducting large-scale crawls (parallelization policy), 3. determine the behaviour of updating known pages (re-visit policy) and 4. constrain the bandwidth occupied (politeness policy). Integrating topic detection using classification could be beneficial for dynamically adapting the selection policies (1). Classification could also be useful in detecting typical (un)deliberate spider traps (e.g. calendars). Moreover, classification scores could then be used to assign links a priority for crawling (Baeza-Yates et al., 2005; Cho et al., 1998). For adapting the re-visit policy (3), classification could be used to identify link sources which are frequently updated, such as news ticker pages. Another issue which we want to address is the parallelization policy (2). We think that UIMA-AS could be valuable in distributing components which cause bottlenecks, thereby increasing the overall speed of a special purpuse (focused) crawler. Other benefits of this approach include the reuse of existing UIMA components, as well as facilities for monitoring and error-handling provided by UIMA-AS. 3 Crawler Architecture The crawler architecture is adopted from Mercator (Heydon & Najork, 1999). The components sketched in the UIMA-AS box in Figure 1 are implemented as Analysis Engines (AE) that can be deployed as singular or multiple instances on one or more nodes. These components are connected by UIMA in an Aggregate Analysis Engine. The TSC component also forms an aggregate AE. UIMA-Based Focused Crawling 263 Figure 1: Architecture of the UIMA Based Crawler The UrlFrontier is responsible for queueing new URLs such that the crawler adheres to the politeness policies. As long as a host’s URL is being processed, no other URLs from the same host are available for processing. In a fully-distributed crawler, each node is assigned its own Frontier. For the reasons described in Boldi et al. (2004), this limits the possibility of influencing the selection policies. In order to have full control over the selection policies, a prerequisite for TSC, our architecture does not allow for a distributed Frontier. The CollectionReader is responsible for requesting URLs from the Frontier and adding cookies and authentication credentials to the CAS. From the resulting CAS, newly extracted links are queued within the Frontier. A Precondition component stores DNS resolutions and the host’s robot rules 2 in the globallyaccessible ServerCache . Information about DNS entries is important for the Fetcher component and the LinkScoper is guided by the robot rules. Several filters are applied to the processed URL . The ContentSeen filter marks a URL as a full duplicate if its content has been previously identified within another URL . The content’s fingerprints are stored in a central database that is connected to each instance of the filter. The TSC component filters the content with a classifier and a previously-trained model. Components for extracting new links for different content types follow. Finally, the LinkScoper component tests whether the newly-extracted URLs adhere to the robot rules and the crawl scope. The remainder of the newly extracted URLs are normalized, i.e. session ids and similar information are cleared from the URL . 4 Conclusion We have introduced a crawler architecture based on UIMA-AS. This architecture is designed for special-purpose crawlers. We believe that UIMA-AS-based crawlers have several distinct advantages, such as integrating complex processing, scaling the components independently and reusing existing components. 2 Stored in the robots.txt 264 Daniel Trümper, Matthias Wendt, Christian Herta Acknowledgments We would like to thank Scott Robinson and Doris Maassen for helpful comments and their revision of this paper. This project was supported by the German Federal Ministry of Economics and Technology on the basis of a resolution by the German Bundestag. References Baeza-Yates, R. A., Castillo, C., Marín, M., & Rodríguez, A. (2005). Crawling a country: better strategies than breadth-first for web page ordering. In A. Ellis & T. Hagino, editors, WWW (Special interest tracks and posters), pages 864-872. ACM. Boldi, P., Codenotti, B., Santini, M., & Vigna, S. (2004). UbiCrawler: a scalable fully distributed Web crawler. Software-Practice and Experience, 34(8), 711-726. Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 31(11-16), 1623-1640. Cho, J., García-Molina, H., & Page, L. (1998). Efficient Crawling Through URL Ordering. In Proceedings of the Seventh International World Wide Web Conference (WWW7), pages 161-172. Heydon, A. & Najork, M. (1999). Mercator: A scalable, extensible web crawler. World Wide Web, pages 219-229. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In C. Nédellec & C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137-142, Chemnitz, DE. Springer Verlag, Heidelberg, DE. Karger, D. R., Lehman, E., Leighton, F. T., Panigrahy, R., Levine, M. S., & Lewin, D. (1997). Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. In STOC, pages 654-663. Mohr, G., Kimpton, M., Stack, M., & Ranitovic, I. (2004). Introduction to heritrix, an archival quality web crawler. In 4th International Web Archiving Workshop (IWAW04). Annotation Interchange with XSLT * Graham Wilcock University of Helsinki Abstract The paper describes an XSLT stylesheet that transforms annotations from GATE XML to UIMA XML format. It extends an existing set of stylesheets that are freely available for download (Wilcock, 2009). 1 Introduction One of the reasons XML was rapidly adopted as a standard format for data interchange ten years ago was that the details can be changed. Whatever specific format the data is in, if it’s XML you can change it into the format you want using XSLT transformations. This was summed up in the joke XML means never having to say you’re sorry. For linguistic annotations, stand-off markup is normally used. The XML format for stand-off annotations is more complex than for in-line annotations. Nevertheless, the joke is not wrong: stand-off annotations in one specific format can be transformed into stand-off annotations in another specific format using XSLT stylesheets. To demonstrate the truth of this claim in practice, not merely in theory, we wrote a set of XSLT stylesheets that transform between specific stand-off annotation formats. The formats are Word- Freak XML (Morton & LaCivita, 2003), GATE XML (Cunningham et al., 2002) and UIMA XML (Götz & Suhre, 2004). Stylesheets that transform GATE XML to WordFreak XML, and WordFreak XML to UIMA XML, are described by Wilcock (2009). This paper describes another example stylesheet that transforms GATE XML to UIMA XML. 2 Transforming GATE XML to UIMA XML GATE XML format includes Feature elements that contain the feature’s Name and its Value . The fragment of GATE XML in Figure 1 shows an Annotation for a token that includes a Feature whose Name is string and whose Value is dun. The same Annotation includes another Feature whose Name is category and whose Value is NN. This shows that GATE has tagged the token dun as NN. The stylesheet gate2uima.xsl transforms GATE XML into UIMA XML (XMI). Namespace declarations are included for XMI, for the UIMA CAS (Common Analysis Structure) and for the type system used by the UIMA OpenNLP examples. * Published in: C. Chiarcos, R. Eckart de Castilho, M. Stede (eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically. Tübingen: Narr, 2009, pages 265-266. 266 Graham Wilcock <Annotation Id="100" Type="Token" StartNode="199" EndNode="202"> <Feature> <Name className="java.lang.String">category</ Name> <Value className="java.lang.String">NN</ Value> </ Feature> ... <Feature> <Name className="java.lang.String">string</ Name> <Value className="java.lang.String">dun</ Value> </ Feature> </ Annotation> Figure 1: GATE XML : An Annotation showing the token dun tagged as NN The overall structure of the UIMA XML file is built by the template in Figure 2. This creates the root element xmi: XMI , the cas: Sofa and cas: View elements in xmi: XMI , and uses <apply-templates select="AnnotationSet"/ > to produce the detailed annotations in between. The cas: View element includes a list of xmi: id numbers. Only annotations whose xmi: id numbers are listed as members of the view will be shown when the view is displayed by the UIMA Annotation Viewer. The stylesheet collects the list of xmi: id numbers for all annotation types (except file ) by means of an <xsl: for-each> loop. An opennlp: Sentence element is created for each GATE sentence annotation, and an opennlp: Token element for each GATE token annotation. The processing of tokens is shown in Figure 3. The values of the UIMA begin and end attributes are extracted from the GATE StartNode and EndNode attributes. The UIMA posTag attribute gets its value from the GATE Feature whose Name is category. For tokens, the UIMA componentId attribute is set to “GATE Tokenizer” and for sentences it is set to “GATE Sentence Splitter”. The GATE token dun (Figure 1) is shown after the transformation to UIMA XML in Figure 4, viewed in the UIMA Annotation Viewer. 3 Conclusion Stylesheets that transform GATE XML to WordFreak XML, and WordFreak XML to UIMA XML, are described by Wilcock (2009). The stylesheets are available for download from http: / / sites.morganclaypool.com/ wilcock . This paper describes another example stylesheet that transforms GATE XML to UIMA XML. References Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002). GATE: A framework and graphical development environment for robust NLP tools and applications. In 40th Anniversary Meeting of the Association for Computational Linguistics, Philadelphia. Götz, T. & Suhre, O. (2004). Design and implementation of the UIMA Common Analysis System. IBM Systems Journal, 43(3), 476-489. Morton, T. & LaCivita, J. (2003). WordFreak: An Open Tool for Linguistic Annotation. In Proceedings of HLT-NAACL 2003, Demonstrations, pages 17-18, Edmonton. Wilcock, G. (2009). Introduction to Linguistic Annotation and Text Analytics. Morgan and Claypool. Annotation Interchange with XSLT 267 <! -- Template to process top-level document element --> <xsl: template match="GateDocument"> <xsl: element name="xmi: XMI"> <xsl: attribute name="xmi: version">2.0</ xsl: attribute> <xsl: element name="cas: NULL"> <xsl: attribute name="xmi: id">0</ xsl: attribute> </ xsl: element> <! -- Make UIMA sofaString from GATE TextWithNodes --> <xsl: element name="cas: Sofa"> <xsl: attribute name="xmi: id">1</ xsl: attribute> <xsl: attribute name="sofaNum">1</ xsl: attribute> <xsl: attribute name="sofaID">_InitialView</ xsl: attribute> <xsl: attribute name="mimetype">text</ xsl: attribute> <xsl: attribute name="sofaString"> <xsl: value-of select="TextWithNodes"/ > </ xsl: attribute> </ xsl: element> <xsl: apply-templates select="AnnotationSet"/ > <xsl: element name="cas: View"> <xsl: attribute name="sofa">1</ xsl: attribute> <xsl: attribute name="members"> <xsl: text>999998</ xsl: text> <xsl: text> </ xsl: text> <xsl: text>999999</ xsl: text> <xsl: for-each select="/ / Annotation[@Type='Sentence']"> <xsl: text> </ xsl: text> <xsl: value-of select="@Id"/ > </ xsl: for-each> <xsl: for-each select="/ / Annotation[@Type='Token']"> <xsl: text> </ xsl: text> <xsl: value-of select="@Id"/ > </ xsl: for-each> </ xsl: attribute> </ xsl: element> </ xsl: element> </ xsl: template> Figure 2: gate2uima.xsl : Transforming the document from GATE XML to UIMA XML 268 Graham Wilcock <! -- Template to process tokens --> <xsl: template match="Annotation[@Type='Token']"> <xsl: element name="opennlp: Token"> <xsl: attribute name="xmi: id"> <xsl: value-of select="@Id"/ > </ xsl: attribute> <xsl: attribute name="sofa">1</ xsl: attribute> <xsl: attribute name="begin"> <xsl: value-of select="@StartNode"/ > </ xsl: attribute> <xsl: attribute name="end"> <xsl: value-of select="@EndNode"/ > </ xsl: attribute> <xsl: attribute name="posTag"> <xsl: value-of select="Feature[Name='category']/ Value"/ > </ xsl: attribute> <xsl: attribute name="componentId"> <xsl: text>GATE Tokenizer</ xsl: text> </ xsl: attribute> </ xsl: element> </ xsl: template> Figure 3: gate2uima.xsl : Transforming tokens from GATE XML to UIMA XML Figure 4: UIMA Annotation Viewer showing the token dun tagged NN by GATE Appendix List of Contributors 271 List of Contributors Farag Ahmed Department of Technical and Business Information Systems (ITI) Otto-von-Guericke University of Magdeburg Universitätsplatz 2 D-39106 Magdeburg farag.ahmed@ovgu.de Martin Atzmueller Department of Computer Science VI University of Würzburg Am Hubland D-97074 Würzburg atzmueller@informatik.uni-wuerzburg.de William Baumgartner Jr. Center for Computational Pharmacology University of Colorado Denver School of Medicine MS 8303, PO Box 6511 Aurora, CO 80045, USA William.Baumgartner@ucdenver.edu Steven J. Bethard Stanford University Stanford CA 94305, USA Johan Bos Dipartimento di Informatica University of Rome “La Sapienza” Via Salaria 113 00198 Roma, Italia bos@di.uniroma1.it Gerlof Bouma Department Linguistik Universität Potsdam Karl-Liebknecht-Str. 24/ 25 D-14476 Golm gerlof.bouma@uni-potsdam.de Manuel Burghardt Institut für Information und Medien, Sprache und Kultur (I: IMSK) Universität Regensburg Universitätsstr. 31 D-93053 Regensburg manuel.burghardt@sprachlit.uni-regensburg.de Ernesto W. DeLuca Department of Technical and Business Information Systems (ITI) Otto-von-Guericke University of Magdeburg Universitätsplatz 2 D-39106 Magdeburg ernesto.deluca@ovgu.de Stefanie Dipper Sprachwissenschaftliches Institut Ruhr-Universität Bochum D-44780 Bochum dipper@linguistics.rub.de Kurt Eberle SFB 732/ B3 Institut für maschinelle Sprachverarbeitung Computerlinguistik Universität Stuttgart Azenbergstr. 12 D-70174 Stuttgart eberle@ims.uni-stuttgart.de Gertrud Faaß SFB 732/ B3 Institut für maschinelle Sprachverarbeitung Computerlinguistik Universität Stuttgart Azenbergstr. 12 D-70174 Stuttgart faasz@ims.uni-stuttgart.de 272 List of Contributors Erik Faessler Jena University Language & Information Engeneering (JULIE) Lab Friedrich-Schiller-Universität Jena D-07743 Jena erik.faessler@uni-jena.de Bernhard Fisseni Fakultät für Geisteswissenschaften Germanistik/ Linguistik Universität Duisburg-Essen Universitätsstr. 12 D-45117 Essen bernhard.fisseni@uni-due.de Juliane Fluck Fraunhofer Institut für Algorithmen und wissenschaftliches Rechnen (SCAI) Schloss Birlinghoven D-53754 Sankt Augustin juliane.fluck@scai.fraunhofer.de Francesco Gallo EURIX Group R&D Department 26 via Carcano 10153 Torino, Italy Rüdiger Gleim Abteilung für geisteswissenschaftliche Fachinformatik Department for Computing in the Humanities Goethe-Universität Frankfurt am Main Georg-Voigt-Straße 4 D-60325 Frankfurt am Main Gleim@em.uni-frankfurt.de Udo Hahn Jena University Language & Information Engeneering (JULIE) Lab Friedrich-Schiller-Universität Jena D-07743 Jena Christian Hardmeier Fondazione Bruno Kessler Via Sommarive, 18 38123 Trento, Italia Ulrich Heid SFB 732/ B3 Institut für maschinelle Sprachverarbeitung Computerlinguistik Universität Stuttgart Azenbergstr. 12 D-70174 Stuttgart heid@ims.uni-stuttgart.de Christian Herta neofonie GmbH Robert-Koch-Platz 4 D-10117 Berlin herta@neofonie.de Graeme Hirst Department of Computer Science University of Toronto Toronto, Ontario Canada M5S 3G4 gh@cs.toronto.edu Anke Holler Seminar für Deutsche Philologie Georg-August-Universität Göttingen Käte-Hamburger-Weg 3 D-37073 Göttingen anke.holler@phil.uni-goettingen.de Lawrence Hunter Center for Computational Pharmacology University of Colorado Denver School of Medicine MS 8303, PO Box 6511 Aurora, CO 80045, USA Larry.Hunter@ucdenver.edu Aaron Kaplan Xerox Research Centre Europe 6 chemin de Maupertuis 38240 Meylan, France Manfred Klenner Institut für Computerlinguistik Universität Zürich Binzmühlestrasse 14 CH-8050 Zürich klenner@cl.uzh.ch List of Contributors 273 Peter Kluegl Department of Computer Science VI University of Würzburg Am Hubland D-97074 Würzburg pkluegl@informatik.uni-wuerzburg.de Peter Koepke Mathematisches Institut Rheinische Friedrich-Wilhelms-Universität Bonn Endenicher Allee 60 D-53115 Bonn koepke@math.uni-bonn.de Peter Kolb Department Linguistik Universität Potsdam Karl-Liebknecht-Str. 24/ 25 D-14476 Golm kolb@linguatools.de Rico Landefeld Jena University Language & Information Engeneering (JULIE) Lab Friedrich-Schiller-Universität Jena D-07743 Jena Pierre Lison Language Technology Lab German Research Centre for Artificial Intelligence (DFKI GmbH) D-66123 Saarbrücken pierre.lison@dfki.de Jonathan Mamou IBM Haifa Research Lab. 31905 Haifa, Israel Torsten Marek Institut für Computerlinguistik Universität Zürich Binzmühlestrasse 14 CH-8050 Zürich marek@ifi.uzh.ch Alexander Mehler Faculty of Technology Bielefeld University Universitätsstr. 25 D-33615 Bielefeld alexander.mehler@uni-bielefeld.de Roland Mittmann Institut für Vergleichende Sprachwissenschaft Goethe-Universität Frankfurt am Main Postfach 11 19 32 D-60054 Frankfurt am Main mittmann@em.uni-frankfurt.de Andreas Nürnberger Department of Technical and Business Information Systems (ITI) Otto-von-Guericke University of Magdeburg Universitätsplatz 2 D-39106 Magdeburg andreas.nuernberger@ovgu.de Philip V. Ogren Center for Computational Pharmacology University of Colorado Denver School of Medicine MS 8303, PO Box 6511 Aurora, CO 80045, USA philip@ogren.info Frank Puppe Department of Computer Science VI University of Würzburg Am Hubland D-97074 Würzburg puppe@informatik.uni-wuerzburg.de Christophe Roeder Center for Computational Pharmacology University of Colorado Denver School of Medicine MS 8303, PO Box 6511 Aurora, CO 80045, USA Chris.Roeder@ucdenver.edu Gerold Schneider Institut für Computerlinguistik Universität Zürich Binzmühlestrasse 14 CH-8050 Zürich gschneid@cl.uzh.ch Bernhard Schröder Fakultät für Geisteswissenschaften Germanistik/ Linguistik Universität Duisburg-Essen Universitätsstr. 12 D-45117 Essen bernhard.schroeder@uni-due.de 274 List of Contributors Rico Sennrich Institut für Computerlinguistik Universität Zürich Binzmühlestrasse 14 CH-8050 Zürich rico.sennrich@gmx.ch Jannik Strötgen Institut für Informatik Ruprecht-Karls-Universität Heidelberg Im Neuenheimer Feld 348 D-69120 Heidelberg jannik.stroetgen@informatik.uni-heidelberg.de Benjamin Sznajder IBM Haifa Research Lab. 31905 Haifa, Israel Simone Teufel Computer Laboratory University of Cambridge JJ Thomson Avenue CB3 0FD Cambridge, UK Simone.Teufel@cl.cam.ac.uk Katrin Tomanek Jena University Language & Information Engeneering (JULIE) Lab Friedrich-Schiller-Universität Jena D-07743 Jena katrin.tomanek@uni-jena.de Daniel Trümper neofonie GmbH Robert-Koch-Platz 4 D-10117 Berlin truemper@neofonie.de Jip Veldman Mathematisches Institut Rheinische Friedrich-Wilhelms-Universität Bonn Endenicher Allee 60 D-53115 Bonn veldman@math.uni-bonn.de Karin Verspoor Center for Computational Pharmacology University of Colorado Denver School of Medicine MS 8303, PO Box 6511 Aurora, CO 80045, USA Karin.Verspoor@ucdenver.edu Martin Volk Institut für Computerlinguistik Universität Zürich Binzmühlestrasse 14 CH-8050 Zürich volk@cl.uzh.ch Tim vor der Brück Intelligent Information and Communication Systems (IICS) Fernuniversität in Hagen D-58084 Hagen tim.vorderbrueck@fernuni-hagen.de Ulli Waltinger Text Technology Bielefeld University Universitätsstraße 25 D-33615 Bielefeld ulli_marc.waltinger@uni-bielefeld.de Martin Warin Institut für Computerlinguistik Universität Zürich Binzmühlestrasse 14 CH-8050 Zürich martin@linguistlist.org Matthias Wendt neofonie GmbH Robert-Koch-Platz 4 D-10117 Berlin matthias@neofonie.de Philipp G. Wetzler University of Colorado at Boulder Boulder CO 80309, USA Graham Wilcock University of Helsinki 00014 Helsinki, Finland graham.wilcock@helsinki.fi Christian Wolff Institut für Information und Medien, Sprache und Kultur (I: IMSK) Universität Regensburg Universitätsstr. 31 D-93053 Regensburg christian.wolff@sprachlit.uni-regensburg.de List of Contributors 275 Sina Zarrieß Institut für maschinelle Sprachverarbeitung Universität Stuttgart D-70174 Stuttgart Heike Zinsmeister Fachbereich Sprachwissenschaft Universität Konstanz Fach 186 D D-78457 Konstanz Heike.Zinsmeister@uni-konstanz.de