XXIII Congreso de la Sociedad Española para el Procesamiento del ... · Búsqueda de Respuestas (2...

XXIII Congreso de la Sociedad Española para el

Procesamiento del Lenguaje Natural

Universidad de Sevilla 10, 11 y 12 de septiembre de 2007

EDITORES Víctor J. Díaz Madrigal (Univ. de Sevilla) Fernando Enríquez de Salamanca Ros (Univ. de Sevilla) COMITÉ CIENTÍFICO PRESIDENTE Prof. Víctor Jesús Díaz Madrigal (Universidad de Sevilla) MIEMBROS Prof. José Gabriel Amores Carredano (Universidad de Sevilla) Prof. Toni Badia i Cardús (Universitat Pompeu Fabra) Prof.ª Irene Castellón Masalles (Universitat de Barcelona) Prof. Manuel de Buenaga Rodríguez (Universidad Europea de Madrid) Prof. Ricardo de Córdoba (Universidad Politécnica de Madrid) Prof.ª Arantza Díaz de Ilarraza (Euskal Herriko Unibertsitatea) Prof. Antonio Ferrández Rodríguez (Universitat d'Alacant) Prof. Mikel Forcada Zubizarreta (Universitat d'Alacant) Prof.ª Ana María García Serrano (Universidad Politécnica de Madrid) Prof. Koldo Gojenola Galletebeitia (Euskal Herriko Unibertsitatea) Prof. Xavier Gómez Guinovart (Universidade de Vigo) Prof. Julio Gonzalo Arroyo (Universidad Nacional de Educación a Distancia) Prof. José Miguel Goñi Menoyo (Universidad Politécnica de Madrid) Prof. Ramón López-Cózar Delgado (Universidad de Granada) Prof. Javier Macías Guarasa (Universidad Politécnica de Madrid) Prof. José B. Mariño Acebal (Universitat Politècnica de Catalunya) Prof.ª M. Antonia Martí Antonín (Universitat de Barcelona) Profª. Raquel Martínez (Universidad Nacional de Educación a Distancia) Prof. Antonio Molina Marco (Universitat Politècnica de Valencia) Prof. Juan Manuel Montero (Universidad Politécnica de Madrid) Prof.ª Lidia Ana Moreno Boronat (Universitat Politècnica de Valencia) Prof. Lluis Padró (Universitat Politècnica de Catalunya) Prof. Manuel Palomar Sanz (Universitat d'Alacant) Prof. Germán Rigau (Euskal Herriko Unibertsitatea) Prof. Horacio Rodríguez Hontoria (Universitat Politècnica de Catalunya) Prof. Emilio Sanchís (Universitat Politécnica de Valencia) Prof. Kepa Sarasola Gabiola (Euskal Herriko Unibertsitatea) Prof. L. Alfonso Ureña López (Universidad de Jaén) Prof. Ferrán Pla (Universitat Politècnica de Valencia) Prof.ª Mª Felisa Verdejo Maillo (Universidad Nacional de Educación a Distancia) Prof. Manuel Vilares Ferro (Universidade de Vigo) Revisores Externos Iñaki Alegria, Laura Alonso Alemany, Kepa Bengoetxea, Zoraida Callejas Carrión, Francisco Carrero, Vicente Carrillo Montero, Fermín Cruz Mata, Víctor Manuel Darriba Bilbao, César de Pablo Sánchez, Fernando Enríquez de Salamanca Ros, Milagros Fernández Gavilanes, Ana Fernández Montraveta, Óscar Ferrández, Sergio Ferrández, Miguel Ángel García Cumbreras, Manuel García Vega, Rubén Izquierdo Beviá, Zornitsa Kozareva, Sara Lana Serrano, Mikel Lersundi, Lluis Márquez, María Teresa Martín Valdivia, José Luis Martínez Fernández, Germán Montoro Manrique, Andrés Montoyo Guijarro, Iulia Nica, Francisco Javier Ortega Rodríguez, Jesús Peral Cortés, Enrique Puertas, Francisco José Ribadas Pena, Estela Saquete Boró, José Antonio Troyano Jiménez, Gloria Vázquez.

COMITÉ ORGANIZADOR PRESIDENTE Víctor Jesús Díaz Madrigal MIEMBROS Adolfo Aumaitre del Rey Rafael Borrego Ropero José Miguel Cañete Valdeón Vicente Carrillo Montero Fermín Cruz Mata Fernando Enríquez de Salamanca Ros Francisco José Galán Morillo Carlos García Vallejo Fco. Javier Ortega Rodríguez Luisa María Romero Moreno José Antonio Troyano Jiménez

Preámbulo El ejemplar número 39 de la revista de la Sociedad Española para el Procesamiento del Lenguaje Natural contiene los artículos científicos - más los resúmenes de proyectos de investigación y de demostraciones de herramientas - aceptados por el Comité Científico para su presentación en el XXIII Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN'07). Esta edición del congreso ha sido organizada por miembros del departamento de Lenguajes y Sistemas Informáticos de la Universidad de Sevilla en la Escuela Técnica Superior de Ingeniería Informática. El número de artículos de investigación recibido junto con la continuidad en la celebración anual del congreso, ésta es la vigésimo tercera edición ininterrumpida, no hacen más que constatar el interés y la actualidad que disfruta hoy en día la investigación en el campo de las Tecnologías de la Lengua. Estas actas recogen 32 artículos científicos que podemos agrupar de forma no categórica y excluyente en las siguientes áreas temáticas: Análisis Morfosintáctico (4 trabajos), Búsqueda de Respuestas (2 trabajos), Categorización de Textos (3 trabajos), Extracción de Información (5 trabajos), Lexicografía Computacional (4 trabajos), Lingüística de Corpus (4 trabajos), Semántica (4 trabajos), Sistemas de Diálogo (2 trabajos) y Traducción Automática (4 trabajos). Se recibieron un total de 49 trabajos de los cuales tan sólo las 32 contribuciones mencionadas (65 por ciento) obtuvieron la aprobación global del Comité Científico. Cada uno de los trabajos recibidos fue revisado por 3 miembros del Comité Científico. Además, y como viene siendo habitual, en las actas se incluyen dos resúmenes presentando proyectos de investigación y nueve resúmenes presentando demostraciones de herramientas de uso específico para tareas relacionadas con el Procesamiento del Lenguaje Natural. Esta edición del congreso cuenta con 2 conferencias invitadas a cargo del Dr. D. Antal van den Bosch (Universidad de Tilburg) y del Dr. D. Anselmo Peñas (Universidad Nacional de Educación a Distancia). Este año se da la peculiaridad de que durante los días 11 y 12 de septiembre, en paralelo con el congreso, se celebran las Jornadas de la Red Temática para el Tratamiento de la Información Multilingüe y Multimodal. En el seno de dichas jornadas se incluye la conferencia invitada a cargo del Dr. D. Ralf Steinberger (Joint Research Centre). No quiero acabar estas líneas sin dar las gracias a los patrocinadores del congreso ya que sin su apoyo financiero o logístico hubiera sido muy difícil organizarlo. No puedo tampoco dejar de agradecer el esfuerzo y las facilidades de las que he sido objeto por parte de todos los miembros del Comité Científico y del Órgano de Gobierno de la Sociedad. Finalmente, me gustaría acabar recordando a todos mis compañeros del grupo de investigación ITÁLICA por el trabajo adicional que ha supuesto la preparación de este evento.

Víctor Jesús Díaz Madrigal Presidente del Comité de Programa de XXIII Congreso de la SEPLN

Procesamiento del Lenguaje Natural, nº 39, septiembre 2007 ISSN 1135-5948

© 2007 Sociedad Española para el Procesamiento del Lenguaje Natural

Sociedad Española para el

Procesamiento del Lenguaje Natural

______________________________________________________________________________________________ ARTÍCULOS Análisis Morfosintáctico Desarrollo de un Analizador Sintáctico Estadístico basado en Dependencias para el Euskera Kepa Bengoetxea y Koldo Gojenola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Técnicas Deductivas para el Análisis Sintáctico con Corrección de Errores Carlos Gómez-Rodríguez, Miguel A. Alonso y Manuel Vilares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 A Simple Formalism for Capturing Order and Co-Occurrence in Computational Morphology Mans Hulden y Shannon Bischoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 A Note on the Complexity of the Recognition Problem for the Minimalist Grammars with Unbounded Scrambling and Barriers Alexander Perekrestenko . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Búsqueda de Respuestas Paraphrase Extraction from Validated Question Answering Corpora in Spanish Jesús Herrera, Anselmo Peñas y Felisa Verdejo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Evaluación de Sistemas de Búsqueda de Respuestas con restricción de tiempo Fernando Llopis, Elisa Noguera, Antonio Ferrández y Alberto Escapa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Categorización de Textos Medidas Internas y Externas en el Agrupamiento de Resúmenes Científicos de Dominios Reducidos Diego Ingaramo, Marcelo Errecalde y Paolo Rosso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Integración de Conocimiento en un Dominio Epecífico para Categorización Multietiqueta María Teresa Martín, Manuel Carlos Díaz, Arturo Montejo y L. Alfonso Ureña-López . . . . . . . . . . . . . . . . . . . . 63 Similitud entre Documentos Multilingües de Carácter Científico-Técnico en un Entorno Web Xabier Saralegi y Iñaki Alegria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Extracción de Información The Influence of Context during the Categorization and Discrimination of Spanish and Portuguese Person Names. Zornitsa Kozareva, Sonia Vázquez y Andrés Montoyo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Studying CSSR Algorithm Applicability on NLP Tasks Muntsa Padró y Lluis Padró . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Aprendizaje Atomático para el Reconocimiento Temporal Multilingüe basado en TiMBL Marcel Puchol-Blasco, Estela Saquete y Patricio Martínez-Barco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Alias Assignment in Information Extraction Emili Sapena, Lluis Padró y Jordi Turmo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Evaluación de un Sistema de Reconocimiento y Normalización de Expresiones Temporales en Español María Teresa Vicente-Díez, César de Pablo-Sánchez y Paloma Martínez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Lexicografía Computacional Inducción de Clases de Comportamiento Verbal a partir del Corpus SENSEM Laura Alonso, Irene Castellón y Nevena Tinkova . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 An Open-Source Lexicon for Spanish Montserrat Marimon, Natalia Seghezzi y Núria Bel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Towards Quantitative Concept Analysis Rogelio Nazar, Jorge Vivaldi y Leo Wanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Evaluación Atomática de un Sistema Híbrido de Predicción de Palabras y Expansiones Sira Elena Palazuelos, José Luis Martín y Javier Macías . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Procesamiento del Lenguaje Natural, nº 39, septiembre 2007 ISSN 1135-5948

© 2007 Sociedad Española para el Procesamiento del Lenguaje Natural

Lingüística de Corpus Specification of a General Linguistic Annotation Framework and its Use in a Real Context Xabier Artola, Arantza Díaz de Ilarraza, Aitor Sologaistoa y Aitor Soroa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Determinación del Umbral de Representatividad de un Corpus mediante el Algoritmo N-Cor Gloria Corpas y Miriam Seghiri . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Generación Semiautomática de Recursos Fernando Enríquez, José Antonio Troyano, Fermín Cruz y F. Javier Ortega . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Building Corpora for the Development of a Dependency Parser for Spanish Using Maltparser Jesús Herrera, Pablo Gervás, Pedro J. Moriano, Alfonso Muñoz y Luis Romero . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Semántica A Proposal of Automatic Selection of Coarse-grained Semantic Classes for WSD Rubén Izquierdo-Bevia, Armando Suárez y Germán Rigau . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Cognitive Modules of an NLP Knowledge Base for Language Understanding Carlos Periñán-Pascual y Francisco Arcas-Túnez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Text as Scene: Discourse Deixis and Bridging Relations Marta Recasens, Antonia Martí Antonín y Mariona Taulé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Definición de una Metodología para la Construcción de Sistemas de Organización del Conocimiento a partir de un Corpus Documental en Lenguaje Natural Sonia Sánchez-Cuadrado, Jorge Morato, José Antonio Moreiro y Monica Marrero . . . . . . . . . . . . . . . . . . . . . . . . 213 Sistemas de Diálogo Prediction of Dialogue Acts on the Basis of the Previous Act Sergio R. Coria y Luis Alberto Pineda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Adaptación de un Gestor de Diálogo Estadístico a una Nueva Tarea David Griol, Lluís F. Hurtado, Encarna Segarra y Emilio Sanchís . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Traducción Automática Un Método de Extracción de Equivalentes de Traducción a partir de un Corpus Comparable Castellano-Gallego Pablo Gamallo y José Ramom Pichel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Flexible Statistical Construction of Bilingual Dictionaries Ismael Pascual y Michael O'Donnell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Training Part-of-Speech Taggers to build Machine Translation Systems for Less-Resourced Language Pairs Felipe Sánchez-Martínez, Carme Armentano-Oller, Juan Antonio Pérez-Ortiz y Mikel L. Forcada . . . . . . . . . . . 257 Parallel Corpora based Translation Resources Extraction Alberto Simões y José João Almeida . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 DEMOSTRACIONES Una Herramienta para la Manipulación de Corpora Bilingüe usando Distancia Lexica Rafael Borrego y Víctor J. Díaz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 MyVoice goes Spanish. Cross-lingual Adaptation of a Voice Controlled PC Tool for Handicapped People Zoraida Callejas, Jan Nouza, Petr Cerva y Ramón López-Cózar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 HistoCat y DialCat: Extensiones de un Analizador Morfológico para tratar Textos Históricos y Dialectales del Catalán Jordi Duran, Mª Antonia Martí y Pilar Perea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 MorphOz: Una Plataforma de Desarrollo de Analizadores Sintáctico-Semánticos Multilingüe Oscar García . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Sistema de Diálogo Estadístico y Adquisición de un Nuevo Corpus de Diálogos David Griol, Encarna Segarra, Lluis. F. Hurtado, Francisco Torres, María José Castro, Fernando García y Emilio Sanchís . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 JBeaver: Un Analizador de Dependencias para el Español Jesús Herrera, Pablo Gervás, Pedro J. Moriano, Alfonso Muñoz y Luis Romero . . . . . . . . . . . . . . . . . . . . . . . . . . 285 NowOnWeb: a NewsIR System Javier Parapar y Álvaro Barreiro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 The Coruña Corpus Tool Javier Parapar y Isabel Moskowich-Spiegel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 WebJspell, an Online Morphological Analyser and Spell Checker Rui Vilela . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 PROYECTOS El Proyecto Gari-Coter en el Seno del Proyecto RICOTERM2 Fco. Mario Barcala, Eva Domínguez, Pablo Gamallo, Marisol López, Eduardo Miguel Moscoso, Guillermo Rojo, María Paula Santalla del Río y Susana Sotelo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Portal da Lingua Portuguesa Maarten Janssen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

ARTÍCULOS

Análisis Morfosintáctico

Desarrollo de un analizador sintáctico estadístico basado en dependencias para el euskera

Kepa Bengoetxea, Koldo Gojenola Universidad del País Vasco UPV/EHU

Escuela Universitaria de Ingeniería Técnica Industrial de Bilbao

{kepa.bengoetxea, koldo.gojenola}@ehu.es

Resumen: Este artículo presenta los primeros pasos dados para la obtención de un analizador sintáctico estadístico para el euskera. El sistema se basa en un treebank anotado sintácticamente mediante dependencias y la adaptación del analizador sintáctico determinista de Nivre et al. (2007), que mediante un análisis por desplazamiento/reducción y un sistema basado en aprendizaje automático para determinar cuál de 4 opciones debe realizar, obtiene un único análisis sintáctico de la oración. Los resultados obtenidos se encuentran cerca de los obtenidos por sistemas similares. Palabras clave: Análisis sintáctico. Análisis basado en dependencias. Treebank.

Abstract: This paper presents the first steps towards a statistical syntactic analyzer for Basque. The system is based on a syntactically dependency annotated treebank and an adaptation of the deterministic syntactic analyzer of Nivre et al. (2007), which relies on a shift/reduce deterministic analyzer together with a machine learning module that determines which one of 4 analysis options to take, giving a unique syntactic dependency analysis of an input sentence. The results are near to those obtained by similar systems. Keywords: Syntactic analysis. Dependency-based analysis. Treebank.

1 Introducción

Este artículo presenta los primeros pasos dados para la obtención de un analizador sintáctico estadístico para el euskera. El sistema se basa en un treebank anotado sintácticamente mediante dependencias y la adaptación del analizador sintáctico determinista MaltParser (Nivre et al., 2007), que mediante un análisis por desplazamiento/reducción y un sistema basado en aprendizaje automático para determinar, en cada paso de análisis, cuál de 4 opciones debe realizar, obtiene un único análisis sintáctico de la oración. Los resultados obtenidos se encuentran cerca de otros sistemas similares.

En el resto del artículo presentaremos en el apartado 2 el treebank utilizado (3LB) que será la base del analizador sintáctico, y las modificaciones realizadas para su procesamiento de manera automática. El

apartado 3 contextualiza los sistemas de análisis sintáctico estadístico, presentando el sistema elegido para este trabajo, que es el analizador determinista Maltparser. En la sección 4 se presentan los experimentos realizados junto con los resultados obtenidos. La sección 5 compara el trabajo realizado con sistemas similares que han sido desarrollados. El artículo acaba presentando las principales conclusiones y líneas futuras de trabajo.

2 3LB: un treebank anotado sintácticamente para el euskera

El proyecto 3LB desarrolló corpus anotados a nivel morfológico y sintáctico para el catalán, euskera y español (Palomar et al., 2004).

La anotación para el catalán y español está basada en constituyentes, mientras que el euskera está anotado mediante dependencias (Carroll, Minnen y Briscoe, 1998). Seguidamente se presentarán primero las

Procesamiento del Lenguaje Natural, nº39 (2007), pp. 5-12 recibido 18-05-2007; aceptado 22-06-2007

ISSN: 1135-5948 © 2007 Sociedad Española para el Procesamiento del Lenguaje Natural

características generales del treebank original (apartado 2.1) y la adaptación que se hizo del treebank para convertirlo a un formato apropiado para el análisis automático (apartado 2.2).

2.1 El treebank 3LB para el euskera

El corpus 3LB (Palomar et al., 2004) contiene 57.000 palabras anotadas sintácticamente. Las características del euskera, como por ejemplo el orden libre de constituyentes de la oración, aconsejaron realizar una anotación mediante dependencias, de manera similar a la realizada para idiomas como el checo (Hajic, 1999), aunque también planteada para idiomas de orden menos libre como el inglés (Jarvinen y Tapanainen, 1998).

La figura 1 muestra un ejemplo de anotación de una oración en el corpus 3LB. Básicamente, la anotación indica el tipo de dependencia (meta, ncsubj, …) seguida de tres atributos que representan:

• Información morfosintáctica útil como es el caso, o el tipo de oración subordinada (konp1 en el ejemplo). Aunque la figura muestra que la anotación incluye una mínima información morfosintáctica, en general, la anotación está basada en palabras. Este hecho supuso un problema, ya que los analizadores sintácticos estadísticos requieren el uso de rasgos morfosintácticos (categoría, número, caso, …) no presentes en este corpus original.

• Núcleo de la dependencia (con el valor especial root para indicar el núcleo de la oración).

• Elemento dependiente.

1 Oración subordinada completiva.

2.2 Adaptación del treebank

La anotación original del treebank para el euskera, válida lingüísticamente, plantea varios problemas a la hora de ser usada en un tratamiento computacional:

• Fenómenos como la aparición de palabras repetidas en una misma oración requieren la explicitación del elemento oracional correspondiente a cada aparición de la palabra, no presente en la anotación original

• Elementos no explícitos. En la anotación original se permitió la anotación de elementos nulos correspondientes a fenómenos como la elipsis o coordinación. Sin embargo, la gran mayoría de los analizadores basados en dependencias actuales no admite la aparición de elementos que no corresponden a palabras de la oración.

• Ambigüedad morfosintáctica. La anotación original se hizo enlazando palabras entre sí. Esta alternativa tiene el inconveniente de que, siendo cada palabra morfológicamente ambigua (cada palabra tiene una media de 2,81 interpretaciones), no se conoce con certeza cuál es la interpretación correcta. Aunque el tipo de dependencia que une dos palabras proporciona información útil para la desambiguación (por ejemplo, la dependencia “ncsubj” generalmente une el núcleo de un sintagma nominal, normalmente de categoría nombre, con un verbo), hay un alto grado de ambigüedad no resoluble automáticamente. La figura 1 muestra que las palabras no contienen ningún tipo de anotación morfosintáctica, a excepción de las dependencias.

• Términos multipalabra. Al etiquetar el corpus, los lingüistas no disponían de una guía sistemática para la anotación de

@@00,06,2,1201,6 Ika-mika baten ostean, funtzionarioak 14:00etan itzultzeko esan zien. (discusión) (de una) (después),(el funcionario) (a las 14) (volver) (decir) (él a ellos/pasado) Después de una discusión, el funcionario les dijo que volvieran a las 14:00.

meta (-, root, esan) ncmod (gen_post_ine, esan, Ika-mika) detmod (-, Ika-mika, baten_ostean) ncsubj (erg, esan, funtzionarioak) ncmod (ine, itzultzeko, 14:00etan) xcomp_obj (konp, esan, itzultzeko) auxmod (-, esan, zien)

Figura 1: Ejemplo de anotación de una oración.

Kepa Bengoetxea y Koldo Gojenola

6

estos elementos, que incluyen elementos como entidades, postposiciones complejas o locuciones. Esto dio lugar a que sea difícil emparejar las palabras del treebank con las de la oración original. Como ejemplo, la figura 1 muestra que la postposición compleja “baten ostean” se ha agrupado en una sola unidad.

Por estos motivos se hizo imprescindible reetiquetar el corpus para obtener una versión tratable computacionalmente. Aunque se realizaron programas de ayuda al reetiquetado, este proceso fue muy costoso, al ser en su mayor parte manual, y exigió la revisión completa del treebank. Las figuras 2 y 3 muestran la oración anterior etiquetada en un formato de dependencias utilizable computacionalmente y su representación gráfica. El formato elegido es el de la conferencia CoNLL2 (CoNLL 2007), que tiene las siguientes características:

• Componentes explícitos. Todas las relaciones deben ser de palabra a palabra, es decir, no se permite eliminar o añadir elementos a la oración en el análisis.

• Es suficientemente versátil para permitir su conversión a otros formatos de manera automática, como el formato

2 Computational Natural Language Learning.

Penn (Marcus, Santorini y Marcinkiewiecz, 1993) o el formato aceptado por el parser de (Collins et al. 1999).

La figura 2 contiene un ejemplo de la sentencia en el nuevo formato. Este formato contiene ocho campos: posición (P), forma, lema, categoría (coarse postag), categoría + subcategoría, información morfosintáctica, identificador del núcleo y relación de dependencia.

3 Análisis sintáctico estadístico

La popularidad de los Treebanks está ayudando al desarrollo de analizadores sintácticos estadísticos que empezó con el Penn Treebank para el inglés (Marcus, Santorini y Marcinkiewiecz, 1993), para el que se han desarrollado parsers de referencia (Collins, 1996; Charniak, 2000), que marcan el estado del arte actual. Aunque las características del inglés llevaron a una anotación inicial basada en constituyentes, diversos factores, fundamentalmente la extensión a idiomas de características muy diferentes al inglés y también la dificultad de evaluación de las estructuras jerárquicas subyacentes, han llevado a desarrollar modelos sintácticos basados en dependencias.

El apartado 3.1 examinará brevemente los analizadores sintácticos basados en

P Forma Lema Cat Cat+subcat Info Núcleo Dependencia1 Ika-mika Ika-mika IZE IZE_ARR ABS|MG 7 ncmod 2 baten_ostean bat IZE IZE_ARR DEK|GEN_oste_INE|NUMS|MUGM|POS 1 ncmod 3 , , PUNT PUNT_KOMA _ 2 PUNC 4 funtzionarioak funtzionario IZE IZE_ARR ERG|NUMS|MUGM 7 ncsubj 5 14:00etan 14:00 DET DET_DZH NMGP|INE|NUMP|MUGM 6 ncmod 6 itzultzeko itzuli ADI ADI_SIN ADIZE|KONPL|ABS|MG 7 xcomp_obj 7 esan esan ADI ADI_SIN PART|BURU 0 ROOT 8 zien *edun ADL ADL B1|NR_HURA|NK_HARK|NI_HAIEI 7 auxmod 9 . . PUNT PUNT_PUNT _ 8 PUNC

Figura 2: Ejemplo de anotación de una oración.

Ika-mika baten_ostean, funtzionarioak 14:00etan itzultzeko esan zien.

Figura 3: Representación gráfica del árbol de dependencias.

xcomp_obj

ncmod

ncsubj

ncmod

ncmod

Desarrollo de un Analizador Sintáctico Estadístico basado en Dependencias para el Euskera

7

dependencias. En el punto 3.2 se describirá el analizador sintáctico de Nivre et al. (2007) que ha sido usado en el presente trabajo. 3.1 Análisis sintáctico basado en dependencias

Los analizadores sintácticos basados en dependencias han sido utilizados en diversos trabajos, con propuestas que van desde analizadores que construyen directamente estructuras de dependencias (Jarvinen y Tapanainen 1998, Lin 1998) hasta otras que sebasan en las tradicionales estructuras de constituyentes permitiendo adicionalmente la extracción de dependencias (Collins 1999; Briscoe, Carroll y Watson, 2006).

Entre los analizadores estadísticos basados en dependencias podemos citar los experimentos realizados por (Eisner, 1996) y los trabajos realizados para el turco (Eryiğit y Oflazer, 2006), que comparte con el euskera la propiedad de ser un idioma aglutinativo. En general, los últimos años este tema ha sido avivado por la competición realizada en la conferencia CoNLL3 sobre analizadores de dependencias (CoNLL, 2006, 2007), en la que se plantea el reto de utilizar diferentes parsers para analizar un conjunto de treebanks de un amplio abanico de idiomas. 3.2 Maltparser: un analizador sintáctico estadístico determinista

El analizador sintáctico determinista de Nivre et al. (2007) es un sistema independiente del lenguaje que permite inducir un parser o analizador sintáctico a partir de un treebank, usando conjuntos de datos de entrenamiento limitados. El analizador se basa en:

• Algoritmos deterministas para análisis de dependencias. Mediante un análisis por desplazamiento/reducción y un sistema basado en el uso de una pila y una cadena de entrada.

• Modelos de características basados en historia (History-based feature models) para predecir la acción a realizar. En este algoritmo concreto, el sistema debe elegir entre 4 opciones (enlazar dos palabras con un arco hacia la izquierda, ídem con arco hacia la derecha, reducir o desplazar), y para ello hace uso de los rasgos de la pila y/o de la cadena de entrada. Aplicando sucesivamente este

3 CoNLL (Computational Natural Language Learning) shared task on dependency parsing.

paso, se obtiene un único análisis sintáctico de la oración.

• Técnicas de aprendizaje automático discriminativas para enlazar historias con acciones. En este momento el sistema permite utilizar dos de las alternativas de aprendizaje automático más exitosas: aprendizaje basado en memoria (Memory Based Learning,Daelemans y Van den Bosch, 2005) y Support Vector Machines (SVM, Chang y Lin, 2001).

Este analizador ha sido probado con multitud de idiomas de diversa tipología, obteniendo resultados que se acercan al estado del arte para el inglés, que es tomado generalmente como referencia y punto de comparación. En la competición CoNLL de 2007, una versión de este sistema ha quedado en primera posición, de un total de 20 sistemas presentados.

4 Experimentos y resultados

En este apartado vamos a presentar los experimentos realizados junto con los resultados que se han obtenido.

El primer paso consiste en seleccionar los atributos utilizados para el análisis sintáctico. Aunque el uso de una mayor cantidad de información puede en principio ayudar a mejorar los resultados, el tamaño del corpus usado (57.000 palabras) es pequeño, por lo que se pueden presentar problemas de data sparseness.

El analizador usado permite especificar distintos tipos de información a utilizar para el entrenamiento, distinguiendo:

• Información léxica. Se podrá usar tanto la forma como el lema de cada palabra.

• Información categorial. Se puede seleccionar tanto la categoría sintáctica (nombre, adjetivo, verbo, …) como la subcategoría (nombre común, nombre propio, …).

• Información morfosintáctica. El euskera presenta una gran variedad de informaciones de este tipo, incluyendo el caso y número para los elementos integrantes del sintagma nominal, o información de concordancia con sujeto, objeto directo e indirecto en verbos, así como distintos tipos de oraciones subordinadas. Entre los idiomas presentados a CoNLL (2007) es el


8

idioma que presenta, de lejos, un mayor número de rasgos morfosintácticos (359).

• Etiquetas de dependencia. Se ha definido un conjunto de 35 etiquetas.

El analizador usado se basa en la técnica de reducción y desplazamiento utilizando, por tanto, una pila donde va añadiendo elementos de la cadena de entrada. Por ello, se pueden especificar elementos tanto de la pila como de la cadena de entrada para su uso en la fase de aprendizaje automático. Además, como el analizador va construyendo el árbol de dependencias, también se pueden especificar rasgos del antecesor o los descendientes de un elemento de la pila o del primer elemento que queda sin analizar de la cadena de entrada4.

Especificación��

Descripción

1 p(σ0) Categoría del símbolo del tope de la pila

2 d(h(σ0)) Etiqueta de dependencia del símbolo del tope de la pila con su núcleo

3 p(τ0) Categoría de la primera palabra de la cadena de entrada por analizar

4 f(τ1) Rasgos morfosintácticos de la palabra siguiente a la primera de la cadena de entrada

5 w(l(σ1)) Forma de la palabra correspondiente al descendiente más a la izquierda del elemento debajo del tope de la pila.

Tabla 1: Ejemplos de especificación de parámetros para el sistema de aprendizaje.

La tabla 1 muestra un ejemplo de especificación de los parámetros de aprendizaje del sistema. Se permite especificar elementos de la pila (σ) o de la cadena de entrada (τ), mediante su posición relativa (empezando desde el cero). Por ejemplo, la especificación 1 hace referencia a la categoría p(art of speech)del símbolo en el tope de la pila. Las etiquetas w(ord), L(ema), d(ependencia), h(ead), l(eft) y r(ight) se refieren a la forma, dependencia, al

4 Al ser el análisis de izquierda a derecha, solo elprimer símbolo de la entrada puede tener antecesor o descendientes.

núcleo, descendiente izquierdo y descendiente derecho, respectivamente. Estas etiquetas se pueden combinar para formar especificaciones más complejas, como en los ejemplos 1-5 de la tabla 1. Por ejemplo, la especificación número 5 de la tabla hace referencia a la forma del dependiente más a la izquierda del símbolo que se encuentra debajo del tope de la pila.

Los datos del treebank se han separado en una parte para entrenamiento (50.123 palabras) y otra para la prueba final (gold test, 5.318 palabras5). Los experimentos se han analizado aplicando la técnica de 10 fold cross-validationsobre los datos de entrenamiento y finalmente sobre los datos del gold-test.

Características Φ1 Φ2

(σ1) (σ0)(τ0)(τ1)(τ2)(τ3)

( σ0( τ0

σ0))σ0)

(τ0)(τ1)σ0)

(τ0)(τ1)σ0σ0σ0τ0

τ0)(σ0)

σ0))

Tabla 2. Modelos de características.

En las pruebas efectuadas se ha querido valorar la importancia del uso de la información morfosintáctica para el entrenamiento, probando si el uso de dicha información mejora significativamente los resultados obtenidos por el parser. A la hora de seleccionar los atributos utilizados por el parser se han especificado los parámetros de la tabla 2 siguiendo las especificaciones de la tabla 1. Se han realizado múltiples pruebas con diferentes clases de parámetros.

La tabla 2 muestra dos clases de pruebas que se han realizado. La columna Φ1 presenta la

5 Debido a errores en la conversión del treebank original, el número de palabras original se ha visto reducido respecto al total de palabras del corpus.


9

combinación de características estándar usada por Nivre et al. (2007) para una gran variedad de lenguas. La columna Φ2 muestra lacombinación más exitosa obtenida en el total de los experimentos, donde se han añadido rasgos correspondientes a información morfosintáctica.

La tabla 3 muestra cómo el uso de información morfosintáctica presenta una mejora de 8 puntos en Labeled Attachment Score6 (LAS) de Φ1 sobre Φ2.

Φ1 Φ2

10 fold cross-validation average 67,64 75,06

Gold-Test 65,08 74,41

Tabla 3. Resultados obtenidos (LAS).

Los experimentos anteriores se han realizado utilizando el corpus en su estado original y cambiando las especificaciones de los parámetros. Teniendo en cuenta que el número de rasgos morfológicos distintos para el euskera es el mayor de todos los idiomas presentados a CoNLL (359) hemos pensado en reducir su número teniendo en cuenta conocimiento específico del euskera, eliminando algunos rasgos que se han considerado poco significativos y unificando rasgos que se considera que tienen un comportamiento común de cara al análisis (por ejemplo, un subconjunto importante de las marcas de caso indican el mismo tipo de dependencia ncmod, modificador no clausal, por lo que decidimos agruparlas). Con esto se espera facilitar la tarea de aprendizaje y reducir el tiempo de aprendizaje y análisis. El resultado no muestra una mejoría (ver tabla 4), al no superar un LAS de 74,41% obtenido con un mayor conjunto de rasgos, aunque sí lo hace en cuanto al tiempo de entrenamiento y de análisis, siendo 3 y 8 veces más rápido, respectivamente.

Aunque no se ha mostrado en las tablas, se ha comprobado, en concordancia con los resultados de Nivre et al. (2007), que el uso de SVM mejora los resultados de MBL cerca de un 3%. Por ello, los resultados presentados corresponden al uso de SVM.

6 Porcentaje de palabras en las que el sistema predice correctamente tanto su núcleo como la relación de dependencia existente entre ellos.

Nº de rasgos 10 fold cross-validation average (Φ2)

Gold-test

359 75,06 74,41

163 75,13 73,45

Tabla 4. Resultados (LAS) obtenidos al reducir el número de rasgos morfosintácticos.

5 Comparación con otros trabajos

Este trabajo se enmarca en el ámbito del análisis sintáctico estadístico basado en dependencias, cuyo máximo exponente actualmente son las competiciones CoNLL 2006 y 2007. En cuanto a los resultados generales, el indicador de asignación de etiqueta correcta (Labeled Attachment Score, LAS) conseguido (74,41%) sitúa a nuestro sistema cerca de los mejores resultados presentados (76,94%). De hecho, este resultado iguala a los obtenidos con un único sistema, ya que el mejor resultado de CoNLL se da al combinar varios analizadores.

En otro trabajo, Cowan y Collins (2005) presentan los resultados de aplicar el analizador de Collins al castellano, que presenta como novedad una mayor flexión que el inglés. El trabajo experimenta con el uso de diferentes tipos de información morfológica, concluyendo que esta información ayuda a mejorar los resultados del analizador.

Eryiğit, Nivre, y Oflazer (2006) experimentan con el uso de distintos tipos de información morfológica para el análisis del turco, comprobando cómo el aumento de la riqueza de la información inicial aumenta la precisión. En un trabajo relacionado, Eryiğit y Oflazer (2006) comprueban que el uso de los morfemas como unidad de análisis (en vez de palabras) también mejora el analizador.

Aranzabe, Arriola, y Díaz de Ilarraza (2004) están desarrollando un analizador sintáctico basado en dependencias para el euskera. Este analizador está basado en conocimiento lingüístico, donde la gramática se ha escrito en el formalismo Constraint Grammar(Tapanainen, 1996). No se tienen en este momento resultados publicados sobre la precisión y cobertura de este analizador, por lo que no es posible establecer comparaciones directas con el sistema aquí presentado.


10

6 Conclusiones

Este artículo ha presentado la preparación del treebank 3LB para el euskera para su tratamiento computacional, así como la adaptación del analizador de Nivre et al. (2007) al tratamiento del euskera. Este lenguaje presenta como características principales el orden libre de constituyentes de la oración y el uso de información morfosintáctica rica en comparación con otras lenguas.

El trabajo presentado supone la primera aproximación al análisis sintáctico estadístico del euskera, en paralelo con la competición CoNLL 2007, en la que hemos colaborado en la fase de preparación de datos.

Se han probado diferentes tipos de parámetros y algoritmos, obteniendo una precisión superior al 74%, que se acerca a los resultados obtenidos por los mejores sistemas de (CoNLL 2007) para la misma tarea. Se ha probado que incorporar distintos tipos de información morfosintáctica mejora notablemente los resultados. Entre las acciones para continuar esta investigación planteamos:

• Análisis no proyectivo. Los algoritmos empleados en este trabajo requieren que las dependencias sean proyectivas, es decir, no puede haber arcos que se crucen. El análisis de los datos del euskera muestra que un 2,9% de las dependencias en el treebank son no proyectivas. Para estos casos, Nivre y Nilsson (2005) plantean un algoritmo que convierte arcos no proyectivos en proyectivos. Al ser el algoritmo reversible, permite volver el treebank a la configuración inicial después del análisis sintáctico, para realizar la evaluación final. Esta conversión permite usar algoritmos de análisis que en principio solo son válidos para la construcción de árboles proyectivos.

• Hemos comprobado cómo una de las categorías sintácticas que peores resultados presenta es el nombre (LASde 66%). Al ser el nombre una de las categorías más frecuentes, presenta un gran porcentaje del total de errores realizados (cerca del 50% de todos los errores). Una de las hipótesis que planteamos es que puede deberse al hecho de que el nombre es comúnmente enlazado con el verbo, pero la dependencia se hace en función del caso

gramatical, que muchas veces pertenece a otra palabra7. Por ello estamos planteando la posibilidad de separar el caso gramatical como un elemento distinto, es decir, tomar morfemas como unidad de análisis. Esta idea aplicada a la alineación de textos en traducción automática ha producido mejoras significativas (Agirre et al., 2006).

• Estudio del efecto que tiene el tipo de corpus en los resultados. El corpus utilizado dispone de dos clases de textos: literarios y periodísticos. Aunque el tamaño reducido del corpus usado no ha permitido realizar pruebas por separado para cada uno de ellos, hemos comprobado que los resultados mejoran (cerca de un 5%) cuando el corpus de entrenamiento está formado solo por textos de un tipo. La ampliación del treebank, que pasará en breve a tener cerca de 300.000 palabras, permitirá realizar estas pruebas con más precisión. Esto también posibilitará el estudio de la aportación del tamaño del corpus.

• Estudio del efecto de la fase de desambiguación morfosintáctica. En este momento, el analizador ha sido probado con una sola interpretación por palabra, es decir, la entrada del analizador es perfecta. La fase de desambiguación previa introducirá errores que se acumulan a los del analizador sintáctico. Aunque los errores de la fase de etiquetado morfológico no son tan importantes para otras lenguas, la alta ambigüedad del euskera (2,81 interpretaciones por palabra, Ezeiza et al. 1998) supone un reto añadido.

Agradecimientos

Este trabajo está subvencionado por el Departamento de Industria y Cultura del Gobierno Vasco (proyecto AnHITZ 2006, IE06-185).

Bibliografía

Agirre E., A. Díaz de Ilarraza, G. Labaka, y K. Sarasola. 2006. Uso de información

7 Por ejemplo, en el sintagma nominal “etxe handi horrekin” (con esa casa grande), la palabra etxe debe asociarse con el verbo, pero el tipo de dependencia viene dado por el sufijo –ekin, que aparece dos palabras más adelante.


11

morfológica en el alineamiento Español-Euskara. XXII Congreso de la SEPLN.

Aranzabe M., J.M. Arriola, y A. Díaz de Ilarraza. 2004. Towards a Dependency Parser of Basque. Proceedings of the Coling 2004 Workshop on Recent Advances in Dependency Grammar. Geneva.

Briscoe, E., J. Carroll, y R. Watson. 2006. The Second Release of the RASP System. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Sydney.

Carroll, J., G. Minnen, y E. Briscoe. 1999. Corpus annotation for parser evaluation. In Proceedings of the EACL-99 Post-Conference Workshop on Linguistically Interpreted Corpora, Bergen. 35-41.

Chang, C.-C. y Lin, C.-J. 2001. LIBSVM: A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

Collins M. 1999. Head-Driven Statistical Models for Natural Language Parsing. PhD Dissertation, University of Pennsylvania.

Collins M., J. Hajic, E. Brill, L. Ramshaw, y Tillmann C. 1999. A Statistical Parser for Czech. In: Proceedings of the 37th Meeting of the ACL, pp. 505-512. University of Maryland, College Park, Maryland.

CoNLL 2006 y 2007. Proceedings of the Tenth/Eleventh Conference on Computational Natural Language Learning.

Cowan B. y M. Collins. 2005. Morphology and Reranking for the Statistical Parsing of Spanish. Proceedings of the Conference on Empirical Methods in NLP (EMNLP).

Daelemans, W. y A. Van den Bosch. 2005. Memory-Based Language Processing.Cambridge University Press.

Eryiğit G., J. Nivre, y K. Oflazer. 2006. The incremental use of morphological information and lexicalization in data-driven dependency parsing. In Proceedings of the 21st International Conference on the Computer Processing of Oriental Languages (ICCPOL), Springer LNAI 4285.

Eryiğit G., y K. Oflazer. 2006. Statistical Dependency Parsing for Turkish. Proceedings of EACL 2006 - The 11th Conference of the European Chapter of the

Association for Computational Linguistics,April 2006, Trento, Italy

Ezeiza N., I. Aduriz, I. Alegria, J.M. Arriola, y R. Urizar. 1998. Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages, COLING-ACL'98,Montreal (Canada). August 10-14, 1998.

Eisner J. 1996. Three new probabilistic models for dependency parsing: an exploration. Proceedings of COLING-1996, Copenhagen.

Hajič J. Building a Syntactically Annotated Corpus: The Prague Dependency Treebank. 1998. In: E. Hajičová (ed.): Issues of Valency and Meaning. Studies in Honour of Jarmila Panevová, Karolinum, Charles University Press, Prague, pp. 106-132.

Jarvinen T., y P. Tapanainen. 1998. Towards an implementable dependency grammar. CoLing-ACL'98 workshop 'Processing of Dependency-Based Grammars', Kahane and Polguere (eds), p. 1-10, Montreal, Canada.

Tapanainen P. 1996. The Constraint Grammar Parser CG-2. Number 27 in Publications of the Department of General Linguistics, University of Helsinki.

Lin D. 1998. Dependency-based Evaluation of MINIPAR. In Workshop on the Evaluation of Parsing Systems, Granada, Spain, May, 1998.

Marcus M., B. Santorini y M. Marcinkiewiecz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19 (2), 313--330.

Nivre, J. y J. Nilsson. 2005. Pseudo-Projective Dependency Parsing. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), 99-106.

Nivre, J., J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov, y E. Marsi. 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2).

Palomar M., M. Civit , A. Díaz de Ilarraza , L. Moreno, E. Bisbal, M. Aranzabe, A. Ageno, M.A. Martí, y B. Navarro. 2004. 3LB: Construcción de una base de árboles sintáctico-semánticos para el catalán, euskera y castellano. XX Congreso de la SEPLN.


12

Tecnicas deductivas para el analisis sintactico con correccion deerrores∗

Carlos Gomez-Rodrıguez y Miguel A. AlonsoDepartamento de Computacion

Universidade da CorunaCampus de Elvina, s/n15071 A Coruna, Spain

{cgomezr, alonso}@udc.es

Manuel VilaresE. S. de Ingenierıa Informatica

Universidad de VigoCampus As Lagoas, s/n32004 Ourense, [email protected]

Resumen: Se presentan los esquemas de analisis sintactico con correccion de er-rores, que permiten definir algoritmos de analisis sintactico con correccion de erroresde una manera abstracta y declarativa. Este formalismo puede utilizarse para de-scribir dichos algoritmos de manera simple y uniforme, y proporciona una base for-mal para demostrar su correccion y otras propiedades. Ademas, mostramos como sepuede utilizar para obtener distintas implementaciones de los algoritmos de analisissintactico, incluyendo variantes basadas en correccion regional.Palabras clave: analisis sintactico robusto, correccion de errores, esquemas deanalisis sintactico

Abstract: We introduce error-correcting parsing schemata, which allow us to defineerror-correcting parsers in a high-level, declarative way. This formalism can be usedto describe error-correcting parsers in a simple and uniform manner, and provides aformal basis allowing to prove their correctness and other properties. We also showhow these schemata can be used to obtain different implementations of the parsers,including variants based on regional error correction.Keywords: robust parsing, error correction, parsing schemata

1. Introduccion

Cuando se utilizan tecnicas de analisissintactico en aplicaciones reales, es habitu-al encontrarse con frases no cubiertas por lagramatica. Esto puede deberse a errores gra-maticales, errores en los metodos de entrada,o a la presencia de estructuras sintacticas cor-rectas pero no contempladas en la gramatica.Un analizador sintactico convencional no po-dra devolver un arbol de analisis en estos ca-sos. Un analizador sintactico robusto es aquelque puede proporcionar resultados utiles paraestas frases agramaticales. Particularmente,un analizador sintactico con correccion deerrores es un tipo de analizador sintacticorobusto que puede obtener arboles sintacti-cos completos para frases no cubiertas por lagramatica, al suponer que estas frases agra-maticales son versiones corruptas de frasesvalidas.∗ Parcialmente financiado por Ministerio de Ed-ucacion y Ciencia (MEC) y FEDER (TIN2004-07246-C03-01, TIN2004-07246-C03-02), Xuntade Galicia (PGIDIT05PXIC30501PN, PGID-IT05PXIC10501PN, Rede Galega de Procesamentoda Linguaxe e Recuperacion de Informacion) yPrograma de Becas FPU (MEC).

En la actualidad no existe un formalismoque permita describir de manera uniforme losanalizadores sintacticos con correccion de er-rores y probar su correccion, tal y como sehace con los esquemas de analisis sintacticopara los analizadores convencionales. En esteartıculo, se propone un formalismo que cubreesta necesidad al tiempo que se muestra comose puede utilizar para obtener implementa-ciones practicas.

2. Esquemas de analisissintactico convencionales

Los esquemas de analisis sintactico(Sikkel, 1997) proporcionan una manerasimple y uniforme de describir, analizar ycomparar distintos analizadores sintacticos.La nocion de esquema de analisis sintacticoproviene de considerar el analisis como unproceso deductivo que genera resultadosintermedios denominados ıtems. Se partede un conjunto inicial de ıtems obtenidodirectamente de la frase de entrada, y elproceso de analisis sintactico consiste en laaplicacion de reglas de inferencia (pasos de-ductivos) que producen nuevos ıtems a partir



de los ya existentes. Cada ıtem contieneinformacion sobre la estructura de la frase,y en cada analisis sintactico satisfactorio seobtiene al menos un ıtem final que garantizala existencia de un arbol sintactico completopara la frase.

Sea G = (N, Σ, P, S)1 una gramatica in-dependiente del contexto2. El conjunto dearboles validos para G, denotado Trees(G),se define como el conjunto de arboles finitosdonde los hijos de cada nodo estan ordenadosde izquierda a derecha, los nodos estan eti-quetados con sımbolos de N∪Σ∪(Σ×N)∪{ε},y cada nodo u satisface alguna de las sigu-ientes condiciones:• u es una hoja,• u esta etiquetado A, los hijos de u estan

etiquetados X1, . . . , Xn y hay una produc-cion A → X1 . . . Xn ∈ P ,

• u esta etiquetado A, u tiene un unico hijoetiquetado ε y existe una produccion A →ε ∈ P ,

• u esta etiquetado a y u tiene un unico hijoetiquetado (a, j) para algun j.A los pares (a, j) les llamaremos termi-

nales marcados, y cuando trabajemos conuna cadena a1 . . . an, escribiremos aj comonotacion abreviada para (aj , j). El numeronatural j se utiliza para indicar la posiciondel sımbolo a en la entrada, de modo que lafrase de entrada a1 . . . an pueda verse comoun conjunto de arboles de la forma aj(aj) enlugar de como una cadena de sımbolos. A par-tir de ahora, nos referiremos a los arboles deesta forma como seudoproducciones.

Sea Trees(G) el conjunto de arboles parauna gramatica independiente del contexto G.Un conjunto de ıtems es un conjunto I talque I ⊆ Π(Trees(G)) ∪ {∅}, donde Π es unaparticion de Trees(G). Si el conjunto con-tiene como elemento a ∅, llamaremos a esteelemento el ıtem vacıo.

Los analisis validos de una cadena enel lenguaje definido por una gramatica Gestan representados por ıtems que contienenarboles sintacticos marcados para esa cade-

1Donde N es el conjunto de sımbolos no termi-nales, Σ el alfabeto de sımbolos terminales, P el con-junto de reglas de produccion, y S el axioma o sımboloinicial de la gramatica.

2Aunque en este trabajo nos centraremos engramaticas independientes del contexto, los esquemasde analisis sintactico (convencionales y con correccionde errores) pueden definirse analogamente para otrosformalismos gramaticales.

na. Dada una gramatica G, un arbol sintacti-co marcado para una cadena a1 . . . an escualquier arbol τ ∈ Trees(G)/root(τ) = S ∧yield(τ) = a1 . . . an. Llamaremos ıtem finala todo ıtem que contenga un arbol sintacti-co marcado para una cadena cualquiera. Lla-maremos ıtem final correcto para una cadenaconcreta a1 . . . an a todo ıtem que contengaun arbol sintactico marcado para esa cadena.

Ejemplo: El conjunto de ıtems de Ear-ley (Earley, 1970), IEarley, asociado a unagramatica G = (N, Σ, P, S) es:

IEarley = {[A → α•β, i, j]/A → αβ ∈ P ∧0 ≤i ≤ j}donde la notacion [A → α•β, i, j] usada paralos ıtems representa el conjunto de arboles deraız A, tales que los hijos directos de A sonαβ, los nodos frontera de los subarboles conraız en los nodos etiquetados α forman unacadena de terminales marcados de la formaai+1 . . . aj , y los nodos etiquetados β son ho-jas. El conjunto de ıtems finales en este casoes

FEarley = {[S → γ•, 0, n]}.Un esquema de analisis sintactico es una

funcion que, dada una cadena a1 . . . an y unagramatica G; permite obtener un conjunto depasos deductivos. Los pasos deductivos son el-ementos de (H ∪ I) × I, donde I es un con-junto de ıtems y H (que llamaremos conjun-to de ıtems iniciales o hipotesis) contiene unconjunto {ai(ai)} por cada seudoproduccionasociada a la cadena. Los pasos deductivosestablecen una relacion de inferencia entreıtems, de modo que Y x si (Y ′, x) ∈ Dpara algun Y ′ ⊆ Y . Llamaremos ıtems vali-dos en un esquema dado a todos aquellos quepuedan deducirse de las hipotesis por mediode una cadena de inferencias.

Un esquema de analisis sintactico se dicesolido si verifica, para cualquier gramatica ycadena de entrada, que todos los ıtems finalesvalidos son correctos. Si verifica que todos losıtems finales correctos son validos (es decir, siexiste un arbol sintactico marcado para unacadena, el sistema puede deducirlo) se diceque es completo. De un esquema que es a lavez solido y completo se dice que es correcto.

Un esquema correcto puede usarse paraobtener una implementacion ejecutable deun analizador sintactico mediante el uso demaquinas deductivas como las que se de-scriben en (Shieber, Schabes, y Pereira, 1995;Gomez-Rodrıguez, Vilares, y Alonso, 2006)para obtener los ıtems finales validos.

Carlos Gómez-Rodríguez, Miguel A. Alonso y Manuel Vilares Ferro

14

3. Esquemas con correccion deerrores

El formalismo de esquemas de analisissintactico descrito en la seccion anterior nobasta para definir analizadores sintacticoscon correccion de errores que muestren uncomportamiento robusto en presencia de en-tradas agramaticales, ya que los ıtems fi-nales se definen como aquellos que contienenarboles sintacticos marcados que pertenecena Trees(G). Sin embargo, en un anali-sis sintactico con correccion de errores,sera necesario obtener ıtems que represen-ten “analisis aproximados” para frases queno tengan un analisis sintactico exacto. Losanalisis aproximados de estas frases agramat-icales no pueden pertenecer a Trees(G), perodeberıan ser similares a algun elemento deTrees(G). En este contexto, si medimos la“similaridad” mediante una funcion de dis-tancia, podemos dar una nueva definicion deıtems que permita generar analisis aproxima-dos, y ası extender los esquemas de analisispara soportar la correccion de errores.

Dada una gramatica independiente delcontexto G = (N, Σ, P, S), llamaremosTrees′(G) al conjunto de arboles finitos en losque los hijos de cada nodo estan ordenados deizquierda a derecha y cada nodo esta etique-tado con un sımbolo de N ∪Σ∪(Σ×N)∪{ε}.Notese que Trees(G) ⊂ Trees′(G).

Sea d : Trees′(G) × Trees′(G) → N ∪{∞} una funcion de distancia que verifiquelos axiomas usuales de positividad estricta,simetrıa y desigualdad triangular.

Llamaremos Treese(G) al conjunto {t ∈Trees′(G)/∃t′ ∈ Trees(G) : d(t, t′) ≤ e},es decir, Treese(G) es el conjunto de arbolesque tengan distancia e o menos a algun arbolvalido de la gramatica. Notese que, por elaxioma de positividad estricta, Trees0(G) =Trees(G).

Definicion 1. (arboles aproximados)Se define el conjunto de arboles aproximadospara una gramatica G y una funcion de dis-tancia entre arboles d como ApTrees(G) ={(t, e) ∈ (Trees′(G) × N)/t ∈ Treese(G)}.Por lo tanto, un arbol aproximado es el parformado por un arbol y su distancia a algunarbol de Trees(G).

Este concepto de arboles aproximadosnos permite definir con precision los proble-mas que pretendemos resolver con el anali-sis sintactico con correccion de errores. Da-

da una gramatica G, una funcion de distan-cia d y una cadena a1 . . . an, el problema delreconocimiento aproximado consiste en de-terminar el mınimo e ∈ N tal que existaun arbol aproximado (t, e) ∈ ApTrees(G)donde t es un arbol sintactico marcado parala cadena. A un arbol aproximado ası le lla-maremos arbol sintactico marcado aproxima-do para a1 . . . an.

Analogamente, el problema del analisissintactico aproximado consiste en encon-trar el mınimo e ∈ N tal que exista unarbol sintactico marcado aproximado (t, e) ∈ApTrees(G) para la cadena de entrada, y en-contrar todos los arboles marcados aproxima-dos de la forma (t, e) para la cadena.

Ası, del mismo modo que el problemadel analisis sintactico se puede ver como unproblema de encontrar arboles, el problemadel analisis sintactico aproximado se puedever como un problema de encontrar arbolesaproximados, que puede ser resuelto por unsistema deductivo analogo a los usados parael analisis sintactico convencional, pero cuyosıtems contengan arboles aproximados.

Definicion 2. (ıtems aproximados)Dada una gramatica G y una funcion de dis-tancia d, definimos conjunto de ıtems aprox-imados como un conjunto I ′ tal que

I ′ ⊆ ((⋃∞

i=0 Πi) ∪ {∅})donde cada Πi es una particion del conjunto{(t, e) ∈ ApTrees(G)/e = i}.

Notese que el concepto esta definido demanera que cada ıtem aproximado contienearboles aproximados con un unico valor de ladistancia e. Definir directamente un conjuntode ıtems aproximados usando una particionde ApTrees(G) no serıa practico, dado quenecesitamos que nuestros analizadores tenganen cuenta cuanta discrepancia acumula cadaanalisis parcial con respecto a la gramatica, yesa informacion se perderıa si nuestros ıtemsno estuviesen asociados a un unico valor de e.Este valor concreto de e es lo que llamaremosdistancia de analisis de un ıtem ι, o dist(ι):

Definicion 3. (distancia de analisis)Sea I ′ ⊆ ((

⋃∞i=0 Πi) ∪ {∅}) un conjunto de

ıtems aproximados tal como se ha definidoarriba, y ι ∈ I ′. La distancia de analisis aso-ciada al ıtem aproximado no vacıo ι, dist(ι),se define como el (trivialmente unico) valorde i ∈ N/ι ∈ Πi.

En el caso del ıtem aproximado vacıo ∅,diremos que dist(∅) = ∞.

Técnicas Deductivas para el Análisis Sintáctico con Corrección de Errores

15

Definicion 4. (esquema de analisis sintacti-co con correccion de errores)Sea d una funcion de distancia. Llamamos es-quema de analisis sintactico con correccionde errores a una funcion que asigna a ca-da gramatica independiente del contexto Guna terna (I ′,K, D), donde K es una fun-cion tal que (I ′,K(a1 . . . an), D) es un sistemade analisis instanciado con correccion de er-rores para cada a1 . . . an ∈ Σ∗. Un sistemade analisis instanciado con correccion de er-rores es una terna (I ′, H,D) tal que I ′ es unconjunto de ıtems aproximado con funcion dedistancia d, H es un conjunto de hipotesis talque {ai(ai)} ∈ H para cada ai, 1 ≤ i ≤ n, yD es un conjunto de pasos deductivos tal queD ⊆ Pfin(H ∪ I ′) × I ′.Definicion 5. (ıtems finales)El conjunto de ıtems finales para una cade-na de longitud n en un conjunto de ıtemsaproximados se define como F(I ′, n) = {ι ∈I/∃(t, e) ∈ ι : t es un arbol sintactico marca-do para alguna cadena a1 . . . an ∈ Σ�}.

El conjunto de ıtems finales correctospara una cadena a1 . . . an en un conjun-to de ıtems aproximados se define comoCF(I ′, a1 . . . an) = {ι ∈ I/∃(t, e) ∈ ι : t es unarbol sintactico marcado para a1 . . . an}.Definicion 6. (distancia mınima de anali-sis)La distancia mınima de analisis parauna cadena a1 . . . an en un conjunto deıtems aproximados I ′ se define comoMinDist(I ′, a1 . . . an) = min{e ∈ N : ∃ι ∈CF(I ′, a1 . . . an) : dist(ι) = e}.Definicion 7. (ıtems finales mınimos)El conjunto de ıtems finales mıni-mos para una cadena a1 . . . an en unconjunto de ıtems aproximados I ′se define como MF(I ′, a1 . . . an) ={ι ∈ CF(I ′, a1 . . . an)/dist(ι) =MinDist(I ′, a1..an)}.

Los conceptos de ıtems validos, solidez,completitud y correccion son analogos al ca-so de los esquemas de analisis convencionales.Notese que los problemas de reconocimientoaproximado y analisis aproximado definidoscon anterioridad para cualquier frase ygramatica pueden resolverse obteniendo elconjunto de ıtems finales mınimos en un con-junto de ıtems aproximados. Cualquier esque-ma con correccion de errores correcto puedededucir estos ıtems, dado que son un subcon-junto de los ıtems finales correctos.

4. Una funcion de distanciabasada en la distancia deedicion

Para especificar un analizador sintacticomediante un esquema de analisis sintacticocon correccion de errores, es necesario decidirprimero que funcion de distancia utilizar paradefinir el conjunto de ıtems aproximados.

Un esquema correcto obtendra los analisisaproximados cuya distancia a un analisis cor-recto sea mınima. Por lo tanto, la funcion dedistancia debe elegirse dependiendo del tipode errores que se quiera corregir.

Supongamos una situacion generica dondenos gustarıa corregir errores segun la distan-cia de edicion. La distancia de edicion o dis-tancia de Levenshtein (Levenshtein, 1966) en-tre dos cadenas es el numero mınimo de inser-ciones, borrados o sustituciones de un unicoterminal que hacen falta para transformarcualquiera de las cadenas en la otra.

Una distancia d adecuada para este ca-so viene dada por el numero de transfor-maciones sobre arboles que necesitamos paraconvertir un arbol en otro, si las transforma-ciones permitidas son insertar, borrar o cam-biar la etiqueta de nodos frontera etiquetadoscon terminales marcados (o con ε). Por lo tan-to, d(t1, t2) = e si t2 puede obtenerse a partirde t1 mediante e transformaciones sobre no-dos correspondientes a terminales marcadosen t1, y d(t1, t2) = ∞ en los demas casos.

Notese que, si bien en este trabajo uti-lizaremos esta distancia para ejemplificar ladefinicion de analizadores con correccion deerrores, el formalismo permite usar cualquierotra funcion de distancia entre arboles. Porejemplo, en ciertas aplicaciones puede ser utildefinir una distancia que compare todo elarbol (en lugar de solo los nodos frontera)permitiendo la insercion, borrado o modifi-cacion de sımbolos no terminales. Esto per-mite detectar errores sintacticos (como porejemplo el uso de un verbo transitivo de for-ma intransitiva) independientemente de lalongitud de los sintagmas implicados.

5. Algoritmo de Lyon

Lyon (1974) define un reconocedor concorreccion de errores basado en el algoritmode Earley. Dada una gramatica G y una ca-dena a1 . . . an, el algoritmo de Lyon devuelvela mınima distancia de edicion a una cadenavalida de L(G).


16

En esta seccion, usaremos nuestro for-malismo para definir un esquema de analisissintactico con correccion de errores para el al-goritmo de Lyon. Esto nos servira como ejem-plo de esquema con correccion de errores, ynos permitira probar la correccion del algo-ritmo, implementarlo facilmente y crear unavariante con correccion regional de errores,como se vera mas tarde.

El esquema para el algoritmo de Lyonesta definido para la funcion de distancia dde la seccion 4. Dada una gramatica indepen-diente del contexto G y una cadena de entra-da a1 . . . an, el esquema Lyon es el que nosproporciona un sistema de analisis instancia-do (I ′, H,D) donde I ′ y D se definen comosigue:

I ′Lyon = {[A → α • β, i, j, e]/A → αβ ∈

P ∧ i, j, e ∈ N ∧ 0 ≤ i ≤ j}donde usamos [A → α • β, i, j, e] como no-tacion para el conjunto de arboles aproxima-dos (t, e) tales que t es un arbol de analisisparcial con raız A donde los hijos directosde A son los sımbolos de la cadena αβ, ylos nodos frontera de los subarboles con raızen los sımbolos de α forman una cadena determinales marcados de la forma ai+1 . . . aj ,mientras que los nodos etiquetados β son ho-jas. Notese que para definir este conjunto deıtems aproximados se utiliza la distancia ddefinida en la seccion anterior, que es la quecondiciona los valores de e en esta notacion.

El conjunto de pasos deductivos, D, parael algoritmo de Lyon se define como la unionde los siguientes:

DInitter = { [S → •γ, 0, 0, 0]}

DScanner ={[A → α • xβ, i, j, e], [x, j, j + 1] [A → αx • β, i, j + 1, e]}

DCompleter ={[A → α • Bβ, i, j, e1], [B → γ•, j, k, e2] [A → αB • β, i, k, e1 + e2]}

DPredictor = {[A → α • Bβ, i, j, e] [B → •γ, j, j, 0]}

DScanSubstituted ={[A → α • xβ, i, j, e], [b, j, j + 1] [A → αx • β, i, j + 1, e + 1]}

DScanDeleted = {[A → α • xβ, i, j, e] [A → αx • β, i, j, e + 1]}

DScanInserted ={[A → α • β, i, j, e], [b, j, j + 1] [A → α • β, i, j + 1, e + 1]}

DDistanceIncreaser ={[A → α • β, i, j, e] [A → α • β, i, j, e + 1]}

Los pasos Initter, Scanner, Completer yPredictor son similares a los del algoritmode Earley, con la diferencia de que tenemosque llevar cuenta de la distancia asociada alos arboles aproximados de nuestros ıtems.Notese que el Completer suma las distanciasen sus antecedentes, dado que su ıtem con-secuente contiene arboles construidos combi-nando los de los dos ıtems antecedente, y quepor lo tanto contendran discrepancias prove-nientes de ambos.

Los pasos ScanSubstituted, ScanDeletedy ScanInserted son pasos de correccion deerrores, y permiten leer sımbolos no espera-dos de la cadena a la vez que se incrementala distancia. ScanSubstituted sirve para cor-regir un error de substitucion en la entrada,ScanDeleted corrige un error de borrado, yScanInserted un error de insercion.

El conjunto de ıtems finales y el subcon-junto de ıtems finales correctos son:

F = {[S → γ•, 0, n, e]}CF = {ι = [S → γ•, 0, n, e]/∃(t, e) ∈ ι : t es

un arbol sintactico marcado para a1 . . . an}El paso DistanceIncreaser asegura que

todos los ıtems finales no mınimos son gen-erados (cosa que se requiere para la com-pletitud). En implementaciones practicas delanalizador, como la propuesta original deLyon (1974), normalmente no interesa lacompletitud estricta sino solo el obtenerlos analisis de distancia mınima, ası que elDistanceIncreaser no es necesario y puedesimplemente omitirse.

Probar la solidez del esquema Lyon es de-mostrar que todos los ıtems finales validos ensus sistemas de analisis asociados son correc-tos. Esto se demuestra probando la proposi-cion, mas fuerte, de que todos los ıtems vali-dos son correctos. Esto se puede demostraranalizando por separado cada paso deducti-vo y demostrando que si sus antecedentes soncorrectos, el consecuente tambien lo es.

Para probar la completitud del esquemaLyon (es decir, que todos los ıtems finalescorrectos son validos en el esquema), tenemosen cuenta que dichos ıtems finales son de laforma [S → α•, 0, n, e], y lo demostramos porinduccion en la distancia e. El caso base seprueba partiendo de la completitud del esque-ma Earley (Sikkel, 1998), y el paso inductivo


17

se demuestra mediante una serie de funcionesde transformacion de ıtems que permiten in-ferir la validez de cualquier ıtem final correctocon distancia e + 1 a partir de la de un ıtemcon distancia e.

6. Implementacion

Un esquema con correccion de errorescompleto permite deducir todos los analisisaproximados validos para una cadena dada.Sin embargo, al implementar un analizadorcon correccion de errores en la practica, noqueremos obtener todos los posibles anali-sis aproximados (cosa que serıa imposible entiempo finito, dado que hay una cantidadinfinita de analisis). Lo que buscamos, co-mo mencionamos en la definicion del prob-lema del analisis sintactico aproximado, sonlos analisis aproximados con distancia mıni-ma.

Cualquier esquema correcto que verifiqueuna propiedad que llamaremos completitudfinita puede adaptarse para resolver el prob-lema del analisis sintactico aproximado entiempo finito, generando solo los analisis dedistancia mınima, si le anadimos algunasrestricciones. Para ello, definiremos algunosconceptos que nos llevaran a la nocion de es-quema finitamente completo.

Definicion 8. (esquema acotado)Sea S un esquema de analisis sintacticocon correccion de errores que asigna a cadagramatica G una terna (I ′,K, D). El esque-ma acotado asociado a S con cota b, denota-do Bb(S), es el que asigna a cada gramaticaG el sistema de analisis Boundb(S(G)) =Boundb(I ′,K, D) = (I ′,K, Db), donde Db ={((a1, a2, . . . , ac), c) ∈ D : dist(c) ≤ b}.

En otras palabras, un esquema acotado esuna variante de un esquema con correccionde errores que no permite deducir ıtems condistancia asociada mayor que la cota b.

Definicion 9. (completitud hasta una cota)Diremos que un esquema de analisis con cor-reccion de errores S es completo hasta unacota b si, para cualquier gramatica y cadenade entrada, todos los ıtems finales correctoscuya distancia asociada no sea mayor que bson validos.

Definicion 10. (completitud finita)Diremos que un esquema de analisis con cor-reccion de errores S es finitamente comple-to si, para todo b ∈ N, el esquema acotadoBb(S) es completo hasta la cota b.

Notese que un esquema finitamente com-pleto es siempre completo, ya que podemoshacer b arbitrariamente grande.

El esquema Lyon cumple la propiedad deser finitamente completo, cosa que se puededemostrar de forma analoga a su completi-tud. Por otra parte, es facil ver que, sidisponemos de una maquina deductiva quepueda ejecutar esquemas de analisis sintacti-co, cualquier esquema con correccion de er-rores S que sea finitamente completo puedeutilizarse para construir un analizador queresuelva el problema del analisis sintacticoaproximado en tiempo finito, devolviendo to-dos los analisis aproximados validos de dis-tancia mınima sin generar ningun analisis dedistancia no mınima. La manera mas simplede hacerlo es la siguiente:function AnalizadorRobusto ( str:cadena )

: conjunto de ıtemsb = 0; //maxima distancia permitidawhile ( true ) {

computar validItems = v(Boundb(S(G)),str);finalItems = {i ∈validItems /i es un ıtem final };if ( finalItems �= ∅ ) return finalItems;b = b+1;

}donde la funcion v(sys,str) computa todoslos ıtems validos en el sistema de analisis syspara la cadena str , y puede implementarsecomo en (Shieber, Schabes, y Pereira, 1995;Gomez-Rodrıguez, Vilares, y Alonso, 2006).

Es facil demostrar que, si el problemadel analisis aproximado tiene alguna solucionpara una cadena dada (cosa que, bajo nuestradefinicion de distancia, siempre sucede), en-tonces este algoritmo la encuentra en tiempofinito. En la practica, podemos hacerle variasoptimizaciones para mejorar el tiempo de eje-cucion, como utilizar los ıtems generados encada iteracion como hipotesis de la siguienteen lugar de inferirlos de nuevo. Notese que es-ta variante de maquina deductiva puede eje-cutar cualquier esquema con correccion de er-rores, no solo el de Lyon.

6.1. Implementacion concorreccion regional

Si un analizador con correccion de erroreses capaz de encontrar todos los analisis aprox-imados de distancia mınima para cualquiercadena dada, como el de la seccion 6, se lellama analizador con correccion de erroresglobal. En la practica, los correctores globalespueden volverse muy ineficientes si queremosanalizar cadenas largas o utilizar gramaticas


18

con miles de producciones, como es usual enel procesamiento del lenguaje natural.

Una alternativa mas eficiente es la correc-cion de errores regional, que se basa en aplicarcorreccion de errores a una region que rodeeal punto en que no se pueda continuar elanalisis. Los analizadores regionales garanti-zan encontrar siempre una solucion optima;pero si existen varias no garantizan encon-trarlas todas.

Los algoritmos con correccion regionalbasados en estados, como los que se definenen (Vilares, Darriba, y Ribadas, 2001), suelenestar asociados a una implementacion par-ticular. Los esquemas de analisis sintacticocon correccion de errores nos permiten definiranalizadores regionales mas generales, basa-dos en ıtems, donde las regiones son con-juntos de ıtems. Los analizadores regionalespueden obtenerse de los globales de un modogeneral, tal que el analizador regional siem-pre devolvera una solucion optima si el anal-izador global del que proviene es correcto yfinitamente completo. Para ello, utilizamos lanocion de funcion de progreso:

Definicion 11. (funcion de progreso)Sea I ′ un conjunto de ıtems aproximados.Una funcion de progreso para I ′ es una fun-cion fp : I ′ → {p ∈ N/0 ≤ p ≤ k}, dondek es un numero natural llamado el progresomaximo.

Sea S un esquema de analisis sintacticocon correccion de errores correcto y finita-mente completo, y fp una funcion de progresopara su conjunto de ıtems. Podemos imple-mentar un analizador con correccion regionalbasado en S de esta manera:

function AnalizadorRegional ( str:cadena ): conjunto de ıtems

b = 0; //distancia maxima permitidamaxProgr = 0; //lımite superior regionminProgr = 0; //lımite inferior regionwhile ( true ) {

computar validItems = v’(Boundb(S(G)),str,minProgr,maxProgr);

finalItems = {i ∈ validItems /i es un ıtem final };if ( finalItems �= ∅ ) return finalItems;

newMax = max{p ∈ N/∃i ∈ validItems /fp(i) = p}

if ( newmaxProgr > maxProgr ) {maxProgr = newMax; minProgr = newMax;

}else if ( minProgr > 0 ) minProgr = minProgr−1;else b = b+1;

}

donde la funcion v’(ded,str,min,max)computa todos los ıtems validos en el sis-tema deductivo ded para la cadena str conla restriccion de que los pasos deductivosde correccion de errores solo se lanzan si almenos uno de sus antecedentes, ι, verifica queminProgr ≤ fp(ι) ≤ maxProgr.

Este analizador regional devuelve siempreuna solucion optima bajo la condicion de queS sea correcto y finitamente completo. Paraque ademas el analizador regional sea efi-ciente, debemos definir la funcion de progresode modo que sea una buena aproximacion decuan “prometedor” es un ıtem de cara a al-canzar un ıtem final3.

Una funcion simple pero adecuada en elcaso del analizador Lyon es fpj([A → α •β, i, j, e]) = j, que simplemente evalua unıtem de acuerdo con su ındice j. Otra alterna-tiva es fpj−i([A → α • β, i, j, e]) = j − i. Am-bas funciones premian a los ıtems que han lle-gado mas a la derecha en la cadena de entra-da, y toman valores maximos para los ıtemsfinales.

7. Resultados empıricos

Para probar nuestros analizadores y es-tudiar su rendimiento, hemos usado el sis-tema descrito en (Gomez-Rodrıguez, Vilares,y Alonso, 2006) para ejecutar el esquemaLyon con correccion global y regional. La fun-cion de progreso usada para el caso regionales la funcion fpj definida mas arriba.

La gramatica y frases utilizadas paralas pruebas provienen del sistema DARPAATIS3. En particular, hemos usado las mis-mas frases de prueba utilizadas en (Moore,2000). Este conjunto de pruebas es adecuadopara nuestros propositos, dado que provienede una aplicacion real y contiene frases agra-maticales. En particular, 28 de las 98 fras-

3Los criterios para determinar una buena funcionde progreso son similares a los que caracterizan a unabuena heurıstica en un problema de busqueda infor-mada. Ası, la funcion de progreso ideal serıa una talque f(ι) = 0 si ι no fuese necesario para deducir unıtem final, y f(ι) > f(κ) si ι puede dar lugar a unıtem final en menos pasos que κ. Evidentemente es-ta funcion no se puede usar, pues hasta completar elproceso deductivo no sabemos si un ıtem dado puedeconducir o no a un ıtem final; pero las funciones queproporcionen una buena aproximacion a esta heurısti-ca ideal daran lugar a analizadores eficientes. En elcaso degenerado en el que se devuelve f(ι) = 0 paracualquier ıtem, la funcion de progreso no proporcionaninguna informacion y el analizador con correccionregional equivale al global.


19

es del conjunto lo son. Al ejecutar nuestrosanalizadores con correccion de errores, encon-tramos que la distancia de edicion mınima auna frase gramatical es 1 para 24 de ellas (esdecir, estas 24 frases tienen una posible cor-reccion con un solo error), 2 para dos de ellas,y 3 para las dos restantes.

Dist. No de Long. Items med. Items med. Mejo-

Mın. Frases Media (Global) (Regional) ra (%)

0 70 11.04 37558 37558 0%1 24 11.63 194249 63751 65.33%2 2 18.50 739705 574534 22.33%3 2 14.50 1117123 965137 13.61%>3 0 n/a n/a n/a n/a

Cuadro 1: Rendimiento de los analizadores glob-ales y regionales al analizar frases del conjunto deprueba ATIS. Cada fila corresponde a un valor dela distancia mınima de analisis (o contador de er-rores).

Como podemos ver, la correccion region-al reduce la generacion de ıtems en un factorde tres en frases con un unico error. En frasescon mas de un error, las mejoras son menores:esto es porque, antes de devolver solucionescon distancia d+1, el analizador regional gen-era todos los ıtems validos con distancia d.De todos modos, debe tenerse en cuenta queel tiempo de ejecucion crece mas rapido queel numero de ıtems generados, ası que estasmejoras relativas en los ıtems se reflejan enmejoras relativas mayores en los tiempos deejecucion. Ademas, en situaciones practicases esperable que las frases con varios erroressean menos frecuentes que las que solo tienenuno, como en este caso. Por lo tanto, los tiem-pos mas rapidos hacen a los analizadores concorreccion regional basados en ıtems una bue-na alternativa a los correctores globales.

8. Conclusiones y trabajo actual

Hemos presentado los esquemas de anali-sis sintactico con correccion de errores, unformalismo que puede utilizarse para definir,analizar y comparar facilmente analizadoressintacticos con correccion de errores. Estosesquemas son descripciones sencillas y declar-ativas de los algoritmos que capturan susemantica y abstraen los detalles de imple-mentacion.

En este trabajo, los hemos utilizado paradescribir un analizador con correccion deerrores basado en Earley — descrito porprimera vez en (Lyon, 1974) —, para pro-bar su correccion, para generar una imple-mentacion deductiva del algoritmo original, y

para obtener un analizador mas rapido, basa-do en correccion regional, a partir del mismoesquema. Los metodos utilizados para obten-er estos resultados son genericos y se puedenaplicar en otros analizadores.

En la actualidad, estamos trabajando enla definicion de una funcion que transformaesquemas convencionales correctos que ver-ifiquen ciertas condiciones en esquemas concorreccion de errores correctos. Esta trans-formacion permite obtener automaticamenteanalizadores sintacticos con correccion de er-rores regional y global a partir de esquemasconvencionales como los correspondientes alos analizadores CYK o Left-Corner.

Bibliografıa

Earley, J. 1970. An efficient context-freeparsing algorithm. Communications ofthe ACM, 13(2):94–102.

Gomez-Rodrıguez, C., J. Vilares, y M. A.Alonso. 2006. Automatic generation ofnatural language parsers from declarativespecifications. En Proc. of STAIRS 2006,Riva del Garda, Italy. Long version avail-able at http://www.grupocole.org/GomVilAlo

2006a long.pdf.

Levenshtein, V. I. 1966. Binary codes ca-pable of correcting deletions, insertions,and reversals. Soviet Physics Doklady,10(8):707–710.

Lyon, G. 1974. Syntax-directed least-errorsanalysis for context-free languages: a prac-tical approach. Comm. ACM, 17(1):3–14.

Moore, R. C. 2000. Improved left-cornerchart parsing for large context-free gram-mars. En Proc. of the 6th IWPT, pages171–182, Trento, Italy, paginas 171–182.

Shieber, S. M., Y. Schabes, y F. C. N. Pereira.1995. Principles and implementation ofdeductive parsing. Journal of Logic Pro-gramming, 24(1–2):3–36, July-August.

Sikkel, K. 1998. Parsing schemata and cor-rectness of parsing algorithms. Theoreti-cal Computer Science, 199(1-2):87-103.

Sikkel, K. 1997. Parsing Schemata — AFramework for Specification and Analysisof Parsing Algorithms. Springer-Verlag,Berlin/Heidelberg/New York.

Vilares, M., V. M. Darriba, y F. J. Rib-adas. 2001. Regional least-cost error re-pair. Lecture Notes in Computer Science,2088:293–301.


20

A simple formalism for capturing order and co-occurrence incomputational morphology

Mans HuldenUniversity of Arizona

Department of LinguisticsP.O. BOX 210028

Tucson AZ 85721-0028USA

[email protected]

Shannon T. BischoffUniversity of Arizona

Department of LinguisticsP.O. BOX 210028

Tucson AZ 85721-0028USA

[email protected]

Resumen: Tradicionalmente, modelos computacionales de morfologıa y fonologıahan venido asumiendo, como punto de partida, un modelo morfotactico donde losmorfemas se extraen de sublexicos y se van concatenando de izquierda a derecha. Elmodelo de ‘clase de continuacion’ se ha venido utilizando como el sistema estandarde facto en la creacion de diferentes cajas de herramientas de software. Tras estudiarlenguas de tipologıa diversa, proponemos aquı un modelo de rasgos ampliado. Nues-tro modelo consta de varias operaciones disenadas con el fin de que un buen numerode restrictiones de co-ocurrencia local y global puedan ser descritas de manera con-cisa. Aparte tambien sugerimos ciertas formas de implementar estos operadores enmodelos de morfologıa basados en transductores de estado finito. Palabras clave:morfologıa computacional; morfotactica, unificacion de rasgos.Palabras clave: morfologıa computacional, morfotactica, unificacion de rasgos.

Abstract: Computational models of morphology and phonology have traditiona-lly assumed as a starting point a morphotactic model where morpehemes are drawnfrom sublexicons and concatenated left-to-right. In defining the lexicon-morphotacticlevel of a system, this ‘continuation-class’ model has been the de facto standard im-plementation in various software toolkits. From surveying of a number of typologi-cally different languages, we propose a more comprehensive feature-driven model ofmorphotactics that provides the linguist with various operations that are designedto concisely define a variety of local and global co-occurrence restrictions. We alsosketch ways to implement these operators in finite-state-transducer-based models ofmorphology.Keywords: computational morphology, morphotactics, feature unification.

1. Introduction

Morphotactics—how morphemes combinetogether to make for well-formed words inlanguages—can, and is, often treated as anisolated problem in computational morpholo-gical analysis and generation. This has beenparticularly true of two-level and finite-statemorphological models, where grammars des-cribe a mapping from an abstract morpho-tactic level to a surface level. In such models,the topmost level is often described not onlyas a mapping to some lower level of represen-tation, but is also separately constrained toreflect only legal combinations of morphemesin a language.

Insofar as morphotactics is seen to be aproblem of expressing combinatorial cons-traints, it would be desirable to develop aformalism that would allow for simple des-

criptions of such constraints on combinationsof morphemes as frequently occur in variousnatural languages. Such models have indeedbeen proposed. By far the most popular mo-del in computational morphology has beenthe ‘continuation class’ model (Koskenniemi,1983; Beesley and Karttunen, 2003) and va-riants thereof. The underlying assumption—and the reason for its popularity—is thata majority of languages exhibit the kind ofmorphotactics that is easily expressed th-rough such systems: left-to-right concatenati-ve models where the allowability of a morphe-me is primarily conditioned by the precedingmorpheme. This assumption does not alwayshold, however, which has led to many propo-sals and implementations that augment thismodel with extensions that provide for ex-pressive power to include some phenomenon



otherwise not capturable.

While a variety of such extensions tothe continuation-class model have beenproposed—some quite comprehensive—wedepart entirely from the continuation-classmodel in this proposal, and instead propose aformalism that is based on declarative cons-traints over both the order and co-occurrenceof individual morphemes.1 This approach torestricting morphotactics takes advantageof a fairly restricted set of operations onfeature-value combinations in morphemes.The formalism allows us express a varietyof non-concatenative phenomena—complexco-occurrence patterns, free morphemeordering, circumfixation, among others—concisely with a small number of statements.

2. Nonconcatenative phenomena

In the following, we give a few examplesof nonconcatenative morphotactic phenome-na that are difficult to capture with only acontinuation-class model of morphotactics inorder to motivate particular features of thenotation we propose.2

2.1. Slot-and-filler morphotactics

The so-called slot-and-filler morphologies(also called templatic morphologies) tend todiffer from concatenative processes or left-to-right agglutinative morphologies in that theyfeature abundant, often long-distance, res-trictions on the co-occurrence of morphemes.An example of this type of language is Nava-jo (and other Athabaskan languages) wherea strict template guides the order of morp-hemes. Some templatic slots may be empty,while others are obligatorily filled:

1The Xerox xfst/lexc (Beesley and Karttunen,2003) toolkit is a particularly versatile toolkit thatoffers a variety of notational devices to capture thesame phenomena we document here.

2We exclude two common patterns from this dis-cussion: that of templatic root-and-pattern morpho-logy (as seen in Arabic), as well as reduplication phe-nomena. These have been extensively treated in the li-terature and the most efficient solutions seem to treatthese more as phonological phenomena not specifiedin the most abstract level of morphotactic description.

1 2 3 4 5 6 7 8O P Obj. In Fut S Cl Stemha da j ∅ ∅ ıı ∅ geed

Pl. 4p 4p Imp.‘out’ ‘dig’

hadajııgeed‘Those guys dug them up’

In the above example, we have a templateconsisting of eight slots, where certain classesof morphemes are allowed to appear—slot 1for ‘outer’ lexical prefixes, slot 2 for markingdistributive plurals, etc.3

What is noteworthy is the complex co-occurrence constraints that govern the legalformation of Navajo verbs. To give a fewexamples with respect to the above templaticderivation: 1) the ‘outer’ prefix ha is allowedwith stems that conjugate according to a cer-tain pattern (the so-called yi-perfective), whi-ch geed fulfils; 2) the allomorph of the 4thperson subject pronoun ıı is selected on thebasis of what slots 1 and 2 contain; 3) the4th person subject pronoun is discontinous inthat a j must also appear in slot 3—withoutthis, the ıı in slot 6 signals 3rd person; 4) the‘classifier’ in slot 7 has four possibilities whi-ch together with the stem mode and prefixesin slots 1 and 2 determine what the subjectallomorph can be.

Navajo is an extreme example of long-distance systematic patterns of co-occurrencerestrictions. Some languages, such as theAmerican Indian language Koasati, whichfeatures around 30 slots for its verbs, allowalmost any co-occurrence pattern (Kimball,1991). Nevertheless, a consise formalism fordefining morphotactics needs to include thepossibility of capturing easily the type of pat-terns Navajo and other similar languages ha-ve.

2.2. Free morpheme ordering

Although less documented among theworld’s major languages, there also existslanguages where certain classes of morphe-mes can appear in free relative order withoutaffecting the semantics of a word. Recentexamples of this include Aymara, an Ame-rican Indian language spoken in the Andean

3This simplified model follows Faltz (1998); themajority of analyses for Navajo assume 16 slots ormore. See Young (2000) for details.

Mans Hulden y Shannon Bischoff

22

region,4 and Chintang, a Tibeto-Burman lan-guage, from which the following example isdrawn:

(1) u-kha-ma-cop-yokt-e3nsA-1nsP-NEG-see-NEG-PST

(2) u-ma-kha-cop-yokt-e(3) kha-u-ma-cop-yokt-e(4) ma-u-kha-cop-yokt-e(5) kha-ma-u-cop-yokt-e(6) ma-kha-u-cop-yokt-e

‘They didn’t see us’

(from Bickel et al. (2007))

Here, examples (1) through (6) are inter-changeable and equally grammatical.

A concatenative model where order mustbe declared would require extra machinery tocapture this phenomenon.5 As will be seenbelow, we will want to capture this pheno-menon by simply leaving certain order cons-traints undeclared, from which the free orderfalls out naturally.

3. Constraining morphotactics

Given these phenomena, we now propo-se a simple formalism to capture morphotac-tics. First, we assume the existence of labe-led sublexicons containing various morphe-mes in a given class. Also, we assume that ea-ch morpheme can be associated with feature-value combinations:

Class1 . . . Classn

Morpheme1 . . . Morpheme1

{Subclass} . . . {Subclass}OP Feat Value . . . OP Feat Value...

......

Morphemei . . . Morphemej

OP Feat Value . . . OP Feat ValueThat is, we assume that a complete lexicon

is a collection of sublexicons (or classes) thatcontain morphemes. These morphemes maycarry any number of feature-value pairs, towhich an operator is associated, and may bea member of a subclass as well.

4See Hardman (2001) for examples of the freemorpheme ordering in Aymara. Thanks to Ken Bees-ley and Mike Maxwell for pointing out these resourcesand the phenomenon.

5Beesley and Karttunen (2003) hint at a solutionthat first declares a strict order with contination clas-ses and subsequently ‘shuffle’ the morphemes freelywith a regular expression operator that is composedafter the output of the strictly ordered morphotacticlevel.

3.1. Order

In a fashion similar to that of thecontinuation-class model, we propose thatmorphemes are drawn out of this finite num-ber of sublexicons (classes) one at a time.However, instead of each sublexicon consis-ting of a statement guiding the choice of thenext sublexicon, the order is to be governedby a number of statements over the sublexi-cons using two operators: > and �.

The operator C1 > C2 defines the patterns(languages) where each morpheme drawn outof the sublexicon named C1 must immedia-tely precede each morpheme drawn out of C2.Likewise C1 � C2 illustrates the constraintthat morphemes drawn from C1 must prece-de (not necessarily immediately) those fromC2. For the sake of completeness, we can alsoassume the existence of the reverse variants< and �.

In a templatic morphology, order cons-traints could simply be a single transitivestatement C1 � . . . � Cn, and the majo-rity of the grammar would consist of feature-based constraints regarding the possible co-

occurrence of morphemes.Likewise, the examples of free morpheme

order are now easy to capture: let us suppo-se that there exists a number of prefixes thathave free internal order (such as in the Chin-tang example above), C1 to Cn, followed bya number of morphemes with strict internalordering, Cx . . . Cy. This could now be cap-tured by the statements:

C1 � Cx

. . .

Cn � Cx

Cx � . . . � Cy

When modeled in this fashion there need notbe any separate statements saying that C1

to Cn occur in free internal order—rather,this falls out of simply not specifying an orderconstraint for those morpheme classes, otherthan that they must occur before Cx.

3.2. Co-occurrence

For defining the possible co-occurrence ofmorphemes, we take advantage of the basicidea of features and feature unification. Wedo not assume elaborate feature structures to

A Simple Formalism for Capturing Order and Co-Occurrence in Computational Morphology

23

exist, rather we take unification to be an ope-rator associated with features in the morp-heme lexicon, such that conflicting feature-value pairs may not exist in the same word.

As mentioned, every morpheme in everysublexicon can carry OP [Feature Value]combinations, where OP is one of �, +, or−.3.2.1. Unification

The ‘unification’ operator � has the fo-llowing semantics: a morpheme associatedwith �[FX] disallows the presence of any ot-her morpheme in the same word carrying afeature F and a value other than X.3.2.2. Coercion

The operator + control for co-ocurrenceas follows: an +[FX] combination associatedwith a morpheme requires that there be anot-her [FX] combination in the word somewhereelse for the word to be legal.3.2.3. Exclusion

Similarly, −[FV ] requires that any [FV ]combination be absent from the word in ques-tion.

For the sake of transparency, it is assumedthat a +[FV ] statement can be satisfied by�[FV ].

3.3. Examples

With these tools of defining morphotac-tics, we can now outline an example fromEnglish derivational morphology using orderconstraints and the feature-related operators.3.3.1. Order constraints

A well-known generalization of English isthat derivational suffixes often change partsof speech, and so must attach to the properpart of speech that the preceding morpheme‘produces.’ Also, prefixes and suffixes are seento fall into two strata: an inner stratum of(mostly) latinate affixes (such as ic and ity,which attach closest to the stem, and an outerstratum of (mostly native) affixes (such asness and less) (Mohanan, 1986). Assumingthe stem atom, and a vocabulary of suffixesic, ity, ness and less, we should be able toform atom, atomic, atomicity, atomnessless,among others, but not ∗atomity, ∗atomlessity.

Class {Stems}atom {toN}

Class {LatinateSuffix}

ic {fromN}

{toA}

ity {fromA}

{toN}

Class {NativeSuffix}

ness {fromN}{toN}

less {fromN}

{toA}

Constraints

LatinateSuffix >> Stems

NativeSuffix >> LatinateSuffix | Stems

{fromN} > {toN}{fromA} > {toA}

In the above notation (reflecting an actualimplementation) ic belongs to the head classLatinateSuffix but also to fromN and toA,reflecting that the suffix is latinate and chan-ges a noun into an adjective. The relevantconstraints are that latinate suffixes mustfollow stems, and that nonlatinate suffixesmust both follow stems and latinate suffi-xes. The above snippet suffices to capturethe general order constraints with respect tothe strata-based derivational view mentionedpreviously.3.3.2. Feature constraints:

circumfixesCircumfixes are a classical simple case of

co-occurrence that can be captured using thefeature constraints. To continue with English,an example of a circumfix is the combinationem+adjective+en, as in embolden. However,the suffix en can occur on its own, as in red-

den, while the prefix em cannot.6 This canbe modelled as follows:

Class {LatinatePrefix}

em

+[Circ emen]

Class {Stems}

bold {toA}

Class {NativeSuffix}

6The prefix em is actually modeled to be underl-yingly en where the nasal assimilates in place to thefollowing consonant.


24

en {fromA}{toV}

U[Circ emen]

Here, the prefix em, carries +[Circ emen], re-quiring the presence of a feature-value pair[Circ emen] somewhere else in the deriva-tion. This can be satisfied by the suffix en.However, this suffix can also surface on itsown since it does not carry the coercion +operator on the feature-value pair, but onlythe unification operator. The interplay be-tween these two operators yields the desiredmorphotactics.

4. Implementation

While we wish to remain somewhat ag-nostic as to the preferred computational mo-dels of morphological analysis and parsing,we shall here outline a possible implementa-tion of the proposed formalism in terms offinite-state automata/transducers, since the-se are a popular mode of building morpholo-gical analyzers and generators.7

We assume the standard regular expres-sion notations where Σ denotes the alphabet,L1∪L2 is the union of two languages, L is thecomplement of language L, # is an auxiliaryboundary marker denoting a left or right ed-ge of a string. Also, in our notation, symboland language concatenation is implied whe-never two symbols are placed adjacent to ea-ch other. Following this, our earlier notation+[FV ] denotes the language that consists ofone string with five elements concatenated(we assume F and V to represent featuresand values, respectively, and +, −, [, ], {, },and � to be single symbols).

4.1. Context restriction

As an auxiliary notation, we shall assumethe presence of a regular expressions context-restriction operator (⇒) in the compilationof automata and transducers as this allevia-tes the task of defining many morphotacticrestrictions. We take:

X ⇒ Y1 Z1, . . . , Yn Zn

7A parser for Navajo verbal morphology has beenbuilt this way: converting the contents of a grammarinto regular expressions, and then building automatathat constrain the morphotactic level (Hulden andBischoff, 2007).

to characterize the regular language whereevery instance of the language X is immedia-tely preceded by the language Yi and imme-diately followed by Zi, for some i. The readeris urged to consult Yli-Jyra and Koskenniemi(2004) for a very efficient method of compi-ling such statements into automata.

4.2. Unification

With the above, we can build �[FV ], forsome feature-value combination present inour grammar, as:

�[FV ] ⇒

#Σ∗(� ∪ +)[FV ]Σ∗ Σ∗(� ∪ +)[FV ]Σ∗#

That is, the presence of a �[FV ] is allowedonly in the environment where both the leftand right-hand sides do not contain a string�[FVx] such that Vx is not V and the opera-tor preceding is either + or �.

4.3. Coercion

Similarly, we can build the + operator asfollows:

+[FV ] ⇒ Σ∗ (� ∪ +)[FV ], (� ∪ +)[FV ]

Here, the statement implies that any pre-sence of +[FV ] is allowed only if the stringalso contains a similar [FV ] somewhere to itsleft or right, where the operator is either +or �.

4.4. Exclusion

The exclusion (−) operator is built simi-larly, as:

−[FV ] ⇒

#Σ∗(� ∪ +)[FV ]Σ∗ Σ∗(� ∪ +)[FV ]Σ∗#

This defines the languages where an ins-tance of some string −[FV ], where F andV are features and values, respectively, isallowed only if surrounded by strings that donot contain [FV ] with the operator either +or �.

5. Order constraints

In order to address the compilation ofthe order constraints (<, > and �, �), onewould have to make assumptions about theexactly how the morphemes, features, values,and class labels are represented as automata.Supposing every morpheme is followed by

A Simple Formalism for Capturing Order and Co-Occurrence in Computational Morphology

25

its bundle of features, so that a word onthe morphotactic level is represented as:M1{Class}op[F1V1] . . . op[FnVn]M2{Class} . . .,where op is one of �,+,−, the presence ofa constraint Class1 � Class2 can berepresented as:

Σ∗{Class2}Σ∗{Class1}Σ∗

that is, the language where no instance of thestring Class2 precedes Class1. The � opera-tor can be defined symmetrically.

The immediate precedence Class1 <

Class2 can be defined as:

Σ∗{Class1}Σ∗{Σ∗{Class2}Σ∗

representing the language where no Classn

string may intevene between a string Class1

and Class2. Note that the brackets { and }are single symbols in Σ in the above.

6. Conclusion

We have presented a formalism for spe-cifying morphotactics that allows for separa-te description of morpheme order and morp-heme co-occurrence. These are controlled bya small number of operators on features, orclasses of morphemes. The order-related ope-rators have the power to state that a classof morpheme must either precede, or imme-diately precede some other class of morphe-mes, while the co-occurrence operators allowfor unification of feature-value pairs, exclu-sion of feature-value pairs, or coercion, i.e.expression of a demand that some feature-value pair be present.

We have also sketched a way to imple-ment the formalism as finite-state automatathrough first converting the notation into re-gular expressions, which can then be compi-led into automata or transducers using stan-dard methods.

Bibliografıa

Beesley, Kenneth and Lauri Karttunen.2003. Finite-State Morphology. CSLI,Stanford.

Bickel, Balthasar, Goma Banjade, MartinGaenszle, Elena Lieven, Netra Paudyal,Ichchha Purna Rai, Manoj Rai, Novel Kis-hor Rai, and Sabine Stoll. 2007. Free pre-fix ordering in chintang. Language, 83.

Faltz, Leonard M. 1998. The Navajo Verb.University of New Mexico Press.

Hardman, M. J. 2001. Aymara: LINCOM

Studies in Native American Linguistics.LINCOM Europa, Munchen.

Hulden, Mans and Shannon T. Bischoff.2007. An experiment in computationalparsing of the Navajo verb. Coyote Pa-pers: special issue dedicated to Navajo lan-

guage studies, 16.

Kimball, Geoffrey D. 1991. Koasati Gram-

mar. Univ. of Nebraska Press, London.

Koskenniemi, Kimmo. 1983. Two-level

morphology: A general computational mo-

del for word-form recognition and produc-tion. Publication 11, University of Hel-sinki, Department of General Linguistics,Helsinki.

Mohanan, Karuvannur P. 1986. The theory

of lexical phonology. Reidel, Dordrecht.

Yli-Jyra, Anssi and Kimmo Koskenniemi.2004. Compiling contextual restrictionson strings into finite-state automata. The

Eindhoven FASTAR Days Proceedings.

Young, Robert W. 2000. The Navajo Verb

System: An Overview. University of NewMexico Press, Alburquerque.


26

A note on the complexity of the recognition problem for theMinimalist Grammars with unbounded scrambling and barriers∗

Alexander Perekrestenko

Universidad Rovira i VirgiliGrupo de Investigacion en Linguıstica Matematica

(Research Group on Mathematical Linguistics)International PhD School in Formal Languages and Applications

Pl. Imperial Tarraco 1, 43005 - Tarragona

[email protected]

Resumen: Las Gramaticas Minimalistas fueron introducidas recientemente comoun modelo para la descripcion formal de la sintaxis de los lenguajes naturales. Eneste artıculo, se investiga una extension no local de este formalismo que permitirıala descripcion del desplazamiento optativo ilimitado de constituyentes sintacticos(scrambling), un fenomeno que existe en muchos idiomas y presenta dificultades parala descripcion formal. Se establece que la extension de las Gramaticas Minimalistascon scrambling sin la llamada condicion del movimiento mas corto (shortest-move

constraint, SMC) y con barreras hace que el problema de reconocimiento para el for-malismo resultante pertenezca a la clase NP-hard de la complejidad computacional.Palabras clave: Sintaxis, analisis sintactico, Gramaticas Minimalistas, orden depalabras, scrambling, complejidad computacional, lenguajes formales

Abstract: Minimalist Grammars were proposed recently as a model for the formaldescription of the natural-language syntax. This paper explores a nonlocal exten-sion to this formalism that would make it possible to describe unbounded scramblingwhich is a discriptionally problematic syntactic phenomenon attested in many lan-guages. It is shown that extending Minimalist Grammars with scrambling withoutshortest-move constraint (SMC) and with barriers makes the recognition problemfor the resulting formalism NP-hard.Keywords: Syntax, parsing, Minimalist Grammars, word order, scrambling, com-putational complexity, formal languages

1 Introduction

The formalization of the natural languagesyntax is important both from the theoret-ical and practical point view. It allows us tocheck the feasibility of the existing syntactictheories as models of how we process the lan-guage and provides a framework for creatingpractical applications—grammars and pars-ing systems. In the formalization of natural-language syntax, following classes of gram-mars usually come into consideration.

Right-liner (regular) grammars. These

∗ This research work has been partially supportedby the Russian Foundation for Humanities as a partof the project “The typology of free word order lan-guages” (grant RGNF 06-04-00203a). The authorwould also like to express his utmost gratitude to thehead of the Research Group on Mathematical Lin-guistics of the Rovira i Virgili University prof. CarlosMartın Vide for his encouragement and advice.

grammars can only be used for so-called shal-

low parsing since their capacity to assignstructural descriptions to sentences is toolimited.

Context-free grammars. While thesegrammars can describe a big part of the nat-ural language syntax in the weak sense, theyfail to assign appropriate structural descrip-tions to sentences containing discontinuousconstituents.

Mildly context-sensitive formalisms.Mildly context-sensitive grammars (MCSG)were proposed as a mathematical modelof the natural-language syntax that wouldbe only as powerful as necessary for thecorrect description of the existing syntacticphenomena. The mildly context-sensitiveformalisms best explored today are Tree-adjoining Grammars (TAGs) and MinimalistGrammars (MGs).



Computationally unrestricted formalisms.Unification-based syntactic theories with un-restricted structure sharing, such as Head-driven Phrase Structure Grammar (HPSG),strictly speaking do not belong to the classof restricted grammars since they are basedon unification formalisms which are Turing-equivalent. The problem of the computa-tional universality of the formalism itself ishere solved with the design of grammars thatdo not exploit the full power of the formal-ism.

Whatever the grammar or the class of for-malisms, it is crucially important for it to al-low parsing in deterministic polynomial timebasing on the length of the input, for other-wise its hight computational complexity (orincomputability) would disqualify it both asa feasible mathematical model of the humanlanguage competence and as a technically ap-plicable framework.

2 Linguistic data

One of the most problematic phenomenafor the formalization of the natural-languagesyntax is so-called scrambling, which is anon-obligatory reordering of syntactic con-stituents. Originally, the term scramblingwas used to denote the argument permu-tation observed in the so-called middlefield

(Mittelfelt) in German. This phenomenonoccurs in many other languages as well, forexample, in Japanese, Russian, Turkish, etc.The descriptionally most problematic class ofthis phenomenon is the so-called unbounded

scrambling where the permutating argumentsbelong to different verbal heads. In this kindof scrambling, a linear reordering of the ar-guments leads to their displacement from theembedded infinitival clauses into the matrixclause. Since in theory there is no limit on thedepth of the infinitival clause embedding, wecan have any number of verbal heads with thearguments “jumping up” to the embeddingclauses from an arbitrarily deeply embeddedinfinitival clause, as shown in the example be-low (all the sentences of this example mean‘. . . that no-one has tried to promise the cus-tomer to repair the refrigerator’):1

. . . dass niemand [[dem Kunden] [[denKuhlschrank] zu reparieren] zu versprechen]versucht hat;

1The sentences are based on the examples fromGerman in (Rambow, 1994).

. . . dass niemand [den Kuhlschrank]i [[demKunden] [ti zu reparieren] zu versprechen]versucht hat;. . . dass [den Kuhlschrank]i niemand [[demKunden] [ti zu reparieren] zu versprechen]versucht hat;. . . dass [dem Kunden]j niemand [tj [[denKuhlschrank] zu reparieren] zu versprechen]versucht hat;. . . dass [den Kuhlschrank]i [dem Kunden]jniemand [tj [ti zu reparieren] zu versprechen]versucht hat;. . . dass [dem Kunden]j [den Kuhlschrank]iniemand [tj [ti zu reparieren] zu versprechen]versucht hat.

The string language of scrambled sentencescan be seen as {nivi | n, v ∈ Σ, i > 0}, it iscontext-free. But what matters from the lin-guistic point of view is not so much the gener-ated language as such, but rather the gram-mar’s capacity to assign linguistically correctstructural descriptions to the sentences withscrambling. In (Becker, Rambow, and Niv,1992) it was proved that unbounded scram-bling cannot be derived by linear context-free rewriting systems (LCFRS) and—as aconsequence—it cannot be derived by set-local multi-component tree-adjoining gram-mars (slMCTAG) either.

An important aspect of the unboundedscrambling is that there are some syntacticcategories, called barriers, beyond which noconstituents can scramble. For German it isa tensed clause, for example.

Nonlocal vector TAGs with dominancelinks and integrity constrains (VTAG-Δ) in-troduced in (Rambow, 1994) are the onlyknown TAG-based formalism which allows ageneralized description of scrambling and ispolynomially parsable if some restrictions ex-ternal to the formalism itself are imposed onthe derivation. In its lexicalized version theserestrictions are satisfied as a consequence ofthe lexicalization. Other nonlocal versions ofTAGs do not have acceptable computationalproperties. For example, the word recog-nition problem for nonlocal MCTAGs withsuch linguistically meaningful restrictions aslexicalization, limiting the numbers of treesin each tree set to two and imposing domi-nance links on the trees belonging to one setis NP-complete (Champollion, 2007). Thisshows that nonlocality, which seems to benecessary for the adequate description of un-


28

bounded scrambling, is generally very dan-gerous for the computational properties ofthe formalism.

Another mildly context-sensitive formal-ism widely studied in the last ten years areMinimalist Grammars (MG) introduced in(Stabler, 1997) as a formalization of somecentral aspects of the structure-building com-ponent of the Minimalist Program, an ap-proach to the description of syntax proposedin (Chomsky, 1995). In this formalism, dis-continuous constituents are described as aresult of the displacement of a part of aconstituent into some other position in thetree. MGs are weakly equivalent to set-local MCTAGs. In MGs the locality isrepresented as the shortest-move constraint(SMC) forbidding competitive displacementof constituents. Lifting this constraint affectsbadly the computational properties of theformalism: for example, canceling the SMC,but preserving the specifier island constraint(SPIC) prohibiting movement from withinspecifiers, produces a Turing-equivalent for-malism (Kobele and Michaelis, 2005). In(Frey and Gartner, 2002), a scrambling op-erator was introduced for MG, but it was re-stricted by the SMC which made the gener-alized scrambling description impossible.

In the present paper we show that ex-tending an MG with an unbounded scram-bling (i.e., scrambling without SMC) andwith barriers—an analogue to the integrityconstraints in VTAG-Δ—makes the recogni-tion problem for the resulting formalism NP-hard.

3 MGs with unbounded

scrambling and barriers

Below we will give a definition of un-restricted Minimalist Grammars with un-bounded scrambling and barriers which isbased on the original definition of MG in(Stabler, 1997) and (Michaelis, 2001).

Definition 1 (MGscrB ) An unrestricted Mi-

nimalist Grammar with unbounded scram-bling and barriers, MGscr

B , is a tuple

G = 〈NonSyn, Syn, c, |, Lex,Ω〉, such that

• NonSyn is a finite set of non-syntactic

features partitioned into a set of phonetic(Phon) and semantic (Sem) features.

• Syn is a finite set of syntactic featureddisjount from NonSynt and partitioned

into

– a set B = { n, v, d, c, t, . . .} of base

(syntactic) categories,

– a set of abstract features,

A = { case, num, pers, . . .},

– a set of merge selectors,M = { =x | x ∈ B },

– a set of move licensees,

E = { −f | f ∈ A },

– a set of move licensors,

R = { +f | f ∈ A },

– a set of scramble licensees,S = { ∼x | x ∈ B },

– a set of barrier markers,

I = { x | x ∈ B }.

• c is a distinguished element of B, the

completeness category.

• ‘|’ is a special symbol (a bar).

• Lex is a lexicon—a finite set of sim-

ple expressions (see Definition 2) overNonSynt ∪ Syn, each of which is of the

form

τ =〈Nτ , �∗,≺, <, labelτ 〉, with Nτ ={ε}.

• Ω is the set of the structure-building op-erations ‘merge’, ‘move’ and ‘scramble’.

In what follows, by [< a, b ] we will denote abinary tree consisting of the nodes a and b

in this very linear order where the node a

is the head of (“projects over”) the structurerepresented by this tree so that the expressionassociated with the tree is the same as the oneassociated with its head node. In the sameway, by [> c, b ] we will denote a binary treeconsisting of the nodes c and b in this verylinear order where the node b is the head ofthe structure represented by the tree:

<

a b

>

c b

A node represented by a single letter will becalled a simple node. All nodes in the aboveexamples are simple ones. If a node repre-sents a subtree, it will be called a complex

node, as in the following example, where b

in the tree [< a, b ] is a complex node since itrepresents its subtree [> c, b ]:

<

a >

c b

The argument position to the right of a headnode is called the complement position. Po-sitions to the left of a head node, over which

A Note on the Complexity of the Recognition Problem for the Minimalist Grammars with Unbounded Scrambling and Barriers

29

this node projects, are referred to as speci-

fier positions. The maximal projection of anode a in a given tree is the maximal subtreeheaded by this node.

Definition 2 (Expression) An expression

is a finite, binary, labeled ordered treeτ = (Nτ , �∗,≺, <, labelτ ), where

Nτ is the set of nodes;

� is the dominance relation between nodes;≺ is the precedence relation between nodes;

< is the projection relation between nodes;labelτ is the leaf-labeling function mapping

the leafs of the tree onto an element from

{M∗ R? B E? S? Phon∗ Sem∗ | } ∪{M∗ R? B−I E? S? Phon∗ Sem∗ | } ∪{E? S? Phon∗ Sem∗ | B} ∪{E? S? Phon∗ Sem∗ | B−I}as introduced in the definition of MGscr

B .

An expression is called complex if it has morethan one node; otherwise it is called simple.

An expression τ over Syn ∪ NonSyn is calledwell-labeled if each leaf of τ is a string fromSyn∗Phon∗Sem∗(|(B + B−I))?. The label ofa complex expression is that of its head leaf.

The phonetic yield of an expression is theconcatenation of the phonetic yields of itssubexpressions.

We will be saying that the expressione = f1 f2 . . . fn−1 | fn, where f1, f2, . . . , fn

are features, has or contains these featuresand displays feature f1. We will say that asyntactic feature f is canceled from the ex-pression e if it is removed from it. We willalso say that a syntactic feature f is hiddenin the expression e if it is moved to the rightof the bar symbol in this expression. To makenotation shorter, we will omit the bar symbolif there are no features behind it.

Now we will define the structure-buildingoperations with their domains.

Definition 3 (merge domain)Dom(merge) = { 〈τ0, τ〉 | τ0 and τ are

well-labeled expressions, τ0 displays category

x, and τ displays feature =x }.

Definition 4 (merge operator)merge(τ) = [<τ ′, τ ′

0], such that

τ is a simple node displaying feature =x,

τ0 displays category x,

τ ′ is like τ except that =x is canceled,τ ′

0 is like τ0 except that x is hidden;

and merge(τ) = [>τ ′

0, τ′], such that

τ is a complex node that displays feature =x,

τ0 displays category x,τ ′ is like τ except that =x is canceled,

τ ′

0 is like τ0 except that x is hidden.

As an example of merge we will consider thederivation of the sentence John likes beer.Lexicon: =d.=d.v .likes; d.John; d.beer

Derivation:Step 1: =d.=d.v.likes + d.beer ⇒ <

=d.v.likes beer|d

Step 2: <

=d.v.likes beer|d

+ d.John ⇒ >

John|d <

v.likes beer|d

Definition 5 (move domain)Dom(move) = { τ | τ is a well-labeled

expression that displays feature +x and

contains exactly one maximal projection τ0

displaying feature −x }.2

Definition 6 (move operator)move(τ) = [>τ ′


τ displays feature +x,τ0 is a proper subtree of τ displaying feature

−x,

τ ′

0 is like τ0 except that −x is canceled, andτ ′ is like τ except that +x is canceled and

the subtree τ0 is replaced by an empty leaf.

The operator move is illustrated below inthe derivation of the subordinate clause whatJohn likes from John likes what within thesentence she wonders what John likes.Lexicon: =d.=d.v.likes; d.John;d.−wh.what; =v.+wh.c

Derivation:<

+wh.c >

John|d <

v.likes −wh.what|d

⇒ >

what|d <

c >

John|d <

v.likes λ

We say that a maximal projection τ ′ is a bar-

rier between the maximal projections τ andτ0, if τ0 is a proper subtree of τ ′, τ ′ is a propersubtree of τ , τ0 has the basic category b, andτ ′ contains the barrier marker −b.

2The restriction that τ cannot contain more thanone movement candidate is the shortest-move condi-tion, as it is used in MG.


30

Definition 7 (scrambling domain)Dom(scramble) = { τ | τ is a well-labeledexpression that displays category x and

contains at least one maximal projection τ0

displaying feature ∼x and there is no barrier

between τ and τ0 }.

Definition 8 (scrambling operator)scramble(τ) = [>τ ′


τ displays category x,

τ0 is a proper subtree of τ displaying feature

∼x and there is no barrier between τ and τ0,τ ′

0 is like τ0 except that ∼x is canceled,

τ ′ is like τ except that subtree τ0 is replacedwith an empty leaf.

The scrambling so defined operates nonde-terministically in the sense that it can dis-place any appropriate constituent. The dif-ference between scrambling and movementconsists in the fact that scrambling is op-tional, it allows a competitive displacementof constituents since it is not restricted bySMC, and it can be blocked by a barrier.

Definition 9 (Language of an MGscrB )

The language L generated by an MGscrB

G is the set of the phonetic yields of the

expressions produced from the lexical entriesby applying (some of) the structure-building

operations, such that these expressions dis-

play the completeness category c and neitherthey themselves nor their subexpressions

contain move licensees and move licensors(i.e., all movements have been performed).

4 MGscrB is NP-hard

4.1 Some preliminaries

A problem X is NP-hard if and only if anNP-complete problem N can be transformed(“reduced”) to X in polynomial time in sucha way that a (hypothetical) polynomial-timealgorithm solving X could also be used tosolve N in polynomial time.

For a language L, we will denote byL() the word recognition problem for L.Let L, L1 and L2 be languages such thatL = L1 ∪ L2 and L1 ∩ L2 = ∅. Let p(w) bea polynomial-time computable function suchthat for any w ∈ L it returns true if w ∈ L1

and false otherwise. (For a w /∈ L, it canreturn either true or false.) We will needfollowing proposition:

Proposition 1 If L1() is NP-hard, then L()is also NP-hard.

4.2 The idea of the proof

The NP-hardness of the word recognitionproblem for MGscr

B will be proved by con-structing a grammar G ∈ MGscr

B that gen-erates a language L = L1 ∪ L2, L1 ∩ L2 = ∅,where L1() represents a known NP-completeproblem, i.e., it is NP-hard, and the questionwhether a word w ∈ L belongs to L1 or to L2

can be resolved in deterministic polynomialtime. In the proof we will use the 3-PartitionProblem which in known to be (strongly) NP-complete:

Given a set of 3k natural numbers{n1, n2, . . . , n3k} and a constant m,decide whether this set can be par-titioned into k subsets of cardinality3 each of which sums up to m.

This problem can be described as a language

L3P = {bmaxn1axn2 . . . axn3k | a, b, x ∈ Σ}

such that it consists of all the words for which〈m,n1, n2, . . . , n3k〉 represents an instance ofthe problem. The word recognition problemfor this language is NP-hard.3

In MGscrB , scrambling allows syntactic

constituents to move to the left in competi-tive manner while barriers set boundaries be-yond which these constituents cannot move.This fact can be used to derive a languageLscr

B containing L3P such that for any wordw ∈ Lscr

B it can be decided in deterministicpolynomial time whether w ∈ L3P or not.

4.3 Proving NP-hardness

Let G = 〈NonSyn, Syn, p, |, Lex,Ω〉 be anMGscr

B where

• Phon = {a, b, c, d}, Sem = ∅,

• A = { f }, and

• B = { a1, a2, a3, a′1, a′2, a′3, a′′1, a′′2, a′′3,b, b′, b0, c1, c2, c3, c′1, c′2, c′3, c′′1, c′′2 , c′′3 ,d1, d2, d3, d′1, d′2, d′3, d′′1 , d′′2 , d′′3, e, g, s,p }.

The lexicon of the grammar, Lex, consists ofthe following entries (organized into groups

3A language representation of the 3-PartitionProblem was also used in (Champollion, 2007) toprove NP-hardness for a restricted version of nonlocalMCTAGs. It should be mentioned, though, that therelationship between nonlocal MCTAGs and MGscr

B isnot known, so we cannot apply the complexity resultfor nonlocal MCTAGs to MGscr

B .


31

according to which part of the structure theygenerate):

1. (a) =c′′3 . a′′3 . ∼s . a; =d′′3 . =b0. c′′3 . c;=c′′3 . =b . d′′3 . d; =e . =b. d′′3 . d; e;

(b) =c′′2 . a′′2 . ∼s . a; =d′′2 . =b0. c′′2 . c;=c′′2 . =b . d′′2 . d; =a′′3 . =b . d′′2 . d;

(c) =c′′1 . a′′1 . ∼s . a; =d′′1 . =b0. c′′1 . c;=c′′1 . =b . d′′1 . d; =a′′2 . =b . d′′1 . d;

2. (a) =c3 . a3 . ∼s . a; =d3 . c3 . c;=c3 . =b′. d3 . d; =a′1. =b′. d3 . d;=a′′1. =b′. d3 . d;

(b) =c2 . a2 . ∼s . a; =d2 . c2 . c;=c2 . =b′. d2 . d; =a3 . =b′. d2 . d;

(c) =c1 . a−b1 . ∼s . a; =d1 . c1 . c;

=c1 . =b′. d1 . d; =a2 . =b′. d1 . d;3. (a) =c′3 . a′3 . ∼s . a; =d′3 . c′3 . c;

=c′3 . =b . d′3 . d; =a1 . =b . d′3 . d;(b) =c′2 . a′2 . ∼s . a; =d′2 . c′2 . c;

=c′2 . =b . d′2 . d; =a′3 . =b . d′2 . d;

(c) =c′1 . a′−b′

1 . ∼s . a; =d′1 . c′1 . c;=c′1 . =b . d′1 . d; =a′2 . =b . d′1 . d;

4. =a1 . g . −f ; =a′1. g . −f ; =a′′1. g . −f ;5. =g . s;6. =s . +f . p;7. b . ∼c . b; b′. ∼c′. b; b . ∼g . b; b′. ∼g . b;

b0. b

Proposition 2 The language L generated bythe grammar G is a union of two disjoint lan-

guages, L = L3p ∪ L′, L3p ∩ L′ = ∅, such that

L3p consists of all the words

bma(bcd)n1a(bcd)n2 . . . a(bcd)n3k

with a, b, c, d ∈ Σ, where 〈m,n1, n2, . . . , n3k〉is an instance of the 3-Partition Prob-

lem, as described above, and there existsa polynomial-time computable function p(w)such that for any word w ∈ L it returns true

if w ∈ L3p and false otherwise; for w /∈ L it

returns either true of false.

We will prove the proposition 2 by followingthe bottom-up derivation of the language L.In the illustrations below, the symbols usedin the tree structures are base category sym-bols.4 The derivation starts at step 1.

4In the grammar G, the lexical entries are madein such a way that the phonetic (i.e., terminal) sym-bols can be obtained by stripping the base categorysymbols of indices and bars (except for the zero-yieldentries headed by e, g, s and p).

Step 1. The derivation begins with thelexical entries (1a) generating the following(sub)tree:

<

a′′3 >

b0 <

c′′3 >

b <

d′′3 >

b0 <

c′′3 >

b <

d′′3 . . .

>

b0 <

c′′3 >

b <

d′′3 e

The yield ofthis subtree isa(bcbd)+. Eachb located imme-diately betweena c and a d (thecorrespondingbase category isunderlined) islicensed for scram-bling to a specifierposition of a c org introduced at alater point in thederivation, sinceevery such b hasthe scramblinglicensee ∼c or ∼g.

The whole a′′3-headed subtree is licensed forscrambling to the s node to be introduced ata later point in the derivation, since the a′′3node has the scrambling licensee ∼s.

After that, subtrees headed with a′′2 anda′′1 are generated by the entries (1b) and (1c)respectively. The generation proceeds in thesame way as in the case of the a′′3 subtree; theb nodes are licensed for scrambling to c or g,and the a′′2 and a′′1 subtrees are themselveslicensed for scrambling to s:

<

a′′1 (b0c′′1bd′′

1)+ <

a′′2 (b0c′′2bd′′

2)+ <

a′′3 (b0c′′3bd′′

3)+

The phonetic yield generated at this pointis a(bcbd)+a(bcbd)+(bcbd)+. The derivationcontinues to step 2 or 4.

Step 2. Analogously to the previously per-formed step, subtrees headed by a3, a2 anda1 are generated by the entries (2a), (2b) and(2c) respectively. All of them are licensed forscrambling to s. The b′ nodes inside thesesubtrees are licensed for scrambling to c′ org. Some of the b nodes introduced in the pre-

viously performed step (this restriction is pro-vided by barriers) scramble to some of the c

nodes introduced at the present step:


32

<

a−b1 ( b c1b

′d1)+ <

a2 ( b c2b′d2)+ <

a3 ( b c3b′d3)+ <

a′1 or a′′1 . . .

The derivation continues to step 3 or 4.

Step 3. Analogously to the previously per-formed step, subtrees headed by a′3, a′2 anda′1 are generated by the entries (3a), (3b) and(3c) respectively. All of them are licensed forscrambling to s. The b nodes inside thesesubtrees are licensed for scrambling to c org. Some of the b′ nodes introduced in thepreviously performed step (this restriction isprovided by barriers) scramble to some of thec′ nodes introduced at the present step:

<

a′−b′

1 ( b’ c′1bd′

1)+ <

a′2 ( b’ c′2bd′

2)+ <

a′3 ( b’ c′3bd′

3)+ <

a1 . . .

The derivation continues to step 2 or 4.

Step 4. A subtree headed by g is generatedby the entries (4). The g head takes as itscomplement a′1 or a′′1 (1), or a1 (2). It is li-censed for movement to p. Some of the b′ or b

nodes introduced in the previously performedstep (this restriction is provided by barriers)scramble to g:

(1) >

b >

b >

. . .

>

b <

g <

a′1 or a′′1 . . .

(2) >

b’ >

b’ >

. . .

>

b’ <

g <

a1 . . .

The derivation continues to step 5.

Step 5. A subtree headed by s is generatedby the entry (5). The s head takes g as itscomplement. Further, some a subtrees gen-erated at previous steps scramble to s:

>

<

a1, a′

1, a′′

1 ,

a2, a′

2, a′′

2 ,

a3, a′

3 or a′′3

>

<

a1, a′

1, a′′

1 ,

a2, a′

2, a′′

2 ,

a3, a′

3 or a′′3

>

. . .

>

<

a1, a′

1, a′′

1 ,

a2, a′

2, a′′

2 ,

a3, a′

3 or a′′3

<

s

g

The derivation continues to step 6.

Step 6. A subtree headed by p is gener-ated by the entry (6). The p head takes s

as its complement. Further, the g subtreegenerated at a previous step is moved to thespecifier position of p:

>

g

<

p

s

The language generated by this grammar, L,is the union of two languages, L = L′

1 ∪ L1,such that L′

1 consists of all the words pro-duced with all b and b′ nodes having scram-bled and each c and c′ head having acceptedexactly one scrambling b or b′ node, and L1

contains the rest of the words. The languageL′

1 consists of all the words

bm a(bcd)n1 a(bcd)n2 . . . a(bcd)n3k

such that for all positive natural numbers k

and m, the multiset {n1, n2, . . . , n3k} can bepartitioned into k multisets of cardinality 3,each of which sums to m. This will be ex-plained following the generation of the wordsof the language. On the yield level, each “a-tripple” a( b cbd)+a( b cbd)+a( b cbd)+ gen-erated at the step (2) or (3) receives thescrambling symbols b from the neighbouring


33

a-tripple on the right (these symbols are de-picted in squares) generated during the pre-vious step and later “gives away” throughscrambling to the neighbouring left a-tripplethe symbols b located between c and d (un-derlined). Barriers guarantee that these sym-bols can only scramble to the adjacent trip-ple. The symbols b scrambling from the left-most a-tripple are stored as a “counter” atstep (4). In case all b and b′ symbols havescrambled and each c and c′ head have re-ceived through scrambling exactly one b orb′, all a-tripples will contain an equal numberof bcd subwords, while the number of thesesubwords in each a(bcd)+ member of one andthe same a-tripple may vary. The “counter”will consist of as many symbols b as there arebcd subwords in each a-tripple. At step (5),all the a(bcd)+ members of the a-tripples arepermuted arbitrarily, whereafter the “countersubword” is moved to the left at step (6).

Each word in L1 contains at least onefollowing subword in positions to the rightstarting from the leftmost occurrence of a:bb (more than one b have scrambled to thesame c head), ac, dc (omission of scramblingto a particular c head), cb (b has not scram-bled), while no word in L′

1 follows this pat-tern. This means that L′

1 ∩ L1 = ∅, and thereexists a polynomial-time computable functionp(w) such that for any w ∈ L, p(w) = true

if w ∈ L′

1 and p(w) = false otherwise. For aw /∈ L, it will return true or false.

The language L′

1 can be seen as a unionof two languages, L′

1 = L2 ∪ L3, such that{n1, n2, . . . , n3k} is a proper multiset for L2

(i.e., it contains repeated elements) and a setfor L3. This means that L2 ∩ L3 = ∅, and—since the problem whether a given multisetis a proper multiset or a set can be solved indeterministic polynomial time—there existsa polynomial-time computable function q(w)such that for any w ∈ L′

1, q(w) = true ifw ∈ L3 and q(w) = false otherwise. For aw /∈ L′

1, it will return true or false.The language L3 constitutes the unary en-

coding of the 3-Partition Problem5 wherebywe have proved the proposition 2, which to-gether with the proposition 1 gives us follow-ing result:

Proposition 3 The word recognition prob-lem for MGscr

B is NP-hard.

5Without loss of generality we consider only posi-tive natural numbers and assume k ≥ 1.

5 Conclusions

Since the recognition problem for MGscrB

is NP-hard, the generalized description ofscrambling is probably impossible in MG, atleast if it is implemented in a straightforwardway. On the other hand, MGs can provide aconvenient framework for the practical im-plementation of some important results ob-tainable within the Minimalist Program. Forthis reason, a further study of the proposedMG extensions is important, since a solutionto the scrambling problem can make out ofMGs a powerful formal language tool for thegrammar engineering. Additionally, it couldprovide insights into possible ways to tacklethe nonlocality problem in this class of for-malisms.

References

Becker, T., O. Rambow, and M. Niv. 1992.The Derivational Generative Power of For-mal Systems or Scrambling is BeyondLCFRS. Technical Report IRCS-92-38,University of Pennsylvania, USA.

Champollion, L. 2007. Lexicalized non-local MCTAG with dominance links isNP-complete. In Proceedings of Mathe-matics of Language 10. To appear.

Chomsky, N. 1995. The Minimalist Program.The MIT Press, Cambridge, USA.

Frey, W. and H.-M. Gartner. 2002. On theTreatment of Scrambling and Adjunctionin Minimalist Grammars. In G. Jager,P. Monachesi, G. Penn, and S. Wintner,editors, Proceedings of Formal Grammar2002, pages 41–52, Trento, Italy.

Kobele, G. M. and J. Michaelis. 2005. TwoType 0-Variants of Minimalist Grammars.In Proceedings of the 10th conference onFormal Grammar and the 9th Meeting

on Mathematics of Language, Edinburgh,Scotland.

Michaelis, J. 2001. On Formal Propertiesof Minimalist Grammars. Ph.D. thesis,Potsdam University, Germany.

Rambow, O. 1994. Formal and Computa-

tional Aspects of Natural Language Syn-tax. Ph.D. thesis, University of Pennsyl-vania, USA.

Stabler, E. 1997. Derivational minimal-ism. In Christian Retore, editor, Logi-cal Aspects of Computational Linguistics.Springer, pages 68–95.


34

Búsqueda de Respuestas

Paraphrase Extraction from Validated Question AnsweringCorpora in Spanish∗

Jesus Herrera, Anselmo Penas, Felisa VerdejoDepartamento de Lenguajes y Sistemas Informaticos

Universidad Nacional de Educacion a DistanciaC/ Juan del Rosal, 16, E-28040 Madrid

{jesus.herrera, anselmo, felisa}@lsi.uned.es

Resumen: Partiendo del debate sobre la definicion de parafrasis, este trabajo in-tenta clarificar lo que las personas consideran como parafrasis. El experimentorealizado parte de una de las distintas campanas que generan cada ano grandescantidades de datos validados, susceptibles de ser reutilizados con diferentes fines.En este artıculo se describe con detalle un metodo simple –fundamentado en re-conocimiento de patrones y operaciones de insercion y eliminacion–, capaz de extraeruna importante cantidad de parafrasis de corpora de Pregunta–Respuesta evaluados.Se muestra ademas la evaluacion realizada por expertos del corpus obtenido. Estetrabajo ha sido realizado para el espanol.Palabras clave: Extraccion de parafrasis, corpus de Pregunta–Respuesta,definicion de parafrasis

Abstract: Basing on the debate around the definition of paraphrase, this workaims to empirically clarify what is considered a paraphrase by humans. The ex-periment accomplished has its starting point in one of the several campaigns thatevery year generate large amounts of validated textual data, which can be reusedfor different purposes. This paper describes in detail a simple method –based onpattern–matching and deletion and insertion operations–, able to extract a remark-able amount of paraphrases from Question Answering assessed corpora. An assess-ment of the corpus obtained was accomplished by experts, and an analysis of thisprocess is shown. This work has been developed for Spanish.Keywords: Paraphrase extraction, Question Answering corpus, paraphrase defini-tion

1 Introduction

The main idea of the present work is that,although several definitions of the concept ofparaphrase have been already made, it is stillimportant to determine what humans under-stand when they are said to evaluate if a pairof statements are related by a paraphrase re-lationship. For this purpose, it was decidedto obtain a corpus containing pairs of state-ments that could be paraphrases; these pairswere be assessed by experts in order to deter-mine if, effectively, there was a paraphrase re-

∗ We are very grateful to Sadi Amro Rodrıguez,Monica Duran Manas and Rosa Garcıa–Gasco Villa-rrubia for their contribution by assessing the para-phrase corpus. We also would like to thank Clau-dia Toda Castan for revising this text. This workhas been partially supported by the Spanish Ministryof Science and Technology within the project R2D2–SyEMBRA (TIC–2003–07158–C04–02), and by theRegional Government of Madrid under the auspicesof MAVIR Research Network (S–0505/TIC–0267).

lationship between them. In addition, it wasconsidered that some corpora could success-fully be reused in order to automatically ex-tract these pairs of candidates for paraphrase.The corpus ed was the corpus of assessed an-swers –in Spanish– from the Question An-swering (QA) exercise proposed in the 2006edition of the Cross Language Evaluation Fo-rum (CLEF). The experiment accomplishedsuggests that with such corpus it is viableto obtain a high amount of paraphrases witha fully automated and simple process. Onlyshallow techniques were applied all along thiswork for this first approach. This method in-creases the set of proposals for paraphrase ob-tention given until now, for example: (Barzi-lay and McKeown, 2001) and (Pang et al.,2003) used text alignment in different ways toobtain paraphrases; (Lin and Pantel, 2001)used mutual information of word distribu-tion to calculate the similarity of expressions,



(Ravichandran and Hovy, 2002) used pairsof questions and answers to obtain variedpatterns which give the same answer; and(Shinyama et al., 2002) obtained paraphrasesby means of named entities found in differentnews articles reporting the same event.

In section 2 an overview of the experimentis given. Section 3 describes all the steps ac-complished in order to transform the multi-lingual source corpus in a monolingual cor-pus of paraphrase candidates, ready to beassessed. Section 4 describes the activity de-veloped by the assessors and the results ob-tained; the problems detected in the processare listed, with suggestions for its improve-ment; and, finally, some ideas about what hu-mans understand under the concept of para-phrase are outlined. In section 5 some conclu-sions and proposals for future work are given.

2 The experiment

Every year, QA campaigns like the ones ofthe CLEF (Magnini et al., 2006), the TextREtrieval Conference (TREC) (Voorhees andDang, 2005) or the NII–NACSIS Test Collec-tion for IR Systems (NTCIR) (Fukumoto etal., 2004) (Kato et al., 2004), generate a largeamount of human–assessed textual corpora.These corpora, containing validated informa-tion, can be reused in order to obtain datathat can be well-spent by a wide range ofsystems. The idea, given by (Shinyama etal., 2002), that articles derived from differentnewspapers can contain paraphrases if theyreport the same event, made us aware of thefact that in the QA campaign of the CLEFthe participating systems usually obtain sev-eral answers for a certain question; the an-swers, taken from a news corpus, are relatedby the common theme stated by this ques-tion. Thus, probably a remarkable numberof these answers will compose one or moresets of paraphrases. But, is it easy for acomputer program to extract that informa-tion? This last question motivated a study ofthe corpora available after the assessments ofthe Question Answering exercise of the CLEF(QA@CLEF) 2006 campaign. The first ac-tion accomplished aimed at determine if, bymeans of simple techniques, a corpus of can-didates for paraphrases could be obtained ina fully automatic way. After it, this corpuswas evaluated by three philologists in orderto detect the exact set of paraphrases ob-tained, i.e., the candidates that were, efec-

tively, paraphrases; their judgements wereused as a voting to obtain this final set. Theoutput of this assessment process was usedto try to identify what humans understandunder “paraphrase”.

3 Building a corpus for theexperiment

One of the objectives of the experiment wasto determine the best way to obtain a para-phrase corpus from a QA assessed corpus us-ing shallow techniques. It was accomplishedas described in the following subsections.

3.1 The multilingual source corpus

The assessment process of the QA@CLEFproduces a multilingual corpus with its re-sults. This QA corpus contains, for every lan-guage involved in the exercise, the followingdata: the questions proposed, all the answersgiven to every question, and the human as-sessment given to every answer (right, wrong,unsupported, inexact) (Magnini et al., 2006).Our idea was to use this corpus as a sourceto obtain a paraphrase corpus in Spanish.

3.2 The Spanish corpus

Since the QA@CLEF is a multiple languagecampaign and the scope of our experimentcovered only the Spanish language, we ex-tracted from the source corpus all the ques-tions and assessed answers in Spanish. Thus,a monolingual Spanish corpus –which is asubcorpus of the source one– was ready to beused. The assessed answers were representedin the format shown in figure 1; for everyanswer there is a record in the file consist-ing of the following fields, from left to rightand separated by tab blanks: the calificationgiven by a human assessor, the number of thequestion, the identification of the run and thesystem, the confidence value, the identifica-tion of the document that supports the an-swer, the answer and the snippet from theindicated document that contains the givenanswer.

This format follows the one established forthe QA@CLEF 20061.

3.3 Extraction of validated data

The first action over the Spanish corpuswas to select the records containing at leastone answer assessed as correct. Thus, only

1Guidelines of QA@CLEF 2006:http://clefqa.itc.it/guidelines.html

Jesús Herrera de la Cruz, Anselmo Peñas y Felisa Verdejo

38

Figure 1: Excerpt of the Spanish corpus....R 0065 inao061eses 1.00 EFE19940520−12031 moneda griega...GRECIA−MONEDA INTERVENCION BANCO CENTRAL PARA SALVAR DRACMA Atenas, 20 may (EFE).− El Banco de Grecia (emisor) tuvo que intervernir hoy, viernes , en el mercado cambiario e inyectar 800 millones de marcosalemanes para mantener el valor del dracma , moneda griega , trasla liberacion de los movimientos del capital el pasado lunes .......

human–validated data were considered forthe experiment. From the 200 questions pro-posed to the systems participating in theQA@CLEF 2006, 153 obtained one or morecorrect answers by one or more systems.From every selected record, the answer andthe snippet containing it were extracted, be-cause all the textual information liable tocontain paraphrases is included into them.

3.4 Data transformation andselection

After it, every answer was turned into itsaffirmative version by means of very simpletechniques, following the initial idea of highsimplicity for this work. First of all, punctu-ation signs were deleted. The most frequentones were ¿ and ?. Next, a list of frecuen-cies of interrogative formulations in Spanishwas made in order to establish a set of rulesfor turning them into the affirmative form.Two transformation operations were appliedby means of these rules: deletion and inser-tion. These operations affect only to the ini-tial words of the questions. Thus, for exam-ple, if the first words of a question are “quienes”, they must just be deleted for obtain-ing the affirmative version; but, if the firstwords of a question are “que” + substantive+ verb, the word “que” must be deleted andthe word “que” must be inserted after thesubstantive and before the verb. Thus, oncedeleted the punctuation signs and applied theprevious rule to the question ¿que organi-zacion dirige Yaser Arafat? (what organi-zation leads Yasser Arafat?), its affirmativeform is as follows: organizacion que dirigeYaser Arafat (organization leaded by YasserArafat). Some rules are very easy to obtain,such as the previous one, but some othersare quite difficult; for example, when a ques-tion starts with the word cuando (when), it isnot trivial to transform it into an affirmativeform, because several options exist and it is

not possible to decide what is the more ap-propriate without a semantic analysis. Thequestion ¿cuando murio Stalin? (when didStalin dead?) serves to illustrate this sit-uation; it could be transformed into differ-ent affirmative forms: fecha en la que murioStalin (date in which Stalin die), momento enel que murio Stalin (moment in which Stalindied), etcetera. Thus, it was decided to ap-ply the following rule: if a question startswith the word cuando, then delete cuando;therefore, for the present example, the ques-tion ¿cuando murio Stalin? is transformedinto murio Stalin (Stalin died). This was con-sidered the best approach that could be ob-tained using only surface techniques. Someof the 29 rules identified are shown in table1. This list of rules raises from a researchwork over the Spanish corpus described, andmore rules could be identified in future re-lated works with other corpora.

Once applied the previous rules over thecorpus, it was identified a set of monogramsand bigrams that must be deleted when ap-pearing at the beginning of the new state-ments obtained. The monograms are articles(“el”, “la”, “lo”, “los”, “las”), and the bi-grams are combinations of the verb “ser” (tobe) followed of an article, for example: “erael”, “es la”, “fue el”. Thus, for example,once deleted the punctuation signs, the ap-plication of rule number 1 from table 1 to thequestion ¿que es el toner? (what is toner?),we obtained the following statement: el toner(the toner); then, the article “el” is deletedand the definitive statement is toner (toner).

Since the techniques used for turning thequestions into their affirmative form wereonly at the lexical level, slightly agrammati-cal statemens were produced. Anyway, mostof the errors consist of a missing article orrelative pronoun. Nevertheless, a human canperfectly understand this kind of agrammat-ical statements and, in addition, a lot of sys-

Paraphrase Extraction from Validated Question Answering Corpora in Spanish

39

Table 1: Some rules identified for automatic conversion into the affirmative form.

# If the first words of the question are: Then:1 que es delete que es2 que + substantive + verb delete que

insert que after the substantive and before the verb3 a que + substantive + verb delete a que

insert a que after the substantive and before the verb4 quien es delete quien es5 cuantos + list of words + verb delete cuantos

insert numero de at the beginninginsert que after the list of words and before the verb

6 cuando delete cuando7 nombre delete nombre8 de delete de

tems do not consider stopwords (where ar-ticles and/or relative pronouns are usuallyincluded). These errors can be avoided ap-plying a morphological analysis; but we pre-served them, appart from for the sake of sim-plicity, in order to permit a future study ofthe importance of their presence in the cor-pus. For example: can systems using the cor-pus accomplish their tasks despite the pres-ence of some grammatical errors in it? If so,the morphological analysis could be avoidedfor building such kind of corpora. At thispoint an interesting suggestion arises: cam-paigns such the Answer Validation Exercise(AVE) (Penas et al., 2006), developed forthe first time within the 2006 CLEF, needan important human effort for transformingthe answers from the associated QA exer-cise into their affirmative form. Therefore,the method implemented for this experimentcould e a useful tool for tasks such the AVE.

After turning the questions into there af-firmative form, a normalization and filter ac-tion was accomplished over the corpus in or-der to avoid the frequent phenomenon of hav-ing a set of equal –or very similar– answersgiven by different systems to a determinedquestion. It consisted of the following steps:

1. Lowercase the affirmative version of allthe questions, and all the answers.

2. Eliminate punctuation signs and parti-cles such as articles or prepositions atthe beginning and the end of every state-ment.

3. For the set of normalized answers asso-ciated to every question, eliminate therepeated ones and the ones contained byother. That is, if the string representing

the answer is the same or is a substringof other string representing the answerand pertaining to the set of answers fora determined question, the former one iseliminated from the set of answers.

After the normalization and filtering, afirst inspection of the corpus obtained wasaccomplished in order to determine if moreoperations should be done for obtaining para-phrases. At the beginning it may seem thatlittle work is to be done with the questions inaffirmative form and the answers. But previ-ous works on paraphrase detection suggestedthat the longest common subsequence of apair of sentences could be considered for theobjectives of this work (Bosma and Callison–Burgh, 2006) (Zhang and Patrick, 2005).A first set of tests using the longest com-mon subsequence showed that some anwerscould be exploited to augment the amountof paraphrases; for example, presidente deBrasil (president of Brazil) is a reformula-tion for presidente brasileno (Brazilian presi-dent) and, if the largest common subsequenceis deleted from both statements, de Brasil (ofBrazil) and brasileno (Brazilian) are the newstatements obtained, and they are a para-phrase of each other. The problem is that it isnecessary to determine what statements aregood candidates for such operation, and it isnot easy by using simple techniques. In addi-tion, little examples of this kind were found;thus, no much information could be added.This is because this operation was not con-sidered for the present work.

3.5 What does not work?

The previous idea about deleting the largestcommon subsequence from a pair of strings


40

in order to find paraphrases made arise thefollowing intuition: when two texts containthe same information, if the common wordsare deleted, the rest of the words conforma pair of strings that could –perhaps– be apair of paraphrases. The snippets of the cor-pus were tested to determine if such intuitionwas correct. The test consisted of groupingall the snippets related to every question and,then, taking every possible pair of snippetsamong the ones pertaining to the same group,deleting the largest common subsequence ofthe pair. An examination of the output ofthis operation revealed that it was impro-ductive to obtain paraphrases. At this pointthe value for the present work of the previ-ous labour accomplished by the QA systemsbecomes patently clear, because they filterinformation from the snippets and virtuallythere is no need to treat it “again”. There-fore it was decided not to use the snippets forthe paraphrase searching, but only the ques-tions into its affirmative form and the differ-ent given answers.

3.6 The final corpus

After applying the operations described insubsection 3.4 over the validated data fromthe Spanish subcorpus, the definitive cor-pus for this work was ready. It consisted ofgroups of related statemens; each group con-tained the affirmative form of a question andall the different answers obtained from theparticipating systems. Giving some numbers,this corpus shows 87 groups of statemes forwhich 1 answer was given to the question, 47groups with 2 different answers for the ques-tion, 12 groups with 3 answers, 5 groups with4 answers, 1 group with 1 answer, no groupswith 6 answers and 1 group with 7 answers.None of the considered questions (see subsec-tion 3.3) received more than 7 different an-swers.

4 Evaluation of the paraphrasecorpus

The final corpus was assessed by three philol-ogists in order to find real paraphrases amongthe candidates.

From every group of related statements inthe corpus, all the possible pairs of state-ments among those of the group were con-sidered for evaluation. Thus, from a group ofm related statements, Cm,2 =

(m2

)pairs must

be evaluated. For the present case, 393 pairs

were produced for evaluation.The assessors were asked to consider the

context of the statements and to admit someredundancies between the affirmative form ofthe question and its answers. For example,for the affirmative form of the question “¿Quees el Atlantis?” (What is Atlantis?), that is“Atlantis”, four different answers are associ-ated:

1. “transbordador estadounidense” (ameri-can shuttle)

2. “foro marıtimo” (marine forum)

3. “transbordador espacial atlantis” (spaceshuttle)

4. “transbordador espacial estadounidense”(american space shuttle)

As it can be observed, the answer “foromarıtimo” does not pertain to the same con-text than the other answers, but “Atlantis”and “foro marıtimo” were considered a para-phrase, such as “Atlantis” and “transbor-dador espacial estadounidense”. But “foromarıtimo” and “transbordador espacial es-tadounidense” were not, obviously, consid-ered a paraphrase. About redundancies, itcan be observed that “transbordador espa-cial atlantis” contains “Atlantis”, but bothstatements express the same idea, i.e., theyare a semantic paraphrase. In addition, thisexample illustrates the affirmation given by(Shinyama et al., 2002) that expressions con-sidered as paraphrases are different from do-main to domain.

The evaluators labeled every single pairwith a boolean value: YES if it was con-sidered that a paraphrase was given betweenboth statements, and NO on the contrary.The assessments of the three experts wereused as a votation. Then, for every possiblepair of statements, it was finally decided thatit was a paraphrase if at least two of the labelsgiven by the assessors to the pair were YES.Following this criterion, from the 393 can-didate pairs of statements, 291 were consid-ered paraphrases, i.e., 74%. The agreementinter–annotator was of 76%. The three ex-perts labeled simoultaneously with YES 204pairs, and labeled simoultaneously with NO48 pairs. Then, a total agreement was givenfor 252 pairs, i.e., 86.6% of the ones that wereconsidered paraphrases.


41

4.1 Problems detected andsuggestions for improvement

The biggest disagreements between annota-tors were given in “difficult” pairs such as,for example: “paıses que forman la OTANactualmente” (countries that conform theNATO at the moment) and “dieciseis” (six-teen); this is because, for some people, anumber can not substitute a set of countriesbut, for some other people, in a determinedcontext it can be said, indifferently, for ex-ample: “... the countries that conform theNATO at the moment held a meeting in Parislast week...” or “... the sixteen held a meet-ing in Paris last week...”.

This situation suggested the analysis ofthe pairs involved in disagreements. From it,several phenomena were detected. The mostfrequent ones are shown in the following list:

• Some errors are introduced by the anno-tators, because they do not consider ac-curately the context in which the state-ments are. As an example, one of theannotators did not consider the pair “or-ganizacion que dirige yaser arafat” (or-ganization leaded by yasser arafat) and“autoridad nacional palestina” (pales-tinian national authority) a paraphrasebecause nowadays Yasser Arafat doesnot lead the Palestinian National Au-thority.

• When one of the statements of the paircomes from a factoid–type question ofthe QA exercise, and its answers are re-stricted to a date (see (Magnini et al.,2006) for more information about thiskind of questions and answer restric-tions), then “difficult” pairs as the fol-lowing appear: “murio stalin” (stalindied) and “5 de marzo de 1953” (5thMarch 1953). Some annotators con-sider that there is a paraphrase but itis because they infer some words thatare missing in the affirmative form ofthe question in order to complete theoverall context of the pair. Thus, forthis pair some annotators actually un-derstand “fecha en la que murio stalin”(date in which stalin died) instead of“murio stalin”. This example shows thatsome disagreements can be induced bythe transformation into affirmative form.

• Some annotators are more strict than

others when considering the grammati-cal accuracy of the statements. QA sys-tems sometimes introduce little gram-matical errors in their responses, andthis affects the consideration about theexistence of paraphrase. This is morefrequent in answers given to date–typeor location–type questions, because ofthe format given to them by the QAsystems. The following two examples il-lustrate the case: first, in the pair “3de abril de 1930” (3rd april 1930) and“3 abril 1930” (3 april 1930), the firststatement is correct but in the secondthe prepositon “de” is missing; despitethe fact that it can be perfectly under-stood, some annotators think that it hasno sense; second, in the pair “lilleham-mer (noruega)” (lillehammer (norway))and “lillehammer noruega” (lillehammernorway), the lacking parentheses in thelatter statement made some annotatorsconsider that it could be interpreted asa compound name instead of a pair ofnames (the city and its country).

• Another source of disagreement is thefact that there is not a bidirectional en-tailment between the two statements ofthe pair. The pair “lepra” (leprosy) and“enfermedad infecciosa” (infectious dis-ease) serves as an example. Leprosy isa infectious disease, but not every infec-tious disease is leprosy. Despite of thisfact, some annotators considered thatthere is a paraphrase, because under de-termined contexts both statements canbe used indifferently.

• Sometimes, errors acquired from the QAassessment process cause different opin-ions among the annotators. For exam-ple, the pair “deep blue” and “ordenadorde ajedrez” (chess computer) is in thecorpus because the assessors of the QAexercise considered “ordenador de aje-drez” (chess computer) as an adequateanswer for the question “¿que es deepblue?” (what is deep blue?). Despite thefact that the annotators were asked toconsider all the statements as validated,those of them who knew that, in fact,Deep Blue is not a computer devoted toplay chess, did not label the pair as para-phrase.

These problems suggest that the assess-


42

ment process should be improved. Thus, notonly a simple labelling action but a morecomplex process should be accomplished.Two alternative propositions for a better as-sessment process are outlined here:

1. In a first round, the assessors not only la-bel the pairs but write an explanation forevery decission. In a second round, inde-pendent assessors take a definitive deci-sion having into account both the vota-tion among the labels given in the pre-vious round and the considerations writ-ten.

2. In a first round, the assessors only labelthe pairs and, in a second round, theydiscuss the controversial cases, and ev-eryone can reconsider its opinion to re-label the pair; if an agreement is notreached, the pair and the opinions aresubmitted to independent assessors.

In addition, the assessment process shouldbe supervised in order to homogenize crite-ria about what kind of little errors should beconsidered by the assessors; for example, thelack of parentheses of prepositions.

Of course, some errors can not be avoidedwhen applying a fully automated process.For example, pairs without sense such as“deep blue” and “ordenador de ajedrez”(chess computer), that depend on the QA as-sessment process, can not be identified withshallow techniques.

4.2 What do humans understandunder paraphrase?

Several methods for recognizing paraphrasesor obtaining them from corpora have beenproposed until now, but a doubt arises: whatis exactly what these methods are recognizingor obtaining? The definition for paraphraseis very fuzzy and context–dependant, as seenhere; even more, almost every author givesa definition of his own; for example, the onegiven by (Fabre and Jacquemin, 2000):

Two sequences are said to be a para-phrase of each other if the user of aninformation system considers that theybring identical or similar informationcontent.Or the one by (Wan et al., 2006):[...] paraphrase pairs as bi–directionalentailment,where a definition for entailment can be

found in (Dagan et al., 2006):

Entailment: whether the meaning ofone text can be inferred (entailed) fromthe other.But these and the other definitions that

can be found for paraphrase can be includedin the simple concept given by (Shinyama etal., 2002):

Expressing one thing in other words.This last enunciation is very useful be-

cause it is capable to deal with the varietyof human opinions. But it is not restric-tive at all. The difficulty when working withparaphrases lies on its own definition. Thisis because of the relatively poor agreementwhen different persons have to say if a pairof expressions can be considered paraphrases.Thus, paraphrase corpora could be built orparaphrase recognition systems could be de-veloped, but every single system using suchresources should be capable of discriminatingthe usefulness of the supplied sets of para-phrases.

5 Conclusions and future work

The annotated corpora from the assessmentprocesses of campaigns like the CLEF, theTREC or the NTCIR, grow year by year.This human work generates a great amountof validated data that could be successfullyreused. This paper describes a very sim-ple and little costly way to obtain para-phrases is described, but it is ot the only northe more complex issue that can be accom-plished. Thus, corpora –aimed at differentapplications– could be increased every yearusing the newest results of this kind of cam-paigns. In addition, the rules proposed herefor transforming questions into their affirma-tive form can be used for automatically build-ing the corpora needed in future AVEs.

Despite the fact that the concept of para-phrase is human–dependant and, therefore, itis not easy to obtain a high agreement inter–annotator, it has been showed that a highamount of paraphrases can be obtained bymeans of shallow techniques. Anyway, theassessment process applied to the paraphrasecandidates corpus can be improved; severalideas for this have been outlined in this pa-per. As a result of this improvement, theagreement inter–annotator should increaseand the percentage of identified paraphrasesshould decrease, but hopefully not to thepoint in which the proposed method shouldbe considered useless. In the near future newmodels for this assessment process should be


43

evaluated, in order to determine the most ap-propriate one. Appart from the accuracy ofthe assessment process, the results obtainedat the present time suggest that it will be in-teresting to test if paraphrase corpora, as theone presented in this paper, are really usefulfor different applications; and if it is worth-while to implement more complex techniquesor the little errors produced do not interferewith the performance of these applications.This will determine if such corpora shouldbe obtained every year after evaluation cam-paings as the one accomplished at CLEF.

References

R. Barzilay and K.R. McKeown. 2001. Ex-tracting Paraphrases from a Parallel Cor-pus. Proceedings of the ACL/EACL.

W. Bosma and C. Callison–Burgh. 2006.Paraphrase Substitution for RecognizingTextual Entailment. Working Notes forthe CLEF 2006 Workshop, 20-22 Septem-ber, Alicante, Spain.

Ido Dagan, Oren Glickman and BernardoMagnini. 2006. The PASCAL RecognisingTextual Entailment Challenge. MLCW2005. LNAI. Springer. 3944, Heidelberg,Germany.

Cecile Fabre and Christian Jacquemin. 2000.Boosting Variant Recognition with LightSemantics. Proceedings of the 18th con-ference on Computational linguistics -Volume 1, Saarbrucken, Germany.

Junichi Fukumoto, Tsuneaki Kato and Fu-mito Masui. 2004. Question Answer-ing Challenge for Five Ranked Answersand List Answers – Overview of NTCIR4QAC2 Subtask 1 and 2 –. Working notesof the Fourth NTCIR Workshop Meeting,National Institute of Informatics, 2004,Tokyo, Japan.

Tsuneaki Kato, Junichi Fukumoto and Fu-mito Masui. 2004. Question Answer-ing Challenge for Information Access Di-alogue – Overview of NTCIR4 QAC2 Sub-task 3–. Working notes of the Fourth NT-CIR Workshop Meeting, National Insti-tute of Informatics, 2004, Tokyo, Japan.

D. Lin and P. Pantel. 2001. Discovery ofInference Rules for Question Answering.Natural Language Engineering, 7(4):343–360.

Bernardo Magnini, Danilo Giampiccolo,Pamela Forner, Christelle Ayache, PetyaOsenova, Anselmo Penas, Valentin Jijk-oun, Bogdan Sacaleanu, Paulo Rocha andRichard Sutcliffe. 2006. Overview ofthe CLEF 2006 Multilingual Question An-swering Track. Working Notes of theCLEF 2006 Workshop, 20–22 September,Alicante, Spain.

B. Pang, K. Knight and D. Marcu.2003. Syntax–based Alignment of Multi-ple Translations: Extracting Paraphrasesand Generating New Sentences. NAACL–HLT.

A. Penas, A. Rodrigo, V. Sama and F.Verdejo. 2006. Overview of the An-swer Validation Exercise 2006. WorkingNotes of the CLEF 2006 Workshop, 20–22September, Alicante, Spain.

D. Ravichandran and E. Hovy. 2002. Learn-ing Surface Text Patterns for a QuestionAnswering System. Proceedings of the40th Annual Meeting of the Associationfor Computational Linguistics (ACL).

Y. Shinyama, S. Sekine, K. Sudo and R. Gr-ishman. 2002. Automatic Paraphrase Ac-quisition from News Articles. Proceedingsof HLT, pages 40–46.

E.M. Voorhees and H.T. Dang. 2005.Overview of the TREC 2005 QuestionAnswering Track. NIST Special Pub-lication 500–266: The Fourteenth TextREtrieval Conference Proceedings (TREC2005), Gaithersburg, MD, USA.

Stephen Wan, Mark Dras, Robert Dale, andCecile Paris. 2006. Using Dependency–Based Features to Take the “Para–farce”out of Paraphrase. Proceedings of theAustralasian Language Technology Work-shop 2006, Sydney, Australia.

Yitao Zhang and Jon Patrick. 2005. Para-phrase Identification by Text Canonical-ization. Proceedings of the AustralasianLanguage Technology Workshop 2005,Sydney, Australia.


44

Evaluacion de Sistemas de Busqueda de Respuestas conrestriccion de tiempo

Fernando Llopis1, Elisa Noguera1, Antonio Ferrandez1 y Alberto Escapa2

1Grupo de Investigacion en Procesamiento del Lenguaje Natural y Sistemas de InformacionDepartamento de Sistemas y Lenguajes Informaticos

2Departamento de Matematica AplicadaUniversidad de Alicante

{elisa,llopis,antonio}@dlsi.ua.es // [email protected]

Resumen: Las investigaciones sobre la evaluacion de los sistemas de Busqueda deRespuestas (BR) solo se han centrado en la evaluacion de la precision de los mismos.En este trabajo se desarrolla un procedimiento matematico para explorar nuevasmedidas de evaluacion en sistemas de BR considerando el tiempo de respuesta.Ademas, hemos llevado a cabo un ejercicio para la evaluacion de sistemas de BRen la campana CLEF-2006 usando las medidas propuestas. La principal conclusiones que la evaluacion del tiempo de respuesta puede ser un nuevo escenario para laevaluacion de los sistemas de BR.Palabras clave: Evaluacion, Busqueda de Respuestas

Abstract: Previous works on evaluating the performance of Question Answering(QA) systems are focused in the evaluation of the precision. Nevertheless, the im-portance of the answer time never has been evaluated. In this paper, we developed amathematic procedure in order to explore new evaluation measures in QA systemsconsidering the answer time. Also, we carried out an exercise for the evaluation ofQA systems within a time constraint in the CLEF-2006 campaign, using the propo-sed measures. The main conclusion is that the evaluation of QA systems in realtimecan be a new scenario for the evaluation of QA systems.Keywords: Evaluation, Question Answering

1. Introduccion

El objetivo de los sistemas de Busquedade Respuestas (BR) es localizar, en colec-ciones de texto, respuestas concretas a pre-guntas. Estos sistemas son muy utiles pa-ra los usuarios porque no necesitan leer to-do el documento o fragmento de texto pa-ra obtener la informacion requerida. Pregun-tas como: ¿Que edad tiene Nelson Mande-la?, o ¿Quien es el presidente de los EstadosUnidos?, ¿Cuando ocurrio la Segunda Gue-rra Mundial? podrıan ser contestadas por es-tos sistemas. Los sistemas de BR contrastancon los sistemas de Recuperacion de Informa-cion (RI), ya que estos ultimos tratan de re-cuperar los documentos relevantes respecto ala pregunta, donde la pregunta puede ser unsimple conjunto de palabras clave (ej. edadNelson Mandela, presidente Estados Unidos,Segunda Guerra Mundial,...).

La conferencia anual Text REtrieval Con-ference (TREC1), organizada por el Natio-nal Institute of Standards and Technology(NIST), tiene como objetivo avanzar en el

1http://trec.nist.gov

estudio de la RI y proveer de la infraestruc-tura necesaria para una evaluacion robustade las metodologıas de la recuperacion tex-tual. Este modelo ha sido usado por el Cross-Language Evaluation Forum (CLEF2) en Eu-ropa y por el National Institute of Infor-matics Test Collection for IR Systems (NT-CIR3) en Asia, los cuales investigan el pro-blema de la recuperacion multilingue. Desde1999, TREC tiene una tarea especıfica parala evaluacion de sistemas de BR (Voorheesy Dang, 2005). En las competiciones CLEF(Magnini et al., 2006) y NTCIR (F. et al.,2002) se han introducido tambien la evalua-cion de los sistemas de BR. Esta evaluacionconsiste en localizar las respuestas a un con-junto de preguntas en una coleccion de docu-mentos, analizando los documentos de formaautomatica.

En estas evaluaciones, los sistemas tienenhasta una semana para responder al conjun-to de preguntas. Esto es un problema en laevaluacion de sistemas de BR porque nor-

2http://www.clef-campaign.org3http://research.nii.ac.jp/ntcir



malmente son muy precisos, pero a la vezmuy lentos, y esto hace muy difıcil la com-paracion entre sistemas. Por esta razon, elobjetivo de este trabajo es aportar un nuevoescenario para la evaluacion de sistemas deBR con restriccion de tiempo.

Este artıculo esta organizado de la siguien-te forma: la seccion 2 describe la evaluacionde los sistemas de BR en el CLEF-2006. Laseccion 3 presenta una nueva propuesta demedidas de evaluacion para sistemas de BR.La seccion 4 describe el experimento llevadoa cabo en el CLEF-2006 dentro del contextode la BR. Finalmente, la seccion 5 aporta lasconclusiones y el trabajo futuro.

2. Evaluacion de sistemas de BRen CLEF-2006

El objetivo en la tarea de BR en el CLEFes promover el desarrollo de los sistemas deBR dotando de una infraestructura para laevaluacion de estos sistemas. Esta tarea tie-ne un creciente interes para la comunidadcientıfica. En esta seccion nos hemos centra-do en describir los principales elementos dela tarea principal de BR en el CLEF-2006.Para mas informacion consultar (Magnini etal., 2006).

2.1. Coleccion de preguntas

El conjunto de preguntas estaba formadopor 200 preguntas, de las cuales 148 eran pre-guntas de tipo factoid, 42 de tipo definitiony 10 de tipo list.

Una pregunta factoid realiza la consul-ta sobre hechos o eventos. Por ejemplo,¿Cual es la capital de Italia?. Se con-sideraron 6 tipos de respuesta esperadapara estas preguntas: PERSONA, TEM-PORAL, LOCALIZACION, ORGANI-ZACION, MEDIDA y OTRAS.

Las preguntas de tipo definition requie-ren infomacion sobre definiciones de gen-te, cosas u organizaciones. Un ejemplo depregunta de este tipo podrıa ser: ¿Quienes el presidente de Espana?. Los tres ti-pos de respuesta para preguntas de ti-po definicion estan divididos en: PER-SONA, ORGANIZACION, OBJECTOy OTROS.

Una pregunta de tipo list requiere infor-macion de diferentes instancias de gente,objetos o datos, como Lista los paises deEuropa de Este.

Fueron introducidas 40 preguntas conrestriccion temporal para los diferentes ti-pos de preguntas (factoid, definition y list).Concretamente, fueron introducidas tres ti-pos de restricciones temporales: FECHA,PERIODO y EVENTO. ¿Quien gano el Pre-mio Nobel de la Paz en 1992? es un ejemplode pregunta con restriccion de FECHA.

Ademas, hubieron varias preguntas que notenıan respuesta dentro de la coleccion. Estasrespuestas son llamadas NIL. La importanciade estas es porque los sistemas deben detectarsi hay respuesta dentro de la coleccion y sinodevolver la respuesta de tipo NIL.

Los participantes tuvieron una semana pa-ra enviar los resultados. Esto significa que lossistemas pueden ser muy lentos, lo cual no esuna caracterıstica deseable para los sistemasde BR.

2.2. Evaluacion de las respuestas

Las respuestas devueltas por cada partici-pante fueron manualmente juzgadas por ase-sores nativos. En particular, cada idioma secoordino por un grupo de asesores. Cada res-puesta fue juzgada como: R (correcta) si larespuesta era correcta y estaba soportada porlos fragmentos de texto devueltos, W (inco-rrecta) si la respuesta no era correcta, X (ine-xacta) si la respuesta contenıa menos o masinformacion de la requerida por la preguntay U (no soportada) si los fragmentos de textono contenıan la respuesta, no fueron incluidosen el fichero de respuestas o no provenıan deldocumento correcto.

2.3. Medidas de evaluacion

Las respuestas fueron evaluadas principal-mente usando la medida de evaluacion: ac-curacy. Tambien, se consideraron otras me-didas: Mean Reciprocal Rank (MRR), K1 yConfident Weighted Score (CWS).

accuracy =r

n(1)

La medida accuracy se define como la pro-porcion de respuestas correctas sobre el totalde preguntas. Solamente se permite una res-puesta por pregunta. Esto se obtiene con laformula (1), donde r es el numero de respues-tas correctas devueltas por el sistema y n esel numero total de preguntas. Esta medida hasido usada desde el CLEF-2004. La principalrazon del uso de esta medida es porque nor-malmente solo se evalua una respuesta porpregunta.

Fernando Llopis, Elisa Noguera, Antonio Ferrández y Alberto Escapa

46

MRR =1q

q∑i=1

1fari

(2)

En la conferencia QA@CLEF-2003, seuso la medida MRR, ya que en esa ocasionse permitieron 3 respuestas por pregunta. Encambio, este ano se ha usado como medidaadicional unicamente para evaluar los siste-mas que devuelven mas de una respuesta porpregunta. Esta medida asigna el valor inversode la posicion en la que la respuesta correctafue encontrada, o cero si la respuesta no fueencontrada. El valor final es la media de losvalores obtenidos para cada pregunta. MRRasigna un valor alto a las respuestas que estanen las posiciones mas altas de la clasificacion.Esta medida esta definida con la formula (2),donde q es el numero de preguntas y fari esla primera posicion en la cual una respuestacorrecta ha sido devuelta.

Los sistemas de BR devuelven las respues-tas sin un orden establecido (simplemente seusa el mismo orden que en el conjunto de pre-guntas), aunque es opcional, algunos puedenasignar a cada respuesta un valor de confian-za (entre 0 y 1). Este valor se utiliza paracalcular dos medidas adicionales: CWS y K1.Estas medidas tienen en cuenta la precision yla confianza. De cualquier forma, la confianzaes un valor opcional que solo algunos sistemasde BR asignan, y solamente estos sistemaspodrıan ser evaluados con estas medidas. Pa-ra mas informacion consultar (Magnini et al.,2006).

2.4. Limitaciones de las actualesevaluaciones en BR

En la actualidad, hay varios aspectos enlas evaluaciones de los sistemas de BR quepodrıan ser mejorados: (1) los participantestienen varios dıas para responder a las pre-guntas, (2) el tiempo de respuesta no se eva-lua, esto causa que los sistemas tengan unbuen rendimiento, pero que sean sistemas de-masiado lentos, y (3) la comparacion entresistemas de BR puede ser difıcil si tienen dife-rente tiempo de respuesta. En consecuencia,el analisis del rendimiento involucra la eva-luacion de la eficiencia y de la eficacia de lossistemas de BR.

La motivacion de este trabajo es estudiarla evaluacion de los sistemas de BR con res-triccion de tiempo. Concretamente, hemospropuesto nuevas medidas de evaluacion que

combinan la precision y el tiempo de respues-ta de los sistemas. Para evaluar el tiempo derespuesta de los sistemas, hemos llevado a ca-bo un experimento en el CLEF-2006 aportan-do un nuevo escenario para comparar siste-mas de BR. Observando los resultados obte-nidos por los sistemas, podemos argumentarque este es un prometedor paso para cambiarla direccion en la evaluacion de los sistemasde BR.

3. Nuevas aproximaciones sobrela evaluacion de los sistemasde BR

El problema mencionado anteriormentepuede ser reformulado de forma matematica.Consideramos que la respuesta de cada sis-tema Si puede ser caracterizada en este pro-blema como un conjunto de pares de numerosreales ordenados (xi, ti). El primer elementode cada par representa la precision del siste-ma y el segundo la eficiencia. De este modo, latarea de BR puede ser representada geometri-camente como un conjunto de puntos locali-zados en un subconjunto D ⊆ R

2. Nuestroproblema puede ser solventado aportando unmetodo que permita ordenar los sistemas Si

de acuerdo a un criterio prefijado que valo-re tanto la precision como la eficiencia. Esteproblema es de la misma naturaleza que otrosproblemas tratados en la Teorıa de Decision.

Una solucion a este problema puede serobtenido introduciendo un preorden total, aveces referido como quasiorden, en D. Unarelacion binaria � en un conjunto D es unpreorden total si es reflexivo, transitivo y sidos elementos (cualesquiera) de D son com-parables entre si. En concreto, podemos defi-nir un quasiorden en D con la ayuda de unafuncion con dos variables de tipo real f : D ⊆R

2 → I ⊆ R, de modo que: (a, b) � (c, d) ⇔f(a, b) ≤ f(c, b), ∀ (a, b), (c, d) ∈ D.

Nos referiremos a esta funcion como fun-cion de clasificacion. Una de las ventajas deeste procedimiento es que la funcion de clasi-ficacion contiene toda la informacion relativaal criterio elegido para clasificar los distintossistemas Si.

Matematicamente, todos los elementosque estan situados en la misma posicion enla clasificacion pertenecen a una misma cur-va de nivel en la funcion de clasificacion. Es-pecıficamente, las curvas de iso-ranking estancaracterizadas por todos los elementos de Dque completan la ecuacion f(x, t) = L, siendo

Evaluación de Sistemas de Búsqueda de Respuestas con restricción de tiempo

47

L un numero real en la inversa de f , I.El procedimiento de clasificacion propues-

ta para evaluar los sistemas en la tarea deBR es de tipo ordinal. Esto significa que nose debe hacer una conclusion sobre la diferen-cia numerica absoluta sobre la diferencia delos valores numericos para dos sistemas en lafuncion de clasificacion. La unica informacionrelevante es la posicion relativa en la clasifica-cion de los sistemas en la tarea de evaluacionde BR. De hecho, si consideramos una nue-va funcion de clasificacion construida compo-niendo la funcion de clasificacion inicial conun estricto incremento de la funcion, el valornumerico asignado a cada sistema cambiara,pero la clasificacion obtenida sera la mismaque inicialmente.

En la aproximacion desarrollada en esteartıculo, la precision xi del sistema Si es cal-culada con la medida de evaluacion Mean Re-ciprocal Rank (MRR), de modo que xi ∈[0, 1]. La eficiencia se mide considerando eltiempo de respuesta de cada sistema, de mo-do que, tener un tiempo de respuesta pequenosignifica tener una buena eficiencia.

Para definir una funcion de clasificacionrealista, es necesario establecer algunos re-quirimientos adicionales. Estas propiedadesestan basadas en el comportamiento intuiti-vo que debe cumplir la funcion. Por ejemplo,como aproximacion inicial, vamos a estable-cer las siguientes condiciones:

1. La funcion f debe ser continua en D.

2. El lımite superior de I se obtiene conlımt→0

f(1, t). En el caso que I no tenga

lımite superior, tendremos lımt→0

f(1, t) =+∞.

3. El lımite inferior de I se obtiene conf(0, 1).

La primera condicion se ha impuesto porconveniencia matematica, aunque se podrıainterpretar en terminos de simplificacion deargumentos. Cabe destacar que este requeri-miento excluye la posibilidad que, si supone-mos que dos sistemas estan en distintas po-siciones en la clasificacion, una pequena va-riacion en la precision o la eficiencia, puedaalterar los valores de la clasificacion. La se-gunda condicion esta relacionada con el hechoque, si suponemos un sistema definido por elpar (1, 0) siempre deberıa estar en la prime-ra posicion en la clasificacion. Finalmente, la

ultima condicion implica que el par (1, 0) de-berıa estar en la ultima posicion.

3.1. Funcion de clasificacionindependiente del tiempo(MRR2)

Como primer ejemplo de funcion de cla-sificacion, consideramos MRR2(x, t) = x. Elpreorden inducido por esta funcion es seme-jante al orden lexicografico, a veces llamadoorden alfabetico. Para esta funcion de clasifi-cacion tenemos que:

1. La funcion inversa de MRR2 esta en elintervalo [0, 1].

2. La funcion MRR2 es continua en D.

3. lımt→0

MRR2(1, t) = 1.

4. MRR2(0, 1) = 0.

De modo que, la funcion cumple las condi-ciones establecidas previamente. Por otro la-do, las curvas de iso-ranking de la funcion sonde la forma x = L, L ∈ [0, 1] cuya represen-tacion es una familia de segmentos verticalescon una unidad de longitud (vease la figu-ra 1). El preorden construido por esta fun-cion de clasificacion solo valora la precisionde los sistemas.

3.2. Funcion de clasificacion condependencia temporal inversa(MRRT )

Como el primer ejemplo de funcion declasificacion no valora la eficiencia de lossistemas, vamos a considerar la funcionMRRT (x, t). Suponemos que en este caso lafuncion de clasificacion es inversamente pro-porcional a la eficiencia (tiempo de respues-ta) y directamente proporcional a la preci-sion. En particular, esta funcion verifica lassiguientes propiedades:

1. La funcion inversa de MRRT esta en elintervalo [0, +∞).

2. La funcion MRRT es continua en D.

3. lımt→0

MRRT (1, t) = +∞.

4. MRRT (0, 1) = 0.

Las curvas de iso-ranking asociadas a lafuncion son de la forma x = L, L ∈ [0, 1].Geometricamente, estas curvas son una fa-milia de segmentos que pasan por el punto


48

(0, 0) y con una pendiente de 1/L (vease lafigura 2). De este modo, los sistemas con me-jor eficiencia, es decir, un tiempo de respues-ta pequeno, obtendran un mejor valor de x yuna posicion alta en la clasificacion. Ası mis-mo, aunque la funcion de clasificacion es denaturaleza ordinal, es deseable que la funcioninversa este acotada entre 0 y 1, ya que estofacilita su intuitiva representacion, condicionque no se cumple por esta funcion.

3.3. Funcion de clasificacionexponencial inversa condependencia del tiempoMRRTe

Debido a las desventajas presentadas enlas funciones anteriores, hemos propuesto unanueva funcion que tambien depende de la pre-cision y de la eficiencia del sistema, aunquela eficiencia tiene un menor peso que la pre-cision en esta funcion. A continuacion, vamosa introducirla:

MRRTe(x, t) =2x

1 + et, (3)

siendo et la funcion exponencial de la eficien-cia. Esta funcion cumple las siguientes condi-ciones:

1. La inversa de MRRTe esta en el inter-valo [0, 1).

2. La funcion MRRTe es continua en D.

3. lımt→0

MRRTe(1, t) = 1.

4. MRRTe(0, 1) = 0.

Las curvas de iso-ranking son de la forma2x/(1+ et) = L, L ∈ [0, 1), estando represen-tadas en la figura 3. Si suponemos un sistemaideal, es decir, que responde instantaneamen-te (t = 0), entonces el valor de esta funcioncoincidirıa con el valor de la funcion de pre-cision. En cambio, la dependencia funcionaldel tiempo modula el valor de la funcion, demodo que, cuando el tiempo incrementa, lafuncion decrece. De cualquier forma, esta de-pendencia es mas suave que en la funcion an-terior. Ademas, si consideramos un sistemaS, unicamente obtendremos la misma clasifi-cacion que el si consideramos sistemas cuyaprecision y eficiencia varian en un rango par-ticular, no solo para un valor pequeno de laprecision.

4. Evaluacion en el CLEF-2006

Como se ha descrito anteriormente, noso-tros consideramos el tiempo como parte fun-damental en la evaluacion de los sistemasde BR. En acuerdo con la organizacion delCLEF, llevamos a cabo una tarea experimen-tal en el CLEF-2006, cuyo objetivo era eva-luar los sistemas de BR con una restriccionde tiempo. Este fue un experimento innova-dor para la evaluacion de los sistemas de BR yfue una iniciativa para aportar un nuevo esce-nario en la evaluacion de los sistemas de BR.El experimento sigue las mismas directricesque la tarea principal, descrita en la seccion2, pero considerando el tiempo de respuesta.

4.1. Participantes

En total, 5 grupos participaron en esteejercicio experimental. Los grupos partici-pantes fueron: daedalus (Espana) (de Pablo-Sanchez et al., 2006), tokyo (Japon) (Whitta-ker et al., 2006), priberam (Portugal) (Cassanet al., 2006), alicante (Espana) (Ferrandez etal., 2006) y inaoe (Mexico) (Juarez-Gonzalezet al., 2006). Todos estos sistemas participa-ron tambien en la tarea principal del CLEF-2006 y tienen experiencia en investigacion ensistemas de BR.

4.2. Evaluacion

En esta seccion se presentan los resultadosde la evaluacion de los 5 sistemas que parti-ciparon en el experimento. Por un lado, sepresenta la precision y la eficiencia obtenidapor estos sistemas. Por otro lado, se presen-tan las puntuaciones obtenidas por cada unode ellos con las diferentes medidas, las cualescombinan la precision y la eficiencia (presen-tada en la seccion 2.3).

La tabla 1 muestra el resumen de los re-sultados obtenidos con las diferentes medidasde evaluacion (MRR, t, MRRT, MRRTe). Semuestran todos los resultados en una sola ta-bla para hacer mas facil la comparacion entrelas diferentes medidas. Tambien se muestra laposicion (pos) obtenida por cada sistema conrespecto a cada medida.4.2.1. Evaluacion de la precision y

del tiempo de respuestaLa precision de los sistemas de BR fue eva-

luada en el experimento con la medida MRR(ver la seccion 2.3). Nosotros usamos esta me-dida porque los sistemas enviaron tres res-puestas por pregunta. La evaluacion de lossistemas con esta medida se presenta en la


49

Participante MRR pos t pos MRRT pos MRRTe posdaedalus1 0.41 1o 0.10 4o 3.83 4o 0.38 1o

tokyo 0.38 2o 1.00 6o 0.38 6o 0.20 6o

priberam 0.35 3o 0.01 1o 32.13 1o 0.34 2o

daedalus2 0.33 4o 0.03 3o 8.56 3o 0.32 3o

inaoe 0.3 5o 0.38 5o 0.78 5o 0.24 4o

alicante 0.24 6o 0.02 2o 16.23 2o 0.23 5o

Cuadro 1: Evaluacion de los resultados obtenidos con las diferentes medidas de evaluacion

tabla 1. Por otra parte, los tiempos de res-puesta se midieron en segundos (tsec), aun-que en la tabla se presenta el tiempo de res-puesta (t) normalizado para cada sistema conrespecto a tmax, o tiempo de respuesta delsistema menos rapido. Es decir, t es igual atsec/tmax.

4.2.2. Evaluacion de los resultadoscon MRR2

La evaluacion global de los sistemas deBR, combinando precision y tiempo de res-puesta con la medida MRRT2 (ver seccion 3)es la misma que usando solo la medida MRR(ver seccion 1), porque esta medida valoraprimero la precision, y despues valora el tiem-po en el caso que la precision sea la mismaentre varios sistemas. En este caso, como laprecision es distinta, los sistemas quedarıanordenados por su MRR.

Figura 1: Comparativa de los resultados ob-tenidos para cada sistema con la medida deevaluacion MRR2 (Preorden lexicografico).

Graficamente, una curva de iso-rankingcontiene a todos los sistemas con el mismovalor de MRR y cualquier valor de tiempo

de respuesta. Es decir, el criterio para esta-blecer la clasificacion es el mismo que la pre-cision obtenida para evaluar los sistemas deBR. Las limitaciones de este procedimiento,las cuales han sido argumentadas en este tra-bajo, son claras si consideramos por ejemplolos sistemas priberam y tokyo en la figura 1.Podemos observar como tokyo esta en segun-da posicion en el ranking y el sistema pribe-ram esta el tercero. En cambio, la diferenciaen la precision de los dos sistemas es muy pe-quena, 0.38 vs. 0.35, mientras que la eficienciadel sistema priberam es mucho mejor que laeficiencia del sistema tokyo. En consencuen-cia, serıa razonable que el sistema priberamprecediera al sistema tokyo. Esto es imposi-ble con esta clase de medidas que son inde-pendientes del tiempo.4.2.3. Evaluacion de los resultados

con MRRTLa evaluacion de los sistemas con la medi-

da MRRT (ver la seccion 3) se presenta en latabla 1. Tambien, para cada sistema se mues-tra la posicion en la lista que ha obtenido conesta medida.

Como podemos observar en la tabla, pribe-ram obtuvo el mejor valor de MRRT (32.13)con un t de 0.01 y un MRR de 0.35. Ademas,tambien se puede observar que la primeraprueba enviada por daedalus (daedalus1) ob-tuvo el mejor MRR con 0.41, en cambio estaprueba no fue la mas rapida (0.10). En conse-cuencia, esta prueba obtuvo un bajo MRRT(0.08). La segunda prueba enviada por dae-dalus (daedalus2) obtuvo un MRR mas bajoque el anterior (0.33), en cambio obtuvo unmejor t (0.03), por esta razon esta segundaprueba obtuvo un mejor MRRT que la pri-mera prueba.

Graficamente, podemos ver los diferentesvalores obtenidos en la figura 2. Por ejemplo,el sistema alicante, cuya presicion es 0.24 y tes 0.02, esta en la misma posicion en la cla-sificacion que priberam, siendo su precision


50

Figura 2: Comparativa de los resultados obte-nidos por cada sistema con la medida de eva-luacion MRRT en sus curvas de iso-ranking.

mejor (0.35). La posicion de cualquier siste-ma en la clasificacion, puede ser igualada porun sistema de menor precision pero con unamayor eficiencia, y en particular esto puedeocurrir aun teniendo un valor pequeno en laprecision. Esto es una desventaja porque sevalora mucho la eficiencia de los sistemas y,en nuestra opinion, el factor principal debe deser la precision, aunque la eficiencia tambiensea valorada.

4.2.4. Evaluacion de los resultadoscon MRRTe

La medida de evaluacion MRRT, presen-tada en la seccion anterior, fue usada en latarea de BR con restriccion de tiempo dentrodel CLEF-2006. Consideramos que esta me-dida valora demasiado el tiempo, por lo tan-to, hemos propuesto una medida alternativamas adecuada para la evaluacion de sistemasde BR con restriccion de tiempo. La nuevamedida, descrita en la seccion 3, ha sido di-senada para penalizar aquellos sistemas quetienen un elevado tiempo de respuesta.

Como muestra la tabla 1, daedalus1 y pri-beram obtienen los mejores resultados con lamedida MRRTe (0.38 y 0.34 respectivamen-te). La disminucion de resultados de pribe-ram (de 0.35 a 0.34), en terminos de MRR,no es significativa porque tiene un tiempo derespuesta muy pequeno (0.01), al igual quealicante (de 0.24 a 0.23). En cambio, el va-lor de MRRTe de daedalus1 reduce su valorde MRR en mayor grado (de 0.41 a 0.38),

porque tiene un t mas elevado (0.10) que losanteriores. Finalmente, inaoe y tokyo han si-do penalizados significativamente por tenerunos tiempos de respuesta muy elevados.

Figura 3: Comparativa de los resultados obte-nidos por cada sistema con la medida de eva-luacion MRRTe en sus curvas de iso-ranking.

Graficamente, podemos comparar los dis-tintos valores de MRRTe en la figura 3. Tam-bien se puede observar en la figura que paraobtener la misma posicion en el ranking que,p.ej. un sistema con una precision de 0.4 yun t de 0.2, su precision oscilara entre (0.36,0.76) y su t variara entre 0 y 1 dependiendode su precision. Estas caracterısticas hacen lamedida de evaluacion MRRTe adecuada parala evaluacion de sistemas de BR con restric-cion de tiempo.

5. Conclusiones y trabajosfuturos

Principalmente, la evaluacion de sistemasde BR ha sido estudiado en profundidad entres foros de investigacion: TREC, CLEF yNTCIR. Aunque, en estos foros solo se hancentrado en evaluar la precision de los siste-mas, y no se ha valorado su eficiencia (consi-deramos el tiempo de respuesta como medidade eficiencia) en ninguna ocasion. En la ma-yor parte de los casos, los sistemas suelen sermuy eficaces pero muy poco eficientes. Poresta razon, hemos estudiado en este trabajola evaluacion de sistemas de BR valorandotambien su tiempo de respuesta.

Para la evaluacion de los sistemas deBR, hemos propuesto tres medidas (MRR2,


51

MRRT , MRRTe) para evaluar los siste-mas con restriccion de tiempo. Estas medi-das estan basadas en la medida Mean Re-ciprocal Rank (MRR) y el tiempo de res-puesta. Como resultados preliminares, hemosvisto que MRRT2 solo valora la precisiony MRRT valora demasiado el tiempo. He-mos solventado este inconveniente proponien-do una nueva medida llamada MRRTe. Estamedida combina el MRR y el tiempo de res-puesta, penalizando a los sistemas que tie-nen un tiempo de respuesta elevado. Cabemencionar, que esta basada en una funcionexponencial. En conclusion, la nueva medidaMRRe permite clasificar los sistemas consi-derando su precision y su tiempo de respues-ta.

Ademas, hemos llevado a cabo una tareaen el CLEF-2006 para evaluar sistemas deBR con restriccion de tiempo (siendo la pri-mera vez que se organiza una evaluacion deestas caracterısticas). Este experimento nosha permitido establecer los criterios para laevaluacion de sistemas de BR en un nuevo es-cenario. Afortunadamente, este experimentofue recibido con una gran expectacion tantopor los participantes, como por los organiza-dores.

Finalmente, las futuras direcciones que va-mos a seguir son: valorar otras variables comoel hardware de los sistemas, e insertar nuevosparametros de control para poder dar masimportancia a la precision o a la eficiencia.

Bibliografıa

Cassan, A., H. Figueira, A. Martins, A. Men-des, P. Mendes, C. Pinto, y D. Vidal.2006. Priberam’s Question AnsweringSystem in a Cross-Language Environ-ment. En WORKING NOTES CLEF2006 Workshop.

de Pablo-Sanchez, C., A. Gonzalez-Ledesma,A. Moreno, J. Martınez-Fernandez, yP. Martınez. 2006. MIRACLE at the Spa-nish CLEF@QA 2006 Track. En WOR-KING NOTES CLEF 2006 Workshop.

F., Junichi, Tsuneaki K., , y Fumito M.2002. An Evaluation of Question Answe-ring Task. En Third NTCIR Workshop onResearch in Information Retrieval, Ques-tion Answering and Summarization, Oc-tober.

Ferrandez, S., P. Lopez-Moreno, S. Ro-ger, A. Ferrandez, J. Peral, X. Alvara-

do, E.Noguera, y F. Llopis. 2006. Ali-QAn and BRILI QA Systems at CLEF2006. En WORKING NOTES CLEF2006 Workshop.

Juarez-Gonzalez, A., A. Tellez-Valero,C. Denicia-Carral, M. Montes y Gomez,y L. Villase nor Pineda. 2006. INAOEat CLEF 2006: Experiments in SpanishQuestion Answering. En WORKINGNOTES CLEF 2006 Workshop.

Magnini, B., D. Giampiccolo, P. Forner,C. Ayache, P. Osenova, A. Pe nas, V. Jij-koun, B. Sacaleanu, P. Rocha, y R. Sut-cliffe. 2006. Overview of the CLEF2006 Multilingual Question AnsweringTrack. En WORKING NOTES CLEF2006 Workshop.

Voorhees, E. y H. Trang Dang. 2005. Over-view of the TREC 2005 Question Answe-ring Track. En TREC.

Whittaker, E. W. D., J. R. Novak, P. Cha-tain, P. R. Dixon, M. H. Heie, y S. Furui.2006. CLEF2006 Question Answering Ex-periments at Tokyo Institute of Techno-logy. En WORKING NOTES CLEF 2006Workshop.


52

Categorización de Textos

Medidas internas y externas en el agrupamiento de resumenescientıficos de dominios reducidos ∗

Diego A. Ingaramo, Marcelo L. ErrecaldeLIDIC, UNSL, Argentina

Avda Ejercito de los Andes 950San Luis (5700)

{daingara,merreca}@unsl.edu.ar

Paolo RossoDSIC, UPV, Espana

Camino de Vera s/n [email protected]

Resumen: Los algoritmos de agrupamiento suelen evaluarse o utilizan en su funcio-namiento distintas medidas internas (u objetivas) como el ındice de Davies-Boulding

o el ındice de Dunn, que intentan reflejar propiedades estructurales del resultadodel agrupamiento. Sin embargo, la presencia de estas propiedades estructurales nogarantiza la usabilidad de los resultados para el usuario, una propiedad subjetiva re-flejada por medidas externas como la medida F y que determinan hasta que puntolos grupos obtenidos se asemejan a los que se hubieran logrado con una categoriza-cion manual real. En trabajos previos, se ha observado una correlacion interesanteentre la medida de densidad esperada (interna) y la tradicional medida F (externa)en tareas de agrupamiento con documentos del corpus standard RCV1. En este tra-bajo, analizamos si esta relacion tambien se verifica en tareas de agrupamiento deresumenes en dominios muy restringidos. Este tipo de tarea ha demostrado tenerun alto grado de complejidad y por ello, un analisis de este estilo, puede ser utilpara determinar cuales son las propiedades estructurales fundamentales a tener encuenta a la hora de disenar algoritmos de agrupamiento para este tipo de dominios.Palabras clave: agrupamiento de resumenes, dominios muy restringidos, medidasde evaluacion

Abstract: Clustering algorithms are usually based (and evaluated) taking into ac-count internal (or objective) measures such as the Davies-Boulding index or theDunn index which attempt to evaluate particular structural properties of the clus-tering result. However, the presence of such structural properties does not guaran-tee the interestingness or usability of the results for the user, a subjective propertyusually captured by external measures like the F -measure that determine up to whatextent the resulting groups resemble a real human classification. In previous works,an interesting correspondence have been observed between the (internal) expecteddensity measure and the (external) F -measure in clustering tasks with documentsfrom the standard corpus RCV1. In this work, we investigate if that correspondencealso is verified in clustering on narrow-domain abstracts tasks. This is a challengingproblem and we think that this kind of study can be useful for detecting which arethe most relevant structural properties which should be considered when designingclustering algorithms for these domains.Keywords: clustering of abstracts, narrow domains, evaluation measures

1. Introduccion

El agrupamiento de textos consiste en laasignacion no supervisada de documentos endistintas categorıas. Si bien es comun queeste tipo de tareas se estudie utilizando co-lecciones de documentos standards, en mu-chos casos solo estan disponibles los resume-nes descriptivos (abstracts), como ocurre conmuchas publicaciones cientıficas. La tarea de∗ El trabajo fue financiado parcialmente por losproyectos de investigacion TIN2006-15265-C06-04 yANPCyT-PICT-2005-34015.

agrupamiento de resumenes, presenta un de-safıo considerable debido a la baja frecuenciade ocurrencia de los terminos en los documen-tos. Esta tarea se dificulta aun mas, cuan-do los resumenes abordan una tematica simi-lar, debido a que existe una interseccion sig-nificativa en el vocabulario de los documen-tos. Esta tarea, conocida como agrupamiento

de resumenes en dominios muy restringidos

(en ingles clustering abstracts on narrow do-mains) ha comenzado a ser abordada en dis-tintos trabajos recientes que presentan distin-



tas propuestas para enfrentar las complejida-des propias de este tipo de dominio (Makago-nov, Alexandrov, y Gelbukh, 2004), (Alexan-drov, Gelbukh, y Rosso, 2005), (Pinto, Jime-nez, y Rosso, 2006).

Por otra parte, Stein (Stein, Meyer, yWißbrock, 2003) destaca que las metricastradicionales de validez de un agrupamiento(ındice de Davies-Boulding, ındice de Dunn,densidad esperada y otras), son medidas in-

ternas (u objetivas) que toman en cuenta dis-tintas propiedades estructurales de los gru-pos obtenidos. Sin embargo, estas medidasno garantizan la calidad del agrupamiento deacuerdo a la clasificacion que hubiera reali-zado un usuario ante la misma tarea. Estetipo de informacion suele estar expresada enmedidas externas (o subjetivas) como la pre-

cision o la medida F, y requieren para sucalculo de informacion sobre la clasificacionreal realizada por un humano. Un algoritmode agrupamiento no tiene en general accesoa este tipo de informacion. Por ello, se sueletomar como referencia a las medidas inter-nas, y confiar en que permitan predecir ade-cuadamente las medidas externas. Este es elcaso de metodos como MajorClust (Stein yNiggemann, 1999), que aproxima la funcionde conectividad parcial o el algoritmo AAT(Adaptive AntTree) (Ingaramo, Leguizamon,y Errecalde, 2005b), (Ingaramo, Leguizamon,y Errecalde, 2005a) que utiliza el ındice deDavies-Boulding en una etapa del algoritmo.

Respecto a las observaciones de Stein, esteanaliza en que medida distintas medidas in-ternas de un agrupamiento sirven para prede-cir la usabilidad del mismo (medidas subjeti-vas) usando en su estudio distintas muestrasde un corpus etiquetado standard (RCV1).En este caso, se reportan resultados intere-santes respecto a la correlacion entre la me-dida de densidad esperada (interna) y la me-dida F (externa).

El objetivo de nuestro trabajo es deter-minar si esta correspondencia tambien severifica en un dominio mas dificultoso co-mo lo es el agrupamiento de resumenes endominios muy restringidos. Esta informa-cion podra ser utilizada en algoritmos deagrupamiento que explıcitamente recurren amedidas internas (Ingaramo, Leguizamon, yErrecalde, 2005b), (Ingaramo, Leguizamon, yErrecalde, 2005a) para adaptarlos a las ca-racterısticas de este tipo de dominios. En eltrabajo experimental se consideran 3 corpora

de resumenes cientıficos en dominios muy es-pecıficos y un subconjunto de un corpus tra-dicional. En todos los casos, se utilizan dis-tintas codificaciones de los documentos y dis-tintos porcentajes del vocabulario. Los meto-dos de agrupamiento utilizados son k-means,MajorClust y un algoritmo de clustering “ar-tificial”.

El artıculo esta organizado de la siguientemanera. En la Seccion 2 se resumen breve-mente las particularidades que surgen en latarea de agrupamiento de resumenes en do-minios muy restringidos. En la Seccion 3 sedescriben algunas de las consideraciones rea-lizadas por Stein respecto a las medidas inter-nas y externas del agrupamiento y se detallanlas medidas que utilizaremos en este trabajo.En la Seccion 4 se describe el trabajo experi-mental y los resultados obtenidos. Por ultimose presentan las conclusiones y posibles tra-bajos futuros.

2. Agrupamiento de resumenes

en dominios reducidos

La categorizacion de textos es el agrupa-miento de documentos con tematicas simila-res, y es una componente clave en la orga-nizacion, recuperacion e inspeccion de gran-des volumenes de documentos accesibles ac-tualmente en Internet, bibliotecas digitales,etc. Distintos trabajos de investigacion hanabordado el problema de la categorizacionautomatica de textos en situaciones donde secuenta con un esquema de clasificacion prede-finido y existe una coleccion de documentosya clasificados. En estos casos, las tecnicasde aprendizaje automatico han demostradouna gran eficacia a la hora de obtener cla-sificadores con muy buenos desempenos endiversas colecciones de documentos (Sebas-tiani, 2002), (Montejo y Urena, 2006).

Esta tarea de agrupamiento es mas com-pleja cuando el proceso de formacion de ca-tegorıas es no supervisado y no se dispone deuna coleccion de documentos etiquetados co-mo referencia. En estos casos se introducendificultades adicionales al caso supervisadocomo, por ejemplo, la correcta determinaciondel numero de clases o la forma de evaluar elresultado del proceso de agrupamiento.

Si bien las tecnicas de agrupamiento hansido aplicadas en reiteradas oportunidades adocumentos completos provenientes de colec-ciones de acceso publico, el acceso a muchaspublicaciones cientıficas queda en muchos ca-

Diego Ingaramo, Marcelo Errecalde y Paolo Rosso

56

sos restringido a sus resumenes (o abstracts).En estos casos, las tecnicas de agrupamientotradicionales suelen arrojar resultados inesta-bles e imprecisos debido a las bajas frecuen-cias de ocurrencias de las palabras presen-tes en el resumen y a la ocurrencia de frasescomunes completas que no realizan ningunaporte al significado del documento (ej. “Inthis paper we present...”). Aqui es importan-te diferenciar:

Resumenes concernientes a tematicasbien diferenciadas (deportes, polıtica,economıa, etc).

Resumenes concernientes a un dominiomuy restringido (narrow domain) dondetodos los resumenes abordan una temati-ca similar y la interseccion de sus voca-bularios es muy significativa.

La dificultad del agrupamiento en el ulti-mo caso ya ha sido observada en distin-tos trabajos recientes (Alexandrov, Gelbukh,y Rosso, 2005), (Pinto, Jimenez, y Rosso,2006) que proponen distintos enfoques parasu abordaje. En (Makagonov, Alexandrov, yGelbukh, 2004) por ejemplo, se utilizo unaadecuada seleccion de las palabras claves yuna mejor evaluacion de la similitud entredocumentos, experimentandose con dos co-lecciones de abstracts de las conferencias CI-CLing 2002 e IFCS 2000. En (Alexandrov,Gelbukh, y Rosso, 2005) se propone el uso delmetodo MajorClust de Stein para el cluste-ring de palabras claves y documentos, experi-mentandose con la misma coleccion CICLingmencionada previamente.

Recientemente, en (Jimenez, Pinto, y Ros-so, 2005) un nuevo experimento con esta co-leccion ha arrojado mejores resultados a par-tir del uso del metodo de punto de transi-cion. Finalmente, en (Pinto, Jimenez, y Ros-so, 2006), (Pinto et al., 2006) se muestra queesta tecnica de seleccion de terminos, pue-de producir un mejor desempeno que otrastecnicas no supervisadas en colecciones deresumenes.

Estos ultimos trabajos comparten la con-clusion de que puede haber una influenciasignificativa del tamano del vocabulario enla medida F cuando se utiliza la tecnica delpunto de transicion. Por este motivo, en estetrabajo decidimos que el analisis de la rela-cion de las medidas internas y externas to-marıa en cuenta distintos porcentajes del ta-

mano de vocabulario, utilizando esta intere-sante tecnica de seleccion de terminos.

3. Medidas de evaluacion de

agrupamientos

El trabajo realizado por Stein en (Stein,Meyer, y Wißbrock, 2003) intento determi-nar si las medidas de validez internas para unagrupamiento de textos se correspondian conlos criterios utilizados por un usuario final,en relacion a la misma tarea. Dentro de estemarco se analizaron distintas medidas inter-nas tradicionales como la familia de ındicesde Dunn y Davies-Bouldin y medidas basadasen densidad como la medida de conectividadparcial y la medida de densidad esperada. Elanalisis se realizo considerando que el criterioreal del usuario estaba reflejado en la medidaF (externa).

Para los experimentos se consideraronmuestras de la coleccion Reuters Text CorpusVolume 1 (Rose, Stevenson, y Whitehead,2002) y distintos algoritmos de agrupamien-to como k-Means y MajorClust. Los resulta-dos mostraron que las medidas internas tradi-cionales se comportan de manera consistenteaunque los grupos encontrados no sean bue-nos en relacion a la medida F . La medida dedensidad esperada en cambio, tiene un mejorcomportamiento que, de acuerdo a Stein, sedebe a la independencia que tiene esta me-dida con respecto a la forma y a la distan-cia entre grupos y elementos de cada grupo.A continuacion, se describen brevemente lamedida de densidad esperada y la medida F

analizadas en el trabajo de Stein.

3.1. Medida de densidad esperada

Se dice que un grafo ponderado 〈V,E, w〉no es denso si |E| = O(|V |), y que es den-

so si |E| = O(|V |2). De esta forma pode-mos calcular la densidad θ de un grafo me-diante la ecuacion |E| = |V |θ. Con w(G) =|V |+∑

e∈E w(e), la relacion para grafos pon-derados es:

w(G) = |V |θ ⇔ θ =ln(w(G))ln(|V |) (1)

θ puede usarse para comparar la densidadde cada subgrafo inducido G

′

= 〈V ′

,E′

, w′〉

de G, y se dice que G′ (no) es denso respecto

a G si la relacion w(G′

)|V ′

|θ es mas chica (mas

grande) que 1.

Medidas Internas y Externas en el Agrupamiento de Resúmenes Científicos de Dominios Reducidos

57

Definicion (Stein, Meyer, y Wißbrock,2003): Sean C = {C1, .., Ck} los grupos deun grafo ponderado G = 〈V,E, w〉 y seaGi = 〈Vi,Ei, wi〉 el subgrafo inducido de G

respecto al cluster Ci. La densidad esperada

ρ de un agrupamiento C es:

ρ(C) =k∑

i=1

|Vi||V | ·

w(Gi)|Vi|θ (2)

Un mayor valor de ρ representa un mejoragrupamiento.

3.2. La medida F

La medida F combina las medidas de pre-cision y recall.

Definicion: Sea D un conjunto de docu-mentos, C = {C1, ..., Ck} un agrupamiento deD y C∗ = {C∗

1 , . . . , C∗

l } la clasificacion real delos documentos en D. El recall de un grupoj en relacion a la clase i, rec(i, j) se definecomo |Cj ∩C∗

i |/|C∗

i |. La precision de un gru-po j respecto a la clase i, prec(i, j) se definecomo |Cj ∩ C∗

i |/|Cj |. La medida F combinaambas funciones de la siguiente manera:

Fi,j =2

1prec(i,j) + 1

rec(i,j)

(3)

y la medida F global se define:

F =l∑

i=1

|C∗

i ||D| · max

j=1,..,k{Fi,j} (4)

En nuestro caso, es importante determinarsi la correspondencia observada por Stein en-tre ambas medidas en la coleccion RCV1 semantiene al agrupar resumenes de dominiosmuy restringidos. Si esto ocurre, se podrıanadaptar para este tipo de dominios, algunosmetodos de agrupamiento que explıcitamenteutilizan otras medidas internas. En caso con-trario, se podrıa investigar si otras medidasinternas se comportan mejor en estos casos.

4. Experimentos

4.1. Conjuntos de Datos

En los experimentos se utilizaron las 4 co-lecciones que se describen a continuacion, quedifieren fundamentalmente en la cantidad dedocumentos y el tipo de distribucion entre losdistintos grupos.

4.1.1. La coleccion CICLing2002

Este corpus se caracteriza por un reducidonumero de resumenes (48) distribuidos ma-nualmente y en forma balanceada en 4 gru-pos que corresponden a tematicas abordadasen la conferencia CICLing 2002. Es un corpuspequeno (23.971 bytes) con 3382 terminos entotal y un vocabulario de tamano 953. La dis-tribucion de los resumenes en los grupos semuestra en la Tabla 1.

Categorıa Nro de resumenesLinguıstica 11Ambiguedad 15Lexico 11Proc. de texto 11TOTAL 48

Tabla 1: Distribucion de CICLing2002

4.1.2. La coleccion Hep-Ex

Este corpus, basado en la coleccion deresumenes de la Universidad de Jaen, Es-pana (Montejo, Urena Lopez y Steinberg,2005), esta compuesto por 2922 resumenesdel area de fısica, originalmente guardadosen los servidores del Conseil Europeen pour

la Recherche Nucleaire (CERN). Este corpusde 962.802 bytes de tamano, con un total de135.969 terminos en total y un vocabulario detamano 6150, distribuye los 2922 resumenesen 9 categorıas de la manera que se muestraen la Tabla 2. Como se puede observar, tie-ne una mayor cantidad de grupos que en elcaso de CICLing2002 y ademas es altamen-te desequilibrado, ya que uno de los gruposconcentra casi el 90% de los documentos.

Categorıa Nro de re-sumenes

Resultados Experimentales 2623Detectores y tecnicas exp. 271Aceleradores 18Fenomenologıa 3Astronomıa 3Transf. de Informacion 1Sistemas No Lineales 1Otros campos de la fısica 1XX 1TOTAL 2922

Tabla 2: Distribucion de Hep-Ex

4.1.3. La coleccion KnCr

Esta coleccion es un subconjunto de la co-leccion de textos cientıficos del area de me-


58

dicina de MEDLINE, restringida a aquellosresumenes sobre temas vinculados al cancer.Se compone de 900 resumenes distribuidos en16 categorıas como se muestra en la Tabla 3.Este corpus tiene un tamano de 834.212 by-tes, con 113.822 terminos en total y un vo-cabulario de tamano 11.958. Estudios preli-minares (Pinto y Rosso, 2006) demuestran laalta complejidad y el desafıo que presenta es-ta coleccion.

Categorıa Nro de resumenesSangre 64Huesos 8Cerebro 14Pecho 119Colon 51Estudios Geneticos 66Genitales 160Pulmones 29Hıgado 99linfoma 30renal 6piel 31estomago 12terapia 169tiroide 20otros 22TOTAL 900

Tabla 3: Distribucion de KnCr

4.1.4. La coleccion 5-MNG

Las 3 colecciones previas corresponden acolecciones de resumenes cientıficos en domi-nios muy especıficos. Para poder compararlos resultados con una coleccion que no tuvie-ra estas caracterısticas, se genero un subcon-junto de la coleccion de textos completos Mi-niNewsGroups 1, de manera tal que los gru-pos seleccionados correspondieran a temati-cas bien diferenciadas. Esta coleccion, quedenominamos 5-MNG, esta compuesta por 5grupos de tamano equilibrado de 100 docu-mentos cada uno (ver Tabla 4).

4.2. Diseno Experimental

En el trabajo experimental se analizo siexiste una correspondencia general entre ladensidad esperada y la medida F evitandointroducir distintos tipos de sesgos en facto-res como el tamano del vocabulario utilizado

1http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html. 20 Newsgroups, the original dataset. Ken Lang, 1993.

o la codificacion de los documentos. Por estemotivo, se busco obtener un muestreo repre-sentativo de resultados considerando distin-tos escenarios.

Para el caso de la codificacion de los do-cumentos, por ejemplo, se obtuvieron resul-tados considerando la mayorıa de las 20 co-dificaciones SMART (Salton, 1971). Para lareduccion del vocabulario, por su parte, losterminos mas relevantes fueron seleccionadosmediante la tecnica del punto de transicion.Esta tecnica ha demostrado tener un impac-to significativo en la medida F en estudiosrecientes con este tipo de dominios (Pinto etal., 2006). Para cada uno de los corpus seconsideraron los resultados obtenidos con lossiguientes porcentajes de vocabulario: 2%,5%, 10%, 20%, 40%, 60%, 80% y hasta un100% (vocabulario total).

Como algoritmos de clustering se utiliza-ron los metodos k-means y MajorClust. Enel primer caso se deben especificar el nume-ro de clusters requeridos y en el segundo ca-so no. Tambien se implemento un algoritmode clustering artificial del tipo del utilizadopor Stein en sus experimentos. La idea en es-te caso es que, dado que se conoce la cate-gorizacion de referencia C∗, es posible gene-rar agrupamientos artificiales C1, . . . , Cn quedifieren en el grado de ruido introducido enel agrupamiento. Este ruido es generado me-diante el intercambio controlado de pares desubconjuntos de documentos entre los gruposque pueden variar desde un documento hastael 50% de los documentos de un grupo.

4.3. Resultados

En las Figuras 1, 2, 3 y 4 se muestran losresultados del agrupamiento artificial con lascolecciones explicadas previamente. En todoslos casos, los valores correspondientes al eje x

representan las densidades esperadas ρ de losagrupamientos encontrados por este algorit-mo, y los valores en el eje y son los valores dela medida F para cada agrupamiento. Debe-

Categorıa Nro de resumenesGraficas 100Motocicletas 100Baseball 100Space 100Politica 100TOTAL 500

Tabla 4: Distribucion de 5-MNG


59

mos notar que ademas de los puntos corres-pondientes a los resultados del agrupamientoartificial, tambien se grafica una lınea rotu-lada “Curva ideal de la muestra”. Esta lıneacorresponde a la funcion lineal que pasa porlos puntos (ρ1,F1) y (ρ2,F2) donde ρ1 y ρ2son el mınimo y maximo valor de densidad es-perada encontrado en los experimentos paraeste corpus y F1 y F2 son el mınimo y maximovalor de la medida F obtenidos para este cor-pus en nuestros experimentos. Esta funcioncorresponde a un resultado idealizado dondela medida F se incrementarıa linealmente deacuerdo al crecimiento de la densidad espe-rada. Dado que esta funcion serıa un patrondeseable posible para la correlacion entre am-bas medidas, en todas las figuras subsiguien-tes, esta lınea sera tomada como referenciapara comparar los resultados obtenidos conlos distintos algoritmos de agrupamiento.

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.75 0.8 0.85 0.9 0.95

Med

ida

F

Densidad esperada

Resultados CICLing2002

AlgoritmosClustering ArtificialCurva ideal de la muestra

Figura 1: CICLing2002 (clustering artificial)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.66 0.68 0.7 0.72 0.74 0.76 0.78

Med

ida

F

Densidad esperada

Resultados 5MNG


Figura 2: 5-MNG (clustering artificial)

En todas estas figuras se puede observaruna buena correspondencia entre la medidade densidad esperada y la medida F cuandose introduce ruido gradualmente en el agru-pamiento. Estos resultados se asemejan a los

obtenidos por Stein con agrupamientos artifi-ciales con RCV1. Sin embargo, en nuestro ca-so dos situaciones merecen atencion. La pri-mera es respecto a CICLing2002 (Figura 1)donde se observan variaciones significativasde F con pequenas variaciones de la densi-dad. Esto parece indicar que cuando existenpocos grupos y pocos documentos por grupola densidad esperada no provee una estima-cion muy estable de F . Esta inestabilidad nose observa en una coleccion con pocos gruposcon textos completos como es el caso de 5-MNG (Figura 2) cuya curva tiene grandes si-militudes con la curva ideal para este corpus.En el caso de Hep-ex (Figura 3) se observaque la medida F se mantiene casi inaltera-ble respecto a las variaciones de la densidadesperada. Este comportamiento puede estarmotivado por el hecho de que esta colecciontiene un grupo que contiene el 90% de losdocumentos y el clustering artificial parte delagrupamiento perfecto de los documentos. Esde esperar entonces, que si bien se incorporapaulatinamente ruido intercambiando docu-mentos entre los grupos, el impacto que se

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9

Med

ida

F

Densidad esperada

Resultados Hep-Ex


Figura 3: Hep-ex (clustering artificial)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54

Med

ida

F

Densidad esperada

Resultados Cancer


Figura 4: KnCr (clustering artificial)


60

tiene sobre la medida F no alcance a ser sig-nificativo. De esta forma, la medida F man-tendra alto sus valores independientementede los valores de densidad esperada. La co-leccion de resumenes que muestra una mejorcorrespondencia entre la densidad esperaday la medida F es KnCr (Figura 4). En estecaso, la curva obtenida tiene una semejanzaa la curva ideal casi tan cercana como en elcaso de 5-MNG.

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.75 0.8 0.85 0.9 0.95

Med

ida

F

Densidad esperada


AlgoritmosK-meansCurva ideal de la muestra

Figura 5: CICLing2002 (k-means)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.66 0.68 0.7 0.72 0.74 0.76 0.78

Med

ida

F

Densidad esperada

Resultados 5MNG


Figura 6: 5-MNG (k-means)

El segundo grupo de resultados se obtuvie-ron con el algoritmo k-means (con el numerocorrecto de grupos) y se muestran en las Fi-guras 5, 6, 7 y 8. En los casos de Hep-ex yKnCr no se observa que un incremento en ladensidad esperada implique un aumento dela correspondiente medida F . En el caso de5-MNG en cambio, parece haber una relacionmas directa entre el crecimiento de la densi-dad esperada y el crecimiento de F . No obs-tante esto, los valores de F comienzan a sermas inestables con valores de densidad supe-riores a 0.73. Considerando que en el caso deCICLing2002 tampoco se visualiza una rela-cion clara entre la densidad y la medida F ,

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54

Med

ida

F

Densidad esperada

Resultados Cancer


Figura 7: KnCr (k-means)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.74 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9

Med

ida

F

Densidad esperada

Resultados Hep-Ex


Figura 8: Hep-ex (k-means)

podemos inferir que si bien en un corpus condocumentos completos y tematicas diferen-ciadas como 5-MNG, los resultados son con-sistentes con los obtenidos por Stein, en elcaso de colecciones de resumenes de dominiosrestringidos esta relacion entre ambas medi-das no parece verificarse.

Los resultados obtenidos con las coleccio-nes de resumenes no mejoraron cuando se uti-lizo un algoritmo como MajorClust que de-termina automaticamente el numero de gru-pos que tendra el resultado, ya que no cuentacon informacion sobre el numero correcto degrupos como en los algoritmos previos. Co-mo ejemplo representativo de estos resulta-dos, en la Figura 9 se muestra el desempenode MajorClust con la coleccion CICLing2002.Se puede observar que se tiene un rango masamplio de valores de densidad que con los dosalgoritmos previos, debido a que la variacionen el numero de grupos hacen variar signifi-cativamente los valores de densidad. Sin em-bargo, con estos valores mayores de densidadesperada tampoco se percibe una mejora dela medida F .


61

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.75 0.8 0.85 0.9 0.95

Med

ida

F

Densidad esperada


AlgoritmosMajor ClustCurva ideal de la muestra

Figura 9: CICLing2002 (MajorClust)

5. Conclusiones y trabajo futuro

Los resultados obtenidos en este trabajocon la coleccion 5-MNG confirman las obser-vaciones realizadas por Stein respecto a quela densidad esperada puede ser un buen in-dicador de la medida F cuando se agrupandocumentos completos de tematicas disımi-les. Sin embargo, esta relacion entre ambasmedidas no parece verificarse en tareas deagrupamiento de resumenes de dominios muyreducidos. Estos resultados se constituyen ennuevos indicadores de la dificultad intrınsecade este tipo de dominios. Como trabajo fu-turo, serıa interesante analizar el desempenode otras medidas internas como el ındice deDavies-Boulding o el ındice de Dunn, en estetipo de dominios y su relacion con la medi-da F . En base a estos estudios, serıa factibleincorporar la medida interna mas adecuadaen los algoritmos que las utilizan en algunade sus etapas. De esta manera, se podrıa lo-grar un algoritmo de agrupamiento acepta-ble, adaptado a las caracterısticas de este do-minio tan dificultoso.

Bibliografıa

Alexandrov, M., A. Gelbukh, y P. Rosso.2005. An Approach to Clustering Abs-tracts. En Proceedings of the 10th In-ternational Conference NLDB-05, LNCS,paginas 275–285. Springer-Verlag.

Ingaramo, D., G. Leguizamon, y M. Errecal-de. 2005a. Adaptive clustering with arti-ficial ants. Journal of Computer Science

and Technology, 5(04):264–271.

Ingaramo, D., G. Leguizamon, y M. Errecal-de. 2005b. Clustering dinamico con hor-migas artificiales. En Proceedings of the

CACIC 2005.

Jimenez, H., D. Pinto, y P. Rosso. 2005. Usodel punto de transicion en la seleccion determinos ındice para agrupamiento de tex-tos cortos. En Procesamiento del LenguajeNatural, paginas 383–390.

Makagonov, P., M. Alexandrov, y A. Gel-bukh. 2004. Clustering abstracts insteadof full texts. En Proc. of the TSD-2004,paginas 129–135.

Montejo, A. y L. A. Urena. 2006. Binaryclassifiers versus adaboost for labeling ofdigital documents. En Procesamiento del

Lenguaje Natural, paginas 319–326.

Pinto, D., H. Jimenez, y P. Rosso. 2006.Clustering Abstracts of Scientific TextsUsing the Transition Point Technique. EnA. Gelbukh, editor, Proceedings of the CI-

CLing 2006, volumen 3878 de LNCS, pagi-nas 536–546. Springer-Verlag.

Pinto, D. y P. Rosso. 2006. Kncr: A short-text narrow-domain sub-corpus of Medli-ne, TLH 2006.

Pinto, D., P. Rosso, J. Alfons, y H. Jimenez.2006. A comparative study of clusteringalgorithms on narrow-domain abstracts.En Procesamiento del Lenguaje Natural,paginas 41–49.

Rose, T.G., M. Stevenson, y M. Whitehead.2002. The reuters corpus volume 1: fromyesterdays news to tomorrows languageresources. En Proceedings of the Third

ICLRE, paginas 29–31.

Salton, Gerard. 1971. The Smart Retrieval

System: Experiments in Automatic Docu-ment Processing. Prentice Hall.

Sebastiani, F. 2002. Machine learning in au-tomated text categorization. ACM Com-

puting Surveys, 34(1):1–47.

Stein, B., S. Meyer, y F. Wißbrock. 2003.On Cluster Validity and the InformationNeed of Users. En Proceedings of the 3rdIASTED, paginas 216–221, Anaheim, Cal-gary, Zurich, Septiembre. ACTA Press.

Stein, B. y O.Niggemann. 1999. On theNature of Structure and its Identifica-tion. volumen 1665 LNCS de Lecture No-

tes in Computer Science, paginas 122–134.Springer, Junio.


62

Integración de conocimiento en un dominio específico para categorización multietiqueta

María Teresa Martín Valdivia Universidad de Jaén

Campus Las Lagunillas, Edif. A3. E-23071 [email protected]

Arturo Montejo Ráez Universidad de Jaén


Manuel Carlos Díaz Galiano Universidad de Jaén


L. Alfonso Ureña López Universidad de Jaén


Resumen: En este artículo se presenta un estudio sobre el uso e integración de una ontología en un corpus biomédico. Nuestro objetivo es comprobar cómo afectan distintas maneras de enriquecimiento e integración de conocimiento sobre un corpus de dominio específico cuando se aplica sobre un sistema de categorización de textos multietiqueta. Se han realizado varios experimentos con distintos tipos de expansión y con diferentes algoritmos de aprendizaje. Los resultados obtenidos muestran una mejora en los experimentos que realizan expansión sobre todo en los casos en los que se utiliza el algoritmo SVM.

Palabras clave: Ontología MeSH, corpus biomédico (CCHMC), categorización multietiqueta, integración de conocimiento, aprendizaje automático

Abstract: In this paper, we present a study on the integration of a given ontology in a biomedical corpus. Our aim is to verify the effect of several approaches for textual enrichment and knowledge integration on a domain-specific corpus when dealing with multi-label text categorization. The different reported experiments vary the expansion strategy used and the set of learning algorithms considered. Our results show that for SVM algorithm the expansion performed produces best results in any case.

Keywords: MeSH ontology, biomedical corpus (CCHMC), multi-label text categorization, knowledge integration, machine learning.

1 IntroducciónLas técnicas de procesamiento de lenguaje natural se están aplicando cada vez con mayor eficiencia en el dominio biomédico. Muchas investigaciones recientes exploran el uso de técnicas de procesamiento de lenguaje natural aplicadas al dominio biomédico (Karamanis 2007, Müller et al 2006). La necesidad de etiquetar y categorizar automáticamente textos médicos se hace cada vez más evidente.

Es innegable la importancia en la investigación y desarrollo de sistemas de búsqueda y recuperación de información en el

dominio de la biomedicina que faciliten la tareas de los especialistas dando soporte y ayuda en su trabajo diario.

En este trabajo se presenta un estudio sobre la influencia en un sistema de categorización de una ontología específica del dominio biomédico: la ontología MeSH (MeSH 2007). Concretamente, se ha utilizado dicha ontología para expandir los términos de un documento que se quiere categorizar con el fin de mejorar los resultados sobre un sistema categorizador multi-etiqueta. Pensamos que la incorporación de conocimiento mediante la integración de recursos tales como las ontologías puede



mejorar significativamente los resultados obtenidos con los sistemas de información.

Por otra parte, para llevar a cabo la experimentación se han utilizado distintas configuraciones tanto de algoritmos de aprendizaje automático utilizados como de parámetros para cada uno de ellos. Concretamente, se ha utilizado el algoritmo SVM (Support Vector Machine), una red neuronal tipo perceptrón denominada PLAUM y el algoritmo de regresión bayesiana BBR. Los experimentos muestran que el uso de SVM mejora los resultados prácticamente en todos los casos.

El artículo se organiza de la siguiente manera: en primer lugar, se describe brevemente la tarea de categorización de textos multietiquetados así como el sistema categorizador utilizado TECAT. A continuación, se presentan los dos recursos biomédicos integrados (el corpus CCHMC y la ontología MeSH). En la siguiente sección se muestran los experimentos y resultados obtenidos. Finalmente, se comentan las conclusiones y trabajos futuros.

2 Categorización multietiqueta La asignación automática de palabras clave a los documentos abre nuevas posibilidades en la exploración documental (Montejo, 2004), y su interés ha despertado a la comunidad científica en la propuesta de soluciones. La disciplina de recuperación de información, junto con las técnicas de procesamiento del lenguaje natural y los algoritmos de aprendizaje automático son el substrato de donde emergen las áreas de Categorización Automática de Textos (Sebastiani, 2002). En esta última área de investigación es donde se enmarca el presente trabajo y donde vierte sus principales aportaciones.

En la clasificación de documentos se distinguen tres casos:

1. Clasificación binaria. El clasificador debe devolver una de entre dos posibles categorías, o bien una respuesta SI/NO. Estos son los sistemas más simples, y al mismo tiempo los sistemas más conocidos en Aprendizaje Automático.

2. Clasificación multi-clase. En este caso el clasificador debe proporcionar una categoría de entre varias propuestas.

Este sistema puede basarse en el anterior.

3. Clasificación multi-etiquetado. El documento se etiqueta no con una única clase, como en el caso anterior, sino que puede tomar varias de entre las categorías disponibles. Es el problema más complejo, pero puede simplificarse si utilizamos clasificadores binarios cuya repuesta pueda combinarse (por ejemplo, mediante un ranking de clases) o entrenando sobre cada clase un clasificador binario de repuesta SI/NO (como el sistema que se describe en este trabajo).

Hemos utilizado el software TECAT1, que implementa un algoritmo para la clasificación multi-etiqueta basado en clasificadores base binarios. El algoritmo usado se muestra a continuación (Algoritmo 1), y consiste en entrenar un clasificador binario para cada clase seleccionando aquel que mejor rendimiento aporta dada una medida de rendimiento sobre el que se evalúa al clasificador. Además, aquellas clases para las que no es posible entrenar un clasificador con un rendimiento mínimo se descarta.

1 Disponible en

http://sinai.ujaen.es/wiki/index.php/TeCat

Entrada:- un conjunto Dt de documentos multi-

etiquetados para entrenamiento - un conjunto Dv de documentos de

validación- un umbral sobre la una medida de

evaluación determinada - un conjunto L de posibles etiquetas

(clases)- un conjunto $C$ de clasificadores

binarios candidatosSalida:- un conjunto C' = {c1, ..., ck, ...,

c|L|} de clasificadores binarios entrenados

Pseudo-código:C' ø Para-cada li en L:

T ø Para-cada cj en C:

entrena(cj, li, Dt)T T {cj}

Fin-para-cada $cmejor mejor(T, Dv)

Si evalua(cmejor) > C' C' {cmejor} Fin-si

Fin-para-cada

Algoritmo 1. Entrenamiento de clasificadores base

María Teresa Martín Valdivia, Manuel Carlos Díaz Galiano, Arturo Montejo Ráez y L. Alfonso Ureña-López

64

3 Recursos utilizados Nuestro objetivo principal consiste en estudiar la influencia que tiene el uso de una ontología médica sobre un corpus biomédico cuando se desea desarrollar un sistema automático de categorización de textos multi-etiquetados. Para ello, hemos utilizado dos recursos que describimos a continuación.

3.1 Corpus CCHMC Se trata de un corpus desarrollado por “The Computational Medicine Center”2. Dicho corpus incluye registros médicos anónimos recopilados en el departamento de radiología del Hospital infantil de Cincinnati (the Cincinnati Children’s Hospital Medical Center’s Department of Radiology – CCHMC) (CMC, 2007).

La colección está formada por 978 documentos consistentes en informes radiológicos que están etiquetados con códigos del ICD-9-CM3 (Internacional Classification of Diseases 9th Revision Clinical Modification). Se trata de un catálogo de enfermedades codificadas con un número de 3 a 5 dígitos con un punto decimal después del tercer dígito. Los códigos ICD-9-CM están organizados de manera jerárquica en los que se agrupan varios códigos consecutivos en los niveles superiores.

El número de códigos asignados a cada documento varía de 1 a 7. La Tabla 1 muestra la distribución del número de etiquetas por documento. El total de etiquetas distintas

2 http://www.computationalmedicine.org/

utilizadas en la colección es 142.

Clases Documentos 1 389 2 368 3 162 4 46 5 12 7 1

Tabla 1. Número de clases asignadas por documento

La Figura 1 muestra un ejemplo de documento. Como se puede observar, la cantidad de información suministrada en cada documento es muy escasa pero muy relevante y bien estructurada. La colección se encuentra anotada manualmente por tres expertos. Por lo tanto, en cada documento existen tres conjuntos de anotaciones, una por cada uno de los expertos. Adicionalmente, se ha añadido un conjunto de etiquetas que unifica la mayoría de los tres expertos. Por otra parte, cada informe contiene dos partes de texto fundamentales: la historia clínica y la impresión o diagnóstico del médico.

3.2 Ontología MeSH La ontología MeSH4 (Medical Subject Headings) está desarrollada y mantenida por la National Library of Medicine y se utiliza como herramienta de indexación y búsqueda en temas

3 http://www.cdc.gov/nchs/icd9.htm

Figura 1. Ejemplo de documento de la colección CCHMC

<doc id="97636670" type="RADIOLOGY_REPORT"> <codes> <code origin="CMC_MAJORITY" type="ICD-9-CM">786.2</code> <code origin="COMPANY3" type="ICD-9-CM">786.2</code> <code origin="COMPANY1" type="ICD-9-CM">204.0</code> <code origin="COMPANY1" type="ICD-9-CM">786.2</code> <code origin="COMPANY1" type="ICD-9-CM">V42.81</code> <code origin="COMPANY2" type="ICD-9-CM">204.00</code> <code origin="COMPANY2" type="ICD-9-CM">786.2</code> </codes> <texts> <text origin="CCHMC_RADIOLOGY" type="CLINICAL_HISTORY"> Eleven year old with ALL, bone marrow transplant on Jan. 2, now with three day history of cough.</text> <text origin="CCHMC_RADIOLOGY" type="IMPRESSION"> 1. No focal pneumonia. Likely chronic changes at the left lung base. 2. Mild anterior wedging of the thoracic vertebral bodies.</text> </texts> </doc>

Integración de Conocimiento en un Dominio Epecífico para Categorización Multietiqueta

65

relacionados con la medicina y la salud. Consiste en un conjunto de unos 23.000 términos denominados descriptores que se encuentran distribuidos de manera jerárquica permitiendo la búsqueda a varios niveles de

especificidad. Un descriptor puede aparecer en varias ramas.

Existen varios estudios que demuestran que el uso y la integración de información procedente de ontologías y recursos con un vocabulario controlado, puede mejorar significativamente los sistemas de tratamiento de información (Chevallet, Lim y Radhouani, 2006, Guyot, Radhouani, y Falquet, 2005, Navigli, Velardi y Gangemi, 2003). Concretamente, nosotros haremos uso de la ontología MeSH con el fin de expandir los documentos del corpus CCHMC que se desean categorizar. De esta manera, se pretende incorporar conocimiento a la colección utilizada con el fin de mejorar los resultados en un sistema de categorización multietiqueta.

4 Descripción de los experimentos 4.1 Expansión con MeSH Debido a que la cantidad de información en cada documento de la colección es escasa, se ha

4 http://www.nlm.nih.gov/mesh/

utilizado la ontología MeSH para expandir, con información médica dichos documentos. Se pretende incorporar información de calidad que ayude a mejorar la categorización de documentos.

Sin embargo, el uso indiscriminado de todos los términos extraídos de la ontología pueden empeorar los resultados puesto que incorporarían demasiado ruido. Así se pone de manifiesto por ejemplo en (Chevallet, Lim y Radhouani, 2006) donde se demuestra que seleccionar aquellas categorías de MeSH más acordes a la temática de los documentos, mejora la calidad de la expansión.

Con el fin de limitar el número de términos expandidos, se ha filtrado el número de categorías utilizadas para realizar la expansión. Así, aunque el primer nivel de MeSH incluye 16 categorías generales, se han seleccionado solo las siguientes tres:

A: Anatomy

C: Diseases

E: Analytical, Diagnostic, and Therapeutic Techniques and Equipment

El motivo para elegir precisamente estas tres categorías es que el corpus incluye casos clínicos de niños con enfermedades relacionadas con el aparato respiratorio por lo

Fever x5 days. Findings consistent with viral

or reactive airway disease.

Documento


or reactive airway disease. pathologic_processes body_temperature_changes

Expansión ul


or reactive airway disease.genomic_instability

acantholysis hyperplasia growth disorders

Expansión sl


or reactive airway disease. fever_of_unknown_origin syndrome sweating_sickness

Expansión ll


or reactive airway disease.pathologic_processes fever_of_unknown_origin body_temperature_changes syndrome

Expansión ul-ll

MeSHul

sl

ll

Figura 2. Estrategias de expansión con MeSH


66

que dichas categorías deberían incluir la mayoría de los términos usados en el corpus.

Al realizar la expansión se busca el primer nodo de la ontología que coincide con la palabra a expandir. Una vez encontrado el nodo, la selección de términos que formarán parte de la selección se puede realizar de tres maneras distintas (ver Figura 2):

Upper level (ul): se selecciona el término que está en un nivel superior a dicho nodo, es decir, el nodo padre. Same level (sl): se selecciona los términos que están al mismo nivel que dicho nodo, es decir, los nodos hermanos. Lower level (ll): se seleccionan los términos inmediatamente inferiores de dicho nodo, es decir, los nodos hijos.

Las palabras existentes dentro de los nodos seleccionados para formar parte de la expansión, han sido consideradas como entidades. Por lo tanto, si un nodo contiene una multipalabra (varias palabras separadas por espacios), dichas palabras se han incluido en la expansión formando un único término.

Con el fin de realizar un estudio para comprobar el comportamiento del sistema con varios tipos de expansión, se han diseñado distintas combinaciones con las tres expansiones anteriores. De esta forma, se han generado expansiones del tipo: ul+sl, ul+ll, ul+sl+ll… En la primera columna de la tabla 3 se pueden ver todas las expansiones realizadas.

4.2 Configuraciones de TECAT Una vez realizada la expansión, cada experimento se ha realizado ajustando los distintos parámetros de TECAT:

Se han eliminado las palabras vacías(stop-words).Se han obtenido las raíces de las palabras usando el stemmer de Porter (Porter 1980). Se han filtrado las características así obtenidas mediante ganancia de información (Shannon 1948), limitándonos a considerar 50,000 características. Se ha usado un pesado según el esquema TD.IDF.

Se ha normalizado usando la función coseno.

Debido a que TECAT nos permite aplicar varios algoritmos al mismo tiempo, hemos estudiado las configuraciones siguientes:

SVM-multi indica que se han pasado a TECAT varias configuraciones simultáneas del algoritmo SVM (Joachims, T., 1998). Estas configuraciones son aquellas que dan un peso adicional a los ejemplos positivos (normalmente escasos) con los valores 1, 2, 5, 10 y 20, es decir, 5 configuraciones diferentes de SVM que TECAT usará como clasificadores base independientes. PLAUM-multi indica, también, varias configuraciones para el perceptrón PLAUM (Y. Li et al., 2002) con pesados para ejemplos positivos en {0, 1, 10, 100} y pesados para negativos en {-10, -1, 0, 1}. Esto implica pasar a TECAT 16 configuraciones diferentes de PLAUM simultáneamente. BBR-multi. De forma similar a los anteriores, aquí el algoritmo BBR (A. Genkin et al., 2006) ha sido parametrizado con valores de umbral {0, 1, 2, 3, 4, 5} y valores de utilidad {0, 1, 2, 3}, si bien no se han analizado las combinaciones de todos ellos, por lo que las configuraciones consideradas han sido 10 para este algoritmo.

Las configuraciones en las que intervienen varias algoritmos combinados han sido realizadas, bien usando la simple de cada uno de ellos, bien la combinación de las múltiples parametrizaciones comentadas en cada uno de estos algoritmos.

5 Evaluación Para evaluar los resultados se han usado

validación cruzada en 10 particiones. Es decir, se ha dividido la colección en 10 particiones diferentes. Se ha ido alternativamente tomando una partición para test y el resto para entrenamiento. Los resultados finales de evaluación se calculan haciendo el promedio de cada ejecución correspondiente a cada participación. De esta forma se reduce el efecto que la selección de un determinado grupo de documentos para entrenamiento o evaluación pudiera tener sobre el resultado final.


67

Con las respuestas de un sistema de clasificación automático, y disponiendo de las predicciones reales que un experto humano asignaría, podemos construir la siguiente tabla de contingencia:

SI es correcto NO es correcto El sistema

dice SI A B

El sistema dice NO

C D

Tabla 2. Contingencias.

Las medidas consideradas son precisión (P),cobertura (R) y F1, siendo ésta última la que nos da una visión más completa del comportamiento del sistema. Estas medidas han sido obtenidas mediante micro-averaging, es decir, calculando los aciertos y fallos en cada clase de forma acumulativa y calculando los valores finales sobre dichos valores acumulados, tal y como se refleja en las ecuaciones siguientes a partir de las medidas correspondientes según la tabla de contingencia anterior:

''

'

Ccc

Ccc

Ccc

BA

AP

''

'

Ccc

Ccc

Ccc

CA

AR

RPRP

F2

1

Los resultados obtenidos se pueden observar en las tablas 3, 4, 5 y 6. Como se puede observar, la integración de la ontología MeSH mejora prácticamente en todos los casos excepto para el caso de PLAUM, si bien con el algoritmo SVM es con el que la mejora es mayor. De hecho, como se muestra en la tabla 3, con la configuración SVM-multi se obtienen los mejores resultados independientemente del tipo de expansión realizada.

Si observamos los resultados desde el punto de vista de la expansión de los documentos, el método con unos resultados más homogéneos es el que realiza la expansión con los nodos padre (ul). Con este tipo de expansión se

obtienen términos más generales que pueden considerarse como puntos en común entre documentos.

En cuanto a los algoritmos de aprendizaje utilizados, se puede observar que la expansión funciona en todos los casos excepto con la red neuronal PLAUM cuyos resultados son mejores sin ningún tipo de expansión.

Tipo de Expansión SVM simple SVM-multi

ll 0,724912 0,7675 ul 0,739461 0,7957 sl 0,734283 0,7697 ul-ll 0,739327 0,7766 ul-sl 0,726128 0,7669 ul-sl-ll 0,713533 0,7557 Sin expansión 0,737024 0,7699

Tabla 3. Expansión con SVM

Tipo de Expansión BBR simple BBR multi


Tabla 4. Expansión con BBR

Tipo de Expansión

PLAUMsimple

PLAUMmulti


Tabla 5. Expansión con PLAUM


68

Tipo de Expansión

SVM-BBR-PLAUM simple

SVM-BBR-PLAUM multi


Tabla 6. Expansión combinando los tres algoritmos utilizados

6 Conclusiones y trabajos futuros. En este trabajo se ha presentado un estudio

en categorización multietiqueta enriqueciendo e integrado conocimiento. Para ello, se expande el corpus utilizado (CCHMC) en el proceso de categorización multietiqueta, con la ontología médica MeSH. Para realizar el estudio se ha utilizado un categorizador multi-etiqueta TECAT disponible libremente y que permite la configuración y utilización simultánea de varios algoritmos de aprendizaje. Nuestro trabajo utiliza SVM, PLAUM y BBR además de una combinación de ellos. Los resultados muestran la conveniencia de integrar conocimiento externo proceden de una ontología específica biomédica. Sin embargo, las diferencias entre los distintos tipos de algoritmos utilizados no son excesivamente significativas.

En el futuro se pretende estudiar el uso de otros tipos de expansión utilizando dicha ontología, como por ejemplo la selección automática de las categoría que se utilizan para expandir, o el uso de sinónimos y palabras similares en lugar de nodos padres y/o hijos. Además se intentarán aplicar estas técnicas de expansión a otro tipo de tareas textual para comprobar el rendimiento de dicha técnica.

7 AgradecimientosEste trabajo ha sido parcialmente financiado

por el Ministerio de Ciencia y Tecnología a través del proyecto TIMOM (TIN2006-15265-C06-03).

BibliografíaChevallet, J. P., J. H. Lim y S. Radhouani.

2006. A Structured Visual Learning

Approach Mixed with Ontology Dimensions for Medical Queries. Lecture Notes in Computer Science. Volume 4022/2006. Pages 642-651

CMC. 2007. The Computational Medicine Center’s 2007 Medical Natural Language Processing Challenge. Disponible en http://www.computationalmedicine.org/ challenge/cmcChallengeDetails.pdf

Genkin, A., D.D. Lewis and D. Madigan. 2006. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics

Guyot, J., Radhouani, S., y Falquet, G. 2005 Ontology-based multilingual information retrieval. In CLEF Workhop, Working Notes Multilingual Track, Vienna, Austria, 21–23. September 2005.

Joachims, T. 1998. Text categorization with support vector machines: learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, N. 1398, Springer Verlag, pp. 137-142.

Karamanis, N. 2007. Text Mining for Biology and Biomedicine. Computational Linguistics. Volume 33. Pages 135-140.

Li, Y., H. Zaragoza, R. Herbrich, J. Shawe-Taylor y J. Kandola. 2002. The Perceptron Algorithm with Uneven Margins.Proceedings of the International Conference of Machine Learning (ICML'2002).

MeSH. 2007. Medical Subject Headings. Accesible desde la página web: http://www.nlm.nih.gov/mesh/

Montejo-Ráez, A. y R. Steinberger. 2004. Why keywording matters. High Energy Physics Libraries Webzine. Num. 10. Diciembre.

Müller, H., T. Deselaers, T. Lehmann, P. Clough y W. Hersh. 2006. Overview of the ImageCLEFmed 2006 medical retrieval and annotation tasks. Evaluation of Multilingual and Multi-modal Information Retrieval – Seventh Workshop of the Cross-Language Evaluation Forum, CLEF 2006. LNCS 2006.

Navigli, R. Velardi, P. y Gangemi, A., 2003. Ontology learning and its application to automated terminology translation. Intelligent Systems, volume 18, issue 1, pp 22-31.


69

Porter, M. 1980. An Algorithm for Suffix Stripping. Program,Vol. 14 (3), pp. 130-137, 1980.

Sebastiani, F. 2002. Machine learning in automated text categorization. ACMComputing Survey, Vol. 34, Num. 1, pp. 1-47.

Shannon, C. E. 1948.A mathematical theory of communication. Bell System Technical Journal, vol. 27, pp. 379-423 y 623-656.


70

Similitud entre documentos multilingües de carácter científico-técnico en un entorno Web

Xabier Saralegi UrizarElhuyar fundazioa

20170 Usurbil

[email protected]

Iñaki Alegria LoinazIXA taldea. UPV/EHU

649 p.k., 20080 Donostia

[email protected]

Resumen: En este artículo se presenta un sistema para la agrupación multilingüe de

documentos que tratan temas similares. Para la representación de los documentos se ha

empleado el modelo de espacio vectorial, utilizando criterios lingüísticos para la selección de

los palabras clave, la formula tf-idf para el cálculo de sus relevancias, y RSS feedback y

wrappers para actualizar el repositorio. Respecto al tratamiento multilingüe se ha seguido una

estrategia basada en diccionarios bilingües con desambiguación. Debido al carácter científico-

técnico de los textos se han empleado diccionarios técnicos combinados con diccionarios de

carácter general. Los resultados obtenidos han sido evaluados manualmente.

Palabras clave: CLIR, similitud translingüe, enlazado translingüe, RSS

Abstract: In this paper we present a system to identify documents of similar content. To

represent the documents we’ve used the vector space model using linguistic knowledge to

choose keywords and tf-idf to calculate the relevancy. The documents repository is updated by

RSS and HTML wrappers. As for the multilingual treatment we have used a strategy based in

bilingual dictionaries. Due to the scientific-technical nature of the texts, the translation of the

vector has been carried off by technical dictionaries combined with general dictionaries. The

obtained results have been evaluated in order to estimate the precision of the system.

Keywords: CLIR, cross-lingual similarity, cross-lingual linking, RSS

1 IntroducciónLa cantidad de información textual publicada

en Internet es cada vez mayor, resultando su

grado de organización todavía deficiente y

caótico en muchos casos. Situándonos por

ejemplo en el contexto de los medios de

comunicación, observamos que los servicios

que se ofrecen actualmente para una

navegación integrada de información

proveniente de distintas fuentes resultan

escasos, y más todavía cuando se trata de

información multilingüe.

Frente a este problema, proponemos una

navegación organizada en base a la semejanza

semántica entre contenidos, aplicada como

experiencia piloto en un entorno multilingüe de

sitios web de noticias científicas.

Concretamente, hemos centrado nuestro

experimento en el sitio web de divulgación

científica en euskera Zientzia.net, combinando

los siguientes idiomas: euskera, castellano e

inglés. Como resultado, Zientzia.net ofrecerá

para cada noticia publicada enlaces a otras

noticias relacionadas, pudiendo estar publicadas

en diferentes sitios web y distintos idiomas. El



objetivo final de este servicio es ofrecer al

lector una navegación más completa y

organizada. Una navegación similar a la

ofrecida por NewsExplorer (Steinberger,

Pouliquen y Ignatet, 2005) pero especializada

en contenidos científico-técnicos.

Con ese objetivo, se ha diseñado y

desarrollado un sistema (Fig.1) que abarca las

tareas de recopilación automática de noticias

procedentes de distintas fuentes, su

representación mediante un modelo algebraico,

y el cálculo de las similitudes entre documentos

escritos en el mismo o en distintos idiomas.

Fig 1. Esquema del flujo de información

La recopilación automática de noticias

-tanto locales como remotas- la realiza un robot

basado en agregadores RSS y wrappers HTML.

La posterior representación de los documentos

se hace según el modelo de espacio vectorial.

Para la construcción de los vectores se

seleccionan las palabras clave siguiendo

criterios lingüísticos. Concretamente se escogen

nombres comunes, entidades y términos

multipalabra, y se calcula su relevancia según la

ecuación tf-idf. La traducción de los vectores

generados a partir de documentos escritos en

distintos idiomas se hace hacia el euskera, y se

utilizan tanto diccionarios técnicos como

diccionarios de carácter general. Para el

tratamiento de las traducciones ambiguas se ha

diseñado un sencillo y efectivo método.

Finalmente, el grado de similitud se estima

mediante el coseno entre los vectores.

Con el propósito de evaluar el sistema, se ha

escogido un grupo de documentos al azar de

una colección previamente procesada por el

mismo, y se ha calculado la precisión

analizando manualmente los cuatro primeros

semejantes detectados automáticamente (cutoff 4).

2 Obtención de documentosNuestro sistema se especializa en la recolección

e interrelación de documentos pertenecientes al

dominio científico-técnico dentro del genero

periodístico o divulgativo. Se ha confeccionado

una lista de sitios web referentes dentro de la

divulgación científica que sirvan de fuentes de

información.

Para la creación y continua actualización de

la colección de noticias provenientes de las

distintas fuentes, se ha implementado un lector

basado en sindicación RSS. Mediante la

sindicación RSS obtenemos de manera

periódica resúmenes de las noticias que se

publican en un determinado sitio-web. Los

resúmenes suelen contener adicionalmente el

título y la URL de cada noticia. Esto implica

que, si deseamos acceder al contenido de la

noticia, debemos acudir al documento HTML y

extraer su contenido.

Sin embargo esta última tarea no es trivial,

ya que el texto del contenido suele estar

mezclado con otros elementos textuales

añadidos -tales como menús de navegación,

publicidad, información corporativa...-. 1 Para

realizar esta limpieza se proponen generalmente

técnicas de carácter automático basadas en

aprendizaje supervisado (Lee, Kan y Lai, 2004),

pero los resultados no llegan a ser óptimos. Por

esa razón, y teniendo además en cuenta que la

lista de sitos web a tratar no es muy amplia,

hemos decidido implementar los wrappers de

manera manual. Concretamente se ha analizado

manualmente la estructura HTML de las

noticias publicadas en cada sitio web, y se han

1Con el objetivo de impulsar trabajos enfocados

a la limpieza de documentos web SIGWAC ha

programado para Junio del 2007 una tarea

(CLEANEVAL) en formato de competición.

Xabier Saralegi y Iñaki Alegria

72

implementado parsers empleando el modelo

XPath en base a los patrones observados en

cada sitio web.

La obtención de noticias publicadas se lleva

a cabo, por tanto, en dos pasos: Primero,

mediante el agregador RSS obtenemos los

metadatos de las noticias publicadas en unos

sitios web determinados y, a continuación,

extraemos el contenido textual del documento

HTML señalado en los metadatos mediante el

wrapper HTML correspondiente al sitio web.

Como paso añadido, debido a que algunos

sitios web publican noticias en varios idiomas,

detectamos el idioma del documento utilizando

LangId2. Esta identificación es necesaria para

poder determinar posteriormente el sentido en

el que será traducido el vector generado.

3 Representación de los documentos multilingües

En este trabajo se ha experimentado únicamente

con el modelo de espacio vectorial. Pese ha

existir modelos más avanzados (Ponte y Croft,

1998), hemos considerado que trabajar con este

modelo nos proporcionará un robusto prototipo

que podrá ser mejorado en el futuro.

Para la construcción de los vectores, hemos

partido de los documentos en formato texto que

en el sistema son suministrados según el

método explicado en el punto 2.1. Como

primer paso se ha realizado una selección del

léxico representativo según criterios

lingüísticos. Para ello, previamente se ha

etiquetado automáticamente cada texto. El

etiquetado POS y lematizado se ha llevado a

cabo con las herramientas Eustagger para el

caso del euskera, y Freeling para el caso del

castellano e inglés. A partir del texto

lematizado se han podido identificar

determinadas unidades léxicas que hemos

estimado como más representativas del

contenido, descartando el léxico que no

2Un identificador de idioma basado en palabras y

frecuencias de trigramas desarrollado por el grupo

IXA de la UPV/EHU.

aportaría más que ruido para el caso que nos

ocupa: modelar el contenido semántico. Así, se

han seleccionado nombres comunes, entidades

y términos multipalabra. El caso de los

adjetivos y verbos no es claro (Chen y Hsi,

2002), y en nuestro caso su ausencia se debe

fundamentalmente a que, al estar poco

representados en los diccionarios técnicos

bilingües, su traducción resultaba limitada. De

todas formas, realizamos una serie de

experimentos (no concluyentes) que apuntaban

a que la no inclusión de verbos y adjetivos

implicaba una casi nula mejora en la detección

de documentos similares.

Los términos multipalabra en todos los

idiomas a tratar (euskera, inglés y castellano) se

han identificado a partir de una lista de

términos (Euskalterm3, ZT hiztegia

4) sobre el

texto lematizado. Hemos descartado utilizar

técnicas de detección automática de

terminología para evitar la generación de ruido

y también simplificar la posterior traducción

mediante diccionarios. Para el caso de la

identificación de entidades hemos utilizado un

heurístico sencillo pero a la vez eficiente en

cuanto a la precisión u omisión de ruido.

Concretamente se han marcado como entidades

las series de palabras escritas en mayúscula y

que, o son palabras desconocidas, o aparecen en

un repertorio de entidades monopalabra

previamente elaborado.

Para calcular la relevancia de cada palabra

clave se ha experimentado con distintas

variantes de tf-idf. Según nuestros

experimentos aplicando el logaritmo a tf (1)

tf-idf= log(tf) · idf (1)

hemos obtenido mejores resultados, ya que

se ha observado que la similitud entre

3Diccionario terminológico que contiene al

rededor de 100.000 fichas terminológicas en euskera

con equivalencias en español, francés, inglés y latín.4Diccionario enciclopédico de ciencia y

tecnología que consta aproximadamente de 15.000

entradas en euskera con equivalencias en español,

francés, inglés.

Similitud entre Documentos Multilingües de Carácter Científico-Técnico en un Entorno Web

73

documentos con muy pocas claves (con valores

tf-idf altos) en común obtenía puntuaciones

demasiado altas, generando en muchos casos

similitudes imprecisas (falsos positivos).

4 Similitud multilingüe

4.1 Medidas de similitudPara el cálculo de la similitud entre documentos

representados según el modelo espacio

vectorial existen distintas métricas. La más

extendida es el coseno. Otras métricas también

utilizadas son Jackar, Dice... En el modelo

OKAPI se toma en consideración el tamaño del

documento y la colección proporcionando

mejores resultados. (Robertson et al., 1994)

Las métricas mencionadas son aplicables

directamente a vectores que representan textos

de un mismo idioma pero, para el caso de

vectores que corresponden a distintos idiomas,

es necesario realizar previamente un proceso de

traducción. Para llevar a cabo esa tarea dos son

las principales estrategias que se proponen en la

literatura: traducción del vector mediante un

modelo estadístico entrenado a partir de un

corpus bilingüe (Hiemstra, 2001) (basada en

corpus), o traducción del vector mediante

diccionarios bilingües (Pirkola, 1998) (basada

en diccionarios).

En la traducción mediante diccionarios la

traducción obtenida puede resultar muy ruidosa

ya que la traducción de una palabra resulta

ambigua en muchos casos. En tal caso, si

aceptamos todas las traducciones posibles y

calculamos su tf-idf según la frecuencia de la

palabra original, podemos introducir

traducciones erróneas que desdibujan la

representación del documento original. Esto

resulta realmente peligroso ya que las

traducciones extrañas, al tener un alto idf,

pueden fácilmente distorsionar la

representación del vector, y en consecuencia el

cálculo de similitudes. Como posible solución

se plantean las “consultas estructuradas”

(Pirkola, 1998). Originalmente pensadas para

tratar “query expansión” en un entorno

monolingüe, ponderan según una estrategia

prudente las posibles traducciones de cada

palabra penalizando el peso tf-idf de todas si el

valor df de alguna de ellas es alto.

Un tipo de traducción basada en corpus es la

guiada por modelos estadísticos (Hiemstra,

2001). La traducción de los vectores se lleva a

cabo mediante el uso de un modelo de

traducción -entrenado a partir de un corpus

bilingüe en los idiomas a tratar-. De esta forma,

se obtiene la traducción del vector más

probable según el modelo de traducción y el

modelo de lenguaje del idioma objetivo.

De todas formas, tanto la cobertura como la

precisión de las técnicas mencionadas no son

óptimas. Esto hace que en el proceso de

traducción se pierda información -o se

introduzca ruido-, de forma que la

representación siempre vaya a ser inferior al

original. Con el objetivo de reforzar la

representación se pueden utilizar técnicas de

“query expansion”, de manera que se añadan

nuevas palabras clave relacionadas

semánticamente con el conjunto de términos del

vector.

Otras técnicas que no necesitan de

traducción por ser independientes del lenguaje,

y que resultan apropiadas cuando los pares de

idiomas a tratar son muy numerosos, son todas

aquellas en las que la selección de palabras

clave del documento se realice mediante

lexicones o tesauros multilingües tales como

WordNet o Eurovoc. En (Steinberger,

Pouliquen y Hagman, 2002) por ejemplo, se

asignan descriptores independientes del idiomas

del tesauro Eurovoc a cada vector mediante un

modelo estadístico entrenado mediante

aprendizaje supervisado. WordNet, por

ejemplo, es utilizado en (Stokes y Carthy, 2001)

para representar los documentos mediante

cadenas léxicas.


74

4.2 DiccionariosPara el caso de vectores en distintos idiomas

hemos seguido una traducción mediante

diccionarios bilingües.

Debido al carácter científico de los

documentos -es decir, un dominio amplio pero

acotado- hemos estimado apropiado el uso de

recursos lingüísticos específicos (Rogati y

Yang, 2004). Hemos combinado diccionarios

técnicos (Euskalterm, ZT hiztegia) con

diccionarios generales (Elhuyar5, Morris

6). No

hemos hecho una traducción estadística basada

en corpus paralelos por falta de recursos. No

disponemos ni de corpus bilingües de carácter

científico para todos los pares de lenguas, ni de

un alineador a nivel de palabra de precisión

notable.

Dic.

técnicos

Dic.

generales

tf-idf

medio

en 4.483 4.229

es 5.036 4.871

Tabla 1: tf-idf medio arit. para palabras clave

Mediante el uso de diccionarios técnicos

hemos logrado obtener un alto grado de

cobertura del léxico especializado. Justamente

el léxico que puede ser más representativo del

tema del documento. El la tabla 1 se muestra

los valores tf-idf de las palabras clave en inglés

con traducción en los diccionarios técnicos

frente a los tf-idf de las palabras clave con

traducciones contenidas en los diccionarios

generales. Las palabras clave se han agrupado

por lemas y provienen de una colección de

documentos reales (tabla 4). Se observa que,

según el valor medio aritmético tf-idf, el grado

de representatividad es ligeramente mayor en el

5Diccionario castellano/vasco que consta de

88.000 entradas, 144.000 acepciones y 19.000

subentradas.6Diccionario inglés/vasco que consta de 67.000

entradas y 120.000 acepciones.

léxico especializado. Parece, por tanto, que el

uso de diccionarios técnicos es una estrategia

apropiada. Más aún si también tenemos en

cuenta su menor grado de ambigüedad medio

en las traducciones de las palabras clave (tabla

2).

Dic.

técnicos

Dic.

generales

# traduc.

palabra

en->eu 1.72 2.827

es->eu 1.805 4.243

Tabla 2: Ambigüedad media en traducciones

De todas formas, hemos observado que la

cobertura respecto al léxico total podía tener

una incidencia negativa en la representación de

los textos, ya que algunas palabras generales

pueden jugar un papel representativo en los

documentos. Adicionalmente, la inclusión

exclusiva de palabras técnicas también

desfiguraba la dimensión del vector, debido a

que las demás palabras del documento no

estaban en modo alguno representadas.

Decidimos combinar de manera secuencial

los diccionarios técnicos con diccionarios de

carácter general. En la tabla 3 se puede

observar las coberturas para las palabras clave

(agrupadas en lemas) de una colección (tabla 4)

obtenidas con las distintas combinaciones de

diccionarios.

diccion.

técnicos

diccion.

general

diccion.

técnico +

general

en 55,52% 61,65% 74,48%

es 77,12% 89,02% 91,57%

Tabla 3: Cobertura para las palabras clave


75

4.3 Traducciones ambiguasComo hemos comentado antes, la traducción

por medio de diccionarios conlleva una posible

ambigüedad que redunda en traducciones

incorrectas que desfiguran el vector traducido.

El uso de diccionarios técnicos reduce en

cierta medida este problema, ya que el nivel de

polisemia y ambigüedad en la traducción es

menor (tabla 2). Aun así, el ruido generado

sigue siendo un problema como hemos

comentado antes. Frente a ello, y teniendo

como prioridad la precisión de los resultados

del sistema final, planteamos una sencilla

estrategia de selección de traducción.

La selección se aplica cada vez que se

calcula la similitud (coseno) entre dos vectores

de distintos idiomas ( �v y �w ). Basándonos en

la hipótesis de que la probabilidad de que

muchas traducciones ( �i , j ��D ) incorrectas

ocurran en el otro vector es baja, resolvemos la

desambiguación eligiendo para cada traducción

ambigua aquella que esté presente en el otro

vector:

cos ��v , �tr(w) ��i , j ��D

�vi w j�

��v��w�(2)

Así, evitamos el ruido que generaría la

inclusión de las traducciones incorrectas. Frente

al caso de utilizar técnicas de ponderación

equitativa de las traducciones, nuestra técnica

también se debe mostrar más efectiva en cuanto

a la precisión final, ya que el posible ruido

afectará solamente a parejas de documentos con

baja semejanza mutua. Como hemos dicho

anteriormente, suponemos que la probabilidad

de que muchas traducciones incorrectas

concurran en el otro vector es baja.

En el sistema, el cálculo de similitudes entre

documentos se realiza cada vez que el robot

recoge una nueva colección de noticias. Se

calculan las distancias entre los documentos

recientemente recogidos y los documentos de

Zientzia.net tanto nuevos como previamente

almacenados.

5 EvaluaciónEn la evaluación hemos querido analizar

únicamente los resultados obtenidos en el

sistema final. Debido a la dificultad de calcular

la cobertura y, siendo la precisión el principal

requisito del sistema, hemos evaluado

únicamente esta última. Concretamente, hemos

calculado la precisión analizando por cada

documento de la colección sus cuatro primeros

semejantes según el sistema (cutoff).La colección base de noticias se ha obtenido

y procesado mediante los procesos explicados

en los anteriores apartados. Consta de todos los

artículos publicados hasta la fecha en

Zientzia.net, y de artículos publicados en los

otros sitios web durante un periodo de un mes

(tabla 4). Aunque la idea del sistema es mostrar

los semejantes a partir de la navegación de los

documentos en euskera, la evaluación se ha

hecho en sentido inverso debido a la

superioridad numérica del los artículos de

Zientzia.net. De la otra forma, la probabilidad

de encontrar semejantes se reduciría

notablemente.

# docs # palabras # palab/doc

es 108 71.366 661

eu 3146 1.249.255 397

en 550 284.317 517

Tabla 4: Colección de noticias procesada

Para la evaluación formamos 3 grupos (uno

para cada idioma) de 10 documentos escogidos

aleatoriamente de la colección base. Tras

procesar toda la colección mediante el sistema

analizamos por cada documento los 4 primeros

más semejantes (de entre los de Zientzia.net)

según el sistema. El método de análisis

propuesto consistió en valorar el grado de

semejanza del contenido en base a una escala de

relevancia dividida en cuatro categorías y


76

basada en el esquema utilizado en (Braschler y

Schäuble, 1998).

(a) Comparten el tema principal: Los

documentos hablan sobre el mismo

tema.

(b) Tema principal relacionado o

comparten temas: Los documentos

tratan de temas muy relacionados o

mantienen en común temas no

principales.

(c) Comparten área: Los documentos

pertenecen ha una determinada área sin

llegar a ser general.

(d) Parecido remoto: Las relaciones entre

los documentos son remotas o

inexistentes.

De esta forma, se pretende valorar como

más positivas las relaciones de gran parecido.

Sabemos que esta escala es discutible, ya que

de cara al usuario puede ser más útil una

referencia que complemente el artículo en curso

que un artículo sobre el mismo tema. Además,

asignar a cada documento una categoría de esta

escala resulta en muchos casos una tarea de

difícil precisión.

El análisis fue llevado a cabo por un

profesional en el campo de la divulgación

científica, y se hizo para dos prototipos

distintos:

1) distribuyendo equitativamente el peso

entre las traducciones .

2) aplicando la desambiguación propuesta

anteriormente.

Quisimos comprobar si el método diseñado

para resolver casos de traducción ambiguos

mejoraba la precisión del sistema.

En las tablas 5, 6 y 7 se muestran las

distintas precisiones (cutoff 4) acumulando las

categorías según la escala de relevancia

comentada. Se observa que los resultados

varían según el idioma, siendo evidente la

perdida de información tras la traducción. Este

hecho influye en mayor medida a las relaciones

inglés-euskera debido a la menor cobertura de

los diccionarios bilingües inglés-euskera.

(a) (a+b) (a+b+c)

Desam. 10% 37.5% 82.5%

No desam. 10% 30% 70%

Tabla 5: Cutoff 4 en-es

(a) (a+b) (a+b+c)

Desam. 30% 37.5% 60%

No desam. 25% 32.5% 60%

Tabla 6: Cutoff 4 es-eu

(a) (a+b) (a+b+c)

17.5% 57.5% 85%

Tabla 7: Cutoff 4 eu-eu

Se ha observado que, quizás debido al

pequeño tamaño de la colección, documentos

con pocas palabras clave compartidos han sido

aceptados como similares.

En cualquier caso, el método diseñado para

resolver traducciones ambiguas mejora la

precisión en todas las pruebas.

Relacionado con el tamaño y la variedad del

contenido se ha observado que la precisión del

sistema es menor frente a documentos de algún

tema muy especial, resultando la comparación

léxica insuficiente. Esto puede ser debido al

reducido número de documentos, pero no ha

podido ser evaluado al no tener constancia de la

cobertura.

6 Conclusiones y trabajo futuroSe ha desarrollado un sistema para la

agrupación de documentos multilingües de

contenido similar con el objetivo de integrarlo

en un un sistema CLIR. Esto ha dado lugar a un

sistema de navegación de noticias científico-

técnicas multilingües, implantado en el sitio

Zientzia.net.


77

Los resultados obtenidos nos deben llevar a

realizar una evaluación más exhaustiva.

Independientemente de esto, se ha comprobado

que la traducción mediante diccionarios resulta

positiva, más concretamente con el uso los

diccionarios técnicos. El uso del método de

desambiguación propuesto también ha sido

exitoso, pero una nueva evaluación es necesaria

para cuantificar mejor la mejora conseguida.

Sería muy interesante evaluar la perdida de

precisión usando solamente resúmenes RSS, ya

que consiguiendo un buen resultado estas

técnicas podrían ser usadas para gran cantidad

de fuentes sin necesidad de utilizar wrappers.

También se pretende realizar nuevos

experimentos con modelos de lenguaje,

preguntas estructuradas y distintas medidas de

similitud. Adicionalmente queremos mejorar la

traducción de entidades mediante detección de

cognados, y la traducción general mediante

generación de tesauros multilingües a partir de

corpus comparables. De cara a algunas de estas

tareas pensamos basar el motor de búsqueda en

la herramienta Lemur toolkit (Ogilvie y Calla,

2001).

AgradecimientosEste trabajo está subvencionado por el

Departamento de Industria del

Gobierno Vasco (proyectos Dokusare SA-

2005/00272, Dokusare SA-2006/00167).

BibliografíaBraschler, M., y P. Schäuble. 1998.

Multilingual Information Retrieval Based on

Document Alignment Techniques , ECDL

1998, pp. 183-197.

Chen, Y., y H. Hsi. 2002. NLP and IR

approaches to monolingual and multilingual

link detection. The 19th Int'l Conf. Computational Linguistics. Taipei, Taiwan.

Hiemstra, D. Using language models for

information retrieval. Ph.D. Thesis University of Twente. Enschede.

Lee, C. H., M. Kan, y S. Lai. 2004. Stylistic and

lexical co-training for web block

classification. WIDM 2004. 136-143

Ogilvie, P., y J. Callan. 2001. Experiments

using the Lemur toolkit. Proceedings of the

Tenth Text Retrieval Conference (TREC-10).

Pirkola, A. 1998. The Effects of Query

Structure and Dictionary setups in

DictionaryBased Cross-language

Information Retrieval. Proce. of the 21st

International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 55-63.

Ponte, J., y W. Croft. 1998. A Language

Modeling Approach to Information

Retrieval. In: Croft et al. (ed.): Proceedings of the 21st Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval, pages

275{281. ACM, New York.

Robertson, S. E., S. Walker, S. Jones, M.

Hancock-Beaulieu, M. Gatford. 1994. Okapi

at TREC-3. NIST Text Retrieval Conference.

Rogati, M., y Y. Yang. 2004. Resource

Selection for Domain Specific Cross-

Lingual IR. SIGIR 2004.

Steinberger, R., B. Pouliquen, y J. Hagman.

2002. Cross-lingual Document Similarity

Calculation Using the Multilingual

Thesaurus EUROVOC. Third International Conference on Intelligent Text.

Steinberger, R., B. Pouliquen, y C. Ignat. 2005.

NewsExplorer: multilingual news analysis

with cross-lingual linking. Information Technology Interfaces.

Stokes, N., y J. Carthy. 2001. Combining

Semantic and Syntactic Document

Classifiers to Improve First Story Detection.

SIGIR 2001: 424-425.


78

Extracción de Información

The Influence of Context during the Categorization andDiscrimination of Spanish and Portuguese Person Names

Zornitsa Kozareva, Sonia Vazquez and Andres MontoyoDepartamento de Lenguajes y Sistemas Informaticos

Universidad de Alicantezkozareva,svazquez,[email protected]

Resumen: Este artıculo presenta un nuevo metodo para la categorizacion y ladiscriminacion de nombres propios utilizando como fuente de informacion la similitudsemantica. Para establecer las relaciones semanticas entre las palabras que forman elcontexto donde aparece la entidad que queremos categorizar o discriminar, nuestrometodo utiliza la semantica latente. Se han realizado diferentes experimentos dondese ha estudiado la influencia del contexto y la robustez de nuestra aproximacionsobre distintos numeros de ejemplos. La evaluacion se ha realizado sobre textos enespanol y portugues. Los resultados obteniendos son 90 % para espanol y 82 % paraportugues en categorizacion y un 80% para espanol y un 65 % para portugues endiscriminacion.Palabras clave: discriminacion de nombres, categorizacion de nombres, informacionsemantica

Abstract: This paper presents a method for fine-grained categorization and dis-crimination of person names on the basis of the semantic similarity information. Weemploy latent semantic analysis which establishes the semantic relations betweenthe words of the context in which the named entities appear. We carry out severalexperimental studies in which we observe the influence of the context and the robust-ness of our approach with different number of examples. Our approach is evaluatedwith Spanish and Portuguese. The experimental results are encouraging, reaching90% for the Spanish and 82 % for the Portuguese person name categorization, and80% for the Spanish and 65% for the Portuguese NE discrimination of six conflatednames.Keywords: name discrimination, name categorization, semantic information

1. Introduction and Related Work

Named Entity (NE) recognition concernsthe detection and classification of names intoa set of categories. Presently, most of the suc-cessful NE approaches employ machine learn-ing techniques and handle simply the per-son, organization, location and miscellaneouscategories. However, the need of the currentNatural Language Applications impedes spe-cialized NE extractors which can help for in-stance an information retrieval system to de-termine that a query about “Jim Henriquesguitars” is related to the person “Jim Hen-riques” with the semantic category musician,and not “Jim Henriques” the composer. Suchclassification can aid the system to rank or re-turn relevant answers in a more accurate andappropriate way.

So far, the state-of-art NE recognizers

identify that “Jim Henriques” is a person,but do not subcategorize it. There are numer-ous of drawbacks related to this fine-grainedNE issue. First, the systems need hand an-notated data which is not available and itscreation is time-consuming and requires su-pervision by experts. Second, for languagesother than English there is a significant lackof freely available or developed resources.

The World Wide Web is a vast, multi-lingual source of unstructured informationwhich we consult daily to understand whatthe weather in our city is, how our favoritesoccer team performed. Therefore, the needof multilingual and specialized NE extrac-tors remains and we have to focus towardthe development of language independent ap-proaches.

Together with the specialized NE catego-



rization, we face the problem of name am-biguity which is related to queries for differ-ent people, locations or companies that sharethe same name. This problem is known asname discrimination (Ted Pedersen y Kulka-rni, 2005). For instance, Cambridge is a cityin United Kingdom, but also in the UnitedStates of America. ACL refers to “The Asso-ciation of Computational Linguistics”, “TheAssociation of Christian Librarians”, “Auto-motive Components Limited” among others.

Previously, (Ted Pedersen y Kulkarni,2005) tackled the name discrimination taskby developing a language independent ap-proach based on the context in which the am-biguous name occurred. They construct sec-ond order co-occurrence features according towhich the entities are clustered and associat-ed to different underlying names. The perfor-mance of this method ranges from 51 % to73% depending on the pair of named enti-ties that have to be disambiguated. Similarapproach was developed by (Bagga y Bald-win, 1998), who created first order contextvectors that represent the instance in whichthe ambiguous name occurs. Their approachis evaluated on 35 different mentions of JohnSmith, and the f-score is 84 %.

For fine-grained person NE categorization,(Fleischman y Hovy, 2002) carried out a su-pervised learning for which they deduced fea-tures from the local context in which theentity resides, as well as semantic informa-tion derived from WordNet. According totheir results, to improve the 70 % coveragefor person name categorization, more sophis-ticated features are needed, together with amore solid data generation procedure. (Tanevy Magnini, 2006) classified geographic lo-cation and person names into several sub-classes. They use syntactic information andobserved how often a syntactic pattern co-occurs with certain member of a given class.Their method reaches 65 % accuracy. (Pasca,2004) presented a lightly supervised lexico-syntactic method for named entity catego-rization which reaches 76 % when evaluatedwith unstructured text of Web documents.

(Mann, 2002) populated a fine-grainedproper noun ontology using common nounpatters and following the hierarchy of Word-Net. They studied the influence of the newlygenerated person ontology in a Question An-swering system. According to the obtainedresults, the precision of the ontology is high,

but still suffers in coverage.However, none of these approaches stud-

ied the text cohesion and semantic similaritybetween snippets with named entities. There-fore, we employ Latent Semantic Analysis(LSA) which allows us to establish the se-mantic relations among the words that sur-round the named entity. Our motivation isbased on the words sense discrimination hy-pothesis of (Miller y Charles, 1991) accord-ing to which words with similar meaning areused in similar context. For instance, namesthat belong to the category sport will be morelikely to appear with words such as champi-onship, ball, team, meanwhile names of uni-versity students or professors will be morelikely to appear with words such as book, li-brary, homework.

2. NE categorization anddiscrimination with LatentSemantic Analysis

LSA has been applied successfully in manyareas of Natural Language Processing suchas Information Retrieval (Scott Deerwester yHarshman, 1990), Information Filtering (Du-mais, 1995) , Word Sense Disambiguation(Shutze, 1998) among others. This is possiblebecause LSA is a fully automatic mathemat-ical/statistical technique for extracting andinferring relations of expected contextual us-age of words in discourse. It uses no humanlyconstructed dictionaries or knowledge bases,semantic networks, syntactic or morphologi-cal analyzes, because it takes only as inputraw text which is parsed into words and isseparated into meaningful passages. On thebasis of this information, the NLP applica-tions extract a list of semantically relatedword pairs or rank documents related to thesame topic.

LSA represents explicitly terms and doc-uments in a rich, high dimensional space, al-lowing the underlying “latent”, semantic re-lationships between terms and documents tobe exploited. LSA relies on the constituentterms of a document to suggest the docu-ment’s semantic content. However, the LSAmodel views the terms in a document assomewhat unreliable indicators of the con-cepts contained in the document. It assumesthat the variability of word choice partiallyobscures the semantic structure of the doc-ument. By reducing the dimensionality ofthe term-document space, the underlying, se-

Zornitsa Kozareva, Sonia Vázquez y Andrés Montoyo

82

mantic relationships between documents arerevealed, and much of the “noise” (differencesin word usage, terms that do not help dis-tinguish documents, etc.) is eliminated. LSAstatistically analyzes the patterns of wordusage across the entire document collection,placing documents with similar word usagepatterns near to each other in the term-document space, and allowing semantically-related documents to be closer even thoughthey may not share terms.

Taking into consideration these propertiesof LSA, we thought that instead of construct-ing the traditional term-document matrix, wecan construct a term-sentence matrix withwhich we can find a set of sentences that aresemantically related and talk about the sameperson. The rows of the term-sentence matrixcorrespond to the words of the sentences inwhich the NE have to be categorized or dis-criminated, while the columns correspond tosentences with different named entities. Thecells of the matrix show the number of times agiven word occurs in a given sentence. Whentwo columns of the term-sentence matrix aresimilar, this means that the two sentencescontain similar words and are therefore like-ly to be semantically related. When two rowsare similar, then the corresponding words oc-cur in most of the same sentences and arelikely to be semantically related. In this way,we can obtain semantic evidence about thewords which characterize given person. Forinstance, a football player is related to wordsas ball, match, soccer, goal, and is seen inphrases such as “X scores a goal”, “Y is pe-nalized”. Meanwhile, a surgeon is related towords as hospital, patient, operation, surgeryand is seen in phrases such as “X operatesY ”, “X transplants”. Evidently, the catego-ry football player can be distinguished easilyfrom that of the surgeon, because both per-son name categories co-occur and relate se-mantically to different words.

3. Named Entity Data Set

In order to evaluate our method, we haveused two languages: Spanish and Portuguese.We collected large news corpora from thesame time period for both languages andidentified a predefined set of named enti-ties on the basis of machine-learning basednamed entity recognizer (Zornitsa Kozare-va y Gomez, 2007). The Spanish corpuswe worked with is EFE94-95, containing

127079110 tokens. The Portuguese corporaare Folha94-95 and Publico94-95, containing90809250 tokens. These corpora were previ-ously used in the CLEF competitions1.

For the NE categorization and discrimi-nation experiments, we used six different lowambiguous named entities, which we assumea-priory to belong to one of the two fine-grained NE categories PERSON SINGER andPERSON PRESIDENT. The president names,both for Spanish and Portuguese are BillClinton, George Bush and Fidel Castro. Thesingers for Spanish are Madonna, Julio Igle-sias and Enrique Iglesias, while for Por-tuguese we have Michael Jackson, Madonnaand Pedro Abrunhosa. Although we wantedto use the same singer names for both lan-guages, it was impossible due to the scat-teredness in the example distribution.

Table 1 shows the original distribution ofthe extracted examples with different con-text windows that surround the named enti-ty. The context windows we worked with are10, 25, 50 and 100. They indicate the num-ber of words2 from the left and from the rightof the identified named entity. Note, that theNE data is obtained only from the contentbetween the text tags in the xml documents.During the creation of the context windows,we used words that belong to the document inwhich the NE is detected. This restriction isimposed, because if we use words from previ-ous or following documents, the domain andthe topic in which the NE is seen can change.Therefore, NE examples for which the num-ber of words from the left or from the rightdid not correspond to the number of contextwords were directly discarded.

To avoid imbalance in the experimentaldata during the evaluation, we decided to cre-ate two samples, one with 100 and anotherwith 200 examples per named entity. Thus,every name will have the same frequency ofoccurrence and there will be no dominanceduring the identification of a given name.

For the NE categorization data, eachoccurrence of the president and singernames is replaced with the obfuscated formPresident Singer, while for the NE discrim-ination task, the names where replaces withM EI JI BC GB FC. The first label indicatesthat a given sentence can belong to the pres-ident or to the singer category, while the sec-

1http://www.clef-campaign.org/210, 25, 50 and 100 respectively

The Influence of Context during the Categorization and Discrimination of Spanish and Portuguese Person Names

83

name lang c10 c25 c50 c100

M ES 280 266 245 206PT 1008 975 893 758

JI ES 426 405 367 295EI ES 407 392 360 305MJ PT 592 568 506 418PA PT 364 347 320 275

BC ES 6928 5970 5271 5185PT 3055 2951 2786 2576

GB ES 730 649 641 521PT 307 300 283 242

FC ES 2865 2765 2779 2357PT 3050 2951 2777 2460

Table 1: NE distribution in the Spanish andPortuguese corpora

ond label indicates that behind it can standone of the six named entities. The NE cat-egorization and discrimination experimentsare carried out in a completely unsupervisedway, meaning that we did not use the correctname and named category until the evalua-tion stage.

4. Experimental Evaluation

To carry out the various experimentalevaluations, first we construct the conceptu-al matrix and establish the semantic similar-ity relations among the sentences in the dataset. For each sentence, LSA produces a listof the similarity between all sentences andthe target one e.g. the sentence to be classi-fied. The list is ordered in descending order,where high probability values indicate strongsimilarity and cohesion between the text ofthe two sentences and vice versa. Therefore,we consider only the top twenty high-scoringsentences, since their NEs will be very likelyto belong to the same fine-grained categoryor person.

In order to evaluate the performance ofour approach, we use the standard precision,recall, f-score and accuracy measures whichcan be derived from Table 2.

number of Correct PRES. Correct SING.

assigned PRES. a bassigned SING. c d

Table 2: Contingency table

Accuracy =a + d

a + b + c + d(1)

Fβ=1 =2 × Precision × Recall

Precision + Recall(2)

For the assignment of the president andsinger categories, we took LSA’s list andgrouped together in a cluster all sentencesfrom the 20 most similar ones. In contrast,for the NE discrimination task, we did not usethe whole list of returned sentences, since wewere interested in concrete NE with identi-cal features and characteristics. For this rea-son, we decided that the most relevant infor-mation is contained in the first sentences atthe top of LSA’s list and rejected the restof the candidates. The information about thenamed category or class was not revealed andused until evaluation.

Our experiments are ordered accordingto the conducted observations. The first oneconcerns the effect of the context for the NEcategorization. This information is very im-portant and beneficial, when annotated cor-pus has to be created. In this way we cansave time and labor for human annotators, orcan ease the supervision process after activelearning or bootstrapping (Kozareva, 2006).Then, we observe the NE fine-grained classi-fication and discrimination.

4.1. Influence of context

Figures 1 and 2 present the performanceof our approach with different context win-dows. The evaluation is carried out with 100and 200 examples per NE. For both sam-ples and both languages (Spanish and Por-tuguese), the context windows perform al-most the same.

This shows that on average with 2-3 sen-tences the context in which the name residescan be captured together with the particu-lar words that characterize and co-occurringwith the name.

4.2. NE categorization

In Table 3, we show the results for theSpanish and Portuguese NE fine-grained cat-egorization. The detailed results are only forthe window of 50 words with 100 and 200 ex-amples. All runs, outperform a simple base-line system which returns for half of the ex-amples the fine-grained category PRESIDENTand for the rest SINGER. This 50 % baseline


84

Figure 1: Influence of context for Portugueseand Spanish with 100 examples

Figure 2: Influence of context for Portugueseand Spanish with 200 examples

performance is due to the balanced corpus wehave created. The f-scores for the fine-grainedNE categorization in Spanish reach around90%, while for Portuguese the f-scored variesaround 92 % for the 100 examples, and 76 %for the 200 examples.

SPANISHcont/ex Cat. P. R. A. F.

50/100PRES. 90.38 87.67 88.83 89.00SING. 87.94 90.00 88.33 88.96

50/200PRES. 90.10 94.33 91.92 92.18SING. 94.04 89.50 91.91 91.71

PORTUGUESEcont/ex Cat. P. R. A. F.

50/100PRES. 93.56 92.00 92.50 92.53SING. 92.07 56.50 77.17 71.29

50/200PRES. 96.58 56.50 77.17 71.29SING. 69.22 97.83 77.16 81.07

Table 3: NE categorization in Spanish andPortuguese

During the error analysis, we found

out that the PERSON PRESIDENT andPERSON SINGER categories are distin-guishable and separable because of thewell-established semantic similarity rela-tion among the words with which the NEco-occurres. A pair of president sentenceshas lots of strongly related words such aspresident:meeting, president:government,which indicates high text cohesion. Whilethe majority of words in a president–singer pair are weakly related, for instancepresident:famous, president:concert. Butstill there are ambiguous pairs as presi-dent:company, where the president relatesto a president of a country, while the com-pany refers to a musical enterprize. Suchinformation confuses LSA’s categorizationprocess.

4.3. NE discrimination

In a continuation, we present in Table 4the performance of LSA for the NE discrim-ination task. The results show that this se-mantic similarity method we employ is veryreliable and suitable not only for the NE cat-egorization, but also for the NE discrimina-tion. A baseline which always returns one andthe same person name during the NE discrim-ination task is 17 %. From the table can beseen that all names outperform the baseline.The f-score per individual name ranges from32% as the lowest to 90 % as the highest per-formance. The results are very good, as theconflated names (three presidents and threesingers) can be easily obfuscated due to thefact that they share the same domain andco-occur with the same semantically relatedwords.

The three best discriminated names forSpanish are Enrique Iglesias, Fidel Castroand Madonna, while for Portuguese we haveFidel Castro, Bill Clinton and Pedro Abrun-hosa. For both languages, the name FidelCastro was easily discriminated due to itscharacterizing words Cuba, CIA, Cuban pres-ident, revolution, tyrant. All sentences hav-ing these words or synonyms related to themare associated to Fidel Castro. Bill Clin-ton co-occurred many times with the wordsdemocracy, Boris Yeltsin, Halifax, Chelsea(the daughter of Bill Clinton), White House,while George Bush appeared with republican,Ronald Reigan, Pentagon, war in Vietnam,Barbara Bush (the wife of George Bush).

Some of the examples for Enrique Igle-


85

name lang 10 25 50 100

Madonna SP 63.63 61.61 63.16 79.45PT 59.05 47.37 46.15 55.29

Julio Iglesias SP 58.96 56.68 66.00 79.19EnriqueIglesias

SP 77.27 80.17 84.36 90.54

PedroAbrunhosa

PT 51.26 61.97 69.63 80.17

MichaelJackson

PT 32.15 62.64 48.45 62.07

Bill Clinton SP 52.72 48.81 74.74 73.91PT 60.41 73.51 64.04 62.38

George Bush SP 49.45 41.38 60.20 67.90PT 63.83 34.07 68.16 66.67

Fidel Castro SP 61.20 62.44 77.08 82.41PT 60.64 79.79 71.61 68.26

Table 4: NE discrimination for Spanish andPortuguese

sias which during the data compiling wereassumed as the Spanish singer, in reality talkabout the president of a financial compa-ny in Uruguay or political issues. Therefore,this name was confused with Bill Clinton asthey share semantically related words suchas bank, general secretary, meeting, decision,appointment.

The discrimination process was goodthough Madonna and Julio Iglesias aresingers and appear in the context of con-certs, famous, artist, magazine, scene, back-stage. The characterizing words for Julio Igle-sias are Chabeli(the daughter of Julio Igle-sias), Spanish, Madrid, Iberoamerican. Thename Madonna co-occurred with words re-lated to a picture of Madonna, a statue in achurch of Madonna, the movie Evita.

Looking at the effect of the context win-dow for the NE discrimination task, it can beseen that for Spanish the best performancesof 90 % for Enrique Iglesias, 82 % for FidelCastro and 79 % for Madonna are achievedwith 100 words from the left and from theright of the NE. In comparison for the Por-tuguese data, the highest coverage of 80 % forFidel Castro, 73 % for Bill Clinton and 62 %for Michael Jackson are reached with the 25word window. For the Spanish data, the larg-er context had better discrimination power,while for Portuguese the more local contextwas better.

The error analysis shows that the perfor-mance of our method depends on the quality

of the data source we work with. As thereis no hand-annotated NE categorization anddiscrimination corpora, we had to developour own corpus by choosing low ambiguousand well known named entities. Even though,during our experiments we found out that oneand the same name refers to three differentindividuals. From one side this made it diffi-cult for the categorization and discriminationprocesses, but opens new line for research.

In conclusion, the conducted experimentsrevealed a series of important observations.The first one is that the different context win-dows perform the same. However, for Spanishbetter classification is obtained with largercontexts, because this is related to the expres-siveness of the Spanish language. Second, wecan claim that LSA is a very appropriate ap-proximation for the resolution of the NE cat-egorization and discrimination tasks. Apartit gives logical explanation about the classi-fication decision of the person names givinga set of words characterizing the individualpersons or their fine-grained categories.

5. Conclusions and Work inProgress

In this paper, we present an approach forNE categorization and discrimination, whichis based on semantic similarity informationderived from LSA. The approach is evaluat-ed with six different low ambiguous personnames, and around 3600 different examplesfor the Spanish and Portuguese languages.The obtained results are very good and out-perform with 15 % the already developed ap-proximations. For the president and singerNE categorization, LSA obtains 90 %, whilefor the NE discrimination, the results varyfrom 46 % to 90 % depending on the per-son name. The variability in the name dis-crimination power is related to the degree ofthe name ambiguity. During the experimen-tal evaluation, we found out that the 100%name purity (e.g. that one name belongs on-ly to one and the same semantic category)which we accept during the data creation inreality contains from 5 to 9 % noise.

In (Zornitsa Kozareva y Montoyo, 2007a),we have evaluated the performance of thesame approach but for the Bulgarian lan-guage. This proves that the approach is lan-guage independent, because it only needs aset of context with ambiguous names. In thisexperimental study, we have focused not only


86

on the multilingual issues but also on the dis-crimination and classification of names fromthe location and organization categories. Theobtained results demonstrate that the bestperformance is obtained with the context of50 words and the easiest category is the lo-cation one which includes cities, mountains,rivers and countries. In general, the most dif-ficult classification was for the organizationnames.

In additional experimental study of (Zor-nitsa Kozareva y Montoyo, 2007b), we havedemonstrated that the combination of thename disambiguation and fine-grained cate-gorization processes can improve the qualityof the data needed for the evaluation of ourapproach.

In the future, we want to resolve cross-language NE discrimination and classifica-tion. We are interested in extracting pairs ofwords that describe and represent the con-cept of a fine-grained category such as presi-dent or a singer and in this way identify newcandidates for these categories. We will re-late this process with an automatic popula-tion of an ontology. Finally, we want to relatethis approach with our web people search ap-proximation (Zornitsa Kozareva y Montoyo,2007c) in order to improve the identificationof the name ambiguity detection on the web.

Acknowledgements

This research has been funded byQALLME number FP6 IST-033860 andTEXT-MESS number TIN2006-15265-C06-01.

References

Bagga, Amit y Breck Baldwin. 1998. Entity-based cross-document coreferencing usingthe vector space model. En Proceed-ings of the Thirty-Sixth Annual Meetingof the ACL and Seventeenth InternationalConference on Computational Linguistics,paginas 79–85.

Dumais, Susan. 1995. Using lsi for informa-tion filtering: Trec-3 experiments. En TheThird Text Retrieval Conference (TREC-3), paginas 219–230.

Fleischman, Michael y Eduard Hovy. 2002.Fine grained classification of named en-tities. En Proceedings of the 19th inter-national conference on Computational lin-guistics, paginas 1–7.

Kozareva, Zornitsa. 2006. Bootstrappingspanish named entities with automatical-ly generated gazetteers. En Proceedings ofEACL, paginas 17–25.

Mann, Gideon. 2002. Fine-grained prop-er noun ontologies for question answering.En COLING-02 on SEMANET, paginas1–7.

Miller, George y Walter Charles. 1991.Contextual correlates of semantic similar-ity. En Language and Cognitive Processes,paginas 1–28.

Pasca, Marius. 2004. Acquisition of catego-rized named entities for web search. EnCIKM ’04: Proceedings of the thirteenthACM international conference on Infor-mation and knowledge management, pagi-nas 137–145.

Scott Deerwester, Susan Dumais, GeorgeFurnas Thomas Landauer y RichardHarshman. 1990. Indexing by latent se-mantic analysis. En Journal of the Amer-ican Society for Information Science, vol-umen 41, paginas 391–407.

Shutze, H. 1998. Automatic word sense dis-crimination. En Journal of computationallinguistics, volumen 24.

Tanev, Hristo y Bernardo Magnini. 2006.Weakly supervised approaches for ontol-ogy population. En Proceeding of 11thConference of the European Chapter ofthe Association for Computational Lin-guistics, paginas 17–24.

Ted Pedersen, Amruta Purandare y AnaghaKulkarni. 2005. Name discrimination byclustering similar contexts. En CICLing,paginas 226–237.

Zornitsa Kozareva, Oscar Ferrandez, AndresMontoyo Rafael Munoz Armando Suarezy Jaime Gomez. 2007. Combiningdata-driven systems for improving namedentity recognition. Data Knowl. Eng.,61(3):449–466.

Zornitsa Kozareva, Sonia Vazquez y AndresMontoyo. 2007a. A Language Indepen-dent Approach for Name Categorizationand Discrimination. En Proceedings of theACL 2007 Workshop on Balto-SlavonicNatural Language Processing.


87

Zornitsa Kozareva, Sonia Vazquez y AndresMontoyo. 2007b. Discovering the Under-lying Meanings and Categories of a Namethrough Domain and Semantic Informa-tion. En Proceedings of Recent Advancesin Natural Language Processing.

Zornitsa Kozareva, Sonia Vazquez y AndresMontoyo. 2007c. UA-ZSA: Web PageClustering on the basis of Name Disam-biguation. . En Proceedings of the 4th In-ternational Workshop on Semantic Eval-uations.


88

Studying CSSR Algorithm Applicability on NLP Tasks

Muntsa Padro and Lluıs PadroTALP Research Center

Universitat Politecnica de CatalunyaBarcelona, Spain

{mpadro, padro}lsi.upc.edu

Resumen: CSSR es un algoritmo de aprendizaje de automatas para representarlos patrones de un proceso a partir de datos sequenciales. Este artıculo estudia laaplicabilidad del CSSR al reconocimiento de sintagmas nominales. Estudiaremos lahabilidad del CSSR para capturar los patrones que hay detras de esta tarea y enque condiciones el algoritmo los aprende mejor. Tambien presentaremos un metodopara aplicar los modelos obtenidos para realizar tareas de anotacion de sintagmasnominales. Dados todos los resultados, discutiremos la aplicabilidad del CSSR atareas de PLN.Palabras clave: Tareas sequenciales de PLN, aprendizage de automatas, deteccionde sintagmas nominales

Abstract: CSSR algorithm learns automata representing the patterns of a processfrom sequential data. This paper studies the applicability of CSSR to some NounPhrase detection. The ability of the algorithm to capture the patterns behind thistasks and the conditions under which it performs better are studied. Also, an ap-proach to use the acquired models to annotate new sentences is pointed out and, atthe sight of all results, the applicability of CSSR to NLP tasks is discussed.Keywords: NLP sequential tasks, automata acquisition, Noun Phrase detection

1 Introduction

Causal-State Splitting Reconstruction(CSSR) algorithm (Shalizi and Shalizi,2004) builds deterministic automata fromdata sequences. This algorithm is based onComputational Mechanics and is conceivedto model stationary processes by learningtheir causal states. These causal statesbuild a minimum deterministic machine thatmodels the process. Its main benefit is thatit does not have a predefined structure (asHMMs do) and that if the pattern to learnis simple enough, the obtained automatonis “intelligible”, providing an explicit modelfor the training data.

CSSR has been applied to different re-search areas such as solid state physics (Varnand Crutchfield, 2004) and anomaly de-tection in dynamical systems (Ray, 2004).These applications use CSSR to capture pat-terns representing obtained data. These pat-terns are then used for different purposes.

This algorithm has been also used in thefield of Natural Language Processing (NLP)to learn automata than can be afterwardsused to tag new data (Padro and Padro,

2005b; Padro and Padro, 2005a). This isa slightly different use, as it is necessary tointroduce some hidden information into theautomaton. Furthermore, the alphabets in-volved in NLP tasks tend to be bigger thanthe other CSSR applications presented. Thisis a handicap when using CSSR for NLPtasks, as we will discuss in this paper. De-spite of that, the results obtained in firstexperiments show that this technique canprovide state-of-the-art results in some NLPtasks. Given these results, the challenge is toimprove them, developing systems rivallingbest state-of-the-art systems. To do so, moreinformation should be incorporated into thesystem but, as it will be discussed in this pa-per, this can lead to other problems given thenature of the algorithm.

The aim of this work is to study the abilityof CSSR to capture a model for the patternsunderlying NLP sequences structure, as wellas under which conditions it performs bet-ter. We focus on studying the models learnedby CSSR in NP detection with different datarather than using CSSR to perform the anno-tating task, which was done in previous work.



2 Theoretical Foundations of

CSSR

The CSSR algorithm (Shalizi and Shalizi,2004) inferres the causal states of a processfrom data in the form of Markov Models.Thus, the many desirable features of HMMsare secured, without having to make a pri-ori assumptions about the architecture of thesystem.

2.1 Causal States

Given a discrete alphabet Σ of size k, con-sider a sequence x− (history) and a randomvariable Z+ for its possible future sequences.Z+ can be observed after x− with a proba-bility P (Z+|x−). Two histories, x− and y−,are equivalent when P (Z+|x−) = P (Z+|y−),i.e. when they have the same probability dis-tribution for the future. The different futuredistributions determine causal states of theprocess. Each causal state is a set of histories(suffixes of alphabet symbols up to a preesta-blished maximum length) with the same pro-bability distribution for the future.

Causal States machines have many de-sirable properties that make them the bestpossible representation of a process.They areminimal and have sufficient statistics to re-present a process, this is, from causal statesit is possible to determine the future fora given past. For that reason we are in-terested in using these kind of machines inNLP tasks. For more theoretical foundationsabout causal states and their properties see(Shalizi and Crutchfield, 2001).

2.2 The Algorithm

The algorithm starts by assuming the processis an identically-distributed and independentsequence with a single causal state, and theniteratively adds new states when it is shownby statistical tests that the current states setis not sufficient. The causal state machineis built in three phases briefly described be-low. For more details on the algorithm, see(Shalizi and Shalizi, 2004).

1. Initialize: Set the machine to one statecontaining only the null suffix. Set l = 0(length of the longest suffix so far).

2. Sufficiency: Iteratively build newstates depending on the future proba-bility distribution of each possible suf-fix extension. Suffix sons (ax) for eachlongest suffix (x) are created adding each

alphabet symbol (a) at the beginningof each suffix. The future distributionfor each son is computed and comparedto the distribution of all other existingstates. If the new distribution equals(with a certain confidence degree α) tothe distribution of an existing state, thesuffix son is added to this state. Oth-erwise, a new state for the suffix son iscreated.

The suffix length l is increased by one ateach iteration. This phase goes on un-til l reaches some fixed maximum valuelmax, the maximum length to be consi-dered for a suffix, which represents thelongest histories taken into account. Theresults of the system will be significantlydifferent depending on the chosen lmax

value, since the larger this value is, thelonger will be the pattern that CSSR willbe able to capture, but also the moretraining data will be necessary to learna correct automaton with statistical re-liability.

3. Recursion: Since CSSR models sta-tionary processes, first of all the tran-sient states are removed. Then thestates are splitted until a deterministicmachine is reached. To do so, the tran-sitions for each suffix in each state arecomputed and if two suffixes in one statehave different transitions for the samesymbol, they are splitted into two diffe-rent states.

The main parameter of this algorithm isthe maximum length (lmax) the suffixes canreach. That is, the maximum length of theconsidered histories. In terms of HMMs, lmax

would be the potential maximum order of themodel (the learned automaton would be anHMM of lmax order if all the suffixes belongedto different states).

When using CSSR, it is necessary to reacha trade off between the amount of data (N),the vocabulary size (k) and the used max-imum length (lmax). According to (Shaliziand Shalizi, 2004), the maximum length thatcan be used with statistical reliability is givenby the ratio log N/ log k.

3 Chunking and NP Detection

This work focus on studying CSSR behaviourwhen applied to NP detection. This sectionpresents an overview on this task.

Muntsa Padró y Lluis Padró

90

Text Chunking consists of dividing sen-tences into non-recursive non-overlappingphrases (chunks) and of classifying them intoa closed set of grammatical classes (Abney,1991) such as noun phrase, verb phrase,etc. Each chunk contains a set of correlativewords syntactically related.

This task is usually seen as a previous stepof full parsing, but for many NLP tasks, hav-ing the text correctly separated into chunksis preferred than having a full parsing, morelikely to contain mistakes. In fact, some-times the only information needed are thenoun phrase (NP) chunks, or, at most, theNP and VP (verb phrase) chunks. For thatreason, the first efforts devoted to Chunk-ing were focused on NP-chunking (Church,1988; Ramshaw and Marcus, 1995), othersdeal with NP, VP and PP (prepositionalphrase) (Veenstra, 1999). In (Buchholz,Veenstra, and Daelemans, 1999) an approachto perform text Chunking for NP, VP, PP,ADJP (adjective phrases) and ADVP (adver-bial phrases) using Memory-Based Learningis presented.

As most NLP tasks, Chunking can be ap-proached using hand-built grammars and fi-nite state techniques or via statistical modelsand Machine Learning techniques. Some ofthese approaches are framed in the CoNLL-2000 Shared Task (Tjong Kim Sang andBuchholz, 2000).

As the aim of this work is to study the vi-ability of applying CSSR to NLP tasks, spe-cially studying the patterns that CSSR is ableto learn, the performed experiments are fo-cused on the task of detecting NPs, ignoring,for the moment, the other kind of chunks.

4 Ability of CSSR to Capture NP

Models

This section presents the experiments per-formed using CSSR to capture the patternsthat form language subsequences as NPs.The goal of these experiments is to see howable is this method to infer automata thatcapture phrase patterns, as well as to studythe influence of different lmax and amount oftraining data on the learned automata.

The patterns that may be found in aphrase, depend on the studied word features.For example, there are some orthographicalpatterns associated with punctuation marks(e.g. after a dot a capitalized word is ex-pected), other more complex patterns asso-

ciated to syntactic structure of the sentence,etc. Depending on which patterns need to becaptured, different features of the words inthe sentence should be highlighted.

To use CSSR to learn these patterns, it isnecessary to define an alphabet representingthe desired features. These features may varydepending on which structures we are reallyinterested in modelling. To learn NP pat-terns, the used features are the Part of Speech(PoS) tags of words as syntactic structure ofsentences depends strongly on them.

The data used for NP detection are ex-tracted from the English WSJ corpus (Char-niak, 2000). This is a corpus with full pars-ing information, with eleven different chunktypes and a complete analysis of sentences.though in this work just NP chunks infor-mation will be used. The alphabet used totrain CSSR consists of a symbol for each PoStag used in the corpus. The total numberof different tags is 44, but there are somePoS tags that never appear inside any NP,so these tags can be merged into one specialsymbol. With this reduction, the alphabethas 38 symbols.

This training corpus has about1.000.000 words which means thatlmax < log N/log k = 3.8.

To learn an automaton representing NPpatterns it is necessary to distinguish thewords belonging and not belonging to a NP,even if the PoS tag is the same. To do soeach word belonging to a NP is representedby its PoS tag (a symbol of the alphabet)and the words not belonging to NP chunksare mapped into a special symbol. Figure 1shows an example of how a sentence is trans-lated into a sequence of alphabet symbols.

Word PoS Tag Chunk Type Symbol

He PRP NP PRPsucceeds VBZ VP OutTerrence NNP NNPDaniels NNP

NPNNP

, , none Outformerly RB ADVP Outa DT DTGrace NNP NP NNPchairman NN NN. . none Out

Figure 1: Example of a training sentence andits translation to the alphabet

Sentences encoded in this way are the se-quences used to train CSSR. The algorithm


91

may to learn an automaton representing NPchunks in terms of PoS tags.

Different automata with lmax from 1 to4 were learned, but the obtained automataare not readable, even when minimized 1.The number of states of the minimized au-tomata varies from 34 for lmax = 1 to 1, 767for lmax = 4.

Given the size of the obtained automata,even after minimization, it is not possible toqualitatively determine if the acquired au-tomata appropriately models NP patterns,so another method to qualitatively evaluatehow accurately the generated automaton rep-resents the data was devised, as described innext section.

4.1 Comparing Grammars to

Determine the Quality of

Learned Models

In order to obtain a qualitative evaluation ofthe automaton acquired by CSSR for NPs,we will compare it with the regular grammardirectly extracted from the syntactic annota-tions available in the WSJ training corpus.

The grammar obtained from the anno-tated corpus is regular, since the NP chunksare never recursive and are formed only byterminal symbols in this corpus. So, thegrammar consists of the different possiblePoS sequences for NPs observed in the cor-pus, with their relative frequencies.

On the other hand, the automaton learnedusing CSSR can be used to generate the samekind of patterns: using the transitions andprobabilities of the automaton, sequences ofPoS tags are generated. The subsequencesbetween two “Out” symbols are the NP pat-terns that CSSR has learned. These pat-terns, and their occurrence frequencies, areextracted and compared with the grammaracquired from WSJ annotations. The moresimilar the set of rules produced by CSSR isto the actual WSJ grammar behind the data,the better we can consider the automaton ismodelling NP patterns.

To perform the comparison between thesetwo sets of patterns and its frequencies,Jensen-Shannon divergence 2 (Lin, 1991) isused. This divergence gives a measure of the

1To minimize the automaton, the probabilistic in-formation of transitions is ignored and a normal min-imizing algorithm is applied

2A symmetric distance derived from Kullback-Leibler divergence.

distance between two distributions.There are two main differences between

the rules generated by the CSSR automa-ton and the rules acquired from corpus an-notations. On the one hand, there are rulesgenerated by CSSR automaton that are notpresent in the corpus. This is due to the factthat CSSR over-generalizes patterns fromdata. On the other hand, there are some dif-ferences in frequencies of common rules, par-tially due to the probability mass given towrong rules. Both differences are captured byJensen-Shannon divergence. The smaller thisdivergence is, the more similar to the origi-nal corpus grammar can the CSSR acquiredautomata be considered.

The line labelled as “WSJ data” in Fig-ure 2 shows the values of this divergencefor different lmax values. It can be seenhow Jensen Shannon divergence falls as lmax

grows. This is because the number of over-generated patterns falls, what means thatCSSR generalizes better, as it may be ex-pected. The difference in frequencies of com-mon rules is also lower when using longerhistories. For lmax = 4 the divergence risesagain because there are not enough data tolearn an automaton with statistical reliabil-ity, so using CSSR with this length introducesincorrect patterns.

4.2 Generating Data to Study

CSSR Performance

One of the limitations of the study presentedin section 4.1 is that, given the size of thealphabet, there are too few available data tolearn automata with large lmax. As discussedabove, the larger lmax that can be used withWSJ data is 3, which may be too small tocapture long NP patterns.

In order to study the influence of theamount of training data when using such abig alphabet, new data was created in thefollowing way: using the WSJ corpus, whichhas a complete syntactic analysis, a gram-mar can be extracted capturing the structureof sentences (divided into different kind ofchunks and PoS tags) and of chunks (dividedinto PoS tags). Each rule has a probabilitydepending on how many times it appears inthe training corpus. Using this grammar newdata can be generated applying rules recur-sively until a whole sentence is created.

The generated sentences, are parse treeswith the same chunk distribution than the


92

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

5 4 3 2 1

Jens

en S

hann

on D

iver

genc

e

l max

Distance between real and CSSR-generated grammar

WSJ data1 milion words, no filter

50 milion words, no filter1 milion words, filter 1%

50 milion words, filter 1%1 milion words, filter 10%

50 milion words, filter 10%

Figure 2: Jensen Shannon divergence between CSSR generated set of rules and real grammarfor different values of lmax when using different filter levels of the grammar

original corpus. Then, the same method totranslate sentences to the NP alphabet de-scribed above is performed, and CSSR is usedto learn automata.

Note that the NP structures present in thegenerated data will be the same that the onesobserved in real corpus, so creating data inthis way is quite similar to replicating thereal corpus many times. The aim of this is tosimulate that large amounts of data are avail-able and to study the algorithm behaviourunder these conditions. In fact, replicatingthe same data many times is equivalent to ar-tificially simulate that the real data is moresignificant, and we are interested in studyingthe influence of doing so in CSSR automata.

Given the nature of the algorithm, repeat-ing the observations N times changes the de-cision of splitting or not two histories becausethe statistical significance of the observationchanges. This decision is performed usingχ2 statistics and the value of χ2 is multi-plied by N when the data is increased bythis value. Thus, generating more data inthis way, equals to give more weight to theavailable data, and the results will show thatthis leads to learning automata that repro-duce data patterns more accurately. Thesame goal could be theoretically obtained byadjusting the confidence level of the χ2 tests,but we found this parameter to be less influ-ent on CSSR behaviour.

The reason why in this work we generate

data using the grammar rather than replica-ting the corpus many times is that in thisway, experiments can be performed filteringlow-frequency rules to get rid of some of thenoise from the original corpus. Thus, beforegenerating the data using the learned gram-mar, the rules that appear less can be filteredand a less noisy corpus can be created. Inthis way the generated data is expected tobe more easily reproduced using CSSR.

The experiments were conducted usingdifferent corpora generated with three diffe-rent grammars: one with all rules learnedfrom WSJ (no filter), which is expected togenerate data similar to the WSJ corpus, andtwo grammars with 1% and 10% of the pro-bability mass filtered. This means that justthe most likely rules that sum the 99% or90% of the mass are conserved.

Using these grammars three different cor-pora of 50 milions tokens were created. Withthis amount of data lmax < log N/log k = 4.9so the maximum usable length is 5. Also, asubset of each corpus of 1 milion tokens wasused to perform more experiments, in orderto better study the influence of the amountof training corpus.

Figure 2 shows the divergence between thelearned automata and the grammar used togenerate the corpus, without filtering andwith each of the two filters. For each filterlevel there are two lines: one for the 1 milionwords generated corpus and one for the 50


93

milion words. It can be seen that the resultsobtained with both non-filtered corpora arevery similar to those obtained with WSJ cor-pus, specially the results obtained with the 1milion corpus, as this is the size of WSJ. Thatmeans that the generated corpus reproducesaccurately the NP patterns present in WSJ.Also, it can be seen that the more rules arefiltered, the more similar is the learned au-tomaton behaviour to the underlying gram-mar, since less noisy patterns are more easilycaptured by CSSR.

These results also show that using moretraining data enables CSSR to learn moreaccurate automata for larger lmax. Whilefor low lmax values increasing the amountof data doesn’t introduce significant differ-ences, if enough data is available CSSR canuse larger lmax and infer more informed au-tomata that reproduce better the grammarbehind the real corpus. Generating corpusdoes not really introduce new patterns, butsimulates that the patterns present in realdata have more statistical significance.

4.3 Discussion

At the sight of the results, we can concludethat CSSR is a good method for learningpatterns, even quite complicated patterns asthose of NPs, but it is highly dependent onamount of available data. For each process,there is a necessary lmax value that capturesthe patterns, and if this value is big, largecorpus will be necessary. Furthermore, asthe minimum amount of data necessary tolearn an automaton with a determined lmax

depends exponentially on the alphabet size(N > klmax), to be able to increase lmax in1, it would be necessary to multiply the datasize by the size of the alphabet k.

For NP detection, CSSR generated au-tomaton is not readable, but that doesn’tmean that it doesn’t reproduces NP patternscorrectly. The automaton can be qualita-tively studied comparing the patterns that itgenerates with the patterns observed in thetraining corpus. The more similar are the twosets of patterns, the better is CSSR repro-ducing the patterns of the task. This com-parison shows that for real data CSSR canlearn better patterns as lmax grows but dueto the limited amount of available data, forlmax = 4 the divergence rises again, as thereis not enough data to learn an automaton re-producing corpus patterns with this length.

So, the performance of the system is limitedby the size of the training corpus.

The generated and not filtered data canbe considered equivalent to the real corpus.Also, it can be seen that when using a bigamount of generated data the performanceis better than for the real data as the sys-tem can deal with longer lmax . When usingsmall lmax the difference between using 1 mil-ion or 50 milion data is not significant. Fur-thermore, as it was expected, as the numberof filtered rules grows, the divergence falls,being really small when lmax grows. Thismeans that the easier the patterns to learnare, the better they are captured by CSSR. Inthe case of filtered rules, the system also per-forms better with large lmax if enough datais available.

Furthermore, in (Padro and Padro, 2005b)similar experiments to those presented herewere performed for Named Entity Recogni-tion (NER). In this case, the learned au-tomata were readable when minimized, andcaptured correctly the patterns of sentencesgiven the chosen sets of features. The con-clusion was that CSSR was able to learn cor-rectly the patterns of NEs with the chosen al-phabet, what combined with the results pre-sented in this work, can lead to the conclusionthat CSSR is a good method to capture lan-guage structures if enough data is available.

5 Applying CSSR to Annotating

Tasks

This work has focused on the ability of CSSRto learn phrase patterns in terms of some se-lected sets of features, and has been seen thatCSSR can reproduce correctly the patterns ofsome NLP structures.

Nevertheless, in these NLP tasks it is nec-essary not only to obtain generative phrasemodels, but also to develop systems able toannotate new sentences. To perform this tag-ging task, hidden information about where aNP begins and ends must be taken into ac-count. An usual approach is to encode thisinformation in “B-I-O” tags (Ramshaw andMarcus, 1995): each word has a B, I or O tag,where B stands for words at phrase (chunkor NE) Beggining, I for words Internal to aphrase, and O for words Outside a phrase.

When CSSR is to be used to annotate newtext, it is necessary to introduce this hiddeninformation into the system. In (Padro andPadro, 2005b; Padro and Padro, 2005a) an


94

approach to use CSSR for NER and Chunk-ing was presented, which will be summarizedhere in order to discuss the applicability ofCSSR to NLP tasks.

The basic idea of the method is that it isnecessary to introduce into the alphabet thehidden information of the tag (B, I or O).To do so, each symbol encoding the featurespreviously selected (e.g. Σ = { DT , NN ,NNP , etc. } for NP) is combined with eachpossible B-I-O tag (Σ = { DTB, DTI , DTO,NNB, NNI , etc} ). Thus, each word in thetraining corpus is translated to one of thesesymbols forming the training sequence.

When a new sentence has to be tagged,the part of the symbol related to context fea-tures is known (e.g. “DT”, ‘NN”, etc) butthe information about the correct B-I-O tagis not available, so there are three possiblealphabet symbols for each word (e.g. DTB,DTI , DTO, if the visible part is a DT ).

To find the most likely tag for each wordin a sentence –that is, to find the most likelysymbol of the alphabet–, (e.g. DTB, DTI ,DTO for a DT word) a Viterbi algorithm isapplied. For each word in a sentence, the pos-sible states the automaton could reach if thecurrent word had the tag B, I, or O, and theprobabilities of these paths are computed. Atthe end of the sentence, the best probabilityis chosen and the optimal path is backwardsrecovered. In this way, the most likely se-quence of B-I-O tags is obtained.

5.1 Results on NP Detection

For NP detection experiments, CoNLL-00shared task (Tjong Kim Sang and Buchholz,2000) data are used. The training corpushas about 200,000 words, and the best ob-tained F1 is 89.11% with lmax = 2. In fact,in (Padro and Padro, 2005a) chunking withall chunk types was performed, obtaining anoverall result of F1 = 88.20 which is compa-rable to last systems in the competition butis quite far from best systems.

Furthermore, following the strategy de-picted in section 4.2, we can force the sta-tistical significance of hypothesis test by re-producing the data many times. Doing soleads to a improvement of the results, obtain-ing F1 = 90.96 also with lmax = 2 when thedata is replicated 1000 times. So increasingthe significance of data leads to better resultswhen performing also annotating tasks.

Also, in (Padro and Padro, 2005b), similar

experiments (without replicating the corpus)to perform NER with CSSR were presented.In those experiments the best parametriza-tion led to a F1 of 88.96%. The systemwith this parametrization, combined with theNEC system used by the winner of CoNLL-2002 shared task (Carreras, Marquez, andPadro, 2002), would situate our system in thefifth position of the competition. This is nota bad result, specially taking into account thesimplicity of the used features.

5.2 Discussion

The results obtained on NP annotating task,show that the problem with the necessaryamount of data becomes worse when tryingto use CSSR to tag new sentences.

First experiments with these kind of taskswere promising, as the used approach wasvery simple and the results were comparableto state-of-the-art systems. Nevertheless, ifmore information is to be included into thesystem to try to improve obtained results, alimitation will be found due to the amount ofnecessary data. Furthermore, even if enoughdata were available, a computational limita-tion will be found, specially in tasks such asNP detection, where the alphabet is big andlots of data have to be processed.

The main problem of this approach is thatto introduce the hidden information the al-phabet size is multiplied by 3, what meansthat the amount of data necessary to useCSSR with the same lmax used without B-I-Oinformation is 3lmax times bigger than whatwas needed before. If CSSR can learn an ac-curate automaton of length l using a trainingcorpus of N = kl words, N ′ = (3k)l = N ∗ 3l

words will be necessary to perform the tag-ging task under the B-I-O approach.

6 Conclusions and Future Work

A study of how CSSR is able to capture pat-terns in language has been presented. Ithas been seen that this algorithm can learnautomata representing processes if there areenough data available, or if the process is sim-ple enough.

One of the main limitations of CSSR isthat it is useful to learn patterns, but itis not directly prepared to introduce hid-den information and to perform annotatingtasks. The approach presented in (Padro andPadro, 2005b) gives reasonably good resultsfor NER but not so good results in NP detec-


95

tion. This is because as the alphabet grows,more than the available data would be nec-essary to learn an accurate automaton, andthe available corpus is not big enough.

The main conclusion of this work is thatCSSR can learn correctly the patterns of se-quential data, specially if the data is not verynoisy, but that it is highly dependent on theamount of data, the size of the alphabet andlmax . Furthermore, this dependency is expo-nential, so to increase a little bit the perfor-mance of the system, it would be necessaryto magnify the amount of data. So, CSSRcan be useful when dealing with systems withsmall alphabets –as in other applications ofCSSR such as those presented in (Varn andCrutchfield, 2004; Ray, 2004)– but to use it insystems with lots of features to be taken intoaccount, as NLP annotating tasks, a limita-tion due to the amount of available data willbe probably found.

In this line, the main future line devisedis to modify CSSR to be able to introducemore information into the system. As thealphabet size has to be small, our proposalis to introduce all the features not encodedin the alphabet via Maximum Entropy (ME)models. Thus, the histories would consist ofsets of features, instead of suffixes, and CSSRwould build the causal states taking into ac-count the probability of seeing a symbol aftera determined history, computing it using ME,instead of taking into account just the simplesuffixes and its transition probabilities.

References

Abney, Steven. 1991. Parsing by Chunks.R. Berwick, S. Abney and C. Tenny(eds.) Principle–based Parsing. KluwerAcademic Publishers, Dordrecht.

Buchholz, Sabine, Jorn Veenstra, and WalterDaelemans. 1999. Cascaded grammaticalrelation assignment. In In Proceedings ofEMNLP/VLC-99, pages 239–246, Univer-sity of Maryland, USA.

Carreras, Xavier, Lluıs Marquez, and LluısPadro. 2002. Named entity extractionusing adaboost. In Proceedings of CoNLLShared Task, pages 167–170, Taipei.

Charniak, Eugene. 2000. Bllip 1987-89 wsjcorpus release 1. In Linguistic Data Con-sortium, Philadelphia.

Church, Kenneth W. 1988. A stochasticparts program and noun phrase parser for

unrestricted text. In Proceedings of the 1stConference on Applied Natural LanguageProcessing, ANLP, pages 136–143. ACL.

Lin, J. 1991. Divergence measures based onthe shannon entropy. IEEE Transactionson Information Theory, 37(1):145–151.

Padro, Muntsa and Lluıs Padro. 2005a. Ap-proaching sequential nlp tasks with an au-tomata acquisition algorithm. In Proceed-ings of International Conference on Re-cent Advances in NLP (RANLP’05), Bul-garia, September.

Padro, Muntsa and Lluıs Padro. 2005b. Anamed entity recognition system basedon a finite automata acquisition algo-rithm. Procesamiento del Lenguaje Nat-ural, (35):319–326, September.

Ramshaw, L. and M. P. Marcus. 1995.Text chunking using transformation-basedlearning. In Proceedings of the Third ACLWorkshop on Very Large Corpora.

Ray, Asok. 2004. Symbolic dynamic analysisof complex systems for anomaly detection.Signal Process., 84(7):1115–1130.

Shalizi, Cosma R. and James P. Crutchfield.2001. Computational mechanics: pattern,prediction strucutre and simplicity. Jour-nal of Statistical Physics, 104:817–879.

Shalizi, Cosma R. and Kristina L. Shalizi.2004. Blind construction of optimal non-linear recursive predictors for discrete se-quences. In Uncertainty in Artificial In-telligence: Proceedings of the TwentiethConference.

Tjong Kim Sang, Erik F. and Sabine Buch-holz. 2000. Introduction to the conll-2000shared task: Chunking. In Claire Cardie,Walter Daelemans, Claire Nedellec, andErik Tjong Kim Sang, editors, Proceed-ings of CoNLL-2000 and LLL-2000, pages127–132. Lisbon, Portugal.

Varn, D. P. and J. P. Crutchfield. 2004.From finite to infinite range order via an-nealing: The causal architecture of defor-mation faulting in annealed close-packedcrystals. Physics Letters A, 324:299–307.

Veenstra, J. 1999. Memory-based textchunking. In Nikos Fakotakis (ed), Ma-chine learning in human language tech-nology, workshop at ACAI 99, Chania,Greece.


96

Aprendizaje automatico para el reconocimiento temporalmultilingue basado en TiMBL∗

Marcel Puchol-Blasco Estela Saquete Patricio Martınez-BarcoDept. de Lenguajes y Sistemas Informaticos (Universidad de Alicante)

Carretera San Vicente s/n 03690 Alicante Espana{marcel,stela,patricio}@dlsi.ua.es

Resumen: Este artıculo presenta un sistema basado en aprendizaje automatico parael reconocimiento de expresiones temporales. El sistema utiliza la aplicacion TiMBL,la cual consiste en un sistema de aprendizaje automatico basado en memoria. Laportabilidad que presenta este sistema hacia otros idiomas nuevos posee un costemuy reducido, ya que practicamente no requiere de ningun recurso dependiente dellenguaje (unicamente requiere un tokenizador y un desambiguador lexico categorial,aunque la carencia del POS tagger no repercute mucho en los resultados finales delsistema). Este sistema ha sido evaluado para tres idiomas distintos: ingles, espanol eitaliano. La evaluacion realizada presenta resultados satisfactorios para corpus quecontienen un gran numero de ejemplos, mientras que obtiene resultados bastantepobres en aquellos corpus que contienen pocos ejemplos.Palabras clave: informacion temporal, reconocimiento de expresiones temporales,aprendizaje automatico

Abstract: This paper presents a Machine Learning-based system for temporal ex-pression recognition. The system uses the TiMBL application, which is a memory-based machine learning system. The portability of the system to other new languageshas a very low cost, because it does not need any dependent language resource (onlyrequires a tokenizer and a POS tagger, although the lack in POS tagger does nothave enough repercussions on the final system results). This sytems has been eva-luated on three different languages: English, Spanish and Italian. The evaluationresults are quite successful for corpus having a lot of examples; however it obtainsvery poor results with corpus that have only a few examples.Keywords: temporal information, temporal expression recognition, machine lear-ning

1. Introduccion

El reconocimiento de expresiones tempo-rales cobra cada dıa mas importancia comotarea dentro del campo del Procesamiento delLenguaje Natural (PLN). La razon de su im-portancia reside en que se trata de un pasoprevio a la resolucion de expresiones tempo-rales, tarea que puede utilizarse en otros cam-pos del PLN tales como la Busqueda de Res-puestas Temporal, la realizacion de resume-nes, la ordenacion de eventos, etc.

Como en casi todos los aspectos del PLN,existen dos aproximaciones para el reconoci-miento de expresiones temporales: los siste-mas basados en conocimiento o reglas y lossistemas basados en aprendizaje automatico

∗ Esta investigacion ha sido parcialmente finan-ciada bajo los proyectos QALL-ME (FP6-IST-033860), TEXT-MESS (TIN-2006-15265-C06-01) yGV06/028; y bajo la beca de investigacion BF-PI06/182.

(AA).

Una de las caracterısticas mas importan-tes que deben presentar los sistemas actualesde PLN es la facilidad de adaptacion del sis-tema a nuevas lenguas. En este aspecto, lossistemas basados en reglas poseen un graninconveniente, ya que el conjunto entero dereglas debe reescribirse y adaptarse a la nue-va lengua a tratar. Sin embargo, los metodosde AA presentan una gran ventaja en este as-pecto, ya que la adaptacion a otras lenguasrequiere un coste menor que el de sistemasbasados en reglas, ya que, en caso de quereradaptar varios sistemas basados en reglas, sedeberan de adaptar cada una de las bases deconocimiento de estos sistemas, mientras quesi se desean adaptar varios sistemas basadosen AA, generando un solo corpus anotado,suele ser sufiente para adaptarlos todos. Noobstante, un inconveniente importante quepresentan estos sistemas reside en la necesi-



dad de un corpus anotado con las expresio-nes temporales en la nueva lengua a tratar,el cual no siempre esta disponible.

En anteriores publicaciones hemos trata-do el tema de adaptar un sistema de reso-lucion temporal basado en reglas para el es-panol (TERSEO, mirar Saquete, Munoz, yMartınez-Barco (2005)), partiendo de la basede la traduccion de las reglas mediante meto-dos de traduccion automatica.

En la lınea de mejorar los resultados obte-nidos anteriormente (89 % de medida F parael ingles y 79 % de media F para el italiano),y teniendo en cuenta los buenos resultadosofrecidos por los sistemas de AA presenta-dos en diferentes competiciones (tales comoel Time Expression Recognition and Norma-lization Workshop - TERN 20041), se ha de-cidido cambiar la metodologıa empleada enalgunos modulos de TERSEO.

En este artıculo presentamos la adapta-cion del modulo de reconocimiento de expre-siones temporales utilizado por TERSEO ametodos de AA. Para tal fin se ha decidi-do utilizar el sistema de AA TiMBL (Daele-mans, Zavrel, y van der Sloot, 2004).

El artıculo se estructura de la siguientemanera: en la seccion 2 se describe el sistemade aprendizaje automatico utilizado, la sec-cion 3 describe el sistema implementado. Lasiguiente seccion define la evaluacion del sis-tema en tres idiomas distintos y compara losresultados con otros sistemas de AA y conel sistema basado en reglas TERSEO. Final-mente, en la seccion 5 se presentan las con-clusiones y el trabajo futuro que se pretendedesarrollar en esta lınea de investigacion.

2. Sistema de aprendizaje

automatico

Actualmente los sistemas de aprendiza-je automatico han tomado mucho auge enel PLN. Debido a eso, muchos sistemas deaprendizaje automatico han sido desarrolla-dos, ampliando el abanico de posibilidades ala hora de seleccionar un sistema para un ca-so en particular.

Un sistema que ha obteniendo buenos re-sultados en aplicaciones destinadas al PLNes TiMBL2. Gracias a los buenos resultadosofrecidos por este sistema y a la disposiciondel API que presenta (gracias a esta API ha

1http://timex2.mitre.org/tern.html2http://ilk.uvt.nl/timbl/

sido posible crear algunas de las caracterısti-cas utilizadas para el aprendizaje del siste-ma), ha sido seleccionado como aplicacion deaprendizaje automatico para nuestro sistema.Debido a ello, a continuacion se presenta unbreve resumen de las caracterısticas de TiM-BL.

2.1. TiMBL

TiMBL (Tilburg Memory-based LearningEnvironment) es una aplicacion que imple-menta algunos algoritmos basados en memo-ria. Todos estos algoritmos tienen en comunque almacenan algun tipo de representaciondel conjunto de entrenamiento explıcitamen-te en memoria en la fase de entrenamiento.En la fase de evaluacion, los nuevos casos seclasifican mediante la extrapolacion del casoalmacenado mas similar.

El aprendizaje basado en memoria(Memory-based learning - MBL, en ingles)se fundamenta en la hipotesis de que elrendimiento en tareas cognitivas se basa enel razonamiento de las bases de interpreta-cion de nuevas situaciones con respecto asituaciones ya almacenadas en experienciasanteriores, mas que en la aplicacion dereglas mentales abstractas de experienciasanteriores.

Un sistema MBL contiene dos componen-tes principales:

Componente de aprendizaje basado enmemoria, el cual se encarga de de guar-dar los ejemplos en memoria.

Componente de interpretacion basado ensimilitud, el cual utiliza como base el re-sultado del componente de aprendizajepara poder clasificar los ejemplos pro-puestos. La similitud entre un ejemplopropuesto y los ejemplos almacenadosen memoria en la fase de aprendizajese calcula mediante la distancia metri-ca �(X, Y ) (mirar ecuaciones 1 y 2). Fi-nalmente sera el algoritmo IB1 el encar-gado de asignar la categorıa al ejemplopropuesto, seleccionando el mas frecuen-te dentro del conjunto de ejemplos massimilares.

�(X, Y ) =n∑

i=1

δ(xi, yi) (1)

Marcel Puchol-Blasco, Estela Saquete y Patricio Martínez-Barco

98

Documentosaprendizaje y

evaluaciónSegmentador + Tokenizador

POS Tagger Reconocedorde disparadores

temporales

Composición de características

Entrenamientode TiMBL

Evaluaciónde TiMBL

ModeloTiMBL

Documentosde evaluaciónetiquetadoscon las ET

Documentos deaprendizaje

Documentos deevaluación

Postprocesodel etiquetado

Adaptaciónde las ETal formato

BIO

Figura 1: Diagrama del sistema

δ(xi, yi) =

⎧⎨⎩

xi−yi

maxi−minisi numerico, si no

0 si xi = yi

1 si xi �= yi

(2)

3. Descripcion del sistema

El sistema propuesto en este artıculo sebasa en la utilizacion del sistema de AA TiM-BL (comentado en la seccion 2.1) para apren-der sobre el conjunto de ejemplos generadospara la fase de entrenamiento a partir de lasdistintas caracterısticas seleccionadas y eti-quetar, posteriormente, el conjunto de ejem-plos generados para la fase de evaluacion.

Para poder generar los ejemplos de entre-namiento y de evaluacion se ha seguido la me-todologıa presentada en la figura 1.

Los pasos seguidos para el tratamiento delos documentos son:

1. Segmentacion del documento en oracio-nes.

2. Tokenizacion de los elementos de la ora-cion.

3. Extraccion del POS de cada token.

4. Adaptacion de las expresiones tempora-les al formato BIO.

5. Reconocimiento de los posibles tokensque sean disparadores temporales.

6. Composicion de las caracterısticas de en-trenamiento si se trata de un documen-to. destinado a la fase de entrenamientoo composicion de las caracterısticas deevaluacion si se trata de un documentodestinado a la fase de evaluacion.

7. Clasificacion de los ejemplos medianteTiMBL.

8. Postprocesamiento de la salida de TiM-BL.

Tomemos como ejemplo la siguienteoracion:

La alarma sono <TIMEX2> cuatro ho-ras antes de la explosion</TIMEX2>.

La oracion se tokeniza, se utiliza unPoS-tagger3 para obtener la categorıa lexicade cada token y se adaptan las expresionestemporales al formato BIO (Begin - iniciode la expresion temporal; Inside - dentrode la expresion temporal; y Outside - fuerade la expresion temporal), generandose unadistribucion vertical como la que se muestraa continuacion:

La O DA0FS0alarma O NCFS000sono O VMIS3S0cuatro B Zhoras I NCFP000antes I RGde I SPS00la I DA0FS0explosion I NCFS000. O Fp

A continuacion se realiza el reconoci-miento de los disparadores temporales, en elcual se analiza token a token si pertenece ono a la ontologıa de disparadores temporalessiguiente:

Dıa de la semana: lunes, martes, mierco-les. . .

Meses del ano: enero (ene.), febrero(feb.), marzo (mar.). . .

Estaciones del ano: primavera, otono, in-vierno o verano.

Festividades: Navidad, Epifanıa, Ad-viento, Halloween. . .

3Etiquetador lexico categorial

Aprendizaje Atomático para el Reconocimiento Temporal Multilingüe basado en TiMBL

99

Palabras temporales: ayer, anteayer, hoy,manana, tarde, noche, anteanoche, tiem-po, presente, pasado, futuro, hora, minu-to, segundo. . .

Posibles preposiciones temporales: du-rante, entre, hasta. . .

Posibles adverbios temporales: antes,despues. . .

Numeros

Fechas simples: dd/mm/aaaa

El siguiente paso a realizar consiste en ge-nerar los ejemplos necesarios para que el sis-tema de AA aprenda de ellos. Para ello es ne-cesario extraer una serie de caracterısticas delas oraciones. Las caracterısticas que han sidoconsideradas en este sistema pueden agrupar-se en:

Caracterısticas relacionadas con el to-ken (TOK): TOK0, BIGR(TOK−1

TOK0), BIGR(TOK0 TOK1),BIGR(TOK−2 TOK−1), BIGR(TOK1

TOK2), SUF(TOK)2, SUF(TOK)3,PREF(TOK)2, PREF(TOK)3.

Caracterısticas relacionadas con losdisparadores (DISP): BIGR(DISP−1

DISP0), BIGR(DISP0 DISP1).

Caracterısticas relacionadas con losejemplos ya etiquetados de la oracion(ETIQ): ETIQ−1, BIGR(ETIQ−2

ETIQ−1), ETIQ1, BIGR(ETIQ1

ETIQ2)4.

Caracterısticas relacionadas con el POS:POS1.

NOTAS: TOK (token), DISP (disparador),ETIQ (elemento ya etiquetado), BIGR (bi-grama).

Sin embargo, estas no han sido las unicasque inicialmente se consideraron en el siste-ma. Las siguientes caracterısticas fueron con-sideradas como una posible mejora al siste-ma, pero al obtener peores resultados, se des-cartaron del sistema:

Caracterısticas relacionadas con el to-ken (TOK): BIGR(TOK−3 TOK−2),BIGR(TOK2 TOK3).

4Posteriormente se vera el tratamiento de este tipode caracterısticas

Caracterısticas relacionadas con losdisparadores (DISP): BIGR(TOK−2

TOK−1), BIGR(TOK1 TOK2).

Caracterısticas relacionadas con ejem-plos ya etiquetados en la oracion(ETIQ1): BIGR(ETIQ2 ETIQ3),BIGR(ETIQ−2 ETIQ−3).

Caracterısticas relacionadascon ejemplos ya etiqueta-dos en la oracion (ETIQ2):

∀x∈[ETini.,0]

{DISPx : ∃DISPx

TOKx : otrocaso.

Caracterısticas relacionadas con ejem-plos ya etiquetados en la oracion(ETIQ3): ∀x∈[ETini.,0]TOKxsiTOKx /∈STOPWORDS.

Acronimos utilizados: TOK (token), DISP(disparador), ETIQ (elemento ya etiqueta-do), BIGR (bigrama). Posiciones utilizadas: 0(posicion actual), -x (x posiciones anteriores),x (x posiciones posteriores), ETini (posicionde inicio de la expresion temporal actual).

Es importante remarcar que las carac-terısticas relacionadas con ejemplos ya eti-quetados reciben un tratamiento diferente enla fase de entrenamiento y en la fase de eva-luacion. En la fase de entrenamiento sı se po-see esta informacion, mientras que en la fasede evaluacion ha sido necesario realizar unaserie de cambios al funcionamiento normal deTiMBL para poder tratar este tipo de ca-racterısticas. El siguiente algoritmo explica elfuncionamiento seguido para tratar este tipode caracterıstica:

Primera pasada - DescendentePara cada ejemplo descendentemente

@num = CLASE[POS-num]CAR[#num] = NADAClasificarGuardar CA

Fin Para

Segunda pasada - AscendentePara cada ejemplo descendentemente

@num = CLASE[POS-num]#num = CLASE[POS+num]ClasificarSi CA �= CAA entoncesTercera pasada - descendentePOS3 = POS + 1Hacer

Tomar ejemplo


100

@num = CLASE[POS3-num]#num = CLASE[POS3+num]ClasificarPOS3++

Mientras CA �= CAAFin Para

NOTAS: CA (clase asignada), CAA (claseasignada anteriormente), @ (clase anterior),# (clase posterior), CAR (caracterıstica).

En la figura 2 puede verse un ejemplo deuna traza realizada para este algoritmo.

Una vez etiquetados todos los ejemplos, serealizara un postproceso muy simple de cohe-rencia de las etiquetas de salida del sistemade AA. Este postproceso se basara en com-probar si existe alguna clasificacion con eti-queta I que posea en la posicion anterior laetiqueta O y modificara esa etiqueta I por laetiqueta B.

Una vez realizado todo este proceso, losdocumentos de evaluacion estaran etiqueta-dos con las expresiones temporales.

4. Resultados experimentales

Ha decidido probarse el sistema en tresidiomas distintos: ingles, espanol e italiano.Para cada uno de estos idiomas se ha selec-cionado un corpus etiquetado mediante eti-quetas TIMEX2, los cuales seran detalladosa continuacion. Debido a que la finalidad deevaluar este sistema no se basa en compara-ciones con los sistemas ya existentes, sino quese intentan conseguir los mejores resultadosposibles, se ha utilizado el metodo de evalua-cion 3-fold cross validation. El sistema de eva-luacion utilizado para medir las prestacionesdel sistema es el proporcionado oficialmenteen el TERN, el cual se basa en un script des-arrollado por el MITRE para la evaluacion desistemas. Los resultados son mostrados utili-zando valores de precision y cobertura conla metrica Fβ=1. Finalmente se muestran lasconclusiones derivadas de los resultados ob-tenidos.

4.1. Corpora utilizado

El corpus utilizado para el ingles es el pro-porcionado en el TERN 20045. Este corpusesta formado por documentos de noticias ex-traıdo de los periodicos, transmisiones de no-ticias y agencias de noticias. Para el proceso

5http://timex2.mitre.org/tern.html

de evaluacion realizado, los corpus de entre-namiento y evaluacion se han unido.

El corpus utilizado para el espanol se ba-sa en una serie de documentos extraıdos deperıodicos digitales en castellano utilizadosen anteriores evaluaciones del sistema TER-SEO.

El corpus utilizado para el italiano se de-nomina I-CAB. Este corpus fue creado co-mo parte del proyecto ONTOTEXT6. Estecorpus esta formado por documentos de no-ticias extraıdos del perıodico local L’Adige.La anotacion se ha llevado a cabo siguiendolos estandares del programa ACE (AutomaticContent Extraction7) para la tarea de Reco-nocimiento y Normalizacion de ExpresionesTemporales (Ferro et al., 2005).

Las caracterısticas mas importantes de es-tos tres corpus pueden verse en la tabla 1.

Idioma DOCS TOK ETIngles 511 196.473 4.728

Espanol 100 39.719 431Italiano 528 204.185 4.548

Cuadro 1: Informacion sobre los corpora uti-lizados para evaluar el sistema

4.2. Proceso de evaluacion

Como ha sido comentado anteriormente,se generaron una serie de caracterısticas ini-ciales sobre las que se realizo una seleccionpara obtener las mejores. Esta seleccion serealizo segun el metodo de Moreda y Palo-mar (2005), obteniendo aquellas que com-pondrıan finalmente el sistema. Para aque-llas caracterısticas relacionadas con la infor-macion lexico-categorial, se utilizo la herra-mienta FreeLing (Atserias et al., 2006).

La evaluacion de los resultados de recono-cimiento de expresiones temporales para losdistintos idiomas, teniendo en cuenta la me-dida obtenida por el scorer del TERN comoTIMEX28 y las caracterısticas seleccionadasfinalmente en el sistema, se muestra en la ta-bla 2, mientras que la medida obtenida porel scorer del TERN como TIMEX2:TEXT9,con las mismas caracterısticas, se muestra enla tabla 3.

6http://tcc.itc.it/projects/ontotext7http://www.nist.gov/speech/tests/ace8medida de comprobacion de la deteccion de ex-

presiones temporales9medida de la extension de la ET (comprobacion

de los lımites de las ET)


101

PALX1X2X3X4X5X6X7X8

CAC1C2C3C4C5C6C7C8

1a iteraciónascendente

PALX1X2X3X4X5X6X7X8

CAAC1C2C3C4C5C6C7C8

CAC1C2C3B <>C4 =C5 =C6 =C7 =C8 =

PAL

X4X5X6X7X8

CAC3BC4B <>C5B <>C6 =

CAA

C4C5C6


3a iteracióndescendente


(cont.)

PALX1X2X3X4X5X6X7X8

CAAC1C2C3BC4BC5BC6C7C8

CAC1 =C2 =C3BC4C5C6C7C8

Figura 2: Ejemplo de traza del algoritmo de caracterısticas relacionadas con ejemplos ya etique-tados

Ingles Castellano ItalianoCaracterısticas P R F P R F P R FTOK 0.654 0.839 0.735 0.503 0.683 0.579 0.630 0.755 0.687TOK+DISP 0.713 0.872 0.784 0.541 0.795 0.642 0.661 0.792 0.721TOK+DISP+ETIQ 0.861 0.823 0.841 0.742 0.673 0.705 0.791 0.740 0.765TOK+DISP+ETIQ+POS 0.871 0.833 0.851 0.744 0.708 0.725 0.784 0.748 0.765

Cuadro 2: Resultados del sistema para TIMEX2

Como puede observarse, se ha realizadouna evaluacion incremental del tipo de carac-terısticas ejecutadas para demostrar el avan-ce de las mismas. Las medidas mostradas enlas tablas corresponden a: P (Precision), R(Recall - Cobertura), F (medida F).

Como puede observarse, los resultados ob-tenidos en los idiomas en los que los corporaposeen mas ejemplos para el aprendizaje delsistema, obtienen mejores resultados.

Otro factor importante es la incorporacionde las clasificaciones realizadas anteriormen-te, junto con el algoritmo de multiples pasa-das realizado para conocer, tanto las clasifica-ciones anteriores, como las posteriores. Comopuede observarse, este tipo de caracterısti-cas pueden mejorar los resultados del sistemamas de un 10 % de precision. Sin embargo,la incorporacion de informacion del POS alsistema mejora unicamente en un 1 %. Esteefecto plantea si es realmente necesario in-corporar un recurso dependiente del lenguaje(el POS tagger10) al sistema para obtener unamejora tan ınfima.

4.3. Comparacion con otrossistemas

Debido al sistema de evaluacion utilizado(3-fold cross validation), no puede realizar-se una comparacion directa con otros siste-

10Desambiguador lexico categorial

mas de aprendizaje automatico, ya que otrossistemas utilizan distintos tipos de metodosde evaluacion. Sin embargo, comparando es-te sistema con sistemas como el de Hacio-glu, Chen, y Douglas (2005), podemos apre-ciar que el sistema presentado en este artıculoofrece menores resultados de precision y co-bertura. Sin embargo, al analizar el metodoutilizado para la evaluacion se observa quelos resultados presentados en este artıculo sonmas contundentes, ya que consideramos queel 3-fold cross validation proporciona unos re-sultados mas fiables que los empleados en es-te artıculo. Ademas, tambien se debe teneren cuenta el tipo de requerimientos que po-see un sistema y el otro. Mientras que estesistema solo necesita de un segmentador, untokenizador y un POS tagger, el otro sistemanecesita, ademas de lo mismo que este, de unparser11 y un chunker12.

Si comparamos los resultados obtenidos eneste artıculo con los obtenidos anteriormenteen TERSEO (Saquete et al., 2006) aprecia-mos que los resultados para el ingles son bas-tante parecidos, mientras que para el italia-no baja un poco la precision. Sin embargo, alcompararlo con el idioma origen de TERSEO

11Sistema que realiza un analisis sintactico total dela oracion

12Sistema que realiza un analisis sintactico parcialde la oracion


102

Ingles Castellano ItalianoCaracterısticas P R F P R F P R FTOK 0.563 0.722 0.633 0.360 0.487 0.413 0.524 0.628 0.571TOK+DISP 0.596 0.731 0.657 0.387 0.572 0.462 0.546 0.655 0.596TOK+DISP+ETIQ 0.756 0.723 0.739 0.585 0.531 0.556 0.667 0.625 0.646TOK+DISP+ETIQ+POS 0.766 0.733 0.749 0.582 0.553 0.567 0.664 0.633 0.648

Cuadro 3: Resultados del sistema para TIMEX2:TEXT

(mirar (Saquete, Munoz, y Martınez-Barco,2005)), el espanol, los resultados se inclinanfavorablemente hacia TERSEO, el cual obtie-ne un 80 % de precision frente al 72 % obteni-do por este sistema. Sin embargo, si tenemosen cuenta los resultados obtenidos por estesistema sin la necesidad de ningun recursodependiente del lenguaje (70 % de medida F)y que TERSEO requiere de recursos depen-dientes del lenguaje para su funcionamiento(TERSEO necesita un POS tagger), los re-sultados ofrecidos por este sistema son bas-tante satisfactorios en este aspecto. Ademas,el coste asociado a la adaptacion de TER-SEO a otros lenguajes distintos del espanoles mucho mas grande que el asociado a estesistema.

5. Conclusiones y trabajo futuro

Se ha presentado un sistema basado enaprendizaje automatico basado en TiMBLque posee un bajo coste de adaptabilidad aotros idiomas, siempre y cuando exista uncorpus etiquetado con ETs en la lengua que sedesee tratar. Este sistema ha sido probado entres idiomas distintos: ingles, espanol e italia-no. Los resultados obtenidos para los idiomasque poseen un corpus con muchos ejemplosen los que basarse el sistema de aprendiza-je automatico ofrecen resultados satisfacto-rios (en ingles, un 85 % para la evaluacionTIMEX2 y un 75 % para la evaluacion TI-MEX2:TEXT, mientras que en italiano, un76 % para la evaluacion TIMEX2 y un 65 %para la evaluacion TIMEX2:TEXT). Sin em-bargo, se ha comprobado como en corpus conpocos ejemplos de los que aprender, se obtie-nen unos resultados bastante pobres (en es-panol, un 72 % para la evaluacion TIMEX2 y57 % para la evaluacion TIMEX2:TEXT).

Como puede comprobarse, estos resulta-dos son favorables y suficientes para la in-corporacion de este sistema en el modulo deTERSEO de reconocimiento de expresionestemporales, pese a que el modulo de TER-

SEO ofrezca mejores resultados. Es necesariotener en cuenta que TERSEO depende de re-cursos linguısticos dependientes del lenguaje,muchos de los cuales no existen en determi-nados idiomas, mientras que en este sistemaestos recursos son prescindibles.

Como trabajo futuro, quieren realizarsepruebas con otra serie de caracterısticas querequieran de una mejor comprension del tex-to. En concreto se desea utilizar informacionsintactica y semantica. Ademas, este siste-ma quiere incorporarse completamente co-mo modulo de reconocimiento de expresionestemporales de TERSEO. Ademas, siguien-do una estrategia similar, quieren realizar-se pruebas de adaptacion a la tecnologıa deaprendizaje automatico en otros modulos de-pendientes del idioma de TERSEO. Final-mente se desea evaluar la combinacion com-pleta de TERSEO con los modulos de depen-dientes del sistema basados en aprendizajeautomatico y los modulos independientes delsistema, los cuales estan basados en reglas,comprobando la precision final de TERSEOtanto en reconocimiento como en resolucionde expresiones temporales.

Bibliografıa

Atserias, J., B. Casas, E. Comelles,M. Gonzalez, L. Padro, y M. Padro.2006. Freeling 1.3: Syntactic and seman-tic services in an open-source nlp library.En Proceedings of the 5th InternationalConference on Language Resources andEvaluation (LREC’06), paginas 48–55.

Daelemans, W., J. Zavrel, y K. van der Sloot.2004. TiMBL: Tilburg Memory BasedLearner, version 5.1, Reference Guide. Ilkresearch group technical report series, Til-burg. 60 pages.

Ferro, L., L. Gerber, I. Mani, B. Sundheim, yG. Wilson. 2005. Tides.2005 standard forthe annotation of temporal expressions.Informe tecnico, MITRE.


103

Hacioglu, Kadri, Ying Chen, y BenjaminDouglas. 2005. Automatic time ex-pression labeling for english and chinesetext. En Alexander F. Gelbukh, editor,CICLing, volumen 3406 de Lecture No-tes in Computer Science, paginas 548–559.Springer.

Moreda, P. y M. Palomar. 2005. Se-lecting Features for Semantic Roles inQA Systems. En Proceedings of Re-cent Advances in Natural Language Pro-cessing (RANLP), paginas 333–339, Bo-rovets, Bulgaria, Septiembre.

Saquete, E., R. Munoz, y P. Martınez-Barco.2005. Event ordering using terseo system.Data and Knowledge Engineering Journal,pagina (To be published).

Saquete, Estela, Oscar Ferrandez, PatricioMartınez-Barco, y Rafael Munoz. 2006.Reconocimiento temporal para el italia-no combinando tecnicas de aprendizajeautomatico y adquisicon automatica deconocimiento. En Proceedings of the 22ndInternational Conference of the SpanishSociety for the Natural Language Proces-sing (SEPLN).


104

Alias Assignment in Information Extraction

Emili Sapena, Lluıs Padro and Jordi TurmoTALP Research Center

Universitat Politecnica de CatalunyaBarcelona, Spain

{esapena, padro, turmo}@lsi.upc.edu

Resumen: Este artıculo presenta un metodo general para la tarea de asignacion dealias en extraccion de informacion. Se comparan dos aproximaciones para encarar elproblema y aprender un clasificador. La primera cuantifica una similaridad globalentre el alias y todas las posibles entidades asignando pesos a las caracterısticassobre cada pareja alias-entidad. La segunda es el clasico clasificador donde cadainstancia es una pareja alias-entidad y sus atributos son las caracterısticas de esta.Ambas aproximaciones usan las mismas funciones de caracterısticas sobre la parejaalias-entidad donde cada nivel de abstraccion, desde los caracteres hasta el nivelsemantico, se tratan de forma homogenea. Ademas, se proponen unas funcionesextendidas de caracterısticas que desglosan la informacion y permiten al algoritmode aprendizaje automatico determinar la contribucion final de cada valor. El usode funciones extendidas mejora los resultados de las funciones simples.

Palabras clave: asignacion de alias, extraccion de informacion, entity matching

Abstract: This paper presents a general method for alias assignment task ininformation extraction. We compared two approaches to face the problem and learna classifier. The first one quantifies a global similarity between the alias and all thepossible entities weighting some features about each pair alias-entity. The secondis a classical classifier where each instance is a pair alias-entity and its attributesare their features. Both approaches use the same feature functions about the pairalias-entity where every level of abstraction, from raw characters up to semanticlevel, is treated in an homogeneous way. In addition, we propose an extendedfeature functions that break down the information and let the machine learningalgorithm to determine the final contribution of each value. The use of extendedfeatures improve the results of the simple ones.

Keywords: Alias Assignment, Information Extraction, Entity Matching

1 Introduction

Alias assignment is a variation of the en-tity matching problem. Entity matching de-cides if two given named entities in the data,such as “George W. Bush” and “Bush”, re-fer to the same real-world entity. Varia-tions in named entity expressions are due tomultiple reasons: use of abbreviations, diffe-rent naming conventions (for example “NameSurname” and “Surname, N.”), aliases, mis-spellings or naming variations over time(for example “Leningrad” and “Saint Peters-burg”). In order to keep coherence in ex-tracted or processed data for further analysis,to determine when different mentions refer tothe same real entity is mandatory.

This problem arises in many applications

that integrate data from multiple sources.Consequently, it has been explored by abig number of communities including statis-tics, information systems and artificial in-telligence. Concretely, many tasks relatedto natural language processing have beeninvolved in the problem such as questionanswering, summarization, information ex-traction, among others. Depending on thearea, variants of the problem are knownwith some different names such as iden-tity uncertainty (Pasula et al., 2002), tu-ple matching, record linkage (Winkler, 1999),deduplication (Sarawagi and Bhamidipaty,2002), merge/purge problem (Hernandez andStolfo, 1995), data cleaning (Kalashnikovand Mehrotra, 2006), reference reconciliation(Dong, Halevy, and Madhavan, 2005), men-



tion matching, instance identification and soothers.

Alias assignment decides if a mention inone source can be referring to one or moreentities in the data. The same alias can beshared by some entities or, by the opposite,it can be referring to an unknown entity. Forinstance, alias “Moore” would be assigned tothe entity “Michael Moore” and also to “JohnMoore” if we have both in the data. Howe-ver, alias “P. Moore” can not be assigned toany of them. Therefore, while entity match-ing problem consists of determining when tworecords are the same real entity, alias assign-ment focuses on finding out whether referen-ces in a text are referring to known real en-tities in our database or not. After alias as-signment, a disambiguation procedure is re-quired to decide which real entity among thepossible ones is the alias pointing to in eachcontext. The disambiguation procedure, ho-wever, is out of the scope of this paper.

There is little previous work that directlyaddresses the problem of alias assignmentas a main focus, but many solutions havebeen developed for the related problem of en-tity matching. Early solutions employ man-ually specified rules (Hernandez and Stolfo,1995), while subsequent works focus on learn-ing the rules from training data (Tejada,Knoblock, and Minton, 2002; Bilenko andMooney, 2003). Numerous solutions focuson efficient techniques to match strings, ei-ther manually specified (Cohen, Ravikumar,and Fienberg, 2003), or learned from trainingdata (Bilenko and Mooney, 2003). Some oth-ers solutions are based in other techniquestaking advantage of the database topologylike clustering a large number of tuples (Mc-Callum, Nigam, and Ungar, 2000), exploi-ting links (Bhattacharya and Getoor, 2004)or using a relational probability model to de-fine a generative model (Pasula et al., 2002).

In the last years, some works take advan-tage of some domain knowledge at the seman-tic level to improve the results. For example,Doan et al. (Doan et al., 2003) shows howsemantic rules either automatically learnedor specified by a domain expert can improvethe results. Shen et al. (Shen, Li, and Doan,2005) use probabilistic domain constraints ina more general model employing a relaxationlabeling algorithm to perform matching.

Some of the methods used for entitymatching are not applicable to alias assign-

ment because the information contribution ofthe pair alias-entity is poorer than that of anentity-entity pair. An alias is only a smallgroup of words without attributes and, nor-mally, without any useful contextual infor-mation. However, using some domain know-ledge, some information about the entitiesand some information about the world, it ispossible to improve the results of a systemthat uses only string similarity measures.

This paper presents a general method foralias assignment task in information extrac-tion. We compared two approaches to facethe problem and learn a classifier. The firstone quantifies a global similarity between thealias and all the possible entities weightingsome features about each pair alias-entity.The algorithm employed to find the bestweights is Hill Climbing. The second is aclassical pairwise classification where eachinstance is a pair alias-entity and its at-tributes are their features. The classifier islearned with Support Vector Machines. Bothapproaches use the same feature functionsabout the pair alias-entity where every levelof abstraction, from raw characters up to se-mantic level, is treated in an homogeneousway. In addition, we propose a set of ex-tended feature functions that break down theinformation and let the machine learning al-gorithm to determine the final contributionof each value. The use of extended featuresimproves the results of the simple ones.

The rest of the paper is structured as fol-lows. In section 2, it is formalized the prob-lem of alias assignment and its representa-tion. Section 3 introduces the machine learn-ing algorithms used. Next, section 4 presentsthe experimental methodology and data usedin our evaluation. In section 5 we describethe feature functions employed in our empi-rical evaluation. Section 6 shows the resultsobtained and, finally, we expose our conclu-sions in section 7.

2 Problem definition andrepresentation

The alias assignment problem can be formali-zed as pairwise classification: Find a functionf : N ×N → {1,−1} which classifies the pairalias-entity as positive (1) if the alias is rep-resenting the entity or negative (-1) if not.The alias and the entity are represented asstrings in a name space N . We propose avariation of the classifier where we can use

Emili Sapena, Lluis Padró y Jordi Turmo

106

also some useful attributes we have aboutthe entity. In our case, function to find willbe: f : N × M → {1,−1} where M repre-sents a different space including all entity’sattributes.

We define a feature function as a functionthat represents a property of the alias, theentity, or the pair alias-entity. Once a pairalias-entity is represented as a vector of fea-tures, one can combine them appropriatelyusing machine learning algorithms to obtaina classifier. In section 3 we explain howwe learn classifiers using two different ap-proaches. Most of the feature functions usedhere are similarity functions which quantifythe similarity of the pair alias-entity accor-ding to some criteria. In a similarity func-tion the returned value r indicates greater si-milarity in larger values while shorter valuesindicates lower similarity (dissimilarity).

Feature functions can be divided in fourgroups by its level of abstraction from rawcharacters up to semantic level. In the lowerlevel, the functions focus on character-basedsimilarity between strings. These techniquesrely on character edit operations, such asdeletions, insertions, substitutions and sub-sequence comparison. Edit similarities findtypographical errors like writing mistakes orOCR errors, abbreviations, similar lemmasand some other difference intra-words.

The second level of abstraction is centeredin vector-space based techniques and it is alsoknown as token-level or word-level. The twostrings to compare are considered as a groupof words (or tokens) disregarding the order inwhich the tokens occur in the strings. Token-based similarity metrics uses operations oversets such as union or intersection.

In a higher level we find some structuralfeatures similar to the work in (Li, Morie, andRoth, 2004). Structural features encode in-formation on the relative order of tokens be-tween two strings, by recording the locationof the participating tokens in the partition.

The highest level includes the functionswith added knowledge. This extra know-ledge can be obtained from other attributes ofthe entity, from an ontology or can be know-ledge about the world. Some previous works(Shen, Li, and Doan, 2005; Doan et al., 2003)use this extra knowledge as rules to be satis-fied. First, rules are specified manually or ob-tained from the data, and then they need toassign some weight or probability to each rule

and also distinguish hard rules from soft ones.In (Shen, Li, and Doan, 2005) weights are es-tablished by an expert user or learned fromthe same data set to classify. In our work,we present another way to use this informa-tion. We propose to add more feature func-tions to increase the number of attributes forour classifier. Each new feature function de-scribes some characteristic of the alias, of theentity, or of the pair alias-entity that needssome extra knowledge. The contribution ofeach feature will be learned as any other simi-larity function when some machine learningmethod is applied.

3 Learning classifiers

Two approaches are used and compared inorder to obtain a good classifier using fea-ture functions introduced above, Hill Climb-ing (Skalak, 1994) and Support Vector Ma-chines (Cortes and Vapnik, 1995). Each onehas different points of view of the problem.The first one, treats the problem as a near-est neighbor model and tries to determinea global Heterogeneous Euclidean-OverlapMetric (HEOM) from the target alias to allthe entities in the database. The alias willbe assigned to the entities with a HEOMshorter than some cut-value. Each pair alias-entity has a HEOM composed by all the va-lues of similarity. The second point of viewis a classical classifier based on the instance’sattributes projected in a multidimensionalspace. The classifier consist in an hyperplanethat separates samples in two classes. Eachpair alias-entity with the values of the fea-ture functions as attributes is an instance forthe classifier that can be classified as positive(matching) or negative (not matching).

The first point of view determines aHEOM composed by the values returned bythe similarity functions. All the similarityfunctions are normalized and transformed todissimilarities in order to obtain a smallvalue of HEOM when alias and entity aresimilar and large value otherwise. HEOM isobtained with all the dissimilarities weightedin a quadratic summatory:

HEOM =√∑

i

wi(di)2

where di is the dissimilarity correspond-ing to the similarity function i and wi isthe weight assigned to this value. Using a


107

training data set, Hill Climbing determinesthe best weight for each feature and the cut-value in order to achieve the best possibleperformance. The algorithm in each step in-creases and decreases each weight in a smallstep-value and selects the modification withbest results. The process is repeated until nomodification is found to improve the result ofthe current solution. The method is executedseveral times starting with random weights.Some of the advantages of Hill Climbing isthat it is easy to develop and can achievegood results in a short time.

The second approach consist in a pairalias-entity classifier using Support VectorMachines (SVM) (Cortes and Vapnik, 1995).SVM have been used widely as a classifier(Osuna, Freund, and Girosi, 1997; Furey etal., 2000). This technique has the appea-ling feature of having very few tunable pa-rameters and using structural risk minimiza-tion which minimizes a bound on the general-ization error. Theorically, SVM can achievemore precise values than Hill Climbing (forour task) because they search in a continuousspace while hill climbing is searching discretevalues. In addition, using kernels more com-plex than linear one, they might combine at-tributes in a better way. Moreover, statisticallearning avoids one of the problems of localsearch, that is to fall in local minimums. Inthe other hand, SVM computational cost ishigher than hill climbing.

4 Evaluation framework

We evaluated both algorithms in the alias as-signment task with a corpus of organizations.Developing an IE system in the domain offootball (soccer) over the Web, one of theproblems we found is that clubs, federations,football players, and many other entities re-lated with football have too long official orreal names. Consequently, some nicknamesor short names are used widely in either freeand structured texts. Almost all texts usethis short names to refer to the entities as-suming that everyone is able to distinguishwhich real entity is pointed. For instance, torefer to “Futbol Club Barcelona”, its typicalto find “FC Barcelona” or “Barcelona”. Webased the results of this paper in our study inthe specific domain of football, however, weare presenting a general method for the aliasassignment task useful in any other domain.

The corpus consist in 900 football club

aliases assigned by hand versus a databasewith 500 football club entities. Some of themare assigned to more than one club whilesome others are not assigned because the re-ferring club is not in our database. Each al-gorithm is trained and tested doing a five-fold cross-validation. Some examples of an-notated corpus can be seen in table 1.

Several aliases found across the Web arereferring to organizations not included yetin the database. Furthermore, for eachalias-entity matching sample (classified aspositive) we have almost 500 samples not-matching (classified as negative). This situa-tion would drive accuracy always near 100%even in a blind classifier deciding always ne-gative. In order to have a reasonable evalua-tion only the set of positive predictions Mp

are used in evaluation and compared withthe set Ma of examples annotated as posi-tive. The measures used are Precision (1),Recall (2) and F1 (3). Only F1 values areshown and compared in this paper.

P =|Mp ∩ Ma|

|Mp|(1)

R =|Mp ∩ Ma|

|Ma|(2)

F1 =2PR

P + R. (3)

5 Experiments

We evaluated the task of alias assignment intwo experiments. In the first one, we com-pared the performance of Hill Climbing andSVM using a set of similarity functions. Thesecond is focused on an improvement of fea-ture functions breaking them down in severalvalues representing more specific aspects oftheir characteristics.

5.1 Algorithm comparison

In the first approach, functions return a valueof similarity depending on some criteria. Inthis case, we are trying to simplify the clas-sification process including only the informa-tion we consider important. The larger num-ber of features included, the longer takes analgorithm to train and achieve good results.Based in this principle, we tried to insert asmuch information as we could in a few values.

The feature functions used in this first ex-periment (example in figure 1) are the follow-ing:


108

Alias Assigned entities

Sydney FC Sydney Football ClubMan Utd Manchester United Football ClubNacional Club Universidad Nacional AC UNAM,

Club Deportivo El Nacional,Club Nacional,Club Nacional de Football

Steaua Bucharest -not assigned-Newcastle United Newcastle United Jets Football Club

Newcastle United Football ClubKrylya Sovetov Professional Football Club Krylya Sovetov Samara

Table 1: Example of some pairs alias-entity in the football domain

5.1.1 Character-based• Prefix and Suffix similarities count

the words of the alias that are the begin(prefix) or the end (suffix) of a word inthe entity name.

• Abbreviations similarity. If a words in the alias is shorter than a word tin the entity name they start with thesame character and each character of sappear in t in the same order, the func-tion concludes that s is an abbreviationof t. For example “Utd” is an abbrevia-tion of “United” and “St” is an abbre-viation of “Saint”.

5.1.2 Token-based• Lexical similarity compares the words

between alias A and entity name B with-out case sensitivity. A classical lexicalsimilarity is:

Sim(A,B) =|A ∩ B||A ∪ B|

where |x ∩ y| correspond to a functionthat returns the number of coincidencesbetween words in x and y, and |x ∪ y|symbolize the number of different wordsin the union of x and y.However, in the case of study, we knowthat some word in the entity name maynot occur in the alias but, almost always,if a word occur in the alias, it must be inthe entity name. In other words, an aliasuse to be a reduced number of words ofthe entity name. Although, it is difficultto find an alias using words that do notoccur in the entity name (it is possible,however). In order to take advantage ofthis asymmetry in our lexical similarity,words of the alias not appearing in the

entity name decrement the similarity asis shown bellow:

Sim(A,B) = max(0,|A ∩ B| − |Wa|

|A ∪ B| )

where Wa represents the words appear-ing in A but not in B and max function isused taking care that similarity functionnever returns a value lower than zero.

• Keywords similarity is another lexi-cal similarity but avoiding typical do-main related words. These kind of wordsoccur in several names and can cause agood lexical similarity when the impor-tant words (keywords) are not matching.For example, “Manchester United Foot-ball Club” and “Dundee United FootballClub” have a good lexical similarity butbad keyword similarity because “foot-ball” and “club” are considered typicaldomain-related words. It uses the sameformula as Lexical similarity but not in-cluding typical domain-related words inA and B. Lexical similarity and Key-words similarity could be combined in alexical similarity weighted with TF-IDF.However, the true contribution of eachtoken to similarity is domain-specificand not always proportional to TF-IDF.Some words have many occurrences butare still important while some others ap-pear few times but are not helpful at all.

5.1.3 Structural• Acronyms similarity looks for a cor-

respondence between acronyms in thealias and capitalized words in the en-tity name. This feature takes care ofthe words order because the order of


109

www.inter.it

s.p.a.

Milano

Internazionale

Club

Football

Milan

Inter

city

web

city

prefixabbreviationcity

abbreviation

prefix

typical word

typical word

AliasInter Milan www.inter.it

Football Club Internazionale Milano s.p.a.Entity

Figure 1: Example of a pair alias-entity andits active features

the characters in an acronym defines theorder that words must have in the en-tity name. An example of acronym is“PSV” which match with “Philips SportVereniging Eindhoven”.

5.1.4 Semantic• City similarity returns 1 (maximum si-

milarity) only when one word in the aliascorrespond to a city, one word in the en-tity name corresponds to a city and bothare the same city. In other cases, returns0 (no similarity). It can be useful whensome cities can have different names de-pending on the language. For instance,“Moscow” and “Moskva” are the samecity or “Vienna” and “Wien”. This fea-ture requires a world knowledge aboutcities.

• Website similarity function comparesthe alias with the URL of the organiza-tion’s website if we have it. Avoiding thefirst TLD (.com, .de, .es) and sometimesthe second (.co.uk, .com.mx) its usualfor an organization to register a domainname with the most typical alias for it.The return value of this function is theratio of words of alias included in thedomain name divided by total numberof words in the alias. We can use this si-

milarity function because we have moreinformation about the entity than onlythe official name. In case we don’t havethis information the return value wouldbe zero.

5.2 Extended features

The second experiment uses extended featurefunctions. This means that most of the fea-ture functions used previously are modifiedand now they return more than one valuebreaking down the information. The featurefunctions are the same but returning a vec-tor of values instead of one value. The clas-sifier may use this extra information if it ishelpful for classification. For instance, lexi-cal similarity now returns: number of wordsin the alias, number of words in the entityname and number of equal words. Combin-ing these values the classifier can achieve afunction like our original lexical similarity ormaybe a better one.

In this second approach the target is tocompare the original feature functions withthe extended ones. We choose SVM for thisexperiment because SVM can use polynomialkernels that may combine attributes in a bet-ter way than a linear classifier. Consequently,in this experiment we compare the best clas-sifier obtained in the first experiment withtwo SVM classifiers using the extended fea-ture functions. One SVM will use a linearkernel while the other will try to take advan-tage of a quadratic one.

Table 2 shows the modifications realizedin each feature function.

6 Results

In our first experiment described in section5.1, we tried the two algorithms mentionedabove, Hill Climbing and SVM, with the fea-ture functions described previously. Table 3shows the results comparing it with a baselineconsisting of some simple rules using only lex-ical, keywords, acronyms and abbreviationssimilarities.

The first aspect to emphasize is thatthe baseline, a simple rule-based classifier,achieves a F1 measure over 80%. This in-dicates that the alias assignment task has ahigh percentage of trivial examples. The useof machine learning and new features mayhelp with difficult ones. Actually, the resultsshow how machine learning algorithms signi-ficantly outperform the results obtained by


110

Feature Return ValuesPrefix Pre1: # words in the alias that

are prefixes in the entity nameSuffix Suf1: # words in the alias that

are suffixes in the entity nameAbbrev. Abr1: # words in the alias that

are an abbreviation of a word inthe entity name

Lexical Lex1: # words in the aliasLex2: # words in the entitynameLex3: # equal wordsLex4: # equal words case sensi-tive

Keywords Key1: # keywords int the alias(words excluding typical domainwords (football, club, etc))Key2: # keywords in the entitynameKey3: # of equal keywords

Acronym Acr1: the alias have an acronym(boolean)Acr2: the alias acronymmatches with capitalized wordsin the entity name (boolean)Acr3: # words in the alias with-out acronymsAcr4: # words in the entityname without words involved inacronymsAcr5: # equal words withoutwords involved in acronyms

City Cit1: some word in the alias isa city (boolean)Cit2: some word in the entityname is a city (boolean)Cit3: both are the same city(boolean)

Website Web1: The entity has a value inthe website field (boolean)Web2: # words occurring bothin the alias and in the URL ofthe entity

Table 2: Extended features used in the se-cond experiment

the baseline. In the other hand, we find thatperform of Hill Climbing and SVM are simi-lar. SVM seems to achieve better results butthe difference is not significant since the con-fidence interval at 95% significance level is0.8%.

In the second approach we wanted to usethe power of SVM combining features and webreak down the components of feature func-tions as explained in section 5.2. SVM mayuse this extra information if it is helpful forclassification. In table 4 two SVM with diffe-

Baseline Hill Climbing SVMF1 80.3 87.1 87.9

Table 3: Results of experiment (1) comparingsimple rule-based baseline with hill climbingand SVM

Features Simple ExtendedAlgorithm SVM SVM SVMKernel linear linear quadraticF1 87.9 93.0 93.0

Table 4: Results of experiment (2) comparingoriginal features with extended features

rent kernels using extended features are com-pared with results obtained in the first expe-riment.

The results indicates that extended fea-tures outperform the original ones. In theother hand, we can see that a quadratic ker-nel does not improve the results of the linearkernel.

7 Conclusions

In this paper we have proposed a homoge-neous model to deal with the problem of clas-sifying a pair alias-entity into true/false ca-tegories. The model consists in using a setof feature functions instead of the state-of-art approach based on distinguishing betweena set of lexico-ortographical similarity func-tions and a set of semantic rules.

Some experiments have been performed inorder to compare different configurations forthe proposed model. The configurations dif-fer in the set of feature functions and in thediscretization strategy for feature weights.Also, two learning techniques have been ap-plied, namely, Hill Climbing and SVMs.

We have seen that Hill Climbing and SVMperform similar. Both algorithms used hassome advantages and disadvantages. On onehand, Hill Climbing is simple and fast but hastwo drawbakcs. The first one is that it looksfor weights by steps and it causes that theweights are always discrete values decreas-ing sometimes the final accuracy. The otherdrawback is that local search can fall in localminima. Although, it may be palliated byexecuting the algorithm several times start-ing with random values. On the other hand,SVM work in a continuous space and learnstatistically which avoids the two drawbacks


111

of hill climbing. Although, SVM take longerto be tuned correctly.

In the second experiment, since SVM canhandle richer combinations of features whenusing polynomial kernels, we tested SVMsusing a linear kernel and a quadratic one,obtaining similar results. The feature setused in this experiment was a refinement ofthe previous one, that is, the features con-tained the same information, but coded withfiner granularity. The results pointed outthat although the similarity functions usedin the first approach produced accurated re-sults, letting the SVM handle all the param-eters results in a significative improvement.

References

Bhattacharya, Indrajit and Lise Getoor. 2004.Iterative record linkage for cleaning and inte-gration. In DMKD ’04: Proceedings of the 9thACM SIGMOD workshop on Research issuesin data mining and knowledge discovery, pages11–18, New York, NY, USA. ACM Press.

Bilenko, Mikhail and Raymond J. Mooney. 2003.Adaptive duplicate detection using learnablestring similarity measures. In KDD ’03: Pro-ceedings of the ninth ACM SIGKDD interna-tional conference on Knowledge discovery anddata mining, pages 39–48, New York, NY,USA. ACM Press.

Cohen, W., P. Ravikumar, and S. Fienberg. 2003.A comparison of string distance metrics forname-matching tasks.

Cortes, Corinna and Vladimir Vapnik. 1995.Support-vector networks. In Springer, edi-tor, Machine Learning, pages 273–297. KluwerAcademic Publishers, Boston.

Doan, AnHai, Ying Lu, Yoonkyong Lee, and Ji-awei Han. 2003. Profile-based object match-ing for information integration. IEEE Intelli-gent Systems, 18(5):54–59.

Dong, Xin, Alon Halevy, and Jayant Madhavan.2005. Reference reconciliation in complex in-formation spaces. In SIGMOD ’05: Proceed-ings of the 2005 ACM SIGMOD internationalconference on Management of data, pages 85–96, New York, NY, USA. ACM Press.

Furey, T. S., N. Christianini, N. Duffy, D. W.Bednarski, M. Schummer, and D. Hauessler.2000. Support vector machine classificationand validation of cancer tissue samples usingmicroarray expression data. Bioinformatics,16(10):906–914.

Hernandez, Mauricio A. and Salvatore J. Stolfo.1995. The merge/purge problem for largedatabases. In SIGMOD ’95: Proceedings of

the 1995 ACM SIGMOD international confer-ence on Management of data, pages 127–138,New York, NY, USA. ACM Press.

Kalashnikov, Dmitri V. and Sharad Mehrotra.2006. Domain-independent data cleaning viaanalysis of entity-relationship graph. ACMTrans. Database Syst., 31(2):716–767.

Li, Xin, Paul Morie, and Dan Roth. 2004. Iden-tification and tracing of ambiguous names:Discriminative and generative approaches.In PROCEEDINGS OF THE NATIONALCONFERENCE ON ARTIFICIAL INTEL-LIGENCE, pages 419–424. Menlo Park, CA;Cambridge, MA; London; AAAI Press; MITPress; 1999.

McCallum, Andrew, Kamal Nigam, and Lyle H.Ungar. 2000. Efficient clustering of high-dimensional data sets with application to ref-erence matching. In KDD ’00: Proceedings ofthe sixth ACM SIGKDD international confer-ence on Knowledge discovery and data mining,pages 169–178, New York, NY, USA. ACMPress.

Osuna, Edgar, Robert Freund, and FedericoGirosi. 1997. Training support vector ma-chines: an application to face detection. cvpr,00:130.

Pasula, H., B. Marthi, B. Milch, S. Russell, andI. Shpitser. 2002. Identity uncertainty andcitation matching.

Sarawagi, Sunita and Anuradha Bhamidipaty.2002. Interactive deduplication using activelearning. In KDD ’02: Proceedings of theeighth ACM SIGKDD international confer-ence on Knowledge discovery and data mining,pages 269–278, New York, NY, USA. ACMPress.

Shen, W., X. Li, and A. Doan. 2005. Constraint-based entity matching. In Proceedings ofAAAI.

Skalak, David B. 1994. Prototype and feature se-lection by sampling and random mutation hillclimbing algorithms. In International Confer-ence on Machine Learning, pages 293–301.

Tejada, Sheila, Craig A. Knoblock, and StevenMinton. 2002. Learning domain-independentstring transformation weights for high accu-racy object identification. In KDD ’02: Pro-ceedings of the eighth ACM SIGKDD interna-tional conference on Knowledge discovery anddata mining, pages 350–359, New York, NY,USA. ACM Press.

Winkler, W. 1999. The state of record linkageand current research problems.


112

Evaluación de un sistema de reconocimiento y normalización de expresiones temporales en español∗∗∗∗

María Teresa Vicente-Díez

César de Pablo-Sánchez

Paloma Martínez

Departamento de Informática. Universidad Carlos III de Madrid Avda. Universidad 30, 28911. Leganés, Madrid

{teresa.vicente, cesar.pablo, paloma.martinez}@uc3m.es

Resumen: El sistema de reconocimiento y normalización de expresiones temporales en español que se describe en este artículo fue presentado por la Universidad Carlos III de Madrid en la evaluación ACE07 llevada a cabo por el NIST. Dicho sistema se centra en la tarea de TERN para español, piloto en esta edición. Se detalla su arquitectura y módulos así como el enfoque basado en reglas implementado por un autómata finito en las etapas de reconocimiento y normalización. Se exponen también los resultados alcanzados en la evaluación y las conclusiones obtenidas a partir de los mismos. Palabras clave: Reconocimiento de expresiones temporales, normalización temporal, timexes, procesamiento de lenguaje natural, PLN, español.

Abstract: The temporal expressions recognition and normalization system for Spanish language described in this paper was presented by the University Carlos III de Madrid to the NIST ACE07 evaluation. The system focuses on the primary TERN task in Spanish, a pilot experience this year. The description of its architecture and modules is detailed, as well as the rule-based approach implemented by a finite state automaton on the recognition and normalization stages. Reached results in the evaluation and conclusions obtained through their analysis are also shown. Keywords: Temporal expressions recognition, time normalization, timexes, natural language processing, NLP, Spanish language.

∗ Este trabajo ha sido parcialmente financiado por la Comunidad de Madrid bajo la Red de Investigación MAVIR (S-0505/TIC-0267).

1 Introducción

La extracción automática de información temporal de noticias u otros contenidos electrónicos supone un importante reto lingüístico. Este tipo de documentos suele contar con una escasa cantidad de metadatos de carácter temporal (Llido, Berlanga y Aramburu, 2001), lo que convierte en difícil determinar el momento en que ocurren los eventos que narran.

“Las expresiones temporales (también denominadas timexes) son fragmentos del lenguaje natural que aluden directamente a instantes en el tiempo o a intervalos. No sólo aportan información temporal por sí mismas sino que también sirven como puntos de anclaje para ubicar eventos que son referidos en un texto” (Ahn, Fissaha, y Rijke, 2005).

En la mayoría de contextos lingüísticos las expresiones temporales son deícticas. Por ejemplo, en las expresiones “la pasada semana”, “en abril”, o “hace tres meses” se



debe conocer cuál es el instante narrativo de referencia para poder precisar el intervalo de tiempo comprendido por la expresión (Saquete, 2000). Además, si se pretende facilitar el intercambio de datos, es fundamental que aquellos intervalos identificados sean traducidos de acuerdo a un estándar establecido, es decir, que sean normalizados. Una identificación y normalización de expresiones temporales precisa es esencial para el razonamiento temporal (Allen, 1983) que demandan las aplicaciones avanzadas de PLN, como la Extracción de Información, el Resumen Automático, o la Búsqueda de Respuestas (QA). Por ejemplo, en esta última es primordial resolver referencias que ayuden a responder a cuestiones temporales (“¿En qué año murió Cervantes?”) o con restricciones de tiempo (“¿Quién era el presidente de los EE.UU. en 2005?”) (Saquete, 2004) (de Pablo-Sánchez et al., 2006).

Particularmente en QA resulta de especial interés la integración de un sistema de razonamiento sobre el tiempo que dote a la aplicación de una nueva dimensión temporal. (Moldovan, Bowden, y Tatu, 2006). Dada la importancia de la identificación de expresiones temporales en este razonamiento se pretende incorporar el sistema expuesto dentro de un entorno de QA. Se espera que la introducción de reglas de inferencia permita mejorar el análisis de preguntas y la calidad de las respuestas extraídas. Por ejemplo, a la hora de resolver preguntas temporalmente ambiguas, como “¿Quién fue Ministro de Justicia en 2007?”, un razonamiento eficiente permitirá conocer de la existencia de dicha ambigüedad, o bien extraer las múltiples respuestas posibles.

Por otra parte, la comunidad científica cuenta con varios recursos para el tratamiento de timexes pero, mayoritariamente, en lengua inglesa. Entre otros, existen diversas guías y métodos de anotación, como por ejemplo el propuesto por Mani y Wilson (2000), lenguajes de especificación como TimeML (Pustejovsky et al., 2005), corpus anotados temporalmente como TimeBank (MITRE, 2007), etc. Sin embargo, algunos de estos recursos no pueden utilizarse directamente en español. Ya que esta lengua es actualmente una de las más habladas en el mundo, parece interesante invertir en la creación de recursos propios.

NIST 2007 Automatic Content Extraction Evaluation (ACE07) forma parte de una serie de evaluaciones cuyo propósito es el desarrollo

de tecnologías de extracción de información e inferencia semántica del lenguaje.

El propósito de la evaluación de la tarea de Reconocimiento y Normalización de Expresiones Temporales (TERN) es avanzar en el estado del arte existente sobre la detección y la normalización automática de este tipo de expresiones.

El sistema que se describe en este artículo está enfocado al reconocimiento y normalización de timexes en español. Fue presentado por la Universidad Carlos III de Madrid (UC3M) a la evaluación ACE07, participando en la tarea de TERN para español. Dicha tarea suponía una experiencia piloto para este lenguaje. Esta propuesta constituye una aproximación inicial en la que fueron implementadas técnicas basadas en reglas simples, tanto en reconocimiento como en normalización. En esta versión preliminar, el sistema maneja expresiones temporales simples del lenguaje, posponiendo el tratamiento de aquellas expresiones de aparición menos frecuente en español, aunque identificables según el estándar TIDES (Ferro et al., 2005).

El artículo está estructurado como sigue: en la sección 2 se describe la tarea en la que participaba el sistema evaluado. En la sección 3 se muestra la arquitectura de dicho sistema y los módulos que lo componen. En la sección 4, se presentan los resultados de la evaluación. Por último, la sección 5 incluye las conclusiones obtenidas y algunas líneas de trabajo futuro.

2 Descripción de la tarea

Los sistemas participantes en la tarea de TERN para español en la evaluación ACE07 han de procesar unos datos de entrada, en este caso noticias (Newswire) en español, e identificar fechas, duraciones, instantes de referencia e intervalos en ellos (reconocimiento). Las expresiones reconocidas, tanto absolutas como deícticas, han de ser tratadas y devueltas en un formato estándar que evite la ambigüedad semántica en su recuperación (normalización). Dichas expresiones son marcadas siguiendo el esquema de anotación TIMEX2, de acuerdo con el estándar TIDES (Ferro et al., 2005), que se compone de un conjunto de atributos, tal y como se muestra en la Tabla 1.

En la Tabla 2 se presentan algunos ejemplos de utilización de TIMEX2 para ilustrar su uso en la anotación de expresiones temporales.

María Teresa Vicente-Díez, César de Pablo-Sánchez y Paloma Martínez

114

ATRIBUTO DESCRIPCIÓN VAL Expresión temporal normalizada MOD Modificador de expresión temporal

normalizada ANCHOR_VAL Punto de referencia temporal normalizadoANCHOR_DIR Direccionalidad temporal SET Indica que el atributo VAL se refiere a un

conjunto de expresiones temporales (un intervalo)

Tabla 1 Atributos de TIMEX2

<TIMEX2 VAL=”1991-10-06”>6 de octubre de 1991</TIMEX2> <TIMEX2 VAL=”1993-08-01T17:00”>5:00 p.m.</TIMEX2> <TIMEX2 VAL=”1992-FA”>el pasado otoño</TIMEX2> <TIMEX2 VAL=”P9M” ANCHOR_VAL=”1993-08” ANCHOR_DIR=”ENDING”>los últimos nueve meses</TIMEX2> <TIMEX2 VAL=”1994-01-20TEV”>el jueves por la tarde</TIMEX2> <TIMEX2 SET=”YES” VAL=”XXXX-XX-XX”>diariamente</TIMEX2> <TIMEX2 VAL=”PRESENT_REF” ANCHOR_VAL=”1994-01-21T08:29” ANCHOR_DIR=”AS_OF”>ahora</TIMEX2> <TIMEX2 VAL=”P25Y”>25 años</TIMEX2> <TIMEX2 VAL=”1994”>el pasado año</TIMEX2>

Tabla 2 Ejemplo de anotación con TIMEX2

Finalmente, ha de generarse una salida por cada uno de los documentos fuente, en un formato XML específico (conocido como ficheros .apf).

Los documentos en español que forman los corpus de ACE07 provienen de 3 fuentes diferentes: Agence France-Presse, Associated Press Worldstream y Xinhua.

3 Descripción del sistema

La arquitectura general del sistema propuesto se muestra en la Figura 1. El procesamiento de cada entrada incluye 4 etapas secuenciales, desde el preproceso de los documentos de origen hasta la devolución de los resultados en el formato apropiado. 3.1 Preprocesador

Este módulo convierte los documentos de entrada en ficheros intermedios enriquecidos, que incluyen información morfológica, sintáctica y semántica. La conversión es llevada a cabo en dos pasos:

Formateado de entrada: este submódulo transforma los ficheros de origen a la codificación que precisa el procesador lingüístico, y elimina los caracteres innecesarios (espacios, tabulaciones, etc.).

Procesado lingüístico: genera un fichero para cada entrada donde todo el texto original es dividido y enriquecido con información de posición, etiquetado gramatical, morfosintáctico y semántico. Esta etapa es llevada a cabo por el procesador Stilus, una herramienta comercial desarrollada por (DAEDALUS, 2007).

Figura 1: Arquitectura general del sistema

3.2 Reconocedor

Detecta las expresiones temporales existentes en el texto de los ficheros de entrada. Se compone de 2 submódulos.

Carga de tokens: carga en memoria objetos con la información lingüística obtenida a partir de los ficheros generados por el procesador lingüístico.

Timex autómata: en este punto, el sistema busca identificar timexes dentro de cada frase de los ficheros de entrada. La búsqueda se realiza a través de un autómata de estados finitos de acuerdo a la gramática que constituye su definición. Está compuesto de 25 estados, 12 de los cuales son finales. Se han definido 19 predicados para realizar las transiciones entre estados, como puede verse en la Figura 2. La Tabla 3 detalla los predicados del sistema desarrollado. Cuando un estado final es alcanzado y no se producen más transiciones, el fragmento de oración reconocido es enviado al Selector de expresiones temporales, dentro del módulo de normalización.

Evaluación de un Sistema de Reconocimiento y Normalización de Expresiones Temporales en Español

115

Figura 2 Descripción del autómata

PREDICADO DESCRIPCIÓN EJEMPLOS 1. pBasicDate {YYYYMMDD, YYYY-MM-DD, YYYY/MM/DD}

YYYY∈{1600-2050}, MM∈{1-12}, DD∈{1-31} 20051202 2005-12-02

2. pInvertedBasicDate {DD-MM-YYYY | DD/MM/ YYYY} YYYY∈{1600-2050}, MM∈{1-12}, DD∈{1-31}

02-12-2005 02/12/2005

3. pArticle {el, la, los, las} la 4. pDayAndMonth DD de MONTH

DD∈{1-31}, MONTH = {enero | febrero |…| diciembre} 5_de_marzo

5. pDateConnector {del,-,/,de} de 6. pYearNumber YYYY∈{1600-2050} 2005 7. pDayAndMonthAndYear DD de MONTH de YYYY. DD∈{1-31},

YYYY∈{1600-2050}, MONTH = {enero | febrero |…| diciembre} 5_de_marzo_de_2005

8. pDayNumber DD∈{1-31} 30 9. pMonth {enero | febrero | …| diciembre | ene | feb | …| dic | ene.| feb.|…| dic.} diciembre 10. pPreposition en en 11. pDeicticTempex {hoy | ahora | anteayer | ayer | mañana | anoche | anteanoche |

pasado_mañana | antes_de_ayer | antes_de_anoche | al_mediodía | por_la_noche | hoy_en_día | hoy_día}

ayer

12. pDemostrative {esta | este} esta 13. pPartsOfToday {mañana | tarde | noche | mediodía | medianoche | madrugada |

momento | período | actualidad | temporada | actualmente} mañana

14. pDayOfWeek {lunes | martes | miércoles | jueves | viernes | sábado | domingo} domingo 15. pYearWord {año} año 16. pPastVerb {hace | hacía | hará | hacen} hace 17. pQuantity {uno | una | dos |…| treinta | cuarenta | cincuenta | sesenta | setenta |

ochenta | noventa | cien | ciento | mil | millar | millón} veinte

18. pNumericQuantity NUMERIC_VALUE∈{0 - 99999999} 25 19. pDateUnit {día | semana | quincena | mes | bimestre | cuatrimestre | trimestre |

semestre | año | bienio | trienio | lustro | quinquenio | sexenio | siglo} mes

Tabla 3 Predicados del autómata


116

3.3 Normalizador

Responsable de normalizar las expresiones previamente reconocidas. Se compone de 5 submódulos.

Selector de expresiones temporales: recibe las diferentes expresiones y las envía al submódulo de normalización adecuado. Al existir diferentes tipos de timexes cada una debe ser manejada de manera concreta.

Normalización de expresiones absolutas: trata con expresiones temporales absolutas, es decir, aquéllas que por sí mismas están definidas completamente. Estas expresiones no necesitan de otro punto en el tiempo que actúe como referencia. A su vez, pueden ser completas (“3 de abril de 2005”), e incompletas (“abril de 2005”).

Normalización de expresiones deícticas: maneja expresiones temporales deícticas, es decir, aquéllas que hacen referencia a otro momento en el tiempo que es preciso conocer para que puedan ser definidas completamente.

La normalización en este caso no es posible inmediatamente, sino que requiere de ciertos cálculos previos. La fecha de referencia es tomada del documento analizado: puede ser obtenida del contexto, o bien puede considerarse la fecha de creación del propio documento. Esta segunda aproximación ha sido la elegida para evaluar las expresiones temporales por el normalizador.

Normalización de intervalos: se ocupa de la normalización de períodos de tiempo, también conocidos como intervalos. Esto implica la existencia de dos timexes unidas por un conector.

Normalización por traducción directa: el español contiene ciertas expresiones que no son propiamente una referencia temporal, sino un punto en el tiempo, como por ejemplo “Navidad”. Este tipo de expresiones son directamente traducidas a través de diccionarios, que almacenan la relación entre la expresión y la fecha normalizada a la que hacen referencia. 3.4 Post-procesador

En esta etapa se escriben los resultados de la normalización de expresiones en un formato de salida XML, predefinido para ACE07. 3.5 Clasificación de expresiones temporales según su normalización

El submódulo Selector de expresiones temporales presentado anteriormente lleva a

cabo una clasificación del tipo de expresión reconocida que se busca normalizar. Esta clasificación atiende a la propuesta definida en las Tablas 4 y 5.

Por una parte, en la Tabla 4 se muestran los distintos tipos de expresiones absolutas que trata el sistema. En la Tabla 5 se detallan las expresiones deícticas contempladas. En ambos casos las timexes pueden estar completas (constan de día, mes y año) o incompletas (si carecen de alguno de ellos). Finalmente la Tabla 6 recoge los elementos que integran las expresiones reconocibles.

Cada tipo de expresión se ha etiquetado con un identificador. Se detalla también el formato de entrada que corresponde a cada clase, así como el valor del atributo TIMEX2 VAL de la expresión una vez normalizada.

En el caso de las expresiones deícticas se muestra un campo adicional: la fecha de referencia. Este dato es necesario para calcular el valor normalizado que corresponde a la expresión. En enfoque que toma el sistema establece que la fecha de referencia sea la fecha de creación de los documentos que procesa.

4 Resultados

4.1 Sistema de puntuación en TERN

La puntuación de un sistema participante en la tarea de TERN está definida como la suma de los valores de todas las expresiones TIMEX2 de salida de dicho sistema, normalizadas por la suma de los valores de todas las expresiones TIMEX2 de referencia, tal y como muestra la fórmula (1). El máximo valor de puntuación posible es un 100%, mientras que el mínimo no está limitado.

∑∑

=j

j

ii

sys tokenrefofvalue

tokensysofvalueValueTERN

___

____

(1)

El valor de cada expresión se basa en sus atributos y cuánto se corresponden con los de referencia (ACE, 2007). 4.2 Resultados obtenidos

Una vez procesados los corpus de evaluación, se enviaron para valorar los resultados obtenidos. Éstos se encuentran publicados en (NIST, 2007).


117

CATEGORÍA DE LA

EXPRESIÓN IDENTIFICADOR FORMATO ENTRADA EJEMPLO ENTRADA ATRIBUTO VAL

NORMALIZADO

EXPRESIONES ABSOLUTAS

ABS_COMPLETE_0 DD-MM-YYYY DD/MM/YYYY

31-12-2005 31/12/2005

2005-12-31 2005-12-31

ABS_COMPLETE_1 YYYYMMDD 20051231 2005-12-31

ABS_COMPLETE_2 [DET]+DD+”de”+MES+

”de”+YYYY [el] 31 de diciembre de 2005

2005-12-31

ABS_INCOMPLETE_1 MES + “de” + YYYY diciembre de 2005

2005-12

ABS_INCOMPLETE_2 [DET]+YYYY [el] 2005 2005

Tabla 4: Propuesta de clasificación de expresiones temporales absolutas

CATEGORÍA DE LA

EXPRESIÓN IDENTIFICADOR FORMATO ENTRADA EJEMPLO

ENTRADA FECHA DE

REFERENCIA ATRIBUTO VAL NORMALIZADO

EXPRESIONES DEÍCTICAS

DEIC_COMPLETE_1 REFERENCIA_PRESENTE REFERENCIA_PASADO REFERENCIA_FUTURO

hoy ayer

mañana

2005-12-31 2005-12-31 2005-12-31

2005-12-31 2005-12-30 2006-01-01

DEIC_COMPLETE_2 VERBO “HACER” +

CANTIDAD + UNIDAD_TIEMPO

hace un mes 2005-12-31 2005-11-30

DEIC_INCOMPLETE_1 [DET]+DD+”de”+MES

MES + DD

[el] 29 de diciembre Diciembre

29

2005-12-31

2005-12-31

2005-12-29

2005-12-29

DEIC_INCOMPLETE_2 DET + “año” Este año 2005-12-31 2005 DEIC_INCOMPLETE_3 DET + DIA_SEMANA El lunes 2005-12-31 2006-01-02

Tabla 5: Propuesta de clasificación de expresiones temporales deícticas

DET = {el | la | los | las | este | esta} MES = {enero | febrero | marzo | … | diciembre} REFERENCIA_PRESENTE = {hoy | ahora | hoy_día | hoy_en_día | esta_mañana | esta_tarde | esta_noche | este_mediodía | esta_madrugada | este_momento | actualidad | actualmente} REFERENCIA_PASADO = {ayer | anoche | anteayer | antes_de_ayer | anteanoche} REFERENCIA_FUTURO = {mañana | pasado_mañana} CANTIDAD = {CANTIDAD_NUMERICA | CANTIDAD_NO_NUMERICA} CANTIDAD_NUMERICA = {1 | 2 | …} CANTIDAD_NO_NUMERICA = {uno | dos | …} UNIDAD_TIEMPO = {día | semana | quincena | mes | bimestre | trimestre | cuatrimestre | semestre | año | bienio | trienio | lustro | quinquenio | sexenio | siglo} DIA_SEMANA = {lunes | martes | miércoles | jueves | viernes | sábado | domingo}

Tabla 6: Elementos integrantes de los distintos tipos de expresiones temporales reconocibles

4.2.1 Resultados generales

Los resultados generales en términos cuantitativos se muestran en la Tabla 7, incluyendo también medidas de precisión, recall y F-measure. Del análisis de esta valoración se confieren los siguientes aspectos: a) la cantidad de expresiones total y

correctamente reconocidas y normalizadas fue de un 47%

b) el porcentaje de expresiones no detectadas es de un 34%

c) el porcentaje de expresiones reconocidas cometiendo algún error es del 13%

d) las falsas alarmas, es decir, expresiones identificadas como temporales sin serlo, suponen aproximadamente un 6%

e) los valores de presión, recall y F-measurese sitúan en todos los casos por encima del 50%.


118

Tabla 7 Porcentajes cuantitativos de los resultados generales

OK FA miss err P R F # 680 94 493 190 - - - % 0.47 0.06 0.34 0.13 0.73 0.53 0.62

4.2.2 Resultados del atributo VAL

El sistema desarrollado, aún en una versión preliminar, no usa todos los atributos que la sintaxis de TIMEX2 provee. De hecho, sólo utiliza el atributo VAL para capturar toda la semántica de las expresiones temporales.

Los resultados obtenidos concernientes al atributo VAL se reflejan en la Tabla 8. Éstos han sido los siguientes: a) el 62% de los elementos detectados están

correctamente marcados b) el 3% de las detecciones corresponden a

falsas alarmas c) no hay detecciones sin su correspondiente

etiqueta VAL d) un 16% de las expresiones reconocidas no

están completamente anotadas, esto es debido a que no se emplea el resto de atributos de TIMEX2

e) el 19% de detecciones fueron erróneas f) la precisión, recall y F-measure alcanzan

porcentajes superiores al 95%

Tabla 8 Porcentajes cuantitativos para el atributo VAL

OK FA miss sub err P R F # 582 28 0 149 177 - - - % 0.62 0.03 0 0.16 0.19 0.97 1 1

4.2.3 Resultados por fuente de datos

Los resultados obtenidos sobre los corpus de cada fuente, en la Figura 3, han sido muy similares. De hecho, la pérdida de puntuación del corpus de APW es debida a errores de anotación en los ficheros de referencia.

Figura 3: Resultados por fuente de datos

4.2.4 Análisis de los resultados

En general, puede considerarse que los resultados son bastante prometedores para tratarse de una tarea piloto. Aunque preliminares, arrojan una estimación global de la cantidad de expresiones temporales identificadas, así como de la calidad de esas detecciones. El número de falsas alarmas representa un porcentaje bajo del total de detecciones. Del mismo modo, la cantidad de expresiones no reconocidas o reconocidas erróneamente es aceptable para la mayor parte de los documentos analizados.

Los valores de precisión, recall y F-measurede la tarea general son superiores a un 50%, y la puntuación final que obtiene el sistema está en un 47%.

A la luz de estos resultados se detectan algunos aspectos destacables: a) la principal causa de pérdida de puntuación

está ocasionada por la omisión de algunas expresiones no reconocidas, hecho altamente penalizado por el evaluador

b) se producen errores debido a la imposibilidad del sistema para utilizar todos los atributos proporcionados por TIMEX2

5 Conclusiones y trabajo futuro

Por tratarse de la primera vez en que la tarea de TERN para español se celebra se carecen de resultados anteriores con los que realizar una comparación precisa de los obtenidos por el sistema expuesto. No obstante, aunque no exhaustivamente comparables, sí se dispone de sistemas previos que abordan tareas semejantes, para español (Saquete, 2006), italiano o inglés (Negri et al., 2006).

Por otra parte, a la vista de las cifras de la evaluación, varios aspectos han de ser mejorados en el futuro: a) el etiquetado de las expresiones reconocidas

deberá contemplar todos los atributos proporcionados por TIMEX2, con el fin de capturar tanta semántica como sea posible (duraciones, períodos de tiempo, etc.)

b) la cobertura de la gramática del autómata ha de ser ampliada, añadiendo tipos de expresiones actualmente no considerados

c) ha de llevarse a cabo la implementación de diccionarios con un mayor alcance de expresiones directamente traducibles, como festividades, vacaciones, etc.

d) constituye un aspecto interesante el desarrollo de una guía para la anotación de


119

expresiones temporales en español. Esta útil herramienta mejoraría el rendimiento del sistema (¿se debe etiquetar “marzo” o “en marzo”?). Además, cada lenguaje cuenta con sus peculiaridades que deben ser tenidas en consideración. Existen expresiones en español cuyo tratamiento heredado del inglés carece de sentido. Por ejemplo, atendiendo a (Ferro et al., 2005) en la expresión “del 2 de marzo” se contempla etiquetar sólo “el 2 de marzo”, segmentando el artículo contracto.

Del mismo modo, se considera una tarea relevante para llevar a cabo en el futuro el estudio de mecanismos de extracción de información contextual, que faciliten la manipulación de expresiones deícticas.

Finalmente, supone una línea de trabajo prioritaria en futuras versiones del sistema la introducción de técnicas de aprendizaje automático en las etapas de reconocimiento y clasificación de expresiones temporales (Ahn, 2005), de manera que complementen los mecanismos actuales, basados en reglas.

Bibliografía

ACE. 2007. The ACE 2007 (ACE07) Evaluation Plan. 2007.

Ahn, D., Fissaha, S. y de Rijke, M. 2005. Extracting Temporal Information from Open Domain Text: A Comparative Exploration. J. Digital Information Management, 3(1):14-20.

Allen, J.F. 1983. Maintaining knowledge about temporal intervals. Communications of the ACM, 26 (11):832-843.

DAEDALUS. 2007. Data, Decisions and Language, S. A. http://www.daedalus.es

Ferro, L., Gerber, L., Mani, I., Sundheim, B. y Wilson, G. 2005. TIDES 2005 Standard for the Annotation of Temporal Expressions.

Llido, D., Berlanga. R. y Aramburu, M.J. 2001. Extracting temporal references to assign document event-time periods. Lecture Notes in Computer Science, 2113:62-71.

Mani, I. y Wilson, G. 2000. Robust Temporal Processing of News. En Proceedings of the ACL’2000 Conference, Hong Kong.

MITRE Corporation. 2007. TimeBank. http://www.cs.brandeis.edu/~jamesp/arda/time/timebank.html

Moldovan, D. Bowden, M. y Tatu, M. 2006. A Temporally-Enhanced PowerAnswer in TREC 2006. En The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings. Gaithersburg, MD, (USA).

National Institute of Standards and Technology. 2007. NIST 2007 Automatic Content Extraction Evaluation Official Results (ACE07) v.2. http://www.nist.gov/speech/tests/ace/ace07/doc/ace07_eval_official_results_20070402.htm

de Pablo-Sánchez, C., González Ledesma, A., Moreno-Sandoval, A. y Vicente-Díez, M.T. 2006. MIRACLE experiments in QA@CLEF 2006 in Spanish: main task, real-time QA and exploratory QA using Wikipedia (WiQA). En CLEF 2006 Proceedings. To be published.

Negri, M., Saquete, E., Martinez-Barco, P., y Munoz, R. 2006. Evaluating Knowledge-based Approaches to the Multilingual Extension of a Temporal Expression Normalizer. En Proceedings of the Workshop on Annotating and Reasoning about Time and Events, Association for Computational Linguistics, páginas 30-37.

Pustejovsky, P., Castaño, J., Ingria, R., Saurí, R., Gaizauskas, R., Setzer, A., y Katz, G. 2003. TimeML: Robust Specification of Event and Temporal Expressions in Text. En Proceedings of the IWCS-5 Fifth International Workshop on Computational Semantics.

Saquete, E., y Martinez-Barco, P. 2000. Grammar specification for the recognition of temporal expressions. En Proceedings of Machine Translation and multilingual applications in the new millennium, MT2000, páginas 21.1-21.7, Exeter, (UK).

Saquete, E., Martínez-Barco, P., Muñoz, R., Viñedo, JL. 2004. Splitting Complex Temporal Questions for Question Answering Systems. En Proceedings of the ACL’2004 Conference, Barcelona.

Saquete, E., Martinez-Barco, P., Muñoz, R., Negri, M., Speranza, M., y Sprugnoli, R. 2006. Multilingual Extension of a Temporal Expression Normalizer using annotated corpora. En Proceedings of the Workshop Cross-language Knowledge Induction at EACL 2006. Trento.


120

Lexicografía Computacional

Inducción de clases de comportamiento verbal a partir del corpus SENSEM

Laura Alonso Alemany Universidad de la República,

Uruguay Universidad Nacional de Córdoba,

Argentina [email protected]

Irene Castellón Masalles Universidad de Barcelona

[email protected]

Nevena Tinkova Tincheva Universidad de Barcelona [email protected]

Resumen: En este artículo presentamos la construcción de un clasificador con el objetivo final de asignar automáticamente patrones de subcategorización a piezas verbales no conocidas previamente, partiendo de una generalización de patrones anotados manualmente. A partir del banco de datos SENSEM (Fernández et al 2004) se han adquirido los esquemas de subcategorización de 1161 sentidos verbales. Estos esquemas se han agrupado en clases de equivalencia mediante técnicas de clustering. Cada clase representa una generalización sobre el comportamiento sintáctico-semántico de los verbos que contiene. Nuestro objetivo final es enriquecer un lexicón verbal con esquemas de subcategorización, asignando automáticamente cada pieza verbal a una de estas clases, a partir de ejemplos de corpus anotados automáticamente. Presentamos una evaluación preliminar de un clasificador que lleva a cabo esta tarea. Palabras clave: Adquisición de subcategorización, análisis sintáctico, clases sintácticas, sentidos verbales.

Abstract: In this paper we present the construction of a classifier with the final objective of automatically assigning subcategorization frames to previously unseen verb senses of Spanish, starting from a generalization of manually annotated frames. Taking as a departure point the data base SENSEM (Fernández et al 2004), the subcategorization frames of 1161 verbal senses have been acquired. These frames have been grouped in equivalence classes by clustering techniques. Each class represents a generalization over the syntactico-semantic behaviour of the verbs in it. Our final target is to enrich a verbal lexicon with subcategorization frames, automatically assigning each verbal piece to one of these classes based on examples from corpus that have been automatically analyzed. We present a preliminary evaluation of a classifier that carries out this task. Keywords: Acquiring verbal subcategorizations, parsing, syntactic classes, verb senses.

1 IntroducciónEn este artículo presentamos la construcción de un clasificador de sentidos verbales con el último fin de establecer un método para enriquecer un léxico verbal con información de subcategorización de forma semiautomática, extrapolando la información de un corpus anotado manualmente a ejemplos sin anotación.

Partimos del corpus anotado a mano SENSEM (Fernández et al 2004), y caracterizamos los verbos que en él aparecen tomando como propiedades los esquemas sintácticos en los que ocurren. Después generalizamos el comportamiento de estos verbos mediante técnicas de clustering. Así obtenemos grupos de verbos con

comportamientos sintácticos similares, ya que en un mismo cluster se agrupan verbos que ocurren con esquemas sintácticos parecidos.

Analizamos diferentes opciones para obtener estas clases de verbos similares: diferentes subconjuntos de propiedades para describir a los verbos y diferentes técnicas de clustering. Aplicamos métricas cuantitativas y cualitativas para analizar las diferentes soluciones obtenidas, y finalmente optamos por estudiar con más detalle una solución en dos niveles que consta de 5 clases iniciales y 11 clases en un segundo nivel. Se ha evaluado la utilidad de esta solución para asignar una clase de comportamiento sintáctico a piezas verbales desconocidas con diferentes clasificadores aprendidos automáticamente.



El resto del artículo está organizado de la siguiente manera. En la próxima sección se argumenta la utilidad de la información de subcategorización para la mejora del análisis sintáctico automático, analizamos algunos trabajos relacionados y exponemos nuestra aproximación. En la sección 3 presentamos la forma como preparamos los datos del corpus SENSEM, los parámetros de los experimentos de clustering y las métricas para evaluarlas. En la sección 4 mostramos cómo analizamos los resultados de los experimentos, con una breve descripción de las soluciones obtenidas y una descripción más extensa de una de las soluciones. En la sección 5 evaluamos la aplicación de las clases seleccionadas a ejemplos no vistos, mediante clasificadores aprendidos automáticamente. Finalmente, en la sección 6 presentamos las conclusiones de este trabajo y el esquema de trabajo futuro.

2 Motivación: la subcategorización y el análisis sintácticoLa descripción del funcionamiento de una pieza verbal tanto a nivel sintáctico como semántico es una tarea necesaria para abordar la 'comprensión' del lenguaje en el área del procesamiento del lenguaje natural. Por un lado, el verbo es el núcleo semántico de la oración, es decir, el que distribuye papeles semánticos y por lo tanto, contribuye a la concreción del sentido de los elementos nominales y a la determinación del sentido global de la escena. Por ejemplo, en la frase (1), el verbo entrarasigna papel semántico de ruta a “la puerta”,por lo que se prima el sentido de “abertura” de la palabra puerta, mientras que en la frase (2) el verbo abrir le asigna el papel de tema, lo cual prima el significado de “armazón” para puerta. (1) El viento entró por la puerta. (2) La puerta se abre sobre una explanada.

Por otro lado, desde una perspectiva puramente sintáctica, el verbo nos informa sobre el tipo de complementos que precisa para que una frase sea gramatical y si este esquema alterna o no con otros complementos, es decir, sobre las diferentes configuraciones sintácticas de los argumentos. En los siguientes ejemplos observamos cómo la misma construcción sintáctica da lugar a una frase agramatical con el verbo dormir o desear, pero no con soñar.

(3) * Los niños duermen sueños tranquilos. Los niños duermen.

(4) Los niños desean sueños tranquilos.

* Los niños desean.(5) Los niños sueñan sueños tranquilos.

Los niños sueñan.

De esta manera, la estructura de subcategorización se puede considerar como la información lingüística básica que posibilita la restricción del número de estructuras obtenidas en el análisis sintáctico.

Esta información es crucial para el buen funcionamiento de los analizadores sintácticos automáticos, ya que hay problemas fundamentales para la buena resolución del análisis sintáctico cuyo comportamiento depende de la idiosincrasia de los núcleos léxicos. Entre los casos más complejos de resolución se encuentran determinar de qué núcleo léxico depende un sintagma preposicional (6), la resolución de la coordinación (7) o la determinación de la función de determinados sintagmas nominales (8). A estos problemas se añaden para el español el grado de libertad en el orden de ocurrencia de los constituyentes (9), haciendo que los casos anteriores sean más difícil resolución. Así, conocer la subcategorización del verbo permite evitar la mala identificación de categorías.

(6) Y lo haremos defendiendo las libertades y los derechos ciudadanos en el combate contra sus enemigos.(7) ... armaba sus modelos con pedazos de cartón, tablitas, goma, engrudo, cartulinas y lápices de colores. (8) Macri anuncia esta tarde su postulación a jefe de gobierno. (9) Papel fundamental han desempeñado en esta recuperación los evangelios llamados apócrifos, sobre todo los de carácter gnóstico.

2.1 Trabajo Relacionado Los trabajos realizados en el área de la adquisición de subcategorización tienen como objetivo final establecer los patrones de realización para cada unidad verbal. Para ello se trabaja con grandes corpus a partir de los cuales se extrae la información relativa a las realizaciones oracionales. La adquisición automática de dicha información ha sido tratada por diferentes autores en general partiendo de un corpus analizado a nivel sintáctico automáticamente (Korhonen et al 2003, Briscoe et al 1997) o manualmente (Sarkar et al 2000) y aplicando determinados filtros para no contemplar información de adjuntos, uno de los principales

Laura Alonso Alemany, Irene Castellón Masalles y Nevena Tinkova

124

problemas en esta tarea. Estos trabajos han tenido un acierto de diferente grado en diferentes lenguas. Para el español encontramos trabajos basados en las diátesis o clases verbales que aplican técnicas simlares a los anteriores (Esteve 2004, Chrupala 2004), con resultados bastante positivos Una de las ambigüedades más difíciles de tratar es la de la adjunción de los sintagmas preposicionales. Algunos autores (Atserias 2006) proponen disponer de dos modelos, uno nominal y otro verbal para que en base a determinadas condiciones disputen por determinados argumentos en una situación ambigua.

2.2 Nuestra Aproximación A diferencia de estos trabajos, nuestro método parte de una serie de patrones ya adquiridos y evaluados para los sentidos verbales descritos dentro del proyecto SENSEM (ver Figura 1).

Figura 1. Esquemas de subcategorización adquiridos para el sentido añadir_1 a partir de la base de datos verbal SENSEM.

Nuestro objetivo final consiste en asociar esquemas de subcategorización a sentidos verbales no descritos en SENSEM. Para ello procedemos en dos pasos:

1) descubrimos grandes clases de comportamiento sintáctico distinguible dentro de los verbos de SENSEM, y

2) clasificamos nuevos predicados verbales en una de esas clases.

Para llegar a este objetivo final partimos de una serie de hipótesis que creemos necesario exponer. En primer lugar, asumimos que la subcategorización es una información asociada

a los sentidos verbales, no a los lemas. En algunos trabajos sobre adquisición de subcategorizaciones se ha trabajado con el lema como unidad de subcategorización (Manning 1993, Briscoe et al 1997). Así, para aplicar el clasificador sobre corpus será necesario disponer de alguna aplicación de algún tipo de desambiguación de sentidos.

Otra de nuestras hipótesis de partida es que en la base de datos SENSEM ya existen la mayoría de los esquemas de subcategorización existentes en español, por lo que resulta muy probable que se pueda caracterizar el comportamiento de un sentido verbal nuevo a partir de extrapolar de alguno de los verbos ya conocidos.

3 MetodologíaEl objetivo inicial, como hemos dicho, consiste en inducir clases de comportamiento sintáctico de los verbos a partir de la información de SENSEM y extrapolar estos comportamientos a verbos desconocidos mediante clasificadores automáticos. A continuación describimos las fases del experimento: caracterización de los ejemplos, inducción de clases mediante clustering y clasificación de ejemplos no vistos.

3.1 Caracterización de los ejemplos anotados manualmente El procedimiento que seguimos se basa en los resultados de la anotación de SENSEM. Los ejemplos del banco de datos de SENSEM son frases de corpus periodístico anotadas a nivel sintáctico-semántico (Castellón et al. 2006). La anotación ha consistido en etiquetar en forma manual el verbo y los constituyentes directamente relacionados con él, donde cada constituyente se anota mediante: la categoría morfosintáctica (p.ej.: sintagma nominal, oración adverbial), la función sintáctica (p.ej.: sujeto, objeto preposicional), su relación con el verbo (p.ej.: argumento o adjunto), y el papel semántico (p.ej.: iniciador, tema afectado, origen, tiempo). El total de lemas tratados es de 250, seleccionados por su frecuencia en un corpus equilibrado de la lengua (Davies 2005), y el número de sentidos es de 1161.

Para caracterizar el comportamiento sintáctico de los sentidos verbales debemos obtener procedemos en los siguientes pasos:

1) esquema de realización sintáctica de cada ejemplo: para cada ejemplo del corpus, se obtiene su esquema sintáctico

Inducción de Clases de Comportamiento Verbal a partir del Corpus SENSEM

125

1.1) compactación de categorías quetienen la misma distribución, como por ejemplo los pronombres relativos (de sujeto u objeto directo) o los sujetos elididos con los sintagmas nominales, entre otros.1.2) selección de argumentos,eliminando los constituyentes opcionales (adjuntos).1.3) eliminación de orden de constituyentes, ordenando los constituyentes en orden alfabético.

2) comportamiento de cada sentido,caracterizado por el número de ejemplos del sentido que ocurren con cada esquema de realización sintáctica posible. De esta forma obtenemos el equivalente

empírico al esquema de subcategorización, a partir de los datos asociados a los sentidos verbales de la base de datos verbal SENSEM(Fernández et al 2004).

Hemos caracterizado los ejemplos (y por lo tanto los esquemas de subcategorización de los sentidos verbales) con diferentes subconjuntos de toda la información disponible:

- categoría morfosintáctica de argumentos; - categoría y función sintáctica; - categoría, función y papel semántico.

Además, observando los resultados se evidenció que los esquemas de realización sintáctica con pocas ocurrencias en corpus introducían mucho ruido en el espacio de búsqueda, causando agrupaciones extrañas. Así decidimos caracterizar los esquemas de subcategorización utilizando como atributos sólo los esquemas de realización con más de 5 o con más de 10 ocurrencias en el corpus, lo cual redujo sensiblemente el número de atributos, como se ve en la Tabla 1.

todos > 5 ocs. > 10 ocs. cat 240 98 69

func + cat 785 213 130 papel + func + cat 2854 464 317

Tabla 1: Número de esquemas de realización sintáctica distintos encontrados en el corpus al caracterizar los ejemplos con diferentes aproximaciones.

3.2 Inducción de clases de verbos A partir de los esquemas de subcategorización de los sentidos presentes en el corpus, con los distintos subconjuntos de atributos descritos arriba, tratamos de descubrir clases de sentidos

con esquemas semejantes. Para ello caracterizamos a cada sentido como un vector, con los esquemas de realización posibles como dimensiones y el número de ejemplos del sentido que ocurren con cada esquema de realización como valor del sentido para esa dimensión. Esto nos dá una representación de los sentidos en un espacio matemático caracterizado por los esquemas de realización, donde podemos aplicar nociones de distancia (osemejanza). Sobre este espacio aplicamos métodos de clasificación no supervisada (clustering) para encontrar grupos de vectores (sentidos) cercanos en el espacio, es decir, que tienden a ocurrir con los mismos esquemas sintácticos. Utilizamos los algoritmos de clustering proporcionados por Weka (Witten et al 2005). Específicamente, elegimos Simple KMeans (Hartigan et al 1979) y el clustering basado en Expectation-Maximization (EM) (Dempster et al 1977).

Además, en muchas soluciones obtuvimos una clase mayoritaria que contenía verbos con muy distintos comportamientos, típicamente, verbos que comparten algún esquema de subcategorización muy frecuente. Si intentamos aumentar el número de clusters que se pedía al método de clustering (ya fuera EM o KMeans), se producía una distribución muy irregular de la población. Esto nos llevó a investigar de forma preliminar una forma de clustering jerárquico partitivo: aplicamos clustering dentro de la población de las clases obtenidas por cada solución, para poder establecer más clases con menor población y más específicas en cuanto a los esquemas de subcategorización. Esta aproximación resultó adecuada para obtener clases con población bien distribuida. En el futuro aplicaremos un algoritmo de clustering jerárquico.

4 Selección de un conjunto adecuado de clases de equivalencia de sentidos verbales 4.1 Métodos para evaluar soluciones de clustering

La gran cantidad de parámetros descritos en el apartado anterior deja entrever el gran número de experimentos que llevamos a cabo, con soluciones de clustering con diferentes métodos y diferentes subconjuntos de atributos para caracterizar a los sentidos verbales. Por lo tanto se hizo necesario establecer métodos de evaluación sistemáticos, descritos extensamente en (Alonso et al. 2007). Se trata de una


126

combinación de inspección cualitativa de las clases obtenidas y las siguientes métricas sobre las soluciones:

Dada una lista de parejas de verbos muy similares creada a mano, observamos si se agrupan en las mismas clases (bonificado) o no (penalizado). Índice de solapamiento de los esquemas que caracterizan a las diferentes clases: un bajo índice de solapamiento indica que los sentidos de las distintas clases efectivamente ocurren con distintos esquemas. Distribución de la población en las clases, penalizando soluciones con clases con poca población (uno o dos sentidos), ya que no generalizan comportamientos. Índice de distinguibilidad de sentidos, que indica si los distintos sentidos de un lema verbal se distribuyen en distintos clusters (bonificado) o en los mismos (penalizado). Dado que una de las diferencias entre sentidos verbales puede ser su distinto comportamiento sintáctico, éste es un indicador sólo orientativo.

4.2 Descripción general de las diferentes soluciones En esta sección describimos sucintamente las soluciones de clustering obtenidas con diferentes criterios para caracterizar los sentidos verbales, para motivar la elección final de una de ellas.

En general, el método KMeans, que necesita un parámetro especificando el número de clases que se quieren establecer, proporcionaba peores resultados que EM, sobretodo respecto a la distribución de la población. En concreto, tendía a proporcionar clases con un solo sentido verbal en las soluciones que proponían más de tres clases. En las soluciones con tres o menos clases el índice de solapamiento de esquemas y el test de parejas resultaban considerablemente peor que para EM. Por esa razón optamos por EM como método para obtener las soluciones de clustering.

Una vez decidimos que EM sería nuestro método, inspeccionamos con más detalle las soluciones obtenidas con diferentes tipos de información.

En las soluciones con categoría, función y papeles semánticos se distinguen claramente clases con tipos distintos de esquemas de subcategorización, especialmente las soluciones en las que sólo se tienen en cuenta los esquemas de realización que ocurren más de 5 o 10 veces, debido a una notable reducción en la escasez de datos (data sparseness) cuando usamos sólo esquemas frecuentes. En estas soluciones encontramos siempre 4 clases, una mayoritaria donde claramente encontramos los verbos con prácticamente cualquier patrón de argumentos pero con una importante presencia de diátesis intransitivas, que se producirían por la elisión de alguno de los argumentos en los ejemplos de corpus, junto con verbos propiamente intransitivos; una segunda clase bastante grande con verbos fuertemente caracterizados como transitivos, con pocas diátesis intransitivas; y dos clases pequeñas con verbos con algún argumento con papel muy marcado (origen,destino), con pocas diátesis intransitivas.

En las soluciones donde los verbos están caracterizados mediante categoría y función,se distingue en todos los casos una clase con más de la mitad de la población, que contiene verbos con comportamientos muy dispares, con el rasgo común de contar con alguna diátesis intransitiva, probablemente causada, como en el caso de las aproximaciones con papeles semánticos, por la elisión de alguno de los argumentos. Se suele distinguir también claramente una o más clases de verbos con algún argumento preposicional o adverbial, y también una clase con verbos ditransitivos y sus diátesis transitivas e intransitivas.

Finalmente, las soluciones donde los sentidos se caracterizan únicamente mediante categoría tienen una tendencia a producir muchas clases, pero la población se encuentra bien distribuida en clases de tamaño mediano, excepto en la solución que tiene en cuenta todos los esquemas. En las soluciones con patrones que ocurren más de 5 y más de 10 veces, se encuentra siempre una clase con la mayor parte de la población, dos clases medianas y un número variable de clases más pequeñas. Resulta difícil generalizar el comportamiento de los verbos de estas clases por la gran ambigüedad de los patrones basados únicamente en categorías.


127

4.3 Solución seleccionada: 5 clases, función + categoría, esquemas que ocurren > 10 veces

A partir de los resultados y comparando las diferentes medidas de evaluación, finalmente se optó por tomar algunas de las clases de las soluciones de clustering que utilizan información de categoría y de función sintáctica. Esta decisión vino parcialmente condicionada por la caracterización de los verbos a los que se pretende asignar una clase de forma automática en última instancia. Los ejemplos de estos verbos podrán ser analizados automáticamente a nivel sintáctico, pero no al nivel de papeles semánticos. Por este motivo en este primer momento prescindimos de las clases obtenidas con información de papeles semánticos

Tomamos pues como punto de referencia la solución en 5 clases, obtenida con los esquemas caracterizados con función y categoría con más de 10 ocurrencias en corpus. Dada la gran compacidad de esta solución, aplicamos clustering dentro de todas las clases, con ánimo de observar si era posible obtener clases más granulares dentro de la misma aproximación. El total de clases es de 5 que se subdivide en un total de 11 clases.

La clase más grande (clase 5, 477 sentidos) está compuesta por sentidos verbales que alternan entre esquemas transitivos e intransitivos y en algún caso con preposicionales. Las subclases obtenidas a partir de ésta están mucho más caracterizadas, las clases 5.5, 5.3 y 5.2 agrupan los sentidos que alternan entre esquemas transitivos e intransitivos, las clases 5.4, 5.6, 5.7 y 5.8 se caracterizan por la alternancia intransitivo – preposicional, con alguna diferencia por la aparición de predicativos o de esquemas transitivos. A este nivel la asociación de una clase a esquemas como sn v sn o sn v sp parecebastante asumible. En la segunda clase (clase 2, 163 sentidos) predominan realizaciones preposicionales e intransitivas que se justifican por la omisión de los argumentos preposicionales. En algún caso encontramos esquemas ditransitivos alternantes con preposicionales. Las subclases obtenidas son muy similares entre ellas exceptuando la presencia en una de esquemas ditranstivos (2.2) y la ausencia en la otra, que se caracteriza por contener esquemas con circunstanciales (2.1).

Las dos siguientes clases (clase 1, 103 sentidos, y clase 3, 68 sentidos) están caracterizadas por alternancias transtiva – ditransitiva – intransitiva, con omisiones de ciertos constituyentes. Estas clases no presentan subclases. La última clase, (clase 4, 63 sentidos) contiene sentidos caracterizados por esquemas básicamente preposicionales alternantes con intransitivos y con la presencia de atributos. Las tres subclases que contiene están diferenciadas por diversos esquemas. 4.1 se caracteriza por la alternancia preposicional – intransitiva con atributos, la clase 4.2 es totalmente preposicional y en la clase 4.3 se clasifican sentidos con esquemas transitivos alternantes con preposicionales. Como vemos, esta solución presenta clases mixtas y algunas que contienen sentidos con comportamiento comparable a los de otras clases. Parece evidente que habrá que profundizar en el método de inducción de clases, pero los resultados hasta el momento son alentadores.

5 Evaluación para aplicación finalHemos aprendido diversos clasificadores que, dado un sentido caracterizado como vector por sus esquemas de realización, lo asigna a una de las grandes clases de comportamiento verbal inducidas en el paso anterior. Hemos aprendido dos clasificadores bayesianos (clásico y Naive Bayes), dos basados en decisiones (J48, basado en árboles de decisión, y JRip, basado en reglas de decisión), uno basado en los k vecinos cercanos (IBk, con k=1), y una baseline, equivalente a los resultados obtenidos por casualidad (OneR). Estos clasificadores han sido evaluados mediante ten-fold cross validation en el corpus SENSEM. Recordemos que el objetivo final de la nuestro trabajo es asignar una clase de subcategorización a verbos no descritos previamente, a partir de ejemplos de corpus analizados automáticamente. Para evaluar la utilidad para este objetivo de las clases de equivalencia descritas en el apartado anterior, analizamos el corpus SENSEM automáticamente con Freeling (Carreras et al 2004). La única información que utilizamos del corpus SENSEMes el alcance de los constituyentes dominados por el verbo en cada ejemplo. Hemos comparado el desempeño de los clasificadores en ejemplos caracterizados con análisis


128

automático y en ejemplos caracterizados con el análisis manual de SENSEM. También hemos comparado el desempeño de los clasificadores en las grandes clases descritas en el apartado anterior (clases gruesas), y en las clases de granularidad más fina (clases finas). Los resultados pueden verse en la Tabla 2.

clases gruesas clases finas manual auto manual auto NaiveBayes 78 63 41 25IBk 76 53 64 24Bayes 72 63 56 25J48 70 52 58 26JRip 69 60 54 31OneR 11 19 11 8

Tabla 2. Porcentaje de sentidos bien clasificados mediante diferentes clasificadores, con los ejemplos anotados manualmente o automáticamente, con clases finas o gruesas (ver apartado 4.3).

Se puede observar que todos los clasificadores superan significativamente la baseline de OneR. En clases gruesas, los clasificadores simples como Naive Bayes o IBk dan los mejores resultados. Se observa un decremento de unos 10-15 puntos en el desempeño de los clasificadores cuando los ejemplos son caracterizados mediante un análisis automático, lo cual supone una importante desmejora en los resultados, que tendrá que ser mejorada en el futuro. En clases finas el desempeño de Naive Bayes cae en picado, mientras que el del resto de clasificadores cae unos 10-15 puntos. Probablemente esta desmejora se dá porque los datos disponibles para esas clases, con menos población, son más escasos y los clasificadores no pueden generalizar adecuadamente. En los ejemplos caracterizados automáticamente, la desmejora es muy importante, y, aunque no llega a los niveles del baseline, la significatividad de la clasificación se acerca peligrosamente a los niveles de la casualidad. Habrá que estudiar detenidamente las causas de error para mejorar estos resultados en el futuro. Por otro lado, hemos realizado otro experimento en el que hemos simulado la ausencia de un algoritmo para desambiguar sentidos. Por ese motivo, la unidad a aprender y clasificar ya no era el sentido verbal, sino que cada uno de los ejemplos era caracterizado como un vector. Estos vectores tienen una caracterización muy pobre, ya que sólo uno de

los atributos tiene un valor distinto de cero, justamente, el atributo que se corresponde con el esquema de realización con el que ocurre el ejemplo en concreto. Vemos los resultados en la Tabla 3.

clases gruesas clases finas manual auto manual auto NaiveBayes 40 30 33 22IBk 48 32 37 23Bayes 41 28 30 34J48 41 31 34 24JRip 30 27 28 22OneR 26 26 2 2

Tabla 3. Porcentaje de ejemplos bien clasificados mediante diferentes clasificadores, con los ejemplos anotados manualmente o automáticamente, con clases finas o gruesas (ver apartado 4.3).

Respecto a la clasificación de ejemplos (vs. sentidos) podemos ver que, aunque los resultados son significativamente mejores que los obtenidos para la baseline en las clases finas, en las clases gruesas los resultados no difieren significativamente, especialmente si los ejemplos son caracterizados con análisis automático. Los métodos simples, especialmente el basado en distancia, IBk, siguen dando los mejores resultados. En clases finas, los resultados son equiparables en análisis manual o automático, pero los porcentajes de ejemplos bien clasificados son demasiado bajos en ambos casos.

6 Conclusiones y trabajo futuro Hemos presentado una aproximación al enriquecimiento semiautomático de un léxico verbal con esquemas de subcategorización. La aproximación se basa en dos pasos: 1) inducción de grandes clases de comportamiento verbal a partir de ejemplos anotados manualmente, y 2) aprendizaje de clasificadores que etiquetan nuevos ejemplos con esas clases. Presentamos un método para evaluar sistemáticamente las clases obtenidas con esta aproximación. Mostramos una aplicación preliminar de todo el proceso, con resultados prometedores pero claramente mejorables. A nivel lingüístico, observamos que las clases de comportamiento verbal inducidas se caracterizan por comportamientos diatéticos de las piezas verbales, por lo que nos anima a seguir investigando en esta línea.


129

Por otro lado, los resultados de la compactación y clasificación de los sentidos ya conocidos en clases, a partir del análisis sintáctico automático son muy prometedores, y aportan datos cruciales sobre la importancia de la desambiguación verbal para asignar marco de subcategorización. El trabajo futuro que se presenta es mucho e interesante. En primer lugar, creemos importante experimentar más con los diferentes métodos y parámetros de clustering para poder inducir las mejores clases desde una perspectiva lingüística. En especial, nos planteamos el uso de técnicas de clustering jerárquico. Además, como hemos expuesto, la aplicación del procedimiento en un entorno real, requiere partir de corpus no anotados y no desambiguados semánticamente. Dada la complejidad del proceso hemos dividido la tarea en dos fases, para poder evaluar cada una de las situaciones independientemente. En una primera fase, la que hemos presentado en este artículo, utilizamos el corpus de SENSEM,donde los sentidos verbales están desambiguados, pero sin la anotación manual sintáctico- semántica. Esta experimentación requiere de un análisis morfosintáctico automático y de la aplicación del clasificador. Una segunda fase consiste en evaluar el clasificador sobre el mismo corpus pero utilizando WSD y análisis automático, para realizar una prueba de adquisición sobre un corpus controlado. Esta fase prevé la aplicación del clasificador sobre corpus de verbos no conocidos.

ReferenciasAlonso, L., I. Castellón y N. Tincheva. 2007.

Obtaining coarse-grained classes of subcategorization patterns for Spanish. RANLP 2007, Borovets, Bulgaria.

Atserias, J. 2006. Towards Robustness in Natural Language Understanding. Tesis doctoral. Lengoaia eta Sistema Informatikoak Saila, Euskal Herriko Unibertsitatea, Donosti.

Atserias, J., B. Casas, E. Comelles, M. González, L. Padró y M. Padró (2006). FreeLing 1.3: Syntactic and semantic services in an open-source NLP library. LREC'06, Génova, Italia.

Brent, M. R. 1993. From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax. Computational Linguistics, 19, p. 243-262.

Briscoe, T. y J. Carroll. 1997. Automatic extraction of subcategorization from corpora. Proceedings of the 5th conference on Applied NaturalLanguage Processing, p. 356-363.

Carreras, X., I. Chao, L. Padró y M. Padró. 2004. FreeLing: An Open-Source Suite of Language Analyzers. LREC'04, Lisboa, Portugal.

Castellón, I., A. Fernández, G. Vázquez, L. Alonso y J. A. Capilla. 2006. The SENSEM Corpus: a Corpus Annotated at the Syntactic and Semantic Level. LREC’06, Génova, Italia, p. 355-359.

Chrupala, G. (2003) Acquiring Verb Subcategorization from Spanish Corpora.Research project presented for the Diploma d'Estudis Avançats. Universitat de Barcelona

Davies, M. 2005. A Frequency Dictionary of Spanish. New York and London: Routledge.

Dempster, A., N. Laird y D. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39.

Esteve, E. (2004) “Towards a semantic classification of Spanish verbs based on subcategorisation information” Proceedings of the ACL 2004 workshop on Student research. Barcelona

Fernández, A., G. Vázquez e I. Castellón. 2004. SENSEM: base de datos verbal del español. G. de Ita, O. Fuentes, M. Osorio (ed.), IX Ibero-American Workshop on Artificial Intelligence, IBERAMIA. Puebla de los Ángeles, México, p. 155-163.

Hartigan, J. A. y M. A. Wong. 1979. Algorithm as136: a k-means clustering algorithm. Applied Statistics, 28, p.100-108.

Korhonen, A. 2002. Subcategorization Acquisition. PhD thesis, Computer Laboratory, University of Cambridge.

Korhonen, A. y J. Preiss. 2003. Improving subcategorization acquisition using word sense disambiguation. ACL 2003.

Manning, Ch. 1993. Automatic acquisition of a large subcategorization dictionary from corpora. ACL’93, p. 235-242.

Sarkar, A. y D. Zeman. 2000. Automatic extraction of subcategorization frames for Czech. COLING’2000.

Witten, I. H. y E. Frank. 2005. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

Agradecimientos

Esta investigación ha sido posible gracias al proyecto KNOW (TIN2006-1549-C03-02) del Ministerio de Educación y Ciencia, a una beca Postdoctoral Beatriu de Pinós de la Generalitat de Catalunya otorgada a Laura Alonso y a la beca Predoctoral FI-IQUC también de la Generalitat de Catalunya, otorgada a Nevena Tinkova, con número de expediente 2004FI-IQUC1/00084.


130

An Open-source Lexicon for Spanish

Montserrat Marimon, Natalia Seghezzi, Núria Bel IULA – Universitat Pompeu Fabra

Pl. de la Mercè 10-12 08002-Barcelona

{montserrat.marimon,natalia.seghezzi,nuria.bel}@upf.edu

Resumen: En este artículo presentamos el componente léxico de una gramática para el español. Nuestro objetivo es describir la información lingüística que codificamos en las entradas léxicas mediante una jerarquía de tipos con herencia múltiple de la cual se pueden extraer subconjuntos de datos necesarios para aplicaciones concretas. Palabras clave: gramática, recursos léxicos, español.

Abstract: In this paper we describe the lexical component of a grammar for Spanish. Our aim is to depict the linguistic information we encode in the lexical entries by means of a multiple inheritance hierarchy of types from which subsets of data required for concrete applications could be extracted. Keywords: grammar, lexical resources, Spanish.

1 Introduction The lexical component, the repository of knowledge about the words of a particular language, plays a major role in NLP systems. The level of linguistic information that the lexicon contains –morpho-syntactic, syntactic, semantic– is determined by the application where it is used. The construction of lexical resources, however, is expensive in terms of both money and time; hence, they should be reused by more than one application.

In this paper we describe the lexical component of the Spanish Resource Grammar(SRG), a wide-coverage open-source1

unification-based grammar for Spanish. Ours is a large lexicon with fine-grained information encoded by means of a multiple inheritance hierarchy of types. This paper aims to depict the linguistic information we have encoded in the lexical entries from which subsets of linguistic

1The SRG may be downloaded from: http://www.upf.edu/pdi/iula/montserrat.marimon/.

data required for concrete applications could be extracted.2

2 The Spanish Resource Grammar The SRG is grounded in the theoretical framework of HPSG (Head-driven Phrase Structure Grammar; Pollard and Sag, 1994) and uses Minimal Recursion Semantics (MRS) for the semantic representation (Copestake et al., 2006). The SRG is implemented within the Linguistic Knowledge Building (LKB) system (Copestake, 2002), based on the basic components of the grammar Matrix, an open–source starter-kit for the development of HPSG grammars developed as part of the LinGO consortium’s multilingual grammar engineering (Bender et al., 2002).

The SRG has a full coverage of close word classes and it contains about 50,000 lexical entries for open classes. The grammar also has 40 lexical rules to perform valence changing operations on lexical items and 150 structure rules to combine words and phrases into larger

2This research was supported by the Spanish Ministerio de Educación y Ciencia Juan de la Cierva and Ramon y Cajal programmes.



constituents and to compositionally build up the semantic representation.

The SRG is part of the DELPH-IN open-source repository of linguistic resources andtools for writing (the LKB system), testing (The[incr tsbd()]; Oepen and Carroll, 2000) andefficiently processing HPSG grammars (thePET system; Callmeier, 2000). Further linguistic resources that are available in the DELPH-IN repository include broad-coverage grammars for English, German and Japanese aswell as smaller grammars for French, Korean,Modern Greek, Norwegian and Portuguese.3

3 The lexicon of the SRG The basic notion of the SRG is the sign. Briefly,a sign is a complex feature structure whichconveys information about the orthographicalrealization of the lexical sign in STEM andsyntactic and semantic information in SYNSEM. SYNSEM structures information related to the treatment of long distance dependencies in NONLOCAL and LOCALinformation which includes head informationthat percolates up the tree structure via HEAD,subcategorization information in VAL(ENCE), whose attributes are SUBJ, COMPS, SPR and SPEC, for subject, complements, specifier and specified element, and semantic informationencoded in CONT.

The MRS, encoded in the featureSYNSEM.LOCAL.CONT, is a flat semanticrepresentation which consists of: 1) RELS - alist of semantic relations each with a “handle”(used to express scope relations) and one ormore roles. Relations are classified according to the number and type of arguments; lexical relations of the same type are distinguished by the feature PRED; 2) HCONS - a set of handle constraints reflecting syntactic limitations onpossible scope relations among the semanticrelations; and 3) HOOK - a group ofdistinguished semantic attributes of a sign. These attributes are: LTOP - the local top handle, INDEX - the salient nominal instanceor event variable introduced by the lexicalsemantic head, and XARG - the semantic index of the sign's external argument.

Each entry of the lexicon consists of aunique identifier, a lexical type (one of about400 leaf types defined by a type hierarchy of

3See http://www.delph-in.net/.

around 5,500 types), an orthography and asemantic relation. Figure 1 shows an example.4

ejemplo_n1 := n_intr_count_le &[ STEM < "ejemplo" >, SYNSEM.LKEYS.KEYREL.PRED "_ejemplo_n_rel" ].

Figure 1: Example of lexical entry.

In the following subsections we focus on the lexical types we have defined for open classes–main verbs, common nouns, adjectives andadverbs– and we describe the linguistic information we have encoded in each type. Due to space limits, we will only present the mostlyused types. Note also that even though we will only show the most relevant LOCAL information, open class types are also definedby a set of NONLOCAL amalgamation types.

Through the type uninflected-lexeme weshow in Figure 2, types for open classes inheritinformation common to all of them. This typebasically identifies the HOOK's features LTOPand INDEX.

uninflected-lexeme := lex-item &[ SYNSEM [

LOCAL.CONT [ HOOK [ LTOP #handle,INDEX #ind ],

RELS.LIST < #key & relation & [ LBL #handle, ARG0 #ind, PRED predsort ],... > ],

LKEYS.KEYREL #key ] ].

Figure 2: Basic type for open classes.

3.1 Common nounsAll common nouns are specified as taking anempty list for the valence features SUBJ andSPEC, and for MOD, since only temporalnouns and nouns in apposition may function as modifiers.5 Common nouns take a non-emptylist value for SPR; here agreement betweennouns and specifiers is dealt with by identifyingthe INDEX of the specifier and that of the noun(#ind), which is of type ref(erential)-ind(ex).Finally, common nouns get the semanticrelation type basic-noun-relation. This

4The attribute SYNSEM.LKEYS.KEYRELprovides a shortcut to the semantic relation in RELS with highest scope and it is only used in the lexicon (see Figure 2).

5Modifying nouns are dealt with by a unary structure rule that generates a modifyingnominal sign.

Montserrat Marimon, Natalia Seghezzi y Núria Bel

132

information is encoded in the type basic-common-noun-lex, as we show in Figure 3.

basic-common-noun-lex := uninflected-lexeme &[ SYNSEM.LOCAL [

CAT [ HEAD noun & [ MOD < > ], VAL [ SUBJ < >,

SPEC < >, SPR < [ OPT -,

LOCAL.CONT.HOOK.INDEX #ind] >]], CONT nom-obj &

[ HOOK.INDEX #ind & ref-ind & [ PNG.PN 3per ], RELS.LIST < basic-noun-relation &

[ PRED nom_rel ],... > ] ].

Figure 3: Basic type for common nouns.

Then, lexical subtypes for nouns are basically distinguished on the basis of valence information and the mass / countable / uncountable distinction. This semanticclassification determines the syntactic behavior of nouns w.r.t. the specifiers they may co-occur. Briefly, countable nouns require an specifierwhen they are in singular (e.g. se sentó en *(la) silla ((s)he sat in (the) chair)), they may co-occur with cardinals (e.g. dos/tres sillastwo/three chairs)) and they only occur in plural with quantifying pronouns such as poco (few)(e.g. *poca silla/pocas sillas (few chairs));uncountable nouns cannot co-occur with partitives (e.g. *un trozo de paz (a piece of peace)), nor with distributional quantifiers suchas cada (each) (e.g. *cada paz (each peace)), orwith cardinals (e.g. *tres paces (three peaces));finally, mass nouns cannot co-occur withcardinals (e.g. *tres aburrimientos (three boredoms)), but they may co-occur withpartitives (e.g. un poco de aburrimiento (a littleof boredom)).

Non-argumental common nouns; i.e. nounstaking an empty list as value for COMPS, areclassified as n_intr_count_le,n_intr_uncount_le or n_intr_mass_le. Nouns with both a count and a mass reading (e.g.manzana (apple); pastel de manzana (apple pie) vs tres manzanas (three apples)) are assignedthe type n_intr_mass-or-count_le. Besides, wehave two subtypes: n_intr_coll_le for collectivenouns (e.g. ejército (army)) and n_intr_plur_lefor plural nouns (e.g. celos (jealousy)). Lexicalsemantic information is given to non-argumental nouns in the lexicon itself as value of the feature SYNSEM.LKEYS.KEYREL.ARG0.SORT. We have defined a hierarchy of types for dealing with nouns with more than one reading (e.g. cabo, which may be

both human (sergeant) and locative (cape),takes hum_loc as value).

Nouns taking complements are classifiedinto three types. Then, each type is further sub-typed according to such linguistic properties as the number and category of subcategorized for elements or the semantic relation type (i.e. the semantic roles of syntactic arguments). Thesethree super-types distinguish:

1) quantifying nouns, which cover three subtypes: n_pseudo-part_le for pseudo-partitive nouns (e.g. montón(pile)), n_part_le for partitive nouns (e.g. mayoría (majority)) and n_group_le for group nouns (e.g. grupo(group)).

2) de-verbal nouns, which cover:the type n_subj-nom_le for subject nominalizations (e.g. agresor(attacker)). Their syntacticargument is identified with the arg2. Lexical semantic informationis given to subject nominalizationsin the lexicon itself.nouns derived from unaccusative verbs, which are typed either asn_event-result_intr_le, if they areintransitive (e.g. muerte (death)), oras n_event-result_intr_lcomp_le, if they take a locative complement(e.g. salto a/hacia (jumpto/towards)). These types of nouns denote both events/processes andresults (and get the lexical semantictype abs(tract)_pro(cess)), and theyidentify the syntactic argument withthe arg2.nouns denoting results derived fromunergative verbs (e.g. gruñido(roar)) and intransitive verbs taking marked NPs (e.g. lucha contra(fight against)). These nouns are typed as n_result_intr_le andn_result_intr_ppcomp_le,respectively. Semantically, both classes of nouns are typed as abs(tract), and identify the firstargument with arg1 and the secondone with arg2. Marking prepositions are specified in the lexical entries. nouns derived from transitive (or ditransitive) verbs denoting events/processes (e.g. construcción(construction), envío (dispatch)).

An Open-Source Lexicon for Spanish

133

These nouns are typed as n_trans_le. Semantically, they are typed as pro(cess), and identify the first argument with arg1 and the second one with arg2.

3) Non-derived argumental nouns, such as relational nouns (e.g. amigo (friend)), body parts (e.g. pierna (leg))), de-adjectival nouns (e.g. belleza (beauty),adicción a (addiction to)) and nounsderived from measure psychological,inchoative and perception verbs (e.g. peso (weight), temor (fear)), are grouped together and distinguished according to the number and thecategory of the complements andcountability features. Table 1 shows the subtypes we have defined for this class of nouns. The columns refer to the type name, the countability features –mass(f1), count (f2), uncount (f3)–, and subcategorized for elements: de(of)-marked NPs (f4), NPs marked by otherprepositions than de (f5), finite completive clause (f6), infinitive clauses (f7) and interrogative clauses(f8). Lexical entries that belong to these types specify both their lexicalsemantic type and markingprepositions.

type f1 f2 f3 f4 f5 f6 f7 f8

n_ppde_count_le - + - + - - - -

n_ppde_uncount_le - - + + - - - -

n_ppde_mass_le + - - + - - - -

n_ppde_mass-or-count_le + + - + - - - -

n_cp_prop_count_le - + - - - + - -

n_cp_ques_count_le - + - - - - - +

n_ppde_ppcomp_count_le - + - + + - - -

n_ppde_ppcomp_uncount_le - - + + + - - -

n_ppde_ppcomp_mass_le + - - + + - - -

n_ppde_prop_fin_count_le - + - + - + - -

n_ppde_prop_fin_uncount_le - - + + - + - -

n_ppde_prop_inf_count_le - + - + - - + -

n_ppde_prop_inf_uncount_le - - + + - - + -

n_ppde_ques_count_le - + - + - - - +

n_ppde_ques_uncount_le - - + + - - - +

Table 1: Types for non-derived argumentalcommon nouns.

The SRG has 35 types for common nounsand about 28,000 nominal entries.

3.2 Adjectives All adjectival types inherit the informationencoded in the type basic-adjective-lex, we show in Figure 4. This type specifies that the value for HEAD is of type adj, the SUBJ-list is empty, and the feature MOD takes a non-emptylist whose element is a nominal sign. The semantic index of the element in the MOD listis identified with the external argument of theadjective (#ind). Finally, the basic-adjective-lextype assigns the basic-adj-relation type toadjectives.

basic-adjective-lex := uninflected-lexeme &[ SYNSEM.LOCAL [

CAT [ HEAD adj & [ MOD < [ LOCAL [

CAT.HEAD noun, CONT.HOOK.INDEX #xarg ]] > ],

VAL.SUBJ < > ], CONT [ HOOK.XARG #xarg,

RELS.LIST < basic-adj-relation & [ PRED basic_adj_rel ],... > ] ] ].

Figure 4: Basic type for adjectives.

Then, adjectives in the SRG are cross-classified according to:

1) their position within the NP; i.e. whether they are pre and/or postmodifiers (e.g. el mero hecho (thesimple fact) vs un chico listo (a cleverguy));

2) whether they are predicative or non-predicative. Predicative adjectives arein turn distinguished on the basis of the copulative verb –ser or estar– they mayco-occur (e.g. ser listo (to be clever) vsestar listo para (to be ready for));

3) whether they are gradable or not. Gradable adjectives may be modifiedby intensifying adverbs (e.g. muyguapa (very pretty)) and may occur in


134

comparative and measure constructions (e.g. más alto que Juan (taller thanJuan), dos metros de largo (two meterslong));

4) whether they are intersective (theproperty applies to the noun in its absolute sense (e.g. nieve blanca (whitesnow)) or scopal (the property onlyapplies to the modified noun (e.g. excelente músico (excellent musician));

5) whether they are positive (e.g. bien(good)), comparative (e.g. mejor(better)) or superlative (e.g. (el) mejor(best));

6) subcategorization, where wedistinguish intransitive adjectives (e.g. guapa (pretty)), transitive adjectives taking marked NPs (e.g. harto de la situación (fed up with the situation)), adjectives taking finite completiveclauses (e.g. contraria a que vengan(opposed to their coming), adjectives taking interrogative clauses (e.g. segurode si vendrán (sure whether they'llcome)), control adjectives (e.g. capazde hacerlo (capable of doing)) andraising adjectives (e.g. difícil de tocar(difficult to play)).

Table 2 shows the types for adjectives in the SRG. The columns show the types and the values they take for: their position in the NP(f1), the copula verb with which they may co-occur (f2), whether they are gradable or not(f3), the type of modifier they are (f4), theirdegree (f5) and valence (f6); here, values are: 'i'(intranstive), 't' (transitive), 'cc' (completiveclause), 'ic' (interrogative clause), 'sc' (subjectcontrol), 'oc' (object control), 'sr' (subject raising) and 'or '(object raising),

type f1 f2 f3 f4 f5 f6

a_adv_int_le pre none - s p i

a_adv_event_le pre/post none - s p i

a_rel_prd_le post ser - i p i

a_rel_nprd_intr_le post none - i p i

a_rel_nprd_trans_le post none - i p t

a_rel_nprd_prop_le post none - i p cc

a_rel_nprd_ques_le_le post none - i p ic

a_qual_intr_scopal_le pre/post ser + s p i

a_qual_intr_ser_le pre/post ser + i p i

a_qual_intr_ser_pstn_le post ser + i p i

a_qual_intr_estar_le post estar + i p i

a_qual_trans_ser_le pre/post ser + i p t

a_qual_trans_ser_pstn_le post ser + i p t

a_qual_trans_estar_le post estar + i p t

a_qual_prop_ser_le pre/post ser + i p cc

a_qual_prop_estar_le post estar + i p cc

a_qual_ques_ser_le pre/post ser + i p ic

a_qual_ques_estar_le post estar + i p ic

a_sr_le pre/post ser + i p sr

a_sctrl_ser_le pre/post ser + i p sc

a_sctrl_estar_le post estar + i p sc

a_or_le pre/post ser + i p or

a_octrl_le post estar + i p oc

a_compar_le pre ser + i c t

a_super_le pre/post both + i s i

Table 2: Some types of adjectives.

Optionality is encoded in the types, whichmeans that all types for adjectives that takecomplements have been doubled. Markingpreposition is specified in the lexical entries.

The SRG has 44 types for adjectives and about 11,200 adjectival entries.

3.3 Adverbs Leaving apart close classes of adverbs; i.e.deictic adverbs (e.g. aquí (here)), relative adverbs (e.g. donde (where)), interrogativeadverbs (e.g. cómo (how),) and degree adverbs (e.g. casi (almost), más (more),...), we distinguish two types of adverbs: scopal adverbs and intersective adverbs.

As we show in Figure 5, intersective adverbsidentify their arg1 and the INDEX of themodified element, whereas scopal adverbsidentify their own INDEX and that of the modified element. Scopal adverbs take the handle of the modified element as their argument, so that the modifier outscopes thehead.

basic_intersective_adverb_lex := basic-adverb-lex &[ SYNSEM.LOCAL [

CAT.HEAD.MOD <[LOCAL intersective-mod & [CONT.HOOK.INDEX #ind]]>,

CONT.LKEYS.KEYREL.ARG1 #ind ] ].

basic_scopal_adverb_lex := basic-adverb-lex &[ SYNSEM.LOCAL [

CAT.HEAD.MOD < [ LOCAL scopal-mod & [ CONT.HOOK [ LTOP #larg,

INDEX #index]]]>, CONT [ HOOK.INDEX #index,

HCONS <! qeq & [ HARG #harg, LARG #larg ] !> ],

LKEYS.KEYREL.ARG1 #harg ] ].


135

Figure 5: Basic types for intersective and scopal adverbs.

Through their super-type basic-adverb-synsem, as we show in Figure 6, both subtypesinherit information common to them, including the HEAD adv value, the empty-list values for both SUBJ and COMPS6 and the identificationof the external argument (XARG) of the adverband that of the element within the MOD list(#xarg). The basic-adverb-synsem type assignsthe basic-adv-relation type to adverbs.

basic-adverb-lex := uninflected-lexeme &[ SYNSEM.LOCAL [

CAT [ HEAD adv & [ MOD < [LOCAL.CONT.HOOK.XARG #xarg] >],

VAL [ SUBJ < >, COMPS < > ] ],

CONT [ HOOK.XARG #xarg ], RELS.LIST < basic-adv-relation,... > ] ] ].

Figure 6: Basic type for adverbs.

Scopal and intersective adverbs have subtypes specifying whether they may co-occurwith degree adverbs (e.g. muy probablemente(very probably) vs *muy diariamente (very daily)) and the adverb placement (e.g. *no está en casa aparentemente ((he/she) is not at homeaparently) vs sinceramente te digo/te digo sinceramente (frankly, I tell you)), giving the four subtypes we show in Table 3.

type ModType G Position

av_s_prhd_le scopal - prehead

av_s_prhd_spec_le scopal + prehead

av_i_psthd_le intersect - posthead

av_i_psthd_spec_le intersect + posthead

Table 3: Some types of adverbs.

In addition, we have: one type for scopal adverbs that only modify sentences (e.g. quizás(maybe)), and two types for focus intersectiveadverbs which distinguish adverbs that may co-occur with degree adverbs (e.g. muyespecialmente (very specially)) from those ones which may not (e.g. *muy solamente (veryonly)).

6Adverbs taking complements, such as detrás de (after) or antes de (before), are treatedas multi-word constructions and they get the category preposition.

The SRG has 14 types for open classes ofadverbs and about 4,000 entries of adverbs.

3.4 Main verbsFigure 2 shows basic-main-verb-lex type, the basic type for main verbs. This type specifiesthat the HEAD value of main verbs is of type verb and takes the negative value for the boolean feature AUX(ILIARY), an empty list for MOD(IFIES) and identifies the HEAD.TAM –tense, aspect and mood– feature with the semantic INDEX.E(VENT) (#tam).Main verbs also take an empty list as value forSPR and introduce an event semantic relation in the RELS-list.

basic-main-verb-lex := uninflected-lexeme & [ SYNSEM.LOCAL [

CAT [ HEAD verb & [ AUX -,MOD < >,

TAM #tam ], VAL.SPR < > ],

CONT [ HOOK.INDEX event & [ E #tam ] ] ,

RELS.LIST < event-relation & [ PRED v_event_rel ], ...> ] ] ].

Figure 7: Basic type for main verbs.

Types for main verbs are first distinguishedon the value for the SUBJ-list. Thus, we have subtypes for impersonal verbs taking an emptySUBJ-list, verbs taking a verbal subject andverbs taking a nominal subject. Then, each typeis sub-typed according to the value of theCOMPS-list; i.e. the number and category of elements in the COMPS-list. Also, we distinguish different types of verbs accordingto: 1) the lexical semantic relation type in the RELS-list; thus, for instance, intransitive verbs are classified either as unaccusative verbs,whose subject is identified with the arg2 (e.g. morir (to die)), or as unergative verbs, whose subject is identified with the arg1 (e.g. nadar(to swim)); 2) the verb form (finite orinfinitive), mood (indicative or subjunctive) andcontrol relation of verbal complements; 3) valence changing processes they may undergo. Optionality is encoded in the types, whichmeans that all types dealing with optional complements have been doubled. We also have types for pronominal verbs. Semantic lexical restrictions on syntactic arguments and markingprepositions are given in the lexicon itself.

The SRG has 170 types for main verbs andabout 6,600 entries for verbs. Table 4 shows the


136

most relevant types of verbs. The columnsshow the types and valence information: thecategory of the subject (f1; n(ominal), v(erbal), -(no subject)) and the complements they take –direct object (f2), indirect object (f3), finitecompletive clause (f4), infinitive (f5),interrogative clause (f6), locative complement(f7), prepositional complement (f8), markedcompletive clause (f9), marked infinitive (f10) and marked interrogative clause (f11).

type f1

f2

f3

f4

f5

f6

f7

f8

f9

f10

f11

iv_strict_intr_le - - - - - - - - - - -

iv_non_pass_np_le - + - - - - - - - - -

iv_cp_prop_le - - - + + - - - - - -

iv_subj_prop_unacc_le v - - - - - - - - - -

v_subj_prop_intr_io_le v - + - - - - - - - -

v_subj_prop_intr_mrkd_np v - - - - - - + - - -

v_subj_prop_trans_np_le v + - - - - - - - - -

v_subj_prop_trans_prop_le v - - + - - - - - - -

v_unacc_le n - - - - - - - - - -

v_unacc_lcomp_le n - - - - - + - - - -

v_intr_le n - - - - - - - - - -

v_intr_mrkd_np_le n - - - - - - + - - -

v_intr_mrkd_vinf_le n - - - - - - - - + -

v_intr_mrkd_prop_fin_le n - - - - - - - + - -

v_intr_mrkd_ques_le n - - - - - - - - - +

v_intr_io_le n - + - - - - - - - -

v_trans_np_le n + - - - - - - - - -

v_trans_np_mrkd_np_le n + - - - - - + - - -

v_trans_np_mrkd_vinf_le n + - - - - - - - + -

v_trans_np_mrkd_prop_fin n + - - - - - - + - -

v_trans_np_lcomp_le n + - - - - + - - - -

v_ditrans_le n + + - - - - - - - -

v_trans_prop_fin_le n + - + - - - - - - -

v_sctrl_le n + - - + - - - - - -

v_trans_ques_le n + - - - + - - - - -

v_ditrans_prop_fin_le n - + + - - - - - - -

v_ditrans_vinf_le n - + - + - - - - - -

v_ditrans_ques n - + - - + - - - - -

v_osr_le n + - - - - - - - - -

Table 4: Some types of main verbs.

ReferencesEmily M. Bender, Dan Flickinger and S.

Oepen. 2002. The grammar Matrix. An

open-source starter-kit for the rapid development of cress-linguisticallyconsistent broad-coverage precisiongrammar. In proceedings of the Workshop on Grammar Engineering and Evaluation atthe 19th International Conference on Computational Linguistics. Taipei, Taiwan.

Ulrich Callmeier. 2000. Pet – a platform for experimentation with efficient HPSG processing. Journal of Natural Language Engineering 6(1): Special Issue on EfficientProcessing with HPSG: Methods, System,Evaluation, pages 99-108.

Ann Copestake, Dan Flickinger, Carl Pollardand Ivan A. 2006. Minimal Recursion Semantics: An Introduction. Research on Language and Computation 3.4:281-332.

Ann Copestake. 2002. Implementing TypedFeatures Structure Grammars. CSLIPublications.

Stephan Oepen and John Carroll. 2000.Performance Profiling for ParserEngineering. Journal of Natural Language Engineering 6(1): Special Issue on EfficientProcessing with HPSG: Methods, System,Evaluation, pages 81-97.

Carl J. Pollard and Ivan A. Sag. 1994. Head-driven Phrase Structure Grammar. The University of Chicago Press, Chicago.


137

Towards Quantitative Concept Analysis

Rogelio [email protected]

Jorge [email protected]

Leo [email protected]

Institut Universitari de Lingüística AplicadaUniversitat Pompeu Fabra

Pl. de la Mercè 10-1208002 Barcelona

ICREA and Dept. de Tecnologías de la Información y las Comunicaciones

Universitat Pompeu FabraPasseig de Circumval·lació 8

08003 Barcelona

Abstract: In this paper, we present an approach to the automatic extraction of conceptual structures from unorganized collections of documents using large scale lexical regularities in text. The technique maps a term to a constellation of other terms that captures the essential meaning of the term in question. The methodology is language independent, it involves an exploration of a document collection in which the initial term occurs (e.g., the collection returned by a search engine when being queried with this term) and the building of a network in which each node is assigned to a term. The weights of the connections between nodes are strengthened each time the terms that these nodes represent appear together in a context of a predefined length. Possible applications are automatic concept map generation, terminology extraction, term retrieval, term translation, term localization, etc. The system is currently under development although preliminary experiments show promising results.

Keywords: Corpus Linguistics; Concept Map Generation; Term Retrieval

Resumen: En este trabajo presentamos una aproximación a la extracción automática de estructuras conceptuales a partir de colecciones desordenadas de documentos, aprovechando regularidades léxicas a gran escala en los textos. Es una técnica para asociar un término con una constelación de otros términos que refleje lo esencial del significado. La metodología es independiente de la lengua. Se explora una colección de documentos donde el término inicial aparece (como la colección que devuelve un motor de búsqueda con esa interrogación) y se construye una red en la que cada nodo es asignado a un término. La ponderación de las conexiones entre nodos se incrementa cuando los términos que representan aparecen juntos en un contexto de extensión predefinida. Posibles aplicaciones son la generación automática de mapas conceptuales, la extracción de terminología, la recuperación de términos, su traducción, localización, etc. El sistema se encuentra actualmente en desarrollo, sin embargo experimentos preliminares muestran resultados prometedores.

Palabras clave: Lingüística de corpus; Generación de mapas conceptuales; Recuperación de términos

1 IntroductionIn this paper, we describe an technique that, starting from a query term provided by the user and a document collection, generates a network of terms conceptually related to such query term. The resulting network is assumed to reflect the most pertinent information found in the collection in relation to the query term.

We call such networks concept maps since, in accordance with the relational paradigm of lexical memory (see, e.g., Miller, 1995), we presuppose that the meaning of a term (i.e., a concept) is given by all relevant relations that hold between this term and other terms – with the totality of these relations resulting in what is commonly known as a map.

The generation of the conceptual maps in our algorithm is guided by quantitative means.



More precisely, it is based on the most recurrent combination patterns among terms in a given document collection.1

The work presented here differs in both its theoretical assumption and its objective from the ontology generation field (cf. Buitelaar et al., 2005 for an overview). Our work is not ontological because we are not interested in what something IS. Rather, we are interested in what people usually SAY about something. We extract a synthesis of people’s perception in reference to a topic from a whole set of documents rather than information from individual sources. Furthermore, we analyze how concepts evolve in real time as result of massive amounts of statements disseminated via the web. This is knowledge whose evolution is based on the same mechanism as self-organized complex systems.

Our intuition is that this is also how common knowledge is being developed. For instance, common knowledge tells us that alchemists wanted to transmute metals into gold. And it turns out that the word alchemisthas a strong statistical association with words such as transmute and trigrams such as metals into gold. The present work is therefore less related to Artificial Intelligence (AI) than it is to linguistics. In fact, it is an example of “artificial”-AI, because it relies on social networks and the unconscious collaborative work of a collective of authors.

The remainder of the paper is structured as follows. In the next section, we present the hypothesis underlying our work. Section 3 outlines the methodology we adopt, and Section 4 illustrates our proposal by a couple of examples. In Section 5, a short overview of the related work is given, before in Section 6 some conclusions and directions for future work are drawn.

2 HypothesisThe question underlying our work is: How is it possible to distinguish relevant information from irrelevant information with respect to a given specific term? In particular, how is it possible to make this distinction by means of a formal prediction instead of subjective or arbitrary judgment? From our point of view, this is possible through the study of large-scale

1 Henceforth, we use the terms “term” and “lexical unit” as equivalents in this paper.

regularities in the lexical organization of the discourse.

Adopting the relational paradigm of the structure of lexical memory (see above) and assuming that the recurrent context of a term reflects the comprehension of this term by the speakers, we draw upon frequency distribution as the decisive means for the construction of a conceptual map. Further theoretical evidence supports the idea of systematic redundancy in the surrounding context of a term. Following Eco (1981) we assume that textual devices such as appositions, paraphrases or coreferences let the writer mention attributes of a referent without compromising assumptions on the knowledge of the reader. The writer has a model reader, an idea about what the reader may or may not already know. Consider an example:

(1) This is an image of Napoleon Bonaparte, Emperor of the French and King of Italy, looking unamused at...

(1) shows the use of an apposition that is equivalent to the plain proposition:

(2) Napoleon Bonaparte was the Emperor of the French and King of Italy.

There are myriads of utterances about Napoleon, all different at the surface, but there is also a space of convergence, which we perceive as patterns of recurrent key terms –including those that appear in (2). Thus, in the list of most frequent terms that occur on May 3rd, 2007 in the web in the context of Napoleon we encounter, among others: emperor, France,Bonaparte, invasion, Russia, king, Italy, … French, …

These units roughly follow a Zipfean distribution: only a relative small number of them show a significant cooccurrence and this is why we can apply statistics to grasp them.

3 AlgorithmIn this paper we propose an algorithm that accepts a term as input and uses it as query for an off-the-shelf search engine. From the document list retrieved by the engine, a parameterizable number of documents is downloaded. From these documents, the algorithm builds a conceptual map for that query. A vocabulary selection is performed and only the chosen units are considered during the

Rogelio Nazar, Jorge Vivaldi y Leo Wanner

140

map construction. The overall process consists of five major steps:

A. Extraction of the contexts of the occurrence of the query term in the document collection. The contexts consist of a parameterizable number of words (15 by default) to the left and to the right of the term (we are not interested in sentence boundary detection since semantic association transcends it).

B. Compilation of an index from the extracted contexts. In addition to single tokens, the index includes a list of bigrams and trigrams, henceforth, n-grams (n = 2, 3). From this index, items that begin or end with a member of a stopword-list are excluded. This stoplist contains punctuation marks, hyphens, brackets, functional (i.e., closed class) words and optionally numbers. It was extracted from the first hundred positions in the list of word frequencies of nine languages obtained from Quasthoff (et al. 2005).

C. Merge of different word formsconsidered to be similar.2 This procedure identifies inflectional variations (as, e.g., animals and animal) and reduces them to the same word (namely, the most frequent form among the variations) computing a Dice similarity coefficient with trigrams ofcharacters as features, only if both variants have the first trigram in common.

D. Elimination of irrelevant terms from the index. Further reduction of the vocabulary is executed by removing terms and n-grams of a frequency below a predefined threshold (usually 4 or 5). Also, terms that appear in only one document are eliminated. The rest is filtered using statistical measures such as Mutual Information (MI), t-score, and chi-square. The threshold score for the association is another parameter, but by default it is automatically adjusted to meet the best conditions. The expected probability of the occurrence of words has been extracted from Quasthoff et al. (2005)’s model, but not with data for low frequency words (f<6). As a result, if a term is not listed there, it is treated as if it was, but with the minimum frequency.

E. Construction of the conceptual map.The algorithm reads all contexts of the query term and if the terms encountered in these

2 Note that we do not use lemmatization and POS-tagging. We were interested in measuring accuracy without this type of resources.

contexts are in the selected vocabulary, each of them is assigned to a unique node in the network. The connections between these nodes are strengthen each time the terms associated with the nodes appear in a context. Every time an edge is stimulated, the rest is weakened. As the learning progresses, the weight of the nodes is weakened if they were assigned a particular term at the beginning but found no significant connections with neighbors afterwards. At the end of the learning process, the most interconnected nodes are key terms related to the meaning(s) of the query term. The nodes also have references to the original documents and contexts where their terms occur. The final number of nodes is determined by an initial parameter, and several prunes may be conducted to reduce nodes until this number is reached.

4 Preliminary Results A few experiments with this algorithm showed that it performs as expected. Currently, we are about to carry out a more extensive and formal evaluation that will allow us to provide exact figures of accuracy.

To give the reader an overview of the algorithm’s potential and applicability, we briefly illustrate its performance in a few applications.

4.1 Concept MappingThe most basic application is to obtain a map of terms conceptually related to the given query term. The terms captured in the network of the query term DUCKBILL PLATYPUS (Figure 1) are precisely its salient attributes: ornithorhyncgus anatinus; fur; swimming animal; unique species; mammal; lay eggs; spiny anteaters; etc.

Figure 1: Network for DUCKBILL PLATYPUS


141

Note that the network contains most of the terms needed for the generation of the lexicographic definition for DUCKBILLPLATYPUS:

(3) Duckbill platypus: ornithorhyncgus anatinus, furred swimming animal, unique species of mammals that lay eggs, along with the spiny anteaters.

4.2 Term DisambiguationGiven a polysemous term as a query, the network shows clustering effects for each sense. For instance, with the Spanish word HENO(hay), different clusters are visible. Figure 2 shows a fragment of this network.

At the left hand side there is one cluster about a pathology, well differentiated from the rest, that are about hay use in farming.

A similar clustering effect occurs with respect to VIRUS in its biological sense contrasted to the malicious code interpretation; PASCAL as person and as programming language; NLP as acronym for Natural Language Processing and as acronym for Neuro-Linguistic Programming, and so on.

4.3 Term TranslationA quite different application of the proposed technique is to obtain the translation of a given query term to another language. Let us assume that DUCKBILL PLATYPUS was a term not yet available in our bilingual dictionary.

The resulting network of our algorithm for such entry includes frequent words which can

be considered as basic vocabulary, e.g., mammal: mamífero, swimming animal:animal acuático, eggs:huevos). A new search with these translations, this time in the Spanish web, gives rise to ornitorrinco as the most significant MI score. Applying the same strategy we found the Spanish equivalent of West Nile Virus.Thus, taking first this term as query term in the English web, we obtain easy translation words such as mosquito, horse, infection, and transmitted. In a second search that uses theSpanish translations of these terms, the term virus del Nilo Occidental emerges. Analogously, with model reader, in the context of semiotics, as translation equivalent of the Spanish lector modelo, and receiver as the equivalent of Sp. destino in the context of the

communication theory (Figure 3).

Figure 3: Network for SOURCE to find RECEIVER

4.4 Term LocalizationThe same strategy applies to localization. Let us assume that a Spaniard wants to know the

Figure 2: Network for HENO


142

equivalent of aguacate (avocado) inArgentinean Spanish. Searching AGUACATEhe/she will obtain the term persea americana as one of the most significant collocates. A second search with persea americana in combination with the words nombre (name) and Argentinasuggests palta as the most obvious candidate (we can discard spp as a possible translation). Cf. Table 1 for the frequency rank.

Freq. rank Term1 aguacate2 spp3 palta4 nombre5 méxico6 lauraceae7 familia8 argentina... ...

Table 1: Collocates of PERSEA AMERICANANOMBRE - ARGENTINA

(4) is a typical sentence encountered in the retrieved document collection:

(4) La palta, cuyo nombre científico es persea americana, es de la familia de las Laureáceas, tiene su origen en México, ...

4.5 Term retrievalWe also tested the algorithm for term retrieval, which addresses the well-known “tip-of-the-tongue” phenomenon: speakers often forget a term but still perfectly recall the purpose of the underlying concept or even the definition of the term in question.

MI rank Term1 acid2 catalytic3 enzyme4 hydrogen5 oxide... ...

Table 2: Collocates of CATALYST and GLUCOSE

Let us assume that a speaker searches for the name of the catalyst that helps to break down

starch into glucose. Taking CATALYST andGLUCOSE as query terms, the user obtains a network that suggests that enzyme is a frequent collocate of both (Table 2).

5 Preliminary EvaluationFrom all the envisaged tasks mentioned in the previous section, we are particularly interested in bilingual lexicon extraction, because, in spite of its character, it does not require parallel corpora. Given an entry in a source language, the system returns a ranked list of candidates for translation in a target language.

Thinking of a tool for translators, we do not worry if the correct translation is not the first candidate, because a user, with his or her knowledge, may choose an appropriate translation from a short list. It is easier to recognize a word than to remember it and, even if it is a word the user did not know before, then he or she may observe morphological similarities as a clue in the case of cognates.

We conducted thus a preliminary evaluation, only to estimate overall accuracy, with a multilingual database of names of birds (Scory, 1997). We took a random sample of 25 entries from a total of 700 and entered one by one the names of the birds in English to obtain, with our method, a list of the best candidates for translation in Spanish. The procedure is simple: it takes the best collocate of the query and repeats the search with it in the Spanish corpus. We checked whether the translation provided by the database was among the first three candidates in the list proposed by the system, and depending on it we determined success or failure of the trial.

The study showed 72% coincidence with the database. However, if we consider the non-normative terms as correct (they can be adequate in some contexts), precision raises to 84%. Most often, the failure was due to insufficient data. Some of the species are very rare and it is hard to find documents in Spanish about them. In some of the failed trials the correct candidate was too low in the list returned by the system, or was not present at all. Table 3 shows the results of the experiment. The first and second columns show the English and Spanish names provided by the database, and the third column shows the translation proposed by our method.


143

Scory'sEnglishnames:

Scory's Spanish names:

Our method:

firecrest reyezuelolistado

reyezuelolistado

brentgoose

barnacla carinegra

barnacla de cara negra;

ganso de collarcurlew

sandpipercorrelimos

zarapitíncorrelimos

zarapitínlong-

tailed duckhavelda pato

haveldashort-

eared owllechuza

campestrelechuza

campestre;búho campestre

songthrush

zorzalcomún

zorzalcomún

piedwagtail

lavandera de yarrell

lavandera blanca

chaffinch pinzón del hierro

pinzónvulgar; pinzón

comúnstock

dovepaloma

zuritapaloma

zuritamontagu'

s harrieraguilucho cenizo

aguilucho cenizo

oystercatcher

ostrero ostrero

whitesthrush

zorzal zorzal

short-toed lark

terreracomún

terreracomún

kentishplover

chorlitejopatinegro

chorlitejopatinegro

twite pardillopiquigualdo

pardillopiquigualdo

woodpigeon

paloma torcaz

paloma torcaz

semi-collared

flycatcher

papamoscas semicollarino

papamoscas semicollarino

coot fochacomún

fochaamericana; gallareta

americanaeleganttern

charránelegante

charránelegante

black-neckedgrebe

zampulncuellinegro

zampulncuellinegro

brownthrasher

sinsontecastaño

sinsonte

kingeider

eider real -

sombretit

carbonero lugubre

-

blyth'spipit

bisbita de blyth

-

lanceolated warbler

buscarlalanceolada

-

Table 3: Evaluation of the results

Scory's database is incomplete and we were able to find some missing names, as well as other variants from the different variations of the geographically extended Spanish language. For example, Booted Eagle can be águila calzada or aguililla calzada; the Northern Oriol should be Ictérido anaranjado but the variant turpial norteño is also used, the same

with Dark-eyed Junco, that should be translated as Cingolo pizarroso, but in some variants of Spanish it is called junco ojioscuro. Grey-tailed Tattler is translated as Archibebe gris, but we found playero de siberia (in French it is Chevalier de Sibérie). This term variation is a problem for the measure of precision, because we are then evaluating not only the performance of the algorithm, but also the difference that exists between normative terminology and real use.

6 Related WorkThere are many works that represent the

meaning of a term as a network of interdependent nodes labeled by terms, related by edges labeled by predicates. This is the idea behind the Concept Maps (Novak and Cañas, 2006); the Topic Maps (Rath, 1999; Park and Hunting, 2003); the Semantic Web (Shadbolt et al., 2006), among others. Other formalisms, such as semantic networks, may be used to represent concepts and their relationships. A lexical database of such as WordNet (Fellbaum, 1998) is a well known example.

Given the popularity of a search engine such as Kartoo.com (Baleydier and Baleydier, 2006), of the VisualThesaurus.com (Thinkmap Inc., 2004), of a graphical version of Google (Shapiro, 2001) as well as of a variety of other similar representations (Dodge, 2004; Lima, 2005), the idea of a conceptual structure as a net of interdependent nodes is already in the visual imagery of the society. All these representations have in common the goal to transform knowledge serially encoded in text into a topographic structure.

The work related to the automatic generation of conceptual structures involves two fields: term extraction and conceptual relation extraction. For the former, there are several techniques not mentioned in this paper (Vivaldi, 2001, for an overview). For the later, there is also a large body of work.

It is possible to extract semantic relations searching for sentential patterns that provide evidence that between the units X and Y the relation Z holds. For example, X being hyponym of Y, common pattern of this type are <X>is a type of <Y>, or <Y> such as <X>;<W>,<X>, and other <Y>, etc. It is also possible to infer taxonomies from patterns of term variation, for example by the inference that artificial intelligence is a kind of


144

intelligence. Many authors advocate a symbolic approach of this kind; cf., among others, (Hearst, 1992; Godby et al, 1999; Sowa, 2000; Popping, 2000; Ibekwe-SanJuan and SanJuan, 2004).

A different strand uses statistical methods for the extraction of association between terms. Studies of syntagmatic cooccurence for collocation extraction are Church and Hanks (1991); Evert (2004); Kilgarriff et. al (2004); Wanner et al. (2006); among others. Studies of paradigmatic similarity based on vector comparison include Grefenstette (1994); Shütze and Pedersen (1997); Curran (2004). These studies are based on the distributional hypothesis that similar words appear in similar contexts. Studies on graphs drawn by cooccurence data include Phillips (1985); Williams (1998); Magnusson and Vanharanta, (2003); Böhm et al. (2004); Widdows, (2004) and Veronis (2004). Use of graphs is an efficient method in tasks like worddisambiguation. By detecting hubs in the graphs, word senses can be determined in a text collection without resort to dictionaries.

7 Conclusions and future workWe have presented a technique for the analysis of concepts and their relations from a purely statistical point of view, without use of direct human judgment or any compiled knowledge from the domain or the language. As a useful metaphor, what we do is to take a picture of the meaning of a term. However, it is also an explicative model as it proposes a reason why it is possible that this technique works, and it is predictive as it has the power to generalize to different contexts and languages.

We contribute to the studies on word cooccurrence in several areas. Contrary to cited authors, our approach is language independent. In addition, we use it for concept map generation and a variety of new applications. We also extend it to experimentation with multilingual corpora.

The work offers prospective engineering applications, but it is also a study of terminology in itself, of the behavior of terms, and not of the terminology of a specific language nor domain. This is, therefore, still in the scope of the interests of linguistics.

Future work will evolve in several directions. Foremost, an extensive evaluation is planned. At the present we are about to evaluate

our technique by an algorithm thatautomatically loops through all the records of the birds database and compares them with the translations provided by our system. This will yield better estimations. We also plan to evaluate the concept maps obtained from the queries with expert users of different areas. Another direction of improvement is a 3D interactive and navigable model of the concept maps since the 2D model entails visualization difficulties. Finally, a web-based version of the prototypical implementation of the technique will be made available soon for free consultation.

Acknowledgments

We would like to thank the anonymous reviewers for their constructive comments. This paper was supported by the ADQUA scholarship granted to the first author by the Government of Catalonia, Spain, according to the resolution UNI/772/2003.

8 ReferencesBaleydier, L and N. Baleydier. 2006.

Introducing Kartoo. KARTOO SA.http://www.kartoo.net/e/eng/doc/introducing_kartoo.pdf [accessed April 2007].

Böhm, K., L. Maicher, H. Witschel, A. Carradori. 2004. Moving Topic Maps to Mainstream - Integration of Topic Map Generation in the User's WorkingEnvironment. In: J.UCS, Proceedings of I-KNOW'04.241-251

Buitelaar, P., P. Cimiano, B. Magnini. 2005. Ontology Learning from Text: An Overview. In Buitelaar, Cimiano and Magnini (Eds.), Ontology Learning from Text: Methods, Applications andEvaluation,3-12, IOS Press.

Church, K. and P. Hanks. 1991. Word Association Norms, Mutual Information and Lexicography, Computational Linguistics, 16(1):22-29.

Curran, J. (2004). From Distributional to Semantic Similarity. PhD thesis, University of Edinburgh.

Dodge, M. 2007. An Atlas of Cyberspaces: Topology of Maps of Elements of Cyberspace.http://www.cybergeography.org/atlas/topology.html [accessed April 2007].


145

Eco, U 1981. Lector in fabula la cooperación interpretativa en el texto narrativo, Barcelona, Lumen.

Evert, S. (2004); The Statistics of Word Coocurrences; PhD Thesis; IMS; University of Stuttgart.

Godby, C.; E. Miller, and R. Reighart. 1999. Automatically Generated Topic Maps of World Wide Web Resources. OCLC Library.

Grefenstette, G. (1994) Explorations in Automatic Thesaurus Discovery, Kluwer Academic Publishers, Norwell, MA.

Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. MIT Press.

Hearst, M. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics.

Ibekwe-Sanjuan, F. and E. Sanjuan, 2004. Mapping the structure of research topics through term variant clustering: the TermWatch system; JADT 2004: 7es Journées internationales d'Analyse statistique des Données Textuelles.

Kilgarriff, A. P. Rychly. P. Smrz. D. Tugwell. 2004. The Sketch Engine. Proceedings EURALEX 2004, Lorient, France.

Lima, M. (2005); “Visualcomplexity” [http://www.visualcomplexity.com/vc/ accessed June 2007]

Magnusson, C. and H. Vanharanta. 2003. Visualizing Sequences of Texts Using Collocational Networks. In P. Perner and A . Rosenfeld (Eds).276-283. Springer-Verlag, Berlin, Heidelberg.

Miller, G.A. Virtual meaning. 1995. In Gothenburg Papers in Theoretical Linguistics 75:3 – 61.

Novak, J. and A. J. Cañas. 2006. The Theory Underlying Concept Maps and How To Construct Them. Technical Report IHMC CmapTools 2006-01, Florida Institute for Human and Machine Cognition.

Park, J. and S. Hunting. 2003. XML Topic Maps: creating and using topic maps for the Web. Boston, Addison-Wesley cop.

Phillips, M. (1985); Aspects of Text Structure: An Investigation of the Lexical Organization of Text. North-Holland, Amsterdam

Popping, R. 2000. Computer - assisted Text Analysis, London, Sage.

Quasthoff, U., M. Richter, and C. Biemann 2006. Corpus portal for search in

monolingual corpora. In: Proceedings of the LREC 2006, Genoa, Italy.

Rath, H. 1999. Technical Issues on Topic Maps, STEP Electronic Publishing Solutions GmbH.

Schütze, H. and J. Pedersen. 1997. A cooccurrence-based thesaurus and two applications to information retrieval.Information Processing and Management. 33(3):307-318.

Scory, S. 1997. Bird Names, A Translation Index. Management Unit of the North Sea Mathematical Models and the Scheldt estuary, Royal Belgian Institute of Natural Sciences (RBINS).[http://www.mumm.ac.be/~serge/birds/ accessed June 2007]

Shadbolt, N. T. Berners-lee and W. Hall. 2006. The Semantic Web Revisited. IEEE Intelligent Systems 21(3):96-101, May/June

Shapiro, A. 2001. TouchGraphAmazonBrowser V1.01. TouchGraph.http://www.touchgraph.com/TGAmazonBrowser.html (accessed April 2007).

Sowa, J. 2000. Knowledge representation logical, philosophical, and computational foundations, Pacific Grove Brooks/Cole cop.

Thinkmap Inc. 2004. VisualThesaurus.com http://www.visualthesaurus.com (accessed

April 2007).Veronis, J. 2004. HyperLex: Lexical

Cartography for Information Retrieval. Computer Speech & Language, 18(3):223-252.

Vivaldi, J. 2001. Extracción de candidatos a término mediante combinación deestrategias heterogéneas. Barcelona: IULA, Universitat Pompeu Fabra, Sèrie Tesis 9.

Wanner, L.; Bohnet, B. and Giereth, M. 2006. Making Sense of Collocations. Computer Speech & Language 20(4):609-624.

Widdows, D. (2004) Geometry and Meaning, Center for the Study of Language and Information/SRI.

Williams, G. 1998. Collocational Networks: Interlocking Patterns of Lexis in a Corpus of Plant Biology Research Articles. International Journal of Corpus Linguistics 3(1):151-71.


146

Evaluación automática de un sistema híbrido de predicción de palabras y expansiones

Sira E. Palazuelos Cagigas José L. Martín Sánchez

Universidad de Alcalá Escuela Politécnica Superior. Campus

Universitario s/n. 28805. Alcalá de Henares. {sira, jlmartin}@depeca.uah.es

Javier Macías Guarasa Grupo de Tecnología del Habla

Universidad Politécnica de Madrid Ciudad Universitaria s/n. 28040. Madrid.

[email protected]

Resumen: La predicción de palabras es uno de los sistemas más utilizados para ayudar a la escritura a personas con problemas físicos y/o lingüísticos. Últimamente la predicción de palabras se complementa con otras estrategias para mejorar su rendimiento como la expansión de abreviaturas o predicción de frases. En este artículo se presenta un sistema híbrido, de predicción de palabras y predicción de expansiones (es decir, se expande la abreviatura incluso antes de acabar de escribirla). En este sistema se permite al usuario abreviar o no cada palabra, y reducir la carga cognitiva requerida para su utilización, ya que no se necesita memorizar abreviaturas fijas para cada palabra. La eficiencia del sistema se evalúa en base al porcentaje de pulsaciones que ahorra con respecto a la escritura del mismo texto sin ayuda, mostrándose resultados de la predicción de palabras y de expansiones por separado y de la combinación de ambos. Palabras clave: Predicción de palabras, expansión de abreviaturas, predicción de expansiones, modelado del lenguaje, ayudas a la escritura y comunicación para personas con discapacidad.

Abstract: Word prediction is one of the most commonly used systems to help to write people with physical and/or linguistic disabilities. In the newest systems, word prediction is complemented with other strategies to improve its performance, such as abbreviation expansion or phrase prediction. In this paper, a hybrid system with prediction of words and expansions is presented. Expansion prediction consists in expanding the abbreviation even before the user finishes writing it. This system allows the user to abbreviate or not a word, and reduces the cognitive load required for its use because it is not necessary to remember a fixed abbreviation for each word. The parameter used to evaluate the efficiency of the system is the percentage of keystrokes saved with respect to writing the text without help, and we include results of the word prediction, the expansion prediction and the combination of both. Keywords: Word prediction, abbreviation expansion, expansions prediction, language modeling, technical aids for writing and communication for people with disabilities.

1 IntroducciónLa predicción de palabras consiste en ofrecer al usuario posibles terminaciones al fragmento de palabra que haya escrito, de forma que, si se predice la palabra que busca, seleccione la predicción y no necesite acabar de escribir la palabra. Es una de las técnicas más utilizadas para ayudar a escribir texto y comunicarse a personas con distintas discapacidades.

Inicialmente su objetivo era reducir el número de pulsaciones (y con ello el tiempo)

que los usuarios con discapacidad física necesitaban para escribir un texto, pero estudios posteriores han demostrado que: no siempre se produce realmente una aceleración en la escritura (al menos en las etapas de uso iniciales), que los usuarios con problemas físicos valoran más la reducción de esfuerzo físico necesario para producir el texto y que los usuarios con problemas lingüísticos también podían utilizarlo para producir textos más correctos (Magnuson y Hunnicutt, 2002).



Para generar la lista de palabras predichas se utilizan diferentes técnicas de modelado del lenguaje como las descritas en (Allen, 1994). Numerosos sistemas usan modelos basados en n-gramas para generar las palabras predichas, como, por ejemplo, el descrito en (Lesher, Moulton y Higginbotham, 1999), que muestra unos resultados de 54,7% para trigramas con listas de predicción de 10 palabras para inglés. (Carlberger et al, 1997) presenta un sistema de predicción para sueco, inglés, danés, noruego, francés, ruso y español basado en ngramas y en información de las últimas palabras utilizadas (recency). En versiones previas, como la descrita en (Hunnicutt, 1989) utilizaban también información semántica en el proceso de predicción. En la versión siguiente han incorporado modelos de Markov para palabras y categorías (Hunnicutt y Carlberger, 2001) presentando un ahorro de pulsaciones para sueco de un 46% con una lista de 5 palabras predichas. En (Garay-Vitoria y Gonzalez-Abascal 1997) se presenta un sistema basado en un chart parser, que más tarde han adaptado a las características particulares del vasco, idioma con un alto grado de flexión en (Garay-Vitoria, Abascal y Gardeazabal, 2002). En este último artículo proponen utilizar gramáticas con reglas que describan la sucesión de categorías que forman una categoría compuesta, y la predicción basada en morfemas con posibilidad de aceptación de palabras completas. El resultado que consiguen para vasco con listas de 5 palabras predichas es aproximadamente del 43%.

En la actualidad la predicción de palabras está siendo complementada con otras técnicas como la expansión de abreviaturas (Lesher y Moulton, 2005), (Willis et al., 2002) y (Willis, Pain y Trewin, 2005), y la predicción de frases (Väyrynen, Noponen y Seppänen, 2007).

Los algoritmos de expansión de abreviaturas se pueden dividir en fijos y flexibles. En su gran mayoría desarrollan mecanismos de desabreviación automática y aceptan cierto margen de error como (Willis et al., 2002), (Willis et al., 2005). La diferencia fundamental de los sistemas del mercado y el descrito en (Palazuelos et al., 2006), que es evaluado en este artículo, es que los algoritmos de expansión de abreviaturas revisados anteriormente expanden una abreviatura después de que ésta haya sido escrita completamente, mientras que en este artículo se

habla de predicción flexible de expansiones: se proponen expansiones al fragmento escrito de la abreviatura en curso (aunque no se haya acabado de escribir). Otra diferencia es que las palabras, en este trabajo, no tienen asignadas abreviaturas fijas, sino que cada persona puede abreviarlas como desee mientras siga ciertas reglas de compresión. También difiere de los anteriores en que propone un sistema de expansión directamente supervisado por el usuario, es decir, se predicen las expansiones a la vez que se escribe el texto y se muestran al usuario las candidatas para que él elija la deseada y la inserte, obteniendo así un texto final totalmente correcto, sin margen de error.

La estructura del artículo es la siguiente: en primer lugar se describe brevemente la arquitectura del sistema de predicción de palabras y expansiones. A continuación se muestran los resultados de ambos sistemas de predicción por separado y combinados. Finalmente, se exponen las conclusiones.

2 Descripción del sistema de predicción de palabras y expansiones El algoritmo de predicción (tanto de palabras como de abreviaturas) consta básicamente de tres bloques que son explicados en detalle en (Palazuelos, 2001) y (Palazuelos et al, 2006):

Diccionarios.

Modulo de predicción.

Interfaz de usuario.

Los diccionarios contienen palabras y unidades multipalabra y toda la información (gramatical y probabilística) que necesitan los métodos de predicción. El sistema contiene un diccionario general para castellano de más de 150.000 entradas, y diccionarios temáticos y personales adaptables al usuario y a la temática del texto que se está escribiendo, que aumentan la probabilidad de predicción de las palabras que ya se han escrito en el texto o que han aparecido en textos sobre el mismo tema. Además, también se han entrenado de forma automática diccionarios para otros idiomas, como el inglés o el portugués.

Los métodos de predicción, a partir del texto escrito por el usuario, proponen restricciones que deben cumplir las palabras siguientes (categoría gramatical y su probabilidad, concordancias, etc.). Los métodos

Sira Elena Palazuelos Cagigas, José Luis Martín Sánchez y Javier Macías-Guarasa

148

de predicción disponibles están basados en secuencias de hasta 6 palabras (n-gramas), hasta 3 categorías (n-POS) y un analizador basado en una gramática independiente del contexto, cuya potencia ha sido aumentada de forma importante para soportar: gestión de probabilidades de reglas, ambigüedad (gramatical) de las palabras, posibilidad de que en la regla haya elementos (terminales o no terminales) opcionales, posibilidad de que los símbolos no terminales sean tanto categorías gramaticales como significantes o lemas (imponiendo las reglas de concordancia de rasgos adecuadas), posibilidad de prohibir un determinado significante o lema en una posición determinada de una regla, y un potente sistema de tratamiento de rasgos, que permite tanto controlar la concordancia entre los distintos símbolos (terminales y no terminales), como imponer o prohibir rasgos en cualquier símbolo de una regla.

La interfaz de usuario se encarga de recoger el texto que está siendo escrito, recibir las restricciones de los métodos de predicción a partir de ese texto, obtener de los diccionarios el listado de palabras que cumplen dichas restricciones y mostrarle las más probables al usuario como listado de palabras predichas.

LaFigura 1 muestra un teclado virtual que

incluye los algoritmos de predicción de palabras y de expansiones. La predicción, además, está incluida en otros sistemas de ayuda a personas con discapacidad como el sistema de comunicación Comunicador, aplicación de acceso gráfico a mensajes descrita en (Palazuelos 2005), o PredWin, editor de texto con acceso por barrido muy utilizado en España por la comunidad de personas con graves discapacidades físicas (Palazuelos 2001).

Figura 1: Ventana de edición de la aplicación Comunicador, incluyendo la lista de palabras y

expansiones predichas tras escribir “Tel”

A partir de la información de los diccionarios y los métodos de predicción, el algoritmo de predicción de palabras mostrará al usuario las palabras más probables que

comiencen exactamente por el fragmento escrito de la palabra en curso, y cumplan las restricciones impuestas por los métodos de predicción.

El algoritmo de predicción de expansiones propuesto tiene un funcionamiento similar al de predicción de palabras, pero, a la hora de comparar el fragmento escrito de la palabra en curso con las palabras del diccionario, aplica una serie de reglas de expansión, tales como:

Aplicación de los heurísticos más frecuentes, por ejemplo, fonéticos o de sustitución (x=por).

Búsqueda en diccionarios por similitud de cadena teniendo en cuenta que puede haber letras eliminadas.

Expansión fija por medio de tablas de pares abreviatura-expansión.

Se está estudiando la inclusión de aprendizaje automático de abreviaturas, aunque el hecho de que el sistema sea flexible hace que el aprendizaje se reduzca a los heurísticos y las tablas fijas.

El algoritmo de expansión es explicado en detalle en (Palazuelos et al., 2006).

3 Evaluación automática del sistema La importancia de la predicción radica, no sólo en su capacidad para acelerar la tasa escritura o la comunicación, sino también en el aumento en la calidad del texto generado por una persona, y la disminución del esfuerzo, tanto físico como cognitivo, necesario para escribirlo. Estos y otros resultados se muestran en (Magnuson y Hunnicutt, 2002) en un estudio a largo plazo, en el que se pudo constatar tanto la reducción en el número de pulsaciones como la aceleración en la escritura a lo largo de los 13 meses de duración del experimento.

La disminución en el esfuerzo cognitivo, (especialmente en personas con dislexia, que cometen demasiadas faltas de ortografía o con cualquier otro problema que provoque que generen textos de baja calidad) es muy difícilmente evaluable de forma automática, y se deja la valoración a expertos que puedan comprobar el aumento en la calidad de los textos generados. Este aumento en la calidad suele conllevar un aumento en la cantidad, ya que los usuarios se sienten más capaces de

Evaluación Atomática de un Sistema Híbrido de Predicción de Palabras y Expansiones

149

escribir textos correctos, y se produce una realimentación positiva en el proceso.

En cuanto a la evaluación de la disminución del esfuerzo físico que se produce por la realización de las pulsaciones necesarias para escribir el texto, la métrica que mejor lo refleja es el porcentaje de ahorro de pulsaciones con respecto a la escritura sin ayuda de predicción. Este parámetro sí puede ser evaluado de forma automática.

Hemos de considerar que, además de los muchos factores que influyen en la eficacia de la predicción (tanto el idioma, como la configuración del sistema de predicción o de la propia interfaz donde esté instalado (Palazuelos et al., 1999) como subjetivos por preferencias del usuario), si el sistema de predicción no es capaz de predecir la palabra adecuada y reducir el número de pulsaciones necesarias, los demás factores serán irrelevantes (Trnka et al., 2005). Por eso es tan importante realizar una evaluación automática del porcentaje de pulsaciones ahorrado.

Para realizar una evaluación automática del sistema, se utiliza un modelo de usuario que simula a una persona escribiendo texto y eligiendo siempre las predicciones correctas cuando se muestran (usuario perfecto). Se toma el texto carácter a carácter y se llama al algoritmo de predicción que hace una propuesta de las posibles palabras predichas después de escribir cada letra. Si alguna de estas palabras se corresponde con la que se está intentando escribir, el sistema la elige, contabilizándola como palabra predicha correctamente y acumulando el ahorro de pulsaciones que produce.

La selección de los textos de entrenamiento y prueba constituye uno de los aspectos más importantes a la hora de realizar la evaluación de cualquier técnica de procesamiento del lenguaje natural (PLN) y se realiza teniendo en cuenta aspectos explicados en (Palazuelos, 2001). En esta serie de experimientos se deseaba evaluar la calidad en la escritura de texto (uso habitual de PredWin, editor de texto, y del teclado virtual, dos de las aplicaciones donde está incluida la predicción), no de conversación (como Comunicador). Se utilizó un texto de prueba resultado de la combinación de varios cuentos, con una longitud de 2000 palabras, teniendo en cuenta que los usuarios de estos sistemas, con graves discapacidades

físicas, normalmente no escriben textos muy grandes en cada sesión.

Como referencia se contabiliza la cantidad de pulsaciones necesaria para escribir el texto sin ningún algoritmo de ayuda cuyos datos se muestran en la Tabla 1.

Nombre texto de prueba “Cuentos variados” Número de palabras 2000

Num. pulsaciones para escribirlo sin ayuda 11969

Tabla 1: Datos sobre el texto de prueba

3.1 Evaluación automática del algoritmo de predicción de palabras En el primer experimento se utiliza predicción de palabras, con 5 candidatas en la lista de predicción, sin ningún tipo de ayuda gramatical, solamente la información estadística contenida en el diccionario general (de más de 150.000 entradas) obteniéndose los resultados que se muestran en la Tabla 2.


Núm. pulsaciones con predicción de palabras sin

ayuda gramatical 7937

% ahorro de pulsaciones 33,68%

Tabla 2: Resultados de la predicción de palabras sin ayuda gramatical

Posteriormente se introduce el análisis gramatical basado en secuencias de categorías gramaticales (POS, parts of speech), bipos ytripos (Allen, 1994).

Tabla 3: Resultados de la predicción de palabras usando tripos


Num. pulsaciones para escribirlo con predicción de

palabras utilizando tripos 7701



150

Según puede verse en la Tabla 3, el ahorro de pulsaciones mejora en un 1,97 % con respecto al anterior.

Si, además, incorporamos la utilización de los n-gramas y el diccionario del texto en curso, el ahorro es mucho mayor como podemos observar en la Tabla 4.


Num. pulsaciones con predicción de palabras

utilizando tripos, n-gramas y diccionario en curso

7243


Tabla 4: Predicción de palabras con tripos, n-gramas y el diccionario de texto en curso

Los resultados de la Tabla 4 muestran que al utilizar los n-gramas, además de los bipos, tripos y el diccionario de texto en curso, se produce una mejora de un 3,83% respecto a los resultados obtenidos aplicando solo tripos y de un 5,8% si no se aplica ningún mecanismo de ayuda gramatical durante la predicción.

3.2 Evaluación automática del algoritmo de predicción de expansiones Los parámetros de evaluación son los mismos que para la predicción de palabras. La evaluación automática es realizada con un modelo de usuario más complejo que el de la predicción de palabras, ya que debemos considerar que escribe texto abreviado. Por esto necesitamos utilizar dos ficheros: el texto con el que deseamos realizar la evaluación y su versión abreviada.

Debido a la dificultad para disponer de corpus paralelos abreviados y sin abreviar, ha sido necesario implementar un proceso para comprimir automáticamente los ficheros de prueba, aplicando las siguientes técnicas de compresión (que intentan imitar en lo posible las estrategias de compresión habituales de los usuarios de teléfono móvil):

Las palabras más frecuentes se comprimen aplicando heurísticos (fonéticos, etc.)

Se eliminan las letras cuyo porcentaje de aparición en el texto supere un 2%

Se incluye una estrategia de compresión fija con tabla, es decir, si una palabra o secuencia de palabras está en dicha tabla, se sustituye directamente por la abreviatura asociada.

Las palabras menos frecuentes se mantienen sin comprimir, ya que la probabilidad de que el sistema las descomprima es reducida. Debemos considerar que deseamos un texto totalmente libre de error, es decir, que si la abreviatura se acaba de escribir y no se ha descomprimido, el sistema simulará un retroceso, y reescribirá la palabra sin comprimir (sumando las pulsaciones necesarias para realizar todo este proceso). Si al comprimir el texto dejamos sin abreviar las palabras menos frecuentes este proceso se elimina, o al menos se reduce, penalizando menos los resultados. Hemos de tener en cuenta que los usuarios también comprimen poco/nada las palabras poco frecuentes, para evitar que quien lea el mensaje pueda pensar que la abreviatura se corresponde con otra palabra más frecuente.

Este archivo comprimido es el utilizado para realizar la evaluación automática. No obstante se realizarán futuras evaluaciones con usuarios reales donde se espera conseguir mejores resultados, teniendo en cuenta que la inteligencia del usuario hará que utilice la estrategia óptima en base al funcionamiento de la expansión.

Además, se ha incluido en la evaluación otra circunstancia que también puede darse en casos reales: si el usuario está utilizando el sistema para comunicarse, necesita velocidad y que el texto sea comprensible, aunque no sea perfecto, y premiará la rapidez a la corrección total. En este caso puede que no corrija las abreviaturas que no se expandan si el texto resultante se puede entender sin dificultad. Se ha introducido esta posibilidad en el sistema, y en los experimentos se proporcionan también resultados considerando que puede haber margen de error (abreviaturas sin descomprimir).

En esta serie de experimentos se ha utilizado el mismo texto de prueba que en los experimentos anteriores. A continuación se evalúa el ahorro de pulsaciones aplicando


151

únicamente el algoritmo de predicción de expansiones, haciendo uso de tripos y n-gramas aplicadas a los diccionarios general y personal. Los resultados obtenidos se muestran en la Tabla 5.

Nombre texto de prueba “Cuentos variados” Num. pulsaciones con

predicción de expansiones sin error

6461

Ahorro de pulsaciones 46,01%

Tabla 5: Predicción de expansiones sin error

Si no se tienen en cuenta los retrocesos, es decir, si se admite un cierto porcentaje de abreviaturas sin descomprimir (margen de error), los resultados obtenidos se muestran en la Tabla 6.

Nombre texto de prueba “Cuentos variados” Num. pulsaciones con

predicción de expansiones con error

6415


Tabla 6: Predicción de expansiones con error

Estos resultados se obtuvieron con un porcentaje de error de un 0,6%, muy bajo respecto a otros sistemas revisados como el descrito en (Shieber y Baker, 2003) que presenta un 3%.

Según puede apreciarse, el sistema de predicción de expansiones sin error obtiene un ahorro de pulsaciones de un 46,01% y con error se ahorra un 46,40%, implicando un incremento del 0,39% en el ahorro de pulsaciones. Además, se puede observar que las mejoras con respecto a la predicción de palabras (Tabla 4) son de 6,53% y 6,92% respectivamente. 3.3 Eficacia de la combinación de los algoritmos de predicción de palabras y expansionesEs posible configurar el modelo de usuario para que se pueda introducir texto normal y abreviado, y el programa es capaz de generar una lista de posibles palabras predichas combinando las propuestas de los algoritmos de predicción de palabras y expansiones.

En esta sección se evalúa la eficacia de la combinación de estos dos algoritmos con respecto a la utilización de cada uno de ellos

por separado. Se comparan los resultados dando prioridad a cada uno de los algoritmos de predicción. Esto quiere decir que en cada experimento se puede decidir cual de los dos algoritmos será el primero en realizar la propuesta de palabras predichas, y si una vez rellena esta lista de posibles palabras, esta no está completa, se llama al otro algoritmo de predicción para que la complete con su propuesta. Es decir, por cada letra que introduzca el usuario, se mostrará una lista de cinco posibles palabras procedentes del algoritmo prioritario o de los dos.

En la Tabla 7 se muestran los resultados obtenidos al darle prioridad al algoritmo de predicción de expansiones frente al de predicción de palabras. Según puede apreciarse, los resultados mejoran un 3% respecto a la aplicación del algoritmo de predicción de expansiones por sí solo, sin tener en cuenta errores.

Nombre texto de prueba “Cuentos variados” Num. pulsaciones ambos

algoritmos prioridad expansión sin error

6094


Tabla 7: Combinación de algoritmos dando prioridad a la predicción de expansiones

Por otro lado, si se da prioridad a la predicción de palabras, los resultados obtenidos se muestran en la Tabla 8.

Nombre texto de prueba “Cuentos variados” Num. pulsaciones ambos algoritmos

prioridad predicción sin error

5606


Tabla 8: Combinación algoritmos dando prioridad a la predicción de palabras

En este caso el ahorro de pulsaciones mejora más de un 4% con respecto a los resultados obtenidos dando prioridad a la predicción de expansiones.

4 ConclusionesEn este artículo se evalúa la eficacia de los algoritmos de predicción de expansiones y


152

palabras que se utilizan en varios sistemas de ayuda a la escritura y comunicación para personas con discapacidad. Para realizar la evaluación automática de los algoritmos presentados se ha diseñado un modelo de usuario capaz de simular la entrada de texto en cada caso.

En primer lugar se exponen los resultados obtenidos aplicando sólo el método de predicción de palabras. La introducción de información gramatical permite que no se presenten al usuario predicciones gramaticalmente incorrectas, y esto produce una mejora en los resultados obtenidos de un 1,97%, además de una mejora subjetiva en la calidad apreciada por el usuario. En el siguiente experimento, además de los tripos, se utilizan los n-gramas y el diccionario personal, logrando un ahorro de pulsaciones de un 39,48% que equivale a una mejora de un 3,83% respecto al método anterior.

Posteriormente se evalúan los resultados considerando que el usuario escribe texto abreviado y se aplica el algoritmo de predicción de expansiones. También se considera si se admite un margen de error en el texto o no (el porcentaje de error obtenido no supera el 0,6% en ningún caso). El ahorro de pulsaciones obtenido sin error fue de un 46,01% mejorando los resultados obtenidos con los algoritmos de predicción de palabras en un 6,9%.

La combinación de los dos algoritmos de predicción permite que el usuario introduzca texto abreviado o texto normal, y produce los mejores resultados cuando se da prioridad a la predicción de palabras con un ahorro de pulsaciones en el orden de un 53,16% libre de error, lo cual supera en un 4% al algoritmo que da prioridad da la predicción de expansiones, en más de un 7% al mejor de los algoritmos de predicción de expansiones y en casi un 14% al mejor algoritmo de predicción de palabras.

Por último, debemos considerar que la introducción de estos algoritmos en el sistema de ayuda a la escritura y/o comunicación no sólo ofrece ventajas cuantitativas en base al ahorro de pulsaciones, sino que también da flexibilidad al usuario a la hora de abreviar, permitiendo que comprima cada vez de una manera diferente y no necesite recordar la abreviatura asignada a cada palabra, por lo tanto, reduce la carga cognitiva que supondría memorizarlas.

BibliografíaAllen, J. 1994. “Natural language

Understanding”. Benjamin/Cummings Publishing Company Inc 2ª Ed.

Carlberger A., Carlberger J., Magnuson T. Hunnicutt S., Palazuelos-Cagigas S., Aguilera Navarro S. 1997. Profet, a new generation of word prediction: An evaluation study. Proceedings,ACL Workshop on Natural language processing for communication aids, 23–28, Madrid.

Garay-Vitoria N. and Gonzalez-Abascal. J. 1997. Intelligent word prediction to enhance text input rate (a syntactic analysis-based word prediction aid for people with severe motor and speech disability). In Proceedings of the Annual International Conference on Intelligent User Interfaces, 241–244.

Garay-Vitoria N. Abascal J., Gardeazabal L. 2002. “Evaluation of Prediction Methods Applied to an Inflected Language”. Lecture Notes In Computer Science; Vol. 2448. Proceedings of the 5th International Conference on Text, Speech and Dialogue Pages: 389 – 396. ISBN:3-540-44129-8.

Hunnicutt, S. 1989. “Using Syntactic and Semantic Information in a Word Prediction Aid”. Proc. Europ. Conf. Speech Commun. Paris, France. September 1989, vol. 1. páginas: 191-193.

Hunnicutt S., Carlberger J. 2001. “Improving Word prediction using Markov models and heuristic methods”. Augmentative and Alternative Communication, Volume 17, Issue 4 December, pages 255 – 264.

Lesher, G.W., Moulton, B.J., Higginbotham, D.J. (1999). Effects of ngram order and training text size on word prediction. Proceedings of the RESNA'99 Annual Conference, 52-54, Arlington, VA: RESNA Press.

Lesher G., Moulton B., 2005. “An introduction to the theoretical limits of abbreviation expansion performance”. 28 Annual RESNA Conference Proceedings. http://www. dynavoxtech.com/files/research/LeMo05.pdf

Magnuson T., Hunnicutt S., 2002. “Measuring the effectiveness of Word prediction: The advantage of long-term use”. Speech, Music


153

and Hearing, KTH, Estocolmo, Suecia. TMH-QPSR. Volumen 43: 57-67.

Palazuelos S. E., Aguilera S., Rodrigo J. L., Godino. J., Martín J. 1999. Considerations on the Automatic Evaluation of word prediction systems. Augmentative and Alternative Communication: New Directions in Research and Practice. Pags: 92-104. Whurr Publishers. Londres.

Palazuelos Cagigas S. 2001. “Aportación a la predicción de palabras en castellano y su integración en sistemas de ayuda a personas con discapacidad física”. Tesis Doctoral.

Palazuelos Cagigas S. E., Martín Sánchez J. L., Arenas García J., Godino Llorente J. I., Aguilera Navarro S. 2001. “Communication strategies using PredWin for people with disabilities”. Conference and Workshop on Assistive Technology for Vision and Hearing Impaired. Castelvecchio Pascoli, Italia. Agosto.

Palazuelos Cagigas S. E., Martín Sánchez J. L., Domínguez Olalla L. M. 2005. “Graphic Communicator with Optimum Message Access for Switch Users”. Assistive technology: from virtuality to reality. Pags: 207-211. ISBN: 1-58603-543-6. ISSN: 1383-813X. Ed. IOS Press (A. Pruski y H. Knops).

Palazuelos Cagigas S. E., Martín Sánchez J. L., Hierrezuelo Sabatela L., Macías Guarasa J. 2006. “Design and evaluation of a versatile architecture for a multilingual word prediction system”. LNCS (Lecture Notes in Computer Science) 4061. Computers Helping People with Special Needs. Springer-Verlag. Editores: Klaus Miesenberger, Joachim Klaus, Wolfgang Zagler, Arthur Karshmer. ISBN: 3-540-36020-4. Páginas 894-901.

Trnka, K., Yarrington, D., McCoy, K., Pennington, C., 2005. “The Keystroke Savings Limit in Word Prediction for AAC”. http://hdl.handle.net/123456789/149.

Shieber S., Baker E. 2003. “Abreviated Text Unput”, IUI’03, Miami, Florida, USA. ACM 1-58113-586-6/03/0001. 12-15 Enero 2003. http://www.iuiconf.org/03pdf/2003-001-0064.pdf

Väyrynen P., Noponen K., Seppänen T. 2007. “Analysing performance in a word

prediction system with multiple prediction methods”. Computer, Speech & Language Volume 21. Issue 3. Páginas 479-491. Julio.

Willis T., Pain H., Trewin S., Clark S. 2002. “Informing Flexible Abbreviation Expansion for users with motor disabilities”. Lecture Notes In Computer Science; Vol. 2398 Proceedings of the 8th International Conference on Computers Helping People with Special Needs. Páginas: 251 – 258. ISBN: 3-540-43904-8.

Willis T., Pain H., Trewin S. 2005. “A Probabilistic Flexible Abbreviation Expansion System for Users With Motor Disabilities”. School of Informatics, University of Edinburgh.


154

Lingüística de Corpus

Specification of a general linguistic annotation framework andits use in a real context

Xabier Artola, Arantza Dıaz de Ilarraza, Aitor Sologaistoa, Aitor SoroaIXA Taldea

Euskal Herriko Unibertsitatea (UPV/EHU)[email protected]

Resumen: AWA es una arquitectura general para representar informacionlinguıstica producida por procesadores linguısticos. Nuestro objetivo es definir unesquema de representacion coherente y flexible que sea la base del intercambio de in-formacion entre herramientas linguısticas de cualquier tipo. Los analisis linguısticosse representan por medio de estructuras de rasgos segun las directrices de TEI-P4.Estas estructuras y su relacion con los demas elementos que componen el analisisforman parte de un modelo de datos disenado bajo el paradigma de orientacion aobjetos. AWA se encarga de la representacion de la informacion dentro de una arqui-tectura mas amplia para gestionar todo el proceso de analisis de un corpus. Comoejemplo de la utilidad del modelo presentado explicaremos como se ha aplicado dichomodelo en el procesamiento de dos corpus.Palabras clave: Modelo de anotacion, arquitectura para la integracion, TEI-P4

Abstract: In this paper we present AWA, a general architecture for representingthe linguistic information produced by diverse linguistic processors. Our aim isto establish a coherent and flexible representation scheme that will be the basisfor the exchange of information. We use TEI-P4 conformant feature structuresas a representation schema for linguistic analyses. A consistent underlying datamodel, which captures the structure and relations contained in the information tobe manipulated, has been identified and implemented by a set of classes followingthe object-oriented paradigm. As an example of the usefulness of the model, we willshow the usage of the framework in a real context: two corpora have been annotatedby means of an application which aim is to exploit and manipulate the data createdby the linguistic processors developed so far.Keywords: Annotation model, integration architecture, TEI-P4

1 Introduction

In this paper we present AWA (Annota-tion Web Architecture), which forms partof LPAF, a multi-layered Language Process-ing and Annotation Framework. LPAF is ageneral framework for the management andthe integration of NLP components and re-sources. AWA defines a data representationschema which aim is to facilitate the com-munication among linguistic processors in avariety of NLP applications. The key designcriteria we have taken into account when de-signing AWA are oriented to make possiblethe description of different phenomena in anhomogeneous way.

The objective of AWA is to establish acoherent and flexible representation schemethat will be the basis for the exchange of in-formation. We use TEI-P4 conformant fea-

ture structures1 to represent linguistic anal-yses. We also have identified a consistentunderlying data model which captures thestructure and relations contained in the in-formation to be manipulated.

This data model has been representedby classes that are encapsulated in sev-eral library modules (LibiXaML), followingthe object-oriented paradigm(Artola et al.,2005). The modules offer the necessary typesand operations to manipulate the linguisticinformation according to the model. Theclass library has been implemented in C++and contains about 100 classes. For theimplementation of the different classes andmethods we make use of the Libxml22 library.

1http://www.tei-c.org/P4X/DTD/2http://xmlsoft.org/



The current release of LibiXaML works onUnix flavours as well as on Windows archi-tectures.

As an example of the usefulness of themodel we will show the usage of the frame-work in a real context. Two corpora havebeen tagged by means of an on-line applica-tion, called EULIA, which aim is to exploitand manipulate the data created by the lin-guistic processors developed so far and in-tegrated in a pipeline architecture. EULIA(Artola et al., 2004) offers help in data brows-ing, manual disambiguation, and annotationtasks by means of an intuitive and easy-to-use graphic user interface.

The rest of the paper is organized as fol-lows. In section 2 we present some relatedwork. Section 3 will be dedicated to ex-plain the proposed annotation architecture.In section 4 we describe the use of featurestructures for representing linguistic informa-tion. Section 5 shows the use of the frame-work in two real contexts: the annotation ofEPEC (Reference Corpus for the Processingof Basque) and ztC (Science and TechnologyCorpus) (Areta et al., 2006), and EULIA, anapplication implemented for facilitating thework with the so-called annotation web. Fi-nally, section 6 is dedicated to present someconclusions and future work.

2 Related work

There is a general trend for establishing stan-dards for effective language resource man-agement (ISO/TC 37/TC 4 (Ide and Ro-mary, 2004)), the main objective of whichis to provide a framework for language re-source development and use. Besides, thereis much work dealing with the use of XML-based technologies for annotating linguisticinformation. ATLAS (Bird et al., 2000), LT-TTT (Thompson et al., 1997) and WHATare some of the projects where stand-off an-notation is used in order to deal efficientlywith the combination of multiple overlappinghierarchies that appear as a consequence ofthe multidimensional nature of linguistic in-formation. LT-TTT (Thompson et al., 1997)is a general library developed within an XMLprocessing paradigm whereby tools are com-bined together in a pipeline allowing to add,modify and remove pieces of annotation. Itprovides linguistic components that operateover XML documents and permit the devel-opment of a broad range of NLP applica-

tions which share the annotated information.In ATLAS (Bird et al., 2000) the authorsuse XML technology as a format for the in-terchange of annotated information betweenlinguistic applications (AIF). In a first ver-sion, ATLAS was fully based in a particularformalism for annotation, called AnnotationGraphs (AGs). However, they extended thearchitecture in order to adopt an upper levelof abstraction and provide an ontology, wherethe conceptual model can be defined. For thisreason MAIA (Meta Annotation Informationfor Atlas) is defined (Laprun et al., 2002)).Although the ontology model is described inXML documents, no XML technology is usedto semantically validate the information. Fi-nally, in the WHAT project (Schafer, 2003),the authors present an XSLT-based White-board Annotation Transformer, an integra-tion facility for integrating deep and shallowNLP components. They rely on XSLT tech-nology for transforming shallow and deep an-notations in an integrated architecture builton top of a standard XSL transformationengine. Linguistic applications communi-cate with the components through program-ming interfaces. These APIs are not isomor-phic to the XML mark-up they are basedon, but they define classes in a hierarchi-cal way. Among other types of formalismsthey use typed feature structures for encod-ing deep annotations, although the correct-ness of these feature structures is not vali-dated with XML tools.

Apart from the annotation infrastructure,several systems go further and define frame-works for rapid prototyping of linguistic ap-plications that share the same data model(annotations) at different levels. GATE(Cunningham, Wilks, and Gaizauskas, 1996;Bontcheva et al., 2004), TALENT (Neff,Byrd, and Bougaraev, 2004), ATLAS andMAIA (Bird et al., 2000; Laprun et al., 2002),and UIMA (Ferrucci and Lally, 2004)) aresome of these systems.

The annotation architecture presented inthis paper follows the stand-off markup ap-proach and it has been inspired on theTEI-P4 guidelines (Sperberg-McQueen andBurnard, 2002) to represent linguistic infor-mation obtained by a wide range of linguistictools.

One reason for taking this approach isthat our representation requirements, to-gether with the characteristics of the lan-

Xabier Artola, Arantza Díaz de Ilarraza, Aitor Sologaistoa y Aitor Soroa

158

guage (Basque) we are dealing with, arenot completely fulfilled by the annotationschemes proposed in the systems mentionedbefore. Basque being an agglutinative andfree-order language, the complexity of themorphological information attached to lin-guistic elements (word-forms, morphemes,multiword expressions, etc.) as well as theneed to represent discontinuous linguisticunits, obliges us to use a rich representationmodel.

3 The annotation architecture ina language processingframework

In this section, the general annotation webarchitecture (AWA) is described from an ab-stract point of view, and situated withinLPAF.

3.1 Language Processing andAnnotation Framework

Figure 1 depicts the main components ofLPAF. The framework has been organized indifferent layers.

The bottom layer defines the basic infras-tructure shared by any LPAF component. Inthis layer we can find:

• The Annotation Web Architecture(AWA), including a set of class librarieswhich offer the necessary types andoperations to manipulate the objects ofthe linguistic information model (Artolaet al., 2005).

• The Linguistic Processing Infrastructure(LPI), which includes the set of classesneeded to combine linguistic processes.It is the result of the characterization ofthe way the linguistic processes interactwith each other.

The former will be thoroughly explainedin this paper.

The middle layer is formed by the LPAFpublic services, which constitute the basic re-sources for defining new linguistic applica-tions. LPAF services perform concrete andwell-defined tasks necessary for defining com-plex linguistic applications such as Q/A sys-tems, environments for manual annotation ofcorpora at different levels, etc.

On the top layer we can find final user ap-plications. EULIA and the ztC Query Sys-

Figure 1: The Language Processing and An-notation Framework

tem 3 are two examples of this type of ap-plications, and will be explained throughoutthis paper.

3.2 Annotation Web Architecture

The Annotation Web Architecture has beendesigned in a way general enough to be usedin the annotation tasks of a very broad rangeof linguistic processing tools. Issues such asthe representation of ambiguity or the at-tachment of linguistic information to unitsformed by discontinuous constituents havebeen taken into account in the annotationmodel.

An abstract view of this annotation ar-chitecture is represented in Figure 2. Whena text unit undergoes a series of lan-guage processing steps, a corpus unit is cre-ated. Together with the raw text, thiscorpus unit includes the linguistic annota-tions resulting from each of these process-ing steps. So, each one of these annota-tions (LinguisticAnnotation class) repre-sents, for instance, the set of annotations pro-duced by a lemmatization process or the an-notations produced by a dependency-basedsyntactic parser. Dependencies among dif-ferent linguistic annotations belonging to thesame processing chain are presented by thedependsOn association link in the diagram.

The model follows a stand-off annota-tion strategy: anchors set on the corpus(Anchor class) are attached to the corre-sponding linguistic information (LingInfoclass) by means of “links” (AnnotationItemclass). An annotation item always refers toone anchor and has associated a single fea-

3http://www.ztcorpusa.net

Specification of a General Linguistic Annotation Framework and its Use in a Real Context

159

Figure 2: The annotation architecture

ture structure containing linguistic informa-tion. Any annotation item can become ananchor in a subsequent annotation operation.As a result of each processing step (tokeniza-tion, morphological segmentation or analysis,lemmatization, syntactic parsing, etc.), whatwe call a “linguistic annotation” consisting ofa web of interlinked XML documents is gen-erated.

The model is physically represented bythree different types of XML documents: an-chor documents, link documents (annotationitems) and documents containing linguisticinformation. Let us show now each one ofthese in more detail:

• Anchors: these elements can go fromphysical elements found in the input cor-pus (textual references, represented bythe TextRef class), such as typical char-acter offset expressions or XPointer ex-pressions pointing to specific points orranges within an XML document, upto annotation items resulting from pre-vious annotation processes; in partic-ular, morphemes and single- or multi-word tokens, word spans, etc., or even“linguistic interpretations” of this kindof elements can be taken as anchors oflinguistic annotations. We have found

that, in many cases, physical text el-ements are not adequate as annota-tion anchors, and linguistic interpre-tations issued from previous analysissteps (lemmatization and syntactic func-tion combinations, or phrasal chunks towhich only some of the interpretations aword can have belong) have to be used asanchors in subsequent processing steps.Textual anchors are set mainly as a re-sult of tokenization and of the identi-fication of multiword expressions. Onthe other hand, interpretational anchorsare annotation items or else special an-chors (anchors specifically created as “el-ements” to which attach linguistic infor-mation); in this case, they are expressedby XML elements which act as a join ofseveral identifiers representing interpre-tations issued from previous processes.As examples of special anchors we canmention word sequences, chunks, etc.Structural ambiguity is represented byoverlapping anchors, i.e., when annota-tions refer to anchors which overlap.

• Annotation items (links): these con-stitute the actual annotations resultingfrom a linguistic analysis process. Eachlink ties a single linguistic interpretation


160

to an anchor. Interpretation ambiguityis represented by several links attachedto the same anchor, and so disambigua-tion consists in simply marking one ofthese links as correct while discardingthe rest.

• Linguistic information: typed fea-ture structures are used to represent thedifferent types of linguistic informationresulting from the analysis processes.In some cases, such as in morphologi-cal segmentation or lemmatization, thelinguistic content corresponds to wordforms (more specifically, token annota-tion items), and therefore huge commonlibraries containing these contents (fea-ture structures) are used, allowing us tosave processing time (and storage room)as previously analyzed word forms neednot be analyzed again and again whenoccurring in new texts.

This data model captures the structureand relations contained in the informationto be manipulated, and is represented byclasses which are encapsulated in several li-brary modules. These classes offer the neces-sary operations or methods the different toolsneed to perform their tasks when recognizingthe input and producing their output.

4 Representing linguisticinformation: feature structuresand Relax NG schemas

This section is devoted to explain in moredetail the use of feature structures in ourmodel, their advantages, features, the repre-sentation of meta-information, and the ex-ploitation of schemas in different tasks, suchas information retrieval or automatic genera-tion of GUIs.

The different types of linguistic informa-tion resulting from the analysis processesare represented as typed feature structures.In a multi-dimensional markup environment,typed feature structures are adequate for rep-resenting linguistic information because theyserve as a general-purpose metalanguage andensure the extensibility of the model to rep-resent very complex information. Typed fea-ture structures provide us with a formal se-mantics and a well-known logical operationset over the represented linguistic informa-tion.

The feature structures we use fulfill theTEI guidelines for typed FSs, and theyare compatible with ISO/TC 37 TC 4 (Ideand Romary, 2004). Furthermore, we haveadopted Relax NG as a definition metalan-guage for typed feature structures. Relax NGschemas define the legal building blocks of afeature structure type and semantically de-scribe the represented information.

<TEI.2>...

<p><fs id="fs1" type="morphosyntactic"><f name="Form"><str>esnea</str></f><f name="Lemma"><str>esne</str></f><f name="Morphological-Features"><fs type="Top-Feature-List><f name="POS"><sym value="NOUN"/></f><f name="SUBCAT"><sym value="COMMON"/></f>

</fs></f><f name="Components"> ...</f>

</fs>

</p><p>

<fs id="fs2" type="lemmatization"><f name="Form"><str>esnea</str></f><f name="Lemma"><str>esne</str></f><f name="POS"><sym value="NOUN"/></f><f name="SUBCAT"><sym value="COMMON"/></f>

</fs>

</p>...

</TEI.2>

Figure 3: Typed feature structures

The type of the feature structure is en-coded in XML by means of the type attribute(see Figure 3). This attribute allows us tounderstand the meaning of the informationdescribed in the feature structure by meansof its link with the corresponding Relax NGschema which specifies the content of the fea-ture structure.

Relax NG schemas provide us with a for-malism to express the syntax and seman-tics of XML documents but, unfortunately,they are not capable of interpreting the con-tent of the feature structures represented inthe document. Therefore, we have imple-mented some tools which, based on the Re-lax NG schema, arrange data and create au-tomatically the appropriate FS that encodesthe associated linguistic information to berepresented. These tools can be used tobuild GUIs for editing linguistic annotationsadapting the interface to the user’s needs insuch a way that they only have to specify thetype of the information to be treated. Be-sides, and thanks to these tools, we are ableto build general front- and back-end modulesfor the integration of different linguistic en-gines in more complex linguistic applications.Specifying the input/output information by


161

means of these Relax NG schema for linguis-tic engines, the front-end module will pro-vide the adequate data to each engine andthe back-end module will produce the suit-able output.

<define name="fs.lemma"><element name="fs">

<attribute name="id"><data type="id"/></attribute><attribute name="type"><value>lemmatization</value>

</attribute><ref name="f.Form"/><ref name="f.Lemma"/><ref name="f.Pos-SubCat"/>

</element></define>

<define name="f.Form"><element name="f">

<attribute name="name"><value>Form</value></attribute><element name="str"><value type="string"/></element>

</element></define>

<define name="f.Pos-Subcat"><choice>

<ref name="pos.Noun"/><ref name="pos.Adj"/>...

<choice><define>

<define name="pos.Noun"><ref name="f.POS"/><element name="f">

<attribute name="name"><value>SUBCAT</value>

</attribute><choice><value>COMMON</value><value>PERSON NAME</value><value>PLACE NAME</value>

<choice></element>

</define>

Figure 4: RELAX NG schema mixing mor-phosyntax and lemmatization

Figure 3. shows a fragment of an XMLdocument which mixes up feature structuresof two different linguistic levels (morphosyn-tactic and lemmatization) for the same word-form. These FSs are defined by the partialRelax NG schema shown in Figure 4. Therelation between FSs and the schema is es-tablished through the type attribute (in bothfigures in bold). Using these relations, ourtools can access the corresponding schemasand exploit them.

5 The use of the annotationarchitecture in a real context

In order to check the validity of the anno-tation architecture presented here, we haveimplemented a pipeline workflow which inte-grates natural language engines going froma tokenizer to a syntactic parser. Two textcorpora have been processed through thispipeline with the aid of a tool named EU-LIA.

5.1 EULIA: an environment formanaging annotated corpora

EULIA is a graphical environment which ex-ploits and manipulates the data created bythe linguistic processors. Designed to be usedby general users and linguists, its implemen-tation is based on a client-server architecturewhere the client is a Java Applet running onany Java-enabled web browser and the serveris a combination of different modules imple-mented in Java, C++ and Perl.

The linguistic processors integrated so farin the mentioned architecture are:

• A tokenizer that identifies tokens andsentences from the input text.

• A segmentizer, which splits up a wordinto its constituent morphemes.

• A morphosyntactic analyzer whose goalis to process the morphological informa-tion associated with each morpheme ob-taining the morphosyntactic informationof the word form considered as a unit.

• A recognizer of multiword lexical units,which performs the morphosyntacticanalysis of the multiword expressionspresent in the text.

• A general-purpose tagger/lemmatizer.

• A chunker or shallow syntactic analyzerbased on Constraint Grammar.

• A deep syntax analyzer.

EULIA provides different facilities whichcan be grouped into three main tasks:

• Query facility. It visualizes the an-swers of the user’s requests accordingto a suitable stylesheet (XSLT). Thesestylesheets can be changed dynamicallydepending on both the users’ choice andthe type of answer.

• Manual disambiguation. Its goalis to help annotators when identifyingthe correct analysis and discarding thewrong ones. The incorrect analyses areproperly tagged but not removed.

• Manual annotation. It consists of as-signing to each anchor its correspond-ing linguistic information. Dependingon the annotation type different kindsof information are needed. In order toget these data, EULIA’s GUI generates


162

a suitable form, based on the Relax NGschema, which defines the document’sformat for that annotation type. Con-sidering that linguistic information is en-coded following the annotation architec-ture, the treatment at different levels ofanalysis is similar.

5.2 Annotating ztC and EPEC

Let us now explain briefly two real experi-ences that demonstrate the flexibility and ro-bustness of the model, the architecture, andthe environment built. These experienceshave been done on two corpora created withdifferent purposes:

• ztC Corpus (Science and Technol-ogy Corpus) ztC is a 8,000,000 wordcorpus of standard written Basque aboutScience and Technology which aim is tobe a reference for the use of the languagein Science and Technology texts. Part ofthis corpus (1,600,000) has been auto-matically annotated and manually dis-ambiguated. The manual disambigua-tion of the corpus is performed on theoutput of EUSTAGGER (Aduriz et al.,1996), a general lemmatizer/tagger thatobtains for each word-form its lemma,POS, number, declension case, and theassociated syntactic functions. In thiscase, the manual disambiguation and an-notation has been restricted to the infor-mation about lemma and POS.

• EPEC Corpus (Reference Corpusfor the Processing of Basque) EPECis a 300,000 word corpus of standardwritten Basque with the aim of being atraining corpus for the development andimprovement of several NLP tools. Thefirst version of this corpus (50,000 words)has already been used for the construc-tion of some tools such as a morpholog-ical analyzer, a lemmatizer, or a shal-low syntactic analyzer, but now we are ina process of enhancement by annotatingmanually 250,000 new words. AlthoughEPEC has been manually annotated atdifferent levels, the manual annotationto which we will refer here has been per-formed on the output of MORPHEUS(Aduriz et al., 2000), a general ana-lyzer that obtains for each word-form itspossible morphosyntactic analyses. EU-LIA presents this information to the lin-

guist who has to choose the correct oneand mark it by means of a facility pro-vided by the application. If the ana-lyzer doesn’t offer any correct analysis,the annotator has to produce it filling-up a form obtained automatically in ascheme-based way, as explained in sec-tion 4. Once the whole corpus is manu-ally annotated and disambiguated at thesegmentation level, the annotations arepropagated to other levels (morphosyn-tax, lemmatization, syntax) automati-cally and revised again by means of theapplication. Currently, eight annotatorsare satisfactorily working in parallel us-ing EULIA.

The flexibility EULIA gets by using RelaxNG schemas makes possible to visualize theinformation needed in each process in such away that the linguist will only focus on theproblem of ambiguity referred to the infor-mation given.

6 Conclusions and future work

In this paper we have presented AWA, a gen-eral architecture for representing the linguis-tic information produced by linguistic proces-sors. It is integrated into LPAF, a languageprocessing and annotation framework. Basedon a common annotation schema, the pro-posed representation is coherent and flexible,and serves as a basis for exchanging informa-tion among a very broad range of linguisticprocessing tools, going from tokenization tosyntactic parsing.

We have described our general annotationmodel, where any annotation can be usedas anchors of subsequent processes. The an-notations are stand-off, so that we can dealefficiently with the combination of multipleoverlapping hierarchies that appear as a con-sequence of the multidimensional nature oflinguistic information. Based on our experi-ence, the markup annotation model we pro-pose can represent a great variety of linguisticinformation or structure.

XML is used as an underlying technologyfor sharing linguistic information. We havealso defined RelaxNG schemas to describethe different types of linguistic informationthe framework is able to work with. Further-more, we use these schemas to automaticallyexploit the information encoded as typed fea-ture structures.


163

We have also presented EULIA, a graph-ical environment the aim of which is to ex-ploit and manipulate the data created bythe linguistic processors. EULIA offers facil-ities to browse over the annotation architec-ture, pose queries and perform manual dis-ambiguation/annotation of corpora.

Finally, we have briefly explained two realcases that show the flexibility and robustnessof our annotation model as well as the bene-fits of an environment like EULIA in manualannotation and disambiguation processes.

References

Aduriz, Itziar, Eneko Agirre, IzaskunAldezabal, Inaki Alegria, Xabier Ar-regi, Jose Mari Arriola, Xabier Ar-tola, Koldo Gojenola, Aitor Maritx-alar, Kepa Sarasola, and Miriam Urkia.2000. A Word-grammar based mor-phological analyzer for agglutinative lan-guages. In Proc. of International Confer-ence on Computational Linguistics. COL-ING’2000, Saarbrucken (Germany).

Aduriz, Itziar, Izaskun Aldezabal, Inaki Ale-gria, Xabier Artola, Nerea Ezeiza, andRuben Urizar. 1996. EUSLEM: A Lem-matiser / Tagger for Basque. In EU-RALEX’96, Part 1, 17-26., Goteborg.

Areta, Nerea, Antton Gurrutxaga, Igor Le-turia, Ziortza Polin, Rafael Saiz, InakiAlegria, Xabier Artola, Arantza Dıazde Ilarraza, Nerea Ezeiza, Aitor Sologais-toa, Aitor Soroa, and Andoni Valverde.2006. Structure, Annotation and Tools inthe Basque ZT Corpus. In LREC 2006,Genoa (Italy).

Artola, Xabier, Arantza Dıaz de Ilarraza,Nerea Ezeiza, Koldo Gojenola, Aitor Solo-gaistoa, and Aitor Soroa. 2004. EU-LIA: a graphical web interface for creat-ing, browsing and editing linguistically an-notated corpora. In LREC 2004. Work-shop on XbRAC, Lisbon (Portugal).

Artola, Xabier, Arantza Dıaz de Ilarraza,Nerea Ezeiza, Gorka Labaka, Koldo Go-jenola, Aitor Sologaistoa, and Aitor Soroa.2005. A framework for representing andmanaging linguistic annotations based ontyped feature structures. In RANLP2005, Borovets (Bulgaria).

Bird, Steven, David Day, John Garo-folo, Henderson Henderson, Christophe

Laprun, and Mark Liberman. 2000. AT-LAS: A flexible and extensible architec-ture for linguistic annotation. In Proc.of the Second International Conferenceon Language Resources and Evaluation,pages 1699–1706, Paris (France).

Bontcheva, Kalina, Valentin Tablan, Di-ana Maynard, and Hamish Cunningham.2004. Evolving GATE to meet new chal-lenges in language engineering. NaturalLanguage Engineering, 10(3-4):349–373.

Cunningham, Hamish, Yorick Wilks, andRobert J. Gaizauskas. 1996. GATE: aGeneral Architecture for Text Engineer-ing. In Proceedings of the 16th conferenceon Computational linguistics, pages 1057–1060. Association for Computational Lin-guistics.

Ferrucci, David and Adam Lally. 2004.UIMA: an architectural approach to un-structured information processing in thecorporate research environment. NaturalLanguage Engineering, 10(3-4):327–348.

Ide, Nancy and Laurent Romary. 2004. In-ternational standard for a linguistic anno-tation framework. Natural Language En-gineering, 10(3-4):211–225.

Laprun, Cristophe, Jonathan. Fiscus, John.Garofolo, and Silvai. Pajot. 2002. A prac-tical introduction to ATLAS. In Proceed-ings of the Third International Conferenceon Language Resources and Evaluation.

Neff, Mary S., Roy J. Byrd, and Bran-mir K. Bougaraev. 2004. The Talent sys-tem: TEXTRACT architecture and datamodel. Natural Language Engineering,10(3-4):307–326.

Schafer, Ulrich. 2003. WHAT: An XSLT-based infrastructure for the integration ofnatural language processing components.In Proceedings of the Workshop on theSoftware Engineering and Architecture ofLanguage Technology Systems (SEALTS),HLT-NAACL03, Edmonton (Canada).

Sperberg-McQueen, C. M. and L. Burnard,editors. 2002. TEI P4: Guidelines forElectronic Text Encoding and Interchange.Oxford, 4 edition.

Thompson, H.S., R. Tobin, D. Mckelvie,and C. Brew. 1997. LT XML Soft-ware API and toolkit for XML processing.www.ltg.ed.ac.uk/software/xml/index.html.


164

Determinación del umbral de representatividad de un corpus mediante el algoritmo N-Cor1

Gloria Corpas Pastor Departamento de Traducción e Interpretación

Facultad de Filosofía y Letras Universidad de Málaga

[email protected]

Míriam Seghiri Domínguez Departamento de Traducción e Interpretación

Facultad de Filosofía y Letras Universidad de Málaga

[email protected]

Resumen: En las páginas que siguen a continuación vamos a describir un método2 para calcular el umbral mínimo de representatividad de un corpus mediante el algoritmo N-Cor de análisis de la densidad léxica en función del aumento incremental del corpus. Se trata de una solución eficaz para determinar a posteriori, por primera vez de forma objetiva y cuantificable, el tamaño mínimo que debe alcanzar un corpus para que sea considerado representativo en términos estadísticos. Este método se ha visto implementado en la aplicación informática ReCor. Con dicha herramienta vamos a comprobar si un corpus de seguros turísticos en español que hemos compilado sería representativo para realizar estudios lingüístico-textuales y poder ser utilizarlo en traducción. Palabras clave: Representatividad, lingüística de corpus, compilación de corpus, corpus especializado.

Abstract: In this paper we describe a method3 to determine the representativeness threshold for any given corpus. By using the N-Cor algorithm it is possible to quantify a posteriori the minimum number of documents and words that should be included in a specialised language corpus, in order that it may be considered representative. This method has been implemented by means of a computer program (ReCor). This program will be used here to check whether a corpus of insurance policies in Spanish is representative enough in order to carry out text-linguistic studies and translation tasks. Keywords: Representativeness, corpus linguistics, corpus compilation, specialised corpus.

1 Introducción

Hasta la fecha, mucho se ha escrito e investigado en torno la cantidad como criterio representativo así como sobre las posibles fórmulas capaces de estimar un mínimo de palabras y documentos a partir del cual un corpus especializado puede considerarse representativo sin llegar a resultados concluyentes.

Los intentos de fijar un tamaño, al menos mínimo, para los corpus especializados han sido varios. Algunos de los más significativos son los expuestos por Heaps (1978), Young-Mi (1995) y Sánchez Pérez y

Cantos Gómez (1997). Según Yang et al. (2000: 21), tales propuestas presentan importantes deficiencias porque se basan en la ley de Zipf. La determinación del tamaño mínimo de un corpus sigue siendo uno de los aspectos más controvertidos en la actualidad (cf. Corpas Pastor y Seghiri Domínguez, 2007/en prensa). En este sentido, se han barajado cifras muy dispares. A modo de ilustración, diremos que Biber (1993), en uno de los trabajos más influyentes sobre corpus y representatividad, llega a afirmar que es posible representar la práctica totalidad de los elementos de un registro particular con relativamente pocos ejemplos, mil palabras, y un número reducido de textos pertenecientes a este registro, concretamente diez.



Urge, pues, resolver esta cuestión, ya que no podemos olvidar que la mayoría de estudios lingüísticos y traductológicos están utilizando corpus de reducidas dimensiones, adecuados para sus necesidades concretas de investigación, colecciones de textos que descargan directamente de fuentes de información electrónicas. La red de redes es hoy día uno de los principales proveedores de materia prima para esta lingüística de corpus “de andar por casa”. Además, este tipo de corpus ad hoc, compilado virtualmente, ha demostrado ser tremendamente útil tanto para llevar a cabo estudios lingüísticos (cf. Haan, 1989, 1992; Kock, 1997 y 1991; Ghadessy, 2001) como para la enseñanza de segundas lenguas (Bernardini, 2000; Aston et al., 2004) y en traducción (Corpas Pastor, 2001, 2004, Seghiri Domínguez, 2006).

Las cifras tan dispares que se han manejado hasta la fecha, así como la poca fiabilidad que dan las propuestas para su cálculo, nos llevaron a reflexionar sobre una posible solución, que se ha visto materializada en la aplicación informática denominada ReCor, que pasamos a describir a continuación.

2 Descripción del programa ReCor

Dejando a un lado que la representatividad de un corpus depende, en primer lugar, de haber aplicado los criterios de diseño externos e internos adecuados, en la práctica, la cuantificación del tamaño mínimo que debe tener un corpus especializado aún no se ha abordado de forma objetiva. Y es que no hay consenso, como ha quedado manifiesto, sobre cuál sea el número mínimo de documentos o palabras que debe tener un determinado corpus para que sea considerado válido y representativo de la población que se desea representar. Las cifras varían, además, como hemos visto, de unos autores a otros. Pero todas estas cifras no resuelven el problema de calcular la representatividad de un corpus, dado que son cifras establecidas a priori, carentes de cualquier base empírica y objetivable.

Con este método pretendemos plantear una solución eficaz para determinar, por

primera vez, a posteriori el tamaño mínimo de un corpus o colección textual, independientemente de la lengua o tipo textual de dicha colección, estableciendo, por tanto, el umbral mínimo de representatividad a partir de un algoritmo (N-Cor) de análisis de la densidad léxica en función del aumento incremental del corpus.

2.1. El algoritmo N-Cor

El presente método calcula el tamaño mínimo de un corpus mediante el análisis de la densidad léxica (d) en relación a los aumentos incrementales del corpus (C) documento a documento, según muestra la siguiente ecuación:

Cn= d1+ d2+d3+...+dn

Figura 1: Ecuación base del algoritmo N-Cor

Para ello, se analizan gradualmente todos los archivos que componen el corpus, extrayendo información sobre la frecuencia de las palabras tipo (types) y las ocurrencias o instancias (tokens) de cada archivo del corpus. En esta operación se utilizan dos criterios de selección de archivos, a saber, por orden alfabético y de forma aleatoria, a fin de garantizar que el orden en el que son seleccionados los archivos no afecta al resultado. Cuando se seleccionan los documentos por orden alfabético, el algoritmo analiza el primer archivo y para éste se calculan los tokens y los types, y la densidad léxica correspondiente. Con ello ya se obtiene un punto en la representación gráfica que se pretende extraer. A continuación, siguiendo el mismo criterio de selección que en el primero, se toma el siguiente documento del corpus y se calculan de nuevo los tokens y los types, para éste, pero sumando los resultados a los tokens y los types de la iteración anterior (en este caso a los del primer documento analizado), se calcula la densidad léxica y con esto se obtiene un segundo punto para la representación gráfica. Se sigue este algoritmo hasta que se hayan tratado todos los documentos que componen el corpus que se estudia. La segunda fase del

Gloria Corpas Pastor y Miriam Seghiri

166

análisis es idéntica, pero tomando los documentos en orden aleatorio.

Se emplea el mismo algoritmo para el análisis de n-gramas, esto es, la opción de realizar un análisis de la frecuencia de aparición de secuencias de palabras (2-grama, 3-grama…, n-grama). La aplicación ofrece la posibilidad de hacer el cómputo de estas secuencias considerando un rango de longitudes de secuencia (números de palabras) definido por el usuario. Al igual que se realiza con respecto a los (tokens), se muestra un gráfico con la información de representatividad del corpus tanto para un orden aleatorio de los ficheros como para un orden alfabético por el nombre de éstos. En el eje horizontal se mantendrá el número de ficheros consultados, y en el eje vertical el cociente (número de n-gramas distintos)/(número de n-gramas totales). A estos efectos, cada instancia de un n-grama es considerado como un token. Asimismo, los ficheros de salida generados indican los n-gramas.

Tanto en el análisis por orden alfabético como en el aleatorio de n-gramas llegará un momento en el que un determinado documento no aporte apenas types al corpus, lo cual indicará que se ha llegado a un tamaño adecuado, es decir, que el corpus analizado ya se puede considerar una muestra representativa de la población en términos estadísticos. En una representación gráfica estaríamos en el punto en el que las líneas de types y tokens se estabilizan y se aproximan al cero. Si el corpus es realmente representativo la gráfica tenderá a descender exponencialmente porque los tokens crecerán en cada iteración mucho más que lostypes, debido a que, en teoría, cada vez irán apareciendo menos palabras nuevas que no estén almacenadas en las estructuras de datos que utiliza el programa. Así pues, podremos afirmar que el corpus es representativo cuando la gráfica sea constante en valores cercanos a cero, pues los documentos siempre van a contener variables del tipo números o nombres propios, por ejemplo, que tenderán a constituir instancias de hapax legomena y, por tanto, aumentarán el grado de variabilidad léxica del corpus. Una posible solución podría ser el empleo de expresiones regulares y técnicas de análisis superficial (shallow parsing) para la detección de nombres propios. En cualquier caso, conviene señalar que, en la práctica, es

imposible alcanzar la incorporación de cero types en el corpus, aunque, por el contrario, sí que irán presentado una tasa muy baja de incorporación, como permite predecir la ley Heaps.

2.1.2. Especificaciones del programa

ReCor es una aplicación informática creada con objeto de poder estimar la representatividad de los corpus en función de su tamaño y que se caracteriza, ante todo, por la sencillez de su interfaz de usuario (cf. Figura 2), frente a la carga eminentemente matemática y de formulación que abundan en este tipo de trabajos.

Figura 2: Interfaz de ReCor (versión 2.1)

Hasta el momento se han implementado tres versiones del programa ReCor: 1.0, 2.0 y 2.1. El funcionamiento es básicamente similar y corresponde a la descripción genérica que ofrecemos a continuación. Ahora bien, la versión 2.0 difiere de la versión 1.0 en que permite a) seleccionar automáticamente un directorio completo de documentos (en vez de tener que pulsar la tecla Shift como en la versión anterior) y b) permite seleccionar un número de n-gramas para el cálculo, donde n ≥1 y n ≤ 10. Ambas versiones (1.0 y 2.0) generan archivos estadísticos en texto plano (.txt). La versión 2.1. difiere de su predecesora en que presenta los archivos estadísticos simultáneamente en formato .txt y en forma de tablas en Excel.

Determinación del Umbral de Representatividad de un Corpus mediante el Algoritmo N-Cor

167

3 Funcionamiento del programa

En este apartado mostraremos el programa ReCor en funcionamiento (versión 2.1.). Para la ilustración del funcionamiento del programa hemos compilado un corpus de seguros turísticos en español. Este corpus, por su diseño4 —es monolingüe5, comparable6,textual7 y especializado8—, responde a los parámetros de creación de corpus, por lo que estaría en condiciones de ser utilizado de forma independiente para la realización de estudios lingüísticos y traductológicos sobre los elementos formales de este tipo contractual.

Gracias a una sencilla interfaz, ReCor resulta de fácil manejo. Así, procedemos a la selección de los archivos que conforman el subcorpus de seguros turísticos en español mediante el botón «Selección de los ficheros del corpus». Una vez seleccionados los archivos que integran el corpus en español, podremos incorporar, si se desea, un «filtro de palabras». En nuestro caso, hemos incluido un filtro que contiene numeración romana. Además, el programa genera tres ficheros de salida (Análisis estadístico, Palabras ord. alf. yPalabras ord. frec.) que se crearán por defecto en la ubicación que determine la aplicación. Si se desea otra localización de los archivos de salida generados, puede indicarse una nueva ruta. El primero, «Análisis estadístico», recoge los resultados de dos análisis distintos; de un lado, los ficheros ordenados alfabéticamente por nombre; de otro, para los ficheros ordenados en orden aleatorio. El documento aparecerá estructurado en cinco columnas, a saber, muestra de types, tokens, cociente entre palabras distintas y totales (types/tokes), número de palabras con una parición (V1) y número de palabras con dos apariciones (V2). El segundo, «Palabras ord. alfa.», generará dos columnas en la que aparecerán las palabras ordenadas por orden alfabético, de una parte, y sus correspondientes ocurrencias, de otra. En tercer lugar, «Palabras ord. frec.», presenta la misma información que el fichero de salida anterior, pero esta vez las palabras se ordenan en función de su frecuencia, es decir, por rango.

Por último, procederemos a especificar «Grupo de palabras», esto es, los n-gramas. Escogemos, para una primera ilustración, uno (cf. Figura 3). Asimismo, indicaremos «sí» en la opción «Filtrar números».

3.1. Representaciones gráficas

Una vez se han seguido los pasos descritos más arriba, la aplicación está lista para realizar el análisis, cuyo resultado se expresa en forma de representaciones gráficas y ficheros de salida en .txt con datos estadísticos exportables a tablas y tablas en Excel. Para generar las representaciones gráficas A y B, pulsamos «Aceptar». ReCor creará, además de los ficheros de salida, las representaciones gráficas A y B, que serán las que nos permitan determinar si, efectivamente, nuestra colección es representativa. (cf. Figura 3). El tiempo que tarde el programa en generar las representaciones gráficas y los archivos de análisis dependerá del número de n-gramas seleccionados para el cálculo, del tamaño del corpus analizado y de la versión utilizada.

Figura 3: Representatividad del corpus de seguros turísticos (1-grama)

A partir de los datos arrojados por ReCor, podemos concluir que el corpus español de contratación de seguros turísticos (cf. Figura 3) es representativo a partir de 140 documentos y 1,0 millón de palabras.

Si deseamos ver los resultados para dos o más gramas, repetiremos los pasos anteriormente expuestos y especificaremos la cifra en «Grupo de palabras». A continuación, mostramos los resultados arrojados por ReCor para 2-gramas.


168

Figura 4: Representatividad del corpus de seguros turísticos (2-gramas)

De este modo, a partir de los datos que nos ofrece el programa para 2-gramas, se desprende que el corpus español de contratación de seguros turísticos (cf. Figura 4) es representativo a partir de 150 documentos y 1,25 millones de palabras.

3.2. Datos estadísticos

Además de las representaciones gráficas A y B, el programa también genera de forma simultánea tres tipos de archivos de salida, cuyo formato (.txt y Excel) depende de la versión utilizada. El primero de ellos, presenta un «Análisis estadístico» del corpus, tanto por orden alfabético como aleatorio, estructurado en cinco columnas: types, tokens, cociente entre palabras distintas y totales (types/tokens), número de palabras con una aparición (V1) y número de palabras con dos apariciones (V2):

Figura 5: Fichero de salida (Análisis estadístico)-Español (v. 2.1)

A partir de este análisis estadístico, se puede observar cómo los types (primera columna) no incrementan y se mantienen estables —9265.0— a pesar de que el volumen del corpus —tokens— sigue en aumento tal y como ilustra la segunda columna (de 392012.0 a 540634.0). De este modo, se comprueba, efectivamente que el corpus ya es representativo para este campo de especialidad y que la inclusión de nuevos textos apenas incorporará novedades significativas al corpus.

En segundo tipo de archivo, «Palabras ord. alf.», nos muestra las palabras que contiene el corpus ordenadas por orden alfabético (primera columna) acompañadas de su frecuencia de aparición (segunda columna):

Figura 6: Ficheros de salida (Palabras ord. alf.) de los corpus de seguros turísticos (español)

Por último, el tercer fichero de salida «Palabrar ord. frec» presenta las palabras del corpus ordenadas (primera columna) en función de su frecuencia (segunda columna):


169

Figura 7: Ficheros de salida (Palabras ord. frec.) de los corpus de seguros turísticos (español)

Finalmente, la versión 2.1. genera simultáneamente, además los anteriores resultados en .txt, tablas de Excel. La Fig. 8 ilustra una tabla en Excel de 2-gramas, ordenados por frecuencia, que ha generado la versión 2.1. para el corpus español.

Figura 8: Lista de 2-gramas por frecuencia-Español (v. 2.1.)

4 Conclusiones

Una de las características principales de los corpus virtuales o ad hoc es que suelen ser eminentemente desequilibrados, puesto que su tamaño y composición finales vienen determinados, normalmente, sobre todo en los lenguajes de especialidad, por la disponibilidad (Giouli y Piperidis, 2002) y, por consiguiente, es imprescindible contar con herramientas que nos aseguren su representatividad. Sin embargo, el problema estriba en que no existe acuerdo sobre el tamaño que debe tener un corpus para que sea considerado «representativo», a pesar de que la «representatividad» sea el concepto clave que diferencia a un corpus de otros tipos de colecciones y repertorios textuales. Sin embargo, las propuestas realizadas hasta la fecha para el cálculo de la representatividad no resultan fiables, como ya hemos señalado. Conscientes de estas deficiencias, Yang et al. (2000) intentaron superarlas con una nueva propuesta, una formulación matemática capaz de predecir la relación entre los types de un corpus y el tamaño de éste (tokens). Sin embargo, los autores, al concluir su trabajo admiten que su enfoque presenta serias limitaciones y entre ellas, destacan la siguiente: «the critical problem is, however, how to determine the value of tolerance error for positive predictions» (Yang et al. 2000: 30).

Nuestra propuesta supera a las anteriores en tanto no necesita determinar la constante C (=tamaño del corpus) para sobre ello intentar calcular su representatividad (algo, por otra parte, casi tautológico), como es habitual en los enfoques basados en la ley de Zipf. Tampoco necesita determinar el valor del error máximo de tolerancia, que es la principal deficiencia del enfoque de Biber (1993) y del de Yang et al. (2000). El algoritmo N-Cor permite establecer a posteriori, sin tener que establecer valores prefijados, el umbral de representatividad de un corpus bien construido, es decir, compilado conforme a criterios de diseño cualitativos (externos e internos). Concretamente, se parte de la idea de que el cociente entre las palabras reales de un texto y las totales —types/tokens—, que da cuenta de la densidad o riqueza léxica de un texto, no aumenta proporcionalmente a partir de un número de textos determinado. Lo mismo ocurre cuando la representatividad se calcula en


170

función de la densidad léxica a partir secuencias de palabras (n-gramas).

Sobre esta base teórica, se ha implementado un programa (ReCor), que permite ilustrar gráficamente el punto a partir del cual un corpus que ha sido compilado según criterios cualitativos comienza a ser representativo en términos cuantitativos. La representación gráfica, a partir de dos líneas —documentos incluidos alfabéticamente y aleatoriamente—, que se estabilizan a medida que se aproximan al valor cero, muestra el tamaño mínimo de la colección para ser considerada representativa.

En el caso de los corpus especializados de tamaño reducido de ámbitos concretos, no es posible determinar a priori, exactamente, un número óptimo de palabras o de documentos, puesto que estará en función de las restricciones propias del campo de especialidad, de cada país y lengua. Nuestro método permite realizar dicha estimación a posteriori, esto es, una vez que se ha terminado de compilar el corpus, durante la compilación o durante la fase de análisis y verificación.

Hasta el momento esta metodología se ha probado con éxito para corpus especializados de seguros turísticos y condiciones generales de contratos de viaje combinado en inglés, español, alemán e italiano (cf. Corpas Pastor y Seghiri Domínguez, 2007/en prensa). También se ha utilizado para comprobar la representatividad del corpus multilingüe utilizado por la Agencia Catalana de Noticias para alimentar su sistema de traducción automática español-inglés-francés-catalán-aranés (occitano).

Actualmente estamos trabajando en una nueva versión (ReCor 3.0) que esté optimizada para trabajar con múltiples ficheros o con archivos de gran extensión de forma rápida y, al mismo tiempo, permita extraer unidades fraseológicas a partir del análisis en n-gramas (n ≥ 1 y n ≤ 10) del corpus.

Bibliografía

Aston, G., S. Bernardini y D. Stewart.. 2004. Corpora and Language Learners.Amsterdam y Filadelfia: John Benjamins.

Bernardini, S. 2000. Competence, capacity, corpora. Bolonia: Cooperativa Libraria Universitaria Editrice.

Biber, D. 1993. «Representativeness in Corpus Design». Literary and Linguistic Computing. 8 (4). 243-257.

Corpas Pastor, G. 2001. «Compilación de un corpus ad hoc para la enseñanza de la traducción inversa especializada». TRANS: revista de traductología. 5. 155-184.

Corpas Pastor, G. 2004. «Localización de recursos y compilación de corpus vía Internet: Aplicaciones para la didáctica de la traducción médica especializada». En Consuelo Gonzalo García y Valentín García Yebra (eds.). Manual de documentación y terminología para la traducción especializada. Madrid: Arco/Libros. 223-257.

Corpas Pastor, G.; Seghiri Domínguez, S. 2007/en prensa. El concepto de representatividad en lingüística de corpus: aproximaciones teóricas y consecuencias para la traducción.Málaga: Servicio de Publicaciones de la Universidad.

Ghadessy, M.., A. Henry, R. L. Roseberry (eds.). 2001. Small corpus studies and ELT: theory and practice. Ámsterdam y Filadelfia: John Benjamins.

Giouli, V. y S. Piperidis. 2002. Corpora and HLT. Current trends in corpus processing and annotation. Bulagaria: Insitute for Language and Speech Processing. S. pag. <http://www.larflast.bas.bg/balric/eng_files/corpora1.php> [Consulta: 18/05/2007].

Haan, P. 1989. Postmodifying clauses in the English noun phrase. A corpus-based study. Amsterdam: Rodopi.

Haan, P. 1992. «The optimum corpus sample size?». En Gerhard Leitner (ed.). New dimensions in English language corpora. Methodology, results, software development. Berlín y Nueva York: Mouton de Gruyter. 3-19.


171

Heaps, H. S. 1978. Information Retrieval: Computational and Theoretical Aspects. Nueva York: Academic Press.

Kock, J. 1997. «Gramática y corpus: los pronombres demostrativos». Revista de filología románica. 14 (1): 291-298. <http://www.ucm.es/BUCM/revistas/fll/0212999x/articulos/RFRM9797120291A.PDF> [Consulta: 18/05/2007].

Kock, J. 2001. «Un corpus informatizado para la enseñanza de la lengua española. Punto de partido y término». Hispanica Polonorum. 3: 60-86. <http://hispanismo.cervantes.es/documentos/kock.pdf> [Consulta: 18/05/2007].

Sánchez Pérez, A. y P. Cantos Gómez. 1997. «Predictability of Word Forms (Types) and Lemmas in Linguistic Corpora. A Case Study Based on the Analysis of the CUMBRE Corpus: An 8-Million-Word Corpus of Contemporary Spanish». International Journal of Corpus Linguistics. 2 (2): 259-280.

Seghiri Domínguez, M. 2006. Compilación de un corpus trilingüe de seguros turísticos (español-inglés-italiano): aspectos de evaluación, catalogación, diseño y representatividad. Tesis doctoral Málaga: Universidad de Málaga.

Yang, D., P. Cantos Gómez y M. Song. 2000. «An Algorithm for Predicting the Relationship between Lemmas and Corpus Size». ETRI Journal. 22 (2) : 20-31. <http://etrij.etri.re.kr/Cyber/servlet/GetFile?fileid=SPF-1042453354988> [Consulta: 18/05/2007].

Young-Mi, J. 1995. «Statistical Characteristics of Korean Vocabulary and Its Application». Lexicographic Study. 5 (6): 134-163.

1 El presente trabajo ha sido realizado en el seno del proyecto La contratación turística electrónica multilingüe como mediación intercultural: aspectos legales, traductológicos y terminológicos (Ref. nº HUM-892, 2006-2009. Proyecto de Excelencia, Junta de Andalucía).

2 La metodología descrita en este trabajo ha recibido el Premio de Investigación en Tecnologías de la Traducción (III convocatoria) concedido por el Observatorio de Tecnologías de la Traducción. Para más información, véase <http://www.uem.es/web/ott/>.

3 This method has been awarded the Translation Technologies Research Award (Premio de Investigación en Tecnologías de la Traducción) by the Translation Technologies Watch (Observatorio de Tecnologías de la Traducción). Further information at the URL: <http://www.uem.es/web/ott/>.

4 Para una visión más amplia acerca del protocolo de compilación de corpus especializados, véase Seghiri Domínguez (2006).

5 Aunque es un corpus monolingüe (español), se encuentra delimitado diatópicamente. De este modo, los textos que integran el corpus de seguros turísticos son elementos formales del contrato que hayan sido redactados exclusivamente en España.

6 Se trata de un corpus comparable pues está integrado por textos originales para la contratación turística, concretamente, elementos formales del contrato y legislación.

7 El corpus de seguros turísticos compilado incluye documentos completos ya que este tipo de corpus es el que permite llevar a cabo investigaciones lingüísticas léxicas y de análisis del discurso, a la par que posibilita la creación de unsubcorpus, o un componente, a partir de la selección de fragmentos más pequeños (Sinclair, 1991). De hecho, Sinclair (1991) y Alvar Ezquerra et al. (1994) han puesto de manifiesto la necesidad de incluir textos enteros porque, de este modo, se elimina la discusión en torno a la representatividad de las distintas partes de un texto así como a la validez de las técnicas de muestreo.

8 Los textos que integran el corpus de seguros turísticos son, específicamente, elementos formalesdel contrato, a saber, solicitudes de seguro, propuestas, cartas de garantía y pólizas.


172

Generacion semiautomatica de recursos ∗

Fernando Enrıquez, Jose A. Troyano, Fermın Cruz y F. Javier OrtegaDep. de Lenguajes y Sistemas Informaticos

Universidad de SevillaAvda. Reina Mercedes s/n

41012 [email protected]

Resumen: Los resultados de muchos algoritmos que se aplican en tareas de proce-samiento del lenguaje natural dependen de la disponibilidad de grandes recursoslinguısticos, de los que extraen el conocimiento necesario para desempenar su traba-jo. La existencia de estos recursos determina por tanto la calidad de los resultados,el rendimiento general del sistema y en ocasiones, ambas cosas. Vamos a mostrardiversos aspectos que hacen referencia al esfuerzo necesario para la creacion de estosrecursos, y que por lo tanto justifican los intentos de desarrollar metodos que alivienesta tarea, ası como diversas propuestas que se han mostrado para solventar estacuestion. Estas propuestas pueden considerarse alternativas al problema que quere-mos solucionar y lo afrontan de muy diferentes maneras, algunas de las cuales quizaspodamos adaptar a nuestras propias implementaciones en un futuro proximo.Palabras clave: Generacion de recursos, aprendizaje automatico, combinacion desistemas

Abstract: The results of many algorithms that are applied to natural languageprocessing tasks depend on the availability of large linguistic resources from whichthey obtain the required knowledge to do their work. The existence of these resourcesdetermines the quality of the results, the general performance of the system andfrequently both things. We are going to show some aspects that refer to the effortneeded in the creation of these resources, and thus justify the attempts to developmethods that lighten this task, and also some proposals that have been made tosolve this problem. These proposals can be considered alternatives to the problemwe want to solve and they face it in very different manners, some of which could beadapted in our own implementations in a near future.Keywords: Resource generation, machine learning, system combination

1. Introduccion

Sin duda alguna el mayor problema quesurge a la hora de afrontar la creacion de re-cursos linguısticos es el esfuerzo que se re-quiere para obtener resultados de suficienteenvergadura como para que les sean utilesa los algoritmos que los necesitan. General-∗ Parcialmente financiado por el Ministerio de Edu-cacion y Ciencia (TIN2004-07246-C03-03).

mente, un algoritmo de aprendizaje super-visado que hace uso de un corpus etique-tado para una determinada tarea, exige unnumero muy alto de palabras o frases etique-tadas para ofrecer resultados que puedan serconsiderados de calidad aunque esto depen-dera del algoritmo en cuestion y de la tareaque se este afrontando.

Si nos centramos en una tarea amplia-



mente conocida dentro del procesamiento dellenguaje natural, como es la desambiguacionde significados, podemos hacernos una ideade este esfuerzo que estamos comentando. Setrata de una tarea que afronta el problemade seleccionar el significado de una palabraen un texto de entre todos los significadosque posee. La ambiguedad es muy comunaunque los humanos estamos tan acostum-brados a ella y tenemos tal capacidad de re-solverla basandonos en el contexto de las pa-labras, que casi pasa desapercibida ante nues-tros ojos. Para esta tarea se han desarrolladomultiples algoritmos con muy buenos resulta-dos, aunque la disponibilidad de corpus eti-quetados sigue constituyendo un problema.En (Ng, 1997) se realizo un estudio que ase-gura que para obtener una precision buenase necesitan al menos 500 ejemplos por cadauna de las palabras ambiguas a tratar (estaes una cifra que representa la media ya quehay diferencias considerables de una palabraa otra). A un ritmo de un ejemplo etiqueta-do por minuto y considerando la existenciade unas 20000 palabras ambiguas en el vo-cabulario ingles comun, esto nos conducirıaa unas 160000 horas de etiquetado, que re-sultarıan en nada mas y nada menos que 80anos de dedicacion exclusiva para una per-sona que lleve a cabo esta tarea de etiqueta-do. Si ademas le anadimos el hecho de quelas tareas de etiquetado suelen ser llevadas acabo por linguistas entrenados o expertos, nocabe duda de que se trata de un proceso real-mente caro y generalmente prohibitivo en lainmensa mayorıa de los casos.

Todo esto supone una limitacion y ter-mina por reducir el numero de ejemplosdisponibles, afectando a la tarea en generaly posiblemente al desarrollo de nuevas vıasde investigacion que puedan aportar mejorasen los resultados. De ahı que este sea el pun-to de partida de una linea de trabajo futuroque deseamos recorrer y de la que intentare-mos extraer soluciones satisfactorias a esteproblema.

A lo largo de los sucesivos capıtulosveremos algunas tecnicas empleadas paracrear recursos linguısticos, comenzando en elcapıtulo 2 con un algoritmo que emplea con-sultas en buscadores web. En el capıtulo 3comentaremos las tecnicas de crowdsourcing,cuyo uso se esta extendiendo con rapidez,mientras que en los capıtulos 4 y 5 comentare-mos metodos de combinacion e importacion

de recursos respectivamente. En el capıtulo6 veremos las tecnicas de bootstrapping parafinalizar con un capıtulo dedicado a las con-clusiones.

2. Empleando Busquedas en laWeb

Una de las vıas que han surgido para in-tentar paliar los efectos del enorme esfuerzorequerido para la creacion de recursos, es eluso de la Web. El contenido de la Web puedeser considerado un enorme corpus que puedeser explotado para diversas tareas, si bienpresenta una estructura y unos contenidostan heterogeneos que no siempre se sabe muybien como sacarle partido a toda la informa-cion que posee.

En (Mihalcea, 2002) podemos apreciarun magnıfico ejemplo de como se puedehacer uso de la Web para obtener recur-sos linguısticos a traves de los sistemas debusquedas que tenemos a nuestra disposi-cion. La tarea que se afronta en este tra-bajo es la desambiguacion de significadosy el sistema propuesto hace uso de diver-sos recursos disponibles como el corpus Sem-Cor (Miller, 1993) y la base de datos lexicaWordNet (Miller, 1995). El algoritmo se re-sume en la figura 1.

Las semillas estan formadas por multiplesunidades de palabras que contienen una pa-labra ambigua, de forma que la expresion porsı misma supone una restriccion para el posi-ble significado de la palabra en la que recaeel interes.

En este algoritmo se emplea un metodopara, utilizando WordNet, construir consul-tas que contengan sinonimos o definicionesdel significado de las palabras de interes y me-diante los motores de busqueda disponiblesen Internet, realizar dichas consultas paraobtener textos relacionados con esas defini-ciones. En WordNet se buscan en primer lu-gar sinonimos que sean monosemicos, y si noexisten, se buscan definiciones de la palabra.Al hacer la busqueda, se seleccionan las ora-ciones que contengan la definicion o el sinoni-mo y se sustituyen por la palabra original,obteniendose un ejemplo de uso de dicha pa-labra con su significado.

Una vez tenemos las expresiones encon-tradas tras explorar la web haciendo uso delas semillas, se aplica un algoritmo iterativode desambiguacion mediante varios procedi-mientos cuyas claves se resumen en:

Fernando Enriquez, Jose Antonio Troyano, Fermin Cruz y F. Javier Ortega

174

1. Crear un conjunto de semillas,compuestas por:

1.1 Ejemplos de SemCor.1.2 Ejemplos de WordNet.1.3 Ejemplos etiquetados

creados mediante busquedas enla web de sinonimos monosemicos odefiniciones de la palabra.

1.4 Ejemplos adicionalesetiquetados manualmente (si estandisponibles).

2. Realizar busquedas en la Webutilizando las expresiones de lassemillas.

3. Desambiguar las palabras en uncontexto cercano al texto querodea las expresiones de lassemillas. Agregar los ejemplosformados con las palabrasdesambiguadas al conjunto de lassemillas.

4. Volver al paso 2.

Figura 1: Algoritmo de busquedas en la web.

1. Localizar las entidades, como nombresde personas, lugares y organizaciones, ymarcar su significado.

2. Localizar las palabras monosemicas ymarcar su significado.

3. Para cada palabra se forman pares conla palabra dada y la anterior y poste-rior. Si en el corpus SemCor aparecendichos pares suficientes veces (superior aun umbral preestablecido) y siempre conel mismo significado, se le asigna dichosignificado a la palabra.

4. Para los sustantivos se crea un contex-to, conteniendo los sustantivos que sue-len aparecer cerca por cada significadoposible. Luego se compara con el con-texto actual del sustantivo y se escoge elsignificado mas parecido.

5. Se buscan conexiones semanticas entrepalabras, por lo que, si una palabra tieneun significado que la convierte en sinoni-ma de otra ya desambiguada, se le asignadicho significado. Tambien se estudianrelaciones de hiperonimia e hiponimia y

se buscan conexiones entre palabras es-tando ambas sin desambiguar.

Los experimentos realizados para medir lacalidad de los corpus que se obtienen median-te este algoritmo, demuestran que se obtienenresultados comparables a los adquiridos atraves del uso de corpus etiquetados manual-mente. Concretamente, los autores hicieronexperimentos con diversas herramientas deetiquetado semantico, utilizando un corpusetiquetado manualmente y por otro lado, elcorpus obtenido automaticamente medianteeste algoritmo. La precision alcanzada cuan-do se usaba el corpus automatico era a vecesincluso mejor que la obtenida con las mismasherramientas pero utilizando el corpus ma-nual.

3. El Crowdsourcing

El crowdsourcing es un termino acunadorecientemente y que constituye un pasoadelante tras el outsourcing. Este ultimoesta basado en la delegacion de ciertas ta-reas en determinadas entidades externas paraahorrar costes y simplificar el proceso de de-sarrollo en un proyecto (generalmente las em-presas han estado fijando las miradas en Indiao China). Las nuevas posibilidades de ahorroen este entorno es posible que se encuentrenen el trabajo disperso y anonimo de multi-tud de internautas que desarrollan tareas demayor o menor valor para una organizacionque sepa llamar su atencion de alguna de en-tre tantas formas posibles. Esta forma de re-copilar el esfuerzo y orientarlo hacia la con-secucion de algun objetivo relacionado con eldesarrollo de alguna tarea en concreto se de-nomina crowdsourcing 1.

El precursor de este termino es Jeff Howe,quien en (Howe, 2006) comenta varios ejem-plos en los que se ha aplicado esta formade trabajo. En dicho artıculo comienza co-mentando un caso particular referente a unfotografo profesional que pierde un clienteal descubrir este que puede comprar fotosa traves de iStockPhoto a un precio mu-cho menor (el cliente solo buscaba fotos degente enferma para un trabajo que estabarealizando). En este portal se publican unnumero muy grande de fotos realizadas poramateurs y que son muy utiles en muchoscasos sin necesidad de pagar el alto precio

1Del ingles ‘crowd’ que significa multitud y‘source’ que significa fuente

Generación Semiautomática de Recursos

175

que cobrarıa un profesional al que le en-cargase el trabajo de forma directa. Es unejemplo mas en el que el trabajo de milesde personas puede ser aprovechado cambian-do un escenario empresarial que parecıa enprincipio inquebrantable. De esta forma ca-da participante puede publicar todo tipo defotos cobrando muy poco por cada una perocon la capacidad de ponerlas al alcance decualquiera que este conectado a Internet. Es-to lleva al autor a decir:

Welcome to the age of the crowd.Just as distributed comput-ing projects like UC Berkeley’sSETI@home have tapped the un-used processing power of millionsof individual computers, so dis-tributed labor networks are usingthe Internet to exploit the spareprocessing power of millions ofhuman brains.

En la misma lınea de este ejemplo queacabamos de comentar, hallamos multitudde proyectos, sistemas y aplicaciones que in-tentan sacar partido de todo este potencial,por ejemplo, la wikipedia, una enciclopediaque se extiende rapidamente entre las pre-ferencias de los usuarios de Internet, y queesta hecha mediante la contribucion anoni-ma de todos los que quieran aportar su granode arena a esta recopilacion de conocimiento.Tambien lo vemos en los programas de tele-vision que se basan estrictamente en mostrarel material creado por los propios telespec-tadores (emitiendo sus videos caseros, com-posiciones musicales, etc) y que obtienen enmuchos casos cifras de audiencia espectacu-lares sin apenas suponerle ningun coste a lacadena. Otros ejemplos pueden ser, el proyec-to InnoCentive, a traves del cual se publi-can problemas de cierta dificultad tecnica ocientıfica que le surgen a todo tipo de empre-sas, de forma que cualquiera puede intentardarle solucion (recibiendo grandes recompen-sas economicas) o el Turco Mecanico de Ama-zon, a traves del cual todo el mundo puedecobrar una pequena cantidad de dinero porrealizar tareas muy simples sin necesidad deuna gran preparacion previa.

La iniciativa ‘Open Mind’ (Stork, 1999)es el resultado de aplicar esta idea ala generacion de recursos linguısticos. Laidea basica es utilizar la informacion y el

conocimiento que se puede obtener a par-tir de los millones de usuarios de Internetcon el objetivo de crear aplicaciones mas in-teligentes. Dentro de esta iniciativa se en-cuentran diversos proyectos relacionados conel lenguaje natural como Open Mind WordExpert (Mihalcea, 2003), centrado en la de-sambiguacion de significados (generando cor-pus anotados semanticamente por los usu-arios) y Open Mind Common Sense (Singh,2002) que se centra en la adquisicion del sen-tido comun para generar un corpus textual.

4. La Combinacion de Recursos

Otra estrategia que podemos encontrar enla bibliografıa para generar corpus es la com-binacion de recursos ya existentes, de maneraque se enriquezcan unos con otros aumentan-do su valor al ser considerados de forma glo-bal. Un ejemplo muy clarificador lo podemosencontrar en (Shi, 2005), donde se combinanFrameNet, VerbNet y WordNet. Vamos a co-mentar brevemente el contenido de estos re-cursos para luego comprender como se com-binan creando un recurso unificado.

La primera pieza de este puzzle partede WordNet. Es una gran base de datoslexica con mucha informacion sobre pa-labras y conceptos. Este es el recursoutilizado para identificar caracterısticassemanticas superficiales que pueden aso-ciarse a unidades lexicas. En WordNetse cubren la gran mayorıa de nombres,verbos, adjetivos y adverbios del ingles.Las palabras se organizan en conjuntosde sinonimos (llamados ‘synsets’) que re-presentan conceptos.

FrameNet por su parte es un recursoque contiene informacion sobre diferen-tes situaciones, llamadas ‘frames’. Ca-da frase etiquetada en FrameNet repre-senta una posible construccion sintacticapara los roles semanticos asociados conun frame para una determinada palabra.Solemos referirnos al conocimiento queaporta WordNet como conocimiento anivel de palabra (word-level knowledge),mientras que FrameNet y VerbNet ha-cen referencia al conocimiento a nivel defrase (sentence-level knowledge).

Y finalmente Verbnet es un recurso lexi-co de verbos basado en las clases deverbos de Levin, y que tambien apor-ta restricciones selectivas asociadas a los


176

roles semanticos. Identificando la clasede VerbNet que se corresponde con unframe de FrameNet, se pueden analizarsintacticamente frases que incluyen ver-bos que no estan cubiertos aun porFrameNet. Se puede hacer esto graciasa que existe una relacion transitiva en-tre las clases de VerbNet (los verbos quepertenecen a la misma clase en Verb-Net tienen una alta probabilidad de com-partir el mismo frame en FrameNet, ypor lo tanto se pueden analizar semanti-camente aunque no aparezcan explıcita-mente en FrameNet).

Dados estos tres recursos, se pueden com-binar de manera que se pueda trabajar contodos ellos a la vez, en lugar de estar obliga-dos a elegir solo uno renunciando a la infor-macion que aportan los otros. Las caracterıs-ticas que permiten llevar a cabo esta unionson las siguientes:

FrameNet no define explıcitamente res-tricciones de seleccion para los rolessemanticos. Ademas, la construccion deFrameNet requirio de un gran esfuerzohumano por lo que la cobertura y es-calabilidad se han visto seriamente afec-tadas.

VerbNet sin embargo tiene mucha mejorcobertura y define relaciones sintactico-semanticas de una manera mas explıcita.VerbNet etiqueta roles tematicos y pro-porciona restricciones de seleccion paralos argumentos de los marcos sintacticos.

WordNet por su parte cubre casi al com-pleto todos los verbos del ingles y apor-ta una gran informacion sobre las rela-ciones semanticas entre los sentidos delos verbos. De todas formas, la construc-cion de WordNet esta basada en el sig-nificado de los verbos y no incluye elcomportamiento sintactico o semanticode los mismos (como pueden ser las es-tructuras de tipo predicado-argumento).

Una vez analizado el contenido de estostres recursos, la combinacion de la informa-cion codificada en cada uno de ellos pasa por:

Aumentar la semantica de los marcoscon las clases de VerbNet etiquetandolos marcos y los roles semanticos deFrameNet con las entradas de VerbNety sus argumentos correspondientes.

Tambien se extiende la cobertura de losverbos de FrameNet haciendo uso de lasclases de VerbNet y las relaciones desinonimia e hiponimia de los verbos deWordNet.

Ademas, se identifican las conexiones ex-plıcitas entre los roles semanticos y lasclases semanticas, codificando restriccio-nes de seleccion para los roles semanti-cos mediante la jerarquıa de nombres deWordNet.

La construccion de recursos linguısticosrequiere un gran esfuerzo humano y cada re-curso esta pensado para solucionar un deter-minado tipo de problemas, mostrando vir-tudes en ciertos aspectos y desventajas enotros. De esta forma, la combinacion de es-tos recursos puede dar lugar a una base deconocimiento mas extensa y mas rica. En(Shi, 2005) hemos visto como se mejora lacobertura de FrameNet, se mejora VerbNetcon la semantica de los marcos y se imple-mentan las restricciones de seleccion hacien-do uso de las clases semanticas existentes enWordNet.

5. Importando RecursosCercanos

Cuando queremos afrontar la tarea decrear un recurso linguıstico, una posibilidadque tenemos al alcance de nuestra manoen muchos casos, es adaptar otro recurso“cercano” al que deseamos crear. Es la op-cion elegida por ejemplo en (Carreras, 2003),donde se construye un reconocedor de enti-dades con nombre para el catalan partien-do de recursos en castellano. Se emplean dosvıas para lograrlo: en primer lugar creandolos modelos para el espanol para posterior-mente traducirlos al catalan, y en segundolugar crear los modelos de forma bilingue di-rectamente.

La cercanıa en este caso se presenta ya quese trata de dos lenguas romanicas que poseenestructuras sintacticas similares y cuyos en-tornos sociales y culturales se solapan engran medida, haciendo que exista un grannumero de entidades que aparecen en loscorpus de ambas lenguas. Estas caracterısti-cas hacen que los recursos en espanol seanaprovechables para llevar a cabo tareas sobreel catalan como puede ser el reconocimientode entidades con nombre.


177

Para el estudio que se llevo a cabo en estecaso, se asumen dos puntos: las entidadesaparecen en los mismos contextos para am-bas lenguas y las entidades responden a losmismos patrones en ambos casos. Ademasde esto se construye un diccionario sencillode palabra a palabra sin tener en cuenta elcontexto (10 horas de trabajo para la ver-sion catalan-espanol y un sistema automaticopara la version espanol-catalan).

Teniendo en cuenta estas premisas se lle-van a cabo varios experimentos sobre elreconocimiento de entidades con nombreen catalan partiendo de corpus etiquetadosunicamente en espanol.

La primera opcion es traducir el mode-lo que se genera al entrenar con los textosen espanol, de manera que se analizan losarboles de decision generados para su poste-rior modificacion. Si un nodo del arbol ana-liza la posibilidad de que en la posicion -2aparezca la palabra “calle”, se traduce dichonodo haciendo lo mismo para la palabra “car-rer” (traduccion del espanol al catalan). Deesta forma se puede aplicar un modelo crea-do mediante corpus en espanol a un texto encatalan. La traduccion se hara en todos losnodos que analicen caracterısticas lexicas deltexto, mientras que los demas permaneceranintactos.

Una segunda opcion es utilizar caracterıs-ticas bilingues (denominadas cross-linguisticfeatures) basadas en una entrada del dic-cionario “es w ∼ ca w” (suponiendo que ex-iste un parametro ‘lang’ de valor ‘es’ para elespanol y ‘ca’ para el catalan). Estas carac-terısticas binarias se comportan de la siguien-te forma:

X-Linges w∼ca w (w) =

=

⎧⎨⎩

1 if w = es w and lang = es1 if w = ca w and lang = ca0 otherwise

De esta forma se puede entrenar el modelocon ejemplos mezclados en ambos idiomas,pudiendo seleccionar el numero de ejemplosde cada caso y permitiendo por ejemplo quehaya un numero muy reducido de ejemplosen catalan para este escenario en concreto. Elresultado es un modelo que puede reconocerentidades tanto en espanol como en catalan.

La tercera opcion consiste por ultimo encrear el modelo entrenando con un pequeno

corpus del idioma para el que se desea ejecu-tar el reconocedor, en este caso, el catalan. Eneste trabajo se hizo empleando el mismo es-fuerzo que se realizo para crear el diccionario,es decir, unas 10 horas de trabajo, obteniendoun pequeno corpus etiquetado.

Los resultados aportados (Carreras, 2003)demuestran que la tercera opcion es la quepeor responde ya que es preferible traducirlos modelos o crearlos de forma que sean bi-lingues, antes que aprender de un numero tanreducido de ejemplos. En cuanto a las otrasdos opciones, la segunda se revela como lamas interesante ya que, aunque sobre el es-panol se obtienen mejores resultados con elmodelo entrenado unicamente con ejemplosen espanol, la opcion de crear un modelo bi-lingue no esta muy lejos en cuanto a numerosen espanol y supera de forma considerable alos demas en catalan.

Estos experimentos demuestran que sepueden aprovechar recursos “cercanos” a losque necesitamos para llevar a cabo tareasobteniendo buenos resultados con un costebastante reducido (sobre todo en compara-cion al que habrıa que afrontar creandonuevos recursos desde cero).

Concretamente las conclusiones aportadaspor los autores de este trabajo son las siguien-tes:

Es mejor traducir un modelo entrenadoen espanol que crear un pequeno corpusanotado con el que entrenar el modelodirectamente en catalan.

La traduccion se puede llevar a cabo deforma automatica sin perdida consider-able de efectividad en el proceso.

La mejor opcion ha resultado ser el usode caracterısticas bilingues ya que per-mite obtener resultados favorables enambos idiomas.

La expansion de esta idea puede venir enforma de aplicaciones de apoyo mas comple-jas y que ayuden a acercar recursos que noesten tan estrechamente ligados como los queaquı se han comentado.

6. Tecnicas de Bootstrapping

En otros trabajos se pone en practica otratecnica de obtencion de recursos muy intere-sante. Se trata de las tecnicas de bootstrap-ping, que tratan de obtener una gran canti-dad de material partiendo de una pequena


178

“semilla”. En la tarea de la creacion de cor-pus etiquetados, el objetivo sera obtener ungran numero de frases etiquetadas de for-ma automatica partiendo de un numero muyreducido de frases etiquetadas manualmente(por lo que el coste es muy bajo en compara-cion con el etiquetado manual completo).

Existen multiples tecnicas de bootstrap-ping, que difieren en la forma de aumentarla semilla, el manejo de las frases nuevas eti-quetadas o las tecnicas de seleccion en casode utilizarse alguna. En cualquier caso todasresponden a la definicion:

“la elevacion de un pequeno esfuer-zo inicial hacia algo mas grande ymas significativo”.

Algunos de los esquemas de ejecucion maspopulares dentro de las conocidas como tecni-cas de bootstrapping son:

Self-train: Un corpus es utilizado paracrear un modelo que se aplica a un con-junto nuevo de frases que tras ser etique-tadas pasan a formar parte del corpusoriginal para volver a generar un nuevomodelo y avanzar de esta forma iterati-vamente.

Figura 2: Esquema de ejecucion para el ‘self-train’.

Esta es la definicion de self-training quegeneralmente se adopta, como en (Clark,2003), aunque existen otras como la queaporta (Ng, 2003), donde se describe co-mo el entrenamiento de un comite declasificadores utilizando bagging para fi-nalmente utilizar la votacion por mayo-rıa para seleccionar las etiquetas finales.

Collaborative-train: Se emplea un mismocorpus para obtener diferentes modelosempleando diferentes tecnicas de apren-dizaje. Posteriormente se introduce unafase de seleccion entre las diferentesopiniones que surgen de aplicar estosmodelos al conjunto de frases nuevas y

las etiquetas seleccionadas sirven paraaumentar el corpus original y proseguircon la siguiente iteracion.

Figura 3: Esquema de ejecucion para el‘collaborative-train’.

Co-train: Dos corpus inicialmenteiguales sirven para crear dos modelos dediferentes caracterısticas y los resultadosde aplicar estos modelos a un conjuntode frases nuevas se “cruzan”, es decir,las frases etiquetadas por el primermodelo sirven para aumentar el corpusque sirvio para crear el segundo modeloy viceversa. De esta forma un modelo nose alimenta unicamente de su percepciondel corpus sino que recibe informacionde otro modelo que imprime otro puntode vista diferente a la resolucion delmismo problema.

Figura 4: Esquema de ejecucion para el ‘co-train’.

En (Jones, 1999) se presentan dos casosde estudio para el uso de tecnicas de boot-strapping en la creacion de recursos. Se tra-ta de un reconocedor de localizaciones y unclasificador de artıculos de investigacion. Enambos casos se obtienen muy buenos resul-tados, mostrando la utilidad de este tipo detecnicas.


179

Otro aspecto importante a tener en cuentaes que se hace practicamente imposible mejo-rar el resultado de un clasificador si los resul-tados que alcanza son demasiado buenos. Enestos casos la aplicacion de estas tecnicas selimitara a introducir ruido y empeorar la ca-lidad del trabajo resultante. Es por lo tantonecesario reservar este tipo de tecnicas a tra-bajos “difıciles” como puede ser aumentar uncorpus que solo contiene un numero limitadode frases inicialmente, teniendo en cuenta quesi el tamano inicial es suficiente para obte-ner buenos resultados, difıcilmente podremosmejorarlos aplicando bootstrapping.

7. Conclusiones

La disponibilidad de recursos es un factorcrucial en muchas de las tareas del Proce-samiento del Lenguaje Natural que se resuel-ven fundamentalmente mediante metodos deaprendizaje supervisado. La obtencion de es-tos recursos es una labor muy costosa, deahı que se lleven a cabo esfuerzos para de-sarrollar metodos que desempenen esta la-bor de forma automatica o semi-automatica.Hemos presentado varias iniciativas ya exis-tentes, mostrando las caracterısticas propiasde cada una de ellas y reflejando diferentesenfoques que creemos pueden llegar a com-paginarse en un entorno que facilite la tareade la generacion de recursos. Este es el pun-to de partida de una linea de trabajo futuroque deseamos recorrer y de la que intentare-mos extraer soluciones satisfactorias a esteproblema.

Bibliografıa

H.T. Ng: Getting serious about word sensedisambiguation. In Proceedings of theACL SIGLEX Workshop on Tagging Textwith Lexical Semantics: Why, What, andHow?. (1997) 1–7

R. Mihalcea: Bootstrapping Large SenseTagged Corpora. In Proceedings of the 3rdInternational Conference on LanguagesResources and Evaluations. (2002)

G. Miller, C. Leacock, T. Randee, R. Bunker:A semantic concordance. In Proceedingsof the 3rd DARPA Workshop on HumanLanguage Technology. (1993) 303–308

G. Miller: Wordnet: A lexical database. Com-munication of the ACM,38(11). (1995) 39–41

J. Howe: The rise of crowdsourcing.Wired - 14.06 http://www.wired.com/wired/archive/14.06/crowds.html.(2006) 17–20

D. Stork: The Open Mind initiative. IEEEExpert Systems and Their Applications,14(3). (1999) 19–20

R. Mihalcea, T. Chklovski: Open Mind WordExpert: Creating Large Annotated DataCollections with Web Users’ Help. In Pro-ceedings of the EACL 2003 Workshop onLinguistically Annotated Corpora (LINC2003). (2003) 17–20

P. Singh, T. Lin, E. Mueller, G. Lim, T.Perkins, W. Li Zhu: Open mind com-mon sense: Knowledge acquisition fromthe general public. In Proceedings of theFirst International Conference on Ontolo-gies, Databases, and Applications of Se-mantics for Large Scale Information Sys-tems. (2002)

Lei Shi, Rada Mihalcea: Putting Pieces To-gether: Combining FrameNet, VerbNetand WordNet for Robust Semantic Pars-ing. In Proceedings of the Sixth Inter-national Conference on Intelligent TextProcessing and Computational Linguis-tics. (2005)

Xavier Carreras, Lluıs Marquez, Lluıs Padro:Named Entity Recognition for CatalanUsing Spanish Resources. In 10th Confer-ence of the European Chapter of the As-sociation for Computational Linguistics.(2003)

S. Clark, J. R. Curran, M. Osborne: Boot-strapping POS taggers using UnlabelledData. In Proceedings of CoNLL-2003.(2003) 49–55

V. Ng, C. Cardie: Weakly supervised nat-ural language learning without redun-dant views. In Human Language Tech-nology/Conference of the North AmericanChapter of the Association for Computa-tional Linguistics. (2003)

Rosie Jones, Andrew McCallum, KamalNigam, Ellen Riloff: Bootstrapping forText Learning Tasks. In IJCAI-99 Work-shop on Text Mining: Foundations, Tech-niques and Applications. (1999)


180

Building Corpora for the Development of a Dependency Parserfor Spanish Using Maltparser∗

Jesus HerreraDepartamento de Lenguajes y Sistemas Informaticos


[email protected]

Pablo Gervas, Pedro J. Moriano, Alfonso Munoz, Luis RomeroDepartamento de Ingenierıa del Software e Inteligencia Artificial

Universidad Complutense de MadridC/ Profesor Jose Garcıa Santesmases, s/n, E-28040 Madrid

[email protected], {pedrojmoriano, alfonsomm, luis.romero.tejera}@gmail.com

Resumen: En el presente artıculo se detalla el proceso de creacion de corporapara el entrenamiento y pruebas de un generador de analizadores de dependencias(Maltparser). Se parte del corpus Cast3LB, que contiene analisis de constituyentes detextos en espanol. Estos analisis de constituyentes se transforman automaticamenteen analisis de dependencias. Ademas se describe como se obtiene, experimentalmentey de manera semiautomatica, un conjunto de etiquetas de funcionalidad sintacticapara etiquetar adecuadamente el corpus de entrenamiento. El proceso seguido hapermitido obtener un analizador de dependencias para el espanol con una precisiondel 91 % en la determinacion de dependencias.Palabras clave: Analisis de dependencias, corpus de entrenamiento, etiqueta defuncionalidad sintactica, Maltparser, JBeaver

Abstract: The present paper details the process followed for creating training andtest corpora for a dependency parser generator (Maltparser). The starting point isthe Cast3LB corpus, which contains constituency analyses of Spanish texts. Theseconstituency analyses are automatically transformed into dependency analyses. Inaddition, the empirically and semiautomatically obtention of a set of syntactic func-tion labels for the training corpus is described. As a result of the process followed, ithas been obtained a dependency parser for Spanish showing a 91 % precision whendetermining dependencies.Keywords: Dependency parsing, training corpus, syntactic function label, Malt-parser, JBeaver

1. Introduction

The development of JBeaver, a dependen-cy parser for Spanish (Herrera et al., 2007), isbased on the use of Maltparser (Nivre et al.,2006), which is a machine learning tool forgenerating dependency parsers for, virtually,every language. Such development carries in-herently associated the labour of generatingcorpora for its training and its subsequentevaluation.

The amount of work needed for develop-

∗ Partially supported by the Spanish Ministryof Education and Science (TIN2006-14433-C02-01project).

ing from scratch a corpus annotated with de-pendency analyses, and with a suitable sizefor training Maltparser, exceeded the pos-sibilities of the JBeaver project. Therefore,it was necessary to find an alternative wayfor the generation of such corpus. A possibleapproach was to reuse available resources inorder to build from them a corpus annotat-ed with dependency analyses in a semiauto-matic way. For this, the Cast3LB (Navarroet al., 2003) treebank was used. It is con-formed by 72 Mb of Spanish annotated texts,approximately and itcontains the constituen-cy analysis for every sentence in it. Leaving



aside certain subtleties (Gelbukh and Torres,2006), constituency analysis and dependencyanalyses can be converted one into the oth-er in a systematic way. After studying theformat and labels used for Cast3LB (Navar-ro et al., 2003) (Civit, 2002), a system ca-pable of transforming the constituency anal-yses contained in Cast3LB into dependencyanalyses was developed by modifying an algo-rithm proposed by Gelbukh et al. (Gelbukhand Torres, 2006) (Gelbukh et al., 2005). Theexistence of Cast3LB and the possibility oftransforming the analyses contained in it intodependency analyses were important reasonsto use Maltparser in the JBeaver project.

On the other hand, having decided thatthe JBeaver parser would be made general-ly available to the public, lead us to consideradditional requirements. For instance, we de-cided to make as easy as possible the use ofJBeaver by tools already adapted to the useof Minipar (Lin, 1998). This is due to the factthat Minipar has become a de facto standardin the last years after being used by a largenumber of applications. Thus, the notationused for JBeaver is, as far as possible, thesame as the one used for Minipar.

2. The source corpus

A dependency analysis corpus is need-ed for training Maltparser. The construc-tion of such a corpus by hand implied awork load well beyond the constraints ofthe JBeaver project. Thus, it was decidedto take advantage of existing resources. Tak-ing into account that, except for some spe-cific cases (such as non-projective construc-tions), the dependency analysis of a text canbe automatically derived from its constituen-cy analysis (Gelbukh and Torres, 2006), andthat Cast3LB –which contains constituen-cy analyses of Spanish texts– was available,it became the best option as source corpusfor the project. Then, the training corpuswas obtained in a semiautomatic way fromCast3LB.

Cast3LB contains 100,000 words in, ap-proximately, 3,700 sentences of texts in Span-ish. 75,000 words of Cast3LB come from theClicTALP corpus, which is a set of text fromseveral domains: literary, journalistic, scien-tific, etcetera, and the other 25,000 wordscome from the EFE news agency’s corpusfrom year 2000 (Navarro et al., 2003). In fig-ure 1 an excerpt from Cast3LB is shown as

an example.

3. Building a training corpus

Malparser requires for its training a cor-pus in which, for every word of the analyzedtext, the following data must be incorporat-ed: a unique identifier, its part of speech la-bel, the identifier of the head of that wordand a label indicating the syntactic functiongiven in the dependency relationship. Malt-parser admits both a XML format and a tabformat at its input. In figure 2 two mutuallyequivalent examples are shown (the first onein XML format and the second one in tabformat).

The numeric identifier 0 and the syntacticfunction label ROOT are used by conventionto designate the dependency tree’s root1.

All the information needed for the cre-ation of the training corpus was containedin the Cast3LB corpus, but it was necessaryto extract it and to modify it to suit the con-ventions followed by Maltparser. For this, thetwo following actions were accomplished: theobtention of dependency relationships, andthe obtention of syntactic function labels.

3.1. Obtaining dependencyrelationships

In order to extract the dependency re-lationships between words contained in theCast3LB corpus, an automatic process wasdeveloped. It was designed from an algorithmproposed by Gelbukh et al. (Gelbukh andTorres, 2006) (Gelbukh et al., 2005), modi-fied as needed.

3.2. Obtaining syntactic functionslabels

The great popularity reached in the lastyears by Minipar lead to the decision of us-ing, in the JBeaver project, a set of syntacticfunction labels that followed, as far as possi-ble, the nomenclature given by Minipar. Inthis way, it would be easier to adapt sys-tems currently using Minipar to the use ofJBeaver. Since the Cast3LB corpus containsspecific syntactic function labels, they mustbe translated into the ones used by Miniparin order to train Maltparser with the appro-priate set of labels. For this, the first actionto be accomplished was to obtain the set ofsyntactic function labels from Minipar. Since

1http://w3.msi.vxu.se/∼nivre/research/MaltXML.html

Jesús Herrera de la Cruz, Pablo Gervás, Pedro J. Moriano, Alfonso Muñoz y Luis Romero

182

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE FILE SYSTEM "3lb.dtd"><FILE id="agset" language="es" wn="1.5" ewn="dic2002"parsing_state="process" semantic_state="process"last_modified="13-01-2006" project="3LB" about="3LB project annotation file"><LOG auto_file="a1-0-auto3.log" anno_file="a1-0-anno4.log"nosense_file="a1-0-nosense4.log" /><SENTENCE id="agset_1"><Anchor id="agset_1_ac1" offset="0"/><Anchor id="agset_1_ac2" offset="15"/><Anchor id="agset_1_ac3" offset="21"/><Anchor id="agset_1_ac4" offset="23"/><Anchor id="agset_1_ac5" offset="26"/><Anchor id="agset_1_ac6" offset="34"/><Anchor id="agset_1_ac7" offset="40"/><Anchor id="agset_1_ac8" offset="42"/><Anchor id="agset_1_ac9" offset="52"/><Anchor id="agset_1_ac10" offset="54"/><Annotation id="agset_1_an3" start="agset_1_ac1" end="agset_1_ac2"type="syn"><Feature name="roles">SUJ</Feature><Feature name="label">sn</Feature><Feature name="parent">agset_1_an2</Feature></Annotation><Annotation id="agset_1_an4" start="agset_1_ac1" end="agset_1_ac2"type="syn"><Feature name="label">grup.nom.ms</Feature><Feature name="parent">agset_1_an3</Feature></Annotation><Annotation id="agset_1_an5" start="agset_1_ac1" end="agset_1_ac2"type="wrd"><Feature name="label">Medardo_Fraile</Feature><Feature name="sense">C2S</Feature><Feature name="parent">agset_1_an6</Feature></Annotation><Annotation id="agset_1_an6" start="agset_1_ac1" end="agset_1_ac2"type="pos"><Feature name="lema">Medardo_Fraile</Feature><Feature name="label">np00000</Feature><Feature name="parent">agset_1_an4</Feature></Annotation><Annotation id="agset_1_an1" start="agset_1_ac1" end="agset_1_ac10"type="dummy_root"><Feature name="label"/><Feature name="parent"/></Annotation>

Figura 1: Excerpt from Cast3LB

an exhaustive list of these labels is not pub-licly available, it was necessary to try to ob-tain the best possible approach, from a largenumber of analyses made with Minipar. Fol-lowing this goal, an empirical work was ac-

complished, based on the idea that with agreat amount of analyses made with Miniparthe set of different labels found would be veryclose to the real set of labels. The process em-ployed was the following:

Building Corpora for the Development of a Dependency Parser for Spanish Using Maltparser

183

<sentence id="2" user="malt" date=""><word id="1" form="Genom" postag="pp" head="3" deprel="ADV"/><word id="2" form="skattereformen" postag="nn.utr.sin.def.nom" head="1"deprel="PR"/><word id="3" form="infors" postag="vb.prs.sfo" head="0" deprel="ROOT"/><word id="4" form="individuell" postag="jj.pos.utr.sin.ind.nom" head="5"deprel="ATT"/><word id="5" form="beskattning" postag="nn.utr.sin.ind.nom" head="3"deprel="SUB"/><word id="6" form="(" postag="pad" head="5" deprel="IP"/><word id="7" form="sarbeskattning" postag="nn.utr.sin.ind.nom" head="5"deprel="APP"/><word id="8" form=")" postag="pad" head="5" deprel="IP"/><word id="9" form="av" postag="pp" head="5" deprel="ATT"/><word id="10" form="arbetsinkomster" postag="nn.utr.plu.ind.nom" head="9"deprel="PR"/><word id="11" form="." postag="mad" head="3" deprel="IP"/>

</sentence>

Genom pp 3 ADVskattereformen nn.utr.sin.def.nom 1 PRinfors vb.prs.sfo 0 ROOTindividuell jj.pos.utr.sin.ind.nom 5 ATTbeskattning nn.utr.sin.ind.nom 3 SUB( pad 5 IPsarbeskattning nn.utr.sin.ind.nom 5 APP) pad 5 IPav pp 5 ATTarbetsinkomster nn.utr.plu.ind.nom 9 PR. mad 3 IP

Figura 2: Mutually equivalent training files for Maltparser (XML and tab)

1. A set of English texts obtained from theweb was parsed with Minipar. It consist-ed of about 1 Mb of texts from sever-al domains extracted from the ProjectGutemberg2 covering the following do-mains: sport (197.1 Kb containing 1,854phrases), economy (207.1 Kb containing1,173 phrases), education (160.5 Kb con-taining 869 phrases), history (162.2 Kbcontaining 1,210 phrases), justice (98.2Kb containing 453 phrases) and health(265.2 Kb containing 2,409 phrases).

2. The output files given by Minipar weretreated in order to extract the set of alldifferent syntactic function labels.

3. A set of analyses, in which all the labelsfound were present, was selected and thefollowing algorithm was applied to it:

2http://www.gutenberg.org/

for each syntactic function label identi-fied do

if this function may occur in Spanishthen

Set one or more rules for suitablytransforming the syntactic function labelfrom Cast3LB into the identified label;

elseDiscard the identified label;

end ifend for

The rules mentioned above were imple-mented in the program that transforms con-stituency analyses into dependency analyses.A special label was used to identify not yetdiscovered syntactic functions that might befound in the future.

After the establishment of the set of syn-tactic rules, a significant set of constituen-


184

cy analyses was transformed into dependen-cy analyses. Having obtained the dependen-cy treebank, all the analyses containing oneor more special labels for not yet discoveredsyntactic functions was manually analyzed.Then, every case was studied in order to de-termine if a new syntactic function label wasincorporated to the set or the considered syn-tactic function could be assimilated to one ofthe known labels. In figure 3 the completelist of syntactic function labels is shown, i.e.,those from Minipar and those that were de-fined ad–hoc.

Identified Minipar’s syntactic function labels:

sc neg pcomp–npnmod nn genposs lex–

depappo

whn mod subjaux amod guestnum vrel elsepunc det negamount–value

New ad–hoc syntactic function labels:

ROOT adj fechadescr c-descr compdet

Figura 3: Syntactic function labels used inthe training corpus

The set of syntactic function labels finallyobtained was not necessarily complete, but itwas reasonably valid for its purpose. Thus, itwas used by the algorithm that transformedconstituency analyses into dependency anal-yses for labelling the syntactic functions ac-cording to Minipar’s nomenclature.

3.3. Part of speech tagging

One of JBeaver’s features is that is ca-pable to parse texts with no need of a pre-vious annotation. Since the model learnedby MaltParser requires, for the parsing step,that every word is labeled with its part ofspeech, the tagging subtask is implementedin JBeaver by the part of speech tagger Tree-tagger (Schmid et al., 1994). The use of Tree-tagger was motivated by the fact that its setof part of speech labels was the one used forMaltParser’s training.

3.4. The definitive corpus

Following the process described in this sec-tion, 280 XML files (72.9 Mb) containing con-stituency analyses from the Cast3LB corpus,consisting of 97,002 words, were transformedinto dependency analyses apt for their pro-cessing by MaltParser (a tab training file of1.6 Mb), being labeled according to the re-quirements of the JBeaver project.

4. The test corpus and results

obtained

For the evaluation of the trained mod-el a fraction of dependencies correctly foundand labeled was computed. The gold stan-dard was a fraction of the corpus describedin section 3. This corpus was divided in threeequal parts; two of them were used as thetraining corpus and the other one was usedboth as test corpus and as gold standard. Forusing it as test corpus, the annotations con-cerning dependency relationships and syntac-tic function were eliminated, i.e., it was con-formed only by the words and their part ofspeech tags, which is the format required byMaltParser for using it as parser. Thus, theoutput given by the trained model was com-pared with the gold standard, and 91 % ofthe dependencies found by the trained modelwere according to the gold standard (Herreraet al., 2007). This result is comparable to theone obtained by Nivre et al. when trainingMaltParser for Spanish (Nivre et al., 2006).

5. Conclusions and future work

The process of building corpora for train-ing and testing a specific tool for generat-ing dependency parser (Maltparser) has beenshown. This process has proper features be-cause of the requirements of the project inwhich it has been developed (JBeaver). It wasmandatory to use existing resources, and aconstituency analyses corpus has been sat-isfactorily transformed into a equivalent de-pendency analyses corpus. For this purpose,an algorithm previously proposed by Gel-bukh et al. was modified and applied. In ad-dition and in order to fulfill the necessities ofthe project, the set of syntactic function la-bels of Minipar was empirically determined.

The future work includes the search formore syntactic function labels, from Miniparand new ones not considered yet. Also, someresearch could be done in order to improvethe algorithm that transforms constituency

Building Corpora for the Development of a Dependency Parser for Spanish Using Maltparser

185

analyses into dependency analyses. By meansof these future improvements, it should bepossible to learn better models for dependen-cy parsing in Spanish.

In addition, similar development efforts tothe one described here could be carried outfor other languages.

Bibliografıa

M. Civit. 2002. Etiquetacion de los Cuan-tificadores: Varias Propuestas. TALP Re-search Center–Universidad Politecnica deCataluna. Technical Report.

A. Gelbukh and S. Torres. 2006. Tratamien-to de Ciertos Pronombres y Conjuncionesen la Transformacion de un Corpus deConstituyentes a un Corpus de Dependen-cias. Avances en la Ciencia de la Com-putacion. VII Encuentro Internacional deComputacion ENC’06.

A. Gelbukh, S. Torres and H. Calvo. 2005.Transforming a Constituency Treebank in-to a Dependency Treebank. Procesamientodel Lenguaje Natural, No 35, September2005. Sociedad Espanola para el Proce-samiento de Lenguaje Natural (SEPLN).

J. Herrera, P. Gervas, P.J. Moriano, A.Munoz, L. Romero. 2007. JBeaver: UnAnalizador de Dependencias para el Es-panol Basado en Aprendizaje. Under eval-uation process for CAEPIA 2007.

D. Lin. 1998. Dependency–based Evaluationof MINIPAR. Proceedings of the Work-shop on the Evaluation of Parsing Sys-tems, Granada, Spain.

B. Navarro, M. Civit, M.A. Martı, R. Marcos,B. Fernandez. 2003. Syntactic, Semanticand Pragmatic Annotation in Cast3LB.Proceedings of the Shallow Processing onLarge Corpora (SproLaC), a Workshop onCorpus Linguistics, Lancaster, UK.

J. Nivre, J. Hall, J. Nilsson, G. Eryigitand S. Marinov. 2006. Labeled Pseudo–Projective Dependency Parsing with Sup-port Vector Machines. Proceedings of theCoNLL-X Shared Task on MultilingualDependency Parsing, New York, USA.

H. Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decission Trees.Proceedings of the International Confer-ence on New Methods in Language Pro-cessing, pages 44–49, Manchester, UK.


186

Semántica

A Proposal of Automatic Selection ofCoarse-grained Semantic Classes for WSD∗

Ruben Izquierdo & Armando SuarezGPLSI. Departament de LSI. UA. Alacant, Spain.

{ruben,armando}@dlsi.ua.es

German RigauIXA NLP Group. EHU. Donostia, Spain.

[email protected]

Resumen: Presentamos un metodo muy simple para seleccionar conceptos base (Base LevelConcepts) usando algunas propiedades estructurales basicas de WordNet. Demostramosempıricamente que el conjunto de Base Level Concepts obtenido agrupa sentidos de palabrasen un nivel de abstraccion adecuado para la desambiguacion del sentido de las palabrasbasada en clases. De hecho, un sencillo clasificador basado en el sentido mas frecuenteusando las clases generadas, es capaz de alcanzar un acierto proximo a 75% para la tarea deetiquetado semantico.Palabras clave: WordNet, Sentidos de las palabras, niveles de abstraccion, Desambiguaciondel Sentido de las Palabras

Abstract: We present a very simple method for selecting Base Level Concepts using somebasic structural properties of WordNet. We also empirically demonstrate that these automa-tically derived set of Base Level Concepts group senses into an adequate level of abstractionin order to perform class-based Word Sense Disambiguation. In fact, a very naive Most Fre-quent classifier using the classes selected is able to perform a semantic tagging with accuracyfigures over 75%.Keywords: WordNet, word-senses, levels of abstraction, Word Sense Disambiguation

1 IntroductionWord Sense Disambiguation (WSD) is an in-termediate Natural Language Processing (NLP)task which consists in assigning the correct se-mantic interpretation to ambiguous words in con-text. One of the most successful approaches in thelast years is the supervised learning from exam-ples, in which statistical or Machine Learningclassification models are induced from semanti-cally annotated corpora (Marquez et al., 2006).Generally, supervised systems have obtained bet-ter results than the unsupervised ones, as shownby experimental work and international evalua-tion exercises such as Senseval1. These annota-ted corpora are usually manually tagged by lexi-cographers with word senses taken from a parti-cular lexical semantic resource –most commonlyWordNet (WN) (Fellbaum, 1998).

WN has been widely criticised for being asense repository that often offers too fine–grainedsense distinctions for higher level applications likeMachine Translation or Question & Answering.In fact, WSD at this level of granularity, has resis-

∗ This paper has been supported by the European Unionunder the project QALL-ME (FP6 IST-033860) andthe Spanish Government under the project Text-Mess(TIN2006-15265-C06-01) and KNOW (TIN2006-15049-C03-01)

1http://www.senseval.org

ted all attempts of infering robust broad-coveragemodels. It seems that many word–sense distinc-tions are too subtle to be captured by automa-tic systems with the current small volumes ofword–sense annotated examples. Possibly, buil-ding class-based classifiers would allow to avoidthe data sparseness problem of the word-basedapproach. Recently, using WN as a sense reposi-tory, the organizers of the English all-words taskat SensEval-3 reported an inter-annotation agree-ment of 72.5% (Snyder and Palmer, 2004). Inter-estingly, this result is difficult to outperform bystate-of-the-art fine-grained WSD systems.

Thus, some research has been focused onderiving different sense groupings to overcomethe fine–grained distinctions of WN (Hearst andSchutze, 1993) (Peters, Peters, and Vossen, 1998)(Mihalcea and Moldovan, 2001) (Agirre, Aldeza-bal, and Pociello, 2003) and on using predefinedsets of sense-groupings for learning class-basedclassifiers for WSD (Segond et al., 1997) (Cia-ramita and Johnson, 2003) (Villarejo, Marquez,and Rigau, 2005) (Curran, 2005) (Ciaramita andAltun, 2006). However, most of the later ap-proaches used the original Lexicographical Fi-les of WN (more recently called Supersenses) asvery coarse–grained sense distinctions. However,not so much attention has been paid on lear-ning class-based classifiers from other available



sense–groupings such as WordNet Domains (Mag-nini and Cavaglia, 2000), SUMO labels (Niles andPease, 2001), EuroWordNet Base Concepts (Vos-sen et al., 1998) or Top Concept Ontology labels(Atserias et al., 2004). Obviously, these resourcesrelate senses at some level of abstraction using dif-ferent semantic criteria and properties that couldbe of interest for WSD. Possibly, their combina-tion could improve the overall results since theyoffer different semantic perspectives of the data.Furthermore, to our knowledge, to date no com-parative evaluation have been performed explo-ring different sense–groupings.

We present a very simple method for selectingBase Level Concepts (Rosch, 1977) using basicstructural properties of WN. We also empirica-lly demonstrate that these automatically derivedset of Base Level Concepts group senses into anadequate level of abstraction in order to performclass-based WSD.

This paper is organized as follows. Section 2introduce the different levels of abstraction thatare relevant for this study, and the available setsof semi-automatically derived Base Concepts. Insection 3, we present the method for deriving fu-lly automatically a number of Base Level Con-cepts from any WN version. Section 4 reportsthe resulting figures of a direct comparison of theresources studied. Section 5 provides an empiri-cal evaluation of the performance of the differentlevels of abstraction. In section 6 we provide furt-her insights of the results obtained and finally, insection 7 some concluding remarks are provided.

2 Levels of abstractionWordNet2 (WN) (Fellbaum, 1998) is an onlinelexical database of English which contains con-cepts represented by synsets, sets of synonyms ofcontent words (nouns, verbs, adjectives and ad-verbs). In WN, different types of lexical and se-mantic relations interlink different synsets, crea-ting in this way a very large structured lexicaland semantic network. The most important rela-tion encoded in WN is the subclass relation (fornouns the hyponymy relation and for verbs thetroponymy relation). The last version of WN,WN 3.0, was released on december 2006. It con-tains 117,097 nouns and 11,488 verbs, organizedinto 81,426 noun synsets and 13,650 verb synsets.

EuroWordNet3 (EWN) (Vossen et al., 1998)is a multilingual database than contains word-nets for several languages (Dutch, Italian, Spa-nish, German, French, Czech and Estonian).Each of these single wordnets represent a uni-que language-internal system of lexicalizations,and it is structured following the approach ofEnglish wordnet: synsets and relations betweenthem. Different wordnets are linked to the Inter-Lingual-Index (ILI), based on Princeton English

2http://wordnet.princeton.edu3http://www.illc.uva.nl/EuroWordNet/

WN. By means of the ILI, synsets and wordsor different languages are connected, allowingadvanced multilingual natural language applica-tions (Vossen et al., 2006).

The notion of Base Concepts (hereinafter BC)was introduced in EuroWordNet. The BC aresupposed to be the concepts that play the mostimportant role in the various wordnets of diffe-rent languages. This role was measured in termsof two main criteria: a high position in the se-mantic hierarchy and having many relations toother concepts. Thus, the BC are the fundamen-tal building blocks for establishing the relationsin a wordnet. In that sense, the Lexicografic Files(or Supersenses) of WN could be considered themost basic set of BC.

Basic Level Concepts (Rosch, 1977) (hereinaf-ter BLC) should not be confused with Base Con-cepts. BLC are a compromise between two con-flicting principles of characterization: a) to repre-sent as many concepts as possible (abstract con-cepts), and b) to represent as many distinctivefeatures as possible (concrete concepts).

As a result of this, Basic Level Concepts ty-pically occur in the middle of hierarchies andless than the maximum number of relations. BCmostly involve the first principle of the Basic Le-vel Concepts only. BC are generalizations of fea-tures or semantic components and thus apply to amaximum number of concepts. Our work focuseson devising simple methods for selecting automa-tically an accurate set of Basic Level Conceptsfrom WN.

2.1 WordNet Base Concepts

WN synsets are organized in forty five Lexico-grapher Files, or SuperSenses, based on syntac-tic categories (nouns, verbs, adjectives and ad-verbs) and logical groupings, such as person, phe-nomenon, feeling, location, etc. There are 26 ba-sic categories for nouns, 15 for verbs, 3 for ad-jectives and 1 for adverbs. For instance, the Su-persenses corresponding to the four senses of thenoun church in WN1.6 are noun.group for the firstChristian Church sense, noun.artifact for the se-cond church building sense and noun.act for thethird church service sense.

2.2 EuroWordNet Base Concepts

Within EuroWordNet, a set of Base Concepts wasselected to reach maximum overlap and compa-tibility across wordnets in different languages fo-llowing the two main criteria described above: ahigh position in the semantic hierarchy and ha-ving many relations to other concepts. Initially, aset of 1,024 Common Base Concepts from WN1.5(concepts acting as BC in at least two languages)was selected, only considering English, Dutch,Spanish and Italian wordnets.

Ruben Izquierdo-Bevia, Armyo Suárez y Germán Rigau

190

2.3 Balkanet Base ConceptsThe Balkanet project4 followed a similar ap-proach to EWN, but using other languages:Greek, Romanian, Serbian, Turkish and Bulga-rian. The goal of Balkanet was to develop a mul-tilingual lexical database for the new languagesfollowing the guidelines of EWN. Thus, the Bal-kanet project selected his own list of BC exten-ding the original set of BC of EWN to a final setof 4,698 ILI records from WN2.05 (3,210 nouns,1,442 verbs and 37 adjectives).

2.4 MEANING Base ConceptsThe MEANING project6 also followed the archi-tectural model proposed by the EWN to build theMultilingual Central Repository (Mcr) (Atseriaset al., 2004). In this case, BC from EWN basedon WN1.5 synsets were ported to WN1.6. Thenumber of BC finally selected was 1,535 (793 fornouns and 742 for verbs).

3 Automatic Selection of BaseLevel Concepts

This section describes a simple method for deri-ving a set of Base Level Concepts (BLC) fromWN. The method has been applied to differentWN versions for nouns and verbs. Basically, toselect the appropriate BLC of a particular synset,the algorithm only considers the relative numberof relations of their hypernyms. We derived twodifferent sets of BLC depending on the type ofrelations considered: a) all types of relations en-coded in WN (All) and b) only the hyponymyrelations encoded in WN (Hypo).

The process follows a bottom-up approachusing the chain of hypernym relations. For eachsynset in WN, the process selects as its Base Le-vel Concept the first local maximum accordingto the relative number of relations. For synsetshaving multiple hypernyms, the path having thelocal maximum with higher number of relationsis selected. Usually, this process finishes havinga number of “fake” Base Level Concepts. Thatis, synsets having no descendants (or with a verysmall number) but being the first local maximumaccording to the number of relations considered.Thus, the process finishes checking if the num-ber of concepts subsumed by the preliminary listof BLC is higher than a certain threshold. Forthose BLC not representing enough concepts ac-cording to a certain threshold, the process selectsthe next local maximum following the hypernymhierarchy. Thus, depending on the type of rela-tions considered to be counted and the thresholdestablished, different sets of BLC can be easilyobtained for each WN version.

An example is provided in table 1. This tableshows the possible BLC for the noun “church”

4http://www.ceid.upatras.gr/Balkanet5http://www.globalwordnet.org/gwa/5000 bc.zip6http://www.lsi.upc.es/˜nlp/meaning

#rel. synset

18 group 1,grouping 119 social group 137 organisation 2,organization 110 establishment 2,institution 112 faith 3,religion 25 Christianity 2,church 1,Christian church 1

#rel. synset

14 entity 1,something 129 object 1,physical object 139 artifact 1,artefact 163 construction 3,structure 179 building 1,edifice 111 place of worship 1, ...19 church 2,church building 1

#rel. synset

20 act 2,human action 1,human activity 169 activity 15 ceremony 3

11 religious ceremony 1,religious ritual 17 service 3,religious service 1,divine service 11 church 3,church service 1

Table 1: Possible Base Level Concepts for thenoun Church in WN1.6

using WN1.6. The table presents the hypernymchain for each synset together with the numberof relations encoded in WN for the synset. Thelocal maxima along the hypernym chain of eachsynset appears in bold. For church 1 the syn-set with 12 total relations faith 3 will be selected.The second sense of church, church 2 is a localmaximum with 19 total relations. This synsetwill be selected if the number of descending syn-sets having church 2 as a Base Level Concept ishigher than a predefined threshold. Finally, theselected Base Level Concept for church 3 is re-ligious ceremony 1. Obvioulsy, different criteriawill select a different set of Base Level Concepts.

Instead of highly related concepts, we alsoconsidered highly frequent concepts as possibleindicator of a large set of features. Following thesame basic algorithm, we also used the relativefrequency of the synsets in the hypernym chain.That is, we derived two other different sets ofBLC depending on the source of relative frequen-cies considered: a) the frequency counts in Sem-Cor (FreqSC) and b) the frequency counts appea-ring in WN (FreqWN). The frequency of a synsethas been obtained summing up the frequencies ofits word senses. In fact, WN word-senses wereranked using SemCor and other sense-annotatedcorpora. Thus, the frequencies of SemCor andWN are similar, but not equal.

4 Comparing Base LevelConcepts

Different sets of Base Level Concepts (BLC) havebeen generated using different WN versions, ty-pes of relations (All and Hypo), sense frequencies(FreqSC and FrecWN) and thresholds.

Table 2 presents the total number of BLC andits average depth for WN1.67 varying the thres-hold and the type of relations considered (All orHypo).

As expected, when increasing the threshold,the total number of automatic BLC and its ave-

7WN1.6 have 66,025 nominal and 12,127 verbal synsets.

A Proposal of Automatic Selection of Coarse-grained Semantic Classes for WSD

191

Thres. Rel. PoS #BLC Av. depth.

0

allNoun 3,094 7.09Verb 1,256 3.32

hypoNoun 2,490 7.09Verb 1,041 3.31

10

allNoun 971 6.20Verb 719 1.39

hypoNoun 993 6.23Verb 718 1.36

20

allNoun 558 5.81Verb 673 1.25


50

allNoun 253 5.21Verb 633 1.13


Table 2: Automatic Base Level Concepts forWN1.6 using All or Hypo relations

rage depth decrease. For instance, using all re-lations on the nominal part of WN, the totalnumber of BLC ranges from 3,094 (no threshold)to 253 (threshold 50). Using hyponym relations,the total number of BLC ranges from 2,490 (nothreshold) to 248. However, although the num-ber of total BLC for nouns decreases dramatica-lly (around 10 times), the average depth of thesynsets selected only ranges from 7.09 (no thres-hold) to 5.21 (threshold 50) using both types ofrelations (All and Hypo). This fact, possibly in-dicates the robustness of the approach.

Also as expected, the verbal part of WNbehave differently. For verbs and using all rela-tions, the total number of BLC ranges from 1,256(no threshold) to 633 (threshold 50). Using hy-ponym relations, the total number of BLC rangesfrom 1,041 (no threshold) to 633 (threshold 50).In this case, since the verbal hierarchies are muchshorter, the average depth of the synsets selec-ted ranges from 3.32 (no threshold) to only 1.13(threshold 50) using all relations, and from 3.31(no threshold) to 1.10 (threshold 50) using hyporelations.

Table 3 presents the total number of BLC andits average depth for WN1.6 varying the thresholdand the type of frequency (WN or SemCor).

In general, when using the frequency criteria,we can observe a similar behaviour than whenusing the relation criteria. That is, when increa-sing the threshold, the total number of automaticBLC and its average depth decrease. However,now the effect of the threshold is more dramatic,specially for nouns. For instance, the total num-ber nominal BLC ranges from around 34,000 withno threshold to less than 100 nominal BLC withthreshold equal to 50 descendants. Again, alt-hough the number of total BLC for nouns decrea-ses dramatically, the average depth of the synsetsselected only ranges from 7.44 (no threshold) to4.35 (threshold 50) using sense frequencies from

Thres. Rel. PoS #BLC Av. depth.

0

SemCorNoun 34,865 7.44Verb 3,070 3.41

WNNoun 34,183 7.44Verb 2,615 3.30

10

SemCorNoun 690 5.74Verb 731 1.38

WNNoun 691 5.77Verb 738 1.40

20


WNNoun 340 5.47Verb 667 1.23

50


WNNoun 99 4.41Verb 631 1.12

Table 3: Automatic Base Level Concepts forWN1.6 using SemCor or WN frequencies

SemCor and from 7.44 (no threshold) to 4.41 (th-reshold 50) using sense frequencies from WN.

As expected, verbs behave differently thannouns. The number of BLC (for both SemCorand WN frequencies) reaches a plateau of around600. In fact, this number is very close to the ver-bal top beginners.

Table 4 summarizes the Balkanet Base Con-cepts including the total number of synsets andtheir average depth.

PoS #BC Av. depth.Noun 3,210 5.08Verb 1,442 2.45

Table 4: Balkanet Base Concepts using WN2.0

In a similar way, table 5 presents the Mea-ning Base Concepts including the total numberof synsets and their average depth.

PoS #BC Av. depth.Noun 793 4.93Verb 742 1.36

Table 5: Meaning Base Concepts using WN1.6

For nouns, the set of Balkanet BC is four ti-mes larger than the Meaning BC, while the ave-rage depth is similar in both sets (5.08 vs. 4.93respectively). The verbal set of Balkanet BCis twice larger than the Meaning one, while con-trary to the nominal subsets, their average depthis quite different (2.45 vs. 1.36). However, whencomparing these sets of BC to the automaticallyselected BLC, it seems clear that for similar volu-mes, the automatic BLC appear to be deeper inthe hierarchies (both for nouns and verbs).

In contrast, the BC derived from the Lexico-graphic Files of WN (or Supersenses), representa much more coarse-grained set (26 categories fornouns and 15 for verbs).


192

5 Sense–groupings as semanticclasses

In order to study to what extend the differentsense–groupings could be of the interest for class–based WSD, we present a comparative evaluationof the different sense–groupings in a controlledframework. We tested the behaviour of the dif-ferent sets of sense–groupings (WN senses, Bal-kanet BC, Meaning BC, automatic BLC andSuperSenses) using the English all–words task ofSensEval–3. Obviously, different sense–groupingswould provide different abstractions of the se-mantic content of WN, and we expect a differentbehaviour when disambiguating nouns and verbs.In fact, the most common baseline used to testthe performance of a WSD system, is the MostFrequent Sense Classifier. In this study, we willuse this simple but robust heuristic to comparethe performances of the different sense–groupings.Thus, we will use SemCor8 (Kucera and Fran-cis, 1967) to train for Most Frequent Classifiersfor each word and sense–grouping. We only usedbrown1 and brown2 parts of SemCor to train theclassifiers. We used standard Precision, Recalland F1 measure (harmonic mean between Preci-sion and Recall) to evaluate the performance ofeach classifier.

For WN senses, Meaning BC, the automaticBLC, and Lexicographic Files, we used WN1.6.For Balkanet BC we used the synset mappingsprovided by (Daude, Padro, and Rigau, 2003)9,translating the BC from WN2.0 to WN1.6. Fortesting the Most Frequent Classifiers we also usedthese mappings to translate the sense–groupingsfrom WN1.6 to WN1.7.1.

Table 6 presents the polysemy degree fornouns and verbs of the different words when grou-ping its senses with respect the different semanticclasses on SensEval–3. Senses stand for WN sen-ses, BLC-A for automatic BLC derived using a th-reshold of 20 and all relations, BLC-S for automa-tic BLC derived using a threshold of 20 and fre-quencies from SemCor and SS for the SuperSen-ses. As expected, while increasing the abstractionlevel (from the sense level to the SuperSense le-vel, passing to intermediate levels) the polysemydegree decreases. For instance in SensEval–3, atthe sense level, the polysemy degree for nous is4.93 (4.93 senses per word), while at the Super-Sense level, the polysemy degree for nouns is 3.06(3.06 classes per word). Notice that the reduc-tion is dramatic for verbs (from 11.0 to only 4.08).Notice also, that when using the Base Level Con-cept representations a high degree of polysemy ismaintained for nouns and verbs.

Tables 7 and 8 presents for polysemous wordsthe performance in terms of F1 measure of thedifferent sense-groupings using the relation cri-teria (All and Hypo) when training the class–

8Annotated using WN1.6.9http://www.lsi.upc.edu/˜nlp/

Senses BLC-A BLC-S SSNouns 4.93 4.07 4.00 3.06Verbs 11.00 8.64 8.72 4.08N + V 7.66 6.13 6.13 3.52

Table 6: Polysemy degree over SensEval–3

frequencies on SemCor and testing on SensEval–3. That is, for each polysemous word inSensEval–3 the Most Frequent Class is obtainedfrom SemCor. Best results are marked using bold.

Class Nouns VerbsSenses 63.69 49.78Balkanet 65.15 50.84Meaning 65.28 53.11BLC–0 66.36 54.30BLC–10 66.31 54.45BLC–20 67.64 54.60BLC–30 67.03 54.60BLC–40 66.61 55.54BLC–50 67.19 55.69SuperSenses 73.05 76.41

Table 7: F1 measure for polysemous words usingall relations for BLC

In table 7, we present the results of using allrelations for selecting BLC. As expected, Super-Senses obtain very high F1 results for nouns andverbs with 73.05 and 76.41, respectively. Compa-ring the BC from Balkanet and Meaning, thebest results seems to be achieved by MeaningBC for both nouns and verbs. Notice that theset of BC from Balkanet was larger than theones selected in Meaning, thus indicating thatthe BC from Meaning provide a better level ofabstraction.

Interestingly, all sets of automatic BLC per-form better than those BC provided by Balka-net or Meaning. For nouns, the best result isobtained for BLC using a threshold of only 20with an F1 of 67.64. We should highlight this re-sult since this set of BLC obtain better WSD per-formance than the rest of automatically derivedBLC while maintaining more information of theoriginal synsets. Interestingly, BLC-20 using 558classes achieves an F1 of 67.64, while SuperSen-ses using a much smaller set (26 classes) achieves73.05.

For verbs, it seems that the restriction on theminimum number of concepts for a Base LevelConcept has a positive impact in the generaliza-tion selection.

These results suggest that intermediate levelsof representation such as the automatically deri-ved Base Concept Levels could be appropriate forlearning class-based WSD classifiers. Recall thatfor nouns SuperSenses use only 26 classes, whileBLC–20 uses 558 semantic classes (more than 20times larger).

In table 8, we present the results of using hy-ponymy relations for selecting the BLC. Again,


193

all sets of automatically derived BLC performbetter than those BC provided by Balkanet orMeaning. In this case, the best results for nounsare obtained again for BLC using a threshold of20 (F1 of 67.28 with 558 classes). We can alsoobserve that in general, using hyponymy relationswe obtain slightly lower performances than usingall relations. Possibly, this fact indicates that ahigher number of hyponymy relations is requiredfor a Base Level Concept to compensate minor(but richer) number of relations.


Table 8: F1 measure for polysemous words usinghypomym relations for BLC

Tables 9 and 10 presents for polysemouswords the performance in terms of F1 measureof the different sense-groupings using the fre-quency criteria (FreqSC and FreqWN) when trai-ning the class–frequencies on SemCor and tes-ting on SensEval–3. That is, for each polysemousword in SensEval–3 the Most Frequent Class isobtained from SemCor. Best results are markedusing bold.

In table 9, we present the results of usingfrequencies from SemCor for selecting the BLC.In this case, not all sets of automatic BLC sur-pass the BC from Balkanet and Meaning. Fornouns, the best result for automatic BLC is ob-tained when using a threshold of 50 (F1 of 68.84with 94 classes), while for verbs, the best resultis obtained when using a threshold of 40. Howe-ver, in this case, verbal BLC obtain slightly lowerresults than using the relations criteria (both alland hypo).


Table 9: F1 measure for polysemous words usingfrequencies from SemCor for BLC

In table 10, we present the results of using fre-

quencies from WN for selecting the BLC. Again,not all automatic sets of BLC surpass the BCfrom Balkanet and Meaning. For nouns, thebest result for automatic BLC is obtained whenusing a threshold of 40 (F1 of 69.16 with 132classes), while for verbs, the best result is obtai-ned when using a threshold of 50. We can alsoobserve that in general, using SemCor frequen-cies we obtain slightly lower performances thanusing WN frequencies. Again, verbal BLC obtainslightly lower results than using the relations cri-teria (both all and hypo).


Table 10: F1 measure for polysemous words usingfrequencies from WN for BLC

These results for polysemous words reinforceour initial observations. That is, that the met-hod for automatically deriving intermediate le-vels of representation such the Base Concept Le-vels seems to be robust enough for learning class-based WSD classifiers. In particular, it seemsthat BLC could achieve high levels of accuracywhile maintaining adequate levels of abstraction(with hundreds of BLC). In particular, the auto-matic BLC obtained using the relations criteria(All or Hypo) surpass the BC from Balkanetand Meaning. For verbs, it seems that eventhe unique top beginners require an extra levelof abstraction (that is, the SuperSense level) tobe affective.

6 Discussion

We can put the current results in context, alt-hough indirectly, by comparison with the resultsof the English SensEval–3 all–words task systems.In this case, the best system presented an accu-racy of 65.1%, while the “WN first sense” base-line would achieve 62.4%10. Furthermore, it isalso worth mentioning that in this edition therewere a few systems above the “WN first sense”baseline (4 out of 26 systems). Usually, this ba-seline is very competitive in WSD tasks, and it isextremely hard to improve upon even slightly.

Tables 11 and 12 presents for monosemousand polysemous nouns and verbs the F1 mea-sures of the different sense-groupings obtained

10This result could be different depending on the treat-ment of multiwords and hyphenated words.


194

with all relations criteria when training the class–frequencies on SemCor and testing on SensEval–3. Best results are marked using bold. Table 11presents the results using all relations criteria andtable 12 presents the same results but using theWN frequency criteria.

Class Nouns Verbs Nouns+VerbsSenses 71.79 52.89 63.24Balkanet 73.06 53.82 64.37Meaning 73.40 56.40 65.71BLC–0 74.80 58.32 67.35BLC–10 74.99 58.46 67.52BLC–20 76.12 58.60 68.20BLC–30 75.99 58.60 68.14BLC–40 75.76 59.70 68.51BLC–50 76.22 59.83 68.82SuperSenses 81.87 79.23 80.68

Table 11: F1 measure for nouns and verbs usingall relations for BLC

Obviously, higher accuracy figures are obtai-ned when incorporating also monosemous words.Note this naive system achieves for Senses anF1 of 63.24, very similar to those reported inSensEval–3, and for SuperSenses a very high aF1 of 80.68. Regarding the automatic BLC, thebest results are obtained for BLC–50, but all ofthem outperform the BC from Balkanet andMeaning. However, for nouns, BLC–20 (with558 classes) obtain only slightly lower F1 figuresthan BLC–50 (with 253 classes).


Table 12: F1 measure for nouns and verbs usingWN frequencies for BLC

When using frequencies instead of relations,BLC even achieve higher results. Again, the bestresults are obtained for BLC–50. However, in thiscase, not all of them outperform the BC fromBalkanet and Meaning.

Surprisingly, these naive Most frequent WSDsystems trained on SemCor are able to achievevery high levels of accuracy. For nouns, usingBLC-20 (selected from all relations, 558 seman-tic labels) the system reaches 75-62, while usingBLC-40 (selected from WN frequencies, 132 se-mantic labels) the system achieves 78.03. Finally,using SuperSenses for verbs (15 semantic labels)this naive system scores 79.23.

To our knowledge, the best results for class–

based WSD are those reported by (Ciaramita andAltun, 2006). This system performs a sequencetagging using a perceptron–trained HMM, usingSuperSenses, training on SemCor and testing onthe SensEval–3. The system achieves an F1–scoreof 70.74, obtaining a significant improvemementfrom a baseline system which scores only 64.09.In this case, the first sense baseline is the Su-perSense of the most frequent synset for a word,according to the WN sense ranking.

Possibly, the origin of the discrepancies be-tween our results and those reported by (Ciara-mita and Altun, 2006) is twofold. First, becausethey use a BIO sequence schema for annotation,and second, the use of the brown-v part of Sem-Cor to establish sense–frequencies.

In order to measure the real contribution ofthe automatic BLC on the WSD task, we also per-formed a final set of experiments. Once trainedon SemCor the Most Frequent Class of a word,we tested on SensEval–3 the first sense appearingin WN of the word for that Class. In that way,we developed a very simple sense tagger whichuses the frequency counts of more coarse-grainedsense–groupings. Table 13 presents the F1 mea-sures for all nouns and verbs of this naive class–based sense tagger when using WN frequenciesfor building the automatic BLC. Note that theseresults are different from the rest since are eva-luated at a sense level.


Table 13: F1 measure for nouns and verbs of theclass–based sense tagger.

Surprisingly, all these oportunistic class–basedsense taggers surpass the Most Frequent Sensetagger. Interestingly, the results of all automaticBLC using threshold higher than 10 obtain equalor better performance than SuperSenses. In fact,the best results for nouns are those obtained usingBLC–30 while for verbs those obtained by BLC–40. That is, the sense-groupings seem to stablishmore robust sense frequencies.

7 Conclusions and further work

The WSD task seems to have reached its maxi-mum accuracy figures with the usual framework.Some of its limitations could come from thesense–granularity of WordNet (WN). WN hasbeen often criticised because its fine–grained


195

sense distinctions. Nevertheless, other problemsarise for supervised systems like data sparsenessjust because the lack of adequate and enough trai-ning examples. Moreover, it is not clear howWSD can contribute with the current result toimprove other NLP tasks.

Changing the set of classes could be a solu-tion to enrich training corpora with many moreexamples. In this manner, the classifiers genera-lize among an heterogeneous set of labeled exam-ples. At the same time these classes are moreeasily learned because there are more clear se-mantic distinctions between them. In fact, ourmost frequent naive systems are able to perform asemantic tagging with accuracy figures over 75%.

Base Level Concepts (BLC) are concepts thatare representative for a set of other concepts. Inthe present work, a simple method for automa-tically selecting BLC from WN based on the hy-pernym hierarchy and the number of stored fre-quencies or relationships between synsets havebeen shown. Although, some sets of Base Con-cepts are available at this moment (e.g. Eu-roWordNet, Balkanet, Meaning), a hugemanual effort should be invested for its develop-ment. Other sets of Base Concepts, like WNLexicographer Files (or SuperSenses) are clearlyinsufficient in order to describe and distinguishbetween the enormous number of concepts thatare used in a text. Using a very simple baseline,the Most Frequent Class, our approach empiri-cally shows a clear improvement over such othersets. In addition, our method is capable to geta more or less detailed sets of BLC without lo-sing semantic discrimination power. Obviously,other selection criteria for selecting BLC shouldbe investigated.

We are also interested in the direct compari-son between automatically and manually selectedBLC. An in depth study of their correlations de-serves more attention.

Once having defined an appropriate level ofabstraction using the new sets of BLC, we planto use them for supervised class–based WSD. Wesuspect that using this approach higher accuracyfigures for WSD could be expected.

ReferencesAgirre, E., I. Aldezabal, y E. Pociello. 2003. A pilot study

of english selectional preferences and their cross-lingualcompatibility with basque. En Proceedings of the In-ternational Conference on Text Speech and Dialogue(TSD’2003), CeskBudojovice, Czech Republic.

Atserias, J., L. Villarejo, G. Rigau, E. Agirre, J. Carroll,B. Magnini, y P. Vossen. 2004. The meaning multilin-gual central repository. En Proceedings of Global Word-Net Conference (GWC’04), Brno, Czech Republic.

Ciaramita, M. y Y. Altun. 2006. Broad-coverage sense di-sambiguation and information extraction with a super-sense sequence tagger. En Proceedings of the Conferenceon Empirical Methods in Natural Language Processing(EMNLP’06), paginas 594–602, Sydney, Australia. ACL.

Ciaramita, M. y M. Johnson. 2003. Supersense tagging ofunknown nouns in wordnet. En Proceedings of the Con-ference on Empirical methods in natural language pro-cessing (EMNLP’03), paginas 168–175. ACL.

Curran, J. 2005. Supersense tagging of unknown nouns usingsemantic similarity. En Proceedings of the 43rd AnnualMeeting on Association for Computational Linguistics(ACL’05), paginas 26–33. ACL.

Daude, J., Ll. Padro, y G. Rigau. 2003. Validation and tuningof wordnet mapping techniques. En Proceedings of theInternational Conference on Recent Advances on Natu-ral Language Processing (RANLP’03), Borovets, Bulga-ria.

Fellbaum, C., editor. 1998. WordNet. An Electronic LexicalDatabase. The MIT Press.

Hearst, M. y H. Schutze. 1993. Customizing a lexicon tobetter suit a computational task. En Proceedingns of theACL SIGLEX Workshop on Lexical Acquisition, Stutt-gart, Germany.

Kucera, H. y W. N. Francis. 1967. Computational Analy-sis of Present-Day American English. Brown UniversityPress, Providence, RI, USA.

Magnini, B. y G. Cavaglia. 2000. Integrating subject fieldscodes into wordnet. En Proceedings of the Second Inter-national Conference on Language Resources and Eva-luation (LREC’00).

Marquez, Ll., G. Escudero, D. Martınez, y G. Rigau. 2006.Supervised corpus-based methods for wsd. En E. Agi-rre and P. Edmonds (Eds.) Word Sense Disambigua-tion: Algorithms and applications., volumen 33 de Text,Speech and Language Technology. Springer.

Mihalcea, R. y D. Moldovan. 2001. Automatic genera-tion of coarse grained wordnet. En Proceding of theNAACL workshop on WordNet and Other Lexical Re-sources: Applications, Extensions and Customizations,Pittsburg, USA.

Niles, I. y A. Pease. 2001. Towards a standard upper ontology.En Proceedings of the 2nd International Conference onFormal Ontology in Information Systems (FOIS-2001),paginas 17–19. Chris Welty and Barry Smith, eds.

Peters, W., I. Peters, y P. Vossen. 1998. Automaticsense clustering in eurowordnet. En First Internatio-nal Conference on Language Resources and Evaluation(LREC’98), Granada, Spain.

Rosch, E. 1977. Human categorisation. Studies in Cross-Cultural Psychology, I(1):1–49.

Segond, F., A. Schiller, G. Greffenstette, y J. Chanod. 1997.An experiment in semantic tagging using hidden markovmodel tagging. En ACL Workshop on Automatic In-formation Extraction and Building of Lexical SemanticResources for NLP Applications. ACL, New Brunswick,New Jersey, paginas 78–81.

Snyder, Benjamin y Martha Palmer. 2004. The english all-words task. En Rada Mihalcea y Phil Edmonds, editores,Senseval-3: Third International Workshop on the Eva-luation of Systems for the Semantic Analysis of Text,paginas 41–43, Barcelona, Spain, July. Association forComputational Linguistics.

Villarejo, L., L. Marquez, y G. Rigau. 2005. Exploring theconstruction of semantic class classifiers for wsd. En Pro-ceedings of the 21th Annual Meeting of Sociedad Espaolapara el Procesamiento del Lenguaje Natural SEPLN’05,paginas 195–202, Granada, Spain, September. ISSN 1136-5948.

Vossen, P., L. Bloksma, H. Rodriguez, S. Climent, N. Calzo-lari, A. Roventini, F. Bertagna, A. Alonge, y W. Peters.1998. The eurowordnet base concepts and top ontology.Informe tecnico, Paris, France, France.

Vossen, P., G. Rigau, I. Alegria, E. Agirre, D. Farwell, yM. Fuentes. 2006. Meaningful results for information re-trieval in the meaning project. En Proceedings of the 3rdGlobal Wordnet Conference, Jeju Island, Korea, SouthJeju, January.


196

Cognitive Modules of an NLP Knowledge Basefor Language Understanding

Resumen: Algunas aplicaciones del procesamiento del lenguaje natural, p.ej. la traducciónautomática, requieren una base de conocimiento provista de representaciones conceptuales quepuedan reflejar la estructura del sistema cognitivo del ser humano. En cambio, tareas como laindización automática o la extracción de información pueden ser realizadas con una semánticasuperficial. De todos modos, la construcción de una base de conocimiento robusta garantiza sureutilización en la mayoría de las tareas del procesamiento del lenguaje natural. El propósito deeste artículo es describir los principales módulos cognitivos de FunGramKB, una base deconocimiento léxico-conceptual multipropósito para su implementación en sistemas delprocesamiento del lenguaje natural.Palabras clave: Representación del conocimiento, ontología, razonamiento, postulado designificado.

Abstract: Some natural language processing systems, e.g. machine translation, require aknowledge base with conceptual representations reflecting the structure of human beings’cognitive system. In some other systems, e.g. automatic indexing or information extraction,surface semantics could be sufficient, but the construction of a robust knowledge baseguarantees its use in most natural language processing tasks, consolidating thus the concept ofresource reuse. The objective of this paper is to describe FunGramKB, a multipurpose lexico-conceptual knowledge base for natural language processing systems. Particular attention will bepaid to the two main cognitive modules, i.e. the ontology and the cognicon.Keywords: Knowledge representation, ontology, reasoning, meaning postulate.

1 FunGramKBFunGramKB Suite1 is a user-friendlyenvironment for the semiautomatic constructionof a multipurpose lexico-conceptual knowledgebase for a natural language processing (NLP)system within the theoretical model of S.C.Dik’s Functional Grammar (1978, 1989, 1997).FunGramKB is not a literal implementation ofDik’s lexical database, but we depart from the

1 We use the name ‘FunGramKB Suite’ to referto our knowledge engineering tool and‘FunGramKB’ to the resulting knowledge base.

functional model in some important aspectswith the aim of building a more robustknowledge base.

On the one hand, FunGramKB ismultipurpose in the sense that it is bothmultifunctional and multilanguage. In otherwords, FunGramKB can be reused in variousNLP tasks (e.g. information retrieval andextraction, machine translation, dialogue-basedsystems, etc) and with several naturallanguages.2

2 English, Spanish, German, French and Italianare supported in the current version of FunGramKB.

Carlos Periñán-PascualUniversidad Católica San Antonio

Campus de los Jerónimos s/n30107 Guadalupe - Murcia (Spain)

[email protected]

Francisco Arcas-TúnezUniversidad Católica San Antonio

Campus de los Jerónimos s/n30107 Guadalupe - Murcia (Spain)

[email protected]



On the other hand, our knowledge base islexico-conceptual, because it comprises twogeneral levels of information: a lexical leveland a cognitive level. In turn, these levels aremade up of several independent but interrelatedcomponents:

Lexical level (i.e. linguistic knowledge):• The lexicon stores morphosyntactic,

pragmatic and collocationalinformation of words.

• The morphicon helps our system tohandle cases of inflectionalmorphology.

Cognitive level (i.e. non-linguisticknowledge):• The ontology is presented as a

hierarchical structure of all the conceptsthat a person has in mind when talkingabout everyday situations.

• The cognicon stores proceduralknowledge by means of cognitivemacrostructures, i.e. script-likeschemata in which a sequence ofstereotypical actions is organised on thebasis of temporal continuity, and moreparticularly on James Allen's temporalmodel (Allen, 1983, 1991; Allen andFerguson, 1994).

• The onomasticon stores informationabout instances of entities, such aspeople, cities, products, etc.

The main consequence of this two-leveldesign is that every lexical module is language-dependent, while every cognitive module isshared by all languages. In other words,computational lexicographers must develop onelexicon and one morphicon for English, onelexicon and one morphicon for Spanish and soon, but knowledge engineers build just oneontology, one cognicon and one onomasticon toprocess any language input cognitively. Section2 gives a brief account on the psychologicalfoundation of FunGramKB cognitive level, andsections 3 and 4 describe the two maincognitive modules in that level, i.e. theontology and the cognicon.

2 Cognitive knowledge in naturallanguage understandingIn cognitive psychology, common-senseknowledge is usually divided into three

different types (Tulving, 1985):

• Semantic knowledge, which storescognitive information about words; it isa kind of mental dictionary.

• Procedural knowledge, which storesinformation about how events areperformed in ordinary situations—e.g.how to ride a bicycle, how to fry anegg...; it is a kind of manual foreveryday actions.

• Episodic knowledge, which storesinformation about specific biographicevents or situations—e.g. our wedding-day; it is a kind of personal scrapbook.

Therefore, if there are three types ofknowledge involved in human reasoning, theremust be three different kinds of knowledgeschemata. These schemata are successfullymapped in an integrated way into the cognitivecomponent of FunGramKB:

• Semantic knowledge is represented inthe form of meaning postulates in theontology.

• Procedural knowledge is represented inthe form of cognitive macrostructuresin the cognicon.

• Episodic knowledge can be stored as acase base.3

A key factor for successful reasoning is thatall these knowledge schemata (i.e. meaningpostulates, cognitive macrostructures and cases)must be represented through the same formallanguage, so that information sharing could takeplace effectively among all cognitive modules.Our formal language is partially founded onDik’s model of semantic representation (1978,1989, 1997), which was initially devised formachine translation (Connolly and Dik, 1989).

Computationally speaking, when storingcognitive knowledge through FunGramKBSuite, a syntactic-semantic checker is triggered,so that consistent well-formed constructs can bestored. Moreover, a parser outputs an XML-formatted feature-value structure used as theinput for the reasoning engine, so that

3 FunGramKB can be very useful in case-basedreasoning, where problems are solved byremembering previous similar cases and reusinggeneral knowledge.

Carlos Periñan-Pascual y Francisco Arcas-Túnez

198

inheritance and inference mechanisms can beapplied. Both the syntactic-semantic validatorof meaning postulates and the XML parser werewritten in C#.

3 FunGramKB ontologyNowadays there is no single right methodologyfor ontology development. Ontology designtends to be a creative process, so it is probablethat two ontologies designed by differentpeople have a different structuring (Noy andMcGuinness, 2001). To avoid this problem, theontology model should be founded on a solidmethodology. The remaining of this sectiondescribes five methodological criteria applied toFunGramKB ontology, some of which arebased on principles implemented in other NLPprojects (Bouaud et al., 1995; Mahesh, 1996;Noy and McGuinness, 2001). The definition ofthese criteria in the analysis and design phasesof the ontology model and the strict applicationof these guidelines in the development phasecontributed to avoid some common errors inconceptual modelling.

3.1 Symbiosis between universality andlinguistic motivationFunGramKB ontology takes the form of auniversal concept taxonomy, where ‘universal’means that every concept we can imagine hasan appropriate place in this ontology. On theother hand, our ontology is linguisticallymotivated, as a result of its involvement withthe semantics of lexical units, although theknowledge stored in our ontology is not specificto any particular language.

3.2 Subsumption as the only taxonomicrelationAt first sight, it can seem that the exclusive useof the IS-A relation can impoverish theontological model. Indeed, a consequence ofthis restriction on the taxonomic relation isfound in the modelling of the upper level,where metaconcepts #ENTITY, #EVENT and#QUALITY arrange nouns, verbs andadjectives respectively in cognitive dimensions.However, the fact that concepts linked tolexical units of different grammatical categoriesare not explicitly connected in our ontologicalmodel doesn’t prevent FunGramKB to relatethose lexical units in the cognitive level throughtheir meaning postulates. Indeed, our ontology

establishes a high degree of connectivity amongconceptual units by taking into accountsemantic components which are shared by theirmeaning postulates. In order to incorporatehuman beings’ commonsense, our ontologymust identify the relations which can beestablished among conceptual units, and henceamong lexical units. However, displayingsemantic similarities and differences throughtaxonomic relations turns out to be morechaotic than through meaning postulates linkedto conceptual units.

3.3 Three-layered ontological modelFunGramKB ontology distinguishes threedifferent conceptual levels, each one of themwith concepts of a different type: metaconcepts,basic concepts and terminals. Figure (1)illustrates these three types of concepts.

#ENTITY

→ #PHYSICAL

→ #OBJECT

→ #SELF_CONNECTED_OBJECT

→ +ARTIFICIAL_OBJECT

→ +CORPUSCULAR

→ +SOLID

→ +BALL

→ $FOOTBALL

Figure 1: Example of ontological structuring inFunGramKB

Metaconcepts, preceded by symbol #,constitute the upper level in the taxonomy. Theanalysis of the upper level in the main linguisticontologies—DOLCE (Gangemi et al., 2002;Masolo et al., 2003), Generalized Upper Model(Bateman, 1990; Bateman, Henschel andRinaldi, 1995), Mikrokosmos (Beale, Nirenburgand Mahesh, 1995; Mahesh and Nirenburg,1995; Nirenburg et al., 1996), SIMPLE (Lenci,2000; Lenci et al., 2000; Pedersen and Keson,1999; SIMPLE Specification Group, 2000;Villegas and Brosa, 1999), SUMO (Niles andPease, 2001a, 2001b)—led to a metaconceptualmodel whose design contributes to theintegration and exchange of information withother ontologies, providing thus standardizationand uniformity. Since metaconcepts reflect

Cognitive Modules of an NLP Knowledge Base for Language Understanding

199

cognitive dimensions, they are not assignedmeaning postulates. Therefore, ourmetaconcepts play the role of ‘hiddencategories’, i.e. concepts which aren’t linked toany lexical unit so that they can serve as hiddensuperordinates and avoid circularity.

Basic concepts, preceded by symbol +, areused in FunGramKB as defining units whichenable the construction of meaning postulatesfor basic concepts and terminals, as well astaking part as selection preferences in thematicframes. The starting point for the identificationof basic concepts was the defining vocabularyin Longman Dictionary of ContemporaryEnglish (Procter, 1978), though deep revisionwas required in order to perform cognitivemapping.

Finally, terminals are headed by symbol $.The borderline between basic concepts andterminals is based on their definitory potentialto take part in meaning postulates.

3.4 Non-atomicity of conceptual unitsIn FunGramKB, basic and terminal conceptsare not stored as atomic symbols but areprovided with a rich internal structureconsisting of semantic properties such as thethematic frame or the meaning postulate.

On the one hand, every event in the ontologyis assigned one thematic frame, i.e. aprototypical cognitive construct which statesthe number and type of participants involved inthe cognitive situation portrayed by the event.In turn, predicate frames of verbs in the lexiconare constructed from thematic frames in theontology. For instance, hundir and zozobrar areSpanish verbs which trigger the same thematicframe, since both of them are linked to the sameconcept (example 1).

(1) SINK (x1)Agent (x2)Theme (x3: LIQUID ^ MUD)Location (x4)Origin (x5)Goal (f1:SLOW)Speed

However, these verbs can differ in theirpredicate frames, since they show differentprofiled arguments (examples 2-3).

(2) hundir (x1)NP / S / Agent (x2)NP / DO / Themehundir (x2)NP / S / Theme

(3) zozobrar (x2)NP / S / Theme

In other words, these lexical units are linkedto the same thematic frame at the cognitivelevel, but the instantiation of this thematicframe can make divergences occur in predicate

frames at the lexical level.4

On the other hand, a meaning postulate is aset of one or more logically connectedpredications (e1, e2... en), which are cognitiveconstructs carrying the generic features of theconcept.5 Concepts, and not words, are thebuilding blocks for the formal description ofmeaning postulates, so a meaning postulatebecomes a language-independent semanticknowledge representation. To illustrate, somepredications in the meaning postulates of anentity, event and quality are presented inexamples (4), (5) and (6) respectively:6

(4) BIRD+(e1: BE (x1: BIRD)Theme(x2:VERTEBRATE)Referent)*(e2: HAVE (x1)Theme (x3: m FEATHER & 2LEG & 2 WING)Referent)*(e3: FLY (x1)Theme)

(5) KISS+(e1: TOUCH (x1: PERSON)Agent (x2)Theme(f1: 2 LIP)Instrument (f2: (e2: LOVE (x1)Agent(x2)Theme) | (e2: GREET (x1)Agent(x2)Theme))Reason)

(6) HUGE+(e1: BE (x2)Theme (x1: HUGE)Attribute)+(e2: BE (x1)Theme (x3: SIZE)Referent)+(e3: BE (x2)Theme (x4: m BIG)Attribute)

For instance, predications in example (1)have the following natural languageequivalents:

Birds are always vertebrates.A typical bird has many feathers, two legsand two wings.A typical bird flies.

Dik (1997) proposes using words from theown language when describing meaningpostulates, since meaning definition is aninternal issue of the language. However, thisstrategy contributes to lexical ambiguity due tothe polysemic nature of the defining lexical

4 The difference between thematic frames andpredicate frames is partially grounded on thedistinction between argument roles and participantroles in Goldberg’s Construction Grammar (1995).

5 Periñán Pascual and Arcas Túnez (2004)describe the formal grammar of well-formedpredications for meaning postulates in FunGramKB.

6 For the sake of clarity, the names of conceptualunits have been oversimplified.


200

units. In addition, describing the meaning ofwords in terms of other words leads to somelinguistic dependency (Vossen, 1994). Instead,FunGramKB employs concepts for the formaldescription of meaning postulates, resulting inan interlanguage representation of meaning.

An alternative could have been to usesecond-order predicate logics for the formalrepresentation of lexical meaning. However, theproblem lies not only on the little expressivepower of predicate logics, but also on the factthat standard logics use monotonic reasoning,which isn’t robust enough for the simulation ofhuman beings’ commonsense reasoning.

3.5 Meaning postulates as ontologicalorganizersOur ontology structuring complies with thesimilarity, specificity and opposition principlesapplied to the meaning postulates of concepts.Firstly, all subordinate concepts must share themeaning postulate of their superordinateconcept (i.e. similarity principle). Secondly, allsubordinate concepts must have a meaningpostulate which states a distinctive feature (ordifferentiae) not present in the meaningpostulate of its superordinate concept (i.e.specificity principle). Finally, differentiae in themeaning postulates of sibling concepts must beincompatible one another (i.e. oppositionprinciple).

4 FunGramKB cogniconText understanding must not be restricted to thecomprehension of individual sentences, but itmust involve the integration of all thisinformation into a ‘situation model’ (Zwaanand Radvansky, 1998) with the purpose ofreconstructing the textual world underlying tothe literal sense of the linguistic realizationswhich make up the text surface. The task ofreconstructing the situation model of an inputtext requires NLP systems to hold humanbeings' commonsense knowledge in the form ofgeneric cognitive structures which can facilitateinferences and predictions as well asinformation selection and management. Sincescripts were devised by Schank and Abelson(1977), little effort has been made to build alarge-scale database of procedural-knowledgeschemata. For example, both expectationpackages (Gordon, 1999) and ThoughtTreasure(Mueller, 1999) are systems which contain factsand rules about ordinary situations, but it is

very difficult to apply any case-based reasoningon them.

In FunGramKB, meaning postulates are notsufficient to describe commonsense knowledge,but they contribute actively to build ‘cognitivemacrostructures’ in the cognicon. In otherwords, our knowledge base integrates semanticknowledge from the ontology with proceduralknowledge from the cognicon, resulting in acorrelation that almost no NLP system hasachieved yet. These schemata are described as‘macrostructures’ because they are morecomprehensive constructions than meaningpostulates. While meaning postulates areontology-oriented knowledge representations,cognitive macrostructures organize knowledgein scenes according to temporality and causalityparameters. On the other hand, thesemacrostructures are described as ‘cognitive’because they are built with conceptual unitsfrom the ontology. Unlike most naturallanguage understanding systems, expectationsabout what is about to happen in a particularsituation are not lexical but conceptual, sodifferent lexical realizations with the samemeaning in the same or different languagescorrespond to the same expectation inFunGramKB.

In example (7), we present somepredications of the cognitive macrostructureEating_at_restaurants:

(7) (e1: ENTER (x1: CUSTOMER)Theme (x2:RESTAURANT)Goal (f1: (e2: BE (x1) (x3:HUNGRY)Attrribute))Reason)(e3: ACCOMPANY (x4: WAITER)Theme(x1)Referent (f2: TABLE)Goal)(e4: SIT (x1)Theme (x5: f1)Location)(e5: BRING (x4)Theme (x6: MENU ^ WINE_LIST)Referent (f3: x1)Goal)(e6: REQUEST (x1)Theme (x7: FOOD |BEVERAGE)Referent (x4)Goal)(e7: TELL (x4)Theme (x8: (e8: COOK (x9:COOK)Theme (x10: FOOD)Referent)Referent(x9)Goal)(e9: BRING (x4)Theme (x11:BEVERAGE)Referent (f4: BAR)Source)

The main advantage of this approach is thatmeaning postulates and cognitivemacrostructures are represented through thesame formal language, so that knowledge canbe shared more effectively betweenFunGramKB cognitive modules, particularlywhen reasoning mechanisms are triggered.


201

5 Reasoning engine in FunGramKBAn NLP application is actually a knowledge-based system, so it must be provided with aknowledge base and a reasoning engine. Tworeasoning processes have been devised to workwith FunGramKB cognitive modules:MicroKnowing and MacroKnowing.

MicroKnowing (Microconceptual-Knowledge Spreading) is a multi-level processperformed by means of two types of reasoningmechanisms: inheritance and inference. Ourinheritance mechanism strictly involves thetransfer of one or several predications from asuperordinate concept to a subordinate one inthe ontology. On the other hand, our inferencemechanism is based on the structures sharedbetween predications linked to conceptual unitswhich do not take part in the same subsumptionrelation within the ontology. Cyclicalapplication of the inheritance and inferencemechanisms on our meaning postulates allowFunGramKB to minimize redundancy as wellas keeping our knowledge base as informativeas possible. When the language engineermodifies an existing meaning postulate orbuilds a new one, just before being stored,FunGramKB Suite automatically performs theMicroKnowing for that meaning postulate inorder to check the compatibility of the newly-incorporated predications with otherpredications involved in the reasoning process.The language engineer is informed about anyincompatibility with inferred or inheritedpredications. In addition, FunGramKB Suitedisplays the whole MicroKnowing process stepby step, enabling us to verify inference andinheritance conditions in a transparent way.7

Currently we are working on theMacroKnowing (Macroconceptual-KnowingSpreading), i.e. the process of integratingmeaning postulates from the ontology with thecognitive macrostructures in the cognicon inorder to spread the procedural knowledgestored in FunGramKB. This interaction ofsemantic and procedural knowledge, sodistinctive of human reasoning, is hardly foundin NLP systems to date.

7 Periñán Pascual and Arcas Túnez (2005) givean accurate description of MicroKnowing inFunGramKB.

6 ConclusionIn NLP, knowledge is usually applied to theinput text for two main tasks: parsing (e.g. spellchecking, syntactic ambiguity resolution, etc)and partial understanding (e.g. lexicalambiguity resolution, document classification,etc). Full natural language understanding ishardly performed. Indeed, deep semantics forNLP is currently very limited, perhaps becausemost applications exploit WordNet as a sourceof information. Moreover, researchers do noteven agree on how much semantic informationis sufficient to achieve the best outcome.However, it is thought that performance isimproved if the system is provided with arobust knowledge base and a powerfulinference component (Vossen, 2003). In fact,the main problem in the successfuldevelopment of natural language understandingsystems lies on the lack of an extensivecommonsense knowledge base. Sincecommonsense is mainly made up of semanticand procedural knowledge, which FunGramKBstores in the form of meaning postulates andcognitive macrostructures respectively, we canconclude that FunGramKB can help languageengineers to design more intelligent NLPapplications.

BibliographyAllen, J. 1983. Maintaining knowledge about

temporal intervals. Communications of theACM, 26 (11): 832-843.

Allen, J. 1991. Time and time again: the manyways to represent time. InternationalJournal of Intelligent Systems, 6 (4): 341-355.

Allen, J. and G. Ferguson. 1994. Actions andevents in interval temporal logic. Journal ofLogic and Computation, 4 (5): 531-579.

Bateman, J.A. 1990. Upper modeling: a generalorganization of knowledge for naturallanguage processing. Workshop onStandards for Knowledge RepresentationSystems. Santa Barbara.

Bateman, J.A., R. Henschel, and F. Rinaldi.1995. The Generalized Upper Model 2.0.Technical report. IPSI/GMD, Darmstadt.

Beale, S., S. Nirenburg, and K. Mahesh. 1995.Semantic analysis in the Mikrokosmosmachine translation project. Proceedings of


202

the Symposium on NLP. Bangkok.

Bouaud, J., B. Bachimont, J. Charlet, and P.Zweigenbaum. 1995. Methodologicalprinciples for structuring an ontology.Proceedings of IJCAI'95: Workshop onBasic Ontological Issues in KnowledgeSharing. Montreal.

Connolly, J.H. and S.C. Dik. eds. 1989.Functional Grammar and the Computer.Foris, Dordrecht.

Dik, S.C. 1978. Functional Grammar. Foris,Dordrecht.

Dik, S.C. 1989. The Theory of FunctionalGrammar. Foris, Dordrecht.

Dik, S.C. 1997. The Theory of FunctionalGrammar. Mouton de Gruyter, Berlin-NewYork.

Gangemi, A., N. Guarino, C. Masolo, A.Oltramari, and L. Schneider. 2002.Sweetening ontologies with DOLCE.Proceedings of EKAW 2002. 166-181,Sigüenza.

Goldberg, A.E. 1995. Constructions: AConstruction Grammar Approach toArgument Structure. The University ofChicago Press, Chicago.

Gordon, A.S. 1999. The design of knowledge-rich browsing interfaces for retrieval indigital libraries. Doctorate thesis.Northwestern University.

Lenci, A. 2000. Building an ontology for thelexicon: semantic types and word meaning.Workshop on Ontology-Based Interpretationof Noun Phrases. Kolding.

Lenci, A., N. Bel, F. Busa, N. Calzolari, E.Gola, M. Monachini, A. Ogonowski, I.Peters, W. Peters, N. Ruimy, M. Villegas,and A. Zampolli. 2000. SIMPLE: a generalframework for the development ofmultilingual lexicons. International Journalof Lexicography, 13 (4): 249-263.

Mahesh, K. 1996. Ontology development formachine translation: ideology andmethodology. Technical report MCCS-96-292. Computing Research Laboratory, NewMexico State University, Las Cruces.

Mahesh, K. and S. Nirenburg. 1995. Semanticclassification for practical natural languageprocessing. Proceedings of the 6th ASIS

SIG/CR Classification Research Workshop:An Interdisciplinary Meeting. 79-94,Chicago.

Masolo, C., S. Borgo, A. Gangemi, N. Guarino,and A. Oltramari. 2003. WonderWebdeliverable D18: ontology library. Technicalreport. Laboratory for Applied Ontology,ISTC-CNR.

Mueller, E.T. 1999. A database and lexicon ofscripts for ThoughtTreasure.[http://cogprints.ecs.soton.ac.uk/archive/00000555/]

Niles, I. and A. Pease. 2001a. Origins of theStandard Upper Merged Ontology: aproposal for the IEEE Standard UpperOntology. Working Notes of the IJCAI-2001Workshop on the IEEE Standard UpperOntology. Seattle.

Niles, I. and A. Pease. 2001b. Towards astandard upper ontology. Proceedings of the2nd International Conference on FormalOntology in Information Systems. Ogunquit.

Nirenburg, S., S. Beale, K. Mahesh, B.Onyshkevych, V. Raskin, E. Viegas, Y.Wilks, and R. Zajac. 1996. Lexicons in theMikroKosmos project. Proceedings of theAISB’96 Workshop on Multilinguality in theLexicon. Brighton.

Noy, N.F. and D.L. McGuinness. 2001.Ontology development 101: a guide tocreating your first ontology. Technicalreport KSL-01-05. Stanford KnowledgeSystems Laboratory, Stanford University.

Pedersen, B.S. and B. Keson. 1999. SIMPLE—semantic information for multifunctionalplurilingual lexica: some examples ofDanish concrete nouns. Proceedings of theSIGLEX-99 Workshop. Maryland.

Periñán Pascual, C. and F. Arcas Túnez. 2004.Meaning postulates in a lexico-conceptualknowledge base. Proceedings of the 15thInternational Workshop on Databases andExpert Systems Applications. 38-42, IEEE,Los Alamitos.

Periñán Pascual, C. and F. Arcas Túnez. 2005.Microconceptual-Knowledge Spreading inFunGramKB. Proceedings on the 9thIASTED International Conference onArtificial Intelligence and Soft Computing.239-244, ACTA Press, Anaheim-Calgary-Zurich.


203

Procter, P. ed. 1978. Longman Dictionary ofContemporary English. Longman, Harlow.

Schank, R. and R.P. Abelson. 1977. Scripts,Plans, Goals and Understanding. LawrenceErlbaum, Hillsdale.

SIMPLE Specification Group. 2000.Specification SIMPLE Work Package 2:linguistic specifications deliverable D2.1.Technical report.

Tulving, E. 1985. How many memory systemsare there? American Psychologist, 40: 385-398.

Villegas, M. and I. Brosa. 1999. SpanishSIMPLE: lexicon documentation. Technicalreport.

Vossen, P. 1994. The end of the chain: wheredoes decomposition of lexical knowledgelead us eventually? E. Engberg-Pedersen, L.Falster Jakobsen, and L. Schack Rasmussen.eds. Function and Expression in FunctionalGrammar. 11-39, Mouton de Gruyter,Berlin-New York.

Vossen, P. 2003. Ontologies. R. Mitkov. ed.The Oxford Handbook of ComputationalLinguistics. 464-482, Oxford UniversityPress, Oxford.

Zwaan, R.A. and G.A. Radvansky. 1998.Situation models in language comprehensionand memory. Psychological Bulletin, 123(2): 162-185.


204

Text as Scene: Discourse Deixis and Bridging Relations

Marta Recasens Universitat de Barcelona

Gran Via Corts Catalanes,58508007 Barcelona

[email protected]

M. Antònia Martí Universitat de Barcelona

Gran Via Corts Catalanes,585 08007 Barcelona [email protected]

Mariona Taulé Universitat de Barcelona

Gran Via Corts Catalanes,585 08007 Barcelona [email protected]

Abstract: This paper presents a new framework, “text as scene”, which lays the foundations for the annotation of two coreferential links: discourse deixis and bridging relations. The incorporation of what we call textual and contextual scenes provides more flexible annotation guidelines, broad type categories being clearly differentiated. Such a framework that is capable of dealing with discourse deixis and bridging relations from a common perspective aims at improving the poor reliability scores obtained by previous annotation schemes, which fail to capture the vague references inherent in both these links. The guidelines presented here complete the annotation scheme designed to enrich the Spanish CESS-ECE corpus with coreference information, thus building the CESS-Ancora corpus. Keywords: corpus annotation, anaphora resolution, coreference resolution.

Resumen: En este artículo se presenta un nuevo marco, “el texto como escena”, que establece las bases para la anotación de dos relaciones de correferencia: la deixis discursiva y las relaciones de bridging. La incorporación de lo que llamamos escenas textuales y contextualesproporciona unas directrices de anotación más flexibles, que diferencian claramente entre tipos de categorías generales. Un marco como éste, capaz de tratar la deixis discursiva y las relaciones de bridging desde una perspectiva común, tiene como objetivo mejorar el bajo grado de acuerdo entre anotadores obtenido por esquemas de anotación anteriores, que son incapaces de captar las referencias vagas inherentes a estos dos tipos de relaciones. Las directrices aquí presentadas completan el esquema de anotación diseñado para enriquecer el corpus español CESS-ECE con información correferencial y así construir el corpus CESS-Ancora. Palabras clave: anotación de corpus, resolución de la anáfora, resolución de la correferencia.

1 Introduction

Due to the lack of large annotated corpora with anaphoric information, the field of computational coreference resolution is still highly knowledge-based, especially for languages other than English. With a view to building a corpus-based coreference resolution system for Spanish, our project is to extend the morphologically, syntactically and semantically annotated CESS-ECE corpus (500,000 words) with pronominal and full noun-phrase (NP) coreference information (thus building the CESS-Ancora corpus). The design of the annotation guidelines is presented in (Recasens, Martí & Taulé, 2007), but two types of coreferential links, namely discourse deixis1

1 We define discourse deixis (or abstract

anaphora) as reference to a discourse segment, thatis, to a non-nominal antecedent.

and bridging relations2, call for a specific analysis which takes into account their complex peculiarities so as to determine the most appropriate set of attributes and values.

We believe that the more consistent the linguistic basis underlying the annotation scheme is, the easier it is to build a state-of-the-art coreference resolution system. On the other hand, coreferential –anaphoric in particular– relations are very much specific to each language. Unlike English, for instance, Spanish has three series of demonstratives and pronouns marked for neuter gender. The typology presented in this paper is the completion of a flexible annotation scheme rich enough to cover the cases of coreference in Spanish.

2 Our approach classifies as bridging (or associative anaphors) those definite or demonstrative NPs that are interpreted on the grounds of a metonymic relationship with a previous NP or VP.



Apart from being a useful resource for training and evaluating coreference resolution systems for Spanish, from a linguistic point of view, the annotated corpus will serve as a workbench to test for Spanish the hypotheses suggested by Ariel (1988) and Gundel, Hedberg & Zacharski (1993) about the cognitive factors governing the use of referring expressions. The only way theoretical claims coming from a single person’s intuitions can be proved is on the basis of empirical data that have been annotated in a reliable way.

As a follow-up, this paper places the emphasis on the annotation guidelines for discourse deixis and bridging relations. Both are considered from a common perspective: what we call “text as scene”, that is, the text taken as a scene in the sense that it builds up both a textual and a contextual framework as the result of an interaction between the discourse and the global context.

The rest of the paper proceeds as follows: Section 2 reviews previous work on abstract and bridging anaphora. A description of the “text as scene” framework is provided in Section 3. Specific guidelines for annotating discourse deixis and bridging relations are given in Section 4. Finally, Section 5 presents our conclusions and discussion of the guidelines.

2 Previous work

Given the difficulty of dealing with antecedents other than NPs, most of the work on anaphora resolution has ignored abstract anaphora and has limited to individual anaphora. However, the work of Byron (2002) has emphasized that demonstrative pronouns referring to preceding clauses abound in natural discourse3. In this line, the corpus-based study of the use of demonstrative NPs in Portuguese and French conducted by Vieira et al. (2002) has pointed out that a system limited to the resolution of anaphors with a nominal antecedent is likely to fail on about 30% of the cases.

In her seminal study, Webber (1988) coins the term “discourse deixis” for reference to discourse segments and argues that these should be included in the discourse model as discourse entities, since they can be subsequently

3 Byron’s anaphora resolution algorithm

differentiates Mentioned Entities (those evoked by NPs) from Activated Entities (those evoked by linguistic constituents other than NPs, involving global focus entities).

referenced via deictic expressions. Nevertheless, a discourse entity corresponding to a textual segment is not added to the discourse model until the listener finds a subsequent deictic pronoun, in the so-called accommodationprocess4. Works on parsing texts into discourse segments (Marcu, 1997) have not dealt with the problem of discourse deixis, i.e. delimiting the extent of the antecedent.

With respect to corpus annotation, there are not many annotation schemes that annotate antecedents other than NPs. The MUC Task Definition (Hirschman & Chinchor, 1997) explicitly defines demonstratives as non-markables. Two notable exceptions are the MATE scheme by Poesio (2000) and the scheme by Tutin et al. (2000), although both point out the difficulty of delimiting the exact part of the text that counts as antecedent as well as the type of object the antecedent is. Tutin et al. (2000) decide to select the largest possible antecedent.

Similarly to discourse deixis, authors seem sceptical about the feasibility of the annotation task for bridging relations, especially since the empirical study conducted by Poesio & Vieira (1998), which reported an agreement of 31%. The issue under debate is where the boundary lies between a discourse-new NP and a bridging one, that is, between autonomous and non-autonomous definite NPs. Fraurud’s (1990) starting point for her corpus-based study is a two-way distinction between first-mentions and subsequent mentions (coreferential NPs). On realising that 60% of the definite NPs were first-mention uses, she concludes that in addition to the syntactic (in)definiteness of an NP, the lexico-encyclopaedic knowledge associated with the head noun of the NP interacts with the general knowledge associated with present anchors in order to select one or more anchors in relation to which a first-mention definite NP is interpreted. Anchors may be provided in the discourse itself –either explicitly or implicitly–, by the global context, or by a combination of the two. Although Fraurud does not use the term, the first-mention NPs that are interpreted in relation to an explicitanchor correspond to “bridging relations”.

4 Accommodation results from the use of a

singular definite, which is felt to presuppose thatthere is already a unique entity in the context with the given description that will allow a truth value to be assigned to the utterance (Lewis, 1979).

Marta Recasens, Antonia Martí Antonín y Mariona Taulé

206

In their analysis of the use of pronouns and demonstrative NPs in bridging relations, Gundel, Hedberg & Zacharski (2000) conclude that such cases are best analysed as minor violations to the Giveness Hierarchy, in that the listener gets away with an underspecified referent on the basis of what is predicated in the text.

What do then discourse deixis and bridging relations have in common? On the one hand, they are the anaphoric links with poorest reliability scores. On the other hand –and probably a cause of the former–, their antecedents are rather fuzzy, either because their extension cannot be clearly determined or because the semantic relation that links them with their anaphor cannot be easily identified. Taking into account the low inter-annotator agreement together with the idea of vague reference, we propose viewing the text as a scene in order to provide a wider contextual framework that captures those cases in which a discourse entity alludes to something that is not literally mentioned in the discourse.

3 Text as scene

Previous aims at annotating coreference have shown the need for reconsidering the annotation of both discourse deixis and bridging relations, since the reference of NPs such as esto, la cosa, and este mercado in (1), (2) and (3) respectively5 cannot be accounted for from approaches that insist on linking each anaphoric expression to an explicit textual antecedent.

(1) El Komercni Banka –Banco Comercial–, uno de los cuatro bancos más grandes de la República Checa, anunció hoy que despedirá a 2.300 empleados más antes de finales del año dentro del proceso de saneamiento de la entidad estatal. El director del banco, Radovan Vrava, señaló que el motivo principal es la reestructuración del banco. El Estado dispone del 60 por ciento de las acciones del Komercni Banka y el Gobierno checo quiere comenzar el proceso de privatización de este banco ya en este año y terminarlo en septiembre del 2001. Otro de los

5 The reader is asked to please forgive the length

of most of the examples used in this paper, but theanaphoric expressions we deal with make no sense unless the context is provided.

objetivos es evitar que se repitan los errores del pasado, que obligaron al Gobierno a comprar créditos dudosos por un valor de 60.000 millones de coronas –1.500 millones de dólares. Esto permitirá al banco sanear su portafolio...6

(2) “Las previsiones para los próximos diez días no son nada halagueñas”, pronosticó ayer Eduardo Coca, director del Instituto Nacional de Meteorología. Tan sólo un pequeño frente con poca agua debía cruzar el norte de la península entre ayer y hoy. Por lo demás, seguirá la situación anticiclónica. Pero la cosano acaba ahí.7

(3) El presidente de la Comisión del Mercado de las Telecomunicaciones mostró su preocupación por la falta de competencia en la telefonía local, como consecuencia de que la liberalización de las telecomunicaciones se ha hecho por principios jurídicos y no técnicos y que “hay que abrir este mercadocomo sea”.8

6 (1) The Komercni Banka –Commercial Bank –,

one of the four biggest banks in the Cheque Republic, announced today that it will dismiss 2,300 more workers by the end of the year within the reform process of the state entity. The director of the bank, Radovan Vrava, pointed out that the main reason is the restructuration of the bank. The State possesses the 60 per cent of the shares of the Komercni Banka and the Cheque Government wants to begin the privatisation process of this bank already this year and finish it in September 2001. Another of the goals is to avoid the repetition of past mistakes, which forced the Government to buy doubtful credits for the price of 60,000 million crowns –1,500 million dollars. This will allow the bank to reform its portfolio.

7 (2) “The forecasts for the next ten days are not favourable at all”, forecasted yesterday Eduardo Coca, director of the National Institute of Meteorology. Only a small front with little water should cross the north of the peninsula between yesterday and today. As for the rest, the anticyclonic situation will persist. But the thing does not end there.

8 (3) The president of the Commission of the Market of Telecommunications showed his concern for the lack of competence in local telephony, as a


207

Our coding scheme is defined from the consideration of the text as a scene in two different senses (see Figure 1), the scene being the cohesive element. On the one hand, discourse deixis captures those anaphoric expressions that refer back to the textual scene, that is, to a discourse segment –either at the sentence level or beyond the sentence– that builds up a scene as a whole. On the other hand, bridging captures those implicit relations (between two discourse entities) that are enabled by the contextual scene activated by the involved entities. A contextual scene is taken to be the knowledge which does not explicitly appear in the text, but that contributes to its comprehension. Bridging is treated within coreference in the sense that the two discourse entities share the reference point on the basis of a contextual scene.

Figure 1: Textual and contextual scenes

Back to example (1), the discourse segment picked up by the pronoun esto –that which is going to allow the Cheque Bank to reform its portfolio– results not only from the last discourse segment, but from combining the content of the events that form the entire textual scene: the dismissal of 2,300 workers, the restructuration of the Bank, its privatisation, and the avoidance of past mistakes. Similarly, the definite NP la cosa in (2) makes reference to the textual scene previously described. It becomes a quasi-pronominal form in that it is almost semantically empty. Finally, example (3) shows a case of bridging, where the interpretation of the demonstrative NP este mercado is made possible by the contextual scene activated by a former NP, la telefonía local, namely, the market opened by local telephony.

Text as scene provides a common framework within which we are able to reach a

consequence of the fact that the liberalisation of telecommunications has been done by juridical and not technical principles and that “this market must be opened at all costs”.

consensus as to the typology of referring expressions that can code discourse deixis and bridging relations as well as the subtypes of links that need to be annotated with a view to achieving a level of inter-annotator agreement as high as possible.

4 Corpus annotation

The CESS-ECE corpus is the largest annotated corpus of Spanish, which contains 500,000 words mostly coming from newspaper articles. It has been annotated with morphological information (PoS), syntactic constituents and functions, argument structures and thematic roles, tagged with strong and weak named entities, and the 150 most frequent nouns have their WordNet synset.

Drawing from the MATE scheme (Poesio, 2000) and taking into account the information already annotated, the enrichment of the corpus with coreference annotation is divided into two steps: a first automatic stage, and a second manual one. The former marks up all NPs of the corpus as <de> (discourse entity) with an ID number, and fills in the TYPE attributes with morphological information (the kind of NP); the latter step adds those <de> unidentified by the automatic annotation – and codes the coreferential relations by incorporating the <link> element.

It is at this second stage when antecedents expressed by phrases other than nominal are marked manually as <seg> elements when necessary. The <coref:link> elements serve to show coreferential relations holding between two discourse entities, and the embedded <coref:anchor> element points to the ID of the antecedent. Each <coref:link> has a TYPE attribute that specifies the kind of coreferential relation. We distinguish seven types of links:

(i) ident (identity) (ii) dx (discourse deixis) (iii) poss (possessor) (iv) bridg (bridging) (v) pred (predicative) (vi) rank (ranking) (vii) context (contextual)

Given that the marking of both discourse deixis and bridging relations is very useful for tasks such as question answering (answer fusion), information extraction (template merging) and text summarization, but that the annotation of these two links poses great difficulty, we

Eduardo Coca, director del Instituto Nacional de Meteorología (INM). Tan sólo un pequeño frente con poca agua debía cruzar el norte de la península entre ayer y hoy. Pero la cosa no acaba ahí.

La falta de ompetencia en todo el mundo en la telefonía local, como consecuencia de que la liberalización de las comunicaciones se ha hecho por principios jurídicos, este mercado como sea.

ctx-sc

Discourse deixis Bridging relation


208

consider it necessary to devote the two following sections to specifying their annotation guidelines, which are based on our conception of the text as scene.

4.1 Discourse deixis (dx)

We consider an anaphoric NP to be in a dx relation when its antecedent is a textual scene expressed by a clause or a sequence of clauses. NPs that have the potential to participate in dx links are demonstrative pronouns, the neuter personal pronoun lo, the relative pronoun que, demonstrative full NPs, and definite descriptions (DD) of the kind la cosa, el fenómeno, la situación, etc. We call these NPs “quasi-pronominal DDs”, as they can be replaced by the pronoun esto and are almost empty of semantic content of their own.

Textual scenes are not constituted as such until a corresponding referring expression appears in the discourse. The pronouns lo and que tend to refer to textual scenes within the same discourse segment or introduced in the previous sentence, while demonstratives and quasi-pronominal DDs can refer to scenes that are more than one sentence away. Since it is not a trivial matter to decide the exact part of the text that serves as antecedent, we distinguish between two SUBTYPE attributes for dx:

(i) subtype=“sent” (sentential) This subclass covers the less problematic

cases of discourse deixis, i.e. those anaphoric NPs that refer to a textual scene whose extent is no longer than one sentence (any discourse segment from period to period). We mark the non-nominal antecedent as a <seg> element with an ID number, which fills the <coref:anchor>. When in doubt about the exact delimitation of the text segment, the entire sentence is marked-up. For ease of presentation, (4a) shows the extent of the antecedent for the anaphoric demonstrative NP este camino9, whereas (4b) codes the link as it is done in the annotation of the CESS-Ancora corpus.

Taking into account that the pronoun alone is not enough to pick up its referent, but that this is made clear from the predicate complement information (Byron, 2000), we further determine the “sent” value with the semantic type of the antecedent: “sent-ev” for

9 In the examples, underlines correspond to anaphoric expressions, while square brackets identify their antecedents.

events (4), “sent-fact” for facts (5), and “sent-prop” for propositions (6).

(4) a. La ministra Anna Birulés animó a las pymes a [invertir en Investigación y Desarrollo] y *0* mostró a los empresarios presentes la disposición del Gobierno a facilitar este camino.10

b. La ministra Anna Birulés animó a las pymes a <seg ID=“seg_03”> invertir en Investigación y Desarrollo </seg> y *0* mostró a los empresarios presentes la disposición del Gobierno a facilitar <de type=“dd0ms0” ID=“de_06”>este camino </de>. <coref:link ID=“de_06” type=“dx”

subtype=“sent-ev”> <coref:anchor ID=“seg_03”/> </coref:link>

(5) Sin embargo, [los virus logran poner a su servicio al organismo vivo más desarrollado que existe: el ser humano.] Es éste un hecho que hace temblar el edificio que la humanidad ha construido.11

(6) [La Coordinadora de Organizaciones de Agricultores y Ganaderos teme que la falta de lluvia afecte también a los regadíos, dado que empieza a reducirse el volumen de agua embalsada.] Este temor es compartido por...12

(ii) subtype=“text” (textual scene) The textual scene subtype includes those cases discussed in Section 3 ((1) and (2)), where an anaphoric expression refers to the whole scene built up by the preceding text. These are cases that result from global discourse effects, so the precise anchor goes beyond the single sentence level and is usually vague in reference.

10 (4) The minister Anna Birulés stimulated the

SMEs [to invest in Research and Development] and showed the present businessmen the Government’s willingness to facilitate this path.

11 (5) Nevertheless, [viruses manage to put at their service the most developed living organism that exists: the human being.] This is a fact that makes the edifice that humanity has built tremble.

12 (6) [The Coordinator of Organisation of Farmers fears that the lack of rain also affects irrigations, given that the volume of dammed water is starting to decrease.] This fear is shared by...


209

Therefore, as <coref:anchor> we indicate the ID of the paragraph (<par>) to which the anaphor belongs, thus indicating that the reference is made to the textual scene going from the beginning of the paragraph to the anaphor. As example, (7) shows the annotation for the anaphoric NP in (1).

(7) <de type=“pd0ns00” ID=“de_09”> Esto </de> permitirá al banco sanear su portafolio.13

<coref:link ID=“de_09” type=“dx” subtype=“text” > <coref:anchor

ID=“par_05”/> </coref:link>Demonstratives which are part of idiomatic phrases, such as the connectors de esta forma or en este sentido, are not considered as markables, since they are mere linking phrases.

4.2 Bridging relations (bridg)

Bridging relations only make sense if we understand them as occurring within the contextual scene triggered by the interaction between two discourse entities. The set of bridging relations is still an open issue (see the classification schemes of Clark, 1977; Vieira, 1998; Poesio, 2000; Muñoz, 2001; Gardent, Manuélian & Kow, 2003), since rather than a binary distinction between first-mention and bridging NPs, there is a scale ranging from those definite NPs which are uniquely interpretable by means of world knowledge (i.e. self-sufficient definite descriptions (SD)14) to those definite NPs which depend on a previous anchor. Inevitably, however, many real examples remain in between, as in (8), where todas las administraciones does not mean “all administrations” (in the world), but just the subset relevant to this scene.

(8) La última edición de Barnasants, el ciclo de canción de autor, ha atraído, según su director, Pere Camps, a unas 2.000 personas. Camps destaca el apoyo unánime de todas las administraciones en la edición de este año.15

13 (7) This will allow the bank to reform its

portfolio. 14 We consider as SD those NPs with the definite

article that depend on no antecedent, but on world knowledge. Their autonomy can result from their generic reference, their containing an explanatory modifier, or their general uniqueness.

15 (8) The last edition of Barnasants, the singer-writer song cycle, has attracted, according to its

In our annotation scheme, we consider NPs such as that in (8) as generic. They are framed by the textual scene, but do not require any anchor for their interpretation. Therefore, first-mentions of such NPs are considered to be SDs, while subsequent mentions are annotated as identity coreference.

We limit the term bridging to NPs (either definiteordemonstrative) that are metonymically interpreted –to a greater or lesser extent– on the basis of a former NP or VP. The second discourse entity is anchored on the entity which contributes to activating the necessary scene for its interpretation. Within the “text as scene” approach, all bridging relations are taken to be contextual scene relations. So we only subspecify three very basic distinctions, which tend to be widely agreed upon. The three SUBTYPE attributes are:

(i) subtype=“part” (part-of) The antecedent of the anaphoric NP corresponds to the whole of which the anaphor is a part, as in (9).

(9) La reestructuración de [los otros bancos checos] se está acompañando por la reducción del personal.16

(ii) subtype=“member” (set-member) As illustrated by (10), the subsequent NP refers to one or more members of the set expressed by the NP anchor.

(10) a. [la tropa]...uno de los soldados.

b. Ante [unas mil personas], entre ellas la ministra de Ciencia y Tecnología, Anna Birulés, el alcalde de Barcelona, Joan Clos, la Delegada del Gobierno, Julia García Valdecasas, y una representación del gobierno catalán, Pujol dijo...17

director, Pere Camps, about 2,000 people. Camps emphasizes the unanimous support of all the administrations in the edition of this year.

16 (9) The restructuration of [the other Cheque banks] is accompanied by the reduction of the staff.

17 (10) a. [the troop]...one of the soldiers. b. Before about [one thousand people], among

them the minister of Science and Technology, Anna Birulés, the mayor of Barcelona, Joan Clos, the Delegate of the Government, Julia García Valdecasas, and a representation of the Catalan government, Pujol said...


210

(iii) subtype =“them” (thematic)The anaphoric NP is related to a VP (the anchor) via a thematic relation. In (11), for instance, estas inversiones is the patient of the previous verb invertir. Like sentential anchors in discourse deixis, antecedents corresponding to VPs are marked by hand with a <seg> tag.

(11) *0* podría hacer que la empresa dominante dejara de [invertir en la red] por no considerarla como una inversión atractiva, y el Gobierno debe incentivar estas inversiones.18

If no subtype is specified, it means that the anaphoric NP is interpreted on the basis of a contextual scene, but that it is not related to itsanchor via a clear part-of, set-member or thematic relation. This includes cases commonly referred to as “discourse topic” or general “inference” bridging. Examples can be found in (3) and (12).

(12) El cambio de [17 acciones de Alcan]...los accionistas.19

5 Conclusions and discussion

In this paper we have developed the specific framework, “text as scene”, on which we base the annotation guidelines for both discourse deixis and bridging relations. The former is annotated as coreferring with a certain textual scene, while the latter is coded on the basis of a contextual scene activated by the conjunction of two discourse entities.

Given the rather vague antecedents that anaphoric expressions interpreted via either of these relations have, the annotation of both discourse deixis and bridging relations has usually obtained considerably low inter-annotator agreement. Our annotation scheme is unique in that we deal with these two relations from a common framework. In contrast to other annotation schemes, ours assumes two additional sources for the referent to be interpreted –a textual and a contextual scene–, which allow broader categories and thus more flexible annotation guidelines. Other interesting contributions of our scheme are the consideration of what we call “quasi-

18 (11) S/he could make the dominant company

stop [investing in the net] for not considering it as an attractive inversion, and the Government must motivate these inversions.

19 (12) The change of [17 shares] of Alcan...the shareholders.

pronominal DDs” as discourse deictics together with the inclusion of demonstrative NPs into the range of potential candidates for bridging relations.

These guidelines complete the annotation scheme designed to enrich the Spanish CESS-ECE corpus with coreference information, thus giving birth to the CESS-Ancora corpus. It is a scheme rich enough to cover the different types of coreference in Spanish. Nevertheless, coreference annotation is such a complex task –involving several types of linguistic items and different factors responsible for linking two items as coreferential– that we are currently conducting a reliability study on a subset of the corpus to investigate the feasibility and validity of our annotation scheme. The results obtained might lead us to extend and refine it. One of the issues whose reliability needs to be proved is the extent to which abstract antecedents can be semantically classified into events, facts and propositions.

We believe that a 500,000-word corpus annotated from the morphological to the pragmatic level may shed new light on key factors about the nature and working of expressions creating coreference. It has not been determined yet, for instance, the way contextual scenes come into play or their scope (Fraurud, 1990). The CESS-Ancora corpus will provide quantitative data from natural written discourse from which it will be possible to infer more precise and realistic linguistic generalisations about the use of coreferential and anaphoric expressions in Spanish.

On the other hand, the rich tagset that distinguishes seven types of coreferential relations and that separates individual from abstract anaphora (each with different attributes) makes the CESS-Ancora corpus a very fruitful language resource. Being publicly released, it shall be used both for training and evaluating coreference resolution systems, as well as in competitions such as ACE or ARE.

In brief, the goal of our project is twofold. From a computational perspective, the CESS-Ancora corpus will be used to construct an automatic corpus-based coreference resolution system for Spanish. From a linguistic point of view, hypotheses on the use of coreferential expressions (Ariel, 1988; Gundel et al., 1993) will be tested on the basis of the annotated data and new linguistic theories might emerge.


211

Acknowledgments

We would like to thank Mihai Surdeanu for his helpful advice and suggestions.

This paper has been supported by the FPU grant (AP2006-00994) from the Spanish Ministry of Education and Science. It is based on work supported by the CESS-ECE (HUM2004-21127), Lang2World (TIN2006-15265-C06-06), and Praxem (HUM2006-27378-E) projects.

References

Ariel, M. 1988. Referring and accessibility. Journal of Linguistics, 24(1):65-87.

Byron, D. K. 2000. Semantically enhanced pronouns. In Proceedings of the 3rd

Discourse Anaphora and Anaphor Resolution Colloquium (DAARC2000), Lancaster.

Byron, D. K. 2002. Resolving pronominal reference to abstract entities. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL'02), Philadelphia, 80-87.

Clark, H. 1977. Bridging. In P.N. Johnson-Laird and P.C.Wason (editors), Thinking: Readings in Cognitive Science, Cambridge University Press.

Fraurud, K. 1990. Definiteness and the processing of NPs in natural discourse. Journal of Semantics, 7:395-433.

Gardent, C., H. Manuélian, and E. Kow. 2003. Which bridges for bridging definite descriptions? In Proceedings of the EACL 2003 Workshop on Linguistically Interpreted Corpora, Budapest, 69-76.

Gundel, J., N. Hedberg, and R. Zacharski. 1993. Cognitive status and the form of referring expressions in discourse. Language, 69(2):274-307.

Gundel, J., N. Hedberg, and R. Zacharski. 2000. Statut cognitif et forme des anaphoriques indirects. Verbum, 22:79-102.

Hirschman, L. and N. Chinchor. 1997. MUC-7 coreference task definition. In MUC-7 Proceedings. Science Applications International Corporation.

Lewis, D. 1979. Score keeping in a language game. In R. Bäuerle et al. (editors),

Semantics from a different point of view. Springer Verlag, Berlin.

Marcu, D. 1997. The Rhetorical Parsing, Summarization, and Generation of Natural Language Texts. PhD Thesis, Department of Computer Science, University of Toronto.

Muñoz, R. 2001. Tratamiento y resolución de las descripciones definidas y su aplicación en sistemas de extracción de información. PhD Thesis, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante.

Poesio, M. 2000. MATE Dialogue Annotation Guidelines – Coreference. Deliverable D2.1. http://www.ims.uni-stuttgart.de/projekte/mate/mdag

Poesio, M. and R. Vieira. 1998. A corpus-based investigation of definite description use. Computational Linguistics, 24(2):183-216.

Recasens, M., M.A. Martí, and M. Taulé. 2007. Where anaphora and coreference meet. Annotation in the Spanish CESS-ECE corpus. In Proceedings of the International Conference on Recent Advances in NaturalLanguage Processing (RANLP2007), Borovets, Bulgaria, forthcoming.

Tutin, A., F. Trouilleux, C. Clouzot, E. Gaussier, A. Zaenen, S. Rayot, and G. Antoniadis. 2000. Annotating a large corpus with anaphoric links. In Proceedings of the 3rd Discourse Anaphora and Anaphor Resolution Colloquium (DAARC2000), Lancaster.

Vieira, R. 1998. Definite Description Processing in Unrestricted Texts. Ph.D. Thesis, University of Edinburgh, Centre for Cognitive Science.

Vieira, R., S. Salmon-Alt, C. Gasperin, E. Schang, and G. Othero. 2002. Coreference and anaphoric relations of demonstrative noun phrases in a multilingual corpus. In Proceedings of the 4th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC2002), Lisbon.

Webber, B. 1988. Discourse deixis: reference to discourse segments. In Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics (ACL'88), New York, 113-122.


212

Definición de una metodología para la construcción de Sistemas de Organización del Conocimiento a partir de un corpus

documental en Lenguaje Natural

Sonia Sánchez-Cuadrado Universidad Carlos III de Madrid

Avda. Universidad 30, 28911 Leganés [email protected]

Jorge Morato Lara Universidad Carlos III de Madrid


José Antonio Moreiro González Universidad Carlos III de Madrid

C/ Madrid 126, 28903 Getafe [email protected]

Mónica Marrero Linares Universidad Carlos III de Madrid


Resumen: Se propone una metodología para la construcción automatizada de KOS adaptable a diferentes entornos a partir de un corpus documental y unas aplicaciones de tratamiento textual que soporten todo el proceso de construcción y mantenimiento automatizado del KOS. Esta metodología se ha aplicado a diferentes entornos reales, comprobando que se trata de una metodología adaptable y obteniendo una reducción significativa de la intervención de expertos del dominio. Palabras clave: metodología, Sistemas de Organización del Conocimiento, KOS, adquisición de conocimiento, sistema PLN, relaciones semánticas.

Abstract: A methodology to automatic KOS construction is proposed based on information extraction from natural language documents. Also, a set of NLP tools have been implemented to help in the development and management process. The methodology has been tested in real world projects. Results show that the methodology is highly adaptable and have a low dependence of domain experts. Keywords: Methodology, Knowledge Organization Systems, KOS, Knowledge acquisition, NLP tools, semantic relationships.

1 Introducción

El objetivo de esta investigación es proponer una metodología adaptable para la construcción automatizada de Sistemas de Organización del Conocimiento a partir de documentos en lenguaje natural de dominios específicos procedentes de entornos y necesidades reales. Este propósito parte de la premisa de que la mayor parte del conocimiento está explicitado en los documentos de un dominio mediante términos y relaciones y que sólo el conocimiento que no esté expresado en los documentos tendrá que ser aportado por los expertos del dominio.

Para diferentes autores como Hodge (2000) o Zeng y Chan (2004) el término Sistemas de Organización de Conocimiento, también conocido como KOS, engloba diferentes tipos de esquemas para organizar la información y promover la gestión del conocimiento, como esquemas de clasificación y categorización, encabezamientos de materias, archivos de autoridades, tesauros, redes semánticas y ontologías. Actualmente, los KOS representan un área de creciente interés por la variedad de disciplinas que han confluido en la necesidad de disponer de estos recursos. Cada una de las áreas de conocimiento ha propuesto unos sistemas de acuerdo a sus necesidades y que por tanto varían en su denominación y en algunas características aunque subyace un modelo



común (Daconta et al., 2003: 157; Lassila y McGuinness, 2001; Gruninger y Uschold, 2002). Algunas de estas características entre los distintos tipos de KOS son:

• Representación simplificada de la realidad • Conceptos y relaciones de un dominio • Estructuras flexibles en riqueza semántica • Proporcionar un vocabulario normalizado y

consensuado Los KOS suponen un recurso que beneficia

la comunicación entre expertos y que permite compartir conocimiento de un dominio o una lengua (ISO 2788:1986; NISO Z39.19: 2005). Además aplicado a la RI se mejora en la clasificación y descripción de documentos mediante términos no ambiguos, y la posibilidad de proporcionar un sistema de expansión y restricción de consultas (Foskett, 1971; Baeza-Yates y Ribeiro-Neto, 1992; Ingwersen, 1992). También se ha aplicado en la Terminología (Cabré, 1993), la Ingeniería del Software mediante el análisis de dominios para la reutilización del software (Prieto-Díaz, 1991; Lloréns, 1996); la Ingeniería Artificial incorporando ontologías que permitan realizar inferencias (Gómez-Pérez, 2003: 119-132); en la Web Semántica mediante la construcción de vocabularios de metadatos (Berners-Lee et al., 2001; Daconta et al, 2003), o incluso como mapas conceptuales para recursos educativos (Novak, 1994; 1998).

Las distintas metodologías relacionadas con la construcción de KOS (Gómez-Pérez et al., 2003) coinciden en que deben cumplir las siguientes características: claridad, coherencia, especificación independiente, extensibilidad, vocabulario mínimo con definiciones y denominaciones normalizadas. Así mismo, a partir de las propuestas, se han detectado unas fases comunes para su construcción como:

• Determinar un ámbito o dominio • Adquisición del conocimiento • Comprobación de posibles anomalías e

inconsistencias • Evaluación • Aplicación • Mantenimiento

Para algunas de las fases existen iniciativas que utilizan herramientas que contribuyen a realizar estas tareas, no obstante la mayor carga de trabajo recae sobre el experto encargado de la construcción del KOS.

Las propuestas para la construcción manual de KOS (Aitchison et al., 1972: 141; Lancaster,

1986; Van Slype, 1991; Noy y McGuinness, 2001) presentan problemas significativos. Por una parte, los KOS consumen grandes recursos económicos y humanos durante un largo periodo, y además implican un coste extra cada vez que deben ser actualizados. A esto se debe sumar la dificultad para consensuar los diferentes criterios de los expertos para la organización del conocimiento. Pero sin duda, uno de los problemas más preocupantes es la falta de disponibilidad de expertos del dominio y la desmotivación de estos expertos en las fases de construcción y actualización. Por este motivo, los principales puntos débiles se encuentran relacionados con la intervención de los expertos y con la adquisición del conocimiento (Antoniou y Harmelen, 2004: 211; Gómez-Pérez et al., 2004:107).

Por otro lado, la construcción automatizada de KOS presenta las siguientes dificultades: 1. Definir el tipo de KOS y la estructura de

conocimiento. Es frecuente que clientes y usuarios no sepan explicar que características y funcionalidad esperan del KOS.

2. Definir y recopilar el material el conocimiento que se representará en el KOS condiciona directamente los resultados, la dificultad en la construcción de la estructura de conocimiento y la calidad del resultado:

• Los documentos están en un idioma diferente al que se está procesando

• Los documentos están en varios idiomas • Los documentos son multidisciplinares • Los documentos presentan diferentes grados

de especificidad • Los documentos no están correctamente

escritos (estilo-ortografía) • Los documentos presentan sintaxis no

formalmente estructurada • Problemas para extraer texto de algunos

formatos (ej. Texto de imágenes) 3. Definir la funcionalidad de las

herramientas informáticas para las fases que pueden ser automatizadas. Existen dos funciones fundamentales: extracción de conocimiento e identificación del conocimiento. La primera debe seleccionar aquella información que pueda aportar conocimiento significativo para una estructura organizativa (por lo tanto una indización selectiva). Por otra parte, el proceso de indización tenderá a registrar la

Sonia Sanchez-Cuadrado, Jorge Morato Lara, José Antonio Moreiro González y Monica Marrero Llinares

214

mayor cantidad de información (por lo tanto, una indización exhaustiva).

4. Análisis del resultado del KOS. Se requiere un análisis de los resultados de la estructura de conocimiento construida, debido a que los sistemas de adquisición de información tienden a ser genéricos.

2 Definición de la Metodología

En primer lugar, se establece una definición de roles para la construcción del KOS (Fraga et al, 2006): ingeniero de dominio (ID), experto de dominio (ED) y responsable de dominio (RD) y después una definición de una metodología. Esta metodología estará compuesta por actividades de la construcción del KOS y actividades de apoyo relacionadas con aspectos informáticos, documentación y con el personal experto. La metodología desarrollada (Sánchez-Cuadrado, 2007) utiliza aplicaciones software como ayuda a las distintas fases, pero también como soporte del KOS

Adquisición de Conocimiento

Definición de Requisitos

Evaluación

Herramientasde desarrollo

Integración

Codificación RSHP

ACTIVIDADESCONSTRUCCIÓN DEL KOS

Recopilación Documental

EvaluaciónKOS Preexistentes

ExtracciónInform.(PLN)

Validación

ACTIVIDADES DE APOYO

INFORMÁTICA PERSONALEXPERTO

Cronogramas

DOCUMENTACIÓN

Validación y refinamiento

Recopilación

Documental

Asignación de Tareas

KOS Final

Documentaciónde Seguimiento

Mantenimiento

Conceptualización

Adquisición de Conocimiento

Definición de Requisitos

Evaluación

Herramientasde desarrollo

Integración

Codificación RSHP

ACTIVIDADESCONSTRUCCIÓN DEL KOS

Recopilación Documental

EvaluaciónKOS Preexistentes

ExtracciónInform.(PLN)

Validación

ACTIVIDADES DE APOYO

INFORMÁTICA PERSONALEXPERTO

Cronogramas

DOCUMENTACIÓN

Validación y refinamiento

Recopilación

Documental

Asignación de Tareas

KOS Final

Documentaciónde Seguimiento

Mantenimiento

Conceptualización

Figura 1: Metodología CAKE para construcción de KOS

Los fundamentos de la metodología CAKE (Figura1) se basan en una serie de actividades para la construcción del KOS (Sánchez-Cuadrado et al., 2006): 1. Definición de requisitos para la

identificación del dominio 2. Adquisición de conocimiento

• Recogida de documentación y filtrado: selección del corpus especializado

• Propuesta de un conjunto reducido de categorías que sirvan de semilla a la incorporación de otros nodos de la taxonomía inicial

• Identificación del vocabulario de especialidad: extracción, valoración y validación de vocabulario

• Identificación de relaciones de especialidad: extracción, valoración y validación de relaciones

3. Evaluación de la calidad del KOS 4. Mantenimiento del KOS

Las fases de definición de la estructura de conocimiento mediante la definición de requisitos y la definición del corpus documental se realiza mediante: entrevistas con los expertos y la selección de documentos. 1. Entrevistas con los expertos.

• Determinar el dominio • Determinar las preguntas que deberían

hacerse a un experto: finalidad, tema, subtemas, preguntas a realizar al sistema RI

• Dar pautas a los expertos para la construcción del corpus

• El resultado de esta fase debe ser: • Una estructura taxonómica que

represente a muy alto nivel los componentes básicos que se desean representar

• Un listado de preguntas y respuestas que desean resolver para una consulta

• Un informe de directrices y recomendaciones para la construcción de un corpus

2. Selección de documentos: diferenciar los documentos que están orientados a la construcción de la estructura de conocimiento con los que están orientados a ser documentos de indización.

• Para la construcción KOS. Es un requisito que estos documentos contengan (aunque sea parcialmente) los términos utilizados en los documentos (cuanto más estructurados los documentos, mejor)

o Listados de términos que utilicen o de índices de libros o informes que tengan.

o Si tienen tesauros parciales o Glosarios que utilicen (o material de

formación de personal) • De la entrevista y de los documentos

estructurados debería salir un primer esbozo de estructura de conocimiento. Esta debería ser evaluada por un/unos experto/s y confirmar la orientación correcta para que pueda ser ampliada.

Definición de una Metodología para la Construcción de Sistemas de Organización del Conocimiento a partir de un Corpus ...

215

Las fases de identificación del vocabulario (3) e identificación de relaciones (4) están basadas en sistemas de PLN (Figura 2) que identifican conceptos (simples y complejos) y relaciones léxico-semánticas a partir de patrones y relaciones sintagmáticas (Sánchez-Cuadrado et al., 2003).

Figura 2: Base de datos de conocimiento y tecnologías lingüísticas aplicadas a la

adquisición de conocimiento

Las fases de valoración y validación de los términos y relaciones de especialidad se realizan con una herramienta para la toma de decisiones sobre posibles términos o relaciones conflictivos. Las herramientas que se deben utilizar estarán en función de: la finalidad del sistema, las características del corpus, el volumen del corpus, la implicación de los expertos en el proceso, las técnicas de evaluación y mantenimiento. Por lo tanto se analizará:

• Finalidad del sistema, • Las características del corpus, • Procesamiento textual-calidad de los

resultados • El volumen del corpus, • La implicación de los expertos en el proceso • Las técnicas de evaluación • El mantenimiento del KOS

En aquellas tareas que deban ser realizadas por expertos del dominio, las herramientas son sencillas, y el tiempo que los expertos deben dedicar a estas tareas debe ser mínimo. Para lograr esto, la solución pasa por obtener buenos resultados y procesos de filtrado automatizados. En general, el análisis y la valoración de los resultados en las diferentes fases de construcción de estructuras de conocimiento, es

por parte de los expertos del dominio. Por tanto, la presentación de los resultados debe ser clara y lo más concreta posible. Una forma de lograr claridad y concreción será mediante conocimiento contextualizado.

Los procesos de mantenimiento deberán ser coherentes (no repetir información, no insertar información contradictoria, no información errónea, etc.). Uso fácil, y actualización en cascada y coherente.

3 Aplicación de la metodología a entornos reales

Esta propuesta es resultado de la construcción de distintos KOS para entornos reales según la definición de requisitos expresados por la institución.

Esta metodología se ha empleado en el entorno petroquímico (REPSOL-YPF) siendo construidos por separado diferentes áreas de conocimiento de la organización. Se construyeron cuatro KOS aplicando la herramienta de PLN para la automatización de la fase de adquisición del conocimiento y herramientas Web para la toma de decisiones de las fases de valoración y validación de términos y relaciones por miembros de la organización. Los KOS obtenidos para el entorno petroquímico tenían la función de indizar de forma automática para poder recuperar los documentos.

A continuación se muestran algunos de los resultados obtenidos en la aplicación de los métodos propuestos a los diferentes dominios: REPSOL-YPF, SAGE-SP, Oficina Defensor del Pueblo, prototipo de la Guardia Civil en cuanto a metodología automatizada, también se ha aplicado a la creación manual de dominios en el proyecto del Archivo General de la Nación de la República Dominicana (AGN). En todos estos proyectos se ha utilizado el modelo RSHP. REPSOL-

YPF SAGE P-GC Defensor

del PuebloAGN

Modelo RSHP si si si si si

Categorías Generales

no si si Si si

Recursos documentales

si no no No si

Análisis recursos estructurados

no no no Si si

Análisis recursos semi-estructurados

si no no No no

Análisis recursos no estructurados

si si si No no

Extracción de Entidades

no si si No no


216

Valoración de términos por la organización

si no no Si si

Validación de términos por la organización

si no no Si si

Valoración de relaciones por la organización

si no no Si si

Validación de relaciones por la organización

si no no Si si

Tabla 1: Fases de construcción de KOS aplicadas a diferentes entornos

En la presente tabla se presentan estos resultados de forma resumida según se haya realizado o no determinada etapa de la metodología en cada uno de los proyectos (Tabla 1).

En cuanto al establecimiento a priori de una clasificación general, se aplicó en los dominios de SAGE, en el prototipo de la Guardia Civil y en el dominio del AGN, confirmando que facilita no sólo las primeras fases de distribución de los términos en categorías y la facilidad para entender la formación del dominio, si no la definición de relaciones entre categorías de términos y términos concretos. En concreto, en las ocasiones en la que no se ha utilizado una clasificación genérica, se genera un tipo de estructura de conocimiento diferente. Las diferencias fundamentales residen en que existe un número amplio de categorías generales válidas para ser gestionadas por una máquina, pero no para una persona. Por otra parte, esta clasificación de términos por categorías de palabras ha facilitado que la revisión pueda ser llevada a cabo por los ingenieros del dominio y que sólo en caso de duda o como resultado de esa clasificación un experto del dominio supervise el dominio. Tras diversas pruebas y estudios de clasificaciones similares, el número de categorías iniciales se ha establecido en torno a 15. La propuesta de una definición de una clasificación general para la construcción de sistemas de organización del conocimiento ha sido aplicada a proyectos enfocados a la construcción automatizada y a la construcción manual (p.e. a los distintos subdominios de SAGE-SP). Esta estructura permitía a los ingenieros del dominio incorporar vocabulario que había sido proporcionado por la compañía, en forma de pequeños listados. A medida que se confirmó el tipo de organización de la empresa se

precisaron los términos y se desecharon familias que no eran pertinentes para el dominio (por ejemplo, los gentilicios).

En el caso del dominio de REPSOL-YPF, se localizaron glosarios según las diferentes áreas que se querían modelar. Estos glosarios contenían términos propios del dominio y específicos, proporcionando un vocabulario normalizado. Por otra parte, se aportó la documentación propia de la empresa que a juicio de los expertos reflejaban suficientemente los dominios a modelar. Esta información fue entregada por temáticas que representaban cinco dominios diferentes, aunque con cierto grado de solapamiento.

En el caso de SAGE, la documentación que representaba el material primario para la construcción del sistema de organización del conocimiento consistía fundamentalmente en los ficheros de ayuda de los programas de sus aplicaciones informáticas. También se aportaban los ficheros de sugerencias y errores que se recogían de los clientes mediante el call-center.

La Guardia Civil aportaba para la construcción del tesauro la documentación que registran los miembros de la unidad para el seguimiento de los casos, donde se encontraba toda la información que se pretendía modelar, aunque en función de la investigación podían aparecer conceptos nuevos. El aumento del dominio era incremental debido fundamentalmente a la incorporación de nuevas instancias.

La Oficina del Defensor del Pueblo disponía de un recurso muy concreto y ya estructurado, su tesauro, con la información que se iba a tratar. Por otra parte, tenían a disposición de losexpertos los informes que se tenían que indizar, permitiendo un adecuado reconocimiento y extracción conceptual.

Para el AGN, el proceso de especificación de los recursos documentales para la recopilación del material primario se determinó como necesario índices de topónimos, organigramas, clasificaciones internas, tipología documental, etc.

Desde el inicio de las experiencias con los proyectos se vieron las ventajas del uso de documentos estructurados o semiestructurados, en cuanto a la calidad y cantidad de conceptos y relaciones concentradas en este tipo documental, sin embargo en los entornos aplicados no han podido ser, prácticamente,


217

aprovechados como recursos documentales básicos, como se observa en la Tabla 1

La aplicación de un tratamiento especial aplicado a los documentos estructurados comenzó con la importación del tesauro utilizado por la Oficina del Defensor del Pueblo al gestor de tesauros TmCake. Este software permitió la evaluación y el mantenimiento del tesauro. REPSOL-YPF y SAGE-SP disponen y utilizan la versión actualizada de esta herramienta (actualmente denominada Domain Reuser), esta versión se encuentra más próxima a la metodología final propuesta.

En el caso del AGN, mediante la funcionalidad de exportación del Domain Reuser se reutilizaron partes de sistemas de organización del conocimiento como un tesauro toponímico de carácter general.

La extracción de conocimiento a partir de composición de palabras ha sido aplicada en las primeras fases de extracción de relaciones y organización de términos en el sistema de organización del conocimiento en los proyectos de REPSOL-YPF, SAGE y prototipo de la Guardia Civil. Incluso se puede aplicar cuando lo que se ha obtenido como recurso primario es una lista de términos simples y compuestos como listado de términos de indización.

El principal problema que supone este mecanismo es que se pueden establecer relaciones de generalización-especificación que no sean ciertas, porque el término que se considera específico sea un término compuesto lexicalizado que ha perdido la semántica del término que se establece como genérico. En los casos de REPSOL-YPF, SAGE, el prototipo para la Guardia Civil, así como las listas de términos que se utilizaron en el dominio del AGN, se realizó una revisión manual, para identificar posibles casos erróneos.

Por otra parte, es habitual que el tratamiento textual de recursos no estructurados produzca gran cantidad de términos que no son fáciles de discriminar como candidatos o no al KOS.

La experiencia de los procesos con REPSOL-YPF, SAGE, y el prototipo de la Guardia Civil sugieren que sea un proceso destinado a cerciorarse de relaciones que pueden ser afectadas por el contexto o la finalidad del sistema de organización del conocimiento. Con esto nos referimos por ejemplo a las relaciones de sinonimia o equivalencia, que aunque dos términos no sean sinónimos, podrían ser considerados como tal para un dominio concreto.

Dominio Familias Conceptos Media de Relaciones

REPSOL-YPF

Medio-ambiente

-- 2224 1,08

Química -- 3758 1,61 Refino -- 4279 1,07 EyP -- 2234 1,24

SAGE-SP Contabilidad 15 3894 1,17

Nóminas 21 2410 2,83 Facturación 15 5584 1,08

GC Guardia

Civil 12 603 2,63

Tabla 2: Características de los KOS en diferentes entornos

En el desarrollo de los diferentes proyectos que se han realizado con las herramientas y las metodologías para la construcción automatizada de sistemas de organización del conocimiento, se observa que la definición y el desarrollo de los nuevos sistemas son más eficientes en cuanto a la extracción de relaciones. Otro dato a destacar es la calidad de las construcciones de los términos compuestos, y la descomposición de esas construcciones complejas. Sin duda, otra de las características que mejora considerablemente la adquisición de conocimiento es la extracción de entidades. Las mejoras afectan a la calidad de los términos específicos y a la especificidad de las relaciones.

0%10%20%30%40%50%60%70%80%90%

100%

Conceptos Relaciones

EyP

Refino

Química

Medio-ambiente

0%

20%

40%

60%

80%

100%

Familias Conceptos Relaciones

Guardia Civil

Facturación

Nóminas

Contabilidad


218

Figura 3: Visualización de las características de los KOS en los diferentes entornos

En el caso del prototipo para la Guardia Civil, el sistema de extracción de entidades tiene un impacto directo en la identificación de los términos candidatos a formar parte del sistema de organización del conocimiento, así como para la extracción de relaciones entre algunas de esas entidades. Otro impacto positivo que se refleja en los resultados es el aumento significativo de relaciones para el resultado de los sistemas de organización del conocimiento (Figura 3), debido a la fase de adquisición de relaciones mediante las unidades identificadas y a la flexibilidad del sistema de tratamiento textual e indizador del sistema de PLN.

4 Conclusiones

Esta propuesta se centra en mejora los resultados con respecto a los aspectos más problemáticos de la construcción de KOS. Por una parte, en cuanto a las tareas asignadas a los expertos y responsables del dominio. Para ello se ha incidido en:

• minimizar el número de tareas asignadas

• reducir el tiempo de las tareas • valorar y validad el conocimiento con

información contextualizada • formar a los expertos y responsables del

dominio sobre el producto final • mejorar la especificación de requisitos

El otro aspecto, en el que se ha centrado esta propuesta es en mejorar la calidad de los documentos que componen el corpus del dominio para la construcción de KOS mediante unos criterios para su construcción y la reutilización de recursos con vocabularios controlados existentes. Esta definición del corpus documental contribuye a: • determinar los temas y facilitar las tareas del

Ingeniero de Dominios para determinar los genéricos,

• determinar las expectativas del cliente, • determinar un corpus de indización de mejor

calidad y adaptado a sus necesidades Por último, la mejora de las herramientas

informáticas necesarias para la obtención de calidad resultados disminuye los errores de indización, extracción de información y construcción de KOS. Por lo tanto, cualquier

proceso de análisis y valoración tenderá a ser más efectivo y ha desempeñarse con mayor calidad. Asimismo un resultado de calidad favorecerá su uso, su utilidad y la necesidad de utilizar mecanismos de mantenimiento para la estructura de conocimiento. En este caso, las propuestas han estado orientadas a:

• diferenciar entre tipos de entidades • corrección ortográfica para posibles

deficiencias en los documentos • organizar la extracción de términos y

relaciones en distintas fases • evaluación progresiva del conocimiento

adquirido • apoyo de una clasificación preexistente para

la distribución del conocimiento En resumen se ha propuesto un entorno para

el desarrollo de KOS mediante una metodología configurable a diferentes escenarios. Para llevarla a cabo se debe elaborar de forma cuidadosa un corpus que contenga la información necesaria para la construcción del KOS y con unas aplicaciones específicas para la adquisición del conocimiento, y con un modelo de representación y construcción y mantenimiento del KOS.

Sin duda una de las ventajas logradas es la disminución de la dependencia de expertos del dominio, reduciendo los costes, las posibles inconsistencias entre expertos y la desmotivación que provocaban las tareas asignadas.

Bibliografía

Aitchison, J.; Gilchrist, A.; Bawden, D. 1972. Thesaurus construction and use: a practical manual. 3rd ed. London: Aslib,.1997.

Antoniou, G. y Harmelen, F. van. A Semantic Web Primer. London: The MIT Press, 2004.

Baeza-Yates, R. y Ribeiro-Neto, B. Modern Information Retrieval. Massachusetts: Addison-Wesley, 1999.

Berners-Lee, T.; Hendler, J.; Lassila, O.. The Semantic Web. Scientific American Magazine; May 2001

Cabré Castellví , Mª. T. La Terminología: Teoría, metodología y aplicaciones. Barcelona: Antartida/Empuréis, 1993 .

Daconta, M. C.; obrst, Leo J. y Smith, K. T. The Semantic Web. A guide to the future of XML, Web Services, and Knowledge Management. Indianapolis: Wiley, 2003.


219

Foskett, D. J. Thesaurus. Encyclopaedia of Library and Information Science. En: Spark-Jones, K. y Willett, P. (eds.). Readings in Information Retrieval. San Francisco: Morgan Kaufmann, 1997. pp 111-134.

Fraga, A.; Sánchez-Cuadrado, S. y Lloréns, J. Creación de un Tesauro Manual y Automático para el dominio de Arquitectura de Software. Jornadas Chilenas de Computación, V Workshop Ingeniería del Software (WIS2005) de las Jornadas Chilenas de Computación. Valdivia, Chile. 2005

Gómez-Pérez, A.; Fernando-López, M.; Corcho, O. Ontological Engineering. London: Springer, 2004.

Hodge, G. Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. The Digital Library Federation Council on Library and Information Resources. 2000

Ingwersen, P. Information Retrieval Interaction. London: Taylor Graham, 1992 P. 245.

ISO-2788: 1986. Guidelines for the Establishment and Development of Monolingual Thesauri. International Organization for Standardization, Second edition -11-15 UDC 025.48. Geneva: ISO, 1986.

Lancaster, F. W. Vocabulary control for information retrieval. 2nd ed. Arlington, Virginia: Information Resources Press, 1986.

Lloréns, J.. Definición de una metodología y una estructura de repositorio orientadas a la reutilización: El tesauro de software. Universidad Carlos III de Madrid, Departamento de Ingeniería, 1996

Morato, J.; Lloréns, J.; Génova, G.; et al. Experiments in Discourse Analysis Impact on Information Classification and Retrieval Systems. Information Processing and Management 2003, 38. pp. 825-851.

Novak, J. D. y D. B. Gowin, Learning how to Learn. New York: Cambridge University Press, 1984.

Novak, J. D., Learning, Creating , and Using Knowledge: Concept Maps as Facilitative Tools for Schools and Corporations.

Mahwah, N. J., Lawrence Erlbaum & Assoc, 1998

Prieto-Díaz, R. Implementing Faceted Classification for Software Reuse. Comm. ACM 1991, 34 (5). pp. 88-97.

Sánchez-Cuadrado, S. Definición de una metodología para la construcción automatizada de sistemas de organización del conocimiento. Tesis Doctoral. Universidad Carlos III de Madrid. Dpto. Biblioteconomía y Documentación, 2007.

Sánchez-Cuadrado, S.; Lloréns, J, y Morato, J. Desarrollo de una aplicación para la gestión de relaciones en tesauros generados automáticamente. Jotri 2003. II Jornadas de Tratamiento y Recuperación de la Información. Madrid. 2003.pp. 151-156

Sánchez-Cuadrado, S.; Lloréns, J. y Morato, J.; et al. Extracción Automática de Relaciones Semánticas. 2da. Conferencia Iberoamericana en Sistemas, Cibernética e Informática. CISCI 2003. Orlando, Florida. 2003a. pp. 265-268.

Sánchez-Cuadrado, S.; y J. Morato Lara. Diseño de una herramienta para la Creación Asistida de KOS. VII Jornada de la Asociación Española de Terminología. Lenguas de especialidad y lenguajes documentales. 24 de noviembre de 2006.

Van Slype, G.. Los lenguajes de indización. Concepción, construcción y utilización en los sistemas documentales. Madrid: Fundación Germán Sánchez Ruipérez. 1991.

Zeng, M. L y L. Mai Chan. Trends and Issues in Establishing Interoperability Among Knowledge Organization Systems. Journal of the American Society for Information Science and Technology, 2004, 55(5):377-395.


220

Sistemas de Diálogo

Prediction of Dialogue Acts on the Basis of the Previous Act

Sergio R. Coria [email protected]

Luis A. Pineda [email protected]

Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas (IIMAS) Universidad Nacional Autónoma de México (UNAM)

Ciudad Universitaria, Coyoacán, México, D.F.

Resumen: En este trabajo se evalúa empíricamente el reconocimiento automático de actos de diálogo. Se usan datos provenientes de un corpus de diálogos con habla espontánea. En cada diálogo dos hablantes colaboran en el diseño de cocinas usando herramientas C.A.D.; uno de ellos desempeña el rol del Sistema y el otro el del Usuario. Los actos de diálogo se etiquetan con DIME-DAMSL, esquema que considera dos planos de expresión: obligaciones y common ground. La evaluación se realiza probando modelos clasificadores creados con algoritmos de aprendizaje máquina: uno para obligaciones y otro para common ground. El principal dato predictor analizado es el acto de diálogo correspondiente al enunciado inmediato anterior. Se pondera también la contribución de información adicional, como la entonación, etiquetada con INTSINT, la modalidad del enunciado, el rol del hablante y el tipo de acto de diálogo del plano complementario. Una aplicación práctica sería en sistemas de administración de diálogo. Palabras clave: Diálogos prácticos, acto de diálogo, DIME-DAMSL, aprendizaje máquina, entonación, INTSINT, corpus de diálogo, árbol de clasificación y regresión

Abstract: In this paper the automatic recognition of dialogue acts is evaluated on an empirical basis. Data from a dialogue corpus with spontaneous speech are used. In each dialogue two speakers collaborate to design a kitchen using a C.A.D. software tool; one of them plays the System’s role and the other plays the User’s role. Dialogue acts are annotated with DIME-DAMSL, a scheme considering two expression planes: obligations and common ground. The evaluation is performed by testing classification models created with Machine Learning algorithms: one model for obligations and other for common ground. The mainly analyzed predictor data is the dialogue act corresponding to the immediately previous utterance. The contribution of other information sources is also evaluated, such as intonation, annotated with INTSINT, utterance mood, speaker role and dialogue act type of the complementary expression plane. A practical application can be the implementation of dialogue management systems. Keywords: Practical dialogues, dialogue act, DIME-DAMSL, machine learning, intonation, INTSINT, dialogue corpus, classification and regression tree

Introduction

Automatic recognition of dialogue acts has been addressed in previous work, such as (Shriberg et al., 1998) and the VERBMOBIL Project (Wahlster, 1993); it is a relevant issue because it provides speech recognition and dialogue management systems with additional information, which tends to improve their accuracy and efficiency. These two pieces of work have used intonational and lexical information to perform the dialogue act

recognition for English and German languages, respectively. Another relevant reference is (Garrido, 1996), where the relation between intonation and utterance mood in Spanish is addressed.

In (Coria and Pineda, 2006) dialogue act in Spanish is addressed from an intonational view and also considering some other non-prosodic features; these experimental settings are immediate predecessors of the present work.

Machine learning algorithms, such as classification trees and neural networks, in



addition to language models and polygrams are commonly used to analyze the phenomenon and to find out the most contributing features for the implementation of recognition or prediction models. This work uses a classification tree algorithm to evaluate the contribution of the previous dialogue act to the prediction task, assuming as baseline a recognition setting where the previous act is not used as one of the predictors.

A key issue in dialogue act recognition is the annotation of dialogue acts. The present work adopts the DIME-DAMSL scheme for this annotation.

1 Dialogue acts and the DIME-DAMSL scheme

1.1 Speech acts and dialogue acts

Searle’s theory on speech acts states that the production or emission of an utterance-instance under certain conditions constitutes a speech act, and speech acts are the basic or minimal units of linguistic communication. The dialogue act is an adaptation of the this notion and involves a speech act in the context of a dialogue (Bunt, 1994) or an act with internal structure specifically related to its dialogue function, as assumed in (Allen and Core, 1997), or a combination of the speech act and the semantic force of an utterance (Bunt, 1995). The present work is based on Allen and Core’s view.

1.2 DAMSL scheme

Allen and Core define a tag set and a series of tagging principles in order to produce a computational scheme for the annotation of dialogue acts in a particular class of dialogues: the so-called practical dialogues, where the interlocutors collaborate to achieve a common goal and do not need to use a too complex language because the conversation is simpler than the general conversation.

The DAMSL scheme defines four tag sets for utterance annotation, as follows: communicative status, information level, forward-looking and backward-looking functions. One of the main purposes of the communicative status is to specify if an utterance is intelligible or not; the information level describes the general subject of the utterance, e.g. task, task-management, communication management.

The forward looking functions resemble diverse categories defined in the traditional speech acts theory; e.g. action directives, commitments or affirms in DAMSL resemble directives, commisives or representatives, respectively, in Searle’s scheme.

The backward-looking functions specify how an utterance is related to the ones preceding it in the dialogue; e.g. to accept a proposal, to confirm understanding of a previous utterance, to answer a question.

1.3 DIME-DAMSL scheme

As DAMSL scheme did not suffice to obtain a high enough inter-annotator agreement, it was not reliable enough to set machine-learning experiments, which require consistent information. A source of low agreement in DAMSL is the lack of a higher level structure to constraint the possible label(s) an utterance can be assigned to; i.e. the scope of DAMSL scheme is restricted to analyze single utterances without considering the context within the dialogue where previous or following utterances occur. This allows a broad space to select and combine labels but, on the other hand, there is a high risk that inter-annotator agreement for dialogue act types is low because of the influence of subjectivity.

Evolving from DAMSL, DIME-DAMSL adopts its tag set and its dimensions and extends them by defining three additional notions, as follows. 1) two expression planes: the obligations and the common ground, 2) transaction structure and 3) charge and credit contributions of dialogue acts in balanced transactions.

The obligations and the common ground planes are parallel structures along which dialogue acts flow. A dialogue act might contribute to any (or both) of the two planes.

In DIME-DAMSL the obligations plane is construed by dialogue acts that generate a responsibility either on the speaker himself or on the listener to perform an action, either verbal or non-verbal; e.g. the obligation to provide some piece of information or to perform a non-verbal action. Dialogue acts that mainly contribute to the obligations plane are: commit, offer (when it is accepted by the interlocutor), action directive and information request. For instance, in utterances from dialogues of the DIME corpus, okay is a

Sergio R. Coria y Luis Alberto Pineda

224

commit (in certain contexts); can you move the stove to the left? is an action directive, and where do you want me to put it? is an information request.

The common ground is the set of dialogue acts that add, reinforce and repair the shared knowledge and beliefs of the interlocutors and preserve and repair the communication flow. DIME-DAMSL defines two sub-planes in the common ground: agreement and understanding; agreement is the set of dialogue acts that add knowledge or beliefs to be shared on the grounding of the dialogue participants; understanding is defined by acts that keep, reinforce or recreate the communication channel. Dialogue acts that mainly contribute to the agreement sub-plane are: open option (e.g. these are the cupboards we have), affirm (e.g. because I need a cabinet), hold (e.g. do you want me to move this cabinet to here?), accept (e.g. yes), reject (e.g. no, there is no design problem), accept part, reject part and maybe. Dialogue acts on the understanding sub-plane are acknowledgment (e.g. yeah, yes, okay, etc.), repeat-or-rephrase (e.g. do you want me to put this stove here?), and backchannel (e.g. mhum, okay, yes, etc.).

Charges and credits are the basic mechanism underlying the interaction between pairs of dialogue acts along each of the two expression planes. A charge generated by a dialogue act introduces an imbalance requesting for satisfaction, and a credit is the item balancing that charge. Instances of balanced pairs are, on the obligations plane, action directive, a charge, which can be balanced with a graphical action; on the agreement plane a charge introduced by an open option can be balanced with an accept; on the understanding plane an affirm creates a charge that can be satisfied with an acknowledgment, etc. These and other additional pairs guide a charge-credit annotation to identify and annotate the most prominent dialogue acts of the utterance; this annotation of dialogue acts is called Preliminary DIME-DAMSL and supports the completion of the dialogue act tagging in a subsequent stage, the so-called Detailed DIME-DAMSL, where the annotation is added with other labels if necessary.

A transaction is defined by a set of consecutive charge-credit pairs intending a sub-goal within a dialogue. A transaction

presents two phases: intention specification, where an intention is specified by a speaker and interpreted by his addressee, and intention satisfaction, where the addressee performs a verbal or non-verbal action attending the intention and the interlocutor interprets that action.

2 The DIME Corpus

The DIME Corpus (Pineda, 2007) is the empirical information source to perform the experiments; it is a collection of 26 human-to-human dialogues with their corresponding video and audio recordings and their annotations on a series of levels. It was created to analyze phonetic, phonologic and dialogue phenomena in Mexican Spanish. Speakers are approximately 15 individuals, males and females, most of them from Mexico City with ages between 22 and 30 y/o.

In each dialogue two speakers collaborate to design a kitchen using a C.A.D. software; one of them plays the System’s role and the other plays the User’s role. The System is always the same speaker in all dialogues. The speakers perform a task that consists in placing pieces of furniture in a virtual kitchen as specified by a drawing on a piece of paper.

Every User interacts with the System using the C.A.D. tool. The User commands the System to design the virtual kitchen. There is no written script, so the language spoken in the dialogue is spontaneous.

2.1 Annotation levels

The DIME corpus is segmented into utterances and annotated on these levels: orthographic transcription (transliteration), allophones, phonemes, phonetic syllables (considering the possible presence of re-syllabication), words, break indices from Sp-Tobi (Beckman et al., 2002), parts of speech (P.O.S.), discourse markers, speech repairs, intonation and utterance mood. The MexBet phonetic alphabet (Cuétara, 2004) is used to annotate allophones, phonemes, phonetic syllables and words.

2.1.1 Intonational annotation

Intonation is annotated with INTSINT (Hirst, Di Cristo and Espesser, 2000), implemented in the M.E.S. tool (Motif Environment for Speech). A stylized contour


225

of the fundamental frequency is automatically obtained and its inflection points are detected, saving their respective frequency (Hz) and timestamp. A perceptive verification is performed by a human annotator in order to assure that the stylized contour is perceptively similar to the original speech signal; the inflection points can be relocated on the frequency or time axis by the annotator. Every inflection point is then automatically annotated with the INTSINT tag set according to the relative location of the point regarding its predecessor and its successor. The tag set is construed of 3 absolute tones: T (top, the absolute highest), B (bottom, the absolute lowest), and M (medium, the frequency average); and 5 iterative tones: H (higher, a local maximal), L (lower, a local minimal), U(up-step, a point on an ascending region), D (down-step, a point on a descending region), S(same, a point at the same height than its predecessor). Absolute tones can occur only once along an intonational contour; i.e. T, Band M appear usually one single time in the intonational annotation of an utterance. On the other hand, iterative tones can appear an arbitrary number of times.

The original INTSINT tags and timestamps produced with M.E.S. are transformed into tag concatenations without timestamps in order to generate simple strings. This representation without time information provides with a higher level abstraction and allows compare intonational contours from different speakers without requiring a normalization process, as it is required when using a numerical representation. This way, the initial or final regions of a contour can be represented by sequences of the first or the last INTSINT tags of a string.

2.1.2 Utterance mood annotation

Utterance mood, i.e. interrogative, declarative, imperative, etc. is annotated as specified by a series of formalized conventions; some of which are as follows:

The human annotator reads the orthographical transcription and listens to the audio file, focusing on the final region of the utterance.

The tag set is: dec (declarative), imp(imperative), int (interrogative) and other. The other label includes any other mood that does not fit into the first three categories. It is also

used in any of the following cases: the end of the utterance is too noisy, the end presents a too long silence whose duration is greater than the one of a pause, the utterance does not contain lexical information but instead a sound such as breathing, laughing, lip-clicks, etc.

As one single annotator performs this tagging, annotation agreement is not computed.

A machine-learning algorithm is used to create a model for automatic annotation of utterance mood by using the manual tagging as target data. The automatic annotation is later used as one of the inputs for dialogue act recognition because this would be the case in a real-world application.

3 Experimental settings and information features

The setting is implemented as a machine learning experiment, selecting a subset of the features as targets and others as predictors. Table 1 presents a data dictionary of the features involved in the prediction models for obligations and common ground dialogue acts. Its right-most column specifies if a feature is used as either predictor (P) or target (T); the T/P value specifies that the feature is used as target in a particular model and as predictor in other. Lexical information is not used in the predictor feature set. The last_2 feature is based on the toneme notion (Navarro-Tomas, 1974).

Two recognition models are produced: one for obligations and other for common ground. The previous dialogue act refers to both obligations_minus1 and commgr_minus1features; i.e. both features are evaluated as predictors for obligations and also for common ground.

The machine learning algorithm to generate the models is J48 (Witten and Frank, 2000); it creates classification and regression trees using an approach similar to CART (Breiman et al., 1983). J48 is implemented in WEKA (Witten and Frank, 2000), a free software tool.

The dataset for the experiment contains features corresponding to 1,043 utterances in 12 dialogues from the DIME corpus.

Baselines to evaluate the results are determined by an experimental setting where the previous dialogue act is not used as one of the predictors. These are: optimal predicted


226

Feature Description Why it is Used P or

T first_1 The first INTSINT label of an utterance

first_2 The first two INTSINT labels of an utterance

first_3 The first three INTSINT labels of an utterance

The initial region of the intonational contour contributes to utterance mood recognition; each of the three features is evaluated

P

last_2 The last 2 INTSINT labels of an utterance

Preliminary experiments show that it is highly contributive to utterance mood recognition because it contains the utterance toneme

P

optimal_pred_mood

Utterance mood (e.g. declarative, interrogative, imperative) is obtained by an automatic recognition task prior to dialogue act recognition. Its predictors are: speaker role, utterance duration and the last 2 and the first 1, 2 and 3 INTSINT tags of the intonational contour.

Particular utterance moods are related to dialogue act types. An automatically recognized mood instead of the manually annotated is used because this is more similar to a real-world application

T/P

utt_duration Utterance duration in milliseconds; it is notnormalized

Preliminary experiments show that it might contribute to the recognition of dialogue act type

P

speaker_role Role of the speaker in the dialogue, either System or User

Statistical analyses show that speaker_role is correlated to dialogue act; e.g. System and commit, User and action directive

P

obligations Manually annotated tag for dialogue act on the obligations plane of an utterance

It is used as target data in the obligations recognition model and as one of the predictors for the common ground model

T/P

obligations_minus1 Dialogue act tag (manually annotated) of obligations in the utterance n-1, where n is the utterance whose dialogue act is the target

Its contribution as one of the predictors for dialogue act is evaluated

P

commgr

Manually annotated tag for dialogue act on the common ground plane of an utterance; agreement and understanding tags are concatenated as one single feature

It is used as target in the common ground recognition model and as one of the predictors in the obligations model

T/P

commgr_minus1

Dialogue act tag (manually annotated) of common ground in the utterance n-1, where nis the utterance whose dialogue act is the target

Its contribution as one of the predictors for dialogue act is evaluated

P

Table 1. Data dictionary of the features involved in the prediction models

mood, utterance duration (in milliseconds) and speaker role; besides, the obligations model uses common ground dialogue act and the common ground model uses the obligation dialogue act. Table 2 presents the baseline values, where accuracy is the percent of correctly classified instances and kappa, introduced by (Siegel and Castellan, 1988) and (Carletta, 1996), is a consistency measurement for manual (or automatic) tagging tasks. Number of labels, instances to be annotated and annotators determine a default agreement

value that might artificially increase the actual inter-annotator agreement (or the model accuracy), so the default agreement value is computed and substracted. Kappa in Table 2 and in the other machine-learning models is automatically computed by WEKA. Kappa of manual annotations, except of utterance mood, is computed by using Excel-style worksheets. Utterance mood was first manually annotated by one only human annotator and then automatic recognition models were produced using the manual tagging data as target.


227

Acc. (%) Kappa Obligations 66.2500 0.58120

Comm. Ground 68.4564 0.55510

Table 2. Baseline values of recognition without the previous act

Dialogue act annotation was formatted and processed in order to manage utterances with more than one tag on any expression plane; e.g. if the tagging contains affirm and accept, involving that the utterance simultaneously affirms and accepts, then it is concatenated as affirm_accept. Other instances are: info-request_graph-action or hold_repeat-rephrase.

4 Results and evaluation

Two classification trees were produced: one for obligations, containing 155 rules and one for common ground, containing 151 rules. Each tree was generated and tested by the 10-fold cross validation method. The complete rule sets are available on demand.

Results in Table 3 show that accuracy and kappa of obligations recognition when using the previous dialogue act as one of the predictors are greater than their baselines: the improvement is +5.658 in accuracy and +0.0791 in kappa. Regarding common ground recognition, there is a marginal decreasing in

accuracy (-0.1918) and a marginal improvement in kappa (+0.0409).

Acc. (%) Kappa

Obligations 71.9080 0.6603 Comm. Ground 68.2646 0.5960

Table 3. Accuracies and kappas of recognition models

Confidence and support values were computed for every if-then rule in the two trees. Confidence is computed as (a-b)/a, and support as a/n, where a is the number of cases where the rule premise occurs, b is the number of non-satisfactory cases and n is the total number of instances in the data set, i.e. 1,043 utterances. Tables 4 and 5 present the 5 rules with highest supports in each model.

In the rules, the no-tag value represents that an utterance does not have a tag associated to a dialogue act feature, e.g. rule 1 in Table 4, where the utterance expresses a dialogue act on the obligations but not on the common ground. Features that do not contribute to the classification task are not present in the rules because they are automatically discarded by J48.

In the obligations plane model, the most important feature for dialogue act classification is the complementary dialogue act, i. e. commgr.

Rule ID Rule a b Confidence Support

1 IF commgr=no-tag AND commgr_minus1=accept AND utt_duration<=5792, THEN info-request

90 52 42.2 8.6

2 IF commgr=graph-action AND obligations_minus1=commit, THEN info-request_graph-action

72 1 98.6 6.9

3

IF commgr=accept AND speaker_role=system AND obligations_minus1=action-dir, THEN commit

71 19 73.2 6.8

4 IF commgr=hold_repeat-rephr, THEN info-request 54 1 98.1 5.2

5 IF commgr=accept AND speaker_role=userAND commgr_minus1=graph-action, THEN answer

51 0 100.0 4.9

Table 4. The five rules with highest support for obligations prediction


228

Rule ID Rule a b Confidence Support 1 IF obligations=commit, THEN accept 112 3 97.3 10.7

2 IF obligations=info-request AND speaker_role=system, THEN hold_repeat-rephr

99 47 52.5 9.5

3 IF obligations=info-request_graph-action, THEN graph-action

98 2 98.0 9.4

4 IF obligations=answer AND commgr_minus1=graph-action, THEN accept

56 5 91.1 5.4

5 IF obligations=answer AND commgr_minus1=hold_repeat-rephr, THEN accept

48 7 85.4 4.6

Table 5. The five rules with highest support for common ground prediction

Table 6 presents the features ranking according to their presence in the rule set. Features with higher percents are associated to a higher contribution to the classification task because they have a higher discriminative capability.

Feature % of Rules commgr 100.0 commgr_minus1 51.0 obligations_minus1 29.0 speaker_role 26.5 first_3 17.4 utt_duration 9.0 first_2 5.2 optimal_pred_mood 2.6

Table 6. Presence of features in the obligations model rules

In the common ground model, also the complementary dialogue act (i.e. obligations) is the most contributing feature, as can be seen in Table 7. Optimal_pred_mood is not a contributing feature in this model.

Recognition rate per class is evaluated by three ratios: recall, precision and F measure. Recall is the number of cases actually belonging to a class divided by the number of cases of that class recognized by the model; precision is the number of cases of a class recognized by the model divided by the number of cases actually belonging to it. F measure is computed as 2x((Precision x Recall)/(Precision + Recall)). F measure is satisfactory if it is greater than or equal to 0.8.In the obligations acts model, classes with

satisfactory F measures are: info-request_graph-action, info-request_graph-action_answer, answer, commit and offer. In the common ground model, these are: graph-action and offer_conv-open.

Feature % of Rules obligations 100.0 commgr_minus1 91.4 first_3 27.8 speaker_role 22.5 obligations_minus1 11.9 utt_duration 9.9 first_2 7.9 last_2 2.0

Table 7. Presence of features in the common ground model rules

5 Conclusions

The dialogue act from the previous utterance as one of the predictors is useful to improve the accuracy (+5.6 percent points) in the obligations recognition. The recognition of common ground dialogue acts is not benefited from this setting.

An automatic recognition process might be implemented by taking advantage of a two-steps recognition, where the dialogue act from one of the two expression planes can be recognized by a lexical-based algorithm and then this dialogue act can be used as one of the inputs for the recognition of the dialogue act on the complementary plane by a classification tree; i.e. to use obligations as one of the inputs for common ground or vice versa.


229

A model for automatic recognition of dialogue acts is useful to implement dialogue management systems by providing information that complements the speech recognition processes.

Acknowledgments

The authors thank the anonymous reviewers of this paper. Sergio Coria also thanks Varinia Estrada for annotations and valuable comments, and CONACyT and DGEP-UNAM for support to this work.

References

Allen, J. and M. Core. 1997. Draft of DAMSL: Dialog Act Markup in Several Layers. In- forme técnico, The Multiparty Discourse Group. University of Rochester, Rochester, USA, October.

Beckman, M.E., M. Diaz-Campos, J. Tevis-McGory, and T.A. Morgan. 2002. Intonation across Spanish, in the Tones and Break Indices framework. Probus 14, 9-36. Walter de Gruyter.

Breiman, L., J.H. Friedman, R.A. Olshen and C.J. Stone. 1983. Classification and Regression Trees. Pacific Grove, CA: Wadsworth & Brooks, USA.

Bunt, H. 1994. Context and Dialogue Control. THINK Quarterly.

Bunt, H. 1995. Dynamic interpretation and dialogue theory. The structure of multimodal dialogue, ed. by M. M. Taylor, F. Neel, and D. G. Bouwhuis. Amsterdam. John Benjamins

Carletta, Jean. 1996. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2):249-254.

Coria, S. and L. Pineda. 2006. Predicting Dialogue Acts from Prosodic Information. In Proceedings of the Seventh International Conference on Intelligent Text Processing and Computational Linguistics, CICLing (Mexico City), February.

Cuétara, J. 2004. Fonética de la ciudad de México. Aportaciones desde las tecnologías del habla. Tesis para obtener el título de Maestro en Lingüística Hispánica. Maestría en Lingüística Hispánica, Posgrado en Lingüística, Universidad Nacional Autónoma de México.

Garrido, J.M. 1996. Modelling Spanish Intonation for Text-to-Speech Applications.Doctoral Dissertation. Departament de Filologia Espanyola, Universitat Autònoma de Barcelona, Spain.

Hirst, D., A. Di Cristo and R. Espesser. 2000. Levels of representation and levels of analysis for the description of intonation systems. In M. Horne (ed) Prosody: Theory and Experiment (Kluwer, Dordrecht).

Navarro-Tomás, T. 1974. Manual de entonación española. New York: Hispanic Institute, 2ª edición corregida, 1948 .- México: Colección Málaga, 3ª edición, 1966. - Madrid: Guadarrama (Punto Omega, 175), 4ª edición, 1974.

Pineda, L. 2007. The DIME Corpus. Department of Computer Science, Institute of Applied Mathematics and Systems. National Autonomous University of Mexico. http://leibniz.iimas.unam.mx/~luis/DIME/CORPUS-DIME.html

Pineda, L., V. Estrada and S. Coria. 2006. The Obligations and Common Ground Structure of Task Oriented Conversations. In Proceedings of X Iberoamerican Artificial Intelligence Conference, Iberamia, Ribeirao Preto, Brazil, October.

Shriberg, E., R. Bates, A. Stolcke, P. Taylor, D. Jurafsky, K. Ries, N. Coccaro, R. Martin, M. Meteer, and C. Van EssDykema. 1998. Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? Language and Speech 41(3-4), Special Issue on Prosody and Conversation, 439-487, USA.

Siegel, S. and N.J. Castellan, Jr. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, second edition, 1988.

Wahlster, W. 1993. VERBMOBIL: Translation of Spontaneous Face-to-Face Dialogs. In Proceedings of the 3rd EUROSPEECH, pp. 29-38, Berlin, Germany.

Witten, I. and E. Frank. 2000. Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. Morgan-Kauffman Publishers. San Francisco, CA. USA: 89-97.


230

Adaptacion de un Gestor de Dialogo Estadıstico a una NuevaTarea∗

David Griol, Lluıs F. Hurtado, Encarna Segarra, Emilio SanchisDepartament de Sistemes Informatics i Computacio

Universitat Politecnica de Valencia. E-46022 Valencia, Spain{dgriol,lhurtado,esegarra,esanchis}@dsic.upv.es

Resumen: En este artıculo se presenta una aproximacion para adaptar una meto-dologıa estadıstica de gestion de dialogo al contexto de un nuevo dominio. El modelode dialogo, que se aprende automaticamente a partir de un corpus de datos, se basaen la utilizacion de un proceso de clasificacion para determinar la siguiente respues-ta del sistema. Esta metodologıa se ha aplicado previamente en el desarrollo de unsistema de dialogo hablado que proporciona informacion sobre trenes. Se resume laaproximacion y el trabajo que se esta realizando actualmente para utilizarla en eldesarrollo de un sistema de dialogo para la reserva de instalaciones deportivas.Palabras clave: Adaptacion, Gestion de Dialogo, Modelos Estadısticos, Sistemasde Dialogo

Abstract: In this paper, we present an approach for adapting a statistical metho-dology for dialog management within the framework of a new domain. The dialogmodel, that is automatically learned from a data corpus, is based on the use of aclassification process to generate the next system answer. This methodology has be-en previously applied in a spoken dialog system that provides railway information.We summarize this approach and the work that we are currently carrying out toapply it for developing a dialog system for booking sports facilities.Keywords: Adaptation, Dialog Management, Statistical Models, Dialog Systems

1. Introduccion

La utilizacion de tecnicas estadısticas pa-ra el desarrollo de los diferentes modulosque componen un sistema de dialogo tieneun interes creciente durante los ultimos anos(Young, 2002). Estas aproximaciones suelenbasarse en modelar los diferentes procesos deforma probabilıstica y estimar los parame-tros correspondientes a partir de un corpusde dialogos.

La motivacion para entrenar modelos es-tadısticos a partir de datos reales es clara.Los avances en el campo de los sistemas dedialogo hacen que los procesos de diseno, im-plementacion y evaluacion de las estrategiasde gestion del dialogo sean cada vez mas com-plejos, lo que ha posibilitado que el foco deinteres de la comunidad cientıfica se desplacede forma creciente de los metodos empıricosa las tecnicas basadas en modelos aprendidosa partir de datos. Estos modelos pueden en-∗ Este trabajo se ha desarrollado en el marco del pro-yecto EDECAN subvencionado por el MEC y FE-DER numero TIN2005-08660-C04-02, la ayuda de laGVA ACOMP07-197 y el Vicerectorat d’Investigacio,Desenvolupament i Innovacio de la UPV.

trenarse a partir de dialogos reales, pudiendomodelar la variabilidad en los comportamien-tos de los usuarios. Aunque la construccion yparametrizacion del modelo depende del co-nocimiento experto del dominio del sistema,el objetivo final es desarrollar sistemas con uncomportamiento mas robusto, con mayor fa-cilidad de portabilidad, escalables y que pre-senten un mayor numero de ventajas de caraa su adaptacion al usuario o a nuevos domi-nios.

Este tipo de metodologıas se han aplica-do tradicionalmente dentro de los campos dereconocimiento automatico del habla y com-prension semantica del lenguaje (Segarra etal., 2002), (He y Young, 2003), (Esteve etal., 2003). La aplicacion de metodologıas es-tadısticas para modelar el comportamientodel gestor de dialogo esta proporcionando re-sultados interesantes en anos mas recientes(Williams y Young, 2007), (Lemon, Georgila,y Henderson, 2006), (Torres, Sanchis, y Sega-rra, 2003).

En este ultimo campo, hemos desarrolladorecientemente una aproximacion para gestio-nar el dialogo utilizando un modelo estadısti-



co aprendido a partir de un corpus de dialo-gos etiquetado (Hurtado et al., 2006). Estetrabajo se ha llevado a cabo en el dominiodel proyecto DIHANA (Benedı et al., 2006).La tarea que se considero para este proyectofue el acceso telefonico a un sistema que pro-porciona informacion sobre horarios, precios,tiempos de recorrido, tipos de trenes y ser-vicios en espanol. Para este proyecto se ad-quirio un corpus de 900 dialogos utilizandola tecnica del Mago de Oz. El corpus se eti-queto en forma de actos de dialogo con lafinalidad de entrenar el modelo de dialogo.

En este artıculo se presenta el trabajo queestamos realizando actualmente para adaptaresta metodologıa con el objetivo de desarro-llar un gestor de dialogo en el ambito de unnuevo proyecto denominado EDECAN (Llei-da et al., 2006). El objetivo definido para esteproyecto es incrementar la robustez de un sis-tema de dialogo hablado mediante el desarro-llo de tecnologıas que posibiliten su adapta-cion y personalizacion a diferentes contextosacusticos o de aplicacion.

La tarea que hemos seleccionado en elmarco del proyecto EDECAN es el desarro-llo de un sistema de reservas de instalacionesdeportivas para la Universitat Politecnica deValencia. Los usuarios pueden preguntar porla disponibilidad de instalaciones, realizar lareserva o cancelacion de pistas deportivas oconocer las reservas actuales que tienen dis-ponibles. A partir de un corpus de dialogospersona-persona se ha disenado un gestor dedialogo inicial para esta tarea, cuya evalua-cion se presenta en este trabajo.

El artıculo se estructura de la siguienteforma. La seccion 2 resume la metodologıade gestion de dialogo desarrollada para elproyecto DIHANA. La seccion 3 describe laadaptacion de esta metodologıa en el marcodel proyecto EDECAN, ası como la defini-cion de la semantica de la tarea. La seccion4 presenta los resultados de la evaluacion delgestor de dialogo desarrollado. Finalmente, laseccion 5 resume brevemente las conclusionesdel trabajo presentado y describe el trabajofuturo.

2. Gestion de dialogo en elproyecto DIHANA

En el ambito del proyecto DIHANA se hadesarrollado un gestor de dialogo basado en lamodelizacion estadıstica de las secuencias deactos de usuario de sistema y de usuario. Una

explicacion detallada del modelo del dialogopuede consultarse en (Hurtado et al., 2006).El objetivo propuesto fue que el gestor dedialogo generase turnos de sistema basando-se unicamente en la informacion suministra-da por los turnos de usuario y la informacioncontenida en el modelo. Una descripcion for-mal del modelo estadıstico propuesto es lasiguiente:

Representamos el dialogo como una se-cuencia de pares (turno de sistema, turno deusuario):

(A1, U1), · · · , (Ai, Ui), · · · , (An, Un)

donde A1 es el turno de bienvenida del sis-tema, y Un es el turno correspondiente a laultima intervencion del usuario. Denotamosel par (Ai, Ui) como Si, el estado de la se-cuencia del dialogo en el instante i.

El objetivo del gestor de dialogo en el ins-tante i es seleccionar la mejor respuesta delsistema. Para realizar esta seleccion, que esun proceso local, se tiene en cuenta la histo-ria previa del dialogo, es decir, la secuenciade estados de dialogo que precedieron al ins-tante i:

Ai = argmaxAi∈A

P (Ai|S1, · · · , Si−1)

donde el conjunto A contiene todas las posi-bles respuestas contempladas para el sistema.

Dado que el numero de posibles secuenciasde estados es muy grande, definimos una es-tructura de datos con la finalidad de estable-cer una particion en el espacio de secuenciasde estados, es decir, en la historia del dialogoque precede al instante i.

Esta estructura de datos, que denomina-mos Registro de Dialogo (Dialog Register,DR), contiene los conceptos y atributos pro-porcionados por el usuario a lo largo de lahistoria previa del dialogo. Mediante la utili-zacion del DR, deja de tenerse en cuenta elorden en el que el usuario ha proporciona-do la informacion, y la seleccion de la mejorrespuesta del sistema se realiza mediante lasiguiente maximizacion:

Ai = argmaxAi∈A

P (Ai|DRi−1, Si−1)

El ultimo estado (Si−1) se tiene en cuentapara la seleccion de la respuesta del sistema

David Griol, Lluís F. Hurtado, Encarna Segarra y Emilio Sanchis

232

Figura 1: Esquema del gestor de dialogo desarrollado para el proyecto DIHANA

dado que un turno de usuario puede propor-cionar informacion no contenida en el DR,pero que es importante para decidir la proxi-ma respuesta del sistema. Este es el caso dela informacion independiente de la tarea (ac-tos de dialogo Afirmacion, Negacion y No-Entendido).

La seleccion de la respuesta del sistema selleva a cabo a traves de un proceso de clasi-ficacion, en el cual se utiliza un perceptronmulticapa (MLP). La capa de entrada recibela codificacion del par (DRi−1, Si−1). La sa-lida generada por el perceptron puede versecomo la probabilidad de seleccionar cada unade las 51 respuestas de sistema diferentes quese definieron para la tarea DIHANA.

La Figura 1 muestra el funcionamientopractico del gestor de dialogo desarrolladopara DIHANA. Los frames generados por elmodulo de comprension tras cada interven-cion del usuario y la ultima respuesta propor-cionada por el sistema se utilizan para gene-rar el par (DRi−1, Si−1). La codificacion deeste par constituye la entrada del perceptronmulticapa que proporciona la probabilidad deseleccionar cada una de las respuestas defini-das en DIHANA, dada la situacion del dialo-go representada por este par.

2.1. Representacion del Registrodel Dialogo

Para la tarea DIHANA, el DR se ha defi-nido como una secuencia de 15 campos, cadauno de ellos asociado a un determinado con-cepto o atributo semantico. La secuencia decampos de conceptos y de atributos se mues-tra en la Figura 2.

Para que el gestor de dialogo determine lasiguiente respuesta, asumimos que no son sig-nificativos los valores exactos de los atributos.Estos valores son importantes para acceder ala base de datos y construir la respuesta del

Conceptos AtributosHora OrigenPrecio DestinoTipo-Tren Fecha-salidaTiempo-Recorrido Fecha-LlegadaServicios Hora-Salida

Hora-LlegadaClaseTipo-trenNumero-OrdenServicios

Figura 2: Registro del dialogo (DR) definidopara la tarea DIHANA

sistema en lenguaje natural. Sin embargo, launica informacion necesaria para determinarla siguiente accion del sistema es la presen-cia o no de conceptos y atributos. Por tanto,la informacion que almacena el DR es unacodificacion de cada uno de sus campos enterminos de tres valores, {0, 1, 2}, de acuerdocon el siguiente criterio:

0: El usuario no ha suministrado el con-cepto o valor del atributo correspondien-te.

1: El concepto o atributo esta presentecon una medida de confianza superior aun umbral prefijado (un valor entre 0 y1). Las medidas de confianza se generandurante los procesos de reconocimientoy comprension (Garcıa et al., 2003).

2: El concepto o atributo esta presentecon una medida de confianza inferior alumbral.

De este modo, cada DR puede represen-tarse como una cadena de longitud 15 cuyoselementos pueden tomar valores del conjunto{0, 1, 2}.

Adaptación de un Gestor de Diálogo Estadístico a una Nueva Tarea

233

3. Gestion de dialogo en elproyecto EDECAN

Una de las tareas que se han definido en elcontexto del proyecto EDECAN consiste enel diseno de un interfaz oral para informar yrealizar reservas de instalaciones deportivasen nuestra universidad. La principal diferen-cia entre este tarea y la definida para el pro-yecto DIHANA radica en el tratamiento quese lleva a cabo de la informacion proporcio-nada por el usuario. En el dominio del siste-ma de dialogo desarrollado para DIHANA seproporcionaba unicamente informacion rela-tiva a las consultas requeridas por el usuario,no modificandose en ningun instante la infor-macion almacenada en la base de datos delsistema. En la tarea EDECAN se incorporannuevas funcionalidades que suponen la modi-ficacion de la informacion almacenada en lasbases de datos de la aplicacion, por ejemplo,tras la reserva o cancelacion de una pista de-portiva.

El modulo definido en la arquitectura delsistema EDECAN para gestionar la informa-cion referente a la aplicacion, que se ha deno-minado Gestor de la Aplicacion (ApplicationManager, AM), realiza dos funciones funda-mentales. En primer lugar, gestiona las con-sultas a la base de datos de la aplicacion. Ensegundo lugar, verifica que la consulta reque-rida por el usuario cumple la normativa defi-nida por la Universidad para la gestion de laspistas deportivas (por ejemplo: un usuario nopuede reservar mas de una pista deportiva aldıa, un usuario sancionado no puede realizarreservas, etc.).

De este modo, el resultado proporcionadopor el AM debe tenerse en cuenta para ge-nerar la siguiente respuesta del sistema. Porejemplo, a la hora de reservar una pista de-portiva (ej. una pista de tenis) pueden ocurrirun conjunto de situaciones:

Tras la consulta a la base de datos dela aplicacion se detecta que el usuarioesta sancionado. El sistema debe infor-mar al usuario que no podra reservar pis-tas deportivas hasta que el periodo desancion haya finalizado.

Tras la consulta a la base de datos secomprueba que no existen pistas quecumplan los requerimientos expuestospor el usuario, informando de ello el sis-tema.

Como resultado de la consulta a la basede datos se verifica que existe una unicapista que cumple los requerimientos delusuario. El sistema debe confirmar quetodo es correcto para proceder finalmen-te con la reserva.

Si se comprueba que hay disponibles doso mas pistas que cumplen las exigenciasdel usuario, el sistema debe verificar cualde ellas desea reservarse.

Para tener en cuenta la informacion pro-porcionada por el AM para la seleccion dela proxima respuesta del sistema, hemos con-siderado que se requieren dos etapas. En laprimera etapa, la informacion contenida enel DR y el ultimo estado Si−1 se tienen encuenta para seleccionar la mejor consulta arealizar al AM (A1i):

A1i = argmaxA1i

∈A1

P (Ai|DRi−1, Si−1)

donde A1 es el conjunto de posibles consultasal AM.

En la segunda fase, se genera la respuestafinal del sistema (A2i) teniendo en cuenta A1i

y la informacion proporcionada por el AM(AMi):

A2i = argmaxA2i

∈A2

P (Ai|AMi, A1i)

donde A2 es el conjunto de posibles respues-tas del sistema.

La Figura 3 muestra el esquema propuestopara el desarrollo del gestor de dialogo parael proyecto EDECAN, detallandose las dosetapas descritas para la generacion de la res-puesta final del sistema.

3.1. Semantica de la tarea

La determinacion de la semantica de la ta-rea EDECAN se ha llevado a cabo teniendoen cuenta las diferentes funcionalidades conlas que se desea dotar al sistema de reservasy la informacion que se requiere para com-pletarlas. Para realizar esta definicion se hautilizado un conjunto de dialogos persona-persona proporcionados por el personal delArea de Deportes de la Universidad. De es-te modo, en estos dialogos han participadousuarios que deseaban realmente realizar lasdiferentes consultas que proporcionara el sis-tema automatico.


234

Figura 3: Esquema del gestor de dialogo propuesto para el proyecto EDECAN

Este conjunto de dialogos se ha ampliadocon nuevos dialogos generados por parte delpersonal de nuestro grupo de investigacion.Para la generacion de estos dialogos, se hallevado a cabo la simulacion del comporta-miento del sistema por parte de un sistema,de forma similar a la tecnica del Mago de Oz.En estos dialogos se han incorporado inter-venciones en las que se pide la confirmacionde atributos y conceptos mencionados duran-te el dialogo. En total se dispone de un cor-pus de 150 dialogos (873 turnos de usuario).La Figura 4 muestra un ejemplo de uno delos dialogos que conforman el corpus descri-to. El conjunto de dialogos se ha etiquetadomediante una representacion en forma de ac-tos de dialogo, que definen la semantica de latarea.3.1.1. Etiquetado de los turnos de

usuarioPara el caso de los turnos de usuario, los

actos de dialogo se corresponden con la in-terpretacion semantica de la intervencion delusuario mediante frames (atributos y concep-tos). Para la tarea EDECAN se han definidoun total de siete conceptos:

Se han contemplado un total de cuatroconceptos dependientes de la tarea, co-rrespondientes a los tipos de consultasque puede solicitar el usuario: conocerla disponibilidad de pistas (Availability),realizar una reserva (Booking), conocerlas reservas que tiene vigentes (Booked)o cancelar alguna de ellas (Cancellation).

Se han definido tres conceptos indepen-dientes de la tarea, analogos a los utili-zados en la tarea DIHANA (Acceptance,

Rejection y Not-Understood).

Se han definido un total de seis atribu-tos, relativos a la informacion que debe apor-tar el usuario para completar las diferentesconsultas contempladas por el sistema. Losatributos definidos son el deporte que se de-sea practicar (Sport), el horario para el quese desea la consulta (Hour), la fecha (Date),el tipo de pista polideportiva (Court-Type),el identificador de pista (Court-Number) y elnumero de orden correspondiente a la pistaque se desea reservar (Order-Number).

A continuacion se muestra un ejemplo dela interpretacion semantica de una interven-cion del usuario:

Turno de usuario:Quiero reservar una pista de squash para mananapor la tardeInterpretacion semantica:(Booking)Sport: squashDate: mananaHour: tarde

3.1.2. Etiquetado de los turnos desistema

El etiquetado de los turnos de sistema seha realizado de forma similar al de las in-tervenciones del usuario. Los conceptos de-finidos pueden clasificarse igualmente en de-pendientes de la tarea e independientes de lamisma. Se han detallado un total de 18 con-ceptos dependientes de la tarea:

Conceptos utilizados para informar alusuario del resultado de una determina-da consulta: sobre disponibilidad de pis-tas (Availability), sobre la realizacion de


235

S1: Bienvenido al sistema de informacion del Area de Deportes. ¿En que puedo ayudarle?U1: Quiero reservar una pista de padel exterior para el miercoles que viene.S2: ¿A que hora desea jugar?U2: A las 3 de la tarde.S3: El dıa 23 de mayo a las 15:00 dispone de la pistas 1 y 4 de padel. ¿Cual de ellas desea reservar?U3: La pista 1 por favor.S4: Le reservo la pista 1 de padel para el miercoles 23 de mayo a las 15:00. ¿Desea algo mas?U4: Sı, querıa liberar una pista que tenia para manana.S5: ¿Desea anular la pista de padel del dıa 21 de mayo a las siete de la tarde?U5: Sı.S6: Le anulo la pista solicitada. ¿Desea algo mas?U6: No gracias.S7: Gracias por utilizar el sistema de informacion del Area de Deportes. Adios.

Figura 4: Ejemplo de dialogo para la tarea EDECAN

una reserva (Booking), sobre las reservasactuales del usuario (Booked) o sobre laanulacion de una reserva (Cancellation).

Conceptos definidos para requerir alusuario los atributos necesarios para unadeterminada consulta: deporte (Sport),fecha (Date), hora (Hour) y tipo de pis-ta (Court-Type).

Conceptos utilizados para la confirma-cion de conceptos (Confirmation-Availa-bility, Confirmation-Booking, Confir-mation-Booked, Confirmation-Cancella-tion) y de atributos (Confirmation-Sport, Confirmation-Date, Confirma-tion-Hour, Confirmation-CourtType).

Conceptos relativos al AM: infraccionde la normativa de reservas (Rule-Info)o indicacion de la necesidad de selec-cionar alguna de las pistas disponibles(Booking-Choice).

Se han definido un total de seis atribu-tos, correspondientes a los cinco detalladospara el etiquetado de los turnos de usua-rio (Sport, Court-Type, Court-Number, Date,Hour) y un atributo relativo al numero depistas que satisfacen los requerimientos delusuario (Availability-Number).

Seguidamente se muestra un ejemplo deletiquetado de una respuesta del sistema:

Turno de Sistema:¿Le reservo la pista de squash 1 del pabellon parael 25 de junio de 20:00 a 20:30?Etiquetado:(Confirmation-Booking)Sport: squash

Date: 25-06-2007Hour: 20:00-20:30Court-Type: pabellonCourt-Number:1

3.2. Representacion de las fuentesde informacion

La representacion definida para el par deentrada (DRi−1, Si−1) es la siguiente:

La codificacion de los actos de dialogoscorrespondientes a la ultima respuestagenerada por el sistema (Ai−1): Esta in-formacion se modela mediante una varia-ble, que posee tantos bits como posiblesrespuestas del sistema diferentes se handetallado para el sistema (29).

�x1 = (x11 , x12 , x13 , · · · , x129) ∈ {0, 1}29

Registro del dialogo (DR): El DR defi-nido para la tarea EDECAN almacenaun total de diez caracterısticas, corres-pondientes a los cuatro conceptos y seisatributos dependientes de la tarea que sehan detallado para realizar el etiquetadode las intervenciones del usuario (Figura5). Analogamente a la tarea DIHANA,cada una de estas caracterısticas puedentomar los valores {0, 1, 2}. De este modo,cada uno de los conceptos y atributos delDR puede modelarse utilizando una va-riable con tres bits.

�xi = (xi1 , xi2 , xi3) ∈ {0, 1}3 i = 2, ..., 11


236

Conceptos AtributosAvailability SportBooking Court-TypeBooked Court-NumberCancellation Date

HourOrder-Number

Figura 5: Registro del dialogo definido parala tarea EDECAN

Informacion independiente de la tarea(actos de dialogo Acceptance, Rejec-tion y Not-Understood): Estos tres actosde dialogo se han codificado de formaidentica a las caracterısticas almacena-das en el DR. De esta forma, cada unode estos tres actos de dialogo puede to-mar los valores {0, 1, 2} y modelarse uti-lizando una variable con tres bits.

�xi = (xi1 , xi2 , xi3) ∈ {0, 1}3 i = 12, ..., 14

De este modo, la variable (DRi−1, Si−1)puede representarse mediante el vector de 14caracterısticas:

(DRi−1, Si−1) = (�x1, �x2, �x3, · · · , �x14)

La respuesta generada por el AM se ha co-dificado teniendo en cuenta el conjunto de po-sibles respuestas existentes en el corpus trasllevar a cabo una consulta al AM. Este con-junto engloba las diferentes situaciones quepuede comportar una consulta al AM des-arrollado para EDECAN y contempladas enel corpus persona-persona:

Caso 1: El AM no ha intervenido en lageneracion de la respuesta final del sis-tema, por ejemplo, cuando se seleccionala confirmacion de un atributo, la deter-minacion del cierre del dialogo, etc.

Casos 2-4: Tras una consulta a la base dedatos, el AM proporciona como respues-ta que no existen pistas que cumplan losrequerimientos del usuario (caso 2), exis-te una unica pista (caso 3) o existe masde una pista disponible (caso 4).

Caso 5: El AM advierte que la consul-ta del usuario no puede efectuarse porincumplir la normativa establecida en laUniversidad.

De este modo, la respuesta generada por elAM se ha modelado con una variable de cin-co bits, que activan cada una de estas cincosituaciones:

AM = (x1, x2, x3, x4, x5) ∈ {0, 1}5

4. Evaluacion

A partir del etiquetado del corpus dedialogos persona-persona, y aplicando laadaptacion expuesta en el artıculo, se ha des-arrollado un gestor de dialogo en el contextodel proyecto EDECAN.

Para realizar el entrenamiento de los MLP,se utilizo un software desarrollado en nuestrogrupo de investigacion. Se extrajo un subcon-junto de validacion (20 %) de cada uno de losconjuntos de test. Los MLP se entrenaron uti-lizando el algoritmo de Backpropagation conmomentum. La mejor topologıa fue dos capasocultas con 100 y 10 neuronas respectivamen-te.

La evaluacion se llevo a cabo medianteun proceso de validacion cruzada. En cadauna de las experimentaciones, el corpus sedividio aleatoriamente en cinco subconjun-tos. Cada evaluacion, de este modo, consis-tio en cinco experimentaciones. En cada unade ellas se utilizo un subconjunto diferentede los cinco definidos como muestras de test,y el 80 % del corpus restante se utilizo co-mo particion de entrenamiento. Para evaluarel funcionamiento del gestor desarrollado sehan definido tres medidas:

Porcentaje de respuestas que coincidencon la respuesta de referencia anotadaen el corpus (%exacta).

Porcentaje de respuestas que son cohe-rentes con el estado actual del dialogo(%correcta).

Porcentaje de respuestas que no soncompatibles con el estado actual deldialogo (%error), provocando el fallodel dialogo.

Estas dos ultimas medidas se han obteni-do tras una revision manual de las respues-tas proporcionadas por el gestor. La Tabla 1muestra los resultados obtenidos de la eva-luacion del gestor.

Los resultados obtenidos tras la experi-mentacion muestran que el gestor de dialogose adapta correctamente a los requerimientos


237

%exacta 72,9%%correcta 86,7%%error 4,5%

Tabla 1: Resultados de la evaluacion del ges-tor de dialogo desarrollado

de la nueva tarea, proporcionando un 86,7%de respuestas que son coherentes con el esta-do actual del dialogo, coincidiendo un 72,9 %con la respuesta de referencia anotada en elcorpus.

El porcentaje de respuestas proporciona-das por el gestor que puede causar el fallodel dialogo es considerable (4,5 %). Asimis-mo, el 8,8 % restante de respuestas no inclui-das en las tres medidas anteriores suponenque el dialogo pueda continuar, pero no soncoherentes con el estado actual del dialogo(como por ejemplo, solicitar informacion dela que ya se dispone actualmente). Mediantela ampliacion del corpus inicial de dialogos seespera poder reducir ambos porcentajes.

5. Conclusiones

En este artıculo se ha presentado el pro-ceso seguido para adaptar una metodologıaestadıstica para la gestion de dialogo con elobjetivo de interactuar en un sistema conun dominio diferente. Este tipo de metodo-logıas permiten una facil adaptacion, siendosu comportamiento dependiente de la calidady tamano del corpus disponible para apren-der su modelo. A partir de un corpus inicialde dialogos se ha desarrollado un gestor conbuenas prestaciones y con la posibilidad demejorar el modelo inicial mediante la incor-poracion de nuevos dialogos.

Actualmente estamos trabajando en eldesarrollo de los diferentes modulos que com-pondran el sistema de dialogo EDECAN conla finalidad de llevar a cabo la adquisicionde un corpus de dialogos con usuarios reales.Esta adquisicion se va a realizar de manerasupervisada, utilizando para ello el gestor dedialogo presentado en este trabajo. Los dialo-gos adquiridos serviran para realizar la mejo-ra del modelo de dialogo inicial.

Bibliografıa

Benedı, J.M., E. Lleida, A. Varona, M.J. Cas-tro, I. Galiano, R. Justo, I. Lopez, y A. Mi-guel. 2006. Design and acquisition of a te-lephone spontaneous speech dialogue cor-

pus in Spanish: DIHANA. En Proc. ofLREC’06, Genove.

Esteve, Y., C. Raymond, F. Bechet, y R. DeMori. 2003. Conceptual Decoding forSpoken Dialog systems. En Proc. of Eu-roSpeech’03, paginas 617–620.

Garcıa, F., L.F. Hurtado, E.Sanchis, y E. Se-garra. 2003. The incorporation of Con-fidence Measures to Language Understan-ding. En Proc. of TSD’03, paginas 165–172, Ceske Budejovice.

He, Yulan y S. Young. 2003. A data-drivenspoken language understanding system.En Proc. of ASRU’03, paginas 583–588.

Hurtado, L.F., D. Griol, E. Segarra, y E. San-chis. 2006. A Stochastic Approach forDialog Management based on Neural Net-works. En Proc. of InterSpeech’06, Pitts-burgh.

Lemon, O., K. Georgila, y J. Henderson.2006. Evaluating Effectiveness and Por-tability of Reinforcement Learned Dia-logue Strategies with real users: theTALK TownInfo Evaluation. En Proc. ofSLT’06, Aruba.

Lleida, E., E. Segarra, M.I. Torres, yJ. Macıas-Guarasa. 2006. EDECAN: sis-tEma de Dialogo multidominio con adap-tacion al contExto aCustico y de Aplica-cioN. En Proc. IV Jornadas en Tecnologiadel Habla, paginas 291–296, Zaragoza.

Segarra, E., E. Sanchis, M. Galiano,F. Garcıa, y L. Hurtado. 2002. Ex-tracting Semantic Information ThroughAutomatic Learning Techniques. Inter-national Journal on Pattern Recognitionand Artificial Intelligence, 16(3):301–307.

Torres, F., E. Sanchis, y E. Segarra. 2003.Development of a stochastic dialog mana-ger driven by semantics. En Proc. EuroS-peech’03, paginas (1):605–608.

Williams, J. y S. Young. 2007. PartiallyObservable Markov Decision Processes forSpoken Dialog Systems. En ComputerSpeech and Language 21(2), paginas 393–422.

Young, S. 2002. The Statistical Approach tothe Design of Spoken Dialogue Systems.Informe tecnico.


238

Traducción Automática

Un metodo de extraccion de equivalentes de traduccion a partirde un corpus comparable castellano-gallego ∗

Pablo Gamallo OteroDept. de Lıngua Espanhola

Univ. de Santiago de [email protected]

Jose Ramom Pichel CamposDept. de Tecnologia Linguıstica da

Imaxin|SoftwareSantiago de Compostela, Galiza

[email protected]

Resumen: Los trabajos sobre extraccion de equivalentes de traduccion a partir decorpus comparables no-paralelos no han sido muy numerosos hasta ahora. La razonprincipal radica en los pobres resultados obtenidos si los comparamos con los enfo-ques que utilizan corpus paralelos y alineados. El metodo propuesto en este artıculo,basado en el uso de contextos semilla generados a partir de diccionarios bilinguesexternos, obtiene tasas de precision proximas a los metodos con corpus paralelos.Estos resultados apoyan la idea de que la ingente cantidad de corpus comparablesdisponibles via Web puede llegar a ser una fuente importante de conocimiento lexi-cografico. En este artıculo, se describen los experimentos realizados sobre un corpuscomparable castellano-gallego.Palabras clave: extraccion de lexico multilingue, corpus comparables, traduccionautomatica

Abstract: So far, research on extraction of word translations from comparable,non-parallel corpora has not been very popular. The main reason was the poorresults when compared to those obtained from aligned parallel corpora. The methodproposed in this paper, relying on seed contexts generated from external bilingualdictionaries, allows us to achieve results similar to those from parallel corpus. In thisway, the huge amount of comparable corpora available via Web can be viewed asa never-ending source of lexicographic information. In this paper, we desbribe theexperiments performed on a comparable, Spanish-Galician corpus.Keywords: multilingual lexical extraction, comparable corpora, automatic transla-tion

1. Introduccion

En las dos ultimas decadas, han aparecidonumerosos trabajos centrados en la extrac-cion automatica de lexicos bilingues a partirde corpus paralelos (Melamed, 1997; Ahren-berg, Andersson, y Merkel, 1998; Tiedemann,1998; Kwong, Tsou, y Lai, 2004). Estos tra-bajos comparten una estrategia comun: orga-nizan primero los textos en pares de segmen-tos alineados para luego, en base a este ali-neamento, calcular las coocurrencias de pa-labras en cada par de segmentos. En algunosde estos experimentos, la precision alcanzadaal nivel de la palabra es muy alta: alrededordel 90% para un recall del 90%. Desgracia-damente, no hay todavıa disponible una grancantidad de texto paralelo, especialmente enlo que se refiere a lenguas minorizadas. Pa-∗ Este trabajo ha sido subvencionado por el Minis-terio de Educacion y Ciencia a cargo del proyectoGARI-COTER, ref: HUM2004-05658-D02-02

ra evitar este problema, en los ultimos anosse han desarrollado tecnicas de extraccion delexicos bilingues a partir de corpus compara-bles no-paralelos. Estas tecnicas parten de laidea de que la Web es un enorme recurso detextos multilingues facilmente organizados encorpus comparables no-paralelos. Un corpuscomparable no-paralelo (de aquı en adelante“corpus comparable”) esta formado por tex-tos en dos lenguas que, sin ser traduccionesunos de otros, versan sobre tematicas pare-cidas. Sin embargo, la tasa de precision detales metodos es todavıa bastante inferior ala de los algoritmos de extraccion de corpusparalelos. Los mejores registros hasta ahoraapenas alcanzan el 72% (Rapp, 1999), y ello,sin dar cuenta de la cobertura alcanzada.

En este artıculo, proponemos un nuevometodo de extraccion de lexicos bilingues apartir de corpus comparables. Este metodose basa en el uso de diccionarios bilingues



con el proposito de identificar corresponden-cias bilingues entre pares de contextos lexico-sintacticos. A parte de los diccionarios, seutilizara para el mismo proposito la identi-ficacion de cognados en los textos compara-bles. La extraccion del lexico bilingue se rea-lizara tomando en cuenta las coocurrenciasde lemas mono y multi-lexicos en los contex-tos bilingues previamente identificados. Losresultados obtenidos mejoran el 72% de pre-cision para una cobertura del 80%, lo que su-pone un avance en el area de la extraccion encorpus comparables. Estos resultados apoyanla idea de que la ingente cantidad de corpuscomparables disponibles via Web puede lle-gar a ser una fuente casi inagotable de cono-cimiento lexicografico.

El artıculo se organiza como sigue. En laseccion 2, situaremos nuestro enfoque con res-pecto a otros trabajos relacionados. La sec-cion 3 describira con detalle las diferentes eta-pas del metodo propuesto. Seguidamente, en4, analizaremos los experimentos realizadospara un corpus castellano-gallego, y descri-biremos un protocolo de evaluacion de losresultados. Acabaremos con una seccion deconclusiones.

2. Trabajo relacionado

No existen muchos trabajos cuyo enfoquesea la extraccion de lexicos bilingues en cor-pus comparables, en relacion a los que usantextos paralelos y alineados. El metodo maseficiente, y en el que se basan la mayorıade los pocos trabajos en el area (Fung yMcKeown, 1997; Fung y Yee, 1998; Rapp,1999; Chiao y Zweigenbaum, 2002), se pue-de describir como sigue: la palabra o multi-palabra w1 es una traduccion candidata dew2 si las palabras que coocurren con w1 den-tro de una ventana de tamano N son tra-ducciones de las palabras que coocurren conw2 dentro de la misma ventana. Esta estra-tegia se fundamenta, por tanto, en una lis-ta de pares de palabras bilingues (llamadaspalabras semilla), previamente identificadasen un diccionario bilingue externo. En resu-men, w1 puede ser una traduccion candidatade w2 si ambas tienden a coocurrir con lasmismas palabras semilla. El principal proble-ma de este metodo es que, segun la hipotesisde Harris (Harris, 1985), las ventanas de ta-mano N son semanticamente menos precisasque los contextos locales de naturaleza lexico-sintactica. Las tecnicas mas eficientes para la

generacion automatica de relaciones semanti-cas (Grefenstette, 1994; Lin, 1998) no utili-zan contextos definidos en forma de ventanasde palabras sino en forma de dependenciassintacticas. En este artıculo, presentaremosun metodo de extraccion de lexicos bilinguesbasado en la previa identificacion de contex-tos lexico-sintacticos bilingues, y no en el usode ventanas de palabras semilla, habitual enlos trabajos mas representativos del estadodel arte.

Existen otros enfoques relacionados con laextraccion de lexicos bilingues en corpus com-parables que no requieren el uso de dicciona-rios externos (Fung, 1995; Rapp, 1995; Diaby Finch, 2001). Sin embargo, (Fung, 1995)obtiene resultados muy pobres lo que res-tringe enormemente sus potenciales aplica-ciones, (Rapp, 1995) tiene graves limitacionescomputacionales, y (Diab y Finch, 2001) soloha sido aplicado a corpus monolingues. Porultimo, cabe mencionar el enfoque descrito en(Gamallo y Pichel, 2005; Gamallo, 2007), queutiliza pequenos fragmentos de corpus parale-los como base para la extraccion de contextossemilla.

3. Descripcion de la estrategia

Nuestra estrategia se divide en tres eta-pas secuenciales: (1) procesamiento textual,(2) creacion de una lista de contextos semillapor medio de la explotacion de diccionariosbilingues y de la identificacion de cognados, y(3) extraccion de los equivalentes de traduc-cion a partir de textos comparables usandocomo anclas la lista de contextos semilla.

3.1. Procesamiento del corpuscomparable

En primer lugar, lematizamos, etiqueta-mos y desambiguamos morfosintacticamenteel corpus comparable usando una herramien-ta de codigo abierto: Freeling (Carreras etal., 2004). En el proceso de etiquetacion, seactiva la identificacion de nombres propios,que pueden ser mono y plurilexicos. Una vezrealizada esta tarea, se seleccionan potencia-les dependencias sintacticas entre lemas conuna estrategia basica de reconocimiento depatrones. Los determinantes son eliminados.Cada dependencia sintactica identificada sedescompone en dos contextos lexico-sintacti-cos complementarios. En el cuadro 1 se mues-tran algunos ejemplos. Dada una dependen-cia sintactica identificada en el corpus, por

Pablo Gamallo y José Ramom Pichel Campos

242

Dep. binarias Contextosde (venta, azucar) < venta de [NOUN] >

< [NOUN] de azucar >

robj (ratificar, ley) < ratificar [NOUN] >

< [VERB] ley >

lobj (ratificar, gobierno) < gobierno [VERB] >

< [NOUN] ratificar >

iobj contra(luchar, pobreza) < luchar contra [NOUN] >

< [VERB] contra pobreza >

modAdj (entrenador, adecuado) < [NOUN] adecuado >

< entrenador [ADJ] >

Cuadro 1: Dependencias binarias y sus contextos lexico-sintacticos asociados.

ejemplo:de (venta, azucar) ,extraemos dos contextos lexico-sintacticos: <

venta de [NOUN] >, donde NOUN represen-ta al conjunto de nombres que pueden apare-cer despues de “venta de”, es decir, “azucar”,“producto”, “aceite”, etc., y por otro lado,< [NOUN] de azucar >, donde NOUN re-presenta el conjunto de nombres que puedenaparecer antes del complemento “de azucar”:“venta”, “importacion”, “transporte”, etc.La caracterizacion de los contextos se ba-sa en la nocion de co-requerimiento descri-ta en (Gamallo, Agustini, y Lopes, 2005).Ademas de las dependencias preposicionalesentre nombres, tambien utilizamos la depen-dencia lobj, que representa la probable rela-cion entre el verbo y el nombre que apareceinmediatamente a su izquierda (left object);robj es la relacion entre el verbo y el nom-bre que aparece a su derecha (right object);iobj prp representa la relacion entre el verboy un nombre precedido de preposicion. Porultimo, modAdj es la relacion entre un nom-bre y el adjetivo que lo modifica.

Los lexicos bilingues que nos proponemosextraer no solo se componen de lemas mo-nolexicos y nombres propios, sino tambien delemas multi-lexicos, es decir, de expresionescon varios lexemas y un cierto grado de cohe-sion: “accidente de trafico”, “cadena de tele-vision”, “dar a conocer”, etc. Para poder ex-traer este tipo de expresiones, realizamos unasegunda fase del procesamiento que consis-te en identificar lemas multi-lexicos (que noson nombres propios) y sus contextos. En es-ta tarea, utilizamos un extractor automaticobasico, basado en la instanciacion de patronesmorfo-sintacticos (e.g, NOUN-PRP-NOUN,NOUN-ADJ, VERB-NOUN, etc.) que nospermite identificar un gran numero de can-didatos. Este extractor se ejecuta en el cor-

pus comparable, por tanto, obtenemos lemasmulti-lexicos en las dos lenguas. Posterior-mente, reducimos la lista de candidatos conun filtro estadıstico elemental que solo retieneaquellos candidatos con un grado de cohesionelevado (medida SCP). Seguimos una estra-tegia parecida a la descrita en (Silva et al.,1999). Una vez constituida la lista de lemasmulti-lexicos, extraemos sus contextos lexico-sintacticos de forma analoga a la empleadaarriba para los lemas mono-lexicos y los nom-bres propios.

3.2. Generacion de contextosbilingues

La principal estrategia que utilizamos pa-ra la generacion de contextos lexico-sintacti-cos bilingues se fundamenta en la explota-cion de diccionarios bilingues externos. Su-pongamos que en un diccionario castellano-gallego la entrada castellana “venta” se tra-duce en gallego por “venda”, ambos nombres.La generacion lexico-sintactica a partir de ca-da uno de estos nombres se lleva a cabo si-guiendo reglas basicas como por ejemplo: unnombre puede ir precedido de una preposi-cion que a su vez es precedida de otro nom-bre o un verbo, puede ir despues de un nom-bre o verbo seguidos de una preposicion, opuede ir antes o despues de un adjetivo. He-mos centrado la generacion en tres categorıas:nombres, verbos y adjetivos. Para cada cate-gorıa sintactica, hemos generado unicamenteun subconjunto representativo de todos loscontextos generables. El cuadro 2 muestra loscontextos generados a partir de la correspon-dencia bilingue entre “venta” y “venda” y unconjunto limitado de reglas.

La generacion se completa con la instan-ciacion de prp. Para ello, empleamos una lis-ta cerrada de preposiciones especıficas y suscorrespondientes traducciones. De esta ma-

Un Método de Extracción de Equivalentes de Traducción a partir de un Corpus Comparable Castellano-Gallego

243

Castellano Gallego<venta prp [NOUN]> <venda prp [NOUN]><[NOUN] prp venta> <[NOUN] prp venda>

<[VERB] venta> <[VERB] venda>

<[VERB] prp venta> <[VERB] prp venda>

<venta [VERB]> <venda [VERB]><venta [ADJ]> <venda [ADJ]><[ADJ] venta> <[ADJ] venda>

Cuadro 2: Contextos bilingues generados apartir de la correlacion “venta-venda”.

nera, obtenemos pares de contextos bilinguescomo: <venta de [NOUN]> y <venda de[NOUN]>, <venta en [NOUN]> y <vendaen [NOUN]>, etc.

Por otro lado, usamos otra estrategia com-plementaria, basada en la identificacion decognados en los textos comparables. Llama-mos aquı cognados a 2 palabras en lenguasdiferentes que se escriben de la misma mane-ra. Solo nos interesamos en aquellos que no seencuentran en el diccionario bilingue, y queson, en su mayorıa, nombres propios. Gene-ramos los contextos lexico-sintacticos corres-pondientes y los juntamos a la lista de paresde contextos bilingues.

Los pares bilingues generados por mediode estas dos estrategias serviran de anclas oreferencias para marcar el corpus comparableen el que se va a realizar la ultima etapa delproceso de extraccion.

3.3. Identificacion de equivalentesde traduccion en el corpuscomparable

La etapa final consiste en la extraccion deequivalentes de traduccion con ayuda de lospares de contextos bilingues previamente ge-nerados. Esta etapa se divide en dos procesossecuenciales: filtrado de contextos y extrac-cion de los equivalentes de traduccion.

3.3.1. FiltradoDada la lista de pares de contextos bi-

lingues generados en la etapa anterior, proce-demos a la eliminacion de aquellos pares conun grado elevado de dispersion y asimetrıa

en el corpus comparable. Un par bilingue decontextos se considera disperso si el numerode lemas diferentes que aparecen en los doscontextos dividido por el numero total de le-mas de la categorıa requerida es superior aun determinado umbral. Por otro lado, unpar bilingue se considera asimetrico si unode los contextos del par tiene una frecuencia

alta en el corpus mientras que el otro tieneuna frecuencia baja. Los umbrales de disper-sion y asimetrıa se establecen empıricamentey pueden variar en funcion del tipo y tamanodel corpus. Una vez filtrados los pares de con-textos dispersos y asimetricos, nos queda unalista reducida que llamamos contextos semi-lla. Esta lista sera utilizada en el siguienteproceso de extraccion.

3.3.2. Algoritmo de extraccionCon el objetivo de extraer pares de lemas

bilingues, proponemos el siguiente algoritmo.

Dada una lista de pares de con-textos semilla:

(a) para cada lema wi de la len-gua fuente, se cuenta el numero deveces que este instancia cada con-texto semilla y se construye un vec-tor de contextos con esa informa-cion;

(b) para cada lema wj de la len-gua meta, se cuenta el numero deveces que este instancia cada con-texto semilla y se construye un vec-tor de contextos con esa informa-cion;

(c) Calculamos la similitudDICE entre pares de vectores:DICE(wi, wj); si wj esta entre losN mas similares a wi, entonces se-leccionamos wj como el candidato aser la traduccion de wi.

Veamos un ejemplo. El cuadro 3 ilustra al-gunas posiciones del vector de contextos aso-ciado al nombre castellano “Bachillerato”. Elvalor de cada posicion (tercera columna en elcuadro) representa el numero de veces que elnombre coocurre con el contexto en el corpuscomparable. Cada contexto del vector de laentrada castellana tiene que tener su correla-to gallego, pues forma parte de la lista de pa-res de contextos semilla. La primera columnadel cuadro representa el ındice o posicion delcontexto en el vector.

El cuadro 4, por su parte, muestra los va-lores asociados a las mismas posiciones enel vector del nombre gallego “Bacharelato”.Los contextos de la segunda columna son lastraducciones de los castellanos que aparecenen el cuadro 3. Por ejemplo, en la posicion00198 de los dos vectores, aparecen los con-textos: <estudio de [NOUN]> y <estudo de


244

ındice contexto freq.00198 <estudio de [NOUN]> 12300234 <estudiante de [NOUN]> 21800456 <curso de [NOUN]> 6901223 <asignatura de [NOUN]> 3502336 <[NOUN] en Lugo> 607789 <estudiar [NOUN]> 9808121 <cursar [NOUN]> 56

Cuadro 3: Extracto del vector asociado al sus-tantivo espanol Bachillerato.

ındice contexto freq.00198 <estudo de [NOUN]> 7800234 <estudante de [NOUN]> 14500456 <curso de [NOUN]> 4501223 <materia de [NOUN]> 4102336 <[NOUN] en Lugo> 3507789 <estudar [NOUN]> 2308121 <cursar [NOUN]> 13

Cuadro 4: Extracto del vector asociado a lanombre gallego Bacharelato.

[NOUN]>. Como forman un par de contex-tos semilla, tienen que aparecer en la mismaposicion vectorial.

Tal y como muestran los cuadros 3 y 4, elnombre gallego “Bacharelato” coocurre connumerosos contextos que son traducciones delos contextos con los que tambien coocurre elnombre castellano “Bachillerato”. Para cal-cular el grado de similitud entre dos lemas,w1 y w2, utilizamos una version del coeficien-te Dice:

Dice(w1, w2) =2∑

i mın(f(w1, ci), f(w2, ci))f(w1) + f(w2)

donde f(w1, ci) representa el numero decoocurrencias entre el lema w1 y el contex-to ci. Como ya se ha dicho anteriormente,los lemas pueden ser mono o multi-lexicos.Para cada lema de la lengua fuente (caste-llano), seleccionamos los lemas de la lenguameta (gallego) con el valor de similitud Dicemas alto, lo que los situa como sus posiblestraducciones. En nuestros experimentos “Ba-charelato” es el lema gallego con el valor desimilitud mas alto con respecto a “Bachille-rato”.

4. Experimentos y evaluacion

4.1. El corpus comparable

El corpus comparable se compone de noti-cias de diarios y semanarios on line, publi-cados desde finales de 2005 hasta finales de

2006. El corpus castellano contiene 13 millo-nes de palabras de artıculos de La Voz de Ga-licia y El Correo Gallego. Por su parte, el cor-pus gallego contiene 10 millones de palabrasde artıculos extraıdos de Galicia-Hoxe, Viei-

ros y A Nosa Terra. La mayorıa de los textosgallegos estan escritos respetando la norma-tiva del 2003 de la Real Academia Galega,dejando para otros proyectos corpus con or-tografıas convergentes con el portugues. Losartıculos recuperados cubren un amplio es-pectro tematico: polıtica regional, nacional einternacional, cultura, deporte y comunica-cion.

4.2. El diccionario bilingue

El diccionario bilingue que hemos utiliza-do para generar los contextos semilla es elempleado por el sistema de traduccion au-tomatica de codigo abierto Opentrad, con elmotor de traduccion Apertium (Armentano-Oller et al., 2006) para los pares castellano-gallego. Nuestros experimentos tienen comoobjetivo actualizar el diccionario, que con-tiene actualmente cerca de 30.000 entra-das, para mejorar los resultados del traduc-tor castellano-gallego, implantado en La Vozde Galicia, sexto periodico en numero delectores de Espana. Este proyecto se rea-lizo en colaboracion con el area de ingenierıalinguıstica de imaxin|software.

El numero de contextos bilingues genera-dos a partir de las entradas del diccionario esde 539.561. A este numero hay que sumar-le aquellos contextos generados usando la es-trategia de identificacion de cognados en elcorpus que no se encuentran en el dicciona-rio. Estos son 754.469. En total, consiguimos1.294.030 contextos bilingues. Este numerose reduce drasticamente cuando pasamos elfiltro que elimina los que tienen un compor-tamiento disperso y asimetrico en el corpuscomparable. La lista final de contextos semi-lla es de: 127.604.

4.3. Evaluacion

El protocolo de evaluacion que elaboramossigue, en algunos aspectos, el de (Melamed,1997), que fue definido para evaluar un meto-do de extraccion de lexicos a partir de corpusparalelos. La precision del lexico extraıdo secalcula con respecto a diferentes niveles decobertura. En nuestro trabajo, la coberturase define poniendo en relacion las entradasdel lexico y su presencia en el corpus compa-


245

rable. En particular, la cobertura se calculasumando las frecuencias en el corpus de lasocurrencias de los lemas que forman el lexicoextraıdo, y dividiendo el resultado por la su-ma de las frecuencias de todos los lemas en elcorpus. El calculo de la cobertura se hace se-paradamente para cada una de las categorıasgramaticales en estudio: nombres, verbos yadjetivos. Y basta con calcularlo usando loslemas y el corpus de la lengua fuente. De es-ta manera, decimos que el lexico extraıdo al-canza un nivel de cobertura del 90% para losnombres si, y solo si, los nombres del lexi-co castellano (lengua fuente) tienen una fre-cuencia en el corpus que alcanza el 90% de lafrecuencia de todos los nombres en el mismocorpus.

Para calcular la precision, fijamos una ca-tegorıa gramatical y un nivel de coberturadel lexico, y extraemos aleatoriamente 150lemas-test de esa categorıa. Calculamos, enrealidad, dos tipos de precision: precision-1se define como el numero de veces que la tra-duccion candidata seleccionada en primer lu-gar es la correcta, dividido por el numero delemas-test. Precision-10 es el numero de can-didatos correctos que aparecen en la lista delos 10 mas similares de cada lema, divididopor el numero de lemas-test.

Hasta ahora, en los protocolos de evalua-cion de otros metodos de extraccion de lexi-cos bilingues a partir de corpus comparablesno se habıa definido ningun tipo de cobertu-ra. La unica informacion sobre las palabras olemas testados es su frecuencia absoluta. Esdecir, se testan palabras o lemas con una fre-cuencia mayor a N , donde N suele ser ≥ 100.(Chiao y Zweigenbaum, 2002). El problemareside en que las frecuencias absolutas, al sertotalmente dependientes del tamano del cor-pus de entrenamiento, no son utiles para com-parar las tasas de precision alcanzadas pordiferentes metodos. En nuestro trabajo, sinembargo, la nocion de nivel de cobertura in-tenta subsanar dicha limitacion.

4.4. Resultados

El cuadro 5 muestra los resultados de laevaluacion. Para cada una de las categorıasgramaticales, incluidos los nombres multilexi-cos, y para cada nivel de cobertura (90%,80%, y 50%), calculamos los dos tipos deprecision.

Con respecto a los nombres, los tres ni-veles de cobertura del 90, 80 y 50 por ciento

corresponden a lexicos compuestos por 9.798,3.534 y 597 nombres, respectivamente. Enla categorıa “Nombres” se incluyen nombrespropios mono y multi-lexicos. La precision alnivel del 90% es relativamente baja (entre50 y 60 por ciento) debido al elevado numerode nombres propios incluidos en el lexico y ala dificultad de encontrar la correcta traduc-cion de un nombre propio usando el metodopropuesto.1 En la figura 1 ilustramos la evo-lucion de la precision (1 y 10) en funcion delos tres niveles de cobertura. Con una cober-tura del 80%, la precision es bastante acep-table: entre el 80 y el 90 por ciento. A estenivel de cobertura, la frecuencia de las entra-das evaluadas es ≥ 129. Se trata, por tanto,de un nivel proximo al empleado en la eva-luacion de otros trabajos relacionados, don-de se calculaba la precision de palabras confrecuencia ≥ 100. Sin embargo, en estos tra-bajos relacionados, las tasas de precision sonsensiblemente inferiores: alrededor del 72%en los mejores casos (Rapp, 1999). Convieneprecisar aquı que el hecho de tener resulta-dos aceptables solo con palabras o lemas fre-cuentes no es un problema insalvable ya que,al trabajar con corpus comparables, podemosfacilmente incrementar el tamano del corpusy, con ello, el numero de lemas que sobrepa-sen el umbral de la frecuencia 100. Por ejem-plo, al incrementar nuestro corpus el dobledel tamano inicial, conseguimos obtener 1/3mas de lemas con una frecuencia superior a100.

Con respecto a los adjetivos y verbos, re-salta la disparidad en los resultados. Mientrasla precision para los verbos roza el 100% al

1Buscamos la traduccion de todo tipo de nombrespropios pues el diccionario bilingue del traductor ne-cesita esta informacion. El motor Apertium 1.0 nointegra todavıa un detector de entidades.

Precision a 3 niveles de cobertura

40

60

80

100

90 80 50

cobertura

pre

cisi

on

precision-1

precision-10

Figura 1: Precision de los nombres a 3 nivelesde cobertura


246

Categorıa Cobertura Precision-1 Precision-10 Tamano del lexicoNombre 90 % 55 % 60 % 9798Nombre 80 % 81 % 90 % 3534Nombre 50 % 95 % 99 % 597Adj 90 % 61 % 70 % 1468Adj 80 % 81 % 87 % 639Adj 50 % 94 % 98 % 124Verbo 90 % 92 % 99 % 745Verbo 80 % 97 % 100 % 401Verbo 50 % 100 % 100 % 86N multi-lex 50 % 59 % 62 % 2013

Cuadro 5: Resultados de la evaluacion

80% de cobertura, los adjetivos se situan en-tre el 81 y el 87 por ciento a ese mismo nivel.Los problemas para tratar los adjetivos radi-can sobre todo en la dificultad del desambi-guador morfosintactico para distinguir entreadjetivos y participios verbales. Un lema eti-quetado como adjetivo por el desambiguadorcastellano puede tener su traduccion en ga-llego etiquetada como verbo. Con respecto ala cobertura, en el 80% el lexico de adjetivosconsta de 639 lemas y el de verbos de 401.Los lexicos aprendidos para estas categorıasson, por tanto, relativamente pequenos, peroel numero puede y debe crecer con la explota-cion de mas cantidad de corpus comparables.

Por ultimo, evaluamos los lemas nomina-les multi-lexicos que no son nombres propios.La precision se situa en torno al 60% parauna cobertura del 50% del lexico. El princi-pal problema relacionado con los lemas multi-lexicos es su baja frecuencia en el corpus. Los2.013 lemas evaluados a ese nivel de cobertu-ra parten de frecuencias relativamente bajas,≥ 40, lo que impide obtener resultados sa-tisfactorios. Aun ası, los resultados son sen-siblemente mejores a los obtenidos por otrostrabajos similares con terminos multipalabra(Fung y McKeown, 1997), que no superan el52% de precision para pequenos lexicos.2

5. Conclusiones

Hasta ahora no han sido muy numerososlos trabajos sobre extraccion a partir de cor-pus comparables no-paralelos. La principalrazon de esta escasez es, sin duda, la difi-cultad de conseguir resultados satisfactorioscon los que se puedan crear recursos utiles.El metodo propuesto en este artıculo presen-

2Si bien, el trabajo de (Fung y McKeown, 1997)tiene el merito de extraer lexicos bilingues de dos len-guas muy dispares: ingles y japones.

ta unos resultados que, sin llegar a las tasasde precision de los metodos basados en cor-pus paralelos, dejan claro que los corpus com-parables pueden ser una fuente muy intere-sante de conocimiento lexicografico. Y exis-te todavıa un amplio margen para mejorarlos resultados. Dado que los corpus compara-bles crecen diariamente con el asombroso cre-cimiento de la Web, no resultarıa complicadoactualizar e incrementar los lexicos bilinguesde forma incremental tomando en cuenta, encada actualizacion, solo aquellos lemas quejuntos sumen una frecuencia, en los textosde la lengua fuente, del 80% de la frecuenciatotal. Esta tarea de actualizacion incremen-tal del lexico forma parte de nuestro trabajoen curso. De esta manera, pretendemos au-mentar y mejorar el diccionario bilingue delsistema de traduccion Apertium.

Bibliografıa

Ahrenberg, Lars, Mikael Andersson, y Mag-nus Merkel. 1998. A simple hybrid alig-ner for generating lexical corresponden-ces in parallel texts. En 36th AnnualMeeting of the Association for Compu-

tational Linguistics and 17th Internatio-nal Conference on Computational Lin-

guistics (COLING-ACL’98), paginas 29–35, Montreal.

Armentano-Oller, Carme, Rafael C. Carras-co, Antonio M. Corbı-Bellot, Mikel L.Forcada, Mireia Ginestı-Rosell, SergioOrtiz-Rojas, Juan Antonio Perez-Ortiz,Gema Ramırez-Sanchez, Felipe Sanchez-Martınez, y Miriam A. Scalco. 2006.Open-source portuguese-spanish machinetranslation. En Lecture Notes in Compu-ter Science, 3960, paginas 50–59.

Carreras, X., I. Chao, L. Padro, y M. Padro.2004. An open-source suite of language


247

analyzers. En 4th International Conferen-

ce on Language Resources and Evaluation(LREC’04), Lisbon, Portugal.

Chiao, Y-C. y P. Zweigenbaum. 2002. Loo-king for candidate translational equiva-lents in specialized, comparable corpora.En 19th COLING’02.

Diab, Mona y Steve Finch. 2001. A statisti-cal word-level translation model for com-parable corpora. En Proceedings of the

Conference on Content-Based MultimediaInformation Access (RIAO).

Fung, Pascale. 1995. Compiling bilingual le-xicon entries from a non-parallel english-chinese corpus. En 14th Annual Meeting

of Very Large Corpora, paginas 173–183,Boston, Massachusettes.

Fung, Pascale y Kathleen McKeown. 1997.Finding terminology translation from non-parallel corpora. En 5th Annual Works-hop on Very Large Corpora, paginas 192–202, Hong Kong.

Fung, Pascale y Lo Yuen Yee. 1998. Anir approach for translating new wordsfrom nonparallel, comparable texts. EnColing’98, paginas 414–420, Montreal, Ca-nada.

Gamallo, Pablo. 2007. Learning bilingual le-xicons from comparable english and spa-nish corpora. En Machine TranslationSUMMIT XI, Copenhagen, Denmark.

Gamallo, Pablo, Alexandre Agustini, y Ga-briel Lopes. 2005. Clustering syntac-tic positions with similar semantic re-quirements. Computational Linguistics,31(1):107–146.

Gamallo, Pablo y Jose Ramom Pichel. 2005.An approach to acquire word translationsfrom non-parallel corpora. En 12th Portu-guese Conference on Artificial Intelligence

(EPIA’05), Evora, Portugal.

Grefenstette, Gregory. 1994. Explorations in

Automatic Thesaurus Discovery. KluwerAcademic Publishers, USA.

Harris, Z. 1985. Distributional structure.En J.J. Katz, editor, The Philosophy ofLinguistics. New York: Oxford UniversityPress, paginas 26–47.

Kwong, Oi Yee, Benjamin K. Tsou, y Tom B.Lai. 2004. Alignment and extraction of

bilingual legal terminology from contextprofiles. Terminology, 10(1):81–99.

Lin, Dekang. 1998. Automatic retrieval andclustering of similar words. En COLING-ACL’98, Montreal.

Melamed, Dan. 1997. A portable al-gorithm for mapping bitext correspon-dences. En 35th Conference of theAssociation of Computational Linguis-

tics (ACL’97), paginas 305–312, Madrid,Spain.

Rapp, Reinhard. 1995. Identifying wordtranslations in non-parallel texts. En 33rdConference of the ACL’95, paginas 320–322.

Rapp, Reinhard. 1999. Automatic identifi-cation of word translations from unrelatedenglish and german corpora. En ACL’99,paginas 519–526.

Silva, J. F., G. Dias, S. Guillore, y G. P.Lopes. 1999. Using localmaxs algorithmfor the extraction of contiguous and non-contiguous multiword lexical units. EnProgress in Artificial Intelligence. LNAI,Springer-Verlag, paginas 113–132.

Tiedemann, Jorg. 1998. Extraction of trans-lation equivalents from parallel corpora.En 11th Nordic Conference of Compu-

tational Linguistics, Copenhagen, Den-mark.


248

Flexible statistical construction of bilingual dictionaries

Ismael Pascual Nieto Universidad Autónoma de Madrid

Escuela Politécnica Superior [email protected]

Mick O’Donnell Universidad Autónoma de Madrid

Escuela Politécnica Superior [email protected]

Resumen: La mayoría de los sistemas previos para construir un diccionario bilingüe a partir de un corpus paralelo dependen de un algoritmo iterativo, usando probabilidades de traducción de pala-bras para alinear palabras en el corpus y sus alineamientos para estimar probabilidades de traduc-ción, repitiendo hasta la convergencia. Si bien este enfoque produce resultados razonables, es computacionalmente lento, limitando el tamaño del corpus que se puede analizar y el del dicciona-rio producido. Nosotros proponemos una aproximación no iterativa para producir un diccionario bilingüe unidireccional que, si bien menos precisa que las aproximaciones iterativas, es mucho más rápida, permitiendo procesar córpora mayores en un tiempo razonable. Asimismo, permite una estimación en tiempo real de la probabilidad de traducción de un par de términos, lo que signi-fica que permite obtener un diccionario de traducción con los n términos más frecuentes, y calcu-lar las probabilidades de traducción de términos infrecuentes cuando se encuentren en documentos reales.Palabras clave: diccionarios bilingües, modelos palabra-a-palabra, traducción automática estadís-tica

Abstract: Most previous systems for constructing a bilingual dictionary from a parallel corpus have depended on an iterative algorithm, using word translation probabilities to align words in the corpus, and using word alignments to estimate word translation probabilities, and repeating until convergence. While this approach produces reasonable results, it is computationally slow, limiting the size of the corpus that can be analysed and the size of the dictionary produced. We propose a non-iterative approach for producing a uni-directional bilingual dictionary which, while less accurate than iterative approaches, is far quicker, allowing larger corpora to be processed in reasonable time. The approach also allows on-the-fly estimation of translation likelihoods between a pair of terms, meaning that a translation dictionary can be generated with the n most frequent terms in an initial pass, and the translation likelihood of infrequent terms can be calculated as encountered in real documents. Keywords: bilingual dictionaries, word-to-word models, statistical machine translation

1 Introduction Over the last 17 years, statistical models have been used to construct bilingual dictionaries from parallel corpora, with the goal of using the dictionaries for tasks such as Machine Translation or Cross-Lingual Information Retrieval.

Most of these works have involved an iterative method to construct the dictionary, which start with an initial estimate of word translation probability, use these probabilities to align the words of the corpus, and then use the word alignments to re-estimate word translation

probability. This approach cycles until convergence. Followers of this approach include Brown et al. (1990) Kay and Röscheisen, (1993); Hiemstra, (1996); Melamed, (1997); Renders et al., (2003) and Tufis, (2004).

However, the iterative approach is expensive in computing time, requiring extensive calculations on each iteration. Due to memory limitations, these approaches usually restrict consideration to the n most frequent terms in each language.

In this paper, we propose a non-iterative approach to building a uni-directional



translation dictionary. While our approach initially produces dictionaries with lower precision, this should be seen in relation to the reduced time needed to build the dictionary. Additionally, our approach supports on-the-fly calculation of the translation suitability between a pair of words. When aligning words in two sentences and less frequent words are encountered, an estimate of the translation likelihood can be derived on the spot, avoiding the need to pre-calculate all possible translation likelihoods between the 76,000 unique terms in our English corpus and the 130,000 unique terms in our Spanish corpus.

The paper is organized as follows: Section 2 discusses the most representative iterative approaches. Section 3 and 4 describes our corpus, and how it is compiled into a word lookup table. Section 5 describes the derivation of our translation dictionaries. Section 6 evaluates the precision and recall of each of our models. Section 7 presents our conclusions.

2 Iterative Approaches The first published work outlining the construction of bilingual dictionaries using statistical methods was in Brown et al. (1990)1

at IBM. They used 40,000 aligned sentences from the Canadian HANSARDs corpus (parliament transcripts in English and French).

In their approach, the translation probability between any pair of words is initially set as equi-probable, as are the probabilities of each relative sentence position of a word and its translation. These probabilities are then used to estimate the probability of each possible alignment of the words in each sentence-pair. These probability-weighted alignments are then used to re-estimate the word-translation probabilities as well as the relative position probabilities. This approach cycles until convergence occurs. They used the Expectation Maximization (EM) algorithm.

Subsequent investigators found the IBM approach too computationally complex (requiring iterative re-estimation of 81 million parameters), and the approach did not scale up to larger parallel corpora. Various approaches were tried to improve the performance.

Hiemstra (1996) attempted to reduce complexity using a modified version of the EM algorithm. While the goal of the IBM work was

1 The first work of IBM on this was 1988, but it

was quite preliminar.

a unidirectional dictionary, Hiemstra aimed to compile a bi-directional dictionary. Hiemstra claimed that the use of bidirectional dictionaries not only reduces the space needed for dictionary storage, but leads to better estimates of translation probabilities2. His results improve on those of IBM.

Melamed (1997) proposed an alternative approach, which, while still iterative, required the estimation of fewer parameters. Like the IBM team, he used the HANSARDs corpus, although using 300,000 aligned sentences. He reports 90% precision in real domains.

A key concept in these models is the term co-occurrence: two tokens u and v co-occur if uappears in one part of an aligned sentence pair and v appears in the other part.

In Melamed’s model, co-occurrence is estimated through likelihood ratios, L(u,v), each of which represents the likelihood that u and vare mutual translations. The process estimating these ratios is as follows:

1) Provide an initial estimate of L(u,v)using their co-occurrence frequencies.

2) Use the estimate of L(u,v) to align the words in the matched sentences of the parallel corpus.

3) Build a new estimate of L(u,v) using the word alignments from step (2).

4) Repeat steps (2) and (3) until convergence occurs (no or little change on each cycle).

Melamed aligns the terms in matched sentences using a competitive linking algorithm,which basically orders the L(u,v) values in descending order, and taking these values in turn, links the u and v terms in aligned sentences. Linked terms are then disqualified from linking with other relations.

This process also keeps count of the number of links made between each u, v pair, and these counts are used to re-estimate L(u,v).

3 Our corpus We used the EUROPARL corpus (Kohen, 2005), consisting of transcripts of sessions of the European Parliament between 1996 and 2003. Each transcript is provided in 11 languages. These transcripts are generally constructed by translators, as each speaker speaks in their native language. We used only the English and Spanish sections of the corpus.

2 This reference is not in the reference list.

Ismael Pascual Nieto y Michael O'Donnell

250

The corpus does not come in sentence aligned form, although each transcript is organised into speaker turns. We wrote software to align the sentences within each speaker turn, based on sequence in the turn, and also on approximate correspondence in number of words, similar to the approach of Gale and Church (1993). Sentences which could not be aligned were discarded. This gave us 730,191 correctly aligned sentences, roughly 20 million words in each language.

4 Compiling a Word Occurrence Index One of our goals was to allow rapid calculation of translation likelihood between any two terms on the fly. This would not be possible if the entire 40 million word corpus had to be processed each time.

To alleviate this problem, we re-compiled the corpus into an index such as used by web search engines: a file is created for each unique token, detailing each occurrence of the token: the file-id (2 bytes) and sentence-id (2 bytes) of the hit, the position of the token within the sentence (1 byte), and the number of terms in the sentence.

Once the index is compiled, it is possible to derive various statistics rapidly. The frequency of a token can be calculated quickly by dividing the file size by 6 (the record size). The relative co-occurrence of an English and Spanish term can be calculated solely by comparing the index files for those two terms. This allows us to calculate the relative co-occurrence between two terms on the fly, if we need to, rather than having to process the entire corpus to find such a result.

Kay and Röscheisen (1993) also build a word lookup index, but only store the sentence id.

5 Compiling the Bilingual Dictionary Melamed uses word co-occurrence scores only as an initial estimate of translation suitability. For our purposes, we have found that this initial estimate, if handled properly, provides adequate accuracy for many tasks, without the required expense of the iterative recalculation of translation probabilities through a word alignment process. Our likelihood formula is similar to that of Melamed’s although modified to allow our method to work on the fly.

Melamed’s initial estimate of translation likelihood of a source term u as a target term v

is the ratio of the joint probability of u and vand the product of the marginal probabilities of u and v, as can be seen in equation 1.

)()(),(),(vPuP

vuPvuL (1)

Basically, if u and v are not related, this ratio should approach 1.0. The stronger the co-occurrence between u and v, the higher the Lvalue. Substituting in estimates for the probabilities, the formula can be re-expressed as equation 2:

Nvnun

vunvuL)()(

),(),( (2)

where, n(u,v) is the co-occurrence frequency of u,v, N is the total number of co-occurrences and n(u) is the marginal frequency of u,calculated as shown in equation by:

vvunun ),()( (3)

5.1 Our Basic model The inclusion of n(u) and n(v) in Melamed’s formula basically require all values for all u and v to be calculated at the same time, which means one must decide beforehand which terms will be included in the process. This excludes the calculation of likelihood values for other terms encountered while processing text, which is one of our goals.

We thus use a modified formula which can calculate the translation likelihood between a given u and a given v independently of other terms. Rather than asking what percent of all co-occurrences involve u and v, we ask what percent of sentence pairs contain u and v. In our approach, P(u,v) represents the probability that u occurs in a source sentence while v appears in a target sentence. P(u) is the probability that uwill appear in a source sentence, and P(v) is the probability that v will appear in the target sentence.

The important point here is that we can now estimate L(u,v) solely by looking at occurrences of a given u and v, without needing to consider the whole range of possible u/v co-occurrences.

A second change from Melamed’s approach is that we desire a unidirectional dictionary. For this reason, we instead use formula 4:

Flexible Statistical Construction of Bilingual Dictionaries

251

)()|()|(

vPuvPuvL (4)

where P(v|u) is the probability of encountering v in a target sentence if u is in the source sentence, and P(v) is the probability of encountering v in a target sentence. As with Melamed’s formula, if u and v are unrelated, the L value will approach 1.0, and higher values indicate a relation between them. A value of 2.0 indicates that v is twice as likely to occur if u is in the corresponding sentence.

Given this simplification, we can calculate L(u,v) as follows:

)(),()|(

vnvunuvP

S

S (5)

SvnvP S )()( (6)

Svn

unvun

uvLS

S

S

)()(

),()|( (7)

Sunvn

vunuvLSS

S

)()(),()|( (8)

where ns(u,v) is the count of sentence-pairs containing both u and v, ns(u) is the count of sentence-pairs in which the source sentence contains u, ns(v) is the count of sentence-pairs in which the target sentence contains v, and S is the total sentence count.

We make one further simplification to allow faster calculation. Because only a small percent of sentences will contain the same word more than once, in the general case, the frequency of a word, nw(u), will be quite close to nS(u).Similarly, nw(v) will approximate nS(v). We thus use nw(u) and nw(v) in place of nS(u) and nS(v).

The advantage of this approach is that the frequency of each term is readily available: the size of the index file for the term divided by the record length.

We also choose to use n(u,v) to estimate nS(u,v) and thus count the co-occurrences of uand v in sentence pairs. This statistic can be derived by scanning through the hit files for uand v, counting cases where the terms appear in the same sentence pair.

For efficiency reasons, we initially compute the values of L(u,v) for the 5000 most frequent tokens in English and Spanish. Any value less than 2.0 is dropped.

We heuristically translate this co-occurrence metric to a translation probability by assuming that the probability of u being translated as v is proportional to the size of the L value. Thus, for each English term u, we collect all the Spanish terms v which were not eliminated, and sum their L values, and divide each by the sum, using this as the translation probability of the term.

Table 1 shows the highest 9 alternatives for absolutely (another 16 were included in the list). Several of the Spanish terms (shown in italic) are present due to intra-language collocation between absolutely and essential,indispensable or crucial (the indirect association problem mentioned by Melamed). Removing these entries will be discussed below.

English Spanish L(v|u) Probabsolutely absolutamente 125.50 0.33 absolutely absoluta 26.67 0.07 absolutely imprescindible 19.75 0.05 absolutely absoluto 19.18 0.05 absolutely indispensable 16.08 0.04 absolutely crucial 10.84 0.03 absolutely totalmente 10.77 0.03 absolutely esencial 9.41 0.03 absolutely increíble 9.29 0.03

Table 1: Translation dictionary alternatives

5.2 Adjusted model A problem arises with the above formula when a term v nearly always occurs with term u. If this is the case, P(v|u) will approach P(v), and the L value will approach 1.0.

For this reason, we introduced the slightly modified formula 9 for likelihood, which instead contrasts those cases where v occurs with u against those cases where v occurs without u:

)|()|()|(uvP

uvPuvL (9)

This basically magnifies the likelihood values, as previously the denominator was diluted by cases where u and v co-occur. However, the same interpretation is still valid:


252

if u and v are not related, the ratio will approach 1.0, while the stronger the correlation, the higher the likelihood value.

5.3 Using relative distanceBy looking at translations between European languages, it is easy to see that a source term tends to appear in a similar relative position in its sentence than its translation in the target sentence.

The probabilistic model of Brown et al. (1990) takes into account that a term in position i in a source sentence will translate as a term in position j in the target sentence with a given probability, conditioned by the length of the two sentences (l and m). These calculations however depend on an iterative method, which we are avoiding. It also requires large amounts of data to obtain realistic estimates for possible values of i, j, l and m.

We thus proposed a simple heuristic to account for the relative position between two terms. We penalise word co-occurrences in relation to the relative distance between the words in their respective sentences. Firstly, given that the source and target sentences may vary in length, we normalise the position of the term in the sentence by dividing its position by the length of the sentence. The relative distance (dR) between the terms can then be calculated as follows:

( , )Ri jd i jl m

(10)

The closer this value is to 0.0 (no relative distance), the more likely that the terms are translations of each other.

When calculating the co-occurrence of a source and target term, rather than just counting 1 each time the terms appear in the same sentence-pair, we discount the increment by subtracting the relative distance between terms, e.g.

Sps svuRS vposuposduvn

,

))(),((1),( (11)

where Sp is the set of aligned sentence pairs and pos(u) is the absolute position of the term u in the corresponding part of an aligned sentence pair.

Basically, the further the two terms are away from each other, the less it counts as a viable co-occurrence. This heuristic step improves our results, and the calculation is far simpler than that used in the IBM work.

6 EvaluationUsing the above methods, we produced four translation dictionaries, using both the basic and adjusted model, both with and without the distance metric.

We then evaluated the quality of these dictionaries against a gold-standard, G, a handcrafted dictionary of 50 terms with human-judged translations. The terms were taken from random positions throughout the word frequency list, and covering a range of syntactic classes.

We then used G to evaluate each of the four dictionaries. In terms of precision, for each English term in G, we collected the correct translations included in our dictionary, and summed their probability estimates. We then averaged the precision over the 50 terms in G.Results for the 4 models are shown in Figure 1.

Our basic dictionary contains up to 25 translation candidates for each source term, with the higher ones being more probable. This list is good for some applications (e.g., word alignment), but produces poor precision (69.96% in the best case). Where precision is important, e.g., for machine translation, we can restrict the number of translation candidates. We achieve 91.94% precision if we just consider the top two candidates.

89,44

82,93

67,56

55,40

89,85

84,49

69,68

58,14

91,6187,33

77,12

67,59

91,9488,67

78,86

69,96

45

50

55

60

65

70

75

80

85

90

95

Precision

Basic Basic + Dist Adjusted Adjusted + Dist

Model

Top 2 Top 3 Top 10 Top 25Number of Words considered

Figure 1: Precision results for the four models

We calculate the recall of a dictionary entry as the percentage of all the correct translations of a term which are in our dictionary. The global recall is then taken as the average over all 50 words. Figure 2 shows our results, again with various levels of cut-off. Our best result was 68.44%, which is quite good considering


253

many of the translations in the golden standard were not used in the corpus.

25,56

29,92

49,99

67,17

25,56

29,92

49,99

67,17

25,84

30,61

50,77

68,44

25,84

30,61

50,77

68,44

20

25

30

35

40

45

50

55

60

65

70

Recall

Basic Adjusted Basic + Dist Adj. + Dist.

Model


Figure 2: Recall results for the four models

It is clear that including more terms in our dictionary increases recall at the expense of precision. The choice of how many terms to include depends on the application, whether precision or recall is more important.

In terms of assessing which of our 4 models is best, it is clear that the adjusted formula and the inclusion of distance penalties both improve precision, and the distance metric improves recall. Our best model is thus the adjusted+distance one.

6.1 Removing Indirect Associations One of Melamed’s main reasons for taking an iterative approach is to remove false translations due to collocations between source terms. For instance, English absolutely is frequently followed by essential, and for this reason, absolutely has strong co-occurrence with words which translate essential.

Melamed only uses co-occurrence values as the basis for aligning words in sentences, and the aligned words are then used to re-estimate word translation probabilities. Since the true translation of a word will generally have a higher co-occurrence value than the false translations, the collocation-induced mappings will be dropped from the data.

One of the prime uses of our translation dictionary is to support word alignment. When used for this purpose, the presence of indirect associations in our dictionary is generally not a problem, because the term with a direct association will be the preferred alignment choice.

However, when using our dictionary for other tasks, such as automatic sentence translation, the indirect associations will be a problem.

For this reason, we have developed a method to remove indirect associations from our dictionary, a means which does not require the expensive step of word-aligning the entire corpus. We firstly derive collocation values between words of the same language. We then pass through our translation dictionary, and whenever a translation of a term is also the translation of a collocate of the term, the co-occurrence value is recalculated, using only those cases where the collocate is not present.

We applied this process as a post-operation on the translation dictionaries produced earlier. Looking only at the adjusted+distance model with 25 translations, removing indirect associations increased precision from 69.96% to 74.85%, a significant increase. Recall also rose from 68.44% to 69.80%. See Figures 3 and 4.

91,94

88,67

78,86

69,96

93,8791,22

81,52

72,73

65

70

75

80

85

90

95

Precision

Adjusted + Dist Adj.+ Dist Corrected

Model


Figure 3: Adjusted+distance model with and without collocation correction: Precision


254

25,84

30,61

50,77

68,44

25,48

31,25

51,71

69,80

20

25

30

35

40

45

50

55

60

65

70Recall

Adj. + Dist. Adj. + Dist. CorrectedModel


Figure 4: Adjusted+distance model with and without collocation correction: Recall.

7 Conclusions and future work In this paper, we proposed an approach to building bilingual dictionaries from a parallel corpus which avoids the computational complexity of the iterative approaches. The approach allows calculation of translation likelihood of pair of words without needing to consider other words at the same time, as in Melamed’s approach. This makes the approach suitable for on-the-fly estimation of translation likelihood of a pair of words encountered during tasks such as aligning words in parallel sentences.

To avoid the problem of indirect association, we propose a method to eliminate such effects from the likelihood table without needing to word-align the corpus.

While our levels of precision and recall are not as high as the iterative approaches, the speed and flexibility of our approach makes it a viable candidate for cases where computation time is an issue, or where building larger dictionaries in realistic timeframes is required.

In terms of the various models we have experimented with, we found that our adjusted model, using P(v|u)/P(v|¬u), gave higher precision than the more pure likelihood measure: P(v|u)/P(v). Also, including distance penalties improved both approaches.

ReferencesBrown, P.F., J. Cocke, S. A. Della Pietra, V. J.

Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer and P. S. Roossin. 1990. A statistical approach to Machine Translation. Computational Linguistics, 16(2):79–85.

Gale, W.A. and K.W. Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1):75–102.

Hiemstra, D. 1996. Using statistical methods to create a bilingual dictionary. Master Thesis. University of Twente.

Kay, M., M. Röscheisen. 1993. Text-Translation Alignment. Computational Linguistics 19(1): 121-142.

Koehn, P. 2005. Europarl: A parallel corpus for Statistical Machine Translation. In: Proceedings of the 10th Machine Translation Summit, Phuket, Thailand, pp. 79–86.

Melamed, I.D. 1997. A word-to-word model of translational equivalence. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain, pp. 490–497

Renders, J.-M., H. Déjean and É. Gaussier. 2003. Assessing automatically extracted bilingual lexicons for CLIR in vertical domains. Lecture Notes in Computer Science 2785, C. Peters, M. Braschler, J. Gonzalo and M. Kluck Editors, Springer-Verlag: Berlin, pp. 363–371.

Tufis, D. and A.M. Barbu and R. Ion. 2004. Extracting multilingual lexicons from parallel corpora. Computers and the Humanities, 38(2):163–189.


255

Training Part-of-Speech Taggers to build Machine TranslationSystems for Less-Resourced Language Pairs

Felipe Sanchez-Martınez, Carme Armentano-Oller,Juan Antonio Perez-Ortiz, Mikel L. Forcada

Transducens GroupDepartament de Llenguatges i Sistemes Informatics

Universitat d’AlacantE-03071 Alacant, Spain

{fsanchez,carmentano,japerez,mlf}@dlsi.ua.es

Resumen: Este articulo revisa el empleo de un metodo no supervisado para laobtencion de desambiguadores lexicos categoriales para su empleo dentro del ingeniode traduccion automatica (TA) de codigo abierto Apertium. El metodo empleael resto de modulos del sistema de TA y un modelo de la lengua destino de latraduccion para la obtencion de desambiguadores lexicos categoriales que despuesse usan dentro de la plataforma de TA Apertium para traducir. Los experimentosrealizados con el par de lenguas occitano–catalan (un caso de estudio para paresde lenguas minorizadas con pocos recursos) muestran que la cantidad de corpusnecesario para el entrenamiento es reducida comparado con los tamanos de corpushabitualmente usados con otros metodos de entrenamiento no supervisado como elalgoritmo de Baum y Welch. Esto hace que el metodo sea especialmente apropiadopara la obtencion de desambiguadores lexicos categoriales para su empleo en TAentre pares de lenguas minorizadas. Ademas, la calidad de traduccion del sistemade TA que utiliza el desambiguador lexico categorial resultante es comparativamentemejor.Palabras clave: traduccion automatica, lenguas minorizadas, desambiguacionlexica categorial, modelos ocultos de Markov

Abstract: In this paper we review an unsupervised method that can be used totrain the hidden-Markov-model-based part-of-speech taggers used within the open-source shallow-transfer machine translation (MT) engine Apertium. This methoduses the remaining modules of the MT engine and a target language model to ob-tain part-of-speech taggers that are then used within the Apertium MT engine inorder to produce translations. The experimental results on the Occitan–Catalanlanguage pair (a case study of a less-resourced language pair) show that the amountof corpora needed by this training method is small compared with the usual corpussizes needed by the standard (unsupervised) Baum-Welch algorithm. This makesthe method appropriate to train part-of-speech taggers to be used in MT for less-resourced language pairs. Moreover, the translation performance of the MT systemembedding the resulting part-of-speech tagger is comparatively better.Keywords: machine translation, less-resourced languages, part-of-speech tagging,hidden Markov models

1 Introduction

The growing availability of machine-readable(monolingual and parallel) corpora has givenrise to the development of real applica-tions such as corpus-based machine transla-tion (MT). However, when MT involves less-resourced language pairs, such as Occitan–Catalan (see below), the amount of mono-

lingual or parallel corpora, if available, isnot enough to build a general-purpose open-domain MT system (Forcada, 2006). In thesecases the only realistic approach to attainhigh performance in general translation is tofollow a rule-based approach, but at the ex-pense of the large costs needed for buildingthe necessary linguistic resources (Arnold,2003).



In this paper we focus on the trainingof the hidden Markov model (HMM)-basedpart-of-speech taggers used by a particu-lar open-source Occitan–Catalan MT sys-tem (Armentano-Oller and Forcada, 2006),that has been built using Apertium, an open-source platform for building MT systems (seesection 2). Occitan–Catalan is an interest-ing example of a less-resourced language pair.HMMs are a common statistical approach topart-of-speech tagging, but they usually de-mand large corpora, which are seldom avail-able for less-resourced languages.

Catalan is a Romance language spokenby around 6 million people, mainly in Spain(where it is co-official in some regions), butalso in Andorra (where it is the official lan-guage), in parts of Southern France and inthe Sardinian city of l’Alguer (Alghero).

Occitan, also known as lenga d’oc orlangue d’oc, is also a Romance language, butwith a reduced community of native speak-ers. It is reported to have about one millionspeakers, mainly in Southern France, but alsoin some valleys of Italy and in the Val d’Aran,a small valley of the Pyrenees of Catalonia,inside the territory of Spain. This last vari-ety is called Aranese; all of the experimentsreported here have been performed with theAranese variety of Occitan.

Although Occitan was one of the main lit-erary languages in Medieval Europe, nowa-days it is legally recognized only in the Vald’Aran, where it has a limited status ofcooficiality. In addition, Occitan dialectshave strong differences, and its standardiza-tion as a single language still faces a num-ber of open issues. Furthermore, the lackof general-purpose machine-readable texts re-stricts the design and construction of natural-language processing applications such aspart-of-speech taggers. The Apertium-basedOccitan–Catalan MT system (Armentano-Oller and Forcada, 2006) mentioned alongthis paper has been built to translate intothe Occitan variety spoken in the Val d’Aran,called Aranese, which is a sub-dialect of Gas-con (one of the main dialects of Occitan).

When part-of-speech tagging is viewed asan intermediate task for the translation pro-cess the use in a unsupervised manner oftarget-language (TL) information, in addi-tion to the source language (SL), has beenshown to give better results than the stan-dard (also unsupervised) Baum-Welch algo-

rithm (Sanchez-Martınez, Perez-Ortiz, andForcada, 2004b). Moreover, as the experi-mental results show, the amount of sourcelanguage text is small compared with corpussizes needed by the standard Baum-Welchalgorithm. Because of this, it may be saidthat this training method is specially suitedto train part-of-speech taggers to be embed-ded in MT systems involving less-resourcedlanguage pairs.

Carbonell et al. (2006) proposed a newMT framework in which a large full-formbilingual dictionary and a huge TL corpusis used to carry out the translation; neitherparallel corpora nor transfer rules are needed.The idea behind Carbonell’s paper and thatof the method we present here share the sameprinciple: if the goal is to get good transla-tions into TL, let TL decides whether a given“construction” in the TL is good or not. Incontrast, Carbonell’s method uses TL infor-mation at translation time, while ours usesonly TL information when training one mod-ule that is then used to carry out the trans-lation; therefore, no TL information is usedby our method at translation time.

The rest of the paper is organized as fol-lows: section 2 overviews the open-sourceplatform for building MT systems Apertium;next, in section 3 the TL-driven trainingmethod used to train the Occitan part-of-speech tagger is introduced; section 4 showsthe experiments and the results achieved; fi-nally in section 5 we discuss the method andthe results achieved.

2 Overview of Apertium

Apertium1 (Armentano-Oller et al., 2006;Corbı-Bellot et al., 2005) is an open-sourceplatform for developing MT systems, initiallyintended for related language pairs. TheApertium MT engine follows a shallow trans-fer approach and may be seen as an assemblyline consisting of the following modules (seefigure 1):

• A de-formatter which separates the textto be translated from the format infor-mation (RTF and HTML tags, whites-pace, etc.). Format information is en-capsulated so that the rest of the mod-ules treat it as blanks between words.

1The MT engine, documentation, and linguisticdata for different language pairs can be downloadedfrom http://apertium.sf.net.

Felipe Sánchez-Martínez, Carme Armentano-Oller, Juan Antonio Pérez-Ortiz y Mikel L. Forcada

258

morphologicalanalyzer

part-of-speechtagger

structuraltransfer

lexicaltransfer

morphologicalgenerator

post-generator

SL text

TL text

de-formatter

re-formatter

Figure 1: Modules of the Apertium shallow-transfer MT platform (see section 2).

• A morphological analyzer which tok-enizes the SL text in surface forms anddelivers, for each surface form, one ormore lexical forms consisting of lemma,lexical category and morphological in-flection information.

• A part-of-speech tagger which chooses,using a first-order hidden Markov model(HMM) (Cutting et al., 1992), one of thelexical forms corresponding to an am-biguous surface form. This is the modulewhose training is discussed in section 3.

• A lexical transfer module which readseach SL lexical form and delivers the cor-responding TL lexical form by looking itup in a bilingual dictionary.

• A structural shallow transfer module(parallel to the lexical transfer) whichuses a finite-state chunker to detect pat-terns of lexical forms which need to beprocessed for word reorderings, agree-ment, etc., and then performs these op-erations.2

• A morphological generator which deliv-ers a TL surface form for each TL lexicalform, by suitably inflecting it.

• A post-generator which performs or-thographic operations such as con-tractions (e.g. Spanish del=de+el)

2This describes Apertium Level 1, used for the ex-periments in this paper; in Apertium Level 2, cur-rently being used for less-related pairs, a three-stagestructural transfer is used to perform inter-chunk op-erations.

and apostrophations (e.g. Catalanl’institut=el+institut).

• A re-formatter which restores the for-mat information encapsulated by the de-formatter into the translated text.

Modules use text to communicate, whichmakes it much easier to diagnose or modifythe behavior of the system.

2.1 Linguistic data and compilers

The Apertium MT engine is completely in-dependent from the linguistic data used fortranslating between a particular pair of lan-guages.

Linguistic data is coded using XML-basedformats;3 this allows for interoperability, andfor easy data transformation and mainte-nance. In particular, files coding linguis-tic data can be automatically generated bythird-party tools.

Apertium provides compilers to convertthe linguistic data into the corresponding ef-ficient form used by each module of the en-gine. Two main compilers are used: one forthe four lexical processing modules (morpho-logical analyzer, lexical transfer, morpholog-ical generator, and post-generator) and an-other one for the structural transfer. Thefirst one generates finite-state letter trans-ducers (Garrido-Alenda, Forcada, and Car-rasco, 2002) which efficiently code the lexi-cal data; the last one uses finite-state ma-chines to speed up pattern matching. Theuse of such efficient compiled data formatsmakes the engine capable of translating tensof thousands of words per second in a currentdesktop computer.

3 Target-language-drivenpart-of-speech tagger training

This section overviews the TL-driven train-ing method that has been used to un-supervisedly train the HMM-based Occi-tan part-of-speech tagger used within theApertium-based Occitan–Catalan MT sys-tem (Armentano-Oller et al., 2006). For adeeper description we refer the reader to pa-pers by Sanchez-Martınez et al. (Sanchez-Martınez, Perez-Ortiz, and Forcada, 2004b;

3The XML formats (http://www.w3.org/XML/)for each type of linguistic data are defined throughconveniently-designed XML document-type defini-tions (DTDs) which may be found inside theapertium package.

Training Part-of-Speech Taggers to build Machine Translation Systems for Less-Resourced Language Pairs

259

Sanchez-Martınez, Perez-Ortiz, and Forcada,2004a; Sanchez-Martınez, Perez-Ortiz, andForcada, 2006).

Typically, the training of general pur-pose HMM-based part-of-speech taggers isdone using the maximum-likelihood estimate(MLE) method (Gale and Church, 1990)when tagged corpora4 are available (super-vised method), or using the Baum-Welchalgorithm (Cutting et al., 1992; Baum,1972) with untagged corpora5 (unsupervisedmethod). However, if the part-of-speech tag-ger is to be embedded as a module in a MTsystem, as is the case, HMM training canbe done in an unsupervised manner by usingsome modules of the MT system and infor-mation from both SL and TL.

The main idea behind the use of TL in-formation is that the correct disambiguation(tag assignment) of a given SL segment willproduce a more likely TL translation thanany (or most) of the remaining wrong disam-biguations. In order to apply this methodthese steps are followed:

• first the SL text is split into adequatesegments (so that they are small and in-dependently translated by the rest of theMT engine); then,

• all possible disambiguations for eachtext segment are generated and trans-lated into the TL; after that,

• a statistical TL model is used to com-pute the likelihood of the translation ofeach disambiguation; and,

• these likelihoods are used to adjust theparameters of the SL HMM: the higherthe likelihood, the higher the probabilityof the original SL tag sequence in theHMM being trained.

The way this training method works canbe illustrated with the following example.Suppose that we are training an English PoStagger to be used within a rule-based MT sys-tem translating from English to Spanish, andthat we have the following segment in En-glish, s =“He books the room”. The first step

4In a tagged corpus each occurrence of each word(ambiguous or not) has been assigned the correctpart-of-speech tag.

5In an untagged corpus all words are assigned (us-ing, for instance, a morphological analyzer) the set ofall possible part-of-speech tags independently of con-text without choosing one of them.

is to use a morphological analyzer to obtainthe set of all possible part-of-speech tags foreach word. Suppose that the morphologicalanalysis of the previous segment according tothe lexicon is: He (pronoun), books (verb ornoun), the (article), and room (verb or noun).As there are two ambiguous words (books androom) we have, for the given segment, fourdisambiguation paths or part-of-speech com-binations, that is to say:

• g1 = (pronoun, verb, article, noun),• g2 = (pronoun, verb, article, verb),• g3 = (pronoun, noun, article, noun),

and• g4 = (pronoun, noun, article, verb).

Let τ be the function representing the trans-lation task. The next step is to translate theSL segment into the TL according to eachdisambiguation path gi:

• τ(g1, s) = “El reserva la habitacion”,

• τ(g2, s) =“El reserva la aloja”,

• τ(g3, s) =“El libros la habitacion”, and

• τ(g4, s) =“El libros la aloja”.

It is expected that a Spanish language modelwill assign a higher likelihood to translationτ(g1, s) than to the other ones, which makelittle sense in Spanish. As a result, the tag se-quence g1 will have a higher probability thanthe other ones.

To estimate the HMM parameters, thecalculated probabilities are used as if frac-tional counts were available to a super-vised training method based on the MLEmethod in conjunction with a smoothingtechnique (Sanchez-Martınez, Perez-Ortiz,and Forcada, 2004b).

As expected, the number of possible dis-ambiguations of a text segment grows ex-ponentially with its length, the transla-tion task being the most time-consumingone. This problem has been successfully ad-dressed (Sanchez-Martınez, Perez-Ortiz, andForcada, 2006) by using a very simple prun-ing method that avoids performing more than80% of the translations without loss in accu-racy.

An implementation of the method de-scribed in this section can be downloadedfrom the Apertium project web page,6 and

6http://apertium.sourceforge.net. The


260

may simplify the initial building of Apertium-based MT systems for new language pairs,yielding better tagging results than theBaum-Welch algorithm (Sanchez-Martınez,Perez-Ortiz, and Forcada, 2004b).

4 Experiments

The method we present is aimed at pro-ducing part-of-speech taggers to be used inMT systems. In this section we report theresults achieved when training the Occitanpart-of-speech tagger of the Apertium-basedOccitan–Catalan MT system.7 Note thatwhen training the Occitan part-of-speech tag-ger the whole MT engine, except for the part-of-speech tagger itself, is used to producetexts from which statistics about TL (Cata-lan) will be collected.

Before training, the Occitan corpus is di-vided into small segments that can be in-dependently translated by the rest of thetranslation engine. To this end, informa-tion about the structural transfer patterns istaken into account. The segmentation is per-formed at nonambiguous words whose part-of-speech tag is not present in any struc-tural transfer pattern, or at nonambiguouswords appearing in patterns that cannot bematched in the lexical context in which theyappear. Unknown words are also treated assegmentation points, since the lexical trans-fer has no bilingual information for them andno structural transfer pattern is activated atall.

Once the SL (Occitan) corpus has beensegmented, for each segment, all possibletranslations into TL (Catalan) according toevery possible combination of disambigua-tions are obtained. Then, the likelihoodsof these translations are computed througha Catalan trigram model trained from a 2-million-word raw-text Catalan corpus, andthen normalized and used to estimate theHMM parameters as described in section 3.

We evaluated the evolution of the perfor-mance of the training method by updatingthe HMM parameters at every 1 000 wordsand testing the resulting part-of-speech tag-ger; this also helps in determining the amount

method is implemented inside packageapertium-tagger-training-tools which is licensedunder the GNU GPL license.

7The linguistic data for this language pair (pack-age apertium-o-ca-1.0.2) can be freely downloadedfrom http://apertium.sourceforge.net

of SL text required for the convergence.Figure 2 shows the evolution of the word

error rate (WER) when training the Occ-itan part-of-speech tagger from a 300 000-word raw-text Occitan corpus built fromtexts collected from the Internet. Theresults achieved when following the stan-dard (unsupervised) Baum-Welch approachto train HMM-based part-of-speech taggerson the same corpus (no larger Occitan cor-pora was available to us in order to train withthe Baum-Welch algorithm), and the resultsachieved when a TL model is used at trans-lation time (instead of a SL part-of-speechtagger) to select always the most likely trans-lation into TL (TLM-best) are given for com-parison.

When reestimating the HMM parame-ters via the Baum-Welch algorithm, the log-likelihood of the training corpus was calcu-lated after each iteration; the iterative reesti-mation process is finished when the differencebetween the log-likelihood of the last itera-tion and the previous one is below a certainthreshold. Note that when training the HMMparameters via the Baum-Welch algorithm,the whole 300 000-word corpus is used, there-fore the WER reported in figure 2 for theBaum-Welch algorithm is independent of thenumber of SL words in the horizontal axis.

The WER is calculated as the edit dis-tance (Levenshtein, 1965) between the trans-lation of an independent 10 079-word Occitancorpus performed by the MT system whenembedding the part-of-speech tagger beingevaluated, and its human-corrected MT intoCatalan. WERs are calculated at the docu-ment level; additions, deletions and substitu-tions being equally weighted.

As can be seen in figure 2 our method doesnot need a large amount of SL text to con-verge and the translation performance is bet-ter than that achieved by the Baum-Welchalgorithm. Moreover, the translation perfor-mance achieved by our method is even bet-ter than that achieved when translating us-ing the TLM-best setup. Although the TLM-best setup might be though as giving the bestresult that can be achieved by our method,the results reported in figure 2 suggest thatour method has some generalization capabil-ity that makes it able to produce better part-of-speech taggers for MT than it may be ini-tially expected.

It must be mentioned that analogous re-


261

6.5

7

7.5

8

8.5

9

9.5

300000 200000 100000 0

Wor

d er

ror

rate

(W

ER

, % o

f wor

ds)

SL (Occitan) words

Baum−Welch

TLM−best

Figure 2: Evolution of the word error rate (WER) when training the (SL) Occitan part-of-speechtagger, Catalan being the target language (TL). WERs reported are calculated at the documentlevel. Baum-Welch and TLM-best (see below) results are given for comparison; thus, they areindependent of the number of SL words. TLM-best corresponds to the results achieved whena TL model is used at translation time (instead of a SL part-of-speech tagger) to select alwaysthe most likely translation into TL.

sults on the Spanish–Catalan language pairhas revealed that, although the part-of-speech tagging accuracy is better when theHMM is trained in a supervised way from atagged corpus, the translation performance ofthe MT system when embedding the super-visedly trained part-of-speech taggers is quitesimilar to that of using a part-of-speech tag-ger trained through the TL-driven trainingmethod.8

Concerning how the presented method be-haves when the languages involved are less re-lated than Occitan and Catalan, preliminaryexperiments on the French–Catalan languagepair show results in agreement to those pro-vided in this paper. Experiments on moreunrelated languages pairs such as English–Catalan will be conducted in the near future.

5 Discussion

In this paper we have reviewed the use oftarget language (TL) information to trainhidden-Markov-model (HMM)-based part-of-speech taggers to be used in machine trans-lation (MT); furthermore, we have presentedexperiments done with the Occitan–Catalan

8We plan to publish these results in the near fu-ture.

language pair, a case study of a less-resourcedlanguage pair.

Our training method has been proven tobe appropriate to train part-of-speech taggersfor MT between less-resourced language pairsbecause, on the one hand, the amount of SLtext needed is very small compared with com-mon corpus sizes (millions of words) used bythe Baum-Welch algorithm; and, on the otherhand, because no new resources must be built(such as tagged corpora) to get translationperformances comparable to those achievedwhen training from tagged corpora.

Finally, it must be pointed out that the re-sulting part-of-speech tagger is tuned to im-prove the translation quality and intended tobe used as a module in a MT system; forthis reason, it may give less accurate resultsas a general purpose part-of-speech taggerfor other natural language processing appli-cations.

Acknowledgements

Work funded by the Spanish Ministry of Edu-cation and Science through project TIN2006-15071-C03-01, by the Spanish Ministry of Ed-ucation and Science and the European So-cial Fund through research grant BES-2004-4711, and by the Spanish Ministry of Indus-


262

try, Tourism and Commerce through projectFIT-350401-2006-5. The development ofthe Occitan–Catalan linguistic data was sup-ported by the Generalitat de Catalunya.

References

Armentano-Oller, C., R.C. Carrasco, A.M.Corbı-Bellot, M.L. Forcada, M. Ginestı-Rosell, S. Ortiz-Rojas, J.A. Perez-Ortiz, G. Ramırez-Sanchez, F. Sanchez-Martınez, and M.A. Scalco. 2006. Open-source Portuguese-Spanish machine trans-lation. In Computational Processing ofthe Portuguese Language, Proceedings ofthe 7th International Workshop on Com-putational Processing of Written and Spo-ken Portuguese, PROPOR 2006, vol-ume 3960 of Lecture Notes in Com-puter Science. Springer-Verlag, pages 50–59. (http://www.dlsi.ua.es/~japerez/pub/pdf/propor2006.pdf).

Armentano-Oller, C. and M.L. Forcada.2006. Open-source machine transla-tion between small languages: Cata-lan and Aranese Occitan. In Strate-gies for developing machine translationfor minority languages (5th SALTMILworkshop on Minority Languages), pages51–54. (organized in conjunction withLREC 2006, http://www.dlsi.ua.es/~mlf/

docum/armentano06p2.pdf).

Arnold, D., 2003. Computers and Trans-lation: A translator’s guide, chapterWhy translation is difficult for computers,pages 119–142. Benjamins Translation Li-brary. Edited by H. Somers.

Baum, L.E. 1972. An inequality and associ-ated maximization technique in statisticalestimation of probabilistic functions of aMarkov process. Inequalities, 3:1–8.

Carbonell, J., S. Klein, D. Miller, M. Stein-baum, T. Grassiany, and J. Frei. 2006.Context-based machine translation. InProceedings of the 7th Conference of theAssociation for Machine Translation inthe Americas, “Visions for the Future ofMachine Translation”, pages 19–28, Au-gust.

Corbı-Bellot, A.M., M.L. Forcada, S. Ortiz-Rojas, J.A. Perez-Ortiz, G. Ramırez-Sanchez, F. Sanchez-Martınez, I. Ale-gria, A. Mayor, and K. Sarasola. 2005.An open-source shallow-transfer machine

translation engine for the Romance lan-guages of Spain. In Proceedings of the10th European Associtation for MachineTranslation Conference, pages 79–86, Bu-dapest, Hungary. (http://www.dlsi.ua.es/~mlf/docum/corbibellot05p.pdf).

Cutting, D., J. Kupiec, J. Pedersen, andP. Sibun. 1992. A practical part-of-speech tagger. In Third Conference onApplied Natural Language Processing. As-sociation for Computational Linguistics.Proceedings of the Conference., pages 133–140, Trento, Italy.

Forcada, M.L. 2006. Open-source machinetranslation: an opportunity for minor lan-guages. In Proceedings of Strategies fordeveloping machine translation for minor-ity languages (5th SALTMIL workshop onMinority Languages). (http://www.dlsi.ua.es/~mlf/docum/forcada06p2.pdf).

Gale, W.A. and K.W. Church. 1990.Poor estimates of context are worse thannone. In Proceedings of a workshop onSpeech and natural language, pages 283–287. Morgan Kaufmann Publishers Inc.

Garrido-Alenda, A., M. L. Forcada, and R. C.Carrasco. 2002. Incremental constructionand maintenance of morphological analy-sers based on augmented letter transduc-ers. In Proceedings of TMI 2002 (Theoret-ical and Methodological Issues in MachineTranslation), pages 53–62.

Levenshtein, V.I. 1965. Binary codes capa-ble of correcting deletions, insertions, andreversals. Doklady Akademii Nauk SSSR,163(4):845–848. English translation in So-viet Physics Doklady, 10(8):707-710, 1966.

Sanchez-Martınez, F., J.A. Perez-Ortiz, andM.L. Forcada. 2004a. Cooperativeunsupervised training of the part-of-speech taggers in a bidirectional machinetranslation system. In Proceedings ofTMI, The Tenth Conference on Theo-retical and Methodological Issues in Ma-chine Translation, pages 135–144, Octo-ber. (http://www.dlsi.ua.es/~fsanchez/pub/pdf/sanchez04b.pdf).

Sanchez-Martınez, F., J.A. Perez-Ortiz, andM.L. Forcada. 2004b. Exploring the useof target-language information to train thepart-of-speech tagger of machine trans-lation systems. In Advances in Natu-ral Language Processing, Proceedings of


263

4th International Conference EsTAL, vol-ume 3230 of Lecture Notes in Com-puter Science. Springer-Verlag, pages 137–148. (http://www.dlsi.ua.es/~fsanchez/pub/pdf/sanchez04a.pdf).

Sanchez-Martınez, F., J.A. Perez-Ortiz, andM.L. Forcada. 2006. Speeding uptarget-language driven part-of-speech tag-ger training for machine translation. InAdvances in Artificial Intelligence, Pro-ceedings of the 5th Mexican InternationalConference on Artificial Intelligence, vol-ume 4293 of Lecture Notes in Com-puter Science. Springer-Verlag, pages 844–854. (http://www.dlsi.ua.es/~fsanchez/pub/pdf/sanchez06b.pdf).


264

Parallel Corpora based Translation Resources Extraction

Alberto SimoesDepartamento de Informatica

Universidade do MinhoBraga, Portugal

[email protected]

Jose Joao AlmeidaDepartamento de Informatica

Universidade do MinhoBraga, [email protected]

Resumen: Este artıculo describe NATools, un conjunto de herramientas de proce-samiento, analisis y extraccion de recursos de traduccion de Corpora Paralelo. Entrelas distintas herramientas disponibles se destacan herramientas de alineamiento defrases e palabras, un extractor de diccionarios probabilısticos de traduccion, un servi-dor de corpus, un conjunto de herramientas de interrogacion de corpora y diccionar-ios y ası mismo un conjunto de herramientas de extraccion de recursos bilingues.Palabras clave: corpora paralelos, recursos bilingues, traduccion automatica

Abstract: This paper describes NATools, a toolkit to process, analyze and extracttranslation resources from Parallel Corpora. It includes tools like a sentence-aligner,a probabilistic translation dictionaries extractor, word-aligner, a corpus server, a setof tools to query corpora and dictionaries, as well as a set of tools to extract bilingualresources.Keywords: parallel corpora, bilingual resources, machine translation

1 Introduction

NATools is a package with a set of tools for parallelcorpora processing. It includes tools to help par-allel corpora preparation, from sentence-alignmentand tokenization, to full probabilistic translationdictionary extraction, word-alignment, and trans-lation examples extraction for machine translation.

Follows a list with some of the available tools:

• a simple parallel corpora sentence alignerbased on the algorithm proposed by (Galeand Church, 1991) and in the Vanilla Alignerimplementation by (Danielsson and Ridings,1997);

• a probabilistic translation dictionary (Simoesand Almeida, 2003; Simoes, 2004) extractorbased on PTD Extractor based on work by(Hiemstra, August 1996; Hiemstra, 1998);

• a parallel corpora word-aligner (Simoes andAlmeida, 2006a) based on probabilistic trans-lation dictionaries;

• NatServer (Simoes and Almeida, 2006b), aparallel corpora server for quick concordancesand probabilistic translation dictionary query-ing;

• a set of web clients to query parallel corporausing NatServer;

• tools for machine translation example extrac-tion (Simoes and Almeida, 2006a) based onprobabilistic translation dictionaries and align-ment pattern rules;

• A full C and Perl API for quick parallel corporatools prototyping;

• a StarDict generation software;

• support for Makefile::Parallel (Simoes,Fonseca, and Almeida, 2007), a Domain Spe-cific Language for process parallelization (totake advantage of multi-processor machinesand/or cluster systems).

This paper consists of three main sections. Thefirst one explains how NATools helps preparing par-allel corpora. Follows a section on querying parallelcorpora both using a corpora server and using webinterfaces. The third section is about using NA-Tools for parallel resources extraction like transla-tions examples.

2 Parallel Corpora Preparation

To create and make available a parallel corpora isnot a simple task. In fact, this process does notdepend just on the compilation of parallel texts.These texts should be processed in some differentways so it can be really useful. Important stepsinclude the text tokenization, sentence boundariesdetection and sentence alignment (or translationunit alignment). NATools include (and depends)on tools to perform these tasks.

2.1 Segmentation and Tokenization

While NATools does not include directly toolsfor segmentation and tokenization, it depends onLingua::PT::PLNbase1, a Perl module for based

1http://search.cpan.org/dist/Lingua-PT-PLNbase.



segmentation and tokenization for the Portugueselanguage. While it was developed with the Por-tuguese language in mind, through the time moreand more support for Spanish, French and Englishhas been incorporated. Thus, after installing NA-Tools you will have access to the Perl module di-rectly or using NATools options for segmentationand tokenization.

2.2 Sentence Alignment

The NATools sentence aligner uses the well knownalgorithm by (Gale and Church, 1991). Work isbeing done to include some clue-align (Tiedemann,2003) information into the original algorithm, tak-ing advantage of numbers and other non-textualelements in sentences in addition to the basic sen-tence length metrics.

While Gale and Church algorithm is known fornot being robust enough for big corpora with bigdifferences in number of sentences, the truth is thatit works for most available corpora.

Also, note that NATools do not force the userto use the supplied sentence-aligner (or tokenizer).For instance, we are using easy-align from IMS-CWB (Christ et al., 1999) to perform sentencealignment on big corpora. Unfortunately easy-alignis not open-source and the used algorithm is not de-scribed in any paper, but it uses not only the baselength metrics but also uses other knowledge likebilingual dictionaries to perform better alignment.

2.3 Corpora Encoding

This is the only required step on using NATools. Itperforms the corpora encoding and creates auxil-iary indexes for quick access. Two lexicon indexesare created (one for each language), mapping an in-teger identifier for each word. The corpora is cod-ified using these integer values, and indexes for di-rect access by word and sentence are created. Thereare other tools to index corpora. Examples are Em-dros (Petersen, 2004) and IMS-CWB (Christ et al.,1999). While the first one is freely available, it is in-tended for monolingual corpora. In the other hand,IMS-CWB is not open software.

2.4 Probabilistic TranslationDictionaries Extraction

This process extracts relationships between wordsand their probable translations. Some researchers(Hiemstra, August 1996) call this word-alignment.Within NATools, we prefer to call it probabilistictranslation dictionaries (PTDs).

There are other tools like Giza++ (Och and Ney,2004) that perform word-alignment directly fromparallel corpora, but that is not our approach. Ourdictionaries map for each word in a language, aset of probable translations on the other language(together with an translation probability). Followsa simple example of a PTD:

1 ** europe (42853 occurrences)

2 europa: 94.71 %3 europeus: 3.39 %4 europeu: 0.81 %5 europeia: 0.11 %

6 ** stupid (180 occurrences)

7 estupido: 17.55 %8 estupida: 10.99 %9 estupidos: 7.41 %

10 avisada: 5.65 %11 direita: 5.58 %12 impasse: 4.48 %

Note that although the first three entries for thestupid word have low probabilities, they refer tothe same word with different inflections: masculinesingular, feminine singular and masculine plural.

The algorithm based on Twente-Aligner (Hiem-stra, August 1996; Hiemstra, 1998) was fully re-viewed and enhanced, and was added support forbig corpora (Simoes, 2004). The version includedin NATools supports arbitrary size corpora (onlylimited by disk space), and can be run on parallelmachines and clusters.

NATools probabilistic dictionary extraction isbeing used for bilingual dictionary bootstrappingas presented by (Guinovart and Fontenla, 2005).

3 Querying Parallel Corpora

To make parallel corpora available for querying isnot easy as well. After the encoding process de-scribed on section 2.3, there is the need for a serverto help searching and querying the encoded cor-pora. Thus, NATools includes its own parallel cor-pora server.

3.1 NatServer: A Parallel CorporaServer

NATools includes NatServer, a socket-based pro-gram to query efficiently parallel corpora, corporan-grams (bigrams, trigrams and tetragrams) andprobabilistic translation dictionaries. It supportsmultiple corpora with different language pairs.

Given the modular implementation of Nat-Server, the C library can be used for other softwareand namely for NATools Perl API (ApplicationProgrammer Interface). This makes it easy for anysoftware choose at run-time if it will use the socketserver or access locally the encoded corpora. This isspecially important for intensive batch tasks wherethe socket-based communication is a big over-headregarding performance.

NatServer is also being prepared to be respon-sible of the server part of Distributed TranslationMemories (Simoes, Guinovart, and Almeida, 2004),

Alberto Simões y José João Almeida

266

a WebService to serve translators with externaltranslation memories.

3.2 Query Tools

Linguistics and translators make heavy use of par-allel corpora and bilingual resources. Meanwhile,they use simple applications or web interfaces.There are parallel corpora available for queryingin the web like COMPARA (Frankenberg-Garciaand Santos, 2001; Frankenberg-Garcia and Santos,2003) or Opus (Tiedemann and Nygaard, 2004),and they are quite used. Thus, it is important toprovide mechanisms to make our parallel corporaavailable in the Web as well.

NATools include a set of web tools for concor-dances with translation guessing (see figure 1) andprobabilistic translation dictionary browsing (seefigure 2).

The web interface lets the user swap betweenconcordances and dictionaries in an easy way, aswell as to check corpora details (description, lan-guages, sizes and so on).

4 Parallel Resources Extraction

NATools main objective was not to be a final-usersoftware package, but instead, be a toolbox for theresearcher that uses parallel corpora. Thus, re-search is being done using NATools and some ofresulting applications are being incorporated in thetoolbox. The probabilistic translation dictionariespresented in section 2.4 by themselves are usefulparallel resources. They were presented earlier be-cause they are crucial for querying correctly NA-Tools corpora.

4.1 Terminology Extraction

(Och, 1999; Och and Ney, 2004) describes methodsto infer translation patterns from parallel corpora.In our work we found out that to describe trans-lation patterns and apply them to parallel corporagives interesting results: bilingual terminology.

Translation patterns describe how words orderchange when translation occurs. For instance, wecan describe a simple pattern to describe how theadjective swaps with the substantive when trans-lating from Portuguese to English as2:

T (A · B) = T (B) · T (A)

A bit complicated pattern:

T (P · de · V · N) = T (N) · T (P ) · of · T (V )

is presented on figure 3 visually. NATools includesa Domain Specific Language (DSL) to define thesepatterns in a easy way. The last example showncan be written as “P "de" V N = N P "of" V”.

2Note that letters on these patterns do not have any spe-cial meaning. They are just variable names.

alte

rnat

ive

sour

ces

of finan

cing

fontes X

de Δfinanciamento X

alternativas X

Figure 3: Translation Pattern example.

Although these patterns can be inferred fromparallel corpora most of them can be defined man-ually quite faster and with good results. Figure 4show some extracts from terminology extracted.Each group is preceded by the rule. Numbersbefore the terminology pairs are the occurrencecounter for that pair.

Note that the examples are the top five in num-ber of occurrences. Although they are all goodtranslations and they can all be considered termi-nology, this does not apply to all the extracted ex-amples. Meanwhile, the DSL lets add morpholog-ical constrains and Perl predicates to the pattern.With these constrains it is quite easy to removefrom the extracted entries those which are not ter-minology.

We did a massive test of terminology extractionusing EuroParl (Koehn, 2002) Portuguese:Englishcorpus. Table 1 shows some statistics on numberof patterns extracted3.

Total number of TUs 1 000 000Number of processed TUs 700 000Number of patterns found 578 103

Number of different patterns 139 781Number of filtered patterns 103 617

Table 1: Terminology extraction statistics.

Table 2 shows the occurrence distribution bysome patterns. The third column is a simple evalu-ation of how many patterns are really terminologyand are correct. Evaluation was done with threesamples: the 20 patterns with more occurrence, the20 patterns with lower occurrence, and 20 patternsin the middle of the list.

4.2 Word Alignment and ExampleExtraction

While Word Alignment and Example Extractionare different tasks, the base algorithm used in NA-Tools is the same. The word alignment is done foreach pair of translation units creating a matrix of

3The number of translations units processes is not equalto the total number of translations units because at the timethese statistics were reported the process did not have fin-ished.


267

Figure 1: Concordances interface.

Figure 2: PTDs query interface.

Pattern Occur. QualityA B = B A 77 497 86%

A de B = B A 12 694 95%A B C = C B A 7 700 93%

H de D H = H D I 3 336 100%A B C = C A B 1 466 40%

P de V N = N P of V 564 98%P de T de F = F T P 360 96%

Table 2: Patterns occurrences by type, and respec-tive quality.

translation probabilities as shown on figure 5. Inthis matrix one can see direct translations betweenword and some marked patterns. As these pat-terns are hopefully terminology, we are consideringthem as a term, and as such, aligning it all withanother term. From this matrix we can extract the

real word-alignment between these two translationunits.

For the example in the figure, it would beextracted the alignments: discussao:discussion,sobre:about, fontes de financiamento alternati-vas:alternative sources of financing, para:for, a:the,alianca radical europeia:european radical alliance.

The truth is that single word translations are al-ready present on the probabilistic translation dic-tionaries, and thus there is no advantage on ex-tracting the word-to-word relation.

The alignment matrix can also be used to ex-tract examples. If we join sequences of words (orterms) and their translations, a set of word se-quences can be extracted (examples). Again, forthe matrix shown, we can extract more relation-ships, like discussao sobre:dicussion about, sobrefontes de financiamento alternativas:about alteran-


268

1 A B = B A2 14949 comunidades europeias | european communities3 12487 parlamento europeu | european parliament4 11645 comunidade europeia | european community5 10055 uni~ao europeia | european union6 7705 jornal oficial | official journal

7 P "de" V N = N P "of" V8 134 comunicac~ao de acusac~oes alterada | revised statement of objections9 55 comunicac~ao de acusac~oes inicial | initial statement of objections

10 49 tribunal de justica europeu | european court of justice11 45 fontes de energia renovaveis | renewable sources of energy12 41 perıodo de tempo limitado | limited period of time

13 A "de" B = B A14 3383 medidas de execuc~ao | implementing measures15 2754 comite de gest~ao | management committee16 1163 plano de acc~ao | action plan17 1050 certificados de importac~ao | import licences18 1036 sigla de identificac~ao | identification marking

Figure 4: Bilingual terminology extracted by Translation Patterns.

discussion

about

alternative

sources

of

financing

for

the

european

radical

alliance

.

discussão 44 0 0 0 0 0 0 0 0 0 0 0

sobre 0 11 0 0 0 0 0 0 0 0 0 0

fontes 0 0 0 74 0 0 0 0 0 0 0 0

de 0 3 0 0 27 0 6 3 0 0 0 0

financiamento 0 0 0 0 0 56 0 0 0 0 0 0

alternativas 0 0 23 0 0 0 0 0 0 0 0 0

para 0 0 0 0 0 0 28 0 0 0 0 0

a 0 1 0 0 1 0 4 33 0 0 0 0

aliança 0 0 0 0 0 0 0 0 0 0 65 0

radical 0 0 0 0 0 0 0 0 0 80 0 0

europeia 0 0 0 0 0 0 0 0 59 0 0 0

. 0 0 0 0 0 0 0 0 0 0 0 80

Figure 5: Word-alignment matrix.

tive sources of financing, fontes de financiamentoalternativas para:alternative sources of financingfor, para a:for the, a alianca radical europeia:theeuropean radical alliance. This process can be re-peated, resulting in bigger examples. This step isimportant to generate more examples occurrencesand thus give more importance for those with big-ger occurrence.

Figure 6 shows some examples extracted usingthis methodology. These examples can be consol-idated (summed accordingly with their occurrencecount) and be used for machine translation or com-puter assisted translation.

4.3 Example Generalization

Based on work from (Brown, 2000; Brown, 2001),we are incorporating generalization algorithms intoNATools. One simple generalization is the detec-tion of numbers, hours and dates. Follows someexamples generalized using this technique.

1 399 as hour hour2 187 orcamento de year year budget3 136 int euros eur int4 135 int euros eur int5 127 directiva de year year directive6 51 orcamento year year budget7 46 int de setembro september int8 31 partir de year year onwards9 29 convenc~ao de year year convention

10 26 eleic~oes de year year elections11 25 perıodo year-year year-year period12 25 int dolares usd int13 24 relatorio de year year report

Although these patterns can be useful they arenot as interesting as if could create place-holders forwords. If we analyze similar entries in the exampleslisting we can find entries differing just in a fewwords like the following example.

1 2 povo portugues portuguese people2 2 povo paraguaio paraguayan people3 2 povo nigeriano nigerian people4 2 povo mexicano mexican people5 2 povo marroquino moroccan people6 2 povo mapuche mapuche people7 2 povo indıgena indigenous people8 2 povo holandes dutch people9 2 povo hungaro hungarian people

10 2 povo hmong hmong people

This can be generalized creating automatically aclass for the differing words (in this case we usedgentilic). Given two different classes with a bignumber of similar members we can join them ex-panding the initial number of examples.


269

1 raw examples2 protocolo para prevenir | protocol to prevent3 , reprimir e punir o | , suppress and punish4 trafico de pessoas | trafficking in persons5 e em particular de | , especially6 mulheres e criancas | women and children

7 consolidated examples8 35736 tendo em conta | having regard9 11304 tratado que institui | treaty establishing

10 10335 das comunidades europeias | of the european communities11 8789 institui a comunidade europeia | establishing the european community12 8424 e , nomeadamente | and in particular13 8224 , a comiss~ao | , the commission14 8142 redacc~ao que lhe foi dada pelo | amended by15 7352 a comiss~ao | to the commission16 7072 a comiss~ao das | the commission of17 6870 pela comiss~ao | for the commission18 6540 todos os estados-membros | all member states19 6400 pela comiss~ao | by the commission20 6379 considerando que , | whereas ,21 5409 regulamento e obrigatorio | regulation shall be binding22 5400 adoptou | has adopted this

Figure 6: Translation examples.

1 povo X: gentilic(X) T(X) people2 governo X: gentilic(X) T(X) govern

4.4 StarDict generation

Although we are in the Internet era, there are a fewpeople without Internet access at home, or workingoffline on a laptop. For these people, to access theonline query system is not possible. Specially fornon computer-science researchers, there is impor-tant to make dictionaries and some concordancesavailable easily.

Figure 7: StarDict screen-shot.

With this in mind we created a tool to generateStarDict (Zheng, Evgeniy, and Murygin, 2007) dic-tionaries with probabilistic translation dictionaryinformation and for each possible translation a setof three concordances.

1 use NAT::Client;

2 $client = NAT::Client->new(3 crp => "EuroParl-PT-EN");

4 $client->iterate(5 { Language => "PT" },6 sub {7 my %param = @_;

8 for $trans (keys %{$param{trans}}) {9 if ($param{trans}{$trans} > 0.1) {

10 $concs = $client->conc({11 concordance => 1},12 $param{word}, $trans);13 $stardict{$param{word}}{$trans}14 = $concs->[0];15 }}});

16 print StarDict($stardict);

Figure 8: Perl code to create a StarDict dictionary.

This tool was also an exercise to see how ver-satile the NATools API was. The basic structureof the dictionary to be translated to StarDict canbe created using just some lines of Perl code (seefigure 8).

The process is done iterating over all the en-tries in the probabilistic translation dictionary. Foreach entry we grab concordances for each probabletranslation (with association above 10%).


270

5 Conclusions

While a lot of work needs to be done within NA-Tools, most for efficiency, being open-source makesit easier. Any researcher can contribute with code,submit bugs reports, and get some support freely.

The whole NATools framework proved tobe robust enough for different sized cor-pora. It was tested with Le Monde Diploma-tique (PT:FR) (Correia, 2006), JRC-Acquis(PT:ES,PT:EN,PT:FR) (Steinberger et al., 2006)and EuroParl (PT:ES,PT:EN:PT:FR) (Koehn,2002). All these corpora are available for queryingin the Internet.

NATools include some other small tools not de-scribed in this paper. For instance, there is a setof small tools that grew up as experiences andwhere maintained in the package as tools to com-pare probabilistic translation dictionaries, tools torank (or classify) translation memories accordinglywith their translation probability, and others.

Acknowledgment

Alberto Simoes has a scholarship from Fundacaopara a Computacao Cientıfica Nacional and thework reported here has been partially funded byFundacao para a Ciencia e Tecnologia throughproject POSI/PLP/43931/2001, co-financed byPOSI, and by POSC project POSC/339/1.3/C-/NAC.

References

Brown, Ralf D. 2000. Automated generalizationof translation examples. In Eighteenth Interna-tional Conference on Computational Linguistics(COLING-2000), pages 125–131.

Brown, Ralf D. 2001. Transfer-rule induc-tion for example-based translation. In MichaelCarl and Andy Way, editors, Workshop onExample-Based Machine Translation, pages 1–11, September.

Christ, Oliver, Bruno M. Schulze, Anja Hofmann,and Esther Konig, 1999. The IMS CorpusWorkbench: Corpus Query Processor (CQP):User’s Manual. Institute for Natural LanguageProcessing, University of Stutgart, March.

Correia, Ana Teresa Varajao Moutinho Pereira.2006. Colaboracao na constituicao do corpusparalelo Le Monde Diplomatique (FR-PT). Re-latorio de estagio, Conselho de Cursos de Letrase Ciencias Humanas — Universidade do Minho,Braga, Dezembro.

Danielsson, Pernilla and Daniel Ridings. 1997.Practical presentation of a “vanilla” aligner. InTELRI Workshop in alignment and exploitationof texts, February.

Frankenberg-Garcia, Ana and Diana Santos,2001. Apresentando o COMPARA, um cor-pus portugues-ingles na Web. Cadernos deTraducao, Universidade de Sao Paulo.

Frankenberg-Garcia, Ana and Diana Santos. 2003.Introducing COMPARA, the portuguese-englishparallel translation corpus. In Silvia BernardiniFederico Zanettin and Dominic Stewart, editors,Corpora in Translation Education. Manchester:St. Jerome Publishing, pages 71–87.

Gale, William A. and Kenneth Ward Church. 1991.A program for aligning sentences in bilingualcorpora. In Meeting of the Association for Com-putational Linguistics, pages 177–184.

Guinovart, Xavier Gomez and Elena SacauFontenla. 2005. Tecnicas para o desen-volvemento de dicionarios de traducion a par-tir de corpora aplicadas na xeracion do Di-cionario CLUVI Ingles-Galego. Viceversa: Re-vista Galega de Traduccion, 11:159–171.

Hiemstra, Djoerd. 1998. Multilingual domainmodeling in twenty-one: automatic creation ofa bi-directional lexicon from a parallel corpus.Technical report, University of Twente, Par-levink Group.

Hiemstra, Djoerd. August 1996. Using statisticalmethods to create a bilingual dictionary. Mas-ter’s thesis, Department of Computer Science,University of Twente.

Koehn, Philipp. 2002. EuroParl: a multilingualcorpus for evaluation of machine translation.Draft, Unpublished.

Och, Franz Josef. 1999. An efficient method fordetermining bilingual word classes. In the 9thConference of the European Chapter of the As-sociation for Computational Linguistics, pages71–76.

Och, Franz Josef and Hermann Ney. 2004. Thealignment template approach to statistical ma-chine translation. Computational Linguistics,30:417–449.

Petersen, Ulrik. 2004. Emdros — a textdatabase engine for analyzed or annotated text.In 20th International Conference on Computa-tional Linguistics, volume II, pages 1190–1193,Geneva, August.

Simoes, Alberto and J. Joao Almeida. 2006a. Com-binatory examples extraction for machine trans-lation. In Jan Tore Lønning and Stephan Oepen,editors, 11th Annual Conference of the EuropeanAssociation for Machine Translation, pages 27–32, Oslo, Norway, 19–20, June.

Simoes, Alberto and J. Joao Almeida. 2006b. Nat-Server: a client-server architecture for building


271

parallel corpora applications. Procesamiento delLenguaje Natural, 37:91–97, September.

Simoes, Alberto, Ruben Fonseca, and Jose JoaoAlmeida. 2007. Makefile::Parallel depen-dency specification language. In Euro-Par 2007,Rennes, France, August. Forthcoming.

Simoes, Alberto, Xavier Gomez Guinovart, andJose Joao Almeida. 2004. Distributed trans-lation memories implementation using webser-vices. Procesamiento del Lenguaje Natural,33:89–94, July.

Simoes, Alberto M. and J. Joao Almeida. 2003.NATools – a statistical word aligner workbench.Procesamiento del Lenguaje Natural, 31:217–224, September.

Simoes, Alberto Manuel Brandao. 2004. Parallelcorpora word alignment and applications. Mas-ter’s thesis, Escola de Engenharia - Universidadedo Minho.

Steinberger, Ralf, Bruno Pouliquen, Anna Widiger,Camelia Ignat, Tomaz Erjavec, Dan Tufis, andDaniel Varga. 2006. The JRC-Acquis: A mul-tilingual aligned parallel corpus with 20+ lan-guages. In 5th International Conference on Lan-guage Resources and Evaluation (LREC’2006),Genoa, Italy, 24–26 May.

Tiedemann, Jorg. 2003. Combining clues for wordalignment. In 10th Conference of the EuropeanChapter of the ACL (EACL03), Budapest, Hun-gary, April 12–17.

Tiedemann, Jorg and Lars Nygaard. 2004. Theopus corpus - parallel & free. In Fourth Inter-national Conference on Language Resources andEvaluation (LREC’04), Lisbon, Portugal, May26–28.

Zheng, Hu, Evgeniy, and Alex Mury-gin. 2007. Stardict. Software anddocumentation homepage, StarDict,http://stardict.sourceforge.net/, Jan-uary.


272

DEMOSTRACIONES

Una herramienta para la manipulacion de corpora bilingueusando distancia lexica∗

Rafael Borrego Ropero y Vıctor J. Dıaz MadrigalDepartamento de Lenguajes y Sistemas Informaticos

E. T. S. Ingenierıa Informatica - Universidad de SevillaAvda. Reina Mercedes s/n 41012-Sevilla (Spain)

{rborrego, vjdiaz}@us.es

Resumen: En este artıculo se presenta una herramienta que permite anotar cor-pora bilingue y realizar alineamiento entre textos usando heurısticas basadas enfrecuencia, posicion y cercanıa lexica (con Edit Distance). La anotacion de corpo-ra bilingue es una tarea muy laboriosa pero esencial a la hora de desarrollar basesde conocimiento para la realizacion de traducciones automaticas entre distintos i-diomas. Esta herramienta ayuda esta tarea, permitiendo anotar de forma rapida ysencilla. Incluye caracterısticas que facilitan la edicion de textos planos y de textosanotados.Palabras clave: Alineamiento, Etiquetado de entidades, Edit Distance, CorporaBilingue

Abstract: In this article is presented a tool for labeling bilingual parallel corporaand aligning texts using heuristics based on word frequency, position and lexico-graphical similarity (using Edit Distance). Bilingual corpora annotation is a verylaborious task but essential at the time of developing knowledge bases for the ac-complishment of automatic translations between different languages. This tool helpsto this task, allowing to annotate texts in a fast and simple way. It includes charac-teristics that help editing plain and annotated texts.Keywords: Alignment, Name Entity Recognition, Bilingual corpora, Edit Distance

1. Introduccion

El sistema que presentamos ha sido de-sarrollado como apoyo a una de las tareasdel proyecto NERO (TIN 2004-07246-C03-03) y facilita el alineamiento de entidadescon nombre en corpora paralelo basandoseen varias heurısticas descriyas en (Borregoy Dıaz, 2007). El alineamiento de textosconsiste en identificar en un corpus biligueque partes (parrafos, frases, palabras) de unode los corpus se corresponden con las del otro.Dado que la anotacion es una tarea muy la-boriosa y de gran dificultad, se ha desarro-llado una herramienta de visualizacion y edi-cion de corpus como apoyo a la anotacion,que detecta alineamientos entre conjuntos depalabras. A continuacion mostraremos los ob-jetivos marcados a la hora de abordar su de-sarrollo:

Realizar una aplicacion portable y ex-tensible, que permita anotar corporaparalelo de forma eficiente.

∗ Este trabajo ha sido parcialmente financiado por elMinisterio de Educacion y Ciencia (TIN 2004-07246-C03-03)

Proporcionar una interfaz grafica que fa-cilite el uso de la aplicacion, visualizan-do los corpus de manera intuitiva (sinque sea necesario tener conocimientosni sobre las heuristicas usadas ni sobreXML).

Permitir anotar corpora paralelo, rela-cionando un conjunto de palabras en unlenguaje con su equivalente en el otro.

Aplicar heurısticas y un sistema devotacion para obtener alineamientos en-tre conjuntos de palabras en un idiomacon su equivalente en el otro

Definicion y modificacion (crear, editary eliminar etiquetas) de etiquetarios.

Leer y escribir corpus anotados con dis-tintos formatos de etiquetado, realizan-do la division de textos usando expre-siones regulares o de forma automatica.

Realizar consultas sobre los corpusacerca de sus etiquetados, y ver suspropiedades.



Generar automaticamente informes so-bre el resultado de las anotaciones reali-zadas.

2. Aspectos tecnologicos delsistema

Caben destacar ciertas decisiones tomadasrelativas a aspectos tecnologicos. Ası, paracubrir el requisito de portabilidad de laaplicacion a diversos sistemas operativos, seopto por una implementacion en lenguaje Ja-va.

En el aspecto relativo a los datos, seeligio una implementacion apoyada en ellenguaje de etiquetado XML. La primerarazon es la capacidad de aplicacion inme-diata de este lenguaje de marcas para laetiquetacion de textos. Esto ha permitidodefinir de una manera sencilla un formato deetiquetado muy flexible, extensible, y senci-llo de utilizar, que es facilmente tratable poraplicaciones externas. Ademas, es un formatode almacenamiento portable, que no requieretener instalado ningun programa especıfico.

Tambien se ha optado por XML para al-macenar datos relativos a configuraciones delos diversos aspectos de la aplicacion, ası co-mo datos necesarios para facilitar su uso,como por ejemplo: definicion de proyectos,definicion de expresiones regulares para di-vidir el texto por frases o por palabras, pala-bras huecas que se desea ignorar, etc.

Para facilitar al usuario su manejo la apli-cacion permite convertir de forma automaticadocumentos en texto plano a XML, indicandola ruta de los ficheros y, de forma opcional, in-formacion sobre su contenido o autores. Conello se puede empezar a manejar la aplicacionsin tener que conocer XML ni tener que hacerconversiones entre formatos de codificacion.Ademas, permite trabajar con un corpus sinalterar su contenido, ya que en ningun mo-mento se modifica el contenido de los ficherosen texto plano.

Con lo comentado anteriormente, la apli-cacion desarrollada cumple los requisitos ex-puestos, pudiendo etiquetar textos, mostrarcorpus etiquetados en distintos idiomas, etc.

3. Descripcion basica del sistema

El sistema se basa en un entorno grafi-co organizado en torno a dos elementosbasicos: un conjunto de menus desplegablesdonde se pueden seleccionar todas las ac-ciones disponibles actualmente en la apli-

cacion, y un conjunto de ventanas donde sevisualizan los textos y la estructura del cor-pus.

Cada corpus esta asociado con un proyec-to en el que se incluyen todos los archivos enlos que esta dividido. La ventana principalse subdivide en dos partes: la parte izquierdacontiene la estructura y archivos del proyec-to (corpus) actual, y en la derecha se visua-lizaran aquellos archivos del proyecto que elusuario desee ver su contenido. Las ventanasinternas que muestran el contenido de cadafichero se encuentran divididas en dos zonas,una para cada idioma, mostrando con distin-to tipo de letra aquellas palabras que se en-cuentran anotadas. Ademas, tras seleccionarun conjunto de palabras en una de las zonas,indica en la otra zona la frase equivalente.

Los ficheros constituyentes del corpus sepueden visualizar de dos formas. La primeraforma es en las ventanas asociadas a losficheros que nos muestra el contenido de ca-da fichero, teniendo un color distinto aquellosconjuntos de palabras que han sido anotados.La otra es en una ventana especial que per-mite ver el conjunto de palabras que contiene,indicando la posicion origen y fin, ası como eltipo de palabra.

En cualquier momento se puede anotar,para lo cual solo hay que seleccionar el textodeseado con el raton, e indicar que se deseaanotar la seleccion. Tambien se puede hacer elproceso inverso, para eliminar una anotacionhecha previamente.

4. Trabajo futuro

Respecto al reconocimiento de entidadesserıa interesante incluir mas heurısticas pararealizar el alineamiento. Ademas, debido alo laborioso del proceso de anotacion, es fre-cuente la participacion de equipos. Esto im-plica dificultades relacionadas con el mante-nimiento de la coherencia en el proceso de eti-quetacion y la gestion de versiones de corpus.En este aspecto, pretendemos enriquecer laherramienta para incorporar funcionalidadesque faciliten este tipo de procesos.

Bibliografıa

Borrego, R. y V. Dıaz. 2007. Alineamien-to de Entidades con Nombre usando Dis-tancia Lexica. Procesamiento del Lengua-je Natural, 38(1):61–66.

Rafael Borrego y Víctor J. Díaz

276

MyVoice goes Spanish. Cross-lingual adaptation of a voicecontrolled PC tool for handicapped people ∗

Zoraida CallejasUniv. GranadaGranada [email protected]

Jan NouzaTech. Univ. Liberec

Liberec [email protected]

Petr CervaTech. Univ. Liberec

Liberec [email protected]

Ramon Lopez-CozarUniv. GranadaGranada [email protected]

Resumen: En este artıculo presentamos la adaptacion del sistema MyVoice delidioma checo al espanol. MyVoice se desarrollo con la idea de permitir a usuarioscon discapacidad motora controlar sus ordenadores y aplicaciones informaticas deforma oral. Nuestro objetivo era adaptarlo de forma rapida al espanol empleandounicamente los recursos disponibles para el idioma checo. Los resultados experi-mentales muestran que se puede conseguir hasta un 96.73% de precision en el re-conocimiento del habla espanola empleando el motor de reconocimiento del hablacheco del sistema MyVoice.Palabras clave: cross-linguistic, reconocimiento del habla, aplicaciones para dis-capacitados

Abstract: In this paper, we present the cross-lingual adaptation of the MyVoicesystem from the Czech to the Spanish language. MyVoice was developed to allowmotor-handicapped people to voice control their PCs and applications. Our ob-jective was to cost-efficiently adapt it to the Spanish language using uniquely theresources available for Czech. Experimental results show that up to 96.73% recogni-tion accuracy can be achieved for Spanish using MyVoice’s Czech speech recognitionenvironment.Keywords: cross-linguistic, speech recognition, applications for handicapped

1 The MyVoice system

MyVoice is a software tool to control thePC and its programs orally. It recognizesvoice commands and interprets them into oneor more basic actions which include virtualmanaging of keyboard, moving mouse, click-ing mouse buttons, printing strings and ex-ecuting programs. MyVoice was developedwith the purpose of facilitating Czech motor-handicapped people the access to new tech-nologies, and has been succesfully used bythem since 2005 (Nouza, Nouza, and Cerva,2005).

MyVoice is structured in several commandgroups, each of them dealing with an spe-cific task, this way for example the groupthat controls the mouse is different from theone that deals with keyboard but they can∗ Development of the MyVoice software was sup-ported by the Grant Agency of the Czech Academyof Sciences (grant no. 1QS108040569).

be accessed easily from each other by a voicecommand. The grouping of commands makesinteraction easier as the user is aware of thevalid words he can utter at each time andcan easily navigate between groups. Further-more, as a specific vocabulary was definedfor each task, better recognition results areachieved.

The system was designed to be userfriendly and customizable and it can be eas-ily adapted to user preferences employing itsconfiguration window. From there, phoneticsof the words can be changed, commands canbe added, edited and deleted, and new com-mand groups can be easily introduced with-out the need of having any expert knowledgeabout computers.

MyVoice system was carefully designedand implemented and has been warmly wel-comed by the Czech handicapped commu-nity. Our aim was to make it available also



for the Spanish users withoug building a newsystem from the scratch, but rather usingthe already developed resources for the Czechlanguage. In order to reach this objective, wecarried out a cross-lingual adaptation of thesystem so that recognition of Spanish com-mands could be done over the Czech speechrecognition environment (i.e. acoustic, lexi-cal an linguistic models), as explained in thenext section.

2 MyVoice cross-lingualadaptation to Spanish

MyVoice commands were translated to Span-ish and a cross-lingual adaptation proce-dure of the Czech recognizer was carriedout. The Czech recognizer’s decoding mod-ule works with a lexicon of alphabetically or-dered words, each of them represented byits text and phonetic form. For the cross-language application we used Spanish textalong with an automatically generated Czechphonetic representation. The phonemes builtfor the Czech recognizer could be then ap-plied to the new task of recognizing Span-ish words, using the Czech phonetic form toconstruct the acoustic models of the wordsby concatenating the corresponding phonememodels.

To automatically generate the Czech pho-netic representation of the Spanish com-mands, a correspondence between Spanishand Czech phonemes was carried out by oneSpanish native speaker and supervised byseveral Czech native speakers. The accuracyof such correspondences depends on the num-ber of phonemes present in each language andthe similarity between them. However, Czechand Spanish languages are very different intheir origin, as Czech belongs to the familyof Slavic languages like Russian, and Spanishis an Italic language like Italian or French.Thus, one of the challenges of our work wasto obtain satisfactory mapping for such dif-ferent languages; especially when previous re-searches had obtained poor results in cross-language tasks between Slavic and Italic lan-guages, for example in (Zgank et al., 2004)with Slovenian and Spanish.

3 Experimental results

Our first experiments were carried out with afemale Spanish native speaker employing theMyVoice software for carrying out her dailyactivities with the PC. For speech recognition

a gender dependent model was used obtain-ing a 93,92% accuracy rate. We carried outspeaker adaptation to try to further improvethis result. After adaptation to our femalespeaker, 96,73% accuracy was obtained. Itis important to note that these results arefor real interaction with MyVoice, in whichvocabulary is restricted at each step to thelist of commands in the current group, thesize of the group ranges between 5 and 137commands. To obtain meaningful results forthe different speaker models with indepen-dence of the groups visited during the in-teraction, we carried out an offline speechrecognition process in which we used thewhole MyVoice vocabulary, which is com-posed of 432 commands. With a gender de-pendent user model we obtained 91.03% ac-curacy, which is improved by speaker adap-tation reaching a 96.58% accuracy.

4 Conclusions

In this paper we have presented the adapta-tion of the MyVoice system for orally control-ing PC, from Czech to Spanish language. Wehave empirically demonstrated that cross-lingual adaptation of the speech recognitionenvironment can done in a short time car-rying out an expert-driven correspondencebetween both languages’ phonetic alphabets.Experimental results using the Spanish ver-sion of MyVoice showed that a 96.58% offlineand 96.73% online performance can be ob-tained. Thus, these are very promising re-sults as they show that portability of speechrecognizers can be ensured in a straightfor-ward way and that this approach can achievegood results even with very phonetically dif-ferent languages as Czech (Slavic) and Span-ish (Italic).

References

Zgank, A., Z. Kacic, F. Diehl, K. Vicsi,G. Szaszak, J. Juhar, and S. Lihan. 2004.The cost278 masper initiative - crosslin-gual speech recognition with large tele-phone databases. In Proceedings of LREC2004, Lisbon, Portugal, May.

Nouza, J., T. Nouza, and P. Cerva. 2005. Amulti-functional voice-control aid for dis-abled persons. In Proceedings of Inter-national Conference on Speech and Com-puter (SPECOM 2005), pages 715–718,Patras, Greece, October.

Zoraida Callejas, Jan Nouza, Petr Cerva y Ramón López-Cózar Delgado

278

HistoCat y DialCat: extensiones de un analizador morfológico para tratar textos históricos y dialectales del catalán

Jordi Duran Cals THERA SL

Adolf Florensa s/n 08028-Barcelona

[email protected]

Mª Antònia Martí AntonínUniversitat de Barcelona

Gran Vía [email protected]

M. Pilar Perea Sabater Universitat de Barcelona

Gran Vía 58508007-Barcelona

[email protected]

Resumen: Los textos históricos y dialectales del catalán no se pueden anotar morfosintácticamente de manera automática ya que no existe una variante estándar de referencia que permita un tratamiento homogéneo y sistemático. El objetivo de los proyectos HistoCat y DialCat ha sido desarrollar un entorno de anotación semiautomático aprovechando herramientas existentes para la anotación morfosintáctica de textos en catalán, que minimizara al máximo la anotación manual. Palabras clave: Corpus historicos y dialectales, Anotación Morfosintáctica, Lingüística de Corpus.

Abstract: Catalan historical and dialectal texts cannot be morphosintactically annotated in an automatic way, because there is not a reference standard of written language that could allow a sistematic and homogeneus treatement. The main objective of DialCat and HistoCat projects has been to develop an environment for the semiauthomatic annotation of these corpora using already existing morphological analyzers for standard Catalan trying to minimize the manual annotation.Keywords: Morphosintactic Annotation, Corpus Linguistics.

1 Introducción. Motivación

Los textos históricos y dialectales del catalán no se pueden anotar morfosintácticamente de manera automática ya que no existe una variante estándar de referencia que permita un tratamiento homogéneo y sistemático.

La anotación morfosintáctica de estos corpus se ha realizado, hasta el momento, de manera manual por no existir un sistema de anotación y lematización automático o semiautomático disponible (Albino, 2006) .

En la lengua antigua, por no existir una variedad estándar de referencia nos encontramos con una gran multiplicidad de formas ortográficas para una misma palabra. En el caso de las variantes dialectales, tenemos que afrontar el problema de determinar como se transcriben ortográficamente las formas propias de ciertas áreas dialectales, que no tienen

representación en los diccionarios de la lengua. Es una realidad que la tradición lexicográfica cuenta con muy poca representación dialectal.

El objetivo de los proyectos HistoCat y DialCat ha sido doble. por un lado, se pretendía desarrollar una herramienta para el análisis morfosintáctico automático de textos históricos y dialectales del catalán; por otro, se pretendía recopilar el léxico de la lengua antigua y un léxico dialectal actual, a partir de corpus.

El corpus de la lengua antigua (HistoCat) consta de 97.603 palabras y está formado por textos del siglo XIV, XV y XVI. El corpus dialectal (DialCat) está formado por 23 textos orales en versión fonoortogràfica (cf. Viaplana y Perea, 2003) que presentan variedades locales correspondientes a los seis grandes dialectos del catalán y consta de 36.450 palabras.

Los proyectos que se presentan han consistido en el desarrollo de un entorno de anotación semiautomático aprovechando



herramientas existentes para la anotación morfosintáctica de textos en catalán, que minimizara al máximo la anotación manual.

2 Tratamiento lingüístico de los corpus Además de la información morfosibtáctica

básica que corresponde a la PoS, en el corpus histórico se da información sobre el siglo, la obra y el autor. En los corpus dialectales se indica el dialecto, la variante dialectal y el informante. El anotador puede indicar también si una palabra és un derivado, un péstamo de otra lengua, o bien un barbarismo.

3 Características tecnológicas El sistema de análisis semiautomático se

basa en una versión extendida del analizador HS-Morfo1. El sistema de análisis se compone de tres módulos: 1) El etiquetador con el sistema de anotación estándar. 2) El etiquetador con el sistema de anotación histórico/dialectal. 3) La interfaz de validación

3.1 Etiquetador estándar Este módulo se compone del etiquetador con

el sistema de anotación de la lengua estándar, el analizador HS-Morfo. Es el primer módulo en el procesamiento y recibe como entrada el texto plano para crear un documento con el texto segmentado y anotado: cada forma recibe los distintos lemas y etiquetas PoS que puede tener asociados. Aquellas palabras que no reconoce por pertenecer al léxico histórico o dialectal son tratadas en el módulo siguiente.

3.2 Etiquetador con el formario histórico /dialectal

En este segundo módulo se completa la anotación de las formas específicas del vocabulario histórico o dialectal, tanto las formas que no han sido reconocidas en el módulo de análisis estándar, como también aquellas formas que sí se han reconocido pero son ambiguas y pueden recibir nuevas interpretaciones..

1 HS-Morfo es un analizado cedido por la

empresa THERA SL para el desarrollo del proyecto. El desarrollo tecnològico ha sido llevado a cabo por dicha empresa.

3.3 La interfaz de validación

Este último módulo cumple una doble función. Por un lado, el usuario valida qué par lema-PoS de cada forma detectada en los dos módulos previos es la correcta en su contexto. Por otro, permite incluir información nueva, en concreto nuevos pares lema-PoS a aquellas palabras que no se han analizado en los módulos anteriores.

Esta información, una vez introducida pasa a formar parte del sistema de anotación del segundo módulo, el que detecta las formas históricas o dialectales. De esta forma el formario histórico y dialectal se van realimentado de manera que está disponible para futuros tratamientos.

4 Extensiones del sistema

Este sistema es fácilmente extensible a otras lenguas, si se dispone de un analizador morfológico de la lengua estándar.

Actualmente se esta desarrollando una interfaz web de consulta que permitirá recuperar el léxico por los criterios aplicados en el proceso de anotación.

5 AgradecimientosDialCat (HUM2005-24445-E) e HistoCat

(HUM2005-24438-E) son dos proyectos financiados por el Ministerio de Educación en el programa de Acciones Complementarias.

BibliografíaAlbino Pires, Natalia (2006) ‘ULISES: un Integrated

Development Environement desarrollado para la anotación de un corpus romancístico’. Procesamiento del Lenguaje Natural, n. 37. Septiembre 2006.

Viaplana, J. y Perea, M. P. 2003. Corpus oral dialectal. Una selecció. Barcelona. PPU.

Jordi Duran, Mª Antonia Martí y Pilar Perea

280

MorphOz: Una plataforma de desarrollo de analizadores sintáctico-semánticos multilingüe

Oscar García MarchenaDepartamento de Lingüística.

VirtuOz S.A. 47, rue de la Chaussée d’Antin

75009París [email protected]

Laboratorio de Lingüística Formal Universidad Paris VII

30, Chateau de rentiers 75013 París [email protected]

1. Un analizador sintáctico-semántico

MorphOz es una plataforma de desarrollo de conocimientos lingüísticos que permite la confección de analizadores sintáctico-semánticos en cualquier lengua. Estos analizadores se diferencian de otros parsers en que sus análisis sintácticos están acompañados de análisis semánticos generados a partir del análisis sintáctico obtenido. Estas representaciones semánticas son independientes de la lengua, y, en principio, idénticos para frases de cualquier lengua con el mismo significado.

Las posibilidades de aplicación tecnológica de estos analizadores con capacidad de representación de significado multilingüe son variadas. Sus creadores, la sociedad VirtuOz, lo emplean para la confección de agentes de diálogo o chatbots: el usuario interactúa con una interfaz que transforma las intervenciones humanas en representaciones semánticas a las que puede responder proactivamente a lo largo de una conversación.

MorphOz utiliza un modelo de análisis gramatical diferente del de otros analizadores: en lugar de realizar un análisis sobre el orden lineal de la frase, genera una representación arborescente de su sintaxis profunda, abstrayendo así el orden sintagmático del análisis gramatical. Este tipo de representación parte de la gramática de dependencias (Tesnière: 1959), y está basado en un modelo lingüístico, la Teoría Sentido-Texto o TST(Mel’čuk: 1988), implementado gracias a una gramática de unificación que es también un modelo de representación lingüística reciente, la gramática de unificación polarizada o GUP(Kahane: 2004). Este sistema presenta la ventaja de ser un modelo lingüístico modular, permitiendo separar en dimensiones de análisis independientes la información morfológica, el

léxico, las construcciones sintácticas, su semántica, y el orden de palabras.

2. Adaptación multilingüe

2.1. Parámetros gramaticales en tipología lingüística

Los modelos recientes en lingüística formal (HPSG, LFG, etc.) proponen una organización gramatical de la lengua al mismo tiempo, y en grados diversos, lexicalista y construccionista. La información gramatical sobre cómo se combinan las unidades de una lengua dada están codificadas en tres áreas: léxico, construcción, y orden de palabras. El léxico, identifica la (sub)categoría, el significado, y la morfología que vincula un token con un lema; las construcciones indican la estructura en la que aparece esa (sub)categoría. Finalmente, el orden de palabras señala las posibles posiciones de los argumentos. Una vez parametradas así las lenguas, podemos formalizar el grado de gramaticalización de cada uno de estos módulos: una gramática del chino contendrá un vocabulario sin información morfológica, varias construcciones gramaticales, y pocas reglas de orden lineal, marcando así un rígido orden de palabras. Para el español, al contrario, se precisará bastante información morfológica en el léxico, y numerosas reglas de orden lineal, para formalizar la variedad de órdenes sintagmáticos posibles.

2. 2. Parámetros gramaticales en MorphOz

Siguiendo esta corriente lexico- construccionista de la lingüística formal actual, MorphOz cuenta con un sistema modular que permite separar los diferentes tipos de información lingüística,



tratarlas independientemente, e incluso transferir los parámetros comunes a otras lenguas con similitudes estructurales. De este modo, construir un motor de análisis para cualquier lengua equivale en MorphOz a distribuir adecuadamente los recursos lingüísticos en tres áreas: léxico (con indicación categorial, semántica y morfológica), construcciones, y orden de palabras. El léxico de cada lengua es tratado como un módulo intraspasable, pero no así el inventario de categorías gramaticales; las construcciones asociadas a las categorías, y el orden de palabras son frecuentemente exportables a lenguas genética o tipológicamente cercanas. Las construcciones gramaticales describen las dependencias sintácticas: identifica núcleos y dependientes, y las funciones gramaticales que identifican la dependencia (sujeto, OD, OI, CC, etc.). Asimismo, las construcciones contienen información semántica: a cada lexema corresponde un semema-definición, que ocupa un lugar en una ontología (basada en Wordnet), y a cada función sintáctica le corresponde un rol semántico regular (agente, tema, paciente, etc.). Si bien esta decisión es extremadamente problemática desde un punto de vista teórico, se adapta bien a los propósitos de representación semántica de la TST (Nasr: 1996). Esta representación semántica última debe ser la misma para todas las lenguas. De este modo, la tarea final del lingüista es controlar que las representaciones semánticas de frases con significado equivalente sean idénticas en lenguas diferentes, a pesar de las diferencias en las representaciones de la sintaxis profunda (sintaxis de dependencias).

2.2.1. Construcciones

Respecto a las lenguas romances, alrededor del 80% de las construcciones han sido compartidas para la confección de gramáticas de español, italiano y portugués. Un 70% son compartidas entre estas lenguas y el francés. Las estructuras diferentes son sobre todo las (sub)categorías verbales con diferente subcategorización, a causa principalmente de la ausencia de reglas para las alternancias en la realización de valencias. Para evitar calcos de modelos gramaticales de tradiciones lingüísticas diferentes, para otras lenguas, se integra directamente una gramática de construcciones completa, pero siempre inspirada en las soluciones ya adoptadas. Las frases averbales del chino, por ejemplo, siguen así el mismo esquema que las oraciones

nominales romances, en las que el verbo copulativo no aporta significado, sino que forma un predicado con su atributo.

2.2.2. Orden de palabras

El orden de palabras está codificado siguiendo el sistema de la TST, según el cual el orden lineal corresponde a una relación de distancias a izquierda o derecha entre el núcleo y su dependiente. El paso entre la sintaxis profunda y superficial se limita a un mapping o proyección de las dependencias en la linealidad de la lengua. Las lenguas romances difieren sólo en algunas reglas, particularmente respecto al orden de clíticos. Otras aplicaciones conciernen las posibilidades de realización en la periferia oracional, o la pasiva en chino, que se define únicamente en función del orden de palabras.

3. Conclusión

La implementación de una teoría lingüística como la TST para la construcción de analizadores sintáctico-semánticos tiene una utilidad doble: plataforma de desarrollo para la investigación en lingüística formal, y aplicaciones industriales variadas: agentes de conversación, sistemas de comprensión multilingüe, etc. El análisis de la sintaxis profunda proporciona además una ventaja sobre otros analizadores: al separar orden de palabras y dependencias, no corremos el riesgo de confundir complementos de adjuntos sea cual sea la posición de éstos.

4. Referencias

S. KAHANE, “Grammaires d’unification polarisées”, en 11ième Conférence annuelle sur le Traitement Automatique des Langues Naturelles (TALN’04), Fès, Maroc, France, 2004.

I. MEL’CUK, Dependency Syntax : Theory and Practice. Albany, N.Y., The SUNY Press, 1988.

A. NASR, Un modèle de reformulation automatique fondé sur la Théorie Sens Texte: Application aux langues contrôlées. Tesis Doctoral en informática, Universidad Paris 7, 1996.

L.TESNIÈRE, “Comment construire une syntaxe” en Bulletin de la Faculté des Lettres de Strasbourg, 1934, 7 - 12, 219–229.

Oscar Garcia Marchena

282

Sistema de Dialogo Estadıstico y Adquisicion de un NuevoCorpus de Dialogos∗

D. Griol, E. Segarra, L.F. Hurtado, F. Torres, F. Garcıa, M. Castro, E. SanchisDepartament de Sistemes Informatics i Computacio

Universitat Politecnica de Valencia. E-46022 Valencia, Spain{dgriol,esegarra,lhurtado,ftgoterr,fgarcia,mcastro,esanchis}@dsic.upv.es

Resumen: Se presenta un sistema de dialogo cuyos modulos principales se hanaprendido utilizando un corpus de dialogos adquirido en el proyecto DIHANA. Sellevara a cabo una demostracion del funcionamiento del sistema. Asimismo, se descri-be la adaptacion de la arquitectura utilizada para la adquisicion del corpus DIHANAa una nueva tarea en el marco del proyecto EDECAN.Palabras clave: Sistemas de Dialogo, Adquisicion de Corpus, Modelos Estadısticos

Abstract: We present a dialog system in which the main modules have been mo-deled using a dialog corpus acquired within the framework of the DIHANA project.A demo of the current operation of the complete system will be carried out. In ad-dition, we describe the adaptation of the architecture used for the acquisition of theDIHANA corpus in the scope of a new task, within the framework of the EDECANproject.Keywords: Dialog Systems, Corpus Acquisition, Statistical Models

1. Introduccion: el sistema dedialogo DIHANA

Aunque construir una aplicacion in-formatica que pueda mantener una conversa-cion con una persona de manera natural si-gue siendo hoy en dıa un reto, los constantesavances de la investigacion en Tecnologıas delHabla han permitido que sean factibles ac-tualmente sistemas de comunicacion persona-maquina mediante la voz, capaces de interac-tuar con iniciativa mixta en el desarrollo deldialogo. Una de las lıneas de trabajo prin-cipales de nuestro grupo de investigacion esel desarrollo de metodologıas estadısticas quemodelen los procesos de reconocimiento delhabla, comprension automatica del lenguajey gestion de dialogo. En estas aproximacio-nes, los parametros del modelo se aprendenautomaticamente a partir de un corpus dedialogos etiquetado.

El principal objetivo del proyecto DIHA-NA (Benedı et al., 2006) fue el diseno y desa-rrollo de un sistema de dialogo que posibilita-se el acceso vocal, mediante habla espontaneaen castellano, a informacion de horarios, pre-cios y servicios de trayectos de trenes nacio-

∗ Este trabajo se ha desarrollado en el marco del pro-yecto EDECAN subvencionado por el MEC y FE-DER numero TIN2005-08660-C04-02, la ayuda de laGVA ACOMP07-197 y el Vicerectorat d’Investigacio,Desenvolupament i Innovacio de la UPV.

nales. En el marco de este proyecto se reali-zo la adquisicion de un corpus de 900 dialogosmediante la tecnica del Mago de Oz. Para lle-var a cabo esta adquisicion se diseno una es-trategia para que el Mago gestionase el dialo-go y seleccionase la proxima respuesta delsistema, basandose en la informacion sumi-nistrada por el usuario hasta el momento ac-tual del dialogo y las medidas de confianzaasociadas a cada uno de los slots de informa-cion. Este corpus se etiqueto mediante actosde dialogo. Asimismo, se desarrollo una pla-taforma para facilitar las labores de gestiondel Mago y visualizar los resultados genera-dos por los modulos del sistema que actuabande forma automatica. En (Benedı et al., 2006)puede encontrarse informacion detallada so-bre el proceso de adquisicion y etiquetado delcorpus DIHANA.

Como resultado del proyecto, se ha desa-rrollado un sistema de dialogo de iniciativamixta capaz de interactuar en el dominio dela tarea. El comportamiento de los modulosprincipales que componen el sistema se ba-sa en modelos estadısticos aprendidos a par-tir del corpus DIHANA. En el sistema seha integrado el reconocedor automatico delhabla Sphinx-II (cmusphinx.sourceforge.net),cuyos modelos acusticos y de lenguaje sehan aprendido a partir del corpus adqui-rido. El modulo de comprension del habla



se ha implementado mediante modelos es-tadısticos aprendidos a partir del corpus.La sıntesis de texto a voz se lleva a ca-bo mediante el uso del sintetizador Festival(www.cstr.ed.ac.uk/projects/festival). La in-formacion relativa a la tarea se almacena enuna base de datos PostGres, que utiliza in-formacion de trenes extraıda de la web. Pa-ra llevar a cabo la gestion de dialogo se hadesarrollado un modelo de dialogo estadısti-co aprendido automaticamente a partir delcorpus (Hurtado et al., 2006). La Figura 1muestra la arquitectura del sistema de dialo-go desarrollado para el proyecto DIHANA.

Figura 1: Arquitectura del sistema DIHANA

2. El proyecto EDECAN

Uno de los principales objetivos del pro-yecto EDECAN (Lleida et al., 2006) es incre-mentar la robustez de un sistema de dialo-go de habla espontanea mediante su adapta-cion y personalizacion a diferentes entornosacusticos y de aplicacion. En el marco delproyecto, se desarrollara un sistema de dialo-go completo para el acceso a un sistema de in-formacion mediante el habla espontanea (deigual modo que el sistema DIHANA). El do-minio definido para el sistema es la consultamultilingue (catalan y castellano) a un siste-ma que proporciona informacion sobre la dis-ponibilidad y reserva de las instalaciones de-portivas en nuestra universidad. Para el desa-rrollo de este sistema se utilizaran aproxima-ciones estadısticas, tal y como se ha descritopara el sistema DIHANA. Por ello, se necesi-ta un corpus de dialogos para la nueva tarea.

Para realizar la adquisicion de este corpuscon usuarios reales, se propone una arquitec-tura del sistema de dialogo (vease Figura 2)donde participaran dos Magos de Oz. El pri-mer Mago sustituira a los modulos de recono-cimiento y comprension del habla. El segun-do Mago supervisara el comportamiento deun gestor de dialogo automatico con un mo-delo inicial aprendido a partir de un corpusde dialogos simulados para la tarea (Hurtado

et al., 2006) (Torres et al., 2005), pudiendomodificar la respuesta propuesta por el ges-tor en los casos en que considere que puedaresultar problematica. Se ha desarrollado unmodulo adicional para la simulacion de erro-res de reconocimiento y comprension, basa-do en el analisis de los errores generados pornuestros modulos de reconocimiento y com-prension de lenguaje para la tarea DIHANA(Garcıa et al., 2007).

Figura 2: Esquema propuesto para la adqui-sicion de un corpus en el proyecto EDECAN

3. Objetivos de la demostracion

La demostracion mostrara el funciona-miento del sistema de dialogo DIHANA. Sepresentaran ejemplos de dialogos que propor-cionen una adecuada valoracion del sistemaDIHANA, ası como de la propuesta de ad-quisicion del corpus EDECAN.

Bibliografıa

Benedı, J.M. et al. 2006. Design and acqui-sition of a telephone spontaneous speechdialogue corpus in Spanish: DIHANA. EnProc. of LREC’06, Genove.

Garcıa, F. et al. 2007. Recognition and Un-derstanding Simulation for a Spoken Dia-log Corpus Acquisition. En Proc. of the10th International Conference on Text,Speech and Dialogue, TSD’07, Pilsen.

Hurtado, L.F. et al. 2006. A Stochastic Ap-proach for Dialog Management based onNeural Networks. En Proc. of InterSpe-ech’06, Pittsburgh.

Lleida, Eduardo et al. 2006. EDECAN: sis-tEma de Dialogo multidominio con adap-tacion al contExto aCustico y de Aplica-cioN. En Proc. IV Jornadas en Tecnologıadel Habla, Zaragoza.

Torres, F. et al. 2005. Learning of stochasticdialog models through a dialog simulationtechnique. En Proc. of Eurospeech’05, Lis-bon.

David Griol, Encarna Segarra, Lluis. F. Hurtado, Francisco Torres, María José Castro Bleda, Fernando García y Emilio Sanchis

284

JBeaver : Un Analizador de Dependencias para el Espanol∗

Jesus HerreraDepartamento de Lenguajes y Sistemas Informaticos


[email protected]

Pablo Gervas, Pedro J. Moriano, Alfonso Munoz, Luis RomeroDepartamento de Ingenierıa del Software e Inteligencia Artificial

Universidad Complutense de MadridC/ Profesor Jose Garcıa Santesmases, s/n, E-28040 Madrid

[email protected], {pedrojmoriano, alfonsomm, luis.romero.tejera}@gmail.com

Resumen: JBeaver es un analizador de dependencias para el espanol desarrolladoutilizando una herramienta de aprendizaje automatico (Maltparser). Este analizadorse caracteriza por ser el unico publicamente disponible para el espanol, ser autonomo,facil de instalar y de utilizar (mediante interfaz grafica o por comandos de conso-la) y de elevada precision. Ademas, el sistema desarrollado sirve para entrenar demanera sencilla modelos de Maltparser, por lo que se configura en potencia como unanalizador de dependencias para cualquier idioma.Palabras clave: Analisis de dependencias, Maltparser, JBeaver

Abstract: JBeaver is a dependency parser built using the Maltparser machine-learning tool. It is publically available , easy to install and to use, and provides highprecision. It also allows training Maltparser models for any language, so it can beused to train dependency parsers for any language.Keywords: Dependency parsing, Maltparser, JBeaver

1. JBeaver

El objetivo final era un analizador de de-pendencias para el espanol, de libre distribu-cion y que fuera facil de instalar y manejar.Por otra parte, se debıan acotar los esfuerzosdada la limitacion de recursos del proyecto.

1.1. Decisiones de Diseno yEleccion de Recursos

Bajo los requisitos del proyecto era invi-able el desarrollo de la algorıtmica propia delanalisis de dependencias, por lo que se hu-bieron de buscar recursos que evitasen es-ta labor. Uno de ellos es Maltparser (Nivreet al., 2006), que finalmente fue el elegi-do por las caracterısticas que ofrecıa: eraautonomo, facil de integrar como subsistemay proporcionaba unos resultados notables enlas lenguas para las que se habıa probado has-ta el momento.

Tanto para el entrenamiento de Maltpars-er como para la ejecucion como analizador

∗ Partially supported by the Spanish Ministryof Education and Science (TIN2006-14433-C02-01project).

del modelo aprendido es necesario propor-cionar el etiquetado de categorıas gramati-cales de las palabras del texto. Como uno delos objetivos era que JBeaver pudiese recibirtextos sin anotar, para facilitar al maximosu uso, la propia herramienta deberıa etique-tar los textos recibidos a la entrada con sucategorıa gramatical. Igualmente que en elcaso del analisis de dependencias, tampocoera factible el desarrollo de algoritmos parael etiquetado de categorıas gramaticales. Porello, fue necesario buscar una herramientadisponible, autonoma, fiable y facil de inte-grar en JBeaver ; esta fue, finalmente, Tree-Tagger (Herrera et al., 2007) (Schmid, 1994).

Tanto el entrenamiento de Maltparser co-mo la evaluacion del producto final obtenidorequieren de corpora convenientemente an-otados. Este aspecto se vio resuelto con eluso del corpus Cast3LB (Navarro et al.,2003), que contiene textos en espanol ano-tados con sus analisis sintacticos de consti-tuyentes. Para obtener los corpora adecua-dos para el entrenamiento de Maltparser y laevaluacion de JBeaver, se desarrollo una her-ramienta para convertir los analisis de consti-



Figura 1: Interfaz grafica de JBeaver

tuyentes del Cast3LB en analisis de depen-dencias (Herrera et al., 2007).

Otro aspecto definitorio de JBeaver es suinterfaz grafica de usuario (ver Figura 1). Enesta se muestran los analisis obtenidos en for-ma de grafos, para que los datos resultenvisualmente comodos de interpretar. No ob-stante, tambien se proporciona la salida enforma de fichero de texto, para que pueda serfacilmente manipulado por otros programas.La representacion de los grafos quedo delega-da a Graphviz, como otro de los subsistemasque forman parte de JBeaver.

1.2. Pruebas

De las diversas pruebas a que fue someti-do JBeaver durante la fase de desarrollo, sonde destacar las relacionadas con el rendimien-to del nucleo analizador, es decir, del modeloentrenado de MaltParser. Para ello se selec-ciono una fraccion del corpus Cast3LB, de431 palabras, no usada previamente para elentrenamiento del modelo de Maltparser y segenero a partir de ella un corpus con analisisde dependencias, que se tomo como modelode referencia. Se extrajeron los textos sin eti-quetar de ese corpus y se sometieron al anali-sis de dependencias efectuado por el mode-lo aprendido. Posteriormente se comprobo lasalida proporcionada por el analizador con el

modelo de referencia, comprobandose que sehabıan encontrado correctamente el 91 % delas dependencias.

Bibliografıa

J. Herrera, P. Gervas, P.J. Moriano, A.Munoz, L. Romero. 2007. Building Cor-pora for the Development of a Dependen-cy Parser for Spanish Using Maltparser.(SEPLN, this volume).

B. Navarro, M. Civit, M.A. Martı, R. Marcos,B. Fernandez. 2003. Syntactic, Semanticand Pragmatic Annotation in Cast3LB.Proceedings of the Shallow Processing onLarge Corpora (SproLaC), a Workshop onCorpus Linguistics, Lancaster, UK.

J. Nivre, J. Hall, J. Nilsson, G. Eryigitand S. Marinov. 2006. Labeled Pseudo–Projective Dependency Parsing with Sup-port Vector Machines. Proceedings of theCoNLL-X Shared Task on MultilingualDependency Parsing, New York, USA.

H. Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decission Trees.Proceedings of the International Confer-ence on New Methods in Language Pro-cessing, pages 44–49, Manchester, UK.


286

NowOnWeb: a NewsIR System∗

Javier ParaparIRLab, Computer Science Dept.University of A Coruna, Spain

Fac. Informatica, Campus de Elvina15071, A Coruna, SPAIN

[email protected]

Alvaro BarreiroIRLab, Computer Science Dept.University of A Coruna, Spain


[email protected]

Resumen: Hoy en dıa existen miles de sitios web de noticias. Los modos tradicio-nales para acceder a este inmenso repositorio de informacion no son adecuados. Eneste contexto presentamos NowOnWeb, un sistema de recuperacion de noticias queobtiene los artıculos de la red y permite buscar y navegar entre los mismos.Palabras clave: Sistemas de noticias, extraccion de informacion, deteccion de re-dundancia, generacion de resumenes.

Abstract: Nowadays there are thousands of news sites available on-line. Traditionalmethods to access this huge news repository are overwhelmed. In this paper wepresent NowOnWeb, a news retrieval system that crawls the articles from the internetpublishers and provides news searching and browsing.Keywords: News system, information extraction, redundancy detection, text sum-marization.

1. Introduction

The huge amount of news informationavailable on-line requires the use of Informa-tion Retrieval (IR) techniques to avoid ove-rwhelming the users. The main objectives ofthese IR methods are: reduce the time spendin reading the articles, avert the redundancyand provide topic search capability. Giventhis context we present NowOnWeb1, a New-sIR system that deals with the on-line newssources to provide an effective and efficientway to show news articles to the user througha comfortable and friendly interface. It is ba-sed on our previous research and solutionsin the IR field and serves as a research plat-form to test and asses the new solutions, al-gorithms and improvements developed in thearea.

2. System Overview

NowOnWeb was designed as a Model-View-Controller web-application following acomponent-based architecture. The main sys-tem components are: a crawler and an inde-xer to maintain an incremental index with a

∗ Acknowledgements:This work was cofunded bythe “Secretarıa de Estado de Universidades e In-vestigacion” and FEDER funds (MEC TIN2005-08521-C02-02) and “Xunta de Galicia”(PGIDIT06PXIC10501PN).

1An operative version with international news isavailable in http://nowonweb.dc.fi.udc.es

temporal window, a news recognition and ex-traction module that allows dynamic sourceadding, a news grouping component that usesa redundancy detection approach, and an ar-ticle summariser based on relevant sentencesextraction.

Figura 1: A snapshot of the application ap-pearance.

Our application offers the user: news sear-ching among all the indexed publishers, querysuggestion, query spelling correction, redun-dancy detection and filtering, query biasedsummary generation, multiple format out-puts like PDF or syndication services, and



personalisation options such as source selec-tion. All these characteristics aim to facilitatethe use of the system, for this reason the re-sults are showed in a friendly and natural way(see Figure 1). In this sense technologies likeAJAX were applied in order to improve theuser experience and the system possibilities.

3. Research Issues

Three are the main research topics invol-ved in the development of NowOnWeb: newsrecognition and extraction, redundancy de-tection and summary generation

3.1. News Recognition andExtraction

The problem here is to extract from an he-terogenous set of pages, most of them withoutarticles, the news articles present. So first wehave to filter the pages without interestingcontent, and second from those with an arti-cle inside, extract the fields (title, body, dateand image if present) among many not des-ired content.

We developed a news recognition and ex-traction technique based on domain specificheuristics over the articles structure that re-sulted in an efficient and effective algorithm.

3.2. Redundancy Detection

The objective of this point is to filter theredundant articles in order to avoid the over-load of the user. To get this we developed andalgorithm based on traditional techniques ofthe information filtering field (Zhang, Callan,and Minka, 2002).

Generally speaking our method takes asinput a ranking of documents sorted in baseof their relevance with the user query. Thealgorithm dynamically assigns a redundancyscore to each document respect to the alreadycreated redundancy sets. If that score is overa threshold with one of the sets, the docu-ment will be included in that set, other wayit will constitute a new redundancy group.

3.3. Summary Generation

The system offers the user summariesabout the relevant articles respect to thequery. These summaries are dynamically ge-nerated in retrieval time, they are query-biased.

To get this we used a technique basedon the extraction of relevant sentences. Eachsentence is scored (Allan, Wade, and Bolivar,

2003) with respect to its relevance with thequery. The sentences with higher score arechosen to get a summary of the desired sizeand they are resorted to maintain the originalarticle relative position.

4. Conclusions and Future Work

NowOnWeb resulted in a NewsIR systemthat satisfies the user needs of information,allowing them to be up-to-date without timewaste.

We got an original solution different fromthe existing ones in the academic (ColumbiaNewsBlaster (McKeown et al., 2002), Michi-gan NewsInEssence(Radev et al., 2005)) andcommercial (Google News, Yahoo News orMSN Newsbot) fields.

As further work we will approach archi-tectural system improvements, efficient querylogging storage and mining, and evaluation ofour news extraction algorithm.

References

Allan, James, Courtney Wade, and AlvaroBolivar. 2003. Retrieval and novelty de-tection at the sentence level. In SIGIR

’03: Proceedings of the 26th annual inter-

national ACM SIGIR conference on Re-search and development in informaion re-

trieval, pages 314–321, New York, NY,USA. ACM Press.

McKeown, Kathleen R., Regina Barzilay, Da-vid Evans, Vasileios Hatzivassiloglou, Ju-dith L. Klavans, Ani Nenkova, Carl Sa-ble, Barry Schiffman, and Sergey Sigel-man. 2002. Tracking and summarizingnews on a daily basis with Columbia’sNewsblaster. In Proceedings of the Human

Language Technology Conference.

Radev, Dragomir, Jahna Otterbacher, AdamWinkel, and Sasha Blair-Goldensohn.2005. Newsinessence: summarizing onlinenews topics. Commun. ACM, 48(10):95–98.

Zhang, Yi, Jamie Callan, and Thomas Min-ka. 2002. Novelty and redundancy detec-tion in adaptive filtering. In SIGIR ’02:Proceedings of the 25th annual internatio-

nal ACM SIGIR conference on Researchand development in information retrieval,pages 81–88, New York, NY, USA. ACMPress.

Javier Parapar y Álvaro Barreiro

288

The Coruna Corpus Tool∗

Javier ParaparIRLab, Computer Science Dept.University of A Coruna, Spain


[email protected]

Isabel Moskowich-SpiegelMUSTE, English Philology Dept.

University of A Coruna, SpainFac. Filologıa, Campus de Zapateira

15070, A Coruna, [email protected]

Resumen: El Coruna Corpus de documentos cientıficos sera usado para el estudiodiacronico del discurso cientıfico en la mayorıa de los niveles linguısticos, contribu-yendo de esta forma al estudio del desarrollo historico del ingles. El Coruna CorpusTool es un sistema de recuperacion de informacion que permite compilar conoci-miento sobre el corpus.Palabras clave: Linguıstica de corpus, ingles cientıfico-tecnico, recuperacion deinformacion.

Abstract: The Coruna Corpus of scientific writing will be used for the diachronicstudy of scientific discourse from most linguistic levels and thereby contribute tothe study of the historical development of English. The Coruna Corpus Tool is aninformation retrieval system that allows the extraction of knowledge from the corpus.Keywords: Corpus linguistics, English scientific writing, information retrieval.

1. Introduction

The Coruna Corpus: A Collection of Sam-

ples for the Historical Study of English Scien-tific Writing was carried out since 2003 bythe Muste Group of the University of ACoruna. The corpus compilation is still inprogress, at the moment we have gatheredtogether samples of 10,000 words approxi-mately belonging to the field of eighteenth-and nineteenth-century mathematics and as-tronomy.

In order to manage all the informationthat will be present in the corpus and tofacilitate linguists the gathering of data, acorpus management tool, the Coruna CorpusTool (CCT) has been developed in collabo-ration with the IRLab of the University of ACoruna. In this demo we would like to pre-sent to the natural language processing com-munity the main characteristics of the cor-pus compilation process and its managementtool.

∗ Acknowledgements: The research which is here re-ported on has been funded by the Xunta de Galiciathrough its Direccion Xeral de Investigacion e Desen-volvemento, grant number PGIDIT03PXIB10402PR(supervised by Isabel Moskowich-Spiegel). This grantis hereby gratefully acknowledged. The first authoralso has to acknowledge the funds of the “Secretarıade Estado de Universidades e Investigacion” and FE-DER (MEC TIN2005-08521-C02-02) and “Xunta deGalicia”(PGIDIT06 PXIC10501PN).

2. The Coruna Corpus

The Coruna Corpus (CC) has been desig-ned as a tool for the study of language chan-ge in English scientific writing in general, aswell as within the different scientific discipli-nes. Its purpose is to facilitate investigationat all linguistic levels, though, in principle,phonology is not included among our inten-ded research topics. The CC contains Englishscientific texts other than medical producedbetween 1650 and 1900. Medical texts havebeen disregarded since they are being com-piled by Taavitsainen and Pahta and theirteam in Helsinki (Taavitsainen and Pahta,1997). Our project proposes to complementother corpora pertaining to the history ofwhat we nowadays call ESP, such as the well-known Corpus of Early English Correspon-dence, the Corpus of Early English MedicalWriting, and the Lampeter Corpus of EarlyModern English Tracts.

From the six areas into which UNES-CO divides Science and Technology we arecompiling samples of texts, at the moment,from: Exact and Natural Sciences: Mathe-matics, Astronomy, Physics and Natural His-tory; Agricultural Sciences and Humanities:Philosophy and History. We intend to compi-le the same number of samples for each scien-tific field in order to facilitate comparativestudies. For each discipline we have selectedtwo texts per decade, with each sample con-



taining around 10,000 words, excluding ta-bles, figures, formulas and graphs.

3. The Coruna Corpus Tool

In order to retrieve information from thecompiled data, we decided to create a corpusmanagement tool. This software applicationis currently in its testing phase. It is desig-ned to help linguists to extract and condensevaluable information for their research. TheCoruna Corpus Tool (CCT) is an Informa-tion Retrieval (IR) platform (see Figure 1)where the indexed textual repository is theset of compiled documents that constitutesthe CC. The texts that conform the CC we-

Figura 1: A snapshot of the application.

re coded and stored as XML documents. Wechose to tag the information following the re-commendations of the TEI (Text EncodingInitiative) (Sperberg-McQueen and Burnard,2002) standard. Several tagged fields that wedesire to index are extracted from the docu-ments. In this sense we have to notice thatwe build a multi-field index to allow searchesusing different criteria; we store, for instan-ce, information about authors, date, scientificfield, corpus document identifier, etc.

It is fair to mention here that we used so-me existing open-source libraries for the sys-tem implementation. Among them we wouldlike to mention Lucene: it is an indexing li-brary (Apache, 2007) widely used in the de-velopment of IR applications.

3.1. Features

The system offers among others the nextfunctionalities:

Document validation: if the document isnot correctly constructed according to theDTD rules, the syntax validator will show thecoders the errors present in the document sothey can be fixed.

Basic term search: it can be launched overthe whole set of indexed documents or at in-dividual document level. As the result of auser query all the occurrences of a word areshown. For each one the following informa-tion is available: document identifier, wordposition and concordance.

Advanced search: a certain number of cus-tom search characteristics are implementedto facilitate the extraction of research results:

Wild card use: the inclusion of wild cardcharacters are allowed to specify thesearching of spelling variations of the sa-me form along time.

Regular expression searching: to allowsearching using patterns, it is useful forexample to search by suffixes or prefixes.

Phrase search: combinations of wordscan be specified as a query indicatingthe gap between the words. This can beused for instance to look for expressionsor verbal forms.

Term list generation: generation of lexiconlists of the whole corpus or inside each docu-ment (as chosen). An alphabetical sorted listof words with the number of appearances isgenerated filtered by the user criteria.

4. Conclusions

As previously explained, the CC is still awork in progress. We have a lot of text tocompile and codify yet. But the CCT is des-igned to be scalable and adaptable to the newneeds of the corpus compilation process. TheCCT is currently an option to manage anyTEI encoded corpus and offers the featuresmore often demanded by linguists.

References

Apache, Foundation. 2007. Lucene:http://lucene.apache.org/.

Sperberg-McQueen, C. M. and L. Bur-nard. 2002. TEI P4: Guidelines forelectronic text encoding and interchange.In Text Encoding Initiative Consortium.

XML Version: Oxford, Providence, Char-

lottesville, Bergen.

Taavitsainen, Irma and Paivi Pahta. 1997.Corpus of early english medical. In ICA-ME ’97: Proceedings of the Internatio-

nal Computer Archive of Modern and Me-dieval English Conference, pages 71–78.Kluwer Academic Print on Demand.

Javier Parapar y Isabel Moskowich-Spiegel

290

WebJspellan online morphological analyser and spell checker

Rui VilelaUniversidade do Minho, Departamento de Informatica

Campus de Gualtar 4710-057 Braga, [email protected]

Resumen: Webjspell es una herramienta multiusos para Internet destinada alanalisis morfologico y correccion ortografica de textos escritos en portugues. Ademasde estas funcionalidades provee: ejemplos de frases, tablas de conjugacion verbal,sugerencia de palabras ante eventuales errores ortograficos y correccion ortograficade paginas de Internet. En esta comunicacion se describe las caracterısticas deWebjspell y las posibles extensiones de sus tecnicas a otras aplicaciones.Palabras clave: correccion ortografica, analizador morfologico.

Abstract: Webjspell is an Internet multipurpose tool for Portuguese morphologicalanalysis and spell checking. It provides examples of phrases, frequencies, verbalconjugation tables, word suggestions, and Internet pages spell checking. This articledescribes Webjspell features, and results.Keywords: spell checking, morphology analysis.

1 Introduction

People have compulsion for auto-evaluateand improve their written production. Thereis a wide range of available linguistics re-sources, paper or digital, helping all peopleto outshine their language knowledge.

All people, especially when they study for-eign languages, have need for more onlineresources to leverage their language under-standing, due to sparse and more expensiveresources.

Webjspell was developed as solution forthis problem, especially within the Por-tuguese language domain, making attainablea morphological analyser and a spell checker.

2 Webjspell

Webjspell was developed to spread the us-age of the morphological analyser Jspell toa wider audience. Available online on http://linguateca.di.uminho.pt/jspell.

It was developed in collaboration ofNatura Project1 and Linguateca2 to havea broader and more user-friendly interface.Development was made using Perl languageand the available Jspell module. (Simoes yAlmeida, 2001)

Jspell and Portuguese dictionary were de-veloped in 1994 by Jose Joao de Almeida

1http://natura.di.uminho.pt2http://www.linguateca.pt

and Ulisses Pinto (Almeida y Pinto, 1995),based on Ispell spell checker for UNIX envi-ronment. Is an interactive command line ap-plication for analyzing mainly words in textfiles.

The Portuguese dictionary is currentlyused along other available open source ap-plications, such as Firefox, Thunderbird, andOpenOffice. Along with diverse usage for dif-ferent kinds of research projects.

Webjspell adds additional features, by us-ing Jspell Perl interface. Beyond a new in-teractive interface, it uses public domain ser-vices and logging. On his foundations, it isdivided in four services: morphological anal-ysis, spell checking, Internet web pages spellchecking, and word feedback or suggestion.

3 Morphological analyser

The morphological analyser, in figure 1, hasa bigger notability than other available ser-vices. For each of the given words and lan-guages, the program obtains a morphologicaland semantic classification.

Improvements were made over the orig-inal Jspell, such as: Verbose morphologi-cal classification; Inflected words stem fromlemmas; Phrase examples from public cor-pora, Word frequencies; Suggestions; Feed-back; Verb conjugation tables;

Further improvements are planned for ex-tending some features, like external online



Figure 1: Morphological analysis

service usage, such as: language translation,word definition, and thesaurus capability.

3.1 Spell Checker

The spell checker aids the user to discoverand fix misspelled words, with resource toword suggestion. Colours are used to markerrors, fixes, and also to identify foreignwords.

Webjspell enhances some of the features ofJspell module, such has missing spaces, hy-phens and in conversely way.

Further preferment can be implemented,like the use of patterns for common pho-netical errors, better exploitation of Jspellmorphological capabilities for finding simplegrammatical errors, along with filtered sug-gestions, and duplicated word detection.

3.2 Web pages spell checker

It allows for a given Internet address, tosearch spelling mistakes, the program editslocally the page, and marks with colours theunknown and foreign words in other sup-ported dictionaries.

3.3 Word suggestions

A interface that allows users to submit a wishlist of words, that could be or not includedin the dictionary.

4 Final considerations

The Webjspell results after some months onthe wild, becomes worthy to analyze the ob-

tained feedback for self-improvement of di-verse dictionaries.

Since the application was released, it hasmore than 2400 searches per month, alongwith an explosion of the number of word sug-gestions for the dictionaries, positively con-tributed for increasing the quality and preci-sion of several dictionaries.

All words, especially the ones that Jspell isunable to identify are kept for later analysis.This method brings advantages in identifyingtypical user errors and new words. Assortedproblems were mended, as much in features,as in interface, including the Perl interfaceand Jspell. Webjspell contributes to the dic-tionary development, on which depend sev-eral text processing applications.

Bibliografıa

Almeida, J.J. y Ulisses Pinto. 1995. Jspell– um modulo para analise lexica genericade linguagem natural. En Actas do XEncontro da Associacao Portuguesa deLinguıstica, paginas 1–15, Evora 1994.

Simoes, Alberto Manuel y Jose JoaoAlmeida. 2001. jspell.pm – um modulode analise morfologica para uso em proces-samento de linguagem natural. En Actasda Associacao Portuguesa de Linguıstica,paginas 485–495.

Rui Vilela

292

PROYECTOS

El proyecto Gari-Coter∗en el seno del proyecto RICOTERM2∗∗

Fco. Mario Barcala Rodrıguez y Eva M.a Domınguez NoyaCentro Ramon Pineiro para a Investigacion en Humanidades

{fbarcala,edomin}@cirp.es

Pablo Gamallo Otero y Marisol Lopez Martınez y Eduardo Miguel Moscoso Mato yGuillermo Rojo y Marıa Paula Santalla del Rıo y Susana Sotelo Docıo

Universidade de Santiago de Compostela{pablogam, fgmarsol, fgmato, guillermo.rojo, fesdocio}@usc.es

Resumen: Descripcion del proyecto Gari-Coter para la elaboracion de los recursoslinguısticos en gallego necesarios para un re-elaborador de consultas multilingue.Palabras clave: expansion de consultas, corpus, base de datos terminologica, ex-traccion automatica de terminos

Abstract: Description of the Gari-Coter project for the development of the neces-sary linguistic resources in Galician for a multilingual query re-elaborator.Keywords: query expansion, corpus, terminological database, automatic termino-logy extraction

1. Situacion actual

Como se ha indicado en la nota de agra-decimiento adjunta al acronimo del proyectoincluido en el tıtulo, este se ha venido desa-rrollando desde 2004, y su cierre esta previstopara finales de 2007. Dos anos y medio, portanto, lleva el proyecto en curso, por lo cuallo que incluimos aquı es una presentacion es-quematica de lo que se proponıa, ası comode algunos de sus, ahora ya, resultados dehecho, a falta de un sexto de tiempo de desa-rrollo del proyecto. Lo que queda del mismo,por otra parte, es previsible que se dediquea la integracion de los recursos y herramien-tas generados en el seno de cada uno de lossubproyectos que integran el proyecto coor-dinado RICOTERM2, el propio Gari-Coter,y el subproyecto, del mismo nombre que elcoordinado, RICOTERM21.

∗ Creacion e integracion multilingue de recursos ter-

minologicos en gallego para Recuperacion de Infor-

macion mediante estrategias de control terminologico

y discursivo en ambitos comunicativos especializados.Subproyecto financiado, bajo la direccion de M.a Pau-la Santalla, por el Ministerio de Educacion y Cienciaentre 2004 y 2007 (HUM2004-05658-C02-02/FILO).∗∗ Control terminologico y discursivo para la recupe-

racion de informacion en ambitos comunicativos es-

pecializados, mediante recursos linguısticos especıficos

y un reelaborador de consultas. Proyecto coordinadofinanciado, bajo la direccion de Merce Lorente Casa-font, por el Ministerio de Educacion y Ciencia entre2004 y 2007 (HUM2004-05658-C02-00/FILO).

2. El subproyecto Gari-Coter en

el seno del proyecto

coordinado RICOTERM2

El proyecto coordinado RICOTERM2 tie-ne como objetivo principal el desarrollo deun prototipo para un sistema multilingue dereformulacion de consultas planteadas porusuarios de Internet interesados en la busque-da de informacion acerca de un ambito comu-nicativo especializado, en nuestro caso, eco-nomıa. El sistema se integrara, como se des-cribe en (Lorente, 2005), en una aplicacionque consistira en una interfaz, ubicada en unportal web especializado en economıa, para latransformacion de consultas simples en con-sultas multilingues expandidas linguıstica yconceptualmente. Actualmente las lenguas detrabajo son el catalan, el castellano, el ga-llego, el ingles y el vasco. El diseno generaldel prototipo esta tambien descrito en (Lo-rente, 2005): baste aquı, para que puedan sercabalmente entendidos los objetivos especıfi-cos del subproyecto Gari-Coter, indicar que,con el proposito de mejorar los resultados delas aplicaciones implicadas de Recuperacionde Informacion mediante tecnicas de expan-sion de consultas, el proyecto utiliza metodostanto de expansion unicamente por terminos(only-term expansion) como de expansion detexto completo (full-text expansion). Para loprimero, se hara uso de una ontologıa del do-minio. Para lo segundo, de un corpus especıfi-co de economıa, estructural y linguısticamen-



te anotado, el cual habra de servir para, me-diante el recurso a herramientas como extrac-tores automaticos de terminologıa y simila-res, detectar colocaciones o fraseologıa pro-pia de los terminos introducidos por el pro-pio usuario, u obtenidos tras la consulta a laontologıa.

Dentro de este planteamiento general,el proyecto Gari-Coter (aparte de objetivoscompartidos, relacionados, como puede supo-nerse, con el diseno y la integracion de todo loproducido en una aplicacion web) tiene comoobjetivos propios la constitucion de los recur-sos para el gallego: un corpus de economıa,adecuadamente codificado y anotado, adap-tando para ello herramientas de procesamien-to existentes para el gallego, y un banco dedatos terminologicos, obtenido a partir de re-cursos previos y de la explotacion del propiocorpus constituido. A falta de algo mas deseis meses para la finalizacion del proyecto,estos recursos han podido ser elaborados enla forma y dimension que someramente des-cribimos a continuacion.

2.1. El corpus

Como para todas las lenguas implicadasen el proyecto RICOTERM2, no uno sino, enrealidad, dos subcorpus de dominio han sidodesarrollados para el gallego: un subcorpusgenerico y uno especıfico. El primero integra-do por 609 noticias de periodico que suman206510 palabras distribuidas en 7892 oracio-nes. El segundo integrado por 14 libros y dosrevistas especializadas que entre todos suman801702 palabras distribuidas en 34588 oracio-nes.

Ambos corpus estan codificados utilizan-do el estandar XML. Cada documento constade una cabecera con informacion bibliografi-ca y de contenido, seguida esta del documen-to mismo, estructurado hasta el nivel de laoracion. Ambos corpus, asimismo, han sidoanotados morfosintacticamente con informa-cion acerca de clase de palabras y categorıasflexivas consideradas relevantes.

En lınea con los planteamientos generalesdel proyecto coordinado (busqueda y aprove-chamiento de recursos preexistentes), para laconstitucion de ambos corpus llegamos a unacuerdo con el Centro Ramon Pineiro para aInvestigacion en Humanidades2, que nos ce-dio los textos procedentes del corpus COR-GA, Corpus de Referencia del Gallego Ac-tual, procesados linguısticamente con su pro-

pio sistema de etiquetacion. Toda la anota-cion del corpus generico fue corregida ma-nualmente.

2.2. El banco de datosterminologico

El banco de datos terminologico se haelaborado a partir, por un lado, de recur-sos previos que constituıan fuentes conside-rablemente heterogeneas3 en cuanto a cali-dad, dimension y fiabilidad: dos diccionarios,dos glosarios electronicos y la seccion de eco-nomıa de una base de datos terminologica,esta ultima la mas rica y rigurosa sin duda.

Actualmente, el banco de datos consta de6046 terminos del dominio economico obteni-dos por esta vıa, la mayorıa de ellos asociadosa informacion exhaustiva acerca del lema, laclase de palabras y la definicion, ası como,en la mayorıa de los casos, equivalentes enotras lenguas e informacion sobre sinonimose hiperonimos.

El conjunto de terminos descrito, asi comoel corpus, se han utilizado ademas para, me-diante tecnicas de extraccion automatica determinos multipalabra basadas en medidas desimilitud contextual, ampliar el banco de da-tos terminologico. En la ultima de las expe-riencias llevadas a cabo 740 terminos multi-palabra pudieron obtenerse, pero los resulta-dos de precision asociados, debidos sin du-da al reducido tamano del corpus, aconse-jan, cuanto menos, una revision manual delos mismos.

Notas1Con el mismo acronimo y nombre que el proyecto

coordinado, financiado por el Ministerio de Educaciony Ciencia entre 2004 y 2007, y dirigido por MerceLorente (HUM2004-05658-C02-01/FILO).

2http://www.cirp.es. [Consultado: 6, junio, 2007].3 Eiras: Eiras Rey, A.: Dicionario de economıa,

no publicado. Formoso: Formoso Gosende, V.(coord.) (1997): Diccionario de termos economicos

e empresariais galego-castelan-ingles. Santiagode Compostela: Confederacion de Empresariosde Galicia. Panlatin Electronic Commer-ce Glossary: http://fon.gs/panlatino. Glos-sary about commerce from galego.org:http://galego.org/vocabularios/ccomercial.html.SNL: http://www.usc.es/en/servizos/portadas/snl.jsp.

Bibliografıa

Lorente, M. 2005. Ontologıa so-bre economıa y recuperacion de in-formacion [en lınea]. Hipertext.net,(3). http://www.hipertext.net. [Consul-tado: 30, enero, 2007].

Fco. Mario Barcala, Eva Domínguez, Pablo Gamallo, Marisol López, Eduardo Miguel Moscoso, Guillermo Rojo et al.

296

Portal da Língua Portuguesa

Maarten JanssenInstitúto de Linguística Teórica e Computacional (ILTEC)

Rua Conde de Redondo 74-5, Lisboa, [email protected]

Resumen: El objetivo del proyecto Portal da Língua Portuguesa es construir, con un dobleobjetivo, un juego de recursos léxicos. En primer lugar, estos recursos sirven como fuente deinformación para una página web sobre la lengua portuguesa para el público en general. Ensegundo lugar, son un repositorio de información léxica para la investigación lingüística. Eldibujo de la base de datos es modular y relacional, y se hizo de modo que proporcionesoluciones estructurales para problemas léxicos, como son los de la homonimia, variaciónortográfica, etc.Palabras clave: Base de datos léxica, morfología, fonética.

Abstract: The goal of the Portal da Língua Portuguesa project is to construe a set of lexicalresources with a double objective. On the one hand, the resources serve as the content source fora web site about the Portuguese language, aimed at the general public. On the other hand, theresources are built to serve as an open source repository of lexical information for linguisticresearch. The design of the database is modular and relational, and is set-up in such a way that itprovides structural solutions for lexical difficulties like homonymy, orthographic variation, etc.Keywords: Lexical database, morphology, phonetics

1 Project DescriptionThe Portal da Língua Portuguesa (henceforthPortal) is a free, large scale online resources onthe Portuguese language, currently underdevelopment at the ILTEC institute in Lisbon,Portugal. It has a primary focus on lexicalinformation, and is designed for the generallanguage user. Although the Portal is the visibleoutlet of the Portal project, the goal of theproject itself is moreover to create a set oflexical resources which, apart from their onlineavailability, will serve as open source data forlinguistic research. The project started fromlexical database called MorDebe, whichprimarily concerns inflectional morphology.But the database is currently being transformedinto an Open Source Lexical InformationNetwork (OSLIN), which contains a muchwider, open-ended range of lexical information.Additional types of lexical informationcurrently under development are inherentinflections, pronunciation, and syllabification.

The Portal project itself is internallysupported by the ILTEC institute, and has no

strict delimitation. Work on the MorDebedatabase was started mid 2004, and the web sitewas launched in November 2006. The web siteis intended to continue for an undeterminedamount of time. The project has two full-timeFCT-funded scholars assigned to it for a periodof 3 years, starting from September 2006. Theproject is enforced by satellite projects, whichdeal with specific parts of the database. A two-year project on the improvement andexploration of the derivational data in OSLINwill start in October 2007, and run for twoyears.

2 OSLIN Design2.1 Main databaseThe main database of OSLIN (MorDebe)consists of a simple two-table structure, onetable with lemmas, the other with the relatedword-forms. The lemma list consists of twoparts – on the one hand, it contains the lemmasfrom the two major Portuguese dictionaries, andon the other hand, it contains words with asignificant frequency in newspapers. In both



parts of the database, a strict lexicographiccontrol is kept over the data, with a significantamount of human intervention, using computer-aided methods. The total number of lemmas atthis moment is around 130k, with constantadditions being made, and well over 1,5Mword-forms.

Although the MorDebe database was set-upfor Portuguese, its design is largely languageindependent. The set of word classes andinflectional forms is determined in a separatedatabase, and can easily be modified toaccommodate languages with rich nominalinflection, or with other fundamental wordclasses.

2.2 Inherent InflectionIn the database, inherent inflection (Janssen,2005) are modelled in terms of relationsbetween lemmas, using relations similar tothose in the Meaning-Text Theory (Mel’cuk,1993) called inflectional functions. With theseinflectional functions, verbs are related to theirdeverbal nouns (s0v), adjectives to theirsynthetic superlative (sup), etc. The inherentinflection database is still under construction,and contains currently over 20.000 derivationalforms. It is planned to feature the complete setof all dictionarized inherent inflections withinthe scope of a year.

There are two types of relations that aremodelled in a way similar to inherentinflections, but are of a different nature. Thefirst is a separate database of gentiles: all nounsand adjectives indicating people or objects froma specific space or region are relationallymarked as such. The difference with inherentinflection is that toponyms are not lemmas, andare stored in a separate database of propernames. The complete set of all over 3000dictionarized gentiles has been modelled in thisfashion.

The second special type of ‘inflectionalfunction’ is the relation between orthographicvariants. Orthographic variation is traditionallyseen as an intra-word phenomenon. But theexplicit modelling of inflectional paradigmsmakes it necessary to keep the different variantsapart and interrelate them with a relation(Janssen, 2006).

2.3 Web Site DesignThe web-site of the Portal provides (or willprovide) five different types of information: not

only the lexical information from the MorDebedatabase, but also information on legislation, adictionary of linguistic terms, a repository ofonline resources on Portuguese other than thePortal itself, and a collection of easy textsconcerning the Portuguese language. With thecurrent content, the web site already attractssome 1000 visitors each day, mainly languageprofessionals such as translators and writers,and that number is steadily rising.

The use of the MorDebe data in an onlineservice for the general public provides anexcellent additional motivation for the creationof the lexical resources, and even opens up thepossibility of commercial sponsoring.

2.4 Modular DesignThe design of the OSLIN database is fullymodular: each additional type of information ismodelled in a separate database, linked to oneof the existing tables, currently either the word-forms or the lemmas. This design makes it easyto extend the database with additional types ofinformation. The main resource currently underdevelopment is a database of IPA transcriptionsfor all lemmas in the database, but various othertypes of information are under investigation. Atthis time, there are no plans to add semanticentities, merely due to lack of resources, notbecause the framework does not allow it.Ideally, the framework would be extended toother languages besides Portuguese in the nearfuture. Using the same set-up for variouslanguage would not only allow reusing theexisting tools, but also make it possible createcross-linguistic relations.

BibliografíaJanssen, Maarten. 2005. “Between Inflection

and Derivation: Paradigmatic LexicalFunctions in Morphological Databases”. EnEast West Encounter: second internationalconference on Meaning - Text Theory,Moscow, Russia.

Janssen, Maarten. 2006. “OrthographicVariation in Lexical Databases”. EnProceedings of EURALEX 2005, Turin,Italy.

Mel’cuk, Igor A. 1993. The Future of theLexicon in Linguistic Description. En Ik-Wan Lee (ed.) Linguistics in the MorningCalm 3: Selected papers from SICOL-1992.Korea: Seoul.

Maarten Janssen

298

Índice de Autores Alegria, Iñaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Almeida, José João . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Alonso, Laura . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Alonso, Miguel A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Arcas-Túnez, Francisco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Armentano-Oller, Carme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Artola, Xabier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Barcala, Fco. Mario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Barreiro, Álvaro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Bel, Núria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Bengoetxea, Kepa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Bischoff, Shannon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Borrego, Rafael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Callejas, Zoraida . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Castellón, Irene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Castro, María José . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Cerva, Petr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Coria, Sergio R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Corpas, Gloria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Cruz, Fermín . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 de Pablo-Sánchez, César . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Díaz de Ilarraza, Arantza . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Díaz, Manuel Carlos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Díaz, Víctor J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Domínguez, Eva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Duran, Jordi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Enríquez, Fernando . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Errecalde, Marcelo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Escapa, Alberto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Ferrández, Antonio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Forcada, Mikel L. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Gamallo, Pablo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241,295 García, Oscar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 García, Fernando . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Gervás, Pablo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181,285 Gojenola, Koldo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Gómez-Rodríguez, Carlos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Griol, David . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231,283 Herrera, Jesús . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37,181,285 Hulden, Mans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Hurtado, Lluis F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231,283 Ingaramo, Diego . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Izquierdo-Bevia, Rubén . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Janssen, Maarten . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Kozareva, Zornitsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Llopis, Fernando . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 López, Marisol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 López-Cózar, Ramón . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Macías, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Marimon, Montserrat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Marrero, Mónica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Martí, Antonia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205,279 Martín, José Luis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Martín, María Teresa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Martínez, Paloma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Martínez-Barco, Patricio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Montejo, Arturo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Montoyo, Andrés . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Morato, Jorge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Moreiro, J. Antonio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Moriano, Pedro J. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181,285 Moscoso, E. Miguel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

Moskowich-Spiegel, Isabel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289 Muñoz, Alfonso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181,285 Nazar, Rogelio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Noguera, Elisa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Nouza, Jan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 O'Donnell, Michael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Ortega, F. Javier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Padró, Lluis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89,105 Padró, Muntsa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Palazuelos, Sira E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Parapar, Javier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287,289 Pascual, Ismael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Peñas, Anselmo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Perea, Pilar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Perekrestenko, Alexander . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Pérez-Ortiz, Juan Antonio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Periñán-Pascual, Carlos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Pichel, José Ramom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Pineda, Luis Alberto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Puchol-Blasco, Marcel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Recasens, Marta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Rigau, Germán . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Rojo, Guillermo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Romero, Luis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181,285 Rosso, Paolo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Sánchez-Cuadrado, Sonia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Sánchez-Martínez Felipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Sanchís, Emilio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231,283 Santalla del Río, María Paula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Sapena, Emili . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Saquete, Estela . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Saralegi, Xabier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Segarra, Encarna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231,283 Seghezzi, Natalia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Seghiri, Miriam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Simões, Alberto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Sologaistoa, Aitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Soroa, Aitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Sotelo, Susana . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Suárez, Armando . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Taulé, Mariona . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Tinkova, Nevena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Torres, Francisco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Troyano, José Antonio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Turmo, Jordi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Ureña-López, L. Alfonso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Vázquez, Sonia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Verdejo, Felisa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Vicente-Díez, María Teresa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Vilares, Manuel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Vilela, Rui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Vivaldi, Jorge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Wanner, Leo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

XXIII Congreso de la Sociedad Española para el Procesamiento del ... · Búsqueda de Respuestas (2...

Documents

Transcript of XXIII Congreso de la Sociedad Española para el Procesamiento del ... · Búsqueda de Respuestas (2...