Linguists and Computer Scientists will highlight the advantages
of using expert-crafted dictionaries and grammars to build NLP applications.
Important dates
First Call for Workshop Papers: December 12, 2017 (11:50pm CET)
Workshop Papers Due Date: May 25, 2018 (11:50pm CET)
Notification of Acceptance: June 20, 2018
Camera-ready Papers Due Date: June 30, 2018
Workshop Dates: August 20, 2018
Call for papers
In conjunction with COLING 2018 in Santa Fe, we are organizing a half-day workshop, entitled “Linguistic Resources for Natural Language Processing” (LR4NLP).
This workshop aims to bring together linguists who are interested in developing large-coverage linguistic resources and researchers with an interest in developing real-world NLP software. These two communities have been working separately for many years. NLP researchers are typically more focused on technical issues specific to automatic text processing, where high-quality performance (e.g. recall and precision) is crucial. On the other hand, linguists tend to focus on problems related to the development of exhaustive and precise resources to pursue the wide spectrum of language – linguistically motivated resources based on a specific theory naturally, but which are for the most part ‘neutral’ vis-à-vis any NLP application. That is, they might be implemented using various grammatical formalisms (HPSG, LFG, NooJ, RG, TAG, XFST, etc.) and should be useable by a wide variety of NLP applications, such as parsing sentences, generating texts, transformational analysis, paraphrasing and translation, among others.
Recent progress in both computer science and linguistics is reducing many of these differences, with large-coverage, collaborative linguistic resources increasingly being used by robust NLP software. For example, NLP researchers can now use large dictionaries of multiword units and expressions, and several linguistic experiments have shown the feasibility of using extensive dictionaries and grammars in software applications that can parse sentences, as well as produce paraphrases and translations of sentences.
By encouraging members of both communities to mutually discuss current research on related topics, we hope to move towards a better understanding of the problems involved. Furthermore, examining ideas that offer reciprocal benefits to both communities may lead to potential collaborative efforts. This workshop focuses on the following questions:
- Is it possible to construct NLP applications that remove ambiguities by using linguistic data alone, i.e. with no statistical methods?
- How does one develop ‘neutral’ linguistic resources (dictionaries, lexicon-grammars, morphological, phrase-structure and transformational grammars, etc.) that can be used both to parse and generate texts, in one or multiple languages?
- What are the limitations of stochastic and neural net based systems, as opposed to grammar and rule-based ones?
Topics should relate to linguistically-based NLP, such as:
- Assessment of grammar and rule-based vs. statistical and neural net approaches to NLP
- Natural language disambiguation based on handcrafted grammars
- Development of large-coverage linguistic resources
- Use of linguistic resources in paraphrasing and machine translation applications
- Linguistically-based NLP for real-world applications
- Paraphrase and translation generation
- Phraseology of specialized languages
- Processing of multiword units, discontinuous expressions, phrasal verbs, etc.
- Surface structure realization
- Transformational analysis and generation
- Linguistically-based question-answering and summarization systems
Program
August 20, 2018
Time | Authors | Paper |
---|---|---|
8:30 - 9:00 | Registration | |
Session 1 | Clash of the Titans: Linguistics vs. Statistics vs. Neural-nets | |
9:00 - 9:10 | [Chair: Peter Machonis] | LR4NLP @ COLING Welcomes you |
9:10 - 9:50 | Mark Liberman | Corpus Phonetics: Past, Present, and Future
Abstract Semi-automatic analysis of digital speech collections is transforming the science of phonetics, and offers interesting opportunities to researchers in other fields. Convenient search and analysis of large published bodies of recordings, transcripts, metadata, and annotations – as much as three or four orders of magnitude larger than a few decades ago – has created a trend towards “corpus phonetics,” whose benefits include greatly increased researcher productivity, better coverage of variation in speech patterns, and essential support for reproducibility. The results of this work include insight into theoretical questions at all levels of linguistic analysis, as well as applications in fields as diverse as psychology, sociology, medicine, and poetics, as well as within phonetics itself. Crucially, analytic inputs include annotation or cat-egorization of speech recordings along many dimensions, from words and phrase structures to discourse structures, speaker attitudes, speaker demographics, and speech styles. Among the many near-term opportunities in this area we can single out the possibility of improving pars-ing algorithms by incorporating features from speech as well as text. |
9:50 - 10:10 | Max Silberztein | Using Linguistic Resources to Evaluate the Quality of Annotated Corpora
Abstract Statistical and neural network based methods that compute their results by comparing a given text to be analyzed with a reference corpus assume that the reference corpus is complete and reliable enough. In this article, I conduct several experiments to verify this assumption and I suggest ways to improve these reference corpora by using carefully handcrafted linguistic resources. |
10:10 - 10:30 | Linrui Zhang and Dan Moldovan | Rule-Based vs. Neural Net Approaches to Semantic Textual Similarity
Abstract This paper presents a neural net approach to determine Semantic Textual Similarity (STS) using attention-based bidirectional Long Short-Term Memory Networks (Bi-LSTM). To this date, most of the traditional STS systems were rule-based that built on top of excessive use of linguistic features and resources. In this paper, we present an end-to-end attention-based Bi-LSTM neural network system that solely takes word-level features, without expensive, feature engineering work or the usage of external resources. By comparing its performance with traditional rule-based systems against the SemEval 2012 benchmark, we make an assessment on the limitations and strengths of neural net systems as opposed to rule-based systems on STS. |
10:30 - 11:00 | Coffee break | |
Session 2 | May the Force Be with NooJ | |
[Chair: Max Silberztein] | ||
11:00 - 11:20 | Peter Machonis | Linguistic Resources for Phrasal Verb Identification
Abstract This paper shows how a lexicon grammar dictionary of English phrasal verbs (PV) can be trans-formed into an electronic dictionary, in order to accurately identify PV in large corpora within the linguistic development environment, NooJ. The NooJ program is an alternative to statistical methods commonly used in NLP: all PV are listed in a dictionary and then located by means of a PV grammar in both continuous and discontinuous format. Results are then refined with a series of dictionaries, disambiguating grammars, filters, and other linguistics resources. The main advantage of such a program is that all PV can be identified in any corpus. The only drawback is that PV not listed in the dictionary (e.g., archaic forms, recent neologisms) are not identified; however, new PV can easily be added to the electronic dictionary, which is freely available to all. |
11:20 - 11:40 | Kristina Kocijan, Krešimir Šojat and Dario Poljak | Designing a Croatian Aspectual Derivatives Dictionary: Preliminary Stages
Abstract The paper focuses on derivationally connected verbs in Croatian, i.e. on verbs that share the same lexical morpheme and are derived from other verbs via prefixation, suffixation and/or stem alternations. As in other Slavic languages with rich derivational morphology, each verb is marked for aspect, either perfective or imperfective. Some verbs, mostly of foreign origin, are marked as bi-aspectual verbs. The main objective of this paper is to detect and to describe major derivational processes and affixes used in the derivation of aspectually connected verbs with NooJ. Annotated chains are exported into a format adequate for a web-based system and further used to enhance the aspectual and derivational information for each verb. |
11:40 - 12:00 | Safa Boudhina and Héla Fehri | A Rule-Based System for Disambiguating French Locative Verbs and Their Translation into Arabic
Abstract This paper presents a rule-based system for disambiguating French locative verbs and their translation into Arabic. The disambiguation phase is based on the use of the French Verb dictionary of Dubois and Dubois Charlier (LVF) as a linguistic resource, from which a base of disambiguation rules is extracted. The extracted rules take the form of transducers which are subsequently applied to texts. The translation phase consists in translating the disambiguated locative verbs returned by the disambiguation phase. The translation takes into account the verb tense, as well as the inflected form of that verb. This phase is based on bilingual dictionaries that contain the different French locative verbs and their translation into Arabic. The experimentation and the evaluation are done using the linguistic platform NooJ, both a language resource development environment and a tool for automatic large corpora flow (Fehri, 2012). |
12:00 - 12:20 | Andrea Rodrigo, Mario Monteleone and Silvia Reyes | A Pedagogical Application of NooJ in Language Teaching: The Adjective in Spanish and Italian
Abstract This paper relies on the work developed by the research team IES_UNR (Argentina) and presents a pedagogical application of NooJ for the teaching and learning of Spanish as a foreign language. However, as this proposal specifically addresses learners of Spanish whose mother tongue is Italian, it also entailed vital collaboration with Mario Monteleone from the University of Salerno, Italy. The adjective was chosen on account of its lower frequency of occurrence in texts written in Spanish, and particularly in the Argentine Rioplatense variety, and with the aim of developing strategies to increase its use. The features that the adjective shares with other grammatical categories render it extremely productive and provide elements that enrich the learner’s proficiency. The reference corpus contains the front pages of the Argentinian newspaper Clarín related to an emblematic historical moment, whose starting point is March 24, 1976, when a military coup began, and covers a thirty year period until March 24, 2006. The use of the linguistic resources created in NooJ for the automatic processing of texts written in Spanish accounts for the adjective in a relevant historical context for Argentina. |
12:20 - 12:40 | Anabela Barreiro and Fernando Batista | Contractions: to Align or not to Align, That is the Question
Abstract This paper performs a detailed analysis on the alignment of Portuguese contractions, based on a previously aligned bilingual corpus. The alignment task was performed manually in a subset of the English-Portuguese CLUE4Translation Alignment Collection. The initial parallel corpus was pre-processed and a decision was made as to whether the contraction should be maintained or decomposed in the alignment. Decomposition was required in the cases in which the two words that have been concatenated, i.e., the preposition and the determiner or pronoun, go in two separate translation alignment pairs (PT - [no seio de] [a União Europeia] EN - [within] [the European Union]). Most contractions required decomposition in contexts where they are positioned at the end of a multiword unit. On the other hand, contractions tend to be maintained when they occur at the beginning or in the middle of the multiword unit, i.e., in the frozen part of the multiword (PT - [no que diz respeito a] EN - [with regard to] or PT - [além disso] EN - [in addition]. A correct alignment of multiwords and phrasal units containing contractions is instrumental for machine translation, paraphrasing, and variety adaptation. |
12:40 - 14:00 | Lunch break | |
Session 3 | One for the Road: Monolingual Resources | |
[Chair: Anabela Barreiro] | ||
14:00 - 14:20 | Bonnie Dorr and Clare Voss | STYLUS: A Resource for Systematically Derived Language Usage
Abstract Starting from an existing lexical-conceptual structure (LCS) Verb Database of 500 verb classes (containing a total of 9525 verb entries), we automatically derived a resource that supports argument identification for language understanding and argument realization for language generation. The extended resource, called STYLUS (SysTematicallY Derived Language USage), supports constraints at the syntax-semantics interface through the inclusion of components of meaning and collocations. We show that the resulting resource covers three cases of language usage patterns both for spatially oriented applications such as dialogue management for robot navigation and for non-spatial applications such as generation of cyber-related notifications. |
14:20 - 14:40 | Andargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Michael Gasser and Andreas Nürnberger | Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus
Abstract We introduce the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error corrections. We have also modified the existing morphological analyzer, HornMorpho, to use it for automatic tagging. |
14:40 - 15:00 | Chanakya Malireddy, Srivenkata N M Somisetty and Manish Shrivastava | Gold Corpus for Telegraphic Summarization
Abstract Most extractive summarization techniques operate by ranking all the source sentences and then select the top-ranked sentences as the summary. Such methods are known to produce good summaries, especially when applied to news articles and scientific texts. However, they do not fare so well when applied to texts such as fictional narratives, which do not have a single central or recurrent theme. This is because usually the information or plot of the story is spread across several sentences. In this paper, we discuss a different summarization technique called Telegraphic Summarization. Here, we do not select whole sentences, rather pick short segments of text spread across sentences, as the summary. We have tailored a set of guidelines to create such summaries and, using the same, annotate a gold corpus of 200 English short stories. |
15:00 - 15:20 | Hafte Abera and Sebsibe H/Mariam | Design of a Tigrinya Language Speech Corpus for Speech Recognition
Abstract In this paper, we describe the first Tigrinya Language speech corpus designed and developed for speech recognition purposes. Tigrinya, often written as Tigrigna (ትግርኛ) /tɪˈɡrinjə/ belongs to the Semitic branch of the Afro-Asiatic languages and shows characteristic features of a Se-mitic language. It is spoken by ethnic Tigray-Tigrigna people in the Horn of Africa. This paper outlines different corpus designing processes and related work on the creation of speech corpora for different languages. The authors also provide procedures that were used for the creation of a speech recognition corpus for Tigrinya, an under-resourced language. One hundred and thirty native Tigrinya speakers were recorded for the training and test datasets. Each speaker read 100 texts, which consisted of syllabically rich and balanced sentences. Ten thousand sets of sen-tences were used, which contained all of the contextual syllables and phones of Tigrinya. |
15:30 - 16:00 | Coffee break | |
Session 4 | Language Resources without Borders | |
[Chair: Kristina Kocijan] | ||
16:00 - 16:20 | Solomon Teferra Abate, Michael Melese Woldeyohannis, Martha Yifiru Tachbelie, Million Meshesha, Solomon Atinafu, Wondwossen Mulugeta, Yaregal Assabie, Hafte Abera, Binyam Ephrem, Tewodros Abebe, Wondimagegnhue Tsegaye, Amanuel Lemma, Tsegaye Andargie and Seifedin Shifaw | Parallel Corpora for bi-Directional Statistical Machine Translation for Seven Ethiopian Language Pairs
Abstract In this paper, we describe the development of parallel corpora for Ethiopian Languages: Am-haric, Tigrigna, Afan-Oromo, Wolaytta and Ge’ez. To check the usability of all the corpora, we conducted baseline bi-directional statistical machine translation (SMT) experiments for seven language pairs. The performance of the bi-directional SMT systems shows that all the corpora can be used for further investigations. We have also shown that the morphological complexity of the Ethio-Semitic languages has a negative impact on the performance of the SMT, especially when they are target languages. Based on the results obtained, we are currently working towards handling the morphological complexities to improve the performance of statistical machine translation among the Ethiopian languages. |
16:20 - 16:40 | Jennifer Sikos and Sebastian Padó | Using Embeddings to Compare FrameNet Frames Across Languages
Abstract Much of the recent interest in Frame Semantics is fueled by the substantial extent of its applicability across languages. At the same time, lexicographic studies have found that the applicability of individual frames can be diminished by cross-lingual divergences regarding polysemy, syntactic valency, and lexicalization. Due to the large effort involved in manual investigations, there are so far no broad-coverage resources with “problematic” frames for any language pair. Our study investigates to what extent multilingual vector representations of frames learned from manually annotated corpora can address this need by serving as a wide coverage source for such divergences. We present a case study for the language pair English—German using the FrameNet and SALSA corpora and find that inferences can be made about cross-lingual frame applicability using a vector space model. |
16:40 - 17:00 | Yuming Zhai, Aurélien Max and Anne Vilnat | Construction of a Multilingual Corpus Annotated with Translation Relations
Abstract Translation relations, which distinguish literal translation from other translation techniques, constitute an important subject of study for human translators (Chuquet and Paillard, 1989). However, automatic processing techniques based on interlingual relations, such as machine translation or paraphrase generation exploiting translational equivalence, have not made use of these relations explicitly until now. In this work, we present a categorization of translation relations and then we annotate a parallel multilingual (English, French, Chinese) corpus of oral presentations, the TED Talks, with these relations. Our long-term objective will be to automatically detect these relations in order to integrate them as important characteristics for the search of monolingual segments in relation of equivalence (paraphrases) or of entailment. The annotated corpus resulting from our work will be made available to the community. |
17:00 - 17:20 | Mutsuko Tomokiyo, Christian Boitet and Mathieu Mangeot | Towards an Automatic Classification of Illustrative Examples in a Large Japanese-French Dictionary Obtained by OCR
Abstract This paper focuses on improving the Cesselin, a large, open source Japanese-French bilingual dictionary digitalized by OCR, available on the web, and contributively improvable online. Labelling its examples (about 226,000) would significantly enhance their usefulness for language learners. Examples are proverbs, idiomatic constructions, normal usage examples, and, for nouns, phrases containing a quantifier. Proverbs are easy to spot, but not the other types. To find a method for automatically or at least semi-automatically annotating them, we have studied many entries, and hypothesized that the degree of lexical similarity between results of MT into a third language might give good cues. To confirm that hypothesis, we sampled 500 examples and used Google Translate to translate into English the Cesslin Japanese expressions and their French translations. The hypothesis holds well, in particular for distinguishing examples of normal usage from idiomatic examples. Finally, we propose a detailed annotation procedure and discuss its future automatization. |
17:20 - 17:40 | Mrinal Dhar, Vaibhav Kumar and Manish Shrivastava | Enabling Code-Mixed Translation: Parallel Corpus Creation and MT Augmentation Approach
Abstract Code-mixing, use of two or more languages in a single sentence, is generated by multi-lingual speakers across the world. The phenomenon presents itself prominently in social media discourse. Consequently, there is a growing need for translating code-mixed hybrid language into standard languages. However, due to the lack of gold parallel data, existing machine translation systems fail to properly translate code-mixed text. In an effort to initiate the task of machine translation of code-mixed content, we present a newly created parallel corpus of code-mixed English-Hindi and English. We selected previously available English-Hindi code-mixed data as a starting point for our parallel corpus, and 4 human translators, fluent in both English and Hindi, translated the 6,096 code-mixed English-Hindi sentences into English. With the help of the created parallel corpus, we analyzed the structure of English-Hindi code-mixed data and present a technique to augment run-of-the-mill machine translation (MT) approaches that can help achieve superior translations without the need for specially designed translation systems. The augmentation pipeline is presented as a pre-processing step and can be plugged with any existing MT system, which we demonstrate by improving code-mixed translations done by systems like Moses, Google Neural Machine Translation System (NMTS) and Bing Translator. |
17:40 | Max Silberztein | LR4NLP@COLING Workshop Wrap-Up |
Submission Instructions
Authors are invited to submit papers describing original, unpublished work, be it completed or in progress. The papers should be maximally 9 pages of main content, with additional pages allowed for references and appendices. The COLING 2018 templates must be used; these will be provided in LaTeX and also Microsoft Word format. All accepted papers will be presented as talks.
Submitted papers should be from any of the following categories, each of which is associated with a distinct review form.
- COMPUTATIONALLY-AIDED LINGUISTIC ANALYSIS The focus of this paper type is new linguistic insight. Originality could be in the linguistic question being addressed, in the methodology applied to the linguistic question, or in the combination of the two. It should be shown how results generalize, either by deepening our understanding of some linguistic system in general or by demonstrating methodology that can be applied to other problems.
- NLP ENGINEERING EXPERIMENT PAPER This type of paper tests a hypothesis about the effectiveness of a technique for a task. The hypothesis should be clearly stated, the testing methodology rigorous, and the experiment reproducible. Furthermore, a successful COLING paper of this type will include thoughtful error analysis and a clear explanation of how the results in the experiment relate to the hypothesis.
- REPRODUCTION PAPER The contribution of a reproduction paper lies in analyses of and in insights into existing methods and problems—plus the added certainty that comes with validating previous results or the information that certain results are not reproducible. A strong reproduction paper offers analysis and deepens our understanding of the methodology used or problem approached, helping practitioners choose techniques / resources.
- RESOURCE PAPER Papers in this track present a new language resource. This could be a corpus, but also could be an annotation standard, tool, and so on. Part of the contribution of a reproduction paper lies in the quality, accessibility and description of resources.
- POSITION PAPER A position paper presents a challenge to conventional thinking or a futuristic new vision. It could open up a new area or spur the development of novel technology, propose changes in existing research practices, or give a new set of ground rules. Creative and sound positions will do best, with well-defined visions opening up new areas of research.
- SURVEY PAPER A survey paper provides a structured overview of the literature to date on a specific topic that helps the reader understand the kinds of questions being asked about the topic, the various approaches that have been applied, how they relate to each other, and what further research areas they open up. A conference-length survey paper should be about a sufficiently focused topic that it can do this successfully with in the page limits.
Paper submission will be electronic in PDF format through the SoftConf conference management system.
Paper submission page will close on May 25th, 2018 at 23:00 Standard European Time
For full papers, please use Text Formatting Style provided by COLING 2018.
Author Responsibilities
Papers must be of original, previously-unpublished work. The formatting template must be strictly adhered to and deadlines met. Papers must be anonymized to support double-blind reviewing. If the paper is available as a preprint, this must be indicated in the submission form but not in the paper itself.
Papers that have been or will be under consideration for other venues at the same time must indicate this at submission time. If a paper is accepted for publication at LR4NLP@COLING, it must be immediately withdrawn from other venues. If a paper under review at LR4NLP@COLING is accepted elsewhere and authors intend to proceed there, the LR4NLP@COLING committee must be notified immediately.
Reviewing Policy
Reviewing will be double-blind, so authors need to conceal their identity. The paper should not include the authors' names and affiliations, nor any acknowledgements. Limit anonymized self-references only to articles that are relevant for reviewers.
Registration
For workshops only
- Regular - one day workshop: $320 (until July 13); $415 (July 14-Aug 19); $480 (on-site)
- Regular - two day workshop: $430 (until July 13); $560 (July 14-Aug 19); $645 (on-site)
- Student - one day workshop: $225 (until July 13); $290 (July 14-Aug 19); $335 (on-site)
- Student - two day workshop: $355 (until July 13); $460 (July 14-Aug 19); $525 (on-site)
For the main conference and workshop
- Regular
- main conference: $715 (until July 6); $930 (July 7-Aug 19); $1 070 (on-site)
- plus one-day workshop: $200 (until July 6); $260 (July 7-Aug 19); $300 (on-site)
- OR plus two-day workshop: $290 (until July 6); $375 (July 7-Aug 19); $435 (on-site)
- Student
- main conference: $500 (until July 6); $650 (July 7-Aug 19); $750 (on-site)
- plus one-day workshop: $140 (until July 6); $180 (July 7-Aug 19); $210 (on-site)
- OR plus two-day workshop: $205 (until July 6); $265 (July 7-Aug 19); $305 (on-site)
Attention students:
- If you declare that you are a student, you will be required to upload proof of your student status at time of registration.
- Proof of status will be verified.
- Only after the status verification is completed, the student registration will become final!
- Student IDs without clear dates of validity, webpage screenshots and unofficial records will NOT be accepted as proof!
- Acceptable proof of student status MUST be clearly readable in English, contain the name of the candidate and the dates of validity.
- Please email a PDF scan of your proof of student status to studentstatusverification@coling2018.org
- Proof containing a verification code must not fail verification.
Workshop Organizers
- Anabela Barreiro, Post-Doctoral Researcher, INESC-ID, Lisbon (Portugal)
- Kristina Kocijan, Assistant Professor of Information and Communication Sciences, University of Zagreb (Croatia)
- Peter Machonis, Professor of French and Linguistics, Florida International University (USA)
- Max Silberztein, Professor of Computer Science and Linguistics, Université de Franche-Comté, Besançon (France)
Scientific Committee
- Program Committee Chair: Max Silberztein, Université de Franche-Comté (France)
- Jorge Baptista, University of Algarve (Portugal)
- Anabela Barreiro, INESC-ID Lisbon (Portugal)
- Xavier Blanco, Autonomous University of Barcelona (Spain)
- Nicoletta Calzolari, Istituto di Linguistica Computazionale (Italy)
- Christiane Fellbaum, Princeton University (USA)
- Héla Fehri, University of Sfax (Tunisia)
- Yuras Hetsevich, National Academy of Sciences (Belarus)
- Kristina Kocijan, University of Zagreb (Croatia)
- Mark Liberman, University of Pennsylvania (USA)
- Elena Lloret Pastor, Universidad de Alicante (Spain)
- Peter Machonis, Florida International University (USA)
- Slim Mesfar, Carthaga University (Tunisia)
- Simon Mille, Universitat Pompeu Fabra (Spain)
- Mario Monteleone, University of Salerno (Italy)
- Johanna Monti, University of Naples - L'Orientale (Italy)
- Bernard Scott, Logos Institute (USA)
More information
- Information about the host conference can be found at COLING 2018 web page
- VISA information
- Request for invitation letter for COLING 2018 - please request it ASAP. Waiting until paper acceptances are out is way too late, given how long these things can take.
- Accommodation -> Redirecting to COLING 2018
- Venue -> Redirecting to COLING 2018