TuniCo - Linguistic Dynamics in the Greater Tunis Area

Research: Text Technology and Infrastructure

The text-technological work has been focusing on tools and relevant standards. Much of this work has been tightly linked to other projects of the Austrian Academy of Sciences, in particular infrastructure building activities of the ACDH's eLexicography working group. The data produced as part of TUNICO furnished an ideal test-bed for lexicographic virtual research environment that is currently being developed.

Modelling frequency data. Methodological considerations on the relationship between dictionaries and corpora

Karlheinz Moerth, Laurent Romary, Gerhard Budin and Daniel Schopper

Journal of the Text Encoding Initiative: Selected Papers from the 2013 TEI Conference, Issue 8 (2015).

Abstract: Academic dictionary writing is making greater and greater use of the TEI Guidelines’ dictionary module. And as increasing numbers of TEI dictionaries become available, there is an ever more palpable need to work towards greater interoperability among dictionary writing systems and other language resources that are needed by dictionaries and dictionary tools. In particular this holds true for the crucial role that statistical data obtained from language resources play in lexicographic workflow—a role that also has to be reflected in the model of the data produced in these workflows. Presenting a range of current projects, the authors address two main questions in this area: How can the relationship between a dictionary and other language resources be conceptualized, irrespective of whether they are used in the production of the dictionary or to enrich existing lexicographic data? And how can this be documented using the TEI Guidelines? Discussing a variety of options, this paper proposes a customization of the TEI dictionary module that tries to respond to the emerging requirements in an environment of increasingly intertwined language resources.

Full text ...

Towards a diatopic dictionary of spoken Arabic varieties: challenges in compiling the VICAV dictionaries

Karlheinz Moerth, Daniel Schopper and Omar Siam

11th Conference of AIDA - Association Internationale de Dialectologie Arabe. 25.-28. May 2015, Bucharest. 05/2015, Bucharest, Romania

Abstract: The research presented in this report is part of a number of digital humanities projects with a strong interest in eLexicography and a focus on Arabic dictionaries. These projects constitute a joint research agenda of the University of Vienna and the Austrian Academy of Sciences which is situated at the crossroads of variational linguistics and language technology research. The transdisciplinary and applied approaches described in the paper have already created demonstrable results with respect to research-driven tool development and the work on interoperability mechanisms such as e.g. encoding standards or language related norms. One result of these endeavours was an innovative interface offering a single point of access to several lexical databases. Our presentation will deal with the background of these efforts and issues relevant to research both in the fields of NLP and dialectology focusing on new technologies and their applicability to the field of Arabic dialectology.

Our research has been based on a collection of digital lexicographic resources that are being created as part of the VICAV project (Vienna Corpus of Arabic Varieties), which is a virtual platform for hosting and exchanging a wide range of digital language resources (such as language profiles, bibliographies, lexical resources, corpora, NLP tools, best practices and guidelines) and the TUNICO project (Lexical Dynamics in the Greater Tunis Area: a Corpus-based Approach; Austrian Science Fund P25706-G23). In addition to a dictionary of Damascus Arabic, dictionaries for Rabat and Cairo varieties are being compiled. A fourth item on the list is a micro-diachronic dictionary of the Tunis variety which is being created as part of the TUNICO project.

Laying the Foundations for a Diachronic Dictionary of Tunis Arabic. A First Glance at an Evolving New Language Resource

Ines Dallaji, Stephan Procházka, Karlheinz Moerth

Proceedings of the XVI EURALEX International Congress. Bolzano 2014: The User in Focus: 377-387.

Abstract: Arabic lexicography has a long tradition. However, at the time of writing this report, there exist only a very few digital products, let alone products documenting Arabic dialects. Our paper presents the TUNICO project (Linguistic Dynamics in the Greater Tunis Area) and digital language resources which are being produced as part of the project. The TUNICO working group is working on a digital diachronic dictionary of Tunis Arabic which is being compiled as part of a larger linguistic endeavour to document the variety of the Tunisian capital. One of the interesting features of the project is that it draws on a number of heterogeneous sources: text books, grammatical descriptions and a corpus of spoken youth language which is currently being compiled. In this project, the dictionary is used as an analytical tool, as a research instrument, by integrating the various sources into one new coherent language resource thus allowing researchers to gain unprecedented insights in material that partly has been available for quite some time.

Linking Instead of Lemmatising: Enriching the TUNICO Corpus with the Dictionary of Tunis Arabic

Karlheinz Moerth, Daniel Schopper, Omar Siam

Abstract: Set against a background of digital humanities research, this paper reports on digital methods used in text-technology and eLexicography . All of the technologies presented in this paper are part of the modern inventory of techniques used in humanities studies. We strongly believe that this kind of technology is becoming an intrinsic part of 21st century’s canon of linguistic methods which makes its discussion a pivotal methodological contribution to a conference such as the International Conference on Tunisian and Libyan Arabic Dialects. We are dealing here with new and innovative tools and practices for the humanities – genuinely digital humanities methods. Data-based and data-driven approaches have become common-place in many sub-disciplines of linguistics. This also holds true of traditional dialectology, variational linguistics, sociolinguistic approaches and in particular lexical studies which heavily rely on data.

To be published in: V. Ritt-Benmimoun (ed.) Tunisian and Libyan Arabic Dialects: Common trends - Recent developments - Diachronic aspects. Estudios de Dialectología Árabe 13. Zaragoza