TuniCo - Linguistic Dynamics in the Greater Tunis Area

Dictionary: Overview

The dictionary was not only built on data from the corpus of spoken language that was compiled in the same project, but also on a range of additional sources: data elicited from complementary interviews with young Tunisians and lexical material taken from various published historical sources dating from the middle of the 20th century and earlier. The most important of these is Hans-Rudolf Singer’s monumental grammar (1984; almost 800 pages) of the Medina of Tunis. Singer’s data was systematically evaluated and integrated into the dictionary, all the material being indicated by reference to the book. Additionally, other resources including (Nicolas 1911, Marçais/Guîga 1958-61, Quéméneur 1962, Abdellatif 2010) were also consulted in order to verify and to complete the contemporary data. The diachronic dimension will help to better understand processes in the development of the lexicon (for more details see Moerth, Prochazka, & Dallaji 2014).

The first layer of lexical data to be incorporated into the digital lexicographical system was extracted from Veronika Ritt-Benmimoun’s textbook titled Skriptum zu den Lehrveranstaltungen Tunesisch-Arabisch Kurs A und B (Vienna, 2012/13). This data is referenced in the dictionary as Ritt-Benmimoun 2014.

Please keep in mind that the dictionary will remain work-in-progress. We do not claim any degree of completeness, neither with respect to lemmas nor with respect to data relevant to the entries’ microstructure. This holds particularly true for etymologies and usage labels indicating regimen of verbs.

The project was embedded in the activities of the two large-scale pan-European research infrastructure consortia in the humanities, CLARIN (Common Language Resources and Technology Infrastructure) and DARIAH (Digital Research Infrastructure for the Arts and Humanities) (Budin, Moerth, & Durco 2013).

Query Interfaces

Over time, we experimented with various web-frontends which were all built on the same lexical data but made use of different technologies. Currently the dictionary can be accessed via three websites:

TUNICO Dictionary Search

The main access point to the dictionary is part of this website. Technologically, it is based on the ACDH-OeAW's corpus_shell framework, a modular service-oriented architecture for a distributed and heterogeneous landscape of digital language resources. The principle idea behind the architecture is to decouple the modules serving data from the user-interface components. To achieve this end, a number of basic requirements are imposed on the system: dynamic configuration of data sources, dynamic configuration of front-end layout, support for different protocols and support for different data formats. The backbone of the system is the SRU/CQL search protocol which supports FCS (Federated Content Search), an evolving standard increasingly used in the pan-European CLARIN (Common Language Resources and Technology Infrastructure) community.

VICAV

The more recent interface can be found on the VICAV Website . The Vienna Corpus of Arabic Varieties (VICAV) is an international project aiming at the collection of digital language resources documenting varieties of spoken Arabic. It provides a wide range of materials such as language profiles, dictionaries, annotated texts, bibliographies and more. There the Tunis dictionary can be accessed together with other dictionaries (Damascus, Cairo). The VICAV dictionary interface was designed to allow users to compare Arabic varieties and to access several dictionaries concurrently.

The TUNICO Dictionary

The dictionary editor used to create the lexical data of this project provides a function to allow users to easily build dictionary frontends. This web page is an example created making use of this functionality.

Encoding

The dictionary are encoded in TEI (P5). While the data model applied for the corpus was easy to implement, the model of the lexicographic data needed some work as comparable data was not available. Major topics that were discussed in conferences and publications were the issue of diachrony in TEI encoded dictionaries, modelling of statistical data in lexical resources, the issue of language identifiers (in particular in research that needs a high degree of granularity) as well as the dictionary-corpus interface.

The encoding documentation of the dictionary are part of the VICAV dcomentation which has been enhanced and refined through data and experiences gathered during the TUNICO project. The coumentation can be found on the website of the ACDH-OeAW's DictGate .

Tools

Several tools were further developed as part of the TUNICO project.

Viennese Lexicographic Editor (VLE)

The main tool for editing the lexicograph data of the project was the Viennese Lexicographic Editor (VLE). VLE is a standalone Windows application that allows lexicographers to process standards-based lexicographic and terminological data in basically any XML-based format such as Lexical Markup Framework (LMF; ISO 24613:2008), TermBase eXchange (TBX; ISO 30042:2008), Resource Description Framework (RDF) or TEI. This general purpose XML editor provides a number of functionalities to streamline lexicographic editing procedures. It allows collaborative dictionary editing on the internet and is built entirely on XML and related technologies (XPath, XQuery, XSLT, XML Schema). While the tool for several years was functioning as part of a client-server-based architecture making use of MySQL in the backend, most recent VLE versions can be used in combination with the free and easy-to-use XML database BaseX (http://basex.org/). Learn more

tokenEditor

The tokenEditor loads corpus texts vertically, i.e. every token on a separate line. It furnishes methods to add labels to the tokens and to manually review automatically applied annotations. The tool allows users to filter, aggregate, visualise and – most importantly – edit word-level annotations in digital texts. The tool was implemented in several applications and integrated in our semi-automatic NLP pipeline. Although it was initially an independent Windows application, the development has until recently been tightly integrated with the development of the lexicographic editor. Most recent VLE versions all feature a built-in tokenEditor component which allows for the seamless integration of corpus and dictionary. This module takes as input XML texts and/or text collections and creates verticalised views of the data. This process is accomplished on the basis of freely configurable XPath expressions which create lists of XML elements on which a wide range of operations can be performed. The tokenEditor can create and display subsets of the list items through the application of regular expressions. It allows the manual correction of the linguistic information. PoS and lemma information can be changed for single tokens or in batch for multiple occurrences. What is key for the usability of the tool is that users have access to the context, can browse through the filtered lists and still can access the underlying corpus. While the tool was primarily designed to review part-of-speech tags and lemmas, it is actually fully customizable and capable to support any number of annotation layers.

Other Tools

In the process of creating the dictionary, several other tools were also used. The most important of these are oXygen and Audacity.