You are using an outdated browser. Please upgrade your browser to improve your experience.
Austrian Centre for Digital Humanities Austrian Academy of Sciences
Abstract: The research presented in this report is part of a number of digital humanities projects with a strong interest in eLexicography and a focus on Arabic dictionaries. These projects constitute a joint research agenda of the University of Vienna and the Austrian Academy of Sciences which is situated at the crossroads of variational linguistics and language technology research. The transdisciplinary and applied approaches described in the paper have already created demonstrable results with respect to research-driven tool development and the work on interoperability mechanisms such as e.g. encoding standards or language related norms. One result of these endeavours was an innovative interface offering a single point of access to several lexical databases. Our presentation will deal with the background of these efforts and issues relevant to research both in the fields of NLP and dialectology focusing on new technologies and their applicability to the field of Arabic dialectology.
Our research has been based on a collection of digital lexicographic resources that are being created as part of the VICAV project (Vienna Corpus of Arabic Varieties), which is a virtual platform for hosting and exchanging a wide range of digital language resources (such as language profiles, bibliographies, lexical resources, corpora, NLP tools, best practices and guidelines) and the TUNICO project (Lexical Dynamics in the Greater Tunis Area: a Corpus-based Approach; Austrian Science Fund P25706-G23). In addition to a dictionary of Damascus Arabic, dictionaries for Rabat and Cairo varieties are being compiled. A fourth item on the list is a micro-diachronic dictionary of the Tunis variety which is being created as part of the TUNICO project. Keywords: Tunis, comparative dialectology, eLexicography, digital humanities, language technology, digital standards.
The transformations in technology and media our world has been undergoing over the past years have also given rise to a great number of theoretical and methodological changes in many disciplines of the humanities. This also holds true for dialect lexicography and Arabic dialect lexicography has started to make use of digital infrastructures as well. This report will touch on a wide range of digital language resources, digital infrastructure components, which have been used and developed in projects working on dictionaries of varieties of spoken Arabic. Thematically, these projects are situated at the crossroads of variational linguistics and language technology research.
The projects at hand have become possible through a close cooperation between the Austrian Academy of Sciences and the University of Vienna. They are conducted jointly by the Institute of Oriental Studies (UV) and the Austrian Centre for Digital Humanities (AAS). Furthermore, the projects are embedded in the activities of the two large-scale pan-European research infrastructure consortia in the humanities, which are CLARIN (Common Language Resources and Technology Infrastructure) and DARIAH (Digital Research Infrastructure for the Arts and Humanities). Both have grown out of the ESFRI Roadmap and were officially endorsed by the Commission of the European Union after a preparatory phase of several years (Budin, Moerth, & Durco 2013).
Yet another institution to be mentioned is CLARIN Centre Vienna (CCV), Austria’s central connection point to the network of CLARIN centres across Europe and Austria’s only dedicated repository for digital language resources. CCV takes care of the long-term preservation of digital language resources.
The research described in this paper has been conducted as part of several linguistic and IT projects. Key areas of interest are eLexicography and advanced text technological methods. This involves activities such tool development, corpus creation, corpus tools and standards relevant for digital humanities.
In our projects we proceed from a broad understanding of what constitutes a digital language resource. In text technology, these are often described in the form of a triangle, as a triade of (a) data, (b) tools, and (c) means of increasing interoperability of data and tools. What is implied by the latter are materials such as standards, best practices, specifications of tools, workflow descriptions and other such types of text (cf. Simons & Bird 2008).
The one project with substantial external funding is Lexical dynamics in the Greater Tunis area: a corpus based approach (TUNICO), which has been supported by the Austrian Science Fund (FWF, P25706-G23; https://www.acdh.oeaw.ac.at/tunico/). This project pursues two main objectives: (a) the exploration of a contemporary Arabic variety, and (b) the development of digital methods and tools. Major products to be created are two digital language resources: a corpus of spoken Tunis Arabic and a micro-diachronic dictionary of Tunis Arabic based on a combination of this and on previously published resources (cf. below 1.2.3).
Amongst our aims for this project is the integration of digital corpora and dictionaries, the development of tools to enable corpus-based eLexicography and to allow corpus enrichment by means of high quality lexicographic data, a task we have been working on for quite some time (Budin, & Moerth 2011).
The second project involved is the Vienna Corpus of Arabic Varieties (VICAV), a virtual research platform to host and exchange language resources to be used by scholars pursuing the comparative study of Arabic dialects. VICAV (https://www.acdh.oeaw.ac.at/vicav) which actually predates TUNICO, also has a strong text-technological component. Next to the obvious linguistic interests, it is meant to further the development of adequate digital tools and methods (Procházka, & Moerth 2013).
Although VICAV is defined as a corpus, which reflects the initial intentions behind the project, it is actually aimed at providing a wide range of digital language resources documenting varieties of spoken Arabic. The VICAV web-site contains various materials such as language profiles, comparative linguistic features, dictionaries, annotated texts, bibliographies and relevant documentation enabling scholars to get involved or to work along similar lines.
A particular type of text to be found on the VICAV website are language profiles, which consist of concise standardised linguistic descriptions of particular varieties, in many cases with clickable built-in features. The fixed structure of these texts is designed to ease comparison between varieties. They all start with information on the respective glottonyms (endonymic, in the local variety, as well as in MSA), followed by a formalised typology and general introduction positioning the particular variety in a wider context. The articles outline the research history, provide a sample text, a short bibliography on relevant publications and also practical information concerning textbooks and dictionaries. Currently, there are 10 such profiles available on the website, many more are under preparation.
A second type of text consists in linguistic features. While criteria catalogues to be used in comparative studies appear to be a most natural thing, nothing of the kind has been provided in the form of a proposal to standardise digital processing so far. The list constitutes a proposal of distinctive linguistic features for comparative purposes, and is to be understood as a first try, to serve as a basis for discussion. The draft is designed in an extensible manner, open for any refinements and enhancements.
Both VICAV and TUNICO produce lexical data in form of digital dictionaries. All of these lexical resources are comparatively small in size, ranging from only several hundred to several thousand entries. None of these dictionaries has more than 8000 entries. Nevertheless, they are meant to offer structured information with detailed lexical data, not just simple look-up lists with unstructured single sense-to-sense relations.
The set of dictionaries is primarily intended for three main purposes:
The usefulness of such resources for didactic purposes is self-evident. The Institute of Near Eastern Studies in Vienna teaches four Arabic varieties on a regular basis: students of Arabic are offered introductory classes in the spoken varieties of Rabat, Tunis, Cairo and Damascus.
The Egyptian dictionary was the first one to be started. Like the other dictionaries, the initial material was taken from existing course materials and glossaries. So far, Tunis, Damascus, Cairo and a small MSA dictionary have been made available online. Furthermore, there exists also limited data on Malta, Bagdad and some other locations. All the dictionaries have English translation equivalents, and some have additional translations into German, French or Spanish.
The dictionary on which the most time and work has been spent so far is the TUNICO Dictionary. It will not only contain all the lexicographic data of the corpus being produced in the project, but also two additional sources: data elicited from complementary interviews with young Tunisians and lexicographical material taken from various published historical sources dating from the middle of the 20th century and earlier. The most important of these is Hans-Rudolf Singer’s monumental grammar (1984; almost 800 pages) of the Medina of Tunis. Singer’s data is systematically evaluated and integrated into the dictionary, all the material being indicated by reference to the book. Additionally, other resources including (Nicolas 1911, Marçais/Guîga 195861, Quéméneur 1962, Abdellatif 2010) are also consulted in order to verify and to complete the contemporary data. The diachronic dimension will help to better understand processes in the development of the lexicon (for more details see Moerth, Prochazka, & Dallaji 2014).
Next to the linguistic interests, both projects have also set out to address methodological questions relating to the trend towards digitisation in academia. As there is not yet anything like an out-of-thebox solution and everything is still very much in a state of flux, digital lexicographers still face a number of difficult decisions in their everyday work. We have identified three main areas of concern which are:
While it is impossible to give simple and generic answers to these questions, we will try to outline our particular approaches in the following paragraphs.
The first challenge in realising the technical part of linguistic research projects is the discovery of the right tools. As the number of available applications has steadily been on the rise over the past few years, making the right decisions requires considerable experience. Most tools used in TUNICO and VICAV are products developed at the Austrian Academy of Sciences, they are the Viennese Lexicographic Editor (VLE), corpus_shell and DictGate. The most concise description of these might read like this:
When looking for an adequate tool to produce digital lexicographic data, researchers are not really spoilt for choice, however there are a number of products available (Budin, & Moerth 2011). Varying degrees of lack of support for standards and/or high pricing made us to start to develop our own tool, the Viennese Lexicographic Editor, an XML editor providing functionalities typically needed in editing lexicographic data. Basically, VLE can be used to edit any XML-based lexicographic and/or terminological format such as LMF, TBX, RDF or TEI.
VLE first came into existence as a by-product of an entirely different development activity: the creation of an interactive online learning system for university students which was used in a collaborative glossary editing project carried out in language courses at the University of Vienna. As it proved to be flexible and adaptable enough, it was also put to work in other projects and continually further developed. It makes use of a wide range of XML technologies such as XSLT, XPath, XML Schema, allows researchers to automatically verify the structural integrity of their data. It has configurable keyboard layouts, various editing modes, allows for freely configurable data visualisations, enables lexicographers to work collaboratively, it supports versioning, has an optimised corpus-dictionary interface (the so called tokenEditor) and is freely available (https://clarin.oeaw.ac.at/vle). The toolkit is designed for single researchers or small groups of researchers rather than for big publishing houses and works as part of a larger infrastructure which is provided by the ACDH.
One of the main challenges for lexicographers when working with VLE is the widespread reluctance to work directly with XML. To remedy this difficulty, VLE has been furnished with a special editing mode which allows lexicographers to perform their tasks in predefined controls, very much in a manner you often find in database interfaces.
The importance of standards lies in two keywords: reusability and interoperability, which both play an important part in the technical agenda of our research projects. We have made every effort to build all components in a manner complying – by and large – with official or de-facto standards for the respective communities.
Giving advice regarding which standards to adopt has become easier as a certain degree of consensus has been established in many fields. The most straightforward method is probably looking at what others do in their projects. Moreover, it is becoming increasingly clear that the guidelines of the Text Encoding Initiative have become the most widely used system for scholarly text encoding. In many countries, digital texts are usually encoded in this elaborate system which has been developed over many years by a large community of practitioners. Working with the TEI implies the use of several other standards such as Unicode and several ISO standards. With respect to character encoding the use of Unicode has become commonplace in all text-based undertakings. Unicode is of particular importance when researchers deal with a variety of writing systems or make use of transcriptions (transliterations). All textual resources of VICAV and TUNICO, i.e. corpora, profiles, bibliographies etc., are encoded in TEI P5 (Budin, Majewski, & Moerth, 2012).
However, the situation is more complex with respect to lexicographic data. In a number of cases we have been working on a specialised schema based on the TEI (P5) dictionary module, which has remained our main means of encoding for such data. In all these endeavours we have aimed at a high degree of interoperability with the ISO standard Lexical Markup Framework (Declerck, Lendvai, & Moerth 2013). In projects aiming at cross-dictionary access we have also been experimenting with semantic technologies such as RDF and SKOS (Declerck, Moerth, & WandlVogt 2014) which are aimed at Linked Data applications (Declerck, Wandl-Vogt, Moerth, & Resch 2014).
An important, ever-recurring issue which is a far from trivial task is the identification of linguistic varieties, which is particularly important in digital dialectology where systems with a high degree of granularity are needed. For many purposes it is important to be able to refer to varieties spoken at a particular location, in a particular town, parts of a town or village. Although the International Organisation for Standardization (ISO) has been dealing with the topic for many years, there are no solutions at hand which have been widely adopted by the scholarly communities. The adoption of ISO 639 for research-driven language resources has remained somewhat reluctant which may be due to the fact that it lacks in intuitiveness, flexibility and completeness. Another reason for the hesitancy of scholars may be that ISO standards are not freely available. Yet, parts of ISO 639 are incorporated in Best Current Practice 47 (BCP 47) which in turn is referred to in the TEI Guidelines.
There are other systems such as Linguasphere (http://www.linguasphere.info) or Glottolog (http://glottolog.org/) which are conceptually interesting, display a sophisticated, very differentiated structure, but unfortunately are not much supported, neither in academia and even less so in the industry. The problem has been on the agenda of both VICAV and TUNICO, and again we have been working in accordance with the TEI Guidelines.
In the TEI world, it has become common practice to make use of the xml:lang attribute to identify both linguistic varieties and writing systems. In this hybrid approach, the value of the attribute should be constructed in accordance with BCP 47 which in turn refers to and aggregates a number of ISO standards (639-1, 639-2, 639-3, ISO 15924, ISO 3166). ISO 639-3 offers a considerable degree of differentiation for Arabic which is defined as a macrolanguage with 30 individual languages (http://www-01.sil.org/iso639-3/macrolanguages.asp). While the situation for Arabic varieties is therefore much better than for those of many other languages, the system is by far not fine-grained enough to cope with everything participants of the AIDA conference might need. To remedy this situation, the colleagues in the VICAV project developed an adapted system which allows for the identification of local varieties (Budin, Majewski, & Moerth 2012: 45).
In many specialised projects adaptations of the encoding system, i.e. modelling of the data, is required, and while many might tend to let ‘technicians’ resolve these issues one must keep in mind that the only specialists capable of doing this properly are the humanities scholars themselves which alas cannot be spared the effort! One of the goals of our dictionary projects was the development of a uniform dictionary structure to allow cross-dictionary queries and the use of the same tools on the various resources. To this end the researchers worked on a customization of the TEI P5 dictionary module (Guidelines 2015: 271-309) which has become fairly popular in such projects. The system has been used successfully for lexicographic data encoding in our institute, where it is meant to be a multi-purpose system targeting both human users and software applications (Budin, Moerth, & Schopper 2015).
Any general-purpose system such as the TEI is bound to have conceptual gaps. One such gap was the case of what in Semitic studies is commonly referred to as a root (Budin, Majewski, & Moerth 2012: 43), which did not have a standard way of encoding. Other problems were missing labels in the standard vocabularies for word classes or morphological categories such as count plural, construct state, collective noun and others typically used in Arabic linguistics.
An important issue in the development of the digital humanities (which we understand as a community of practice rather than as a discipline) is the dissemination of acquired know-how. It is not enough to implement innovative solutions; for the field as a whole it is key to make sure that new methods are properly documented and this documentation made available. We need more documentation of decisions made, of workflows, of tools etc. This is why both VICAV and TUNICO devote considerable resources to make their methods and data available to others and thus to furnish examples that can be recycled in future projects. One of the steps taken to achieve this end was the establishment of the DictGate website, a research platform that support (groups of) researchers in need of solutions that can be applied without much logistical and technical overhead. DictGate is used for the exchange of lexicographic tools, data and documentation. It is a service based at the Austrian Academy of Sciences, it is freely accessible, it aims at providing free lexicographic resources and supports the principle of Open access (https://clarin.oeaw.ac.at/lrp/dict-gate/).
Having compiled a digital dictionary, lexicographers are in need of solutions to access this data, to search it, to analyse it and to publish it. While digital data usually allows the preparation of printing templates, our main concern has been digital availability.
So far the dictionaries described here have all been made available via the VICAV website (http://acdh.oeaw.ac.at/vicav), each with their own specialised interface. In the future, the data will be available through the web-gate of CLARIN Centre Vienna (https://clarin.oeaw.ac.at/).
All these interfaces rely on a technological framework called corpus_shell, on which most of our web-based applications have been built (https://clarin.oeaw.ac.at/corpus_shell). The system has been developed at the Austrian Academy of Sciences for several years and is being used for a number of web-based applications. It is very flexible and allows to build new applications almost on the fly. It is based on a modular service-oriented architecture and designed to be put to work in a distributed and heterogeneous virtual landscape.
As mentioned before, one of the purposes of creating the VICAV dictionaries was comparative research. What we are aiming at is the creation of lexicographic system that allows linguists to compare different Arabic varieties. We are aiming at a single interface, the ‘diatopic dictionary’, which allows to query a number of dictionaries and to obtain integrated results. Currently, our dictionary editor is capable of performing this task. A web-interface is planned as a next stage in the development of the VICAV infrastructure. As we have outlined before, such a system requires the fulfilment of several preconditions, the most important of which is definitely an encoding system that is by and large the same across all the involved dictionaries (cf. above 3.2.2 Encoding Issues).
The central issue in conceptualising such an interface is deciding on what data to use to match the dictionary entries in a meaningful manner. The first thing that comes to mind are, of course, the aforementioned roots. All our dictionaries contain this kind of information. It is important to note that we do not proceed from the synchronic situation, but attribute corresponding Classical Arabic roots wherever a colloquial lexeme can be traced back to a CA cognate. In a second step, it would make sense to add some formalised morphological information, which then would allow to query for roots in combination with morphological patterns. At the moment it is only possible to match the roots with word class information. However, comparing roots has obvious drawbacks as many high frequency lexemes of the modern language cannot be traced back to CA roots. ‘car’ is a good example of an English word which has etymologically completely unrelated forms in the various varieties:
|Salé||ṭumubil (ṭumubilat) [noun]car|
|Tunis||kaṛhba (kṛāhib) [noun]car|
|Malta||karozza (karozzi) [noun]car|
|Cairo||ʕaṛabiyya (ʕaṛabiyyāt) [noun]car|
|Damascus||sayyāra (sayyārāt) [noun]car|
There is a second type of information stored in the entries which currently helps to generate much more meaningful results. These are the senses, those XML elements which in the TEI system contain the semantically grouped translation equivalents of the lemmas. The formalised combination of ‘book’ with a part-of-speech label indicating noun would look like this: (sense = *book*) + (pos = noun). This then creates a result-set that looks like this:
|MSA||kitāb (kutub) [noun]book|
|Salé||ktab (ktub, ktuba) [noun]book|
|Tunis||ktāb (ktub) [noun]book|
|Malta||ktieb (kotba) [noun]book|
|Cairo||kitāb (kutub) [noun]book|
|Damascus||ktāb (kǝtᵊb, kǝtob) [noun]book|
|Baghdad||ktāb (kutub) [noun]book|
While the results of this query look perfect, the method is often flawed considerably by homonyms in the language used for the translations. When querying for ‘letter’, the Egyptian resultset will contain both ḥarf ‘letter (of the alphabet)’ and gawāb ‘letter, message’. There are plenty of such examples:
bat:maḍrab ‘bat, racquet’, wiṭwāṭ ‘bat (animal)’ glass:ʔizāz ‘glass (material)’, kubbāya ‘drinking glass’
spring:rabīʕ ‘springtime’, ʕēn ‘spring (of water)’, lawlab ‘spring, coil’
There exist at least two possible solutions to this dilemma. Both make use of the information stored in the senses of the dictionary entries, those parts in the entries that contain the translation equivalents. Either one links these elements to one another (which would require that each sense be linked with senses in all the other dictionaries), or one links them to an abstract ‘outer’ system, some kind of semantic resource serving as a pivot between the senses.
We have conducted experiments with the first approach automatically assigning links. Given the small number of lexemes contained in the dictionaries, the evaluated results were quite good. However, it is to be expected that the percentage of wrong assignments will rise remarkably with larger dictionaries.
The second method would definitely be the more future-oriented approach. It would mean to link the senses to abstract concepts defined in a digitally available ontology. Actually, there exists a great number of such lexico-semantic databases for a wide range of languages. Currently, we are evaluating possibilities of working with Arabic Wordnet (http://globalwordnet.org/arabic-wordnet/).
The interface presented at AIDA 2015 is integrated into the Viennese Dictionary Editor and still very experimental in nature. A more sophisticated browser-based interface is being developed and planned to be publicly available in 2016 through the VICAV website.
An additional field of activities is the expansion of the dictionary collection. By making the project better known in the community, we hope for wider participation. As additional data-sets can be integrated without much overhead, we are looking for small lexical databases, dictionaries and glossaries which can easily be transformed into TEI dictionaries and imported into the database. The ACDH offers support in converting the data and provides free tools to access and edit. VICAV has always strived to pursue a clear policy of transparent author attribution that seeks to assure that authors retain full control of their data. As the project is based at the Austrian Academy of Sciences (Austria’s largest research facility outside the universities), and conducted in collaboration with the University of Vienna and the large European infrastructure consortia CLARIN and DARIAH, it is to be expected that the efforts will result in sustainable structures.
Digital methods are increasingly becoming mainstream in all linguistic disciplines. They do not only affect the way scholars work, they also have considerable social implications. Following the paradigms of digital humanities, we hope to involve many more colleagues in our build-up of digital infrastructures both by inviting them to contribute to our projects, and offering the outcomes of our projects in the community, thus propagating the spirit of open access and open source.
Abdellatif, K. 2010. Dictionnaire «le Karmous» du Tunisien. (http://www.fichier-pdf.fr /2010/08/31/m14401m/; accessed
Budin, G., Majewski, S., & Moerth, K. 2012. “Creating Lexical Resources in TEI P5”, Journal of the Text Encoding Initiative, 3. (doi:10.4000/jtei.522).
Budin, G., Moerth, K., Romary, L., & Schopper, D. “Modelling frequency data. Methodological considerations on the relationship between dictionaries and corpora”, Journal of the Text Encoding Initiative, 8, 2015.
https://jtei.revues.org/1356 (doi : 10.4000/jtei.1356).
Budin, G., & Moerth, K. 2011. “Hooking up to the corpus: the Viennese Lexicographic Editor’s corpus interface”, I. Kosem & Kosem, K. (eds.), Electronic lexicography in the 21st century: new applications for new users. Proceedings of eLex 2011 conference. Bled, Slovenia: Trojina, Institute for Applied Slovene Studies. 52-59.
Budin, G., Moerth, K., & Durco, M. 2013. “European Lexicography Infrastructure Components”, Kosem, I., Kallas, J., Gantar, P., Krek, S., Langemets, M., & Tuulik, M. (eds.), Electronic lexicography in the 21st century: thinking outside the paper. Proceedings of the eLex 2013 conference, 17-19 October 2013. Tallin, Estonia: Trojina, Institute for Applied Slovene Studies/Eesti Keele Instituut. 76-92.
Declerck, T., Lendvai, P., & Moerth, K. 2013. “Collaborative Tools: From Wiktionary to LMF, for Synchronic and Diachronic Language Data”, Francopoulo, G. (ed.), LMF. Lexical Markup Framework. John Wiley & Sons. 175-186.
Declerck, T., Moerth, K., & Wandl-Vogt, E. 2014. “A SKOS-based Schema for TEI encoded Dictionaries at ICLTT”, LREC 2014, Ninth International Conference on Language Resources and Evaluation. 26.-31. May 2014, Reykjavik. Reykjavik, Iceland: European Language Resources Association (ELRA).
Declerck, T., Wandl-Vogt, E., Moerth, K., & Resch, C. 2014. “Towards a Unified Approach for Publishing Regional and Historical Language Resources on the Linked Data Framework”, Workshop on Collaboration and Computing for Under-Resourced Languages in the Linked Open Data Era. Co-located with LREC 2014. 26.-31 May 2014, Reykjavik. Reykjavik, Iceland: European Language Resources Association (ELRA).
Marçais, W., & Guîga, A. 1958-61. Textes arabes de Takroûna. II: Glossaire. 8 vol. Paris.
Moerth, K., Procházka, S., & Dallaji, I. 2014. “Laying the Foundations for a Diachronic Dictionary of Tunis Arabic. A First Glance at an Evolving New Language Resource”. A. Abel, Vettori, C., & Ralli, N. (eds.), Proceedings of the XVI EURALEX International Congress: The User in Focus. Bolzano, EURALEX 2014: 377-387.
Nicolas, A. 1911. Dictionnaire français-arabe: idiome tunisien and Dictionnaire arabe-français. Tunis.
Procházka, S., & Moerth, K. 2015. “The Vienna Corpus of Arabic Varieties: building a digital research environment for Arabic dialects”. Proceedings of the 10th AIDA Conference, Doha 2013. (In print).
Quéméneur, J. 1962. “Glossaire de dialectal”, IBLA 1962. 325-67.
Simons, G., & Bird S. 2008. “Toward a Global Infrastructure for the Sustainability of Language Resources”, Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation. (http://www01.sil.org/~simonsg/preprint/PACLIC22.pdf; Accessed on 13.9.2015).
Singer, H. 1984. Grammatik der Arabischen Mundart der Medina von Tunis. Berlin-New York.
TEI Consortium. 2012. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version 2.8.0. Last updated 6th April. (www.tei-c.org/release/doc/tei-p5-doc/en/Guidelines.pdf).