TuniCo - Linguistic Dynamics in the Greater Tunis Area

Corpus

The corpus represents transcriptions of recordings of more than 30 hours of dialogues and narrative interviews, which were collected by Ines Dallaji and Ines Gabsi during a field-trip in 2013. In its current form it is made up of 24 digital documents. According to the research plan, our researchers approached speakers from different social backgrounds under the age of thirty-five who have grown up and still live in the Greater Tunis area. The number of male and female speakers in our recordings is balanced.

The digital corpus is encoded in accordance with Chapter 8 (Transcriptions of Speech) of the Guidelines of the Text Encoding Initiative (TEI P5). The data has been validated against a standard TEI all schema.

The linguistic annotations were applied in several steps. The frist step was accomplished manually by linking tokens to the dictionary. This task was accomplished accross the corpus making use of the lexicographic editor's tokenEditor component. Roughly 27% (i.e. 26101 tokens) of the corpus were annotated in this manner.

The remaining tokens were furnished with references making use of two approaches: (a) copying the manually created data to the rest of the corpus and (b) assigning references on the basis of automatically generated wordforms. By propagating the manually created data to the un-tagged parts of the corpus and matching the remaining tokens with automatically created wordforms, we capture another 52% of the tokens. A third figure has to be taken into considerations, the 13.154 French items (~14%), most of which were not entered in the dictionary, which leaves us with ~7% items which have not been linked to the corpus yet.

The main challenge in creating the wordforms used to link the tokens with the dictionary entries were the verbs. Currently, the wordform generator produces 34.798 wordforms out of the ~8000 lemmata of the dictionary. This number will increase over time with the increase of entries in the dictionary.

The last step was the application of word class information which was simply achieved by reading this information from the dictionary which in the automatically applied IDs has led to double-assignments in a number of high-frequency items.

Search the corpus ...

Find explanations about the query language ...

Statistics

The corpus is made up of 142.317 tokens, 94.652 w-tags and 47.665 pc-tags. Several statics files were prepared so far.

Get a list of all wordforms (ordered by frequency). Foreign items (mainly French) are displayed in red. Loading may take a while as the list contains 9886 items.

Get a list of all wordforms. Foreign items (mainly French) are displayed in red. Loading may take a while as the list contains 9886 items.

Get a list of all foreign wordforms.

Get a list of the 300 most frequent verbs.

Get a list of the 300 most frequent nouns.

Get a list of all adjectives.