In the last few years, neural models have allowed spectacular progress in natural language processing (NLP). The DeepTypo project proposes to use multilingual models of speech to design methods for automatically extracting, from audio recordings, typological information useful for language documentation and research (phonological and morphosyntactic complexity indices, similarities between languages…).

Based on a collaboration between linguists and NLP researchers, the DeepTypo project sits squarely in the space of digital humanities by addressing fundamental questions of both communities.

It will help linguists in their work of documenting and analyzing languages, especially “rare” or “poorly endowed” languages, by providing them with new tools and methods that will allow them, for example, to bring out new information on similarities between languages. Beyond the “tool development” aspect, the DeepTypo project aims, above all, at showing that the representations at the heart of neural networks can be used to answer fundamental questions in linguistic, by taking, as an example, current issues in creolistics (the study of creoles) and dialectology of Sino-Tibetan languages.

Extracting typological information, the core of the DeepTypo project, will also contribute to the identification of the limits of fine-tuning. This approach has made it possible to develop, at low cost, NLP systems for several languages and many tasks and is often presented today as "THE" solution to all NLP problems. The identification of linguistic features captured by neural networks will allow us to verify if this is indeed the case: if a model is, for example, not able to detect and represent the tones of a language, it is more than likely that it cannot be used to develop a system for tonal languages.

To achieve this ambitious goal, we will use neural representation analysis methods to interpret and understand the decisions of neural networks and will develop them along four original axes:

  1. Based on the collaboration with the different partners of the project, we will try to identify richer features than those considered in the state of the art: if the existing works have focused on “simple” features (speaker gender, language of the utterance, ...), we will also consider information related to the diversity of the languages and to the linguistic characteristics of these languages (phonemic inventory, identification of tonal languages, ...).

  2. In addition to existing analysis methods (e.g. linguistic probes), we will develop new methods to measure similarity between languages. Again, close collaboration between linguists and NLP researchers will be essential to define a linguistically relevant similarity (or similarities).

  3. We will apply our methods to the 230 languages of the Pangloss collection (an archive of rare languages managed by LACITO) and to 15 creoles (collected mainly by LLL). These large-scale experiments will allow us to test state-of-the-art pre-trained models on languages with a wide variety of linguistic features rarely considered in NLP work.

  4. We will apply these methods to language documentation support tasks, an application that has, until now, never been considered.

Fundings

The DeepTypo is funded by the French Agence Nationale de la Recherche

Logo ANR