Globalinventor

24.11.2013 : Language Identification

Language Identification:

Task: Identify the language of (each word in) a text.

Idea: Use grapheme statistics to determine the language of words/text.

Solution:
If you have unlimited memory and processing time, you could create a dictionary of words $w \in W$ for each language $L_{ID}\subseteq W^*$ . A look-up of each word $w$ in each dictionary would give you a detailed overview. But what about words occurring in various languages such as following English/German words "an", "man", "in", "text", etc.? You could compute word statistics for each language.

\[L_{\hat{ID}} = \arg\max_{ID} P(L_{ID}|\underline{w}) = \arg\max_{ID} P_{L_{ID}}(\underline{w})P(L_{ID}) \]

An N-gram Markov Model would be an intuitive choice:

\[P_{L_{ID}}(\underline{w})=P_{L_{ID}}(w_1..w_m) = \prod_{i=1}^m P_{L_{ID}}(w_i|w_1..w_{i-1}) \approx \prod_{i=1}^m P_{L_{ID}}(w_i|w_{i-(N-1)}..w_i)\]

where $P_{L_{ID}}(\underline{w})$ is the probability of a word sequence $\underline{w}$ for a given language $L_{ID}$. For $N=1$, each word in the dictionary has a probability. The 10 most probable words for English are:
"the, be, and, of, a, in, to, have, to, it".
The 10 most probable word for German are:
"der, die, und, in, den, von, zu, das, mit, sich".
Using words and word-statistics is probably the best you can do. Unfortunately, this method requires some amount of memory, cpu-power etc. and the knowledge of each word of a language. At least the last assumption is hardly be possible.
The idea is, to reduce the number of features used to compute the $N$-gram model. An intuitive idea is to resolve word segmentations and use the grapheme sequence itself including spaces. A word $w$ can be written as a sequence of graphemes $\underline{g} \in G_{ID}^* \subseteq G^*$. The set of graphemes $G_{ID}$ for a language $L_{ID}$ is limited by definition. For German, there are 30 case-normalized graphemes: $G = \{A,B,...Y,Z,Ä,Ö,..\}$. The task becomes trivial, if each language have dedicated graphemes. The grapheme "ä", "ö", "ü", "ß" are typical for German-like languages. But what if non of those graphemes occur in the text ? - Let's talk about the non trivial case where each language uses the same set of graphemes $G_{i} = G_{j}$ for $i \neq j$.
In principle, you can use a similar computation as already described:

\[L_{\hat{ID}} = \arg\max_{L_{ID}} P(L_{ID}|\underline{w}) = \arg\max_{ID} P_{L_{ID}}(\underline{w})P(L_{ID}) = \arg\max_{ID} P_{L_{ID}}(\underline{g})P(L_{ID}) \]

for $g \in G_{L_{ID}} \cup \{\text{space}\}$. The N-gram Markov Model would be:

\[P_{ID}(\underline{g})=P_{ID}(g_1..g_m) = \prod_{i=1}^m P_{ID}(g_i|g_1..g_{i-1}) \approx \prod_{i=1}^m P_{ID}(g_i|g_{i-(N-1)}..w_i)\]

Using $N=1$ results in the following grapheme-ranking:
UK-Englih: e t a o i n s r h l d c u m f p g w y b v k x j q z
German: e n i s r a t d h u l c g m o b w f k z v ü p ä ß j ö y q x
Dutch: e n a t i r o d s l g h v k m u b p w j c z f x y
If you have a long text, you most probably get a sufficient result, at least for well distinguishable languages. For shorter text or single words, you should investigate in bigger $N$ or alternative models.

Conclusion: Language identification using $N$-gram models on words and graphemes.

Outlook: Use $N$-gram smoothing, alternative models and apply for alternative applications such as text-topic identification.

References: ToDo

Additional Notes: The open-soruce HTML-math library was used: http://www.mathjax.org

2013 - Munir Georges www.globalinventor.eu Datenschutzerklärung Impressum