

Table of Contents
1. Language Detection Overview
Language detection or identification is one of NLP problem. To guessing the language of text and text in the form of the long document or even short tweets.
Currently, the language detection problem is already solved by using N-gram approach with traditional classification models like naive-bayes, SVM classifier or etc. Some python libraries provide language detection functionality with well results also. like langid, FastText.
The main key problem is to identify the language for short text with high accuracy.
As we know deep learning (neural network) is hot topic and it is solved many Natural language processing and Computer vision tasks with promising results.
Now we will try to solve this problem by using “Character-level Convolutional Networks” type Deep neural network model. As per this paper “http://arxiv.org/abs/1509.01626“. we knew CNN network good and promising results for NLP tasks also.
2. Data-Set
Now we will select languages which are similar to each other. We will initially select 5 languages: English, German, French, Italian, and Spanish
Okay, but where do we find the right data-set?
In the case of the deep neural network, the model is more hungry for feeding data. In other words, more data is needed to train and create a model to be generalize.
“Wikipedia”….!
Wikipedia is an information repository for people, but its CORPUS for the data scientists. wikipedia is taking a backup at “https://dumps.wikimedia.org/“. so it is our main data source.
I have download below files as per below links:
- English – https://dumps.wikimedia.org/enwiki/20190920/enwiki-20190920-pages-articles1.xml-p10p30302.bz2
- German – https://dumps.wikimedia.org/dewiki/20190920/dewiki-20190920-pages-articles4.xml-p4420050p5466270.bz2
- French – https://dumps.wikimedia.org/frwiki/20190920/frwiki-20190920-pages-articles6.xml-p7494135p8994135.bz2
- Italian – https://dumps.wikimedia.org/itwiki/20190920/itwiki-20190920-pages-articles1.xml-p2p277087.bz2
- Spanish – https://dumps.wikimedia.org/eswiki/20190920/eswiki-20190920-pages-articles1.xml-p5p143635.bz2
This all archive contains a set of the article. but how we can extract articles or even text from these files. After some searching i find one beautiful script “process_wiki.py” This script needs “gensim” library. That can extract all articles text into a single text file.
/data_dir ./en.txt - English ./de.txt - German ./fr.txt - French ./it.txt - Italian ./es.txt - Spanish
Now our raw data is ready. Yup
3. Language Detection
3.1 Data loading and Pre-processing
First, we need to import the required libraries for further usage
# core libs import random from collections import Counter # numpy import numpy as np # Sklearn import sklearn from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report # keras import keras from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.layers import Input, Embedding, Conv1D, MaxPool1D, Flatten, Dense from keras.models import Model
Let’s check the installed core libraries version
# Library version print(f'keras= {keras.__version__}') print(f'sklearn= {sklearn.__version__}') print(f'numpy= {np.__version__}')
keras= 2.2.4 sklearn= 0.20.4 numpy= 1.16.4
Now we have initialized basic configurations.
# Basic Configs data_dir = '/data_dir' # Select to articles from file num_of_articles = 10000 # Maximum sequence length sentense_len = 150 # shingle configs shingles_range = (70, 100, 130) # how many shingles generate per line shingle_per_line = 10 # out of vocabulary token oov_str = 'oov'
Here we have created dictionaries of language code mapping with language full name and text file.
# language code wise full name mapping lang_code_dict = { 'en':'english', 'de':'german', 'fr':'french', 'it':'italian', 'es':'spanish' } # language code wise data file mapping data_info = { 'en' : data_dir + '/en.txt', 'de' : data_dir + '/de.txt', 'fr' : data_dir + '/fr.txt', 'it' : data_dir + '/it.txt', 'es' : data_dir + '/es.txt', }
This is the data loading stage, it’s simply read lines from a file and convert to lower case. total number of lines is collected based on “num_of_articles” configurations.
# data loading data_dict = {} for lang_code, file_path in data_info.items(): with open(file_path, encoding='utf-8') as file: lines = file.readlines() lines = lines[:num_of_articles] # convert to lower case lines = [l.lower().strip() for l in lines] data_dict[lang_code] = lines print(lang_code, len(lines))
en 10000 de 10000 fr 10000 it 10000 es 10000
“Shingles” is also referred to as n-gram. here we have defined the utility methods of generating the shingle from a single line or multiple lines.
def generate_shingles(line, length, total): """ Generate shingles from line """ shingle_list = [] max_index = len(line) - length if max_index > 0: for _ in range(total): index = random.randint(0, max_index) shingle_text = line[index:index+length] shingle_list.append(shingle_text) else: shingle_list.append(line) return shingle_list def generate_shingles_lines(line, length, total): """ Generate shingles from list of lines """ shingle_list = [] for line in lines: shingles = generate_shingles(line=line, length=length, total=total) shingle_list.extend(shingles) return shingle_list
Why do we need to create a shingle? because each line is one article in the data file and its length is long. so earlier we have defined two configurations “shingles_range” and “shingle_per_line“. a “shingles_range” tuple for how much length of single shingle and “shingle_per_line” is how many number of shingle generate per line.
Example : shingles_range = (10, 20) shingle_per_line = (5)
here first shingle length is 10 and the number of shingles is 5 and the second shingle length is 20 and the number of shingles is the same. So here total 10 shingle generated per line.
# generate shingles shingle_data_dict = {} for lang, lines in data_dict.items(): shingle_list = [] for s_range in shingles_range: shingles = generate_shingles_lines(lines, s_range, shingle_per_line) shingle_list.extend(shingles) shingle_data_dict[lang] = shingle_list print(lang, len(shingle_list))
en 300000 de 300000 fr 300000 it 300000 es 300000
Earlier we have generated shingles in the form of a dictionary. now we convert from dictionary to list of records. here we have around 1.5M records.
# create list of lines and labels data_lines, labels = [], [] for lang, samples in shingle_data_dict.items(): data_lines.extend(samples) total_samples = len(samples) labels.extend([lang] * total_samples) print(len(data_lines), len(labels))
1500000 1500000
In this stage, we doing vocabulary preparation. but why? because our approach is a deep learning model and it is based on the character level. So we need to prepare vocabulary for all languages.
here first we have created a list that’s contain all those characters which is found in data lines and select “76” most frequent character. “76” is not fix its based on the problem statement.
if you notice we have added the “OOV” token also. this problem is out of vocabulary. if any character which is not found in our vocabulary so this character considers as “OOV“.
# create list of all characters from all data lines data_char_ls = [] for line in data_lines: char_ls = [c for c in line] data_char_ls.append(char_ls) # count all characters cunt = Counter(x for xs in data_char_ls for x in set(xs)) # create vocabulary char_vocab = [c[0] for c in cunt.most_common(76)] + [oov_str] print(char_vocab)
[' ', 'e', 'a', 'i', 'n', 'r', 't', 's', 'o', 'l', 'd', 'c', 'u', 'm', 'p', 'g', 'h', 'b', 'f', 'v', 'w', 'z', 'y', 'k', 'é', 'q', 'j', 'x', 'ó', 'í', 'ü', 'á', 'ä', 'è', 'ö', 'à', 'ñ', 'ú', 'ß', 'ò', 'ç', 'ù', 'ê', 'ô', 'â', 'î', 'ì', 'œ', 'û', 'ï', 'ō', '²', 'š', 'ë', 'č', 'ã', 'ł', 'ā', 'ø', 'ć', 'ū', 'ž', 'ı', 'å', 'ř', 'ş', 'ý', 'æ', 'α', 'ο', 'ă', 'о', 'а', 'ń', 'н', 'ν', 'oov']
# create dictionary for (char to index) # here we (index + 1) becoz, 0 index for padding ch2int = {c:i+1 for i, c in enumerate(char_vocab)} print(ch2int) print() # create dictionary for (index to char) int2ch = {i:c for c, i in ch2int.items()} print(int2ch)
{' ': 1, 'e': 2, 'a': 3, 'i': 4, 'n': 5, 'r': 6, 't': 7, 's': 8, 'o': 9, 'l': 10, 'd': 11, 'c': 12, 'u': 13, 'm': 14, 'p': 15, 'g': 16, 'h': 17, 'b': 18, 'f': 19, 'v': 20, 'w': 21, 'z': 22, 'y': 23, 'k': 24, 'é': 25, 'q': 26, 'j': 27, 'x': 28, 'ó': 29, 'í': 30, 'ü': 31, 'á': 32, 'ä': 33, 'è': 34, 'ö': 35, 'à': 36, 'ñ': 37, 'ú': 38, 'ß': 39, 'ò': 40, 'ç': 41, 'ù': 42, 'ê': 43, 'ô': 44, 'â': 45, 'î': 46, 'ì': 47, 'œ': 48, 'û': 49, 'ï': 50, 'ō': 51, '²': 52, 'š': 53, 'ë': 54, 'č': 55, 'ã': 56, 'ł': 57, 'ā': 58, 'ø': 59, 'ć': 60, 'ū': 61, 'ž': 62, 'ı': 63, 'å': 64, 'ř': 65, 'ş': 66, 'ý': 67, 'æ': 68, 'α': 69, 'ο': 70, 'ă': 71, 'о': 72, 'а': 73, 'ń': 74, 'н': 75, 'ν': 76, 'oov': 77} {1: ' ', 2: 'e', 3: 'a', 4: 'i', 5: 'n', 6: 'r', 7: 't', 8: 's', 9: 'o', 10: 'l', 11: 'd', 12: 'c', 13: 'u', 14: 'm', 15: 'p', 16: 'g', 17: 'h', 18: 'b', 19: 'f', 20: 'v', 21: 'w', 22: 'z', 23: 'y', 24: 'k', 25: 'é', 26: 'q', 27: 'j', 28: 'x', 29: 'ó', 30: 'í', 31: 'ü', 32: 'á', 33: 'ä', 34: 'è', 35: 'ö', 36: 'à', 37: 'ñ', 38: 'ú', 39: 'ß', 40: 'ò', 41: 'ç', 42: 'ù', 43: 'ê', 44: 'ô', 45: 'â', 46: 'î', 47: 'ì', 48: 'œ', 49: 'û', 50: 'ï', 51: 'ō', 52: '²', 53: 'š', 54: 'ë', 55: 'č', 56: 'ã', 57: 'ł', 58: 'ā', 59: 'ø', 60: 'ć', 61: 'ū', 62: 'ž', 63: 'ı', 64: 'å', 65: 'ř', 66: 'ş', 67: 'ý', 68: 'æ', 69: 'α', 70: 'ο', 71: 'ă', 72: 'о', 73: 'а', 74: 'ń', 75: 'н', 76: 'ν', 77: 'oov'}
def encode(in_ls, key): """ encode list of character to index of characters using 'char2int' dictionary """ out_ls = [] for ch in in_ls: index = key.get(ch) if index is None: index = key.get(oov_str) out_ls.append(index) return out_ls
The neural network is understanding only numbers and we need to convert from character to appropriate index based on “ch2int“
# data encoding encoded_ls = [encode(l, ch2int) for l in data_lines] print(len(encoded_ls))
1500000
In this stage, we apply some operation on data-set like padding and encoding.
# padding and trucating of encoded sequence X = pad_sequences(encoded_ls, maxlen=sentense_len, truncating='post', padding='post') # target encoding from 'en' or 'de' language code to 0, 1 label_encoder = LabelEncoder() encoded_labels = label_encoder.fit_transform(labels) print(label_encoder.classes_) # one hot encoding of targets y = to_categorical(encoded_labels) print(X.shape, y.shape)
['de' 'en' 'es' 'fr' 'it'] (1500000, 150) (1500000, 5)
Our complete data-set doesn’t for training only because we want to check the accuracy of the model as well. so we divide the whole data-set into train/test splits.
# Train & Test split (70:30) ratio from full data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(1050000, 150) (450000, 150) (1050000, 5) (450000, 5)
3.2 Model building
The neural network is a stack of models. here we have used Embedding, Conv1D, MaxPool1D, and Dense.
# Build the Neural network inp = Input(shape=(sentense_len, )) x = Embedding(input_dim=len(char_vocab) + 1, output_dim=64)(inp) x = Conv1D(64, 5, activation='relu')(x) x = MaxPool1D(5)(x) x = Conv1D(64, 5, activation='relu')(x) x = MaxPool1D(20)(x) x = Flatten()(x) x = Dense(64, activation='relu')(x) x = Dense(5, activation='softmax')(x) model = Model(inputs=inp, output=x) model.summary() model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Layer (type) Output Shape Param # ================================================================= input_4 (InputLayer) (None, 150) 0 _________________________________________________________________ embedding_4 (Embedding) (None, 150, 64) 4992 _________________________________________________________________ conv1d_8 (Conv1D) (None, 146, 64) 20544 _________________________________________________________________ max_pooling1d_8 (MaxPooling1 (None, 29, 64) 0 _________________________________________________________________ conv1d_9 (Conv1D) (None, 25, 64) 20544 _________________________________________________________________ max_pooling1d_9 (MaxPooling1 (None, 1, 64) 0 _________________________________________________________________ flatten_3 (Flatten) (None, 64) 0 _________________________________________________________________ dense_5 (Dense) (None, 64) 4160 _________________________________________________________________ dense_6 (Dense) (None, 5) 325 ================================================================= Total params: 50,565 Trainable params: 50,565 Non-trainable params: 0
3.3 Training and evaluation
# Train the model model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=256, epochs=5)
WARNING:tensorflow:From /home/divyesh/.local/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.cast instead. Train on 1050000 samples, validate on 450000 samples Epoch 1/5 1050000/1050000 [==============================] - 750s 715us/step - loss: 0.1605 - acc: 0.9531 - val_loss: 0.1075 - val_acc: 0.9710 Epoch 2/5 1050000/1050000 [==============================] - 753s 717us/step - loss: 0.0967 - acc: 0.9734 - val_loss: 0.0899 - val_acc: 0.9753 Epoch 3/5 1050000/1050000 [==============================] - 719s 685us/step - loss: 0.0850 - acc: 0.9765 - val_loss: 0.0859 - val_acc: 0.9765 Epoch 4/5 1050000/1050000 [==============================] - 720s 686us/step - loss: 0.0787 - acc: 0.9781 - val_loss: 0.0832 - val_acc: 0.9772 Epoch 5/5 1050000/1050000 [==============================] - 751s 715us/step - loss: 0.0744 - acc: 0.9792 - val_loss: 0.0808 - val_acc: 0.9780
# prediction on test data pred = model.predict(X_test) pred_y = pred.argmax(axis=1).ravel() actual_y = y_test.argmax(axis=1).ravel() # Generate classification report report = classification_report(actual_y, pred_y, target_names=label_encoder.classes_) print(report)
precision recall f1-score support de 0.98 0.98 0.98 89924 en 0.96 0.98 0.97 89480 es 0.99 0.98 0.98 90056 fr 0.98 0.97 0.97 90318 it 0.99 0.98 0.98 90222 micro avg 0.98 0.98 0.98 450000 macro avg 0.98 0.98 0.98 450000 weighted avg 0.98 0.98 0.98 450000
3.4 Prediction
After training and evaluation of the model. we can use this method is for language prediction from the content line.
def predict(line): """ Prediction method for single line """ line = line.lower() chars = [c for c in line] encoded = encode(chars, ch2int) padded = keras.preprocessing.sequence.pad_sequences([encoded], maxlen=sentense_len, truncating='post', padding='post') scores = model.predict(padded) max_index = scores[0].argmax() lbl = label_encoder.classes_[max_index] return lbl, scores[0][max_index]
# sample perdiction print(predict('this is sample text'))
('en', 0.9471939)
# Real time data from google news test_data = [ ('en', 'Today rural India and its villages have declared themselves'), ('de', 'Es ist einer dieser Momente, bei denen man dabei gewesen sein will'), ('fr', 'Mais rien ne permet pour l’instant de confirmer ces propos.'), ('it', 'Il peso della compartecipazione dei cittadini (il ticket appunto) sarà cacolato'), ('es', 'Después de la evaluación y las pruebas médicas, se descubrió que tenía un') ] # predict on real time data for actual_lang, data in test_data: print('-----------------') print(f'Data:{data}') print(f'Predicted:{predict(data)}, Actual:{actual_lang}')
----------------- Data:Today rural India and its villages have declared themselves Predicted:('en', 0.97216403), Actual:en ----------------- Data:Es ist einer dieser Momente, bei denen man dabei gewesen sein will Predicted:('de', 0.9998753), Actual:de ----------------- Data:Mais rien ne permet pour l’instant de confirmer ces propos. Predicted:('fr', 0.98878455), Actual:fr ----------------- Data:Il peso della compartecipazione dei cittadini (il ticket appunto) sarà cacolato Predicted:('it', 0.9981592), Actual:it ----------------- Data:Después de la evaluación y las pruebas médicas, se descubrió que tenía un Predicted:('es', 0.9999844), Actual:es
4. Source Code
You can download full source code from here Lang_Detection_CNN.ipynb
5. References
- Wikimedia Downloads
- Deep Learning: Language identification using Keras & TensorFlow – Machine Learning Experiments
- Detect Language Using Python | ProBytes Software
- Character level CNN with Keras – Towards Data Science
Neural networks are a popular tool for language detection. They can be used to identify the language of text, speech, or images. In Machine learning or Natural language processing, neural networks are composed of small “neurons” that can learn to recognize patterns. They are especially good at recognizing patterns that are familiar to them. This is why neural networks are often used for text recognition.