1. Language Detection Overview

Language detection or identification is one of NLP problem. To guessing the language of text and text in the form of the long document or even short tweets.

Currently, the language detection problem is already solved by using N-gram approach with traditional classification models like naive-bayes, SVM classifier or etc. Some python libraries provide language detection functionality with well results also. like langid, FastText.

The main key problem is to identify the language for short text with high accuracy.

As we know deep learning (neural network) is hot topic and it is solved many Natural language processing and Computer vision tasks with promising results.

Now we will try to solve this problem by using “Character-level Convolutional Networks” type Deep neural network model. As per this paper “http://arxiv.org/abs/1509.01626“. we knew CNN network good and promising results for NLP tasks also.

2. Data-Set

Now we will select languages which are similar to each other. We will initially select 5 languages: English, German, French, Italian, and Spanish

Okay, but where do we find the right data-set?

In the case of the deep neural network, the model is more hungry for feeding data. In other words, more data is needed to train and create a model to be generalize.

“Wikipedia”….!

Wikipedia is an information repository for people, but its CORPUS for the data scientists. wikipedia is taking a backup at “https://dumps.wikimedia.org/“. so it is our main data source.

I have download below files as per below links:

This all archive contains a set of the article. but how we can extract articles or even text from these files. After some searching i find one beautiful script “process_wiki.py” This script needs “gensim” library. That can extract all articles text into a single text file.

   /data_dir
      ./en.txt - English
      ./de.txt - German
      ./fr.txt - French
      ./it.txt - Italian
      ./es.txt - Spanish

Now our raw data is ready. Yup

3. Language Detection

3.1 Data loading and Pre-processing

First, we need to import the required libraries for further usage

# core libs
import random
from collections import Counter

# numpy
import numpy as np

# Sklearn
import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# keras
import keras
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Input, Embedding, Conv1D, MaxPool1D, Flatten, Dense
from keras.models import Model

Let’s check the installed core libraries version

# Library version
print(f'keras= {keras.__version__}')
print(f'sklearn= {sklearn.__version__}')
print(f'numpy= {np.__version__}')
keras= 2.2.4 sklearn= 0.20.4 numpy= 1.16.4

Now we have initialized basic configurations.

# Basic Configs
data_dir = '/data_dir'
# Select to articles from file
num_of_articles = 10000
# Maximum sequence length
sentense_len = 150
# shingle configs
shingles_range = (70, 100, 130)
# how many shingles generate per line
shingle_per_line = 10 
# out of vocabulary token
oov_str = 'oov'

Here we have created dictionaries of language code mapping with language full name and text file.

# language code wise full name mapping
lang_code_dict = {
    'en':'english', 'de':'german', 
    'fr':'french', 'it':'italian', 
    'es':'spanish'
}
# language code wise data file mapping
data_info = {
    'en' : data_dir + '/en.txt',
    'de' : data_dir + '/de.txt',
    'fr' : data_dir + '/fr.txt',
    'it' : data_dir + '/it.txt',
    'es' : data_dir + '/es.txt',
}

This is the data loading stage, it’s simply read lines from a file and convert to lower case. total number of lines is collected based on “num_of_articles” configurations.

# data loading
data_dict = {}
for lang_code, file_path in data_info.items():
    with open(file_path, encoding='utf-8') as file:
        lines = file.readlines()
        lines = lines[:num_of_articles]
        # convert to lower case
        lines = [l.lower().strip() for l in lines]
        data_dict[lang_code] = lines
        print(lang_code, len(lines))
en 10000
de 10000
fr 10000
it 10000
es 10000

Shingles” is also referred to as n-gram. here we have defined the utility methods of generating the shingle from a single line or multiple lines.

def generate_shingles(line, length, total):
    """
    Generate shingles from line
    """
    shingle_list = [] 
    max_index = len(line) - length
    if max_index > 0:
        for _ in range(total):
            index = random.randint(0, max_index)
            shingle_text = line[index:index+length]
            shingle_list.append(shingle_text)
    else:
        shingle_list.append(line)
    return shingle_list

def generate_shingles_lines(line, length, total):
    """
    Generate shingles from list of lines
    """
    shingle_list = []
    for line in lines:
        shingles = generate_shingles(line=line, length=length, total=total)
        shingle_list.extend(shingles)
    return shingle_list

Why do we need to create a shingle? because each line is one article in the data file and its length is long. so earlier we have defined two configurations “shingles_range” and “shingle_per_line“. a “shingles_range” tuple for how much length of single shingle and “shingle_per_line” is how many number of shingle generate per line.

Example : shingles_range = (10, 20) shingle_per_line = (5)

here first shingle length is 10 and the number of shingles is 5 and the second shingle length is 20 and the number of shingles is the same. So here total 10 shingle generated per line.

# generate shingles
shingle_data_dict = {}
for lang, lines in data_dict.items():
    shingle_list = []
    for s_range in shingles_range:
        shingles = generate_shingles_lines(lines, s_range, shingle_per_line)
        shingle_list.extend(shingles)
    shingle_data_dict[lang] = shingle_list
    print(lang, len(shingle_list))
en 300000
de 300000
fr 300000
it 300000
es 300000

Earlier we have generated shingles in the form of a dictionary. now we convert from dictionary to list of records. here we have around 1.5M records.

# create list of lines and labels
data_lines, labels = [], []
for lang, samples in shingle_data_dict.items():
    data_lines.extend(samples)
    total_samples = len(samples)
    labels.extend([lang] * total_samples)
print(len(data_lines), len(labels))
1500000
1500000

 

In this stage, we doing vocabulary preparation. but why? because our approach is a deep learning model and it is based on the character level. So we need to prepare vocabulary for all languages.

here first we have created a list that’s contain all those characters which is found in data lines and select “76” most frequent character. “76” is not fix its based on the problem statement.

if you notice we have added the “OOV” token also. this problem is out of vocabulary. if any character which is not found in our vocabulary so this character considers as “OOV“.

# create list of all characters from all data lines
data_char_ls = []
for line in data_lines:
    char_ls = [c for c in line]
    data_char_ls.append(char_ls)
    
# count all characters
cunt = Counter(x for xs in data_char_ls for x in set(xs))

# create vocabulary
char_vocab = [c[0] for c in cunt.most_common(76)] + [oov_str]
print(char_vocab)
[' ', 'e', 'a', 'i', 'n', 'r', 't', 's', 'o', 'l', 'd', 'c', 'u', 'm', 'p', 'g', 'h', 'b', 'f', 'v', 'w', 'z', 'y', 'k', 'é', 'q', 'j', 'x', 'ó', 'í', 'ü', 'á', 'ä', 'è', 'ö', 'à', 'ñ', 'ú', 'ß', 'ò', 'ç', 'ù', 'ê', 'ô', 'â', 'î', 'ì', 'œ', 'û', 'ï', 'ō', '²', 'š', 'ë', 'č', 'ã', 'ł', 'ā', 'ø', 'ć', 'ū', 'ž', 'ı', 'å', 'ř', 'ş', 'ý', 'æ', 'α', 'ο', 'ă', 'о', 'а', 'ń', 'н', 'ν', 'oov']

 

# create dictionary for (char to index)
# here we (index + 1) becoz, 0 index for padding
ch2int = {c:i+1 for i, c in enumerate(char_vocab)}
print(ch2int)
print()
# create dictionary for (index to char)
int2ch = {i:c for c, i in ch2int.items()}
print(int2ch)
{' ': 1, 'e': 2, 'a': 3, 'i': 4, 'n': 5, 'r': 6, 't': 7, 's': 8, 'o': 9, 'l': 10, 'd': 11, 'c': 12, 'u': 13, 'm': 14, 'p': 15, 'g': 16, 'h': 17, 'b': 18, 'f': 19, 'v': 20, 'w': 21, 'z': 22, 'y': 23, 'k': 24, 'é': 25, 'q': 26, 'j': 27, 'x': 28, 'ó': 29, 'í': 30, 'ü': 31, 'á': 32, 'ä': 33, 'è': 34, 'ö': 35, 'à': 36, 'ñ': 37, 'ú': 38, 'ß': 39, 'ò': 40, 'ç': 41, 'ù': 42, 'ê': 43, 'ô': 44, 'â': 45, 'î': 46, 'ì': 47, 'œ': 48, 'û': 49, 'ï': 50, 'ō': 51, '²': 52, 'š': 53, 'ë': 54, 'č': 55, 'ã': 56, 'ł': 57, 'ā': 58, 'ø': 59, 'ć': 60, 'ū': 61, 'ž': 62, 'ı': 63, 'å': 64, 'ř': 65, 'ş': 66, 'ý': 67, 'æ': 68, 'α': 69, 'ο': 70, 'ă': 71, 'о': 72, 'а': 73, 'ń': 74, 'н': 75, 'ν': 76, 'oov': 77} {1: ' ', 2: 'e', 3: 'a', 4: 'i', 5: 'n', 6: 'r', 7: 't', 8: 's', 9: 'o', 10: 'l', 11: 'd', 12: 'c', 13: 'u', 14: 'm', 15: 'p', 16: 'g', 17: 'h', 18: 'b', 19: 'f', 20: 'v', 21: 'w', 22: 'z', 23: 'y', 24: 'k', 25: 'é', 26: 'q', 27: 'j', 28: 'x', 29: 'ó', 30: 'í', 31: 'ü', 32: 'á', 33: 'ä', 34: 'è', 35: 'ö', 36: 'à', 37: 'ñ', 38: 'ú', 39: 'ß', 40: 'ò', 41: 'ç', 42: 'ù', 43: 'ê', 44: 'ô', 45: 'â', 46: 'î', 47: 'ì', 48: 'œ', 49: 'û', 50: 'ï', 51: 'ō', 52: '²', 53: 'š', 54: 'ë', 55: 'č', 56: 'ã', 57: 'ł', 58: 'ā', 59: 'ø', 60: 'ć', 61: 'ū', 62: 'ž', 63: 'ı', 64: 'å', 65: 'ř', 66: 'ş', 67: 'ý', 68: 'æ', 69: 'α', 70: 'ο', 71: 'ă', 72: 'о', 73: 'а', 74: 'ń', 75: 'н', 76: 'ν', 77: 'oov'}

 

def encode(in_ls, key):
    """
    encode list of character to index of characters using 'char2int' dictionary
    """
    out_ls = []
    for ch in in_ls:
        index = key.get(ch)
        if index is None:
            index = key.get(oov_str)
        out_ls.append(index)
    return out_ls

The neural network is understanding only numbers and we need to convert from character to appropriate index based on “ch2int

# data encoding
encoded_ls = [encode(l, ch2int) for l in data_lines]
print(len(encoded_ls))
1500000

In this stage, we apply some operation on data-set like padding and encoding.

# padding and trucating of encoded sequence
X = pad_sequences(encoded_ls, maxlen=sentense_len, truncating='post', padding='post')

# target encoding from 'en' or 'de' language code to 0, 1 
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)
print(label_encoder.classes_)

# one hot encoding of targets
y = to_categorical(encoded_labels)

print(X.shape, y.shape)
['de' 'en' 'es' 'fr' 'it'] (1500000, 150) (1500000, 5)

Our complete data-set doesn’t for training only because we want to check the accuracy of the model as well. so we divide the whole data-set into train/test splits.

# Train & Test split (70:30) ratio from full data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(1050000, 150) (450000, 150) (1050000, 5) (450000, 5)

 

3.2 Model building

The neural network is a stack of models. here we have used Embedding, Conv1D, MaxPool1D, and Dense.

# Build the Neural network
inp = Input(shape=(sentense_len, ))
x = Embedding(input_dim=len(char_vocab) + 1, output_dim=64)(inp)
x = Conv1D(64, 5, activation='relu')(x)
x = MaxPool1D(5)(x)
x = Conv1D(64, 5, activation='relu')(x)
x = MaxPool1D(20)(x)
x = Flatten()(x)
x = Dense(64, activation='relu')(x)
x = Dense(5, activation='softmax')(x)
model = Model(inputs=inp, output=x)
model.summary()
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Layer (type)                 Output Shape              Param #   
=================================================================
input_4 (InputLayer)         (None, 150)               0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 150, 64)           4992      
_________________________________________________________________
conv1d_8 (Conv1D)            (None, 146, 64)           20544     
_________________________________________________________________
max_pooling1d_8 (MaxPooling1 (None, 29, 64)            0         
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 25, 64)            20544     
_________________________________________________________________
max_pooling1d_9 (MaxPooling1 (None, 1, 64)             0         
_________________________________________________________________
flatten_3 (Flatten)          (None, 64)                0         
_________________________________________________________________
dense_5 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_6 (Dense)              (None, 5)                 325       
=================================================================
Total params: 50,565
Trainable params: 50,565
Non-trainable params: 0

3.3 Training and evaluation

# Train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), batch_size=256, epochs=5)
WARNING:tensorflow:From /home/divyesh/.local/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 1050000 samples, validate on 450000 samples
Epoch 1/5
1050000/1050000 [==============================] - 750s 715us/step - loss: 0.1605 - acc: 0.9531 - val_loss: 0.1075 - val_acc: 0.9710
Epoch 2/5
1050000/1050000 [==============================] - 753s 717us/step - loss: 0.0967 - acc: 0.9734 - val_loss: 0.0899 - val_acc: 0.9753
Epoch 3/5
1050000/1050000 [==============================] - 719s 685us/step - loss: 0.0850 - acc: 0.9765 - val_loss: 0.0859 - val_acc: 0.9765
Epoch 4/5
1050000/1050000 [==============================] - 720s 686us/step - loss: 0.0787 - acc: 0.9781 - val_loss: 0.0832 - val_acc: 0.9772
Epoch 5/5
1050000/1050000 [==============================] - 751s 715us/step - loss: 0.0744 - acc: 0.9792 - val_loss: 0.0808 - val_acc: 0.9780
# prediction on test data
pred = model.predict(X_test)
pred_y = pred.argmax(axis=1).ravel()
actual_y = y_test.argmax(axis=1).ravel()

# Generate classification report
report = classification_report(actual_y, pred_y, target_names=label_encoder.classes_)
print(report)
              precision    recall  f1-score   support

          de       0.98      0.98      0.98     89924
          en       0.96      0.98      0.97     89480
          es       0.99      0.98      0.98     90056
          fr       0.98      0.97      0.97     90318
          it       0.99      0.98      0.98     90222

   micro avg       0.98      0.98      0.98    450000
   macro avg       0.98      0.98      0.98    450000
weighted avg       0.98      0.98      0.98    450000

3.4 Prediction

After training and evaluation of the model. we can use this method is for language prediction from the content line.

def predict(line):
    """
    Prediction method for single line
    """
    line = line.lower()
    chars = [c for c in line]
    encoded = encode(chars, ch2int)
    padded = keras.preprocessing.sequence.pad_sequences([encoded], maxlen=sentense_len, truncating='post', padding='post')
    scores = model.predict(padded)
    max_index = scores[0].argmax()
    lbl = label_encoder.classes_[max_index]
    return lbl, scores[0][max_index]
# sample perdiction
print(predict('this is sample text'))
('en', 0.9471939)
# Real time data from google news
test_data = [
    ('en', 'Today rural India and its villages have declared themselves'),
    ('de', 'Es ist einer dieser Momente, bei denen man dabei gewesen sein will'),
    ('fr', 'Mais rien ne permet pour l’instant de confirmer ces propos.'),
    ('it', 'Il peso della compartecipazione dei cittadini (il ticket appunto) sarà cacolato'),
    ('es', 'Después de la evaluación y las pruebas médicas, se descubrió que tenía un')
]

# predict on real time data
for actual_lang, data in test_data:
    print('-----------------')
    print(f'Data:{data}')
    print(f'Predicted:{predict(data)}, Actual:{actual_lang}')
-----------------
Data:Today rural India and its villages have declared themselves
Predicted:('en', 0.97216403), Actual:en
-----------------
Data:Es ist einer dieser Momente, bei denen man dabei gewesen sein will
Predicted:('de', 0.9998753), Actual:de
-----------------
Data:Mais rien ne permet pour l’instant de confirmer ces propos.
Predicted:('fr', 0.98878455), Actual:fr
-----------------
Data:Il peso della compartecipazione dei cittadini (il ticket appunto) sarà cacolato
Predicted:('it', 0.9981592), Actual:it
-----------------
Data:Después de la evaluación y las pruebas médicas, se descubrió que tenía un
Predicted:('es', 0.9999844), Actual:es

4. Source Code

You can download full source code from here Lang_Detection_CNN.ipynb

5. References

Was this post helpful?

Leave a Reply

Your email address will not be published. Required fields are marked *