“Natural Language processing is a branch of artificial intelligence that deals with analyzing, understanding and generating the languages that humans use naturally in order to interface with computers in both written and spoken contexts using natural human languages instead of computer languages.”
The fundamental concepts of NLP differ from those of Machine Learning or Software Engineering in general. In this article, we will discuss some of the basic concepts.
Table of Contents
This is a core tool for every NLP framework. Many ML techniques whether they aim for text classification or regression, use n-grams and features, produced by them. Before you start extracting features, you need to get the words. There are many tools available for tokenization, Lucene is widely used for. We can use different Lucene filters and Lucene tokenizer to achieve. In the next blogs we will discuss more about this.
Splits a sequence of tokens into sentences. Sentence splitting is a deterministic consequence of tokenization: a sentence ends when a sentence-ending character (., !, or ?) is found which is not grouped with other characters into a token (such as for an abbreviation or number), though it may still include a few tokens that can follow a sentence ending character as part of the same sentence (such as quotes and brackets). Standford NLP, NTLK, apache open NLP provide this functionality.
Part of speech-tagger
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads a text in some language and assigns parts of speech to each word (and other tokens), such as noun, verb, adjective, etc.,
Lemmatization in linguistics is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item.
lemmatization is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence, the part of speech of a word should be first determined and the normalization rules will be different for different part of speech
It can also help in identifying the subjects, objects of interest, parts of speech information etc.
Applications can be built on top of it using this information to improve the results of their specific use case. Some of the applications where parsing would be used would be language prediction, translation, chatbots etc.
Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.
Named Entity Recognization
Recognizes named entities (person and company names, etc.) in the text.
They rely on extracted parts-of-speech and basic grammars, encoded in frameworks. There is a separated part of NLP, called information retrieval, where people do really cool things like an automated generation of reports based on several messages about the topic. NER is certainly the biggest part of it
The relation extraction extracts directed qualified relations starting from free-text sentences where two or more entities are extracted by the entity extraction module. The relation extraction module requires a list of verbs and nominalization terms that are used to describe the relations of interest.
Sentiment analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study effective states and subjective information.
There are several ways to perform sentiment analysis, some people even use deep learning (word2vec). It starts with feature extraction, usually, computes TDM from 2-3-grams, which contain sentiment-related words from dictionaries (semi- and supervised models) or builds the dictionaries based on the word distribution itself (un- and semisupervised models). Then the TDM is used as a feature matrix, which is fed to the neural net or SVM or whatever the end-point algorithms happen to be.