

Standford RegexNER is a pattern-based (i.e., rule-based) interface for doing Named Entity Recognition (NER).
The simplest rule file has two tab-separated fields on a line. The first field has text to match and the second field has the entity category to assign. (Note that you must have a tab character between the text and the category. Other spaces will not do.)
Table of Contents
RegexNER basic file:
we might wish to label the names of computer hardware a HARDWARE_COMPONENT entity label. Then our first RegexNER file might be the following (with a tab before each “DEGREE”):
Motherboard HARDWARE_COMPONENT RAM HARDWARE_COMPONENT CPU HARDWARE_COMPONENT Hard Drive HARDWARE_COMPONENT Case HARDWARE_COMPONENT Optical Drive HARDWARE_COMPONENT Expansion Card HARDWARE_COMPONENT Fan HARDWARE_COMPONENT
RegexNER overwritten entity mapping :
RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. Only the non-entity O label can always be overwritten, but you can specify extra entity tags which can always be overwritten as well.
Motherboard HARDWARE_COMPONENT LOCATION RAM HARDWARE_COMPONENT PERSON CPU HARDWARE_COMPONENT Hard Drive HARDWARE_COMPONENT Case HARDWARE_COMPONENT LOCATION Optical Drive HARDWARE_COMPONENT ORGANIZATION Expansion Card HARDWARE_COMPONENT ORGANIZATION Fan HARDWARE_COMPONENT PERSON
RegexNER entity priority:
The fourth column can be used to give rules a priority. If multiple rules match, the result is undefined unless you give the rules a priority. Here is a (sort of silly) extension of the last file, where we have rules with priorities. The priorities are occurring in the fourth column, so note very carefully that there are two tabs between the entity label and the priority. Rules with no explicitly given priority have priority 1.0.
Motherboard HARDWARE_COMPONENT LOCATION 2.0 RAM HARDWARE_COMPONENT PERSON 3.0 CPU HARDWARE_COMPONENT 3.0
Example:
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner,regexner"); props.put("regexner.mapping", "HARDWARE_COMPONENT.txt"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); String text = "Motherboard: The electronic skeleton of the entire system.\n" + "CPU: A powerful calculator – The brains of it all.\n" + "RAM: The indispensable short-term memory.\n" + "Hard Drive: Where all permanent data is saved and stored.\n" + "Case: The shell that holds all components together."; // create an empty Annotation just with the given text Annotation document = new Annotation(text); // run all Annotators on this text pipeline.annotate(document); // these are all the sentences in this document // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class); for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel word: sentence.get(CoreAnnotations.TokensAnnotation.class)) { if(word.getString(CoreAnnotations.AnswerAnnotation.class).equals("O")) continue; System.out.println(word.word() + " = " + word.get(CoreAnnotations.AnswerAnnotation.class) ); } }
Refer Standford Guide for more details.