Standford RegexNER is a pattern-based (i.e., rule-based) interface for doing Named Entity Recognition (NER).

The simplest rule file has two tab-separated fields on a line. The first field has text to match and the second field has the entity category to assign. (Note that you must have a tab character between the text and the category. Other spaces will not do.)

RegexNER basic file:

we might wish to label the names of computer hardware a HARDWARE_COMPONENT entity label. Then our first RegexNER file might be the following (with a tab before each “DEGREE”):

Motherboard	HARDWARE_COMPONENT
RAM	HARDWARE_COMPONENT
CPU	HARDWARE_COMPONENT
Hard Drive	HARDWARE_COMPONENT
Case	HARDWARE_COMPONENT
Optical Drive	HARDWARE_COMPONENT
Expansion Card	HARDWARE_COMPONENT
Fan	HARDWARE_COMPONENT

RegexNER overwritten entity mapping :

RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. Only the non-entity O label can always be overwritten, but you can specify extra entity tags which can always be overwritten as well.

Motherboard	HARDWARE_COMPONENT	LOCATION
RAM	HARDWARE_COMPONENT	PERSON
CPU	HARDWARE_COMPONENT
Hard Drive	HARDWARE_COMPONENT
Case	HARDWARE_COMPONENT	LOCATION
Optical Drive	HARDWARE_COMPONENT	ORGANIZATION
Expansion Card	HARDWARE_COMPONENT	ORGANIZATION
Fan	HARDWARE_COMPONENT	PERSON

RegexNER entity priority:

The fourth column can be used to give rules a priority. If multiple rules match, the result is undefined unless you give the rules a priority. Here is a (sort of silly) extension of the last file, where we have rules with priorities. The priorities are occurring in the fourth column, so note very carefully that there are two tabs between the entity label and the priority. Rules with no explicitly given priority have priority 1.0.

Motherboard	HARDWARE_COMPONENT	LOCATION	2.0
RAM	HARDWARE_COMPONENT	PERSON	3.0
CPU	HARDWARE_COMPONENT		3.0

Example:

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner,regexner");
        props.put("regexner.mapping", "HARDWARE_COMPONENT.txt");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);


        String text = "Motherboard: The electronic skeleton of the entire system.\n" +
                "CPU: A powerful calculator – The brains of it all.\n" +
                "RAM: The indispensable short-term memory.\n" +
                "Hard Drive: Where all permanent data is saved and stored.\n" +
                "Case: The shell that holds all components together.";
        // create an empty Annotation just with the given text
        Annotation document = new Annotation(text);

        // run all Annotators on this text
        pipeline.annotate(document);

        // these are all the sentences in this document
        // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
        List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

        for(CoreMap sentence: sentences) {
            // traversing the words in the current sentence
            // a CoreLabel is a CoreMap with additional token-specific methods
            for (CoreLabel word: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                if(word.getString(CoreAnnotations.AnswerAnnotation.class).equals("O"))
                    continue;
                System.out.println(word.word() + " = " + word.get(CoreAnnotations.AnswerAnnotation.class) );
            }
        }

 

Refer Standford Guide for more details.

 

Was this post helpful?
Let us know, if you liked the post. Only in this way, we can improve us.
Yes
No

Leave a Reply

Your email address will not be published. Required fields are marked *