Prepare training data for Custom NER using WebAnno

Apple Inc. ORG founded by Steve JobsPERSON,
Steve WozniakPERSON, and Ronald WayneORG
in April 1976 DATE is a multinational company
headquartered in Cupertino GPE, California GPE.


Named entity recognition (NER) is an important task in NLP to extract required information from text or extract specific portion (word or phrase like location, name etc.) of text.

To do that you can use readily available pre-trained NER model by using open source library like Spacy or Stanford CoreNLP.

Now if you think pretrained NER models are not giving result as per your expectation or entity you are looking for (Example: Animal, Tree name, Fruit name) is not available in pre-trained NER model then you can train your own Name Entity Recognition model.

To train custom NER model you should have huge amount of annotated data. Now you cannot prepare annotated data manually. You must use some tool to do it. 

This tutorial explains how to prepare training data for custom NER by using annotation tool (WebAnno), later we will use this training data to train custom NER with spacy.

In my next tutorial I will explain how to train custom NER model by using prepared custom NER data.

By following this article you can also prepare training data with custom entities like Fruit, Animal etc.


To prepare training data for custom Named Entity Recognition we need an annotator (annotation tool).

Now there are lots of open source annotation tools are available like:

There are lots of them. Now which one to go with?
I found WebAnno helpful for me as:
·       It is a jar file that means you no need to install it.
·       Can be used for complex project
·       Multiple user can work in the same project
·       Web based application
·       Can be used in any operating system
·       Most important easy to use (not like brat)
So in this tutorial I will walk you through the whole step from download and setup to prepare training data for custom NER. So let’s get started.

Must Read:

Download and Setup WebAnno:

Download beta version of webanno from below link:
This is a runnable jar file that means you no need to install it. So you should use it across any operating system without any trouble.
To run this web based application you just need to double click on that downloaded jar file or on the command line by using below command:
java -jar webanno-standalone-4.0.0-beta-6.jar
Note:WebAnno will create a directory called .webanno under your home directory (C: => user => “your computer name”) and stores its database and files there.
Now let’s get started working with webnno to generate training data to train custom NER model in spacy.

Prepare training data for custom NER model:

Now to prepare training data for custom NER model using WebAnno follow below steps:

Steps1:Download and start WebAnno

Run WebAnno by following steps mentioned above under download and setup Webanno section. While opening you should be observing screen like below:

Here please don’t do anything, just wait until you see below popup box.
In this popup you need to select Open browser. Since WebAnno is a web based application, a new tab will open in your default browser.
link is: http://localhost:8080/login.html.
Now at opening page you need to login by user name and password.
Username: admin
Password: admin

On next page after successful login, click on projects.

Step2: Setup Project in WebAnno for NER

Now at Project Settings page:
1.    At Projects tab click on create
2.    Write some name of the project.  (Ex: “Test_Annotation”)
3.    Select Project Type as annotation
4.    Script Direction: If you are going to annotate text written in English then it should be left-to-right (default). Or if want to work with language like Urdu then the script direction will be right-to-left.
That’s all, no need to change anything else in this page. Now click on save (bottom right). 
If you have done above steps successfully you should able to see your project name inside your Projects tab.
Once project details have been defined multiple tabs will be appearing like Users, Documents, Layers, Tagsets etc.


From there select Documents tab and do following:
1.    Select “Plain text” as format from dropdown box
2.    Upload text file of text document for which we are going to prepare training data.
3.    Click on Import
The sample text I have used inside my input text file to prepare training data is listed below:
Who is Shaka Khan?
I like London and Berlin
Note: I have used same text available at spacy documentation so that you can relate this article with spacy documentation.


Now you can see that my sample text have only two entities in total i.e. Name and Location.

Now in spacy model Name entity is listed as PERSON and Location entity is listed as LOC
So to prepare training data to update existing spacy model you have to follow spacy entity list. But if you want to train a new model then you can specify any name for specific entity.
So now let’s see if WebAnno have those specific entity names (LOC, PERSON) or not, if not we have create a new entity by matching above names.
For that follow below steps:
1.    Click on Tagsets tab
2.    Select Name Entity tags from the Tagsetslist
3.    Now at right side of the page see if LOC and PERSON tags are available under Tags list
4.    As per my observation PER is listed instead of PERSON and LOC is available. So we need to create only one tag by name of PERSON. For that click on create under Tags
5.    Now at right side type entity name you want to add (in my case PERSON)
6.    Finally click on Save and check. The new entity name (PERSON) should be listed inTags list
In this similar way you can create your custom entity also like: Animal, Fruit etc.

So at this point we are done with project setup. Now we can move into the main part which is annotation.

Step3: Annotation in WebAnno

Hope at this stage you are done with project setup. If so click on home to go main menu.
Now from project menu select Annotation. A new pop up window will appear select document you want to go annotate from there.

Note: A project can have multiple documents.
At annotation page do following to annotate your text.
1.    Select Layer as Named entity
2.    Select word or phrase by mouse (which you think an entity)
3.    Select entity type from value (ex: LOC, PERSON)
Once you are done with your annotation click on Export and select UIMA CAS JSON from the popup window and export it.

Must Read:


It will be downloading a file named something like webanno2969822115926518043export.

Now this is a zip file, which needs to be extracted. After extracting you will have your annotated json file. For me it is input.jsonas my input file name was input.txt.

Now let’s have quick look at the annotated file generated by WebAnno.

Somewhere in the file you can see NamedEntity, Sentence, Token,  _reference_fss. These are the things you need later while your will prepare final spacy formatted training data.
I will make a separate tutorial to convert this data to spacy formatted training data.

Conclusion:

In this tutorial I have discussed about preparing training data for custom NER model by using WebAnno. In my next post I will explain how to convert this annotated data to prepare spacy formatted final training data to traincustom named entity recognition, as our main objective is to use spacy to build model by using custom NER data.

By following this article you can also prepare training data with custom entities like Fruit, Animal etc.

If you have any question or suggestion regarding this topic see you in comment section. I will try my best to answer.

6 thoughts on “Prepare training data for Custom NER using WebAnno”

  1. Hi Tomanin its really nice for your reply. I just wanted to ask is there a better way to make custom data for spacy.. like how can we find token and its start and end. eg karan is good boy. I want karan start and end. Any clues. Like is there any spacy defined function.

  2. No there is no function but you can make a custom function based on string count or alphabet count.

    So for your example your custom function will return:
    karan: [start: 0. end: 4] # After tokenization word length of karan is 4
    space 4+1 = 5
    is: [start: 5, end: 7]
    space 7+1 = 8
    good: [start: 8. end: 12]
    So on……

  3. Hi thanks for your reply. Well, last 2 questions.
    1. For the above method ..what if the word is at the end of the sentence.
    2. Well when I follow up your webanno method for annotations, one error comes when I run parse the JSON code. i.e List index not matching. I tried a lot to resolve but was stuck.
    Your reply would really be appreciated.

Leave a Comment

Your email address will not be published. Required fields are marked *