Permissively-Licensed Named Entity Recognition on the JVM

Views:  4,820

Fortis is an open source social data ingestion, analysis, and visualization platform built on Scala and Apache Spark. The tool is developed in collaboration with the United Nations Office for the Coordination of Humanitarian Affairs (UN OCHA) to provide insights into crisis events as they occur, via the lens of social media.

A key part of the Fortis platform is the ability to search events (such as Tweets or news articles) for key figures or places of interest. To increase the accuracy and index quality of this search, Fortis uses named entity recognition to differentiate between normal content words and special entities like organizations, people or locations. This code story explains how Fortis integrated named entity recognition using Spark Streaming and Scala, the challenges faced with this approach and with running named entity recognition on the Java Virtual Machine (JVM), and our solution based on Docker containers and Azure Web Apps for Linux.

The state of open source named entity recognition on the JVM

Several well-known packages in the Java ecosystem offer natural language processing and named entity recognition capabilities; the table below lists some of them. However, many of these projects are either not licensed under terms acceptable for the MIT-licensed Fortis project, or target only a few languages. Some only offer generic named entity recognition such as “this is an entity” as opposed more granular details like “this is a place” or “this is a person.”

ProjectLanguagesLicenseDisambiguation
AnnieEnglishGPLYes
Azure Cognitive ServicesEnglishAzureNo
BerkeleyEnglishGPLYes
EpicEnglishApache v2Yes
FactorieEnglishApache v2Yes
OpenNLPEnglish, Spanish, DutchApache v2No
StanfordEnglish, German, Spanish, ChineseGPLYes

The OpeNER project, created by the European Union with a conglomerate of research universities and industry, stands out as a key package for Fortis use since OpeNER offers named entity recognition in many languages (English, French, German, Spanish, Italian, Dutch) and is licensed under the Apache v2 license which makes it easy to integrate into an existing open source project.

Named Entity recognition via OpeNER

OpeNER is based on a simple pipeline model in which text is analyzed by a sequence of models, each step augmenting the source text with additional information that is used by subsequent steps. The pipeline model is illustrated in the figure below.

OpeNER pipeline model

This pipeline can be added to a project via Maven and may be consumed from a JVM language such as Scala. The sample code below illustrates the analysis of an input text using the pipeline (see also the production code in the Fortis project). However, note that in the context of a Spark application the approach has four main limitations which will be discussed in the next section and addressed later in this code story.

After running text through the pipeline, it is now possible to extract entities from the annotated pipeline output. OpeNER supports eight types of entities ranging from concrete real-world entities such as “person”, “geopolitical entity (GPE)” or “location” to more abstract concepts like “date”, “time” and “money.”

Simplifying the integration with OpeNER via Docker and Azure Web Apps for Linux

In a Spark application context, several issues exist in the approach outlined above:

  • The model binaries must be managed and deployed to every Spark worker node
  • Loading the models from disk is time-consuming for short-lived Spark workers
  • Spark workers are often run with low-spec resources for scaling horizontally instead of vertically
  • Model files, however, are large binaries so Spark workers can run out of memory when loading more than one or two models

To address these limitations, OpeNER has the option to host models behind HTTP services so that developers can separate their natural language processing infrastructure from their application infrastructure. These HTTP services are simple to consume but hard to set up since they have several complex dependencies. To simplify deployment, the Fortis team created Docker images for each of the services. It is possible to run the services locally as follows (using Bash, after installing Docker):

Each of the Docker images also comes with a one-click deployment template to set up and run the services in production on Azure Web Apps for Linux. The one-click deployment templates can be found in the following repositories on Github:

The one-click deployments running the OpeNER containers on Azure Web Apps for Linux are convenient as they offer a simple deployment and management story. Deploying the services is as simple as clicking the “Deploy to Azure” button on the GitHub repositories and stepping through the wizard. Once the deployment is done, the Azure Portal can be used to easily scale the services horizontally (distributing the service over more instances) and vertically (hosting the service on more powerful virtual machines).

When dealing with large workloads with low latency or high-throughput requirements, however, introducing four HTTP hops for natural language processing will be prohibitively expensive. In such scenarios, there is the option to run the natural language processing models in-process as described earlier in this post or to host multiple Docker images for the OpeNER services on the same host and expose them via a wrapper service such as opener-docker-wrapper. An end-to-end usage example of the wrapper service can be found in its Github repository. Deploying the wrapper service on an Azure Standard F1 Virtual Machine leads to an average entity extraction speed of 280ms per request with a standard deviation of 165ms.plea

The ability to identify entities, such as places, people, and organizations adds a powerful level of natural language understanding to applications. However, the open source ecosystem for named entity recognition on the JVM is quite limited, with many projects either being licensed under non-permissive licenses, targeting few languages or being hard to deploy.

To solve this problem, the Fortis team created an MIT-licensed one-click deployment to Azure for web services that lets developers get started with a wide range of natural language tasks in 5 minutes or less, by consuming simple HTTP services for language identification, tokenization, part-of-speech-tagging and named entity recognition. These services can be used in a wide variety of additional contexts including identifying organizations in product reviews, automatically tagging places in social media posts, and so forth.

Resources

 

Leave a Reply

Your email address will not be published. Required fields are marked *