Entity extraction: How does it work?
Entity extraction, also known as entity name extraction or named entity recognition, is an information extraction technique that refers to the process of identifying and classifying key elements from text into pre-defined categories. In this way, it helps transform unstructured data to data that is structured, and therefore machine readable and available for standard processing that can be applied for retrieving information, extracting facts and question answering. Let’s look at the process of entity extraction and how it works.
In certain formats such as document files, spreadsheets, web pages and social media, text appears as unstructured data. Being able to identify entities—people, places, organizations and concepts, numerical expressions (dates, times, currency amounts, phone numbers, etc.) as well as temporal expressions (dates, time, duration, frequency, etc.) – provides a way for anyone who must make use of the information to understand what it contains. Whether it’s an analyst who has hundreds or more documents to review, or an investigative journalist who just received a data dump of several gigabytes (on the scale of Wikileaks or Panama Papers, for example), they may not initially know what the information contains, nor what they should be looking for.
Entity extraction can provide a useful view of unknown data sets by immediately revealing at a minimum, who, and what, the information contains. As a result, an analyst would be able to see a structured representation of all of the the names of people, companies, brands, cities or countries, even phone numbers in a corpus that could serve as a point of departure for further analysis and investigation.
Entity Extraction at work
Entity extraction technologies must address a number of language issues to be able to correctly identify and classify entities. While it’s easy for a human to distinguish between different types of names (person? place? organization? product?), the ambiguities of language make this an especially complex task for machines. A system based on keywords would not be able to properly differentiate between all the possible meanings of a word nor how it is used, for example: “orange” (fruit), Orange (county), Orange (company) or orange (color).
Extraction rules are what fuel the extraction of entities in text and may be based on pattern matching, linguistics, syntax, semantics or a combination of approaches. Entity extraction based on semantic technologies can disambiguate meaning and understand context, therefore enabling a number of useful downstream operations valuable for a variety of functions for business and security/intelligence. These include:
- Entity relation extraction: Reveals direct relationships, connections or events shared among different entities as well as complex relationships through inferred, indirect connections.
- Linking: Establishes links between knowledge banks; for example, it could identify all of the places mentioned in a corpus and link to the corresponding location on a map, or cross-reference entities with other information sources.
- Fact extraction: Extracts all of the data associated with an entity to respond to question answering or queries from a corpus (in contrast to a query that would just return a list of documents containing the “answers”).
Learn more information on entity extraction.
Originally published May 2016, updated January 2020