Text mining algorithms
As we’ve mentioned before, text mining is the “the process of deriving high-quality information from text”. Text mining is a broad term that covers a variety of techniques for extracting information from unstructured text. In this post, we’re going to talk about text mining algorithms and two of the most important tasks included in this activity: Named entity recognition and relation extraction.
Named entity recognition
A named entity is a series of words that identifies some real world element, for example “France,” “Barack Obama” and “Facebook Inc.” The activity of named entity recognition (NER) is to identify named entities from unstructured text and assign them into a type included in a known list such as person, organization, location, measurement, address, SSN, etc.
This is a task that cannot be accomplished by matching against pre-compiled lists, because the named entities in even a single entity type are virtually unlimited. For example, think of all of the entities that could be categorized as people, or think of all of the locations that you can imagine (and then consider all of the locations that you don’t know). Any list to match the entities against would simply be unlimited and incomplete.
Another reason that this could not be accomplished through matching is because the context in which a certain named entity is included can influence the type of entity. For example, Leonardo Da Vinci could refer to the historic artist or the airport in Rome depending on the context in which the string of words is used.
The first text mining algorithm user for NER is the Rule-based Approach. Rule-based methods consist of defining a set of rules either manually or through machine learning. Tools like our Cogito Studio allow you to choose and/or combine both approaches based on your needs. Each word in the text is represented by a set of features. The text is then compared against the rules and a rule is applied if a match is found.
A second kind text mining algorithm are those based on a statistical learning approach. Statistical based text mining algorithms used for named entity recognition translates into a sequence labeling problem, a general machine learning problem that is used to model many natural language processing tasks.
The second task in text mining is Relation Extraction, which builds on the output of what occurs in named entity recognition. Relations extraction is the task of identifying and defining the semantic relations between entities in text. The importance of this task has grown in recent years because of the relevancy of relations in monitoring for social media and government intelligence.
The most commonly used text mining algorithms for relation extraction are those also used for classification problems. This is a classification task that, when considering a pair of entities that co-occur in the same sentence, tries to categorize the relations based on a predefined list or taxonomy of relations.
A good example of this is Feature-based Classification. The text mining algorithm starts from any pair of entities that co-occur in the same sentence. The objective of the classification task is to assign a class label to this co-occurrency that is either one of the predefined relation types or zero if the two entities are not related.
A second example is the Kernel Method, which is a more complex mechanism that is frequently used in machine learning. In this method, a kernel (a complex mathematical function), is a similarity measure for the observations. This similarity, once again oversimplyfying this advanced algorithm, is then used to identify similar relations.
You can see some of these features in action in our live Cogito demo.