Posted on: April 08, 2016, by: Ann Kelly
Auto-categorization is usually described as one of the basic text analytics capabilities. This is unfortunate because the underlying functionality can be used for much more than simply categorizing the subject of a document. The name probably stuck because a lot of text analytics companies only offered categorization using training sets. With that approach, auto-categorization is all you can do. However, once you add in categorization rules like Smartlogic’s Semaphore (either sets of terms or full categorization syntax), you can use them for a broad range of other tasks.
Let’s take a look at three examples:
First, rules can be used to greatly enhance the accuracy of data extraction (one of the other basic functionality of text analytics) by using the context surrounding a target noun phrase to disambiguate the phrase. One of my favorite examples is the word, “pipeline,” which can be either an oil and gas pipeline or a biotech research pipeline. By looking at the words that surround “pipeline” to see if they are oil and gas or biotech related, you can disambiguate and capture the correct meaning.
Second, rules can be used to capture specific facts, not just random strings. For example, on one project we wanted to capture not only addresses and phone numbers we also wanted to find the address for a number of specific people mentioned in a document. To do that, we needed to examine the context around those terms and put together the right combinations. We started with semi-regular expressions like addresses and phone numbers, which was fairly easy. The next step was to capture people’s names where we combined external data sources that listed many of the names. However, the lists were incomplete and so had to be supplemented with rules that looked for words that began with capital letters that were near text that indicated that the capitalized words were people, not other proper nouns. These words included job titles and various action verbs like “said” or “submitted.” This enabled us to differentiate the address of the architect for a particular project from the addresses of less important roles.
The third and what I think is the most powerful example of text analytics is analyzing unstructured text but, of course, there is no such thing. All text has some structure starting with words, sentences, and paragraphs. In addition, many documents are structured into sections of varying degrees of formality. These sections are typically identified by a small number of text headings. For example, on a recent project we developed rules that only looked at specific sections for the target words. This allowed us to achieve much higher accuracy by ignoring parts of the document where either those words were not important or meant something entirely different such as, being part of a list of options rather than indicating the presence of an option.
The rule, when developed in Smartlogic’s Semaphore, looked like this:
<sequence sequencetype=“paragraph”>
<paragraph>
<text data=“LOCATION AND DESCRIPTION OF PROPOSED WORK:” />
</paragraph>
<skip count=“4”/>
<paragraph>
<text data=“Target - Work Issues” />
</paragraph>
</sequence>
This rule only counted the list of concepts in the ontology node - Work Issues - if they appeared within 4 paragraphs of the section heading, “LOCATION AND DESCRIPTION OF PROPOSED WORK.” This is a simple example, which could be extended to enable the addition and use of a variety of structural elements found in most documents.
On another project, we built rules that looked in the Abstract section of scientific papers. As expected, the section heading text could vary from publication to publication so we incorporated all variations in our rules. For example, the rule started with the following variations:
Abstract,
ABSTRACT,
Introduction,
Background,
Summary,
Background,
Aim,
SUMMARY,
The fully mature rule contained many variations and we were then able to categorize different sections and develop section-type specific rules. Using these techniques, we achieved over 99% accuracy and leveraged that accuracy to build additional applications.
These are just three examples of ways you can use “auto-categorization.” The functionality underlying categorization is really the brains of text analytics and it enables us to process text with additional depth and intelligence. The possibilities are expanding every day so stay tuned.
Semaphore is part of the Progress product portfolio. Progress is the leading
provider of application development and digital experience technologies.
About Us Awards Press Releases Media Coverage Careers Offices
Copyright © 2023 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved.
Progress, Telerik, Ipswitch, Chef, Kemp, Flowmon, MarkLogic, Semaphore and certain product names used herein are trademarks or registered trademarks of Progress Software Corporation and/or one of its subsidiaries or affiliates in the U.S. and/or other countries. See Trademarks for appropriate markings.