Semaphore Rulebase Generator

Product Sheet

White Paper

Watch a video for Video

Video

Accurate, Transparent Rule-based Classification and Tagging of Content

The Semaphore component that links the language used to describe a topic (held in Semaphore Ontology Manager) and the classification processing engine (Semaphore Classification Server) is Semaphore Rulebase Generator.

Rulebase Generator

Semaphore employs 20 different types of rule, with many different control attributes, ways to describe expressions and different types of wildcard.  This leads to the ability for the system to create very sophisticated classification rules that deliver very precise results.Rule-based classification is transparent (you can see exactly why a 'tag' has been returned by analyzing the rules - this is important for example in e-discovery where it is important to be able to determine exactly why documents are submitted. Bayesian statistical methods over training sets create a 'black box').

The Rulebase Generation Process

The process of classifying a document based on its content (as opposed to other factors such as its format or stage in a process) might follow this path:

  • What 'Language' Evidence is Available?

The system (or user) identifies vocabulary that describes a subject or concept.  For example:  Mobile Handsets, Mobiles, GSM, Cellular, Networks, Global Standard for Mobiles, Quad band, 3G, iPhone often appear in literature about Mobile Communications.

  • How Significant is That Language?

Language is complex, many terms are often used to describe very different concepts.  For example: Mobile, Networks, Cellular might be relevant terms but may also occur in documents that have nothing to do with mobile communications. iPhone, Quad Band, GSM, however, are less ambiguous.

  • How is The Language Used in Your Domain?

The relationships between terms are significant. For example: Networks is a parent term to 'mobile' and 'fixed line' and mobile in turn is related to 'cellular'. This structure helps identify the subject area as mobile communications.

Semaphore Rulebase Generator encapsulates this language processing logic and outputs 'rule-bases'.

During rulebase creation it will include all of the following elements:

  • How does the language reflect its significance in relation to a concept? Rules are weighted to reflect this:
    • 'Mobile' low weight (contribute little to the score, e.g. 10) as ambiguous.
    • 'GSM' higher weight (contribute more to the score, e.g. 20) as a more reliable indicator.
  • The location of the language in the text can also reflect its significance. Content is made up of various structures:
    • Single words
    • Phrases
    • Sentences
    • Title / Body
    • Metadata fields
  • Weightings can be applied accordingly:
    • If the preferred term appears in the document title as an exact phrase match then a very high weight (contribute 40 to the score) can be applied as this is a very good indication.
    • If a metadata field has a term that is a look-up term in the taxonomy ensure the preferred term is returned (contribute 100 to the score).
  • The Semaphore model holds other clues in its relationships between terms:
    • If a child term 'fires' (i.e. score exceeds threshold) pass some weight to its immediate parent.
    • If a related topic fires, pass a higher weight:
    • A specific equivalence relationship has higher significance. For example, a Full / Abbreviation equivalence relationship has been created and for the content set it is useful to return a higher weight for abbreviations than other Use/Use For equivalence relationships.  i.e. contribute a score of 10 for UF/Use and a score of 15 for Abbr/Full.