Fact & Language Service

Classify information assets in multiple languages using advanced language processing

Advanced Language Packs provide complex logic for language processing, to result in highly accurate classification.

Semaphore Advanced Language Packs help you extract the vocabulary and relationships from your classification model so information can be put in the context of the business. Using analytics and natural language processing strategies such as stemming, tokenization, lemmatization and part of speech tagging, you can identify the sentiment and context within unstructured information and use it to identify trends, discover patterns and improve your organization.

The process begins with Semaphore Ontology Editor where you create a taxonomy/ontology/model that reflects the topics, concepts and unique characteristics of the organization as well as your fact extraction strategy. Semaphore Rulebase Generator creates rulebases directly from the model and Classification Server uses them to perform precise, complete and consistent metadata tagging.

Advanced Language Packs use part-of-speech tagging, which identifies a word’s grammatical category (i.e. noun, verb). This information is used in conjunction with rule logic to perform complex matching. With Semaphore, information can be analyzed in bulk, Semaphore can then generate RDF triples and use graph based technology to visualize results and drive information discovery.

Semaphore Advanced Language Packs are available in languages such as, Arabic, Bokmål, Catalan, Croatian, and Chinese (Simplified and Traditional), Czech, Danish, Dutch, English, Finnish, French, German, Italian, Japanese, Korean, Nynorsk, Polish, Portuguese, Russian, Serbian, Slovak, Slovenian and Spanish languages.

Advance Language Pack strategies

Advanced Language Packs use a number of sophisticated linguistic strategies to analyze unstructured information and identify sentiment, context and meaning:

  • Language identification – automatic identification of the language found within the text (French, English or German) as well as the text format i.e. plaintext or html.
  • Document Analyzer – parses information assets and identifies paragraphs and sentences.
  • Case Normalization – identifies case-normalized alternatives for words within your asset based on document position such as, within a sentence or in a title.
  • Word Segmentation – performs basic tokenization; breaks text into syntactic units (tokens). Identifies abbreviations and multi-word tokens (i.e. out-of-the-box) so they can be processed as single words.
  • Stemmer - identifies the base form (stem) for each token found within the text. For example, the words speaks and speaking have a stem of speak.
  • Part-of-Speech tagging – identifies and labels the part of speech (i.e. noun, verb) as well as sub-class attributes – singular or plural for nouns and present or past tense for verbs - for each word in context using the surrounding context.
  • Tagged Stemming – provides complete linguistic analysis of input text, including stemming with respect to part-of-speech information. This operation segments text into words and punctuation, performs document analysis, case normalization, and part-of-speech tagging.
  • Phrase Grouping - identifies sequences of tokens that function as a single syntactic unit in text. Given sequences of words labeled with part-of-speech tags, the phrase grouping uses grammar rules defined in the language-specific modules to form phrases.

Advanced Language Packs are available in more than 30 languages, discover the power of Semaphore Advanced Language Packs, download our fact sheet.