Posted on: May 09, 2016, by: Steve Biondi
The MarkLogic database is a remarkably flexible and fast NoSQL database capable of handling a broad spectrum of data-driven applications. Some applications may have tightly controlled XML document schemas, where data is chopped up into well-known, discrete units and stored consistently within XML documents in a disciplined manner. The document structures in other applications however, may be less structured. Documents may contain entire sections of XML or XHTML that contain valuable blocks of text of multiple sizes that have little informational structure, for example, when Microsoft Office documents or PDF files are converted to XHTML, or when user comments or textual descriptions are stored in XML elements. In these cases, it is difficult to leverage the valuable information contained in these textual blocks.
Smartlogic’s Semaphore solves this problem by automatically analyzing the textual information within your documents and enriching them with well-structured, machine-readable metadata, which contains the relevant concepts found within your enterprise controlled vocabulary, taxonomy, or ontology. It can also extract and embed “entities” such as, names, addresses, dates, etc.
When you combine this powerful capability with your MarkLogic application and enrich your documents, you have effectively mapped them onto your enterprise’s semantic knowledge model. These models can be anything, such as general topic maps, product lines, medical areas and/or standards, geographic locations and areas of law. And the semantic models themselves can be used to further enhance applications using model information such as synonyms, related concepts, broader or narrower concepts, or custom relationship types defined by you. A simple example could be mapping XML documents into a topical medical taxonomy, which could then lead to other related concepts through pre-established associations or “crosswalk” mappings between the medical or disease taxonomies and terminologies.
The Smartlogic for MarkLogic Connector is a pipeline in the MarkLogic Content Processing Framework (CPF), which is automatically triggered when a user or process adds or updates a document. The Smartlogic pipeline module sends the XML/XHTML document to Smartlogic’s Classification Server, which executes the rules and automatically inserts the resulting information into the document.
The Classification Server host and port details, specific location within the XML document into which to insert results and the minimum score threshold are part of the pipeline configuration. Users can configure and deploy the pipeline using a web application which is part of the connector Roxy-based installation.
The knowledge models used for classification are built and managed in the Smartlogic’s Workbench Ontology Editor application. Models for Smartlogic start with the SKOS-XL semantic standards, and contain concept schemes, concepts, labels, metadata, and relationships.
Once a knowledge model is ready to be used for classification, a user publishes it, and our publisher service generates and deploys all the rules necessary to classify documents. Smartlogic Classification Server classifies text within a document using these classification rules. Each rule, after execution, returns the taxonomy concept identifier, the concept label, its rule-base class and a classification confidence score ranging from zero to 100.
When classification is complete, the concept results, which have a score that exceeds a configurable threshold, are automatically embedded directly into the XML/XHTML document. These XML tags, called “META” tags, can be used for a wide variety of purposes, for example, in fields or path range indexes to enable faceted refiners with MarkLogic search API. The rule-base class can also be used within XPATH expressions to create multiple fields or path range indexes and thereby create multiple facet dimensions on documents from the knowledge model. And because we store the concept identifiers in the documents, you do not have to update embedded labels when they change (and they always do) for concepts.
In subsequent blog entries, I will go into further details about how to get the most out of Smartlogic for MarkLogic solution. For example, you can use a MarkLogic database as the back end triple store for your models, allowing you to access that information natively in MarkLogic via SPARQL queries. In other scenarios, you may want to manually control document classification and avoid using the CPF and MarkLogic task server altogether, so using a tool such as MarkLogic CORB2 might be a better approach. Until then!
Copyright © 2023 Progress Software Corporation and/or its subsidiaries or affiliates. All Rights Reserved.
Progress, Telerik, Ipswitch, Chef, Kemp, Flowmon, MarkLogic, Semaphore and certain product names used herein are trademarks or registered trademarks of Progress Software Corporation and/or one of its subsidiaries or affiliates in the U.S. and/or other countries. See Trademarks for appropriate markings.