Tags

Strange reasoning with Open Calais

I have been testing Open Calais.
Open Calais is a Web Service for text mining that can extract entities (persons, companies, countries, ... ) in RDF/OWL from arbitrary text and HTML documents.

My favorite SW IDE TopBraid Composer supports Open Calais as one of its import features. So I gave it a try with following HTML page: Wikipedia's entry on John Zorn.

John Zorn on Wikipedia
This is the 'quite impressive' result of Open Calais' mining, detecting amongst others 38 instances of 'Music Albums' and 25 of 'Music Groups'.

Detected classes and instances
Is this perfect? Of course not. One of his most famous bands Masada is not detected as a 'Music Band', but as a 'Facility' and 'Product'.

The semantics of all those classes can be found at the url of the namespace(s) used.
The 'cale' prefix stand for following uri 'http://s.opencalais.com/1/type/em/e/'.
This uri is dereferenceable and offers, depending on the content-type negotiated by the client, a human oriented HTML representation or a machine readable RDF version.

Below the human targeted explanation of class 'Facility'.

Facility explained
Looking at the RDF descriptions, we discover a lot of rdfs:domain and rdfs:range statements.
So I decided to make use of these statements to infer, using a reasoner, new triples getting surprising results.

An example result.

The resource with following identifier
idbeing initially of type 'Company' now becomes an instance of the list below:

instanceOf
which sounds as complete nonsense to me.

Thanks to using Pellet 1.5.2 as reasoner, we are able to ask where those inferred triples come from (for me the feature why you cannot live without Pellet):

Pellet explanation
And indeed by assigning the property 'c:name' to a resource, this resource becomes automatically (by the rdfs:domain semantics) an instance of all the classes being the object of all those "c:name rdfs:domain ?object" statements.

IF
P rdfs:domain D
AND
x P y
THEN
x rdf:type D.
I have the impression that a classical SW modeling error has been made over here, misusing' rdfs;domain' to assign a property to a class as you normally do in object oriented modeling.
In the SW a property however can be used anywhere and is independent of any class and the property rdfs:domain is used solely for inferencing. In the Open Calais case, I cannot imagine that these are the inferences you want.

Something to report at the Pedantic Web Group?



Comments

Rafi (unauthenticated)
May 11, 2010

Few months ago we released an OWL ontology fixing the problem described in your blog. The corrected schema can be downloaded from
http://www.opencalais.com/files/owl.opencalais-4.3a.xml. Note, dereferencing OpenCalais classes and predicates URI retrieves the erroneous old schema and therefore make sure that you use the ontology from the enclosed URL.

Hope this helps

Rafi Shachar
Open Calais Team

Living in the XML and RDF world
May 19, 2010

Rafi,

Thanks for the update.

Paul