Filtering untyped literals
After a job integrating triples from different sources i ended up with the situation that multiple subjects had twice the same value for the same property but one value was a plain literal, while the other was of datatype xsd:string.
<x> pcce:city "Brussels"
<x> pcce:city "Brussels"^^xsd:string
My aim was to get rid of one of the two.
My initial idea was to make use of the SPARQL "datatype" operator, which returns the datatype IRI of the value.
My first SPARQL UPDATE query looked as follows:
DELETE {?a pcce:city ?city.}
WHERE {
?a a pcce:Building.
?a pcce:city ?city.
FILTER (datatype(?city) != xsd:string)
}
To my big surprise this didn't work at all.
I was kindly refered to the SPARQL spec by Scott Henninger of TopQuadrant, where is clearly indicated that if the parameter of the datatype operator is a simple literal, xsd:string is returned.
The built-in datatype operand in SPARQL casts untyped literals to xsd:string, so in both the cases xsd:string is returned making it impossible to check untyped literals this way.
Work-around
There is a SPARQL operator (e.g. "
sameTerm") that allows you to test if two RDF terms are the same. In our case it concerns testing two RDF literal values for equality.
How is Literal Equality defined?
Two literals are equal if and only if all of the following hold:
- The strings of the two lexical forms compare equal, character by character.
- Either both or neither have language tags.
- The language tags, if any, compare equal.
- Either both or neither have datatype URIs.
- The two datatype URIs, if any, compare equal, character by character.
According to this definition "Brussels" and "Brussels"^^xsd:string are different terms.
What to compare?
Second piece of the solution is that we have the ability to cast primitive datatypes to other simple datatypes according to the rules of
XPath.
If we take "Brussels" and cast it using the xsd:string constructor function to "Brussels"^^xsd:string and then compare both terms with sameTerm we will get FALSE.
If we take "Brussels"^^xsd:string and cast it using xsd:string to "Brussels"^^xsd:string then the comparison will give TRUE.
Solution
This leads to following solution:
DELETE {?a pcce:city ?city.}
WHERE {
?a a pcce:Building.
?a pcce:city ?city.
FILTER (sameTerm(?city,xsd:string(?city)))
}
Users of products from the TopBraid Suite family have access to a shortcut function for testing untyped literals: "spl:isUntypedLiteral".
Our query becomes then:
DELETE {?a pcce:city ?city.}
WHERE {
?a a pcce:Building.
?a pcce:city ?city.
FILTER (!spl:isUntypedLiteral(?city))
}
Comments