Using RDFS or OWL as a schema language for validating RDF

[This post is rescued from an ancient SWAD-E FAQ list because I want to update it.]

Many software applications need the ability to test that some input data is complete and correct enough to be processed, e.g. to check the data once so that access functions will not later on break due to missing items. This is commonly done by using a schema language to define what “complete and correct” means in this, syntactic, sense and a schema processor to validate data against the schema.

Developers new to RDF can easily mistake RDFS as being a schema language (perhaps because the ‘S’ stands for schema!), they then get referred to OWL as providing the solution and then get surprised by the results of trying to use OWL this way.

This is a big topic which we’ll just touch on here. In this FAQ entry I just want to illustrate a few of pitfalls and hint at why this is harder than it looks in the hope that it might reduce the “unpleasant surprise” for developers new to OWL.

To spoil the punch line, there isn’t yet a really good schema solution for semantic web applications but one is needed. OWL does allow you to express some (though not all) of the constraints you might like. However, to use it you may need an OWL processor which makes additional assumptions relevant to your application – a generic processor will not do the sort of validation a schema-language user is expecting.

The problems arise from fundamental features of the semantic web:
- open world assumption
- no unique name assumption
- multiple typing
- support for inference

Let’s look at a few examples of schema-like constraints you might want to express:

1. Required property

Suppose you want to express a constraint something like “every document must have an author”. You might say something like:

eg:Document rdf:type owl:Class;
    rdfs:subClassOf [ a owl:Restriction;
        owl:onProperty     dc:author;
        owl:minCardinality 1^^xsd:integer].

 eg:myDoc rdf:type eg:Document .

You might think that if you asked a general OWL processor to validate this it would say “invalid” because eg:myDoc doesn’t have an author. Not so. The OWL restriction is saying something that is supposed to be “true of the world” rather than true of any given data document. So seeing an instance of a Document an OWL processor will conclude that it must have an author (because every Document does) just not one we know about yet. So in fact if you now ask an OWL aware processor for the author of myDoc you might, for example, get back a bNode – an example of the inferential, as opposed to constraint checking, nature of OWL processing. This also fits in with the open world assumption – there may be another triple giving an author for myDoc “out there” somewhere.

Of course, even though general OWL processors behave this way doesn’t prevent one from creating a specialist validator which treats a document as a complete closed description and flags any such missing properties – it is just that a generic OWL reasoner probably won’t do this by default.

2. Limiting the number of properties

A related example is expressing the constraint that “every document can have at most one copyright holder”.

  eg:Document rdf:type owl:Class;
              rdfs:subClassOf [ a owl:Restriction;
               owl:onProperty     eg:copyrightHolder;
               owl:maxCardinality 1^^xsd:integer].

  eg:myDoc rdf:type eg:Document ;
           eg:copyrightHolder eg:institute1 ;
           eg:copyrightHolder eg:institute2 .

Again if you ask a general OWL processor to validate this set of statements you might expect it to complain that there are two values for eg:copyrightHolder. Not so. In this case, the problem is the unique name assumption. On the web two different URIs could refer to the same resource and there is no defined way to tell this. Unless there is an explicit declaration that eg:institute1 and eg:institute2 are owl:differentFrom each other then there is no violation.

Indeed, just like in the first example, what an OWL processor does is the reverse. Instead of noticing a violation it infers additional facts which must be true if the data is consistent, in this case it would infer:

       eg:institute1 owl:sameAs  eg:institute2 .

Again, a specialist OWL processor could be told to make an additional unique name assumption to handle such cases but that is not a good thing to do in general. In fact, using such cardinality constraints (e.g. in the guise of owl:InverseFunctionalProperty or owl:FunctionalProperty) to detect aliases is a powerful and much used feature of OWL.

Life is a little easier if one is dealing with DatatypeProperties because you can tell when two literals are distinct (well even this is hard when you are looking at different xsd number classes but at least strings are easy!).

3. Type constraints

The third common schema requirement is to the limit the types of values a given property can take. For example:

  eg:Document rdf:type owl:Class;
              owl:equivalentClass [ a owl:Restriction;
               owl:onProperty     eg:author ;
               owl:allValuesFrom  eg:Person ].

  eg:myDoc rdf:type eg:Document ;
           eg:author eg:Daffy .
  eg:Daffy rdf:type eg:Duck.

  eg:myDoc2 eg:author eg:Dave .
  eg:Dave rdf:type eg:Person .

Does the myDoc example cause a constraint violation? No. In RDF an instance can be a member of many classes. Unless we are explicitly told that the classes eg:Duck and eg:Person are disjoint then all that happens with the myDoc example is that we infer that eg:Daffy must be a Person as well. Again a specialist processor could be developed to flag a warning in cases where an object is inferred to have type which is not a known supertype of its declared types; again this would be making additional assumptions not warranted in the general case but useful for input validation purposes.

Having got the hang that OWL is more about inference that constraint checking then what about myDoc2? Should the OWL processor infer that myDoc2 is a Document. After all we defined a Document this time using a complete, rather than partial, definition – so that anything for which all authors are Persons should be a document and the author of myDoc2 is a person. The answer, again, is “no”. Just because all the authors we see happen to be people doesn’t mean there aren’t more authors for myDoc2 that we don’t know about.

4. Value ranges

Another common schema requirement is to limit the range of a value. For example to say that an integer representing a day-of-the-month should be between 1 and 31.

Data ranges are not part of OWL at all.

You can express them within XML Schema Datatypes. You could declare a user defined XSD datatype which is an xsd:integer restricted to the range 1 to 31.

There is a problem that XML Schema doesn’t define a standard way of determining the URI for a user defined datatype and the RDF datatyping mechanism requires all datatypes to have a URI. This will hopefully get “clarified” and in any case there is a de facto convention which is straightfoward, used by DAML and supported by toolkits so in the meantime we can be non-standard but get work done.

It also slightly less useful that it seems since the RDF datatyping machinery requires that each literal value have an explict datatype URI – you can’t just give a lexical value and use range constraints to apply the type.

These caveats aside, the xsd user defined datatype machinery is useful and this is the one place where RDFS on its own, without OWL, can do some validation. An RDFS processor should detect if the lexical form of a typed literal does not match the declared datatype.

5. Complex constraints

The final forms of constraints that come up are ones which involve constraints between values. For example, that a pair of properties should form a unique value pair, or that the value of one datatype property must be less than another property of the same resource, or of a related resource.

No such cross-property constraints can be expressed at all OWL.

Comments are closed.