Semantic MediaWiki/Background: Ontologies and the Semantic Web - Meta
Semantic MediaWiki/Background: Ontologies and the Semantic Web
From Meta
< Semantic MediaWiki
The problem of creating machine-accessible content on the Web is not new, and much effort has been invested to solve it. The investigations in this field were strongly enforced by the articulation of the vision of the Semantic Web which was envisaged as an improved world wide web that allows users to search for actual content instead of text. Based on machine-readable descriptions of web-content, "intelligent" software was supposed to gather and organize information, relate data from distributed sources, and answer questions. The basic ingredient for these features are ontologies – formal specifications of various kinds that describe important features in a domain of interest.
Today, the intended revolution of Web-usage has turned into a gradual evolution, and it is clear that the full implementation of the original objectives will still take years to come. Nonetheless, the great amounts of research and development in the fields have established versatile technologies with many applications. The extension of MediaWiki should take advantage of these achievements: the commitment to technologies that have already become standard would allow us to reuse existing software and to stay in the mainstream of future developments.
Contents
* 1 Wikipedia vs. the Semantic Web
* 2 The Web Ontology Language and others
o 2.1 Resource Description Framework (RDF)
o 2.2 RDF Schema (RDFS)
o 2.3 Too much and too little: a critical look on the expressive power of RDFS
o 2.4 The Web Ontology Language (OWL)
* 3 Software applications for the Semantic Web
[edit]
Wikipedia vs. the Semantic Web
Obviously, Wikipedia is not the Web and it must be understood that the "semantification" of the both is not the same thing. A closer look shows that this rather fortunate for Wikipedia: some unsolved problems of the Semantic Web just do not occur in our single-site context. But let us start with some points where a "semantic Wikipedia" is quite similar to a Semantic Web. Among others, the Semantic Web confronts us with the following issues:
* web pages that where created for humans must be annotated for machine-reading,
* to be readable for the public, annotations must be provided in standardized data formats,
* to be understandable for the public, annotations must have a formalized meaning,
* translating informal information into formal annotations can be difficult and we must develop methods to guide the users,
* programs must be able to integrate information from many sites,
* many different people will create their annotations in a distributed way; we must expect contradictions and errors in the gathered data.
In this respect, Wikipedia is not that different from the Web. Especially the distributed, non-central way of providing content is an important similarity, which suggests Semantic Web methods for our setting. For most of the above issues, we already have some concrete answers available today: Annotations can be build on non-proprietary data formats that have a standardized syntax and semantics (meaning). There is quite some methodology and experience in designing ontologies and an understanding of what types of annotations are more difficult for users has developed. A multitude of programs that can work on the mentioned standard data formats exist. Many of them are still under heavy development, both in companies and in universities. Most software is free (partly free-as-in-speech), but there are also industrial strength applications that are developed commercially. Ontology languages and software are generally designed to work on distributed, potentially incomplete or erroneous specifications.
On the other hand, the Semantic Web faces far more difficult issues. Even if everybody would use a common standard language for annotations (there are more than one), it might still be that different names are used for the same concepts. There is no "world community" to negotiate on the usage of annotations and ontologies can become incompatible. Furthermore, there is no easy way of creating annotations: instead of a convenient MediaWiki interface, people would have to write their ontologies directly in a technical syntax. Finally, the motivation for creating annotations currently is rather weak, since most people do not want to provide their data in a machine readable way to the world but rather want humans to visit their sites to click on advertisements. These might be some of the reasons why we are much closer today to a "semantic Wikipedia" than to a semantic WWW in general.
[edit]
The Web Ontology Language and others
As mentioned above, there are various languages for writing annotation data in a way that is understood globally. Here, we want to discuss RDF/RDFS and OWL, both of which have a machine-accessible XML syntax. Both are W3C recommendations like HTML and XML, but OWL is the more recent development which is arguably more evolved. However, OWL is downwards compatible, and OWL ontologies can be processed with tools that were conceived for RDF or even for XML as well. The converse is generally not true.
[edit]
Resource Description Framework (RDF)
RDF is a very simple format for describing relations between all kinds of resources (though the various syntactic formats are confusing for most humans). What an RDF-specification describes is basically just a directed graph where both the nodes (i.e. resources) and the edges (i.e. relationships, properties) have labels. That is all, but one can express rather complex relationships. In the context of Wikipedia, this could already implement typed links (as described in the section on related work): articles are resources that can easily be described by their URLs, and typed links are the labeled edges between them. The resulting structure could then be queried to obtain information. Such queries are just questions about the graph, e.g. "Find all nodes that have a link of type birthplace to France".
Moreover, RDF-relationships can also be declared between a resource and a so-called literal. In effect, literals are just simple data-values with an associated data type (the available types are defined in the standard and are closely related to the data types in XML Schema). Thus one can annotate resources with data-properties of certain values. The result can still be depicted as a directed graph, where we now have resources and literals as two distinguished types of nodes. RDF has some more features like the descriptipon of resource collections (sets, lists, etc.) but we will not go into more details.
[edit]
RDF Schema (RDFS)
However, RDF is not sufficient for more elaborate purposes, since it does not allow to describe anything beyond simple directed graphs. In particular, there is no internal mechanism to implement classes (e.g. for categorization of articles). Sure, one can define relationships with a label "hasClass" between an article and its class, but a typical RDF-tool will not recognize this as a special relationship. In fact, the class (category) will just be treated like any other resource (article). This creates first problems with subclass relationship: if A is a subclass of B, and B is a subclass of C, then A is a subclass of C. But this will not be derived by RDF-tools, since we cannot express that the relation "subclass of" is transitive. Indeed, "subclass of" is just a label – a string that has no internal meaning whatsoever.
To overcome this problem, RDF was extended by a simple ontology language called RDF Schema (RDFS). In this language, special relationships like "subclassOf" are predefined and are treated in a standardized way. This enables programs to handle various "structural" descriptions in an adequate way, instead of treating them like plain meaningless labels. This enables classification: RDFS has a predefined "Class"-object and a property "type" which states that a resource belongs to a class (is of this type). Any resource of type Class is treated as a class and can thus be used as the type of other resources. In addition, classes can be organized in a hierarchy by relating them with the "subclassOf" property.
The meaning of these expressions is built into the language. For example, let A be a subclass of the class B, and assume that the specification contains a resource r of type A. Now if the user enters a query for all objects of type B, then r will also be returned – the relationship must be inferred by the program that implements the RDFS specification. These features are very helpful, since they simplify our annotations considerably: without the built-in meaning one would have to state explicitly that r is a subclass of B (and possibly of many other classes). On the other hand, the software must be more "intelligent" than for working on simple RDF.
Besides the mentioned extensions (and some more of similar kind), RDFS is very closely related to RDF. The syntactical XML-format is valid RDF, with the only difference that capable programs can make use of the additional knowledge of the built-in meaning. Yet, one could still use RDF-tools to work with the data.
[edit]
Too much and too little: a critical look on the expressive power of RDFS
In spite of its versatility, the combination of RDF and RDFS has some major disadvantages. There are two major sources of trouble: (1) RDF(S) treats all properties and classes as resources and (2) statements that can be made about some resource are usually legal for any resource. For example, one can easily state that a class has itself as a type (i.e. it is an instance of itself). This creates some problems. When we speak of classes (or categories), we usually imagine them as "collections of things". If something is of a certain type, then it just belongs to that collection. For instance, the class "Person" symbolizes the collection of all persons. Unfortunatelly, this interpretation of classes is no longer applicable when we allow classes to be their own type: no set of common set theory can contain itself.
In effect, the correct formal interpretation of RDFS is much more complicated, and is not easily communicated to the average user. This of course is quite problematic in the context of Wikipedia, since we cannot provide a prior training for editors working on annotations. But the complicated semantics of RDFS gives us even more expressive power; sometimes more than we would reasonably like to have. By definition, even the predefined resources like "Class" and "subclassOf" are just resources. Thus we can legally state that "subclassOf is of type Class" – a statement that is rather nonsensical. This further adds to the confusion that users may encounter when working with RDFS. While users can still work on the idea that classes are collections of resources, standard compliant software has to obey the official semantics to process arbitrary RDF(S) input. Thus the behavior of such tools might not be what the user expected.
On the other hand, RDFS still is a very weak language for making more elaborate descriptions. For example, like RDF, it has no means of stating that a property is transitive. So if we state that Frankfurt lies in Germany and that Germany lies in Europe, we still cannot derive the information that Frankfurt is in Europe. Yet, the average user would takes it for granted that this knowledge is given in the specification, and would like to obtain Frankfurt in a search for European cities as well.
Another limitation is that RDFS cannot construct complex class expressions: if the user wants to have all resources that belong to the class "City" and are located in Germany, then RDFS cannot be used as a query language. Likewise, we cannot describe that the class "Human" consists exactly of the classes "Woman" and "Man" and many other more elaborate statements that we might want to make (this concerns the possibility of extending our annotation framework later on; for the moment, we have no need for such complicated expressions).
Finally, RDFS has a feature called reification that allows us to use even statements as resources. Thus we can express "the fact that Frankfurt lies in Germany is a type of geographical relation". Though sounding complicated, this actually has quite some practical applications. It allows us to annotate our annotations, for example with a source for this statement or with a time for which it is true. However, reification truns out to be extremely powerful; so powerful that in combination with simple (very useful) extensions like those mentioned above, it rules out the possibility of implementing a program that can fully evaluate these specifications (the language becomes undecidable). That is the reason why one usually choses to sacrifice reification for some other practical features and a decidable (implementable) formal semantics.
[edit]
The Web Ontology Language (OWL)
OWL has a much simpler semantics that disallows some freedoms of RDFS in exchange for more powerful descriptions in other areas. The added power also poses some problems in intuitive usability, so we should restrict to simple OWL-annotations. To be added …
[edit]
Software applications for the Semantic Web
Here we will introduce some tools, preferably non-commercial ones, that can be used to work on ontologies in standard file formats.