In this Activity, we will introduce Knowledge Graphs as an appropriate means to formalise data.
At the end of this Activity, you should have a basic understanding of
We will recall (or introduce) some basic notions of Propositional Logic, as those are useful to understand Simple Knowledge Graph Logic.
In this lecture, we will introduce the core notions of Data, Information and Knowledge, and show that knowledge is an essential ingredient for turning raw data into useful Information. In many application problems, this knowledge is tacit, or implicit, in the heads of the Data Scientists or the Database (IT) Department of a company.
This is very cost intensive, and only works when the data remains in its data silo: as soon as we think about reuse or integration of different data sources, there would be an enormous gain if this knowledge was available in an explicit way.
The following video (of about 16 minutes) (expand view) argues for explicit, formalised knowledge, as an essential means for turning raw Data into valuable Information.
I found https://www.slideshare.net/mcjenkins/knwoedgebase-vs-database a rather good discussion on the relation between data and knowledge, and the impact this had on the systems one had to develop.
The big question now is what good ways are to integrate big collections of data with explicit knowledge. While relational databases are constructed to optimise data access, they are not idea for data integration, and making the meaning of the relations between data items explicit.
In this video, we will argue why Knowledge Graphs are natural representation formalisms for data, as it provides natural ways to make the intrinsic meaning of the data explicit.
If this is indeed true, we need to understand how to model data and knowledge in a Knowledge Graph as a formal system. In short, a formal system is a formal language equipped with a formal meaning (a notion of truth), that can be evaluated by a computer program. This should quarantee predictable inference, in other words whoever works with a Knowledge Graph in this formalisms should get precisely the same answers.
The following video explains this idea in more detail.
(expand view):
Before we work out a formal system for Knowledge Graphs we should have a better understanding of what a formal system is, and how to build one. For this purpose, we have a(nother) look at Propositional Logic. Some of you have followed a course in Logic, others have not. For the former, this video is a reminder, for the others, a new material.
Propositional Logic is a well known formal system to represent and reason over "Propositions", statements that can either be true or false. Most people would agree that it is the canonical Logic, a very simple formal system, that is easy to understand, while still very useful in practise.
Understanding how Propositional Logic is formally defined will be useful to understand the more complex Simple Knowledge Graph Logic (and its more complex variants we will study later).
The following Video lecture (37 minutes) introduces the Syntax and Semantics of PL (expand view):
There is plenty of literature on Propositional Logic. Should things not be clear yet, we recomment that you read: HuthRyanChapter1.pdf and huth-ryan-section2.4.pdf.
Propositional Logic was introduced as one of the most intuitive and simple Formal Systems. Before we look at how to represent Knowledge Graphs formally, let us try to get a bit more intuition about Formal Systems.
The following Video lecture discusses some more examples of Formal Systems (expand view):
SKGL is not an "officially recognised" logic, but a very simple, home made, system to formalise a special type of simple knowledge graphs. It is a subset of more expressive official languages, such as RDF.
The advantage of this Formal System is that it can be easily extended, first towards a Web Language, and secondly towards integrating knowledge. But that will be the topics of the next learning activities.
The following Video lecture introduces the Syntax and Semantics of SKGL (expand view):
You should now be ready to do the first practical assignment
Practical assignment 1 (Manipulating formal systems, PL), available on 6 september)
and the first test:
Test 1 (Quiz), available on Monday (11 september)
Just follow the links, and you will get more explanation.
In this Activity, we will introduce Linked Data, and the RDF data model as a technology to publish and consume data one the Web.
At the end of this Activity, you should have a basic understanding of
These lectures should give you sufficient background knowledge for the 2nd practical assignment. In the test (quiz) at the end of this module, you will be asked general questions about Data publishing on the Web, RDF and SPARQL.
In this short video (expand view), we recall the main things introduced in the previous module.
View the slides here.
The following video of about 15 minutes (expand view) discusses the problems of publishing data on the current Web.
The Web of Documents has been devised for human consumption, and that means that even the information that has been produced from structured databases can usually not be accessed programmatically by anybody but the data-owner, let alone be combined with datasources from other locations and owners.
Since the Web 2.0 the users of the Web have started to also produce data directly (think of Facebook, Twitter and Youtube), and even more data is collected about the user and their behaviour. While this data can in some cases be accessed via APIs, it can hardly be reused as the data is not semantically annotated or linked.
Knowledge Graphs are a format that can help overcome this world of data silos on the Web.
View the slides here.
One of the best motivation for a Web of Data comes from the inventor of the Web himself, Tim Berners-Lee. Worth watching:
In the last video, Berners-Lee strongly argues for the value of data, and how beneficial it would be to publish data on the Web, and in particular, how to make it really useful on the Web.
The following lecture of about 20 minutes (expand view) explains our claim that Knowledge Graphs are a very appropriate technology to make data-sharing and reuse possible in the first place. First, we will show the benefits of linking datasets, which leads to the notion of Linked Data.
For this to come from a vision and dream to reality, there are 4 proposals of things that need to be done:
View the slides here.
In the video we also argue that Inference, i.e. the capability to derive new information from data according to the formal Semantics of the statements in the knowledge graph, or to make implicit information explicit is an important part for publishing data on the Web. We will later deal with more expressive knowledge, with a schema language RDFS (Module 3) and an richer ontology language OWL (Module 4).
It should be clear from the what we introduced in Module 1, that Knowledge Graph, and their formalisms are very appropriate data models for representing Data on the Web. But it should also be clear that there are not only technological challenges (standards, tools, etc) but also societal challenges.
Tim Berners-Lee discusses what it means to publish data on the web. One major criticism of the technology has been that linking depends on shared vocabularies, and that there would never be agreement on using one conceptualisation of the world, and vocabulary. Instead, think of a bag of crisps (or was it chips?). Another nice presentation from him:
In the following video of about 23 minutes (expand view) introduces the data model RDF, which is the formal systems used to combine the Knowledge Graph language we introduced in Module 1 with the Web technology required to make it a true Web language.
View the slides here.
In order to port data to the Web, Tim Berners-Lee proposes 4 principles, also called the Linked Data principles:
RDF has been devised to cater for all those principles: the main elements of triples are the references to resources (URIs), one is encouraged to use HTTP resources (names with a website attached to it, and that is ideally machine readable (i.e. in RDF again)). The knowledge graph format, obviously, lends itself perfectly for linking with other URIs that, given the web-nature of those identifiers can be easily integrated from datasets at different locations on the Web.
Up to now, the representation formalism for RDF were mostly extensions of our simple Knowledge Graph Logic from Module 1 with methods to integrate well-understood Web technology. This allows us to refer to Web objects in our language via URIs, link them using RDF triples and even get formally specified (additional) information via dereferencing.
What we have not covered yet is the case when we want to talk about objects to which we cannot, or do not want to, give a name. In this case, RDF provides for a weak form of quantification, called Blank Nodes.
A short video (less than 4 minutes) introduces these blank nodes (expand view)
View the slides here.
As we discussed in Module 1, in order to be able to specify formal shared meaning to formulas, which can be unambigously interpreted by machines, we have to define first, which formulas (RDF graphs) are well-formed.
The following video of about 15 minutes formalises the Syntax of RDF (expand view):
View the slides here.
The formalism is very similar to the definition we gave to our simple Knowledge Graph Logic in Module 1. The only difference is that we split the vocabulary in a different way: while in SKGL, we had a set of objects and a set of relations, this distinction is not made in RDF. This is a very interesting feature, as we can now treat properties as objects, and make explicit statements about them.
We could, e.g., state that
has_sibling rdf:type rdfs:SymmetricProperty .
in other words, that the has_simpling relation is symmetric (if I am the sibling of someone, (s)he is my sibling as well. This is formally complicated (as you will see when I discuss RDF models and interpretations, but very powerful to model properties of properties.
Instead of distinguishing between properties and resources, in RDF a distinction is made between Resources, Literals and Blank nodes. As not all combinations of those sets is useful in triples, RDF is a subset of all the combinations of those three sets in the (s p o) positions.
For practical reasons, different ways of writing down the information is also important. There are different so called serialisations, of which we study Turtle in more detail.
The following video (12 minutes) briefly discusses the RDF semantics (expand view).
View the slides here.
It does not make much sense to fully introduce the model-theoretic semantics of RDF here, as it is rather tedious and advanced (read here for the real fan, but don't waste time on this). What is important to remember is that the basic idea of assigning semantics to RDF knowledge graphs is very similar to our approach in SKGL: for each triple we assign objects to the subject, predicate and object, and relations to the predicates (as mentioned before, predicates are objects and relations).
An model of a graph is then an interpretation that satisfies all the triples. Special treatment is given to the blank-nodes, but the results is a theorem that's very similar to the one we had in SKGL (the theorem in the video is slightly weaker, btw):
Theorem (calculus): A graph G' is entailed by a graph G if G' can be rewritten into a subset of G, where rewriting means assigning a URI to a blank node.
Here, the same semantic notion of entailment is applied as usual.
An alternative view on Semantics is also given: like in the example discussed above by Tim Berners-Lee, one of the biggest advantages of the Web of Data initiative is the use of common vocabularies. While this does not allow for fully automated predictable inference by machines it is helpful nevertheless, as people who agree on a joint vocabulary usually also, up to a certain point, share a joint understanding of the concepts described in it.
NOTE: there will be no questions in the test about the Model-theory (formal semantics) of RDF with 2 exceptions: you should know that entailmenht between RDF graphs G an RDF graphs G' is semantically defined in the usual way ( all models of G are models of G'), and that you can calculate entailment between G and G' by finding a replacement of blank nodes in G', so that this new graph G'' is a subset of G.
This final video lecture (10 minutes) touches on some important issues related to RDF. (expand view)
While its called an RDF graph, e.g., RDF graphs are not real graphs as mentioned before. The lecture also discusses the nature of URIs in more detail. Important here, is that URIs should be confused with automatic lookup services. So whenever a user uses a URI, this does not imply that he or she has any authority over this resouce.
The most important part of this video, though, is the 4 different ways in which RDF data can be published and accessed:
View the slides here.
The final video of about 35 minutes (expand view) introduces the SPARQL query language.
You will need to understand SPARQL in order to finish the second assignment.
There is also very good literature including screen-casts.
View the slides here.
In this Activity, we will move from pure data publishing to also integrate explicitly modelled knowledge. RDF Schema is the basic language for this purpose, and has been introduced as a simple language to model the data schema, as well as some initial facility for ontological knowledge.
As we have discussed in the previous modules, formally modelled knowledge is the basic for predictable inferencing, which is based on the rules that implement the basic entailment relation.
At the end of this Activity, you should have a basic understanding of
These lectures should give you sufficient background knowledge for the 3th practical assignment. In the test (quiz) at the end of this module, you will be asked general questions about RDFS, inferencing, and data publishing.
In this short video (expand view), we recall the main things introduced in the previous module.
View the slides here.
The following video of about 15 minutes (expand view) introduces different ways of publishing and consuming data on the Web. First, we can add triples and graphs manually to our knowledge graphs in a programming environment, as we did in the first practical assingment. Alternatively, we can add data to so called Triplestores, which are graph databases specialised for RDF graphs.
This video should be very useful when you start working on the third assignment.
View the slides here.
The following video of about minutes (expand view) discusses the notions of knowledge and inferencing on the Web. Until now, we have mostly focussed on the publication and consumption of data in RDF. Given the formal semantics provided by a logical system (such as PL, RDF and RDFS) we know which facts are entailed and can devise rules to calculate all possible logical consequences.
In the lecture we briefly touch upon the roots of knowledge representation with the Semantic Networks from Quillian from the Sixties. It was based on the unpredictability of the inferences that people started to develop formal semantics for networks, which nowadays allow for unambiguous notions of semantics and inference, such as entailment and logical consequence.
Finally, we also discuss typical separations made in knowledge representation between instances and classes, as well as denotations and instantiation.
If there is just one thing to remember from these slides, it should be the following principle (which should be known by now): Formulas are axioms that restrict the possible interpretations of the world (the models of a Knowledge Base). Entailment is defined as truth in these restricted interpretations (models).
Based on the power of these formal mechanisms, we can now devise languages and inference systems that can express rich knowledge about the data model and the domain itself. RDFS, the RDF Schema language, is the most simple of these knowledge representation languages.
View the slides here.
In this lecture we introduce the basic notions w.r.t. RDFS as well as its inference capabilities.
This video (expand view) introduces RDFS and basic inferencing:
View the slides in PDF here.
You will need to understand RDFS and basic inferencing in order to finish the third assignment.
The official reference for RDF schema is found here.
For those of you who are familiar with UML (Unified Modelling Language) there is a nice Discussion of the Relationship Between RDF-Schema and UML here (not compulsory reading)
It does not make much sense to fully introduce the model-theoretic semantics of RDFS for a variety of reasons. The problem is that the natural extension of the RDF semantics (where interpretations are graphs) is at odds with the interpretation of objects as instances of classes. We will therefore skip the discussion of the formal semantics of RDFS, as this is a rather advanced topic. Trust us, though, that there is a well defined model-theory, and that there is a theorem that states that with the inference or entailment rules (as presented in the previous videolecture),
This means that in practise you can treat inference with the RDFS rules as equivalent to entailment.
In the final lecture (expand view), we discuss RDF and RDFS as a vocabulary in RDF datasets. Basically this means that the special operators used to describe RDF information (rdf:type, etc) have to be referenced to in the namespaces of the RDF graph, which is the way of stating that RDF semantics has to be used. The same holds for RDFS. Unless we define the RDFS namespace, it is not clear for anybody how to interpret the special symbols used in RDFS axioms, such as rdfs:subClassOf, rdfs:subPropertyOf etc.
This lecture ends with an example on how to use RDFS in a data integration problem. We show how to translate the JSON output of the Facebook and IMDB API to build RDF datasets, and how link the objects from the two databases in various ways (including using RDFS inferencing) .
View the slides in PDF here.
In this Activity, we will introdce the expressive Web Ontology Language OWL.
As we have discussed in the previous modules formally specified knowledge is an essential ingredient for data sharing, and for data reuse in the first instance. With RDFS we have already studied an ontology language, which allowed us to model simple constraints on the data model and the domain of the data. Still, there are many things that cannot be said in RDFS, and this Module will introduce a language that maximises expressivity with well understood semantics and computational properties.
At the end of this Activity, you should have a basic understanding of
These lectures should give you sufficient background knowledge for the 4th practical assignment, which basically consists in building your own ontology. We recommend that you have already the final assignment in mind when you model your own ontology.
You can also watch the video lectures on YouTube (by popular demand - you can watch at double speed).
In this short video of 4 minutes (expand view or Download (file-> save page as)), we recall the main things introduced in the previous module(s), starting from data publishing on the Web, RDF as a data model, and RDFS as a simple schema and ontology language.
View a PDF version of the slides here.
In this video lecture of about 28 minutes (expand view or Download (file-> save page as)) we recap the main knowledge representaiton principles, that underly the formal systems we use in this course to represent data and knowledge. There is also a reminder of the basic distinctions one makes in knowledge represtation about different types of knowledge, namely generic knowledge about classes of objects, usually called a terminology, and instances of those classes, the assertions.
Discussing the weaknesss of RDFS we get to introduce OWL as an expressive language to model conceptual and individual knowlegde. Based on Description Logics with their formal model theoretic semantics, OWL is well understood, very expressive, while still having relatively good computational properties.
View a PDF version of the slides here.
In this video lecture of 39 minutes (expand view or Download (file-> save page as)) we discuss OWL class axioms and property types. These are very useful for extensive description of how (not) to classify individuals and how (not) to use properties in the domain that your OWL ontology describes. The class axioms are used by a reasoner to detect inconistencies in your data. For example, when two specific classes A and B can never contain the same individual (or differently phrased: "an individual cannot belong to class A and B at the same time"). Moreover, a reasoner can infer that an instance belong to a specific class C because we already know it belongs to a union or intersection of two other classes A and B.
With property axioms we can declare for example that a property is reflexive, transitive, and functional, and a reasoner will then infer that there is a contradiction when these axioms are violated by your data (actual instances being related via a property in the wrong way), or the reasoner will infer new knowledge, for example, because you are using a transitive class (A -> B -> C == A -> C) or reflexive class (A -> B == B -> A).
We discuss several of these OWL entailment rules, which are more expressive (and therefore useful) than the RDFS entailment rules we discussed last week.
An important application of these rules is to classify new knowledge based on the existing class and property axioms in your OWL ontology. Or when merging two ontologies we can assign existing instances/data classified by on of the ontologies to the correct classes in another ontology by defining axioms about unions and disjunctions of classes in the two ontologies.
View a PDF version of the slides here.
In this lecture of 27 minutes (expand view or Download (file-> save page as)), we will discuss OWL Class restrictions, so the operators in OWL that allow you to restrict the properties of your ontology in a more fine-grained way than range and domain.
View a PDF version of the slides here.
I made a short video, in which I try to explain the class axioms in a more "white board"
style. (be careful, this file is huge).
OWL has a very strict distinction between concepts and instances. Nothing is, in principle, allowed to be both a concept (class of objects) and an individual (object).
In this short lecture (expand view or Download (file-> save page as)), we briefly mention a technique in OWL to do this anyway. The basic idea is that OWL interprets two things as different things if they are used both as concept and instances even though they have the same name (called punning).
In the final lecture of 12 minutes (expand view or Download (file-> save page as)), we will discuss a number of common mistakes that beginners typically make when starting to model knowledge with OWL. An example is the overcommitment when using range and domain restrictions for properties, the distinction between complement and disjointness and problems with universal quantification.
View a PDF version of the slides here.
Protege is a Ontology editor that allows you to model knowledge in OWL in a more intuitive way than just writing down the axioms in turtle, which will very quickly become unmanagable. You need to install protege as described in the tools document.
Rinke Hoekstra made a nice screencast (12 minutes) last year to guide you step by step through the process of building an ontology in Protege.
Using Protege from Rinke Hoekstra on Vimeo.
The ontology from the Pizza tutorial is a good example: http://protege.stanford.edu/ontologies/pizza/pizza.owl, but please be creative (toppings on fries instead of pizza's is not creative). The tutorial itself is worth reading. It guides you through building an ontology step by step
We decided to postpone the Lecture on Ontology Engineering to next week as we believe that you enough to do to learn the OWL concepts without having to worry about the details of building a beautify ontology in a systematic way. Next week!
In this Activity, we will turn towards the task of ontology engineering and knowledge and data integration.
As we have discussed in the previous modules formally specified knowledge is an essential ingredient for data sharing, and for data reuse in the first instance. You made a first contact with the web ontology language OWL, which provides a wide variety of operators to model highly complex properties of concept and instances, as well as relations between those. For the final assignment we need to build ontologies that make use of these modelling tools within OWL. In order to deal with the complexity of this task, we need to discuss the matter of Ontology Engineering in a bit more detail. How do you actually construct an ontology for your specific application and domain.
Before we get to the final assignment, which will bring all the learned material together, we also need to address specific problems when integrating data on the Web. First, the ontologies people used to describe their data have to be mapped, the world-views unified, in other words. After that we can link the data. These will be the topics of this learning activity.
At the end of this Activity, you should have a basic understanding of
There is also a video summarising the task for the final assignment (but a more detailed description will be provided as well at the beginning of next week).
From Wednesday morning on, you can also watch the video lectures on YouTube (by popular demand - you can watch at double speed and with subtitles).
In this short video of 4 minutes (expand view or Download (file-> save page as)), we recall the main things introduced in the previous module(s), starting from data publishing on the Web, RDF as a data model, and RDFS and OWL as ontology languages.
This video of about one hour introduces some basic concepts of ontology engineering. This is a useful cookbook style introduction, which can be very valuable when you will have to construct your own ontology for the final assignment. (expand view or Download (file-> save page as)).
There are 8 steps one should take when building an ontology that will be discussed in detail in the lecture:
While this is neither a linear process nor a law, it is very useful to understand these steps, that are explained in detail in the lecture before starting to build your ontology.
Also view in PDF or on Youtube.
Before you can integrate different data source from the web, you need to make sure that the conceptualisation, in other word the ontology describing the data, are aligned. If you have two datasets about similar domains, which would be useful to combine, you need to make sure that the terminology, the vocabularies you use to describe the data, is compatible. This means that you need to model in some way or another, how the concepts from one vocabulary are related to the concepts from another one.
But on the Web there are not only formal ontologies modelled in OWL with well-defined model-theoretic Semantics, but also other vocabularies with lesser or hardly any formal semantics. Those knowledge organisation schemes, such as Thesauri or Taxonomies are often very rich hierarchies of concepts used for example to organise topics in libraries or on the Web (such as catalogues). Many of those vocabularies are digitised and published online, and can be valuable sources of knowledge for reuse.
Vocabularies are manifold on the We. The rich and expressive ontologies are most commonly modelled in OWL, thesauri and other hierarchies often make use of the SKOS vocabulary, which is about relating topics in terms of broader and narrower relations.
Reusing these vocabuarlies means often combining them, through alignments (mapping concepts from one to the other ontology or vocabulary) or merging. These mappings can be done (semi)automatically via a number of algorithms, but is often performed "by hand" (as you will probably do in your final assignment).
As in building ontologies, also in mappings there is no one-size-fits-all solution, and the quality of a mapping has to be evaluated.
The following lecture gives an overview over ontology alignment (expand view or Download (file-> save page as))
View the slides on Youtube or in PDF.
The following videolecture of about 36 minutes (expand view or Download (file-> save page as)) finally addresses the issue of data integration based on the methods presented in the previous 4 modules. The mail idea is that knowledge graphs, modelled in RDF, can more easily be extended with explicit knowlegde and thus be very useful to combine different sources of potentially heterogeneous data.
In this lecture we first look at ways to transform existing datasources into RDF. This can be done via tools that help you transform relational databases or csv files into knowledge graphs. An alternative is to write programmes that interpret the json output of Web APIs, e.g. from facebook or IMDB, and store those interpretations in RDF.
Now, we can use expressive schema and ontology languages to map those datasets, as is shown in the presentation. For accessing the external datasets there are various options: we can store all the relevant data in our Semantic Database (our triple-store), use SERVICE queries to integrate live queries to other endpoints in our SPARQL queries or use the knowledge to integrate the data on client side.
View the slides in PDF.
There is also a nice screencast that explains how to use Openrefine to clean and transform a messy dataset into RDF.
This video is also a good example on how to produce RDF from existing datasources.
In this lecture we describe the final project, in which you will have to bring everything together that you've learned in the first 5 weeks of the course.
The task is to build a web-application that combines different datasources from the Web and integrates this with datasources you might have built yourself and enriched with ontological information. You will have to this by integrating ontologies describing these datasources, and combining these with other vocabularies, match the data itself, enrich it with ontological domain knowledge and access it via SPARQL.
There are also some nice examples in this video lecture (expand view or Download (file-> save page as)) to give you some intuition of what is expected and possible.
Note: the rubric and project guidelines in this video lecture are not final. See the final project guidelines and rubric on Canvas for an updated version.
Example screencasts are on https://vimeo.com/189549359 (no sound) and on ScreencastGroep19 (in Dutch) and on http://www.tinyurl.com/y7o82t27 (in English, save/download to computer and open in video player)