Scalable Integration of Linked Data
Introduction
The goal of this tutorial is to introduce, motivate and detail techniques for integrating heterogeneous structured data from across the Web. Inspired by the growth in Linked Data publishing, our tutorial aims at educating Web researchers and practitioners about this new publishing paradigm. The tutorial will show how Linked Data enables uniform access, parsing and interpretation of data, and how this novel wealth of structured data can potentially be exploited for creating new applications or enhancing existing ones.
As such, the tutorial will focus on Linked Data publishing and related Semantic Web technologies and standards, introducing scalable techniques for crawling, indexing and automatically integrating structured heterogeneous Web data through reasoning.
Content
Monday, October 24th | |
Session 1: Introduction to RDF and Linked Data | |
The first session gives an overview of RDF and Linked Data publishing. We will discuss the RDF data model and Linked Data principles for publishing RDF data on the Web. In particular, this session will cover:
| |
Session 2: Scalable Linked Data Crawling | |
This session gives an overview of the state of the art in efficient data-retrieval techniques, including novel challenges and techniques for crawling Linked Data from the Web. We will present the architecture of a crawler for small to medium-sized datasets in the range to several hundred million triples. In particular, this session will cover:
| |
Session 3: Scalable RDF Indexing Techniques | |
This session presents scalable techniques for indexing and querying local repositories of Linked Data. We will discuss the standardised SPARQL query-language and thereafter discuss the state-of-the-art in RDF storage with respect to research, directions and applications. In particular, this ses- sion will cover:
| |
Session 4: Reasoning: Motivation and Overview | |
This session gives an introduction to the RDFS and OWL (2) standards and to rule-based reasoning, with heavy emphasis on motivating reasoning for the Linked Data use-case and for integrating heterogeneous data from a large num- ber of diverse sources. We also introduce algorithms which incorporate information about the provenance of data during reasoning to ensure robustness in the face of noisy or impudent remote data. In particular, this session will cover:
| |
Session 5: Scalable Distributed Reasoning over MapReduce | |
This session presents scalable distributed reasoning using the MapReduce distribution framework, enabling high performance over a cluster of commodity hardware. This session details the MapReduce framework (employed by Google and Yahoo, among others) and the award-winning WebPIE system which integrates optimised execution strategies for rules supporting a (pragmatic) fragment of OWL semantics.
| |
Session 6: Implementing a LarKC Workflow | |
This session allows attendees to get hands-on with building scalable linked data applications. Some of the technologies presented in the previous sessions will be put together using a scalable workflow engine tailored for Linked Data: the Large Knowledge Collider (LarKC).
|