Doctoral Thesis

2009

Christoph Kiefer, Non-Deductive Reasoning for the Semantic Web and Software Analysis, January 2009. (doctoralthesis)

The Semantic Web uses a number of knowledge representation (KR) languages to represent the terminological knowledge of a domain in a structured and formally sound way. Such KRs are typically description logics (DL), which are a particular kind of knowledge representation languages. One of the underpinnings of the Semantic Web and, therefore, a strength of any such semantic architecture, is the ability to reason from data, that is, to derive new knowledge from basic facts. In other words, the information that is already known and stored in the knowledgebase is extended with the information that can be logically deduced from the ground truth. The world does, however, generally not fit into a fixed, predetermined logic system of zeroes and ones. To account for this, especially in order to deal with the uncertainty inherent in the physical world, different models of human reasoning are required. Two prominent ways to model human reasoning are similarity reasoning (aka analogical reasoning) and inductive reasoning. It has been shown in recent years that the notion of similarity plays an important role in a number of Semantic Web tasks, such as Semantic Web service matchmaking, similarity-based service discovery, and ontology alignment. With inductive reasoning, two prominent tasks that can benefit from the use of statistical induction techniques are Semantic Web service classification and (semi-) automatic semantic data annotation. This dissertation transfers these ideas to the Semantic Web. To this end, it extends the well-known RDF query language SPARQL with two novel, non-deductive reasoning extensions in order to enable similarity and inductive reasoning. To address these issues, specifically to implement the two novel reasoning variants by using SPARQL, we introduce the concept of virtual triple patterns. Virtual triples are not asserted but inferred. Hence, they do not exist in the knowledgebase, but, rather, only as a result of the similarity/inductive reasoning process. To address similarity reasoning, we present the iSPARQL (imprecise SPARQL) framework---an extension of traditional SPARQL that supports customized similarity strategies via virtual triple patterns in order to explore an RDF dataset for similar resources. For our inductive reasoning extension, we introduce our SPARQL-ML (SPARQL Machine Learning) approach to create and work with statistical induction/data mining models in traditional SPARQL. Our presented iSPARQL and SPARQL-ML frameworks are validated using five different case studies of heavily researched Semantic Web and Software Analysis tasks. For the Semantic Web, these tasks are semantic service matchmaking, service discovery, and service classification. For Software Analysis, we conduct some experiments in software evolution and bug prediction. By applying our approaches to this large number of different tasks, we hope to show the approaches' generality, ease-of-use, extensibility, and high degree of flexibility in terms of customization to the actual task.

RDF for all publications BibTeX for all publications

Publications

2010

Jonas Tappolet, Christoph Kiefer, Abraham Bernstein, Semantic web enabled software analysis, Journal of Web Semantics: Science, Services and Agents on the World Wide Web 8, July 2010. (article)

One of the most important decisions researchers face when analyzing software systems is the choice of a proper data analysis/exchange format. In this paper, we present EvoOnt, a set of software ontologies and data exchange formats based on OWL. EvoOnt models software design, release history information, and bug-tracking meta-data. Since OWL describes the semantics of the data, EvoOnt (1) is easily extendible, (2) can be processed with many exist- ing tools, and (3) allows to derive assertions through its inherent Description Logic reasoning capabilities. The contribution of this paper is that it introduces a novel software evolution ontology that vastly simplifies typical software evolution analysis tasks. In detail, we show the usefulness of EvoOnt by repeating selected software evolution and analysis experiments from the 2004-2007 Mining Software Repositories Workshops (MSR). We demonstrate that if the data used for analysis were available in EvoOnt then the analyses in 75% of the papers at MSR could be reduced to one or at most two simple queries within off-the-shelf SPARQL tools. In addition, we present how the inherent capabilities of the Semantic Web have the potential of enabling new tasks that have not yet been addressed by software evolution researchers, e.g., due to the complexities of the data integration.

2009

Abraham Bernstein, Esther Kaufmann, Christoph Kiefer, Querying the Semantic Web with Ginseng - A Guided Input Natural Language Search Engine, Searching Answers. Festschrift in Honour of Michael Hess on the Occasion of His 60th Birthday, Editor(s): Simon Clematide, Manfred Klenner, Martin Volk; 2009, MV-Wissenschaft, Münster. (incollection)

2008

Christoph Kiefer, Abraham Bernstein, André Locher, Adding Data Mining Support to SPARQL via Statistical Relational Learning Methods, Proceedings of the 5th European Semantic Web Conference (ESWC), February 2008, Springer. (inproceedings)

Exploiting the complex structure of relational data enables to build better models by taking into account the additional information provided by the links between objects. We extend this idea to the Semantic Web by introducing our novel SPARQL-ML approach to perform data mining for Semantic Web data. Our approach is based on traditional SPARQL and statistical relational learning methods, such as Relational Probability Trees and Relational Bayesian Classifiers. We analyze our approach thoroughly conducting three sets of experiments on synthetic as well as real-world data sets. Our analytical results show that our approach can be used for any Semantic Web data set to perform instance-based learning and classification. A comparison to kernel methods used in Support Vector Machines shows that our approach is superior in terms of classification accuracy.
Markus Stocker, Andy Seaborne, Abraham Bernstein, Christoph Kiefer, Dave Reynolds, SPARQL Basic Graph Pattern Optimization Using Selectivity Estimation, Proceedings of the 17th International World Wide Web Conference (WWW), April 2008, ACM. (inproceedings)

In this paper, we formalize the problem of Basic Graph Pattern (BGP) optimization for SPARQL queries and main memory graph implementations of RDF data. We define and analyze the characteristics of heuristics for selectivity-based static BGP optimization. The heuristics range from simple triple pattern variable counting to more sophisticated selectivity estimation techniques. Customized summary statistics for RDF data enable the selectivity estimation of joined triple patterns and the development of efficient heuristics. Using the Lehigh University Benchmark (LUBM), we evaluate the performance of the heuristics for the queries provided by the LUBM and discuss some of them in more details. Note that the SPARQL versions of the 14 LUBM queries and the University0 data set we used in this paper can be downloaded from here.
Christoph Kiefer, Abraham Bernstein, The Creation and Evaluation of iSPARQL Strategies for Matchmaking, Proceedings of the 5th European Semantic Web Conference (ESWC), February 2008, Springer. (inproceedings)

This research explores a new method for Semantic Web service matchmaking based on iSPARQL strategies, which enables to query the Semantic Web with techniques from traditional information retrieval. The strategies for matchmaking that we developed and evaluated can make use of a plethora of similarity measures and combination functions from SimPack---our library of similarity measures. We show how our combination of structured and imprecise querying can be used to perform hybrid Semantic Web service matchmaking. We analyze our approach thoroughly on a large OWL-S service test collection and show how our initial strategies can be improved by applying machine learning algorithms to result in very effective strategies for matchmaking.

2007

Christoph Kiefer, Abraham Bernstein, Jonas Tappolet, Analyzing Software with iSPARQL, Proceedings of the 3rd International Workshop on Semantic Web Enabled Software Engineering (SWESE 2007), June 2007, Springer. (inproceedings)
Christoph Kiefer, Imprecise SPARQL: Towards a Unified Framework for Similarity-Based Semantic Web Tasks, Proceedings of 2nd Knowledge Web PhD Symposium (KWEPSY) colocated with the 4th Annual European Semantic Web Conference (ESWC), June 2007. (inproceedings)

This proposal explores a unified framework to solve Semantic Web tasks that often require similarity measures, such as RDF retrieval, ontology alignment, and semantic service matchmaking. Our aim is to see how far it is possible to integrate user-defined similarity functions (UDSF) into SPARQL to achieve good results for these tasks.We present some research questions, summarize the experimental work conducted so far, and present our research plan that focuses on the various challenges of similarity querying within the Semantic Web.
Christoph Kiefer, Abraham Bernstein, Jonas Tappolet, Mining Software Repositories with iSPARQL and a Software Evolution Ontology, Proceedings of the 2007 International Workshop on Mining Software Repositories (MSR '07), March 2007, IEEE Computer Society. (inproceedings)

One of the most important decisions researchers face when analyzing the evolution of software systems is the choice of a proper data analysis/exchange format. Most existing formats have to be processed with special programs written specifically for that purpose and are not easily extendible. Most scientists, therefore, use their own database(s) requiring each of them to repeat the work of writing the import/export programs to their format. We present EvoOnt, a software repository data exchange format based on the Web Ontology Language (OWL). EvoOnt includes software, release, and bug-related information. Since OWL describes the semantics of the data, EvoOnt is (1) easily extendible, (2) comes with many existing tools, and (3) allows to derive assertions through its inherent Description Logic reasoning capabilities. The paper also shows iSPARQL ? our SPARQL-based Semantic Web query engine containing similarity joins. Together with EvoOnt, iSPARQL can accomplish a sizable number of tasks sought in software repository mining projects, such as an assessment of the amount of change between versions or the detection of bad code smells. To illustrate the usefulness of EvoOnt (and iSPARQL), we perform a series of experiments with a real-world Java project. These show that a number of software analyses can be reduced to simple iSPARQL queries on an EvoOnt dataset.
Abraham Bernstein, Christoph Kiefer, Markus Stocker, OptARQ: A SPARQL Optimization Approach based on Triple Pattern Selectivity Estimation, Department of Informatics, University of Zurich 2007. (techreport)

Query engines for ontological data based on graph models mostly execute user queries without considering any optimization. Especially for large ontologies, optimization techniques are required to ensure that query results are delivered within reasonable time. OptARQ is a first prototype for SPARQL query optimization based on the concept of triple pattern selectivity estimation. The evaluation we conduct demonstrates how triple pattern reordering according to their selectivity affects the query execution performance.
Christoph Kiefer, Abraham Bernstein, Hong Joo Lee, Mark Klein, Markus Stocker, Semantic Process Retrieval with iSPARQL, Proceedings of the 4th European Semantic Web Conference (ESWC '07), March 2007, Springer. (inproceedings)

The vision of semantic business processes is to enable the integration and inter-operability of business processes across organizational boundaries. Since different organizations model their processes differently, the discovery and retrieval of similar smantic business processes is necessary in order to foster inter-organi ational collaborations. This paper presents our approach of using iSPARQL ï¿½ our imprecise query engine based on SPARQL ï¿½ to query the OWL MIT Process Handbook ï¿½ a large collection of over 5000 semantic business processes. We particularly show how easy it is to use iSPARQL to perform the presented process retrieval task. Furthermore, since choosing the best performing similarity strategy is a non-trivial, data-, and context-dependent task, we evaluate the performance of three simple and two human-engineered similarity strategies. In addition, we conduct machine learning experiments to learn similarity measures showing that complementary information contained in the different notions of similarity strategies provide a very high retrieval accuracy. Our preliminary results indicate that iSPARQL is indeed useful for extending the reach of queries and that it, therefore, is an enabler for inter- and intra-organizational collaborations.
Abraham Bernstein, Markus Stocker, Christoph Kiefer, SPARQL Query Optimization Using Selectivity Estimation, 2007. (misc)

This poster describes three static SPARQL optimization approaches for in-memory RDF graphs: (1) a selectivity estimation index (SEI) for single query triple patterns; (2) a query pattern index (QPI) for joined triple patterns; and (3) a hybrid optimization approach that combines both indexes. Using the Lehigh University Benchmark (LUBM), we show that the hybrid approach outperforms other SPARQL query engines such as ARQ and Sesame for in-memory graphs.
Christoph Kiefer, Abraham Bernstein, Markus Stocker, The Fundamentals of iSPARQL - A Virtual Triple Approach For Similarity-Based Semantic Web Tasks, Proceedings of the 6th International Semantic Web Conference (ISWC), March 2007, Springer. (inproceedings)

This research explores three SPARQL-based techniques to solve Semantic Web tasks that often require similarity measures, such as semantic data integration, ontology mapping, and Semantic Weg service matchmaking. Our aim is to see how far it is possible to integrate customized similarity functions (CSF) into SPARQL to achieve good results for these tasks. Our first approach exploits virtual triples calling property functions to establish virtual relations among resources under comparison; the second approach uses extension functions to filter out resources that do not meet the requested similarity criteria; finally, our third technique applies new solution modifiers to post-process a SPARQL solution sequence. The semantics of the three approaches are formally elaborated and discussed. We close the paper with a demonstration of the usefulness of our iSPARQL framework in the context of a data integration and an ontology mapping experiment.

2006

Tobias Sager, Abraham Bernstein, Martin Pinzger, Christoph Kiefer, Detecting Similar Java Classes Using Tree Algorithms, Proceedings of the International Workshop on Mining Software Repositories, May 2006, ACM. (inproceedings)

Similarity analysis of source code is helpful during development to provide, for instance, better support for code reuse. Consider a development environment that analyzes code while typing and that suggests similar code examples or existing implementations from a source code repository. Mining software repositories by means of similarity measures enables and enforces reusing existing code and reduces the developing effort needed by creating a shared knowledge base of code fragments. In information retrieval similarity measures are often used to find documents similar to a given query document. This paper extends this idea to source code repositories. It introduces our approach to detect similar Java classes in software projects using tree similarity algorithms. We show how our approach allows to find similar Java classes based on an evaluation of three tree-based similarity measures in the context of five user-defined test cases as well as a preliminary software evolution analysis of a medium-sized Java project. Initial results of our technique indicate that it (1) is indeed useful to identify similar Java classes, (2) successfully identifies the ex ante and expost versions of refactored classes, and (3) provides some interesting insights into within-version and between-version dependencies of classes within a Java project.
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, Abraham Bernstein, Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit, 10th International Conference on Extending Database Technology (EDBT 2006), Editor(s): Yannis Ioannidis, Marc H. Scholl, Joachim W. Schmidt, Florian Matthes, Mike Hatzopoulos, Klemens Boehm, Alfons Kemper, Torsten Grust, Christian Boehm, March ; 2006, Springer. (inproceedings)

Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different databases often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. This paper presents the SOQA-SimPack Toolkit (SST), an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST's usefulness with the SOQA-SimPack Toolkit Browser, which allows users to graphically perform similarity calculations in ontologies.
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, Abraham Bernstein, Generic Similarity Detection in Ontologies with the SOQA-SimPack Toolkit, SIGMOD Conference, June 2006, ACM. (inproceedings)

Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different databases often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. In this demo, we present the SOQASimPack Toolkit (SST) [7], an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST?s usefulness with the SOQA-SimPack Toolkit Browser, which allows users to graphically perform similarity calculations in ontologies.
Abraham Bernstein, Esther Kaufmann, Christian Kaiser, Christoph Kiefer, Ginseng: A Guided Input Natural Language Search Engine for Querying Ontologies, 2006 Jena User Conference, May 2006. (inproceedings)
Abraham Bernstein, Christoph Kiefer, Imprecise RDQL: Towards Generic Retrieval in Ontologies Using Similarity Joins, 21th Annual ACM Symposium on Applied Computing (ACM SAC 2006), April 2006, ACM. (inproceedings)

Traditional semantic web query languages support a logic-based access to the semantic web. They offer a retrieval (or reasoning) of data based on facts. On the traditional web and in databases, however, exact querying often provides an incomplete answer as queries are over-specified or the mix of multiple ontologies/modelling differences requires ``interpretational flexibility.'' Therefore, similarity measures or ranking approaches are frequently used to extend the reach of a query. This paper extends this idea to the semantic web. It introduces iRDQL---a semantic web query language with support for similarity joins. It is an extension of RDQL (RDF Data Query Language) that enables its users to query for similar resources ranking the results using a similarity measure. We show how iRDQL allows to extend the reach of a query by finding additional results. We quantitatively evaluated four similarity measures for their usefulness in iRDQL in the context of an OWL-S semantic web service retrieval test collection and compared the results to a specialized OWL-S matchmaker. Initial results of using iRDQL indicate that it is indeed useful for extending the reach of queries and that it is able to improve recall without overly sacrificing precision. We also found that our generic iRDQL approach was only slightly outperformed by the specialized algorithm.

2005

Abraham Bernstein, Christoph Kiefer, iRDQL - Imprecise Queries Using Similarity Joins for Retrieval in Ontologies, 4th International Semantic Web Conference, November 2005. (inproceedings)
Abraham Bernstein, Christoph Kiefer, iRDQL - Imprecise RDQL Queries Using Similarity Joins, K-CAP 2005 Workshop on: Ontology Management: Searching, Selection, Ranking, and Segmentation, October 2005. (inproceedings)

Traditional semantic web query languages support a logic-based access to the semantic web. They offer a retrieval (or reasoning) of data based on facts. On the traditional web and in databases, however, exact querying often provides an incomplete answer as queries are overspecified or the mix of multiple ontologies/modelling differences requires ?interpretational flexibility.? This paper introduces iRDQL ? a semantic web query language with support for similarity joins. It is an extension to RDQL that enables the user to query for similar resources in an ontology. A similarity measure is used to determine the degree of similarity between two semantic web resources. Similar resources are ranked by their similarity and returned to the user. We show how iRDQL allows to extend the reach of a query by finding additional results. We quantitatively evaluated one measure of SimPack ? our library of similarity measures for the use in ontologies ? for its usefulness in iRDQL within the context of an OWL-S semantic web service retrieval test collection. Initial results of using iRDQL indicate that it is indeed useful for extending the reach of the query and that it is able to improve recall without overly sacrificing precision.
Abraham Bernstein, Esther Kaufmann, Anne Göhring, Christoph Kiefer, Querying Ontologies: A Controlled English Interface for End-users, 4th International Semantic Web Conference (ISWC 2005), November 2005. (inproceedings)

The semantic web presents the vision of a distributed, dynamically growing knowledge base founded on formal logic. Common users, however, seem to have problems even with the simplest Boolean expressions. As queries from web search engines show, the great majority of users simply do not use Boolean expressions. So how can we help users to query a web of logic that they do not seem to understand? We address this problem by presenting a natu-ral language interface to semantic web querying. The interface allows formulat-ing queries in Attempto Controlled English (ACE), a subset of natural English. Each ACE query is translated into a discourse representation structure ? a vari-ant of the language of first-order logic ? that is then translated into an N3-based semantic web querying language using an ontology-based rewriting framework. As the validation shows, our approach offers great potential for bridging the gap between the logic-based semantic web and its real-world users, since it al-lows users to query the semantic web without having to learn an unfamiliar formal language. Furthermore, we found that users liked our approach and de-signed good queries resulting in a very good retrieval performance (100% pre-cision and 90% recall).
Abraham Bernstein, Esther Kaufmann, Christoph Kiefer, Christoph Bürki, SimPack: A Generic Java Library for Similarity Measures in Ontologies, University of Zurich, Department of Informatics, August 2005. (techreport)

Good similarity measures are central for techniques such as retrieval, matchmaking, clustering, data-mining, ontology translations, automatic database schema matching, and simple object comparisons. Measures for the use with complex (or aggregated) objects in ontologies are, however, rare, even though they are central for semantic web applications. This paper first introduces SimPack, a library of similarity measures for the use in ontologies (of complex objects). The measures of the library are then experimentally compared with a similarity ``gold standard'' established by surveying 94 human subjects in two ontologies. Results show that human and algorithm assessments vary (both between people and across ontologies), but can be grouped into cohesive clusters, each of which is well modeled by one of the measures in the library. Furthermore, we show two increasingly accurate methods to predict the cluster membership of the subjects providing the foundation for the construction of personalized similarity measures.

RDF for all publications BibTeX for all publications

Other theses

Christoph Kiefer. Qualitative and Quantitative Evalutation of the Reading People Tracker (Diploma Thesis). ETH Zurich, Department of Computer Science. March 2004. [pdf] [bibtex]
Christoph Kiefer. Semi-Automatic Calibration for the Ada/Expo.02 Vision Matrix (Semester Thesis). ETH Zurich, Department of Computer Science. August 2003. [pdf] [bibtex]
Thomas Brunner and Christoph Kiefer. Preisdifferenzierung für Flugtickets auf den Strecken Zürich-London und Frankfurt-New York (Semester Thesis). ETH Zurich, Department of Computer Science. March 2003. [pdf] [bibtex]

Student Advisor

Daniel Buchmüller, Ubidas - A Novel P2P Backup System, 2008
André Locher, SPARQL-ML: Knowledge Discovery for the Semantic Web, 2008.
Jonas Tappolet, Mining Software Repositories -- A Semantic Web Approach, 2007.
Manuel Kägi, Using Genetic Programming and SimPack to Learn Global Similarity Measures, 2007.
Markus Stocker, The Fundamentals of iSPARQL, 2006.
Daniel Baggenstoss, Implementation and Evaluation of Graph Isomorphism Algorithms for RDF Graphs, 2006.
Tobias Sager, Coogle -- A Code Google Plug-in for Detecting Similar Java Classes, 2005.
Silvan Hollenstein, XQuery Similarity Joins, 2005.
Guido Badertscher, Generelle Queryverarbeitung für das Semantische Web, 2004.
Patrick Brunner, YASeC - Yet another semantic Web Crawler: Implementation eines semantikfreundlichen Web-Crawlers, 2004.
Stephan Bögli, Supported Querying - Ein interaktives Anfragetool für das Semantic Web, 2004.

My other profiles on the web