Search:
SimPack Project Page

Motivation

The question of similarity is a heavily researched subject in the computer science, artificial intelligence, psychology, and linguistics literature. Typically, those studies focus on the similarity between vectors [Baeza-Yates & Ribeiro-Neto '99, Salton & McGill '83], strings [Lord et al. '03], trees or graphs [Shasha & Zhang '97], or simple objects [Genter & Medina '98, Resnik '99, Lin '98].

In our case we are particularly interested in the similarity between concepts (complex objects) in ontologies. All measures are implemented in our Java-based generic similarity framework called SimPack

Goal

SimPack is intended primarily for the research of similarity between concepts in ontologies or ontologies as a whole. Possible other application areas of SimPack include

  • the investigation of similarity between software source code. For instance to detect changes between classes of different software releases.
  • the research of similarity between hierarchically-structured data, such as XML, to compare, search, or integrate data from different data sources.

SimPack is, for example, used in iSPARQL that is an extension of traditional SPARQL (SPARQL Protocol And RDF Query Language) that allows to query for similar concepts in ontologies.

Implemented Similarity Measures

The similarity between entities (concepts in ontologies, classes in source code, XML documents, data streams, etc.) can be measured by a myriad of similarity measures. Currently we have implemented similarity measures from the following categories:

  • Feature vectors
    Alignment, Cosine, Dice, Euclidean, Jaccard, Manhattan, Overlap, Pearson
  • Strings or sequences of strings (text)
    Averaged String Matching, Jaro, TFIDF
  • Sets
    Jaccard, Loss of Information, Resembalance
  • Sequences
    Levensthein Edit Distance
  • Trees
    Bottom-up/Top-down Maximum Common Subtree, Tree Edit Distance
  • Graphs
    Conceptual Similarity, Graph Isomorphism, Subgraph Isomorphism, Maximum Common Subgraph Isomorphism, Graph Isomorphism Covering, Shortest Path
  • Information theory
    Jiang & Conrath, Lin, Resnik

In addition, the measures from the SecondString, the SimMetrics, the ontology Alignment API, and the OWLS-MX projects are wrapped in SimPack.

Use cases

Similarity analysis of the structures of different ontologies

Comparison of general, tree-like structures

Comparison of workflow layouts

Measuring the similarity of general, graph-like structures

Features of SimPack

SimPack offers the following features among others:

  • it offers a varity of different similarity measures for the use in ontologies and other research areas
  • it is generic, i.e., it can be applied to different data structures given the excistence of approriate data accessors
  • it is implemented in Java, thus portable

Links

SimPack uses the following APIs:
Cobertura, Colt, Eclipse, Famix, Jena, JGraphT, JUnit, Apache Lucene, OWL-S API, SecondString, SimMetrics, Taverna

Download

Current version is 0.91 (17 April 2008), previous was 0.90

Source distribution: simpack-0.91-src.zip
Source distribution including jar-files: simpack-0.91-all-src.zip (~46MB)
Only jar file: simpack-0.91-bin.jar

License

This work is licensed under LGPL.

Documentation

Javadoc API

Publications

Credits

Daniel Baggenstos, Beat Fluri, Antoon Goderis, Silvan Hollenstein, Manuel Kägi, Tobias Sager, Markus Stocker, Michael Würsch

Contacts

Please do not hesitate to contact us if you have any kinds of questions or comments about the SimPack project. For questions and comments write to simpack [at] ifi.unizh.ch or contact one of the authors

 

Last modified April 17, 2008 by Christoph Kiefer <kiefer at ifi.uzh.ch>

26452