-
Mei Wang, Abraham Bernstein, Marc Chesney, An Experimental Study on Real Option Strategies, 37th Annual Meeting of the European Finance Association 2010. (inproceedings)
We conduct a laboratory experiment to study whether people intuitively use real-option strategies in a dynamic investment setting. The participants were asked to play as an oil manager and make production decisions in response to a simulated mean-reverting oil price. Using cluster analysis, participants can be classified into four groups, which we label as "mean-reverting", "Brownian motion real-option", "Brownian motion myopic real-option", and "ambiguous". We find two behavioral biases in the strategies by our participants: ignoring the mean-reverting process, and myopic behavior. Both lead to too frequent switches when compared with the theoretical benchmark. We also find that the last group behaves as if they have learned to incorporating the true underlying process into their decisions, and improved their decisions during the later stage.
-
Floarea Serban, Jörg-Uwe Kietz, Abraham Bernstein, An Overview of Intelligent Data Assistants for Data Analysis, Proc. of the 3rd Planning to Learn Workshop (WS9) at ECAI'10, August 2010. (inproceedings)
Todays intelligent data assistants (IDA) for data analysis are focusing on how to do effective and intelligent data analysis. However this is not a trivial task since one must take into consideration all the influencing factors: on one hand data analysis in general and on the other hand the communication and interaction with data analysts. The basic approach of building an IDA, where data analysis is (1) better as well as (2) faster in the same time, is not a very rewarding criteria and does not help in designing good IDAs. Therefore this paper tries to (a) discover constructive criteria that allow us to compare existing systems and help design better IDAs and (b) review all previous IDAs based on these criteria to find out what are the problems that IDAs should solve as well as which method works best for which problem. In conclusion we try to learn from previous experiences what features should be incorporated into a new IDA that would solve the problems of todays analysts.
-
Steffen Hölldobler, Abraham Bernstein, Günter Hotz, Klaus-Peter Löhr, Paul Molitor, Gustav Neumann, Rüdiger Reischuk, Dietmar Saupe, Myra Spiliopoulou, Harald Störrle, Dorothea Wagner, Ausgezeichnete Informatikdissertationen 2009, Lecture Notes in Informatics, Gesellschaft für Informatik (GI), Gesellschaft für Informatik (GI)2010. (book)
-
Floarea Serban, Auto-experimentation of KDD Workflows based on Ontological Planning, The 9th International Semantic Web Conference (ISWC 2010), Doctoral Consortium, November 2010. (inproceedings)
One of the problems of Knowledge Discovery in Databases
(KDD) is the lack of user support for solving KDD problems. Current
Data Mining (DM) systems enable the user to manually design workflows
but this becomes diffcult when there are too many operators to choose from or the workflow's size is too large. Therefore we propose to use auto-experimentation based on ontological planning to provide the users
with automatic generated workflows as well as rankings for workflows
based on several criteria (execution time, accuracy, etc.). Moreover auto-experimentation
will help to validate the generated workflows and to
prune and reduce their number. Furthermore we will use mixed-initiative
planning to allow the users to set parameters and criteria to limit the
planning search space as well as to guide the planner towards better
workflows.
-
Jiwen Li, Automatic verification of small molecule structure with one dimensional proton nuclear magnetic resonance sprectrum, 04 2010. (doctoralthesis)
Small molecule structure one dimensional (1D) proton (1H) Nuclear Magnetic Resonance (NMR) verification has become a vital procedure for drug design and discovery. However, the inefficient throughput of human verification procedure has limited its application only to an arbitral instrument for molecular structural identification. Considering NMR?s unimpeachable advantages in molecular structural identification tasks (compared to other techniques), to popularize NMR technology into routine molecular structural verification procedures (especially in compound library management of the pharmaceutical industry), will dramatically increase the efficiency of drug discovery procedures. As a result, some automatic NMR structure verification software approaches were developed, described in the literature and are commercially available. Unfortunately, all of them are limited in principal (e.g. they heavily depend on the chemical shift prediction) and are shown not to be working in practice.
Driven by the strong motivation from the industry, we propose a new approach as an alternative to approach the problem. Specifically, we propose to utilize approaches from artificial intelligence (AI) to mimic the spectroscopist?s NMR molecular structure verification procedure. Guided by this strategy, a human-logic based optimization (i.e. heuristic search) approach is designed to mimic the spectroscopist?s decision process. The approach is based on a probabilistic model that is used to unify the human logic based optimization approach under maximum likelihood framework. Furthermore, a new automatic 1D 1H NMR molecular structural verification system is designed and implemented based on the optimization approach proposed earlier.
In order to convince vast NMR spectroscopists and molecular structural identification participators, comprehensive experiments are used to evaluate the system?s decision accuracy and consistency to the spectroscopists. The results of the experiments demonstrate that the system has very high performance in terms of both accuracy and consistency with the spectroscopists on the test datasets we used. This result validates both the correctness of our approach and the feasibility of building industrialized software based on our system to be used in practical industrial structural verification environments. As a result, commercial software based on our system is under development by a major NMR manufacture, and is going to be released to the pharmaceutical industry.
Finally, the thesis also discusses similarities and differences between the human logic based optimization and other typically used optimization approaches, and especially focuses on their applicability. Through these discussions, we hope that the human logic based optimization could be used as a reference by other practical computer science participants to solve other automation problems from different domains.
-
Cosmin Basca, Abraham Bernstein, Avalanche - Putting the Spirit of the Web back into Semantic Web Querying. (conference/Demo)
Traditionally Semantic Web applications either included a web crawler or relied on external services to gain access to the Web of Data. Recent efforts, have enabled applications to query the entire Se- mantic Web for up-to-date results. Such approaches are based on either centralized indexing of semantically annotated metadata or link traversal and URI dereferencing as in the case of Linked Open Data. They pose a number of limiting assumptions, thus breaking the openness principle of the Web. In this demo we present a novel technique called Avalanche, designed to allow a data surfer to query the Semantic Web transpar- ently. The technique makes no prior assumptions about data distribu- tion. Specifically, Avalanche can perform ?live? queries over the Web of Data. First, it gets on-line statistical information about the data dis- tribution, as well as bandwidth availability. Then, it plans and executes the query in a distributed manner trying to quickly provide first answers.
-
Cosmin Basca, Abraham Bernstein, Avalanche: Putting the Spirit of the Web back into Semantic Web Querying, Proceedings Of The 6th International Workshop On Scalable Semantic Web Knowledge Base Systems (SSWS2010), Editor(s): Achille Fokoue, Thorsten Liebig, Yuanbo Guo, November ; 2010. (inproceedings/full paper)
Traditionally Semantic Web applications either included a web crawler or relied on external services to gain access to the Web of Data. Recent efforts have enabled applications to query the entire Se- mantic Web for up-to-date results. Such approaches are based on either centralized indexing of semantically annotated metadata or link traver- sal and URI dereferencing as in the case of Linked Open Data. By mak- ing limiting assumptions about the information space, they violate the openness principle of the Web ? a key factor for its ongoing success. In this article we propose a technique called Avalanche, designed to allow a data surfer to query the Semantic Web transparently without making any prior assumptions about the distribution of the data ? thus adhering to the openness criteria. Specifically, Avalanche can perform ?live? (SPARQL) queries over the Web of Data. First, it gets on-line statistical information about the data distribution, as well as bandwidth availability. Then, it plans and executes the query in a distributed man- ner trying to quickly provide first answers. The main contribution of this paper is the presentation of this open and distributed SPARQL querying approach. Furthermore, we propose to extend the query planning algo- rithm with qualitative statistical information. We empirically evaluate Avalanche using a realistic dataset, show its strengths but also point out the challenges that still exist.
-
Minh Khoa Nguyen, Cosmin Basca, Abraham Bernstein, B+Hash Tree: Optimizing query execution times for on-Disk Semantic Web data structures, Proceedings Of The 6th International Workshop On Scalable Semantic Web Knowledge Base Systems (SSWS2010), Editor(s): Achille Fokoue, Thorsten Liebig, Yuanbo Guo, November ; 2010. (inproceedings/full paper)
The increasing growth of the Semantic Web has substantially enlarged the amount of data available in RDF format. One proposed so- lution is to map RDF data to relational databases (RDBs). The lack of a common schema, however, makes this mapping inefficient. Some RDF-native solutions use B+Trees, which are potentially becoming a bottleneck, as the single key-space approach of the Semantic Web may even make their O(log(n)) worst case performance too costly. Alterna- tives, such as hash-based approaches, suffer from insufficient update and scan performance. In this paper we propose a novel type of index struc- ture called a B+Hash Tree, which combines the strengths of traditional B-Trees with the speedy constant-time lookup of a hash-based structure. Our main research idea is to enhance the B+Tree with a Hash Map to enable constant retrieval time instead of the common logarithmic one of the B+Tree. The result is a scalable, updatable, and lookup-optimized, on-disk index-structure that is especially suitable for the large key-spaces of RDF datasets. We evaluate the approach against existing RDF index- ing schemes using two commonly used datasets and show that a B+Hash Tree is at least twice as fast as its competitors ? an advantage that we show should grow as dataset sizes increase.
-
Cosmin Basca, Robert H. Warren, Abraham Bernstein, Canopener: Recycling Old and New Data, 3rd Workshop on Mashups, Enterprise Mashups and Lightweight Composition on the Web (MEM 2010), April 2010. (inproceedings/full paper)
The advent of social markup languages and lightweight pub- lic data access methods has created an opportunity to share the social, documentary and system information locked in most servers as a mashup. Whereas solutions already exists for creating and managing mashups from network sources, we propose here a mashup framework whose primary infor- mation sources are the applications and user files of a server. This enables us to use server legacy data sources that are already maintained as part of basic administration to se- mantically link user documents and accounts using social web constructs.
-
Katharina Reinecke, Culturally Adaptive User Interfaces 2010. (phdthesis)
One of the largest impediments for the efficient use of software in different cultural contexts is the gap between the software designs - typically following western cultural cues - and the users, who handle it within their cultural frame. The problem has become even more relevant, as today the majority of revenue in the software industry comes from outside market dominating countries such as the USA. While research has shown that adapting user interfaces to cultural preferences can be a decisive factor for marketplace success, the endeavor is oftentimes foregone because of its time-consuming and costly procedure. Moreover, it is usually limited to producing one uniform user interface for each nation, thereby disregarding the intangible nature of cultural backgrounds.
To overcome these problems, this thesis introduces a new approach called 'cultural adaptivity'. The main idea behind it is to develop intelligent user interfaces, which can automatically adapt to the user's culture. Rather than only adapting to one country, cultural adaptivity is able to anticipate different influences on the user's cultural background, such as previous countries of residence, differing nationalities of the parents, religion, or the education level. We hypothesized that realizing these influences in adequate adaptations of the interface improves the overall usability, and specifically, increases work efficiency and user satisfaction.
In support of this thesis, we developed a cultural user model ontology, which includes various facets of users' cultural backgrounds. The facets were aligned with information on cultural differences in perception and user interface preferences, resulting in a comprehensive set of adaptation rules.
We evaluated our approach with our culturally adaptive system MOCCA, which can adapt to the users' cultural backgrounds with more than 115'000 possible combinations of its user interface. Initially, the system relies on the above-mentioned adaptation rules to compose a suitable user interface layout. In addition, MOCCA is able to learn new, and refine existing, adaptation rules from users' manual modifications of the user interface based on a collaborative filtering mechanism, and from observing the user's interaction with the interface.
The results of our evaluations showed that MOCCA is able to anticipate the majority of user preferences in an initial adaptation, and that users' performance and satisfaction significantly improved when using the culturally adapted version of MOCCA, compared to its 'standard' US interface.
-
Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, Simon Fischer, Data Mining Workflow Templates for Intelligent Discovery Assistance in RapidMiner, Proc of RCOMM'10 2010. (inproceedings)
Knowledge Discovery in Databases (KDD) has evolved during the last years and reached a mature stage offering plenty of operators to solve complex tasks. User support for building work?ows, in contrast, has not increased proportionally. The large number of operators available in current KDD systems make it difficult for users to successfully analyze data. Moreover, work?ows easily contain a large number of operators and parts of the work?ows are applied several times, thus it is hard for users to build them manually. In addition, work?ows are not checked for correctness before execution. Hence, it frequently happens that the execution of the work?ow stops with an error after several hours runtime. In this paper we address these issues by introducing a knowledge-based representation of KDD work?ows as a basis for cooperative-interactive planning. Moreover, we discuss work?ow templates that can mix executable operators and tasks to be re?ned later into sub-work?ows. This new representation
helps users to structure and handle work?ows, as it constrains the number of operators that need to be considered. We show that work?ows can be grouped in templates enabling re-use and simplifying KDD work?ow construction in RapidMiner.
-
Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, Simon Fischer, Data Mining Workflow Templates for Intelligent Discovery Assistance and Auto-Experimentation, Proc of the ECML/PKDD'10 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD'10), September 2010. (inproceedings)
Knowledge Discovery in Databases (KDD) has grown a lot during the last years. But providing user support for constructing workflows is still problematic. The large number of operators available in current KDD systems makes it difficult for a user to successfully solve her task. Also, workflows can easily reach a huge number of operators(hundreds) and parts of the workflows are applied several times. Therefore, it becomes hard for the user to construct them manually. In addition, workflows are not checked for correctness before execution. Hence, it frequently happens that the execution of the workflow stops with an error after several hours runtime.
In this paper3 we present a solution to these problems. We introduce a knowledge-based representation of Data Mining (DM) workflows as a basis for cooperative interactive planning. Moreover, we discuss workflow templates, i.e. abstract workflows that can mix executable operators and tasks to be renewed later into sub-workflows. This new representation helps users to structure and handle workflows, as it constrains the number of operators that need to be considered. Finally, workflows can be grouped in templates which foster re-use further simplifying DM workflow construction.
-
Thomas Scharrenbach, Claudia d'Amato, Nicola Fanizzi, Rolf Grütter, Bettina Waldvogel, Abraham Bernstein, Default Logics for Plausible Reasoning with Controversial Axioms, Proceedings of the 6th International Workshop on Uncertainty Reasoning for the Semantic Web (URSW-2010), November 2010, CEUR Workshop Proceedings. (inproceedings)
Using a variant of Lehmann's Default Logics and Probabilistic Description Logics we recently presented a framework that invalidates those unwanted inferences that cause concept unsatisfiability without the need to remove explicitly stated axioms. The solutions of this methods were shown to outperform classical ontology repair w.r.t. the number of inferences invalidated. However, conflicts may still exist in the knowledge base and can make reasoning ambiguous. Furthermore, solutions with a minimal number of inferences invalidated do not necessarily minimize the number of conflicts. In this paper we provide an overview over finding solutions that have a minimal number of conflicts while invalidating as few inferences as possible. Specifically, we propose to evaluate solutions w.r.t. the quantity of information they convey by recurring to the notion of entropy and discuss a possible approach towards computing the entropy w.r.t. an ABox.
-
Jonas Luell, Employee fraud detection under real world conditions, 04 2010. (doctoralthesis)
Employee fraud in financial institutions is a considerable monetary and reputational risk. Studies state that this type of fraud is typically detected by a tip, in the worst case from affected customers, which is fatal in terms of reputation. Consequently, there is a high motivation to improve analytic detection.
We analyze the problem of client advisor fraud in a major financial institution and find that it differs substantially from other types of fraud. However, internal fraud at the employee level receives little attention in research.
In this thesis, we provide an overview of fraud detection research with the focus on implicit assumptions and applicability. We propose a decision framework to find adequate fraud detection approaches for real world problems based on a number of defined characteristics. By applying the decision framework to the problem setting we met at Alphafin the chosen approach is motivated.
The proposed system consists of a detection component and a visualization component.
A number of implementations for the detection component with a focus on tempo-relational pattern matching is discussed. The visualization component, which was converted to productive software at Alphafin in the course of the collaboration, is introduced. On the basis of three case studies we demonstrate the potential of the proposed system and discuss findings and possible extensions for further refinements.
-
Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, eProPlan: A Tool to Model Automatic Generation of Data Mining Workflows, Proc. of the 3rd Planning to Learn Workshop (WS9) at ECAI'10, August 2010. (inproceedings)
This paper introduces the first ontological modeling environment for planning Knowledge Discovery (KDD) workflows.
We use ontological reasoning combined with AI planning techniques to automatically generate
workflows for solving Data Mining (DM) problems. The KDD researchers can easily model not only their DM and
preprocessing operators but also their DM tasks, that are used to guide the workflow generation.
-
Stuart N. Wrigley, Khadija Elbedweihy, Dorothee Reinhard, Abraham Bernstein, Fabio Ciravegna, Evaluating Semantic Search Tools using the SEALS platform, Proceedings of the International Workshop on Evaluation of Semantic Technologies (IWEST 2010) Workshop 2010. (inproceedings)
In common with many state of the art semantic technologies, there is a lack of comprehensive, established evaluation mechanisms for semantic search tools. In this paper, we describe a new evaluation and benchmarking approach for semantic search tools using the infrastructure under development within the SEALS initiative. To our knowledge, it is the first effort to present a comprehensive evaluation methodology for semantic search tools. The paper describes the evaluation methodology including our two-phase approach in which tools are evaluated both in a fully automated fashion as well as within a user-based study. We also present and discuss preliminary results from the first SEALS evaluation campaign together with a discussion of some of the key findings.
-
Esther Kaufmann, Abraham Bernstein, Evaluating the Usability of Natural Language Query Languages and Interfaces to Semantic Web Knowledge Bases, Journal of Web Semantics: Science, Services and Agents on the World Wide Web 8 2010. (article)
The need to make the contents of the Semantic Web accessible to end-users becomes increasingly pressing as the amount of information stored in ontology-based knowledge bases steadily increases. Natural language interfaces (NLIs) provide a familiar and convenient means of query access to Semantic Web data for casual end-users. While several studies have shown that NLIs can achieve high retrieval performance as well as domain independence, this paper focuses on usability and investigates if NLIs and natural language query languages are useful from an end- user?s point of view. To that end, we introduce four interfaces each allowing a different query language and present a usability study benchmarking these interfaces. The results of the study reveal a clear preference for full natural language query sentences with a limited set of sentence beginnings over keywords or formal query languages. NLIs to ontology-based knowledge bases can, therefore, be considered to be useful for casual or occasional end-users. As such, the overarching contribution is one step towards the theoretical vision of the Semantic Web becoming reality.
-
Christian Bird, Adrian Bachmann, Foyzur Rahman, Abraham Bernstein, LINKSTER: Enabling Efficient Manual Inspection and Annotation of Mined Data, ACM SIGSOFT / FSE '10: Proceedings of the eighteenth International Symposium on the Foundations of Software Engineering, November 2010. (inproceedings/formal demonstration)
While many uses of mined software engineering data are automatic in nature, some techniques and studies either require, or can be improved, by manual methods. Unfortunately, manually inspecting, analyzing, and annotating mined data can be difficult and tedious, especially when information from multiple sources must be integrated. Oddly, while there are numerous tools and frameworks for automatically mining and analyzing data, there is a dearth of tools which facilitate manual methods. To fill this void, we have developed LINKSTER, a tool which integrates data from bug databases, source code repositories, and mailing list archives to allow manual inspection and annotation. LINKSTER has already been used successfully by an OSS project lead to obtain data for one empirical study.
-
Stuart N. Wrigley, Dorothee Reinhard, Khadija Elbedweihy, Abraham Bernstein, Fabio Ciravegna, Methodology and Campaign Design for the Evaluation of Semantic Search Tools, Proceedings of the Semantic Search 2010 Workshop (SemSearch 2010) 2010. (inproceedings)
The main problem with the state of the art in the semantic
search domain is the lack of comprehensive evaluations.
There exist only a few eorts to evaluate semantic search
tools and to compare the results with other evaluations of
their kind.
In this paper, we present a systematic approach for testing
and benchmarking semantic search tools that was developed
within the SEALS project. Unlike other semantic web evaluations
our methodology tests search tools both automatically
and interactively with a human user in the loop. This
allows us to test not only functional performance measures,
such as precision and recall, but also usability issues, such
as ease of use and comprehensibility of the query language.
The paper describes the evaluation goals and assumptions;
the criteria and metrics; the type of experiments we
will conduct as well as the datasets required to conduct the
evaluation in the context of the SEALS initiative. To our
knowledge it is the rst eort to present a comprehensive
evaluation methodology for Semantic Web search tools.
-
Katharina Reinecke, Sonja Schenkel, Abraham Bernstein, Modeling a User's Culture. In: The Handbook of Research in Culturally-Aware Information Technology: Perspectives and Models , IGI Global 2010. (inbook)
-
, Proc. of the 3rd Planning to Learn Workshop (WS9) at ECAI 2010. (proceedings/Workshop Proceedings)
The task of constructing composite systems, that is systems composed of more than
one part, can be seen as interdisciplinary area which builds on expertise in different
domains. The aim of this workshop is to explore the possibilities of constructing such
systems with the aid of Machine Learning and exploiting the know-how of Data Mining.
One way of producing composite systems is by inducing the constituents and then by
putting the individual parts together.
For instance, a text extraction system may be composed of various subsystems, some
oriented towards tagging, morphosyntactic analysis or word sense disambigua- tion.
This may be followed by selection of informative attributes and ?nally generation of
the system for the extraction of the relevant information. Machine Learning tech- niques
may be employed in various stages of this process. The problem of constructing com-
plex systems can thus be seen as a problem of planning to resolve multiple (possibly
interacting) tasks.
So, one important issue that needs to be addressed is how these multiple learning pro-
cesses can be coordinated. Each task is resolved using certain ordering of operations.
Meta-learning can be useful in this process. It can help us to retrieve previous solutions
conceived in the past and re-use them in new settings.
The aim of the workshop is to explore the possibilities of this new area, offer a forum
for exchanging ideas and experience concerning the state-of-the art, permit to bring in
knowledge gathered in different but related and relevant areas and outline new direc-
tions for research. It is expected that the workshop will help to create a sub-community
of ML / DM researchers interested to explore these new venues to ML / DM problems
and help thus to advance the research and potential for new type of ML / DM systems.
-
Jonas Tappolet, Christoph Kiefer, Abraham Bernstein, Semantic web enabled software analysis, Journal of Web Semantics: Science, Services and Agents on the World Wide Web 8, July 2010. (article)
One of the most important decisions researchers face when analyzing software systems is the choice of a proper data analysis/exchange format. In this paper, we present EvoOnt, a set of software ontologies and data exchange formats based on OWL. EvoOnt models software design, release history information, and bug-tracking meta-data. Since OWL describes the semantics of the data, EvoOnt (1) is easily extendible, (2) can be processed with many exist- ing tools, and (3) allows to derive assertions through its inherent Description Logic reasoning capabilities. The contribution of this paper is that it introduces a novel software evolution ontology that vastly simplifies typical software evolution analysis tasks. In detail, we show the usefulness of EvoOnt by repeating selected software evolution and analysis experiments from the 2004-2007 Mining Software Repositories Workshops (MSR). We demonstrate that if the data used for analysis were available in EvoOnt then the analyses in 75% of the papers at MSR could be reduced to one or at most two simple queries within off-the-shelf SPARQL tools. In addition, we present how the inherent capabilities of the Semantic Web have the potential of enabling new tasks that have not yet been addressed by software evolution researchers, e.g., due to the complexities of the data integration.
-
Philip Stutz, Abraham Bernstein, William W. Cohen, Signal/Collect: Graph Algorithms for the (Semantic) Web, ISWC 2010, Editor(s): P.F. Patel-Schneider; 2010, Springer, Heidelberg. (inproceedings)
The Semantic Web graph is growing at an incredible pace, enabling opportunities to discover new knowledge by interlinking and analyzing previously unconnected data sets. This confronts researchers with a conundrum: Whilst the data is available the programming models that facilitate scalability and the infrastructure to run various algorithms on the graph are missing.
Some use MapReduce - a good solution for many problems. However, even some simple iterative graph algorithms do not map nicely to that programming model requiring programmers to shoehorn their problem to the MapReduce model.
This paper presents the Signal/Collect programming model for synchronous and asynchronous graph algorithms. We demonstrate that this abstraction can capture the essence of many algorithms on graphs in a concise and elegant way by giving Signal/Collect adaptations of various relevant algorithms. Furthermore, we built and evaluated a prototype Signal/Collect framework that executes algorithms in our programming model. We empirically show that this prototype transparently scales and that guiding computations by scoring as well as asynchronicity can greatly improve the convergence of some example algorithms. We released the framework under the Apache License 2.0 (at http://www.ifi.uzh.ch/ddis/research/sc).
-
Thomas Scharrenbach, Rolf Grütter, Bettina Waldvogel, Abraham Bernstein, Structure Preserving TBox Repair using Defaults, Proceedings of the 23rd International Workshop on Description Logics (DL 2010), May 2010, CEUR Workshop Proceedings. (inproceedings)
Unsatisfiable concepts are a major cause for inconsistencies in Description Logics knowledge bases. Popular methods for repairing such concepts aim to remove or rewrite axioms to resolve the conflict by the original logics used. Under certain conditions, however, the structure and intention of the original axioms must be preserved in the knowledge base. This, in turn, requires changing the underlying logics for repair. In this paper, we show how Probabilistic Description Logics, a variant of Reiter?s default logics with Lehmann?s Lexicographical Entailment, can be used to resolve conflicts fully-automatically and receive a consistent knowledge base from which inferences can be drawn again.
-
Adrian Bachmann, Christian Bird, Foyzur Rahman, Premkumar Devanbu, Abraham Bernstein, The Missing Links: Bugs and Bug-fix Commits, ACM SIGSOFT / FSE '10: Proceedings of the eighteenth International Symposium on the Foundations of Software Engineering, November 2010. (inproceedings)
Empirical studies of software defects rely on links between bug databases and program code repositories. This linkage is typically based on bug-fixes identified in developer-entered commit logs. Unfortunately, developers do not always report which commits perform bug-fixes. Prior work suggests that such links can be a biased sample of the entire population of fixed bugs. The validity of statistical hypotheses-testing based on linked data could well be affected by bias. Given the wide use of linked defect data, it is vital to gauge the nature and extent of the bias, and try to develop testable theories and models of the bias. To do this, we must establish ground truth: manually analyze a complete version history corpus, and nail down those commits that fix defects, and those that do not. This is a diffcult task, requiring an expert to compare versions, analyze changes, find related bugs in the bug database, reverse-engineer missing links, and finally record their work for use later. This effort must be repeated for hundreds of commits to obtain a useful sample of reported and unreported bug-fix commits. We make several contributions. First, we present Linkster, a tool to facilitate link reverse-engineering. Second, we evaluate this tool, engaging a core developer of the Apache HTTP web server project to exhaustively annotate 493 commits that occurred during a six week period. Finally, we analyze this comprehensive data set, showing that there are serious and consequential problems in the data.
-
Thomas Scharrenbach, Claudia d'Amato, Nicola Fanizzi, Rolf Grütter, Bettina Waldvogel, Abraham Bernstein, Unsupervised Conflict-Free Ontology Evolution Without Removing Axioms, Proceedings of the 4th International Workshop on Ontology Dynamics (IWOD 2010), November 2010, CEUR Workshop Proceedings. (inproceedings)
In the beginning of the Semantic Web, ontologies were usually constructed once by a single knowledge engineer and then used as a static conceptualization of some domain. Nowadays, knowledge bases are increasingly dynamically evolving and incorporate new knowledge from different heterogeneous domains -- some of which is even contributed by casual users (i.e., non-knowledge engineers) or even software agents. Given that ontologies are based on the rather strict formalism of Description Logics and their inference procedures, conflicts are likely to occur during ontology evolution. Conflicts, in turn, may cause an ontological knowledge base to become inconsistent and making reasoning impossible. Hence, every formalism for ontology evolution should provide a mechanism for resolving conflicts.
In this paper we provide a general framework for conflict-free ontology evolution without changing the knowledge representation. Using a variant of Lehmann's Default Logics and Probabilistic Description Logics, we can invalidate unwanted implicit inferences without removing explicitly stated axioms. We show that this method outperforms classical ontology repair w.r.t. the amount of information lost while allowing for automatic conflict-solving when evolving ontologies.
-
Rolf Grütter, Thomas Scharrenbach, Bettina Waldvogel, Vague Spatio-Thematic Query-Processing - A Qualitative Approach to Spatial Closeness, Transactions in GIS 14, April 2010. (article)
In order to support the processing of qualitative spatial queries, spatial knowledge must be represented in a way that machines can make use of it. Ontologies typically represent thematic knowledge. Enhancing them with spatial knowledge is still a challenge. In this article, an implementation of the Region Connection Calculus (RCC) in the Web Ontology Language (OWL), augmented by DL-safe SWRL rules, is used to represent spatio-thematic knowledge. This involves partially ordered partitions, which are implemented by nominals and functional roles. Accordingly, a spatial division into administrative regions, rather than, for instance, a metric system, is used as a frame of reference for evaluating closeness. Hence, closeness is evaluated purely according to qualitative criteria. Colloquial descriptions typically involve qualitative concepts. The approach presented here is thus expected to align better with the way human beings deal with closeness than does a quantitative approach. To illustrate the approach, it is applied to the retrieval of documents from the database of the Datacenter Nature and Landscape (DNL).
-
Adrian Bachmann, Abraham Bernstein, When Process Data Quality Affects the Number of Bugs: Correlations in Software Engineering Datasets, MSR '10: Proceedings of the 7th IEEE Working Conference on Mining Software Repositories, May 2010. (inproceedings)
Software engineering process information extracted from version control systems and bug tracking databases are widely used in empirical software engineering. In prior work, we showed that these data are plagued by quality deficiencies, which vary in its characteristics across projects. In addition, we showed that those deficiencies in the form of bias do impact the results of studies in empirical software engineering. While these findings affect software engineering researchers the impact on practitioners has not yet been substantiated. In this paper we, therefore, explore (i) if the process data quality and characteristics have an influence on the bug fixing process and (ii) if the process quality as measured by the process data has an influence on the product (i.e., software) quality. Specifically, we analyze six Open Source as well as two Closed Source projects and show that process data quality and characteristics have an impact on the bug fixing process: the high rate of empty commit messages in Eclipse, for example, correlates with the bug report quality. We also show that the product quality -- measured by number of bugs reported -- is affected by process data quality measures. These findings have the potential to prompt practitioners to increase the quality of their software process and its associated data quality.
-
Adrian Bachmann, Why should we care about data quality in software engineering?, 10 2010. (doctoralthesis)
Software engineering tools such as bug tracking databases and version
control systems store large amounts of data about the history and
evolution of software projects. In the last few years, empirical software
engineering researchers have paid attention to these data to provide
promising research results, for example, to predict the number of
future bugs, recommend bugs to fix next, and visualize the evolution
of software systems. Unfortunately, such data is not well-prepared
for research purposes, which forces researchers to make process assumptions
and develop tools and algorithms to extract, prepare, and
integrate (i.e., inter-link) these data. This is inexact and may lead to
quality issues. In addition, the quality of data stored in software engineering
tools is questionable, which may have an additional effect on
research results.
In this thesis, therefore, we present a step-by-step procedure to
gather, convert, and integrate software engineering process data, introducing
an enhanced linking algorithm that results in a better linking
ratio and, at the same time, higher data quality compared to previously
presented approaches. We then use this technique to generate
six open source and two closed source software project datasets. In
addition, we introduce a framework of data quality and characteristics
measures, which allows an evaluation and comparison of these
datasets.
However, evaluating and reporting data quality issues are of no
importance if there is no effect on research results, processes, or product
quality. Therefore, we show why software engineering researchers
should care about data quality issues and, fundamentally, show that
such datasets are incomplete and biased; we also show that, even
worse, the award-winning bug prediction algorithm BUGCACHE is affected by quality issues like these. The easiest way to fix such data
quality issues would be to ensure good data quality at its origin by
software engineering practitioners, which requires extra effort on their
part. Therefore, we consider why practitioners should care about data
quality and show that there are three reasons to do so: (i) process data
quality issues have a negative effect on bug fixing activities, (ii) process
data quality issues have an influence on product quality, and
(iii) current and future laws and regulations such as the Sarbanes-
Oxley Act or the Capability Maturity Model Integration (CMMI) as
well as operational risk management implicitly require traceability
and justification of all changes to information systems (e.g., by change
management). In a way, this increases the demand for good data quality
in software engineering, including good data quality of the tools
used in the process.
Summarizing, we discuss why we should care about data quality
in software engineering, showing that (i) we have various data quality
issues in software engineering datasets and (ii) these quality issues
have an effect on research results as well as missing traceability and
justification of program code changes, and so software engineering
researchers as well as software engineering practitioners should care
about these issues.
-
Stefanie Hauske, Gerhard Schwabe, Abraham Bernstein, Wiederverwendung multimedialer und online verfuegbarer Selbstlernmodule in der Wirtschaftsinformatik ? Lessons Learned , E-Learning 2010: Aspekte der Betriebswirtschaftslehre und Informatik (MKWI 2008), Editor(s): M. Breitner, F. Lehner, J. Staff, U. Winand; 2010. (inproceedings)
Die Wiederverwendbarkeit von digitalen Lehrinhalten war eine zentrale
Frage in dem E-Learning-Projekt ?Foundations of Information Systems
(FOIS)?, einem Verbundprojekt von fünf Schweizer Universitäten. Wäh-
rend der Projektlaufzeit wurden zwölf multimediale und online verfügbare
Selbstlernmodule produziert, die ein breites Spektrum an Wirtschaftsin-
formatikthemen abdecken und die primär in einführenden Lehrveranstal-
tungen der Wirtschaftsinformatik gemäß dem Blended-Learning-Ansatz
genutzt werden. In dem Artikel beschreiben wir, wie die für die Wieder-
verwendung von E-Learning-Inhalten und -materialien wesentlichen As-
pekte Flexibilität, Kontextfreiheit, inhaltliche und didaktische Vereinheit-
lichung sowie Blended-Learning-Einsatz in dem Projekt umgesetzt worden
sind. Im zweiten Teil gehen wir auf die Erfahrungen ein, die wir und unse-
re Studierenden mit den FOIS-Module in der Lehre an der Universität Zü-
rich gesammelt haben, und stellen Evaluationsergebnisse aus drei Lehrver-
anstaltungen und unsere Lessons Learned vor.
-
Rolf Grütter, Thomas Scharrenbach, A Qualitative Approach to Vague Spatio-Thematic Query Processing, Proceedings of the Terra Cognita Workshop, ISWC2009, Editor(s): Dave Kolas, Nancy Wiegand, Gary Berg-Cross, October ; 2009. (inproceedings)
-
Jonas Tappolet, Abraham Bernstein, Applied Temporal RDF: Efficient Temporal Querying of RDF Data with SPARQL, Proceedings of the 6th European Semantic Web Conference (ESWC), June 2009, Springer. (inproceedings)
Many applications operate on time-sensitive data. Some of
these data are only valid for certain intervals (e.g., job-assignments, versions
of software code), others describe temporal events that happened
at certain points in time (e.g., a persons birthday). Until recently, the
only way to incorporate time into Semantic Web models was as a data
type property. Temporal RDF, however, considers time as an additional
dimension in data preserving the semantics of time.
In this paper we present a syntax and storage format based on named
graphs to express temporal RDF. Given the restriction to preexisting
RDF-syntax, our approach can perform any temporal query using standard
SPARQL syntax only. For convenience, we introduce a shorthand
format called -SPARQL for temporal queries and show how -SPARQL
queries can be translated to standard SPARQL. Additionally, we show
that, depending on the underlying data?s nature, the temporal RDF approach
vastly reduces the number of triples by eliminating redundancies
resulting in an increased performance for processing and querying. Last
but not least, we introduce a new indexing approach method that can
significantly reduce the time needed to execute time point queries (e.g.,
what happened on January 1st).
-
Dorothea Wagner, Abraham Bernstein, Thomas Dreier, Steffen Hölldobler, Günter Hotz, Klaus-Peter Löhr, Paul Molitor, Gustaf Neumann, Rüdiger Reiachuk, Dietmar Saupe, Myra Spiliopoulou, Harald Störrle, Ausgezeichnete Informatikdissertationen 2008, Lecture Notes in Informatics, Gesellschaft für Informatik (GI)2009. (book)
-
Katharina Reinecke, Culturally Adaptivity in User Interfaces, Doctoral Consortium at the International Conference of Information Systems (ICIS), December 2009. (inproceedings)
-
Adrian Bachmann, Abraham Bernstein, Data Retrieval, Processing and Linking for Software Process Data Analysis, University of Zurich, Department of Informatics, 12 2009. (techreport)
Many projects in the mining software repositories communities rely on software process data gathered from bug tracking databases and commit log files of version control systems. These data are then used to predict defects, gather insight into a project's life-cycle, and other tasks. In this technical report we introduce the software systems which hold such data. Furthermore, we present our approach for retrieving, processing and linking this data. Specifically, we first introduce the bug fixing process and the software products used which support this process. We then present a step by step guidance of our approach to retrieve, parse, convert and link the data sources. Additionally, we introduce an improved approach for linking the change log file with the bug tracking database. Doing that, we achieve a higher linking rate than with other approaches
-
Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, Premkumar Devanbu, Fair and Balanced? Bias in Bug-Fix Datasets, ESEC/FSE '09: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering on European software engineering conference and foundations of software engineering, August 2009. (inproceedings)
Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data.
-
Abraham Bernstein, Jiwen Li, From Active Towards InterActive Learning: Using Consideration Information to Improve Labeling Correctness, Human Computation Workshop, June 2009. (inproceedings)
Data mining techniques have become central to many applications. Most of those applications rely on so called supervised learning algorithms, which learn from given examples in the form of data with predefined labels (e.g., classes such as spam, not spam). Labeling, however, is oftentimes expensive, as it typically requires manual work by human experts. Active learning systems reduce the human effort by choosing the most informative instances for labeling. Unfortunately, research in psychology has shown conclusively that human decisions are inaccurate, easily biased by circumstances, and far from the oracle decision making assumed in active learning research. Based on these findings we show experimentally that (human) mistakes in labeling can significantly deteriorate the performance of active learning systems. To solve this problem, we introduce consideration information ? a concept from marketing ? into an active learning system to bias and improve the human?s labeling performance. Results (with simulated and human labelers) show that consideration information can indeed be used to exert a bias. Furthermore, we find that the choice of appropriate consideration information can be used to positively bias an expert and thereby improving the overall performance of the learning setting.
-
Christoph Kiefer, Non-Deductive Reasoning for the Semantic Web and Software Analysis, January 2009. (doctoralthesis)
The Semantic Web uses a number of knowledge representation (KR) languages to represent
the terminological knowledge of a domain in a structured and formally sound
way. Such KRs are typically description logics (DL), which are a particular
kind of knowledge representation languages. One of the underpinnings of the Semantic Web
and, therefore, a strength of any such semantic architecture, is the ability to
reason from data, that is, to derive new knowledge from basic facts. In other
words, the information that is already known and stored in the knowledgebase is
extended with the information that can be logically deduced from the ground
truth.
The world does, however, generally not fit into a fixed, predetermined logic
system of zeroes and ones. To account for this, especially in order to deal
with the uncertainty inherent in the physical world, different models of human
reasoning are required. Two prominent ways to model human reasoning are
similarity reasoning (aka analogical reasoning) and inductive reasoning. It has
been shown in recent years that the notion of similarity plays an important
role in a number of Semantic Web tasks, such as Semantic Web service matchmaking,
similarity-based service discovery, and ontology alignment. With inductive
reasoning, two prominent tasks that can benefit from the use of statistical
induction techniques are Semantic Web service classification and (semi-) automatic
semantic data annotation.
This dissertation transfers these ideas to the Semantic Web. To this end, it extends the
well-known RDF query language SPARQL with two novel, non-deductive reasoning
extensions in order to enable similarity and inductive reasoning. To address
these issues, specifically to implement the two novel reasoning variants by
using SPARQL, we introduce the concept of virtual triple patterns. Virtual
triples are not asserted but inferred. Hence, they do not exist in the
knowledgebase, but, rather, only as a result of the similarity/inductive
reasoning process.
To address similarity reasoning, we present the iSPARQL (imprecise SPARQL)
framework---an extension of traditional SPARQL that supports customized
similarity strategies via virtual triple patterns in order to explore an RDF
dataset for similar resources. For our inductive reasoning extension, we
introduce our SPARQL-ML (SPARQL Machine Learning) approach to create and work
with statistical induction/data mining models in traditional SPARQL.
Our presented iSPARQL and SPARQL-ML frameworks are validated using five
different case studies of heavily researched Semantic Web and Software Analysis tasks.
For the Semantic Web, these tasks are semantic service matchmaking, service discovery,
and service classification. For Software Analysis, we conduct some experiments
in software evolution and bug prediction. By applying our approaches to this
large number of different tasks, we hope to show the approaches' generality,
ease-of-use, extensibility, and high degree of flexibility in terms of
customization to the actual task.
-
Thomas Scharrenbach, Abraham Bernstein, On the Evolution of Ontologies using Probabilistic Description Logics, Proceedings of the First ESWC Workshop on Inductive Reasoning and Machine Learning on the Semantic Web, Editor(s): Claudia d'Amato, Nicola Fanizzi, Marko Grobelnik, Agnieszka Lawrynowicz, Vojtech Svátek, June ; 2009, CEUR Workshop Proceedings. (inproceedings)
Exceptions play an important role in conceptualizing data,
especially when new knowledge is introduced or existing knowledge
changes. Furthermore, real-world data often is contradictory and uncertain.
Current formalisms for conceptualizing data like Description Logics
rely upon rst-order logic. As a consequence, they are poor in addressing
exceptional, inconsistent and uncertain data, in particular when evolving
the knowledge base over time.
This paper investigates the use of Probabilistic Description Logics as a
formalism for the evolution of ontologies that conceptualize real-world
data. Dierent scenarios are presented for the automatic handling of inconsistencies
during ontology evolution.
-
Cathrin Weiss, Abraham Bernstein, On-disk storage techniques for Semantic Web data - Are B-Trees always the optimal solution?, Proceedings of the 5th International Workshop on Scalable Semantic Web Knowledge Base Systems, October 2009. (inproceedings)
Since its introduction in 1971, the B-tree has become the dominant index structure in database systems.
Conventional wisdom dictated that the use of a B-tree index or one of its descendants would typically lead to good results.
The advent of XML-data, column stores, and the recent resurgence of typed-graph (or triple) stores motivated by the Semantic Web has changed the nature of the data typically stored.
In this paper we show that in the case of triple-stores the usage of B-trees is actually highly detrimental to query performance.
Specifically, we compare on-disk query performance of our triple-based Hexastore when using two different B-tree implementations, and our simple and novel vector storage that leverages offsets.
Our experimental evaluation with a large benchmark data set confirms that the vector storage outperforms the other approaches by at least a factor of four in load-time, by approximately a factor of three (and up to a factor of eight for some queries) in query-time, as well as by a factor of two in required storage.
The only drawback of the vector-based approach is its time-consuming need for reorganization of parts of the data during inserts of new triples: a seldom occurrence in many Semantic Web environments.
As such this paper tries to reopen the discussion about the trade-offs when using different types of indices in the light of non-relational data and contribute to the endeavor of building scalable and fast typed-graph databases.
-
Tobias Bannwart, Amancio Bouza, Gerald Reif, Abraham Bernstein, Private Cross-page Movie Recommendations with the Firefox add-on OMORE, 8th International Semantic Web Conference (ISWC 2009), October 2009. (inproceedings/Semantic Web Challenge)
Online stores and Web portals bring information about a myriad of items such as books, CDs, restaurants or movies at the user's fingertips. Although, the Web reduces the barrier to the information, the user is overwhelmed by the number of available items. Therefore, recommender systems aim to guide the user to relevant items. Current recommender systems store user ratings on the server side. This way the scope of the recommendations is limited to this server only. In addition, the user entrusts the operator of the server with valuable information about his preferences.
Thus, we introduce the private, personal movie recommender OMORE, which learns the user model based on the user's movie ratings. To preserve privacy, OMORE is implemented as Firefox add-on which stores the user ratings and the learned user model locally at the client side. Although OMORE uses the features from the movie pages on the IMDb site, it is not restricted to IMDb only. To enable cross-referencing between various movie sites such as IMDb, Amazon.com, Blockbuster, Netflix, Jinni, or Rotten Tomatoes we introduce the movie cross-reference database LiMo which contributes to the Linked Data cloud.
-
Amancio Bouza, Gerald Reif, Abraham Bernstein, Probabilistic Partial User Model Similarity for Collaborative Filtering, Proceedings of the 1st International Workshop on Inductive Reasoning and Machine Learning on the Semantic Web (IRMLeS2009) at the 6th European Semantic Web Conference (ESWC2009), June 2009. (inproceedings)
Recommender systems play an important role in supporting people getting items they like. One type of recommender systems is user-based collaborative filtering. The fundamental assumption of user-based collaborative filtering is that people who share similar preferences for common items behave similar in the future. The similarity of user preferences is computed globally on common rated items such that partial preference similarities might be missed. Consequently, valuable ratings of partially similar users are ignored. Furthermore, two users may even have similar preferences but the set of common rated items is too small to infer preference similarity. We propose first, an approach that computes user preference similarities based on learned user preference models and second, we propose a method to compute partial user preference similarities based on partial user model similarities. For users with few common rated items, we show that user similarity based on preferences significantly outperforms user similarity based on common rated items.
-
Abraham Bernstein, Esther Kaufmann, Christoph Kiefer, Querying the Semantic Web with Ginseng - A Guided Input Natural Language Search Engine, Searching Answers. Festschrift in Honour of Michael Hess on the Occasion of His 60th Birthday, Editor(s): Simon Clematide, Manfred Klenner, Martin Volk; 2009, MV-Wissenschaft, Münster. (incollection)
-
Adrian Bachmann, Abraham Bernstein, Software Process Data Quality and Characteristics - A Historical View on Open and Closed Source Projects, IWPSE-Evol '09: Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) workshops, August 2009. (inproceedings)
Software process data gathered from bug tracking databases and version control system log files are a very valuable source to analyze the evolution and history of a project or predict its future. These data are used for instance to predict defects, gather insight into a project's life-cycle, and additional tasks. In this paper we survey five open source projects and one closed source project in order to provide a deeper insight into the quality and characteristics of these often-used process data. Specifically, we first define quality and characteristics measures, which allow us to compare the quality and characteristics of the data gathered for different projects. We then compute the measures and discuss the issues arising from these observation. We show that there are vast differences between the projects, particularly with respect to the quality in the link rate between bugs and commits.
-
Esther Kaufmann, Talking to the semantic web : natural language query interfaces for casual end-users 2009. (doctoralthesis)
The Semantic Web presents the vision of a distributed, dynamically growing knowledge base founded on formal logic. This formal framework facilitates precise and effective querying in order to manage information-seeking tasks. Casual end-users, however, are typically overwhelmed by the formal logic. So how can we help users to query a Web of logic that they do not seem to understand? One solution to address this problem is the use of natural language for query specification, but the development of natural language interfaces requires computationally and conceptually intensive algorithms relying on large quantities of domain-dependent background knowledge, thereby making them virtually unadaptable to new domains and applications, or achievable only by sacrificing retrieval performance. Furthermore, while natural language interfaces hide prohibitive formal query languages, users should know their capabilities in order to utilize the natural language interface successfully.
This thesis proposes to break the dichotomy between full natural language and formal approaches by regarding them as ends of a Formality Continuum, where the freedom of full natural languages and the structuredness of formal query languages lie at the ends of the continuum. We hypothesize that portable natural language interfaces to the Semantic Web with high retrieval performance can be built, thereby avoiding a complex configuration by controlling the query language and by extracting the necessary under lying frameworks from ontology-based knowledge bases, since such knowledge bases offer a rich source of semantically annotated information. We further hypothesize that query interfaces for the casual user should impose some structure on the user ?s input in order to guide the entry, but should not overly restrict the user with an excessively formalistic language. In this way, the best solutions for casual end-users lie somewhere
between the freedom of a full natural language and the structuredness of a formal query language.
To support our proposition we introduce, in a first step, four different, domain-independent query interfaces to the Semantic Web that lie at different positions of the Formality Continuum: NLP-Reduce, Querix, Ginseng, and Semantic Crystal. The first two interfaces allow users to pose questions in almost full or slightly controlled English. The third interface offers query formulation in a controlled language akin to English. The last interface belongs to the formal approaches, as it exhibits a formal, but graphically displayed query language. The interfaces are simple in configuration, but still offer wellperforming and appropriate tools for composing queries to ontology-based data for casual end-users.
As a second step, we present two evaluations to test our hypotheses: (1) an in-depth retrieval performance evaluation with three test sets comparing our interfaces with three
existing systems and (2) a comprehensive usability study with real-world end-users assessing our interfaces and, in particular, their query languages. The two evaluations provide sufficient evidence to determine the advantages and disadvantages of query interfaces at various points along the Formality Continuum. In turn, they lead to concrete answers to our hypotheses.
The thesis shows that natural language interfaces provide a convenient as well as reliable means of querying access to the Semantic Web and, hence, a realistic potential for bridging the gap between the formal logic of the Semantic Web and the casual end users. As such, its overarching contribution is one step towards the theoretical vision of the Semantic Web becoming reality.
-
Katharina Reinecke, Abraham Bernstein, Tell Me Where You`ve Lived, and I`ll Tell You What You Like: Adapting Interfaces to Cultural Preferences, User Modeling, Adaptation, and Personalization (UMAP) 2009. (inproceedings)
-
Bettina Bauer-Messmer, Lukas Wotruba, Kalin Müller, Sandro Bischof, Rolf Grütter, Thomas Scharrenbach, Rolf Meile, Martin Hägeli, Jürg Schenker, The Data Centre Nature and Landscape (DNL): Service Oriented Architecture, Metadata Standards and Semantic Technologies in an Environmental Information System, EnviroInfo 2009: Environmental Informatics and Industrial Environmental Protection: Concepts, Methods and Tools, Editor(s): Volker Wohlgemuth, Bernd Page, Kristina Voigt, September ; 2009, Shaker Verlag, Aachen. (inproceedings)
-
, The Semantic Web - ISWC 2009. (proceedings)
This book constitutes the refereed proceedings of the 8th International Semantic Web Conference, ISWC 2009, held in Chantilly, VA, USA, during October 25-29, 2009.
The volume contains 43 revised full research papers selected from a total of 250 submissions; 15 papers out of 59 submissions to the semantic Web in-use track, and 7 papers and 12 posters accepted out of 19 submissions to the doctorial consortium.
The topics covered in the research track are ontology engineering; data management; software and service engineering; non-standard reasoning with ontologies; semantic retrieval; OWL; ontology alignment; description logics; user interfaces; Web data and knowledge; semantic Web services; semantic social networks; and rules and relatedness. The semantic Web in-use track covers knowledge management; business applications; applications from home to space; and services and infrastructure.
-
Jörg-Uwe Kietz, Floarea Serban, Abraham Bernstein, Simon Fischer, Towards Cooperative Planning of Data Mining Workflows, Proc of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-09) 2009. (inproceedings)
A ma jor challenge for third generation data mining and knowledge discovery systems is the integration of different data mining tools and services for data understanding, data integration, data preprocessing, data mining, evaluation and deployment, which are distributed across the network of computer systems. In this paper we outline how an intelligent assistant that is intended to support end-users in the difficult and time consuming task of designing KDD-Work?ows out of these distributed services can be built. The assistant should support the user in checking the correctness of work?ows, understanding the goals behind given work?ows, enumeration of AI planner generated work?ow completions, storage, retrieval, adaptation and repair of previous work?ows. It should also be an open easy extendable system. This is reached by basing
the system on a data mining ontology (DMO) in which all the services (operators) together with their in-/output, pre-/postconditions are described. This description is compatible with OWL-S and new operators can be added importing their OWL-S speci?cation and classifying it into
the operator ontology.
-
Jayalath Ekanayake, Jonas Tappolet, Harald C. Gall, Abraham Bernstein, Tracking Concept Drift of Software Projects Using Defect Prediction Quality, Proceedings of the 6th IEEE Working Conference on Mining Software Repositories , May 2009, IEEE Computer Society. (inproceedings)
Defect prediction is an important task in the mining of
software repositories, but the quality of predictions varies
strongly within and across software projects. In this paper
we investigate the reasons why the prediction quality is so
fluctuating due to the altering nature of the bug (or defect)
fixing process. Therefore, we adopt the notion of a concept
drift, which denotes that the defect prediction model has
become unsuitable as set of influencing features has changed
? usually due to a change in the underlying bug generation
process (i.e., the concept). We explore four open source
projects (Eclipse, OpenOffice, Netbeans and Mozilla) and
construct file-level and project-level features for each of
them from their respective CVS and Bugzilla repositories.
We then use this data to build defect prediction models and
visualize the prediction quality along the time axis. These
visualizations allow us to identify concept drifts and ? as a
consequence ? phases of stability and instability expressed
in the level of defect prediction quality. Further, we identify
those project features, which are influencing the defect
prediction quality using both a tree induction-algorithm and
a linear regression model. Our experiments uncover that
software systems are subject to considerable concept drifts
in their evolution history. Specifically, we observe that the
change in number of authors editing a file and the number
of defects fixed by them contribute to a project?s concept
drift and therefore influence the defect prediction quality.
Our findings suggest that project managers using defect
prediction models for decision making should be aware of
the actual phase of stability or instability due to a potential
concept drift.
-
Christoph Kiefer, Abraham Bernstein, André Locher, Adding Data Mining Support to SPARQL via Statistical Relational Learning Methods, Proceedings of the 5th European Semantic Web Conference (ESWC), February 2008, Springer. (inproceedings)
Exploiting the complex structure of relational data enables to build better models by taking into account the additional information provided by the links between objects. We extend this idea to the Semantic Web by introducing our novel SPARQL-ML approach to perform data mining for Semantic Web data. Our approach is based on traditional SPARQL and statistical relational learning methods, such as Relational Probability Trees and Relational Bayesian Classifiers.
We analyze our approach thoroughly conducting three sets of experiments on synthetic as well as real-world data sets. Our analytical results show that our approach can be used for any Semantic Web data set to perform instance-based learning and classification. A comparison to kernel methods used in Support Vector Machines shows that our approach is superior in terms of classification accuracy.
-
Dorothea Wagner, Abraham Bernstein, Thomas Dreier, Steffen Hölldobler, Günter Hotz, Klaus-Peter Löhr, Paul Molitor, Rüdiger Reischuk, Dietmar Saupe, Myra Spiliopoulou, Augezeichnete Informatikdissertationen 2007, Lecture Notes in Informatics, Gesellschaft für Informatik (GI)2008. (book)
-
Peter Vorburger, Catching the drift : when regimes change over time 2008. (doctoralthesis)
The goal of this thesis is the identification of relationships between data streams. These relationships may change over time. The research contribution is to solve this problem by combining two
data mining fields. The first field is the identification of such relationships, e.g. by using correlation measures. The second field covers methods to deal with the dynamics of such a system
which require model reconsideration. This field is called ?concept drift? and allows to identify and handle new situations. In this thesis two different approaches are presented to combine these two fields into one solution. After that, these two approaches are assessed on synthetic and real-world datasets. Finally, the solution is applied to the finance domain. The task is the determination of dominant factors influencing exchange rates. Finance experts call such a dominant factor ?regime?. These factors change over time and thus, the problem is named ?regime drift?. The approach
turns out to be successful in dealing with regime drifts.
-
Man Lung Yiu, Nikos Mamoulis, Panagiotis Karras, Common Influence Join: A Natural Join Operation for Spatial Pointsets, Proc. of the 24th IEEE Intl Conf. on Data Engineering (ICDE), February 2008, IEEE Computer Society. (inproceedings)
-
Jonas Luell, Abraham Bernstein, Den Transaktionen auf der Spur, OecNews, February 2008. (article)
-
Thomas Scharrenbach, End-User Assisted Ontology Evolution in Uncertain Domains, The Semantic Web - ISWC 2008, 7th International Semantic Web Conference 2008, Springer. (inproceedings)
-
Simon Ferndriger, Abraham Bernstein, Jin Song Dong, Yuzhang Feng, Yuan-Fang Li, Larry Hunter, Enhancing Semantic Web Services with Inheritance, Proceedings of the 7th International Semantic Web Conference (ISWC) 2008, November 2008, Springer. (inproceedings)
Currently proposed Semantic Web Services technologies allow the creation of ontology-based semantic annotations of Web services so that software agents are able to discover, invoke, compose and monitor these services with a high degree of automation. The OWL Services (OWL-S) ontology is an upper ontology in OWL language, providing essential vocabularies to semantically describe Web services. Currently OWL-S services can only be developed independently; if one service is unavailable then finding a suitable alternative would require an expensive and difficult global search/match. It is desirable to have a new OWL-S construct that can systematically support substitution tracing as well as incremental development and reuse of services. Introducing inheritance relationship (IR) into OWL-S is a natural solution. However, OWL-S, as well as most of the other currently discussed formalisms for SemanticWeb Services such as WSMO or SAWSDL, has yet to define a concrete and self-contained mechanism of establishing inheritance relationships among services, which we believe is very important for the automated annotation and discovery of Web services as well as human organization of services into a taxonomy-like structure. In this paper, we extend OWL-S with the ability to define and maintain inheritance relationships between services. Through the definition of an additional ?inheritance profile?, inheritance relationships can be stated and reasoned about. Two types of IRs are allowed to grant service developers the choice to respect the ?contract? between services or not. The proposed inheritance framework has also been implemented and the prototype will be briefly evaluated as well.
-
Jonas Luell, Abraham Bernstein, Alexandra Schaller, Hans Geiger, Foreign Exchange (S.114-177), Swiss Financial Center Watch Monitoring Report, February 2008. (inproceedings)
-
Cathrin Weiss, Panagiotis Karras, Abraham Bernstein, Hexastore: Sextuple Indexing for Semantic Web Data Management, Proc. of the 34th Intl Conf. on Very Large Data Bases (VLDB), February 2008. (inproceedings)
-
Cathrin Weiss, Abraham Bernstein, Sandro Boccuzzo, i-MoCo: Mobile Conference Guide - Storing and querying huge amounts of Semantic Web data on the iPhone/iPod Touch, October 2008. (misc)
Querying and storing huge amounts of Semantic Web data ? this has usually required a lot of computational power. This is no
longer true if one makes use of recent research outcomes like modern RDF indexing strategies. We present a mobile conference guide application that combines several different RDF data sets to present interlinked information about publications, conferences, authors, locations, and others to the user. With our application we show that it is possible to store a big amount of indexed data on an iPhone/iPod Touch device. That querying is also efficent
is demonstrated by creating the application?s actual content out of real time queries on the data.
-
Bettina Bauer-Messmer, Thomas Scharrenbach, Rolf Grütter, Improving an Environmental Ontology by Incorporating User-Input, Environmental Informatics and Industrial Ecology. Proceedings of the 22nd International Conference on Informatics for Environmental Protection, September 2008. (inproceedings)
-
Rolf Grütter, Thomas Scharrenbach, Bettina Bauer-Messmer, Improving an RCC-Derived Geospatial Approximation by OWL Axioms, The Semantic Web - ISWC 2008, 7th International Semantic Web Conference, October 2008, Springer. (inproceedings)
-
Panagiotis Karras, Nikos Mamoulis, Lattice Histograms: a Resilient Synopsis Structure, Proc. of the 24th IEEE Intl Conf. on Data Engineering (ICDE), February 2008, IEEE Computer Society. (inproceedings)
-
Eirik Aune, Adrian Bachmann, Abraham Bernstein, Christian Bird, Premkumar Devanbu, Looking Back on Prediction: A Retrospective Evaluation of Bug-Prediction Techniques, November 2008. (misc/poster)
-
Alexandros Kalousis, Abraham Bernstein, Melanie Hilario, Meta-learning with kernels and similarity functions for planning of data mining work?ows, Proceedings of the ICML/COLT/UAI 2008 Planing to Learn Workshop, Editor(s): Pavel Brazdil, Abraham Bernstein, Larry Hunter, July ; 2008. (inproceedings)
We propose an intelligent data mining (DM) assistant that will combine planning and meta-learning to provide support to users of virtual DM laboratory. A knowledge-driven planner will rely on a data mining ontology to plan the knowledge discovery work?ow and determine the set of valid operators for each step of this work?ow. A probabilistic meta-learner will select the most appropriate operators by using relational similarity measures and kernel functions over records of past sessions meta-data stored in a DM experiments repository.
-
Katharina Reinecke, Abraham Bernstein, Predicting User Interface Preferences of Culturally Ambiguous Users, Proceedings of 26th Conference Extended Abstracts on Human Factors in Computing Systems (CHI) 2008. (inproceedings)
To date, localized user interfaces are still being adapted to one nation, not taking into account cultural ambiguities of people within this nation. We have developed an approach to cultural user modeling, which allows to personalize user interfaces to an individual's cultural background. The study presented in this paper shows how we use this approach to predict user interface preferences. Results show that we are able to reduce the absolute error on this prediction to 1.079 on a rating scale of 5. These findings suggest that it is possible to automate the process of localization and, thus, to automatically personalizing user interfaces for users of different cultural backgrounds.
-
Pavel Brazdil, Abraham Bernstein, Larry Hunter, Proceedings if the Second Planing to Learn Workshop at ICML/COLT/UAI 2008, University of Zurich, Department of Informatics, July 2008. (book)
-
Man Yiu, Panagiotis Karras, Nikos Mamoulis, Ring-constrained Join: Deriving Fair Middleman Locations from Pointsets via a Geometric Constraint, Proc. of the 11th Intl Conf. on Extending Database Technology (EDBT), February 2008. (inproceedings)
-
Jonas Tappolet, Semantics-aware Software Project Repositories, ESWC 2008 Ph.D. Symposium, June 2008. (inproceedings)
This proposal explores a general framework to solve software
analysis tasks using ontologies. Our aim is to build semantically anno-
tated, flexible, and extensible software repositories to overcome data
representation, intra- and inter-project integration difficulties as well
as to make the tedious and error-prone extraction and preparation of
meta-data obsolete. We also outline a number of practical evaluation
approaches for our propositions.
-
Amancio Bouza, Gerald Reif, Abraham Bernstein, Harald C. Gall, SemTree: Ontology-Based Decision Tree Algorithm for Recommender Systems, In Proceedings of the 7th International Semantic Web Conference, October 2008. (inproceedings/Poster)
Recommender systems play an important role in supporting people when choosing items from an overwhelming huge number of choices. So far, no recommender system makes use of domain knowledge. We are modeling user preferences with a machine learning approach to recommend people items by predicting the item ratings. Specifically, we propose SemTree, an ontology-based decision tree learner, that uses a reasoner and an ontology to semantically generalize item features to improve the effectiveness of the decision tree built. We show that SemTree outperforms comparable approaches in recommending more accurate recommendations considering domain knowledge.
-
David Kurz, Abraham Bernstein, Katrin Hunt, Dragana Radovanovic, Paul E. Erne, Z Siudak, Osmund Bertel, Simple point of care risk stratification in acute coronary syndromes: The AMIS model, Heart 2008. (article)
Background: Early risk stratification is important in the management of patients with acute
coronary syndromes (ACS).
Objective: To develop a rapidly available risk stratification tool for use in all ACS.
Design and methods: Application of modern data mining and machine learning algorithms to
a derivation cohort of 7520 ACS patients included in the AMIS (Acute Myocardial Infarction
in Switzerland)-Plus registry between 2001 and 2005; prospective model testing in two
validation cohorts.
Results: The most accurate prediction of in-hospital mortality was achieved with the
?Averaged One-Dependence Estimators? (AODE) algorithm, with input of 7 variables
available at first patient contact: Age, Killip class, systolic blood pressure, heart rate, pre-
hospital cardio-pulmonary resuscitation, history of heart failure, history of cerebrovascular
disease. The c-statistic for the derivation cohort (0.875) was essentially maintained in
important subgroups, and calibration over five risk categories, ranging from <1% to >30%
predicted mortality, was accurate. Results were validated prospectively against an
independent AMIS-Plus cohort (n=2854, c-statistic 0.868) and the Krakow-Region ACS
Registry (n=2635, c-statistic 0.842). The AMIS model significantly outperformed established
?point-of-care? risk prediction tools in both validation cohorts. In comparison to a logistic
regression-based model, the AODE-based model proved to be more robust when tested on the
Krakow validation cohort (c-statistic 0.842 vs. 0.746). Accuracy of the AMIS model
prediction was maintained at 12-months follow-up in an independent cohort (n=1972, c-
statistic 0.877).
Conclusions: The AMIS model is a reproducibly accurate point-of-care risk stratification tool
for the complete range of ACS, based on variables available at first patient contact.
-
Markus Stocker, Andy Seaborne, Abraham Bernstein, Christoph Kiefer, Dave Reynolds, SPARQL Basic Graph Pattern Optimization Using Selectivity Estimation, Proceedings of the 17th International World Wide Web Conference (WWW), April 2008, ACM. (inproceedings)
In this paper, we formalize the problem of Basic Graph Pattern (BGP) optimization for SPARQL queries and main memory graph implementations of RDF data. We define and analyze the characteristics of heuristics for selectivity-based static BGP optimization. The heuristics range from simple triple pattern variable counting to more sophisticated selectivity estimation techniques. Customized summary statistics for RDF data enable the selectivity estimation of joined triple patterns and the development of efficient heuristics. Using the Lehigh University Benchmark (LUBM), we evaluate the performance of the heuristics for the queries provided by the LUBM and discuss some of them in more details.
Note that the SPARQL versions of the 14 LUBM queries and the University0 data set we used in this paper can be downloaded from here.
-
Robinson Aschoff, Abraham Bernstein, Suchmethoden im Netz: heute - morgen, digma. Zeitschrift für Datenrecht und Informationssicherheit 8, September 2008. (article)
Von der herkömmlichen Suchmaschine bis zur Vision einer verständnisvollen Antwort: Potenziale und Begrenzungen
Die Entwicklung von Suchtechnologien für das World Wide Web gehört heute zu den
zentralen Herausforderungen der Informatik. Eine Alternative zu den heutigen Algorithmen-basierten Suchmaschinen stellen hierbei Social-Search-Ansätze dar. Das Semantic Web beinhaltet schliesslich die
Vision, komplexe natürlichsprachige Anfragen beantworten zu können.
-
Christoph Kiefer, Abraham Bernstein, The Creation and Evaluation of iSPARQL Strategies for Matchmaking, Proceedings of the 5th European Semantic Web Conference (ESWC), February 2008, Springer. (inproceedings)
This research explores a new method for Semantic Web service matchmaking based on iSPARQL strategies, which enables to query the Semantic Web with techniques from traditional information retrieval. The strategies for matchmaking that we developed and evaluated can make use of a plethora of similarity measures and combination functions from SimPack---our library of similarity measures. We show how our combination of structured and imprecise querying can be used to perform hybrid Semantic Web service matchmaking. We analyze our approach thoroughly on a large OWL-S service test collection and show how our initial strategies can be improved by applying machine learning algorithms to result in very effective strategies for matchmaking.
-
Katharina Reinecke, Abraham Bernstein, Stefanie Hauske, To Make or to Buy? Sourcing Decisions at the Zurich Cantonal Bank, Proceedings of the International Conference on Information Systems (ICIS) 2008. (inproceedings)
The case study describes the IT situation at Zurich Cantonal Bank around the turn of the millennium. Incapable to fulfill the company?s strategic goals, it is shown how the legacy systems force the company into the decision to modify or to replace the old systems with standard software packages: to make or to buy? The case study introduces the bank?s strategic goals and their importance for the three make or buy alternatives. All solutions are described in detail; however, the bank?s decision is left open for students to decide. For a thorough analysis of the situation, the student is required to put himself in the position of the key decision maker at Zurich Cantonal Bank, calculating risks and balancing advantages and disadvantages of each solution. Six video interviews reveal further technical and interpersonal aspects of the decision-making process at the bank, as well as of the situation today.
-
David Kurz, Abraham Bernstein, Katrin Hunt, Z Siudak, D Dudek, Dragana Radovanovic, Paul E. Erne, Osmund Bertel, Validation of the AMIS risk stratification model for acute coronary syndromes in an external cohort in Jahrestagung der Schweizerischen Gesellschaft für Kardiologie, May 2008. (misc)
Background: We recently reported the development of the AMIS (Acute Myocardial Infarction in Switzerland) risk stratification model for patients with acute coronary syndrome (ACS). This model predicts hospital mortality risk across the complete spectrum of ACS based on 7 parameters available in the prehospital phase. Since the AMIS model was developed on a Swiss dataset in which the majority of patients were treated by primary PCI, we sought validation on an external cohort treated with a more conservative strategy.
Methods: The Krakow Region (Malopolska) ACS registry included patients treated with a non-invasive strategy in 29 hospitals in the greater Krakow (PL) area between 2002-2006. In-hospital mortality risk was calculated using the AMIS model (input parameters: age, Killip class, systolic blood pressure, heart rate, pre-hospital resuscitation, history of heart failure, and history of cerebrovascular disease; risk calculator available at www.amis-plus.ch). Discriminative performance was quantified as "area under the curve" (AUC, range 0?1) in a receiver operator characteristic, and was compared to the risk scores for ST-elevation myocardial infarction (STEMI) and Non-STE-ACS from the TIMI study group.
Results: Among the 2635 patients included in the registry (57% male, mean age 68.2±11.5 years, 31% STEMI) hospital mortality was 7.6%. The AUC using the AMIS model was 0.842, compared to 0.724 for the TIMI risk score for STEMI or 0.698 for the TIMI risk score for Non-STE-ACS (Fig. A). Risk calibration was maintained with the AMIS model over the complete range of risks (Fig. B). The performance of the AMIS model in this cohort was comparable to that found in the AMIS validation cohort (n=2854, AUC 0.868).
Conclusions: The AMIS risk prediction model for ACS displayed an excellent predictive performance in this non-invasively-treated external cohort, confirming the reliability of this bedside ?point-of-care? model in everyday practice.
-
Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, What Makes a Good Bug Report?, Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE), February 2008. (inproceedings)
-
Stefanie Hauske, Gerhard Schwabe, Abraham Bernstein, Wiederverwendung multimedialer und online verfügbarer Selbstlernmodule in der Wirtschaftsinformatik - Lessons Learned, Multikonferenz WIrtschaftsinformatik MKWI 2008, February 2008. (inproceedings)
Die Wiederverwendbarkeit von digitalen Lehrinhalten war eine zentrale Frage in dem E-Learning-Projekt ?Foundations of Information Systems (FOIS)?, einem Verbundprojekt von fünf Schweizer Universitäten. Während der Projektlaufzeit wurden zwölf multimediale und online verfügbare Selbstlernmodule produziert, die ein breites Spektrum an Wirtschaftsinformatikthemen abdecken und die primär in einführenden Lehrveranstaltungen der Wirtschaftsinformatik gemäß dem Blended-Learning-Ansatz genutzt werden. In dem Artikel beschreiben wir, wie die für die Wiederverwendung von E-Learning-Inhalten und -materialien wesentlichen Aspekte Flexibilität, Kontextfreiheit, inhaltliche und didaktische Vereinheitlichung sowie Blended-Learning-Einsatz in dem Projekt umgesetzt worden sind. Im zweiten Teil gehen wir auf die Erfahrungen ein, die wir und unsere Studierenden mit den FOIS-Module in der Lehre an der Universität Zürich gesammelt haben, und stellen Evaluationsergebnisse aus drei Lehrveranstaltungen und unsere Lessons Learned vor.
-
Christoph Kiefer, Abraham Bernstein, Jonas Tappolet, Analyzing Software with iSPARQL, Proceedings of the 3rd International Workshop on Semantic Web Enabled Software Engineering (SWESE 2007), June 2007, Springer. (inproceedings)
-
Dorothea Wagner, Abraham Bernstein, Thomas Dreier, Steffen Hölldobler, Günter Hotz, Klaus-Peter Löhr, Paul Molitor, Rüdiger Reiachuk, Dietmar Saupe, Myra Spiliopoulou, Augezeichnete Informatikdissertationen 2006, Lecture Notes in Informatics, Gesellschaft für Informatik (GI)2007. (book)
-
Hülya Topcuoglu, Katharina Reinecke, Stefanie Hauske, Abraham Bernstein, CaseML - Enabling Multifaceted Learning Scenarios with a Flexible Markup Language for Business Case Studies, ED Media 2007 2007. (inproceedings)
-
Isabelle Guyon, Jiwen Li, Theodor Mador, Patrick A. Pletscher, Gerold Schneider, Markus Uhr, Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark, Pattern Recognition Letters 28, September 2007. (article)
We used the datasets of the NIPS 2003 challenge on feature selection as part of the practical work of an undergraduate course on feature
extraction. The students were provided with a toolkit implemented in Matlab. Part of the course requirements was that they should
outperform given baseline methods. The results were beyond expectations: the student matched or exceeded the performance of the
best challenge entries and achieved very effective feature selection with simple methods. We make available to the community the results
of this experiment and the corresponding teaching material. These results also provide a new baseline for researchers in feature selection.
-
Katharina Reinecke, Gerald Reif, Abraham Bernstein, Cultural User Modeling With CUMO: An Approach to Overcome the Personalization Bootstrapping Problem, First International Workshop on Cultural Heritage on the Semantic Web at the 6th International Semantic Web Conference (ISWC 2007), November 12 2007. (inproceedings)
The increasing interest in personalizable applications for heterogeneous user populations has heightened the need for a more efficient acquisition of start-up information about the user. We argue that the user?s cultural background is suitable for predicting various adaptation preferences at once. With these as a basis, we can accelerate the initial acquisition process. The paper presents an approach to factoring culture into user models. We introduce the cultural user model ontology CUMO, describing how and to which extend it can accurately represent the user?s cultural background. Furthermore, we outline its use as a re-usable and shared knowledge base in a personalization process, before presenting a plan of our future work towards cultural personalization.
-
Katharina Reinecke, Abraham Bernstein, Culturally Adaptive Software: Moving Beyond Internationalization, Proceedings of the HCI International (HCII), July 2007, Springer. (inproceedings)
So far, culture has played a minor role in the design of software. Our experience with imbuto, a program designed for Rwandan agricultural advisors, has shown that cultural adaptation increased efficiency, but was extremely time-consuming and, thus, prohibitively expensive. In order to bridge the gap between cost-savings on one hand, and international usability on the other, this paper promotes the idea of culturally adaptive software. In contrast to manual localization, adaptive software is able to acquire details about an individual's cultural identity during use. Combining insights from the related fields international usability, user modeling and user interface adaptation, we show how research findings can be exploited for an integrated approach to automatically adapt software to the user's cultural frame.
-
Panagiotis Karras, Dimitris Sacharidis, Nikos Mamoulis, Exploiting Duality in Summarization with Deterministic Guarantees, Proc. of the 13th ACM SIGKDD Intl Conf. on Knowledge Discovery and Data Mining (KDD) 2007, ACM. (inproceedings)
-
Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, Nikos Mamoulis, Fast Anonymization with Low Information Loss, Proc. of the 33rd Intl Conf. on Very Large Data Bases (VLDB) 2007. (inproceedings)
-
Katharina Reinecke, Hülya Topcuoglu, Stefanie Hauske, Abraham Bernstein, Flexibilisierung der Lehr- und Lernszenarien von Business-Fallstudien durch CaseML, 5. E-Learning-Fachtagung DELFI 2007. (inproceedings)
In diesem Paper wird eine Auszeichnungssprache für multimediale und modularisierte Fallstudien, die in der Wirtschaftsinformatik-Lehre eingesetzt werden, vorgestellt. Während die meisten Fallstudien für eine spezifische Lehr-Lernsituation geschrieben sind, sollen die Fallstudien, wie sie hier beschrieben werden, flexibel und modular für verschiedene Aufgabenstellungen und in unterschiedlichen Lehr-Lern-
Szenarien einsetzbar sein. Hierfür ist eine flexible Darstellung der Fallstudien notwendig; sie kann durch die von uns entwickelte Auszeichnungssprache CaseML sicherge-
stellt werden.
-
Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, Andreas Zeller, How Long will it Take to Fix This Bug?, Proceedings of the Fourth International Workshop on Mining Software Repositories, Editor(s): Harald C. Gall, Michele Lanza, May ; 2007, IEEE Computer Society. (inproceedings)
Predicting the time and effort for a software problem has long been a difficult task. We present an approach that automatically predicts the fixing effort, i.e., the person-hours spent on fixing an issue. Our technique leverages existing issue tracking systems: given a new issue report, we use the Lucene framework to search for similar, earlier reports and use their average time as a prediction. Our approach thus allows for early effort estimation, helping in assigning issues and scheduling stable releases. We evaluated our approach using effort data from the JBoss project. Given a sufficient number of issues reports, our automatic predictions are close to the actual effort; for issues that are bugs, we are off by only one hour, beating naive predictions by a factor of four.
-
Esther Kaufmann, Abraham Bernstein, How Useful are Natural Language Interfaces to the Semantic Web for Casual End-users?, 6th International Semantic Web Conference (ISWC 2007), March 2007. (inproceedings)
Natural language interfaces offer end-users a familiar and convenient option for querying ontology-based knowledge bases. Several studies have shown that they can achieve high retrieval performance as well as domain independence. This paper focuses on usability and investigates if NLIs are useful from an end-user's point of view. To that end, we introduce four interfaces each allowing a different query language and present a usability study benchmarking these interfaces. The results of the study reveal a clear preference for full sentences as query language and confirm that NLIs are useful for querying Semantic Web data.
-
Christoph Kiefer, Imprecise SPARQL: Towards a Unified Framework for Similarity-Based Semantic Web Tasks, Proceedings of 2nd Knowledge Web PhD Symposium (KWEPSY) colocated with the 4th Annual European Semantic Web Conference (ESWC), June 2007. (inproceedings)
This proposal explores a unified framework to solve Semantic Web tasks that often require similarity measures, such as RDF retrieval, ontology alignment, and semantic service matchmaking. Our aim is to see how far it is possible to integrate user-defined similarity functions (UDSF) into SPARQL to achieve good results for these tasks.We present some research questions, summarize the experimental work conducted so far, and present our research plan that focuses on the various challenges of similarity querying within the Semantic Web.
-
Abraham Bernstein, Jayalath Ekanayake, Martin Pinzger, Improving Defect Prediction Using Temporal Features and Non Linear Models, Proceedings of the International Workshop on Principles of Software Evolution, September 2007, IEEE Computer Society. (inproceedings)
Predicting the defects in the next release of a large soft-
ware system is a very valuable asset for the pro ject manger
to plan her resources. In this paper we argue that temporal
features (or aspects) of the data are central to prediction per-
formance. We also argue that the use of non-linear models,
as opposed to traditional regression, is necessary to uncover
some of the hidden interrelationships between the features
and the defects and maintain the accuracy of the prediction
in some cases.
Using data obtained from the CVS and Bugzilla reposito-
ries of the Eclipse pro ject, we extract a number of temporal
features, such as the number of revisions and number of re-
ported issues within the last three months. We then use
these data to predict both the location of defects (i.e., the
classes in which defects will occur) as well as the number of
reported bugs in the next month of the pro ject. To that end
we use standard tree-based induction algorithms in compar-
ison with the traditional regression.
Our non-linear models uncover the hidden relationships be-
tween features and defects, and present them in easy to un-
derstand form. Results also show that using the temporal
features our prediction model can predict whether a source
?le will have a defect with an accuracy of 99% (area under
ROC curve 0.9251) and the number of defects with a mean
absolute error of 0.019 (Spearman?s correlation of 0.96).
-
Jacek Ratzinger, Thomas Sigmund, Peter Vorburger, Harald C. Gall, Mining Software Evolution to Predict Refactoring, Proceedings of the International Symposium on Empirical Software Engineering and Measurement (ESEM 2007) 2007, IEEE Computer Society. (inproceedings)
Can we predict locations of future refactoring based on the development history? In an empirical study of open source projects we found that attributes of software evolution data can be used to predict the need for refactoring in the following two months of development. Information systems utilized in software projects provide a broad range of data for decision support. Versioning systems log each activity during the development, which we use to extract data mining features such as growth measures, relationships between classes, the number of authors working on a particular piece of code, etc. We use this information as input into classification algorithms to create prediction models for future refactoring activities. Different state-of-the-art classifiers are investigated such as decision trees, logistic model trees, propositional rule learners, and nearest neighbor algorithms. With both high precision and high recall we can assess the refactoring proneness of object-oriented systems. Although we investigate different domains, we discovered critical factors within the development life cycle leading to refactoring, which are common among all studied projects.
-
Christoph Kiefer, Abraham Bernstein, Jonas Tappolet, Mining Software Repositories with iSPARQL and a Software Evolution Ontology, Proceedings of the 2007 International Workshop on Mining Software Repositories (MSR '07), March 2007, IEEE Computer Society. (inproceedings)
One of the most important decisions researchers face when analyzing the evolution of software systems is the choice of a proper data analysis/exchange format. Most existing formats have to be processed with special programs written specifically for that purpose and are not easily extendible. Most scientists, therefore, use their own database(s) requiring each of them to repeat the work of writing the import/export programs to their format. We present EvoOnt, a software repository data exchange format based on the Web Ontology Language (OWL). EvoOnt includes software, release, and bug-related information. Since OWL describes the semantics of the data, EvoOnt is (1) easily extendible, (2) comes with many existing tools, and (3) allows to derive assertions through its inherent Description Logic reasoning capabilities. The paper also shows iSPARQL ? our SPARQL-based Semantic Web query engine containing similarity joins. Together with EvoOnt, iSPARQL can accomplish a sizable number of tasks sought in software repository mining projects, such as an assessment of the amount of change between versions or the detection of bad code smells. To illustrate the usefulness of EvoOnt (and iSPARQL), we perform a series of experiments with a real-world Java project. These show that a number of software analyses can be reduced to simple iSPARQL queries on an EvoOnt dataset.
-
Esther Kaufmann, Abraham Bernstein, Lorenz Fischer, NLP-Reduce: A "naïve" but Domain-independent Natural Language Interface for Querying Ontologies, November 2007. (misc)
Casual users are typically overwhelmed by the formal logic of the Semantic Web. The question is how to help casual users to query a web based on logic that they do not seem to understand. An often proposed solution is the use of natural language interfaces. Such tools, however, suffer from the problem that entries have to be grammatical. Furthermore, the systems are hardly adaptable to new domains. We address these issues by presenting NLP-Reduce, a "naïve," domain-independent natural language interface for the Semantic Web. The simple approach deliberately avoids any complex linguistic and semantic technology while still achieving good retrieval performance as shown by the preliminary evaluation.
-
Abraham Bernstein, Christoph Kiefer, Markus Stocker, OptARQ: A SPARQL Optimization Approach based on Triple Pattern Selectivity Estimation, Department of Informatics, University of Zurich 2007. (techreport)
Query engines for ontological data based on graph models mostly execute user queries without considering any optimization. Especially for large ontologies, optimization techniques are required to ensure that query results are delivered within reasonable time. OptARQ is a first prototype for SPARQL query optimization based on the concept of triple pattern selectivity estimation. The evaluation we conduct demonstrates how triple pattern reordering according to their selectivity affects the query execution performance.
-
Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, Andreas Zeller, Predicting Effort to fix Software Bugs, Proceedings of the 9th Workshop Software Reengineering, May 2007. (inproceedings)
-
Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, Quality of Bug Reports in Eclipse, Proceedings of the 2007 OOPSLA Workshop on Eclipse Technology eXchange, October 2007, ACM. (inproceedings)
The information in bug reports influences the speed at which bugs are fixed. However, bug reports differ in their quality of information. We conducted a survey responses among the ECLIPSE developers to determine the information in reports that they widely used and the problems frequently encountered. Our results show that steps to reproduce and stack traces are most sought after by developers, while inaccurate steps to reproduce and incomplete information pose the largest hurdles. Surprisingly, developers are indifferent to bug duplicates. Such insight is useful to design new bug tracking tools that guide reporters at providing more helpful information. We also present a prototype of a quality-meter tool that measures the quality of bug reports by scanning its content.
-
Christoph Kiefer, Abraham Bernstein, Hong Joo Lee, Mark Klein, Markus Stocker, Semantic Process Retrieval with iSPARQL, Proceedings of the 4th European Semantic Web Conference (ESWC '07), March 2007, Springer. (inproceedings)
The vision of semantic business processes is to enable the integration and inter-operability of business processes across organizational boundaries. Since different organizations model their processes differently, the discovery and retrieval of similar smantic business processes is necessary in order to foster inter-organi ational collaborations. This paper presents our approach of using iSPARQL � our imprecise query engine based on SPARQL � to query the OWL MIT Process Handbook � a large collection of over 5000 semantic business processes. We particularly show how easy it is to use iSPARQL to perform the presented process retrieval task. Furthermore, since choosing the best performing similarity strategy is a non-trivial, data-, and context-dependent task, we evaluate the performance of three simple and two human-engineered similarity strategies. In addition, we conduct machine learning experiments to learn similarity measures showing that complementary information contained in the different notions of similarity strategies provide a very high retrieval accuracy. Our preliminary results indicate that iSPARQL is indeed useful for extending the reach of queries and that it, therefore, is an enabler for inter- and intra-organizational collaborations.
-
Abraham Bernstein, Markus Stocker, Christoph Kiefer, SPARQL Query Optimization Using Selectivity Estimation, 2007. (misc)
This poster describes three static SPARQL optimization approaches for in-memory RDF graphs: (1) a selectivity estimation index (SEI) for single query triple patterns; (2) a query pattern index (QPI) for joined triple patterns; and (3) a hybrid optimization approach that combines both indexes. Using the Lehigh University Benchmark (LUBM), we show that the hybrid approach outperforms other SPARQL query engines such as ARQ and Sesame for in-memory graphs.
-
Christoph Kiefer, Abraham Bernstein, Markus Stocker, The Fundamentals of iSPARQL - A Virtual Triple Approach For Similarity-Based Semantic Web Tasks, Proceedings of the 6th International Semantic Web Conference (ISWC), March 2007, Springer. (inproceedings)
This research explores three SPARQL-based techniques to solve Semantic Web tasks that often require similarity measures, such as semantic data integration, ontology mapping, and Semantic Weg service matchmaking. Our aim is to see how far it is possible to integrate customized similarity functions (CSF) into SPARQL to achieve good results for these tasks. Our first approach exploits virtual triples calling property functions to establish virtual relations among resources under comparison; the second approach uses extension functions to filter out resources that do not meet the requested similarity criteria; finally, our third technique applies new solution modifiers to post-process a SPARQL solution sequence. The semantics of the three approaches are formally elaborated and discussed. We close the paper with a demonstration of the usefulness of our iSPARQL framework in the context of a data integration and an ontology mapping experiment.
-
Panagiotis Karras, Nikos Mamoulis, The Haar+ Tree: a Refined Synopsis Data Structure, Proc. of the 23rd IEEE Intl Conf. on Data Engineering (ICDE) 2007, IEEE Computer Society. (inproceedings)
-
Abraham Bernstein, Michael Daenzer, The NExT System: Towards True Dynamic Adaptions of Semantic Web Service Compositions (System Description), Proceedings of the 4th European Semantic Web Conference (ESWC '07), March 2007, Springer. (inproceedings)
Traditional process support systems typically offer a static composition of atomic tasks to more powerful services. In the real world, however, processes change over time: business needs are rapidly evolving thus changing the work itself and relevant information may be unknown until workflow execution run-time. Hence, the static approach does not sufficiently address the need for dynamism. Based on applications in the life science domain this paper puts forward five requirements for dynamic process support systems. These demand a focus on a tight user interaction in the whole process life cycle. The system and the user establish a continuous feedback loop resulting in a mixed-initiative approach requiring a partial execution and resumption feature to adapt a running process to changing needs. Here we present our prototype implementation NExT and discuss a preliminary validation based on a real-world scenario.
-
Abraham Bernstein, Peter Vorburger, A Scenario-Based Approach for Direct Interruptablity Prediction on Wearable Devices, Journal of Pervasive Computing and Communications 3, March 2006. (article)
People are subjected to a multitude of interruptions. This situation is likely to get worse as technological devices are making us increasingly reachable. In order to manage the interruptions it is imperative to predict a person?s interruptability - his/her current readiness or inclination to be interrupted. In this paper we introduce the approach of direct interruptability inference from sensor streams (accelerometer and audio data) in a ubiquitous computing setup and show that it provides highly accurate and robust predictions. Furthermore, we argue that scenarios are central for evaluating the performance of ubiquitous computing devices (and interruptability predicting devices in particular) and prove it on our setup. We also demonstrate that scenarios provide the foundation for avoiding misleading results, assessing the results? generalizability, and provide the basis for a stratified scenario-based learning model, which greatly speeds-up the training of such devices.
-
Dorothea Wagner, Ausgezeichnete Informatikdissertationen 2005, Series of the German Informatics society (GI) D-62006. (book)
-
Tobias Sager, Abraham Bernstein, Martin Pinzger, Christoph Kiefer, Detecting Similar Java Classes Using Tree Algorithms, Proceedings of the International Workshop on Mining Software Repositories, May 2006, ACM. (inproceedings)
Similarity analysis of source code is helpful during development to provide, for instance, better support for code reuse. Consider a development environment that analyzes code while typing and that suggests similar code examples or existing implementations from a source code repository. Mining software repositories by means of similarity measures enables and enforces reusing existing code and reduces the developing effort needed by creating a shared knowledge base of code fragments. In information retrieval similarity measures are often used to find documents similar to a given query document. This paper extends this idea to source code repositories. It introduces our approach to detect similar Java classes in software projects using tree similarity algorithms. We show how our approach allows to find similar Java classes based on an evaluation of three tree-based similarity measures in the context of five user-defined test cases as well as a preliminary software evolution analysis of a medium-sized Java project. Initial results of our technique indicate that it (1) is indeed useful to identify similar Java classes, (2) successfully identifies the ex ante and expost versions of refactored classes, and (3) provides some interesting insights into within-version and between-version dependencies of classes within a Java project.
-
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, Abraham Bernstein, Detecting Similarities in Ontologies with the SOQA-SimPack Toolkit, 10th International Conference on Extending Database Technology (EDBT 2006), Editor(s): Yannis Ioannidis, Marc H. Scholl, Joachim W. Schmidt, Florian Matthes, Mike Hatzopoulos, Klemens Boehm, Alfons Kemper, Torsten Grust, Christian Boehm, March ; 2006, Springer. (inproceedings)
Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different databases often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. This paper presents the SOQA-SimPack Toolkit (SST), an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST's usefulness with the SOQA-SimPack Toolkit Browser, which allows users to graphically perform similarity calculations in ontologies.
-
David Kurz, Katrin Hunt, Abraham Bernstein, Dragana Radovanovic, Paul E. Erne, Jean-Christophe Stauffer, Osmund Bertel, Development of a novel risk stratification model to improve mortality prediction in acute coronary syndromes: the AMIS (Acute Myocardial Infarction in Switzerland) model, World Congress of Cardiology 2006, September 2006. (incollection/Abstract)
Background: Current established models predicting mortality in acute coronary syndrome (ACS) patients are derived from randomised controlled trials performed in the 1990's, and are thus based on and predictive for selected populations. These scores perform inadequately in patients treated according to current guidelines. The aim of this study was to develop a model with improved predictive performance applicable to all kinds of ACS, based on outcomes in real world patients from the new millennium.
Methods: The AMIS (Acute Myocardial Infarction in Switzerland)-Plus registry prospectively collects data from ACS patients admitted to 56 Swiss hospitals. Patients included in this registry between October 2001 and May 2005 (n = 7520) were the basis for model development. Modern data mining computational methods using new classification learning algorithms were tested to optimise mortality risk prediction using well-defined and non-ambiguous variables available at first patient contact. Predictive performance was quantified as "area under the curve" (AUC, range 0 - 1) in a receiver operator characteristic, and was compared to the benchmark risk score from the TIMI study group. Results were verified using 10-fold cross-validation.
Results: Overall, hospital mortality was 7.5%. The final prediction model was based on the "Averaged One-Dependence Estimators" algorithm and included the following 7 input variables: 1) Age, 2) Killip class, 3) systolic blood pressure, 4) heart rate, 5) pre-hospital mechanical resuscitation, 6) history of heart failure, 7) history of cerebrovascular disease. The output of the model was an estimate of in-hospital mortality risk for each patient. The AUC for the entire cohort was 0.875, compared to 0.803 for the TIMI risk score. The AMIS model performed equally well for patients with or without ST elevation myocardial infarction (AUC 0.879 and 0.868, respectively). Subgroup analysis according to the initial revascularisation modality indicated that the AMIS model performed best in patients undergoing PCI (AUC 0.884 vs. 0.783 for TIMI) and worst in patients receiving no revascularisation therapy (AUC 0.788 vs. 0.673 for TIMI). The model delivered an acurate and reproducible prediction over the complete range of risks and for all kinds of ACS.
Conclusions: The AMIS model performs about 10% better than established risk prediction models for hospital mortality in patients with all kinds of ACS in the modern era. Modern data mining algorithms proved useful to optimise the model development.
-
David Kurz, Katrin Hunt, Abraham Bernstein, Dragana Radovanovic, Paul E. Erne, Jean-Christophe Stauffer, Development of a novel risk stratification model to improve mortality prediction in acute coronary syndromes: the AMIS model, Gemeinsame Jahrestagung der Schweizerischen Gesellschaften für Kardiologie, für Pneumologie, für Thoraxchirurgie, und für Intensivmedizin, June 2006. (incollection/Abstract)
Background: Current established models predicting mortality in acute coronary syndrome (ACS) patients are derived from randomised controlled trials performed in the 1990�s, and are thus based on and predictive for selected populations. These scores perform inadequately in patients treated according to current guidelines. The aim of this study was to develop a model with improved predictive performance applicable to all kinds of ACS, based on outcomes in real world patients from the new millennium.
Methods: The AMIS-Plus registry prospectively collects data from ACS patients admitted to 56 Swiss hospitals. Patients included in this registry between October 2001 and May 2005 (n = 7520) were the basis for model development. Modern data mining computational methods using new classification learning algorithms were tested to optimise mortality risk prediction using well-defined and non-ambiguous variables available at first patient contact. Predictive performance was quantified as �area under the curve� (AUC, range 0 � 1) in a receiver operator characteristic, and was compared to the benchmark risk score from the TIMI study group. Results were verified using 10-fold cross-validation.
Results: Overall, hospital mortality was 7.5%. The final prediction model was based on the �Averaged One-Dependence Estimators� algorithm and included the following 7 input variables: 1) Age, 2) Killip class, 3) systolic blood pressure, 4) heart rate, 5) pre-hospital mechanical resuscitation, 6) history of heart failure, 7) history of cerebrovascular disease. The output of the model was an estimate of in-hospital mortality risk for each patient. The AUC for the entire cohort was 0.875, compared to 0.803 for the TIMI risk score. The AMIS model performed equally well for patients with or without ST-Elevation (AUC 0.879 and 0.868, respectively). Subgroup analysis according to the initial revascularisation modality indicated that the AMIS model performed best in patients undergoing PCI (AUC 0.884 vs. 0.783 for TIMI) and worst for patients receiving no revascularisation therapy (AUC 0.788 vs. 0.673 for TIMI). The model delivered an accurate and reproducible prediction over the complete range of risks and for all kinds of ACS.
Conclusions: The AMIS model performs about 10% better than established risk prediction models for hospital mortality in patients with all kinds of ACS in the modern era. Modern data mining algorithms proved useful to optimise the model development.
-
Peter Vorburger, Abraham Bernstein, Entropy-based Concept Shift Detection, IEEE International Conference on Data Mining (ICDM), March 2006. (inproceedings)
-
Hülya Topcuoglu, FAST - Flexible Assignment System, E-Learn 2006 2006. (inproceedings)
Nowadays the usage of collaborative learning in e-Learning environments is becoming very popular. Even so, there is a lack of web-based assignment systems supporting collaborative learning. In this work, I report on a novel assignment system for organizing and realizing web-based exercises. This new flexible assignment system makes it possible to design and perform different exercises in various collaborative learning-settings. For example, exercises can be arranged as tutorials as well as peer assessments. For an effective implementation of collaborative learning environments, the learning process must be structured. That can be done through so-called collaboration scripts and statechart diagrams.
-
Patrick Ziegler, Christoph Kiefer, Christoph Sturm, Klaus R. Dittrich, Abraham Bernstein, Generic Similarity Detection in Ontologies with the SOQA-SimPack Toolkit, SIGMOD Conference, June 2006, ACM. (inproceedings)
Ontologies are increasingly used to represent the intended real-world semantics of data and services in information systems. Unfortunately, different databases often do not relate to the same ontologies when describing their semantics. Consequently, it is desirable to have information about the similarity between ontology concepts for ontology alignment and integration. In this demo, we present the SOQASimPack Toolkit (SST) [7], an ontology language independent Java API that enables generic similarity detection and visualization in ontologies. We demonstrate SST?s usefulness with the SOQA-SimPack Toolkit Browser, which allows users to graphically perform similarity calculations in ontologies.
-
Abraham Bernstein, Esther Kaufmann, GINO - A Guided Input Natural Language Ontology Editor, 5th International Semantic Web Conference (ISWC 2006), November 2006, Springer. (inproceedings)
The casual user is typically overwhelmed by the formal logic of the Semantic Web. The gap between the end user and the logic-based scaffolding has to be bridged if the Semantic Web's capabilities are to be utilized by the general public. This paper proposes that controlled natural languages offer one way to bridge the gap. We introduce GINO, a guided input natural language ontology editor that allows users to edit and query ontologies in a language akin to English. It uses a small static grammar, which it dynamically extends with elements from the loaded ontologies. The usability evaluation shows that GINO is well-suited for novice users when editing ontologies. We believe that the use of guided entry overcomes the habitability problem, which adversely affects most natural language systems. Additionally, the approach's dynamic grammar generation allows for
easy adaptation to new ontologies.
-
Abraham Bernstein, Esther Kaufmann, Christian Kaiser, Christoph Kiefer, Ginseng: A Guided Input Natural Language Search Engine for Querying Ontologies, 2006 Jena User Conference, May 2006. (inproceedings)
-
Abraham Bernstein, Christoph Kiefer, Imprecise RDQL: Towards Generic Retrieval in Ontologies Using Similarity Joins, 21th Annual ACM Symposium on Applied Computing (ACM SAC 2006), April 2006, ACM. (inproceedings)
Traditional semantic web query languages support a logic-based
access to the semantic web. They offer a retrieval (or reasoning) of
data based on facts. On the traditional web and in databases,
however, exact querying often provides an incomplete answer as
queries are over-specified or the mix of multiple
ontologies/modelling differences requires ``interpretational
flexibility.'' Therefore, similarity measures or ranking approaches
are frequently used to extend the reach of a query. This paper
extends this idea to the semantic web. It introduces iRDQL---a
semantic web query language with support for similarity joins. It is
an extension of RDQL (RDF Data Query Language) that enables its
users to query for similar resources ranking the results using a
similarity measure. We show how iRDQL allows to extend the reach of
a query by finding additional results. We quantitatively evaluated
four similarity measures for their usefulness in iRDQL in the
context of an OWL-S semantic web service retrieval test collection
and compared the results to a specialized OWL-S matchmaker. Initial
results of using iRDQL indicate that it is indeed useful for
extending the reach of queries and that it is able to improve recall
without overly sacrificing precision. We also found that our generic
iRDQL approach was only slightly outperformed by the specialized
algorithm.
-
David Kurz, Katrin Hunt, Abraham Bernstein, Dragana Radovanovic, Paul E. Erne, Jean-Christophe Stauffer, Osmund Bertel, Inadequate performance of the TIMI risk prediction score for patients with ST-elevation myocardial infarction in the modern era, Gemeinsame Jahrestagung der Schweizerischen Gesellschaften für Kardiologie, für Pneumologie, für Thoraxchirurgie, und für Intensivmedizin, June 2006. (incollection/Abstract)
Background: Mortality prediction of patients admitted with ST elevation myocardial infarction (STEMI) is currently based on models derived from randomised controlled trials performed in the 1990�s, with selective inclusion and exclusion criteria. It is unclear whether such models remain valid in community-based populations in the modern era.
Methods: The AMIS-Plus registry prospectively collects data from ACS patients admitted to 56 Swiss hospitals. We analysed hospital mortality for patients with ST-Elevation myocardial infarction (STEMI) included in this registry between 1997-2005, and compared it to mortality as predicted by the benchmark risk score from the TIMI study group. This is an integer score calculated from 10 weighted parameters available at admission. Each score value delivers a hospital mortality risk prediction (range 0.7% for 0 points, 31.7% for >8 points).
Results: Among 7875 patients with STEMI, overall hospital mortality was 7.3%. The TIMI risk score overestimated mortality risk at each score level for the entire population. Subgroup analysis according to initial revascularisation treatment (PCI [n=3358], thrombolysis [n=1842], none [n=2675]) showed an especially poor performance for patients treated by PCI. In this subgroup no relevant increase in mortality was observed up until 5 points (actual mortality 2.7%, predicted 11.6%), and remained below 5% up till 7 points (predicted 21.5%) (Figure 1).
FIGURE
Conclusions: The TIMI risk score overestimates the mortality risk and delivers poor stratification in real life patients with STEMI treated according to current guidelines.
-
David Kurz, Katrin Hunt, Abraham Bernstein, Dragana Radovanovic, Paul E. Erne, Osmund Bertel, Inadequate performance of the TIMI risk prediction score for patients with ST-elevation myocardial infarction treated according to current guidelines, World Congress of Cardiology 2006, September 2006. (incollection/Abstract)
Background: Mortality prediction of patients admitted with ST elevation myocardial infarction (STEMI) is currently based on models derived from randomised controlled trials performed in the 1990's, with selective inclusion and exclusion criteria. It is unclear whether such models remain valid in community-based populations in the modern era.
Methods: The AMIS (Acute Myocardial Infarction in Switzerland)-Plus registry prospectively collects data from ACS patients admitted to 56 Swiss hospitals. We analysed hospital mortality for patients with ST-elevation myocardial infarction (STEMI) included in this registry between 1997-2005, and compared it to mortality as predicted by the benchmark risk score from the TIMI study group. This is an integer score calculated from 10 weighted parameters available at admission. Each score value delivers a hospital mortality risk prediction (range 0.7% for 0 points, 31.7% for >8 points).
Results: Among 7875 patients with STEMI, overall hospital mortality was 7.3%. The TIMI risk score overestimated mortality risk at each score level for the entire population. Subgroup analysis according to initial revascularisation treatment (PCI [n=3358], thrombolysis [n=1842], none [n=2675]) showed an especially poor performance of the TIMI risk score for patients treated by PCI. In this subgroup no relevant increase in mortality was observed up until 5 points (actual mortality 2.7%, predicted 11.6%), and remained below 5% up till 7 points (predicted 21.5%) (Figure 1).
Conclusions: The TIMI risk score overestimates the mortality risk and delivers poor stratification in real life patients with STEMI treated according to current guidelines.
-
Stefanie Hauske, Kooperative Content-Erstellung mittels eines iterativen und prototypischen Vorgehens, E-Learning - Alltagstaugliche Innovation, Waxmann 2006. (inbook)
-
Patrick Knab, Martin Pinzger, Abraham Bernstein, Predicting Defect Densities in Source Code Files with Decision Tree Learners, MSR '06: Proceedings of the 2006 International Workshop on Mining Software Repositories, May 2006, ACM. (inproceedings)
With the advent of open source software repositories the data available
for defect prediction in source files increased tremendously.
Although traditional statistics turned out to derive reasonable results
the sheer amount of data and the problem context of defect prediction
demand sophisticated analysis such as provided by current data
mining and machine learning techniques.
In this work we focus on defect density prediction and present
an approach that applies a decision tree learner on evolution data
extracted from the Mozilla open source web browser project. The
evolution data includes different source code, modification, and defect
measures computed from seven recent Mozilla releases. Among
the modification measures we also take into account the change coupling,
a measure for the number of change-dependencies between
source files. The main reason for choosing decision tree learners,
instead of for example neural nets, was the goal of finding underlying
rules which can be easily interpreted by humans. To find these
rules, we set up a number of experiments to test common hypotheses
regarding defects in software entities. Our experiments showed, that
a simple tree learner can produce good results with various sets of
input data.
-
Abraham Bernstein, Thomas Gschwind, Wolf Zimmermann, Proceedings of the Fourth IEEE European Conference on Web Services (ECOWS 2006), IEEE Computer Society, December 2006. (book)
-
Esther Kaufmann, Abraham Bernstein, Renato Zumstein, Querix: A Natural Language Interface to Query Ontologies Based on Clarification Dialogs, 5th International Semantic Web Conference (ISWC 2006), November 2006, Springer. (inproceedings)
The logic-based machine-understandable framework of the Semantic Web typically challenges casual users when they try to query ontologies. An often proposed solution to help casual users is the use of natural language interfaces. Such tools, however, suffer from one of the biggest problems of natural language: ambiguities. Furthermore, the systems are hardly adaptable to new domains. This paper addresses these issues by presenting Querix, a domain-independent natural language interface for the Semantic Web. The approach allows queries in natural language, thereby asking the user for clarification in case of ambiguities. The preliminary evaluation showed good retrieval performance.
-
Abraham Bernstein, Michael Daenzer, The NExT Process Workbench: Towards the Support of Dynamic Semantic Web Processes, ECOWS'06 Workshop on Semantics for Web Services 2006. (inproceedings)
Traditional process support systems offer the promise of software assembled from service elements. The typical approach is a static composition of atomic processes to more powerful services. In the real world, however, processes change over time: business needs are rapidly evolving and, thus, changing the work itself and relevant information may be unknown until workflow execution run-time. Hence, the traditional, static approach does not sufficiently address the need for dynamism. Based on applications in the life science domain this paper puts forward five requirements for dynamic process support system. Specifically, these demand a focus on a tight user interaction in the process discovery, composition, and execution phases. The system and the user establish a continuous feedback loop resulting in a mixed-initiative kind approach. We also present a prototype implementation NExT, which embodies this approach and present a preliminary validation based on a real-world scenario as well as a comparison with other process support tools.
-
Abraham Bernstein, Benjamin Grosof, Michael Kifer, Beyond Monotonic Inheritance: Towards Non-Monotonic Semantic Web Process Ontologies, W3C Workshop On Frameworks for Semantics in Web Services, June 2005, World Wide Web Consortium. (inproceedings)
-
Adrian Bachmann, Design and Prototypical Implementation of an Accounting System for an AAA Server, August 2005. (misc/Semester Thesis)
The key aim of this thesis work is to design and prototypically implement the Accounting module for an AAA server, based on the Generic AAA Architecture defined in RFC 2903 and Diameter protocol specifications. The resulting protocol and architecture shall provide a solution for offering accounting services to a Mobile Grid. It will also be used at a later stage together with various charging models for creating a charging mechanism for future mobile grids. A mobile grid environment, by its heterogenos nature brings new challenges to all the tree A-s in a traditional AAA environment. Regarding the accounting proccess, new types of resources have to be accounted for which require new parameters that have to be present in accounting records. Besides the traditional accounting of time, bytes, and packets, a grid service might need to account for CPU usage, memory consumption, or even accessed/containing information. The accounting module shall provide generic interfaces for possibly different monitoring entities that adapt to the type of resource being accounted for. The access to accounted for data shall be secure and reliable. Secure in this context means that accounted for records for a certain service can be created by certain entities that are aproved by that service provider. This requirement can be realized using X.509 certificates or other kind of credential tokens. Encryption of accounting messages shall be offered as a communication option between the AAA client and the AAA server. Reliability refers to the posibility of retreiving the accurate accounting information for a certain service/resource usage for charging consumptions.
-
Abraham Bernstein, Peter Vorburger, Patrice Egger, Direct Interruptablity Prediction and Scenario-based Evaluation of Wearable Devices: Towards Reliable Interruptability Predictions, First International Workshop on Managing Context Information in Mobile and Pervasive Environments MCMP-05, February 2005. (inproceedings)
In this paper we introduce the approach of direct interruptability inference from accelerometer and audio data and show that it provides highly accurate and robust predictions. Furthermore, we argue that scenarios are central for evaluating the performance of interruptability predicting devices and prove it on our setup. We also demonstrate that scenarios provide the foundation for avoiding misleading results, assessing the results? generalizability, and provide the basis for a stratified scenario-based learning model, which greatly speeds-up the training of such devices.
-
Abraham Bernstein, Esther Kaufmann, Christoph Bürki, Mark Klein, How Similar Is It? Towards Personalized Similarity Measures in Ontologies, 7. Internationale Tagung Wirtschaftsinformatik, February 2005. (inproceedings)
Finding a good similarity assessment algorithm for the use in ontologies is central to the functioning of techniques such as retrieval, matchmaking, clustering, data-mining, ontology translations, automatic database schema matching, and simple object comparisons. This paper assembles a catalogue of ontology based similarity measures, which are experimentally compared with a �similarity gold standard� obtained by surveying 50 human subjects. Results show that human and algorithmic similarity predications varied substantially, but could be grouped into cohesive clusters. Addressing this variance we present a personalized similarity assessment procedure, which uses a machine learning component to predict a subject�s cluster membership, providing an excellent prediction of the gold standard. We conclude by hypothesizing ontology dependent similarity measures.
-
Peter Vorburger, Abraham Bernstein, Alen Zurfluh, Interruptability Prediction Using Motion Detection, First International Workshop on Managing Context Information in Mobile and Pervasive Environments MCMP-05, May 2005. (inproceedings)
-
Abraham Bernstein, Christoph Kiefer, iRDQL - Imprecise Queries Using Similarity Joins for Retrieval in Ontologies, 4th International Semantic Web Conference, November 2005. (inproceedings)
-
Abraham Bernstein, Christoph Kiefer, iRDQL - Imprecise RDQL Queries Using Similarity Joins, K-CAP 2005 Workshop on: Ontology Management: Searching, Selection, Ranking, and Segmentation, October 2005. (inproceedings)
Traditional semantic web query languages support a logic-based access to the semantic web. They offer a retrieval (or reasoning) of data based on facts. On the traditional web and in databases, however, exact querying often provides an incomplete answer as queries are overspecified or the mix of multiple ontologies/modelling differences requires ?interpretational flexibility.?
This paper introduces iRDQL ? a semantic web query language with support for similarity joins. It is an extension to RDQL that enables the user to query for similar resources in an ontology. A similarity measure is used to determine the degree of similarity between two semantic web resources. Similar resources are ranked by their similarity and returned to the user. We show how iRDQL allows to extend the reach of a query by finding additional results. We quantitatively evaluated one measure of SimPack ? our library of similarity measures for the use in ontologies ? for its usefulness in iRDQL within the context of an OWL-S semantic web service
retrieval test collection. Initial results of using iRDQL indicate that it is indeed useful for extending the reach of the query and that it is able to improve recall without overly sacrificing precision.
-
Abraham Bernstein, Esther Kaufmann, Anne Göhring, Christoph Kiefer, Querying Ontologies: A Controlled English Interface for End-users, 4th International Semantic Web Conference (ISWC 2005), November 2005. (inproceedings)
The semantic web presents the vision of a distributed, dynamically growing knowledge base founded on formal logic. Common users, however, seem to have problems even with the simplest Boolean expressions. As queries from web search engines show, the great majority of users simply do not use Boolean expressions. So how can we help users to query a web of logic that they do not seem to understand? We address this problem by presenting a natu-ral language interface to semantic web querying. The interface allows formulat-ing queries in Attempto Controlled English (ACE), a subset of natural English. Each ACE query is translated into a discourse representation structure ? a vari-ant of the language of first-order logic ? that is then translated into an N3-based semantic web querying language using an ontology-based rewriting framework. As the validation shows, our approach offers great potential for bridging the gap between the logic-based semantic web and its real-world users, since it al-lows users to query the semantic web without having to learn an unfamiliar formal language. Furthermore, we found that users liked our approach and de-signed good queries resulting in a very good retrieval performance (100% pre-cision and 90% recall).
-
Abraham Bernstein, Esther Kaufmann, Christian Kaiser, Querying the Semantic Web with Ginseng: A Guided Input Natural Language Search Engine, 15th Workshop on Information Technology and Systems (WITS 2005), December 2005. (inproceedings)
The Semantic Web presents the vision of a distributed, dynamically growing knowledge base founded on formal logic. Common users, however, seem to have problems even with the simplest Boolean expression. As queries from web search engines show, the great majority of users simply do not use Boolean expressions. So how can we help users to query a web of logic that they do not seem to under-stand?
We address this problem by presenting Ginseng, a quasi natural language guided query interface to the Semantic Web. Ginseng relies on a simple question grammar which gets dynamically extended by the structure of an ontology to guide users in formulating queries in a language seemingly akin to English. Based on the grammar Ginseng then translates the queries into a Semantic Web query language (RDQL), which allows their execution. Our evaluation with 20 users shows that Ginseng is extremely simple to use without any training (as opposed to any logic-based querying approach) resulting in very good query per-formance (precision = 92.8%, recall = 98.4%). We, furthermore, found that even with its simple gram-mar/approach Ginseng could process over 40% of questions from a query corpus without modification.
-
Steve Battle, Abraham Bernstein, Harlod Boley, Benjamin Grosof, Michael Gruniger, Richard Hull, Michael Kifer, David Martin, Sheila McIlraith, Deborah McGuinness, Jiawen Su, Said Tabet, Semantic Web Services Framework (SWSF), Semantic Web Services Initiative (SWSI), April 2005. (techreport/Technical Report)
This is the initial technical report of the Semantic Web Services Language (SWSL) Committee of the Semantic Web Services Initiative (SWSI). This report consists of the following four top-level documents, with four related appendices.
* Semantic Web Services Framework (SWSF) Overview
* The Semantic Web Services Language (SWSL)
* The Semantic Web Services Ontology (SWSO)
* SWSF Application Scenarios
Appendices (of the Ontology document):
* PSL in SWSL-FOL and SWSL-Rules
* Axiomatization of the FLOWS Process Model
* Axiomatization of the Process Model in SWSL-Rules
* Reference Grammars
-
Abraham Bernstein, Esther Kaufmann, Christoph Kiefer, Christoph Bürki, SimPack: A Generic Java Library for Similarity Measures in Ontologies, University of Zurich, Department of Informatics, August 2005. (techreport)
Good similarity measures are central for techniques such as retrieval, matchmaking, clustering, data-mining, ontology translations, automatic database schema matching, and simple object comparisons. Measures for the use with complex (or aggregated) objects in ontologies are, however, rare, even though they are central for semantic web applications. This paper first introduces SimPack, a library of similarity measures for the use in ontologies (of complex objects). The measures of the library are then experimentally compared with a similarity ``gold standard'' established by surveying 94 human subjects in two ontologies. Results show that human and algorithm assessments vary (both between people and across ontologies), but can be grouped into cohesive clusters, each of which is well modeled by one of the measures in the library. Furthermore, we show two increasingly accurate methods to predict the cluster membership of the subjects providing the foundation for the construction of personalized similarity measures.
-
Abraham Bernstein, Esther Kaufmann, Norbert E. Fuchs, Talking to the Semantic Web - A Controlled English Query Interface for Ontologies, AIS SIGSEMIS Bulletin 2, February 2005. (article)
The semantic web presents the vision of a distributed, dynamically growing knowledge base founded on formal logic. Common users, however, seem to have problems even with the simplest Boolean expression. As queries from web search engines show, the great majority of users simply do not use Boolean expressions. So how can we help users to query a web of logic that they do not seem to understand?
We address this problem by presenting a natural language front-end to semantic web querying. The front-end allows formulating queries in Attempto Controlled English (ACE), a subset of natural English. Each ACE query is translated into a discourse representation structure ? a variant of the language of first-order logic ? that is then translated into the semantic web querying language PQL. As examples show, our approach offers great potential for bridging the gap between the semantic web and its real-world users, since it allows users to query the semantic web without having to learn an unfamiliar formal language.
-
Peter Vorburger, Abraham Bernstein, Towards an Artificial Receptionist: Anticipating a Persons Phone Behavior, University of Zurich, Department of Informatics 2005. (techreport)
People are subjected to a multitude of interruptions, which in some situations are detrimental to their work performance. Consequently, the capability to predict a person?s degree of interruptability (i.e., a measure of detrimental an interruption would be to her current work) can provide a basis for a ?ltering mechanism. This paper introduces a novel approach to predict a person?s presence and interruptability in an of?ce-like environment based on audio, multi-sector motion detection using video, and the time of the day collected as sensor data.
Conducting an experiment in a real of?ce environment over the length of more than 40 work days we show that the multisector motion detection data, which to our knowledge has been
used for the ?rst time to this end, outperforms audio data both in presence and interruptability. We, furthermore, show, that the combination of all three data-streams improves the interruptability prediction accuracy and robustness. Finally, we use these data to predict a subject?s phone behavior (ignore or accept the incoming phone call) by combining interruptability and the estimated importance of call. We call such an application an arti?cial receptionist. Our analysis also show that the results improve when taking the temporal aspect of the context into account.
-
Abraham Bernstein, Foster Provost, Shawndra Hill, Towards Intelligent Assistance for a Data Mining Process: An Ontology-based Approach for Cost-sensitive Classification, IEEE Transactions on Knowledge and Data Engineering 17, April 2005. (article)
A data mining (DM) process involves multiple stages. A simple, but typical, process might in-clude preprocessing data, applying a data-mining algorithm, and postprocessing the mining re-sults. There are many possible choices for each stage, and only some combinations are valid. Because of the large space and non-trivial interactions, both novices and data-mining specialists need assistance in composing and selecting DM processes. Extending notions developed for statistical expert systems we present a prototype Intelligent Discovery Assistant (IDA), which provides users with (i) systematic enumerations of valid DM processes, in order that important, potentially fruitful options are not overlooked, and (ii) effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute. We use the prototype to show that an IDA can indeed provide useful enumerations and effective rankings in the context of simple classification processes. We discuss how an IDA could be an important tool for knowledge sharing among a team of data miners. Finally, we illustrate the claims with a com-prehensive demonstration of cost-sensitive classification using a more involved process and data from the 1998 KDDCUP competition.
-
Abraham Bernstein, Benjamin Grosof, Beyond Monotonic Inheritance: Towards Semantic Web Process Ontologies, University of Zurich, Department of Informatics, August 2003. (techreport/Working Paper)
Semantic Web Services (SWS), the convergence of Semantic Web and Web Services, is the emerging next major generation of the Web, in which e-services and business communication become more knowledge-based and agent-based. In the SWS vision, service descriptions are built partly upon process ontologies ? widely shared ontological knowledge about business processes ? which are represented using Semantic Web techniques for declarative knowledge representation (KR), e.g., OWL Description Logic or RuleML Logic Programs.
In this paper, we give the first approach to solving a previously unsolved, crucial problem in representing process ontologies using SW KR: how to represent non-monotonic inheritance reasoning, in which at each (sub)class in the class hierarchy, any inherited property value may be overridden with another value, or simply cancelled (i.e., not inherited). Non-monotonic inheritance is an important, heavily-used feature in pre-SWS process ontologies, e.g., ubiquitous in object-oriented (OO) programming. The advantages of non-monotonicity in inheritance include greater reuse/modularity and easier specification, updating, and merging. We focus in particular on the Process Handbook (PH), a large, influential, and well-used process ontologies repository that is representative in its features for non-monotonic inheritance. W3C?s OWL, the currently dominant SW KR for ontologies, is fundamentally incapable of representing non-monotonicity; so too is First Order Logic. Using instead another form of leading SW KR ? RuleML ? we give a new approach that successfully represents the PH?s style of non-monotonic inheritance. In this Courteous Inheritance approach, PH ontology knowledge is represented as prioritized default rules expressed in the Courteous Logic Programs (CLP) subset of RuleML.
A prototype of our approach is in progress. We aim to use it to enable SWS exploitation of the forthcoming open-source version of the PH.
-
Abraham Bernstein, How can cooperative work tools support dynamic group processes? Bridging the specificity frontier, Organizing Business Knowledge: The MIT Process Handbook, Editor(s): Thomas W. Malone, Kevin Crowston, George Herman, August ; 2003, MIT Press. (incollection)
In the past, most collaboration support systems have focused on either automating fixed work processes or simply supporting communication in ad-hoc processes. This results in systems that are usually inflexible and difficult to change or that provide no specific support to help users decide what to do next.
This paper describes a new kind of tool that bridges the gap between these two approaches by flexibly supporting processes at many points along the spectrum: from highly specified to highly unspecified. The development of this approach was strongly based on social science theory about collaborative work.
-
Abraham Bernstein, Process Recombination: An Ontology Based Approach for Business Process Re-Design, SAP Design Guild 7, October 2003. (article)
A critical need for many organizations is the ability to quickly (re-)design their business processes in response to changing needs and capabilities. Current process design tools and methodologies, however, are very resource-intensive and provide little support for generating (as opposed to merely recording) new design alternatives.
This paper describes the 'process recombination,' a novel approach for template-based business process re-design based upon the MIT Process Handbook. This approach allows one to systematically generate different process (re-) designs using the repository of process alternatives stored in the Process Handbook. Our experience to date has shown that this approach can be effective in helping users produce innovative process designs.
-
Abraham Bernstein, Mark Klein, Thomas W. Malone, The Process Recombinator: A Tool for Generating New Business Process Ideas, Organizing Business Knowledge: The MIT Process Handbook, Editor(s): Thomas W. Malone, Kevin Crowston, George Herman, August ; 2003, MIT Press. (incollection)
A critical need for many organizations in the next century will be the ability to quickly develop innovative business processes to take advantage of rapidly changing technologies and markets. Current process design tools and methodologies, however, are very resource-intensive and provide little support for generating (as opposed to merely recording) new design alternatives.
This paper describes the Process Recombinator, a novel tool for generating new business process ideas by recombining elements from a richly structured repository of knowledge about business processes. The key contribution of the work is the technical demonstration of how such a repository can be used to automatically generate a wide range of innovative process designs. We have also informally evaluated the Process Recombinator in several field studies, which are briefly described here as well.
-
Abraham Bernstein, The Product Workbench: An Environment for the Mass-Customization of Production-Processes, Organizing Business Knowledge: The MIT Process Handbook, Editor(s): Thomas W. Malone, Kevin Crowston, George Herman, August ; 2003, MIT Press. (incollection)
This article investigates how to support process enactment in highly flexible organizations. First it develops the requirements for such a support system. Then it proposes a prototype implementation, which offers its users the equivalent of a CAD/CAM-like tool for designing and supporting business processes. The tool enables end-users to take flexible building blocks of a production process, reassemble them to fit the specific needs of a particular case and finally export its description to process support systems like workflow management systems.
-
Abraham Bernstein, Scott Clearwater, Foster Provost, The Relational Vector-space Model and Industry Classification, IJCAI-2003 Workshop on Learning Statistical Models from Relational Data, August 2003. (inproceedings)
This paper addresses the classification of linked entities. We introduce a relational vector-space (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure, representing entities by vectors of weights. Given labeled data as background knowledge/training data, classification procedures can be defined for this model, including a straightforward, ?direct? model using weighted adjacency vectors. Using a large set of tasks from the domain of company affiliation identification, we demonstrate that such classification procedures can be effective. We then examine the method in more detail, showing that as expected the classification performance correlates with the relational autocorrelation of the data set. We then turn the tables and use the relational VS scores as a way to analyze/visualize the relational autocorrelation present in a complex linked structure. The main contribution of the paper is to introduce the relational VS model as a potentially useful addition to the toolkit for relational data mining. It could provide useful constructed features for domains with low to moderate relational autocorrelation; it may be effective by itself for domains with high levels of relational autocorrelation, and it provides a useful abstraction for analyzing the properties of linked data.
-
Abraham Bernstein, Foster Provost, Scott Clearwater, The Relational Vector-space Model and Industry Classification (techreport), New York University - Stern School of Business, Information Systems Group 2003. (techreport/Technical Report)
This paper addresses the classification of linked entities. We introduce a relational vector-space (VS) model (in analogy to the VS model used in information retrieval) that abstracts the linked structure, representing entities by vectors of weights. Given labeled data as background knowledge/training data, classification procedures can be defined for this model, including a straightforward, ?direct? model using weighted adjacency vectors. Using a large set of tasks from the domain of company affiliation identification, we demonstrate that such classification procedures can be effective. We then examine the method in more detail, showing that as expected the classification performance correlates with the relational autocorrelation of the data set. We then turn the tables and use the relational VS scores as a way to analyze/visualize the relational autocorrelation present in a complex linked structure. The main contribution of the paper is to introduce the relational VS model as a potentially useful addition to the toolkit for relational data mining. It could provide useful constructed features for domains with low to moderate relational autocorrelation; it may be effective by itself for domains with high levels of relational autocorrelation, and it provides a useful abstraction for analyzing the properties of linked data.
-
Thomas W. Malone, Kevin Crowston, Jintae Lee, Brian Pentland, Chrysanthos Dellarocas, George Wyner, John Quimby, Abraham Bernstein, George Herman, Mark Klein, Charley Osborne, Tools for inventing organizations: Toward a handbook of organizational processes, Organizing Business Knowledge: The MIT Process Handbook, Editor(s): Thomas W. Malone, Kevin Crowston, George Herman, August ; 2003, MIT Press. (incollection)
This paper describes a novel theoretical and empirical approach to tasks such as business process redesign and knowledge management. The project involves collecting examples of how different organizations perform similar processes, and organizing these examples in an on-line ìprocess handbook". The handbook is intended to help people: (1) redesign existing organizational processes, (2) invent new organizational processes (especially ones that take advantage of information technology), and (3) share ideas about organizational practices.
A key element of the work is an approach to analyzing processes at various levels of abstraction, thus capturing both the details of specific processes as well as the "deep structure" of their similarities. This approach uses ideas from computer science about inheritance and from coordination theory about managing dependencies. A primary advantage of the approach is that it allows people to explicitly represent the similarities (and differences) among related processes and to easily find or generate sensible alternatives for how a given process could be performed. In addition to describing this new approach, the work reported here demonstrates the basic technical feasibility of these ideas and gives one example of their use in a field study.
-
Abraham Bernstein, Shawndra Hill, Foster Provost, An Intelligent Assistant for the Knowledge Discovery Process (techreport), An Intelligent Assistant for the Knowledge Discovery Process 2002. (techreport/Working Paper)
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There are many possible choices for each stage, and only some combinations are valid. Because of the large space and non-trivial interactions, both novices and data-mining specialists need assistance in composing and selecting DM processes. We present the concept of Intelligent Discovery Assistants (IDAs), which provide users with (i) systematic enumerations of valid DM processes, in order that important, potentially fruitful options are not overlooked, and (ii) effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute. We use a prototype to show that an IDA can indeed provide useful enumerations and effective rankings. We dis-cuss how an IDA is an important tool for knowledge sharing among a team of data miners. Finally, we illustrate all the claims with a comprehensive demonstration using a more involved process and data from the 1998 KDDCUP competition.
-
Abraham Bernstein, Scott Clearwater, Shawndra Hill, Claudia Perlich, Foster Provost, Discovering Knowledge from Relational Data Extracted from Business News, Workshop on Multi-Relational Data Mining (MRDM 2002) at the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2002. (inproceedings)
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated the processing of documents, reducing the amount of text that must be read. Current techniques (e.g., text classification and information extraction) for full-text analysis for the most part are limited to discovering information that can be found in single documents. Often, however, important information does not reside in a single document, but in the relationships between information distributed over multiple documents.
This paper reports on an investigation into whether knowledge can be discovered automatically from relational data extracted from large corpora of business news stories. We use a combination of information extraction, network analysis, and statistical techniques. We show that relationally interlinked patterns distributed over multiple documents can indeed be extracted, and (specifically) that knowledge about companies? interrelationships can be discovered. We evaluate the extracted relationships in several ways: we give a broad visualization of related companies, showing intuitive industry clusters; we use network analysis to ask who are the central players, and finally, we show that the extracted interrelationships can be used for important tasks, such as classifying companies by industry membership.
-
Abraham Bernstein, Scott Clearwater, Shawndra Hill, Claudia Perlich, Foster Provost, Discovering Knowledge from Relational Data Extracted from Business News (techreport), New York University, Center for Digital Economy Research 2002. (techreport/Working Paper)
Thousands of business news stories (including press releases, earnings reports, general business news, etc.) are released each day. Recently, information technology advances have partially automated the processing of documents, reducing the amount of text that must be read. Current techniques (e.g., text classification and information extraction) for full-text analysis for the most part are limited to discovering information that can be found in single documents. Often, however, important information does not reside in a single document, but in the relationships between information distributed over multiple documents.
This paper reports on an investigation into whether knowledge can be discovered automatically from relational data extracted from large corpora of business news stories. We use a combination of information extraction, network analysis, and statistical techniques. We show that relationally interlinked patterns distributed over multiple documents can indeed be extracted, and (specifically) that knowledge about companies? interrelationships can be discovered. We evaluate the extracted relationships in several ways: we give a broad visualization of related companies, showing intuitive industry clusters; we use network analysis to ask who are the central players, and finally, we show that the extracted interrelationships can be used for important tasks, such as classifying companies by industry membership.
-
Abraham Bernstein, Mark Klein, Discovering Services: Towards High-Precision Service Retrieval, The 'Web Services, E-Business and Semantic Web Workshop' at the fourteenth international Conference on Advanced Information Systems Engineering (CAiSE-2002), August 2002. (inproceedings)
The ability to rapidly locate useful on-line services (e.g. software applications, software components), as opposed to simply useful documents, is becoming increasingly critical in many domains. Current service retrieval technology is, however, notoriously prone to low precision. This paper describes a novel service retrieval approached based on the sophisticated use of process ontologies. Our preliminary evaluations suggest that this approach offers qualitatively higher retrieval precision than existing (keyword and table-based) approaches without sacrificing recall and computational tractability/scalability.
-
Mark Klein, Abraham Bernstein, Searching for services on the semantic web using process ontologies, The Emerging Semantic Web - Selected papers from the first Semantic Web Working Symposium, Editor(s): Stefan Decker, Jerome Euzenat, Deborah McGuinness, Isabel Cruz, August ; 2002, IOS. (incollection)
The ability to rapidly locate useful on-line services (e.g. software applications, software components, process models, or service organizations), as opposed to simply useful documents, is becoming increasingly critical in many domains. As the sheer number of such services increases it will become increasingly more important to provide tools that allow people (and software) to quickly find the services they need, while minimizing the burden for those who wish to list their services with these search engines. This can be viewed as a critical enabler of the ?friction-free? markets of the ?new economy?. Current service retrieval technology is, however, seriously deficient in this regard. The information retrieval community has focused on the retrieval of documents, not services per se, and has as a result emphasized keyword-based approaches. Those approaches achieve fairly high recall but low precision. The software agents and distributed computing communities have developed simple ?frame-based? approaches for ?matchmaking? between tasks and on-line services increasing precision at the substantial cost of requiring all services to be modeled as frames and only supporting perfect matches. This paper proposes a novel, ontology-based approach that employs the characteristics of a process-taxonomy to increase recall without sacrificing precision and computational complexity of the service retrieval process.
-
Guruduth Banavar, Abraham Bernstein, Software Infrastructure and Design Challenges for Ubiquitous Computing Applications, Communication of the ACM 45, August 2002. (article)
-
Abraham Bernstein, Mark Klein, Towards High-Precision Service Retrieval (inproceedings), The International Semantic Web Conference 2002. (inproceedings)
The ability to rapidly locate useful on-line services (e.g. software applications, software components, process models, or service organizations), as opposed to simply useful documents, is becoming increasingly critical in many domains. Current service retrieval technology is, however, notoriously prone to low precision. This paper describes a novel service retrieval approached based on the sophisticated use of process ontologies. Our preliminary evaluations suggest that this approach offers qualitatively higher retrieval precision than existing (keyword and table-based) approaches without sacrificing recall and computational tractability/scalability.