Dr. Adrian Bachmann
, LINKSTER: Enabling Efficient Manual Inspection and Annotation of Mined Data, ACM SIGSOFT / FSE '10: Proceedings of the eighteenth International Symposium on the Foundations of Software Engineering, November 2010. (inproceedings/formal demonstration)
While many uses of mined software engineering data are automatic in nature, some techniques and studies either require, or can be improved, by manual methods. Unfortunately, manually inspecting, analyzing, and annotating mined data can be difficult and tedious, especially when information from multiple sources must be integrated. Oddly, while there are numerous tools and frameworks for automatically mining and analyzing data, there is a dearth of tools which facilitate manual methods. To fill this void, we have developed LINKSTER, a tool which integrates data from bug databases, source code repositories, and mailing list archives to allow manual inspection and annotation. LINKSTER has already been used successfully by an OSS project lead to obtain data for one empirical study.
, The Missing Links: Bugs and Bug-fix Commits, ACM SIGSOFT / FSE '10: Proceedings of the eighteenth International Symposium on the Foundations of Software Engineering, November 2010. (inproceedings)
Empirical studies of software defects rely on links between bug databases and program code repositories. This linkage is typically based on bug-fixes identified in developer-entered commit logs. Unfortunately, developers do not always report which commits perform bug-fixes. Prior work suggests that such links can be a biased sample of the entire population of fixed bugs. The validity of statistical hypotheses-testing based on linked data could well be affected by bias. Given the wide use of linked defect data, it is vital to gauge the nature and extent of the bias, and try to develop testable theories and models of the bias. To do this, we must establish ground truth: manually analyze a complete version history corpus, and nail down those commits that fix defects, and those that do not. This is a diffcult task, requiring an expert to compare versions, analyze changes, find related bugs in the bug database, reverse-engineer missing links, and finally record their work for use later. This effort must be repeated for hundreds of commits to obtain a useful sample of reported and unreported bug-fix commits. We make several contributions. First, we present Linkster, a tool to facilitate link reverse-engineering. Second, we evaluate this tool, engaging a core developer of the Apache HTTP web server project to exhaustively annotate 493 commits that occurred during a six week period. Finally, we analyze this comprehensive data set, showing that there are serious and consequential problems in the data.
, When Process Data Quality Affects the Number of Bugs: Correlations in Software Engineering Datasets, MSR '10: Proceedings of the 7th IEEE Working Conference on Mining Software Repositories, May 2010. (inproceedings)
Software engineering process information extracted from version control systems and bug tracking databases are widely used in empirical software engineering. In prior work, we showed that these data are plagued by quality deficiencies, which vary in its characteristics across projects. In addition, we showed that those deficiencies in the form of bias do impact the results of studies in empirical software engineering. While these findings affect software engineering researchers the impact on practitioners has not yet been substantiated. In this paper we, therefore, explore (i) if the process data quality and characteristics have an influence on the bug fixing process and (ii) if the process quality as measured by the process data has an influence on the product (i.e., software) quality. Specifically, we analyze six Open Source as well as two Closed Source projects and show that process data quality and characteristics have an impact on the bug fixing process: the high rate of empty commit messages in Eclipse, for example, correlates with the bug report quality. We also show that the product quality -- measured by number of bugs reported -- is affected by process data quality measures. These findings have the potential to prompt practitioners to increase the quality of their software process and its associated data quality.
, Why should we care about data quality in software engineering?, 10 2010. (doctoralthesis)
Software engineering tools such as bug tracking databases and version control systems store large amounts of data about the history and evolution of software projects. In the last few years, empirical software engineering researchers have paid attention to these data to provide promising research results, for example, to predict the number of future bugs, recommend bugs to fix next, and visualize the evolution of software systems. Unfortunately, such data is not well-prepared for research purposes, which forces researchers to make process assumptions and develop tools and algorithms to extract, prepare, and integrate (i.e., inter-link) these data. This is inexact and may lead to quality issues. In addition, the quality of data stored in software engineering tools is questionable, which may have an additional effect on research results. In this thesis, therefore, we present a step-by-step procedure to gather, convert, and integrate software engineering process data, introducing an enhanced linking algorithm that results in a better linking ratio and, at the same time, higher data quality compared to previously presented approaches. We then use this technique to generate six open source and two closed source software project datasets. In addition, we introduce a framework of data quality and characteristics measures, which allows an evaluation and comparison of these datasets. However, evaluating and reporting data quality issues are of no importance if there is no effect on research results, processes, or product quality. Therefore, we show why software engineering researchers should care about data quality issues and, fundamentally, show that such datasets are incomplete and biased; we also show that, even worse, the award-winning bug prediction algorithm BUGCACHE is affected by quality issues like these. The easiest way to fix such data quality issues would be to ensure good data quality at its origin by software engineering practitioners, which requires extra effort on their part. Therefore, we consider why practitioners should care about data quality and show that there are three reasons to do so: (i) process data quality issues have a negative effect on bug fixing activities, (ii) process data quality issues have an influence on product quality, and (iii) current and future laws and regulations such as the Sarbanes- Oxley Act or the Capability Maturity Model Integration (CMMI) as well as operational risk management implicitly require traceability and justification of all changes to information systems (e.g., by change management). In a way, this increases the demand for good data quality in software engineering, including good data quality of the tools used in the process. Summarizing, we discuss why we should care about data quality in software engineering, showing that (i) we have various data quality issues in software engineering datasets and (ii) these quality issues have an effect on research results as well as missing traceability and justification of program code changes, and so software engineering researchers as well as software engineering practitioners should care about these issues.
, Data Retrieval, Processing and Linking for Software Process Data Analysis, University of Zurich, Department of Informatics, 12 2009. (techreport)
Many projects in the mining software repositories communities rely on software process data gathered from bug tracking databases and commit log files of version control systems. These data are then used to predict defects, gather insight into a project's life-cycle, and other tasks. In this technical report we introduce the software systems which hold such data. Furthermore, we present our approach for retrieving, processing and linking this data. Specifically, we first introduce the bug fixing process and the software products used which support this process. We then present a step by step guidance of our approach to retrieve, parse, convert and link the data sources. Additionally, we introduce an improved approach for linking the change log file with the bug tracking database. Doing that, we achieve a higher linking rate than with other approaches
, Fair and Balanced? Bias in Bug-Fix Datasets, ESEC/FSE '09: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering on European software engineering conference and foundations of software engineering, August 2009. (inproceedings)
Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data.
, Software Process Data Quality and Characteristics - A Historical View on Open and Closed Source Projects, IWPSE-Evol '09: Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) workshops, August 2009. (inproceedings)
Software process data gathered from bug tracking databases and version control system log files are a very valuable source to analyze the evolution and history of a project or predict its future. These data are used for instance to predict defects, gather insight into a project's life-cycle, and additional tasks. In this paper we survey five open source projects and one closed source project in order to provide a deeper insight into the quality and characteristics of these often-used process data. Specifically, we first define quality and characteristics measures, which allow us to compare the quality and characteristics of the data gathered for different projects. We then compute the measures and discuss the issues arising from these observation. We show that there are vast differences between the projects, particularly with respect to the quality in the link rate between bugs and commits.
Looking Back on Prediction: A Retrospective Evaluation of Bug-Prediction Techniques, November 2008. (misc/poster)
, Indoornavigation mittels Ortsinterpolation, March 2006. (diplomathesis/Diploma Thesis)
Satellite navigation is ubiquitous in our daily life. Unfortunately, satellite navigation signals can not be received somtimes. Due to this effect, a new stream of reseach has been emerged focussing on alternatives, especially in the area of indoor-navigation. In this diploma thesis a new approach is developed and presented. The new approach induces in a new way navigation information from accelerometer and magnetometer sensor data. The new approach overcomes the shortcoming of insufficient calibration - one of the major issues in current research. The main contribution of this work is an online calibration framework, which allows to adapt to changing border conditions. Thus, the result is a much more robust, precise, and up-to-date base for the path extrapolation.
, Design and Prototypical Implementation of an Accounting System for an AAA Server, August 2005. (misc/Semester Thesis)
The key aim of this thesis work is to design and prototypically implement the Accounting module for an AAA server, based on the Generic AAA Architecture defined in RFC 2903 and Diameter protocol specifications. The resulting protocol and architecture shall provide a solution for offering accounting services to a Mobile Grid. It will also be used at a later stage together with various charging models for creating a charging mechanism for future mobile grids. A mobile grid environment, by its heterogenos nature brings new challenges to all the tree A-s in a traditional AAA environment. Regarding the accounting proccess, new types of resources have to be accounted for which require new parameters that have to be present in accounting records. Besides the traditional accounting of time, bytes, and packets, a grid service might need to account for CPU usage, memory consumption, or even accessed/containing information. The accounting module shall provide generic interfaces for possibly different monitoring entities that adapt to the type of resource being accounted for. The access to accounted for data shall be secure and reliable. Secure in this context means that accounted for records for a certain service can be created by certain entities that are aproved by that service provider. This requirement can be realized using X.509 certificates or other kind of credential tokens. Encryption of accounting messages shall be offered as a communication option between the AAA client and the AAA server. Reliability refers to the posibility of retreiving the accurate accounting information for a certain service/resource usage for charging consumptions.