-
Christian Bird, Adrian Bachmann, Foyzur Rahman, Abraham Bernstein, LINKSTER: Enabling Efficient Manual Inspection and Annotation of Mined Data, ACM SIGSOFT / FSE '10: Proceedings of the eighteenth International Symposium on the Foundations of Software Engineering, November 2010. (inproceedings/formal demonstration)
While many uses of mined software engineering data are automatic in nature, some techniques and studies either require, or can be improved, by manual methods. Unfortunately, manually inspecting, analyzing, and annotating mined data can be difficult and tedious, especially when information from multiple sources must be integrated. Oddly, while there are numerous tools and frameworks for automatically mining and analyzing data, there is a dearth of tools which facilitate manual methods. To fill this void, we have developed LINKSTER, a tool which integrates data from bug databases, source code repositories, and mailing list archives to allow manual inspection and annotation. LINKSTER has already been used successfully by an OSS project lead to obtain data for one empirical study.
-
Adrian Bachmann, Christian Bird, Foyzur Rahman, Premkumar Devanbu, Abraham Bernstein, The Missing Links: Bugs and Bug-fix Commits, ACM SIGSOFT / FSE '10: Proceedings of the eighteenth International Symposium on the Foundations of Software Engineering, November 2010. (inproceedings)
Empirical studies of software defects rely on links between bug databases and program code repositories. This linkage is typically based on bug-fixes identified in developer-entered commit logs. Unfortunately, developers do not always report which commits perform bug-fixes. Prior work suggests that such links can be a biased sample of the entire population of fixed bugs. The validity of statistical hypotheses-testing based on linked data could well be affected by bias. Given the wide use of linked defect data, it is vital to gauge the nature and extent of the bias, and try to develop testable theories and models of the bias. To do this, we must establish ground truth: manually analyze a complete version history corpus, and nail down those commits that fix defects, and those that do not. This is a diffcult task, requiring an expert to compare versions, analyze changes, find related bugs in the bug database, reverse-engineer missing links, and finally record their work for use later. This effort must be repeated for hundreds of commits to obtain a useful sample of reported and unreported bug-fix commits. We make several contributions. First, we present Linkster, a tool to facilitate link reverse-engineering. Second, we evaluate this tool, engaging a core developer of the Apache HTTP web server project to exhaustively annotate 493 commits that occurred during a six week period. Finally, we analyze this comprehensive data set, showing that there are serious and consequential problems in the data.
-
Adrian Bachmann, Abraham Bernstein, When Process Data Quality Affects the Number of Bugs: Correlations in Software Engineering Datasets, MSR '10: Proceedings of the 7th IEEE Working Conference on Mining Software Repositories, May 2010. (inproceedings)
Software engineering process information extracted from version control systems and bug tracking databases are widely used in empirical software engineering. In prior work, we showed that these data are plagued by quality deficiencies, which vary in its characteristics across projects. In addition, we showed that those deficiencies in the form of bias do impact the results of studies in empirical software engineering. While these findings affect software engineering researchers the impact on practitioners has not yet been substantiated. In this paper we, therefore, explore (i) if the process data quality and characteristics have an influence on the bug fixing process and (ii) if the process quality as measured by the process data has an influence on the product (i.e., software) quality. Specifically, we analyze six Open Source as well as two Closed Source projects and show that process data quality and characteristics have an impact on the bug fixing process: the high rate of empty commit messages in Eclipse, for example, correlates with the bug report quality. We also show that the product quality -- measured by number of bugs reported -- is affected by process data quality measures. These findings have the potential to prompt practitioners to increase the quality of their software process and its associated data quality.
-
Adrian Bachmann, Why should we care about data quality in software engineering?, 10 2010. (doctoralthesis)
Software engineering tools such as bug tracking databases and version
control systems store large amounts of data about the history and
evolution of software projects. In the last few years, empirical software
engineering researchers have paid attention to these data to provide
promising research results, for example, to predict the number of
future bugs, recommend bugs to fix next, and visualize the evolution
of software systems. Unfortunately, such data is not well-prepared
for research purposes, which forces researchers to make process assumptions
and develop tools and algorithms to extract, prepare, and
integrate (i.e., inter-link) these data. This is inexact and may lead to
quality issues. In addition, the quality of data stored in software engineering
tools is questionable, which may have an additional effect on
research results.
In this thesis, therefore, we present a step-by-step procedure to
gather, convert, and integrate software engineering process data, introducing
an enhanced linking algorithm that results in a better linking
ratio and, at the same time, higher data quality compared to previously
presented approaches. We then use this technique to generate
six open source and two closed source software project datasets. In
addition, we introduce a framework of data quality and characteristics
measures, which allows an evaluation and comparison of these
datasets.
However, evaluating and reporting data quality issues are of no
importance if there is no effect on research results, processes, or product
quality. Therefore, we show why software engineering researchers
should care about data quality issues and, fundamentally, show that
such datasets are incomplete and biased; we also show that, even
worse, the award-winning bug prediction algorithm BUGCACHE is affected by quality issues like these. The easiest way to fix such data
quality issues would be to ensure good data quality at its origin by
software engineering practitioners, which requires extra effort on their
part. Therefore, we consider why practitioners should care about data
quality and show that there are three reasons to do so: (i) process data
quality issues have a negative effect on bug fixing activities, (ii) process
data quality issues have an influence on product quality, and
(iii) current and future laws and regulations such as the Sarbanes-
Oxley Act or the Capability Maturity Model Integration (CMMI) as
well as operational risk management implicitly require traceability
and justification of all changes to information systems (e.g., by change
management). In a way, this increases the demand for good data quality
in software engineering, including good data quality of the tools
used in the process.
Summarizing, we discuss why we should care about data quality
in software engineering, showing that (i) we have various data quality
issues in software engineering datasets and (ii) these quality issues
have an effect on research results as well as missing traceability and
justification of program code changes, and so software engineering
researchers as well as software engineering practitioners should care
about these issues.
-
Adrian Bachmann, Abraham Bernstein, Data Retrieval, Processing and Linking for Software Process Data Analysis, University of Zurich, Department of Informatics, 12 2009. (techreport)
Many projects in the mining software repositories communities rely on software process data gathered from bug tracking databases and commit log files of version control systems. These data are then used to predict defects, gather insight into a project's life-cycle, and other tasks. In this technical report we introduce the software systems which hold such data. Furthermore, we present our approach for retrieving, processing and linking this data. Specifically, we first introduce the bug fixing process and the software products used which support this process. We then present a step by step guidance of our approach to retrieve, parse, convert and link the data sources. Additionally, we introduce an improved approach for linking the change log file with the bug tracking database. Doing that, we achieve a higher linking rate than with other approaches
-
Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, Premkumar Devanbu, Fair and Balanced? Bias in Bug-Fix Datasets, ESEC/FSE '09: Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering on European software engineering conference and foundations of software engineering, August 2009. (inproceedings)
Software engineering researchers have long been interested in where and why bugs occur in code, and in predicting where they might turn up next. Historical bug-occurence data has been key to this research. Bug tracking systems, and code version histories, record when, how and by whom bugs were fixed; from these sources, datasets that relate file changes to bug fixes can be extracted. These historical datasets can be used to test hypotheses concerning processes of bug introduction, and also to build statistical bug prediction models. Unfortunately, processes and humans are imperfect, and only a fraction of bug fixes are actually labelled in source code version histories, and thus become available for study in the extracted datasets. The question naturally arises, are the bug fixes recorded in these historical datasets a fair representation of the full population of bug fixes? In this paper, we investigate historical data from several software projects, and find strong evidence of systematic bias. We then investigate the potential effects of "unfair, imbalanced" datasets on the performance of prediction techniques. We draw the lesson that bias is a critical problem that threatens both the effectiveness of processes that rely on biased datasets to build prediction models and the generalizability of hypotheses tested on biased data.
-
Adrian Bachmann, Abraham Bernstein, Software Process Data Quality and Characteristics - A Historical View on Open and Closed Source Projects, IWPSE-Evol '09: Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) workshops, August 2009. (inproceedings)
Software process data gathered from bug tracking databases and version control system log files are a very valuable source to analyze the evolution and history of a project or predict its future. These data are used for instance to predict defects, gather insight into a project's life-cycle, and additional tasks. In this paper we survey five open source projects and one closed source project in order to provide a deeper insight into the quality and characteristics of these often-used process data. Specifically, we first define quality and characteristics measures, which allow us to compare the quality and characteristics of the data gathered for different projects. We then compute the measures and discuss the issues arising from these observation. We show that there are vast differences between the projects, particularly with respect to the quality in the link rate between bugs and commits.