Intelligent Assistance for the Data Mining Process: An Ontology-based Approach

Knowledge discovery from data is the result of an exploratory process involving the application of various algorithmic procedures for manipulating data, building models from data, and manipulating the mod-els. The Knowledge Discovery (KD) process [Fayyad, Piatetsky-Shapiro & Smyth, 1996] is one of the central notions of the field of Knowledge Discovery and Data mining (KDD). The KD process deserves more attention from the research community; processes comprise multiple algorithmic components, which interact in non-trivial ways. Even data-mining specialists are not familiar with the full range of components, let alone the vast design space of possible processes. Therefore, both novices and data-mining specialists are apt to overlook useful instances of the KD process. We consider tools that will help data miners to navigate the space of KD processes (see Figure 1) systematically, and more effectively.

Figure 1: The KD process (adapted from Fayad et al. [1996])

Figure 2 shows three simple, example DM processes. Process 1 comprises simply the application of a decision-tree inducer. Process 2 preprocesses the data by discretizing numeric attributes, and then builds a na´ve Bayesian classifier. Process 3 preprocesses the data first by taking a random subsample, then applies discretization, and then builds a na´ve Bayesian classifier.


Figure 2: Three valid DM processes Intelligent

Discovery Assistants (IDAs) help data miners with the exploration of the space of valid DM processes. A valid DM process violates no fundamental constraints of its constituent techniques. For example, if the input data set contains numeric attributes, simply applying na´ve Bayes is not a valid DM process-because (strictly speaking) na´ve Bayes applies only to categorical attributes. However, Process 2 is valid, because it preprocesses the data with a discretization routine, transforming the numeric attributes to categorical ones. IDAs take advantage of an explicit ontology of data-mining techniques, which defines the various techniques and their properties. Using the ontology, an IDA searches the space of valid processes. Applying each search operator corresponds to the inclusion in the DM process of a different data-mining technique; preconditions constrain its applicability and there are effects of applying it. Figure 3 shows some (simplified) ontology entries (cf., Figure 2).


In this project we investigate the following benefits of IDA. In particular, we believe that IDAs can provide users with

  1. a systematic enumeration of valid DM processes, so they do not miss important, potentially fruitful options
  2. effective rankings of these valid processes by different criteria, to help them choose between the options
  3. an infrastructure for sharing knowledge about data-mining processes, which leads to what economists call network externalities.

Relevant Publications

  • Bernstein, Abraham, Shawndra Hill, and Foster Provost. 2002.
    An Intelligent Assistant for the Knowledge Discovery Process: An Ontology-based Approach
    New York University – Leonard Stern School of Business, Center for Digital Economy Research, CeDER Working Paper # IS-02-02
  • Bernstein, Abraham, and Foster Provost. 2001
    An Intelligent Assistant for the Knowledge Discovery Process
    New York University - Leonard Stern School of Business, Center for Digital Economy Research, CeDER Working Paper # IS-01-01
    also to be presented at the
    IJCAI 2001 Workshop on Wrappers for Performance Enhancement in Knowledge Discovery in Databases