Decision Tree Learning |
|
Overview
Menu Help
Create Mode
Test Mode
Algorithms
Learning is the ability to improve one's behaviour based on experience and represents an essential element of computational intelligence. Decision trees are a simple yet successful technique for supervised classification learning. This applet demonstrates how to build a decision tree using a training data set and then use the tree to classify unseen examples in a test data set.
This applet provides several sample data sets of examples to learn and classify, however, you can also create or import your own data sets. Before building a decision tree, the data set can be viewed, and examples can be moved to and from the training set and test set. The applet's Create Mode allows you to watch as a decision tree is built automatically, or build the tree yourself. When building the tree manually, you can use several tools to gain more information that can guide your decisions. Once the decision tree is built, switch to Test Mode to test the tree against the unseen examples in your test set.
Create mode is used both for acquiring or creating data sets and for building the decision tree. When in create mode, the control panel provides a variety of tools for building a problem. The "View/Edit Examples" button opens a widnow used to manipulate the data set while the remainder of the controls are used to construct the decision tree.
The easiest way to get a data set to build a tree for is to load a sample data set. However, new data sets can also be created by inputting the examples or loading them from a text file.
Loading a Sample Data Set To load a sample, click "Load Sample Dataset" from the "File" menu. Then select a sample from the drop-down list and click "Load." See "Manipulating a Data Set"
Creating a new Data Set To begin creating a new data set, click "Create New Data Set" from the "File" menu. A window will then appear that asks you for the "parameters" for the new data set. These parameters are the names of the input values that the tree will use to predict the output value, which is also given a name and specified in the "Data Set Parameter Input" window. Input values are specified with a comma separating them and the output value name must appear last. A trivial example is: input1, input2, input3, output. Click "Ok" when finished. The next step is to create examples using the "View/Edit Examples" window.
Loading a Data Set From File Example data can also be loaded from a text file. Click "Open Location" from the "File" menu to specify the address of a the file you would like to load. If the file is in a valid format, it will be loaded into the program. A valid file contains one example per line, with the input values followed by the output value and separated by spaces, commas, or tabs.
Once you have created or acquired a data set and distributed examples between the test and training sets (using the "View/Edit Examples" window), you are ready to begin building the tree. The decision tree is visualized as a set of nodes, edges, and edge labels. Each node in the tree represents an input parameter that examples are split on. Red rectanges represent interior nodes while green rectanges represent leaf nodes. Blue diamond-shaped nodes are nodes that have not yet been split or set as a leaf. The labels on the edges between nodes indicate the possible values of the parent split parameter.
Constructing a Tree Automatically To automatically generate a tree, first select the splitting function to use from the left-side control panel. This splitting function will determine how the program chooses a parameter to split examples on. You can choose Random, Information Gain, Gain Ratio, or GINI. See the Algorithms section for more information on these splitting functions. After selecting a splitting function, simply click "Step" to watch the program construct the tree. Clicking "Auto Step" will cause the program to continue stepping until the tree is complete or the "Stop" button is pressed.
Conditions can be used to restrict splitting while automatically generating a tree. To set stopping conditions, click "Stopping Conditions..." from the "Options" menu. Clicking the checkbox beside a condition enables it and allows you to edit the parameter that appears to the right of the condition. The program will not split a node if any enabled stopping condition is met.
The minimum information gain condition will prevent splits that do not produce the information gain specified by the parameter. The minimum example count condition will not allow a node to be split if there are fewer than the specified number of examples mapped to it. Finally, the maximum depth condition will restrict splits if they will increase the maximum root-to-leaf depth of the tree beyond the specified value. Note that the root has depth 0.
Constructing a Tree Manually You can also construct a tree by selecting parameters to split on yourself. When the graph mode is set to "Split Node," you can click on any blue diamond-shaped node to split it. A window will appear with information about each of the parameters that the examples can be split on. When you have chosen a parameter, select its checkbox and click "Split."
Several tools are available to guide your splitting choices. When the "View Node Information" graph mode is selected, you can click any node to get summary information about the examples mapped to it, its entropy, and its GINI index. Clicking a node in "View Mapped Examples" mode will show you the examples that have been mapped to the node. "Toggle Histogram" mode allows you to quickly view the output value probability distribution for a node.
The "Show Plot" button on the control panel opens a plot of training and test set error over the number of splits. This can be useful for evaluating whether or not further splitting is likely to improve the decsion tree's predictions. The default error value is the proportion of incorrectly predicted examples. However, the type of error plotted can be changed via the options menu. Other error calculations include average sum of absolute values of differences between the predicted distribution and the actual distribution, and a squared variant of this, the average sum of squares of absolute values of differences.
When splitting manually, you may decide that there is nothing to be gained from splitting a blue diamond-shaped split node further. In this case, you can use the "Set as Leaf" mode to select an output value for the node. Selecting "Set all Split-nodes as Leaves" from the "Options" menu will set all split nodes in the tree as leaves, using the mode output value.
When finished constructing the decision tree, select the "Test" tab in the upper-left corner of the screen to test the tree against the test set of examples. Click the "Test" button to view the test set examples classified into the categories: "Correctly Predicted," "Incorrectly Predicted," and "No Prediction." The pie chart at the bottom of the test results window provides a quick perspective on the performance of your decision tree.
The default test results window classifies examples as correct or incorrect based on whether they mapped to a leaf with the same output value as the test example. An alternative test result display can be accessed by setting the "Test Result Type" to "Probablilities" in the options menu and clicking the "Test" button again. This time, the test results are classified by the probabilistic error of each example using an error threshold value to classify the examples as correct or incorrect. A slider at the bottom of this window allows you to change the error threshold, and radio buttons allow you to choose between the two error calculations.
The graph modes in Test Mode allow you to inspect individual nodes of the tree. You can view mapped test examples, node information, and toggle the histogram view.
Before automatically creating a decision tree, you can choose from several splitting functions that are used to determine which attribute to split on. The following splitting functions are available: