Introduction
As announced, assignment #3 consists of 2 parts: The first part is dealing with decision trees and Naive Bayes and is primarily a paper and pencil exercise. The second part is about WEKA which is a very common machine learning / data mining tool (Weka Data Mining toolkit).
Total credit for this assignment is 100 points. It's a singlestudent assignment.
Due date: Mo 24.01.2005
Part 1: Decision Trees & Naive Bayes (60)
The goal of this exercise is to get familiarized with decision trees and probabilistic learning (Naive Bayes). Therefore a simple dataset about wether to go skiing or not is provided. The decision of going skiing depends on the attributes snow, weather, season, and physical condition as shown in the table below.
snow  weather  season  physical condition  go skiing 
sticky  foggy  low  rested  no 
fresh  sunny  low  injured  no 
fresh  sunny  low  rested  yes 
fresh  sunny  high  rested  yes 
fresh  sunny  mid  rested  yes 
frosted  windy  high  tired  no 
sticky  sunny  low  rested  yes 
frosted  foggy  mid  rested  no 
fresh  windy  low  rested  yes 
fresh  windy  low  rested  yes 
fresh  foggy  low  rested  yes 
fresh  foggy  low  rested  yes 
sticky  sunny  mid  rested  yes 
frosted  foggy  low  injured  no 
a) Build the decision tree based on the data above by calculating the information gain for each possible node, as shown in the lecture. Please hand in the resulting tree including all calculation steps. (20)
Hint: There is a scientific calculator on calculator.com for the calculation of log2(x). By the way: log2(ax)=log10(ax)*log2(10) or log2(ax)=ln(ax)*log2(e).
b) Apply Naive Bayes as probabilistic learning algorithm on the dataset above and create a table with counts and probabilities. The following calculation based on a smaller dataset is provided as an example: (20)
snow  weather  go skiing 
fresh  foggy  no 
sticky  windy  no 
sticky  sunny  yes 
fresh  windy  yes 
fresh  foggy  yes 
frosted  sunny  no 
This small data table leads to the following table with counts and probabilities:
snow  weather  go skiing  


 



If we now like to qualify the following new instance "snow=fresh and weather=sunny", we calculate the likelihood of "go skiing=yes" in the following way:
likelihood of yes = 2/3 * 1/3 * 3/6 = 6/54 = 1/9
likelihood of no = 1/3 * 1/3 * 3/6 = 3/54 = 1/18
(we can defactorize the attributes under the assumption that all attributes are equally important and independent  that's why Naive Bayes is called Naive)
probability of yes = (1/9)/((1/9)+(1/18)) = 2/3
probability of no = (1/18)/((1/18)+(1/9)) = 1/3
Therefore, the probability is 33% to not go skiing based on the information given in this small example.
c) Please explain in your own words the following terms: Naive Bayes Classifier, Bayesian Belief Network. In addition, draw the Bayesian Belief Network that represents the conditional independence assumption of the Naive Bayes Classifier for the skiing problem. Give the conditional probability tables associated with each node in the Belief Network. (10)
Hint: The Naive Bayes conditional independence assumption is P(a1, a2, …, anvi) = P(a1vi) * P(a2vi) * … * P(anvi), where vi is a class label and a1, a2, …, an are dataset attributes.
d) Decide for the new instance given below whether to go skiing or not, based on the calculated classifiers: (5)
snow  weather  season  physical condition  go skiing 
sticky  windy  high  tired  ? 
e) Now, an instance including missing values has to be classified (use the first two classifiers built on the original data set): (5)
snow  weather  season  physical condition  go skiing 
  windy  mid  injured  ? 
Decide whether to go skiing or not.
Part 2: Data Mining Practice using WEKA (40)
Introduction
The tool used in this exercise is called WEKA which is based on Java. This tool is very intuitive and easy to handle, as seen in the lecture. Download it (GUI version recommended) under www.cs.waikato.ac.nz/ml/weka/index.html and install it as descibed in the documentation. To fulfill the assignment it is sufficient to use the WEKA Explorer GUI (There is a user guide for the Explorer GUI among other detailed tutorials on the WEKA website).
Assignment
There are two datasets provided, which each has been divided into a training and a test dataset. First, the drug dataset (training set, test set), and second the soybean dataset (training set, test set).
Run WEKA, and choose the explorer. Load a training file. On the classifier page, select the J48 decision tree classifier. This classifier has a number of options. Two of them are “unpruned” and ‘minNumObj”. “Unpruned” prevents pruning the decision tree after building it. “MinNumObj” sets the minimum number of instances in any leaf node to limit the growth of the decision tree.
a) For each dataset, vary the above two parameters. Generate pruned and unpruned trees, with a minimum number of instances in leaves between 1 and 4.
b) Explain why the pruning effects the performance of the decision tree algorithm.
c) Explain how different limitations of instances in the leaf influences the performance (prediction accuracy) of the decision tree algorithm.
What to hand in:
A write up of results, answers to the questions, and supporting material like experimental results, etc. Please do not write long prose but hand in a succinct description of your results.
Please do not hesitate to contact us (Jiwen Li, Christoph Kiefer) if you have questions, problems or remarks of any kind.