Introduction

As announced, assignment #3 consists of 2 parts: The first part is dealing with decision trees and Naive Bayes and is primarily a paper and pencil exercise. The second part is about WEKA which is a very common machine learning / data mining tool (Weka Data Mining toolkit).

Total credit for this assignment is 100 points. It's a single-student assignment.

Due date: Mo 24.01.2005

Part 1: Decision Trees & Naive Bayes (60)

The goal of this exercise is to get familiarized with decision trees and probabilistic learning (Naive Bayes). Therefore a simple dataset about wether to go skiing or not is provided. The decision of going skiing depends on the attributes snow, weather, season, and physical condition as shown in the table below.

snow	weather	season	physical condition	go skiing
sticky	foggy	low	rested	no
fresh	sunny	low	injured	no
fresh	sunny	low	rested	yes
fresh	sunny	high	rested	yes
fresh	sunny	mid	rested	yes
frosted	windy	high	tired	no
sticky	sunny	low	rested	yes
frosted	foggy	mid	rested	no
fresh	windy	low	rested	yes
fresh	windy	low	rested	yes
fresh	foggy	low	rested	yes
fresh	foggy	low	rested	yes
sticky	sunny	mid	rested	yes
frosted	foggy	low	injured	no

a) Build the decision tree based on the data above by calculating the information gain for each possible node, as shown in the lecture. Please hand in the resulting tree including all calculation steps. (20)

Hint: There is a scientific calculator on calculator.com for the calculation of log2(x). By the way: log2(ax)=log10(ax)*log2(10) or log2(ax)=ln(ax)*log2(e).

b) Apply Naive Bayes as probabilistic learning algorithm on the dataset above and create a table with counts and probabilities. The following calculation based on a smaller dataset is provided as an example: (20)

snow	weather	go skiing
fresh	foggy	no
sticky	windy	no
sticky	sunny	yes
fresh	windy	yes
fresh	foggy	yes
frosted	sunny	no

This small data table leads to the following table with counts and probabilities:

snow

weather

go skiing

	yes	no
fresh	2	1
sticky	1	1
frosted	0	1

	yes	no
sunny	1	1
foggy	1	1
windy	1	1

yes	no
3	3

	yes	no
fresh	2/3	1/3
sticky	1/3	1/3
frosted	0/3	1/3

	yes	no
sunny	1/3	1/3
foggy	1/3	1/3
windy	1/3	1/3

yes	no
3/6	3/6

If we now like to qualify the following new instance "snow=fresh and weather=sunny", we calculate the likelihood of "go skiing=yes" in the following way:

likelihood of yes = 2/3 * 1/3 * 3/6 = 6/54 = 1/9

likelihood of no = 1/3 * 1/3 * 3/6 = 3/54 = 1/18

(we can defactorize the attributes under the assumption that all attributes are equally important and independent - that's why Naive Bayes is called Naive)

probability of yes = (1/9)/((1/9)+(1/18)) = 2/3

probability of no = (1/18)/((1/18)+(1/9)) = 1/3

Therefore, the probability is 33% to not go skiing based on the information given in this small example.

c) Please explain in your own words the following terms: Naive Bayes Classifier, Bayesian Belief Network. In addition, draw the Bayesian Belief Network that represents the conditional independence assumption of the Naive Bayes Classifier for the skiing problem. Give the conditional probability tables associated with each node in the Belief Network. (10)

Hint: The Naive Bayes conditional independence assumption is P(a1, a2, …, an|vi) = P(a1|vi) * P(a2|vi) * … * P(an|vi), where vi is a class label and a1, a2, …, an are dataset attributes.

d) Decide for the new instance given below whether to go skiing or not, based on the calculated classifiers: (5)

snow	weather	season	physical condition	go skiing
sticky	windy	high	tired	?

e) Now, an instance including missing values has to be classified (use the first two classifiers built on the original data set): (5)

snow	weather	season	physical condition	go skiing
-	windy	mid	injured	?

Decide whether to go skiing or not.

Part 2: Data Mining Practice using WEKA (40)

Introduction

The tool used in this exercise is called WEKA which is based on Java. This tool is very intuitive and easy to handle, as seen in the lecture. Download it (GUI version recommended) under www.cs.waikato.ac.nz/ml/weka/index.html and install it as descibed in the documentation. To fulfill the assignment it is sufficient to use the WEKA Explorer GUI (There is a user guide for the Explorer GUI among other detailed tutorials on the WEKA web-site).

Assignment

There are two datasets provided, which each has been divided into a training and a test dataset. First, the drug dataset (training set, test set), and second the soybean dataset (training set, test set).

Run WEKA, and choose the explorer. Load a training file. On the classifier page, select the J48 decision tree classifier. This classifier has a number of options. Two of them are “unpruned” and ‘minNumObj”. “Unpruned” prevents pruning the decision tree after building it. “MinNumObj” sets the minimum number of instances in any leaf node to limit the growth of the decision tree.

a) For each dataset, vary the above two parameters. Generate pruned and unpruned trees, with a minimum number of instances in leaves between 1 and 4.

b) Explain why the pruning effects the performance of the decision tree algorithm.

c) Explain how different limitations of instances in the leaf influences the performance (prediction accuracy) of the decision tree algorithm.

What to hand in:

A write up of results, answers to the questions, and supporting material like experimental results, etc. Please do not write long prose but hand in a succinct description of your results.

Please do not hesitate to contact us (Jiwen Li, Christoph Kiefer) if you have questions, problems or remarks of any kind.