Search:
Assignment #3: Learning

Introduction

As announced, assignment #3 consists of 2 parts: The first part is dealing with decision trees and Naive Bayes and is primarily a paper and pencil exercise. The second part is about WEKA which is a very common machine learning / data mining tool (Weka Data Mining toolkit).

Total credit for this assignment is 100 points. It's a single-student assignment.

Due date: Mo 24.01.2005

Part 1: Decision Trees & Naive Bayes (60)

The goal of this exercise is to get familiarized with decision trees and probabilistic learning (Naive Bayes). Therefore a simple dataset about wether to go skiing or not is provided. The decision of going skiing depends on the attributes snow, weather, season, and physical condition as shown in the table below.

snow

weather

season

physical condition

go skiing

sticky

foggy

low

rested

no

fresh

sunny

low

injured

no

fresh

sunny

low

rested

yes

fresh

sunny

high

rested

yes

fresh

sunny

mid

rested

yes

frosted

windy

high

tired

no

sticky

sunny

low

rested

yes

frosted

foggy

mid

rested

no

fresh

windy

low

rested

yes

fresh

windy

low

rested

yes

fresh

foggy

low

rested

yes

fresh

foggy

low

rested

yes

sticky

sunny

mid

rested

yes

frosted

foggy

low

injured

no

a) Build the decision tree based on the data above by calculating the information gain for each possible node, as shown in the lecture. Please hand in the resulting tree including all calculation steps. (20)

Hint: There is a scientific calculator on calculator.com for the calculation of log2(x). By the way: log2(ax)=log10(ax)*log2(10) or log2(ax)=ln(ax)*log2(e).

b) Apply Naive Bayes as probabilistic learning algorithm on the dataset above and create a table with counts and probabilities. The following calculation based on a smaller dataset is provided as an example: (20)

snow

weather

go skiing

fresh

foggy

no 

sticky

windy

no

sticky

sunny

yes

fresh

windy

yes

fresh

foggy

yes

frosted

sunny

no

This small data table leads to the following table with counts and probabilities:

snow

weather

go skiing

yes

no

fresh

2

1

sticky

1

1

frosted

1

yes

no

sunny

1

1

foggy

1

1

windy

1

1

yes

no

3

3

yes

no

fresh

2/3

1/3

sticky

1/3

1/3

frosted

0/3

1/3

yes

no

sunny

1/3

1/3

foggy

1/3

1/3

windy

1/3

1/3

yes

no

3/6

3/6

 

If we now like to qualify the following new instance "snow=fresh and weather=sunny", we calculate the likelihood of "go skiing=yes" in the following way:

likelihood of yes = 2/3 * 1/3 * 3/6 = 6/54 = 1/9

likelihood of no = 1/3 * 1/3 * 3/6 = 3/54 = 1/18

(we can defactorize the attributes under the assumption that all attributes are equally important and independent - that's why Naive Bayes is called Naive)

probability of yes = (1/9)/((1/9)+(1/18)) = 2/3

probability of no = (1/18)/((1/18)+(1/9)) = 1/3

Therefore, the probability is 33% to not go skiing based on the information given in this small example.

c) Please explain in your own words the following terms: Naive Bayes Classifier, Bayesian Belief Network. In addition, draw the Bayesian Belief Network that represents the conditional independence assumption of the Naive Bayes Classifier for the skiing problem. Give the conditional probability tables associated with each node in the Belief Network. (10)

Hint: The Naive Bayes conditional independence assumption is P(a1, a2, …, an|vi) = P(a1|vi) * P(a2|vi) * … * P(an|vi), where vi is a class label and a1, a2, …, an are dataset attributes.

d) Decide for the new instance given below whether to go skiing or not, based on the calculated classifiers: (5)

snow

weather

season

physical condition

go skiing

sticky

windy

high

tired

?

e) Now, an instance including missing values has to be classified (use the first two classifiers built on the original data set): (5)

snow

weather

season

physical condition

go skiing

-

windy

mid

injured

?

Decide whether to go skiing or not.

Part 2: Data Mining Practice using WEKA (40)

Introduction

The tool used in this exercise is called WEKA which is based on Java. This tool is very intuitive and easy to handle, as seen in the lecture. Download it (GUI version recommended) under www.cs.waikato.ac.nz/ml/weka/index.html and install it as descibed in the documentation. To fulfill the assignment it is sufficient to use the WEKA Explorer GUI (There is a user guide for the Explorer GUI among other detailed tutorials on the WEKA web-site).

Assignment

There are two datasets provided, which each has been divided into a training and a test dataset. First, the drug dataset (training set, test set), and second the soybean dataset (training set, test set).

Run WEKA, and choose the explorer. Load a training file. On the classifier page, select the J48 decision tree classifier. This classifier has a number of options. Two of them are “unpruned” and ‘minNumObj”. “Unpruned” prevents pruning the decision tree after building it. “MinNumObj” sets the minimum number of instances in any leaf node to limit the growth of the decision tree.

a) For each dataset, vary the above two parameters. Generate pruned and unpruned trees, with a minimum number of instances in leaves between 1 and 4.

b) Explain why the pruning effects the performance of the decision tree algorithm.

c) Explain how different limitations of instances in the leaf influences the performance (prediction accuracy) of the decision tree algorithm.

What to hand in:

A write up of results, answers to the questions, and supporting material like experimental results, etc. Please do not write long prose but hand in a succinct description of your results.

Please do not hesitate to contact us (Jiwen Li, Christoph Kiefer) if you have questions, problems or remarks of any kind.