Search:
Assignment #3: Learning

Introduction

As announced assignment #3 consists of 3 parts: a self-explorartory on-line task, some paper and pencil work, and some experimental work (using the Weka Data Mining toolkit).

Due date: Mi 21 January 2004

Part 1: Self Exploration

Decision trees

Please go to

http://www.ifi.unizh.ch/ddisold/Teaching/BI2003/Applets/CLspace/dtree/

(page protected with the usual username & password)

alternatively go to:

http://www.cs.ubc.ca/labs/lci/CIspace/Version4/dTree/index.html

and try to understand the details of how a decision tree learner works.

Baisian Belief Networks

Please go to

http://www.ifi.unizh.ch/ddisold/Teaching/BI2003/Applets/CLspace/baysian/

(page protected with the usual username & password)

alternatively go to:

http://www.cs.ubc.ca/labs/lci/CIspace/Version4/bayes/index.html

and try to understand the details of how a baysian network works.

 

Part 2: Paper and pencil

The goal of this exercise is to get familiarized with decision trees and probabilistic learning (naive Bayes). Therefore a simple dataset about a weather dependent game is provided. The decision of playing the game depends on the attributes outlook, temperature, humidity, and wind as shown in the table below.

outlook

temperature

humidity

wind

play game

sunny

hot

high

weak

no

sunny

hot

high

strong

no

overcast

hot

high

weak

yes

rain

mild

high

weak

yes

rain

cool

normal

weak

yes

rain

cool

normal

strong

no

overcast

cool

normal

strong

yes

sunny

mild

high

weak

no

sunny

cool

normal

weak

yes

rain

mild

normal

weak

yes

sunny

mild

normal

strong

yes

overcast

mild

high

strong

yes

overcast

hot

normal

weak

yes

rain

mild

high

strong

no

a)

Build the decision tree based on the data above by calculating the information gain for each possible node, as shown in the lecture. Please hand in the resulting tree including all calculation steps.
Hint: There is a scientific calculator on calculator.com for the calculation of log2(x). By the way: log2(ax)=log10(ax)*log2(10) or log2(ax)=ln(ax)*log2(e).

 

b)

Apply naive Bayes as probabilistic learning algorithm on the dataset above.

The following calculation based on a smaller dataset is provided as an example:  

temperature

humidity

play game 

hot

high

no 

hot

high

no

hot

high

yes

mild

high

yes

cool

normal

yes

cool

normal

no

This small data table leads to the following table with counts and probabilities:

temperature

humidity

play game

yes

no

hot

1

2

mild

1

0

cool

1

1

yes

no

high

2

2

normal

1

1

yes

no

3

3

yes

no

hot

1/3

2/3

mild

1/3

0/3

cool

1/3

1/3

yes

no

high

2/3

2/3

normal

1/3

1/3

yes

no

3/6

3/6

If we now like to qualify the following new instance "temperature=hot and humidity=normal", we calculate the likelihood of "play=yes" the following way:

likelihood of yes = 1/3 * 1/3 * 3/6 = 1/18
likelihood of no = 2/3 * 1/3 * 3/6 = 2/18

(we can defactorize the attributes under the assumption that all attributes are equally important and independent - that's why naive Bayes is called naive)

probability of yes = (1/18)/((1/18)+(2/18)) = 1/3
probability of no = (2/18)/((1/18)+(2/18)) = 2/3

Therefore it is probably (66%) not possible that the game can be played based on the information given in this small example.

 

c)

Decide for the new instance given below whether to play or not, based on the calculated classifiers:

outlook

temperature

humidity

windy

play game

sunny

cool

high

strong

?

d)

Now, the training set above is enhanced by the following instance:

outlook

temperature

humidity

windy

play game

overcast

cool

normal

weak

no

Recalculate both classifiers on the enhanced data set. What is the impact of this item on the classifiers?

e)

Now, an instance including missing values has to be classified (use the first two classifiers built on the original data set):

outlook

temperature

humidity

windy

play game

-

cool

high

strong

?

Decide whether to play or not.

f)

Discuss the results above and focus on advantages and disadvantages of each classifier.

 

Part 3: Data Mining Practice

Overview

The goal of this exercise is to get familiarized with a data mining tool and several data mining algorithms.

Introduction

The used tool is called WEKA, built on JAVA. This tool is very intuitive and easy handle, as seen in the lecture. Download it (GUI version recommended) under http://www.cs.waikato.ac.nz/ml/weka/index.html and install it as descibed in the documentation. To fulfill the assignment it is sufficient to use the WEKA Explorer GUI (There is a user guide for the Explorer GUI among other detailed tutorials on the WEKA web-site).

The Assignment

There are two datasets provided. First, the German credit dataset and second, the Titanic survivor dataset.

As an introduction look first at the Titanic dataset. This dataset consists only of 4 classification attributes. Answer the following questions and reason:

  • Which is the most significant attribute to estimate chance of survival?
  • Is it worth to pay more for a more luxorious class?
  • Did the gentlemen act like gentlemen?
  • Find another correlation between the attributes.

Have a look at the German credit dataset and try to find some regularities by eyeballing it.

  • Try a number of algorithms (decison tree, decision stumps, naive bayes, neural network, ...) and compare the resulting models. Which one is:
    • most comprehensible (to a human)?
    • most predictive?
    • fastest when (a) classifying and (b) learning?
  • Which one would you use in a project and why?
  • What do you think about the predicted models?
    • Are they wise to use for credit granting?
    • What is the major problem of using such models?
    • How often do they need to be renewed?

(Hint: discretize attributes by applying a filter if an algorithm cannot handle numeric attributes).

What to hand in

A write up of your results, answers to the questions, and supporting material like graphs, etc. Please do not write long prose but hand in a succinct (auf Deutsch: kurz und prägnant) description of your results. You can summarize things with bullet points.

Groups of 2-3 students are allowed. Please send your solution to vorburger@ifi.unizh.ch. Don't hesitate to write an email if questions or suggestions rise.