Introduction
As announced assignment #3 consists of 3 parts: a self-explorartory on-line task, some paper and pencil work, and some experimental work (using the Weka Data Mining toolkit).
Due date: Mi 21 January 2004
Part 1: Self Exploration
Decision trees
Please go to
http://www.ifi.unizh.ch/ddisold/Teaching/BI2003/Applets/CLspace/dtree/
(page protected with the usual username & password)
alternatively go to:
http://www.cs.ubc.ca/labs/lci/CIspace/Version4/dTree/index.html
and try to understand the details of how a decision tree learner works.
Baisian Belief Networks
Please go to
http://www.ifi.unizh.ch/ddisold/Teaching/BI2003/Applets/CLspace/baysian/
(page protected with the usual username & password)
alternatively go to:
http://www.cs.ubc.ca/labs/lci/CIspace/Version4/bayes/index.html
and try to understand the details of how a baysian network works.
Part 2: Paper and pencil
The goal of this exercise is to get familiarized with decision trees and probabilistic learning (naive Bayes). Therefore a simple dataset about a weather dependent game is provided. The decision of playing the game depends on the attributes outlook, temperature, humidity, and wind as shown in the table below.
outlook | temperature | humidity | wind | play game |
sunny | hot | high | weak | no |
sunny | hot | high | strong | no |
overcast | hot | high | weak | yes |
rain | mild | high | weak | yes |
rain | cool | normal | weak | yes |
rain | cool | normal | strong | no |
overcast | cool | normal | strong | yes |
sunny | mild | high | weak | no |
sunny | cool | normal | weak | yes |
rain | mild | normal | weak | yes |
sunny | mild | normal | strong | yes |
overcast | mild | high | strong | yes |
overcast | hot | normal | weak | yes |
rain | mild | high | strong | no |
a) | Build the decision tree based on the data above by calculating the information gain for each possible node, as shown in the lecture. Please hand in the resulting tree including all calculation steps.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
b) | Apply naive Bayes as probabilistic learning algorithm on the dataset above. The following calculation based on a smaller dataset is provided as an example:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
c) | Decide for the new instance given below whether to play or not, based on the calculated classifiers:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
d) | Now, the training set above is enhanced by the following instance:
Recalculate both classifiers on the enhanced data set. What is the impact of this item on the classifiers? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
e) | Now, an instance including missing values has to be classified (use the first two classifiers built on the original data set):
Decide whether to play or not. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
f) | Discuss the results above and focus on advantages and disadvantages of each classifier.
|
Part 3: Data Mining Practice
Overview
The goal of this exercise is to get familiarized with a data mining tool and several data mining algorithms.
Introduction
The used tool is called WEKA, built on JAVA. This tool is very intuitive and easy handle, as seen in the lecture. Download it (GUI version recommended) under http://www.cs.waikato.ac.nz/ml/weka/index.html and install it as descibed in the documentation. To fulfill the assignment it is sufficient to use the WEKA Explorer GUI (There is a user guide for the Explorer GUI among other detailed tutorials on the WEKA web-site).
The Assignment
There are two datasets provided. First, the German credit dataset and second, the Titanic survivor dataset.
As an introduction look first at the Titanic dataset. This dataset consists only of 4 classification attributes. Answer the following questions and reason:
- Which is the most significant attribute to estimate chance of survival?
- Is it worth to pay more for a more luxorious class?
- Did the gentlemen act like gentlemen?
- Find another correlation between the attributes.
Have a look at the German credit dataset and try to find some regularities by eyeballing it.
- Try a number of algorithms (decison tree, decision stumps, naive bayes, neural network, ...) and compare the resulting models. Which one is:
- most comprehensible (to a human)?
- most predictive?
- fastest when (a) classifying and (b) learning?
- Which one would you use in a project and why?
- What do you think about the predicted models?
- Are they wise to use for credit granting?
- What is the major problem of using such models?
- How often do they need to be renewed?
(Hint: discretize attributes by applying a filter if an algorithm cannot handle numeric attributes).
What to hand in
A write up of your results, answers to the questions, and supporting material like graphs, etc. Please do not write long prose but hand in a succinct (auf Deutsch: kurz und prägnant) description of your results. You can summarize things with bullet points.
Groups of 2-3 students are allowed. Please send your solution to vorburger@ifi.unizh.ch. Don't hesitate to write an email if questions or suggestions rise.