# Data: StackOverflow Posts

The code loads the data set used for decision tree classification used in practice session 2.

In [1]:
labeled_posts_dt = spark.read.load("/data/dt_example/")
labeled_posts_dt.printSchema()

root
 |-- id: long (nullable = true)
 |-- PostTypeId: long (nullable = true)
 |-- CreationDate: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- TagsRaw: string (nullable = true)
 |-- BodyRaw: string (nullable = true)
 |-- ParentId: long (nullable = true)
 |-- AcceptedAnswerId: long (nullable = true)
 |-- Score: double (nullable = true)
 |-- ViewCount: long (nullable = true)
 |-- AnswerCount: long (nullable = true)
 |-- CommentCount: long (nullable = true)
 |-- FavoriteCount: long (nullable = true)
 |-- Tags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- Body: string (nullable = true)
 |-- Text: string (nullable = true)
 |-- IsJava: double (nullable = true)
 |-- HasNoAcceptedAnswer: boolean (nullable = true)



## Categorize the Score values into Three Classes.

In [2]:
from pyspark.ml.feature import QuantileDiscretizer

label_discretizer_model = QuantileDiscretizer(numBuckets=3, inputCol='Score', \
                        outputCol='Popularity').fit(labeled_posts_dt)
dataset = label_discretizer_model.transform(labeled_posts_dt)
dataset.cache()

dataset.select('Score', 'Popularity').sample(withReplacement=False, fraction=0.1).show(truncate=False)

# 90% Training, 10% Test
(training, test) = dataset.randomSplit([0.9, 0.1])

+-----+----------+
|Score|Popularity|
+-----+----------+
|0.0  |1.0       |
|0.0  |1.0       |
|1.0  |2.0       |
|0.0  |1.0       |
|2.0  |2.0       |
|0.0  |1.0       |
|1.0  |2.0       |
|1.0  |2.0       |
|-1.0 |0.0       |
|0.0  |1.0       |
|0.0  |1.0       |
|0.0  |1.0       |
|0.0  |1.0       |
|0.0  |1.0       |
|-2.0 |0.0       |
|0.0  |1.0       |
|-1.0 |0.0       |
|0.0  |1.0       |
|0.0  |1.0       |
|0.0  |1.0       |
+-----+----------+
only showing top 20 rows



# Decision Tree Classifier Pipeline

Below is a mostly the same code used in practice session 2 for the decission tree classification (minus the VectorIndexer).

In [3]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler

feature_assembler = VectorAssembler(inputCols=['ViewCount', \
    'AnswerCount', 'CommentCount', 'FavoriteCount', 'HasNoAcceptedAnswer'], \
    outputCol='features')

# Setup a Decision Tree classifier with default parameters.
dt_classifier = DecisionTreeClassifier(labelCol='Popularity', featuresCol='features')

# Setup simple Pipeline of VectorAssembler and Decision Tree classifier.
dt_pipeline = Pipeline(stages=[feature_assembler, dt_classifier])

## Fit, Predict, and Evaluate

In [4]:
from pyspark.sql.functions import count
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

model_dt = dt_pipeline.fit(training)
predictions_dt = model_dt.transform(test)

# Let's look at the predicted values.
predictions_dt.select('Score', 'Popularity', 'prediction'). \
    sample(withReplacement=False, fraction=0.4).show(truncate=False)

# Are we balanced?
predictions_dt.groupBy('Popularity').agg(count('Popularity').alias('Count')).show()

# Evaluation
evaluator = MulticlassClassificationEvaluator(predictionCol=dt_classifier.getPredictionCol(), \
                labelCol=dt_classifier.getLabelCol(), metricName='accuracy')

print(evaluator.getMetricName() + (": %g" % evaluator.evaluate(predictions_dt)))

+-----+----------+----------+
|Score|Popularity|prediction|
+-----+----------+----------+
|0.0  |1.0       |1.0       |
|0.0  |1.0       |1.0       |
|1.0  |2.0       |1.0       |
|0.0  |1.0       |1.0       |
|0.0  |1.0       |1.0       |
|0.0  |1.0       |1.0       |
|-1.0 |0.0       |1.0       |
|1.0  |2.0       |1.0       |
|0.0  |1.0       |1.0       |
|0.0  |1.0       |1.0       |
|0.0  |1.0       |1.0       |
|1.0  |2.0       |2.0       |
|0.0  |1.0       |1.0       |
|-1.0 |0.0       |1.0       |
|0.0  |1.0       |1.0       |
|0.0  |1.0       |1.0       |
|-2.0 |0.0       |2.0       |
|-1.0 |0.0       |2.0       |
|0.0  |1.0       |1.0       |
|-3.0 |0.0       |1.0       |
+-----+----------+----------+
only showing top 20 rows

+----------+-----+
|Popularity|Count|
+----------+-----+
|       0.0|  374|
|       1.0| 1609|
|       2.0| 1320|
+----------+-----+

accuracy: 0.737209


## Inspect the Tree

In [5]:
def print_dt_info(features, tree_model):
    print("Features %s" % features)
    print("Feature importances: %s" % tree_model.featureImportances.toArray())
    print("Tree: %s" % tree_model.toDebugString)
    
print_dt_info(feature_assembler.getInputCols(), model_dt.stages[1])

Features ['ViewCount', 'AnswerCount', 'CommentCount', 'FavoriteCount', 'HasNoAcceptedAnswer']
Feature importances: [0.94718312 0.00892717 0.01469145 0.02919826 0.        ]
Tree: DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4d23b7b334a0a344dc29) of depth 5 with 63 nodes
  If (feature 0 <= 53.0)
   If (feature 0 <= 20.0)
    If (feature 0 <= 10.0)
     If (feature 2 <= 0.0)
      If (feature 0 <= 5.0)
       Predict: 1.0
      Else (feature 0 > 5.0)
       Predict: 1.0
     Else (feature 2 > 0.0)
      If (feature 1 <= 0.0)
       Predict: 1.0
      Else (feature 1 > 0.0)
       Predict: 1.0
    Else (feature 0 > 10.0)
     If (feature 1 <= 0.0)
      If (feature 3 <= 0.0)
       Predict: 1.0
      Else (feature 3 > 0.0)
       Predict: 1.0
     Else (feature 1 > 0.0)
      If (feature 3 <= 0.0)
       Predict: 1.0
      Else (feature 3 > 0.0)
       Predict: 1.0
   Else (feature 0 > 20.0)
    If (feature 3 <= 0.0)
     If (feature 2 <= 0.0)
      If (feature 0 <= 41.0)
  

Maybe the tree is too deep?

In [6]:
print(dt_classifier.explainParam('maxDepth'))

maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 5)


# Model Tuning with k-fold CrossValidation

Let's find the optimal depth using a systematic and automated approach.

In [7]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

param_grid_dt = ParamGridBuilder().addGrid(dt_classifier.maxDepth, [1, 2, 3, 4, 5, 6, 8, 10, 15, 30]).build()

crossval_dt = CrossValidator(estimator=dt_pipeline, \
                          estimatorParamMaps=param_grid_dt, \
                          evaluator=evaluator, \
                          numFolds=3)

cv_model_dt = crossval_dt.fit(training)

prediction_dt_cv = cv_model_dt.transform(test)
print(evaluator.getMetricName() + (": %g" % evaluator.evaluate(prediction_dt_cv)))

accuracy: 0.737209


## Inspect the Cross-Validation Model

In [8]:
print("Average Metrics: %s" % cv_model_dt.avgMetrics)

best_dt_model = cv_model_dt.bestModel.stages[1]
print("\nBest Model:")
print_dt_info(feature_assembler.getInputCols(), best_dt_model)

Average Metrics: [0.7365286474320645, 0.7365286474320645, 0.7413568596377471, 0.7421066096438848, 0.7428991397709886, 0.7424877507076783, 0.7416390069121415, 0.7386890131973116, 0.7291802211679683, 0.7262578062664323]

Best Model:
Features ['ViewCount', 'AnswerCount', 'CommentCount', 'FavoriteCount', 'HasNoAcceptedAnswer']
Feature importances: [0.94718312 0.00892717 0.01469145 0.02919826 0.        ]
Tree: DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4d23b7b334a0a344dc29) of depth 5 with 63 nodes
  If (feature 0 <= 53.0)
   If (feature 0 <= 20.0)
    If (feature 0 <= 10.0)
     If (feature 2 <= 0.0)
      If (feature 0 <= 5.0)
       Predict: 1.0
      Else (feature 0 > 5.0)
       Predict: 1.0
     Else (feature 2 > 0.0)
      If (feature 1 <= 0.0)
       Predict: 1.0
      Else (feature 1 > 0.0)
       Predict: 1.0
    Else (feature 0 > 10.0)
     If (feature 1 <= 0.0)
      If (feature 3 <= 0.0)
       Predict: 1.0
      Else (feature 3 > 0.0)
       Predict: 1.0
     

Looks like the default maxDepth parameter performs best. CV chose the same model as we had by default.

# Model Selection with Train-Validation Split

This is the same as with CV but usinf Train-Validation Split instead that evaluates each parameter set on only one (training, test) pair. 

In [9]:
from pyspark.ml.tuning import TrainValidationSplit

tv_split_dt = TrainValidationSplit( \
                estimator=dt_pipeline, \
                estimatorParamMaps=param_grid_dt, \
                evaluator=evaluator, \
                trainRatio=0.8)

tv_split_dt_model = tv_split_dt.fit(training)
prediction_tv_dt = tv_split_dt_model.transform(test)
print(evaluator.getMetricName() + (": %g" % evaluator.evaluate(prediction_tv_dt)))

accuracy: 0.737209


## Inspect the Train-Validation Split Model

In [10]:
print("Metrics: %s" % tv_split_dt_model.validationMetrics)

best_dt_model2 = tv_split_dt_model.bestModel.stages[1]
print("\nBest Model:")
print_dt_info(feature_assembler.getInputCols(), best_dt_model2)

Metrics: [0.7383292383292384, 0.7383292383292384, 0.73990873990874, 0.739031239031239, 0.7416637416637417, 0.7395577395577395, 0.7388557388557389, 0.7388557388557389, 0.7363987363987364, 0.7325377325377326]

Best Model:
Features ['ViewCount', 'AnswerCount', 'CommentCount', 'FavoriteCount', 'HasNoAcceptedAnswer']
Feature importances: [0.94718312 0.00892717 0.01469145 0.02919826 0.        ]
Tree: DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4d23b7b334a0a344dc29) of depth 5 with 63 nodes
  If (feature 0 <= 53.0)
   If (feature 0 <= 20.0)
    If (feature 0 <= 10.0)
     If (feature 2 <= 0.0)
      If (feature 0 <= 5.0)
       Predict: 1.0
      Else (feature 0 > 5.0)
       Predict: 1.0
     Else (feature 2 > 0.0)
      If (feature 1 <= 0.0)
       Predict: 1.0
      Else (feature 1 > 0.0)
       Predict: 1.0
    Else (feature 0 > 10.0)
     If (feature 1 <= 0.0)
      If (feature 3 <= 0.0)
       Predict: 1.0
      Else (feature 3 > 0.0)
       Predict: 1.0
     Else (featu

# Tree Ensembles

Using a single Decision Tree we could not improve. Let's try a random forest.

In [11]:
from pyspark.ml.classification import RandomForestClassifier

random_forest = RandomForestClassifier(labelCol=dt_classifier.getLabelCol(), \
                                       featuresCol=dt_classifier.getFeaturesCol())

pipeline_rf = Pipeline(stages=[feature_assembler, random_forest])

In [12]:
print(random_forest.explainParam(random_forest.numTrees.name))

numTrees: Number of trees to train (>= 1). (default: 20)


In [13]:
param_grid_rf = ParamGridBuilder() \
    .addGrid(random_forest.maxDepth, [1, 2, 3, 4, 5, 6, 8, 10, 15, 30]) \
    .addGrid(random_forest.numTrees, [5, 10, 50, 100]) \
    .build()
    
tv_split_rf = TrainValidationSplit(estimator=pipeline_rf, \
                                   estimatorParamMaps=param_grid_rf, \
                                   evaluator=evaluator, \
                                   trainRatio=0.9)

tv_split_model_rf = tv_split_rf.fit(training)
prediction_rf = tv_split_model_rf.transform(test)
print(evaluator.getMetricName() + (": %g" % evaluator.evaluate(prediction_rf)))

accuracy: 0.737511


## Inspect the Train-Validation Split Model of the RandomForestClassifier

In [18]:
print("Metrics: %s" % tv_split_model_rf.validationMetrics)

best_rf_model = tv_split_model_rf.bestModel.stages[1]
# Code below creates big output!
# print("\nBest Model:")
# print_dt_info(feature_assembler.getInputCols(), best_rf_model)

Metrics: [0.7335243553008596, 0.7388968481375359, 0.7388968481375359, 0.7356733524355301, 0.7378223495702005, 0.7378223495702005, 0.7374641833810889, 0.7371060171919771, 0.7399713467048711, 0.7399713467048711, 0.7410458452722063, 0.7414040114613181, 0.7417621776504298, 0.7414040114613181, 0.7428366762177651, 0.744269340974212, 0.7431948424068768, 0.7421203438395415, 0.7424785100286533, 0.7431948424068768, 0.7431948424068768, 0.7414040114613181, 0.7431948424068768, 0.7431948424068768, 0.7435530085959885, 0.7431948424068768, 0.7439111747851003, 0.7460601719197708, 0.7431948424068768, 0.7446275071633238, 0.7435530085959885, 0.7453438395415473, 0.7392550143266475, 0.7424785100286533, 0.7431948424068768, 0.7414040114613181, 0.7392550143266475, 0.7417621776504298, 0.7424785100286533, 0.7421203438395415]


Well that didn't work.

Perhaps we should try a different ML algorithm.

# Multilayer Perceptron Classifier

In [15]:
from pyspark.ml.classification import MultilayerPerceptronClassifier

# specify layers for the neural network:
# input layer of size 4 (features), then hidden intermediate layers
# and output of size 3 (classes)
layers = [5, 12, 6, 3]

mlp = MultilayerPerceptronClassifier(maxIter=400, \
                                     layers=layers, \
                                     labelCol=dt_classifier.getLabelCol(), \
                                     featuresCol=feature_assembler.getOutputCol(), \
                                     predictionCol=dt_classifier.getPredictionCol())

pipeline_mlp = Pipeline(stages=[feature_assembler, mlp])
                                                 
model_mlp = pipeline_mlp.fit(training)

predictions_mlp = model_mlp.transform(test)

print(evaluator.getMetricName() + (": %g" % evaluator.evaluate(predictions_mlp)))

accuracy: 0.741447


Not much changed.

## Adding Feature Selection

Feature selection is not needed for Decision Trees because they implicitly do their own feature selection. But it might benefit a neural network.

In [16]:
from pyspark.ml.feature import ChiSqSelector

feature_selector = ChiSqSelector(featuresCol=feature_assembler.getOutputCol(), \
    outputCol='selected', labelCol=dt_classifier.getLabelCol())

pipeline_fs_mlp = Pipeline(stages=[feature_assembler, feature_selector, mlp])

param_grid_fs_mlp = [
    {mlp.featuresCol: feature_selector.getOutputCol(), mlp.layers: [1, 12, 6, 3], feature_selector.numTopFeatures: 1},
    {mlp.featuresCol: feature_selector.getOutputCol(), mlp.layers: [2, 12, 6, 3], feature_selector.numTopFeatures: 2},
    {mlp.featuresCol: feature_selector.getOutputCol(), mlp.layers: [3, 12, 6, 3], feature_selector.numTopFeatures: 3},
    {mlp.featuresCol: feature_selector.getOutputCol(), mlp.layers: [4, 12, 6, 3], feature_selector.numTopFeatures: 4}
]
    
    
tv_split_fs_mlp = TrainValidationSplit(estimator=pipeline_fs_mlp, \
                                       estimatorParamMaps=param_grid_fs_mlp, \
                                       evaluator=evaluator,
                                       trainRatio=0.8)

tv_split_model_fs_mlp = tv_split_fs_mlp.fit(training)
prediction_fs_mlp = tv_split_model_fs_mlp.transform(test)
print(evaluator.getMetricName() + (": %g" % evaluator.evaluate(prediction_fs_mlp)))

accuracy: 0.7363


## Inspect Train-Validation Split Model for MultiLayerPerceptronClassifier

In [17]:
print("Metrics: %s" % tv_split_model_fs_mlp.validationMetrics)

best_fs_mlp_model = tv_split_model_fs_mlp.bestModel.stages[2]
print("\nBest Model: %s" % best_fs_mlp_model.layers)

Metrics: [0.7397332397332397, 0.7376272376272376, 0.7411372411372411, 0.7492102492102493]

Best Model: [4, 12, 6, 3]


# Summary

Model Tuning and different ML algorithms didn't change the result in any meaningful way. The best conclusion for now is that a better result is not possible with the given features. We need to engineer better features to improve.