What is Decision Tree

Decision Tree is like a flow chart, at the top the tree there is a ROOT, where things start. At the bottom there are leaves. Below is a visualization of a Decision Tree.

max depth 3

Gini impurity

The Gini impurity score tells how well the criteria splits the instances.

G_i = 0.49 means it does not separate the training instances very well
G_i = 0.0 means it separates the training instances perfectly

gini

How Decision Tree works?

calculate gini impurity for each feature, using all the records availabe
select feature with lowest gini impurity as the root
repeat step 1-2 for each node until

no more records
further seperation increases gini score
stop criteria met

Data preparation

Feature Scaling: Because Decision Tree looks at individual feature at a time, there is generally no need for feature scaling
Feature Imputation
- categorical: pick majority, predict using another high correlation column
- numerical: mean or medium

Overfitting issue in DT

Below is a decision tree build with no restriction. Clearly the second tree is having an over fitting problem.

no restriction

A few way to overcome this:

min sample leaves
max depth: how deep the decision tree will be

Code Example

kaggle Heart Disease Dataset

# training decision tree
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(max_depth=5, min_samples_leaf=2)
tree_clf.fit(x_train, y_train)

# confusion metrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
fig, ax=plt.subplots(figsize=(15,10))
plot_confusion_matrix(tree_clf_3d, x_test, y_test, ax=ax)

confusion-dt

# other score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
fig, ax=plt.subplots(figsize=(15,10))
plot_confusion_matrix(tree_clf_3d, x_test, y_test, ax=ax)

# output
#               precision    recall  f1-score   support
#
#            0       0.81      0.90      0.85        29
#            1       0.90      0.81      0.85        32
#
#     accuracy                           0.85        61
#    macro avg       0.85      0.85      0.85        61
# weighted avg       0.86      0.85      0.85        61

# roc curve
from sklearn.metrics import roc_auc_score, roc_curve
roc_score=roc_auc_score(y_test, y_pred)
fpr,tpr,thresholds = roc_curve(y_test, tree_clf_3d.predict_proba(x_test)[:,1])
plt.plot([0,1], [0,1], "red")
plt.plot(fpr, tpr)

roc-dt

RandomForest

A subset of features and a subset of samples are selected to build a bunch of shorter and simpler trees.

Collectively, they predict the result and the majority vote will be selected.

Regarding sampling for each tree, there are 2 main approaches:

BAGGING: short for boostrap aggregation, which is sampling with replacement, allowing each record to be sampled multiple times for every predictor (aka tree)
PASTING: sampling without replacement.

Code Example

# random forest
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier()
rnd_clf.fit(x_train, y_train)
y_pred = rnd_clf.predict(x_test)

# bagging classifier
from sklearn.ensemble import BaggingClassifier
bag_clf = BaggingClassifier( DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)
bag_clf.fit(x_train, y_train)
y_pred = bag_clf.predict(x_test)