A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules. For all X that falls in a particular region (leaf), we make the same prediction. For this post, we give a brief introduction to Decision Trees and outline some widely used Decision Tree algorithms.
Content:
The Basics
Classification and Regression Tree
Overfitting and Pruning Trees
Trees VS Linear Models
Advantages VS Disadvantages
PART ONE
Classification and Regression Trees:
Tree-based regression models are known for their simplicity and efficiency when dealing with domains with large number of variables and cases. Regression trees are obtained using a fast divide and conquer greedy algorithm that recursively partitions the given training data into smaller subsets.
PsuedoCode:
Assign all training instances to the root of the tree. Set current node to root node.
For each attribute, Partition all data instances at the node by the value of the attribute.
Compute the information gain ratio from the partitioning.
Identify feature that results in the greatest information gain ratio.
Set this feature to be the splitting criterion at the current node.
If the best information gain ratio is 0, tag the current node as a leaf and return.
Partition all instances according to attribute value of the best feature.
Denote each partition as a child node of the current node.
For each child node (for classification): If the child node is “pure” (has instances from only one class) tag it as a leaf and return.If not set the child node as the current node and recurse to step 2.
You might be wondering how we calculate the gain ratio and determine where to split the data. We use information gain, which is a measure of a reduction of uncertainty. Information gain increases with the average purity of the subsets and we choose an attribute that gives us the most information gain.
We measure purity using entropy, which is given in the equation below:
PART TWO
Overfitting and Pruning Trees
If we follow the above procedure, we are likely to end up with trees that are overly complex and overfitting, that is they end up fitting the noise. Generally, we "prune" the tree (i.e. cutting off some of the terminal nodes) by using cross validation to see which for all the tree sizes, which one has the lowest error rate.
How do we see which branch to cut at each tree size? Among each given tree size, we choose the one with the lowest cost complexity, which equals to the proportion of misclassified records plus the penalty factor attached to tree size (set by user, default is 0.01 in some Rpackages)
PART THREE
Tress VS Linear Models
If the relationship between the predictors and response is linear, then classical linear models such as linear regression would outperform regression trees; on the other hand, if the relationship between the predictors is non-linear, then decision trees would outperform classical approaches.
PART FOUR
Advantages and Disadvantages of Decision Trees
Advantages:
Trees are easy to explain
Trees can be plotted graphically and thus easily interpreted by non-experts
Works on both classification and regression problems
Disadvantages:
1. Classification and Regression Trees don't have the same prediction accuracy as some of the more complicated approaches.
Next section, we will take a look at how to implement CART in R.
Comments