Guide to Decision Trees

In this guide, we'll be discussing how to build a regression model using a decision tree. But before we get into the material, I'd like to mention that decision trees can either be a regression or classification model. So we'll be talking about this topic again, once we get into classification.

Just like all the other regression methods that we've worked with, a regression tree is a form of supervised learning. But unlike the previous examples, regression trees are structured more like a flow chart instead of line fitting. Regression trees are built by using an algorithm called ID3 and a process known as recursive partitioning.

Recursive partitioning is an iterative process that splits the data into partitions that we refer to as branches. Then it keeps splitting each partition into smaller and smaller subsets as it moves up each branch. Every decision tree begins at the root node or first parent. It's also where the algorithm begins to iterate over the entire training set. Distributing the data into the first two partitions or branches that offer the largest information gain. The algorithm determines how data is split and distributed by utilizing a technique called sum of squares.

medium

Just like the other algorithms that we've worked with, Python and Scikit-learn will do all of the math for us. But if you have interest, it works by minimizing the sum of squared deviations from the mean within each partition. The splitting rule is then applied to each new branch in turn creating a subset or child node. Then depending on the size of the tree the process will continue until all the features are used or each node reaches its terminal or leaf node. And to help make better sense of this,

medium

let's take a look at a regression tree that predicts how many hours a student will study on any given day. In our example, we have a dataset containing information about the weather. The first thing that happens is the algorithm looks for the features that offer the greatest information gain by using the sum of squares technique. The algorithm partitions the data between sunny and cloudy days and then creates a child node at the end of each branch. At the temperature node, the algorithm once again determines which features offer the biggest information gain by using sum of squares.

Concurrently, the same process is happening at the rain node. After the remaining data is partitioned, the algorithm is out of features to partition, resulting in a leaf node that generates a numerical output predicting the number of hours a student will study that day. So if it's a cloudy day with heavy rain, the regression tree predicts that a student will study for six hours that day. In contrast, if it's sunny and not too hot or cold, it's likely the student will only study for two hours.

Okeydoke. Now that we've covered the basics, let's go through a real example using some of our own data. And what I would like to do is look at how the number of hours spent with a TA, can help predict a student's grade on an upcoming exam. So for this example, we'll be using TA hours from exam one, as well as exam one grades. We're going to start by importing the necessary libraries, segmenting the variables, apply NumPy's new access function to the variables and then split the data into training and testing sets. And I'd also like to mention that the only new class that will be importing, comes from the tree module in the Scikit-learn Library.

medium

Once you have everything imported and segmented, we're going to continue as if this was a regular regression problem. We're going to move forward by creating a regressor, but this time we'll be using the decision tree regressor class. With the regressor in place, we can then utilize the fit function to train the model. And then we'll finish that up by creating a prediction variable, and assign that to the predict function.

medium

And like I did in the previous guide, I already have everything set up to graph, so let's go ahead and see what's going on. And it looks like we made the same mistake as we did in the last guide, so let's go ahead and do a reassignment using the argsort variable again. Oh yeah, and don't forget to add the squeeze function.

medium

All right, let's go ahead and run this and you can see that we have a much better result. But as it stands, it looks like our algorithm might be over-fitting. But this is a really easy fix, we just need to go back to the regressor variable and parse in the max depth parameter. There's no real science to this, so let's just start with a max depth of two and run the code to see what's happening. Now this looks like a pretty good fit. My only criticism is that some of the less dense data looks like it's being influenced too much by some of the areas of higher density. You can kinda see that the students who studied between 24 and 35 hours are grouped together, but because of the large number of students that studied between 24 and 26 hours, the algorithm appears to underestimate the exam grade of students who studied more than 26 hours.

medium

So, let's go back to our regressor and adjust the max depth to three and then we'll see what happens. And I kinda leaned more towards the depth of three because it appears to account for the small changes in the data a little bit better. And it also looks like the density concern that I had when the max depth was two, doesn't really exist.

medium

We can also retrieve predictions and return the coefficient of determination and that's done the same basic way as before. And to retrieve a prediction all we need to do is parse in a feature.

medium

And there's a little bit more that we could go through but I think I'm going to wrap this guide up because in the next guide, we're going to be expanding at a lot of the same ideas that we covered in this guide. So for now, I will wrap it up and I will see you in the next one.