- Read Tutorial
- Watch Guide Video
In this guide, we're going to take what we just learned about decision trees and extend that into random decision forests. The whole idea behind the decision forest is power in numbers, because it's essentially just a bunch of decision trees combined together. What makes a forest so powerful is that instead of using a single algorithm, it uses a bunch of algorithms to help increase its predictive power and reduce overfitting.
Using another analogy to help me explain decision forests, I'm going to use an old philosophical theory that there's more wisdom in a crowd than any single person. Obviously this isn't a hard-and-fast rule but it makes sense in practice. Think about it this way. Most of us probably played the game where we had to guess the number of jelly beans, or coffee beans, or marbles in a jar, and whoever gets us the closest wins a prize. What usually happens is that some of us guess way too many, a few others guess way too low, but most are somewhere around the middle. So if we take the average of all of the guesses, we're way more likely to be closer to the actual number than the vast majority of each individual guess.
There are a couple of caveats that come with this though. First, it's assumed that each individual is independent from the other. Meaning they went in with no information or influence from other people guessing. The next, is that the population needs to be randomly sampled. If we only ask a bunch of four-year-olds we'd probably end up with a ridiculous result since they don't know the difference between 11 and a million. And finally, there needs to be an appropriate number of samples to make sure the extreme guesses don't influence the result too much.
Applying the same concepts to a random decision forest, we can treat decision trees like they're individuals in a crowd. Just like before, each decision tree does its own thing. Then all of the trees come together and essentially vote on the best result, but to ensure all of the trees are acting independently as possible, we use a technique called bootstrap aggregation, or bagging for short. Bagging makes sure each tree is responsible for taking its own randomly chosen sample of training data. When a tree is done with its random sample, it puts whatever information it used back in the training set, and the next tree gathers his own random sample. The point of doing it that way allows each tree to stay as independent as possible, which helps ensure varying results.
Along with bootstrapping, random forests also utilize feature randomness for even more variation throughout the trees in the forest. In the last guide, we talked about how decision trees use all of the features to figure out where it should partition. Well, in a random forest that isn't allowed. Instead, the forest only allows the trees to partition based on a handful of randomly selected features. That results in a lower correlation from tree to tree and helps create more diversity.
Now that we've gone over the general gist of how this works, let's get into an example. Since this is the last guide of the section, I thought it would be fun to go through an old example to show how we can solve the same problem using a different method. I already imported everything that we'll need, including the RandomForestRegressor class from scikit-learnin. In terms of the data, the feature variable contains exam one, two, and three, and the target variable is the final exam grade. I did make one small change to the train_test_split function which we haven't utilized yet. I set the random state to zero. That way the training and testing set won't be reshuffled every time we run the code. And you'll notice that I did the same thing for the regressor.
Like I mentioned earlier, a decision forest will randomly select data points and features every time it's run. So by setting the random state to zero, it will only perform a random selection the very first time it's run, then it will keep the same selection every time we run it after that. And what did I forget to do? So it looks like this is different than the other methods that we've used. Instead of using the column vector, we need to flatten it back out to a 1-D array. And we haven't done this yet, but it's really simple. We just need to go back, and pass in the ravel function from numpy followed by the object.
All righty, moving on. The default number of trees in a random forest is 100. So when we run this, we get an R-squared value of almost 0.93, which is a pretty good score. So I just mentioned that the default for the number of trees is 100, but that parameter can easily be changed by passing in n_estimators, followed by the number of trees. So let's double what we have and see how it affects the coefficient of determination. And it looks like it barely changed anything. And that's probably because the estimator size was already large enough.
Choosing an estimator size is similar to choosing a sample size, and it can be surprisingly tricky. Sometimes you'll only need a handful of samples, then other times you might need thousands. I was always taught to use the smallest sample size possible without effecting the result. And that's because if you use an estimator size that's too small, you might see an adverse effect like we're getting here when we use just one tree.
Now out of curiosity, let's see what our result would be if we use the decision tree instead of a forest, because in theory it should be very similar to what we have here. So let's move on back to the top and import the DecisionTreeRegressor class, then create the regressor and we'll make the random state zero for this as well. Then we'll just need to fit the model, and then create the R-squared object. To make it a little bit easier, I'll pass it in down here as well. Now let's run it again, and we get 0.879, which is nearly an identical result.
I'm crossing my fingers that you remember back to the last guide when we changed the depth of the tree, because we can do the same thing with all of the trees in the forest. So let's do that and start with a max depth of 10. And that hardly helped at all. Let's drop it down to five to see if we were overfitting. Alrighty. Now it's back up to 0.9. And we also know from before that we got a much better result with 100 trees instead of just the one. So let's move that back to 100 as well. Now we're getting a score of 0.93 again, which is about where we started.
The last thing that I'd like to do in this guide, is compared the decision forest result to multi-variable regression. So to finish up, let's add the regression class, then we'll move down to add the regressor and fit function. And then the R-squared object. Now, when we run it, the linear regression score is just barely lower than the forest score.
And to dig a little bit deeper into the forest result, we can use an attribute called the feature_importance. This is gonna be very similar to the coefficient attribute that we used in regression, where the higher the number is the more important the feature is. In order for us to compare the two, let's make the first object for the forest, then we'll move down and make the second object that we'll be using for the parameter coefficient. Now let's move into the Console and pass them both in. And I don't know about you, but I find this kind of interesting. The most important feature for the forest was exam two, but for the regression model, it was exam one, but for whatever reason, they were both still able to return a really high R-squared score. Which to me is just kind of cool.
Well, that brings us to the end of not only the guide, but also the section. So as always, I will see you in the next one.