- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:02.690 --> 00:00:04.400
Over the next few guides,
2
00:00:04.400 --> 00:00:06.060
we're gonna use what we've already learned
3
00:00:06.060 --> 00:00:07.780
about regression trees
4
00:00:07.780 --> 00:00:10.930
and apply that to how we can use a decision tree
5
00:00:10.930 --> 00:00:12.333
for classification.
6
00:00:13.810 --> 00:00:15.430
There's quite a bit to cover,
7
00:00:15.430 --> 00:00:17.770
so I decided to split everything up
8
00:00:17.770 --> 00:00:20.210
into three separate guides.
9
00:00:20.210 --> 00:00:24.150
The first two guides are gonna be a little more theoretical.
10
00:00:24.150 --> 00:00:27.670
In the first, we'll focus on two main concepts,
11
00:00:27.670 --> 00:00:30.773
which are Gini Impurity and Entropy.
12
00:00:32.320 --> 00:00:35.110
In the second, we'll have a detailed discussion
13
00:00:35.110 --> 00:00:37.373
about maximizing Information Gain.
14
00:00:38.270 --> 00:00:40.120
Then in the final guide,
15
00:00:40.120 --> 00:00:41.700
we'll use what we've learned
16
00:00:41.700 --> 00:00:45.223
to build a really clean Decision Tree Visualization.
17
00:00:46.670 --> 00:00:49.120
But before we get into all of the new stuff,
18
00:00:49.120 --> 00:00:50.763
let's do a quick review.
19
00:00:51.700 --> 00:00:53.760
So we already know that a decision tree
20
00:00:53.760 --> 00:00:56.540
starts by taking in the entire training set
21
00:00:56.540 --> 00:00:58.370
at the root node.
22
00:00:58.370 --> 00:01:00.400
Then it works its way through the data,
23
00:01:00.400 --> 00:01:02.220
by asking what feature offers
24
00:01:02.220 --> 00:01:04.340
the biggest reduction in uncertainty,
25
00:01:04.340 --> 00:01:07.863
if it were to be partitioned into subsets or child nodes.
26
00:01:08.960 --> 00:01:11.670
The algorithm continues down each branch
27
00:01:11.670 --> 00:01:14.450
using the same process until only terminal
28
00:01:14.450 --> 00:01:16.223
or leaf nodes are remaining.
29
00:01:17.140 --> 00:01:18.790
When we built the regression tree,
30
00:01:18.790 --> 00:01:22.640
the partition was determined by using mean squared error.
31
00:01:22.640 --> 00:01:25.170
But Scikit-learn also gave us the option
32
00:01:25.170 --> 00:01:29.683
of using Friedman mean squared error or mean absolute error.
33
00:01:30.750 --> 00:01:34.480
For a classification tree it's gonna be a little different
34
00:01:34.480 --> 00:01:36.450
because the two options we're given
35
00:01:36.450 --> 00:01:39.113
are Gini impurity and entropy.
36
00:01:40.680 --> 00:01:42.380
The first method that we'll go through
37
00:01:42.380 --> 00:01:46.783
to help calculate uncertainty at a node is Gini impurity.
38
00:01:47.650 --> 00:01:50.230
Which also happens to be the default method
39
00:01:50.230 --> 00:01:51.363
for Scikit-learn.
40
00:01:53.500 --> 00:01:58.500
So, Gini impurity is a metric that ranges from zero to one
41
00:01:58.540 --> 00:02:01.010
and it's used to measure how often
42
00:02:01.010 --> 00:02:03.460
a randomly chosen element from a set
43
00:02:03.460 --> 00:02:05.263
would be incorrectly labeled.
44
00:02:06.680 --> 00:02:08.170
As an example,
45
00:02:08.170 --> 00:02:11.070
let's pretend that we have two different sets.
46
00:02:11.070 --> 00:02:13.890
The first contains all of the feature elements
47
00:02:13.890 --> 00:02:16.713
and the second contains all of the class labels.
48
00:02:17.990 --> 00:02:21.690
If we randomly select one element from the first set
49
00:02:21.690 --> 00:02:25.010
and randomly select a label from the second set,
50
00:02:25.010 --> 00:02:28.700
then classify the element is having that label,
51
00:02:28.700 --> 00:02:31.390
the Gini impurity gives us the probability
52
00:02:31.390 --> 00:02:33.833
of incorrectly labeling the feature element.
53
00:02:35.280 --> 00:02:39.020
Because all four elements are the same the impurity is zero
54
00:02:39.020 --> 00:02:41.703
because there's no chance of mislabeling the element.
55
00:02:43.970 --> 00:02:46.680
But if we were to replace two of the sunny days
56
00:02:46.680 --> 00:02:49.993
with cloudy days, the impurity would be .5.
57
00:02:50.990 --> 00:02:53.540
And that's because if we were to randomly select
58
00:02:53.540 --> 00:02:55.260
an element and label,
59
00:02:55.260 --> 00:02:57.810
we'd end up incorrectly classifying the element
60
00:02:57.810 --> 00:02:59.313
about half of the time.
61
00:03:01.390 --> 00:03:06.390
Now, I mentioned that Gini impurity ranges from zero to one,
62
00:03:06.610 --> 00:03:09.720
which I guess is technically incorrect,
63
00:03:09.720 --> 00:03:11.300
because there's really no way
64
00:03:11.300 --> 00:03:15.063
of incorrectly labeling an element 100% of the time.
65
00:03:16.160 --> 00:03:19.010
If we go back to our peer set of sunny days,
66
00:03:19.010 --> 00:03:22.350
it's not possible to incorrectly classify a cloudy day
67
00:03:22.350 --> 00:03:24.370
100% of the time.
68
00:03:24.370 --> 00:03:27.720
Because cloudy days don't exist in the set.
69
00:03:27.720 --> 00:03:31.883
So really the maximum Gini impurity has a limit at one.
70
00:03:33.400 --> 00:03:36.560
But for what we're doing, the concept of impurity limits
71
00:03:36.560 --> 00:03:38.193
isn't really that important.
72
00:03:39.090 --> 00:03:41.170
In fact, the only reason I brought it up
73
00:03:41.170 --> 00:03:43.440
is because the existence of the limit
74
00:03:43.440 --> 00:03:45.320
suggests that Gini impurity
75
00:03:45.320 --> 00:03:47.833
is really just a variation of entropy.
76
00:03:49.070 --> 00:03:51.950
So that obviously leads us into entropy,
77
00:03:51.950 --> 00:03:53.840
which is the next method we can use
78
00:03:53.840 --> 00:03:55.423
to calculate uncertainty.
79
00:03:57.660 --> 00:03:58.860
There's a decent chance
80
00:03:58.860 --> 00:04:01.900
that a few of you have heard of the term entropy before.
81
00:04:01.900 --> 00:04:03.980
And that's because it's often used
82
00:04:03.980 --> 00:04:06.803
to describe disorder or chaos in a system.
83
00:04:07.820 --> 00:04:11.010
In machine learning it's a pretty similar concept,
84
00:04:11.010 --> 00:04:15.320
but to be a little more precise when we refer to entropy,
85
00:04:15.320 --> 00:04:18.280
we're talking about the homogeneity of a data set
86
00:04:18.280 --> 00:04:20.453
or how similar the elements are.
87
00:04:22.130 --> 00:04:24.170
Using the same example as before,
88
00:04:24.170 --> 00:04:28.090
let's start with a data set and made up of only sunny days.
89
00:04:28.090 --> 00:04:31.500
For this, we'd say the data set is completely homogenous
90
00:04:31.500 --> 00:04:33.653
because all four elements are the same.
91
00:04:34.580 --> 00:04:37.540
But as we replace sunny days with cloudy days,
92
00:04:37.540 --> 00:04:40.020
we'll begin to lose homogeneity
93
00:04:40.020 --> 00:04:42.223
until the set is perfectly mixed.
94
00:04:43.860 --> 00:04:45.210
Then if we were to replace
95
00:04:45.210 --> 00:04:47.860
the remaining sunny days with cloudy days,
96
00:04:47.860 --> 00:04:50.353
we'd have a perfectly homogenous set again.
97
00:04:52.520 --> 00:04:54.140
To calculate entropy,
98
00:04:54.140 --> 00:04:57.760
we say entropy is equal to negative one
99
00:04:57.760 --> 00:05:00.700
times the sum of p sub i
100
00:05:00.700 --> 00:05:05.700
times log base two of p sub i from i equals one to j.
101
00:05:09.630 --> 00:05:13.000
Now applying the entropy equation to our example,
102
00:05:13.000 --> 00:05:16.433
the entropy of the cloudy day set is in fact zero.
103
00:05:18.610 --> 00:05:22.190
Then as we replace one of the cloudy days with a sunny day,
104
00:05:22.190 --> 00:05:24.933
the entropy is about .811.
105
00:05:27.320 --> 00:05:29.270
With an evenly mixed set
106
00:05:29.270 --> 00:05:32.110
the entropy reaches its maximum value of one
107
00:05:34.370 --> 00:05:37.150
with three sunny days and one cloudy day,
108
00:05:37.150 --> 00:05:40.830
the entropy is back to about .811
109
00:05:40.830 --> 00:05:42.950
and finally with all sunny days,
110
00:05:42.950 --> 00:05:45.143
the entropy is back to zero again.
111
00:05:47.080 --> 00:05:49.110
And we don't really have enough data points
112
00:05:49.110 --> 00:05:50.140
to make it smooth
113
00:05:50.140 --> 00:05:53.470
but if we were to graph that binary entropy function,
114
00:05:53.470 --> 00:05:55.193
it would look something like this.
115
00:05:56.500 --> 00:05:58.460
Now, before we wrap this guide up,
116
00:05:58.460 --> 00:06:00.233
let's do a quick recap.
117
00:06:01.560 --> 00:06:05.440
Gini impurity is the default method in Scikit-learn
118
00:06:05.440 --> 00:06:08.283
and is considered to be a variation of entropy.
119
00:06:09.260 --> 00:06:11.770
It works by using the impurity of a node
120
00:06:11.770 --> 00:06:13.393
to calculate uncertainty.
121
00:06:15.360 --> 00:06:18.360
Entropy also calculates uncertainty of a node,
122
00:06:18.360 --> 00:06:21.483
but instead it uses homogeneity and log rules.
123
00:06:23.420 --> 00:06:26.290
And I think that about covers everything you'll need to know
124
00:06:26.290 --> 00:06:27.890
going into the next guide.
125
00:06:27.890 --> 00:06:31.763
So for now I'll wrap it up and I'll see you in the next one.