- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:03.380 --> 00:00:04.500
In the last guide,
2
00:00:04.500 --> 00:00:06.550
we talked about how we can use entropy
3
00:00:06.550 --> 00:00:09.710
to measure uncertainty at a decision node.
4
00:00:09.710 --> 00:00:11.230
Well, on this guide,
5
00:00:11.230 --> 00:00:13.440
we're gonna use what we know about entropy
6
00:00:13.440 --> 00:00:15.803
and relate that to information gain.
7
00:00:17.180 --> 00:00:20.170
I'm gonna do my best to not overcomplicate this,
8
00:00:20.170 --> 00:00:22.670
but the core principles go all the way back
9
00:00:22.670 --> 00:00:25.980
to the work done by a man named Claude Shannon,
10
00:00:25.980 --> 00:00:29.180
who is the creator of information theory.
11
00:00:29.180 --> 00:00:31.190
While he was developing his theory,
12
00:00:31.190 --> 00:00:34.520
he determined that entropy of a random variable
13
00:00:34.520 --> 00:00:37.890
is the average level of information or uncertainty
14
00:00:37.890 --> 00:00:40.913
inherent in the variable's possible outcomes.
15
00:00:42.690 --> 00:00:46.030
So in the last guide when we used entropy to determine
16
00:00:46.030 --> 00:00:47.700
the uncertainty at a node,
17
00:00:47.700 --> 00:00:50.530
we were simultaneous and unknowingly calculating
18
00:00:50.530 --> 00:00:53.223
the average level of information at the node.
19
00:00:54.710 --> 00:00:57.020
Looking at Wikipedia's general definition
20
00:00:57.020 --> 00:00:59.387
of information gain, it states,
21
00:00:59.387 --> 00:01:02.757
"In general terms, the expected information gain
22
00:01:02.757 --> 00:01:05.567
"is the change in information entropy H
23
00:01:05.567 --> 00:01:07.677
"from a prior state to a state
24
00:01:07.677 --> 00:01:10.287
"that takes some information as given."
25
00:01:11.600 --> 00:01:14.580
And to simplify that definition even more,
26
00:01:14.580 --> 00:01:16.460
information gain can be thought of
27
00:01:16.460 --> 00:01:20.090
as the change in uncertainty measured by entropy
28
00:01:20.090 --> 00:01:22.213
from parent node to child node.
29
00:01:24.230 --> 00:01:27.530
But the most important aspect of information gain
30
00:01:27.530 --> 00:01:30.270
is that we needed to help decide which feature
31
00:01:30.270 --> 00:01:32.263
should be partitioned to build a tree,
32
00:01:33.200 --> 00:01:35.420
and to do that at each step,
33
00:01:35.420 --> 00:01:37.680
we need to make sure we're choosing the split
34
00:01:37.680 --> 00:01:40.133
that results in the purest child node.
35
00:01:41.610 --> 00:01:44.390
And I feel like that explanation was still
36
00:01:44.390 --> 00:01:46.050
a little too wordy.
37
00:01:46.050 --> 00:01:48.910
So let's work through a really simple dataset
38
00:01:48.910 --> 00:01:51.510
that contains information about whether or not a student
39
00:01:51.510 --> 00:01:54.223
is going to spend time studying on a given day.
40
00:01:55.440 --> 00:01:58.850
The first column of the dataset is cloud coverage,
41
00:01:58.850 --> 00:02:02.560
and it indicates if it was sunny or cloudy outside.
42
00:02:02.560 --> 00:02:06.180
The second column lets us know if it rained or not.
43
00:02:06.180 --> 00:02:09.520
And the last column contains the class labels,
44
00:02:09.520 --> 00:02:12.623
indicating whether or not the student decided to study.
45
00:02:14.280 --> 00:02:15.800
Like we've talked about,
46
00:02:15.800 --> 00:02:18.310
the algorithm starts at the root node,
47
00:02:18.310 --> 00:02:21.160
containing the entire dataset.
48
00:02:21.160 --> 00:02:23.970
And before any splitting can be done,
49
00:02:23.970 --> 00:02:27.283
the entropy of the root node needs to be calculated.
50
00:02:28.240 --> 00:02:31.390
After we pass in 5/8 for the probability
51
00:02:31.390 --> 00:02:33.410
of a student not studying,
52
00:02:33.410 --> 00:02:37.930
and 3/8 for the probability of a student studying,
53
00:02:37.930 --> 00:02:42.150
we end up with an entropy of .954,
54
00:02:42.150 --> 00:02:45.030
indicating the set is just about evenly mixed
55
00:02:45.030 --> 00:02:47.723
between study days and non-study days.
56
00:02:48.960 --> 00:02:51.170
In the next step of the algorithm,
57
00:02:51.170 --> 00:02:54.300
we'll begin to iterate over every feature variable,
58
00:02:54.300 --> 00:02:55.910
calculating the entropy,
59
00:02:55.910 --> 00:02:59.023
and then the information gain for each unique element.
60
00:03:00.280 --> 00:03:02.610
Taking a quick look at our data frame,
61
00:03:02.610 --> 00:03:05.230
there's only four unique values,
62
00:03:05.230 --> 00:03:08.653
sunny, cloudy, rain, and no rain.
63
00:03:10.210 --> 00:03:12.110
So starting with sunny,
64
00:03:12.110 --> 00:03:15.273
the algorithm will ask the question, is it sunny?
65
00:03:16.180 --> 00:03:18.010
If the answer is true,
66
00:03:18.010 --> 00:03:20.800
the observation will be grouped on the left,
67
00:03:20.800 --> 00:03:24.750
but if the answer is false, it will be grouped on the right.
68
00:03:24.750 --> 00:03:27.740
Now, just looking at the true child node,
69
00:03:27.740 --> 00:03:30.200
it contains four observations,
70
00:03:30.200 --> 00:03:33.280
all of which have a study value of no,
71
00:03:33.280 --> 00:03:35.423
resulting in an entropy of zero.
72
00:03:36.840 --> 00:03:40.470
The node on the right also contains four observations,
73
00:03:40.470 --> 00:03:43.370
but only one has a study value of no,
74
00:03:43.370 --> 00:03:47.070
while the other three have study value of yes.
75
00:03:47.070 --> 00:03:51.323
This results in the node having an entropy of .811.
76
00:03:52.730 --> 00:03:55.900
Now that we know the entropy of both of the child nodes,
77
00:03:55.900 --> 00:03:59.790
we can move forward with calculating the weighted average.
78
00:03:59.790 --> 00:04:02.330
And the reason we're using a weighted average
79
00:04:02.330 --> 00:04:05.300
is because we care more about a large dataset
80
00:04:05.300 --> 00:04:07.190
with a low uncertainty
81
00:04:07.190 --> 00:04:09.853
than a small set with high uncertainty.
82
00:04:11.180 --> 00:04:13.110
Once that's taken care of,
83
00:04:13.110 --> 00:04:16.830
we have everything we need to calculate information gain,
84
00:04:16.830 --> 00:04:19.040
which, if we were to use this split,
85
00:04:19.040 --> 00:04:22.803
we'd end up with an information gain of about .55.
86
00:04:24.820 --> 00:04:27.870
Now, if we were to follow the exact same steps
87
00:04:27.870 --> 00:04:29.470
for cloudy days,
88
00:04:29.470 --> 00:04:32.770
we'd end up getting the exact same information gain.
89
00:04:32.770 --> 00:04:36.450
So really, we just have one more split to test,
90
00:04:36.450 --> 00:04:39.683
and that's between rainy days and non-rainy days.
91
00:04:41.300 --> 00:04:45.120
Calculating the entropy for the true node, we get zero.
92
00:04:45.120 --> 00:04:48.077
And the entropy for the false node is .65.
93
00:04:50.360 --> 00:04:52.050
For the weighted entropy,
94
00:04:52.050 --> 00:04:56.190
we end up getting a value of .4875,
95
00:04:56.190 --> 00:05:00.973
which results in an information gain of about .4655.
96
00:05:03.120 --> 00:05:04.750
When we compare the results,
97
00:05:04.750 --> 00:05:07.140
the largest information gain was achieved
98
00:05:07.140 --> 00:05:09.510
when we split the root node between sunny days
99
00:05:09.510 --> 00:05:12.950
and cloudy days, indicating that that should be
100
00:05:12.950 --> 00:05:16.403
the very first partition we use to build the decision tree.
101
00:05:18.010 --> 00:05:19.750
If we wanted to continue,
102
00:05:19.750 --> 00:05:23.080
we'd follow the exact same steps down each branch
103
00:05:23.080 --> 00:05:25.543
until we were only left with leaf nodes.
104
00:05:27.030 --> 00:05:28.550
Now that we're done with that,
105
00:05:28.550 --> 00:05:31.550
I think that about brings us to the end of this guide,
106
00:05:31.550 --> 00:05:34.863
which leaves us with one last decision tree guide.
107
00:05:35.770 --> 00:05:37.420
And for that, we'll be building
108
00:05:37.420 --> 00:05:42.200
an actual decision tree visualization, but for now,
109
00:05:42.200 --> 00:05:45.683
I will wrap things up and I will see you in the next one.