Decision Tree: Information Gain
Guide Tasks
  • Read Tutorial
  • Watch Guide Video
Video locked
This video is viewable to users with a Bottega Bootcamp license

WEBVTT

1
00:00:03.380 --> 00:00:04.500
In the last guide,

2
00:00:04.500 --> 00:00:06.550
we talked about how we can use entropy

3
00:00:06.550 --> 00:00:09.710
to measure uncertainty at a decision node.

4
00:00:09.710 --> 00:00:11.230
Well, on this guide,

5
00:00:11.230 --> 00:00:13.440
we're gonna use what we know about entropy

6
00:00:13.440 --> 00:00:15.803
and relate that to information gain.

7
00:00:17.180 --> 00:00:20.170
I'm gonna do my best to not overcomplicate this,

8
00:00:20.170 --> 00:00:22.670
but the core principles go all the way back

9
00:00:22.670 --> 00:00:25.980
to the work done by a man named Claude Shannon,

10
00:00:25.980 --> 00:00:29.180
who is the creator of information theory.

11
00:00:29.180 --> 00:00:31.190
While he was developing his theory,

12
00:00:31.190 --> 00:00:34.520
he determined that entropy of a random variable

13
00:00:34.520 --> 00:00:37.890
is the average level of information or uncertainty

14
00:00:37.890 --> 00:00:40.913
inherent in the variable's possible outcomes.

15
00:00:42.690 --> 00:00:46.030
So in the last guide when we used entropy to determine

16
00:00:46.030 --> 00:00:47.700
the uncertainty at a node,

17
00:00:47.700 --> 00:00:50.530
we were simultaneous and unknowingly calculating

18
00:00:50.530 --> 00:00:53.223
the average level of information at the node.

19
00:00:54.710 --> 00:00:57.020
Looking at Wikipedia's general definition

20
00:00:57.020 --> 00:00:59.387
of information gain, it states,

21
00:00:59.387 --> 00:01:02.757
"In general terms, the expected information gain

22
00:01:02.757 --> 00:01:05.567
"is the change in information entropy H

23
00:01:05.567 --> 00:01:07.677
"from a prior state to a state

24
00:01:07.677 --> 00:01:10.287
"that takes some information as given."

25
00:01:11.600 --> 00:01:14.580
And to simplify that definition even more,

26
00:01:14.580 --> 00:01:16.460
information gain can be thought of

27
00:01:16.460 --> 00:01:20.090
as the change in uncertainty measured by entropy

28
00:01:20.090 --> 00:01:22.213
from parent node to child node.

29
00:01:24.230 --> 00:01:27.530
But the most important aspect of information gain

30
00:01:27.530 --> 00:01:30.270
is that we needed to help decide which feature

31
00:01:30.270 --> 00:01:32.263
should be partitioned to build a tree,

32
00:01:33.200 --> 00:01:35.420
and to do that at each step,

33
00:01:35.420 --> 00:01:37.680
we need to make sure we're choosing the split

34
00:01:37.680 --> 00:01:40.133
that results in the purest child node.

35
00:01:41.610 --> 00:01:44.390
And I feel like that explanation was still

36
00:01:44.390 --> 00:01:46.050
a little too wordy.

37
00:01:46.050 --> 00:01:48.910
So let's work through a really simple dataset

38
00:01:48.910 --> 00:01:51.510
that contains information about whether or not a student

39
00:01:51.510 --> 00:01:54.223
is going to spend time studying on a given day.

40
00:01:55.440 --> 00:01:58.850
The first column of the dataset is cloud coverage,

41
00:01:58.850 --> 00:02:02.560
and it indicates if it was sunny or cloudy outside.

42
00:02:02.560 --> 00:02:06.180
The second column lets us know if it rained or not.

43
00:02:06.180 --> 00:02:09.520
And the last column contains the class labels,

44
00:02:09.520 --> 00:02:12.623
indicating whether or not the student decided to study.

45
00:02:14.280 --> 00:02:15.800
Like we've talked about,

46
00:02:15.800 --> 00:02:18.310
the algorithm starts at the root node,

47
00:02:18.310 --> 00:02:21.160
containing the entire dataset.

48
00:02:21.160 --> 00:02:23.970
And before any splitting can be done,

49
00:02:23.970 --> 00:02:27.283
the entropy of the root node needs to be calculated.

50
00:02:28.240 --> 00:02:31.390
After we pass in 5/8 for the probability

51
00:02:31.390 --> 00:02:33.410
of a student not studying,

52
00:02:33.410 --> 00:02:37.930
and 3/8 for the probability of a student studying,

53
00:02:37.930 --> 00:02:42.150
we end up with an entropy of .954,

54
00:02:42.150 --> 00:02:45.030
indicating the set is just about evenly mixed

55
00:02:45.030 --> 00:02:47.723
between study days and non-study days.

56
00:02:48.960 --> 00:02:51.170
In the next step of the algorithm,

57
00:02:51.170 --> 00:02:54.300
we'll begin to iterate over every feature variable,

58
00:02:54.300 --> 00:02:55.910
calculating the entropy,

59
00:02:55.910 --> 00:02:59.023
and then the information gain for each unique element.

60
00:03:00.280 --> 00:03:02.610
Taking a quick look at our data frame,

61
00:03:02.610 --> 00:03:05.230
there's only four unique values,

62
00:03:05.230 --> 00:03:08.653
sunny, cloudy, rain, and no rain.

63
00:03:10.210 --> 00:03:12.110
So starting with sunny,

64
00:03:12.110 --> 00:03:15.273
the algorithm will ask the question, is it sunny?

65
00:03:16.180 --> 00:03:18.010
If the answer is true,

66
00:03:18.010 --> 00:03:20.800
the observation will be grouped on the left,

67
00:03:20.800 --> 00:03:24.750
but if the answer is false, it will be grouped on the right.

68
00:03:24.750 --> 00:03:27.740
Now, just looking at the true child node,

69
00:03:27.740 --> 00:03:30.200
it contains four observations,

70
00:03:30.200 --> 00:03:33.280
all of which have a study value of no,

71
00:03:33.280 --> 00:03:35.423
resulting in an entropy of zero.

72
00:03:36.840 --> 00:03:40.470
The node on the right also contains four observations,

73
00:03:40.470 --> 00:03:43.370
but only one has a study value of no,

74
00:03:43.370 --> 00:03:47.070
while the other three have study value of yes.

75
00:03:47.070 --> 00:03:51.323
This results in the node having an entropy of .811.

76
00:03:52.730 --> 00:03:55.900
Now that we know the entropy of both of the child nodes,

77
00:03:55.900 --> 00:03:59.790
we can move forward with calculating the weighted average.

78
00:03:59.790 --> 00:04:02.330
And the reason we're using a weighted average

79
00:04:02.330 --> 00:04:05.300
is because we care more about a large dataset

80
00:04:05.300 --> 00:04:07.190
with a low uncertainty

81
00:04:07.190 --> 00:04:09.853
than a small set with high uncertainty.

82
00:04:11.180 --> 00:04:13.110
Once that's taken care of,

83
00:04:13.110 --> 00:04:16.830
we have everything we need to calculate information gain,

84
00:04:16.830 --> 00:04:19.040
which, if we were to use this split,

85
00:04:19.040 --> 00:04:22.803
we'd end up with an information gain of about .55.

86
00:04:24.820 --> 00:04:27.870
Now, if we were to follow the exact same steps

87
00:04:27.870 --> 00:04:29.470
for cloudy days,

88
00:04:29.470 --> 00:04:32.770
we'd end up getting the exact same information gain.

89
00:04:32.770 --> 00:04:36.450
So really, we just have one more split to test,

90
00:04:36.450 --> 00:04:39.683
and that's between rainy days and non-rainy days.

91
00:04:41.300 --> 00:04:45.120
Calculating the entropy for the true node, we get zero.

92
00:04:45.120 --> 00:04:48.077
And the entropy for the false node is .65.

93
00:04:50.360 --> 00:04:52.050
For the weighted entropy,

94
00:04:52.050 --> 00:04:56.190
we end up getting a value of .4875,

95
00:04:56.190 --> 00:05:00.973
which results in an information gain of about .4655.

96
00:05:03.120 --> 00:05:04.750
When we compare the results,

97
00:05:04.750 --> 00:05:07.140
the largest information gain was achieved

98
00:05:07.140 --> 00:05:09.510
when we split the root node between sunny days

99
00:05:09.510 --> 00:05:12.950
and cloudy days, indicating that that should be

100
00:05:12.950 --> 00:05:16.403
the very first partition we use to build the decision tree.

101
00:05:18.010 --> 00:05:19.750
If we wanted to continue,

102
00:05:19.750 --> 00:05:23.080
we'd follow the exact same steps down each branch

103
00:05:23.080 --> 00:05:25.543
until we were only left with leaf nodes.

104
00:05:27.030 --> 00:05:28.550
Now that we're done with that,

105
00:05:28.550 --> 00:05:31.550
I think that about brings us to the end of this guide,

106
00:05:31.550 --> 00:05:34.863
which leaves us with one last decision tree guide.

107
00:05:35.770 --> 00:05:37.420
And for that, we'll be building

108
00:05:37.420 --> 00:05:42.200
an actual decision tree visualization, but for now,

109
00:05:42.200 --> 00:05:45.683
I will wrap things up and I will see you in the next one.