Decision Tree: Information Entropy

WEBVTT

1
00:00:02.690 --> 00:00:04.400
Over the next few guides,

2
00:00:04.400 --> 00:00:06.060
we're gonna use what we've already learned

3
00:00:06.060 --> 00:00:07.780
about regression trees

4
00:00:07.780 --> 00:00:10.930
and apply that to how we can use a decision tree

5
00:00:10.930 --> 00:00:12.333
for classification.

6
00:00:13.810 --> 00:00:15.430
There's quite a bit to cover,

7
00:00:15.430 --> 00:00:17.770
so I decided to split everything up

8
00:00:17.770 --> 00:00:20.210
into three separate guides.

9
00:00:20.210 --> 00:00:24.150
The first two guides are gonna be a little more theoretical.

10
00:00:24.150 --> 00:00:27.670
In the first, we'll focus on two main concepts,

11
00:00:27.670 --> 00:00:30.773
which are Gini Impurity and Entropy.

12
00:00:32.320 --> 00:00:35.110
In the second, we'll have a detailed discussion

13
00:00:35.110 --> 00:00:37.373
about maximizing Information Gain.

14
00:00:38.270 --> 00:00:40.120
Then in the final guide,

15
00:00:40.120 --> 00:00:41.700
we'll use what we've learned

16
00:00:41.700 --> 00:00:45.223
to build a really clean Decision Tree Visualization.

17
00:00:46.670 --> 00:00:49.120
But before we get into all of the new stuff,

18
00:00:49.120 --> 00:00:50.763
let's do a quick review.

19
00:00:51.700 --> 00:00:53.760
So we already know that a decision tree

20
00:00:53.760 --> 00:00:56.540
starts by taking in the entire training set

21
00:00:56.540 --> 00:00:58.370
at the root node.

22
00:00:58.370 --> 00:01:00.400
Then it works its way through the data,

23
00:01:00.400 --> 00:01:02.220
by asking what feature offers

24
00:01:02.220 --> 00:01:04.340
the biggest reduction in uncertainty,

25
00:01:04.340 --> 00:01:07.863
if it were to be partitioned into subsets or child nodes.

26
00:01:08.960 --> 00:01:11.670
The algorithm continues down each branch

27
00:01:11.670 --> 00:01:14.450
using the same process until only terminal

28
00:01:14.450 --> 00:01:16.223
or leaf nodes are remaining.

29
00:01:17.140 --> 00:01:18.790
When we built the regression tree,

30
00:01:18.790 --> 00:01:22.640
the partition was determined by using mean squared error.

31
00:01:22.640 --> 00:01:25.170
But Scikit-learn also gave us the option

32
00:01:25.170 --> 00:01:29.683
of using Friedman mean squared error or mean absolute error.

33
00:01:30.750 --> 00:01:34.480
For a classification tree it's gonna be a little different

34
00:01:34.480 --> 00:01:36.450
because the two options we're given

35
00:01:36.450 --> 00:01:39.113
are Gini impurity and entropy.

36
00:01:40.680 --> 00:01:42.380
The first method that we'll go through

37
00:01:42.380 --> 00:01:46.783
to help calculate uncertainty at a node is Gini impurity.

38
00:01:47.650 --> 00:01:50.230
Which also happens to be the default method

39
00:01:50.230 --> 00:01:51.363
for Scikit-learn.

40
00:01:53.500 --> 00:01:58.500
So, Gini impurity is a metric that ranges from zero to one

41
00:01:58.540 --> 00:02:01.010
and it's used to measure how often

42
00:02:01.010 --> 00:02:03.460
a randomly chosen element from a set

43
00:02:03.460 --> 00:02:05.263
would be incorrectly labeled.

44
00:02:06.680 --> 00:02:08.170
As an example,

45
00:02:08.170 --> 00:02:11.070
let's pretend that we have two different sets.

46
00:02:11.070 --> 00:02:13.890
The first contains all of the feature elements

47
00:02:13.890 --> 00:02:16.713
and the second contains all of the class labels.

48
00:02:17.990 --> 00:02:21.690
If we randomly select one element from the first set

49
00:02:21.690 --> 00:02:25.010
and randomly select a label from the second set,

50
00:02:25.010 --> 00:02:28.700
then classify the element is having that label,

51
00:02:28.700 --> 00:02:31.390
the Gini impurity gives us the probability

52
00:02:31.390 --> 00:02:33.833
of incorrectly labeling the feature element.

53
00:02:35.280 --> 00:02:39.020
Because all four elements are the same the impurity is zero

54
00:02:39.020 --> 00:02:41.703
because there's no chance of mislabeling the element.

55
00:02:43.970 --> 00:02:46.680
But if we were to replace two of the sunny days

56
00:02:46.680 --> 00:02:49.993
with cloudy days, the impurity would be .5.

57
00:02:50.990 --> 00:02:53.540
And that's because if we were to randomly select

58
00:02:53.540 --> 00:02:55.260
an element and label,

59
00:02:55.260 --> 00:02:57.810
we'd end up incorrectly classifying the element

60
00:02:57.810 --> 00:02:59.313
about half of the time.

61
00:03:01.390 --> 00:03:06.390
Now, I mentioned that Gini impurity ranges from zero to one,

62
00:03:06.610 --> 00:03:09.720
which I guess is technically incorrect,

63
00:03:09.720 --> 00:03:11.300
because there's really no way

64
00:03:11.300 --> 00:03:15.063
of incorrectly labeling an element 100% of the time.

65
00:03:16.160 --> 00:03:19.010
If we go back to our peer set of sunny days,

66
00:03:19.010 --> 00:03:22.350
it's not possible to incorrectly classify a cloudy day

67
00:03:22.350 --> 00:03:24.370
100% of the time.

68
00:03:24.370 --> 00:03:27.720
Because cloudy days don't exist in the set.

69
00:03:27.720 --> 00:03:31.883
So really the maximum Gini impurity has a limit at one.

70
00:03:33.400 --> 00:03:36.560
But for what we're doing, the concept of impurity limits

71
00:03:36.560 --> 00:03:38.193
isn't really that important.

72
00:03:39.090 --> 00:03:41.170
In fact, the only reason I brought it up

73
00:03:41.170 --> 00:03:43.440
is because the existence of the limit

74
00:03:43.440 --> 00:03:45.320
suggests that Gini impurity

75
00:03:45.320 --> 00:03:47.833
is really just a variation of entropy.

76
00:03:49.070 --> 00:03:51.950
So that obviously leads us into entropy,

77
00:03:51.950 --> 00:03:53.840
which is the next method we can use

78
00:03:53.840 --> 00:03:55.423
to calculate uncertainty.

79
00:03:57.660 --> 00:03:58.860
There's a decent chance

80
00:03:58.860 --> 00:04:01.900
that a few of you have heard of the term entropy before.

81
00:04:01.900 --> 00:04:03.980
And that's because it's often used

82
00:04:03.980 --> 00:04:06.803
to describe disorder or chaos in a system.

83
00:04:07.820 --> 00:04:11.010
In machine learning it's a pretty similar concept,

84
00:04:11.010 --> 00:04:15.320
but to be a little more precise when we refer to entropy,

85
00:04:15.320 --> 00:04:18.280
we're talking about the homogeneity of a data set

86
00:04:18.280 --> 00:04:20.453
or how similar the elements are.

87
00:04:22.130 --> 00:04:24.170
Using the same example as before,

88
00:04:24.170 --> 00:04:28.090
let's start with a data set and made up of only sunny days.

89
00:04:28.090 --> 00:04:31.500
For this, we'd say the data set is completely homogenous

90
00:04:31.500 --> 00:04:33.653
because all four elements are the same.

91
00:04:34.580 --> 00:04:37.540
But as we replace sunny days with cloudy days,

92
00:04:37.540 --> 00:04:40.020
we'll begin to lose homogeneity

93
00:04:40.020 --> 00:04:42.223
until the set is perfectly mixed.

94
00:04:43.860 --> 00:04:45.210
Then if we were to replace

95
00:04:45.210 --> 00:04:47.860
the remaining sunny days with cloudy days,

96
00:04:47.860 --> 00:04:50.353
we'd have a perfectly homogenous set again.

97
00:04:52.520 --> 00:04:54.140
To calculate entropy,

98
00:04:54.140 --> 00:04:57.760
we say entropy is equal to negative one

99
00:04:57.760 --> 00:05:00.700
times the sum of p sub i

100
00:05:00.700 --> 00:05:05.700
times log base two of p sub i from i equals one to j.

101
00:05:09.630 --> 00:05:13.000
Now applying the entropy equation to our example,

102
00:05:13.000 --> 00:05:16.433
the entropy of the cloudy day set is in fact zero.

103
00:05:18.610 --> 00:05:22.190
Then as we replace one of the cloudy days with a sunny day,

104
00:05:22.190 --> 00:05:24.933
the entropy is about .811.

105
00:05:27.320 --> 00:05:29.270
With an evenly mixed set

106
00:05:29.270 --> 00:05:32.110
the entropy reaches its maximum value of one

107
00:05:34.370 --> 00:05:37.150
with three sunny days and one cloudy day,

108
00:05:37.150 --> 00:05:40.830
the entropy is back to about .811

109
00:05:40.830 --> 00:05:42.950
and finally with all sunny days,

110
00:05:42.950 --> 00:05:45.143
the entropy is back to zero again.

111
00:05:47.080 --> 00:05:49.110
And we don't really have enough data points

112
00:05:49.110 --> 00:05:50.140
to make it smooth

113
00:05:50.140 --> 00:05:53.470
but if we were to graph that binary entropy function,

114
00:05:53.470 --> 00:05:55.193
it would look something like this.

115
00:05:56.500 --> 00:05:58.460
Now, before we wrap this guide up,

116
00:05:58.460 --> 00:06:00.233
let's do a quick recap.

117
00:06:01.560 --> 00:06:05.440
Gini impurity is the default method in Scikit-learn

118
00:06:05.440 --> 00:06:08.283
and is considered to be a variation of entropy.

119
00:06:09.260 --> 00:06:11.770
It works by using the impurity of a node

120
00:06:11.770 --> 00:06:13.393
to calculate uncertainty.

121
00:06:15.360 --> 00:06:18.360
Entropy also calculates uncertainty of a node,

122
00:06:18.360 --> 00:06:21.483
but instead it uses homogeneity and log rules.

123
00:06:23.420 --> 00:06:26.290
And I think that about covers everything you'll need to know

124
00:06:26.290 --> 00:06:27.890
going into the next guide.

125
00:06:27.890 --> 00:06:31.763
So for now I'll wrap it up and I'll see you in the next one.