Decision Tree: Visualization

WEBVTT

1
00:00:02.990 --> 00:00:03.823
Now that we have

2
00:00:03.823 --> 00:00:05.780
all of the technical stuff out of the way,

3
00:00:05.780 --> 00:00:08.570
we can move forward with applying what we've learned

4
00:00:08.570 --> 00:00:10.433
to a real world problem.

5
00:00:11.550 --> 00:00:13.270
And I think what I like most

6
00:00:13.270 --> 00:00:16.110
about using a decision tree for classification,

7
00:00:16.110 --> 00:00:19.173
is the visual representation we're able to create.

8
00:00:20.300 --> 00:00:21.330
The reason being,

9
00:00:21.330 --> 00:00:24.310
and I'm sure many of you have experienced this,

10
00:00:24.310 --> 00:00:26.160
when you're trying to explain something

11
00:00:26.160 --> 00:00:27.960
that's fairly complicated,

12
00:00:27.960 --> 00:00:32.320
it makes it a lot easier when you have some diagram or chart

13
00:00:32.320 --> 00:00:34.023
for people to follow along with.

14
00:00:35.100 --> 00:00:36.860
And in my opinion,

15
00:00:36.860 --> 00:00:38.250
the visual representation

16
00:00:38.250 --> 00:00:40.830
we're able to create from a decision tree,

17
00:00:40.830 --> 00:00:44.040
offers one of the most powerful and effective ways

18
00:00:44.040 --> 00:00:47.433
of simplifying and communicating a complex problem.

19
00:00:49.650 --> 00:00:53.900
Now, I really want this guide to focus on the visual itself,

20
00:00:53.900 --> 00:00:55.470
so I thought it would be best

21
00:00:55.470 --> 00:00:56.730
to work with the data frame

22
00:00:56.730 --> 00:00:59.010
that we're already familiar with.

23
00:00:59.010 --> 00:01:01.390
That's why I chose the tumor dataset

24
00:01:01.390 --> 00:01:04.473
that we first worked with in the K-Nearest Neighbor guide.

25
00:01:06.200 --> 00:01:09.920
To save a little bit of time, I already imported pandas,

26
00:01:09.920 --> 00:01:12.170
the train test split function,

27
00:01:12.170 --> 00:01:14.170
and the decision tree classifier

28
00:01:14.170 --> 00:01:16.063
from Scikit-learn's tree module.

29
00:01:17.420 --> 00:01:20.570
I also imported the tumor CSV file,

30
00:01:20.570 --> 00:01:23.240
segmented our feature and target variables,

31
00:01:23.240 --> 00:01:26.353
and then applied both of those to the train test function.

32
00:01:27.610 --> 00:01:30.630
And like we've done in almost every other example,

33
00:01:30.630 --> 00:01:33.610
I created a decision tree classifier object

34
00:01:33.610 --> 00:01:35.170
and used the fit function

35
00:01:35.170 --> 00:01:38.020
to create a model using the training data.

36
00:01:38.020 --> 00:01:41.860
Then the last thing I did was create a score variable

37
00:01:41.860 --> 00:01:44.283
so we can check the accuracy of the model.

38
00:01:46.160 --> 00:01:48.010
Before we go any further,

39
00:01:48.010 --> 00:01:49.450
let's run this really quick

40
00:01:49.450 --> 00:01:51.253
and take a look at the data frame.

41
00:01:58.500 --> 00:02:01.540
In case any of you don't remember what we're working with,

42
00:02:01.540 --> 00:02:03.530
we have nine feature variables

43
00:02:03.530 --> 00:02:06.400
going from clump thickness to mitosis.

44
00:02:07.840 --> 00:02:12.180
Then the last column contains all of the class labels.

45
00:02:12.180 --> 00:02:13.573
And for that,

46
00:02:15.540 --> 00:02:20.540
we have 444 elements representing benign tumors,

47
00:02:20.760 --> 00:02:24.863
and 239 elements that represent malignant tumors.

48
00:02:26.830 --> 00:02:29.650
When we built out the K-Nearest Neighbor model,

49
00:02:29.650 --> 00:02:33.523
I think we had an accuracy score somewhere in the high 90s.

50
00:02:35.260 --> 00:02:38.203
Let's take a look at how the decision tree matches up.

51
00:02:40.840 --> 00:02:42.550
And it's pretty good,

52
00:02:42.550 --> 00:02:45.343
but not quite as good as K-Nearest Neighbors.

53
00:02:46.920 --> 00:02:50.040
All right, now that we have all of that taken care of,

54
00:02:50.040 --> 00:02:52.363
let's start building out the tree visual.

55
00:02:53.440 --> 00:02:55.270
The first thing that we're gonna need to do

56
00:02:55.270 --> 00:02:58.373
is import plot trees from the tree module.

57
00:02:59.310 --> 00:03:03.930
Then we'll move down and pass in plot tree, parentheses,

58
00:03:03.930 --> 00:03:05.563
and then the classifier.

59
00:03:09.130 --> 00:03:11.343
Now, when we check the plot pane,

60
00:03:12.320 --> 00:03:15.933
we have a very hard to read classification tree.

61
00:03:17.080 --> 00:03:19.270
But before we address that issue,

62
00:03:19.270 --> 00:03:21.570
let's go through a couple of the attributes

63
00:03:21.570 --> 00:03:23.263
that Sklearn offers.

64
00:03:25.660 --> 00:03:28.060
The three that I use the most often

65
00:03:28.060 --> 00:03:32.240
are feature names, class names, and the field option,

66
00:03:32.240 --> 00:03:33.960
which adds color to a node

67
00:03:33.960 --> 00:03:36.383
to help indicate what class it belongs to.

68
00:03:38.480 --> 00:03:41.700
Occasionally, I'll add the rounded attribute as well.

69
00:03:41.700 --> 00:03:45.043
And all that does is round the corners of the node box.

70
00:03:47.380 --> 00:03:50.503
So let's add those in starting with feature names.

71
00:03:52.240 --> 00:03:55.703
For this, we can just pass in df.columns,

72
00:03:57.010 --> 00:03:59.480
then include all of the column names

73
00:03:59.480 --> 00:04:01.533
excluding the class column.

74
00:04:04.120 --> 00:04:07.240
The class names are gonna be a little bit more tricky,

75
00:04:07.240 --> 00:04:09.493
but it's still pretty straightforward.

76
00:04:11.010 --> 00:04:13.320
What we're gonna do is start by passing in

77
00:04:13.320 --> 00:04:16.070
Y for the target variable.

78
00:04:16.070 --> 00:04:20.113
Then we'll use the map function and pass in a dictionary.

79
00:04:21.280 --> 00:04:24.010
For the key, we're gonna use the class label

80
00:04:24.010 --> 00:04:25.223
starting with two.

81
00:04:26.250 --> 00:04:28.943
And for the value, we'll pass in malignant.

82
00:04:31.410 --> 00:04:34.543
Then we can do the exact same thing for the benign class.

83
00:04:41.370 --> 00:04:44.563
For the field attribute, we just have to pass in true.

84
00:04:45.440 --> 00:04:47.703
And for the rounded, we'll do the same thing.

85
00:04:51.640 --> 00:04:53.640
Now let's run this one more time

86
00:04:54.550 --> 00:04:57.090
and you can see color has been added,

87
00:04:57.090 --> 00:04:59.353
but it's still too small to read anything.

88
00:05:01.410 --> 00:05:05.000
Now, there is a way to export the file directly,

89
00:05:05.000 --> 00:05:05.910
but instead,

90
00:05:05.910 --> 00:05:09.673
let's use another really popular library called Graphviz.

91
00:05:12.110 --> 00:05:14.280
And the reason I'm showing you this option

92
00:05:14.280 --> 00:05:18.260
is because Graphviz offers a ton of different options

93
00:05:18.260 --> 00:05:19.810
and it's super easy to use

94
00:05:19.810 --> 00:05:22.193
because it works directly with Scikit-learn.

95
00:05:23.360 --> 00:05:25.490
So if it's something that you would like to try,

96
00:05:25.490 --> 00:05:27.470
it's a really easy install.

97
00:05:27.470 --> 00:05:30.143
But for now, let's just go through how it works.

98
00:05:32.090 --> 00:05:33.900
The first thing that we're gonna need to do

99
00:05:33.900 --> 00:05:36.250
is export Graphviz

100
00:05:38.170 --> 00:05:40.823
as well as the Graphviz library.

101
00:05:49.910 --> 00:05:52.920
Then we're gonna create a variable named dot data

102
00:05:52.920 --> 00:05:56.493
and assign that to the export Graphviz function.

103
00:05:58.940 --> 00:06:03.183
Now let's just copy all of this and paste it below.

104
00:06:06.860 --> 00:06:09.430
I'm also gonna add in one more attribute

105
00:06:09.430 --> 00:06:11.353
that Scikit-learn doesn't offer.

106
00:06:13.240 --> 00:06:16.923
So we'll say special characters equals true.

107
00:06:19.540 --> 00:06:23.550
If you're wondering about the purpose of export Graphviz,

108
00:06:23.550 --> 00:06:27.370
it's used to export the decision tree into dot format,

109
00:06:27.370 --> 00:06:29.783
which is the language used by Graphviz.

110
00:06:32.225 --> 00:06:34.570
The last step is to render the tree

111
00:06:34.570 --> 00:06:37.070
into a format we can actually read,

112
00:06:37.070 --> 00:06:38.963
which is a pretty quick process.

113
00:06:39.890 --> 00:06:43.240
All we need to do is create a graph variable

114
00:06:43.240 --> 00:06:46.773
and assign that to graphviz.source.

115
00:06:49.760 --> 00:06:51.480
Then inside of the parentheses,

116
00:06:51.480 --> 00:06:52.540
we just need to pass in

117
00:06:52.540 --> 00:06:55.370
the source of the dot format decision tree,

118
00:06:55.370 --> 00:06:57.963
which in our case is the dot data object.

119
00:07:01.370 --> 00:07:04.310
Finally, we just say graph.render,

120
00:07:04.310 --> 00:07:07.153
and pass in a string for the title of the graph.

121
00:07:14.490 --> 00:07:16.060
Then after we run it again,

122
00:07:16.060 --> 00:07:19.120
the default format of the graph is a PDF.

123
00:07:19.120 --> 00:07:20.630
So you should find the graph

124
00:07:20.630 --> 00:07:23.653
wherever you have downloaded PDFs saved to.

125
00:07:27.720 --> 00:07:29.130
Now when we open this up,

126
00:07:29.130 --> 00:07:31.190
the first thing that you'll probably notice

127
00:07:31.190 --> 00:07:33.563
are the color differences between the nodes.

128
00:07:34.760 --> 00:07:37.020
And those are directly related to the class

129
00:07:37.020 --> 00:07:38.853
and gini impurity number.

130
00:07:39.860 --> 00:07:42.830
The orange nodes indicate a sample with more elements

131
00:07:42.830 --> 00:07:44.950
from the benign class.

132
00:07:44.950 --> 00:07:48.303
And the darker the node gets, the more pure the sample is.

133
00:07:51.310 --> 00:07:53.190
So when you look at one of the leaf nodes,

134
00:07:53.190 --> 00:07:56.623
it's a really dark orange and the impurity score is zero.

135
00:07:58.330 --> 00:08:00.430
Then the same thing is gonna be true

136
00:08:00.430 --> 00:08:03.193
with the malignant class, but all of those are blue.

137
00:08:04.600 --> 00:08:07.500
The last thing we'll cover before we finish this guide up

138
00:08:07.500 --> 00:08:09.540
is what each note tells us

139
00:08:09.540 --> 00:08:11.483
about how partitions were decided.

140
00:08:13.950 --> 00:08:15.270
Starting at the root node,

141
00:08:15.270 --> 00:08:17.330
we can see the color is light orange,

142
00:08:17.330 --> 00:08:19.570
and that's because the majority of the samples

143
00:08:19.570 --> 00:08:21.113
come from the benign class.

144
00:08:22.570 --> 00:08:23.750
And there is a chance

145
00:08:23.750 --> 00:08:25.710
that your tree is gonna be a little different,

146
00:08:25.710 --> 00:08:27.590
but at the top of my root node,

147
00:08:27.590 --> 00:08:29.590
we can see that the first partition

148
00:08:29.590 --> 00:08:33.033
was made using the feature cell size uniformity.

149
00:08:34.810 --> 00:08:36.280
And according to the graph,

150
00:08:36.280 --> 00:08:40.420
if the feature value was less than or equal to 3.5,

151
00:08:40.420 --> 00:08:43.650
it follows the true arrow to the left.

152
00:08:43.650 --> 00:08:45.980
But if it was above 3.5,

153
00:08:45.980 --> 00:08:48.203
it follows the false arrow to the right.

154
00:08:49.630 --> 00:08:52.860
And you can tell right away that was a really good split

155
00:08:52.860 --> 00:08:55.950
because both of the gini values for child nodes

156
00:08:55.950 --> 00:08:57.633
are really close to zero.

157
00:08:59.220 --> 00:09:00.170
Then from there,

158
00:09:00.170 --> 00:09:03.460
you just have to follow the same process down each branch

159
00:09:03.460 --> 00:09:05.970
until you eventually get to a leaf node.

160
00:09:05.970 --> 00:09:07.820
And that's really all there is to it.

161
00:09:08.920 --> 00:09:11.810
So for now, I will be wrapping this up

162
00:09:11.810 --> 00:09:13.660
and I will see you in the next guide.