- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:02.990 --> 00:00:03.823
Now that we have
2
00:00:03.823 --> 00:00:05.780
all of the technical stuff out of the way,
3
00:00:05.780 --> 00:00:08.570
we can move forward with applying what we've learned
4
00:00:08.570 --> 00:00:10.433
to a real world problem.
5
00:00:11.550 --> 00:00:13.270
And I think what I like most
6
00:00:13.270 --> 00:00:16.110
about using a decision tree for classification,
7
00:00:16.110 --> 00:00:19.173
is the visual representation we're able to create.
8
00:00:20.300 --> 00:00:21.330
The reason being,
9
00:00:21.330 --> 00:00:24.310
and I'm sure many of you have experienced this,
10
00:00:24.310 --> 00:00:26.160
when you're trying to explain something
11
00:00:26.160 --> 00:00:27.960
that's fairly complicated,
12
00:00:27.960 --> 00:00:32.320
it makes it a lot easier when you have some diagram or chart
13
00:00:32.320 --> 00:00:34.023
for people to follow along with.
14
00:00:35.100 --> 00:00:36.860
And in my opinion,
15
00:00:36.860 --> 00:00:38.250
the visual representation
16
00:00:38.250 --> 00:00:40.830
we're able to create from a decision tree,
17
00:00:40.830 --> 00:00:44.040
offers one of the most powerful and effective ways
18
00:00:44.040 --> 00:00:47.433
of simplifying and communicating a complex problem.
19
00:00:49.650 --> 00:00:53.900
Now, I really want this guide to focus on the visual itself,
20
00:00:53.900 --> 00:00:55.470
so I thought it would be best
21
00:00:55.470 --> 00:00:56.730
to work with the data frame
22
00:00:56.730 --> 00:00:59.010
that we're already familiar with.
23
00:00:59.010 --> 00:01:01.390
That's why I chose the tumor dataset
24
00:01:01.390 --> 00:01:04.473
that we first worked with in the K-Nearest Neighbor guide.
25
00:01:06.200 --> 00:01:09.920
To save a little bit of time, I already imported pandas,
26
00:01:09.920 --> 00:01:12.170
the train test split function,
27
00:01:12.170 --> 00:01:14.170
and the decision tree classifier
28
00:01:14.170 --> 00:01:16.063
from Scikit-learn's tree module.
29
00:01:17.420 --> 00:01:20.570
I also imported the tumor CSV file,
30
00:01:20.570 --> 00:01:23.240
segmented our feature and target variables,
31
00:01:23.240 --> 00:01:26.353
and then applied both of those to the train test function.
32
00:01:27.610 --> 00:01:30.630
And like we've done in almost every other example,
33
00:01:30.630 --> 00:01:33.610
I created a decision tree classifier object
34
00:01:33.610 --> 00:01:35.170
and used the fit function
35
00:01:35.170 --> 00:01:38.020
to create a model using the training data.
36
00:01:38.020 --> 00:01:41.860
Then the last thing I did was create a score variable
37
00:01:41.860 --> 00:01:44.283
so we can check the accuracy of the model.
38
00:01:46.160 --> 00:01:48.010
Before we go any further,
39
00:01:48.010 --> 00:01:49.450
let's run this really quick
40
00:01:49.450 --> 00:01:51.253
and take a look at the data frame.
41
00:01:58.500 --> 00:02:01.540
In case any of you don't remember what we're working with,
42
00:02:01.540 --> 00:02:03.530
we have nine feature variables
43
00:02:03.530 --> 00:02:06.400
going from clump thickness to mitosis.
44
00:02:07.840 --> 00:02:12.180
Then the last column contains all of the class labels.
45
00:02:12.180 --> 00:02:13.573
And for that,
46
00:02:15.540 --> 00:02:20.540
we have 444 elements representing benign tumors,
47
00:02:20.760 --> 00:02:24.863
and 239 elements that represent malignant tumors.
48
00:02:26.830 --> 00:02:29.650
When we built out the K-Nearest Neighbor model,
49
00:02:29.650 --> 00:02:33.523
I think we had an accuracy score somewhere in the high 90s.
50
00:02:35.260 --> 00:02:38.203
Let's take a look at how the decision tree matches up.
51
00:02:40.840 --> 00:02:42.550
And it's pretty good,
52
00:02:42.550 --> 00:02:45.343
but not quite as good as K-Nearest Neighbors.
53
00:02:46.920 --> 00:02:50.040
All right, now that we have all of that taken care of,
54
00:02:50.040 --> 00:02:52.363
let's start building out the tree visual.
55
00:02:53.440 --> 00:02:55.270
The first thing that we're gonna need to do
56
00:02:55.270 --> 00:02:58.373
is import plot trees from the tree module.
57
00:02:59.310 --> 00:03:03.930
Then we'll move down and pass in plot tree, parentheses,
58
00:03:03.930 --> 00:03:05.563
and then the classifier.
59
00:03:09.130 --> 00:03:11.343
Now, when we check the plot pane,
60
00:03:12.320 --> 00:03:15.933
we have a very hard to read classification tree.
61
00:03:17.080 --> 00:03:19.270
But before we address that issue,
62
00:03:19.270 --> 00:03:21.570
let's go through a couple of the attributes
63
00:03:21.570 --> 00:03:23.263
that Sklearn offers.
64
00:03:25.660 --> 00:03:28.060
The three that I use the most often
65
00:03:28.060 --> 00:03:32.240
are feature names, class names, and the field option,
66
00:03:32.240 --> 00:03:33.960
which adds color to a node
67
00:03:33.960 --> 00:03:36.383
to help indicate what class it belongs to.
68
00:03:38.480 --> 00:03:41.700
Occasionally, I'll add the rounded attribute as well.
69
00:03:41.700 --> 00:03:45.043
And all that does is round the corners of the node box.
70
00:03:47.380 --> 00:03:50.503
So let's add those in starting with feature names.
71
00:03:52.240 --> 00:03:55.703
For this, we can just pass in df.columns,
72
00:03:57.010 --> 00:03:59.480
then include all of the column names
73
00:03:59.480 --> 00:04:01.533
excluding the class column.
74
00:04:04.120 --> 00:04:07.240
The class names are gonna be a little bit more tricky,
75
00:04:07.240 --> 00:04:09.493
but it's still pretty straightforward.
76
00:04:11.010 --> 00:04:13.320
What we're gonna do is start by passing in
77
00:04:13.320 --> 00:04:16.070
Y for the target variable.
78
00:04:16.070 --> 00:04:20.113
Then we'll use the map function and pass in a dictionary.
79
00:04:21.280 --> 00:04:24.010
For the key, we're gonna use the class label
80
00:04:24.010 --> 00:04:25.223
starting with two.
81
00:04:26.250 --> 00:04:28.943
And for the value, we'll pass in malignant.
82
00:04:31.410 --> 00:04:34.543
Then we can do the exact same thing for the benign class.
83
00:04:41.370 --> 00:04:44.563
For the field attribute, we just have to pass in true.
84
00:04:45.440 --> 00:04:47.703
And for the rounded, we'll do the same thing.
85
00:04:51.640 --> 00:04:53.640
Now let's run this one more time
86
00:04:54.550 --> 00:04:57.090
and you can see color has been added,
87
00:04:57.090 --> 00:04:59.353
but it's still too small to read anything.
88
00:05:01.410 --> 00:05:05.000
Now, there is a way to export the file directly,
89
00:05:05.000 --> 00:05:05.910
but instead,
90
00:05:05.910 --> 00:05:09.673
let's use another really popular library called Graphviz.
91
00:05:12.110 --> 00:05:14.280
And the reason I'm showing you this option
92
00:05:14.280 --> 00:05:18.260
is because Graphviz offers a ton of different options
93
00:05:18.260 --> 00:05:19.810
and it's super easy to use
94
00:05:19.810 --> 00:05:22.193
because it works directly with Scikit-learn.
95
00:05:23.360 --> 00:05:25.490
So if it's something that you would like to try,
96
00:05:25.490 --> 00:05:27.470
it's a really easy install.
97
00:05:27.470 --> 00:05:30.143
But for now, let's just go through how it works.
98
00:05:32.090 --> 00:05:33.900
The first thing that we're gonna need to do
99
00:05:33.900 --> 00:05:36.250
is export Graphviz
100
00:05:38.170 --> 00:05:40.823
as well as the Graphviz library.
101
00:05:49.910 --> 00:05:52.920
Then we're gonna create a variable named dot data
102
00:05:52.920 --> 00:05:56.493
and assign that to the export Graphviz function.
103
00:05:58.940 --> 00:06:03.183
Now let's just copy all of this and paste it below.
104
00:06:06.860 --> 00:06:09.430
I'm also gonna add in one more attribute
105
00:06:09.430 --> 00:06:11.353
that Scikit-learn doesn't offer.
106
00:06:13.240 --> 00:06:16.923
So we'll say special characters equals true.
107
00:06:19.540 --> 00:06:23.550
If you're wondering about the purpose of export Graphviz,
108
00:06:23.550 --> 00:06:27.370
it's used to export the decision tree into dot format,
109
00:06:27.370 --> 00:06:29.783
which is the language used by Graphviz.
110
00:06:32.225 --> 00:06:34.570
The last step is to render the tree
111
00:06:34.570 --> 00:06:37.070
into a format we can actually read,
112
00:06:37.070 --> 00:06:38.963
which is a pretty quick process.
113
00:06:39.890 --> 00:06:43.240
All we need to do is create a graph variable
114
00:06:43.240 --> 00:06:46.773
and assign that to graphviz.source.
115
00:06:49.760 --> 00:06:51.480
Then inside of the parentheses,
116
00:06:51.480 --> 00:06:52.540
we just need to pass in
117
00:06:52.540 --> 00:06:55.370
the source of the dot format decision tree,
118
00:06:55.370 --> 00:06:57.963
which in our case is the dot data object.
119
00:07:01.370 --> 00:07:04.310
Finally, we just say graph.render,
120
00:07:04.310 --> 00:07:07.153
and pass in a string for the title of the graph.
121
00:07:14.490 --> 00:07:16.060
Then after we run it again,
122
00:07:16.060 --> 00:07:19.120
the default format of the graph is a PDF.
123
00:07:19.120 --> 00:07:20.630
So you should find the graph
124
00:07:20.630 --> 00:07:23.653
wherever you have downloaded PDFs saved to.
125
00:07:27.720 --> 00:07:29.130
Now when we open this up,
126
00:07:29.130 --> 00:07:31.190
the first thing that you'll probably notice
127
00:07:31.190 --> 00:07:33.563
are the color differences between the nodes.
128
00:07:34.760 --> 00:07:37.020
And those are directly related to the class
129
00:07:37.020 --> 00:07:38.853
and gini impurity number.
130
00:07:39.860 --> 00:07:42.830
The orange nodes indicate a sample with more elements
131
00:07:42.830 --> 00:07:44.950
from the benign class.
132
00:07:44.950 --> 00:07:48.303
And the darker the node gets, the more pure the sample is.
133
00:07:51.310 --> 00:07:53.190
So when you look at one of the leaf nodes,
134
00:07:53.190 --> 00:07:56.623
it's a really dark orange and the impurity score is zero.
135
00:07:58.330 --> 00:08:00.430
Then the same thing is gonna be true
136
00:08:00.430 --> 00:08:03.193
with the malignant class, but all of those are blue.
137
00:08:04.600 --> 00:08:07.500
The last thing we'll cover before we finish this guide up
138
00:08:07.500 --> 00:08:09.540
is what each note tells us
139
00:08:09.540 --> 00:08:11.483
about how partitions were decided.
140
00:08:13.950 --> 00:08:15.270
Starting at the root node,
141
00:08:15.270 --> 00:08:17.330
we can see the color is light orange,
142
00:08:17.330 --> 00:08:19.570
and that's because the majority of the samples
143
00:08:19.570 --> 00:08:21.113
come from the benign class.
144
00:08:22.570 --> 00:08:23.750
And there is a chance
145
00:08:23.750 --> 00:08:25.710
that your tree is gonna be a little different,
146
00:08:25.710 --> 00:08:27.590
but at the top of my root node,
147
00:08:27.590 --> 00:08:29.590
we can see that the first partition
148
00:08:29.590 --> 00:08:33.033
was made using the feature cell size uniformity.
149
00:08:34.810 --> 00:08:36.280
And according to the graph,
150
00:08:36.280 --> 00:08:40.420
if the feature value was less than or equal to 3.5,
151
00:08:40.420 --> 00:08:43.650
it follows the true arrow to the left.
152
00:08:43.650 --> 00:08:45.980
But if it was above 3.5,
153
00:08:45.980 --> 00:08:48.203
it follows the false arrow to the right.
154
00:08:49.630 --> 00:08:52.860
And you can tell right away that was a really good split
155
00:08:52.860 --> 00:08:55.950
because both of the gini values for child nodes
156
00:08:55.950 --> 00:08:57.633
are really close to zero.
157
00:08:59.220 --> 00:09:00.170
Then from there,
158
00:09:00.170 --> 00:09:03.460
you just have to follow the same process down each branch
159
00:09:03.460 --> 00:09:05.970
until you eventually get to a leaf node.
160
00:09:05.970 --> 00:09:07.820
And that's really all there is to it.
161
00:09:08.920 --> 00:09:11.810
So for now, I will be wrapping this up
162
00:09:11.810 --> 00:09:13.660
and I will see you in the next guide.