- Read Tutorial
- Watch Guide Video
WEBVTT
1
00:00:03.580 --> 00:00:05.610
In this guide, we're gonna use
2
00:00:05.610 --> 00:00:08.680
what we covered in the last guide and apply that
3
00:00:08.680 --> 00:00:11.830
to creating a logistic regression model.
4
00:00:11.830 --> 00:00:14.390
Then when we're done with that, we're gonna bring back
5
00:00:14.390 --> 00:00:16.870
the confusion matrix and walk through
6
00:00:16.870 --> 00:00:19.333
how we can make adjustments to the threshold.
7
00:00:21.620 --> 00:00:24.140
To get right into it, let's take a look at the code
8
00:00:24.140 --> 00:00:25.540
we're gonna be working with.
9
00:00:26.700 --> 00:00:29.630
Hopefully by now, all of this should look pretty familiar,
10
00:00:29.630 --> 00:00:32.170
especially since we're using the same tumor data set
11
00:00:32.170 --> 00:00:33.713
as all of the other guides.
12
00:00:35.430 --> 00:00:37.840
And other than the classification report that you see
13
00:00:37.840 --> 00:00:39.480
from the metric module,
14
00:00:39.480 --> 00:00:41.780
there isn't anything new for us to go through.
15
00:00:48.240 --> 00:00:50.430
You can see right away, the obvious benefit
16
00:00:50.430 --> 00:00:52.710
of using the classification report
17
00:00:52.710 --> 00:00:54.950
because it gives us a little bit more information
18
00:00:54.950 --> 00:00:56.563
than just the accuracy.
19
00:00:57.830 --> 00:01:00.530
And that's gonna turn out to be important in this example
20
00:01:00.530 --> 00:01:02.610
because we're gonna be paying a little more attention
21
00:01:02.610 --> 00:01:04.033
to the precision score.
22
00:01:06.670 --> 00:01:09.360
I know I made reference to this in some of the other guides
23
00:01:09.360 --> 00:01:12.350
but the reason we're concerning ourselves with precision,
24
00:01:12.350 --> 00:01:15.220
is because it measures the ability of the classifier
25
00:01:15.220 --> 00:01:17.363
to avoid the dreaded false positive.
26
00:01:18.320 --> 00:01:20.650
And in this example, it's even more important
27
00:01:20.650 --> 00:01:22.240
to avoid a false positive
28
00:01:22.240 --> 00:01:25.403
because a misdiagnosis could quite literally be deadly.
29
00:01:26.760 --> 00:01:29.560
So even though we have a really solid precision score
30
00:01:29.560 --> 00:01:32.530
of 0.95 for the positive class,
31
00:01:32.530 --> 00:01:36.150
ideally we'd want that as close to one as possible.
32
00:01:36.150 --> 00:01:38.970
To make that happen, we'll have to reduce the sensitivity
33
00:01:38.970 --> 00:01:40.570
of the positive class.
34
00:01:40.570 --> 00:01:42.010
Or in other words,
35
00:01:42.010 --> 00:01:44.310
we'll need to increase the decision threshold
36
00:01:44.310 --> 00:01:46.160
making it harder for the classifier
37
00:01:46.160 --> 00:01:47.713
to predict a benign tumor.
38
00:01:53.400 --> 00:01:55.400
The first step in doing that is to use
39
00:01:55.400 --> 00:01:58.050
the predict prob function to create an object
40
00:01:58.050 --> 00:02:00.400
that contains all of the class probabilities
41
00:02:00.400 --> 00:02:01.663
for the feature test set.
42
00:02:13.181 --> 00:02:14.830
Once we're done with that, we're gonna be using
43
00:02:14.830 --> 00:02:16.900
the only new function in the guide
44
00:02:16.900 --> 00:02:18.700
and that's the binarize function,
45
00:02:18.700 --> 00:02:21.063
located in the pre-processing module.
46
00:02:22.030 --> 00:02:25.020
I know the documentation isn't overly descriptive
47
00:02:25.020 --> 00:02:27.500
but the reason we're using the binarize function
48
00:02:27.500 --> 00:02:30.457
is because it gives us a really simple way
49
00:02:30.457 --> 00:02:34.430
of thresholding numerical features to get boolean values.
50
00:02:34.430 --> 00:02:36.980
And I'll show you what that means in just a minute.
51
00:02:40.350 --> 00:02:42.120
Now to get back to the code,
52
00:02:42.120 --> 00:02:45.340
I went ahead and did the binarize import already
53
00:02:45.340 --> 00:02:48.330
as well as adding in a new data frame, which we'll be using
54
00:02:48.330 --> 00:02:50.593
to help visualize all of what's going on.
55
00:02:51.480 --> 00:02:53.440
But before we take a look at that,
56
00:02:53.440 --> 00:02:56.163
we need to create the binarized array object.
57
00:03:00.660 --> 00:03:02.270
This is pretty straightforward
58
00:03:02.270 --> 00:03:04.693
but after we parse in the binarize function,
59
00:03:08.070 --> 00:03:11.623
all we need to do is parse in the data we want binarized,
60
00:03:14.300 --> 00:03:16.193
followed by a threshold value.
61
00:03:19.693 --> 00:03:21.440
Towards the end of the last guide, I mentioned
62
00:03:21.440 --> 00:03:24.143
the threshold value for logistic regression is 0.5.
63
00:03:25.450 --> 00:03:27.313
So let's just start with that.
64
00:03:29.270 --> 00:03:32.080
Now after we run it, let's go ahead and open up
65
00:03:32.080 --> 00:03:33.943
the class probability data frame.
66
00:03:52.040 --> 00:03:54.590
All right, so the first column comes from
67
00:03:54.590 --> 00:03:58.340
the prediction probability object we created earlier.
68
00:03:58.340 --> 00:04:01.190
And because we're only gonna be using the positive label
69
00:04:01.190 --> 00:04:03.270
to help us build the confusion matrix,
70
00:04:03.270 --> 00:04:04.540
I thought it made more sense
71
00:04:04.540 --> 00:04:06.633
to only include the benign class.
72
00:04:07.960 --> 00:04:10.310
Now after that, we have a column containing
73
00:04:10.310 --> 00:04:12.680
all of the actual class labels,
74
00:04:12.680 --> 00:04:14.783
which we got from the target test set.
75
00:04:16.820 --> 00:04:20.270
Then in the logistic regression prediction column,
76
00:04:20.270 --> 00:04:22.490
it gives us the label that was actually predicted
77
00:04:22.490 --> 00:04:24.440
by the logistic regression model,
78
00:04:24.440 --> 00:04:26.193
while using the threshold of 0.5.
79
00:04:28.827 --> 00:04:31.260
Then the last column is just the binarized array,
80
00:04:31.260 --> 00:04:34.123
with a threshold setting of 0.5 as well.
81
00:04:35.090 --> 00:04:36.930
So assuming everything worked out
82
00:04:36.930 --> 00:04:38.480
the way it was supposed to,
83
00:04:38.480 --> 00:04:41.580
the last two columns should be identical.
84
00:04:41.580 --> 00:04:44.023
And when we scroll through the data frame,
85
00:04:50.390 --> 00:04:52.073
they are in fact the same.
86
00:04:54.520 --> 00:04:57.660
Before we go ahead with making changes to the threshold,
87
00:04:57.660 --> 00:05:00.263
let's first compare the confusion matrices.
88
00:05:41.120 --> 00:05:43.913
And as expected, both of those match up.
89
00:05:45.200 --> 00:05:48.283
Now going back into the class probability data frame,
90
00:05:56.780 --> 00:05:59.410
we can see that all five of the false positives
91
00:05:59.410 --> 00:06:01.490
occurred within this range
92
00:06:01.490 --> 00:06:03.350
and the largest false positive
93
00:06:03.350 --> 00:06:06.593
had a confidence level of about 88%.
94
00:06:08.070 --> 00:06:11.273
So in theory, if we were to increase the threshold to 0.9,
95
00:06:12.530 --> 00:06:14.930
that should reduce the sensitivity enough
96
00:06:14.930 --> 00:06:17.973
to avoid malignant tumors being classified as benign.
97
00:06:27.520 --> 00:06:29.870
Now that we've replaced the five with the nine,
98
00:06:29.870 --> 00:06:31.453
we can run it again.
99
00:06:34.100 --> 00:06:36.380
And we now have a confusion matrix
100
00:06:36.380 --> 00:06:38.320
with zero false positives,
101
00:06:38.320 --> 00:06:40.543
which is exactly what we were looking for.
102
00:06:44.150 --> 00:06:46.260
And I think to finish this guide up,
103
00:06:46.260 --> 00:06:49.110
we should make a second classification report
104
00:06:49.110 --> 00:06:51.703
but this time using the binarized data.
105
00:07:12.450 --> 00:07:14.700
And taking a look at that,
106
00:07:14.700 --> 00:07:17.010
the precision score of the benign class
107
00:07:17.010 --> 00:07:19.980
went from 0.95 to one.
108
00:07:19.980 --> 00:07:24.050
And even the accuracy jumped from 0.96 to 0.98,
109
00:07:24.050 --> 00:07:25.693
which is also great news.
110
00:07:26.570 --> 00:07:28.530
So overall, the modifications
111
00:07:28.530 --> 00:07:30.483
to the threshold really paid off.
112
00:07:32.160 --> 00:07:34.490
And I think that's about all we really needed
113
00:07:34.490 --> 00:07:36.050
to cover for this guide.
114
00:07:36.050 --> 00:07:39.563
So I will wrap this up and I will see you in the next guide.