Building the Logistic Regression Model

WEBVTT

1
00:00:03.580 --> 00:00:05.610
In this guide, we're gonna use

2
00:00:05.610 --> 00:00:08.680
what we covered in the last guide and apply that

3
00:00:08.680 --> 00:00:11.830
to creating a logistic regression model.

4
00:00:11.830 --> 00:00:14.390
Then when we're done with that, we're gonna bring back

5
00:00:14.390 --> 00:00:16.870
the confusion matrix and walk through

6
00:00:16.870 --> 00:00:19.333
how we can make adjustments to the threshold.

7
00:00:21.620 --> 00:00:24.140
To get right into it, let's take a look at the code

8
00:00:24.140 --> 00:00:25.540
we're gonna be working with.

9
00:00:26.700 --> 00:00:29.630
Hopefully by now, all of this should look pretty familiar,

10
00:00:29.630 --> 00:00:32.170
especially since we're using the same tumor data set

11
00:00:32.170 --> 00:00:33.713
as all of the other guides.

12
00:00:35.430 --> 00:00:37.840
And other than the classification report that you see

13
00:00:37.840 --> 00:00:39.480
from the metric module,

14
00:00:39.480 --> 00:00:41.780
there isn't anything new for us to go through.

15
00:00:48.240 --> 00:00:50.430
You can see right away, the obvious benefit

16
00:00:50.430 --> 00:00:52.710
of using the classification report

17
00:00:52.710 --> 00:00:54.950
because it gives us a little bit more information

18
00:00:54.950 --> 00:00:56.563
than just the accuracy.

19
00:00:57.830 --> 00:01:00.530
And that's gonna turn out to be important in this example

20
00:01:00.530 --> 00:01:02.610
because we're gonna be paying a little more attention

21
00:01:02.610 --> 00:01:04.033
to the precision score.

22
00:01:06.670 --> 00:01:09.360
I know I made reference to this in some of the other guides

23
00:01:09.360 --> 00:01:12.350
but the reason we're concerning ourselves with precision,

24
00:01:12.350 --> 00:01:15.220
is because it measures the ability of the classifier

25
00:01:15.220 --> 00:01:17.363
to avoid the dreaded false positive.

26
00:01:18.320 --> 00:01:20.650
And in this example, it's even more important

27
00:01:20.650 --> 00:01:22.240
to avoid a false positive

28
00:01:22.240 --> 00:01:25.403
because a misdiagnosis could quite literally be deadly.

29
00:01:26.760 --> 00:01:29.560
So even though we have a really solid precision score

30
00:01:29.560 --> 00:01:32.530
of 0.95 for the positive class,

31
00:01:32.530 --> 00:01:36.150
ideally we'd want that as close to one as possible.

32
00:01:36.150 --> 00:01:38.970
To make that happen, we'll have to reduce the sensitivity

33
00:01:38.970 --> 00:01:40.570
of the positive class.

34
00:01:40.570 --> 00:01:42.010
Or in other words,

35
00:01:42.010 --> 00:01:44.310
we'll need to increase the decision threshold

36
00:01:44.310 --> 00:01:46.160
making it harder for the classifier

37
00:01:46.160 --> 00:01:47.713
to predict a benign tumor.

38
00:01:53.400 --> 00:01:55.400
The first step in doing that is to use

39
00:01:55.400 --> 00:01:58.050
the predict prob function to create an object

40
00:01:58.050 --> 00:02:00.400
that contains all of the class probabilities

41
00:02:00.400 --> 00:02:01.663
for the feature test set.

42
00:02:13.181 --> 00:02:14.830
Once we're done with that, we're gonna be using

43
00:02:14.830 --> 00:02:16.900
the only new function in the guide

44
00:02:16.900 --> 00:02:18.700
and that's the binarize function,

45
00:02:18.700 --> 00:02:21.063
located in the pre-processing module.

46
00:02:22.030 --> 00:02:25.020
I know the documentation isn't overly descriptive

47
00:02:25.020 --> 00:02:27.500
but the reason we're using the binarize function

48
00:02:27.500 --> 00:02:30.457
is because it gives us a really simple way

49
00:02:30.457 --> 00:02:34.430
of thresholding numerical features to get boolean values.

50
00:02:34.430 --> 00:02:36.980
And I'll show you what that means in just a minute.

51
00:02:40.350 --> 00:02:42.120
Now to get back to the code,

52
00:02:42.120 --> 00:02:45.340
I went ahead and did the binarize import already

53
00:02:45.340 --> 00:02:48.330
as well as adding in a new data frame, which we'll be using

54
00:02:48.330 --> 00:02:50.593
to help visualize all of what's going on.

55
00:02:51.480 --> 00:02:53.440
But before we take a look at that,

56
00:02:53.440 --> 00:02:56.163
we need to create the binarized array object.

57
00:03:00.660 --> 00:03:02.270
This is pretty straightforward

58
00:03:02.270 --> 00:03:04.693
but after we parse in the binarize function,

59
00:03:08.070 --> 00:03:11.623
all we need to do is parse in the data we want binarized,

60
00:03:14.300 --> 00:03:16.193
followed by a threshold value.

61
00:03:19.693 --> 00:03:21.440
Towards the end of the last guide, I mentioned

62
00:03:21.440 --> 00:03:24.143
the threshold value for logistic regression is 0.5.

63
00:03:25.450 --> 00:03:27.313
So let's just start with that.

64
00:03:29.270 --> 00:03:32.080
Now after we run it, let's go ahead and open up

65
00:03:32.080 --> 00:03:33.943
the class probability data frame.

66
00:03:52.040 --> 00:03:54.590
All right, so the first column comes from

67
00:03:54.590 --> 00:03:58.340
the prediction probability object we created earlier.

68
00:03:58.340 --> 00:04:01.190
And because we're only gonna be using the positive label

69
00:04:01.190 --> 00:04:03.270
to help us build the confusion matrix,

70
00:04:03.270 --> 00:04:04.540
I thought it made more sense

71
00:04:04.540 --> 00:04:06.633
to only include the benign class.

72
00:04:07.960 --> 00:04:10.310
Now after that, we have a column containing

73
00:04:10.310 --> 00:04:12.680
all of the actual class labels,

74
00:04:12.680 --> 00:04:14.783
which we got from the target test set.

75
00:04:16.820 --> 00:04:20.270
Then in the logistic regression prediction column,

76
00:04:20.270 --> 00:04:22.490
it gives us the label that was actually predicted

77
00:04:22.490 --> 00:04:24.440
by the logistic regression model,

78
00:04:24.440 --> 00:04:26.193
while using the threshold of 0.5.

79
00:04:28.827 --> 00:04:31.260
Then the last column is just the binarized array,

80
00:04:31.260 --> 00:04:34.123
with a threshold setting of 0.5 as well.

81
00:04:35.090 --> 00:04:36.930
So assuming everything worked out

82
00:04:36.930 --> 00:04:38.480
the way it was supposed to,

83
00:04:38.480 --> 00:04:41.580
the last two columns should be identical.

84
00:04:41.580 --> 00:04:44.023
And when we scroll through the data frame,

85
00:04:50.390 --> 00:04:52.073
they are in fact the same.

86
00:04:54.520 --> 00:04:57.660
Before we go ahead with making changes to the threshold,

87
00:04:57.660 --> 00:05:00.263
let's first compare the confusion matrices.

88
00:05:41.120 --> 00:05:43.913
And as expected, both of those match up.

89
00:05:45.200 --> 00:05:48.283
Now going back into the class probability data frame,

90
00:05:56.780 --> 00:05:59.410
we can see that all five of the false positives

91
00:05:59.410 --> 00:06:01.490
occurred within this range

92
00:06:01.490 --> 00:06:03.350
and the largest false positive

93
00:06:03.350 --> 00:06:06.593
had a confidence level of about 88%.

94
00:06:08.070 --> 00:06:11.273
So in theory, if we were to increase the threshold to 0.9,

95
00:06:12.530 --> 00:06:14.930
that should reduce the sensitivity enough

96
00:06:14.930 --> 00:06:17.973
to avoid malignant tumors being classified as benign.

97
00:06:27.520 --> 00:06:29.870
Now that we've replaced the five with the nine,

98
00:06:29.870 --> 00:06:31.453
we can run it again.

99
00:06:34.100 --> 00:06:36.380
And we now have a confusion matrix

100
00:06:36.380 --> 00:06:38.320
with zero false positives,

101
00:06:38.320 --> 00:06:40.543
which is exactly what we were looking for.

102
00:06:44.150 --> 00:06:46.260
And I think to finish this guide up,

103
00:06:46.260 --> 00:06:49.110
we should make a second classification report

104
00:06:49.110 --> 00:06:51.703
but this time using the binarized data.

105
00:07:12.450 --> 00:07:14.700
And taking a look at that,

106
00:07:14.700 --> 00:07:17.010
the precision score of the benign class

107
00:07:17.010 --> 00:07:19.980
went from 0.95 to one.

108
00:07:19.980 --> 00:07:24.050
And even the accuracy jumped from 0.96 to 0.98,

109
00:07:24.050 --> 00:07:25.693
which is also great news.

110
00:07:26.570 --> 00:07:28.530
So overall, the modifications

111
00:07:28.530 --> 00:07:30.483
to the threshold really paid off.

112
00:07:32.160 --> 00:07:34.490
And I think that's about all we really needed

113
00:07:34.490 --> 00:07:36.050
to cover for this guide.

114
00:07:36.050 --> 00:07:39.563
So I will wrap this up and I will see you in the next guide.