How to Build a Spam Classifier

WEBVTT

1
00:00:03.010 --> 00:00:04.250
In this guide,

2
00:00:04.250 --> 00:00:05.610
we're gonna use what we've learned

3
00:00:05.610 --> 00:00:07.167
about the Naive Bayes classifier

4
00:00:07.167 --> 00:00:10.450
and extend that into how it can be used

5
00:00:10.450 --> 00:00:12.163
to create a spam filter.

6
00:00:13.520 --> 00:00:15.010
As a quick refresher,

7
00:00:15.010 --> 00:00:18.010
Naive Bayes is based on Bayes' rules,

8
00:00:18.010 --> 00:00:20.080
which gives us a really easy way

9
00:00:20.080 --> 00:00:22.530
to determine the posterior probability

10
00:00:22.530 --> 00:00:23.963
of an event occurring.

11
00:00:25.700 --> 00:00:28.930
In the last example, it was the probability

12
00:00:28.930 --> 00:00:31.230
of a family having twin girls,

13
00:00:31.230 --> 00:00:33.913
knowing that the first was already a girl.

14
00:00:35.090 --> 00:00:39.010
And when we expand that to more than one feature,

15
00:00:39.010 --> 00:00:41.720
the posterior probability for each feature

16
00:00:41.720 --> 00:00:44.420
is calculated and then the probabilities

17
00:00:44.420 --> 00:00:45.840
are multiplied together

18
00:00:45.840 --> 00:00:47.663
to get the final probability.

19
00:00:48.700 --> 00:00:50.510
Then the final probability

20
00:00:50.510 --> 00:00:53.310
is calculated for the other class as well

21
00:00:53.310 --> 00:00:56.350
and whichever has the greater probability

22
00:00:56.350 --> 00:00:59.393
determines what class the object belongs to.

23
00:01:01.750 --> 00:01:03.926
In the example that we'll be working through next,

24
00:01:03.926 --> 00:01:06.270
the object is an email

25
00:01:06.270 --> 00:01:09.360
and the features are all of the unique words

26
00:01:09.360 --> 00:01:10.603
in that email.

27
00:01:11.600 --> 00:01:14.920
And what that means is that the posterior probability

28
00:01:14.920 --> 00:01:19.710
has to be calculated for every unique word in the email.

29
00:01:19.710 --> 00:01:22.833
And that determines if it's spam or not spam.

30
00:01:24.940 --> 00:01:27.410
And it might sound like a tall task

31
00:01:27.410 --> 00:01:30.010
but it's actually pretty simple

32
00:01:30.010 --> 00:01:31.923
when we get to use Sklearn.

33
00:01:34.490 --> 00:01:35.950
Just like before,

34
00:01:35.950 --> 00:01:39.173
we're gonna start off by importing everything that we need.

35
00:01:40.030 --> 00:01:42.540
And the only core library we're gonna need

36
00:01:42.540 --> 00:01:45.480
in this example is pandas

37
00:01:45.480 --> 00:01:48.821
but we'll also be using the train and test function,

38
00:01:48.821 --> 00:01:51.550
the multinomial Naive Bayes class,

39
00:01:51.550 --> 00:01:54.810
which we can use for text classification.

40
00:01:54.810 --> 00:01:57.330
We also have a CountVectorizer

41
00:01:57.330 --> 00:02:00.390
and a bunch of different metric functions

42
00:02:00.390 --> 00:02:02.670
and I'll explain what all of those do

43
00:02:02.670 --> 00:02:04.543
as we work through the example.

44
00:02:05.780 --> 00:02:07.640
The DataFrame itself is made up

45
00:02:07.640 --> 00:02:11.763
of just over 5,500 rows and two columns.

46
00:02:12.890 --> 00:02:16.370
The first column contains the class label,

47
00:02:16.370 --> 00:02:19.440
which is either going to be spam or ham,

48
00:02:19.440 --> 00:02:22.690
which is another way of saying not spam.

49
00:02:22.690 --> 00:02:26.783
Then the second column contains all of the actual emails.

50
00:02:28.900 --> 00:02:32.220
Our input x will contain every row

51
00:02:32.220 --> 00:02:33.720
and the last column,

52
00:02:33.720 --> 00:02:35.653
which contains all of the emails.

53
00:02:36.850 --> 00:02:39.410
And the output, which is y,

54
00:02:39.410 --> 00:02:41.440
contains all of the class labels

55
00:02:41.440 --> 00:02:44.523
that indicate if an email is spam or ham.

56
00:02:46.610 --> 00:02:48.500
Unlike some of our other examples,

57
00:02:48.500 --> 00:02:50.680
we don't need to reshape the data

58
00:02:50.680 --> 00:02:52.333
to create a column vector.

59
00:02:53.310 --> 00:02:56.230
Instead, I just added the value function

60
00:02:56.230 --> 00:02:57.970
to remove the column labels

61
00:02:57.970 --> 00:02:59.943
and only return the values.

62
00:03:00.950 --> 00:03:02.781
Notice when I add the value function

63
00:03:02.781 --> 00:03:06.140
and run it again, the object type changes

64
00:03:06.140 --> 00:03:09.623
from a pandas series to a NumPy array.

65
00:03:15.310 --> 00:03:17.530
Next, let's move into the console

66
00:03:17.530 --> 00:03:20.963
and do a quick overview of our class labels.

67
00:03:22.360 --> 00:03:26.360
First, let's pass in df followed by squared brackets

68
00:03:26.360 --> 00:03:27.383
in the column name.

69
00:03:29.050 --> 00:03:31.163
Then .unique.

70
00:03:33.700 --> 00:03:35.640
And what we get back is an array

71
00:03:35.640 --> 00:03:38.500
with all of the unique values in the column

72
00:03:38.500 --> 00:03:40.590
and there aren't any nan values,

73
00:03:40.590 --> 00:03:42.323
so that's a good thing.

74
00:03:44.470 --> 00:03:46.963
Now, let's pass in the same thing again.

75
00:03:49.860 --> 00:03:52.423
But this time, we'll finish up with value_counts.

76
00:03:58.300 --> 00:03:59.910
And this time, we get an array

77
00:03:59.910 --> 00:04:03.093
with a total number of ham and spam entries.

78
00:04:04.660 --> 00:04:06.540
Without doing the exact math,

79
00:04:06.540 --> 00:04:08.430
it looks like there's about six

80
00:04:08.430 --> 00:04:10.870
to seven times as many ham values

81
00:04:10.870 --> 00:04:12.673
as there are spam values.

82
00:04:13.800 --> 00:04:15.460
That'll be something that we can talk about

83
00:04:15.460 --> 00:04:16.593
later on as well.

84
00:04:18.020 --> 00:04:19.470
But for now, let's move on

85
00:04:19.470 --> 00:04:21.573
with building the classification model.

86
00:04:23.940 --> 00:04:25.830
In the first Naive Bayes example

87
00:04:25.830 --> 00:04:27.070
that we went through,

88
00:04:27.070 --> 00:04:30.730
we started out by creating the classifier object.

89
00:04:30.730 --> 00:04:33.330
But this time, it's gonna be a little bit different.

90
00:04:34.270 --> 00:04:37.100
And that's because before we can generate a model,

91
00:04:37.100 --> 00:04:39.100
we need to go through all of the emails

92
00:04:39.100 --> 00:04:41.000
to figure out the probability

93
00:04:41.000 --> 00:04:44.973
of every word belonging to spam or ham emails.

94
00:04:46.770 --> 00:04:49.040
To do that, we're going to leverage

95
00:04:49.040 --> 00:04:51.310
the CountVectorizer class.

96
00:04:51.310 --> 00:04:54.690
So we'll start by creating a vectorizer variable

97
00:04:55.610 --> 00:04:58.763
and assigning that to the CountVectorizer class.

98
00:05:03.460 --> 00:05:07.130
Then we'll create a second variable named counts,

99
00:05:07.130 --> 00:05:10.140
which will be assigned to the fit_transform method

100
00:05:10.140 --> 00:05:11.973
of the CountVectorizer class.

101
00:05:19.960 --> 00:05:23.550
This is unlike anything we've used so far in the course

102
00:05:23.550 --> 00:05:26.410
but what the fit_transform method is doing

103
00:05:26.410 --> 00:05:28.173
is tokenizing the data.

104
00:05:29.490 --> 00:05:31.100
And what tokenizing means

105
00:05:31.100 --> 00:05:33.260
is that all of the meaningful data,

106
00:05:33.260 --> 00:05:35.140
which are the words in an email,

107
00:05:35.140 --> 00:05:37.410
are being replaced by a number,

108
00:05:37.410 --> 00:05:39.683
which are commonly referred to as a token.

109
00:05:41.410 --> 00:05:43.370
So when we use the CountVectorizer

110
00:05:43.370 --> 00:05:45.630
to fit and transform the data,

111
00:05:45.630 --> 00:05:47.550
it goes through each email,

112
00:05:47.550 --> 00:05:51.130
counting how many times each word was used.

113
00:05:51.130 --> 00:05:53.520
Then it iterates back through the dataset

114
00:05:53.520 --> 00:05:55.170
and replaces each word

115
00:05:55.170 --> 00:05:57.513
with the number of times the word was counted.

116
00:05:59.640 --> 00:06:01.860
And once we have that word count,

117
00:06:01.860 --> 00:06:04.733
that's what we'll be using to calculate probabilities.

118
00:06:06.390 --> 00:06:07.910
So now we can go ahead

119
00:06:07.910 --> 00:06:10.440
and create our classifier object

120
00:06:10.440 --> 00:06:14.233
and assign that to the multinomial Naive Bayes class.

121
00:06:18.700 --> 00:06:21.760
The next step is to create the classification model

122
00:06:21.760 --> 00:06:23.093
by using the fit method.

123
00:06:27.370 --> 00:06:29.000
And this is where we wanna pass

124
00:06:29.000 --> 00:06:31.220
in the transformed training set

125
00:06:31.220 --> 00:06:32.623
that we just created.

126
00:06:33.530 --> 00:06:36.443
And we'll follow that up with the output training set.

127
00:06:39.720 --> 00:06:41.880
Okay, now we are done building

128
00:06:41.880 --> 00:06:43.783
the actual classification model.

129
00:06:44.660 --> 00:06:47.163
But we still need a way of evaluating it.

130
00:06:48.120 --> 00:06:50.040
So the first thing that we wanna do

131
00:06:50.040 --> 00:06:52.280
is tokenize the test set

132
00:06:52.280 --> 00:06:55.010
and we can do that using the transform method

133
00:06:55.010 --> 00:06:56.633
from the vectorizer class.

134
00:07:09.410 --> 00:07:12.330
Then we'll finish by creating a prediction object

135
00:07:12.330 --> 00:07:14.580
that will make class predictions based

136
00:07:14.580 --> 00:07:16.313
on the transformed test set.

137
00:07:32.420 --> 00:07:35.780
Just like with regression, Sklearn gives us a bunch

138
00:07:35.780 --> 00:07:39.053
of different tools we can use for model evaluation.

139
00:07:41.500 --> 00:07:43.540
The first option that we'll go through

140
00:07:43.540 --> 00:07:46.690
uses the accuracy_score function

141
00:07:46.690 --> 00:07:48.960
and this will return the percentage

142
00:07:48.960 --> 00:07:51.970
of correct predictions made by the model

143
00:07:51.970 --> 00:07:54.203
from the test set that we passed in.

144
00:08:02.020 --> 00:08:03.650
The next evaluation method

145
00:08:03.650 --> 00:08:05.211
that I'd like us to talk about

146
00:08:05.211 --> 00:08:07.683
is called the confusion_matrix.

147
00:08:21.840 --> 00:08:25.040
So after we pass in both of the test sets,

148
00:08:25.040 --> 00:08:26.853
let's run it again.

149
00:08:34.340 --> 00:08:37.543
And I'll walk you through what this two-by-two matrix means.

150
00:08:39.220 --> 00:08:40.980
Depending how your data was split

151
00:08:40.980 --> 00:08:43.260
during the train, test step,

152
00:08:43.260 --> 00:08:46.900
your confusion matrix will probably have different values.

153
00:08:46.900 --> 00:08:49.720
But it's all gonna mean the same thing.

154
00:08:49.720 --> 00:08:52.420
So what the confusion matrix does

155
00:08:52.420 --> 00:08:56.373
is split the prediction results into four main categories.

156
00:08:58.110 --> 00:08:59.980
The element in the first row

157
00:08:59.980 --> 00:09:02.520
of the first column indicates the number

158
00:09:02.520 --> 00:09:04.730
of ham predictions the model made

159
00:09:04.730 --> 00:09:07.303
that were correctly classified as ham.

160
00:09:08.670 --> 00:09:11.930
The value in the first column, second row indicates

161
00:09:11.930 --> 00:09:14.090
how many ham predictions were made

162
00:09:14.090 --> 00:09:15.723
that were actually spam.

163
00:09:18.790 --> 00:09:20.780
Then over in the second column,

164
00:09:20.780 --> 00:09:23.690
the first row indicates the number of spam predictions

165
00:09:23.690 --> 00:09:25.590
that were actually ham

166
00:09:25.590 --> 00:09:27.400
and the second row is the number

167
00:09:27.400 --> 00:09:29.943
of spam predictions that were correctly predicted.

168
00:09:32.050 --> 00:09:34.608
A confusion matrix can be a good tool to use

169
00:09:34.608 --> 00:09:36.652
when one of the incorrect predictions

170
00:09:36.652 --> 00:09:39.303
is way more important than the other.

171
00:09:40.870 --> 00:09:44.680
Say, for example, instead of this being a spam filter,

172
00:09:44.680 --> 00:09:47.173
it was classifying possible tumors.

173
00:09:48.260 --> 00:09:50.050
Incorrectly classifying someone

174
00:09:50.050 --> 00:09:53.080
with a tumor into the non-tumor group

175
00:09:53.080 --> 00:09:55.448
has much more severe consequences

176
00:09:55.448 --> 00:09:57.038
than if a non-tumor patient

177
00:09:57.038 --> 00:10:00.403
was incorrectly classified as having a tumor.

178
00:10:01.450 --> 00:10:02.810
Because at that point,

179
00:10:02.810 --> 00:10:05.210
they can always go in for further testing

180
00:10:05.210 --> 00:10:07.070
where it will eventually be determined

181
00:10:07.070 --> 00:10:09.460
it was a misdiagnosis.

182
00:10:09.460 --> 00:10:13.200
But obviously, if someone is told they don't have a tumor,

183
00:10:13.200 --> 00:10:14.850
and they actually do,

184
00:10:14.850 --> 00:10:18.393
the outcome has the potential to be much more costly.

185
00:10:20.410 --> 00:10:22.490
So moving on to the next method,

186
00:10:22.490 --> 00:10:24.720
we have the f1 score

187
00:10:24.720 --> 00:10:27.270
and this can be used to check the accuracy

188
00:10:27.270 --> 00:10:29.253
of a binary classification model.

189
00:10:38.160 --> 00:10:40.823
And let's see what this error message says.

190
00:10:42.340 --> 00:10:44.260
Oh, so for this, we're gonna need

191
00:10:44.260 --> 00:10:46.220
to declare what class label

192
00:10:46.220 --> 00:10:48.410
is the positive result

193
00:10:48.410 --> 00:10:50.140
and since we're checking to see

194
00:10:50.140 --> 00:10:52.040
if an email is spam,

195
00:10:52.040 --> 00:10:54.090
that's what will yield a positive result.

196
00:10:56.260 --> 00:10:57.970
So let's run this again

197
00:10:57.970 --> 00:11:00.223
and now we have our accuracy score.

198
00:11:02.360 --> 00:11:04.060
The last method that we'll be covering

199
00:11:04.060 --> 00:11:05.920
is the precision_score,

200
00:11:05.920 --> 00:11:09.453
which assesses the precision of the classification model.

201
00:11:17.070 --> 00:11:19.133
And it looks like I forgot it again.

202
00:11:25.150 --> 00:11:27.003
Now let's go ahead and run it.

203
00:11:32.630 --> 00:11:34.170
And it looks like we have a score

204
00:11:34.170 --> 00:11:38.063
of just under .99, which is a really good score.

205
00:11:39.500 --> 00:11:41.010
To finally wrap this guide up,

206
00:11:41.010 --> 00:11:44.150
let's go ahead and make a couple new emails

207
00:11:44.150 --> 00:11:46.163
to see how the model responds.

208
00:11:50.910 --> 00:11:54.080
The first will be for Free Viagra!!!

209
00:12:00.750 --> 00:12:03.203
And we'll make the second message from a friend.

210
00:12:07.600 --> 00:12:10.270
Before we can use the model to make a prediction,

211
00:12:10.270 --> 00:12:12.520
we're gonna have to tokenize both new_emails.

212
00:12:26.490 --> 00:12:28.903
Then we'll apply that to the predict function.

213
00:12:44.170 --> 00:12:45.470
And now we can go ahead

214
00:12:45.470 --> 00:12:47.943
and check the new email class predictions.

215
00:12:54.660 --> 00:12:57.323
And it looks like the model did a really good job.

216
00:12:59.670 --> 00:13:02.580
Well, that finally brings us to the end of this guide.

217
00:13:02.580 --> 00:13:04.333
So I will see you in the next one.