Anda di halaman 1dari 123

1

00:00:00,000 --> 00:00:00,580

2
00:00:00,580 --> 00:00:03,270
ANNOUNCER: The following program
is brought to you by Caltech.

3
00:00:03,270 --> 00:00:16,250

4
00:00:16,250 --> 00:00:19,950
YASER ABU-MOSTAFA: Welcome to machine
learning, and welcome to our online

5
00:00:19,950 --> 00:00:21,650
audience as well.

6
00:00:21,650 --> 00:00:25,830
Let me start with an outline of the
course, and then go into the material

7
00:00:25,830 --> 00:00:28,070
of today's lecture.

8
00:00:28,070 --> 00:00:34,540
As you see from the outline, the topics
are given colors, and that

9
00:00:34,540 --> 00:00:38,860
designates their main content, whether
it's mathematical or practical.

10
00:00:38,860 --> 00:00:41,960
Machine learning is
a very broad subject.

11
00:00:41,960 --> 00:00:47,300
It goes from very abstract theory to
extreme practice as in rules of thumb.

12
00:00:47,300 --> 00:00:52,350
And the inclusion of a topic in the
course depends on the relevance to

13
00:00:52,350 --> 00:00:53,370
machine learning.
14
00:00:53,370 --> 00:00:57,720
So some mathematics is useful because it
gives you the conceptual framework,

15
00:00:57,720 --> 00:01:02,320
and then some practical aspects are
useful because they give you the way

16
00:01:02,320 --> 00:01:05,080
to deal with real learning systems.

17
00:01:05,080 --> 00:01:09,370
Now if you look at the topics, these
are not meant to be separate topics

18
00:01:09,370 --> 00:01:10,560
for each lecture.

19
00:01:10,560 --> 00:01:13,630
They just highlight the main
content of those lectures.

20
00:01:13,630 --> 00:01:17,740
But there is a story line that goes
through it, and let me tell you what

21
00:01:17,740 --> 00:01:20,380
the story line is like.

22
00:01:20,380 --> 00:01:23,530
It starts here with: what is learning?

23
00:01:23,530 --> 00:01:26,500

24
00:01:26,500 --> 00:01:27,750
Can we learn?

25
00:01:27,750 --> 00:01:29,940

26
00:01:29,940 --> 00:01:31,190
How to do it?

27
00:01:31,190 --> 00:01:33,660

28
00:01:33,660 --> 00:01:37,250
How to do it well?

29
00:01:37,250 --> 00:01:40,410
And then the take-home lessons.

30
00:01:40,410 --> 00:01:44,420
There is a logical dependency that goes
through the course, and there's

31
00:01:44,420 --> 00:01:48,390
one exception to that
logical dependency.

32
00:01:48,390 --> 00:01:53,240
One lecture, which is the third one,
doesn't really belong here.

33
00:01:53,240 --> 00:01:57,750
It's a practical topic, and the reason
I included it early on is because I

34
00:01:57,750 --> 00:02:01,740
needed to give you some tools to play
around with, to test the

35
00:02:01,740 --> 00:02:03,990
theoretical and conceptual aspects.

36
00:02:03,990 --> 00:02:09,430
If I waited until it belonged normally,
which is to the second aspect of the

37
00:02:09,430 --> 00:02:15,620
linear models which is down there, the
beginning of the course would be

38
00:02:15,620 --> 00:02:18,730
just too theoretical
for people's taste.

39
00:02:18,730 --> 00:02:22,320
And as you see, if you look at the
colors, it is mostly red in the
40
00:02:22,320 --> 00:02:25,080
beginning and mostly blue in the end.

41
00:02:25,080 --> 00:02:28,430
So it starts building the
concepts and the theory.

42
00:02:28,430 --> 00:02:32,100
And then it goes on to the
practical aspects.

43
00:02:32,100 --> 00:02:35,970
Now, let me start today's lecture.

44
00:02:35,970 --> 00:02:39,790
And the subject of the lecture
is the learning problem.

45
00:02:39,790 --> 00:02:41,890
It's an introduction to
what learning is.

46
00:02:41,890 --> 00:02:45,230
And I will draw your attention
to one aspect of this slide,

47
00:02:45,230 --> 00:02:48,450
which is this part.

48
00:02:48,450 --> 00:02:50,590
That's the logo of the course.

49
00:02:50,590 --> 00:02:53,850
And believe it or not,
this is not artwork.

50
00:02:53,850 --> 00:02:56,720
This is actually a technical
figure that will come up

51
00:02:56,720 --> 00:02:57,670
in one of the lectures.

52
00:02:57,670 --> 00:02:59,200
I'm not going to tell you which one.
53
00:02:59,200 --> 00:03:03,100
So you can wait in anticipation until it
comes up, but this will actually be

54
00:03:03,100 --> 00:03:06,420
a scientific figure that
we will talk about.

55
00:03:06,420 --> 00:03:11,930
Now when we move to today's
lecture, I'm going to talk

56
00:03:11,930 --> 00:03:14,660
today about the following.

57
00:03:14,660 --> 00:03:18,590
Machine learning is a very broad
subject, and I'm going to start with

58
00:03:18,590 --> 00:03:22,100
one example that captures the
essence of machine learning.

59
00:03:22,100 --> 00:03:26,480
It's a fun example about movies
that everybody watches.

60
00:03:26,480 --> 00:03:30,870
And then after that, I'm going to
abstract from the learning problem,

61
00:03:30,870 --> 00:03:35,690
the practical learning problem,
aspects that are common to all

62
00:03:35,690 --> 00:03:38,540
learning situations that
you're going to face.

63
00:03:38,540 --> 00:03:42,030
And in abstracting them, we'll have the
mathematical formalization of the

64
00:03:42,030 --> 00:03:43,310
learning problem.
65
00:03:43,310 --> 00:03:47,530
And then we will get our first algorithm
for machine learning today.

66
00:03:47,530 --> 00:03:51,210
It's a very simple algorithm, but it
will fix the idea about what is the

67
00:03:51,210 --> 00:03:53,060
role of an algorithm in this case.

68
00:03:53,060 --> 00:03:56,420
And we will survey the types of learning,
so that we know which part we

69
00:03:56,420 --> 00:04:01,400
are emphasizing in this course,
and which parts are nearby.

70
00:04:01,400 --> 00:04:04,775
And I will end up with a puzzle, a very
interesting puzzle, and it's

71
00:04:04,775 --> 00:04:08,390
a puzzle in more ways than
one, as you will see.

72
00:04:08,390 --> 00:04:11,040
OK, so let me start with an example.

73
00:04:11,040 --> 00:04:14,040

74
00:04:14,040 --> 00:04:18,410
The example of machine learning that
I'm going to start with is how

75
00:04:18,410 --> 00:04:22,290
a viewer would rate a movie.

76
00:04:22,290 --> 00:04:26,195
Now that is an interesting problem, and
it's interesting for us because we

77
00:04:26,195 --> 00:04:30,020
watch movies, and very interesting for
a company that rents out movies.

78
00:04:30,020 --> 00:04:37,070
And indeed, a company which is Netflix
wanted to improve the in-house system

79
00:04:37,070 --> 00:04:40,040
by a mere 10%.

80
00:04:40,040 --> 00:04:43,450
So they make recommendations when you
log in, they recommend movies that

81
00:04:43,450 --> 00:04:46,650
they think you will like, so they think
that you'll rate them highly.

82
00:04:46,650 --> 00:04:50,230
And they had a system, and they
wanted to improve the system.

83
00:04:50,230 --> 00:04:56,610
So how much is a 10% improvement in
performance worth to the company?

84
00:04:56,610 --> 00:05:03,480
It was actually $1 million that was
paid out to the first group that

85
00:05:03,480 --> 00:05:06,780
actually managed to get
the 10% improvement.

86
00:05:06,780 --> 00:05:10,780
So you ask yourself, 10% improvement
in something like that, why should

87
00:05:10,780 --> 00:05:13,670
that be worth a million dollars?

88
00:05:13,670 --> 00:05:19,890
It's because, if the recommendations
that the movie company makes are spot

89
00:05:19,890 --> 00:05:25,150
on, you will pay more attention to the
recommendation, you are likely to rent
90
00:05:25,150 --> 00:05:27,830
the movies that they recommend, and they
will make lots of money-- much

91
00:05:27,830 --> 00:05:30,130
more than the million dollars
they promised.

92
00:05:30,130 --> 00:05:31,910
And this is very typical
in machine learning.

93
00:05:31,910 --> 00:05:35,290
For example, machine learning has
applications in financial forecasting.

94
00:05:35,290 --> 00:05:38,970
You can imagine that the minutest
improvement in financial forecasting

95
00:05:38,970 --> 00:05:40,810
can make a lot of money.

96
00:05:40,810 --> 00:05:45,410
So the fact that you can actually push
the system to be better using machine

97
00:05:45,410 --> 00:05:50,800
learning is a very attractive aspect of
the technique in a wide spectrum of

98
00:05:50,800 --> 00:05:52,480
applications.

99
00:05:52,480 --> 00:05:53,985
So what did these guys do?

100
00:05:53,985 --> 00:05:57,320

101
00:05:57,320 --> 00:06:04,260
They gave the data, and people started
working on the problem using different

102
00:06:04,260 --> 00:06:08,810
algorithms, until someone managed
to get the prize.

103
00:06:08,810 --> 00:06:12,530
Now if you look at the problem of
rating a movie, it captures the

104
00:06:12,530 --> 00:06:16,460
essence of machine learning, and the
essence has three components.

105
00:06:16,460 --> 00:06:20,320
If you find these three components in
a problem you have in your field, then

106
00:06:20,320 --> 00:06:24,300
you know that machine learning is
ready as an application tool.

107
00:06:24,300 --> 00:06:25,850
What are the three?

108
00:06:25,850 --> 00:06:29,860
The first one is that
a pattern exists.

109
00:06:29,860 --> 00:06:33,810
If a pattern didn't exist, there
would be nothing to look for.

110
00:06:33,810 --> 00:06:35,900
So what is the pattern here?

111
00:06:35,900 --> 00:06:41,990
There is no question that the way
a person rates a movie is related to how

112
00:06:41,990 --> 00:06:46,200
they rated other movies, and is
also related to how other

113
00:06:46,200 --> 00:06:48,870
people rated that movie.

114
00:06:48,870 --> 00:06:50,250
We know that much.
115
00:06:50,250 --> 00:06:52,670
So there is a pattern
to be discovered.

116
00:06:52,670 --> 00:06:57,710
However, we cannot really pin
it down mathematically.

117
00:06:57,710 --> 00:07:01,740
I cannot ask you to write a 17th-order
polynomial that captures how people

118
00:07:01,740 --> 00:07:03,790
rate movies.

119
00:07:03,790 --> 00:07:06,880
So the fact that there is a pattern,
and that we cannot pin it down

120
00:07:06,880 --> 00:07:10,800
mathematically, is the reason why we
are going for machine learning.

121
00:07:10,800 --> 00:07:12,520
For "learning from data".

122
00:07:12,520 --> 00:07:15,930
We couldn't write down the system on our
own, so we're going to depend on

123
00:07:15,930 --> 00:07:18,930
data in order to be able
to find the system.

124
00:07:18,930 --> 00:07:21,080
There is a missing component
which is very important.

125
00:07:21,080 --> 00:07:24,970
If you don't have that,
you are out of luck.

126
00:07:24,970 --> 00:07:28,310
We have to have data. We
are learning from data.

127
00:07:28,310 --> 00:07:31,030
So if someone knocks on my door with
an interesting machine learning

128
00:07:31,030 --> 00:07:33,670
application, and they tell me how
exciting it is, and how great the

129
00:07:33,670 --> 00:07:37,200
application would be, and how much
money they would make, the first

130
00:07:37,200 --> 00:07:41,150
question I ask, what data do you have?

131
00:07:41,150 --> 00:07:43,120
If you data, we are in business.

132
00:07:43,120 --> 00:07:45,880
If you don't, you are out of luck.

133
00:07:45,880 --> 00:07:48,510
If you have these three components,
you are ready to

134
00:07:48,510 --> 00:07:50,970
apply machine learning.

135
00:07:50,970 --> 00:07:51,750

136
00:07:51,750 --> 00:07:56,030
Now let me give you a solution to the
movie rating, in order to start

137
00:07:56,030 --> 00:07:57,220
getting a feel for it.

138
00:07:57,220 --> 00:07:59,160
So here is a system.

139
00:07:59,160 --> 00:08:01,630
Let me start to focus on part of it.

140
00:08:01,630 --> 00:08:07,230
We are going to describe a viewer
as a vector of factors, a profile if

141
00:08:07,230 --> 00:08:08,630
you will.

142
00:08:08,630 --> 00:08:16,320
So if you look here for example, the
first one would be comedy content.

143
00:08:16,320 --> 00:08:17,360

144
00:08:17,360 --> 00:08:18,660
Does the movie have a lot of comedy?

145
00:08:18,660 --> 00:08:21,600

146
00:08:21,600 --> 00:08:25,020
From a viewer point of view,
do they like comedies?

147
00:08:25,020 --> 00:08:27,800
Here, do they like action?

148
00:08:27,800 --> 00:08:31,580
Do they prefer blockbusters, or
do they like fringe movies?

149
00:08:31,580 --> 00:08:36,210
And you can go on all the way, even to
asking yourself whether you like the

150
00:08:36,210 --> 00:08:38,250
lead actor or not.

151
00:08:38,250 --> 00:08:42,909
Now you go to the content of the
movie itself, and you get the

152
00:08:42,909 --> 00:08:44,580
corresponding part.

153
00:08:44,580 --> 00:08:46,750
Does the movie have comedy?
154
00:08:46,750 --> 00:08:48,010
Does it have action?

155
00:08:48,010 --> 00:08:49,160
Is it a blockbuster?

156
00:08:49,160 --> 00:08:50,620
And so on.

157
00:08:50,620 --> 00:08:56,310
Now you compare the two, and you realize
that if there is a match--

158
00:08:56,310 --> 00:08:59,950
let's say you hate comedy and the
movie has a lot of comedy--

159
00:08:59,950 --> 00:09:02,090
then the chances are you're
not going to like it.

160
00:09:02,090 --> 00:09:02,770

161
00:09:02,770 --> 00:09:06,670
But if there is a match between so many
coordinates, and the

162
00:09:06,670 --> 00:09:10,730
number of factors here could be
really like 300 factors.

163
00:09:10,730 --> 00:09:12,590
Then the chances are you'll
like the movie.

164
00:09:12,590 --> 00:09:15,220
And if there's a mismatch, the
chances are you're not

165
00:09:15,220 --> 00:09:16,570
going to like the movie.

166
00:09:16,570 --> 00:09:17,670
So what do you do,
167
00:09:17,670 --> 00:09:17,680

168
00:09:17,680 --> 00:09:21,900
you match the movie and the viewer
factors, and then you add the

169
00:09:21,900 --> 00:09:23,240
contributions of them.

170
00:09:23,240 --> 00:09:26,950
And then as a result of that, you
get the predicted rating.

171
00:09:26,950 --> 00:09:28,370

172
00:09:28,370 --> 00:09:34,100
This is all good except for one problem,
which is this is really not

173
00:09:34,100 --> 00:09:35,950
machine learning.

174
00:09:35,950 --> 00:09:40,190
In order to produce this thing, you have
to watch the movie, and analyze

175
00:09:40,190 --> 00:09:41,300
the content.

176
00:09:41,300 --> 00:09:46,180
You have to interview the viewer,
and ask about their taste.

177
00:09:46,180 --> 00:09:49,020
And then after that, you combine
them and try to get

178
00:09:49,020 --> 00:09:51,000
a prediction for the rating.

179
00:09:51,000 --> 00:09:52,130

180
00:09:52,130 --> 00:09:55,590
Now the idea of machine learning is that
you don't have to do any of that.

181
00:09:55,590 --> 00:09:59,550
All you do is sit down and sip on your
tea, while the machine is doing

182
00:09:59,550 --> 00:10:03,050
something to come up with
this figure on its own.

183
00:10:03,050 --> 00:10:03,570

184
00:10:03,570 --> 00:10:06,460
So let's look at the
learning approach.

185
00:10:06,460 --> 00:10:12,390
So in the learning approach, we know
that the viewer will be a vector of

186
00:10:12,390 --> 00:10:16,150
different factors, and different
components for every factor.

187
00:10:16,150 --> 00:10:20,680
So this vector will be different
from one viewer to another.

188
00:10:20,680 --> 00:10:21,050

189
00:10:21,050 --> 00:10:25,500
For example, one viewer will have a big
blue content here, and one of them

190
00:10:25,500 --> 00:10:28,510
will have a small blue content,
depending on their taste.

191
00:10:28,510 --> 00:10:28,940

192
00:10:28,940 --> 00:10:32,140
And then, there is the movie.
193
00:10:32,140 --> 00:10:37,220
And a particular movie will have different
contents that correspond to those.

194
00:10:37,220 --> 00:10:42,580
And the way we said we are computing the
rating, is by simply taking these

195
00:10:42,580 --> 00:10:45,830
and combining them and
getting the rating.

196
00:10:45,830 --> 00:10:51,830
Now what machine learning will do is
reverse-engineer that process.

197
00:10:51,830 --> 00:10:54,310

198
00:10:54,310 --> 00:10:59,380
It starts from the rating, and then
tries to find out what factors would be

199
00:10:59,380 --> 00:11:01,800
consistent with that rating.

200
00:11:01,800 --> 00:11:03,120
So think of it this way.

201
00:11:03,120 --> 00:11:07,690
You start, let's say, with
completely random factors.

202
00:11:07,690 --> 00:11:12,260
So you take these guys, just random
numbers from beginning to end, and

203
00:11:12,260 --> 00:11:14,290
these guys, random numbers
from beginning to end.

204
00:11:14,290 --> 00:11:17,620
For every user and every movie,
that's your starting point.

205
00:11:17,620 --> 00:11:22,810
Obviously, there is no chance in the
world that when you get the inner

206
00:11:22,810 --> 00:11:26,120
product between these factors that are
random, that you'll get anything that

207
00:11:26,120 --> 00:11:30,120
looks like the rating that actually
took place, right?

208
00:11:30,120 --> 00:11:34,430
But what you do is you take a rating
that actually happened, and then you

209
00:11:34,430 --> 00:11:40,010
start nudging the factors ever so
slightly toward that rating.

210
00:11:40,010 --> 00:11:45,540
Make the direction of the inner product
get closer to the rating.

211
00:11:45,540 --> 00:11:49,500
Now it looks like a hopeless thing. I
start with so many factors, they are

212
00:11:49,500 --> 00:11:51,780
all random, and I'm trying to
make them match a rating.

213
00:11:51,780 --> 00:11:53,330
What are the chances?

214
00:11:53,330 --> 00:11:57,070
Well the point is that you are going to
do this not for one rating, but for

215
00:11:57,070 --> 00:11:59,050
a 100 million ratings.

216
00:11:59,050 --> 00:12:01,610
And you keep cycling through
the 100 million, over

217
00:12:01,610 --> 00:12:03,740
and over and over.

218
00:12:03,740 --> 00:12:08,180
And eventually, lo and behold, you
find that the factors now are

219
00:12:08,180 --> 00:12:10,420
meaningful in terms of the ratings.

220
00:12:10,420 --> 00:12:17,380
And if you get a user, a viewer here,
that didn't watch a movie, and you get

221
00:12:17,380 --> 00:12:21,320
the vector that resulted from that
learning process, and you get the

222
00:12:21,320 --> 00:12:25,470
movie vector that resulted from that
process, and you do the inner product,

223
00:12:25,470 --> 00:12:29,640
lo and behold, you get a rating which
is actually consistent with how that

224
00:12:29,640 --> 00:12:31,770
viewer rates the movie.

225
00:12:31,770 --> 00:12:33,480
That's the idea.

226
00:12:33,480 --> 00:12:35,380

227
00:12:35,380 --> 00:12:41,050
Now this actually, the solution I
described, is one of the winning

228
00:12:41,050 --> 00:12:43,220
solutions in the competition
that I mentioned.

229
00:12:43,220 --> 00:12:47,170
So this is for real, this
actually can be used.

230
00:12:47,170 --> 00:12:51,440
Now with this example in mind,
let's actually go to the

231
00:12:51,440 --> 00:12:52,610
components of learning.

232
00:12:52,610 --> 00:12:56,700
So now I would like to abstract from the
learning problems that I see, what

233
00:12:56,700 --> 00:13:00,280
are the mathematical components that
make up the learning problem?

234
00:13:00,280 --> 00:13:01,910
And I'm going to use a metaphor.

235
00:13:01,910 --> 00:13:05,540
I'm going to use a metaphor now from
another application domain, which

236
00:13:05,540 --> 00:13:07,280
is a financial application.

237
00:13:07,280 --> 00:13:11,900
So the metaphor we are going to
use is credit approval.

238
00:13:11,900 --> 00:13:15,600
You apply for a credit card, and the
bank wants to decide whether it's

239
00:13:15,600 --> 00:13:18,010
a good idea to extend a credit
card for you or not.

240
00:13:18,010 --> 00:13:19,960
From the bank's point of view,
if they're going to make

241
00:13:19,960 --> 00:13:20,800
money, they are happy.

242
00:13:20,800 --> 00:13:22,490
If they are going to lose money,
they are not happy.
243
00:13:22,490 --> 00:13:24,590
That's the only criterion they have.

244
00:13:24,590 --> 00:13:29,000
Now, very much like we didn't have
a magic formula for deciding how

245
00:13:29,000 --> 00:13:32,660
a viewer will rate a movie, the bank
doesn't have a magic formula for

246
00:13:32,660 --> 00:13:36,230
deciding whether a person
is creditworthy or not.

247
00:13:36,230 --> 00:13:39,940
What they're going to do, they're going
to rely on historical records of

248
00:13:39,940 --> 00:13:43,950
previous customers, and how their credit
behavior turned out, and then

249
00:13:43,950 --> 00:13:47,660
try to reverse-engineer the system, and
when they get the system frozen,

250
00:13:47,660 --> 00:13:50,070
they're going to apply it
to a future customer.

251
00:13:50,070 --> 00:13:51,480
That's the deal.

252
00:13:51,480 --> 00:13:54,360
What are the components here?

253
00:13:54,360 --> 00:13:56,480
Let's look at it.

254
00:13:56,480 --> 00:13:58,690
First, you have the applicant
information.

255
00:13:58,690 --> 00:14:02,380
And the applicant information-- you
look at this, and you can see that

256
00:14:02,380 --> 00:14:06,865
there is the age, the gender, how much
money you make, how much money you

257
00:14:06,865 --> 00:14:10,870
owe, and all kinds of fields that are
believed to be related to the

258
00:14:10,870 --> 00:14:13,160
creditworthiness.

259
00:14:13,160 --> 00:14:17,580
Again, pretty much like we did in
the movie example, there is no question

260
00:14:17,580 --> 00:14:21,310
that these fields are related
to the creditworthiness.

261
00:14:21,310 --> 00:14:25,020
They don't necessarily uniquely
determine it, but they are related.

262
00:14:25,020 --> 00:14:28,680
And the bank doesn't want a sure bet.
They want to get the credit decision

263
00:14:28,680 --> 00:14:29,950
as reliable as possible.

264
00:14:29,950 --> 00:14:32,970
So they want to use that pattern,
in order to be able to come up with

265
00:14:32,970 --> 00:14:33,960
a good decision.

266
00:14:33,960 --> 00:14:34,620

267
00:14:34,620 --> 00:14:38,900
And they take this input, and they want
to approve the credit or deny it.
268
00:14:38,900 --> 00:14:41,220
So let's formalize this.

269
00:14:41,220 --> 00:14:45,190
First, we are going to
have an input.

270
00:14:45,190 --> 00:14:48,640
And the input is called
x. Surprise, surprise!

271
00:14:48,640 --> 00:14:52,970
And that input happens to be
the customer application.

272
00:14:52,970 --> 00:14:56,280
So we can think of it as
a d-dimensional vector, the first

273
00:14:56,280 --> 00:15:00,660
component is the salary, years in
residence, outstanding debt, whatever

274
00:15:00,660 --> 00:15:01,600
the components are.

275
00:15:01,600 --> 00:15:05,170
You put it as a vector, and
that becomes the input.

276
00:15:05,170 --> 00:15:09,690
Then we get the output y. The output
y is simply the decision, either to

277
00:15:09,690 --> 00:15:14,130
extend credit or not to extend
credit, +1 and -1.

278
00:15:14,130 --> 00:15:15,380

279
00:15:15,380 --> 00:15:17,540

280
00:15:17,540 --> 00:15:22,980
And being a good or bad customer, that
is from the bank's point of view.

281
00:15:22,980 --> 00:15:26,380
Now we have after that,
the target function.

282
00:15:26,380 --> 00:15:31,860
The target function is a function
from a domain X, which is the

283
00:15:31,860 --> 00:15:34,370
set of all of these x's.

284
00:15:34,370 --> 00:15:34,770

285
00:15:34,770 --> 00:15:37,240
So it is the set of vectors
of d dimensions.

286
00:15:37,240 --> 00:15:40,020
So it's a d-dimensional Euclidean
space, in this case.

287
00:15:40,020 --> 00:15:42,220
And then the Y is the set of y's.

288
00:15:42,220 --> 00:15:44,820
Well, that's an easy one because
y can only be +1 or -1,

289
00:15:44,820 --> 00:15:46,320
accept or deny.

290
00:15:46,320 --> 00:15:49,990
And therefore this is just
a binary co-domain.

291
00:15:49,990 --> 00:15:55,620
And this target function is the ideal
credit approval formula, which we

292
00:15:55,620 --> 00:15:57,280
don't know.

293
00:15:57,280 --> 00:16:00,920
In all of our endeavors in machine
learning, the target function is

294
00:16:00,920 --> 00:16:02,740
unknown to us.

295
00:16:02,740 --> 00:16:05,850
If it were known, nobody
needs learning.

296
00:16:05,850 --> 00:16:08,120
We just go ahead and implement it.

297
00:16:08,120 --> 00:16:11,890
But we need to learn it because
it is unknown to us.

298
00:16:11,890 --> 00:16:14,850
So what are we going
to do to learn it?

299
00:16:14,850 --> 00:16:19,040
We are going to use data, examples.

300
00:16:19,040 --> 00:16:23,360
So the data in this case is based on
previous customer application records.

301
00:16:23,360 --> 00:16:28,040
The input, which is the information in
their applications, and the output,

302
00:16:28,040 --> 00:16:31,070
which is how they turned
out in hindsight.

303
00:16:31,070 --> 00:16:33,920
This is not a question of prediction
at the time they applied, but after

304
00:16:33,920 --> 00:16:36,430
five years, they turned out
to be a great customer.

305
00:16:36,430 --> 00:16:40,005
So the bank says, if someone has
these attributes again, let's approve
306
00:16:40,005 --> 00:16:43,150
credit because these guys
tend to make us money.

307
00:16:43,150 --> 00:16:47,290
And this one made us lose a lot of
money, so let's deny it, and so on.

308
00:16:47,290 --> 00:16:50,680
And the historical records-- there are
plenty of historical records.

309
00:16:50,680 --> 00:16:54,180
All of this makes sense when you're
talking about having 100,000 of

310
00:16:54,180 --> 00:16:55,030
those guys.

311
00:16:55,030 --> 00:16:58,160
Then you pretty much say, I will
capture what the essence of that

312
00:16:58,160 --> 00:16:59,330
function is.

313
00:16:59,330 --> 00:17:02,940
So this is the data, and then you use
the data, which is the historical

314
00:17:02,940 --> 00:17:06,829
records, in order to
get the hypothesis.

315
00:17:06,829 --> 00:17:12,160
The hypothesis is the formal name we're
going to call the formula we get

316
00:17:12,160 --> 00:17:14,220
to approximate the target function.

317
00:17:14,220 --> 00:17:19,348
So the hypothesis lives in the same
world as the target function, and if

318
00:17:19,348 --> 00:17:25,779
you look at the value of g, it supposedly
approximates f.

319
00:17:25,780 --> 00:17:29,030
While f is unknown to us,
g is very much known--

320
00:17:29,030 --> 00:17:33,050
actually we created it-- and the hope
is that it does approximate f well.

321
00:17:33,050 --> 00:17:35,690
That's the goal of learning.

322
00:17:35,690 --> 00:17:39,310
So this notation will be our notation
for the rest of the course, so get

323
00:17:39,310 --> 00:17:40,170
used to it.

324
00:17:40,170 --> 00:17:43,800
The target function is always f, the
hypothesis we produce, which we'll

325
00:17:43,800 --> 00:17:47,860
refer to as the final hypothesis will be
called g, the data will always have

326
00:17:47,860 --> 00:17:51,400
that notation-- there are capital N
points, which are the data set.

327
00:17:51,400 --> 00:17:55,630
And the output is always y.

328
00:17:55,630 --> 00:17:59,120
So this is the formula to be used.

329
00:17:59,120 --> 00:18:03,940
Now, let's put it in a diagram in order
to analyze it a little bit more.

330
00:18:03,940 --> 00:18:07,110
If you look at the diagram
here, here is the target
331
00:18:07,110 --> 00:18:08,810
function and it is unknown--

332
00:18:08,810 --> 00:18:11,740
that is the ideal approval which we will
never know, but that's what we're

333
00:18:11,740 --> 00:18:13,860
hoping to get to approximate.

334
00:18:13,860 --> 00:18:15,170
And we don't see it.

335
00:18:15,170 --> 00:18:18,230
We see it only through the eyes
of the training examples.

336
00:18:18,230 --> 00:18:21,350
This is our vehicle of understanding
what the target function is.

337
00:18:21,350 --> 00:18:25,170
Otherwise the target function is
a mysterious quantity for us.

338
00:18:25,170 --> 00:18:28,440
And eventually, we would like to
produce the final hypothesis.

339
00:18:28,440 --> 00:18:31,490
The final hypothesis is the formula the
bank is going to use in order to

340
00:18:31,490 --> 00:18:36,440
approve or deny credit, with the hope
that g hopefully approximates that f.

341
00:18:36,440 --> 00:18:37,300

342
00:18:37,300 --> 00:18:41,160
Now what connects those two guys?

343
00:18:41,160 --> 00:18:43,110
This will be the learning algorithm.
344
00:18:43,110 --> 00:18:47,340
So the learning algorithm takes the
examples, and will produce the final

345
00:18:47,340 --> 00:18:51,850
hypothesis, as we described in the
example of the movie rating.

346
00:18:51,850 --> 00:18:56,880
Now there is another component that
goes into the learning algorithm.

347
00:18:56,880 --> 00:19:02,430
So what the learning algorithm does, it
creates the formula from a preset

348
00:19:02,430 --> 00:19:06,100
model of formulas, a set of candidate
formulas, if you will.

349
00:19:06,100 --> 00:19:10,610
And these we are going to call the
hypothesis set, a set of hypotheses

350
00:19:10,610 --> 00:19:13,230
from which we are going to
pick one hypothesis.

351
00:19:13,230 --> 00:19:18,440
So from this H comes a bunch of small
h's, which are functions that can be

352
00:19:18,440 --> 00:19:21,320
candidates for being the
credit approval.

353
00:19:21,320 --> 00:19:24,390
And one of them will be picked by the
learning algorithm, which happens to

354
00:19:24,390 --> 00:19:27,220
be g, hopefully approximating f.

355
00:19:27,220 --> 00:19:30,960
Now if you look at this part of the
chain, from the target function to the
356
00:19:30,960 --> 00:19:34,540
training to the learning algorithm to
the final hypothesis, this is very

357
00:19:34,540 --> 00:19:37,280
natural, and nobody will
object to that.

358
00:19:37,280 --> 00:19:40,280
But why do we have this
hypothesis set?

359
00:19:40,280 --> 00:19:44,030
Why not let the algorithm
pick from anything?

360
00:19:44,030 --> 00:19:48,150
Just create the formula, without being
restricted to a particular set of

361
00:19:48,150 --> 00:19:50,190
formulas H.

362
00:19:50,190 --> 00:19:52,790
There are two reasons, and
I want to explain them.

363
00:19:52,790 --> 00:19:56,920
One of them is that there is no downside
for including a hypothesis

364
00:19:56,920 --> 00:19:59,320
set in the formalization.

365
00:19:59,320 --> 00:20:01,480
And there is an upside.

366
00:20:01,480 --> 00:20:05,030
So let me describe why there is no
downside, and then describe why there

367
00:20:05,030 --> 00:20:07,040
is an upside.

368
00:20:07,040 --> 00:20:11,430
There is no downside for the simple
reason that, from a practical point of

369
00:20:11,430 --> 00:20:12,730
view, that's what you do.

370
00:20:12,730 --> 00:20:14,770
You want to learn, you say I'm going
to use a linear formula.

371
00:20:14,770 --> 00:20:16,060
I'm going to use a neural network.

372
00:20:16,060 --> 00:20:17,570
I'm going to use a support
vector machine.

373
00:20:17,570 --> 00:20:20,990
So you are already dictating
a set of hypotheses.

374
00:20:20,990 --> 00:20:25,260
If you happen to be a brave soul, and you
don't want to restrict yourself at

375
00:20:25,260 --> 00:20:29,680
all, very well, then your hypothesis
set is the set of all possible

376
00:20:29,680 --> 00:20:31,580
hypotheses.

377
00:20:31,580 --> 00:20:32,200
Right?

378
00:20:32,200 --> 00:20:34,600
So there is no loss of generality
in putting it.

379
00:20:34,600 --> 00:20:37,410
So there is no downside.

380
00:20:37,410 --> 00:20:40,890
The upside is not obvious here, but it
will become obvious as we go through

381
00:20:40,890 --> 00:20:41,900
the theory.

382
00:20:41,900 --> 00:20:47,150
The hypothesis set will play a pivotal
role in the theory of learning.

383
00:20:47,150 --> 00:20:50,590
It will tell us: can we learn, and
how well we learn, and whatnot.

384
00:20:50,590 --> 00:20:54,370
Therefore having it as an explicit
component in the problem statement

385
00:20:54,370 --> 00:20:56,600
will make the theory go through.

386
00:20:56,600 --> 00:20:58,910
So that's why we have this figure.

387
00:20:58,910 --> 00:20:59,720

388
00:20:59,720 --> 00:21:04,440
Now, let me focus on the solution
components of that figure.

389
00:21:04,440 --> 00:21:07,520
What do I mean by the
solution components?

390
00:21:07,520 --> 00:21:14,610
If you look at this, the first part,
which is the target-- let me try to

391
00:21:14,610 --> 00:21:15,570
expand it--

392
00:21:15,570 --> 00:21:18,540
so the target function is
not under your control.

393
00:21:18,540 --> 00:21:21,640
Someone knocks on my door and says:
I want to approve credit.
394
00:21:21,640 --> 00:21:24,730
That's the target function, I
have no control over that.

395
00:21:24,730 --> 00:21:27,110
And by the way, here are
the historical records.

396
00:21:27,110 --> 00:21:30,440
I have no control over that,
so they give me the data.

397
00:21:30,440 --> 00:21:33,330
And would you please hand me
the final hypothesis?

398
00:21:33,330 --> 00:21:36,250
That is what I'm going to give them at
the end, before I receive my check.

399
00:21:36,250 --> 00:21:36,900

400
00:21:36,900 --> 00:21:39,170
So all of that is completely dictated.

401
00:21:39,170 --> 00:21:49,120
Now let's look at the other part. The
learning algorithm, and the hypothesis

402
00:21:49,120 --> 00:21:52,090
set that we talked about,
are your solution tools.

403
00:21:52,090 --> 00:21:55,450
These are things you choose, in
order to solve the problem.

404
00:21:55,450 --> 00:22:01,150
And I would like to take a little bit
of a look into what they look like,

405
00:22:01,150 --> 00:22:04,770
and give you an example of them, so that
you have a complete chain for

406
00:22:04,770 --> 00:22:06,630
the entire figure in your mind.

407
00:22:06,630 --> 00:22:09,670
From the target function, to the data
set, to the learning algorithm,

408
00:22:09,670 --> 00:22:12,210
hypothesis set, and the
final hypothesis.

409
00:22:12,210 --> 00:22:12,780

410
00:22:12,780 --> 00:22:15,520
So, here is the hypothesis set.

411
00:22:15,520 --> 00:22:22,820
We chose the notation H for the
set, and the element will be given the

412
00:22:22,820 --> 00:22:23,990
symbol small h.

413
00:22:23,990 --> 00:22:27,890
So h is a function, pretty much
like the final hypothesis g.

414
00:22:27,890 --> 00:22:30,540
g is just one of them
that you happen to elect.

415
00:22:30,540 --> 00:22:34,060
So when we elect it, we call it g. If
it's sitting there generically, we

416
00:22:34,060 --> 00:22:35,100
call it h.

417
00:22:35,100 --> 00:22:35,860

418
00:22:35,860 --> 00:22:39,090
And then, when you put them together,
they are referred to as

419
00:22:39,090 --> 00:22:39,690
the learning model.

420
00:22:39,690 --> 00:22:42,610
So if you're asked what is the learning
model you are using, you're

421
00:22:42,610 --> 00:22:46,580
actually choosing both a hypothesis
set and a learning algorithm.

422
00:22:46,580 --> 00:22:49,400
We'll see the perceptron in a moment,
so this would be the

423
00:22:49,400 --> 00:22:52,780
perceptron model, and this would be the
perceptron learning algorithm.

424
00:22:52,780 --> 00:22:56,420
This could be neural network, and
this would be back propagation.

425
00:22:56,420 --> 00:22:59,050
This could be support vector
machines of some kind, let's say

426
00:22:59,050 --> 00:23:02,520
radial basis function version, and this
would be the quadratic programming.

427
00:23:02,520 --> 00:23:05,840
So every time you have a model, there is
a hypothesis set, and then there is

428
00:23:05,840 --> 00:23:07,960
an algorithm that will do the
searching and produce

429
00:23:07,960 --> 00:23:09,280
one of those guys.

430
00:23:09,280 --> 00:23:09,690

431
00:23:09,690 --> 00:23:13,630
So this is the standard form
432
00:23:13,630 --> 00:23:14,820
for the solution.

433
00:23:14,820 --> 00:23:18,860
Now, let me go through a simple
hypothesis set in detail so we have

434
00:23:18,860 --> 00:23:19,650
something to implement.

435
00:23:19,650 --> 00:23:23,860
So after the lecture, you can actually
implement a learning algorithm on real

436
00:23:23,860 --> 00:23:24,950
data if you want to.

437
00:23:24,950 --> 00:23:28,440
This is not a glorious model. It's
a very simple model. On the other hand,

438
00:23:28,440 --> 00:23:33,790
it's a very clear model to pinpoint
what we are talking about.

439
00:23:33,790 --> 00:23:34,570

440
00:23:34,570 --> 00:23:35,890
So here is the deal.

441
00:23:35,890 --> 00:23:39,660

442
00:23:39,660 --> 00:23:43,730
You have an input, and the input
is x_1 up to x_d, as we said--

443
00:23:43,730 --> 00:23:49,210
d-dimensional vector-- and each of them
comes from the real numbers, just

444
00:23:49,210 --> 00:23:49,840
for simplicity.
445
00:23:49,840 --> 00:23:51,320
So this belongs to the real numbers.

446
00:23:51,320 --> 00:23:53,170
And these are the attributes
of a customer.

447
00:23:53,170 --> 00:23:56,470
As we said, salary, years in
residence, and whatnot.

448
00:23:56,470 --> 00:24:00,080
So what does the perceptron model do?

449
00:24:00,080 --> 00:24:02,730
It does a very simple formula.

450
00:24:02,730 --> 00:24:08,760
It takes the attributes you have and
gives them different weights, w.

451
00:24:08,760 --> 00:24:12,110
So let's say the salary is important,
the chances are w corresponding to the

452
00:24:12,110 --> 00:24:13,900
salary will be big.

453
00:24:13,900 --> 00:24:15,880
Some other attribute is
not that important.

454
00:24:15,880 --> 00:24:19,210
The chances are the w that
goes with it is not that big.

455
00:24:19,210 --> 00:24:21,540
Actually, outstanding
debt is bad news.

456
00:24:21,540 --> 00:24:23,370
If you owe a lot, that's not good.

457
00:24:23,370 --> 00:24:26,600
So the chances are the weight will
be negative for outstanding
458
00:24:26,600 --> 00:24:28,420
debt, and so on.

459
00:24:28,420 --> 00:24:32,210
Now you add them together, and you add
them in a linear form-- that's what

460
00:24:32,210 --> 00:24:33,630
makes it a perceptron--

461
00:24:33,630 --> 00:24:39,010
and you can look at this as
a credit score, of sorts.

462
00:24:39,010 --> 00:24:39,760

463
00:24:39,760 --> 00:24:43,300
Now you compare the credit
score with a threshold.

464
00:24:43,300 --> 00:24:46,690
If you exceed the threshold, they
approve the credit card.

465
00:24:46,690 --> 00:24:50,420
And if you don't, they
deny the credit card.

466
00:24:50,420 --> 00:24:51,710
So that is the formula they

467
00:24:51,710 --> 00:24:52,520
settle on.

468
00:24:52,520 --> 00:24:58,500
They have no idea, yet, what the w's and
the threshold are, but they dictated the

469
00:24:58,500 --> 00:25:01,110
formula-- the analytic form that
they're going to use.

470
00:25:01,110 --> 00:25:02,040
471
00:25:02,040 --> 00:25:06,530
Now we take this and we put it
in the formalization we had.

472
00:25:06,530 --> 00:25:11,370
We have to define a hypothesis h,
and this will tell us what is the

473
00:25:11,370 --> 00:25:14,820
hypothesis set that has all the
hypotheses that have the same

474
00:25:14,820 --> 00:25:16,170
functional form.

475
00:25:16,170 --> 00:25:17,530
So you can write it down as this.

476
00:25:17,530 --> 00:25:22,270
This is a little bit long, but there's
absolutely nothing to it.

477
00:25:22,270 --> 00:25:26,490
This is your credit score, and this
is the threshold you compare to by

478
00:25:26,490 --> 00:25:27,740
subtracting.

479
00:25:27,740 --> 00:25:30,910
If this quantity is positive, you belong
to the first thing and you will

480
00:25:30,910 --> 00:25:31,890
approve credit.

481
00:25:31,890 --> 00:25:34,880
If it's negative, you belong here
and you will deny credit.

482
00:25:34,880 --> 00:25:38,440
Well, the function that takes a real
number, and produces the sign +1 or

483
00:25:38,440 --> 00:25:41,010
-1, is called the sign.

484
00:25:41,010 --> 00:25:43,930
So when you take the sign of this thing,
this will indeed be +1 or

485
00:25:43,930 --> 00:25:46,970
-1, and this will give
the decision you want.

486
00:25:46,970 --> 00:25:49,820
And that will be the form
of your hypothesis.

487
00:25:49,820 --> 00:25:57,640
Now let's put it in color, and you
realize that what defines h is your

488
00:25:57,640 --> 00:26:00,290
choice of w_i and the threshold.

489
00:26:00,290 --> 00:26:05,060
These are the parameters that define
one hypothesis versus the other.

490
00:26:05,060 --> 00:26:07,820
x is an input that will be
put into any hypothesis.

491
00:26:07,820 --> 00:26:11,780
As far as we are concerned, when we are
learning for example, the inputs

492
00:26:11,780 --> 00:26:13,610
and outputs are already determined.

493
00:26:13,610 --> 00:26:15,010
These are the data set.

494
00:26:15,010 --> 00:26:19,640
But what we vary to get one hypothesis
or another, and what the algorithm

495
00:26:19,640 --> 00:26:23,270
needs to vary in order to choose the
final hypothesis, are those parameters
496
00:26:23,270 --> 00:26:27,270
which, in this case, are
w_i and the threshold.

497
00:26:27,270 --> 00:26:28,810

498
00:26:28,810 --> 00:26:30,610
So let's look at it visually.

499
00:26:30,610 --> 00:26:32,990
Let's assume that the data
you are working

500
00:26:32,990 --> 00:26:34,790
with is linearly separable.

501
00:26:34,790 --> 00:26:38,770
Linearly separable in this case, for
example, you have nine data points.

502
00:26:38,770 --> 00:26:42,220
And if you look at the nine data
points, some of them were good

503
00:26:42,220 --> 00:26:44,850
customers and some of them
were bad customers.

504
00:26:44,850 --> 00:26:48,450
And you would like now to apply the
perceptron model, in order to separate

505
00:26:48,450 --> 00:26:49,460
them correctly.

506
00:26:49,460 --> 00:26:53,860
You would like to get to this situation,
where the perceptron, which

507
00:26:53,860 --> 00:26:57,680
is this purple line, separates the blue
region from the red region or the

508
00:26:57,680 --> 00:27:02,240
pink region, and indeed all the good
customers belong to one, and the bad

509
00:27:02,240 --> 00:27:03,600
customers belong to the other.

510
00:27:03,600 --> 00:27:07,100
So you have hope that a future customer,
if they lie here or lie

511
00:27:07,100 --> 00:27:09,152
here, they will be classified
correctly.

512
00:27:09,152 --> 00:27:12,780
If there is actually a simple linear
pattern to this to be detected.

513
00:27:12,780 --> 00:27:16,950
But when you start, you start with
random weights, and the random weights

514
00:27:16,950 --> 00:27:18,990
will give you any line.

515
00:27:18,990 --> 00:27:23,490
So the purple line in both
cases corresponds to the

516
00:27:23,490 --> 00:27:25,900
purple parameters there.

517
00:27:25,900 --> 00:27:30,370
One choice of these w's and the
threshold corresponds to one line.

518
00:27:30,370 --> 00:27:32,220
You change them, you get another line.

519
00:27:32,220 --> 00:27:35,410
So you can see that the learning
algorithm is playing around with these

520
00:27:35,410 --> 00:27:39,360
parameters, and therefore moving the
line around, trying to arrive at this

521
00:27:39,360 --> 00:27:40,950
happy solution.

522
00:27:40,950 --> 00:27:42,350

523
00:27:42,350 --> 00:27:45,620
Now we are going to have a simple
change of notation.

524
00:27:45,620 --> 00:27:51,000
Instead of calling it threshold, we're
going to treat it as if it's a weight.

525
00:27:51,000 --> 00:27:55,030
It was minus threshold.
Now we call it, plus w_0.

526
00:27:55,030 --> 00:27:58,760
Absolutely nothing, all you need
to do is choose w_0 to

527
00:27:58,760 --> 00:28:00,930
be minus the threshold.

528
00:28:00,930 --> 00:28:01,840
No big deal.

529
00:28:01,840 --> 00:28:03,060

530
00:28:03,060 --> 00:28:04,750
So why do we do that?

531
00:28:04,750 --> 00:28:08,220
We do that because we are going to
introduce an artificial coordinate.

532
00:28:08,220 --> 00:28:11,780
Remember that the input
was x_1 through x_d.

533
00:28:11,780 --> 00:28:14,020
Now we're going to add x_0.

534
00:28:14,020 --> 00:28:15,910
This is not an attribute of
the customer, but

535
00:28:15,910 --> 00:28:20,460
an artificial constant we add, which
happens to be always +1.

536
00:28:20,460 --> 00:28:22,710
Why are we doing this?
You probably guessed.

537
00:28:22,710 --> 00:28:26,540
Because when you do that, then all of
a sudden the formula simplifies.

538
00:28:26,540 --> 00:28:30,040
Now you are summing from i equals
0, instead of i equals 1.

539
00:28:30,040 --> 00:28:33,410
So you added the zero term,
and what is the zero term?

540
00:28:33,410 --> 00:28:37,270
It's the threshold which you
conveniently call w_0 with a plus sign,

541
00:28:37,270 --> 00:28:38,390
multiplied by the 1.

542
00:28:38,390 --> 00:28:41,550
So indeed, this will be the formula
equivalent to that.

543
00:28:41,550 --> 00:28:43,340
So it looks better.

544
00:28:43,340 --> 00:28:46,490
And this is the standard notation
we're going to use.

545
00:28:46,490 --> 00:28:51,720
And now we put it as a vector
form, which will simplify matters, so

546
00:28:51,720 --> 00:28:56,200
in this case you will be having an inner
product between a vector w,

547
00:28:56,200 --> 00:28:59,190
a column vector, and a vector x.

548
00:28:59,190 --> 00:29:04,870
So the vector w would be w_0,
w_1, w_2, w_3, w_4, et cetera.

549
00:29:04,870 --> 00:29:07,030
And x_0, x_1, x_2, et cetera.

550
00:29:07,030 --> 00:29:10,180
And you do the inner product by taking
a transpose, and you get a formula

551
00:29:10,180 --> 00:29:12,200
which is exactly the formula
you have here.

552
00:29:12,200 --> 00:29:17,330
So now we are down to this formula
for the perceptron hypothesis.

553
00:29:17,330 --> 00:29:19,010

554
00:29:19,010 --> 00:29:22,540
Now that we have the hypothesis set,
let's look for the learning algorithm

555
00:29:22,540 --> 00:29:23,500
that goes with it.

556
00:29:23,500 --> 00:29:27,170
The hypothesis set tells you the
resources you can work with.

557
00:29:27,170 --> 00:29:29,990
Now we need the algorithm that is
going to look at the data, the

558
00:29:29,990 --> 00:29:33,810
training data that you're going to use,
and navigate through the space

559
00:29:33,810 --> 00:29:37,380
of hypotheses, to bring the one that
is going to output as the final

560
00:29:37,380 --> 00:29:39,660
hypothesis that you give
to your customer.

561
00:29:39,660 --> 00:29:47,210
So this one is called the perceptron
learning algorithm, and it implements

562
00:29:47,210 --> 00:29:49,420
this function.

563
00:29:49,420 --> 00:29:51,610
What it does is the following.

564
00:29:51,610 --> 00:29:54,560
It takes the training data.

565
00:29:54,560 --> 00:29:56,710
That is always what a learning
algorithm does. This is

566
00:29:56,710 --> 00:29:58,020
their starting point.

567
00:29:58,020 --> 00:30:03,330
So it takes existing customers, and
their existing credit behavior in

568
00:30:03,330 --> 00:30:04,030
hindsight--

569
00:30:04,030 --> 00:30:05,380
that's what it uses--

570
00:30:05,380 --> 00:30:07,070
and what does it do?

571
00:30:07,070 --> 00:30:12,080
It tries to make the w correct.

572
00:30:12,080 --> 00:30:17,640
So it really doesn't like at all
when a point is misclassified.

573
00:30:17,640 --> 00:30:22,960
So if a point is misclassified,
it means that your w didn't do

574
00:30:22,960 --> 00:30:23,850
the right job here.

575
00:30:23,850 --> 00:30:26,640
So what does it mean to be
a misclassified point here?

576
00:30:26,640 --> 00:30:31,910
It means that when you apply your
formula, with the current w--

577
00:30:31,910 --> 00:30:34,340
the w is the one that the algorithm
will play with--

578
00:30:34,340 --> 00:30:37,340
apply it to this particular x.

579
00:30:37,340 --> 00:30:38,170
Then what happens?

580
00:30:38,170 --> 00:30:41,200
You get something that is not the
credit behavior you want.

581
00:30:41,200 --> 00:30:43,500
It is misclassified.

582
00:30:43,500 --> 00:30:45,150
So what do we do when a point
is misclassified?

583
00:30:45,150 --> 00:30:46,840
We have to do something.

584
00:30:46,840 --> 00:30:49,660
So what the algorithm does, it
updates the weight vector.

585
00:30:49,660 --> 00:30:53,480
It changes the weight, which changes
the hypothesis, so that it behaves

586
00:30:53,480 --> 00:30:55,840
better on that particular point.

587
00:30:55,840 --> 00:30:59,970
And this is the formula that it does.

588
00:30:59,970 --> 00:31:01,480
So I'll explain it in a moment.

589
00:31:01,480 --> 00:31:08,430
Let me first try to explain the inner
product in terms of agreement or

590
00:31:08,430 --> 00:31:10,770
disagreement.

591
00:31:10,770 --> 00:31:16,890
If you have the vector x and the vector
w this way, their inner product

592
00:31:16,890 --> 00:31:21,230
will be positive, and the sign
will give you a +1.

593
00:31:21,230 --> 00:31:25,440
If they are this way, the inner product
will be negative, and the sign

594
00:31:25,440 --> 00:31:27,550
will be -1.

595
00:31:27,550 --> 00:31:32,180
So being misclassified means that
either they are this way and the

596
00:31:32,180 --> 00:31:37,590
output should be -1, or it's this
way and output should be +1.

597
00:31:37,590 --> 00:31:40,840
That's what makes it misclassified,
right?

598
00:31:40,840 --> 00:31:41,750

599
00:31:41,750 --> 00:31:49,720
So if you look here at this formula, it
takes the old w and adds something

600
00:31:49,720 --> 00:31:52,130
that depends on the misclassified
point.

601
00:31:52,130 --> 00:31:55,280
Both in terms of the x_n and y_n.

602
00:31:55,280 --> 00:31:57,410
y_n is just +1 or -1.

603
00:31:57,410 --> 00:32:00,800
So here you are either adding a vector
or subtracting a vector.

604
00:32:00,800 --> 00:32:05,490
And we will see from this diagram that
you're always doing so in such a way

605
00:32:05,490 --> 00:32:09,520
that you make the point more likely
to be correctly classified.

606
00:32:09,520 --> 00:32:10,780
How is that?

607
00:32:10,780 --> 00:32:15,990
If y equals +1, as you see here,
then it must be that since the point

608
00:32:15,990 --> 00:32:19,530
is misclassified, that
w dot x was negative.

609
00:32:19,530 --> 00:32:24,900
Now when you modify this to w plus
y x, it's actually w plus x.

610
00:32:24,900 --> 00:32:29,330
You add x to w, and when you add x to
w you get the blue vector instead of
611
00:32:29,330 --> 00:32:30,120
the red vector.

612
00:32:30,120 --> 00:32:33,850
And lo and behold, now the inner
product is indeed positive.

613
00:32:33,850 --> 00:32:38,830
And in the other case when it's -1,
it is misclassified because they

614
00:32:38,830 --> 00:32:39,760
were this way.

615
00:32:39,760 --> 00:32:41,840
They give you +1 when
it should be -1.

616
00:32:41,840 --> 00:32:44,520
And when you apply the rule, since
y is -1, you are actually

617
00:32:44,520 --> 00:32:45,640
subtracting x.

618
00:32:45,640 --> 00:32:48,680
So you subtract x and get this guy,
and you will get the correct

619
00:32:48,680 --> 00:32:49,620
classification.

620
00:32:49,620 --> 00:32:49,810

621
00:32:49,810 --> 00:32:51,560
So this is the intuition behind it.

622
00:32:51,560 --> 00:32:53,990
However, it is not the intuition
that makes this work.

623
00:32:53,990 --> 00:32:58,930
There are a number of problems
with this approach.
624
00:32:58,930 --> 00:33:02,570
I just motivated that
this is not a crazy rule.

625
00:33:02,570 --> 00:33:06,660
Whether or not it's a working
rule, that is yet to be seen.

626
00:33:06,660 --> 00:33:11,270
Let's look at the iterations of
the perceptron learning algorithm.

627
00:33:11,270 --> 00:33:13,860
Here is one iteration of PLA.

628
00:33:13,860 --> 00:33:18,950
So you look at this thing, and you have
this current w corresponds to

629
00:33:18,950 --> 00:33:20,330
the purple line.

630
00:33:20,330 --> 00:33:22,760
This guy is blue in the red region.

631
00:33:22,760 --> 00:33:24,610
It means it's misclassified.

632
00:33:24,610 --> 00:33:28,630
So now you would like to adjust
the weights, that is move around

633
00:33:28,630 --> 00:33:32,340
that purple line, such that the
point is classified correctly.

634
00:33:32,340 --> 00:33:35,440
If you apply the learning rule, you'll
find that you're actually moving in

635
00:33:35,440 --> 00:33:40,210
this direction, which means that the
blue point will likely be correctly

636
00:33:40,210 --> 00:33:42,230
classified after that iteration.

637
00:33:42,230 --> 00:33:43,790

638
00:33:43,790 --> 00:33:46,930
There is a problem because, let's
say that I actually move

639
00:33:46,930 --> 00:33:49,700
this guy in this direction.

640
00:33:49,700 --> 00:33:55,010
Well this one, I got it right, but this
one, which used to be right,

641
00:33:55,010 --> 00:33:56,440
now is messed up.

642
00:33:56,440 --> 00:33:58,920
It moved to the blue region, right?

643
00:33:58,920 --> 00:34:02,440
And if you think about it, I'm trying
to take care of one point, and I may be

644
00:34:02,440 --> 00:34:05,450
messing up all other points, because
I'm not taking them into

645
00:34:05,450 --> 00:34:06,980
consideration.

646
00:34:06,980 --> 00:34:08,469
Well, the good news for the perceptron

647
00:34:08,469 --> 00:34:12,909
learning algorithm is that all you need
to do, is for iterations 1,

648
00:34:12,909 --> 00:34:19,179
2, 3, 4, et cetera, pick a misclassified
point, anyone you like.

649
00:34:19,179 --> 00:34:22,020
650
00:34:22,020 --> 00:34:24,489
And then apply the iteration to it.

651
00:34:24,489 --> 00:34:27,870
The iteration we just talked about,
which is this one.

652
00:34:27,870 --> 00:34:29,480
The top one.

653
00:34:29,480 --> 00:34:31,210
And that's it.

654
00:34:31,210 --> 00:34:35,790
If you do that, and the data was
originally linearly separable, then

655
00:34:35,790 --> 00:34:40,310
you will end up with the case that you
will get to a correct solution.

656
00:34:40,310 --> 00:34:42,870
You will get to something that
classifies all of them correctly.

657
00:34:42,870 --> 00:34:44,340
This is not an obvious statement.

658
00:34:44,340 --> 00:34:45,310
It requires a proof.

659
00:34:45,310 --> 00:34:47,300
The proof is not that hard.

660
00:34:47,300 --> 00:34:51,570
But it gives us the simplest possible
learning model we can think of.

661
00:34:51,570 --> 00:34:54,710
It's a linear model, and
this is your algorithm.

662
00:34:54,710 --> 00:34:59,060
All you need to do is be very patient,
because 1, 2, 3, 4-- this is
663
00:34:59,060 --> 00:35:00,200
a really long.

664
00:35:00,200 --> 00:35:01,900
At times it can be very long.

665
00:35:01,900 --> 00:35:03,310
But it eventually converges.

666
00:35:03,310 --> 00:35:04,350
That's the promise,

667
00:35:04,350 --> 00:35:06,970
as long as the data is
linearly separable.

668
00:35:06,970 --> 00:35:13,920
So now we have one learning model, and
if I give you now data from a bank--

669
00:35:13,920 --> 00:35:17,216
previous customers and their credit
behavior-- you can actually run the

670
00:35:17,216 --> 00:35:21,090
perceptron learning algorithm, and come up
with a final hypothesis g that you

671
00:35:21,090 --> 00:35:22,630
can hand to the bank.

672
00:35:22,630 --> 00:35:26,390
Not clear at all that it will be good,
because all you did was match the

673
00:35:26,390 --> 00:35:27,750
historical records.

674
00:35:27,750 --> 00:35:31,280
Well, you may ask the question: if I
match the historical records, does this

675
00:35:31,280 --> 00:35:34,300
mean that I'm getting future customers
right, which is the
676
00:35:34,300 --> 00:35:35,480
only thing that matters?

677
00:35:35,480 --> 00:35:38,240
The bank already knows what happened
with the previous customers. It's just

678
00:35:38,240 --> 00:35:41,050
using the data to help you
find a good formula.

679
00:35:41,050 --> 00:35:44,050
The formula will be good or not good to
the extent that it applies to a new

680
00:35:44,050 --> 00:35:47,510
customer, and can predict the
behavior correctly.

681
00:35:47,510 --> 00:35:50,530
Well, that's a loaded question
which will be handled in

682
00:35:50,530 --> 00:35:53,470
extreme detail, when we talk about
the theory of learning.

683
00:35:53,470 --> 00:35:57,190
That's why we have to develop
all of this theory.

684
00:35:57,190 --> 00:35:58,970
So, that's it.

685
00:35:58,970 --> 00:36:02,340
And that is the perceptron
learning algorithm.

686
00:36:02,340 --> 00:36:07,620
Now let me go into the bigger picture
of learning, because what I talked

687
00:36:07,620 --> 00:36:09,990
about so far is one type of learning.

688
00:36:09,990 --> 00:36:13,700
It happens to be by far the most
popular, and the most used.

689
00:36:13,700 --> 00:36:16,250
But there are other types of learning.

690
00:36:16,250 --> 00:36:21,540
So let's talk about the premise of
learning, from which the different

691
00:36:21,540 --> 00:36:24,180
types came about.

692
00:36:24,180 --> 00:36:27,220
That's what learning is about.

693
00:36:27,220 --> 00:36:29,630

694
00:36:29,630 --> 00:36:34,640
This is the premise that is common
between any problem that you

695
00:36:34,640 --> 00:36:36,650
would consider learning.

696
00:36:36,650 --> 00:36:41,450
You use a set of observations,
what we call data, to uncover

697
00:36:41,450 --> 00:36:43,280
an underlying process.

698
00:36:43,280 --> 00:36:46,540
In our case, the target function.

699
00:36:46,540 --> 00:36:51,170
You can see that this is
a very broad premise.

700
00:36:51,170 --> 00:36:54,730
And therefore, you can see that people
have rediscovered that over and over

701
00:36:54,730 --> 00:36:57,160
and over, in so many disciplines.

702
00:36:57,160 --> 00:37:00,870
Can you think of a discipline, other than
machine learning, that uses that

703
00:37:00,870 --> 00:37:02,125
as its exclusive premise?

704
00:37:02,125 --> 00:37:05,420

705
00:37:05,420 --> 00:37:09,460
Anybody have taken courses
in statistics?

706
00:37:09,460 --> 00:37:12,180
In statistics, that's what they do.

707
00:37:12,180 --> 00:37:16,090
The underlying process is
a probability distribution.

708
00:37:16,090 --> 00:37:21,590
And the observations are samples
generated by that distribution.

709
00:37:21,590 --> 00:37:24,220
And you want to take the samples, and
predict what the probability

710
00:37:24,220 --> 00:37:25,890
distribution is.

711
00:37:25,890 --> 00:37:29,740
And over and over, there are so many
disciplines under different names.

712
00:37:29,740 --> 00:37:34,000
Now when we talk about different types
of learning, it's not like we sit down

713
00:37:34,000 --> 00:37:37,970
and look at the world and say, this
looks different from this because the

714
00:37:37,970 --> 00:37:39,420
assumptions look different.

715
00:37:39,420 --> 00:37:43,700
What you do is, you take this premise
and apply it in a context.

716
00:37:43,700 --> 00:37:48,100
And that calls for a certain amount
of mathematics and algorithms.

717
00:37:48,100 --> 00:37:53,690
If a particular set of assumptions takes
you sufficiently far from the

718
00:37:53,690 --> 00:37:57,850
mathematics and the algorithms you used
in the other disciplines, that

719
00:37:57,850 --> 00:38:00,360
it takes on a life of its own.

720
00:38:00,360 --> 00:38:03,900
And it develops its own math and
algorithms, then you declare it

721
00:38:03,900 --> 00:38:05,250
a different type.

722
00:38:05,250 --> 00:38:05,660

723
00:38:05,660 --> 00:38:09,750
So when I list the types, it's not
completely obvious just by the slide

724
00:38:09,750 --> 00:38:13,000
itself, that these should be
the types that you have.

725
00:38:13,000 --> 00:38:16,370
But for what it's worth, these
are the most important types.

726
00:38:16,370 --> 00:38:18,110
First one is supervised learning,
that's what we have
727
00:38:18,110 --> 00:38:18,970
been talking about.

728
00:38:18,970 --> 00:38:22,240
And I will discuss it in detail, and tell
you why it's called supervised.

729
00:38:22,240 --> 00:38:26,640
And it is, by far, the concentration
of this course.

730
00:38:26,640 --> 00:38:31,950
There is another one which is called
unsupervised learning, and

731
00:38:31,950 --> 00:38:33,990
unsupervised learning
is very intriguing.

732
00:38:33,990 --> 00:38:37,310
I will mention it briefly here, and then
we will talk about a very famous

733
00:38:37,310 --> 00:38:40,740
algorithm for unsupervised learning
later in the course.

734
00:38:40,740 --> 00:38:44,090
And the final type is reinforcement
learning, which is even more

735
00:38:44,090 --> 00:38:47,640
intriguing, and I will
discuss it in a brief

736
00:38:47,640 --> 00:38:49,760
introduction in a moment.

737
00:38:49,760 --> 00:38:50,330

738
00:38:50,330 --> 00:38:52,290
So let's take them one by one.

739
00:38:52,290 --> 00:38:53,180
Supervised learning.

740
00:38:53,180 --> 00:38:54,460
So what is supervised learning?

741
00:38:54,460 --> 00:38:56,960

742
00:38:56,960 --> 00:39:01,320
Anytime you have the data that is
given to you, with the output

743
00:39:01,320 --> 00:39:07,630
explicitly given-- here is the user
and movie, and here is the rating.

744
00:39:07,630 --> 00:39:11,030
Here is the previous customer, and
here is their credit behavior.

745
00:39:11,030 --> 00:39:15,270
It's as if a supervisor is helping you
out, in order to be able to classify

746
00:39:15,270 --> 00:39:16,300
the future ones.

747
00:39:16,300 --> 00:39:18,140
That's why it's called supervised.

748
00:39:18,140 --> 00:39:21,160
Let's take an example of coin
recognition, just to be able to

749
00:39:21,160 --> 00:39:24,110
contrast it with unsupervised
learning in a moment.

750
00:39:24,110 --> 00:39:24,630

751
00:39:24,630 --> 00:39:29,350
Let's say you have a vending machine,
and you would like to make

752
00:39:29,350 --> 00:39:31,670
the system able to
recognize the coins.

753
00:39:31,670 --> 00:39:33,030
So what do you do?

754
00:39:33,030 --> 00:39:36,630
You have physical measurements of the
coin, let's be simplistic and say we

755
00:39:36,630 --> 00:39:39,520
measure the size and mass
of the coin you put.

756
00:39:39,520 --> 00:39:44,980
Now the coins will be quarters,
nickels, pennies, and dimes.

757
00:39:44,980 --> 00:39:46,800
25, 5, 1, and 10.

758
00:39:46,800 --> 00:39:47,500

759
00:39:47,500 --> 00:39:51,760
And when you put the data in this
diagram, they will belong there.

760
00:39:51,760 --> 00:39:56,640
So the quarters, for example, are
bigger, so they will belong here.

761
00:39:56,640 --> 00:40:00,480
And the dimes in the US currency happen
to be the smallest of them,

762
00:40:00,480 --> 00:40:04,200
so they are smallest here, and there
will be a scatter because of the error

763
00:40:04,200 --> 00:40:07,160
in measurement, because of the exposure
to the elements, and whatnot.

764
00:40:07,160 --> 00:40:10,070
So let's say that this is your
training data, and it's supervised
765
00:40:10,070 --> 00:40:11,830
because things are colored.

766
00:40:11,830 --> 00:40:15,660
I gave you those and told you they
are 25 cents, 5 cents, et cetera.

767
00:40:15,660 --> 00:40:20,040
So you use those in order to train
a system, and the system will then be

768
00:40:20,040 --> 00:40:22,100
able to classify a future one.

769
00:40:22,100 --> 00:40:26,990
For example, if we stick to the
linear approach, you may be able to

770
00:40:26,990 --> 00:40:29,890
find separator lines like those.

771
00:40:29,890 --> 00:40:33,250
And those separator lines will
separate, based on the data, the 10

772
00:40:33,250 --> 00:40:35,510
from the 1 from the
5 from the 25.

773
00:40:35,510 --> 00:40:37,240
And once you have those,

774
00:40:37,240 --> 00:40:39,870
you can bid farewell to the data.
You don't need it anymore.

775
00:40:39,870 --> 00:40:42,960
And when you get a future coin that is
now unlabeled, you don't know what it

776
00:40:42,960 --> 00:40:47,220
is, when the vending machine is actually
working, then the coin will

777
00:40:47,220 --> 00:40:51,090
lie in one region or another, and you're
going to classify it accordingly.

778
00:40:51,090 --> 00:40:53,550
So that is supervised learning.

779
00:40:53,550 --> 00:40:56,060
Now let's look at unsupervised
learning.

780
00:40:56,060 --> 00:41:01,490
For unsupervised learning, instead of
having the examples, the training data,

781
00:41:01,490 --> 00:41:05,570
having this form which is the
input plus the correct

782
00:41:05,570 --> 00:41:07,020
target-- the correct output--

783
00:41:07,020 --> 00:41:12,470
the customer and how they behaved
in reality in credit,

784
00:41:12,470 --> 00:41:16,765
we are going to have examples that have
less information, so much less it

785
00:41:16,765 --> 00:41:19,480
is laughable.

786
00:41:19,480 --> 00:41:23,920
I'm just going to tell you
what the input is.

787
00:41:23,920 --> 00:41:27,330
And I'm not going to tell you what
the target function is at all.

788
00:41:27,330 --> 00:41:30,190
I'm not going to tell you anything
about the target function.

789
00:41:30,190 --> 00:41:32,770
I'm just going to tell you, here
is the data of a customer.
790
00:41:32,770 --> 00:41:36,210
Good luck, try to predict the credit.

791
00:41:36,210 --> 00:41:38,300
OK--

792
00:41:38,300 --> 00:41:41,340
How in the world are we
going to do that?

793
00:41:41,340 --> 00:41:44,810
Let me show you that the situation
is not totally hopeless.

794
00:41:44,810 --> 00:41:46,010
That's what I'm going to achieve.

795
00:41:46,010 --> 00:41:48,390
I'm not going to tell you
how to do it completely.

796
00:41:48,390 --> 00:41:51,780
But let me show you that a situation
like that is not totally hopeless.

797
00:41:51,780 --> 00:41:52,620

798
00:41:52,620 --> 00:41:55,330
Let's go for the coin example.

799
00:41:55,330 --> 00:41:56,240

800
00:41:56,240 --> 00:42:01,550
For the coin example, we have
data that looks like this.

801
00:42:01,550 --> 00:42:05,800
If I didn't tell you what the
denominations are, the data

802
00:42:05,800 --> 00:42:08,530
would look like this.
803
00:42:08,530 --> 00:42:09,940
Right?

804
00:42:09,940 --> 00:42:12,220
You have the measurements, but you don't
know, is that a quarter, is

805
00:42:12,220 --> 00:42:14,140
it-- you don't know.

806
00:42:14,140 --> 00:42:17,970
Now honestly, if you look at this
thing, you say I can know

807
00:42:17,970 --> 00:42:19,720
something from this figure.

808
00:42:19,720 --> 00:42:21,740
Things tend to cluster together.

809
00:42:21,740 --> 00:42:25,880
So I may be able to classify those
clusters into categories, without

810
00:42:25,880 --> 00:42:28,440
knowing what the categories are.

811
00:42:28,440 --> 00:42:29,960
That will be quite
an achievement already.

812
00:42:29,960 --> 00:42:33,110
You still don't know whether it's
25 cents, or whatever.

813
00:42:33,110 --> 00:42:36,040
But the data actually made you
able to do something that is

814
00:42:36,040 --> 00:42:38,630
a significant step.

815
00:42:38,630 --> 00:42:42,370
You're going to be able to come
up with these boundaries.
816
00:42:42,370 --> 00:42:43,160

817
00:42:43,160 --> 00:42:46,210
And now, you are so close to
finding the full system.

818
00:42:46,210 --> 00:42:49,470
So unlabeled data actually
can be pretty useful.

819
00:42:49,470 --> 00:42:52,910
Obviously, I have seen the colored
ones, so I actually chose the

820
00:42:52,910 --> 00:42:55,500
boundaries right because I still
remember them visually.

821
00:42:55,500 --> 00:42:58,300
But if you look at the clusters and
you have never heard about that,

822
00:42:58,300 --> 00:43:02,830
especially these guys might not
look like two clusters.

823
00:43:02,830 --> 00:43:04,510
They may look like one cluster.

824
00:43:04,510 --> 00:43:10,260
So it actually could be that this is
ambiguous, and indeed in unsupervised

825
00:43:10,260 --> 00:43:13,900
learning, the number of clusters
is ambiguous at times.

826
00:43:13,900 --> 00:43:16,040

827
00:43:16,040 --> 00:43:18,145
And then, what you do--

828
00:43:18,145 --> 00:43:20,740
829
00:43:20,740 --> 00:43:23,110
this is the output of your system.
Now, I can categorize the

830
00:43:23,110 --> 00:43:24,960
coins into types.

831
00:43:24,960 --> 00:43:28,050
I'm just going to call them
types: type 1, type 2,

832
00:43:28,050 --> 00:43:29,260
type 3, type 4.

833
00:43:29,260 --> 00:43:33,140
I have no idea which belongs to which,
but obviously if someone comes with

834
00:43:33,140 --> 00:43:37,420
a single example of a quarter, a dime,
et cetera, then you are ready to go.

835
00:43:37,420 --> 00:43:37,890

836
00:43:37,890 --> 00:43:40,680
Whereas before, you had to have lots of
examples in order to choose where

837
00:43:40,680 --> 00:43:42,770
exactly to put the boundary.

838
00:43:42,770 --> 00:43:43,880

839
00:43:43,880 --> 00:43:47,600
And this is why a set like that,
which looks like complete

840
00:43:47,600 --> 00:43:50,160
jungle, is actually useful.

841
00:43:50,160 --> 00:43:52,850
Let me give you another interesting
example of unsupervised learning,
842
00:43:52,850 --> 00:43:55,610
where I give you the input without the
output, and you are actually in

843
00:43:55,610 --> 00:43:58,320
a better situation to learn.

844
00:43:58,320 --> 00:44:02,290
Let's say that your company or your
school in this case, is sending you

845
00:44:02,290 --> 00:44:05,190
for a semester in Rio de Janeiro.

846
00:44:05,190 --> 00:44:09,690
So you're very excited, and you
decide that you'd better learn some

847
00:44:09,690 --> 00:44:13,660
Portuguese, in order to be able to
speak the language when you arrive.

848
00:44:13,660 --> 00:44:14,370

849
00:44:14,370 --> 00:44:17,830
Not to worry, when you arrive, there
will be a tutor who teaches you

850
00:44:17,830 --> 00:44:18,400
Portuguese.

851
00:44:18,400 --> 00:44:20,400
But you have a month to go,
and you want to help

852
00:44:20,400 --> 00:44:22,320
yourself as much as possible.

853
00:44:22,320 --> 00:44:26,620
You look around, and you find that the
only resource you have is a radio

854
00:44:26,620 --> 00:44:30,080
station in Portuguese in your car.
855
00:44:30,080 --> 00:44:35,080
So what you do, you just turn
it on whenever you drive.

856
00:44:35,080 --> 00:44:38,680
And for an entire month, you're
bombarded with Portuguese.

857
00:44:38,680 --> 00:44:42,890
"tudo bem", "como vai", "valeu",
stuff like that comes back.

858
00:44:42,890 --> 00:44:45,550
After a while, without knowing anything--
it's unsupervised, nobody

859
00:44:45,550 --> 00:44:47,250
told you the meaning of any word--

860
00:44:47,250 --> 00:44:50,870
you start to develop a model of
the language in your mind.

861
00:44:50,870 --> 00:44:52,370
You know what the idioms
are, et cetera.

862
00:44:52,370 --> 00:44:54,930
You are very eager to know
what actually "tudo bem"

863
00:44:54,930 --> 00:44:56,350
-- what does that mean?

864
00:44:56,350 --> 00:44:58,380
You are ready to learn, and once
you learn it, it's actually

865
00:44:58,380 --> 00:44:59,780
fixed in your mind.

866
00:44:59,780 --> 00:45:03,130
Then when you go there, you will learn
the language faster than if you didn't

867
00:45:03,130 --> 00:45:05,070
go through this experience.

868
00:45:05,070 --> 00:45:08,320
So you can think of unsupervised
learning, in one way or another, as

869
00:45:08,320 --> 00:45:12,300
a way of getting a higher-level
representation of the input.

870
00:45:12,300 --> 00:45:15,580
Whether it's extremely high level as
in clusters-- you forgot all the

871
00:45:15,580 --> 00:45:19,680
attributes and you just tell me a label,
or higher level as in this-- a better

872
00:45:19,680 --> 00:45:23,620
representation than just the
crude input into some model

873
00:45:23,620 --> 00:45:25,212
in your mind.

874
00:45:25,212 --> 00:45:29,280

875
00:45:29,280 --> 00:45:32,250
Now let's talk about
reinforcement learning.

876
00:45:32,250 --> 00:45:35,430
In this case, it's not as bad
as unsupervised learning.

877
00:45:35,430 --> 00:45:38,970
So again, without the benefit of
supervised learning, you don't get

878
00:45:38,970 --> 00:45:40,810
the correct output.

879
00:45:40,810 --> 00:45:44,550
What you do is-- I will
give you the input.
880
00:45:44,550 --> 00:45:46,750
OK, thank you very much,
that's very kind.

881
00:45:46,750 --> 00:45:48,580
What else?

882
00:45:48,580 --> 00:45:53,450
I'm going to give you some output.

883
00:45:53,450 --> 00:45:54,540
The correct output?

884
00:45:54,540 --> 00:45:55,200
No!

885
00:45:55,200 --> 00:45:56,690
Some output.

886
00:45:56,690 --> 00:46:01,070
OK, that's very nice, but doesn't
seem very helpful.

887
00:46:01,070 --> 00:46:05,100
It looks now like unsupervised learning,
because in unsupervised learning I

888
00:46:05,100 --> 00:46:06,460
could give you some output.

889
00:46:06,460 --> 00:46:08,080
Here is a dime. Oh, it's a quarter.

890
00:46:08,080 --> 00:46:10,490
It's some output!

891
00:46:10,490 --> 00:46:12,740
Such output has no information.

892
00:46:12,740 --> 00:46:16,240
The information comes from the next one.

893
00:46:16,240 --> 00:46:19,520
I'm going to grade this output.
894
00:46:19,520 --> 00:46:21,440
So that is the information
provided to you.

895
00:46:21,440 --> 00:46:26,200
So I'm not explicitly giving you the
output, but when you choose an output,

896
00:46:26,200 --> 00:46:28,900
I'm going to tell you how
well you're doing.

897
00:46:28,900 --> 00:46:31,850
Reinforcement learning is interesting
because it is mostly our own

898
00:46:31,850 --> 00:46:33,450
experience in learning.

899
00:46:33,450 --> 00:46:38,060
Think of a toddler, and a hot
cup of tea in front of her.

900
00:46:38,060 --> 00:46:40,610
She is looking at it, and
she is very curious.

901
00:46:40,610 --> 00:46:43,210
So she reaches to touch. Ouch!

902
00:46:43,210 --> 00:46:44,720
And she starts crying.

903
00:46:44,720 --> 00:46:47,790
The reward is very negative
for trying.

904
00:46:47,790 --> 00:46:51,325
Now next time she looks at it, and she
remembers the previous experience, and

905
00:46:51,325 --> 00:46:52,760
she doesn't touch it.

906
00:46:52,760 --> 00:46:56,120
But there is a certain level of pain,
because there is an unfulfilled

907
00:46:56,120 --> 00:46:57,870
curiosity.

908
00:46:57,870 --> 00:47:01,860
And curiosity killed the cat. In
three or four trials, the toddler

909
00:47:01,860 --> 00:47:02,530
tries again.

910
00:47:02,530 --> 00:47:04,100
Maybe now it's OK.

911
00:47:04,100 --> 00:47:05,420
And Ouch!

912
00:47:05,420 --> 00:47:09,890
Eventually from just the grade of the
behavior of to touch it or not to

913
00:47:09,890 --> 00:47:14,290
touch it, the toddler will learn not to
touch cups of tea that have smoke

914
00:47:14,290 --> 00:47:15,350
coming out of them.

915
00:47:15,350 --> 00:47:16,060

916
00:47:16,060 --> 00:47:18,930
So that is a case of
reinforcement learning.

917
00:47:18,930 --> 00:47:22,340
The most important application, or one
of the most important applications of

918
00:47:22,340 --> 00:47:26,650
reinforcement learning, is
in playing games.

919
00:47:26,650 --> 00:47:28,290
920
00:47:28,290 --> 00:47:32,420
So backgammon is one of the games,
and think that you want

921
00:47:32,420 --> 00:47:33,600
a system to learn it.

922
00:47:33,600 --> 00:47:40,050
So what you want, you want to take the
current state of the board, and you

923
00:47:40,050 --> 00:47:44,010
roll the dice, and then you decide
what is the optimal move in

924
00:47:44,010 --> 00:47:45,960
order to stand the best chance to win.

925
00:47:45,960 --> 00:47:46,830
That's the game.

926
00:47:46,830 --> 00:47:50,890
So the target function is the
best move given a state.

927
00:47:50,890 --> 00:47:55,680
Now, if I have to generate those things
in order for the system to

928
00:47:55,680 --> 00:48:00,430
learn, then I must be a pretty good
backgammon player already.

929
00:48:00,430 --> 00:48:03,580
So now it's a vicious cycle.

930
00:48:03,580 --> 00:48:06,480
Now, reinforcement learning
comes in handy.

931
00:48:06,480 --> 00:48:09,030
What you're going to do, you
are going to have the

932
00:48:09,030 --> 00:48:11,070
computer choose any output.

933
00:48:11,070 --> 00:48:13,790
A crazy move, for all you care.

934
00:48:13,790 --> 00:48:16,070
And then see what happens eventually.

935
00:48:16,070 --> 00:48:19,280
So this computer is playing against
another computer, both of

936
00:48:19,280 --> 00:48:21,040
them want to learn.

937
00:48:21,040 --> 00:48:24,730
And you make a move, and eventually
you win or lose.

938
00:48:24,730 --> 00:48:28,280
So you propagate back the credit
because of winning or losing,

939
00:48:28,280 --> 00:48:31,780
according to a very specific and
sophisticated formula, into all the

940
00:48:31,780 --> 00:48:34,810
moves that happened.

941
00:48:34,810 --> 00:48:37,570
Now you think that's completely hopeless,
because maybe this is not the

942
00:48:37,570 --> 00:48:39,750
move that resulted in this,
it's another move.

943
00:48:39,750 --> 00:48:45,390
But always remember, that you are going
to do this 100 billion times.

944
00:48:45,390 --> 00:48:47,130
Not you, the poor computer.

945
00:48:47,130 --> 00:48:49,610
You're sitting down sipping
your tea.

946
00:48:49,610 --> 00:48:53,460
A computer is doing this, playing
against an imaginary opponent, and

947
00:48:53,460 --> 00:48:55,240
they keep playing and
playing and playing.

948
00:48:55,240 --> 00:48:58,530
And in three hours of CPU time, you go
back to the computer-- maybe not three

949
00:48:58,530 --> 00:49:02,180
hours, maybe three days of CPU time--
you go back to the computer, and you

950
00:49:02,180 --> 00:49:03,505
have a backgammon champion.

951
00:49:03,505 --> 00:49:06,430

952
00:49:06,430 --> 00:49:09,960
Actually, that's true.

953
00:49:09,960 --> 00:49:13,900
The world champion, at some point, was
a neural network that learned the way

954
00:49:13,900 --> 00:49:15,880
I described.

955
00:49:15,880 --> 00:49:20,730
So it is actually a very attractive
approach, because in machine

956
00:49:20,730 --> 00:49:24,590
learning now, we have a target function
that we cannot model.

957
00:49:24,590 --> 00:49:27,590
That covers a lot of territory,
I've seen a lot of those.
958
00:49:27,590 --> 00:49:29,720
We have data coming from
the target function.

959
00:49:29,720 --> 00:49:30,830

960
00:49:30,830 --> 00:49:32,560
I usually have that.

961
00:49:32,560 --> 00:49:36,010
And now we have the lazy
man's approach to life.

962
00:49:36,010 --> 00:49:39,410
We are going to sit down, and let the
computer do all of the work, and

963
00:49:39,410 --> 00:49:40,830
produce the system we want.

964
00:49:40,830 --> 00:49:44,090
Instead of studying the thing
mathematically, and writing code, and

965
00:49:44,090 --> 00:49:44,900
debugging--

966
00:49:44,900 --> 00:49:46,740
I hate debugging.

967
00:49:46,740 --> 00:49:49,650
And then you go. No,
we're not going to do that.

968
00:49:49,650 --> 00:49:52,550
The learning algorithm just works,
and produces something good.

969
00:49:52,550 --> 00:49:53,040

970
00:49:53,040 --> 00:49:54,490
And we get the check.
971
00:49:54,490 --> 00:49:56,950
So this is a pretty good deal.

972
00:49:56,950 --> 00:50:03,880
It actually is so good, it might
be too good to be true.

973
00:50:03,880 --> 00:50:07,120
So let's actually examine if
all of this was a fantasy.

974
00:50:07,120 --> 00:50:10,590

975
00:50:10,590 --> 00:50:14,080
So now I'm going to give you
a learning puzzle.

976
00:50:14,080 --> 00:50:16,020
Humans are very good learners, right?

977
00:50:16,020 --> 00:50:17,630

978
00:50:17,630 --> 00:50:21,170
So I'm now going to give you a learning
problem in the form that I

979
00:50:21,170 --> 00:50:23,870
described, a supervised
learning problem.

980
00:50:23,870 --> 00:50:28,910
And that supervised learning problem
will give you a training set, some

981
00:50:28,910 --> 00:50:32,300
points mapped to +1, some
points mapped to -1.

982
00:50:32,300 --> 00:50:35,600
And then I'm going to give you
a test point that is unlabeled.

983
00:50:35,600 --> 00:50:41,780
Your task is to look at the examples,
learn the target function, apply it to

984
00:50:41,780 --> 00:50:46,630
the test point, and then decide what
the value of the function is.

985
00:50:46,630 --> 00:50:50,550
After that, I'm going to ask, who
decided that the function is +1,

986
00:50:50,550 --> 00:50:53,130
and who decided that the
function is -1.

987
00:50:53,130 --> 00:50:55,720
OK? It's clear what the deal is.

988
00:50:55,720 --> 00:50:55,730

989
00:50:55,730 --> 00:50:59,680
And I would like our online audience
to do the same thing.

990
00:50:59,680 --> 00:51:02,650
And please text what the solution is.

991
00:51:02,650 --> 00:51:04,900
Just +1 or -1.

992
00:51:04,900 --> 00:51:05,590

993
00:51:05,590 --> 00:51:06,560
Fair enough?

994
00:51:06,560 --> 00:51:07,810
Let's start the game.

995
00:51:07,810 --> 00:51:12,260

996
00:51:12,260 --> 00:51:14,890

997
00:51:14,890 --> 00:51:19,390
What is above the line are
the training examples.

998
00:51:19,390 --> 00:51:23,870
I put the input as a three-by-three
pattern in order to be visually easy

999
00:51:23,870 --> 00:51:24,780
to understand.

1000
00:51:24,780 --> 00:51:28,370
But this is just really nine
bits worth of information.

1001
00:51:28,370 --> 00:51:31,470
And they are ones and zeros,
black and white.

1002
00:51:31,470 --> 00:51:36,640
And for this input, this input, and this
input, the value of the target

1003
00:51:36,640 --> 00:51:38,760
function is -1.

1004
00:51:38,760 --> 00:51:42,160
For this input, this input, and this
input, the value of the target

1005
00:51:42,160 --> 00:51:44,470
function is +1.

1006
00:51:44,470 --> 00:51:47,980
Now this is your data set, this
is your training set.

1007
00:51:47,980 --> 00:51:49,360
Now you should learn the function.

1008
00:51:49,360 --> 00:51:52,980
And when you're done, could you please
tell me what your function will return

1009
00:51:52,980 --> 00:51:54,680
on this test point?
1010
00:51:54,680 --> 00:51:57,130
Is it +1 or -1.

1011
00:51:57,130 --> 00:52:00,480
I will give everybody 30 seconds
before I ask for an answer.

1012
00:52:00,480 --> 00:52:05,330

1013
00:52:05,330 --> 00:52:06,930
Maybe we should have some
background music?

1014
00:52:06,930 --> 00:52:13,680

1015
00:52:13,680 --> 00:52:14,930

1016
00:52:14,930 --> 00:52:20,400

1017
00:52:20,400 --> 00:52:22,680
OK, time's up.

1018
00:52:22,680 --> 00:52:24,850
Your learning algorithm
has converged, I hope.

1019
00:52:24,850 --> 00:52:30,835
And now we apply it here, and I ask
people here, who says it's +1?

1020
00:52:30,835 --> 00:52:34,120

1021
00:52:34,120 --> 00:52:35,180
Thank you.

1022
00:52:35,180 --> 00:52:37,810
Who says it's -1?

1023
00:52:37,810 --> 00:52:39,270
Thank you.
1024
00:52:39,270 --> 00:52:42,020
I see that the online audience
also contributed?

1025
00:52:42,020 --> 00:52:44,070
MODERATOR: Yeah, the big
majority says +1.

1026
00:52:44,070 --> 00:52:45,950
PROFESSOR: But
are there -1's?

1027
00:52:45,950 --> 00:52:46,840
MODERATOR: Two -1's.

1028
00:52:46,840 --> 00:52:47,300

1029
00:52:47,300 --> 00:52:48,320
PROFESSOR: Cool.

1030
00:52:48,320 --> 00:52:49,050

1031
00:52:49,050 --> 00:52:50,990
I don't care if it's
a +1 or -1.

1032
00:52:50,990 --> 00:52:54,270
What I care about is that
I get both answers.

1033
00:52:54,270 --> 00:52:55,810
That is the essence of it.

1034
00:52:55,810 --> 00:52:57,280
Why do I care?

1035
00:52:57,280 --> 00:53:00,760
Because in reality, this
is an impossible task.

1036
00:53:00,760 --> 00:53:03,470
1037
00:53:03,470 --> 00:53:06,090
I told you the target
function is unknown.

1038
00:53:06,090 --> 00:53:11,110
It could be anything,
really anything.

1039
00:53:11,110 --> 00:53:15,740
And now I give you the value of the
target function at 6 points.

1040
00:53:15,740 --> 00:53:19,470
Well, there are many functions that
fit those 6 points, and behave

1041
00:53:19,470 --> 00:53:21,470
differently outside.

1042
00:53:21,470 --> 00:53:32,400
For example, if you take the function
to be +1 if the top left square

1043
00:53:32,400 --> 00:53:40,510
is white, then this should
be -1, right?

1044
00:53:40,510 --> 00:53:49,880
If you take the function to be +1
if the pattern is symmetric--

1045
00:53:49,880 --> 00:53:52,700
let's see, I said it
the other way around.

1046
00:53:52,700 --> 00:53:57,430
So the top one is black,
it would be -1.

1047
00:53:57,430 --> 00:53:58,850
So this would be -1.

1048
00:53:58,850 --> 00:54:00,680
If it's symmetric, it would be +1.

1049
00:54:00,680 --> 00:54:03,420
So this would be +1, because
this guy has both-- this is

1050
00:54:03,420 --> 00:54:05,380
black, and also it is symmetric.

1051
00:54:05,380 --> 00:54:06,230
Right?

1052
00:54:06,230 --> 00:54:09,320
And you can find infinite
variety like that.

1053
00:54:09,320 --> 00:54:12,310
And that problem is not restricted
to this case.

1054
00:54:12,310 --> 00:54:14,300

1055
00:54:14,300 --> 00:54:15,530
The question here is obvious.

1056
00:54:15,530 --> 00:54:17,010
The function is unknown.

1057
00:54:17,010 --> 00:54:18,300
You really mean unknown, right?

1058
00:54:18,300 --> 00:54:19,430
Yes, I mean it.

1059
00:54:19,430 --> 00:54:20,260
Unknown-- anything?

1060
00:54:20,260 --> 00:54:21,280
Yes, I do.

1061
00:54:21,280 --> 00:54:22,160
OK.

1062
00:54:22,160 --> 00:54:26,110
You give me a finite sample,
it can be anything outside.

1063
00:54:26,110 --> 00:54:30,750
How in the world am I going to tell
what the learning outside is?

1064
00:54:30,750 --> 00:54:33,720
OK, that sounds about right.

1065
00:54:33,720 --> 00:54:37,150
But we are in trouble, because that's
the premise of learning.

1066
00:54:37,150 --> 00:54:41,400
If the goal was to memorize the examples
I gave you, that would be

1067
00:54:41,400 --> 00:54:43,780
memorizing, not learning.

1068
00:54:43,780 --> 00:54:48,110
Learning is to figure out a pattern
that applies outside.

1069
00:54:48,110 --> 00:54:53,370
And now we realize that outside,
I cannot say anything.

1070
00:54:53,370 --> 00:54:56,641
Does this mean that learning
is doomed?

1071
00:54:56,641 --> 00:55:00,860
Well, this is going to be
a very short course!

1072
00:55:00,860 --> 00:55:06,230
Well, the good news is that learning
is alive and well.

1073
00:55:06,230 --> 00:55:13,420
And we are going to show that, without
compromising our basic premise.

1074
00:55:13,420 --> 00:55:18,320
The target function will
continue to be unknown.

1075
00:55:18,320 --> 00:55:21,440
And we still mean unknown.

1076
00:55:21,440 --> 00:55:24,390
And we will be able to learn.

1077
00:55:24,390 --> 00:55:27,620
And that will be the subject
of the next lecture.

1078
00:55:27,620 --> 00:55:32,410
Right now, we are going to go for
a short break, after which we are going

1079
00:55:32,410 --> 00:55:40,150
to take the Q&A.

1080
00:55:40,150 --> 00:55:43,270

1081
00:55:43,270 --> 00:55:49,350
We'll start the Q&A, and we will get
questions from the class here, and

1082
00:55:49,350 --> 00:55:51,270
from the online audience.

1083
00:55:51,270 --> 00:55:56,160
And if you'd like to ask a question, let
me ask you to go to this side of

1084
00:55:56,160 --> 00:56:00,630
the room where the mic is, so that
your question can be heard.

1085
00:56:00,630 --> 00:56:04,680
And we will alternate, if there are
questions here, we will alternate

1086
00:56:04,680 --> 00:56:07,540
between campus and off campus.

1087
00:56:07,540 --> 00:56:11,030
So let me start if there is
a question from outside.

1088
00:56:11,030 --> 00:56:16,080
MODERATOR: Yes, so the most common
question is, how do you determine if

1089
00:56:16,080 --> 00:56:19,050
a set of points is linearly
separable, and what do you do

1090
00:56:19,050 --> 00:56:20,730
if they're not separable.

1091
00:56:20,730 --> 00:56:26,120
PROFESSOR: The linear separability
assumption is a very

1092
00:56:26,120 --> 00:56:29,850
simplistic assumption, and doesn't
apply mostly in practice.

1093
00:56:29,850 --> 00:56:34,780
And I chose it only because it goes with
a very simple algorithm, which is

1094
00:56:34,780 --> 00:56:36,950
the perceptron learning algorithm.

1095
00:56:36,950 --> 00:56:40,850
There are two ways to deal with the
case of linear inseparability.

1096
00:56:40,850 --> 00:56:44,450
There are algorithms, and most
algorithms actually deal with that

1097
00:56:44,450 --> 00:56:49,730
case, and there's also a technique that
we are going to study next

1098
00:56:49,730 --> 00:56:55,330
week, which will take a set of points
which is not linearly separable, and

1099
00:56:55,330 --> 00:56:59,150
create a mapping that makes
them linearly separable.

1100
00:56:59,150 --> 00:57:02,050
So there is a way to deal with it.
1101
00:57:02,050 --> 00:57:05,990
However, the question how do you
determine it's linearly separable, the

1102
00:57:05,990 --> 00:57:09,240
right way of doing it in practice is
that, when someone gives you data, you

1103
00:57:09,240 --> 00:57:11,870
assume in general it's not
linearly separable.

1104
00:57:11,870 --> 00:57:15,310
It will hardly ever be, and therefore
you take techniques that can deal with

1105
00:57:15,310 --> 00:57:16,630
that case as well.

1106
00:57:16,630 --> 00:57:20,100
There is a simple modification of the
perceptron learning algorithm, which

1107
00:57:20,100 --> 00:57:21,670
is called the pocket algorithm,

1108
00:57:21,670 --> 00:57:26,190
that applies the same rule with a very
minor modification, and deals with the

1109
00:57:26,190 --> 00:57:29,460
case where the data is not separable.

1110
00:57:29,460 --> 00:57:34,820
However, if you apply the perceptron
learning algorithm, that is guaranteed

1111
00:57:34,820 --> 00:57:38,990
to converge to a correct solution in the
case of linear separability, and

1112
00:57:38,990 --> 00:57:43,520
you apply it to data that is not
linearly separable, bad things happen.

1113
00:57:43,520 --> 00:57:46,800
Not only is it going not to converge,
obviously it is not going to converge

1114
00:57:46,800 --> 00:57:50,600
because it terminates when there are
no misclassified points, right?

1115
00:57:50,600 --> 00:57:53,640
If there is a misclassified point, then
there's a next iteration always.

1116
00:57:53,640 --> 00:57:56,500
So since the data is not linearly
separable, we will never come to

1117
00:57:56,500 --> 00:57:59,110
a point where all the points
are classified correctly.

1118
00:57:59,110 --> 00:58:01,220
So this is not what is bothering us.

1119
00:58:01,220 --> 00:58:04,570
What is bothering us is that, as you go
from one step to another, you can

1120
00:58:04,570 --> 00:58:08,040
go from a very good solution
to a terrible solution.

1121
00:58:08,040 --> 00:58:10,450
In the case of no linear separability.

1122
00:58:10,450 --> 00:58:13,530
So it's not an algorithm that you
would like to use, and just

1123
00:58:13,530 --> 00:58:15,650
terminate by force at an iteration.

1124
00:58:15,650 --> 00:58:21,360
A modification of it can be used this
way, and I'll mention it briefly when

1125
00:58:21,360 --> 00:58:26,120
we talk about linear regression
and other linear methods.

1126
00:58:26,120 --> 00:58:29,770
MODERATOR: There's also a question of
how does the rate of convergence of

1127
00:58:29,770 --> 00:58:33,810
the perceptron change with the
dimensionality of the data?

1128
00:58:33,810 --> 00:58:35,840
PROFESSOR: Badly!

1129
00:58:35,840 --> 00:58:37,200
That's the answer.

1130
00:58:37,200 --> 00:58:38,440
Let me put it this way.

1131
00:58:38,440 --> 00:58:42,000
You can build pathological cases, where
it really will take forever.

1132
00:58:42,000 --> 00:58:45,230
However, I did not give the perceptron
learning algorithm in the first

1133
00:58:45,230 --> 00:58:47,900
lecture to tell you that this is
the great algorithm that you

1134
00:58:47,900 --> 00:58:49,160
need to learn.

1135
00:58:49,160 --> 00:58:51,720
I gave it in the first lecture,
because this is simplest

1136
00:58:51,720 --> 00:58:53,550
algorithm I could give.

1137
00:58:53,550 --> 00:58:56,990
By the end of this course,
you'll be saying, what?

1138
00:58:56,990 --> 00:58:57,650
Perceptron?

1139
00:58:57,650 --> 00:58:58,880
Never heard of it.

1140
00:58:58,880 --> 00:59:02,710
So it will go out of contention, after we
get to the more interesting stuff.

1141
00:59:02,710 --> 00:59:03,240

1142
00:59:03,240 --> 00:59:07,090
But as a method that can be used, it
indeed can be used, and can be

1143
00:59:07,090 --> 00:59:09,710
explained in five minutes
as you have seen.

1144
00:59:09,710 --> 00:59:15,050
MODERATOR: Regarding the items for
learning, you mentioned that there

1145
00:59:15,050 --> 00:59:15,900
must be a pattern.

1146
00:59:15,900 --> 00:59:18,400
So can you be more specific about that?

1147
00:59:18,400 --> 00:59:20,590
How do you know if there's a pattern?

1148
00:59:20,590 --> 00:59:21,940
PROFESSOR: You don't.

1149
00:59:21,940 --> 00:59:25,840
My answers seem to be very abrupt,
but that's the way it is.

1150
00:59:25,840 --> 00:59:29,680
When we get to the theory--
is learning feasible-- it will

1151
00:59:29,680 --> 00:59:34,060
become very clear that there is
a separation between the target

1152
00:59:34,060 --> 00:59:35,730
function-- there is
a pattern to detect--

1153
00:59:35,730 --> 00:59:37,150
and whether we can learn it.

1154
00:59:37,150 --> 00:59:40,150
It is very difficult for me to explain
it in two minutes, it will take a full

1155
00:59:40,150 --> 00:59:41,500
lecture to get there.

1156
00:59:41,500 --> 00:59:47,600
But the essence of it is that you take
the data, you apply your learning

1157
00:59:47,600 --> 00:59:52,710
algorithm, and there is something you
can explicitly detect that will

1158
00:59:52,710 --> 00:59:54,890
tell you whether you learned or not.

1159
00:59:54,890 --> 00:59:57,630
So in some cases, you're not
going to be able to learn.

1160
00:59:57,630 --> 00:59:59,890
In some cases, you'll be able to learn.

1161
00:59:59,890 --> 01:00:02,630
And the key is that you're going
to be able to tell by

1162
01:00:02,630 --> 01:00:04,440
running your algorithm.

1163
01:00:04,440 --> 01:00:07,280
And I'm going to explain that
in more details later on.

1164
01:00:07,280 --> 01:00:08,010
1165
01:00:08,010 --> 01:00:15,660
So basically, I'm also resisting
taking the data, deciding

1166
01:00:15,660 --> 01:00:19,220
whether it's linearly separable, looking
at it and seeing. You will

1167
01:00:19,220 --> 01:00:25,370
realize as we go through that it's
a no-no to actually look at the data.

1168
01:00:25,370 --> 01:00:26,860
What?

1169
01:00:26,860 --> 01:00:29,580
That's what data is for, to look at.

1170
01:00:29,580 --> 01:00:30,850
Bear with me.

1171
01:00:30,850 --> 01:00:34,720
We will come to the level where we ask
why don't we look at the data--

1172
01:00:34,720 --> 01:00:37,920
just looking at it and then saying:
It's linearly separable.

1173
01:00:37,920 --> 01:00:39,890
Let's pick the perceptron.

1174
01:00:39,890 --> 01:00:42,370
That's bad practice, for reasons
that are not obvious now.

1175
01:00:42,370 --> 01:00:45,460
They will become obvious, once we
are done with the theory.

1176
01:00:45,460 --> 01:00:50,330
So when someone knocks on my door with
a set of data, I can ask them all

1177
01:00:50,330 --> 01:00:54,360
kinds of questions about the data-- not
the particular data set that they gave

1178
01:00:54,360 --> 01:00:57,750
me, but about the general data that
is generated by their process.

1179
01:00:57,750 --> 01:01:00,570
They can tell me this variable is
important, the function is symmetric,

1180
01:01:00,570 --> 01:01:04,210
they can give you all kinds of
information that I will take to heart.

1181
01:01:04,210 --> 01:01:08,730
But I will try, as much as I can, to
avoid looking at the particular data

1182
01:01:08,730 --> 01:01:14,680
set that they gave me, lest I should
tailor my system toward this data set,

1183
01:01:14,680 --> 01:01:17,680
and be disappointed when another
data set comes about.

1184
01:01:17,680 --> 01:01:20,100
You don't want to get too
close to the data set.

1185
01:01:20,100 --> 01:01:24,190
This will become very clear
as we go with the theory.

1186
01:01:24,190 --> 01:01:27,190
MODERATOR: In general about
machine learning, how does it

1187
01:01:27,190 --> 01:01:30,550
relate to other statistical, especially
econometric techniques?

1188
01:01:30,550 --> 01:01:33,090

1189
01:01:33,090 --> 01:01:37,150
PROFESSOR: Statistics is, in
the form I said, it's machine

1190
01:01:37,150 --> 01:01:38,710
learning where the target--

1191
01:01:38,710 --> 01:01:42,010
it's not a function in this case--
is a probability distribution.

1192
01:01:42,010 --> 01:01:44,670
Statistics is a mathematical field.

1193
01:01:44,670 --> 01:01:49,100
And therefore, you put the assumptions
that you need in order to be able to

1194
01:01:49,100 --> 01:01:53,970
rigorously prove the results you have,
and get the results in detail.

1195
01:01:53,970 --> 01:01:55,700
For example, linear regression.

1196
01:01:55,700 --> 01:01:59,810
When we talk about linear regression, it
will have very few assumptions, and

1197
01:01:59,810 --> 01:02:03,150
the results will apply to a wide range,
because we didn't make too many

1198
01:02:03,150 --> 01:02:04,330
assumptions.

1199
01:02:04,330 --> 01:02:07,530
When you study linear regression under
statistics, there is a lot of

1200
01:02:07,530 --> 01:02:11,020
mathematics that goes with it, lot of
assumptions, because that is the

1201
01:02:11,020 --> 01:02:12,640
purpose of the field.
1202
01:02:12,640 --> 01:02:18,560
In general, machine learning tries to make
the least assumptions and cover the

1203
01:02:18,560 --> 01:02:22,090
most territory. These go together.

1204
01:02:22,090 --> 01:02:25,640
So it is not a mathematical discipline,
but it's not a purely

1205
01:02:25,640 --> 01:02:26,850
applied discipline.

1206
01:02:26,850 --> 01:02:31,270
It spans both the mathematical, to
certain extent, but it is willing to

1207
01:02:31,270 --> 01:02:35,870
actually go into territory where we
don't have mathematical models, and

1208
01:02:35,870 --> 01:02:38,040
still want to apply our techniques.

1209
01:02:38,040 --> 01:02:40,600
So that is what characterizes
it the most.

1210
01:02:40,600 --> 01:02:44,120
And then there are other fields.
By doing machine learning,

1211
01:02:44,120 --> 01:02:46,400
you can find it under the name
computational learning,

1212
01:02:46,400 --> 01:02:48,090
or statistical learning.

1213
01:02:48,090 --> 01:02:52,120
Data mining has a huge intersection
with machine learning.

1214
01:02:52,120 --> 01:02:56,020
There are lots of disciplines around
that actually share some value.

1215
01:02:56,020 --> 01:02:59,630
But the point is, the premise that you
saw is so broad, that it shouldn't be

1216
01:02:59,630 --> 01:03:03,690
surprising that people at different times
developed a particular discipline

1217
01:03:03,690 --> 01:03:06,840
with its own jargon, to deal
with that discipline.

1218
01:03:06,840 --> 01:03:13,990
So what I'm giving you is machine
learning as the mainstream goes, and

1219
01:03:13,990 --> 01:03:17,520
that can be applied as widely as
possible to applications, both

1220
01:03:17,520 --> 01:03:20,900
practical applications and
scientific applications.

1221
01:03:20,900 --> 01:03:24,870
You will see, here is a situation, I
have an experiment, here is a target,

1222
01:03:24,870 --> 01:03:25,980
I have the data.

1223
01:03:25,980 --> 01:03:28,640
How do I produce the target
in the best way I want?

1224
01:03:28,640 --> 01:03:32,010
And then you apply machine learning.

1225
01:03:32,010 --> 01:03:36,180
MODERATOR: Also, in a general
question about machine learning.

1226
01:03:36,180 --> 01:03:36,190
1227
01:03:36,190 --> 01:03:42,370
Do machine learning algorithms perform
global optimization methods,

1228
01:03:42,370 --> 01:03:45,810
or just local optimization methods?

1229
01:03:45,810 --> 01:03:47,640
PROFESSOR: Obviously,
a general question.

1230
01:03:47,640 --> 01:03:48,470

1231
01:03:48,470 --> 01:03:52,120
Optimization is a tool
for machine learning.

1232
01:03:52,120 --> 01:03:56,340
So we will pick whatever optimization
that does the job for us.

1233
01:03:56,340 --> 01:03:59,440
And sometimes, there is a very
specific optimization method.

1234
01:03:59,440 --> 01:04:01,470
For example, in support vector
machines, it will be quadratic

1235
01:04:01,470 --> 01:04:01,990
programming.

1236
01:04:01,990 --> 01:04:04,050
It happens to be the one
that works with that.

1237
01:04:04,050 --> 01:04:08,190
But optimization is not something
that machine learning people

1238
01:04:08,190 --> 01:04:10,000
study for its own sake.

1239
01:04:10,000 --> 01:04:12,840
They obviously study it to understand
it better, and to choose the correct

1240
01:04:12,840 --> 01:04:14,900
optimization method.

1241
01:04:14,900 --> 01:04:17,780
Now, the question is alluding
to something that will

1242
01:04:17,780 --> 01:04:21,220
become clear when we talk about neural
networks, which is local minimum versus

1243
01:04:21,220 --> 01:04:22,830
global minimum.

1244
01:04:22,830 --> 01:04:26,680
And it is impossible to put this in
any perspective before we get the

1245
01:04:26,680 --> 01:04:29,120
details of neural networks,
so I will defer that until

1246
01:04:29,120 --> 01:04:32,850
we get to that lecture.

1247
01:04:32,850 --> 01:04:37,530
MODERATOR: Also, this is
a math question, I guess.

1248
01:04:37,530 --> 01:04:42,470
Is the hypothesis set, in a topological
sense, continuous?

1249
01:04:42,470 --> 01:04:47,160
PROFESSOR: The hypothesis
set can be anything, in principle.

1250
01:04:47,160 --> 01:04:50,500
So it can be continuous,
and it can be discrete.

1251
01:04:50,500 --> 01:04:53,710
For example, in the next lecture I take
the simplest case where we have
1252
01:04:53,710 --> 01:04:57,610
a finite hypothesis set, in order
to make a certain point.

1253
01:04:57,610 --> 01:05:00,610
In reality, almost all the hypothesis
sets that you find are

1254
01:05:00,610 --> 01:05:02,580
continuous and infinite.

1255
01:05:02,580 --> 01:05:04,170
Very infinite!

1256
01:05:04,170 --> 01:05:10,190
And the level of sophistication
of the hypothesis set can be huge.

1257
01:05:10,190 --> 01:05:15,440
And nonetheless, we will be able to see
that under one condition, which

1258
01:05:15,440 --> 01:05:19,307
comes from the theory, we'll be able to
learn even if the hypothesis set is

1259
01:05:19,307 --> 01:05:23,580
huge and complicated.

1260
01:05:23,580 --> 01:05:26,340
There's a question from inside, yes?

1261
01:05:26,340 --> 01:05:32,930
STUDENT: I think I understood, more or
less, the general idea, but I don't

1262
01:05:32,930 --> 01:05:37,160
understand the second example
you gave about credit approval.

1263
01:05:37,160 --> 01:05:41,200
So how do we collect our data?

1264
01:05:41,200 --> 01:05:46,210
Should we give credit to everyone, or
should we make our data biased,
1265
01:05:46,210 --> 01:05:51,170
because we cannot determine
the data of--

1266
01:05:51,170 --> 01:05:57,480
we can't determine, should we give credit
or not to persons we rejected?

1267
01:05:57,480 --> 01:05:58,030
PROFESSOR: Correct.

1268
01:05:58,030 --> 01:06:04,465
This is a good point. Every time
someone asks a question, the

1269
01:06:04,465 --> 01:06:05,590
lecture number comes to my mind.

1270
01:06:05,590 --> 01:06:07,570
I know when I'm going
to talk about it.

1271
01:06:07,570 --> 01:06:10,410
So what you describe is
called sampling bias.

1272
01:06:10,410 --> 01:06:12,190
And I will describe it in detail.

1273
01:06:12,190 --> 01:06:18,450
But when you use the biased data, let's
say the bank uses historical records.

1274
01:06:18,450 --> 01:06:22,450
So it sees the people who applied and
were accepted, and for those guys, it

1275
01:06:22,450 --> 01:06:26,030
can actually predict what the credit
behavior is, because it has their

1276
01:06:26,030 --> 01:06:26,700
credit history.

1277
01:06:26,700 --> 01:06:30,000
They charged and repaid and maxed
out, and all of this.

1278
01:06:30,000 --> 01:06:32,590
And then they decide: is this
a good customer or not?

1279
01:06:32,590 --> 01:06:36,400
For those who were rejected, there's
really no way to tell in this case

1280
01:06:36,400 --> 01:06:38,870
whether they were falsely rejected,
that they would have been good

1281
01:06:38,870 --> 01:06:40,220
customers or not.

1282
01:06:40,220 --> 01:06:44,050
Nonetheless, if you take the customer
base that you have, and base your

1283
01:06:44,050 --> 01:06:48,070
decision on it, the boundary
works fairly decently.

1284
01:06:48,070 --> 01:06:51,300
Actually, pretty decently, even for the
other guys, because the other guys

1285
01:06:51,300 --> 01:06:55,060
usually are deeper into the
classification region than the

1286
01:06:55,060 --> 01:06:57,940
boundary guys that you accepted,
and turned out to be bad.

1287
01:06:57,940 --> 01:06:58,810

1288
01:06:58,810 --> 01:07:01,000
But the point is well taken.

1289
01:07:01,000 --> 01:07:04,390
The data set in this case is not
completely representative, and there
1290
01:07:04,390 --> 01:07:07,750
is a particular principle in learning
that we'll talk about, which is

1291
01:07:07,750 --> 01:07:11,400
sampling bias, that deals
with this case.

1292
01:07:11,400 --> 01:07:14,270
Another question from here?

1293
01:07:14,270 --> 01:07:17,420
STUDENT: You explain that we need
to have a lot of data to learn.

1294
01:07:17,420 --> 01:07:22,050
So how do you decide how much amount
of data that is required for

1295
01:07:22,050 --> 01:07:26,980
a particular problem, in order to be
able to come up with a reasonable--

1296
01:07:26,980 --> 01:07:27,930
PROFESSOR: Good question.

1297
01:07:27,930 --> 01:07:31,710
So let me tell you the theoretical,
and the practical answer.

1298
01:07:31,710 --> 01:07:36,340
The theoretical answer is that this is
exactly the crux of the theory part

1299
01:07:36,340 --> 01:07:37,810
that we're going to talk about.

1300
01:07:37,810 --> 01:07:38,350

1301
01:07:38,350 --> 01:07:40,950
And in the theory, we are going
to see, can we learn?

1302
01:07:40,950 --> 01:07:43,120
And how much data.

1303
01:07:43,120 --> 01:07:46,150
So all of this will be answered
in a mathematical way.

1304
01:07:46,150 --> 01:07:48,020
So this is the theoretical answer.

1305
01:07:48,020 --> 01:07:52,770
The practical answer is: that's
not under your control.

1306
01:07:52,770 --> 01:07:57,180
When someone knocks on your door: Here
is the data, I have 500 points.

1307
01:07:57,180 --> 01:08:00,170
I tell him, I will give you
a fantastic system if you

1308
01:08:00,170 --> 01:08:02,200
just give me 2000.

1309
01:08:02,200 --> 01:08:05,000
But I don't have 2000, I have 500.

1310
01:08:05,000 --> 01:08:09,040
So now you go and you use your theory
to do something to your system, such

1311
01:08:09,040 --> 01:08:11,000
that it can work with the 500.

1312
01:08:11,000 --> 01:08:11,710

1313
01:08:11,710 --> 01:08:12,600
There was one case--

1314
01:08:12,600 --> 01:08:16,930
I worked with data in different
applications--

1315
01:08:16,930 --> 01:08:20,330
at some point, we had almost
100 million points.

1316
01:08:20,330 --> 01:08:21,760
You were swimming in data.

1317
01:08:21,760 --> 01:08:23,279
You wouldn't complain about data.

1318
01:08:23,279 --> 01:08:25,200
Data was wonderful.

1319
01:08:25,200 --> 01:08:28,779
And in another case, there were
less than 100 points.

1320
01:08:28,779 --> 01:08:31,890
And you had to deal with
the data with gloves!

1321
01:08:31,890 --> 01:08:36,290
Because if you use them the wrong way,
they are contaminated, which is

1322
01:08:36,290 --> 01:08:38,970
an expression we will see, and
then you have nothing.

1323
01:08:38,970 --> 01:08:43,029
And you will produce a system, and you
are proud of it, but you have no idea

1324
01:08:43,029 --> 01:08:44,540
whether it will perform well or not.

1325
01:08:44,540 --> 01:08:46,899
And you cannot give this to the customer,
and have the customer come

1326
01:08:46,899 --> 01:08:49,300
back to you and say: what did you do!?

1327
01:08:49,300 --> 01:08:49,760

1328
01:08:49,760 --> 01:08:55,490
So there is a question of, what
performance can you do given

1329
01:08:55,490 --> 01:08:57,090
what data size you have?

1330
01:08:57,090 --> 01:09:00,520
But in practice, you really have no
control over the data size in almost

1331
01:09:00,520 --> 01:09:03,140
all the cases, almost all
the practical cases.

1332
01:09:03,140 --> 01:09:05,960
Yes?

1333
01:09:05,960 --> 01:09:10,540
STUDENT: Another question I have
is regarding the hypothesis set.

1334
01:09:10,540 --> 01:09:13,729
So the larger the hypothesis set
is, probably I'll be able to

1335
01:09:13,729 --> 01:09:15,649
better fit the data.

1336
01:09:15,649 --> 01:09:20,420
But that, as you were explaining, might
be a bad thing to do because

1337
01:09:20,420 --> 01:09:23,460
when the new data point comes,
there might be troubles.

1338
01:09:23,460 --> 01:09:25,210
So how do you decide
the size of your--

1339
01:09:25,210 --> 01:09:27,680
PROFESSOR: You are asking all
the right questions, and all of

1340
01:09:27,680 --> 01:09:28,350
them are coming up.

1341
01:09:28,350 --> 01:09:32,330
This is again part of the theory,
but let me try to explain this.

1342
01:09:32,330 --> 01:09:35,420
As we mentioned, learning is about
being able to predict.

1343
01:09:35,420 --> 01:09:40,510
So you are using the data, not to
memorize it, but to figure out what

1344
01:09:40,510 --> 01:09:42,130
the pattern is.

1345
01:09:42,130 --> 01:09:45,100
And if you figure out a pattern that
applies to all the data, and it's

1346
01:09:45,100 --> 01:09:47,216
a reasonable pattern, then you
have a chance that it

1347
01:09:47,216 --> 01:09:49,340
will generalize outside.

1348
01:09:49,340 --> 01:09:53,880
Now the problem is that, if I give you
50 points, and you use a 7000th-order

1349
01:09:53,880 --> 01:09:57,360
polynomial, you will fit the
heck out of the data.

1350
01:09:57,360 --> 01:10:01,160
You will fit it so much with so many
degrees of freedom to spare, but you

1351
01:10:01,160 --> 01:10:02,070
haven't learned anything.

1352
01:10:02,070 --> 01:10:04,610
You just memorized it in a fancy way.

1353
01:10:04,610 --> 01:10:08,500
You put it in a polynomial form, and
that actually carries all the
1354
01:10:08,500 --> 01:10:10,400
information about the
data that you have,

1355
01:10:10,400 --> 01:10:11,890
and then some.

1356
01:10:11,890 --> 01:10:15,280
So you don't expect at all that
this will generalize outside.

1357
01:10:15,280 --> 01:10:18,450
And that intuitive observation
will be formalized when we

1358
01:10:18,450 --> 01:10:19,580
talk about the theory.

1359
01:10:19,580 --> 01:10:22,930
There will be a measurement of the
hypothesis set that you give me, that

1360
01:10:22,930 --> 01:10:25,550
measures the sophistication of it,
and will tell you with that

1361
01:10:25,550 --> 01:10:28,850
sophistication, you need that amount
of data in order to be able to make

1362
01:10:28,850 --> 01:10:30,430
any statement about generalization.

1363
01:10:30,430 --> 01:10:31,680
So that is what the theory is about.

1364
01:10:31,680 --> 01:10:34,650

1365
01:10:34,650 --> 01:10:37,880
STUDENT: Suppose, I mean, here
whatever we discussed, it is like I

1366
01:10:37,880 --> 01:10:42,930
had a data set and I came up with
an algorithm, and gave the output.

1367
01:10:42,930 --> 01:10:48,690
But won't it be also important to see,
OK, we came up with the output, and

1368
01:10:48,690 --> 01:10:52,790
using that, what was the feedback?

1369
01:10:52,790 --> 01:10:57,690
Are there techniques where you take
the feedback and try to

1370
01:10:57,690 --> 01:10:58,980
correct your--

1371
01:10:58,980 --> 01:11:03,360
PROFESSOR: You are alluding
to different techniques here.

1372
01:11:03,360 --> 01:11:07,740
But one of them would be validation,
which is after you learn, you validate

1373
01:11:07,740 --> 01:11:09,360
your solution.

1374
01:11:09,360 --> 01:11:13,000
And this is an extremely established and
core technique in machine learning

1375
01:11:13,000 --> 01:11:16,870
that will be covered in
one of the lectures.

1376
01:11:16,870 --> 01:11:18,810
Any questions from the online audience?

1377
01:11:18,810 --> 01:11:25,780
MODERATOR: In practice, how many
dimensions would be considered easy,

1378
01:11:25,780 --> 01:11:28,730
medium, and hard for
a perceptron problem?

1379
01:11:28,730 --> 01:11:31,100
PROFESSOR: The hard,

1380
01:11:31,100 --> 01:11:34,850
in most people's mind before they
get into machine learning, is the

1381
01:11:34,850 --> 01:11:36,420
computational time.

1382
01:11:36,420 --> 01:11:38,800
If something takes a lot of time,
then it's a hard problem.

1383
01:11:38,800 --> 01:11:42,800
If something can be computed quickly,
it's an easy problem.

1384
01:11:42,800 --> 01:11:47,210
For machine learning, the bottleneck
in my case, has never been the

1385
01:11:47,210 --> 01:11:51,340
computation time, even in
incredibly big data sets.

1386
01:11:51,340 --> 01:11:55,410
The bottleneck for machine learning is
to be able to generalize outside the

1387
01:11:55,410 --> 01:11:56,990
data that you have seen.

1388
01:11:56,990 --> 01:12:01,790
So to answer your question, the
perceptron behaves badly in terms of

1389
01:12:01,790 --> 01:12:04,090
the computational behavior.

1390
01:12:04,090 --> 01:12:07,490
We will be able to predict its
generalization behavior, based on the

1391
01:12:07,490 --> 01:12:09,370
number of dimensions and
the amount of data.
1392
01:12:09,370 --> 01:12:11,610
This will be given explicitly.

1393
01:12:11,610 --> 01:12:19,030
And therefore, the perceptron algorithm
is bad computationally, good

1394
01:12:19,030 --> 01:12:20,980
in terms of generalization.

1395
01:12:20,980 --> 01:12:24,900
If you actually can get away with
perceptrons, your chances of

1396
01:12:24,900 --> 01:12:28,460
generalizing are good because
it's a simplistic

1397
01:12:28,460 --> 01:12:33,850
model, and therefore its ability to
generalize is good, as we will see.

1398
01:12:33,850 --> 01:12:38,010
MODERATOR: Also, in the example you
explain the use of binary function.

1399
01:12:38,010 --> 01:12:43,690
So can you use more multi-valued
or real functions?

1400
01:12:43,690 --> 01:12:45,100
PROFESSOR: Correct.

1401
01:12:45,100 --> 01:12:47,980
Remember when I told you that there is
a topic that is out of sequence.

1402
01:12:47,980 --> 01:12:51,810
There was a logical sequence to the
course, and then I took part of the

1403
01:12:51,810 --> 01:12:55,870
linear models and put it very early on,
to give you something a little bit

1404
01:12:55,870 --> 01:12:59,140
more sophisticated than perceptrons
to try your hand on.

1405
01:12:59,140 --> 01:13:01,560
That happens to be for
real-valued functions.

1406
01:13:01,560 --> 01:13:05,650
And obviously there are hypotheses that
cover all types of co-domains.

1407
01:13:05,650 --> 01:13:07,010
Y could be anything as well.

1408
01:13:07,010 --> 01:13:09,930

1409
01:13:09,930 --> 01:13:18,730
MODERATOR: Another question is, in
the learning process you showed, when

1410
01:13:18,730 --> 01:13:22,420
do you pick your learning algorithm,
when do you pick your hypothesis set,

1411
01:13:22,420 --> 01:13:23,840
and what liberty do you have?

1412
01:13:23,840 --> 01:13:28,380

1413
01:13:28,380 --> 01:13:33,070
PROFESSOR: The hypothesis set
is the most important aspect of

1414
01:13:33,070 --> 01:13:36,030
determining the generalization behavior
that we'll talk about.

1415
01:13:36,030 --> 01:13:38,960
The learning algorithm does play a role,
although it is a secondary role,

1416
01:13:38,960 --> 01:13:41,330
as we will see in the discussion.
1417
01:13:41,330 --> 01:13:45,960
So in general, the learning
algorithm has the form of

1418
01:13:45,960 --> 01:13:49,140
minimizing an error function.

1419
01:13:49,140 --> 01:13:51,540
So you can think of the
perceptron, what does

1420
01:13:51,540 --> 01:13:52,290
the algorithm do?

1421
01:13:52,290 --> 01:13:55,420
It tries to minimize the
classification error.

1422
01:13:55,420 --> 01:13:58,710
That is your error function, and
you're minimizing it using this

1423
01:13:58,710 --> 01:14:00,220
particular update rule.

1424
01:14:00,220 --> 01:14:03,700
And in other cases, we'll see that we
are minimizing an error function.

1425
01:14:03,700 --> 01:14:08,000
Now the minimization aspect is
an optimization question, and once you

1426
01:14:08,000 --> 01:14:11,180
determine that this is indeed the
error function that I want to

1427
01:14:11,180 --> 01:14:15,950
minimize, then you go and minimize
as much as you can using the most

1428
01:14:15,950 --> 01:14:18,710
sophisticated optimization
technique that you find.

1429
01:14:18,710 --> 01:14:22,160
So the question now translates into
what is the choice of the error

1430
01:14:22,160 --> 01:14:26,280
function or error measure that
will help or not help.

1431
01:14:26,280 --> 01:14:29,530
And that will be covered also next week
under the topic, Error and Noise.

1432
01:14:29,530 --> 01:14:32,350
When I talk about error, we'll talk
about error measures, and this

1433
01:14:32,350 --> 01:14:37,660
translates directly to the learning
algorithm that goes with them.

1434
01:14:37,660 --> 01:14:38,730
MODERATOR: Back to the perceptron.

1435
01:14:38,730 --> 01:14:43,220
So what happens if your hypothesis
gives you exactly 0 in this case?

1436
01:14:43,220 --> 01:14:47,200
PROFESSOR: So remember that
the quantity you compute and

1437
01:14:47,200 --> 01:14:49,960
compare with the threshold
was your credit score.

1438
01:14:49,960 --> 01:14:53,090
So I told you what happens if you are
above threshold, and what happens if

1439
01:14:53,090 --> 01:14:54,760
you're below threshold.

1440
01:14:54,760 --> 01:14:57,430
So what happens if you're exactly
at the threshold?

1441
01:14:57,430 --> 01:15:02,340
Your score is exactly that.
1442
01:15:02,340 --> 01:15:07,080
The informal answer is that it depends
on the mood of the credit

1443
01:15:07,080 --> 01:15:08,650
officer on that day.

1444
01:15:08,650 --> 01:15:10,870
If they had a bad day,
you will be denied!

1445
01:15:10,870 --> 01:15:16,410
But the serious answer is that
there are technical ways of

1446
01:15:16,410 --> 01:15:17,870
defining that point.

1447
01:15:17,870 --> 01:15:21,580
You can define it as 0,
so the sign of 0 is 0.

1448
01:15:21,580 --> 01:15:24,190
In which case you are always making
an error, because you are never +1 or

1449
01:15:24,190 --> 01:15:25,830
-1, when you should be.

1450
01:15:25,830 --> 01:15:28,230
Or you could make it belong
to the +1 category or

1451
01:15:28,230 --> 01:15:29,700
to the -1 category.

1452
01:15:29,700 --> 01:15:32,190
There are ramifications for
all of these decisions

1453
01:15:32,190 --> 01:15:33,950
that are purely technical.

1454
01:15:33,950 --> 01:15:36,010
Nothing conceptual comes out of them.
1455
01:15:36,010 --> 01:15:38,790
That's why I decided not
to include it.

1456
01:15:38,790 --> 01:15:42,220
Because it clutters the main concept
with something that really has no

1457
01:15:42,220 --> 01:15:43,170
ramification.

1458
01:15:43,170 --> 01:15:46,090
As far as you're concerned, the easiest
way to consider it is that the

1459
01:15:46,090 --> 01:15:49,040
output will be 0, and therefore you will
be making an error regardless of

1460
01:15:49,040 --> 01:15:50,410
whether it's +1 or -1.

1461
01:15:50,410 --> 01:15:53,600

1462
01:15:53,600 --> 01:15:57,070
MODERATOR: Is there a kind of problem
that cannot be learned even if

1463
01:15:57,070 --> 01:16:01,480
there's a huge amount of data?

1464
01:16:01,480 --> 01:16:02,360
PROFESSOR: Correct.

1465
01:16:02,360 --> 01:16:07,010
For example, if I go to my computer
and use a pseudo-random number

1466
01:16:07,010 --> 01:16:12,090
generator to generate the target over
the entire domain, then patently,

1467
01:16:12,090 --> 01:16:14,960
nothing I can give you will make
you learn the other guys.

1468
01:16:14,960 --> 01:16:16,360

1469
01:16:16,360 --> 01:16:17,665
So remember the three--

1470
01:16:17,665 --> 01:16:20,310

1471
01:16:20,310 --> 01:16:23,170
let me try to--

1472
01:16:23,170 --> 01:16:24,640
the essence of machine learning.

1473
01:16:24,640 --> 01:16:28,710
The first one was, a pattern exists.

1474
01:16:28,710 --> 01:16:29,510

1475
01:16:29,510 --> 01:16:34,130
If there's no pattern that exists,
there is nothing to learn.

1476
01:16:34,130 --> 01:16:38,350
Let's say that it's like a baby,
and stuff is happening, and the

1477
01:16:38,350 --> 01:16:42,000
baby is just staring. There is nothing
to pick from that thing.

1478
01:16:42,000 --> 01:16:44,740
Once there is a pattern, you can see
the smile on the baby's face.

1479
01:16:44,740 --> 01:16:46,500
Now I can see what is going on.

1480
01:16:46,500 --> 01:16:49,420
So whatever you are learning,
there needs to be a pattern.
1481
01:16:49,420 --> 01:16:50,240

1482
01:16:50,240 --> 01:16:52,640
Now, how to tell that there's
a pattern or not,

1483
01:16:52,640 --> 01:16:53,400
that's a different question.

1484
01:16:53,400 --> 01:16:58,300
But the main ingredient, there's a pattern.
The other one is we cannot pin

1485
01:16:58,300 --> 01:16:58,980
it down mathematically.

1486
01:16:58,980 --> 01:17:00,970
If we can pin it down mathematically,
and you decide to do

1487
01:17:00,970 --> 01:17:02,840
the learning, then you
are really lazy.

1488
01:17:02,840 --> 01:17:04,730
Because you could just write the code.

1489
01:17:04,730 --> 01:17:05,380
But fine.

1490
01:17:05,380 --> 01:17:08,280
You can use learning in this case, but
it's not the recommended method,

1491
01:17:08,280 --> 01:17:11,620
because it has certain errors
in performance.

1492
01:17:11,620 --> 01:17:14,060
Whereas if you have the mathematical
definition, you just implement it and

1493
01:17:14,060 --> 01:17:16,150
you'll get the best possible solution.
1494
01:17:16,150 --> 01:17:18,240
And the third one, you have data,
which is key.

1495
01:17:18,240 --> 01:17:22,370
So you have plenty of data, but the
first one is off, you are simply not

1496
01:17:22,370 --> 01:17:23,890
going to learn.

1497
01:17:23,890 --> 01:17:27,900
And it's not like I have to answer each
of these questions at random.

1498
01:17:27,900 --> 01:17:31,460
The theory will completely
capture what is going on.

1499
01:17:31,460 --> 01:17:35,820
So there's a very good reason for going
through four lectures in the

1500
01:17:35,820 --> 01:17:38,490
outline that are
mathematically inclined.

1501
01:17:38,490 --> 01:17:40,140
This is not for the sake of math.

1502
01:17:40,140 --> 01:17:45,170
I don't like to do math
hacking, if you will.

1503
01:17:45,170 --> 01:17:48,680
I pick the math that is necessary
to establish a concept.

1504
01:17:48,680 --> 01:17:51,530
And these will establish it, and they
are very much worth being patient with

1505
01:17:51,530 --> 01:17:52,480
and going through.

1506
01:17:52,480 --> 01:17:55,840
Because once you're done with them, you
basically have it cold about what

1507
01:17:55,840 --> 01:18:00,520
are the components that make learning
possible, and how do we tell, and all

1508
01:18:00,520 --> 01:18:03,360
of the questions that have been asked.

1509
01:18:03,360 --> 01:18:04,620
MODERATOR: Historical question.

1510
01:18:04,620 --> 01:18:10,880
So why is the perceptron often
related with a neuron?

1511
01:18:10,880 --> 01:18:14,435
PROFESSOR: I will discuss this
in neural networks, but in general,

1512
01:18:14,435 --> 01:18:19,760
when you take a neuron and synapses, and
you find what is the function that

1513
01:18:19,760 --> 01:18:25,200
gets to the neuron, you find that the
neuron fires, which is +1, if the

1514
01:18:25,200 --> 01:18:31,090
signal coming to it, which is roughly
a combination of the stimuli, exceeds

1515
01:18:31,090 --> 01:18:32,400
a certain threshold.

1516
01:18:32,400 --> 01:18:37,760
So that was the initial inspiration, and
the initial inspiration was

1517
01:18:37,760 --> 01:18:41,460
that: the brain does a pretty good
job, so maybe if we mimic the

1518
01:18:41,460 --> 01:18:42,890
function, we will get something good.
1519
01:18:42,890 --> 01:18:45,940
But you mimic one neuron, and then you
put it together and you'll get the

1520
01:18:45,940 --> 01:18:47,520
neural network that you
are talking about.

1521
01:18:47,520 --> 01:18:52,780
And I will discuss the analogy with
biology, and the extent that it can be

1522
01:18:52,780 --> 01:18:55,850
benefited from, when we talk
about neural networks, because

1523
01:18:55,850 --> 01:18:57,799
that will be the more proper
context for that.

1524
01:18:57,799 --> 01:19:02,850

1525
01:19:02,850 --> 01:19:08,710
MODERATOR: Another question is,
regarding the hypothesis set, are there

1526
01:19:08,710 --> 01:19:12,645
Bayesian hierarchical procedures
to narrow down the hypothesis set?

1527
01:19:12,645 --> 01:19:15,660

1528
01:19:15,660 --> 01:19:16,916
PROFESSOR: OK.

1529
01:19:16,916 --> 01:19:20,320
The choice of the hypothesis set and
the model in general is model

1530
01:19:20,320 --> 01:19:23,820
selection, and there's quite a bit of
stuff that we are going to talk about

1531
01:19:23,820 --> 01:19:26,550
in model selection, when we
talk about validation.

1532
01:19:26,550 --> 01:19:31,160
In general, the word Bayesian was
mentioned here-- if you

1533
01:19:31,160 --> 01:19:36,330
look at machine learning, there are
schools that deal with the subject

1534
01:19:36,330 --> 01:19:37,840
differently.

1535
01:19:37,840 --> 01:19:41,940
So for example, the Bayesian school
puts a mathematical framework

1536
01:19:41,940 --> 01:19:43,160
completely on it.

1537
01:19:43,160 --> 01:19:47,490
And then everything can be derived,
and that is based on Bayesian

1538
01:19:47,490 --> 01:19:48,500
principles.

1539
01:19:48,500 --> 01:19:54,380
I will talk about that at the very
end, so it's last but not least.

1540
01:19:54,380 --> 01:19:57,350
And I will make a very specific point
about it, for what it's worth.

1541
01:19:57,350 --> 01:20:03,280
But what I'm talking about in the course
in all of the details, are the

1542
01:20:03,280 --> 01:20:08,310
most commonly useful methods
in practice.

1543
01:20:08,310 --> 01:20:10,280
That is my criterion for inclusion.
1544
01:20:10,280 --> 01:20:10,900

1545
01:20:10,900 --> 01:20:13,910
So I will get to that
when we get there.

1546
01:20:13,910 --> 01:20:16,080
In terms of a hierarchy,

1547
01:20:16,080 --> 01:20:19,160
there are a number of hierarchical
methods.

1548
01:20:19,160 --> 01:20:23,360
For example, structural risk
minimization is one of them.

1549
01:20:23,360 --> 01:20:27,060
There are methods of hierarchies,
and the ramifications of it in

1550
01:20:27,060 --> 01:20:27,910
generalization.

1551
01:20:27,910 --> 01:20:30,500
I may touch upon it, when I get
to support vector machines.

1552
01:20:30,500 --> 01:20:35,490
But again, there's a lot of theory,
and if you read a book on machine

1553
01:20:35,490 --> 01:20:38,860
learning written by someone from pure
theory, you would think that you are

1554
01:20:38,860 --> 01:20:41,220
reading about a completely
different subject.

1555
01:20:41,220 --> 01:20:44,370
It's respectable stuff, but
different from the other

1556
01:20:44,370 --> 01:20:45,670
stuff that is practiced.

1557
01:20:45,670 --> 01:20:51,950
So one of the things that I'm trying to
do, I'm trying to pick from all the

1558
01:20:51,950 --> 01:20:56,070
components of machine learning, the
big picture that gives you the

1559
01:20:56,070 --> 01:20:59,540
understanding of the concept, and
the tools to use it in practice.

1560
01:20:59,540 --> 01:21:00,790
That is the criterion for inclusion.

1561
01:21:00,790 --> 01:21:04,170

1562
01:21:04,170 --> 01:21:04,710

1563
01:21:04,710 --> 01:21:07,340
Any questions from the inside here?

1564
01:21:07,340 --> 01:21:11,060

1565
01:21:11,060 --> 01:21:13,040
OK, we'll call it a day, and
we'll see you on Thursday.

1566
01:21:13,040 --> 00:00:00,000

Anda mungkin juga menyukai