2
00:00:00,580 --> 00:00:03,270
ANNOUNCER: The following program
is brought to you by Caltech.
3
00:00:03,270 --> 00:00:16,250
4
00:00:16,250 --> 00:00:19,950
YASER ABU-MOSTAFA: Welcome to machine
learning, and welcome to our online
5
00:00:19,950 --> 00:00:21,650
audience as well.
6
00:00:21,650 --> 00:00:25,830
Let me start with an outline of the
course, and then go into the material
7
00:00:25,830 --> 00:00:28,070
of today's lecture.
8
00:00:28,070 --> 00:00:34,540
As you see from the outline, the topics
are given colors, and that
9
00:00:34,540 --> 00:00:38,860
designates their main content, whether
it's mathematical or practical.
10
00:00:38,860 --> 00:00:41,960
Machine learning is
a very broad subject.
11
00:00:41,960 --> 00:00:47,300
It goes from very abstract theory to
extreme practice as in rules of thumb.
12
00:00:47,300 --> 00:00:52,350
And the inclusion of a topic in the
course depends on the relevance to
13
00:00:52,350 --> 00:00:53,370
machine learning.
14
00:00:53,370 --> 00:00:57,720
So some mathematics is useful because it
gives you the conceptual framework,
15
00:00:57,720 --> 00:01:02,320
and then some practical aspects are
useful because they give you the way
16
00:01:02,320 --> 00:01:05,080
to deal with real learning systems.
17
00:01:05,080 --> 00:01:09,370
Now if you look at the topics, these
are not meant to be separate topics
18
00:01:09,370 --> 00:01:10,560
for each lecture.
19
00:01:10,560 --> 00:01:13,630
They just highlight the main
content of those lectures.
20
00:01:13,630 --> 00:01:17,740
But there is a story line that goes
through it, and let me tell you what
21
00:01:17,740 --> 00:01:20,380
the story line is like.
22
00:01:20,380 --> 00:01:23,530
It starts here with: what is learning?
23
00:01:23,530 --> 00:01:26,500
24
00:01:26,500 --> 00:01:27,750
Can we learn?
25
00:01:27,750 --> 00:01:29,940
26
00:01:29,940 --> 00:01:31,190
How to do it?
27
00:01:31,190 --> 00:01:33,660
28
00:01:33,660 --> 00:01:37,250
How to do it well?
29
00:01:37,250 --> 00:01:40,410
And then the take-home lessons.
30
00:01:40,410 --> 00:01:44,420
There is a logical dependency that goes
through the course, and there's
31
00:01:44,420 --> 00:01:48,390
one exception to that
logical dependency.
32
00:01:48,390 --> 00:01:53,240
One lecture, which is the third one,
doesn't really belong here.
33
00:01:53,240 --> 00:01:57,750
It's a practical topic, and the reason
I included it early on is because I
34
00:01:57,750 --> 00:02:01,740
needed to give you some tools to play
around with, to test the
35
00:02:01,740 --> 00:02:03,990
theoretical and conceptual aspects.
36
00:02:03,990 --> 00:02:09,430
If I waited until it belonged normally,
which is to the second aspect of the
37
00:02:09,430 --> 00:02:15,620
linear models which is down there, the
beginning of the course would be
38
00:02:15,620 --> 00:02:18,730
just too theoretical
for people's taste.
39
00:02:18,730 --> 00:02:22,320
And as you see, if you look at the
colors, it is mostly red in the
40
00:02:22,320 --> 00:02:25,080
beginning and mostly blue in the end.
41
00:02:25,080 --> 00:02:28,430
So it starts building the
concepts and the theory.
42
00:02:28,430 --> 00:02:32,100
And then it goes on to the
practical aspects.
43
00:02:32,100 --> 00:02:35,970
Now, let me start today's lecture.
44
00:02:35,970 --> 00:02:39,790
And the subject of the lecture
is the learning problem.
45
00:02:39,790 --> 00:02:41,890
It's an introduction to
what learning is.
46
00:02:41,890 --> 00:02:45,230
And I will draw your attention
to one aspect of this slide,
47
00:02:45,230 --> 00:02:48,450
which is this part.
48
00:02:48,450 --> 00:02:50,590
That's the logo of the course.
49
00:02:50,590 --> 00:02:53,850
And believe it or not,
this is not artwork.
50
00:02:53,850 --> 00:02:56,720
This is actually a technical
figure that will come up
51
00:02:56,720 --> 00:02:57,670
in one of the lectures.
52
00:02:57,670 --> 00:02:59,200
I'm not going to tell you which one.
53
00:02:59,200 --> 00:03:03,100
So you can wait in anticipation until it
comes up, but this will actually be
54
00:03:03,100 --> 00:03:06,420
a scientific figure that
we will talk about.
55
00:03:06,420 --> 00:03:11,930
Now when we move to today's
lecture, I'm going to talk
56
00:03:11,930 --> 00:03:14,660
today about the following.
57
00:03:14,660 --> 00:03:18,590
Machine learning is a very broad
subject, and I'm going to start with
58
00:03:18,590 --> 00:03:22,100
one example that captures the
essence of machine learning.
59
00:03:22,100 --> 00:03:26,480
It's a fun example about movies
that everybody watches.
60
00:03:26,480 --> 00:03:30,870
And then after that, I'm going to
abstract from the learning problem,
61
00:03:30,870 --> 00:03:35,690
the practical learning problem,
aspects that are common to all
62
00:03:35,690 --> 00:03:38,540
learning situations that
you're going to face.
63
00:03:38,540 --> 00:03:42,030
And in abstracting them, we'll have the
mathematical formalization of the
64
00:03:42,030 --> 00:03:43,310
learning problem.
65
00:03:43,310 --> 00:03:47,530
And then we will get our first algorithm
for machine learning today.
66
00:03:47,530 --> 00:03:51,210
It's a very simple algorithm, but it
will fix the idea about what is the
67
00:03:51,210 --> 00:03:53,060
role of an algorithm in this case.
68
00:03:53,060 --> 00:03:56,420
And we will survey the types of learning,
so that we know which part we
69
00:03:56,420 --> 00:04:01,400
are emphasizing in this course,
and which parts are nearby.
70
00:04:01,400 --> 00:04:04,775
And I will end up with a puzzle, a very
interesting puzzle, and it's
71
00:04:04,775 --> 00:04:08,390
a puzzle in more ways than
one, as you will see.
72
00:04:08,390 --> 00:04:11,040
OK, so let me start with an example.
73
00:04:11,040 --> 00:04:14,040
74
00:04:14,040 --> 00:04:18,410
The example of machine learning that
I'm going to start with is how
75
00:04:18,410 --> 00:04:22,290
a viewer would rate a movie.
76
00:04:22,290 --> 00:04:26,195
Now that is an interesting problem, and
it's interesting for us because we
77
00:04:26,195 --> 00:04:30,020
watch movies, and very interesting for
a company that rents out movies.
78
00:04:30,020 --> 00:04:37,070
And indeed, a company which is Netflix
wanted to improve the in-house system
79
00:04:37,070 --> 00:04:40,040
by a mere 10%.
80
00:04:40,040 --> 00:04:43,450
So they make recommendations when you
log in, they recommend movies that
81
00:04:43,450 --> 00:04:46,650
they think you will like, so they think
that you'll rate them highly.
82
00:04:46,650 --> 00:04:50,230
And they had a system, and they
wanted to improve the system.
83
00:04:50,230 --> 00:04:56,610
So how much is a 10% improvement in
performance worth to the company?
84
00:04:56,610 --> 00:05:03,480
It was actually $1 million that was
paid out to the first group that
85
00:05:03,480 --> 00:05:06,780
actually managed to get
the 10% improvement.
86
00:05:06,780 --> 00:05:10,780
So you ask yourself, 10% improvement
in something like that, why should
87
00:05:10,780 --> 00:05:13,670
that be worth a million dollars?
88
00:05:13,670 --> 00:05:19,890
It's because, if the recommendations
that the movie company makes are spot
89
00:05:19,890 --> 00:05:25,150
on, you will pay more attention to the
recommendation, you are likely to rent
90
00:05:25,150 --> 00:05:27,830
the movies that they recommend, and they
will make lots of money-- much
91
00:05:27,830 --> 00:05:30,130
more than the million dollars
they promised.
92
00:05:30,130 --> 00:05:31,910
And this is very typical
in machine learning.
93
00:05:31,910 --> 00:05:35,290
For example, machine learning has
applications in financial forecasting.
94
00:05:35,290 --> 00:05:38,970
You can imagine that the minutest
improvement in financial forecasting
95
00:05:38,970 --> 00:05:40,810
can make a lot of money.
96
00:05:40,810 --> 00:05:45,410
So the fact that you can actually push
the system to be better using machine
97
00:05:45,410 --> 00:05:50,800
learning is a very attractive aspect of
the technique in a wide spectrum of
98
00:05:50,800 --> 00:05:52,480
applications.
99
00:05:52,480 --> 00:05:53,985
So what did these guys do?
100
00:05:53,985 --> 00:05:57,320
101
00:05:57,320 --> 00:06:04,260
They gave the data, and people started
working on the problem using different
102
00:06:04,260 --> 00:06:08,810
algorithms, until someone managed
to get the prize.
103
00:06:08,810 --> 00:06:12,530
Now if you look at the problem of
rating a movie, it captures the
104
00:06:12,530 --> 00:06:16,460
essence of machine learning, and the
essence has three components.
105
00:06:16,460 --> 00:06:20,320
If you find these three components in
a problem you have in your field, then
106
00:06:20,320 --> 00:06:24,300
you know that machine learning is
ready as an application tool.
107
00:06:24,300 --> 00:06:25,850
What are the three?
108
00:06:25,850 --> 00:06:29,860
The first one is that
a pattern exists.
109
00:06:29,860 --> 00:06:33,810
If a pattern didn't exist, there
would be nothing to look for.
110
00:06:33,810 --> 00:06:35,900
So what is the pattern here?
111
00:06:35,900 --> 00:06:41,990
There is no question that the way
a person rates a movie is related to how
112
00:06:41,990 --> 00:06:46,200
they rated other movies, and is
also related to how other
113
00:06:46,200 --> 00:06:48,870
people rated that movie.
114
00:06:48,870 --> 00:06:50,250
We know that much.
115
00:06:50,250 --> 00:06:52,670
So there is a pattern
to be discovered.
116
00:06:52,670 --> 00:06:57,710
However, we cannot really pin
it down mathematically.
117
00:06:57,710 --> 00:07:01,740
I cannot ask you to write a 17th-order
polynomial that captures how people
118
00:07:01,740 --> 00:07:03,790
rate movies.
119
00:07:03,790 --> 00:07:06,880
So the fact that there is a pattern,
and that we cannot pin it down
120
00:07:06,880 --> 00:07:10,800
mathematically, is the reason why we
are going for machine learning.
121
00:07:10,800 --> 00:07:12,520
For "learning from data".
122
00:07:12,520 --> 00:07:15,930
We couldn't write down the system on our
own, so we're going to depend on
123
00:07:15,930 --> 00:07:18,930
data in order to be able
to find the system.
124
00:07:18,930 --> 00:07:21,080
There is a missing component
which is very important.
125
00:07:21,080 --> 00:07:24,970
If you don't have that,
you are out of luck.
126
00:07:24,970 --> 00:07:28,310
We have to have data. We
are learning from data.
127
00:07:28,310 --> 00:07:31,030
So if someone knocks on my door with
an interesting machine learning
128
00:07:31,030 --> 00:07:33,670
application, and they tell me how
exciting it is, and how great the
129
00:07:33,670 --> 00:07:37,200
application would be, and how much
money they would make, the first
130
00:07:37,200 --> 00:07:41,150
question I ask, what data do you have?
131
00:07:41,150 --> 00:07:43,120
If you data, we are in business.
132
00:07:43,120 --> 00:07:45,880
If you don't, you are out of luck.
133
00:07:45,880 --> 00:07:48,510
If you have these three components,
you are ready to
134
00:07:48,510 --> 00:07:50,970
apply machine learning.
135
00:07:50,970 --> 00:07:51,750
136
00:07:51,750 --> 00:07:56,030
Now let me give you a solution to the
movie rating, in order to start
137
00:07:56,030 --> 00:07:57,220
getting a feel for it.
138
00:07:57,220 --> 00:07:59,160
So here is a system.
139
00:07:59,160 --> 00:08:01,630
Let me start to focus on part of it.
140
00:08:01,630 --> 00:08:07,230
We are going to describe a viewer
as a vector of factors, a profile if
141
00:08:07,230 --> 00:08:08,630
you will.
142
00:08:08,630 --> 00:08:16,320
So if you look here for example, the
first one would be comedy content.
143
00:08:16,320 --> 00:08:17,360
144
00:08:17,360 --> 00:08:18,660
Does the movie have a lot of comedy?
145
00:08:18,660 --> 00:08:21,600
146
00:08:21,600 --> 00:08:25,020
From a viewer point of view,
do they like comedies?
147
00:08:25,020 --> 00:08:27,800
Here, do they like action?
148
00:08:27,800 --> 00:08:31,580
Do they prefer blockbusters, or
do they like fringe movies?
149
00:08:31,580 --> 00:08:36,210
And you can go on all the way, even to
asking yourself whether you like the
150
00:08:36,210 --> 00:08:38,250
lead actor or not.
151
00:08:38,250 --> 00:08:42,909
Now you go to the content of the
movie itself, and you get the
152
00:08:42,909 --> 00:08:44,580
corresponding part.
153
00:08:44,580 --> 00:08:46,750
Does the movie have comedy?
154
00:08:46,750 --> 00:08:48,010
Does it have action?
155
00:08:48,010 --> 00:08:49,160
Is it a blockbuster?
156
00:08:49,160 --> 00:08:50,620
And so on.
157
00:08:50,620 --> 00:08:56,310
Now you compare the two, and you realize
that if there is a match--
158
00:08:56,310 --> 00:08:59,950
let's say you hate comedy and the
movie has a lot of comedy--
159
00:08:59,950 --> 00:09:02,090
then the chances are you're
not going to like it.
160
00:09:02,090 --> 00:09:02,770
161
00:09:02,770 --> 00:09:06,670
But if there is a match between so many
coordinates, and the
162
00:09:06,670 --> 00:09:10,730
number of factors here could be
really like 300 factors.
163
00:09:10,730 --> 00:09:12,590
Then the chances are you'll
like the movie.
164
00:09:12,590 --> 00:09:15,220
And if there's a mismatch, the
chances are you're not
165
00:09:15,220 --> 00:09:16,570
going to like the movie.
166
00:09:16,570 --> 00:09:17,670
So what do you do,
167
00:09:17,670 --> 00:09:17,680
168
00:09:17,680 --> 00:09:21,900
you match the movie and the viewer
factors, and then you add the
169
00:09:21,900 --> 00:09:23,240
contributions of them.
170
00:09:23,240 --> 00:09:26,950
And then as a result of that, you
get the predicted rating.
171
00:09:26,950 --> 00:09:28,370
172
00:09:28,370 --> 00:09:34,100
This is all good except for one problem,
which is this is really not
173
00:09:34,100 --> 00:09:35,950
machine learning.
174
00:09:35,950 --> 00:09:40,190
In order to produce this thing, you have
to watch the movie, and analyze
175
00:09:40,190 --> 00:09:41,300
the content.
176
00:09:41,300 --> 00:09:46,180
You have to interview the viewer,
and ask about their taste.
177
00:09:46,180 --> 00:09:49,020
And then after that, you combine
them and try to get
178
00:09:49,020 --> 00:09:51,000
a prediction for the rating.
179
00:09:51,000 --> 00:09:52,130
180
00:09:52,130 --> 00:09:55,590
Now the idea of machine learning is that
you don't have to do any of that.
181
00:09:55,590 --> 00:09:59,550
All you do is sit down and sip on your
tea, while the machine is doing
182
00:09:59,550 --> 00:10:03,050
something to come up with
this figure on its own.
183
00:10:03,050 --> 00:10:03,570
184
00:10:03,570 --> 00:10:06,460
So let's look at the
learning approach.
185
00:10:06,460 --> 00:10:12,390
So in the learning approach, we know
that the viewer will be a vector of
186
00:10:12,390 --> 00:10:16,150
different factors, and different
components for every factor.
187
00:10:16,150 --> 00:10:20,680
So this vector will be different
from one viewer to another.
188
00:10:20,680 --> 00:10:21,050
189
00:10:21,050 --> 00:10:25,500
For example, one viewer will have a big
blue content here, and one of them
190
00:10:25,500 --> 00:10:28,510
will have a small blue content,
depending on their taste.
191
00:10:28,510 --> 00:10:28,940
192
00:10:28,940 --> 00:10:32,140
And then, there is the movie.
193
00:10:32,140 --> 00:10:37,220
And a particular movie will have different
contents that correspond to those.
194
00:10:37,220 --> 00:10:42,580
And the way we said we are computing the
rating, is by simply taking these
195
00:10:42,580 --> 00:10:45,830
and combining them and
getting the rating.
196
00:10:45,830 --> 00:10:51,830
Now what machine learning will do is
reverse-engineer that process.
197
00:10:51,830 --> 00:10:54,310
198
00:10:54,310 --> 00:10:59,380
It starts from the rating, and then
tries to find out what factors would be
199
00:10:59,380 --> 00:11:01,800
consistent with that rating.
200
00:11:01,800 --> 00:11:03,120
So think of it this way.
201
00:11:03,120 --> 00:11:07,690
You start, let's say, with
completely random factors.
202
00:11:07,690 --> 00:11:12,260
So you take these guys, just random
numbers from beginning to end, and
203
00:11:12,260 --> 00:11:14,290
these guys, random numbers
from beginning to end.
204
00:11:14,290 --> 00:11:17,620
For every user and every movie,
that's your starting point.
205
00:11:17,620 --> 00:11:22,810
Obviously, there is no chance in the
world that when you get the inner
206
00:11:22,810 --> 00:11:26,120
product between these factors that are
random, that you'll get anything that
207
00:11:26,120 --> 00:11:30,120
looks like the rating that actually
took place, right?
208
00:11:30,120 --> 00:11:34,430
But what you do is you take a rating
that actually happened, and then you
209
00:11:34,430 --> 00:11:40,010
start nudging the factors ever so
slightly toward that rating.
210
00:11:40,010 --> 00:11:45,540
Make the direction of the inner product
get closer to the rating.
211
00:11:45,540 --> 00:11:49,500
Now it looks like a hopeless thing. I
start with so many factors, they are
212
00:11:49,500 --> 00:11:51,780
all random, and I'm trying to
make them match a rating.
213
00:11:51,780 --> 00:11:53,330
What are the chances?
214
00:11:53,330 --> 00:11:57,070
Well the point is that you are going to
do this not for one rating, but for
215
00:11:57,070 --> 00:11:59,050
a 100 million ratings.
216
00:11:59,050 --> 00:12:01,610
And you keep cycling through
the 100 million, over
217
00:12:01,610 --> 00:12:03,740
and over and over.
218
00:12:03,740 --> 00:12:08,180
And eventually, lo and behold, you
find that the factors now are
219
00:12:08,180 --> 00:12:10,420
meaningful in terms of the ratings.
220
00:12:10,420 --> 00:12:17,380
And if you get a user, a viewer here,
that didn't watch a movie, and you get
221
00:12:17,380 --> 00:12:21,320
the vector that resulted from that
learning process, and you get the
222
00:12:21,320 --> 00:12:25,470
movie vector that resulted from that
process, and you do the inner product,
223
00:12:25,470 --> 00:12:29,640
lo and behold, you get a rating which
is actually consistent with how that
224
00:12:29,640 --> 00:12:31,770
viewer rates the movie.
225
00:12:31,770 --> 00:12:33,480
That's the idea.
226
00:12:33,480 --> 00:12:35,380
227
00:12:35,380 --> 00:12:41,050
Now this actually, the solution I
described, is one of the winning
228
00:12:41,050 --> 00:12:43,220
solutions in the competition
that I mentioned.
229
00:12:43,220 --> 00:12:47,170
So this is for real, this
actually can be used.
230
00:12:47,170 --> 00:12:51,440
Now with this example in mind,
let's actually go to the
231
00:12:51,440 --> 00:12:52,610
components of learning.
232
00:12:52,610 --> 00:12:56,700
So now I would like to abstract from the
learning problems that I see, what
233
00:12:56,700 --> 00:13:00,280
are the mathematical components that
make up the learning problem?
234
00:13:00,280 --> 00:13:01,910
And I'm going to use a metaphor.
235
00:13:01,910 --> 00:13:05,540
I'm going to use a metaphor now from
another application domain, which
236
00:13:05,540 --> 00:13:07,280
is a financial application.
237
00:13:07,280 --> 00:13:11,900
So the metaphor we are going to
use is credit approval.
238
00:13:11,900 --> 00:13:15,600
You apply for a credit card, and the
bank wants to decide whether it's
239
00:13:15,600 --> 00:13:18,010
a good idea to extend a credit
card for you or not.
240
00:13:18,010 --> 00:13:19,960
From the bank's point of view,
if they're going to make
241
00:13:19,960 --> 00:13:20,800
money, they are happy.
242
00:13:20,800 --> 00:13:22,490
If they are going to lose money,
they are not happy.
243
00:13:22,490 --> 00:13:24,590
That's the only criterion they have.
244
00:13:24,590 --> 00:13:29,000
Now, very much like we didn't have
a magic formula for deciding how
245
00:13:29,000 --> 00:13:32,660
a viewer will rate a movie, the bank
doesn't have a magic formula for
246
00:13:32,660 --> 00:13:36,230
deciding whether a person
is creditworthy or not.
247
00:13:36,230 --> 00:13:39,940
What they're going to do, they're going
to rely on historical records of
248
00:13:39,940 --> 00:13:43,950
previous customers, and how their credit
behavior turned out, and then
249
00:13:43,950 --> 00:13:47,660
try to reverse-engineer the system, and
when they get the system frozen,
250
00:13:47,660 --> 00:13:50,070
they're going to apply it
to a future customer.
251
00:13:50,070 --> 00:13:51,480
That's the deal.
252
00:13:51,480 --> 00:13:54,360
What are the components here?
253
00:13:54,360 --> 00:13:56,480
Let's look at it.
254
00:13:56,480 --> 00:13:58,690
First, you have the applicant
information.
255
00:13:58,690 --> 00:14:02,380
And the applicant information-- you
look at this, and you can see that
256
00:14:02,380 --> 00:14:06,865
there is the age, the gender, how much
money you make, how much money you
257
00:14:06,865 --> 00:14:10,870
owe, and all kinds of fields that are
believed to be related to the
258
00:14:10,870 --> 00:14:13,160
creditworthiness.
259
00:14:13,160 --> 00:14:17,580
Again, pretty much like we did in
the movie example, there is no question
260
00:14:17,580 --> 00:14:21,310
that these fields are related
to the creditworthiness.
261
00:14:21,310 --> 00:14:25,020
They don't necessarily uniquely
determine it, but they are related.
262
00:14:25,020 --> 00:14:28,680
And the bank doesn't want a sure bet.
They want to get the credit decision
263
00:14:28,680 --> 00:14:29,950
as reliable as possible.
264
00:14:29,950 --> 00:14:32,970
So they want to use that pattern,
in order to be able to come up with
265
00:14:32,970 --> 00:14:33,960
a good decision.
266
00:14:33,960 --> 00:14:34,620
267
00:14:34,620 --> 00:14:38,900
And they take this input, and they want
to approve the credit or deny it.
268
00:14:38,900 --> 00:14:41,220
So let's formalize this.
269
00:14:41,220 --> 00:14:45,190
First, we are going to
have an input.
270
00:14:45,190 --> 00:14:48,640
And the input is called
x. Surprise, surprise!
271
00:14:48,640 --> 00:14:52,970
And that input happens to be
the customer application.
272
00:14:52,970 --> 00:14:56,280
So we can think of it as
a d-dimensional vector, the first
273
00:14:56,280 --> 00:15:00,660
component is the salary, years in
residence, outstanding debt, whatever
274
00:15:00,660 --> 00:15:01,600
the components are.
275
00:15:01,600 --> 00:15:05,170
You put it as a vector, and
that becomes the input.
276
00:15:05,170 --> 00:15:09,690
Then we get the output y. The output
y is simply the decision, either to
277
00:15:09,690 --> 00:15:14,130
extend credit or not to extend
credit, +1 and -1.
278
00:15:14,130 --> 00:15:15,380
279
00:15:15,380 --> 00:15:17,540
280
00:15:17,540 --> 00:15:22,980
And being a good or bad customer, that
is from the bank's point of view.
281
00:15:22,980 --> 00:15:26,380
Now we have after that,
the target function.
282
00:15:26,380 --> 00:15:31,860
The target function is a function
from a domain X, which is the
283
00:15:31,860 --> 00:15:34,370
set of all of these x's.
284
00:15:34,370 --> 00:15:34,770
285
00:15:34,770 --> 00:15:37,240
So it is the set of vectors
of d dimensions.
286
00:15:37,240 --> 00:15:40,020
So it's a d-dimensional Euclidean
space, in this case.
287
00:15:40,020 --> 00:15:42,220
And then the Y is the set of y's.
288
00:15:42,220 --> 00:15:44,820
Well, that's an easy one because
y can only be +1 or -1,
289
00:15:44,820 --> 00:15:46,320
accept or deny.
290
00:15:46,320 --> 00:15:49,990
And therefore this is just
a binary co-domain.
291
00:15:49,990 --> 00:15:55,620
And this target function is the ideal
credit approval formula, which we
292
00:15:55,620 --> 00:15:57,280
don't know.
293
00:15:57,280 --> 00:16:00,920
In all of our endeavors in machine
learning, the target function is
294
00:16:00,920 --> 00:16:02,740
unknown to us.
295
00:16:02,740 --> 00:16:05,850
If it were known, nobody
needs learning.
296
00:16:05,850 --> 00:16:08,120
We just go ahead and implement it.
297
00:16:08,120 --> 00:16:11,890
But we need to learn it because
it is unknown to us.
298
00:16:11,890 --> 00:16:14,850
So what are we going
to do to learn it?
299
00:16:14,850 --> 00:16:19,040
We are going to use data, examples.
300
00:16:19,040 --> 00:16:23,360
So the data in this case is based on
previous customer application records.
301
00:16:23,360 --> 00:16:28,040
The input, which is the information in
their applications, and the output,
302
00:16:28,040 --> 00:16:31,070
which is how they turned
out in hindsight.
303
00:16:31,070 --> 00:16:33,920
This is not a question of prediction
at the time they applied, but after
304
00:16:33,920 --> 00:16:36,430
five years, they turned out
to be a great customer.
305
00:16:36,430 --> 00:16:40,005
So the bank says, if someone has
these attributes again, let's approve
306
00:16:40,005 --> 00:16:43,150
credit because these guys
tend to make us money.
307
00:16:43,150 --> 00:16:47,290
And this one made us lose a lot of
money, so let's deny it, and so on.
308
00:16:47,290 --> 00:16:50,680
And the historical records-- there are
plenty of historical records.
309
00:16:50,680 --> 00:16:54,180
All of this makes sense when you're
talking about having 100,000 of
310
00:16:54,180 --> 00:16:55,030
those guys.
311
00:16:55,030 --> 00:16:58,160
Then you pretty much say, I will
capture what the essence of that
312
00:16:58,160 --> 00:16:59,330
function is.
313
00:16:59,330 --> 00:17:02,940
So this is the data, and then you use
the data, which is the historical
314
00:17:02,940 --> 00:17:06,829
records, in order to
get the hypothesis.
315
00:17:06,829 --> 00:17:12,160
The hypothesis is the formal name we're
going to call the formula we get
316
00:17:12,160 --> 00:17:14,220
to approximate the target function.
317
00:17:14,220 --> 00:17:19,348
So the hypothesis lives in the same
world as the target function, and if
318
00:17:19,348 --> 00:17:25,779
you look at the value of g, it supposedly
approximates f.
319
00:17:25,780 --> 00:17:29,030
While f is unknown to us,
g is very much known--
320
00:17:29,030 --> 00:17:33,050
actually we created it-- and the hope
is that it does approximate f well.
321
00:17:33,050 --> 00:17:35,690
That's the goal of learning.
322
00:17:35,690 --> 00:17:39,310
So this notation will be our notation
for the rest of the course, so get
323
00:17:39,310 --> 00:17:40,170
used to it.
324
00:17:40,170 --> 00:17:43,800
The target function is always f, the
hypothesis we produce, which we'll
325
00:17:43,800 --> 00:17:47,860
refer to as the final hypothesis will be
called g, the data will always have
326
00:17:47,860 --> 00:17:51,400
that notation-- there are capital N
points, which are the data set.
327
00:17:51,400 --> 00:17:55,630
And the output is always y.
328
00:17:55,630 --> 00:17:59,120
So this is the formula to be used.
329
00:17:59,120 --> 00:18:03,940
Now, let's put it in a diagram in order
to analyze it a little bit more.
330
00:18:03,940 --> 00:18:07,110
If you look at the diagram
here, here is the target
331
00:18:07,110 --> 00:18:08,810
function and it is unknown--
332
00:18:08,810 --> 00:18:11,740
that is the ideal approval which we will
never know, but that's what we're
333
00:18:11,740 --> 00:18:13,860
hoping to get to approximate.
334
00:18:13,860 --> 00:18:15,170
And we don't see it.
335
00:18:15,170 --> 00:18:18,230
We see it only through the eyes
of the training examples.
336
00:18:18,230 --> 00:18:21,350
This is our vehicle of understanding
what the target function is.
337
00:18:21,350 --> 00:18:25,170
Otherwise the target function is
a mysterious quantity for us.
338
00:18:25,170 --> 00:18:28,440
And eventually, we would like to
produce the final hypothesis.
339
00:18:28,440 --> 00:18:31,490
The final hypothesis is the formula the
bank is going to use in order to
340
00:18:31,490 --> 00:18:36,440
approve or deny credit, with the hope
that g hopefully approximates that f.
341
00:18:36,440 --> 00:18:37,300
342
00:18:37,300 --> 00:18:41,160
Now what connects those two guys?
343
00:18:41,160 --> 00:18:43,110
This will be the learning algorithm.
344
00:18:43,110 --> 00:18:47,340
So the learning algorithm takes the
examples, and will produce the final
345
00:18:47,340 --> 00:18:51,850
hypothesis, as we described in the
example of the movie rating.
346
00:18:51,850 --> 00:18:56,880
Now there is another component that
goes into the learning algorithm.
347
00:18:56,880 --> 00:19:02,430
So what the learning algorithm does, it
creates the formula from a preset
348
00:19:02,430 --> 00:19:06,100
model of formulas, a set of candidate
formulas, if you will.
349
00:19:06,100 --> 00:19:10,610
And these we are going to call the
hypothesis set, a set of hypotheses
350
00:19:10,610 --> 00:19:13,230
from which we are going to
pick one hypothesis.
351
00:19:13,230 --> 00:19:18,440
So from this H comes a bunch of small
h's, which are functions that can be
352
00:19:18,440 --> 00:19:21,320
candidates for being the
credit approval.
353
00:19:21,320 --> 00:19:24,390
And one of them will be picked by the
learning algorithm, which happens to
354
00:19:24,390 --> 00:19:27,220
be g, hopefully approximating f.
355
00:19:27,220 --> 00:19:30,960
Now if you look at this part of the
chain, from the target function to the
356
00:19:30,960 --> 00:19:34,540
training to the learning algorithm to
the final hypothesis, this is very
357
00:19:34,540 --> 00:19:37,280
natural, and nobody will
object to that.
358
00:19:37,280 --> 00:19:40,280
But why do we have this
hypothesis set?
359
00:19:40,280 --> 00:19:44,030
Why not let the algorithm
pick from anything?
360
00:19:44,030 --> 00:19:48,150
Just create the formula, without being
restricted to a particular set of
361
00:19:48,150 --> 00:19:50,190
formulas H.
362
00:19:50,190 --> 00:19:52,790
There are two reasons, and
I want to explain them.
363
00:19:52,790 --> 00:19:56,920
One of them is that there is no downside
for including a hypothesis
364
00:19:56,920 --> 00:19:59,320
set in the formalization.
365
00:19:59,320 --> 00:20:01,480
And there is an upside.
366
00:20:01,480 --> 00:20:05,030
So let me describe why there is no
downside, and then describe why there
367
00:20:05,030 --> 00:20:07,040
is an upside.
368
00:20:07,040 --> 00:20:11,430
There is no downside for the simple
reason that, from a practical point of
369
00:20:11,430 --> 00:20:12,730
view, that's what you do.
370
00:20:12,730 --> 00:20:14,770
You want to learn, you say I'm going
to use a linear formula.
371
00:20:14,770 --> 00:20:16,060
I'm going to use a neural network.
372
00:20:16,060 --> 00:20:17,570
I'm going to use a support
vector machine.
373
00:20:17,570 --> 00:20:20,990
So you are already dictating
a set of hypotheses.
374
00:20:20,990 --> 00:20:25,260
If you happen to be a brave soul, and you
don't want to restrict yourself at
375
00:20:25,260 --> 00:20:29,680
all, very well, then your hypothesis
set is the set of all possible
376
00:20:29,680 --> 00:20:31,580
hypotheses.
377
00:20:31,580 --> 00:20:32,200
Right?
378
00:20:32,200 --> 00:20:34,600
So there is no loss of generality
in putting it.
379
00:20:34,600 --> 00:20:37,410
So there is no downside.
380
00:20:37,410 --> 00:20:40,890
The upside is not obvious here, but it
will become obvious as we go through
381
00:20:40,890 --> 00:20:41,900
the theory.
382
00:20:41,900 --> 00:20:47,150
The hypothesis set will play a pivotal
role in the theory of learning.
383
00:20:47,150 --> 00:20:50,590
It will tell us: can we learn, and
how well we learn, and whatnot.
384
00:20:50,590 --> 00:20:54,370
Therefore having it as an explicit
component in the problem statement
385
00:20:54,370 --> 00:20:56,600
will make the theory go through.
386
00:20:56,600 --> 00:20:58,910
So that's why we have this figure.
387
00:20:58,910 --> 00:20:59,720
388
00:20:59,720 --> 00:21:04,440
Now, let me focus on the solution
components of that figure.
389
00:21:04,440 --> 00:21:07,520
What do I mean by the
solution components?
390
00:21:07,520 --> 00:21:14,610
If you look at this, the first part,
which is the target-- let me try to
391
00:21:14,610 --> 00:21:15,570
expand it--
392
00:21:15,570 --> 00:21:18,540
so the target function is
not under your control.
393
00:21:18,540 --> 00:21:21,640
Someone knocks on my door and says:
I want to approve credit.
394
00:21:21,640 --> 00:21:24,730
That's the target function, I
have no control over that.
395
00:21:24,730 --> 00:21:27,110
And by the way, here are
the historical records.
396
00:21:27,110 --> 00:21:30,440
I have no control over that,
so they give me the data.
397
00:21:30,440 --> 00:21:33,330
And would you please hand me
the final hypothesis?
398
00:21:33,330 --> 00:21:36,250
That is what I'm going to give them at
the end, before I receive my check.
399
00:21:36,250 --> 00:21:36,900
400
00:21:36,900 --> 00:21:39,170
So all of that is completely dictated.
401
00:21:39,170 --> 00:21:49,120
Now let's look at the other part. The
learning algorithm, and the hypothesis
402
00:21:49,120 --> 00:21:52,090
set that we talked about,
are your solution tools.
403
00:21:52,090 --> 00:21:55,450
These are things you choose, in
order to solve the problem.
404
00:21:55,450 --> 00:22:01,150
And I would like to take a little bit
of a look into what they look like,
405
00:22:01,150 --> 00:22:04,770
and give you an example of them, so that
you have a complete chain for
406
00:22:04,770 --> 00:22:06,630
the entire figure in your mind.
407
00:22:06,630 --> 00:22:09,670
From the target function, to the data
set, to the learning algorithm,
408
00:22:09,670 --> 00:22:12,210
hypothesis set, and the
final hypothesis.
409
00:22:12,210 --> 00:22:12,780
410
00:22:12,780 --> 00:22:15,520
So, here is the hypothesis set.
411
00:22:15,520 --> 00:22:22,820
We chose the notation H for the
set, and the element will be given the
412
00:22:22,820 --> 00:22:23,990
symbol small h.
413
00:22:23,990 --> 00:22:27,890
So h is a function, pretty much
like the final hypothesis g.
414
00:22:27,890 --> 00:22:30,540
g is just one of them
that you happen to elect.
415
00:22:30,540 --> 00:22:34,060
So when we elect it, we call it g. If
it's sitting there generically, we
416
00:22:34,060 --> 00:22:35,100
call it h.
417
00:22:35,100 --> 00:22:35,860
418
00:22:35,860 --> 00:22:39,090
And then, when you put them together,
they are referred to as
419
00:22:39,090 --> 00:22:39,690
the learning model.
420
00:22:39,690 --> 00:22:42,610
So if you're asked what is the learning
model you are using, you're
421
00:22:42,610 --> 00:22:46,580
actually choosing both a hypothesis
set and a learning algorithm.
422
00:22:46,580 --> 00:22:49,400
We'll see the perceptron in a moment,
so this would be the
423
00:22:49,400 --> 00:22:52,780
perceptron model, and this would be the
perceptron learning algorithm.
424
00:22:52,780 --> 00:22:56,420
This could be neural network, and
this would be back propagation.
425
00:22:56,420 --> 00:22:59,050
This could be support vector
machines of some kind, let's say
426
00:22:59,050 --> 00:23:02,520
radial basis function version, and this
would be the quadratic programming.
427
00:23:02,520 --> 00:23:05,840
So every time you have a model, there is
a hypothesis set, and then there is
428
00:23:05,840 --> 00:23:07,960
an algorithm that will do the
searching and produce
429
00:23:07,960 --> 00:23:09,280
one of those guys.
430
00:23:09,280 --> 00:23:09,690
431
00:23:09,690 --> 00:23:13,630
So this is the standard form
432
00:23:13,630 --> 00:23:14,820
for the solution.
433
00:23:14,820 --> 00:23:18,860
Now, let me go through a simple
hypothesis set in detail so we have
434
00:23:18,860 --> 00:23:19,650
something to implement.
435
00:23:19,650 --> 00:23:23,860
So after the lecture, you can actually
implement a learning algorithm on real
436
00:23:23,860 --> 00:23:24,950
data if you want to.
437
00:23:24,950 --> 00:23:28,440
This is not a glorious model. It's
a very simple model. On the other hand,
438
00:23:28,440 --> 00:23:33,790
it's a very clear model to pinpoint
what we are talking about.
439
00:23:33,790 --> 00:23:34,570
440
00:23:34,570 --> 00:23:35,890
So here is the deal.
441
00:23:35,890 --> 00:23:39,660
442
00:23:39,660 --> 00:23:43,730
You have an input, and the input
is x_1 up to x_d, as we said--
443
00:23:43,730 --> 00:23:49,210
d-dimensional vector-- and each of them
comes from the real numbers, just
444
00:23:49,210 --> 00:23:49,840
for simplicity.
445
00:23:49,840 --> 00:23:51,320
So this belongs to the real numbers.
446
00:23:51,320 --> 00:23:53,170
And these are the attributes
of a customer.
447
00:23:53,170 --> 00:23:56,470
As we said, salary, years in
residence, and whatnot.
448
00:23:56,470 --> 00:24:00,080
So what does the perceptron model do?
449
00:24:00,080 --> 00:24:02,730
It does a very simple formula.
450
00:24:02,730 --> 00:24:08,760
It takes the attributes you have and
gives them different weights, w.
451
00:24:08,760 --> 00:24:12,110
So let's say the salary is important,
the chances are w corresponding to the
452
00:24:12,110 --> 00:24:13,900
salary will be big.
453
00:24:13,900 --> 00:24:15,880
Some other attribute is
not that important.
454
00:24:15,880 --> 00:24:19,210
The chances are the w that
goes with it is not that big.
455
00:24:19,210 --> 00:24:21,540
Actually, outstanding
debt is bad news.
456
00:24:21,540 --> 00:24:23,370
If you owe a lot, that's not good.
457
00:24:23,370 --> 00:24:26,600
So the chances are the weight will
be negative for outstanding
458
00:24:26,600 --> 00:24:28,420
debt, and so on.
459
00:24:28,420 --> 00:24:32,210
Now you add them together, and you add
them in a linear form-- that's what
460
00:24:32,210 --> 00:24:33,630
makes it a perceptron--
461
00:24:33,630 --> 00:24:39,010
and you can look at this as
a credit score, of sorts.
462
00:24:39,010 --> 00:24:39,760
463
00:24:39,760 --> 00:24:43,300
Now you compare the credit
score with a threshold.
464
00:24:43,300 --> 00:24:46,690
If you exceed the threshold, they
approve the credit card.
465
00:24:46,690 --> 00:24:50,420
And if you don't, they
deny the credit card.
466
00:24:50,420 --> 00:24:51,710
So that is the formula they
467
00:24:51,710 --> 00:24:52,520
settle on.
468
00:24:52,520 --> 00:24:58,500
They have no idea, yet, what the w's and
the threshold are, but they dictated the
469
00:24:58,500 --> 00:25:01,110
formula-- the analytic form that
they're going to use.
470
00:25:01,110 --> 00:25:02,040
471
00:25:02,040 --> 00:25:06,530
Now we take this and we put it
in the formalization we had.
472
00:25:06,530 --> 00:25:11,370
We have to define a hypothesis h,
and this will tell us what is the
473
00:25:11,370 --> 00:25:14,820
hypothesis set that has all the
hypotheses that have the same
474
00:25:14,820 --> 00:25:16,170
functional form.
475
00:25:16,170 --> 00:25:17,530
So you can write it down as this.
476
00:25:17,530 --> 00:25:22,270
This is a little bit long, but there's
absolutely nothing to it.
477
00:25:22,270 --> 00:25:26,490
This is your credit score, and this
is the threshold you compare to by
478
00:25:26,490 --> 00:25:27,740
subtracting.
479
00:25:27,740 --> 00:25:30,910
If this quantity is positive, you belong
to the first thing and you will
480
00:25:30,910 --> 00:25:31,890
approve credit.
481
00:25:31,890 --> 00:25:34,880
If it's negative, you belong here
and you will deny credit.
482
00:25:34,880 --> 00:25:38,440
Well, the function that takes a real
number, and produces the sign +1 or
483
00:25:38,440 --> 00:25:41,010
-1, is called the sign.
484
00:25:41,010 --> 00:25:43,930
So when you take the sign of this thing,
this will indeed be +1 or
485
00:25:43,930 --> 00:25:46,970
-1, and this will give
the decision you want.
486
00:25:46,970 --> 00:25:49,820
And that will be the form
of your hypothesis.
487
00:25:49,820 --> 00:25:57,640
Now let's put it in color, and you
realize that what defines h is your
488
00:25:57,640 --> 00:26:00,290
choice of w_i and the threshold.
489
00:26:00,290 --> 00:26:05,060
These are the parameters that define
one hypothesis versus the other.
490
00:26:05,060 --> 00:26:07,820
x is an input that will be
put into any hypothesis.
491
00:26:07,820 --> 00:26:11,780
As far as we are concerned, when we are
learning for example, the inputs
492
00:26:11,780 --> 00:26:13,610
and outputs are already determined.
493
00:26:13,610 --> 00:26:15,010
These are the data set.
494
00:26:15,010 --> 00:26:19,640
But what we vary to get one hypothesis
or another, and what the algorithm
495
00:26:19,640 --> 00:26:23,270
needs to vary in order to choose the
final hypothesis, are those parameters
496
00:26:23,270 --> 00:26:27,270
which, in this case, are
w_i and the threshold.
497
00:26:27,270 --> 00:26:28,810
498
00:26:28,810 --> 00:26:30,610
So let's look at it visually.
499
00:26:30,610 --> 00:26:32,990
Let's assume that the data
you are working
500
00:26:32,990 --> 00:26:34,790
with is linearly separable.
501
00:26:34,790 --> 00:26:38,770
Linearly separable in this case, for
example, you have nine data points.
502
00:26:38,770 --> 00:26:42,220
And if you look at the nine data
points, some of them were good
503
00:26:42,220 --> 00:26:44,850
customers and some of them
were bad customers.
504
00:26:44,850 --> 00:26:48,450
And you would like now to apply the
perceptron model, in order to separate
505
00:26:48,450 --> 00:26:49,460
them correctly.
506
00:26:49,460 --> 00:26:53,860
You would like to get to this situation,
where the perceptron, which
507
00:26:53,860 --> 00:26:57,680
is this purple line, separates the blue
region from the red region or the
508
00:26:57,680 --> 00:27:02,240
pink region, and indeed all the good
customers belong to one, and the bad
509
00:27:02,240 --> 00:27:03,600
customers belong to the other.
510
00:27:03,600 --> 00:27:07,100
So you have hope that a future customer,
if they lie here or lie
511
00:27:07,100 --> 00:27:09,152
here, they will be classified
correctly.
512
00:27:09,152 --> 00:27:12,780
If there is actually a simple linear
pattern to this to be detected.
513
00:27:12,780 --> 00:27:16,950
But when you start, you start with
random weights, and the random weights
514
00:27:16,950 --> 00:27:18,990
will give you any line.
515
00:27:18,990 --> 00:27:23,490
So the purple line in both
cases corresponds to the
516
00:27:23,490 --> 00:27:25,900
purple parameters there.
517
00:27:25,900 --> 00:27:30,370
One choice of these w's and the
threshold corresponds to one line.
518
00:27:30,370 --> 00:27:32,220
You change them, you get another line.
519
00:27:32,220 --> 00:27:35,410
So you can see that the learning
algorithm is playing around with these
520
00:27:35,410 --> 00:27:39,360
parameters, and therefore moving the
line around, trying to arrive at this
521
00:27:39,360 --> 00:27:40,950
happy solution.
522
00:27:40,950 --> 00:27:42,350
523
00:27:42,350 --> 00:27:45,620
Now we are going to have a simple
change of notation.
524
00:27:45,620 --> 00:27:51,000
Instead of calling it threshold, we're
going to treat it as if it's a weight.
525
00:27:51,000 --> 00:27:55,030
It was minus threshold.
Now we call it, plus w_0.
526
00:27:55,030 --> 00:27:58,760
Absolutely nothing, all you need
to do is choose w_0 to
527
00:27:58,760 --> 00:28:00,930
be minus the threshold.
528
00:28:00,930 --> 00:28:01,840
No big deal.
529
00:28:01,840 --> 00:28:03,060
530
00:28:03,060 --> 00:28:04,750
So why do we do that?
531
00:28:04,750 --> 00:28:08,220
We do that because we are going to
introduce an artificial coordinate.
532
00:28:08,220 --> 00:28:11,780
Remember that the input
was x_1 through x_d.
533
00:28:11,780 --> 00:28:14,020
Now we're going to add x_0.
534
00:28:14,020 --> 00:28:15,910
This is not an attribute of
the customer, but
535
00:28:15,910 --> 00:28:20,460
an artificial constant we add, which
happens to be always +1.
536
00:28:20,460 --> 00:28:22,710
Why are we doing this?
You probably guessed.
537
00:28:22,710 --> 00:28:26,540
Because when you do that, then all of
a sudden the formula simplifies.
538
00:28:26,540 --> 00:28:30,040
Now you are summing from i equals
0, instead of i equals 1.
539
00:28:30,040 --> 00:28:33,410
So you added the zero term,
and what is the zero term?
540
00:28:33,410 --> 00:28:37,270
It's the threshold which you
conveniently call w_0 with a plus sign,
541
00:28:37,270 --> 00:28:38,390
multiplied by the 1.
542
00:28:38,390 --> 00:28:41,550
So indeed, this will be the formula
equivalent to that.
543
00:28:41,550 --> 00:28:43,340
So it looks better.
544
00:28:43,340 --> 00:28:46,490
And this is the standard notation
we're going to use.
545
00:28:46,490 --> 00:28:51,720
And now we put it as a vector
form, which will simplify matters, so
546
00:28:51,720 --> 00:28:56,200
in this case you will be having an inner
product between a vector w,
547
00:28:56,200 --> 00:28:59,190
a column vector, and a vector x.
548
00:28:59,190 --> 00:29:04,870
So the vector w would be w_0,
w_1, w_2, w_3, w_4, et cetera.
549
00:29:04,870 --> 00:29:07,030
And x_0, x_1, x_2, et cetera.
550
00:29:07,030 --> 00:29:10,180
And you do the inner product by taking
a transpose, and you get a formula
551
00:29:10,180 --> 00:29:12,200
which is exactly the formula
you have here.
552
00:29:12,200 --> 00:29:17,330
So now we are down to this formula
for the perceptron hypothesis.
553
00:29:17,330 --> 00:29:19,010
554
00:29:19,010 --> 00:29:22,540
Now that we have the hypothesis set,
let's look for the learning algorithm
555
00:29:22,540 --> 00:29:23,500
that goes with it.
556
00:29:23,500 --> 00:29:27,170
The hypothesis set tells you the
resources you can work with.
557
00:29:27,170 --> 00:29:29,990
Now we need the algorithm that is
going to look at the data, the
558
00:29:29,990 --> 00:29:33,810
training data that you're going to use,
and navigate through the space
559
00:29:33,810 --> 00:29:37,380
of hypotheses, to bring the one that
is going to output as the final
560
00:29:37,380 --> 00:29:39,660
hypothesis that you give
to your customer.
561
00:29:39,660 --> 00:29:47,210
So this one is called the perceptron
learning algorithm, and it implements
562
00:29:47,210 --> 00:29:49,420
this function.
563
00:29:49,420 --> 00:29:51,610
What it does is the following.
564
00:29:51,610 --> 00:29:54,560
It takes the training data.
565
00:29:54,560 --> 00:29:56,710
That is always what a learning
algorithm does. This is
566
00:29:56,710 --> 00:29:58,020
their starting point.
567
00:29:58,020 --> 00:30:03,330
So it takes existing customers, and
their existing credit behavior in
568
00:30:03,330 --> 00:30:04,030
hindsight--
569
00:30:04,030 --> 00:30:05,380
that's what it uses--
570
00:30:05,380 --> 00:30:07,070
and what does it do?
571
00:30:07,070 --> 00:30:12,080
It tries to make the w correct.
572
00:30:12,080 --> 00:30:17,640
So it really doesn't like at all
when a point is misclassified.
573
00:30:17,640 --> 00:30:22,960
So if a point is misclassified,
it means that your w didn't do
574
00:30:22,960 --> 00:30:23,850
the right job here.
575
00:30:23,850 --> 00:30:26,640
So what does it mean to be
a misclassified point here?
576
00:30:26,640 --> 00:30:31,910
It means that when you apply your
formula, with the current w--
577
00:30:31,910 --> 00:30:34,340
the w is the one that the algorithm
will play with--
578
00:30:34,340 --> 00:30:37,340
apply it to this particular x.
579
00:30:37,340 --> 00:30:38,170
Then what happens?
580
00:30:38,170 --> 00:30:41,200
You get something that is not the
credit behavior you want.
581
00:30:41,200 --> 00:30:43,500
It is misclassified.
582
00:30:43,500 --> 00:30:45,150
So what do we do when a point
is misclassified?
583
00:30:45,150 --> 00:30:46,840
We have to do something.
584
00:30:46,840 --> 00:30:49,660
So what the algorithm does, it
updates the weight vector.
585
00:30:49,660 --> 00:30:53,480
It changes the weight, which changes
the hypothesis, so that it behaves
586
00:30:53,480 --> 00:30:55,840
better on that particular point.
587
00:30:55,840 --> 00:30:59,970
And this is the formula that it does.
588
00:30:59,970 --> 00:31:01,480
So I'll explain it in a moment.
589
00:31:01,480 --> 00:31:08,430
Let me first try to explain the inner
product in terms of agreement or
590
00:31:08,430 --> 00:31:10,770
disagreement.
591
00:31:10,770 --> 00:31:16,890
If you have the vector x and the vector
w this way, their inner product
592
00:31:16,890 --> 00:31:21,230
will be positive, and the sign
will give you a +1.
593
00:31:21,230 --> 00:31:25,440
If they are this way, the inner product
will be negative, and the sign
594
00:31:25,440 --> 00:31:27,550
will be -1.
595
00:31:27,550 --> 00:31:32,180
So being misclassified means that
either they are this way and the
596
00:31:32,180 --> 00:31:37,590
output should be -1, or it's this
way and output should be +1.
597
00:31:37,590 --> 00:31:40,840
That's what makes it misclassified,
right?
598
00:31:40,840 --> 00:31:41,750
599
00:31:41,750 --> 00:31:49,720
So if you look here at this formula, it
takes the old w and adds something
600
00:31:49,720 --> 00:31:52,130
that depends on the misclassified
point.
601
00:31:52,130 --> 00:31:55,280
Both in terms of the x_n and y_n.
602
00:31:55,280 --> 00:31:57,410
y_n is just +1 or -1.
603
00:31:57,410 --> 00:32:00,800
So here you are either adding a vector
or subtracting a vector.
604
00:32:00,800 --> 00:32:05,490
And we will see from this diagram that
you're always doing so in such a way
605
00:32:05,490 --> 00:32:09,520
that you make the point more likely
to be correctly classified.
606
00:32:09,520 --> 00:32:10,780
How is that?
607
00:32:10,780 --> 00:32:15,990
If y equals +1, as you see here,
then it must be that since the point
608
00:32:15,990 --> 00:32:19,530
is misclassified, that
w dot x was negative.
609
00:32:19,530 --> 00:32:24,900
Now when you modify this to w plus
y x, it's actually w plus x.
610
00:32:24,900 --> 00:32:29,330
You add x to w, and when you add x to
w you get the blue vector instead of
611
00:32:29,330 --> 00:32:30,120
the red vector.
612
00:32:30,120 --> 00:32:33,850
And lo and behold, now the inner
product is indeed positive.
613
00:32:33,850 --> 00:32:38,830
And in the other case when it's -1,
it is misclassified because they
614
00:32:38,830 --> 00:32:39,760
were this way.
615
00:32:39,760 --> 00:32:41,840
They give you +1 when
it should be -1.
616
00:32:41,840 --> 00:32:44,520
And when you apply the rule, since
y is -1, you are actually
617
00:32:44,520 --> 00:32:45,640
subtracting x.
618
00:32:45,640 --> 00:32:48,680
So you subtract x and get this guy,
and you will get the correct
619
00:32:48,680 --> 00:32:49,620
classification.
620
00:32:49,620 --> 00:32:49,810
621
00:32:49,810 --> 00:32:51,560
So this is the intuition behind it.
622
00:32:51,560 --> 00:32:53,990
However, it is not the intuition
that makes this work.
623
00:32:53,990 --> 00:32:58,930
There are a number of problems
with this approach.
624
00:32:58,930 --> 00:33:02,570
I just motivated that
this is not a crazy rule.
625
00:33:02,570 --> 00:33:06,660
Whether or not it's a working
rule, that is yet to be seen.
626
00:33:06,660 --> 00:33:11,270
Let's look at the iterations of
the perceptron learning algorithm.
627
00:33:11,270 --> 00:33:13,860
Here is one iteration of PLA.
628
00:33:13,860 --> 00:33:18,950
So you look at this thing, and you have
this current w corresponds to
629
00:33:18,950 --> 00:33:20,330
the purple line.
630
00:33:20,330 --> 00:33:22,760
This guy is blue in the red region.
631
00:33:22,760 --> 00:33:24,610
It means it's misclassified.
632
00:33:24,610 --> 00:33:28,630
So now you would like to adjust
the weights, that is move around
633
00:33:28,630 --> 00:33:32,340
that purple line, such that the
point is classified correctly.
634
00:33:32,340 --> 00:33:35,440
If you apply the learning rule, you'll
find that you're actually moving in
635
00:33:35,440 --> 00:33:40,210
this direction, which means that the
blue point will likely be correctly
636
00:33:40,210 --> 00:33:42,230
classified after that iteration.
637
00:33:42,230 --> 00:33:43,790
638
00:33:43,790 --> 00:33:46,930
There is a problem because, let's
say that I actually move
639
00:33:46,930 --> 00:33:49,700
this guy in this direction.
640
00:33:49,700 --> 00:33:55,010
Well this one, I got it right, but this
one, which used to be right,
641
00:33:55,010 --> 00:33:56,440
now is messed up.
642
00:33:56,440 --> 00:33:58,920
It moved to the blue region, right?
643
00:33:58,920 --> 00:34:02,440
And if you think about it, I'm trying
to take care of one point, and I may be
644
00:34:02,440 --> 00:34:05,450
messing up all other points, because
I'm not taking them into
645
00:34:05,450 --> 00:34:06,980
consideration.
646
00:34:06,980 --> 00:34:08,469
Well, the good news for the perceptron
647
00:34:08,469 --> 00:34:12,909
learning algorithm is that all you need
to do, is for iterations 1,
648
00:34:12,909 --> 00:34:19,179
2, 3, 4, et cetera, pick a misclassified
point, anyone you like.
649
00:34:19,179 --> 00:34:22,020
650
00:34:22,020 --> 00:34:24,489
And then apply the iteration to it.
651
00:34:24,489 --> 00:34:27,870
The iteration we just talked about,
which is this one.
652
00:34:27,870 --> 00:34:29,480
The top one.
653
00:34:29,480 --> 00:34:31,210
And that's it.
654
00:34:31,210 --> 00:34:35,790
If you do that, and the data was
originally linearly separable, then
655
00:34:35,790 --> 00:34:40,310
you will end up with the case that you
will get to a correct solution.
656
00:34:40,310 --> 00:34:42,870
You will get to something that
classifies all of them correctly.
657
00:34:42,870 --> 00:34:44,340
This is not an obvious statement.
658
00:34:44,340 --> 00:34:45,310
It requires a proof.
659
00:34:45,310 --> 00:34:47,300
The proof is not that hard.
660
00:34:47,300 --> 00:34:51,570
But it gives us the simplest possible
learning model we can think of.
661
00:34:51,570 --> 00:34:54,710
It's a linear model, and
this is your algorithm.
662
00:34:54,710 --> 00:34:59,060
All you need to do is be very patient,
because 1, 2, 3, 4-- this is
663
00:34:59,060 --> 00:35:00,200
a really long.
664
00:35:00,200 --> 00:35:01,900
At times it can be very long.
665
00:35:01,900 --> 00:35:03,310
But it eventually converges.
666
00:35:03,310 --> 00:35:04,350
That's the promise,
667
00:35:04,350 --> 00:35:06,970
as long as the data is
linearly separable.
668
00:35:06,970 --> 00:35:13,920
So now we have one learning model, and
if I give you now data from a bank--
669
00:35:13,920 --> 00:35:17,216
previous customers and their credit
behavior-- you can actually run the
670
00:35:17,216 --> 00:35:21,090
perceptron learning algorithm, and come up
with a final hypothesis g that you
671
00:35:21,090 --> 00:35:22,630
can hand to the bank.
672
00:35:22,630 --> 00:35:26,390
Not clear at all that it will be good,
because all you did was match the
673
00:35:26,390 --> 00:35:27,750
historical records.
674
00:35:27,750 --> 00:35:31,280
Well, you may ask the question: if I
match the historical records, does this
675
00:35:31,280 --> 00:35:34,300
mean that I'm getting future customers
right, which is the
676
00:35:34,300 --> 00:35:35,480
only thing that matters?
677
00:35:35,480 --> 00:35:38,240
The bank already knows what happened
with the previous customers. It's just
678
00:35:38,240 --> 00:35:41,050
using the data to help you
find a good formula.
679
00:35:41,050 --> 00:35:44,050
The formula will be good or not good to
the extent that it applies to a new
680
00:35:44,050 --> 00:35:47,510
customer, and can predict the
behavior correctly.
681
00:35:47,510 --> 00:35:50,530
Well, that's a loaded question
which will be handled in
682
00:35:50,530 --> 00:35:53,470
extreme detail, when we talk about
the theory of learning.
683
00:35:53,470 --> 00:35:57,190
That's why we have to develop
all of this theory.
684
00:35:57,190 --> 00:35:58,970
So, that's it.
685
00:35:58,970 --> 00:36:02,340
And that is the perceptron
learning algorithm.
686
00:36:02,340 --> 00:36:07,620
Now let me go into the bigger picture
of learning, because what I talked
687
00:36:07,620 --> 00:36:09,990
about so far is one type of learning.
688
00:36:09,990 --> 00:36:13,700
It happens to be by far the most
popular, and the most used.
689
00:36:13,700 --> 00:36:16,250
But there are other types of learning.
690
00:36:16,250 --> 00:36:21,540
So let's talk about the premise of
learning, from which the different
691
00:36:21,540 --> 00:36:24,180
types came about.
692
00:36:24,180 --> 00:36:27,220
That's what learning is about.
693
00:36:27,220 --> 00:36:29,630
694
00:36:29,630 --> 00:36:34,640
This is the premise that is common
between any problem that you
695
00:36:34,640 --> 00:36:36,650
would consider learning.
696
00:36:36,650 --> 00:36:41,450
You use a set of observations,
what we call data, to uncover
697
00:36:41,450 --> 00:36:43,280
an underlying process.
698
00:36:43,280 --> 00:36:46,540
In our case, the target function.
699
00:36:46,540 --> 00:36:51,170
You can see that this is
a very broad premise.
700
00:36:51,170 --> 00:36:54,730
And therefore, you can see that people
have rediscovered that over and over
701
00:36:54,730 --> 00:36:57,160
and over, in so many disciplines.
702
00:36:57,160 --> 00:37:00,870
Can you think of a discipline, other than
machine learning, that uses that
703
00:37:00,870 --> 00:37:02,125
as its exclusive premise?
704
00:37:02,125 --> 00:37:05,420
705
00:37:05,420 --> 00:37:09,460
Anybody have taken courses
in statistics?
706
00:37:09,460 --> 00:37:12,180
In statistics, that's what they do.
707
00:37:12,180 --> 00:37:16,090
The underlying process is
a probability distribution.
708
00:37:16,090 --> 00:37:21,590
And the observations are samples
generated by that distribution.
709
00:37:21,590 --> 00:37:24,220
And you want to take the samples, and
predict what the probability
710
00:37:24,220 --> 00:37:25,890
distribution is.
711
00:37:25,890 --> 00:37:29,740
And over and over, there are so many
disciplines under different names.
712
00:37:29,740 --> 00:37:34,000
Now when we talk about different types
of learning, it's not like we sit down
713
00:37:34,000 --> 00:37:37,970
and look at the world and say, this
looks different from this because the
714
00:37:37,970 --> 00:37:39,420
assumptions look different.
715
00:37:39,420 --> 00:37:43,700
What you do is, you take this premise
and apply it in a context.
716
00:37:43,700 --> 00:37:48,100
And that calls for a certain amount
of mathematics and algorithms.
717
00:37:48,100 --> 00:37:53,690
If a particular set of assumptions takes
you sufficiently far from the
718
00:37:53,690 --> 00:37:57,850
mathematics and the algorithms you used
in the other disciplines, that
719
00:37:57,850 --> 00:38:00,360
it takes on a life of its own.
720
00:38:00,360 --> 00:38:03,900
And it develops its own math and
algorithms, then you declare it
721
00:38:03,900 --> 00:38:05,250
a different type.
722
00:38:05,250 --> 00:38:05,660
723
00:38:05,660 --> 00:38:09,750
So when I list the types, it's not
completely obvious just by the slide
724
00:38:09,750 --> 00:38:13,000
itself, that these should be
the types that you have.
725
00:38:13,000 --> 00:38:16,370
But for what it's worth, these
are the most important types.
726
00:38:16,370 --> 00:38:18,110
First one is supervised learning,
that's what we have
727
00:38:18,110 --> 00:38:18,970
been talking about.
728
00:38:18,970 --> 00:38:22,240
And I will discuss it in detail, and tell
you why it's called supervised.
729
00:38:22,240 --> 00:38:26,640
And it is, by far, the concentration
of this course.
730
00:38:26,640 --> 00:38:31,950
There is another one which is called
unsupervised learning, and
731
00:38:31,950 --> 00:38:33,990
unsupervised learning
is very intriguing.
732
00:38:33,990 --> 00:38:37,310
I will mention it briefly here, and then
we will talk about a very famous
733
00:38:37,310 --> 00:38:40,740
algorithm for unsupervised learning
later in the course.
734
00:38:40,740 --> 00:38:44,090
And the final type is reinforcement
learning, which is even more
735
00:38:44,090 --> 00:38:47,640
intriguing, and I will
discuss it in a brief
736
00:38:47,640 --> 00:38:49,760
introduction in a moment.
737
00:38:49,760 --> 00:38:50,330
738
00:38:50,330 --> 00:38:52,290
So let's take them one by one.
739
00:38:52,290 --> 00:38:53,180
Supervised learning.
740
00:38:53,180 --> 00:38:54,460
So what is supervised learning?
741
00:38:54,460 --> 00:38:56,960
742
00:38:56,960 --> 00:39:01,320
Anytime you have the data that is
given to you, with the output
743
00:39:01,320 --> 00:39:07,630
explicitly given-- here is the user
and movie, and here is the rating.
744
00:39:07,630 --> 00:39:11,030
Here is the previous customer, and
here is their credit behavior.
745
00:39:11,030 --> 00:39:15,270
It's as if a supervisor is helping you
out, in order to be able to classify
746
00:39:15,270 --> 00:39:16,300
the future ones.
747
00:39:16,300 --> 00:39:18,140
That's why it's called supervised.
748
00:39:18,140 --> 00:39:21,160
Let's take an example of coin
recognition, just to be able to
749
00:39:21,160 --> 00:39:24,110
contrast it with unsupervised
learning in a moment.
750
00:39:24,110 --> 00:39:24,630
751
00:39:24,630 --> 00:39:29,350
Let's say you have a vending machine,
and you would like to make
752
00:39:29,350 --> 00:39:31,670
the system able to
recognize the coins.
753
00:39:31,670 --> 00:39:33,030
So what do you do?
754
00:39:33,030 --> 00:39:36,630
You have physical measurements of the
coin, let's be simplistic and say we
755
00:39:36,630 --> 00:39:39,520
measure the size and mass
of the coin you put.
756
00:39:39,520 --> 00:39:44,980
Now the coins will be quarters,
nickels, pennies, and dimes.
757
00:39:44,980 --> 00:39:46,800
25, 5, 1, and 10.
758
00:39:46,800 --> 00:39:47,500
759
00:39:47,500 --> 00:39:51,760
And when you put the data in this
diagram, they will belong there.
760
00:39:51,760 --> 00:39:56,640
So the quarters, for example, are
bigger, so they will belong here.
761
00:39:56,640 --> 00:40:00,480
And the dimes in the US currency happen
to be the smallest of them,
762
00:40:00,480 --> 00:40:04,200
so they are smallest here, and there
will be a scatter because of the error
763
00:40:04,200 --> 00:40:07,160
in measurement, because of the exposure
to the elements, and whatnot.
764
00:40:07,160 --> 00:40:10,070
So let's say that this is your
training data, and it's supervised
765
00:40:10,070 --> 00:40:11,830
because things are colored.
766
00:40:11,830 --> 00:40:15,660
I gave you those and told you they
are 25 cents, 5 cents, et cetera.
767
00:40:15,660 --> 00:40:20,040
So you use those in order to train
a system, and the system will then be
768
00:40:20,040 --> 00:40:22,100
able to classify a future one.
769
00:40:22,100 --> 00:40:26,990
For example, if we stick to the
linear approach, you may be able to
770
00:40:26,990 --> 00:40:29,890
find separator lines like those.
771
00:40:29,890 --> 00:40:33,250
And those separator lines will
separate, based on the data, the 10
772
00:40:33,250 --> 00:40:35,510
from the 1 from the
5 from the 25.
773
00:40:35,510 --> 00:40:37,240
And once you have those,
774
00:40:37,240 --> 00:40:39,870
you can bid farewell to the data.
You don't need it anymore.
775
00:40:39,870 --> 00:40:42,960
And when you get a future coin that is
now unlabeled, you don't know what it
776
00:40:42,960 --> 00:40:47,220
is, when the vending machine is actually
working, then the coin will
777
00:40:47,220 --> 00:40:51,090
lie in one region or another, and you're
going to classify it accordingly.
778
00:40:51,090 --> 00:40:53,550
So that is supervised learning.
779
00:40:53,550 --> 00:40:56,060
Now let's look at unsupervised
learning.
780
00:40:56,060 --> 00:41:01,490
For unsupervised learning, instead of
having the examples, the training data,
781
00:41:01,490 --> 00:41:05,570
having this form which is the
input plus the correct
782
00:41:05,570 --> 00:41:07,020
target-- the correct output--
783
00:41:07,020 --> 00:41:12,470
the customer and how they behaved
in reality in credit,
784
00:41:12,470 --> 00:41:16,765
we are going to have examples that have
less information, so much less it
785
00:41:16,765 --> 00:41:19,480
is laughable.
786
00:41:19,480 --> 00:41:23,920
I'm just going to tell you
what the input is.
787
00:41:23,920 --> 00:41:27,330
And I'm not going to tell you what
the target function is at all.
788
00:41:27,330 --> 00:41:30,190
I'm not going to tell you anything
about the target function.
789
00:41:30,190 --> 00:41:32,770
I'm just going to tell you, here
is the data of a customer.
790
00:41:32,770 --> 00:41:36,210
Good luck, try to predict the credit.
791
00:41:36,210 --> 00:41:38,300
OK--
792
00:41:38,300 --> 00:41:41,340
How in the world are we
going to do that?
793
00:41:41,340 --> 00:41:44,810
Let me show you that the situation
is not totally hopeless.
794
00:41:44,810 --> 00:41:46,010
That's what I'm going to achieve.
795
00:41:46,010 --> 00:41:48,390
I'm not going to tell you
how to do it completely.
796
00:41:48,390 --> 00:41:51,780
But let me show you that a situation
like that is not totally hopeless.
797
00:41:51,780 --> 00:41:52,620
798
00:41:52,620 --> 00:41:55,330
Let's go for the coin example.
799
00:41:55,330 --> 00:41:56,240
800
00:41:56,240 --> 00:42:01,550
For the coin example, we have
data that looks like this.
801
00:42:01,550 --> 00:42:05,800
If I didn't tell you what the
denominations are, the data
802
00:42:05,800 --> 00:42:08,530
would look like this.
803
00:42:08,530 --> 00:42:09,940
Right?
804
00:42:09,940 --> 00:42:12,220
You have the measurements, but you don't
know, is that a quarter, is
805
00:42:12,220 --> 00:42:14,140
it-- you don't know.
806
00:42:14,140 --> 00:42:17,970
Now honestly, if you look at this
thing, you say I can know
807
00:42:17,970 --> 00:42:19,720
something from this figure.
808
00:42:19,720 --> 00:42:21,740
Things tend to cluster together.
809
00:42:21,740 --> 00:42:25,880
So I may be able to classify those
clusters into categories, without
810
00:42:25,880 --> 00:42:28,440
knowing what the categories are.
811
00:42:28,440 --> 00:42:29,960
That will be quite
an achievement already.
812
00:42:29,960 --> 00:42:33,110
You still don't know whether it's
25 cents, or whatever.
813
00:42:33,110 --> 00:42:36,040
But the data actually made you
able to do something that is
814
00:42:36,040 --> 00:42:38,630
a significant step.
815
00:42:38,630 --> 00:42:42,370
You're going to be able to come
up with these boundaries.
816
00:42:42,370 --> 00:42:43,160
817
00:42:43,160 --> 00:42:46,210
And now, you are so close to
finding the full system.
818
00:42:46,210 --> 00:42:49,470
So unlabeled data actually
can be pretty useful.
819
00:42:49,470 --> 00:42:52,910
Obviously, I have seen the colored
ones, so I actually chose the
820
00:42:52,910 --> 00:42:55,500
boundaries right because I still
remember them visually.
821
00:42:55,500 --> 00:42:58,300
But if you look at the clusters and
you have never heard about that,
822
00:42:58,300 --> 00:43:02,830
especially these guys might not
look like two clusters.
823
00:43:02,830 --> 00:43:04,510
They may look like one cluster.
824
00:43:04,510 --> 00:43:10,260
So it actually could be that this is
ambiguous, and indeed in unsupervised
825
00:43:10,260 --> 00:43:13,900
learning, the number of clusters
is ambiguous at times.
826
00:43:13,900 --> 00:43:16,040
827
00:43:16,040 --> 00:43:18,145
And then, what you do--
828
00:43:18,145 --> 00:43:20,740
829
00:43:20,740 --> 00:43:23,110
this is the output of your system.
Now, I can categorize the
830
00:43:23,110 --> 00:43:24,960
coins into types.
831
00:43:24,960 --> 00:43:28,050
I'm just going to call them
types: type 1, type 2,
832
00:43:28,050 --> 00:43:29,260
type 3, type 4.
833
00:43:29,260 --> 00:43:33,140
I have no idea which belongs to which,
but obviously if someone comes with
834
00:43:33,140 --> 00:43:37,420
a single example of a quarter, a dime,
et cetera, then you are ready to go.
835
00:43:37,420 --> 00:43:37,890
836
00:43:37,890 --> 00:43:40,680
Whereas before, you had to have lots of
examples in order to choose where
837
00:43:40,680 --> 00:43:42,770
exactly to put the boundary.
838
00:43:42,770 --> 00:43:43,880
839
00:43:43,880 --> 00:43:47,600
And this is why a set like that,
which looks like complete
840
00:43:47,600 --> 00:43:50,160
jungle, is actually useful.
841
00:43:50,160 --> 00:43:52,850
Let me give you another interesting
example of unsupervised learning,
842
00:43:52,850 --> 00:43:55,610
where I give you the input without the
output, and you are actually in
843
00:43:55,610 --> 00:43:58,320
a better situation to learn.
844
00:43:58,320 --> 00:44:02,290
Let's say that your company or your
school in this case, is sending you
845
00:44:02,290 --> 00:44:05,190
for a semester in Rio de Janeiro.
846
00:44:05,190 --> 00:44:09,690
So you're very excited, and you
decide that you'd better learn some
847
00:44:09,690 --> 00:44:13,660
Portuguese, in order to be able to
speak the language when you arrive.
848
00:44:13,660 --> 00:44:14,370
849
00:44:14,370 --> 00:44:17,830
Not to worry, when you arrive, there
will be a tutor who teaches you
850
00:44:17,830 --> 00:44:18,400
Portuguese.
851
00:44:18,400 --> 00:44:20,400
But you have a month to go,
and you want to help
852
00:44:20,400 --> 00:44:22,320
yourself as much as possible.
853
00:44:22,320 --> 00:44:26,620
You look around, and you find that the
only resource you have is a radio
854
00:44:26,620 --> 00:44:30,080
station in Portuguese in your car.
855
00:44:30,080 --> 00:44:35,080
So what you do, you just turn
it on whenever you drive.
856
00:44:35,080 --> 00:44:38,680
And for an entire month, you're
bombarded with Portuguese.
857
00:44:38,680 --> 00:44:42,890
"tudo bem", "como vai", "valeu",
stuff like that comes back.
858
00:44:42,890 --> 00:44:45,550
After a while, without knowing anything--
it's unsupervised, nobody
859
00:44:45,550 --> 00:44:47,250
told you the meaning of any word--
860
00:44:47,250 --> 00:44:50,870
you start to develop a model of
the language in your mind.
861
00:44:50,870 --> 00:44:52,370
You know what the idioms
are, et cetera.
862
00:44:52,370 --> 00:44:54,930
You are very eager to know
what actually "tudo bem"
863
00:44:54,930 --> 00:44:56,350
-- what does that mean?
864
00:44:56,350 --> 00:44:58,380
You are ready to learn, and once
you learn it, it's actually
865
00:44:58,380 --> 00:44:59,780
fixed in your mind.
866
00:44:59,780 --> 00:45:03,130
Then when you go there, you will learn
the language faster than if you didn't
867
00:45:03,130 --> 00:45:05,070
go through this experience.
868
00:45:05,070 --> 00:45:08,320
So you can think of unsupervised
learning, in one way or another, as
869
00:45:08,320 --> 00:45:12,300
a way of getting a higher-level
representation of the input.
870
00:45:12,300 --> 00:45:15,580
Whether it's extremely high level as
in clusters-- you forgot all the
871
00:45:15,580 --> 00:45:19,680
attributes and you just tell me a label,
or higher level as in this-- a better
872
00:45:19,680 --> 00:45:23,620
representation than just the
crude input into some model
873
00:45:23,620 --> 00:45:25,212
in your mind.
874
00:45:25,212 --> 00:45:29,280
875
00:45:29,280 --> 00:45:32,250
Now let's talk about
reinforcement learning.
876
00:45:32,250 --> 00:45:35,430
In this case, it's not as bad
as unsupervised learning.
877
00:45:35,430 --> 00:45:38,970
So again, without the benefit of
supervised learning, you don't get
878
00:45:38,970 --> 00:45:40,810
the correct output.
879
00:45:40,810 --> 00:45:44,550
What you do is-- I will
give you the input.
880
00:45:44,550 --> 00:45:46,750
OK, thank you very much,
that's very kind.
881
00:45:46,750 --> 00:45:48,580
What else?
882
00:45:48,580 --> 00:45:53,450
I'm going to give you some output.
883
00:45:53,450 --> 00:45:54,540
The correct output?
884
00:45:54,540 --> 00:45:55,200
No!
885
00:45:55,200 --> 00:45:56,690
Some output.
886
00:45:56,690 --> 00:46:01,070
OK, that's very nice, but doesn't
seem very helpful.
887
00:46:01,070 --> 00:46:05,100
It looks now like unsupervised learning,
because in unsupervised learning I
888
00:46:05,100 --> 00:46:06,460
could give you some output.
889
00:46:06,460 --> 00:46:08,080
Here is a dime. Oh, it's a quarter.
890
00:46:08,080 --> 00:46:10,490
It's some output!
891
00:46:10,490 --> 00:46:12,740
Such output has no information.
892
00:46:12,740 --> 00:46:16,240
The information comes from the next one.
893
00:46:16,240 --> 00:46:19,520
I'm going to grade this output.
894
00:46:19,520 --> 00:46:21,440
So that is the information
provided to you.
895
00:46:21,440 --> 00:46:26,200
So I'm not explicitly giving you the
output, but when you choose an output,
896
00:46:26,200 --> 00:46:28,900
I'm going to tell you how
well you're doing.
897
00:46:28,900 --> 00:46:31,850
Reinforcement learning is interesting
because it is mostly our own
898
00:46:31,850 --> 00:46:33,450
experience in learning.
899
00:46:33,450 --> 00:46:38,060
Think of a toddler, and a hot
cup of tea in front of her.
900
00:46:38,060 --> 00:46:40,610
She is looking at it, and
she is very curious.
901
00:46:40,610 --> 00:46:43,210
So she reaches to touch. Ouch!
902
00:46:43,210 --> 00:46:44,720
And she starts crying.
903
00:46:44,720 --> 00:46:47,790
The reward is very negative
for trying.
904
00:46:47,790 --> 00:46:51,325
Now next time she looks at it, and she
remembers the previous experience, and
905
00:46:51,325 --> 00:46:52,760
she doesn't touch it.
906
00:46:52,760 --> 00:46:56,120
But there is a certain level of pain,
because there is an unfulfilled
907
00:46:56,120 --> 00:46:57,870
curiosity.
908
00:46:57,870 --> 00:47:01,860
And curiosity killed the cat. In
three or four trials, the toddler
909
00:47:01,860 --> 00:47:02,530
tries again.
910
00:47:02,530 --> 00:47:04,100
Maybe now it's OK.
911
00:47:04,100 --> 00:47:05,420
And Ouch!
912
00:47:05,420 --> 00:47:09,890
Eventually from just the grade of the
behavior of to touch it or not to
913
00:47:09,890 --> 00:47:14,290
touch it, the toddler will learn not to
touch cups of tea that have smoke
914
00:47:14,290 --> 00:47:15,350
coming out of them.
915
00:47:15,350 --> 00:47:16,060
916
00:47:16,060 --> 00:47:18,930
So that is a case of
reinforcement learning.
917
00:47:18,930 --> 00:47:22,340
The most important application, or one
of the most important applications of
918
00:47:22,340 --> 00:47:26,650
reinforcement learning, is
in playing games.
919
00:47:26,650 --> 00:47:28,290
920
00:47:28,290 --> 00:47:32,420
So backgammon is one of the games,
and think that you want
921
00:47:32,420 --> 00:47:33,600
a system to learn it.
922
00:47:33,600 --> 00:47:40,050
So what you want, you want to take the
current state of the board, and you
923
00:47:40,050 --> 00:47:44,010
roll the dice, and then you decide
what is the optimal move in
924
00:47:44,010 --> 00:47:45,960
order to stand the best chance to win.
925
00:47:45,960 --> 00:47:46,830
That's the game.
926
00:47:46,830 --> 00:47:50,890
So the target function is the
best move given a state.
927
00:47:50,890 --> 00:47:55,680
Now, if I have to generate those things
in order for the system to
928
00:47:55,680 --> 00:48:00,430
learn, then I must be a pretty good
backgammon player already.
929
00:48:00,430 --> 00:48:03,580
So now it's a vicious cycle.
930
00:48:03,580 --> 00:48:06,480
Now, reinforcement learning
comes in handy.
931
00:48:06,480 --> 00:48:09,030
What you're going to do, you
are going to have the
932
00:48:09,030 --> 00:48:11,070
computer choose any output.
933
00:48:11,070 --> 00:48:13,790
A crazy move, for all you care.
934
00:48:13,790 --> 00:48:16,070
And then see what happens eventually.
935
00:48:16,070 --> 00:48:19,280
So this computer is playing against
another computer, both of
936
00:48:19,280 --> 00:48:21,040
them want to learn.
937
00:48:21,040 --> 00:48:24,730
And you make a move, and eventually
you win or lose.
938
00:48:24,730 --> 00:48:28,280
So you propagate back the credit
because of winning or losing,
939
00:48:28,280 --> 00:48:31,780
according to a very specific and
sophisticated formula, into all the
940
00:48:31,780 --> 00:48:34,810
moves that happened.
941
00:48:34,810 --> 00:48:37,570
Now you think that's completely hopeless,
because maybe this is not the
942
00:48:37,570 --> 00:48:39,750
move that resulted in this,
it's another move.
943
00:48:39,750 --> 00:48:45,390
But always remember, that you are going
to do this 100 billion times.
944
00:48:45,390 --> 00:48:47,130
Not you, the poor computer.
945
00:48:47,130 --> 00:48:49,610
You're sitting down sipping
your tea.
946
00:48:49,610 --> 00:48:53,460
A computer is doing this, playing
against an imaginary opponent, and
947
00:48:53,460 --> 00:48:55,240
they keep playing and
playing and playing.
948
00:48:55,240 --> 00:48:58,530
And in three hours of CPU time, you go
back to the computer-- maybe not three
949
00:48:58,530 --> 00:49:02,180
hours, maybe three days of CPU time--
you go back to the computer, and you
950
00:49:02,180 --> 00:49:03,505
have a backgammon champion.
951
00:49:03,505 --> 00:49:06,430
952
00:49:06,430 --> 00:49:09,960
Actually, that's true.
953
00:49:09,960 --> 00:49:13,900
The world champion, at some point, was
a neural network that learned the way
954
00:49:13,900 --> 00:49:15,880
I described.
955
00:49:15,880 --> 00:49:20,730
So it is actually a very attractive
approach, because in machine
956
00:49:20,730 --> 00:49:24,590
learning now, we have a target function
that we cannot model.
957
00:49:24,590 --> 00:49:27,590
That covers a lot of territory,
I've seen a lot of those.
958
00:49:27,590 --> 00:49:29,720
We have data coming from
the target function.
959
00:49:29,720 --> 00:49:30,830
960
00:49:30,830 --> 00:49:32,560
I usually have that.
961
00:49:32,560 --> 00:49:36,010
And now we have the lazy
man's approach to life.
962
00:49:36,010 --> 00:49:39,410
We are going to sit down, and let the
computer do all of the work, and
963
00:49:39,410 --> 00:49:40,830
produce the system we want.
964
00:49:40,830 --> 00:49:44,090
Instead of studying the thing
mathematically, and writing code, and
965
00:49:44,090 --> 00:49:44,900
debugging--
966
00:49:44,900 --> 00:49:46,740
I hate debugging.
967
00:49:46,740 --> 00:49:49,650
And then you go. No,
we're not going to do that.
968
00:49:49,650 --> 00:49:52,550
The learning algorithm just works,
and produces something good.
969
00:49:52,550 --> 00:49:53,040
970
00:49:53,040 --> 00:49:54,490
And we get the check.
971
00:49:54,490 --> 00:49:56,950
So this is a pretty good deal.
972
00:49:56,950 --> 00:50:03,880
It actually is so good, it might
be too good to be true.
973
00:50:03,880 --> 00:50:07,120
So let's actually examine if
all of this was a fantasy.
974
00:50:07,120 --> 00:50:10,590
975
00:50:10,590 --> 00:50:14,080
So now I'm going to give you
a learning puzzle.
976
00:50:14,080 --> 00:50:16,020
Humans are very good learners, right?
977
00:50:16,020 --> 00:50:17,630
978
00:50:17,630 --> 00:50:21,170
So I'm now going to give you a learning
problem in the form that I
979
00:50:21,170 --> 00:50:23,870
described, a supervised
learning problem.
980
00:50:23,870 --> 00:50:28,910
And that supervised learning problem
will give you a training set, some
981
00:50:28,910 --> 00:50:32,300
points mapped to +1, some
points mapped to -1.
982
00:50:32,300 --> 00:50:35,600
And then I'm going to give you
a test point that is unlabeled.
983
00:50:35,600 --> 00:50:41,780
Your task is to look at the examples,
learn the target function, apply it to
984
00:50:41,780 --> 00:50:46,630
the test point, and then decide what
the value of the function is.
985
00:50:46,630 --> 00:50:50,550
After that, I'm going to ask, who
decided that the function is +1,
986
00:50:50,550 --> 00:50:53,130
and who decided that the
function is -1.
987
00:50:53,130 --> 00:50:55,720
OK? It's clear what the deal is.
988
00:50:55,720 --> 00:50:55,730
989
00:50:55,730 --> 00:50:59,680
And I would like our online audience
to do the same thing.
990
00:50:59,680 --> 00:51:02,650
And please text what the solution is.
991
00:51:02,650 --> 00:51:04,900
Just +1 or -1.
992
00:51:04,900 --> 00:51:05,590
993
00:51:05,590 --> 00:51:06,560
Fair enough?
994
00:51:06,560 --> 00:51:07,810
Let's start the game.
995
00:51:07,810 --> 00:51:12,260
996
00:51:12,260 --> 00:51:14,890
997
00:51:14,890 --> 00:51:19,390
What is above the line are
the training examples.
998
00:51:19,390 --> 00:51:23,870
I put the input as a three-by-three
pattern in order to be visually easy
999
00:51:23,870 --> 00:51:24,780
to understand.
1000
00:51:24,780 --> 00:51:28,370
But this is just really nine
bits worth of information.
1001
00:51:28,370 --> 00:51:31,470
And they are ones and zeros,
black and white.
1002
00:51:31,470 --> 00:51:36,640
And for this input, this input, and this
input, the value of the target
1003
00:51:36,640 --> 00:51:38,760
function is -1.
1004
00:51:38,760 --> 00:51:42,160
For this input, this input, and this
input, the value of the target
1005
00:51:42,160 --> 00:51:44,470
function is +1.
1006
00:51:44,470 --> 00:51:47,980
Now this is your data set, this
is your training set.
1007
00:51:47,980 --> 00:51:49,360
Now you should learn the function.
1008
00:51:49,360 --> 00:51:52,980
And when you're done, could you please
tell me what your function will return
1009
00:51:52,980 --> 00:51:54,680
on this test point?
1010
00:51:54,680 --> 00:51:57,130
Is it +1 or -1.
1011
00:51:57,130 --> 00:52:00,480
I will give everybody 30 seconds
before I ask for an answer.
1012
00:52:00,480 --> 00:52:05,330
1013
00:52:05,330 --> 00:52:06,930
Maybe we should have some
background music?
1014
00:52:06,930 --> 00:52:13,680
1015
00:52:13,680 --> 00:52:14,930
1016
00:52:14,930 --> 00:52:20,400
1017
00:52:20,400 --> 00:52:22,680
OK, time's up.
1018
00:52:22,680 --> 00:52:24,850
Your learning algorithm
has converged, I hope.
1019
00:52:24,850 --> 00:52:30,835
And now we apply it here, and I ask
people here, who says it's +1?
1020
00:52:30,835 --> 00:52:34,120
1021
00:52:34,120 --> 00:52:35,180
Thank you.
1022
00:52:35,180 --> 00:52:37,810
Who says it's -1?
1023
00:52:37,810 --> 00:52:39,270
Thank you.
1024
00:52:39,270 --> 00:52:42,020
I see that the online audience
also contributed?
1025
00:52:42,020 --> 00:52:44,070
MODERATOR: Yeah, the big
majority says +1.
1026
00:52:44,070 --> 00:52:45,950
PROFESSOR: But
are there -1's?
1027
00:52:45,950 --> 00:52:46,840
MODERATOR: Two -1's.
1028
00:52:46,840 --> 00:52:47,300
1029
00:52:47,300 --> 00:52:48,320
PROFESSOR: Cool.
1030
00:52:48,320 --> 00:52:49,050
1031
00:52:49,050 --> 00:52:50,990
I don't care if it's
a +1 or -1.
1032
00:52:50,990 --> 00:52:54,270
What I care about is that
I get both answers.
1033
00:52:54,270 --> 00:52:55,810
That is the essence of it.
1034
00:52:55,810 --> 00:52:57,280
Why do I care?
1035
00:52:57,280 --> 00:53:00,760
Because in reality, this
is an impossible task.
1036
00:53:00,760 --> 00:53:03,470
1037
00:53:03,470 --> 00:53:06,090
I told you the target
function is unknown.
1038
00:53:06,090 --> 00:53:11,110
It could be anything,
really anything.
1039
00:53:11,110 --> 00:53:15,740
And now I give you the value of the
target function at 6 points.
1040
00:53:15,740 --> 00:53:19,470
Well, there are many functions that
fit those 6 points, and behave
1041
00:53:19,470 --> 00:53:21,470
differently outside.
1042
00:53:21,470 --> 00:53:32,400
For example, if you take the function
to be +1 if the top left square
1043
00:53:32,400 --> 00:53:40,510
is white, then this should
be -1, right?
1044
00:53:40,510 --> 00:53:49,880
If you take the function to be +1
if the pattern is symmetric--
1045
00:53:49,880 --> 00:53:52,700
let's see, I said it
the other way around.
1046
00:53:52,700 --> 00:53:57,430
So the top one is black,
it would be -1.
1047
00:53:57,430 --> 00:53:58,850
So this would be -1.
1048
00:53:58,850 --> 00:54:00,680
If it's symmetric, it would be +1.
1049
00:54:00,680 --> 00:54:03,420
So this would be +1, because
this guy has both-- this is
1050
00:54:03,420 --> 00:54:05,380
black, and also it is symmetric.
1051
00:54:05,380 --> 00:54:06,230
Right?
1052
00:54:06,230 --> 00:54:09,320
And you can find infinite
variety like that.
1053
00:54:09,320 --> 00:54:12,310
And that problem is not restricted
to this case.
1054
00:54:12,310 --> 00:54:14,300
1055
00:54:14,300 --> 00:54:15,530
The question here is obvious.
1056
00:54:15,530 --> 00:54:17,010
The function is unknown.
1057
00:54:17,010 --> 00:54:18,300
You really mean unknown, right?
1058
00:54:18,300 --> 00:54:19,430
Yes, I mean it.
1059
00:54:19,430 --> 00:54:20,260
Unknown-- anything?
1060
00:54:20,260 --> 00:54:21,280
Yes, I do.
1061
00:54:21,280 --> 00:54:22,160
OK.
1062
00:54:22,160 --> 00:54:26,110
You give me a finite sample,
it can be anything outside.
1063
00:54:26,110 --> 00:54:30,750
How in the world am I going to tell
what the learning outside is?
1064
00:54:30,750 --> 00:54:33,720
OK, that sounds about right.
1065
00:54:33,720 --> 00:54:37,150
But we are in trouble, because that's
the premise of learning.
1066
00:54:37,150 --> 00:54:41,400
If the goal was to memorize the examples
I gave you, that would be
1067
00:54:41,400 --> 00:54:43,780
memorizing, not learning.
1068
00:54:43,780 --> 00:54:48,110
Learning is to figure out a pattern
that applies outside.
1069
00:54:48,110 --> 00:54:53,370
And now we realize that outside,
I cannot say anything.
1070
00:54:53,370 --> 00:54:56,641
Does this mean that learning
is doomed?
1071
00:54:56,641 --> 00:55:00,860
Well, this is going to be
a very short course!
1072
00:55:00,860 --> 00:55:06,230
Well, the good news is that learning
is alive and well.
1073
00:55:06,230 --> 00:55:13,420
And we are going to show that, without
compromising our basic premise.
1074
00:55:13,420 --> 00:55:18,320
The target function will
continue to be unknown.
1075
00:55:18,320 --> 00:55:21,440
And we still mean unknown.
1076
00:55:21,440 --> 00:55:24,390
And we will be able to learn.
1077
00:55:24,390 --> 00:55:27,620
And that will be the subject
of the next lecture.
1078
00:55:27,620 --> 00:55:32,410
Right now, we are going to go for
a short break, after which we are going
1079
00:55:32,410 --> 00:55:40,150
to take the Q&A.
1080
00:55:40,150 --> 00:55:43,270
1081
00:55:43,270 --> 00:55:49,350
We'll start the Q&A, and we will get
questions from the class here, and
1082
00:55:49,350 --> 00:55:51,270
from the online audience.
1083
00:55:51,270 --> 00:55:56,160
And if you'd like to ask a question, let
me ask you to go to this side of
1084
00:55:56,160 --> 00:56:00,630
the room where the mic is, so that
your question can be heard.
1085
00:56:00,630 --> 00:56:04,680
And we will alternate, if there are
questions here, we will alternate
1086
00:56:04,680 --> 00:56:07,540
between campus and off campus.
1087
00:56:07,540 --> 00:56:11,030
So let me start if there is
a question from outside.
1088
00:56:11,030 --> 00:56:16,080
MODERATOR: Yes, so the most common
question is, how do you determine if
1089
00:56:16,080 --> 00:56:19,050
a set of points is linearly
separable, and what do you do
1090
00:56:19,050 --> 00:56:20,730
if they're not separable.
1091
00:56:20,730 --> 00:56:26,120
PROFESSOR: The linear separability
assumption is a very
1092
00:56:26,120 --> 00:56:29,850
simplistic assumption, and doesn't
apply mostly in practice.
1093
00:56:29,850 --> 00:56:34,780
And I chose it only because it goes with
a very simple algorithm, which is
1094
00:56:34,780 --> 00:56:36,950
the perceptron learning algorithm.
1095
00:56:36,950 --> 00:56:40,850
There are two ways to deal with the
case of linear inseparability.
1096
00:56:40,850 --> 00:56:44,450
There are algorithms, and most
algorithms actually deal with that
1097
00:56:44,450 --> 00:56:49,730
case, and there's also a technique that
we are going to study next
1098
00:56:49,730 --> 00:56:55,330
week, which will take a set of points
which is not linearly separable, and
1099
00:56:55,330 --> 00:56:59,150
create a mapping that makes
them linearly separable.
1100
00:56:59,150 --> 00:57:02,050
So there is a way to deal with it.
1101
00:57:02,050 --> 00:57:05,990
However, the question how do you
determine it's linearly separable, the
1102
00:57:05,990 --> 00:57:09,240
right way of doing it in practice is
that, when someone gives you data, you
1103
00:57:09,240 --> 00:57:11,870
assume in general it's not
linearly separable.
1104
00:57:11,870 --> 00:57:15,310
It will hardly ever be, and therefore
you take techniques that can deal with
1105
00:57:15,310 --> 00:57:16,630
that case as well.
1106
00:57:16,630 --> 00:57:20,100
There is a simple modification of the
perceptron learning algorithm, which
1107
00:57:20,100 --> 00:57:21,670
is called the pocket algorithm,
1108
00:57:21,670 --> 00:57:26,190
that applies the same rule with a very
minor modification, and deals with the
1109
00:57:26,190 --> 00:57:29,460
case where the data is not separable.
1110
00:57:29,460 --> 00:57:34,820
However, if you apply the perceptron
learning algorithm, that is guaranteed
1111
00:57:34,820 --> 00:57:38,990
to converge to a correct solution in the
case of linear separability, and
1112
00:57:38,990 --> 00:57:43,520
you apply it to data that is not
linearly separable, bad things happen.
1113
00:57:43,520 --> 00:57:46,800
Not only is it going not to converge,
obviously it is not going to converge
1114
00:57:46,800 --> 00:57:50,600
because it terminates when there are
no misclassified points, right?
1115
00:57:50,600 --> 00:57:53,640
If there is a misclassified point, then
there's a next iteration always.
1116
00:57:53,640 --> 00:57:56,500
So since the data is not linearly
separable, we will never come to
1117
00:57:56,500 --> 00:57:59,110
a point where all the points
are classified correctly.
1118
00:57:59,110 --> 00:58:01,220
So this is not what is bothering us.
1119
00:58:01,220 --> 00:58:04,570
What is bothering us is that, as you go
from one step to another, you can
1120
00:58:04,570 --> 00:58:08,040
go from a very good solution
to a terrible solution.
1121
00:58:08,040 --> 00:58:10,450
In the case of no linear separability.
1122
00:58:10,450 --> 00:58:13,530
So it's not an algorithm that you
would like to use, and just
1123
00:58:13,530 --> 00:58:15,650
terminate by force at an iteration.
1124
00:58:15,650 --> 00:58:21,360
A modification of it can be used this
way, and I'll mention it briefly when
1125
00:58:21,360 --> 00:58:26,120
we talk about linear regression
and other linear methods.
1126
00:58:26,120 --> 00:58:29,770
MODERATOR: There's also a question of
how does the rate of convergence of
1127
00:58:29,770 --> 00:58:33,810
the perceptron change with the
dimensionality of the data?
1128
00:58:33,810 --> 00:58:35,840
PROFESSOR: Badly!
1129
00:58:35,840 --> 00:58:37,200
That's the answer.
1130
00:58:37,200 --> 00:58:38,440
Let me put it this way.
1131
00:58:38,440 --> 00:58:42,000
You can build pathological cases, where
it really will take forever.
1132
00:58:42,000 --> 00:58:45,230
However, I did not give the perceptron
learning algorithm in the first
1133
00:58:45,230 --> 00:58:47,900
lecture to tell you that this is
the great algorithm that you
1134
00:58:47,900 --> 00:58:49,160
need to learn.
1135
00:58:49,160 --> 00:58:51,720
I gave it in the first lecture,
because this is simplest
1136
00:58:51,720 --> 00:58:53,550
algorithm I could give.
1137
00:58:53,550 --> 00:58:56,990
By the end of this course,
you'll be saying, what?
1138
00:58:56,990 --> 00:58:57,650
Perceptron?
1139
00:58:57,650 --> 00:58:58,880
Never heard of it.
1140
00:58:58,880 --> 00:59:02,710
So it will go out of contention, after we
get to the more interesting stuff.
1141
00:59:02,710 --> 00:59:03,240
1142
00:59:03,240 --> 00:59:07,090
But as a method that can be used, it
indeed can be used, and can be
1143
00:59:07,090 --> 00:59:09,710
explained in five minutes
as you have seen.
1144
00:59:09,710 --> 00:59:15,050
MODERATOR: Regarding the items for
learning, you mentioned that there
1145
00:59:15,050 --> 00:59:15,900
must be a pattern.
1146
00:59:15,900 --> 00:59:18,400
So can you be more specific about that?
1147
00:59:18,400 --> 00:59:20,590
How do you know if there's a pattern?
1148
00:59:20,590 --> 00:59:21,940
PROFESSOR: You don't.
1149
00:59:21,940 --> 00:59:25,840
My answers seem to be very abrupt,
but that's the way it is.
1150
00:59:25,840 --> 00:59:29,680
When we get to the theory--
is learning feasible-- it will
1151
00:59:29,680 --> 00:59:34,060
become very clear that there is
a separation between the target
1152
00:59:34,060 --> 00:59:35,730
function-- there is
a pattern to detect--
1153
00:59:35,730 --> 00:59:37,150
and whether we can learn it.
1154
00:59:37,150 --> 00:59:40,150
It is very difficult for me to explain
it in two minutes, it will take a full
1155
00:59:40,150 --> 00:59:41,500
lecture to get there.
1156
00:59:41,500 --> 00:59:47,600
But the essence of it is that you take
the data, you apply your learning
1157
00:59:47,600 --> 00:59:52,710
algorithm, and there is something you
can explicitly detect that will
1158
00:59:52,710 --> 00:59:54,890
tell you whether you learned or not.
1159
00:59:54,890 --> 00:59:57,630
So in some cases, you're not
going to be able to learn.
1160
00:59:57,630 --> 00:59:59,890
In some cases, you'll be able to learn.
1161
00:59:59,890 --> 01:00:02,630
And the key is that you're going
to be able to tell by
1162
01:00:02,630 --> 01:00:04,440
running your algorithm.
1163
01:00:04,440 --> 01:00:07,280
And I'm going to explain that
in more details later on.
1164
01:00:07,280 --> 01:00:08,010
1165
01:00:08,010 --> 01:00:15,660
So basically, I'm also resisting
taking the data, deciding
1166
01:00:15,660 --> 01:00:19,220
whether it's linearly separable, looking
at it and seeing. You will
1167
01:00:19,220 --> 01:00:25,370
realize as we go through that it's
a no-no to actually look at the data.
1168
01:00:25,370 --> 01:00:26,860
What?
1169
01:00:26,860 --> 01:00:29,580
That's what data is for, to look at.
1170
01:00:29,580 --> 01:00:30,850
Bear with me.
1171
01:00:30,850 --> 01:00:34,720
We will come to the level where we ask
why don't we look at the data--
1172
01:00:34,720 --> 01:00:37,920
just looking at it and then saying:
It's linearly separable.
1173
01:00:37,920 --> 01:00:39,890
Let's pick the perceptron.
1174
01:00:39,890 --> 01:00:42,370
That's bad practice, for reasons
that are not obvious now.
1175
01:00:42,370 --> 01:00:45,460
They will become obvious, once we
are done with the theory.
1176
01:00:45,460 --> 01:00:50,330
So when someone knocks on my door with
a set of data, I can ask them all
1177
01:00:50,330 --> 01:00:54,360
kinds of questions about the data-- not
the particular data set that they gave
1178
01:00:54,360 --> 01:00:57,750
me, but about the general data that
is generated by their process.
1179
01:00:57,750 --> 01:01:00,570
They can tell me this variable is
important, the function is symmetric,
1180
01:01:00,570 --> 01:01:04,210
they can give you all kinds of
information that I will take to heart.
1181
01:01:04,210 --> 01:01:08,730
But I will try, as much as I can, to
avoid looking at the particular data
1182
01:01:08,730 --> 01:01:14,680
set that they gave me, lest I should
tailor my system toward this data set,
1183
01:01:14,680 --> 01:01:17,680
and be disappointed when another
data set comes about.
1184
01:01:17,680 --> 01:01:20,100
You don't want to get too
close to the data set.
1185
01:01:20,100 --> 01:01:24,190
This will become very clear
as we go with the theory.
1186
01:01:24,190 --> 01:01:27,190
MODERATOR: In general about
machine learning, how does it
1187
01:01:27,190 --> 01:01:30,550
relate to other statistical, especially
econometric techniques?
1188
01:01:30,550 --> 01:01:33,090
1189
01:01:33,090 --> 01:01:37,150
PROFESSOR: Statistics is, in
the form I said, it's machine
1190
01:01:37,150 --> 01:01:38,710
learning where the target--
1191
01:01:38,710 --> 01:01:42,010
it's not a function in this case--
is a probability distribution.
1192
01:01:42,010 --> 01:01:44,670
Statistics is a mathematical field.
1193
01:01:44,670 --> 01:01:49,100
And therefore, you put the assumptions
that you need in order to be able to
1194
01:01:49,100 --> 01:01:53,970
rigorously prove the results you have,
and get the results in detail.
1195
01:01:53,970 --> 01:01:55,700
For example, linear regression.
1196
01:01:55,700 --> 01:01:59,810
When we talk about linear regression, it
will have very few assumptions, and
1197
01:01:59,810 --> 01:02:03,150
the results will apply to a wide range,
because we didn't make too many
1198
01:02:03,150 --> 01:02:04,330
assumptions.
1199
01:02:04,330 --> 01:02:07,530
When you study linear regression under
statistics, there is a lot of
1200
01:02:07,530 --> 01:02:11,020
mathematics that goes with it, lot of
assumptions, because that is the
1201
01:02:11,020 --> 01:02:12,640
purpose of the field.
1202
01:02:12,640 --> 01:02:18,560
In general, machine learning tries to make
the least assumptions and cover the
1203
01:02:18,560 --> 01:02:22,090
most territory. These go together.
1204
01:02:22,090 --> 01:02:25,640
So it is not a mathematical discipline,
but it's not a purely
1205
01:02:25,640 --> 01:02:26,850
applied discipline.
1206
01:02:26,850 --> 01:02:31,270
It spans both the mathematical, to
certain extent, but it is willing to
1207
01:02:31,270 --> 01:02:35,870
actually go into territory where we
don't have mathematical models, and
1208
01:02:35,870 --> 01:02:38,040
still want to apply our techniques.
1209
01:02:38,040 --> 01:02:40,600
So that is what characterizes
it the most.
1210
01:02:40,600 --> 01:02:44,120
And then there are other fields.
By doing machine learning,
1211
01:02:44,120 --> 01:02:46,400
you can find it under the name
computational learning,
1212
01:02:46,400 --> 01:02:48,090
or statistical learning.
1213
01:02:48,090 --> 01:02:52,120
Data mining has a huge intersection
with machine learning.
1214
01:02:52,120 --> 01:02:56,020
There are lots of disciplines around
that actually share some value.
1215
01:02:56,020 --> 01:02:59,630
But the point is, the premise that you
saw is so broad, that it shouldn't be
1216
01:02:59,630 --> 01:03:03,690
surprising that people at different times
developed a particular discipline
1217
01:03:03,690 --> 01:03:06,840
with its own jargon, to deal
with that discipline.
1218
01:03:06,840 --> 01:03:13,990
So what I'm giving you is machine
learning as the mainstream goes, and
1219
01:03:13,990 --> 01:03:17,520
that can be applied as widely as
possible to applications, both
1220
01:03:17,520 --> 01:03:20,900
practical applications and
scientific applications.
1221
01:03:20,900 --> 01:03:24,870
You will see, here is a situation, I
have an experiment, here is a target,
1222
01:03:24,870 --> 01:03:25,980
I have the data.
1223
01:03:25,980 --> 01:03:28,640
How do I produce the target
in the best way I want?
1224
01:03:28,640 --> 01:03:32,010
And then you apply machine learning.
1225
01:03:32,010 --> 01:03:36,180
MODERATOR: Also, in a general
question about machine learning.
1226
01:03:36,180 --> 01:03:36,190
1227
01:03:36,190 --> 01:03:42,370
Do machine learning algorithms perform
global optimization methods,
1228
01:03:42,370 --> 01:03:45,810
or just local optimization methods?
1229
01:03:45,810 --> 01:03:47,640
PROFESSOR: Obviously,
a general question.
1230
01:03:47,640 --> 01:03:48,470
1231
01:03:48,470 --> 01:03:52,120
Optimization is a tool
for machine learning.
1232
01:03:52,120 --> 01:03:56,340
So we will pick whatever optimization
that does the job for us.
1233
01:03:56,340 --> 01:03:59,440
And sometimes, there is a very
specific optimization method.
1234
01:03:59,440 --> 01:04:01,470
For example, in support vector
machines, it will be quadratic
1235
01:04:01,470 --> 01:04:01,990
programming.
1236
01:04:01,990 --> 01:04:04,050
It happens to be the one
that works with that.
1237
01:04:04,050 --> 01:04:08,190
But optimization is not something
that machine learning people
1238
01:04:08,190 --> 01:04:10,000
study for its own sake.
1239
01:04:10,000 --> 01:04:12,840
They obviously study it to understand
it better, and to choose the correct
1240
01:04:12,840 --> 01:04:14,900
optimization method.
1241
01:04:14,900 --> 01:04:17,780
Now, the question is alluding
to something that will
1242
01:04:17,780 --> 01:04:21,220
become clear when we talk about neural
networks, which is local minimum versus
1243
01:04:21,220 --> 01:04:22,830
global minimum.
1244
01:04:22,830 --> 01:04:26,680
And it is impossible to put this in
any perspective before we get the
1245
01:04:26,680 --> 01:04:29,120
details of neural networks,
so I will defer that until
1246
01:04:29,120 --> 01:04:32,850
we get to that lecture.
1247
01:04:32,850 --> 01:04:37,530
MODERATOR: Also, this is
a math question, I guess.
1248
01:04:37,530 --> 01:04:42,470
Is the hypothesis set, in a topological
sense, continuous?
1249
01:04:42,470 --> 01:04:47,160
PROFESSOR: The hypothesis
set can be anything, in principle.
1250
01:04:47,160 --> 01:04:50,500
So it can be continuous,
and it can be discrete.
1251
01:04:50,500 --> 01:04:53,710
For example, in the next lecture I take
the simplest case where we have
1252
01:04:53,710 --> 01:04:57,610
a finite hypothesis set, in order
to make a certain point.
1253
01:04:57,610 --> 01:05:00,610
In reality, almost all the hypothesis
sets that you find are
1254
01:05:00,610 --> 01:05:02,580
continuous and infinite.
1255
01:05:02,580 --> 01:05:04,170
Very infinite!
1256
01:05:04,170 --> 01:05:10,190
And the level of sophistication
of the hypothesis set can be huge.
1257
01:05:10,190 --> 01:05:15,440
And nonetheless, we will be able to see
that under one condition, which
1258
01:05:15,440 --> 01:05:19,307
comes from the theory, we'll be able to
learn even if the hypothesis set is
1259
01:05:19,307 --> 01:05:23,580
huge and complicated.
1260
01:05:23,580 --> 01:05:26,340
There's a question from inside, yes?
1261
01:05:26,340 --> 01:05:32,930
STUDENT: I think I understood, more or
less, the general idea, but I don't
1262
01:05:32,930 --> 01:05:37,160
understand the second example
you gave about credit approval.
1263
01:05:37,160 --> 01:05:41,200
So how do we collect our data?
1264
01:05:41,200 --> 01:05:46,210
Should we give credit to everyone, or
should we make our data biased,
1265
01:05:46,210 --> 01:05:51,170
because we cannot determine
the data of--
1266
01:05:51,170 --> 01:05:57,480
we can't determine, should we give credit
or not to persons we rejected?
1267
01:05:57,480 --> 01:05:58,030
PROFESSOR: Correct.
1268
01:05:58,030 --> 01:06:04,465
This is a good point. Every time
someone asks a question, the
1269
01:06:04,465 --> 01:06:05,590
lecture number comes to my mind.
1270
01:06:05,590 --> 01:06:07,570
I know when I'm going
to talk about it.
1271
01:06:07,570 --> 01:06:10,410
So what you describe is
called sampling bias.
1272
01:06:10,410 --> 01:06:12,190
And I will describe it in detail.
1273
01:06:12,190 --> 01:06:18,450
But when you use the biased data, let's
say the bank uses historical records.
1274
01:06:18,450 --> 01:06:22,450
So it sees the people who applied and
were accepted, and for those guys, it
1275
01:06:22,450 --> 01:06:26,030
can actually predict what the credit
behavior is, because it has their
1276
01:06:26,030 --> 01:06:26,700
credit history.
1277
01:06:26,700 --> 01:06:30,000
They charged and repaid and maxed
out, and all of this.
1278
01:06:30,000 --> 01:06:32,590
And then they decide: is this
a good customer or not?
1279
01:06:32,590 --> 01:06:36,400
For those who were rejected, there's
really no way to tell in this case
1280
01:06:36,400 --> 01:06:38,870
whether they were falsely rejected,
that they would have been good
1281
01:06:38,870 --> 01:06:40,220
customers or not.
1282
01:06:40,220 --> 01:06:44,050
Nonetheless, if you take the customer
base that you have, and base your
1283
01:06:44,050 --> 01:06:48,070
decision on it, the boundary
works fairly decently.
1284
01:06:48,070 --> 01:06:51,300
Actually, pretty decently, even for the
other guys, because the other guys
1285
01:06:51,300 --> 01:06:55,060
usually are deeper into the
classification region than the
1286
01:06:55,060 --> 01:06:57,940
boundary guys that you accepted,
and turned out to be bad.
1287
01:06:57,940 --> 01:06:58,810
1288
01:06:58,810 --> 01:07:01,000
But the point is well taken.
1289
01:07:01,000 --> 01:07:04,390
The data set in this case is not
completely representative, and there
1290
01:07:04,390 --> 01:07:07,750
is a particular principle in learning
that we'll talk about, which is
1291
01:07:07,750 --> 01:07:11,400
sampling bias, that deals
with this case.
1292
01:07:11,400 --> 01:07:14,270
Another question from here?
1293
01:07:14,270 --> 01:07:17,420
STUDENT: You explain that we need
to have a lot of data to learn.
1294
01:07:17,420 --> 01:07:22,050
So how do you decide how much amount
of data that is required for
1295
01:07:22,050 --> 01:07:26,980
a particular problem, in order to be
able to come up with a reasonable--
1296
01:07:26,980 --> 01:07:27,930
PROFESSOR: Good question.
1297
01:07:27,930 --> 01:07:31,710
So let me tell you the theoretical,
and the practical answer.
1298
01:07:31,710 --> 01:07:36,340
The theoretical answer is that this is
exactly the crux of the theory part
1299
01:07:36,340 --> 01:07:37,810
that we're going to talk about.
1300
01:07:37,810 --> 01:07:38,350
1301
01:07:38,350 --> 01:07:40,950
And in the theory, we are going
to see, can we learn?
1302
01:07:40,950 --> 01:07:43,120
And how much data.
1303
01:07:43,120 --> 01:07:46,150
So all of this will be answered
in a mathematical way.
1304
01:07:46,150 --> 01:07:48,020
So this is the theoretical answer.
1305
01:07:48,020 --> 01:07:52,770
The practical answer is: that's
not under your control.
1306
01:07:52,770 --> 01:07:57,180
When someone knocks on your door: Here
is the data, I have 500 points.
1307
01:07:57,180 --> 01:08:00,170
I tell him, I will give you
a fantastic system if you
1308
01:08:00,170 --> 01:08:02,200
just give me 2000.
1309
01:08:02,200 --> 01:08:05,000
But I don't have 2000, I have 500.
1310
01:08:05,000 --> 01:08:09,040
So now you go and you use your theory
to do something to your system, such
1311
01:08:09,040 --> 01:08:11,000
that it can work with the 500.
1312
01:08:11,000 --> 01:08:11,710
1313
01:08:11,710 --> 01:08:12,600
There was one case--
1314
01:08:12,600 --> 01:08:16,930
I worked with data in different
applications--
1315
01:08:16,930 --> 01:08:20,330
at some point, we had almost
100 million points.
1316
01:08:20,330 --> 01:08:21,760
You were swimming in data.
1317
01:08:21,760 --> 01:08:23,279
You wouldn't complain about data.
1318
01:08:23,279 --> 01:08:25,200
Data was wonderful.
1319
01:08:25,200 --> 01:08:28,779
And in another case, there were
less than 100 points.
1320
01:08:28,779 --> 01:08:31,890
And you had to deal with
the data with gloves!
1321
01:08:31,890 --> 01:08:36,290
Because if you use them the wrong way,
they are contaminated, which is
1322
01:08:36,290 --> 01:08:38,970
an expression we will see, and
then you have nothing.
1323
01:08:38,970 --> 01:08:43,029
And you will produce a system, and you
are proud of it, but you have no idea
1324
01:08:43,029 --> 01:08:44,540
whether it will perform well or not.
1325
01:08:44,540 --> 01:08:46,899
And you cannot give this to the customer,
and have the customer come
1326
01:08:46,899 --> 01:08:49,300
back to you and say: what did you do!?
1327
01:08:49,300 --> 01:08:49,760
1328
01:08:49,760 --> 01:08:55,490
So there is a question of, what
performance can you do given
1329
01:08:55,490 --> 01:08:57,090
what data size you have?
1330
01:08:57,090 --> 01:09:00,520
But in practice, you really have no
control over the data size in almost
1331
01:09:00,520 --> 01:09:03,140
all the cases, almost all
the practical cases.
1332
01:09:03,140 --> 01:09:05,960
Yes?
1333
01:09:05,960 --> 01:09:10,540
STUDENT: Another question I have
is regarding the hypothesis set.
1334
01:09:10,540 --> 01:09:13,729
So the larger the hypothesis set
is, probably I'll be able to
1335
01:09:13,729 --> 01:09:15,649
better fit the data.
1336
01:09:15,649 --> 01:09:20,420
But that, as you were explaining, might
be a bad thing to do because
1337
01:09:20,420 --> 01:09:23,460
when the new data point comes,
there might be troubles.
1338
01:09:23,460 --> 01:09:25,210
So how do you decide
the size of your--
1339
01:09:25,210 --> 01:09:27,680
PROFESSOR: You are asking all
the right questions, and all of
1340
01:09:27,680 --> 01:09:28,350
them are coming up.
1341
01:09:28,350 --> 01:09:32,330
This is again part of the theory,
but let me try to explain this.
1342
01:09:32,330 --> 01:09:35,420
As we mentioned, learning is about
being able to predict.
1343
01:09:35,420 --> 01:09:40,510
So you are using the data, not to
memorize it, but to figure out what
1344
01:09:40,510 --> 01:09:42,130
the pattern is.
1345
01:09:42,130 --> 01:09:45,100
And if you figure out a pattern that
applies to all the data, and it's
1346
01:09:45,100 --> 01:09:47,216
a reasonable pattern, then you
have a chance that it
1347
01:09:47,216 --> 01:09:49,340
will generalize outside.
1348
01:09:49,340 --> 01:09:53,880
Now the problem is that, if I give you
50 points, and you use a 7000th-order
1349
01:09:53,880 --> 01:09:57,360
polynomial, you will fit the
heck out of the data.
1350
01:09:57,360 --> 01:10:01,160
You will fit it so much with so many
degrees of freedom to spare, but you
1351
01:10:01,160 --> 01:10:02,070
haven't learned anything.
1352
01:10:02,070 --> 01:10:04,610
You just memorized it in a fancy way.
1353
01:10:04,610 --> 01:10:08,500
You put it in a polynomial form, and
that actually carries all the
1354
01:10:08,500 --> 01:10:10,400
information about the
data that you have,
1355
01:10:10,400 --> 01:10:11,890
and then some.
1356
01:10:11,890 --> 01:10:15,280
So you don't expect at all that
this will generalize outside.
1357
01:10:15,280 --> 01:10:18,450
And that intuitive observation
will be formalized when we
1358
01:10:18,450 --> 01:10:19,580
talk about the theory.
1359
01:10:19,580 --> 01:10:22,930
There will be a measurement of the
hypothesis set that you give me, that
1360
01:10:22,930 --> 01:10:25,550
measures the sophistication of it,
and will tell you with that
1361
01:10:25,550 --> 01:10:28,850
sophistication, you need that amount
of data in order to be able to make
1362
01:10:28,850 --> 01:10:30,430
any statement about generalization.
1363
01:10:30,430 --> 01:10:31,680
So that is what the theory is about.
1364
01:10:31,680 --> 01:10:34,650
1365
01:10:34,650 --> 01:10:37,880
STUDENT: Suppose, I mean, here
whatever we discussed, it is like I
1366
01:10:37,880 --> 01:10:42,930
had a data set and I came up with
an algorithm, and gave the output.
1367
01:10:42,930 --> 01:10:48,690
But won't it be also important to see,
OK, we came up with the output, and
1368
01:10:48,690 --> 01:10:52,790
using that, what was the feedback?
1369
01:10:52,790 --> 01:10:57,690
Are there techniques where you take
the feedback and try to
1370
01:10:57,690 --> 01:10:58,980
correct your--
1371
01:10:58,980 --> 01:11:03,360
PROFESSOR: You are alluding
to different techniques here.
1372
01:11:03,360 --> 01:11:07,740
But one of them would be validation,
which is after you learn, you validate
1373
01:11:07,740 --> 01:11:09,360
your solution.
1374
01:11:09,360 --> 01:11:13,000
And this is an extremely established and
core technique in machine learning
1375
01:11:13,000 --> 01:11:16,870
that will be covered in
one of the lectures.
1376
01:11:16,870 --> 01:11:18,810
Any questions from the online audience?
1377
01:11:18,810 --> 01:11:25,780
MODERATOR: In practice, how many
dimensions would be considered easy,
1378
01:11:25,780 --> 01:11:28,730
medium, and hard for
a perceptron problem?
1379
01:11:28,730 --> 01:11:31,100
PROFESSOR: The hard,
1380
01:11:31,100 --> 01:11:34,850
in most people's mind before they
get into machine learning, is the
1381
01:11:34,850 --> 01:11:36,420
computational time.
1382
01:11:36,420 --> 01:11:38,800
If something takes a lot of time,
then it's a hard problem.
1383
01:11:38,800 --> 01:11:42,800
If something can be computed quickly,
it's an easy problem.
1384
01:11:42,800 --> 01:11:47,210
For machine learning, the bottleneck
in my case, has never been the
1385
01:11:47,210 --> 01:11:51,340
computation time, even in
incredibly big data sets.
1386
01:11:51,340 --> 01:11:55,410
The bottleneck for machine learning is
to be able to generalize outside the
1387
01:11:55,410 --> 01:11:56,990
data that you have seen.
1388
01:11:56,990 --> 01:12:01,790
So to answer your question, the
perceptron behaves badly in terms of
1389
01:12:01,790 --> 01:12:04,090
the computational behavior.
1390
01:12:04,090 --> 01:12:07,490
We will be able to predict its
generalization behavior, based on the
1391
01:12:07,490 --> 01:12:09,370
number of dimensions and
the amount of data.
1392
01:12:09,370 --> 01:12:11,610
This will be given explicitly.
1393
01:12:11,610 --> 01:12:19,030
And therefore, the perceptron algorithm
is bad computationally, good
1394
01:12:19,030 --> 01:12:20,980
in terms of generalization.
1395
01:12:20,980 --> 01:12:24,900
If you actually can get away with
perceptrons, your chances of
1396
01:12:24,900 --> 01:12:28,460
generalizing are good because
it's a simplistic
1397
01:12:28,460 --> 01:12:33,850
model, and therefore its ability to
generalize is good, as we will see.
1398
01:12:33,850 --> 01:12:38,010
MODERATOR: Also, in the example you
explain the use of binary function.
1399
01:12:38,010 --> 01:12:43,690
So can you use more multi-valued
or real functions?
1400
01:12:43,690 --> 01:12:45,100
PROFESSOR: Correct.
1401
01:12:45,100 --> 01:12:47,980
Remember when I told you that there is
a topic that is out of sequence.
1402
01:12:47,980 --> 01:12:51,810
There was a logical sequence to the
course, and then I took part of the
1403
01:12:51,810 --> 01:12:55,870
linear models and put it very early on,
to give you something a little bit
1404
01:12:55,870 --> 01:12:59,140
more sophisticated than perceptrons
to try your hand on.
1405
01:12:59,140 --> 01:13:01,560
That happens to be for
real-valued functions.
1406
01:13:01,560 --> 01:13:05,650
And obviously there are hypotheses that
cover all types of co-domains.
1407
01:13:05,650 --> 01:13:07,010
Y could be anything as well.
1408
01:13:07,010 --> 01:13:09,930
1409
01:13:09,930 --> 01:13:18,730
MODERATOR: Another question is, in
the learning process you showed, when
1410
01:13:18,730 --> 01:13:22,420
do you pick your learning algorithm,
when do you pick your hypothesis set,
1411
01:13:22,420 --> 01:13:23,840
and what liberty do you have?
1412
01:13:23,840 --> 01:13:28,380
1413
01:13:28,380 --> 01:13:33,070
PROFESSOR: The hypothesis set
is the most important aspect of
1414
01:13:33,070 --> 01:13:36,030
determining the generalization behavior
that we'll talk about.
1415
01:13:36,030 --> 01:13:38,960
The learning algorithm does play a role,
although it is a secondary role,
1416
01:13:38,960 --> 01:13:41,330
as we will see in the discussion.
1417
01:13:41,330 --> 01:13:45,960
So in general, the learning
algorithm has the form of
1418
01:13:45,960 --> 01:13:49,140
minimizing an error function.
1419
01:13:49,140 --> 01:13:51,540
So you can think of the
perceptron, what does
1420
01:13:51,540 --> 01:13:52,290
the algorithm do?
1421
01:13:52,290 --> 01:13:55,420
It tries to minimize the
classification error.
1422
01:13:55,420 --> 01:13:58,710
That is your error function, and
you're minimizing it using this
1423
01:13:58,710 --> 01:14:00,220
particular update rule.
1424
01:14:00,220 --> 01:14:03,700
And in other cases, we'll see that we
are minimizing an error function.
1425
01:14:03,700 --> 01:14:08,000
Now the minimization aspect is
an optimization question, and once you
1426
01:14:08,000 --> 01:14:11,180
determine that this is indeed the
error function that I want to
1427
01:14:11,180 --> 01:14:15,950
minimize, then you go and minimize
as much as you can using the most
1428
01:14:15,950 --> 01:14:18,710
sophisticated optimization
technique that you find.
1429
01:14:18,710 --> 01:14:22,160
So the question now translates into
what is the choice of the error
1430
01:14:22,160 --> 01:14:26,280
function or error measure that
will help or not help.
1431
01:14:26,280 --> 01:14:29,530
And that will be covered also next week
under the topic, Error and Noise.
1432
01:14:29,530 --> 01:14:32,350
When I talk about error, we'll talk
about error measures, and this
1433
01:14:32,350 --> 01:14:37,660
translates directly to the learning
algorithm that goes with them.
1434
01:14:37,660 --> 01:14:38,730
MODERATOR: Back to the perceptron.
1435
01:14:38,730 --> 01:14:43,220
So what happens if your hypothesis
gives you exactly 0 in this case?
1436
01:14:43,220 --> 01:14:47,200
PROFESSOR: So remember that
the quantity you compute and
1437
01:14:47,200 --> 01:14:49,960
compare with the threshold
was your credit score.
1438
01:14:49,960 --> 01:14:53,090
So I told you what happens if you are
above threshold, and what happens if
1439
01:14:53,090 --> 01:14:54,760
you're below threshold.
1440
01:14:54,760 --> 01:14:57,430
So what happens if you're exactly
at the threshold?
1441
01:14:57,430 --> 01:15:02,340
Your score is exactly that.
1442
01:15:02,340 --> 01:15:07,080
The informal answer is that it depends
on the mood of the credit
1443
01:15:07,080 --> 01:15:08,650
officer on that day.
1444
01:15:08,650 --> 01:15:10,870
If they had a bad day,
you will be denied!
1445
01:15:10,870 --> 01:15:16,410
But the serious answer is that
there are technical ways of
1446
01:15:16,410 --> 01:15:17,870
defining that point.
1447
01:15:17,870 --> 01:15:21,580
You can define it as 0,
so the sign of 0 is 0.
1448
01:15:21,580 --> 01:15:24,190
In which case you are always making
an error, because you are never +1 or
1449
01:15:24,190 --> 01:15:25,830
-1, when you should be.
1450
01:15:25,830 --> 01:15:28,230
Or you could make it belong
to the +1 category or
1451
01:15:28,230 --> 01:15:29,700
to the -1 category.
1452
01:15:29,700 --> 01:15:32,190
There are ramifications for
all of these decisions
1453
01:15:32,190 --> 01:15:33,950
that are purely technical.
1454
01:15:33,950 --> 01:15:36,010
Nothing conceptual comes out of them.
1455
01:15:36,010 --> 01:15:38,790
That's why I decided not
to include it.
1456
01:15:38,790 --> 01:15:42,220
Because it clutters the main concept
with something that really has no
1457
01:15:42,220 --> 01:15:43,170
ramification.
1458
01:15:43,170 --> 01:15:46,090
As far as you're concerned, the easiest
way to consider it is that the
1459
01:15:46,090 --> 01:15:49,040
output will be 0, and therefore you will
be making an error regardless of
1460
01:15:49,040 --> 01:15:50,410
whether it's +1 or -1.
1461
01:15:50,410 --> 01:15:53,600
1462
01:15:53,600 --> 01:15:57,070
MODERATOR: Is there a kind of problem
that cannot be learned even if
1463
01:15:57,070 --> 01:16:01,480
there's a huge amount of data?
1464
01:16:01,480 --> 01:16:02,360
PROFESSOR: Correct.
1465
01:16:02,360 --> 01:16:07,010
For example, if I go to my computer
and use a pseudo-random number
1466
01:16:07,010 --> 01:16:12,090
generator to generate the target over
the entire domain, then patently,
1467
01:16:12,090 --> 01:16:14,960
nothing I can give you will make
you learn the other guys.
1468
01:16:14,960 --> 01:16:16,360
1469
01:16:16,360 --> 01:16:17,665
So remember the three--
1470
01:16:17,665 --> 01:16:20,310
1471
01:16:20,310 --> 01:16:23,170
let me try to--
1472
01:16:23,170 --> 01:16:24,640
the essence of machine learning.
1473
01:16:24,640 --> 01:16:28,710
The first one was, a pattern exists.
1474
01:16:28,710 --> 01:16:29,510
1475
01:16:29,510 --> 01:16:34,130
If there's no pattern that exists,
there is nothing to learn.
1476
01:16:34,130 --> 01:16:38,350
Let's say that it's like a baby,
and stuff is happening, and the
1477
01:16:38,350 --> 01:16:42,000
baby is just staring. There is nothing
to pick from that thing.
1478
01:16:42,000 --> 01:16:44,740
Once there is a pattern, you can see
the smile on the baby's face.
1479
01:16:44,740 --> 01:16:46,500
Now I can see what is going on.
1480
01:16:46,500 --> 01:16:49,420
So whatever you are learning,
there needs to be a pattern.
1481
01:16:49,420 --> 01:16:50,240
1482
01:16:50,240 --> 01:16:52,640
Now, how to tell that there's
a pattern or not,
1483
01:16:52,640 --> 01:16:53,400
that's a different question.
1484
01:16:53,400 --> 01:16:58,300
But the main ingredient, there's a pattern.
The other one is we cannot pin
1485
01:16:58,300 --> 01:16:58,980
it down mathematically.
1486
01:16:58,980 --> 01:17:00,970
If we can pin it down mathematically,
and you decide to do
1487
01:17:00,970 --> 01:17:02,840
the learning, then you
are really lazy.
1488
01:17:02,840 --> 01:17:04,730
Because you could just write the code.
1489
01:17:04,730 --> 01:17:05,380
But fine.
1490
01:17:05,380 --> 01:17:08,280
You can use learning in this case, but
it's not the recommended method,
1491
01:17:08,280 --> 01:17:11,620
because it has certain errors
in performance.
1492
01:17:11,620 --> 01:17:14,060
Whereas if you have the mathematical
definition, you just implement it and
1493
01:17:14,060 --> 01:17:16,150
you'll get the best possible solution.
1494
01:17:16,150 --> 01:17:18,240
And the third one, you have data,
which is key.
1495
01:17:18,240 --> 01:17:22,370
So you have plenty of data, but the
first one is off, you are simply not
1496
01:17:22,370 --> 01:17:23,890
going to learn.
1497
01:17:23,890 --> 01:17:27,900
And it's not like I have to answer each
of these questions at random.
1498
01:17:27,900 --> 01:17:31,460
The theory will completely
capture what is going on.
1499
01:17:31,460 --> 01:17:35,820
So there's a very good reason for going
through four lectures in the
1500
01:17:35,820 --> 01:17:38,490
outline that are
mathematically inclined.
1501
01:17:38,490 --> 01:17:40,140
This is not for the sake of math.
1502
01:17:40,140 --> 01:17:45,170
I don't like to do math
hacking, if you will.
1503
01:17:45,170 --> 01:17:48,680
I pick the math that is necessary
to establish a concept.
1504
01:17:48,680 --> 01:17:51,530
And these will establish it, and they
are very much worth being patient with
1505
01:17:51,530 --> 01:17:52,480
and going through.
1506
01:17:52,480 --> 01:17:55,840
Because once you're done with them, you
basically have it cold about what
1507
01:17:55,840 --> 01:18:00,520
are the components that make learning
possible, and how do we tell, and all
1508
01:18:00,520 --> 01:18:03,360
of the questions that have been asked.
1509
01:18:03,360 --> 01:18:04,620
MODERATOR: Historical question.
1510
01:18:04,620 --> 01:18:10,880
So why is the perceptron often
related with a neuron?
1511
01:18:10,880 --> 01:18:14,435
PROFESSOR: I will discuss this
in neural networks, but in general,
1512
01:18:14,435 --> 01:18:19,760
when you take a neuron and synapses, and
you find what is the function that
1513
01:18:19,760 --> 01:18:25,200
gets to the neuron, you find that the
neuron fires, which is +1, if the
1514
01:18:25,200 --> 01:18:31,090
signal coming to it, which is roughly
a combination of the stimuli, exceeds
1515
01:18:31,090 --> 01:18:32,400
a certain threshold.
1516
01:18:32,400 --> 01:18:37,760
So that was the initial inspiration, and
the initial inspiration was
1517
01:18:37,760 --> 01:18:41,460
that: the brain does a pretty good
job, so maybe if we mimic the
1518
01:18:41,460 --> 01:18:42,890
function, we will get something good.
1519
01:18:42,890 --> 01:18:45,940
But you mimic one neuron, and then you
put it together and you'll get the
1520
01:18:45,940 --> 01:18:47,520
neural network that you
are talking about.
1521
01:18:47,520 --> 01:18:52,780
And I will discuss the analogy with
biology, and the extent that it can be
1522
01:18:52,780 --> 01:18:55,850
benefited from, when we talk
about neural networks, because
1523
01:18:55,850 --> 01:18:57,799
that will be the more proper
context for that.
1524
01:18:57,799 --> 01:19:02,850
1525
01:19:02,850 --> 01:19:08,710
MODERATOR: Another question is,
regarding the hypothesis set, are there
1526
01:19:08,710 --> 01:19:12,645
Bayesian hierarchical procedures
to narrow down the hypothesis set?
1527
01:19:12,645 --> 01:19:15,660
1528
01:19:15,660 --> 01:19:16,916
PROFESSOR: OK.
1529
01:19:16,916 --> 01:19:20,320
The choice of the hypothesis set and
the model in general is model
1530
01:19:20,320 --> 01:19:23,820
selection, and there's quite a bit of
stuff that we are going to talk about
1531
01:19:23,820 --> 01:19:26,550
in model selection, when we
talk about validation.
1532
01:19:26,550 --> 01:19:31,160
In general, the word Bayesian was
mentioned here-- if you
1533
01:19:31,160 --> 01:19:36,330
look at machine learning, there are
schools that deal with the subject
1534
01:19:36,330 --> 01:19:37,840
differently.
1535
01:19:37,840 --> 01:19:41,940
So for example, the Bayesian school
puts a mathematical framework
1536
01:19:41,940 --> 01:19:43,160
completely on it.
1537
01:19:43,160 --> 01:19:47,490
And then everything can be derived,
and that is based on Bayesian
1538
01:19:47,490 --> 01:19:48,500
principles.
1539
01:19:48,500 --> 01:19:54,380
I will talk about that at the very
end, so it's last but not least.
1540
01:19:54,380 --> 01:19:57,350
And I will make a very specific point
about it, for what it's worth.
1541
01:19:57,350 --> 01:20:03,280
But what I'm talking about in the course
in all of the details, are the
1542
01:20:03,280 --> 01:20:08,310
most commonly useful methods
in practice.
1543
01:20:08,310 --> 01:20:10,280
That is my criterion for inclusion.
1544
01:20:10,280 --> 01:20:10,900
1545
01:20:10,900 --> 01:20:13,910
So I will get to that
when we get there.
1546
01:20:13,910 --> 01:20:16,080
In terms of a hierarchy,
1547
01:20:16,080 --> 01:20:19,160
there are a number of hierarchical
methods.
1548
01:20:19,160 --> 01:20:23,360
For example, structural risk
minimization is one of them.
1549
01:20:23,360 --> 01:20:27,060
There are methods of hierarchies,
and the ramifications of it in
1550
01:20:27,060 --> 01:20:27,910
generalization.
1551
01:20:27,910 --> 01:20:30,500
I may touch upon it, when I get
to support vector machines.
1552
01:20:30,500 --> 01:20:35,490
But again, there's a lot of theory,
and if you read a book on machine
1553
01:20:35,490 --> 01:20:38,860
learning written by someone from pure
theory, you would think that you are
1554
01:20:38,860 --> 01:20:41,220
reading about a completely
different subject.
1555
01:20:41,220 --> 01:20:44,370
It's respectable stuff, but
different from the other
1556
01:20:44,370 --> 01:20:45,670
stuff that is practiced.
1557
01:20:45,670 --> 01:20:51,950
So one of the things that I'm trying to
do, I'm trying to pick from all the
1558
01:20:51,950 --> 01:20:56,070
components of machine learning, the
big picture that gives you the
1559
01:20:56,070 --> 01:20:59,540
understanding of the concept, and
the tools to use it in practice.
1560
01:20:59,540 --> 01:21:00,790
That is the criterion for inclusion.
1561
01:21:00,790 --> 01:21:04,170
1562
01:21:04,170 --> 01:21:04,710
1563
01:21:04,710 --> 01:21:07,340
Any questions from the inside here?
1564
01:21:07,340 --> 01:21:11,060
1565
01:21:11,060 --> 01:21:13,040
OK, we'll call it a day, and
we'll see you on Thursday.
1566
01:21:13,040 --> 00:00:00,000