(DownSub - Com) Lecture 01 - The Learning Problem

1
00:00:00,000 --> 00:00:00,580
2
00:00:00,580 --> 00:00:03,270
ANNOUNCER: The following program
is brought to you by Caltech.
3
00:00:03,270 --> 00:00:16,250
4
00:00:16,250 --> 00:00:19,950
YASER ABU-MOSTAFA: Welcome to machine
learning, and welcome to our online
5
00:00:19,950 --> 00:00:21,650
audience as well.
6
00:00:21,650 --> 00:00:25,830
Let me start with an outline of the
course, and then go into the material
7
00:00:25,830 --> 00:00:28,070
of today's lecture.
8
00:00:28,070 --> 00:00:34,540
As you see from the outline, the topics
are given colors, and that
9
00:00:34,540 --> 00:00:38,860
designates their main content, whether
it's mathematical or practical.
10
00:00:38,860 --> 00:00:41,960
Machine learning is
a very broad subject.
11
00:00:41,960 --> 00:00:47,300
It goes from very abstract theory to
extreme practice as in rules of thumb.
12
00:00:47,300 --> 00:00:52,350
And the inclusion of a topic in the
course depends on the relevance to
13
00:00:52,350 --> 00:00:53,370
machine learning.
14
00:00:53,370 --> 00:00:57,720
So some mathematics is useful because it
gives you the conceptual framework,
15
00:00:57,720 --> 00:01:02,320
and then some practical aspects are
useful because they give you the way
16
00:01:02,320 --> 00:01:05,080
to deal with real learning systems.
17
00:01:05,080 --> 00:01:09,370
Now if you look at the topics, these
are not meant to be separate topics
18
00:01:09,370 --> 00:01:10,560
for each lecture.
19
00:01:10,560 --> 00:01:13,630
They just highlight the main
content of those lectures.
20
00:01:13,630 --> 00:01:17,740
But there is a story line that goes
through it, and let me tell you what
21
00:01:17,740 --> 00:01:20,380
the story line is like.
22
00:01:20,380 --> 00:01:23,530
It starts here with: what is learning?
23
00:01:23,530 --> 00:01:26,500
24
00:01:26,500 --> 00:01:27,750
Can we learn?
25
00:01:27,750 --> 00:01:29,940
26
00:01:29,940 --> 00:01:31,190
How to do it?
27
00:01:31,190 --> 00:01:33,660
28
00:01:33,660 --> 00:01:37,250
How to do it well?
29
00:01:37,250 --> 00:01:40,410
And then the take-home lessons.
30
00:01:40,410 --> 00:01:44,420
There is a logical dependency that goes
through the course, and there's
31
00:01:44,420 --> 00:01:48,390
one exception to that
logical dependency.
32
00:01:48,390 --> 00:01:53,240
One lecture, which is the third one,
doesn't really belong here.
33
00:01:53,240 --> 00:01:57,750
It's a practical topic, and the reason
I included it early on is because I
34
00:01:57,750 --> 00:02:01,740
needed to give you some tools to play
around with, to test the
35
00:02:01,740 --> 00:02:03,990
theoretical and conceptual aspects.
36
00:02:03,990 --> 00:02:09,430
If I waited until it belonged normally,
which is to the second aspect of the
37
00:02:09,430 --> 00:02:15,620
linear models which is down there, the
beginning of the course would be
38
00:02:15,620 --> 00:02:18,730
just too theoretical
for people's taste.
39
00:02:18,730 --> 00:02:22,320
And as you see, if you look at the
colors, it is mostly red in the
40
00:02:22,320 --> 00:02:25,080
beginning and mostly blue in the end.
41
00:02:25,080 --> 00:02:28,430
So it starts building the
concepts and the theory.
42
00:02:28,430 --> 00:02:32,100
And then it goes on to the
practical aspects.
43
00:02:32,100 --> 00:02:35,970
Now, let me start today's lecture.
44
00:02:35,970 --> 00:02:39,790
And the subject of the lecture
is the learning problem.
45
00:02:39,790 --> 00:02:41,890
It's an introduction to
what learning is.
46
00:02:41,890 --> 00:02:45,230
And I will draw your attention
to one aspect of this slide,
47
00:02:45,230 --> 00:02:48,450
which is this part.
48
00:02:48,450 --> 00:02:50,590
That's the logo of the course.
49
00:02:50,590 --> 00:02:53,850
And believe it or not,
this is not artwork.
50
00:02:53,850 --> 00:02:56,720
This is actually a technical
figure that will come up
51
00:02:56,720 --> 00:02:57,670
in one of the lectures.
52
00:02:57,670 --> 00:02:59,200
I'm not going to tell you which one.
53
00:02:59,200 --> 00:03:03,100
So you can wait in anticipation until it
comes up, but this will actually be
54
00:03:03,100 --> 00:03:06,420
a scientific figure that
we will talk about.
55
00:03:06,420 --> 00:03:11,930
Now when we move to today's
lecture, I'm going to talk
56
00:03:11,930 --> 00:03:14,660
today about the following.
57
00:03:14,660 --> 00:03:18,590
Machine learning is a very broad
subject, and I'm going to start with
58
00:03:18,590 --> 00:03:22,100
one example that captures the
essence of machine learning.
59
00:03:22,100 --> 00:03:26,480
It's a fun example about movies
that everybody watches.
60
00:03:26,480 --> 00:03:30,870
And then after that, I'm going to
abstract from the learning problem,
61
00:03:30,870 --> 00:03:35,690
the practical learning problem,
aspects that are common to all
62
00:03:35,690 --> 00:03:38,540
learning situations that
you're going to face.
63
00:03:38,540 --> 00:03:42,030
And in abstracting them, we'll have the
mathematical formalization of the
64
00:03:42,030 --> 00:03:43,310
learning problem.
65
00:03:43,310 --> 00:03:47,530
And then we will get our first algorithm
for machine learning today.
66
00:03:47,530 --> 00:03:51,210
It's a very simple algorithm, but it
will fix the idea about what is the
67
00:03:51,210 --> 00:03:53,060
role of an algorithm in this case.
68
00:03:53,060 --> 00:03:56,420
And we will survey the types of learning,
so that we know which part we
69
00:03:56,420 --> 00:04:01,400
are emphasizing in this course,
and which parts are nearby.
70
00:04:01,400 --> 00:04:04,775
And I will end up with a puzzle, a very
interesting puzzle, and it's
71
00:04:04,775 --> 00:04:08,390
a puzzle in more ways than
one, as you will see.
72
00:04:08,390 --> 00:04:11,040
OK, so let me start with an example.
73
00:04:11,040 --> 00:04:14,040
74
00:04:14,040 --> 00:04:18,410
The example of machine learning that
I'm going to start with is how
75
00:04:18,410 --> 00:04:22,290
a viewer would rate a movie.
76
00:04:22,290 --> 00:04:26,195
Now that is an interesting problem, and
it's interesting for us because we
77
00:04:26,195 --> 00:04:30,020
watch movies, and very interesting for
a company that rents out movies.
78
00:04:30,020 --> 00:04:37,070
And indeed, a company which is Netflix
wanted to improve the in-house system
79
00:04:37,070 --> 00:04:40,040
by a mere 10%.
80
00:04:40,040 --> 00:04:43,450
So they make recommendations when you
log in, they recommend movies that
81
00:04:43,450 --> 00:04:46,650
they think you will like, so they think
that you'll rate them highly.
82
00:04:46,650 --> 00:04:50,230
And they had a system, and they
wanted to improve the system.
83
00:04:50,230 --> 00:04:56,610
So how much is a 10% improvement in
performance worth to the company?
84
00:04:56,610 --> 00:05:03,480
It was actually $1 million that was
paid out to the first group that
85
00:05:03,480 --> 00:05:06,780
actually managed to get
the 10% improvement.
86
00:05:06,780 --> 00:05:10,780
So you ask yourself, 10% improvement
in something like that, why should
87
00:05:10,780 --> 00:05:13,670
that be worth a million dollars?
88
00:05:13,670 --> 00:05:19,890
It's because, if the recommendations
that the movie company makes are spot
89
00:05:19,890 --> 00:05:25,150
on, you will pay more attention to the
recommendation, you are likely to rent
90
00:05:25,150 --> 00:05:27,830
the movies that they recommend, and they
will make lots of money-- much
91
00:05:27,830 --> 00:05:30,130
more than the million dollars
they promised.
92
00:05:30,130 --> 00:05:31,910
And this is very typical
in machine learning.
93
00:05:31,910 --> 00:05:35,290
For example, machine learning has
applications in financial forecasting.
94
00:05:35,290 --> 00:05:38,970
You can imagine that the minutest
improvement in financial forecasting
95
00:05:38,970 --> 00:05:40,810
can make a lot of money.
96
00:05:40,810 --> 00:05:45,410
So the fact that you can actually push
the system to be better using machine
97
00:05:45,410 --> 00:05:50,800
learning is a very attractive aspect of
the technique in a wide spectrum of
98
00:05:50,800 --> 00:05:52,480
applications.
99
00:05:52,480 --> 00:05:53,985
So what did these guys do?
100
00:05:53,985 --> 00:05:57,320
101
00:05:57,320 --> 00:06:04,260
They gave the data, and people started
working on the problem using different
102
00:06:04,260 --> 00:06:08,810
algorithms, until someone managed
to get the prize.
103
00:06:08,810 --> 00:06:12,530
Now if you look at the problem of
rating a movie, it captures the
104
00:06:12,530 --> 00:06:16,460
essence of machine learning, and the
essence has three components.
105
00:06:16,460 --> 00:06:20,320
If you find these three components in
a problem you have in your field, then
106
00:06:20,320 --> 00:06:24,300
you know that machine learning is
ready as an application tool.
107
00:06:24,300 --> 00:06:25,850
What are the three?
108
00:06:25,850 --> 00:06:29,860
The first one is that
a pattern exists.
109
00:06:29,860 --> 00:06:33,810
If a pattern didn't exist, there
would be nothing to look for.
110
00:06:33,810 --> 00:06:35,900
So what is the pattern here?
111
00:06:35,900 --> 00:06:41,990
There is no question that the way
a person rates a movie is related to how
112
00:06:41,990 --> 00:06:46,200
they rated other movies, and is
also related to how other
113
00:06:46,200 --> 00:06:48,870
people rated that movie.
114
00:06:48,870 --> 00:06:50,250
We know that much.
115
00:06:50,250 --> 00:06:52,670
So there is a pattern
to be discovered.
116
00:06:52,670 --> 00:06:57,710
However, we cannot really pin
it down mathematically.
117
00:06:57,710 --> 00:07:01,740
I cannot ask you to write a 17th-order
polynomial that captures how people
118
00:07:01,740 --> 00:07:03,790
rate movies.
119
00:07:03,790 --> 00:07:06,880
So the fact that there is a pattern,
and that we cannot pin it down
120
00:07:06,880 --> 00:07:10,800
mathematically, is the reason why we
are going for machine learning.
121
00:07:10,800 --> 00:07:12,520
For "learning from data".
122
00:07:12,520 --> 00:07:15,930
We couldn't write down the system on our
own, so we're going to depend on
123
00:07:15,930 --> 00:07:18,930
data in order to be able
to find the system.
124
00:07:18,930 --> 00:07:21,080
There is a missing component
which is very important.
125
00:07:21,080 --> 00:07:24,970
If you don't have that,
you are out of luck.
126
00:07:24,970 --> 00:07:28,310
We have to have data. We
are learning from data.
127
00:07:28,310 --> 00:07:31,030
So if someone knocks on my door with
an interesting machine learning
128
00:07:31,030 --> 00:07:33,670
application, and they tell me how
exciting it is, and how great the
129
00:07:33,670 --> 00:07:37,200
application would be, and how much
money they would make, the first
130
00:07:37,200 --> 00:07:41,150
question I ask, what data do you have?
131
00:07:41,150 --> 00:07:43,120
If you data, we are in business.
132
00:07:43,120 --> 00:07:45,880
If you don't, you are out of luck.
133
00:07:45,880 --> 00:07:48,510
If you have these three components,
you are ready to
134
00:07:48,510 --> 00:07:50,970
apply machine learning.
135
00:07:50,970 --> 00:07:51,750
136
00:07:51,750 --> 00:07:56,030
Now let me give you a solution to the
movie rating, in order to start
137
00:07:56,030 --> 00:07:57,220
getting a feel for it.
138
00:07:57,220 --> 00:07:59,160
So here is a system.
139
00:07:59,160 --> 00:08:01,630
Let me start to focus on part of it.
140
00:08:01,630 --> 00:08:07,230
We are going to describe a viewer
as a vector of factors, a profile if
141
00:08:07,230 --> 00:08:08,630
you will.
142
00:08:08,630 --> 00:08:16,320
So if you look here for example, the
first one would be comedy content.
143
00:08:16,320 --> 00:08:17,360
144
00:08:17,360 --> 00:08:18,660
Does the movie have a lot of comedy?
145
00:08:18,660 --> 00:08:21,600
146
00:08:21,600 --> 00:08:25,020
From a viewer point of view,
do they like comedies?
147
00:08:25,020 --> 00:08:27,800
Here, do they like action?
148
00:08:27,800 --> 00:08:31,580
Do they prefer blockbusters, or
do they like fringe movies?
149
00:08:31,580 --> 00:08:36,210
And you can go on all the way, even to
asking yourself whether you like the
150
00:08:36,210 --> 00:08:38,250
lead actor or not.
151
00:08:38,250 --> 00:08:42,909
Now you go to the content of the
movie itself, and you get the
152
00:08:42,909 --> 00:08:44,580
corresponding part.
153
00:08:44,580 --> 00:08:46,750
Does the movie have comedy?
154
00:08:46,750 --> 00:08:48,010
Does it have action?
155
00:08:48,010 --> 00:08:49,160
Is it a blockbuster?
156
00:08:49,160 --> 00:08:50,620
And so on.
157
00:08:50,620 --> 00:08:56,310
Now you compare the two, and you realize
that if there is a match--
158
00:08:56,310 --> 00:08:59,950
let's say you hate comedy and the
movie has a lot of comedy--
159
00:08:59,950 --> 00:09:02,090
then the chances are you're
not going to like it.
160
00:09:02,090 --> 00:09:02,770
161
00:09:02,770 --> 00:09:06,670
But if there is a match between so many
coordinates, and the
162
00:09:06,670 --> 00:09:10,730
number of factors here could be
really like 300 factors.
163
00:09:10,730 --> 00:09:12,590
Then the chances are you'll
like the movie.
164
00:09:12,590 --> 00:09:15,220
And if there's a mismatch, the
chances are you're not
165
00:09:15,220 --> 00:09:16,570
going to like the movie.
166
00:09:16,570 --> 00:09:17,670
So what do you do,
167
00:09:17,670 --> 00:09:17,680
168
00:09:17,680 --> 00:09:21,900
you match the movie and the viewer
factors, and then you add the
169
00:09:21,900 --> 00:09:23,240
contributions of them.
170
00:09:23,240 --> 00:09:26,950
And then as a result of that, you
get the predicted rating.
171
00:09:26,950 --> 00:09:28,370
172
00:09:28,370 --> 00:09:34,100
This is all good except for one problem,
which is this is really not
173
00:09:34,100 --> 00:09:35,950
machine learning.
174
00:09:35,950 --> 00:09:40,190
In order to produce this thing, you have
to watch the movie, and analyze
175
00:09:40,190 --> 00:09:41,300
the content.
176
00:09:41,300 --> 00:09:46,180
You have to interview the viewer,
and ask about their taste.
177
00:09:46,180 --> 00:09:49,020
And then after that, you combine
them and try to get
178
00:09:49,020 --> 00:09:51,000
a prediction for the rating.
179
00:09:51,000 --> 00:09:52,130
180
00:09:52,130 --> 00:09:55,590
Now the idea of machine learning is that
you don't have to do any of that.
181
00:09:55,590 --> 00:09:59,550
All you do is sit down and sip on your
tea, while the machine is doing
182
00:09:59,550 --> 00:10:03,050
something to come up with
this figure on its own.
183
00:10:03,050 --> 00:10:03,570
184
00:10:03,570 --> 00:10:06,460
So let's look at the
learning approach.
185
00:10:06,460 --> 00:10:12,390
So in the learning approach, we know
that the viewer will be a vector of
186
00:10:12,390 --> 00:10:16,150
different factors, and different
components for every factor.
187
00:10:16,150 --> 00:10:20,680
So this vector will be different
from one viewer to another.
188
00:10:20,680 --> 00:10:21,050
189
00:10:21,050 --> 00:10:25,500
For example, one viewer will have a big
blue content here, and one of them
190
00:10:25,500 --> 00:10:28,510
will have a small blue content,
depending on their taste.
191
00:10:28,510 --> 00:10:28,940
192
00:10:28,940 --> 00:10:32,140
And then, there is the movie.
193
00:10:32,140 --> 00:10:37,220
And a particular movie will have different
contents that correspond to those.
194
00:10:37,220 --> 00:10:42,580
And the way we said we are computing the
rating, is by simply taking these
195
00:10:42,580 --> 00:10:45,830
and combining them and
getting the rating.
196
00:10:45,830 --> 00:10:51,830
Now what machine learning will do is
reverse-engineer that process.
197
00:10:51,830 --> 00:10:54,310
198
00:10:54,310 --> 00:10:59,380
It starts from the rating, and then
tries to find out what factors would be
199
00:10:59,380 --> 00:11:01,800
consistent with that rating.
200
00:11:01,800 --> 00:11:03,120
So think of it this way.
201
00:11:03,120 --> 00:11:07,690
You start, let's say, with
completely random factors.
202
00:11:07,690 --> 00:11:12,260
So you take these guys, just random
numbers from beginning to end, and
203
00:11:12,260 --> 00:11:14,290
these guys, random numbers
from beginning to end.
204
00:11:14,290 --> 00:11:17,620
For every user and every movie,
that's your starting point.
205
00:11:17,620 --> 00:11:22,810
Obviously, there is no chance in the
world that when you get the inner
206
00:11:22,810 --> 00:11:26,120
product between these factors that are
random, that you'll get anything that
207
00:11:26,120 --> 00:11:30,120
looks like the rating that actually
took place, right?
208
00:11:30,120 --> 00:11:34,430
But what you do is you take a rating
that actually happened, and then you
209
00:11:34,430 --> 00:11:40,010
start nudging the factors ever so
slightly toward that rating.
210
00:11:40,010 --> 00:11:45,540
Make the direction of the inner product
get closer to the rating.
211
00:11:45,540 --> 00:11:49,500
Now it looks like a hopeless thing. I
start with so many factors, they are
212
00:11:49,500 --> 00:11:51,780
all random, and I'm trying to
make them match a rating.
213
00:11:51,780 --> 00:11:53,330
What are the chances?
214
00:11:53,330 --> 00:11:57,070
Well the point is that you are going to
do this not for one rating, but for
215
00:11:57,070 --> 00:11:59,050
a 100 million ratings.
216
00:11:59,050 --> 00:12:01,610
And you keep cycling through
the 100 million, over
217
00:12:01,610 --> 00:12:03,740
and over and over.
218
00:12:03,740 --> 00:12:08,180
And eventually, lo and behold, you
find that the factors now are
219
00:12:08,180 --> 00:12:10,420
meaningful in terms of the ratings.
220
00:12:10,420 --> 00:12:17,380
And if you get a user, a viewer here,
that didn't watch a movie, and you get
221
00:12:17,380 --> 00:12:21,320
the vector that resulted from that
learning process, and you get the
222
00:12:21,320 --> 00:12:25,470
movie vector that resulted from that
process, and you do the inner product,
223
00:12:25,470 --> 00:12:29,640
lo and behold, you get a rating which
is actually consistent with how that
224
00:12:29,640 --> 00:12:31,770
viewer rates the movie.
225
00:12:31,770 --> 00:12:33,480
That's the idea.
226
00:12:33,480 --> 00:12:35,380
227
00:12:35,380 --> 00:12:41,050
Now this actually, the solution I
described, is one of the winning
228
00:12:41,050 --> 00:12:43,220
solutions in the competition
that I mentioned.
229
00:12:43,220 --> 00:12:47,170
So this is for real, this
actually can be used.
230
00:12:47,170 --> 00:12:51,440
Now with this example in mind,
let's actually go to the
231
00:12:51,440 --> 00:12:52,610
components of learning.
232
00:12:52,610 --> 00:12:56,700
So now I would like to abstract from the
learning problems that I see, what
233
00:12:56,700 --> 00:13:00,280
are the mathematical components that
make up the learning problem?
234
00:13:00,280 --> 00:13:01,910
And I'm going to use a metaphor.
235
00:13:01,910 --> 00:13:05,540
I'm going to use a metaphor now from
another application domain, which
236
00:13:05,540 --> 00:13:07,280
is a financial application.
237
00:13:07,280 --> 00:13:11,900
So the metaphor we are going to
use is credit approval.
238
00:13:11,900 --> 00:13:15,600
You apply for a credit card, and the
bank wants to decide whether it's
239
00:13:15,600 --> 00:13:18,010
a good idea to extend a credit
card for you or not.
240
00:13:18,010 --> 00:13:19,960
From the bank's point of view,
if they're going to make
241
00:13:19,960 --> 00:13:20,800
money, they are happy.
242
00:13:20,800 --> 00:13:22,490
If they are going to lose money,
they are not happy.
243
00:13:22,490 --> 00:13:24,590
That's the only criterion they have.
244
00:13:24,590 --> 00:13:29,000
Now, very much like we didn't have
a magic formula for deciding how
245
00:13:29,000 --> 00:13:32,660
a viewer will rate a movie, the bank
doesn't have a magic formula for
246
00:13:32,660 --> 00:13:36,230
deciding whether a person
is creditworthy or not.
247
00:13:36,230 --> 00:13:39,940
What they're going to do, they're going
to rely on historical records of
248
00:13:39,940 --> 00:13:43,950
previous customers, and how their credit
behavior turned out, and then
249
00:13:43,950 --> 00:13:47,660
try to reverse-engineer the system, and
when they get the system frozen,
250
00:13:47,660 --> 00:13:50,070
they're going to apply it
to a future customer.
251
00:13:50,070 --> 00:13:51,480
That's the deal.
252
00:13:51,480 --> 00:13:54,360
What are the components here?
253
00:13:54,360 --> 00:13:56,480
Let's look at it.
254
00:13:56,480 --> 00:13:58,690
First, you have the applicant
information.
255
00:13:58,690 --> 00:14:02,380
And the applicant information-- you
look at this, and you can see that
256
00:14:02,380 --> 00:14:06,865
there is the age, the gender, how much
money you make, how much money you
257
00:14:06,865 --> 00:14:10,870
owe, and all kinds of fields that are
believed to be related to the
258
00:14:10,870 --> 00:14:13,160
creditworthiness.
259
00:14:13,160 --> 00:14:17,580
Again, pretty much like we did in
the movie example, there is no question
260
00:14:17,580 --> 00:14:21,310
that these fields are related
to the creditworthiness.
261
00:14:21,310 --> 00:14:25,020
They don't necessarily uniquely
determine it, but they are related.
262
00:14:25,020 --> 00:14:28,680
And the bank doesn't want a sure bet.
They want to get the credit decision
263
00:14:28,680 --> 00:14:29,950
as reliable as possible.
264
00:14:29,950 --> 00:14:32,970
So they want to use that pattern,
in order to be able to come up with
265
00:14:32,970 --> 00:14:33,960
a good decision.
266
00:14:33,960 --> 00:14:34,620
267
00:14:34,620 --> 00:14:38,900
And they take this input, and they want
to approve the credit or deny it.
268
00:14:38,900 --> 00:14:41,220
So let's formalize this.
269
00:14:41,220 --> 00:14:45,190
First, we are going to
have an input.
270
00:14:45,190 --> 00:14:48,640
And the input is called
x. Surprise, surprise!
271
00:14:48,640 --> 00:14:52,970
And that input happens to be
the customer application.
272
00:14:52,970 --> 00:14:56,280
So we can think of it as
a d-dimensional vector, the first
273
00:14:56,280 --> 00:15:00,660
component is the salary, years in
residence, outstanding debt, whatever
274
00:15:00,660 --> 00:15:01,600
the components are.
275
00:15:01,600 --> 00:15:05,170
You put it as a vector, and
that becomes the input.
276
00:15:05,170 --> 00:15:09,690
Then we get the output y. The output
y is simply the decision, either to
277
00:15:09,690 --> 00:15:14,130
extend credit or not to extend
credit, +1 and -1.
278
00:15:14,130 --> 00:15:15,380
279
00:15:15,380 --> 00:15:17,540
280
00:15:17,540 --> 00:15:22,980
And being a good or bad customer, that
is from the bank's point of view.
281
00:15:22,980 --> 00:15:26,380
Now we have after that,
the target function.
282
00:15:26,380 --> 00:15:31,860
The target function is a function
from a domain X, which is the
283
00:15:31,860 --> 00:15:34,370
set of all of these x's.
284
00:15:34,370 --> 00:15:34,770
285
00:15:34,770 --> 00:15:37,240
So it is the set of vectors
of d dimensions.
286
00:15:37,240 --> 00:15:40,020
So it's a d-dimensional Euclidean
space, in this case.
287
00:15:40,020 --> 00:15:42,220
And then the Y is the set of y's.
288
00:15:42,220 --> 00:15:44,820
Well, that's an easy one because
y can only be +1 or -1,
289
00:15:44,820 --> 00:15:46,320
accept or deny.
290
00:15:46,320 --> 00:15:49,990
And therefore this is just
a binary co-domain.
291
00:15:49,990 --> 00:15:55,620
And this target function is the ideal
credit approval formula, which we
292
00:15:55,620 --> 00:15:57,280
don't know.
293
00:15:57,280 --> 00:16:00,920
In all of our endeavors in machine
learning, the target function is
294
00:16:00,920 --> 00:16:02,740
unknown to us.
295
00:16:02,740 --> 00:16:05,850
If it were known, nobody
needs learning.
296
00:16:05,850 --> 00:16:08,120
We just go ahead and implement it.
297
00:16:08,120 --> 00:16:11,890
But we need to learn it because
it is unknown to us.
298
00:16:11,890 --> 00:16:14,850
So what are we going
to do to learn it?
299
00:16:14,850 --> 00:16:19,040
We are going to use data, examples.
300
00:16:19,040 --> 00:16:23,360
So the data in this case is based on
previous customer application records.
301
00:16:23,360 --> 00:16:28,040
The input, which is the information in
their applications, and the output,
302
00:16:28,040 --> 00:16:31,070
which is how they turned
out in hindsight.
303
00:16:31,070 --> 00:16:33,920
This is not a question of prediction
at the time they applied, but after
304
00:16:33,920 --> 00:16:36,430
five years, they turned out
to be a great customer.
305
00:16:36,430 --> 00:16:40,005
So the bank says, if someone has
these attributes again, let's approve
306
00:16:40,005 --> 00:16:43,150
credit because these guys
tend to make us money.
307
00:16:43,150 --> 00:16:47,290
And this one made us lose a lot of
money, so let's deny it, and so on.
308
00:16:47,290 --> 00:16:50,680
And the historical records-- there are
plenty of historical records.
309
00:16:50,680 --> 00:16:54,180
All of this makes sense when you're
talking about having 100,000 of
310
00:16:54,180 --> 00:16:55,030
those guys.
311
00:16:55,030 --> 00:16:58,160
Then you pretty much say, I will
capture what the essence of that
312
00:16:58,160 --> 00:16:59,330
function is.
313
00:16:59,330 --> 00:17:02,940
So this is the data, and then you use
the data, which is the historical
314
00:17:02,940 --> 00:17:06,829
records, in order to
get the hypothesis.
315
00:17:06,829 --> 00:17:12,160
The hypothesis is the formal name we're
going to call the formula we get
316
00:17:12,160 --> 00:17:14,220
to approximate the target function.
317
00:17:14,220 --> 00:17:19,348
So the hypothesis lives in the same
world as the target function, and if
318
00:17:19,348 --> 00:17:25,779
you look at the value of g, it supposedly
approximates f.
319
00:17:25,780 --> 00:17:29,030
While f is unknown to us,
g is very much known--
320
00:17:29,030 --> 00:17:33,050
actually we created it-- and the hope
is that it does approximate f well.
321
00:17:33,050 --> 00:17:35,690
That's the goal of learning.
322
00:17:35,690 --> 00:17:39,310
So this notation will be our notation
for the rest of the course, so get
323
00:17:39,310 --> 00:17:40,170
used to it.
324
00:17:40,170 --> 00:17:43,800
The target function is always f, the
hypothesis we produce, which we'll
325
00:17:43,800 --> 00:17:47,860
refer to as the final hypothesis will be
called g, the data will always have
326
00:17:47,860 --> 00:17:51,400
that notation-- there are capital N
points, which are the data set.
327
00:17:51,400 --> 00:17:55,630
And the output is always y.
328
00:17:55,630 --> 00:17:59,120
So this is the formula to be used.
329
00:17:59,120 --> 00:18:03,940
Now, let's put it in a diagram in order
to analyze it a little bit more.
330
00:18:03,940 --> 00:18:07,110
If you look at the diagram
here, here is the target
331
00:18:07,110 --> 00:18:08,810
function and it is unknown--
332
00:18:08,810 --> 00:18:11,740
that is the ideal approval which we will
never know, but that's what we're
333
00:18:11,740 --> 00:18:13,860
hoping to get to approximate.
334
00:18:13,860 --> 00:18:15,170
And we don't see it.
335
00:18:15,170 --> 00:18:18,230
We see it only through the eyes
of the training examples.
336
00:18:18,230 --> 00:18:21,350
This is our vehicle of understanding
what the target function is.
337
00:18:21,350 --> 00:18:25,170
Otherwise the target function is
a mysterious quantity for us.
338
00:18:25,170 --> 00:18:28,440
And eventually, we would like to
produce the final hypothesis.
339
00:18:28,440 --> 00:18:31,490
The final hypothesis is the formula the
bank is going to use in order to
340
00:18:31,490 --> 00:18:36,440
approve or deny credit, with the hope
that g hopefully approximates that f.
341
00:18:36,440 --> 00:18:37,300
342
00:18:37,300 --> 00:18:41,160
Now what connects those two guys?
343
00:18:41,160 --> 00:18:43,110
This will be the learning algorithm.
344
00:18:43,110 --> 00:18:47,340
So the learning algorithm takes the
examples, and will produce the final
345
00:18:47,340 --> 00:18:51,850
hypothesis, as we described in the
example of the movie rating.
346
00:18:51,850 --> 00:18:56,880
Now there is another component that
goes into the learning algorithm.
347
00:18:56,880 --> 00:19:02,430
So what the learning algorithm does, it
creates the formula from a preset
348
00:19:02,430 --> 00:19:06,100
model of formulas, a set of candidate
formulas, if you will.
349
00:19:06,100 --> 00:19:10,610
And these we are going to call the
hypothesis set, a set of hypotheses
350
00:19:10,610 --> 00:19:13,230
from which we are going to
pick one hypothesis.
351
00:19:13,230 --> 00:19:18,440
So from this H comes a bunch of small
h's, which are functions that can be
352
00:19:18,440 --> 00:19:21,320
candidates for being the
credit approval.
353
00:19:21,320 --> 00:19:24,390
And one of them will be picked by the
learning algorithm, which happens to
354
00:19:24,390 --> 00:19:27,220
be g, hopefully approximating f.
355
00:19:27,220 --> 00:19:30,960
Now if you look at this part of the
chain, from the target function to the
356
00:19:30,960 --> 00:19:34,540
training to the learning algorithm to
the final hypothesis, this is very
357
00:19:34,540 --> 00:19:37,280
natural, and nobody will
object to that.
358
00:19:37,280 --> 00:19:40,280
But why do we have this
hypothesis set?
359
00:19:40,280 --> 00:19:44,030
Why not let the algorithm
pick from anything?
360
00:19:44,030 --> 00:19:48,150
Just create the formula, without being
restricted to a particular set of
361
00:19:48,150 --> 00:19:50,190
formulas H.
362
00:19:50,190 --> 00:19:52,790
There are two reasons, and
I want to explain them.
363
00:19:52,790 --> 00:19:56,920
One of them is that there is no downside
for including a hypothesis
364
00:19:56,920 --> 00:19:59,320
set in the formalization.
365
00:19:59,320 --> 00:20:01,480
And there is an upside.
366
00:20:01,480 --> 00:20:05,030
So let me describe why there is no
downside, and then describe why there
367
00:20:05,030 --> 00:20:07,040
is an upside.
368
00:20:07,040 --> 00:20:11,430
There is no downside for the simple
reason that, from a practical point of
369
00:20:11,430 --> 00:20:12,730
view, that's what you do.
370
00:20:12,730 --> 00:20:14,770
You want to learn, you say I'm going
to use a linear formula.
371
00:20:14,770 --> 00:20:16,060
I'm going to use a neural network.
372
00:20:16,060 --> 00:20:17,570
I'm going to use a support
vector machine.
373
00:20:17,570 --> 00:20:20,990
So you are already dictating
a set of hypotheses.
374
00:20:20,990 --> 00:20:25,260
If you happen to be a brave soul, and you
don't want to restrict yourself at
375
00:20:25,260 --> 00:20:29,680
all, very well, then your hypothesis
set is the set of all possible
376
00:20:29,680 --> 00:20:31,580
hypotheses.
377
00:20:31,580 --> 00:20:32,200
Right?
378
00:20:32,200 --> 00:20:34,600
So there is no loss of generality
in putting it.
379
00:20:34,600 --> 00:20:37,410
So there is no downside.
380
00:20:37,410 --> 00:20:40,890
The upside is not obvious here, but it
will become obvious as we go through
381
00:20:40,890 --> 00:20:41,900
the theory.
382
00:20:41,900 --> 00:20:47,150
The hypothesis set will play a pivotal
role in the theory of learning.
383
00:20:47,150 --> 00:20:50,590
It will tell us: can we learn, and
how well we learn, and whatnot.
384
00:20:50,590 --> 00:20:54,370
Therefore having it as an explicit
component in the problem statement
385
00:20:54,370 --> 00:20:56,600
will make the theory go through.
386
00:20:56,600 --> 00:20:58,910
So that's why we have this figure.
387
00:20:58,910 --> 00:20:59,720
388
00:20:59,720 --> 00:21:04,440
Now, let me focus on the solution
components of that figure.
389
00:21:04,440 --> 00:21:07,520
What do I mean by the
solution components?
390
00:21:07,520 --> 00:21:14,610
If you look at this, the first part,
which is the target-- let me try to
391
00:21:14,610 --> 00:21:15,570
expand it--
392
00:21:15,570 --> 00:21:18,540
so the target function is
not under your control.
393
00:21:18,540 --> 00:21:21,640
Someone knocks on my door and says:
I want to approve credit.
394
00:21:21,640 --> 00:21:24,730
That's the target function, I
have no control over that.
395
00:21:24,730 --> 00:21:27,110
And by the way, here are
the historical records.
396
00:21:27,110 --> 00:21:30,440
I have no control over that,
so they give me the data.
397
00:21:30,440 --> 00:21:33,330
And would you please hand me
the final hypothesis?
398
00:21:33,330 --> 00:21:36,250
That is what I'm going to give them at
the end, before I receive my check.
399
00:21:36,250 --> 00:21:36,900
400
00:21:36,900 --> 00:21:39,170
So all of that is completely dictated.
401
00:21:39,170 --> 00:21:49,120
Now let's look at the other part. The
learning algorithm, and the hypothesis
402
00:21:49,120 --> 00:21:52,090
set that we talked about,
are your solution tools.
403
00:21:52,090 --> 00:21:55,450
These are things you choose, in
order to solve the problem.
404
00:21:55,450 --> 00:22:01,150
And I would like to take a little bit
of a look into what they look like,
405
00:22:01,150 --> 00:22:04,770
and give you an example of them, so that
you have a complete chain for
406
00:22:04,770 --> 00:22:06,630
the entire figure in your mind.
407
00:22:06,630 --> 00:22:09,670
From the target function, to the data
set, to the learning algorithm,
408
00:22:09,670 --> 00:22:12,210
hypothesis set, and the
final hypothesis.
409
00:22:12,210 --> 00:22:12,780
410
00:22:12,780 --> 00:22:15,520
So, here is the hypothesis set.
411
00:22:15,520 --> 00:22:22,820
We chose the notation H for the
set, and the element will be given the
412
00:22:22,820 --> 00:22:23,990
symbol small h.
413
00:22:23,990 --> 00:22:27,890
So h is a function, pretty much
like the final hypothesis g.
414
00:22:27,890 --> 00:22:30,540
g is just one of them
that you happen to elect.
415
00:22:30,540 --> 00:22:34,060
So when we elect it, we call it g. If
it's sitting there generically, we
416
00:22:34,060 --> 00:22:35,100
call it h.
417
00:22:35,100 --> 00:22:35,860
418
00:22:35,860 --> 00:22:39,090
And then, when you put them together,
they are referred to as
419
00:22:39,090 --> 00:22:39,690
the learning model.
420
00:22:39,690 --> 00:22:42,610
So if you're asked what is the learning
model you are using, you're
421
00:22:42,610 --> 00:22:46,580
actually choosing both a hypothesis
set and a learning algorithm.
422
00:22:46,580 --> 00:22:49,400
We'll see the perceptron in a moment,
so this would be the
423
00:22:49,400 --> 00:22:52,780
perceptron model, and this would be the
perceptron learning algorithm.
424
00:22:52,780 --> 00:22:56,420
This could be neural network, and
this would be back propagation.
425
00:22:56,420 --> 00:22:59,050
This could be support vector
machines of some kind, let's say
426
00:22:59,050 --> 00:23:02,520
radial basis function version, and this
would be the quadratic programming.
427
00:23:02,520 --> 00:23:05,840
So every time you have a model, there is
a hypothesis set, and then there is
428
00:23:05,840 --> 00:23:07,960
an algorithm that will do the
searching and produce
429
00:23:07,960 --> 00:23:09,280
one of those guys.
430
00:23:09,280 --> 00:23:09,690
431
00:23:09,690 --> 00:23:13,630
So this is the standard form
432
00:23:13,630 --> 00:23:14,820
for the solution.
433
00:23:14,820 --> 00:23:18,860
Now, let me go through a simple
hypothesis set in detail so we have
434
00:23:18,860 --> 00:23:19,650
something to implement.
435
00:23:19,650 --> 00:23:23,860
So after the lecture, you can actually
implement a learning algorithm on real
436
00:23:23,860 --> 00:23:24,950
data if you want to.
437
00:23:24,950 --> 00:23:28,440
This is not a glorious model. It's
a very simple model. On the other hand,
438
00:23:28,440 --> 00:23:33,790
it's a very clear model to pinpoint
what we are talking about.
439
00:23:33,790 --> 00:23:34,570
440
00:23:34,570 --> 00:23:35,890
So here is the deal.
441
00:23:35,890 --> 00:23:39,660
442
00:23:39,660 --> 00:23:43,730
You have an input, and the input
is x_1 up to x_d, as we said--
443
00:23:43,730 --> 00:23:49,210
d-dimensional vector-- and each of them
comes from the real numbers, just
444
00:23:49,210 --> 00:23:49,840
for simplicity.
445
00:23:49,840 --> 00:23:51,320
So this belongs to the real numbers.
446
00:23:51,320 --> 00:23:53,170
And these are the attributes
of a customer.
447
00:23:53,170 --> 00:23:56,470
As we said, salary, years in
residence, and whatnot.
448
00:23:56,470 --> 00:24:00,080
So what does the perceptron model do?
449
00:24:00,080 --> 00:24:02,730
It does a very simple formula.
450
00:24:02,730 --> 00:24:08,760
It takes the attributes you have and
gives them different weights, w.
451
00:24:08,760 --> 00:24:12,110
So let's say the salary is important,
the chances are w corresponding to the
452
00:24:12,110 --> 00:24:13,900
salary will be big.
453
00:24:13,900 --> 00:24:15,880
Some other attribute is
not that important.
454
00:24:15,880 --> 00:24:19,210
The chances are the w that
goes with it is not that big.
455
00:24:19,210 --> 00:24:21,540
Actually, outstanding
debt is bad news.
456
00:24:21,540 --> 00:24:23,370
If you owe a lot, that's not good.
457
00:24:23,370 --> 00:24:26,600
So the chances are the weight will
be negative for outstanding
458
00:24:26,600 --> 00:24:28,420
debt, and so on.
459
00:24:28,420 --> 00:24:32,210
Now you add them together, and you add
them in a linear form-- that's what
460
00:24:32,210 --> 00:24:33,630
makes it a perceptron--
461
00:24:33,630 --> 00:24:39,010
and you can look at this as
a credit score, of sorts.
462
00:24:39,010 --> 00:24:39,760
463
00:24:39,760 --> 00:24:43,300
Now you compare the credit
score with a threshold.
464
00:24:43,300 --> 00:24:46,690
If you exceed the threshold, they
approve the credit card.
465
00:24:46,690 --> 00:24:50,420
And if you don't, they
deny the credit card.
466
00:24:50,420 --> 00:24:51,710
So that is the formula they
467
00:24:51,710 --> 00:24:52,520
settle on.
468
00:24:52,520 --> 00:24:58,500
They have no idea, yet, what the w's and
the threshold are, but they dictated the
469
00:24:58,500 --> 00:25:01,110
formula-- the analytic form that
they're going to use.
470
00:25:01,110 --> 00:25:02,040
471
00:25:02,040 --> 00:25:06,530
Now we take this and we put it
in the formalization we had.
472
00:25:06,530 --> 00:25:11,370
We have to define a hypothesis h,
and this will tell us what is the
473
00:25:11,370 --> 00:25:14,820
hypothesis set that has all the
hypotheses that have the same
474
00:25:14,820 --> 00:25:16,170
functional form.
475
00:25:16,170 --> 00:25:17,530
So you can write it down as this.
476
00:25:17,530 --> 00:25:22,270
This is a little bit long, but there's
absolutely nothing to it.
477
00:25:22,270 --> 00:25:26,490
This is your credit score, and this
is the threshold you compare to by
478
00:25:26,490 --> 00:25:27,740
subtracting.
479
00:25:27,740 --> 00:25:30,910
If this quantity is positive, you belong
to the first thing and you will
480
00:25:30,910 --> 00:25:31,890
approve credit.
481
00:25:31,890 --> 00:25:34,880
If it's negative, you belong here
and you will deny credit.
482
00:25:34,880 --> 00:25:38,440
Well, the function that takes a real
number, and produces the sign +1 or
483
00:25:38,440 --> 00:25:41,010
-1, is called the sign.
484
00:25:41,010 --> 00:25:43,930
So when you take the sign of this thing,
this will indeed be +1 or
485
00:25:43,930 --> 00:25:46,970
-1, and this will give
the decision you want.
486
00:25:46,970 --> 00:25:49,820
And that will be the form
of your hypothesis.
487
00:25:49,820 --> 00:25:57,640
Now let's put it in color, and you
realize that what defines h is your
488
00:25:57,640 --> 00:26:00,290
choice of w_i and the threshold.
489
00:26:00,290 --> 00:26:05,060
These are the parameters that define
one hypothesis versus the other.
490
00:26:05,060 --> 00:26:07,820
x is an input that will be
put into any hypothesis.
491
00:26:07,820 --> 00:26:11,780
As far as we are concerned, when we are
learning for example, the inputs
492
00:26:11,780 --> 00:26:13,610
and outputs are already determined.
493
00:26:13,610 --> 00:26:15,010
These are the data set.
494
00:26:15,010 --> 00:26:19,640
But what we vary to get one hypothesis
or another, and what the algorithm
495
00:26:19,640 --> 00:26:23,270
needs to vary in order to choose the
final hypothesis, are those parameters
496
00:26:23,270 --> 00:26:27,270
which, in this case, are
w_i and the threshold.
497
00:26:27,270 --> 00:26:28,810
498
00:26:28,810 --> 00:26:30,610
So let's look at it visually.
499
00:26:30,610 --> 00:26:32,990
Let's assume that the data
you are working
500
00:26:32,990 --> 00:26:34,790
with is linearly separable.
501
00:26:34,790 --> 00:26:38,770
Linearly separable in this case, for
example, you have nine data points.
502
00:26:38,770 --> 00:26:42,220
And if you look at the nine data
points, some of them were good
503
00:26:42,220 --> 00:26:44,850
customers and some of them
were bad customers.
504
00:26:44,850 --> 00:26:48,450
And you would like now to apply the
perceptron model, in order to separate
505
00:26:48,450 --> 00:26:49,460
them correctly.
506
00:26:49,460 --> 00:26:53,860
You would like to get to this situation,
where the perceptron, which
507
00:26:53,860 --> 00:26:57,680
is this purple line, separates the blue
region from the red region or the
508
00:26:57,680 --> 00:27:02,240
pink region, and indeed all the good
customers belong to one, and the bad
509
00:27:02,240 --> 00:27:03,600
customers belong to the other.
510
00:27:03,600 --> 00:27:07,100
So you have hope that a future customer,
if they lie here or lie
511
00:27:07,100 --> 00:27:09,152
here, they will be classified
correctly.
512
00:27:09,152 --> 00:27:12,780
If there is actually a simple linear
pattern to this to be detected.
513
00:27:12,780 --> 00:27:16,950
But when you start, you start with
random weights, and the random weights
514
00:27:16,950 --> 00:27:18,990
will give you any line.
515
00:27:18,990 --> 00:27:23,490
So the purple line in both
cases corresponds to the
516
00:27:23,490 --> 00:27:25,900
purple parameters there.
517
00:27:25,900 --> 00:27:30,370
One choice of these w's and the
threshold corresponds to one line.
518
00:27:30,370 --> 00:27:32,220
You change them, you get another line.
519
00:27:32,220 --> 00:27:35,410
So you can see that the learning
algorithm is playing around with these
520
00:27:35,410 --> 00:27:39,360
parameters, and therefore moving the
line around, trying to arrive at this
521
00:27:39,360 --> 00:27:40,950
happy solution.
522
00:27:40,950 --> 00:27:42,350
523
00:27:42,350 --> 00:27:45,620
Now we are going to have a simple
change of notation.
524
00:27:45,620 --> 00:27:51,000
Instead of calling it threshold, we're
going to treat it as if it's a weight.
525
00:27:51,000 --> 00:27:55,030
It was minus threshold.
Now we call it, plus w_0.
526
00:27:55,030 --> 00:27:58,760
Absolutely nothing, all you need
to do is choose w_0 to
527
00:27:58,760 --> 00:28:00,930
be minus the threshold.
528
00:28:00,930 --> 00:28:01,840
No big deal.
529
00:28:01,840 --> 00:28:03,060
530
00:28:03,060 --> 00:28:04,750
So why do we do that?
531
00:28:04,750 --> 00:28:08,220
We do that because we are going to
introduce an artificial coordinate.
532
00:28:08,220 --> 00:28:11,780
Remember that the input
was x_1 through x_d.
533
00:28:11,780 --> 00:28:14,020
Now we're going to add x_0.
534
00:28:14,020 --> 00:28:15,910
This is not an attribute of
the customer, but
535
00:28:15,910 --> 00:28:20,460
an artificial constant we add, which
happens to be always +1.
536
00:28:20,460 --> 00:28:22,710
Why are we doing this?
You probably guessed.
537
00:28:22,710 --> 00:28:26,540
Because when you do that, then all of
a sudden the formula simplifies.
538
00:28:26,540 --> 00:28:30,040
Now you are summing from i equals
0, instead of i equals 1.
539
00:28:30,040 --> 00:28:33,410
So you added the zero term,
and what is the zero term?
540
00:28:33,410 --> 00:28:37,270
It's the threshold which you
conveniently call w_0 with a plus sign,
541
00:28:37,270 --> 00:28:38,390
multiplied by the 1.
542
00:28:38,390 --> 00:28:41,550
So indeed, this will be the formula
equivalent to that.
543
00:28:41,550 --> 00:28:43,340
So it looks better.
544
00:28:43,340 --> 00:28:46,490
And this is the standard notation
we're going to use.
545
00:28:46,490 --> 00:28:51,720
And now we put it as a vector
form, which will simplify matters, so
546
00:28:51,720 --> 00:28:56,200
in this case you will be having an inner
product between a vector w,
547
00:28:56,200 --> 00:28:59,190
a column vector, and a vector x.
548
00:28:59,190 --> 00:29:04,870
So the vector w would be w_0,
w_1, w_2, w_3, w_4, et cetera.
549
00:29:04,870 --> 00:29:07,030
And x_0, x_1, x_2, et cetera.
550
00:29:07,030 --> 00:29:10,180
And you do the inner product by taking
a transpose, and you get a formula
551
00:29:10,180 --> 00:29:12,200
which is exactly the formula
you have here.
552
00:29:12,200 --> 00:29:17,330
So now we are down to this formula
for the perceptron hypothesis.
553
00:29:17,330 --> 00:29:19,010
554
00:29:19,010 --> 00:29:22,540
Now that we have the hypothesis set,
let's look for the learning algorithm
555
00:29:22,540 --> 00:29:23,500
that goes with it.
556
00:29:23,500 --> 00:29:27,170
The hypothesis set tells you the
resources you can work with.
557
00:29:27,170 --> 00:29:29,990
Now we need the algorithm that is
going to look at the data, the
558
00:29:29,990 --> 00:29:33,810
training data that you're going to use,
and navigate through the space
559
00:29:33,810 --> 00:29:37,380
of hypotheses, to bring the one that
is going to output as the final
560
00:29:37,380 --> 00:29:39,660
hypothesis that you give
to your customer.
561
00:29:39,660 --> 00:29:47,210
So this one is called the perceptron
learning algorithm, and it implements
562
00:29:47,210 --> 00:29:49,420
this function.
563
00:29:49,420 --> 00:29:51,610
What it does is the following.
564
00:29:51,610 --> 00:29:54,560
It takes the training data.
565
00:29:54,560 --> 00:29:56,710
That is always what a learning
algorithm does. This is
566
00:29:56,710 --> 00:29:58,020
their starting point.
567
00:29:58,020 --> 00:30:03,330
So it takes existing customers, and
their existing credit behavior in
568
00:30:03,330 --> 00:30:04,030
hindsight--
569
00:30:04,030 --> 00:30:05,380
that's what it uses--
570
00:30:05,380 --> 00:30:07,070
and what does it do?
571
00:30:07,070 --> 00:30:12,080
It tries to make the w correct.
572
00:30:12,080 --> 00:30:17,640
So it really doesn't like at all
when a point is misclassified.
573
00:30:17,640 --> 00:30:22,960
So if a point is misclassified,
it means that your w didn't do
574
00:30:22,960 --> 00:30:23,850
the right job here.
575
00:30:23,850 --> 00:30:26,640
So what does it mean to be
a misclassified point here?
576
00:30:26,640 --> 00:30:31,910
It means that when you apply your
formula, with the current w--
577
00:30:31,910 --> 00:30:34,340
the w is the one that the algorithm
will play with--
578
00:30:34,340 --> 00:30:37,340
apply it to this particular x.
579
00:30:37,340 --> 00:30:38,170
Then what happens?
580
00:30:38,170 --> 00:30:41,200
You get something that is not the
credit behavior you want.
581
00:30:41,200 --> 00:30:43,500
It is misclassified.
582
00:30:43,500 --> 00:30:45,150
So what do we do when a point
is misclassified?
583
00:30:45,150 --> 00:30:46,840
We have to do something.
584
00:30:46,840 --> 00:30:49,660
So what the algorithm does, it
updates the weight vector.
585
00:30:49,660 --> 00:30:53,480
It changes the weight, which changes
the hypothesis, so that it behaves
586
00:30:53,480 --> 00:30:55,840
better on that particular point.
587
00:30:55,840 --> 00:30:59,970
And this is the formula that it does.
588
00:30:59,970 --> 00:31:01,480
So I'll explain it in a moment.
589
00:31:01,480 --> 00:31:08,430
Let me first try to explain the inner
product in terms of agreement or
590
00:31:08,430 --> 00:31:10,770
disagreement.
591
00:31:10,770 --> 00:31:16,890
If you have the vector x and the vector
w this way, their inner product
592
00:31:16,890 --> 00:31:21,230
will be positive, and the sign
will give you a +1.
593
00:31:21,230 --> 00:31:25,440
If they are this way, the inner product
will be negative, and the sign
594
00:31:25,440 --> 00:31:27,550
will be -1.
595
00:31:27,550 --> 00:31:32,180
So being misclassified means that
either they are this way and the
596
00:31:32,180 --> 00:31:37,590
output should be -1, or it's this
way and output should be +1.
597
00:31:37,590 --> 00:31:40,840
That's what makes it misclassified,
right?
598
00:31:40,840 --> 00:31:41,750
599
00:31:41,750 --> 00:31:49,720
So if you look here at this formula, it
takes the old w and adds something
600
00:31:49,720 --> 00:31:52,130
that depends on the misclassified
point.
601
00:31:52,130 --> 00:31:55,280
Both in terms of the x_n and y_n.
602
00:31:55,280 --> 00:31:57,410
y_n is just +1 or -1.
603
00:31:57,410 --> 00:32:00,800
So here you are either adding a vector
or subtracting a vector.
604
00:32:00,800 --> 00:32:05,490
And we will see from this diagram that
you're always doing so in such a way
605
00:32:05,490 --> 00:32:09,520
that you make the point more likely
to be correctly classified.
606
00:32:09,520 --> 00:32:10,780
How is that?
607
00:32:10,780 --> 00:32:15,990
If y equals +1, as you see here,
then it must be that since the point
608
00:32:15,990 --> 00:32:19,530
is misclassified, that
w dot x was negative.
609
00:32:19,530 --> 00:32:24,900
Now when you modify this to w plus
y x, it's actually w plus x.
610
00:32:24,900 --> 00:32:29,330
You add x to w, and when you add x to
w you get the blue vector instead of
611
00:32:29,330 --> 00:32:30,120
the red vector.
612
00:32:30,120 --> 00:32:33,850
And lo and behold, now the inner
product is indeed positive.
613
00:32:33,850 --> 00:32:38,830
And in the other case when it's -1,
it is misclassified because they
614
00:32:38,830 --> 00:32:39,760
were this way.
615
00:32:39,760 --> 00:32:41,840
They give you +1 when
it should be -1.
616
00:32:41,840 --> 00:32:44,520
And when you apply the rule, since
y is -1, you are actually
617
00:32:44,520 --> 00:32:45,640
subtracting x.
618
00:32:45,640 --> 00:32:48,680
So you subtract x and get this guy,
and you will get the correct
619
00:32:48,680 --> 00:32:49,620
classification.
620
00:32:49,620 --> 00:32:49,810
621
00:32:49,810 --> 00:32:51,560
So this is the intuition behind it.
622
00:32:51,560 --> 00:32:53,990
However, it is not the intuition
that makes this work.
623
00:32:53,990 --> 00:32:58,930
There are a number of problems
with this approach.
624
00:32:58,930 --> 00:33:02,570
I just motivated that
this is not a crazy rule.
625
00:33:02,570 --> 00:33:06,660
Whether or not it's a working
rule, that is yet to be seen.
626
00:33:06,660 --> 00:33:11,270
Let's look at the iterations of
the perceptron learning algorithm.
627
00:33:11,270 --> 00:33:13,860
Here is one iteration of PLA.
628
00:33:13,860 --> 00:33:18,950
So you look at this thing, and you have
this current w corresponds to
629
00:33:18,950 --> 00:33:20,330
the purple line.
630
00:33:20,330 --> 00:33:22,760
This guy is blue in the red region.
631
00:33:22,760 --> 00:33:24,610
It means it's misclassified.
632
00:33:24,610 --> 00:33:28,630
So now you would like to adjust
the weights, that is move around
633
00:33:28,630 --> 00:33:32,340
that purple line, such that the
point is classified correctly.
634
00:33:32,340 --> 00:33:35,440
If you apply the learning rule, you'll
find that you're actually moving in
635
00:33:35,440 --> 00:33:40,210
this direction, which means that the
blue point will likely be correctly
636
00:33:40,210 --> 00:33:42,230
classified after that iteration.
637
00:33:42,230 --> 00:33:43,790
638
00:33:43,790 --> 00:33:46,930
There is a problem because, let's
say that I actually move
639
00:33:46,930 --> 00:33:49,700
this guy in this direction.
640
00:33:49,700 --> 00:33:55,010
Well this one, I got it right, but this
one, which used to be right,
641
00:33:55,010 --> 00:33:56,440
now is messed up.
642
00:33:56,440 --> 00:33:58,920
It moved to the blue region, right?
643
00:33:58,920 --> 00:34:02,440
And if you think about it, I'm trying
to take care of one point, and I may be
644
00:34:02,440 --> 00:34:05,450
messing up all other points, because
I'm not taking them into
645
00:34:05,450 --> 00:34:06,980
consideration.
646
00:34:06,980 --> 00:34:08,469
Well, the good news for the perceptron
647
00:34:08,469 --> 00:34:12,909
learning algorithm is that all you need
to do, is for iterations 1,
648
00:34:12,909 --> 00:34:19,179
2, 3, 4, et cetera, pick a misclassified
point, anyone you like.
649
00:34:19,179 --> 00:34:22,020
650
00:34:22,020 --> 00:34:24,489
And then apply the iteration to it.
651
00:34:24,489 --> 00:34:27,870
The iteration we just talked about,
which is this one.
652
00:34:27,870 --> 00:34:29,480
The top one.
653
00:34:29,480 --> 00:34:31,210
And that's it.
654
00:34:31,210 --> 00:34:35,790
If you do that, and the data was
originally linearly separable, then
655
00:34:35,790 --> 00:34:40,310
you will end up with the case that you
will get to a correct solution.
656
00:34:40,310 --> 00:34:42,870
You will get to something that
classifies all of them correctly.
657
00:34:42,870 --> 00:34:44,340
This is not an obvious statement.
658
00:34:44,340 --> 00:34:45,310
It requires a proof.
659
00:34:45,310 --> 00:34:47,300
The proof is not that hard.
660
00:34:47,300 --> 00:34:51,570
But it gives us the simplest possible
learning model we can think of.
661
00:34:51,570 --> 00:34:54,710
It's a linear model, and
this is your algorithm.
662
00:34:54,710 --> 00:34:59,060
All you need to do is be very patient,
because 1, 2, 3, 4-- this is
663
00:34:59,060 --> 00:35:00,200
a really long.
664
00:35:00,200 --> 00:35:01,900
At times it can be very long.
665
00:35:01,900 --> 00:35:03,310
But it eventually converges.
666
00:35:03,310 --> 00:35:04,350
That's the promise,
667
00:35:04,350 --> 00:35:06,970
as long as the data is
linearly separable.
668
00:35:06,970 --> 00:35:13,920
So now we have one learning model, and
if I give you now data from a bank--
669
00:35:13,920 --> 00:35:17,216
previous customers and their credit
behavior-- you can actually run the
670
00:35:17,216 --> 00:35:21,090
perceptron learning algorithm, and come up
with a final hypothesis g that you
671
00:35:21,090 --> 00:35:22,630
can hand to the bank.
672
00:35:22,630 --> 00:35:26,390
Not clear at all that it will be good,
because all you did was match the
673
00:35:26,390 --> 00:35:27,750
historical records.
674
00:35:27,750 --> 00:35:31,280
Well, you may ask the question: if I
match the historical records, does this
675
00:35:31,280 --> 00:35:34,300
mean that I'm getting future customers
right, which is the
676
00:35:34,300 --> 00:35:35,480
only thing that matters?
677
00:35:35,480 --> 00:35:38,240
The bank already knows what happened
with the previous customers. It's just
678
00:35:38,240 --> 00:35:41,050
using the data to help you
find a good formula.
679
00:35:41,050 --> 00:35:44,050
The formula will be good or not good to
the extent that it applies to a new
680
00:35:44,050 --> 00:35:47,510
customer, and can predict the
behavior correctly.
681
00:35:47,510 --> 00:35:50,530
Well, that's a loaded question
which will be handled in
682
00:35:50,530 --> 00:35:53,470
extreme detail, when we talk about
the theory of learning.
683
00:35:53,470 --> 00:35:57,190
That's why we have to develop
all of this theory.
684
00:35:57,190 --> 00:35:58,970
So, that's it.
685
00:35:58,970 --> 00:36:02,340
And that is the perceptron
learning algorithm.
686
00:36:02,340 --> 00:36:07,620
Now let me go into the bigger picture
of learning, because what I talked
687
00:36:07,620 --> 00:36:09,990
about so far is one type of learning.
688
00:36:09,990 --> 00:36:13,700
It happens to be by far the most
popular, and the most used.
689
00:36:13,700 --> 00:36:16,250
But there are other types of learning.
690
00:36:16,250 --> 00:36:21,540
So let's talk about the premise of
learning, from which the different
691
00:36:21,540 --> 00:36:24,180
types came about.
692
00:36:24,180 --> 00:36:27,220
That's what learning is about.
693
00:36:27,220 --> 00:36:29,630
694
00:36:29,630 --> 00:36:34,640
This is the premise that is common
between any problem that you
695
00:36:34,640 --> 00:36:36,650
would consider learning.
696
00:36:36,650 --> 00:36:41,450
You use a set of observations,
what we call data, to uncover
697
00:36:41,450 --> 00:36:43,280
an underlying process.
698
00:36:43,280 --> 00:36:46,540
In our case, the target function.
699
00:36:46,540 --> 00:36:51,170
You can see that this is
a very broad premise.
700
00:36:51,170 --> 00:36:54,730
And therefore, you can see that people
have rediscovered that over and over
701
00:36:54,730 --> 00:36:57,160
and over, in so many disciplines.
702
00:36:57,160 --> 00:37:00,870
Can you think of a discipline, other than
machine learning, that uses that
703
00:37:00,870 --> 00:37:02,125
as its exclusive premise?
704
00:37:02,125 --> 00:37:05,420
705
00:37:05,420 --> 00:37:09,460
Anybody have taken courses
in statistics?
706
00:37:09,460 --> 00:37:12,180
In statistics, that's what they do.
707
00:37:12,180 --> 00:37:16,090
The underlying process is
a probability distribution.
708
00:37:16,090 --> 00:37:21,590
And the observations are samples
generated by that distribution.
709
00:37:21,590 --> 00:37:24,220
And you want to take the samples, and
predict what the probability
710
00:37:24,220 --> 00:37:25,890
distribution is.
711
00:37:25,890 --> 00:37:29,740
And over and over, there are so many
disciplines under different names.
712
00:37:29,740 --> 00:37:34,000
Now when we talk about different types
of learning, it's not like we sit down
713
00:37:34,000 --> 00:37:37,970
and look at the world and say, this
looks different from this because the
714
00:37:37,970 --> 00:37:39,420
assumptions look different.
715
00:37:39,420 --> 00:37:43,700
What you do is, you take this premise
and apply it in a context.
716
00:37:43,700 --> 00:37:48,100
And that calls for a certain amount
of mathematics and algorithms.
717
00:37:48,100 --> 00:37:53,690
If a particular set of assumptions takes
you sufficiently far from the
718
00:37:53,690 --> 00:37:57,850
mathematics and the algorithms you used
in the other disciplines, that
719
00:37:57,850 --> 00:38:00,360
it takes on a life of its own.
720
00:38:00,360 --> 00:38:03,900
And it develops its own math and
algorithms, then you declare it
721
00:38:03,900 --> 00:38:05,250
a different type.
722
00:38:05,250 --> 00:38:05,660
723
00:38:05,660 --> 00:38:09,750
So when I list the types, it's not
completely obvious just by the slide
724
00:38:09,750 --> 00:38:13,000
itself, that these should be
the types that you have.
725
00:38:13,000 --> 00:38:16,370
But for what it's worth, these
are the most important types.
726
00:38:16,370 --> 00:38:18,110
First one is supervised learning,
that's what we have
727
00:38:18,110 --> 00:38:18,970
been talking about.
728
00:38:18,970 --> 00:38:22,240
And I will discuss it in detail, and tell
you why it's called supervised.
729
00:38:22,240 --> 00:38:26,640
And it is, by far, the concentration
of this course.
730
00:38:26,640 --> 00:38:31,950
There is another one which is called
unsupervised learning, and
731
00:38:31,950 --> 00:38:33,990
unsupervised learning
is very intriguing.
732
00:38:33,990 --> 00:38:37,310
I will mention it briefly here, and then
we will talk about a very famous
733
00:38:37,310 --> 00:38:40,740
algorithm for unsupervised learning
later in the course.
734
00:38:40,740 --> 00:38:44,090
And the final type is reinforcement
learning, which is even more
735
00:38:44,090 --> 00:38:47,640
intriguing, and I will
discuss it in a brief
736
00:38:47,640 --> 00:38:49,760
introduction in a moment.
737
00:38:49,760 --> 00:38:50,330
738
00:38:50,330 --> 00:38:52,290
So let's take them one by one.
739
00:38:52,290 --> 00:38:53,180
Supervised learning.
740
00:38:53,180 --> 00:38:54,460
So what is supervised learning?
741
00:38:54,460 --> 00:38:56,960
742
00:38:56,960 --> 00:39:01,320
Anytime you have the data that is
given to you, with the output
743
00:39:01,320 --> 00:39:07,630
explicitly given-- here is the user
and movie, and here is the rating.
744
00:39:07,630 --> 00:39:11,030
Here is the previous customer, and
here is their credit behavior.
745
00:39:11,030 --> 00:39:15,270
It's as if a supervisor is helping you
out, in order to be able to classify
746
00:39:15,270 --> 00:39:16,300
the future ones.
747
00:39:16,300 --> 00:39:18,140
That's why it's called supervised.
748
00:39:18,140 --> 00:39:21,160
Let's take an example of coin
recognition, just to be able to
749
00:39:21,160 --> 00:39:24,110
contrast it with unsupervised
learning in a moment.
750
00:39:24,110 --> 00:39:24,630
751
00:39:24,630 --> 00:39:29,350
Let's say you have a vending machine,
and you would like to make
752
00:39:29,350 --> 00:39:31,670
the system able to
recognize the coins.
753
00:39:31,670 --> 00:39:33,030
So what do you do?
754
00:39:33,030 --> 00:39:36,630
You have physical measurements of the
coin, let's be simplistic and say we
755
00:39:36,630 --> 00:39:39,520
measure the size and mass
of the coin you put.
756
00:39:39,520 --> 00:39:44,980
Now the coins will be quarters,
nickels, pennies, and dimes.
757
00:39:44,980 --> 00:39:46,800
25, 5, 1, and 10.
758
00:39:46,800 --> 00:39:47,500
759
00:39:47,500 --> 00:39:51,760
And when you put the data in this
diagram, they will belong there.
760
00:39:51,760 --> 00:39:56,640
So the quarters, for example, are
bigger, so they will belong here.
761
00:39:56,640 --> 00:40:00,480
And the dimes in the US currency happen
to be the smallest of them,
762
00:40:00,480 --> 00:40:04,200
so they are smallest here, and there
will be a scatter because of the error
763
00:40:04,200 --> 00:40:07,160
in measurement, because of the exposure
to the elements, and whatnot.
764
00:40:07,160 --> 00:40:10,070
So let's say that this is your
training data, and it's supervised
765
00:40:10,070 --> 00:40:11,830
because things are colored.
766
00:40:11,830 --> 00:40:15,660
I gave you those and told you they
are 25 cents, 5 cents, et cetera.
767
00:40:15,660 --> 00:40:20,040
So you use those in order to train
a system, and the system will then be
768
00:40:20,040 --> 00:40:22,100
able to classify a future one.
769
00:40:22,100 --> 00:40:26,990
For example, if we stick to the
linear approach, you may be able to
770
00:40:26,990 --> 00:40:29,890
find separator lines like those.
771
00:40:29,890 --> 00:40:33,250
And those separator lines will
separate, based on the data, the 10
772
00:40:33,250 --> 00:40:35,510
from the 1 from the
5 from the 25.
773
00:40:35,510 --> 00:40:37,240
And once you have those,
774
00:40:37,240 --> 00:40:39,870
you can bid farewell to the data.
You don't need it anymore.
775
00:40:39,870 --> 00:40:42,960
And when you get a future coin that is
now unlabeled, you don't know what it
776
00:40:42,960 --> 00:40:47,220
is, when the vending machine is actually
working, then the coin will
777
00:40:47,220 --> 00:40:51,090
lie in one region or another, and you're
going to classify it accordingly.
778
00:40:51,090 --> 00:40:53,550
So that is supervised learning.
779
00:40:53,550 --> 00:40:56,060
Now let's look at unsupervised
learning.
780
00:40:56,060 --> 00:41:01,490
For unsupervised learning, instead of
having the examples, the training data,
781
00:41:01,490 --> 00:41:05,570
having this form which is the
input plus the correct
782
00:41:05,570 --> 00:41:07,020
target-- the correct output--
783
00:41:07,020 --> 00:41:12,470
the customer and how they behaved
in reality in credit,
784
00:41:12,470 --> 00:41:16,765
we are going to have examples that have
less information, so much less it
785
00:41:16,765 --> 00:41:19,480
is laughable.
786
00:41:19,480 --> 00:41:23,920
I'm just going to tell you
what the input is.
787
00:41:23,920 --> 00:41:27,330
And I'm not going to tell you what
the target function is at all.
788
00:41:27,330 --> 00:41:30,190
I'm not going to tell you anything
about the target function.
789
00:41:30,190 --> 00:41:32,770
I'm just going to tell you, here
is the data of a customer.
790
00:41:32,770 --> 00:41:36,210
Good luck, try to predict the credit.
791
00:41:36,210 --> 00:41:38,300
OK--
792
00:41:38,300 --> 00:41:41,340
How in the world are we
going to do that?
793
00:41:41,340 --> 00:41:44,810
Let me show you that the situation
is not totally hopeless.
794
00:41:44,810 --> 00:41:46,010
That's what I'm going to achieve.
795
00:41:46,010 --> 00:41:48,390
I'm not going to tell you
how to do it completely.
796
00:41:48,390 --> 00:41:51,780
But let me show you that a situation
like that is not totally hopeless.
797
00:41:51,780 --> 00:41:52,620
798
00:41:52,620 --> 00:41:55,330
Let's go for the coin example.
799
00:41:55,330 --> 00:41:56,240
800
00:41:56,240 --> 00:42:01,550
For the coin example, we have
data that looks like this.
801
00:42:01,550 --> 00:42:05,800
If I didn't tell you what the
denominations are, the data
802
00:42:05,800 --> 00:42:08,530
would look like this.
803
00:42:08,530 --> 00:42:09,940
Right?
804
00:42:09,940 --> 00:42:12,220
You have the measurements, but you don't
know, is that a quarter, is
805
00:42:12,220 --> 00:42:14,140
it-- you don't know.
806
00:42:14,140 --> 00:42:17,970
Now honestly, if you look at this
thing, you say I can know
807
00:42:17,970 --> 00:42:19,720
something from this figure.
808
00:42:19,720 --> 00:42:21,740
Things tend to cluster together.
809
00:42:21,740 --> 00:42:25,880
So I may be able to classify those
clusters into categories, without
810
00:42:25,880 --> 00:42:28,440
knowing what the categories are.
811
00:42:28,440 --> 00:42:29,960
That will be quite
an achievement already.
812
00:42:29,960 --> 00:42:33,110
You still don't know whether it's
25 cents, or whatever.
813
00:42:33,110 --> 00:42:36,040
But the data actually made you
able to do something that is
814
00:42:36,040 --> 00:42:38,630
a significant step.
815
00:42:38,630 --> 00:42:42,370
You're going to be able to come
up with these boundaries.
816
00:42:42,370 --> 00:42:43,160
817
00:42:43,160 --> 00:42:46,210
And now, you are so close to
finding the full system.
818
00:42:46,210 --> 00:42:49,470
So unlabeled data actually
can be pretty useful.
819
00:42:49,470 --> 00:42:52,910
Obviously, I have seen the colored
ones, so I actually chose the
820
00:42:52,910 --> 00:42:55,500
boundaries right because I still
remember them visually.
821
00:42:55,500 --> 00:42:58,300
But if you look at the clusters and
you have never heard about that,
822
00:42:58,300 --> 00:43:02,830
especially these guys might not
look like two clusters.
823
00:43:02,830 --> 00:43:04,510
They may look like one cluster.
824
00:43:04,510 --> 00:43:10,260
So it actually could be that this is
ambiguous, and indeed in unsupervised
825
00:43:10,260 --> 00:43:13,900
learning, the number of clusters
is ambiguous at times.
826
00:43:13,900 --> 00:43:16,040
827
00:43:16,040 --> 00:43:18,145
And then, what you do--
828
00:43:18,145 --> 00:43:20,740
829
00:43:20,740 --> 00:43:23,110
this is the output of your system.
Now, I can categorize the
830
00:43:23,110 --> 00:43:24,960
coins into types.
831
00:43:24,960 --> 00:43:28,050
I'm just going to call them
types: type 1, type 2,
832
00:43:28,050 --> 00:43:29,260
type 3, type 4.
833
00:43:29,260 --> 00:43:33,140
I have no idea which belongs to which,
but obviously if someone comes with
834
00:43:33,140 --> 00:43:37,420
a single example of a quarter, a dime,
et cetera, then you are ready to go.
835
00:43:37,420 --> 00:43:37,890
836
00:43:37,890 --> 00:43:40,680
Whereas before, you had to have lots of
examples in order to choose where
837
00:43:40,680 --> 00:43:42,770
exactly to put the boundary.
838
00:43:42,770 --> 00:43:43,880
839
00:43:43,880 --> 00:43:47,600
And this is why a set like that,
which looks like complete
840
00:43:47,600 --> 00:43:50,160
jungle, is actually useful.
841
00:43:50,160 --> 00:43:52,850
Let me give you another interesting
example of unsupervised learning,
842
00:43:52,850 --> 00:43:55,610
where I give you the input without the
output, and you are actually in
843
00:43:55,610 --> 00:43:58,320
a better situation to learn.
844
00:43:58,320 --> 00:44:02,290
Let's say that your company or your
school in this case, is sending you
845
00:44:02,290 --> 00:44:05,190
for a semester in Rio de Janeiro.
846
00:44:05,190 --> 00:44:09,690
So you're very excited, and you
decide that you'd better learn some
847
00:44:09,690 --> 00:44:13,660
Portuguese, in order to be able to
speak the language when you arrive.
848
00:44:13,660 --> 00:44:14,370
849
00:44:14,370 --> 00:44:17,830
Not to worry, when you arrive, there
will be a tutor who teaches you
850
00:44:17,830 --> 00:44:18,400
Portuguese.
851
00:44:18,400 --> 00:44:20,400
But you have a month to go,
and you want to help
852
00:44:20,400 --> 00:44:22,320
yourself as much as possible.
853
00:44:22,320 --> 00:44:26,620
You look around, and you find that the
only resource you have is a radio
854
00:44:26,620 --> 00:44:30,080
station in Portuguese in your car.
855
00:44:30,080 --> 00:44:35,080
So what you do, you just turn
it on whenever you drive.
856
00:44:35,080 --> 00:44:38,680
And for an entire month, you're
bombarded with Portuguese.
857
00:44:38,680 --> 00:44:42,890
"tudo bem", "como vai", "valeu",
stuff like that comes back.
858
00:44:42,890 --> 00:44:45,550
After a while, without knowing anything--
it's unsupervised, nobody
859
00:44:45,550 --> 00:44:47,250
told you the meaning of any word--
860
00:44:47,250 --> 00:44:50,870
you start to develop a model of
the language in your mind.
861
00:44:50,870 --> 00:44:52,370
You know what the idioms
are, et cetera.
862
00:44:52,370 --> 00:44:54,930
You are very eager to know
what actually "tudo bem"
863
00:44:54,930 --> 00:44:56,350
-- what does that mean?
864
00:44:56,350 --> 00:44:58,380
You are ready to learn, and once
you learn it, it's actually
865
00:44:58,380 --> 00:44:59,780
fixed in your mind.
866
00:44:59,780 --> 00:45:03,130
Then when you go there, you will learn
the language faster than if you didn't
867
00:45:03,130 --> 00:45:05,070
go through this experience.
868
00:45:05,070 --> 00:45:08,320
So you can think of unsupervised
learning, in one way or another, as
869
00:45:08,320 --> 00:45:12,300
a way of getting a higher-level
representation of the input.
870
00:45:12,300 --> 00:45:15,580
Whether it's extremely high level as
in clusters-- you forgot all the
871
00:45:15,580 --> 00:45:19,680
attributes and you just tell me a label,
or higher level as in this-- a better
872
00:45:19,680 --> 00:45:23,620
representation than just the
crude input into some model
873
00:45:23,620 --> 00:45:25,212
in your mind.
874
00:45:25,212 --> 00:45:29,280
875
00:45:29,280 --> 00:45:32,250
Now let's talk about
reinforcement learning.
876
00:45:32,250 --> 00:45:35,430
In this case, it's not as bad
as unsupervised learning.
877
00:45:35,430 --> 00:45:38,970
So again, without the benefit of
supervised learning, you don't get
878
00:45:38,970 --> 00:45:40,810
the correct output.
879
00:45:40,810 --> 00:45:44,550
What you do is-- I will
give you the input.
880
00:45:44,550 --> 00:45:46,750
OK, thank you very much,
that's very kind.
881
00:45:46,750 --> 00:45:48,580
What else?
882
00:45:48,580 --> 00:45:53,450
I'm going to give you some output.
883
00:45:53,450 --> 00:45:54,540
The correct output?
884
00:45:54,540 --> 00:45:55,200
No!
885
00:45:55,200 --> 00:45:56,690
Some output.
886
00:45:56,690 --> 00:46:01,070
OK, that's very nice, but doesn't
seem very helpful.
887
00:46:01,070 --> 00:46:05,100
It looks now like unsupervised learning,
because in unsupervised learning I
888
00:46:05,100 --> 00:46:06,460
could give you some output.
889
00:46:06,460 --> 00:46:08,080
Here is a dime. Oh, it's a quarter.
890
00:46:08,080 --> 00:46:10,490
It's some output!
891
00:46:10,490 --> 00:46:12,740
Such output has no information.
892
00:46:12,740 --> 00:46:16,240
The information comes from the next one.
893
00:46:16,240 --> 00:46:19,520
I'm going to grade this output.
894
00:46:19,520 --> 00:46:21,440
So that is the information
provided to you.
895
00:46:21,440 --> 00:46:26,200
So I'm not explicitly giving you the
output, but when you choose an output,
896
00:46:26,200 --> 00:46:28,900
I'm going to tell you how
well you're doing.
897
00:46:28,900 --> 00:46:31,850
Reinforcement learning is interesting
because it is mostly our own
898
00:46:31,850 --> 00:46:33,450
experience in learning.
899
00:46:33,450 --> 00:46:38,060
Think of a toddler, and a hot
cup of tea in front of her.
900
00:46:38,060 --> 00:46:40,610
She is looking at it, and
she is very curious.
901
00:46:40,610 --> 00:46:43,210
So she reaches to touch. Ouch!
902
00:46:43,210 --> 00:46:44,720
And she starts crying.
903
00:46:44,720 --> 00:46:47,790
The reward is very negative
for trying.
904
00:46:47,790 --> 00:46:51,325
Now next time she looks at it, and she
remembers the previous experience, and
905
00:46:51,325 --> 00:46:52,760
she doesn't touch it.
906
00:46:52,760 --> 00:46:56,120
But there is a certain level of pain,
because there is an unfulfilled
907
00:46:56,120 --> 00:46:57,870
curiosity.
908
00:46:57,870 --> 00:47:01,860
And curiosity killed the cat. In
three or four trials, the toddler
909
00:47:01,860 --> 00:47:02,530
tries again.
910
00:47:02,530 --> 00:47:04,100
Maybe now it's OK.
911
00:47:04,100 --> 00:47:05,420
And Ouch!
912
00:47:05,420 --> 00:47:09,890
Eventually from just the grade of the
behavior of to touch it or not to
913
00:47:09,890 --> 00:47:14,290
touch it, the toddler will learn not to
touch cups of tea that have smoke
914
00:47:14,290 --> 00:47:15,350
coming out of them.
915
00:47:15,350 --> 00:47:16,060
916
00:47:16,060 --> 00:47:18,930
So that is a case of
reinforcement learning.
917
00:47:18,930 --> 00:47:22,340
The most important application, or one
of the most important applications of
918
00:47:22,340 --> 00:47:26,650
reinforcement learning, is
in playing games.
919
00:47:26,650 --> 00:47:28,290
920
00:47:28,290 --> 00:47:32,420
So backgammon is one of the games,
and think that you want
921
00:47:32,420 --> 00:47:33,600
a system to learn it.
922
00:47:33,600 --> 00:47:40,050
So what you want, you want to take the
current state of the board, and you
923
00:47:40,050 --> 00:47:44,010
roll the dice, and then you decide
what is the optimal move in
924
00:47:44,010 --> 00:47:45,960
order to stand the best chance to win.
925
00:47:45,960 --> 00:47:46,830
That's the game.
926
00:47:46,830 --> 00:47:50,890
So the target function is the
best move given a state.
927
00:47:50,890 --> 00:47:55,680
Now, if I have to generate those things
in order for the system to
928
00:47:55,680 --> 00:48:00,430
learn, then I must be a pretty good
backgammon player already.
929
00:48:00,430 --> 00:48:03,580
So now it's a vicious cycle.
930
00:48:03,580 --> 00:48:06,480
Now, reinforcement learning
comes in handy.
931
00:48:06,480 --> 00:48:09,030
What you're going to do, you
are going to have the
932
00:48:09,030 --> 00:48:11,070
computer choose any output.
933
00:48:11,070 --> 00:48:13,790
A crazy move, for all you care.
934
00:48:13,790 --> 00:48:16,070
And then see what happens eventually.
935
00:48:16,070 --> 00:48:19,280
So this computer is playing against
another computer, both of
936
00:48:19,280 --> 00:48:21,040
them want to learn.
937
00:48:21,040 --> 00:48:24,730
And you make a move, and eventually
you win or lose.
938
00:48:24,730 --> 00:48:28,280
So you propagate back the credit
because of winning or losing,
939
00:48:28,280 --> 00:48:31,780
according to a very specific and
sophisticated formula, into all the
940
00:48:31,780 --> 00:48:34,810
moves that happened.
941
00:48:34,810 --> 00:48:37,570
Now you think that's completely hopeless,
because maybe this is not the
942
00:48:37,570 --> 00:48:39,750
move that resulted in this,
it's another move.
943
00:48:39,750 --> 00:48:45,390
But always remember, that you are going
to do this 100 billion times.
944
00:48:45,390 --> 00:48:47,130
Not you, the poor computer.
945
00:48:47,130 --> 00:48:49,610
You're sitting down sipping
your tea.
946
00:48:49,610 --> 00:48:53,460
A computer is doing this, playing
against an imaginary opponent, and
947
00:48:53,460 --> 00:48:55,240
they keep playing and
playing and playing.
948
00:48:55,240 --> 00:48:58,530
And in three hours of CPU time, you go
back to the computer-- maybe not three
949
00:48:58,530 --> 00:49:02,180
hours, maybe three days of CPU time--
you go back to the computer, and you
950
00:49:02,180 --> 00:49:03,505
have a backgammon champion.
951
00:49:03,505 --> 00:49:06,430
952
00:49:06,430 --> 00:49:09,960
Actually, that's true.
953
00:49:09,960 --> 00:49:13,900
The world champion, at some point, was
a neural network that learned the way
954
00:49:13,900 --> 00:49:15,880
I described.
955
00:49:15,880 --> 00:49:20,730
So it is actually a very attractive
approach, because in machine
956
00:49:20,730 --> 00:49:24,590
learning now, we have a target function
that we cannot model.
957
00:49:24,590 --> 00:49:27,590
That covers a lot of territory,
I've seen a lot of those.
958
00:49:27,590 --> 00:49:29,720
We have data coming from
the target function.
959
00:49:29,720 --> 00:49:30,830
960
00:49:30,830 --> 00:49:32,560
I usually have that.
961
00:49:32,560 --> 00:49:36,010
And now we have the lazy
man's approach to life.
962
00:49:36,010 --> 00:49:39,410
We are going to sit down, and let the
computer do all of the work, and
963
00:49:39,410 --> 00:49:40,830
produce the system we want.
964
00:49:40,830 --> 00:49:44,090
Instead of studying the thing
mathematically, and writing code, and
965
00:49:44,090 --> 00:49:44,900
debugging--
966
00:49:44,900 --> 00:49:46,740
I hate debugging.
967
00:49:46,740 --> 00:49:49,650
And then you go. No,
we're not going to do that.
968
00:49:49,650 --> 00:49:52,550
The learning algorithm just works,
and produces something good.
969
00:49:52,550 --> 00:49:53,040
970
00:49:53,040 --> 00:49:54,490
And we get the check.
971
00:49:54,490 --> 00:49:56,950
So this is a pretty good deal.
972
00:49:56,950 --> 00:50:03,880
It actually is so good, it might
be too good to be true.
973
00:50:03,880 --> 00:50:07,120
So let's actually examine if
all of this was a fantasy.
974
00:50:07,120 --> 00:50:10,590
975
00:50:10,590 --> 00:50:14,080
So now I'm going to give you
a learning puzzle.
976
00:50:14,080 --> 00:50:16,020
Humans are very good learners, right?
977
00:50:16,020 --> 00:50:17,630
978
00:50:17,630 --> 00:50:21,170
So I'm now going to give you a learning
problem in the form that I
979
00:50:21,170 --> 00:50:23,870
described, a supervised
learning problem.
980
00:50:23,870 --> 00:50:28,910
And that supervised learning problem
will give you a training set, some
981
00:50:28,910 --> 00:50:32,300
points mapped to +1, some
points mapped to -1.
982
00:50:32,300 --> 00:50:35,600
And then I'm going to give you
a test point that is unlabeled.
983
00:50:35,600 --> 00:50:41,780
Your task is to look at the examples,
learn the target function, apply it to
984
00:50:41,780 --> 00:50:46,630
the test point, and then decide what
the value of the function is.
985
00:50:46,630 --> 00:50:50,550
After that, I'm going to ask, who
decided that the function is +1,
986
00:50:50,550 --> 00:50:53,130
and who decided that the
function is -1.
987
00:50:53,130 --> 00:50:55,720
OK? It's clear what the deal is.
988
00:50:55,720 --> 00:50:55,730
989
00:50:55,730 --> 00:50:59,680
And I would like our online audience
to do the same thing.
990
00:50:59,680 --> 00:51:02,650
And please text what the solution is.
991
00:51:02,650 --> 00:51:04,900
Just +1 or -1.
992
00:51:04,900 --> 00:51:05,590
993
00:51:05,590 --> 00:51:06,560
Fair enough?
994
00:51:06,560 --> 00:51:07,810
Let's start the game.
995
00:51:07,810 --> 00:51:12,260
996
00:51:12,260 --> 00:51:14,890
997
00:51:14,890 --> 00:51:19,390
What is above the line are
the training examples.
998
00:51:19,390 --> 00:51:23,870
I put the input as a three-by-three
pattern in order to be visually easy
999
00:51:23,870 --> 00:51:24,780
to understand.
1000
00:51:24,780 --> 00:51:28,370
But this is just really nine
bits worth of information.
1001
00:51:28,370 --> 00:51:31,470
And they are ones and zeros,
black and white.
1002
00:51:31,470 --> 00:51:36,640
And for this input, this input, and this
input, the value of the target
1003
00:51:36,640 --> 00:51:38,760
function is -1.
1004
00:51:38,760 --> 00:51:42,160
For this input, this input, and this
input, the value of the target
1005
00:51:42,160 --> 00:51:44,470
function is +1.
1006
00:51:44,470 --> 00:51:47,980
Now this is your data set, this
is your training set.
1007
00:51:47,980 --> 00:51:49,360
Now you should learn the function.
1008
00:51:49,360 --> 00:51:52,980
And when you're done, could you please
tell me what your function will return
1009
00:51:52,980 --> 00:51:54,680
on this test point?
1010
00:51:54,680 --> 00:51:57,130
Is it +1 or -1.
1011
00:51:57,130 --> 00:52:00,480
I will give everybody 30 seconds
before I ask for an answer.
1012
00:52:00,480 --> 00:52:05,330
1013
00:52:05,330 --> 00:52:06,930
Maybe we should have some
background music?
1014
00:52:06,930 --> 00:52:13,680
1015
00:52:13,680 --> 00:52:14,930
1016
00:52:14,930 --> 00:52:20,400
1017
00:52:20,400 --> 00:52:22,680
OK, time's up.
1018
00:52:22,680 --> 00:52:24,850
Your learning algorithm
has converged, I hope.
1019
00:52:24,850 --> 00:52:30,835
And now we apply it here, and I ask
people here, who says it's +1?
1020
00:52:30,835 --> 00:52:34,120
1021
00:52:34,120 --> 00:52:35,180
Thank you.
1022
00:52:35,180 --> 00:52:37,810
Who says it's -1?
1023
00:52:37,810 --> 00:52:39,270
Thank you.
1024
00:52:39,270 --> 00:52:42,020
I see that the online audience
also contributed?
1025
00:52:42,020 --> 00:52:44,070
MODERATOR: Yeah, the big
majority says +1.
1026
00:52:44,070 --> 00:52:45,950
PROFESSOR: But
are there -1's?
1027
00:52:45,950 --> 00:52:46,840
MODERATOR: Two -1's.
1028
00:52:46,840 --> 00:52:47,300
1029
00:52:47,300 --> 00:52:48,320
PROFESSOR: Cool.
1030
00:52:48,320 --> 00:52:49,050
1031
00:52:49,050 --> 00:52:50,990
I don't care if it's
a +1 or -1.
1032
00:52:50,990 --> 00:52:54,270
What I care about is that
I get both answers.
1033
00:52:54,270 --> 00:52:55,810
That is the essence of it.
1034
00:52:55,810 --> 00:52:57,280
Why do I care?
1035
00:52:57,280 --> 00:53:00,760
Because in reality, this
is an impossible task.
1036
00:53:00,760 --> 00:53:03,470
1037
00:53:03,470 --> 00:53:06,090
I told you the target
function is unknown.
1038
00:53:06,090 --> 00:53:11,110
It could be anything,
really anything.
1039
00:53:11,110 --> 00:53:15,740
And now I give you the value of the
target function at 6 points.
1040
00:53:15,740 --> 00:53:19,470
Well, there are many functions that
fit those 6 points, and behave
1041
00:53:19,470 --> 00:53:21,470
differently outside.
1042
00:53:21,470 --> 00:53:32,400
For example, if you take the function
to be +1 if the top left square
1043
00:53:32,400 --> 00:53:40,510
is white, then this should
be -1, right?
1044
00:53:40,510 --> 00:53:49,880
If you take the function to be +1
if the pattern is symmetric--
1045
00:53:49,880 --> 00:53:52,700
let's see, I said it
the other way around.
1046
00:53:52,700 --> 00:53:57,430
So the top one is black,
it would be -1.
1047
00:53:57,430 --> 00:53:58,850
So this would be -1.
1048
00:53:58,850 --> 00:54:00,680
If it's symmetric, it would be +1.
1049
00:54:00,680 --> 00:54:03,420
So this would be +1, because
this guy has both-- this is
1050
00:54:03,420 --> 00:54:05,380
black, and also it is symmetric.
1051
00:54:05,380 --> 00:54:06,230
Right?
1052
00:54:06,230 --> 00:54:09,320
And you can find infinite
variety like that.
1053
00:54:09,320 --> 00:54:12,310
And that problem is not restricted
to this case.
1054
00:54:12,310 --> 00:54:14,300
1055
00:54:14,300 --> 00:54:15,530
The question here is obvious.
1056
00:54:15,530 --> 00:54:17,010
The function is unknown.
1057
00:54:17,010 --> 00:54:18,300
You really mean unknown, right?
1058
00:54:18,300 --> 00:54:19,430
Yes, I mean it.
1059
00:54:19,430 --> 00:54:20,260
Unknown-- anything?
1060
00:54:20,260 --> 00:54:21,280
Yes, I do.
1061
00:54:21,280 --> 00:54:22,160
OK.
1062
00:54:22,160 --> 00:54:26,110
You give me a finite sample,
it can be anything outside.
1063
00:54:26,110 --> 00:54:30,750
How in the world am I going to tell
what the learning outside is?
1064
00:54:30,750 --> 00:54:33,720
OK, that sounds about right.
1065
00:54:33,720 --> 00:54:37,150
But we are in trouble, because that's
the premise of learning.
1066
00:54:37,150 --> 00:54:41,400
If the goal was to memorize the examples
I gave you, that would be
1067
00:54:41,400 --> 00:54:43,780
memorizing, not learning.
1068
00:54:43,780 --> 00:54:48,110
Learning is to figure out a pattern
that applies outside.
1069
00:54:48,110 --> 00:54:53,370
And now we realize that outside,
I cannot say anything.
1070
00:54:53,370 --> 00:54:56,641
Does this mean that learning
is doomed?
1071
00:54:56,641 --> 00:55:00,860
Well, this is going to be
a very short course!
1072
00:55:00,860 --> 00:55:06,230
Well, the good news is that learning
is alive and well.
1073
00:55:06,230 --> 00:55:13,420
And we are going to show that, without
compromising our basic premise.
1074
00:55:13,420 --> 00:55:18,320
The target function will
continue to be unknown.
1075
00:55:18,320 --> 00:55:21,440
And we still mean unknown.
1076
00:55:21,440 --> 00:55:24,390
And we will be able to learn.
1077
00:55:24,390 --> 00:55:27,620
And that will be the subject
of the next lecture.
1078
00:55:27,620 --> 00:55:32,410
Right now, we are going to go for
a short break, after which we are going
1079
00:55:32,410 --> 00:55:40,150
to take the Q&A.
1080
00:55:40,150 --> 00:55:43,270
1081
00:55:43,270 --> 00:55:49,350
We'll start the Q&A, and we will get
questions from the class here, and
1082
00:55:49,350 --> 00:55:51,270
from the online audience.
1083
00:55:51,270 --> 00:55:56,160
And if you'd like to ask a question, let
me ask you to go to this side of
1084
00:55:56,160 --> 00:56:00,630
the room where the mic is, so that
your question can be heard.
1085
00:56:00,630 --> 00:56:04,680
And we will alternate, if there are
questions here, we will alternate
1086
00:56:04,680 --> 00:56:07,540
between campus and off campus.
1087
00:56:07,540 --> 00:56:11,030
So let me start if there is
a question from outside.
1088
00:56:11,030 --> 00:56:16,080
MODERATOR: Yes, so the most common
question is, how do you determine if
1089
00:56:16,080 --> 00:56:19,050
a set of points is linearly
separable, and what do you do
1090
00:56:19,050 --> 00:56:20,730
if they're not separable.
1091
00:56:20,730 --> 00:56:26,120
PROFESSOR: The linear separability
assumption is a very
1092
00:56:26,120 --> 00:56:29,850
simplistic assumption, and doesn't
apply mostly in practice.
1093
00:56:29,850 --> 00:56:34,780
And I chose it only because it goes with
a very simple algorithm, which is
1094
00:56:34,780 --> 00:56:36,950
the perceptron learning algorithm.
1095
00:56:36,950 --> 00:56:40,850
There are two ways to deal with the
case of linear inseparability.
1096
00:56:40,850 --> 00:56:44,450
There are algorithms, and most
algorithms actually deal with that
1097
00:56:44,450 --> 00:56:49,730
case, and there's also a technique that
we are going to study next
1098
00:56:49,730 --> 00:56:55,330
week, which will take a set of points
which is not linearly separable, and
1099
00:56:55,330 --> 00:56:59,150
create a mapping that makes
them linearly separable.
1100
00:56:59,150 --> 00:57:02,050
So there is a way to deal with it.
1101
00:57:02,050 --> 00:57:05,990
However, the question how do you
determine it's linearly separable, the
1102
00:57:05,990 --> 00:57:09,240
right way of doing it in practice is
that, when someone gives you data, you
1103
00:57:09,240 --> 00:57:11,870
assume in general it's not
linearly separable.
1104
00:57:11,870 --> 00:57:15,310
It will hardly ever be, and therefore
you take techniques that can deal with
1105
00:57:15,310 --> 00:57:16,630
that case as well.
1106
00:57:16,630 --> 00:57:20,100
There is a simple modification of the
perceptron learning algorithm, which
1107
00:57:20,100 --> 00:57:21,670
is called the pocket algorithm,
1108
00:57:21,670 --> 00:57:26,190
that applies the same rule with a very
minor modification, and deals with the
1109
00:57:26,190 --> 00:57:29,460
case where the data is not separable.
1110
00:57:29,460 --> 00:57:34,820
However, if you apply the perceptron
learning algorithm, that is guaranteed
1111
00:57:34,820 --> 00:57:38,990
to converge to a correct solution in the
case of linear separability, and
1112
00:57:38,990 --> 00:57:43,520
you apply it to data that is not
linearly separable, bad things happen.
1113
00:57:43,520 --> 00:57:46,800
Not only is it going not to converge,
obviously it is not going to converge
1114
00:57:46,800 --> 00:57:50,600
because it terminates when there are
no misclassified points, right?
1115
00:57:50,600 --> 00:57:53,640
If there is a misclassified point, then
there's a next iteration always.
1116
00:57:53,640 --> 00:57:56,500
So since the data is not linearly
separable, we will never come to
1117
00:57:56,500 --> 00:57:59,110
a point where all the points
are classified correctly.
1118
00:57:59,110 --> 00:58:01,220
So this is not what is bothering us.
1119
00:58:01,220 --> 00:58:04,570
What is bothering us is that, as you go
from one step to another, you can
1120
00:58:04,570 --> 00:58:08,040
go from a very good solution
to a terrible solution.
1121
00:58:08,040 --> 00:58:10,450
In the case of no linear separability.
1122
00:58:10,450 --> 00:58:13,530
So it's not an algorithm that you
would like to use, and just
1123
00:58:13,530 --> 00:58:15,650
terminate by force at an iteration.
1124
00:58:15,650 --> 00:58:21,360
A modification of it can be used this
way, and I'll mention it briefly when
1125
00:58:21,360 --> 00:58:26,120
we talk about linear regression
and other linear methods.
1126
00:58:26,120 --> 00:58:29,770
MODERATOR: There's also a question of
how does the rate of convergence of
1127
00:58:29,770 --> 00:58:33,810
the perceptron change with the
dimensionality of the data?
1128
00:58:33,810 --> 00:58:35,840
PROFESSOR: Badly!
1129
00:58:35,840 --> 00:58:37,200
That's the answer.
1130
00:58:37,200 --> 00:58:38,440
Let me put it this way.
1131
00:58:38,440 --> 00:58:42,000
You can build pathological cases, where
it really will take forever.
1132
00:58:42,000 --> 00:58:45,230
However, I did not give the perceptron
learning algorithm in the first
1133
00:58:45,230 --> 00:58:47,900
lecture to tell you that this is
the great algorithm that you
1134
00:58:47,900 --> 00:58:49,160
need to learn.
1135
00:58:49,160 --> 00:58:51,720
I gave it in the first lecture,
because this is simplest
1136
00:58:51,720 --> 00:58:53,550
algorithm I could give.
1137
00:58:53,550 --> 00:58:56,990
By the end of this course,
you'll be saying, what?
1138
00:58:56,990 --> 00:58:57,650
Perceptron?
1139
00:58:57,650 --> 00:58:58,880
Never heard of it.
1140
00:58:58,880 --> 00:59:02,710
So it will go out of contention, after we
get to the more interesting stuff.
1141
00:59:02,710 --> 00:59:03,240
1142
00:59:03,240 --> 00:59:07,090
But as a method that can be used, it
indeed can be used, and can be
1143
00:59:07,090 --> 00:59:09,710
explained in five minutes
as you have seen.
1144
00:59:09,710 --> 00:59:15,050
MODERATOR: Regarding the items for
learning, you mentioned that there
1145
00:59:15,050 --> 00:59:15,900
must be a pattern.
1146
00:59:15,900 --> 00:59:18,400
So can you be more specific about that?
1147
00:59:18,400 --> 00:59:20,590
How do you know if there's a pattern?
1148
00:59:20,590 --> 00:59:21,940
PROFESSOR: You don't.
1149
00:59:21,940 --> 00:59:25,840
My answers seem to be very abrupt,
but that's the way it is.
1150
00:59:25,840 --> 00:59:29,680
When we get to the theory--
is learning feasible-- it will
1151
00:59:29,680 --> 00:59:34,060
become very clear that there is
a separation between the target
1152
00:59:34,060 --> 00:59:35,730
function-- there is
a pattern to detect--
1153
00:59:35,730 --> 00:59:37,150
and whether we can learn it.
1154
00:59:37,150 --> 00:59:40,150
It is very difficult for me to explain
it in two minutes, it will take a full
1155
00:59:40,150 --> 00:59:41,500
lecture to get there.
1156
00:59:41,500 --> 00:59:47,600
But the essence of it is that you take
the data, you apply your learning
1157
00:59:47,600 --> 00:59:52,710
algorithm, and there is something you
can explicitly detect that will
1158
00:59:52,710 --> 00:59:54,890
tell you whether you learned or not.
1159
00:59:54,890 --> 00:59:57,630
So in some cases, you're not
going to be able to learn.
1160
00:59:57,630 --> 00:59:59,890
In some cases, you'll be able to learn.
1161
00:59:59,890 --> 01:00:02,630
And the key is that you're going
to be able to tell by
1162
01:00:02,630 --> 01:00:04,440
running your algorithm.
1163
01:00:04,440 --> 01:00:07,280
And I'm going to explain that
in more details later on.
1164
01:00:07,280 --> 01:00:08,010
1165
01:00:08,010 --> 01:00:15,660
So basically, I'm also resisting
taking the data, deciding
1166
01:00:15,660 --> 01:00:19,220
whether it's linearly separable, looking
at it and seeing. You will
1167
01:00:19,220 --> 01:00:25,370
realize as we go through that it's
a no-no to actually look at the data.
1168
01:00:25,370 --> 01:00:26,860
What?
1169
01:00:26,860 --> 01:00:29,580
That's what data is for, to look at.
1170
01:00:29,580 --> 01:00:30,850
Bear with me.
1171
01:00:30,850 --> 01:00:34,720
We will come to the level where we ask
why don't we look at the data--
1172
01:00:34,720 --> 01:00:37,920
just looking at it and then saying:
It's linearly separable.
1173
01:00:37,920 --> 01:00:39,890
Let's pick the perceptron.
1174
01:00:39,890 --> 01:00:42,370
That's bad practice, for reasons
that are not obvious now.
1175
01:00:42,370 --> 01:00:45,460
They will become obvious, once we
are done with the theory.
1176
01:00:45,460 --> 01:00:50,330
So when someone knocks on my door with
a set of data, I can ask them all
1177
01:00:50,330 --> 01:00:54,360
kinds of questions about the data-- not
the particular data set that they gave
1178
01:00:54,360 --> 01:00:57,750
me, but about the general data that
is generated by their process.
1179
01:00:57,750 --> 01:01:00,570
They can tell me this variable is
important, the function is symmetric,
1180
01:01:00,570 --> 01:01:04,210
they can give you all kinds of
information that I will take to heart.
1181
01:01:04,210 --> 01:01:08,730
But I will try, as much as I can, to
avoid looking at the particular data
1182
01:01:08,730 --> 01:01:14,680
set that they gave me, lest I should
tailor my system toward this data set,
1183
01:01:14,680 --> 01:01:17,680
and be disappointed when another
data set comes about.
1184
01:01:17,680 --> 01:01:20,100
You don't want to get too
close to the data set.
1185
01:01:20,100 --> 01:01:24,190
This will become very clear
as we go with the theory.
1186
01:01:24,190 --> 01:01:27,190
MODERATOR: In general about
machine learning, how does it
1187
01:01:27,190 --> 01:01:30,550
relate to other statistical, especially
econometric techniques?
1188
01:01:30,550 --> 01:01:33,090
1189
01:01:33,090 --> 01:01:37,150
PROFESSOR: Statistics is, in
the form I said, it's machine
1190
01:01:37,150 --> 01:01:38,710
learning where the target--
1191
01:01:38,710 --> 01:01:42,010
it's not a function in this case--
is a probability distribution.
1192
01:01:42,010 --> 01:01:44,670
Statistics is a mathematical field.
1193
01:01:44,670 --> 01:01:49,100
And therefore, you put the assumptions
that you need in order to be able to
1194
01:01:49,100 --> 01:01:53,970
rigorously prove the results you have,
and get the results in detail.
1195
01:01:53,970 --> 01:01:55,700
For example, linear regression.
1196
01:01:55,700 --> 01:01:59,810
When we talk about linear regression, it
will have very few assumptions, and
1197
01:01:59,810 --> 01:02:03,150
the results will apply to a wide range,
because we didn't make too many
1198
01:02:03,150 --> 01:02:04,330
assumptions.
1199
01:02:04,330 --> 01:02:07,530
When you study linear regression under
statistics, there is a lot of
1200
01:02:07,530 --> 01:02:11,020
mathematics that goes with it, lot of
assumptions, because that is the
1201
01:02:11,020 --> 01:02:12,640
purpose of the field.
1202
01:02:12,640 --> 01:02:18,560
In general, machine learning tries to make
the least assumptions and cover the
1203
01:02:18,560 --> 01:02:22,090
most territory. These go together.
1204
01:02:22,090 --> 01:02:25,640
So it is not a mathematical discipline,
but it's not a purely
1205
01:02:25,640 --> 01:02:26,850
applied discipline.
1206
01:02:26,850 --> 01:02:31,270
It spans both the mathematical, to
certain extent, but it is willing to
1207
01:02:31,270 --> 01:02:35,870
actually go into territory where we
don't have mathematical models, and
1208
01:02:35,870 --> 01:02:38,040
still want to apply our techniques.
1209
01:02:38,040 --> 01:02:40,600
So that is what characterizes
it the most.
1210
01:02:40,600 --> 01:02:44,120
And then there are other fields.
By doing machine learning,
1211
01:02:44,120 --> 01:02:46,400
you can find it under the name
computational learning,
1212
01:02:46,400 --> 01:02:48,090
or statistical learning.
1213
01:02:48,090 --> 01:02:52,120
Data mining has a huge intersection
with machine learning.
1214
01:02:52,120 --> 01:02:56,020
There are lots of disciplines around
that actually share some value.
1215
01:02:56,020 --> 01:02:59,630
But the point is, the premise that you
saw is so broad, that it shouldn't be
1216
01:02:59,630 --> 01:03:03,690
surprising that people at different times
developed a particular discipline
1217
01:03:03,690 --> 01:03:06,840
with its own jargon, to deal
with that discipline.
1218
01:03:06,840 --> 01:03:13,990
So what I'm giving you is machine
learning as the mainstream goes, and
1219
01:03:13,990 --> 01:03:17,520
that can be applied as widely as
possible to applications, both
1220
01:03:17,520 --> 01:03:20,900
practical applications and
scientific applications.
1221
01:03:20,900 --> 01:03:24,870
You will see, here is a situation, I
have an experiment, here is a target,
1222
01:03:24,870 --> 01:03:25,980
I have the data.
1223
01:03:25,980 --> 01:03:28,640
How do I produce the target
in the best way I want?
1224
01:03:28,640 --> 01:03:32,010
And then you apply machine learning.
1225
01:03:32,010 --> 01:03:36,180
MODERATOR: Also, in a general
question about machine learning.
1226
01:03:36,180 --> 01:03:36,190
1227
01:03:36,190 --> 01:03:42,370
Do machine learning algorithms perform
global optimization methods,
1228
01:03:42,370 --> 01:03:45,810
or just local optimization methods?
1229
01:03:45,810 --> 01:03:47,640
PROFESSOR: Obviously,
a general question.
1230
01:03:47,640 --> 01:03:48,470
1231
01:03:48,470 --> 01:03:52,120
Optimization is a tool
for machine learning.
1232
01:03:52,120 --> 01:03:56,340
So we will pick whatever optimization
that does the job for us.
1233
01:03:56,340 --> 01:03:59,440
And sometimes, there is a very
specific optimization method.
1234
01:03:59,440 --> 01:04:01,470
For example, in support vector
machines, it will be quadratic
1235
01:04:01,470 --> 01:04:01,990
programming.
1236
01:04:01,990 --> 01:04:04,050
It happens to be the one
that works with that.
1237
01:04:04,050 --> 01:04:08,190
But optimization is not something
that machine learning people
1238
01:04:08,190 --> 01:04:10,000
study for its own sake.
1239
01:04:10,000 --> 01:04:12,840
They obviously study it to understand
it better, and to choose the correct
1240
01:04:12,840 --> 01:04:14,900
optimization method.
1241
01:04:14,900 --> 01:04:17,780
Now, the question is alluding
to something that will
1242
01:04:17,780 --> 01:04:21,220
become clear when we talk about neural
networks, which is local minimum versus
1243
01:04:21,220 --> 01:04:22,830
global minimum.
1244
01:04:22,830 --> 01:04:26,680
And it is impossible to put this in
any perspective before we get the
1245
01:04:26,680 --> 01:04:29,120
details of neural networks,
so I will defer that until
1246
01:04:29,120 --> 01:04:32,850
we get to that lecture.
1247
01:04:32,850 --> 01:04:37,530
MODERATOR: Also, this is
a math question, I guess.
1248
01:04:37,530 --> 01:04:42,470
Is the hypothesis set, in a topological
sense, continuous?
1249
01:04:42,470 --> 01:04:47,160
PROFESSOR: The hypothesis
set can be anything, in principle.
1250
01:04:47,160 --> 01:04:50,500
So it can be continuous,
and it can be discrete.
1251
01:04:50,500 --> 01:04:53,710
For example, in the next lecture I take
the simplest case where we have
1252
01:04:53,710 --> 01:04:57,610
a finite hypothesis set, in order
to make a certain point.
1253
01:04:57,610 --> 01:05:00,610
In reality, almost all the hypothesis
sets that you find are
1254
01:05:00,610 --> 01:05:02,580
continuous and infinite.
1255
01:05:02,580 --> 01:05:04,170
Very infinite!
1256
01:05:04,170 --> 01:05:10,190
And the level of sophistication
of the hypothesis set can be huge.
1257
01:05:10,190 --> 01:05:15,440
And nonetheless, we will be able to see
that under one condition, which
1258
01:05:15,440 --> 01:05:19,307
comes from the theory, we'll be able to
learn even if the hypothesis set is
1259
01:05:19,307 --> 01:05:23,580
huge and complicated.
1260
01:05:23,580 --> 01:05:26,340
There's a question from inside, yes?
1261
01:05:26,340 --> 01:05:32,930
STUDENT: I think I understood, more or
less, the general idea, but I don't
1262
01:05:32,930 --> 01:05:37,160
understand the second example
you gave about credit approval.
1263
01:05:37,160 --> 01:05:41,200
So how do we collect our data?
1264
01:05:41,200 --> 01:05:46,210
Should we give credit to everyone, or
should we make our data biased,
1265
01:05:46,210 --> 01:05:51,170
because we cannot determine
the data of--
1266
01:05:51,170 --> 01:05:57,480
we can't determine, should we give credit
or not to persons we rejected?
1267
01:05:57,480 --> 01:05:58,030
PROFESSOR: Correct.
1268
01:05:58,030 --> 01:06:04,465
This is a good point. Every time
someone asks a question, the
1269
01:06:04,465 --> 01:06:05,590
lecture number comes to my mind.
1270
01:06:05,590 --> 01:06:07,570
I know when I'm going
to talk about it.
1271
01:06:07,570 --> 01:06:10,410
So what you describe is
called sampling bias.
1272
01:06:10,410 --> 01:06:12,190
And I will describe it in detail.
1273
01:06:12,190 --> 01:06:18,450
But when you use the biased data, let's
say the bank uses historical records.
1274
01:06:18,450 --> 01:06:22,450
So it sees the people who applied and
were accepted, and for those guys, it
1275
01:06:22,450 --> 01:06:26,030
can actually predict what the credit
behavior is, because it has their
1276
01:06:26,030 --> 01:06:26,700
credit history.
1277
01:06:26,700 --> 01:06:30,000
They charged and repaid and maxed
out, and all of this.
1278
01:06:30,000 --> 01:06:32,590
And then they decide: is this
a good customer or not?
1279
01:06:32,590 --> 01:06:36,400
For those who were rejected, there's
really no way to tell in this case
1280
01:06:36,400 --> 01:06:38,870
whether they were falsely rejected,
that they would have been good
1281
01:06:38,870 --> 01:06:40,220
customers or not.
1282
01:06:40,220 --> 01:06:44,050
Nonetheless, if you take the customer
base that you have, and base your
1283
01:06:44,050 --> 01:06:48,070
decision on it, the boundary
works fairly decently.
1284
01:06:48,070 --> 01:06:51,300
Actually, pretty decently, even for the
other guys, because the other guys
1285
01:06:51,300 --> 01:06:55,060
usually are deeper into the
classification region than the
1286
01:06:55,060 --> 01:06:57,940
boundary guys that you accepted,
and turned out to be bad.
1287
01:06:57,940 --> 01:06:58,810
1288
01:06:58,810 --> 01:07:01,000
But the point is well taken.
1289
01:07:01,000 --> 01:07:04,390
The data set in this case is not
completely representative, and there
1290
01:07:04,390 --> 01:07:07,750
is a particular principle in learning
that we'll talk about, which is
1291
01:07:07,750 --> 01:07:11,400
sampling bias, that deals
with this case.
1292
01:07:11,400 --> 01:07:14,270
Another question from here?
1293
01:07:14,270 --> 01:07:17,420
STUDENT: You explain that we need
to have a lot of data to learn.
1294
01:07:17,420 --> 01:07:22,050
So how do you decide how much amount
of data that is required for
1295
01:07:22,050 --> 01:07:26,980
a particular problem, in order to be
able to come up with a reasonable--
1296
01:07:26,980 --> 01:07:27,930
PROFESSOR: Good question.
1297
01:07:27,930 --> 01:07:31,710
So let me tell you the theoretical,
and the practical answer.
1298
01:07:31,710 --> 01:07:36,340
The theoretical answer is that this is
exactly the crux of the theory part
1299
01:07:36,340 --> 01:07:37,810
that we're going to talk about.
1300
01:07:37,810 --> 01:07:38,350
1301
01:07:38,350 --> 01:07:40,950
And in the theory, we are going
to see, can we learn?
1302
01:07:40,950 --> 01:07:43,120
And how much data.
1303
01:07:43,120 --> 01:07:46,150
So all of this will be answered
in a mathematical way.
1304
01:07:46,150 --> 01:07:48,020
So this is the theoretical answer.
1305
01:07:48,020 --> 01:07:52,770
The practical answer is: that's
not under your control.
1306
01:07:52,770 --> 01:07:57,180
When someone knocks on your door: Here
is the data, I have 500 points.
1307
01:07:57,180 --> 01:08:00,170
I tell him, I will give you
a fantastic system if you
1308
01:08:00,170 --> 01:08:02,200
just give me 2000.
1309
01:08:02,200 --> 01:08:05,000
But I don't have 2000, I have 500.
1310
01:08:05,000 --> 01:08:09,040
So now you go and you use your theory
to do something to your system, such
1311
01:08:09,040 --> 01:08:11,000
that it can work with the 500.
1312
01:08:11,000 --> 01:08:11,710
1313
01:08:11,710 --> 01:08:12,600
There was one case--
1314
01:08:12,600 --> 01:08:16,930
I worked with data in different
applications--
1315
01:08:16,930 --> 01:08:20,330
at some point, we had almost
100 million points.
1316
01:08:20,330 --> 01:08:21,760
You were swimming in data.
1317
01:08:21,760 --> 01:08:23,279
You wouldn't complain about data.
1318
01:08:23,279 --> 01:08:25,200
Data was wonderful.
1319
01:08:25,200 --> 01:08:28,779
And in another case, there were
less than 100 points.
1320
01:08:28,779 --> 01:08:31,890
And you had to deal with
the data with gloves!
1321
01:08:31,890 --> 01:08:36,290
Because if you use them the wrong way,
they are contaminated, which is
1322
01:08:36,290 --> 01:08:38,970
an expression we will see, and
then you have nothing.
1323
01:08:38,970 --> 01:08:43,029
And you will produce a system, and you
are proud of it, but you have no idea
1324
01:08:43,029 --> 01:08:44,540
whether it will perform well or not.
1325
01:08:44,540 --> 01:08:46,899
And you cannot give this to the customer,
and have the customer come
1326
01:08:46,899 --> 01:08:49,300
back to you and say: what did you do!?
1327
01:08:49,300 --> 01:08:49,760
1328
01:08:49,760 --> 01:08:55,490
So there is a question of, what
performance can you do given
1329
01:08:55,490 --> 01:08:57,090
what data size you have?
1330
01:08:57,090 --> 01:09:00,520
But in practice, you really have no
control over the data size in almost
1331
01:09:00,520 --> 01:09:03,140
all the cases, almost all
the practical cases.
1332
01:09:03,140 --> 01:09:05,960
Yes?
1333
01:09:05,960 --> 01:09:10,540
STUDENT: Another question I have
is regarding the hypothesis set.
1334
01:09:10,540 --> 01:09:13,729
So the larger the hypothesis set
is, probably I'll be able to
1335
01:09:13,729 --> 01:09:15,649
better fit the data.
1336
01:09:15,649 --> 01:09:20,420
But that, as you were explaining, might
be a bad thing to do because
1337
01:09:20,420 --> 01:09:23,460
when the new data point comes,
there might be troubles.
1338
01:09:23,460 --> 01:09:25,210
So how do you decide
the size of your--
1339
01:09:25,210 --> 01:09:27,680
PROFESSOR: You are asking all
the right questions, and all of
1340
01:09:27,680 --> 01:09:28,350
them are coming up.
1341
01:09:28,350 --> 01:09:32,330
This is again part of the theory,
but let me try to explain this.
1342
01:09:32,330 --> 01:09:35,420
As we mentioned, learning is about
being able to predict.
1343
01:09:35,420 --> 01:09:40,510
So you are using the data, not to
memorize it, but to figure out what
1344
01:09:40,510 --> 01:09:42,130
the pattern is.
1345
01:09:42,130 --> 01:09:45,100
And if you figure out a pattern that
applies to all the data, and it's
1346
01:09:45,100 --> 01:09:47,216
a reasonable pattern, then you
have a chance that it
1347
01:09:47,216 --> 01:09:49,340
will generalize outside.
1348
01:09:49,340 --> 01:09:53,880
Now the problem is that, if I give you
50 points, and you use a 7000th-order
1349
01:09:53,880 --> 01:09:57,360
polynomial, you will fit the
heck out of the data.
1350
01:09:57,360 --> 01:10:01,160
You will fit it so much with so many
degrees of freedom to spare, but you
1351
01:10:01,160 --> 01:10:02,070
haven't learned anything.
1352
01:10:02,070 --> 01:10:04,610
You just memorized it in a fancy way.
1353
01:10:04,610 --> 01:10:08,500
You put it in a polynomial form, and
that actually carries all the
1354
01:10:08,500 --> 01:10:10,400
information about the
data that you have,
1355
01:10:10,400 --> 01:10:11,890
and then some.
1356
01:10:11,890 --> 01:10:15,280
So you don't expect at all that
this will generalize outside.
1357
01:10:15,280 --> 01:10:18,450
And that intuitive observation
will be formalized when we
1358
01:10:18,450 --> 01:10:19,580
talk about the theory.
1359
01:10:19,580 --> 01:10:22,930
There will be a measurement of the
hypothesis set that you give me, that
1360
01:10:22,930 --> 01:10:25,550
measures the sophistication of it,
and will tell you with that
1361
01:10:25,550 --> 01:10:28,850
sophistication, you need that amount
of data in order to be able to make
1362
01:10:28,850 --> 01:10:30,430
any statement about generalization.
1363
01:10:30,430 --> 01:10:31,680
So that is what the theory is about.
1364
01:10:31,680 --> 01:10:34,650
1365
01:10:34,650 --> 01:10:37,880
STUDENT: Suppose, I mean, here
whatever we discussed, it is like I
1366
01:10:37,880 --> 01:10:42,930
had a data set and I came up with
an algorithm, and gave the output.
1367
01:10:42,930 --> 01:10:48,690
But won't it be also important to see,
OK, we came up with the output, and
1368
01:10:48,690 --> 01:10:52,790
using that, what was the feedback?
1369
01:10:52,790 --> 01:10:57,690
Are there techniques where you take
the feedback and try to
1370
01:10:57,690 --> 01:10:58,980
correct your--
1371
01:10:58,980 --> 01:11:03,360
PROFESSOR: You are alluding
to different techniques here.
1372
01:11:03,360 --> 01:11:07,740
But one of them would be validation,
which is after you learn, you validate
1373
01:11:07,740 --> 01:11:09,360
your solution.
1374
01:11:09,360 --> 01:11:13,000
And this is an extremely established and
core technique in machine learning
1375
01:11:13,000 --> 01:11:16,870
that will be covered in
one of the lectures.
1376
01:11:16,870 --> 01:11:18,810
Any questions from the online audience?
1377
01:11:18,810 --> 01:11:25,780
MODERATOR: In practice, how many
dimensions would be considered easy,
1378
01:11:25,780 --> 01:11:28,730
medium, and hard for
a perceptron problem?
1379
01:11:28,730 --> 01:11:31,100
PROFESSOR: The hard,
1380
01:11:31,100 --> 01:11:34,850
in most people's mind before they
get into machine learning, is the
1381
01:11:34,850 --> 01:11:36,420
computational time.
1382
01:11:36,420 --> 01:11:38,800
If something takes a lot of time,
then it's a hard problem.
1383
01:11:38,800 --> 01:11:42,800
If something can be computed quickly,
it's an easy problem.
1384
01:11:42,800 --> 01:11:47,210
For machine learning, the bottleneck
in my case, has never been the
1385
01:11:47,210 --> 01:11:51,340
computation time, even in
incredibly big data sets.
1386
01:11:51,340 --> 01:11:55,410
The bottleneck for machine learning is
to be able to generalize outside the
1387
01:11:55,410 --> 01:11:56,990
data that you have seen.
1388
01:11:56,990 --> 01:12:01,790
So to answer your question, the
perceptron behaves badly in terms of
1389
01:12:01,790 --> 01:12:04,090
the computational behavior.
1390
01:12:04,090 --> 01:12:07,490
We will be able to predict its
generalization behavior, based on the
1391
01:12:07,490 --> 01:12:09,370
number of dimensions and
the amount of data.
1392
01:12:09,370 --> 01:12:11,610
This will be given explicitly.
1393
01:12:11,610 --> 01:12:19,030
And therefore, the perceptron algorithm
is bad computationally, good
1394
01:12:19,030 --> 01:12:20,980
in terms of generalization.
1395
01:12:20,980 --> 01:12:24,900
If you actually can get away with
perceptrons, your chances of
1396
01:12:24,900 --> 01:12:28,460
generalizing are good because
it's a simplistic
1397
01:12:28,460 --> 01:12:33,850
model, and therefore its ability to
generalize is good, as we will see.
1398
01:12:33,850 --> 01:12:38,010
MODERATOR: Also, in the example you
explain the use of binary function.
1399
01:12:38,010 --> 01:12:43,690
So can you use more multi-valued
or real functions?
1400
01:12:43,690 --> 01:12:45,100
PROFESSOR: Correct.
1401
01:12:45,100 --> 01:12:47,980
Remember when I told you that there is
a topic that is out of sequence.
1402
01:12:47,980 --> 01:12:51,810
There was a logical sequence to the
course, and then I took part of the
1403
01:12:51,810 --> 01:12:55,870
linear models and put it very early on,
to give you something a little bit
1404
01:12:55,870 --> 01:12:59,140
more sophisticated than perceptrons
to try your hand on.
1405
01:12:59,140 --> 01:13:01,560
That happens to be for
real-valued functions.
1406
01:13:01,560 --> 01:13:05,650
And obviously there are hypotheses that
cover all types of co-domains.
1407
01:13:05,650 --> 01:13:07,010
Y could be anything as well.
1408
01:13:07,010 --> 01:13:09,930
1409
01:13:09,930 --> 01:13:18,730
MODERATOR: Another question is, in
the learning process you showed, when
1410
01:13:18,730 --> 01:13:22,420
do you pick your learning algorithm,
when do you pick your hypothesis set,
1411
01:13:22,420 --> 01:13:23,840
and what liberty do you have?
1412
01:13:23,840 --> 01:13:28,380
1413
01:13:28,380 --> 01:13:33,070
PROFESSOR: The hypothesis set
is the most important aspect of
1414
01:13:33,070 --> 01:13:36,030
determining the generalization behavior
that we'll talk about.
1415
01:13:36,030 --> 01:13:38,960
The learning algorithm does play a role,
although it is a secondary role,
1416
01:13:38,960 --> 01:13:41,330
as we will see in the discussion.
1417
01:13:41,330 --> 01:13:45,960
So in general, the learning
algorithm has the form of
1418
01:13:45,960 --> 01:13:49,140
minimizing an error function.
1419
01:13:49,140 --> 01:13:51,540
So you can think of the
perceptron, what does
1420
01:13:51,540 --> 01:13:52,290
the algorithm do?
1421
01:13:52,290 --> 01:13:55,420
It tries to minimize the
classification error.
1422
01:13:55,420 --> 01:13:58,710
That is your error function, and
you're minimizing it using this
1423
01:13:58,710 --> 01:14:00,220
particular update rule.
1424
01:14:00,220 --> 01:14:03,700
And in other cases, we'll see that we
are minimizing an error function.
1425
01:14:03,700 --> 01:14:08,000
Now the minimization aspect is
an optimization question, and once you
1426
01:14:08,000 --> 01:14:11,180
determine that this is indeed the
error function that I want to
1427
01:14:11,180 --> 01:14:15,950
minimize, then you go and minimize
as much as you can using the most
1428
01:14:15,950 --> 01:14:18,710
sophisticated optimization
technique that you find.
1429
01:14:18,710 --> 01:14:22,160
So the question now translates into
what is the choice of the error
1430
01:14:22,160 --> 01:14:26,280
function or error measure that
will help or not help.
1431
01:14:26,280 --> 01:14:29,530
And that will be covered also next week
under the topic, Error and Noise.
1432
01:14:29,530 --> 01:14:32,350
When I talk about error, we'll talk
about error measures, and this
1433
01:14:32,350 --> 01:14:37,660
translates directly to the learning
algorithm that goes with them.
1434
01:14:37,660 --> 01:14:38,730
MODERATOR: Back to the perceptron.
1435
01:14:38,730 --> 01:14:43,220
So what happens if your hypothesis
gives you exactly 0 in this case?
1436
01:14:43,220 --> 01:14:47,200
PROFESSOR: So remember that
the quantity you compute and
1437
01:14:47,200 --> 01:14:49,960
compare with the threshold
was your credit score.
1438
01:14:49,960 --> 01:14:53,090
So I told you what happens if you are
above threshold, and what happens if
1439
01:14:53,090 --> 01:14:54,760
you're below threshold.
1440
01:14:54,760 --> 01:14:57,430
So what happens if you're exactly
at the threshold?
1441
01:14:57,430 --> 01:15:02,340
Your score is exactly that.
1442
01:15:02,340 --> 01:15:07,080
The informal answer is that it depends
on the mood of the credit
1443
01:15:07,080 --> 01:15:08,650
officer on that day.
1444
01:15:08,650 --> 01:15:10,870
If they had a bad day,
you will be denied!
1445
01:15:10,870 --> 01:15:16,410
But the serious answer is that
there are technical ways of
1446
01:15:16,410 --> 01:15:17,870
defining that point.
1447
01:15:17,870 --> 01:15:21,580
You can define it as 0,
so the sign of 0 is 0.
1448
01:15:21,580 --> 01:15:24,190
In which case you are always making
an error, because you are never +1 or
1449
01:15:24,190 --> 01:15:25,830
-1, when you should be.
1450
01:15:25,830 --> 01:15:28,230
Or you could make it belong
to the +1 category or
1451
01:15:28,230 --> 01:15:29,700
to the -1 category.
1452
01:15:29,700 --> 01:15:32,190
There are ramifications for
all of these decisions
1453
01:15:32,190 --> 01:15:33,950
that are purely technical.
1454
01:15:33,950 --> 01:15:36,010
Nothing conceptual comes out of them.
1455
01:15:36,010 --> 01:15:38,790
That's why I decided not
to include it.
1456
01:15:38,790 --> 01:15:42,220
Because it clutters the main concept
with something that really has no
1457
01:15:42,220 --> 01:15:43,170
ramification.
1458
01:15:43,170 --> 01:15:46,090
As far as you're concerned, the easiest
way to consider it is that the
1459
01:15:46,090 --> 01:15:49,040
output will be 0, and therefore you will
be making an error regardless of
1460
01:15:49,040 --> 01:15:50,410
whether it's +1 or -1.
1461
01:15:50,410 --> 01:15:53,600
1462
01:15:53,600 --> 01:15:57,070
MODERATOR: Is there a kind of problem
that cannot be learned even if
1463
01:15:57,070 --> 01:16:01,480
there's a huge amount of data?
1464
01:16:01,480 --> 01:16:02,360
PROFESSOR: Correct.
1465
01:16:02,360 --> 01:16:07,010
For example, if I go to my computer
and use a pseudo-random number
1466
01:16:07,010 --> 01:16:12,090
generator to generate the target over
the entire domain, then patently,
1467
01:16:12,090 --> 01:16:14,960
nothing I can give you will make
you learn the other guys.
1468
01:16:14,960 --> 01:16:16,360
1469
01:16:16,360 --> 01:16:17,665
So remember the three--
1470
01:16:17,665 --> 01:16:20,310
1471
01:16:20,310 --> 01:16:23,170
let me try to--
1472
01:16:23,170 --> 01:16:24,640
the essence of machine learning.
1473
01:16:24,640 --> 01:16:28,710
The first one was, a pattern exists.
1474
01:16:28,710 --> 01:16:29,510
1475
01:16:29,510 --> 01:16:34,130
If there's no pattern that exists,
there is nothing to learn.
1476
01:16:34,130 --> 01:16:38,350
Let's say that it's like a baby,
and stuff is happening, and the
1477
01:16:38,350 --> 01:16:42,000
baby is just staring. There is nothing
to pick from that thing.
1478
01:16:42,000 --> 01:16:44,740
Once there is a pattern, you can see
the smile on the baby's face.
1479
01:16:44,740 --> 01:16:46,500
Now I can see what is going on.
1480
01:16:46,500 --> 01:16:49,420
So whatever you are learning,
there needs to be a pattern.
1481
01:16:49,420 --> 01:16:50,240
1482
01:16:50,240 --> 01:16:52,640
Now, how to tell that there's
a pattern or not,
1483
01:16:52,640 --> 01:16:53,400
that's a different question.
1484
01:16:53,400 --> 01:16:58,300
But the main ingredient, there's a pattern.
The other one is we cannot pin
1485
01:16:58,300 --> 01:16:58,980
it down mathematically.
1486
01:16:58,980 --> 01:17:00,970
If we can pin it down mathematically,
and you decide to do
1487
01:17:00,970 --> 01:17:02,840
the learning, then you
are really lazy.
1488
01:17:02,840 --> 01:17:04,730
Because you could just write the code.
1489
01:17:04,730 --> 01:17:05,380
But fine.
1490
01:17:05,380 --> 01:17:08,280
You can use learning in this case, but
it's not the recommended method,
1491
01:17:08,280 --> 01:17:11,620
because it has certain errors
in performance.
1492
01:17:11,620 --> 01:17:14,060
Whereas if you have the mathematical
definition, you just implement it and
1493
01:17:14,060 --> 01:17:16,150
you'll get the best possible solution.
1494
01:17:16,150 --> 01:17:18,240
And the third one, you have data,
which is key.
1495
01:17:18,240 --> 01:17:22,370
So you have plenty of data, but the
first one is off, you are simply not
1496
01:17:22,370 --> 01:17:23,890
going to learn.
1497
01:17:23,890 --> 01:17:27,900
And it's not like I have to answer each
of these questions at random.
1498
01:17:27,900 --> 01:17:31,460
The theory will completely
capture what is going on.
1499
01:17:31,460 --> 01:17:35,820
So there's a very good reason for going
through four lectures in the
1500
01:17:35,820 --> 01:17:38,490
outline that are
mathematically inclined.
1501
01:17:38,490 --> 01:17:40,140
This is not for the sake of math.
1502
01:17:40,140 --> 01:17:45,170
I don't like to do math
hacking, if you will.
1503
01:17:45,170 --> 01:17:48,680
I pick the math that is necessary
to establish a concept.
1504
01:17:48,680 --> 01:17:51,530
And these will establish it, and they
are very much worth being patient with
1505
01:17:51,530 --> 01:17:52,480
and going through.
1506
01:17:52,480 --> 01:17:55,840
Because once you're done with them, you
basically have it cold about what
1507
01:17:55,840 --> 01:18:00,520
are the components that make learning
possible, and how do we tell, and all
1508
01:18:00,520 --> 01:18:03,360
of the questions that have been asked.
1509
01:18:03,360 --> 01:18:04,620
MODERATOR: Historical question.
1510
01:18:04,620 --> 01:18:10,880
So why is the perceptron often
related with a neuron?
1511
01:18:10,880 --> 01:18:14,435
PROFESSOR: I will discuss this
in neural networks, but in general,
1512
01:18:14,435 --> 01:18:19,760
when you take a neuron and synapses, and
you find what is the function that
1513
01:18:19,760 --> 01:18:25,200
gets to the neuron, you find that the
neuron fires, which is +1, if the
1514
01:18:25,200 --> 01:18:31,090
signal coming to it, which is roughly
a combination of the stimuli, exceeds
1515
01:18:31,090 --> 01:18:32,400
a certain threshold.
1516
01:18:32,400 --> 01:18:37,760
So that was the initial inspiration, and
the initial inspiration was
1517
01:18:37,760 --> 01:18:41,460
that: the brain does a pretty good
job, so maybe if we mimic the
1518
01:18:41,460 --> 01:18:42,890
function, we will get something good.
1519
01:18:42,890 --> 01:18:45,940
But you mimic one neuron, and then you
put it together and you'll get the
1520
01:18:45,940 --> 01:18:47,520
neural network that you
are talking about.
1521
01:18:47,520 --> 01:18:52,780
And I will discuss the analogy with
biology, and the extent that it can be
1522
01:18:52,780 --> 01:18:55,850
benefited from, when we talk
about neural networks, because
1523
01:18:55,850 --> 01:18:57,799
that will be the more proper
context for that.
1524
01:18:57,799 --> 01:19:02,850
1525
01:19:02,850 --> 01:19:08,710
MODERATOR: Another question is,
regarding the hypothesis set, are there
1526
01:19:08,710 --> 01:19:12,645
Bayesian hierarchical procedures
to narrow down the hypothesis set?
1527
01:19:12,645 --> 01:19:15,660
1528
01:19:15,660 --> 01:19:16,916
PROFESSOR: OK.
1529
01:19:16,916 --> 01:19:20,320
The choice of the hypothesis set and
the model in general is model
1530
01:19:20,320 --> 01:19:23,820
selection, and there's quite a bit of
stuff that we are going to talk about
1531
01:19:23,820 --> 01:19:26,550
in model selection, when we
talk about validation.
1532
01:19:26,550 --> 01:19:31,160
In general, the word Bayesian was
mentioned here-- if you
1533
01:19:31,160 --> 01:19:36,330
look at machine learning, there are
schools that deal with the subject
1534
01:19:36,330 --> 01:19:37,840
differently.
1535
01:19:37,840 --> 01:19:41,940
So for example, the Bayesian school
puts a mathematical framework
1536
01:19:41,940 --> 01:19:43,160
completely on it.
1537
01:19:43,160 --> 01:19:47,490
And then everything can be derived,
and that is based on Bayesian
1538
01:19:47,490 --> 01:19:48,500
principles.
1539
01:19:48,500 --> 01:19:54,380
I will talk about that at the very
end, so it's last but not least.
1540
01:19:54,380 --> 01:19:57,350
And I will make a very specific point
about it, for what it's worth.
1541
01:19:57,350 --> 01:20:03,280
But what I'm talking about in the course
in all of the details, are the
1542
01:20:03,280 --> 01:20:08,310
most commonly useful methods
in practice.
1543
01:20:08,310 --> 01:20:10,280
That is my criterion for inclusion.
1544
01:20:10,280 --> 01:20:10,900
1545
01:20:10,900 --> 01:20:13,910
So I will get to that
when we get there.
1546
01:20:13,910 --> 01:20:16,080
In terms of a hierarchy,
1547
01:20:16,080 --> 01:20:19,160
there are a number of hierarchical
methods.
1548
01:20:19,160 --> 01:20:23,360
For example, structural risk
minimization is one of them.
1549
01:20:23,360 --> 01:20:27,060
There are methods of hierarchies,
and the ramifications of it in
1550
01:20:27,060 --> 01:20:27,910
generalization.
1551
01:20:27,910 --> 01:20:30,500
I may touch upon it, when I get
to support vector machines.
1552
01:20:30,500 --> 01:20:35,490
But again, there's a lot of theory,
and if you read a book on machine
1553
01:20:35,490 --> 01:20:38,860
learning written by someone from pure
theory, you would think that you are
1554
01:20:38,860 --> 01:20:41,220
reading about a completely
different subject.
1555
01:20:41,220 --> 01:20:44,370
It's respectable stuff, but
different from the other
1556
01:20:44,370 --> 01:20:45,670
stuff that is practiced.
1557
01:20:45,670 --> 01:20:51,950
So one of the things that I'm trying to
do, I'm trying to pick from all the
1558
01:20:51,950 --> 01:20:56,070
components of machine learning, the
big picture that gives you the
1559
01:20:56,070 --> 01:20:59,540
understanding of the concept, and
the tools to use it in practice.
1560
01:20:59,540 --> 01:21:00,790
That is the criterion for inclusion.
1561
01:21:00,790 --> 01:21:04,170
1562
01:21:04,170 --> 01:21:04,710
1563
01:21:04,710 --> 01:21:07,340
Any questions from the inside here?
1564
01:21:07,340 --> 01:21:11,060
1565
01:21:11,060 --> 01:21:13,040
OK, we'll call it a day, and
we'll see you on Thursday.
1566
01:21:13,040 --> 00:00:00,000

(DownSub - Com) Lecture 01 - The Learning Problem

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

(DownSub - Com) Lecture 01 - The Learning Problem

Diunggah oleh

Hak Cipta:

Format Tersedia

1

00:00:00,000 --> 00:00:00,580

Anda mungkin juga menyukai