Anda di halaman 1dari 51

[MUSIC PLAYING]

DAVID J. MALAN: All right, this is CS50 and this is a lecture four.
So we're here in beautiful Lowell Lecture Hall
and Sanders is in use today.
And we're joined by some friends that will soon
be clear and present in just a moment.
But before then, recall that last time we took a look at CS50 IDE.
This was a new web-based programming environment similar in spirit
to CS50 Sandbox and CS50 Lab, but added a few features.
For instance, what features did it add to you--
to your capabilities?
Yeah?
AUDIENCE: Debugger.
DAVID J. MALAN: What's that?
AUDIENCE: The debugger.
DAVID J. MALAN: The debugger.
So debug50, which opens that side panel that
allows you to step through your code, step by step, and see variables.
Yeah?
AUDIENCE: Check50.
DAVID J. MALAN: Sorry, say again?
AUDIENCE: Check50.
DAVID J. MALAN: Check50 as well, which is a CS50 specific tool that
allows you to check the correctness of your code
much like the teaching fellows would when providing feedback on it.
Running a series of tests that pretty much are
the same tests that a lot of the homework's
will encourage you yourself to run manually,
but it just automates the process.
And anything else?
AUDIENCE: [INAUDIBLE]
DAVID J. MALAN: So that is true too.
There's a little hidden Easter egg that we don't use this semester,
but yes indeed.
If you look for a small puzzle piece, you
can actually convert your C code back to Scratch like puzzle pieces
and back and forth, and back to forth, thanks to Kareem and some of the team.
So that is there, but by now, it's probably better
to get comfortable with text as well.
So there's a couple of our other tools that we've
used over time of course besides check50 and debug50.
We've of course used printf and when is printf useful?
Like when might you want to use it beyond needing to just print something
because the problem set tells you to.
Yeah?
AUDIENCE: To find where your bug is.
DAVID J. MALAN: Yeah, so to find where your bug is.
If you just, kind of, want to print out variables, value or some kind of text
so you know what's going on and you don't necessarily
want to deploy debug50, you can do that.
When else?
AUDIENCE: If you have a long formula for something [INAUDIBLE]
and you want to see [INAUDIBLE].
DAVID J. MALAN: Good.
Yeah.
AUDIENCE: How running-- like going through debug50 50 times.
DAVID J. MALAN: Indeed.
Well, in real life-- so you might want to use printf
when you have maybe a nested loop, and you want to put a printf inside loop
so as to see when it kicks in.
Of course, you could use debug50, but you
might end up running debug50 or clicking next, next, next, next, next, next,
next so many times that gets a little tedious.
But do keep in mind, you can just put a breakpoint deeper into your code
as well and perhaps remove an earlier breakpoint as well.
And honestly, all the time, whether it's in C or other languages,
do I find myself occasionally using printf just to type out printf in here
just so that I can literally see if my code got to a certain point in here
to see if something's printed.
But the debugger you're going to find now
and hence forth so much more powerful, so much more versatile.
So if you haven't already gotten to the habit of using debug50 by all
means start and use those breakpoints to actually walk through your code
where you care to see what's going on.
So style50, of course, checks the style of your code much like the teaching
fellows might, and it shows you in red or green
what spaces you might want to delete, what spaces you might
want to add just to pretty things up.
So it's more readable for you and others.
And then what about help50?
When should you instinctively reach for help50?
AUDIENCE: When you don't understand an error message.
DAVID J. MALAN: Exactly.
Yeah, when you don't understand an error message.
So you're compiling something.
You're running a command.
It doesn't really quite work and you're seeing a cryptic error message.
Eventually, you'll get the muscle memory and the sort of exposure
to just know, oh, I remember what that means.
But until then, run help50 at the beginning of that same command,
and it's going to try to detect what your error is
and provide TF-like feedback on how to actually work around that.
You'll see two on the course's website is a wonderful handout made
by Emily Hong, one of our own teaching fellows,
that introduces all of these tools, and a few more,
and gets you into the habit of thinking about things.
It's kind of a flow chart.
If I have this problem, then do this or else
if I have this problem do this other thing.
So to check that out as well.
But today, let's introduce really the last, certainly for C,
of our command line tools that's going to help
you chase down problems in your code.
Last week, recall that we had talked about memory a lot.
We talked about malloc, allocating memory,
and we talked about freeing memory and the like.
But it turns out, you can do a lot of damage
when you start playing with memory.
In fact, probably by now, almost everyone-- segmentation fault?
[LAUGHTER]
Yeah, so that's just one of the errors that you might run into,
and frankly, you might have errors in your code now
and hence forth that have bugs but you don't even realize it
because you're just getting lucky.
And the program is just not crashing or it's not freezing,
but this can still happen.
And so Valgrind is a command line program that is probably
looks the scariest of the tools we've used,
but you can also use it with help50, that
just tries to find what are called memory leaks in your program.
Recall that last week we introduced malloc,
and malloc lets you allocate memory.
But if you don't free that memory, by literally calling the free function,
you're going to constantly ask your operating system, MacOS, Linux,
Windows, whatever, can I have more memory?
Can I have more memory?
Can I have more memory?
And if you never, literally, hand it back by calling free your computer
may very well slow down or freeze or crash.
And frankly, if you've ever had that happen on your Mac or PC, very likely
that's what some human accidentally did.
He or she just allocated more and more memory
but never really got around to freeing that memory.
So Valgrind can help you find those mistakes before you or your users do.
So let's do a quick example, let me go CS50 IDE, and let me go ahead
and make one new program here.
We'll call it memory.c because we'll see later today how
I might chase down those memory leaks.
But for now, let's start with something even simpler, which all of you
may be done by now, which is to accidentally touch memory
that you shouldn't, changing it, reading it and let's see what this might mean.
So let me do the familiar at the top here.
Include standard IO.
Well, let's not even do that yet.
Let's just do this first.
Let's do int, main(void), just to start a simple program
and in here let me go ahead and just call a function called f.
I don't really care what its name is for today.
I just want to call a function f, and then that's it.
Now this function f, let me go ahead and define it as follows, void f(void).
It's not going to do much of anything at all.
But let's suppose, just for the sake of discussion, that f's purpose in life
is just to allocate memory for whatever useful purpose,
but for now it's just for demonstration's sake.
So what's the function with which you can allocate memory?
AUDIENCE: Malloc.
DAVID J. MALAN: Malloc.
So suppose I want malloc space for, I don't know,
something simple like just one integer.
We're just doing this for demonstration purposes,
or actually let's do more, 10 integers, 10 integers.
I could, of course, do-- well, give me 10, but how many bytes do what I want?
How many bytes do I need for 10 integers?
AUDIENCE: sizeof(int).
DAVID J. MALAN: Yeah, so I can do literally sizeof(int)
and most likely the size of an int is going to be?
AUDIENCE: Four.
DAVID J. MALAN: Four, probably.
On many systems today, it's just 4 bytes or 32 bits,
but you don't want to hard code that lest someone else's computer not use
those same values.
So the size of an int.
So 10 times the size of an int.
Malloc returns what type of data?
What does that hand me back?
AUDIENCE: [INAUDIBLE]
DAVID J. MALAN: Yeah, returns an address or a pointer.
Specifically, the address, 100, 900, whatever, of the chunk of memory
it just allocated for you.
So if I want to keep that around, I need to declare a pointer.
Let's just call it x for today that stores that address.
Could call it x, y, z, whatever, but it's not an int that it's returning.
It's the address of an int.
And remember, that's what the star operator now means.
The address of some data type.
It's just a number.
All right, so now if I were to--
first, let's clean this up.
Turns out that you use malloc, I need to use stdlib.h.
We saw that last week, albeit briefly, and then of course
if I'm going to call f, what do I have to do to fix this code?
AUDIENCE: You need to declare.
DAVID J. MALAN: Yeah, I need to declare it up here,
or I could just move f's implementation up top.
So I think this works, even though this program at the moment
is completely stupid.
It doesn't do anything useful, but it will allocate memory.
And I'll do something with it as follows.
If I want to change the first value in this chunk of memory,
well how might I do that?
Well, I've asked the computer for 10 integers or rather space
for 10 integers.
What's interesting about malloc is that when
it returns a chunk of memory for you it's contiguous, back-to-back.
And when you hear contiguous or back-to-back,
what kind of data structure does that recall to mind?
AUDIENCE: An array.
DAVID J. MALAN: An array.
So it turns out we can treat this just random chunk of memory
like it's an array.
So if we want to go to the first location in that array of memory,
I can just do this and put in the number say 50.
Or if I want to go to the next location, I can do this.
Or if I want to do the next location, I can do this.
Or if I want to go to the last location, I might do this,
but is that good or bad?
AUDIENCE: Bad.
DAVID J. MALAN: Why bad?
AUDIENCE: It's-- it's out of bounds
DAVID J. MALAN: Yeah, so it's out of bounds.
Right?
This is sort of week one style mistakes when it came to loops.
Recall, with for loops or while loops, you might go a little too far,
and that's fine.
But now we actually will see we have a tool that
can help us notice these things.
So hopefully, just visually, it's apparent that what I have going on here
is just-- on line 12, I have a variable x
that storing the address of that chunk of memory.
And then on line 13, I'm just trying to access location 10
and set the value 50 there.
But as you note, there is no location 10.
There's location 0, 1, 2, 3, all the way through 9, of course.
So how might we detect this with a program?
Well, let me go ahead and increase my terminal window just a bit
here, save my file, and let me go ahead and compile make memory.
OK, all is well.
It compiled without any error messages, and now
let me go ahead and run memory, enter.
All right, so that worked pretty well.
Let's actually be a little more explicit here just for good measure.
Let me go ahead and print something out.
So printf, %i for an integer, and let's make it just more explicit.
You inputted %i and then comma x bracket 10.
And what do I have to include you use printf?
AUDIENCE: stdio.h.
DAVID J. MALAN: Yeah, so stdio.
So let's just quickly add that, stdio.h, save.
All right, let me recompile this, make memory, enter.
And now let me go ahead and do ./memory.
Huh?
Feels like it's a correct program.
And yet, for a couple of weeks now we've been claiming that mm-hmm,
don't do that.
Don't go beyond the boundaries of your array.
So how do we reconcile this?
Feels like buggy code or at least we've told you it's buggy code,
and yet it's working.

Yeah?
AUDIENCE: [INAUDIBLE]
DAVID J. MALAN: That's a good way of putting it.
AUDIENCE: It's still very similar.
We want that.
DAVID J. MALAN: OK.
AUDIENCE: So we can theoretically--
it just created a program.
DAVID J. MALAN: Yeah, and I think if I heard you correctly,
you said C doesn't scream if you go too far?
AUDIENCE: Yeah.
DAVID J. MALAN: Yeah, OK.
So that's a good way of putting it.
Like, you can get lucky in C. And you can
do something that is objectively, pedagogically, like technically wrong,
but the computer's not going to crash.
It's not going to freeze because you just get lucky.
Because often, for performance reasons, when
you allocate space for 10 integers, you're
actually going to get a chunk of memory back
that's a little bigger than you need.
It's just not safe to assume that it's bigger than you need,
but you might just get lucky.
And you might end up having more memory that you can technically get away
with touching or accessing or changing, and the computer's not going to notice.
But that's not safe because on someone else's Mac or PC,
their computer might just be operating a little bit differently than yours,
and bam, that bug is going to bite them and not you.
And those are the hardest, most annoying bugs to chase down as some of you
might have experienced.
Right?
It works on your computer but not a friends or vise versa.
These are the kinds of explanations for that.
So Valgrind can help us track down even these most subtle errors.
The program seems to be working.
Check50 or tools like it might even assume
that it's working because it is printing the right thing,
but let's take a look at what this program Valgrind thinks.
Let me increase the size of the terminal window here,
and go ahead and type in Valgrind ./memory.
So same program name ./memory but I'm prefixing it with the name Valgrind.
All right?
Unfortunately, Valgrind is really quite ugly,
and it prints out a whole bunch of stuff here.
So let's take a look.
At the very top, you'll see all these numbers on the left,
and that's just an unfortunate aesthetic.
But we do see some useful information.
Invalid read of size 4 and then it has these cryptic
looking letters and numbers.
What are those?
They're just addresses and hexadecimal.
It doesn't really matter what they are, but Valgrind
can tell us where the memory is that's acting up suspiciously.
You can then see next to that, that Valgrind is pointing
to function f on memory. c 15th line.
So that's perhaps helpful, and then main on line 8
because that's the function that was called.
So Valgrind is actually kind of nice in that it's showing us all the functions
that you called from bottom up, much like the stack from last week.
And so something's going wrong line 15, and if we go back to that,
let's see line 15 was--
well, sure enough.
I'm actually trying to access that memory location
and frankly I did it on line 14 as well.
So hopefully fixing one or both of those will address this issue.
And notice here, this frankly just gets overwhelming pretty quickly.
And then, oh, 40 bytes in one block are definitely lost in lost record.
I mean, this is the problem with Valgrind, honestly.
It was written some years ago, not particularly user friendly,
but that's fine we have a tool to address this.
Let me go ahead and rerun Valgrind with help50,
enter, and see if we can't just assist with this.
All right, so still the same amount of black and white input but down here now
help50 is noticing, oh, I can help you with an invalid write of size 4.
So it's still at the same location, but this time--
or rather same file, memory.c but line 14.
And we propose, looks like you're trying to modify 4 bytes of memory that
isn't yours, question mark.
Did you try to store something beyond the bounds of an array?
Take a closer look at line 14 of memory.c.
So hopefully, even though Valgrind's output is crazy esoteric,
at least that yellow output will point you toward, ah, line 14.
I'm indeed touching 4 bytes, an integer, that shouldn't be.
And so let's go ahead and fix this.
If I go into my program, and I don't do this.
Let's change it to location 9, and location 9 here and save.
Then let me go ahead and rerun Valgrind without help50.
All right, progress except--
oops.
Nope, no progress.
I skipped the step.
Yeah, I didn't recompile it.
A little puzzled why I saw the same thing.
So now let's rerun Valgrind and here it seems to be better.
So I don't see that same error message up
at the very top like we did before, but notice here, 40 bytes in one blocks.
OK, that was bad grammar in the program, but are definitely
lost in loss record 1 of 1.
So I still don't quite understand that.
No big deal.
Let's go ahead and run help50 and see what the second of two errors
apparently is here.
So here it's highlighting those lines.
40 bytes and one blocks are definitely lost, and looks like your program
leaked 40 bytes of memory.
Did you forget the free memory that you allocated with malloc?
Take a closer look at line 13 of memory.c.
So in this case line 13 indeed has a call to malloc.
So what's the fix for this problem?
AUDIENCE: Free.
DAVID J. MALAN: Per help50 or your own intuition?
What do I have to add to this program?
AUDIENCE: Free.
AUDIENCE: Free.
Yeah, free, and where does that go?

Right here.
So we can free the memory.
Why would this be bad?
AUDIENCE: [INAUDIBLE]
DAVID J. MALAN: Exactly.
We're freeing the memory, which is like saying to the operating system,
I don't need this anymore.
And yet, two lines later we're using it again and again.
So bad.
We didn't do that mistake last week, but you should only
be freeing memory when, literally, you're
ready to free it up and give it back, which should probably
be at the end of the program.
So let me go ahead and re-save this, Open, up my terminal window,
recompile it this time, and now, let me run Valgrind one last time
without help50.
And still a little verbose, but zero errors, from zero contexts.
That sounds pretty good.
And moreover, it also explicitly says, all heap blocks were freed.
And recall that the heap, is that chunk of memory
that we drew visually up here, which is where malloc takes memory from.
So, done.
So this is kind of the mentality with which
to have when approaching the correctness of your code.
Like, it's one thing to run sample inputs, or run the program like I did.
All looked well.
It's one thing to run tools like check50, which we humans wrote.
But we too are fallible, certainly, and we might not think of anything.
And thankfully, smart humans have made tools, that at first glance,
might be a little hard to use.
Like debug 50, as is Valgrind now.
But they ultimately help you get your code 100% correct
without you having to struggle visually over just staring at the screen.
And we see this a lot in office hours, honestly.
A lot of students, to their credit, sort of reasoning through, staring
at the screen, just trying to understand what's going wrong,
but they're not taking any additional input other than the characters
on the screen.
You have so many tools that can feed you more and more hints along the way.
So do acquire those instincts.
Any questions on this?
Yeah?
AUDIENCE: Sir, if you had a main function that took arguments.
Would you run Valgrind with those arguments as well?
DAVID J. MALAN: Yes, indeed.
So Valgrind works just like debug 50, just like help50.
If you have command line arguments, just run them as usual,
but prefix your command with Valgrind, or maybe even help50 Valgrind,
to help one with the other.
Good question.
Other thoughts?
Yeah?
AUDIENCE: Where does the data go [INAUDIBLE]??

DAVID J. MALAN: Good question.


So at the end of the day, think about what's
inside the computer, which is just something like this.
So physically, it's obviously still there.
It's just being treated by the operating system--
Mac, OS, Windows, Linux, whatever, as like a pool of memory.
We keep drawing it as a grid that looks a little something like this.
So the operating systems job is to just keep track of which of those squares
is in use, thanks to malloc.
And which has been freed.
And so you can think of it as having little check
marks next to them saying, this is in use, this is in use,
these others are not in use.
So they just go back on the so-called free list into that pool of memory.
Good question.
If you take a higher level course on operating systems in fact,
or CS61 or 161 at Harvard, you'll actually build these kinds of things
yourself.
And implement tools like, malloc, yourself.
Yeah?
AUDIENCE: So why did we have to allocate memory in this case, and what happens
[INAUDIBLE]?
DAVID J. MALAN: Good question.
Why did we have to allocate memory in this case?
We did not.
This was purely, as mentioned, for demonstration purposes.
If we had some program in which we wanted
to allocate some amount of memory, then this is how we might do it.
However, a cleaner way to do all of this,
would have been to say, hey, computer, give me 10 integers like this,
and not have to worry about memory management.
And that's where we began in week one, just using arrays on the stack,
so to speak.
Not using malloc at all.
So the point is only, that once you start using malloc, and free,
and memory more generally, you take on more responsibilities
than we did in week one.
Good question.
And the others?
All right.
So, turns out, there's one more tool, in all seriousness.
This is the thing.
[? DDB50. ?] So debug 50 is an allusion to a very popular tool called, GDB 50,
[? Gnu ?] debugger.
It's an older tool that you won't use at the command line,
but it's what makes debug 50 work.
Turns out, there's a thing.
And there's an actual Wikipedia article that you
might have clicked on in my email last night, called rubber duck debugging.
And frankly, you don't have to go as all out, as excessive, as we did here,
but the purpose of this technique, of rubber duck debugging,
is to keep, literally, like a rubber duck on your shelf, or on your desk.
And when you have a bug and you don't have the luxury of a teaching fellow,
or a roommate who took CS50, or a more technical friend who can help walk you
through your code, literally, start walking through your code
verbally, talking to the duck saying, well, online 2, I'm declaring main,
and on line 3, I'm allocating space for an array.
And then, on line 4, I'm calling-- ah!
That's what I'm doing wrong.
So if any of you have ever had that kind of moment, whether in office hours,
or alone, where you're either talking in your head,
or you're talking through your code to someone else.
And here, she doesn't even have to respond.
You just hear yourself saying the wrong thing, or having that aha moment.
You can approximate that by just keeping one of these little guys on your desk,
and have that conversation.
And it's actually not as crazy sounding as it actually is.
It's that process of just talking through your code logically,
step by step, in a way that you can't necessarily do in your own mind.
At least I can't.
When you hear yourself say something wrong,
or that didn't quite follow logically, bam, you
can actually have that aha moment.
So on the way out today, by all means, take any one of these ducks.
That took quite a long, time for [? Colten ?] to lay out today.
And we'll have more at office hours in the weeks to come, if you would like.
So some of you might recall such a duck from [? Currier ?] House
last year too, which was a cousin of his as well.
All right.
So that is rubber duck debugging.
Now, last week, recall that we began to take off training wheels.
We'd use for a few weeks, the CS50 library.
And that's kind of in the past now.
That was just a technique, a tool, via which
we could get user input a little more pleasantly, than if we actually
started dealing with memory early on.
And we revealed last week that a "string", quote, unquote,
is just what, underneath the hood in C?

Say again.
An array of characters.
And even more specifically, it's a synonym S-T-R-I-N-G for what actual
data type?
char star, as we've called it.
So a char star is just the computer scientists
way of describing a pointer to a character,
or rather the address of a character, which
is functionally equivalent to saying an array of memory, or sequence of memory.
But it's kind of the more precise, more technical way of describing it.
And so now that we know that we have char stars underneath the hood, well,
where is all of that coming from?
Well, indeed, it maps directly to that memory.
We keep pointing out that something like this is inside of your computer.
And we can think of the memory as just being chunks of memory,
all of whose bytes are numbered.
0 on up to 2 gigabytes, or 2 billion, whatever the value might be.
But of course last week, we pointed out that you think about this memory
not as being hardware per se, but as just being this pool of memory that's
divided into different regions.
The very top of your computer's memory, so to speak,
is what we call the text segment.
And what goes in the text segment of your computer's memory
when you're running a program?
Text is like, poor choice of words, frankly, but what is it?
Say again.
AUDIENCE: File Headers?
DAVID J. MALAN: Not the file headers, in this case.
This is in the context of running a program, not necessarily saving a file.
Yeah?
AUDIENCE: String literals.
DAVID J. MALAN: Not string literals here,
but they're nearby, actually, in memory.
AUDIENCE: Functions.
DAVID J. MALAN: Functions, closer.
Yeah.
The text segment of your computer's memory
is where, when you double click a program to run it,
or in Linux, when you do dot flash something, to run it.
That's where the zeros and ones of your actual program, the machine code,
that we talked about in week zero, is just loaded into RAM.
So recall from last week, that, you know, anything physical in this world--
hard drives, solid state drives, is slow.
So those devices are slow, but RAM, the stuff we keep pulling up on the screen,
is relatively fast.
If only because it has no moving parts.
It's purely electronic.
So when you double click a program on your Mac or PC,
or do dot slash something in Linux, that is
loading from a slow device, your hard drive,
where the data is stored long term, into RAM or memory,
where it can run much more quickly and pleasurably in terms of performance.
And so, what does this actually mean for us?
Well, it's got to go somewhere.
We just decided, humans, years ago that it's
going to go at the top, so to speak, of this chunk of memory.
Below that though, are the more dynamic regions of memory--
the stack and the heap.
And we said this a moment ago, and last week as well, what goes on the heap?
Or who uses the heap?
AUDIENCE: Dynamic memory.
DAVID J. MALAN: Dynamic memory.
Any time you call malloc, you're asking the operating system
for memory from the so-called heap.
Anytime you call free, you're sort of conceptually putting it back.
Like, it's not actually going anywhere.
You're just marking it as available for other functions and variables to use.
The stack, meanwhile, is used for what?
AUDIENCE: Local variables.
DAVID J. MALAN: Local variables and any of your functions.
So main, typically takes a sliver of memory at the bottom.
If main calls another function, it gets a sliver of memory above that.
If that function calls one, it gets a sliver of memory above that.
So they each have their own different regions of memory.
But of course, these arrows, both pointing at each other,
doesn't seem like such a good design.
But the reality, is bad things can happen.
You can allocate so much memory that, bam, the stack overflows the heap.
Or the heap overflows the stack.
Thus was born websites like Stack Overflow, and the like.
But that's just a reality.
If you have a finite amount of memory, at some point,
something's going to break.
Or the computer's going to have to say, mm-mm, no more memory.
You're going to have to quit some programs, or close some files,
or whatnot.
So that was only to say that that's how the memory is laid out.
And we started to explore this by way of a few programs.
This one here-- it's a little dark here.
This one here, was a swap function.
Now it's even darker.
It was a swap function that actually did swap two values, A and B.
But it didn't actually work in the way we intended.
What was broken about this swap function last week?

Like, I'm pretty sure it worked.


And when our brave volunteer came up and swapped the orange juice and the milk,
that worked.
So like, the logic was correct, but the program itself did not work.
Why?
AUDIENCE: It changed the values of the copy variables.
DAVID J. MALAN: Exactly.
It changed values in the copies of the variable.
So recall, that when main was the function
we called, and it had two values, x and y, that chunk of memory was here.
That chunk of memory was here.
And it had like the numbers 1 and 2.
But when it called the swap function, that got its own chunk of memory.
So main was at the bottom, swap was above that.
It had its own chunks of memory called, a and b, which
initially, got the values 1 and 2.
1 and 2 were indeed successfully swapped,
but that had no effect on x and y.
So we fixed that.
With the newer version of this program, of course,
it looked a lot more cryptic at first glance, but in English,
could someone just describe what it is that happens
in this example that was more correct?
Like, what does this program do line by line?
Yeah?
AUDIENCE: Instead of passing copies of the variables,
you pass pointers to their addresses.
DAVID J. MALAN: Exactly.
Instead of passing the values of the variables, thereby copying them,
it passes the addresses of those variables.
So that's like saying, I don't technically care where it is in memory,
but I do need to know that it is somewhere in memory.
So instead of passing an x in the number 1,
let's suppose that x is at location 100--
my go to example.
It's actually the number 100 that's going to go there.
And if y is at the location like, 104, well, it's
104 that's going to go there, which are not the values we want to swap,
but those are sort of like little maps, or breadcrumbs if you will,
that lead us to the right location.
So that when we execute this code, what we're ultimately
swapping in those three lines, is this and this, and all along the way,
recall, we're using a temporary variable there
that can be just thrown away after.
So that's what pointers allowed us to do.
And that's what allowed us to actually change values on the so-called stack,
even by calling on other function.
All right.
Any questions then, on where we left off last time with the stack and with swap?
No?
All right.
So recall we introduced Binky as well, who lost his head at one point,
but why?
What went horribly, horribly awry with this scene from last week's film
from Stanford?

Binky was doing everything correctly, right?


Like, moving values.
42 was successful.
And then, yeah?
AUDIENCE: He tried to dereference something that
wasn't pointing to any actual address.
DAVID J. MALAN: Exactly.
He tried to dereference a pointer, an address, that wasn't actually pointing
to a valid address.
Recall that this was the line in code in question that was unlucky and bad.
Star y, means, go to the address in y, and do something to it.
Set it equal to the number 13.
But the problem was, that in the code we looked at last week,
all we did at the start was say, hey, computer give me a pointer to an int,
and call it x.
Do the same, and call it y.
Allocate space and point x at it.
But we never did the same for y.
So whereas x contained, last week, the address of an actual chunk of memory,
thanks to malloc, what did y contain at that point in the story?
The yellow line there.

What did y contain?


What value?

AUDIENCE: Null.
DAVID J. MALAN: Null.
Maybe.
Maybe.
But it's not obvious because there's no mention of null in the program.
We might get lucky.
Null is just 0.
And sometimes we've seen that 0 are the default values in a program.
So maybe.
But I say, maybe, and I'm hedging why.
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Yeah.
And it doesn't allocate-- well, allocate, is not quite the right word.
That suggests you are allocating actual memory.
It's a garbage value.
There's something there.
Right?
My Mac has been running for a few hours.
And your Macs, and PCs, and phones, are probably running all day long.
Or certainly when the lid is up.
And so, your memory is getting used, and unused, and used.
Like, lots of stuff is going on.
So your computer is not filled with all zeros or all ones.
If you look at it at some random point in the day,
it's filled with like bunches and bunches of zeros and ones
from previous programs that you quit long ago.
Windows you have in the background and the like.
So, the short of it is, when you're running
a program for the first time, that's been running now for some time,
it's going to get messy.
That big rectangle of memory is going to have some ones over here
some zeros over here and vise versa.
So they're garbage values, because those bytes have some values in them.
You just don't necessarily know what they are.
So the point is, you should never ever dereference a pointer
that you have not set yourself.
Maybe you will crash.
Maybe it won't crash.
Valgrind can help you find these things but sometimes.
But it's just not a safe operation.
And lastly, the last thing we introduced last week,
which will be the stepping stone for what problems we'll solve this week,
was struct.
So struck is kind of cool, in that you can design your own custom data
structures.
C is pretty limited out of the box, so to speak.
You only have chars and boules, and floats, and ints, and doubles,
and longs, and str--
well, we don't even have strings, per se.
So it doesn't really come with many features, like a lot of languages do.
Like Python, which we'll see in a few weeks.
So with struct in C, you have the ability
to solve some problems of your own.
For instance, with the struct, we can actually
start to implement our own features.
Or our own data types.
For instance, let me go up here.
And let me go ahead and create a file called say,
student, or rather destruct dot h.
So recall that dot h is a header file.
Thus far, you have used header files that other people made.
Like, CS50 dot h, and standard IO dot h, and standard [? lid ?] dot h,
but you can make your own.
Header files are just files that typically contain code that you
want to share across multiple programs.
And we'll see more of this in time.
So let me go ahead and just save this file.
And suppose that I want to represent a student in memory.
A student of course, is probably going to have what?
For instance, how about a string for their name,
a string for their dorm-- but string is kind of two weeks ago.
Lets call this char star.
And lets call name, char star.
And so you might want to associate like, multiple pieces of data with students.
Right?
And you don't want to have multiple variables, per se.
It would be nice to kind of encapsulate these together.
And recall at the very end of last week, we
saw this feature where you can define your own type,
with typedef, that is a structure itself.
And you can give it a name.
So in short, simply by executing this these lines of code,
you have just created your own custom data type.
It's now called student.
And every student in the world shall have, per this code, a name
and a dorm associated with them.
Now, why is this useful?
Well the program, we looked at the very end of last time looked
a little something like this.
Instruct zero dot c, we had the following,
I first allocated some amount of space for student.
I asked the user what's the enrollment in the class or whatnot?
That gives us an int.
And then, we allocated an array of type student, called students, plural.
This was an alternative, recall, to doing something
like this, string names enrollment, and string dorms enrollment.
Which would work.
You could have two separate arrays, and you'd just
have to remember that name zero and dorm zero is the same human.
But why do that if you can keep things together.
So with structs, we were able to do this.
Give me this many student structures, and call the whole array, students.
And the only new syntax we introduce to satisfy this goal, was what operator?
AUDIENCE: The dot.
DAVID J. MALAN: The dot.
Yeah.
So in the past, recall from like week two, we introduced arrays.
And arrays allow you to do square bracket notation.
So that is no different from a couple of weeks back.
But if your array is not storing just integers, or chars, or floats,
or whatever, it's actually storing a structure, like a student,
you can get at that student's name by literally just saying dot name.
And you can get at their dorm by doing dot dorm.
And then everything else is the same.
This is what's called, encapsulation.
And it's kind of like a fundamental principle of programming
where, if you have some real world entity, like a student,
and you want to represent students with code, yeah,
you can have a bunch of arrays that all have called names, dorms, emails, phone
numbers, but that just gets messy.
You can instead encapsulate all of that related Information about a student
into one data structure so that now you have, per week zero, an abstraction.
Like, a student is an abstraction.
And if we break that abstraction, what is a student actually?
Not in the real world, but in our code world here?
Student is an abstraction.
It's a useful word, all of us can kind of agree means something,
but technically, what does it apparently mean?
A student is actually a name in a dorm, which really kind of is
diminutive to everyone in this room, but we've distilled it in code
to just those two values.
So there we have encapsulation.
You're kind of encapsulating together multiple values.
And you're abstracting away just have a more useful term,
because no one is going to want to talk in terms of lines of code
to describe anything.
So, same topic as in the past.
So, now we have the ability to come up with our own custom data structures
it seems.
That we can store anything inside of them that we want.
So let's now see how poorly we've been designing
some things for the past few weeks.
So it turns out that much of the code, hopefully
we've been writing in recent weeks has been correct,
but we've been not necessarily designing solutions in the best way.
Recall that when we have this chunk of memory,
we've typically treated it as at most, an array.
So just a contiguous chunk of memory.
And thanks to this very simple mental model, do we get strings,
do we get arrays of students now.
But arrays aren't necessarily the best data structure in the world.
Like, what is a downside of an array if you've encountered ones thus far.

In C, what's a downside of an array?


Yeah?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Can or cannot?
AUDIENCE: Cannot.
DAVID J. MALAN: You cannot.
That is true.
So in C, you cannot mix data types inside of an array.
They must all be ints, they must all be chars, they must all be students.
It's a bit of a white lie because technically, you
can have something called a void star, and you can actually map-- but yes.
That is true though, strictly speaking-- cannot mix data types.
Though frankly, even though other languages let you do that,
it's not necessarily the best design decision.
But sure, a limitation.
Other thoughts.
Yeah?
AUDIENCE: The size cannot change.
DAVID J. MALAN: The size cannot change.
Let's focus on that one.
Because that's sort of even more constraining it would seem.
So if you want an array for, say, two values, what do you do?
Well, you can do something like int, x, bracket, 2, semi-colon.
And what does that actually give you inside of your computer's memory?
It gives you some chunk that we'll draw a rectangle.
This is location 0.
This is location 1.
Suppose that, oh, a few minutes later, you change your mind.
Oh, darn, I just took a--
I want to type in a third value, or I want
to add another student to the array.
Where do you put that?
Well, you don't.
If you want to add a third value to an array of size 2,
what's your only option in C?
AUDIENCE: You make a new array.
DAVID J. MALAN: You make a new array.
So literally.
And if this array had the number like 42,
and this had the number 13, the only way to add a third number is to allocate
a second array, copy the values into the same locations, 42, 13, and then,
we'll add another value, 50.
And then, so that you're not using up twice as much space
almost permanently, now you can sort of free somehow,
or stop using that chunk of memory.
So that's fine.
It's correct what we just did.
But what's the running time of that process?

Recall a couple of weeks ago, we started talking about efficiency and design.
What's the running time of resizing an array.
AUDIENCE: Too long.
DAVID J. MALAN: Say Again.
AUDIENCE: I said, too long.
DAVID J. MALAN: Too long.
Fair.
But let's be more precise.
Big o of-- big o of what?
AUDIENCE: N.
DAVID J. MALAN: N. What's n?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: OK.
True.
But what does n represent?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Yeah.
So you don't actually have to not know.
It's just a general answer.
In this case, however long the array is, call it n.
It is that many steps to resize it into that plus 1.
Technically it's big o, over n, plus 1.
But recall in our discussion, "The big o notation," we just
ignore the smaller terms-- the plus 1s, the divided by 2s, the plus n.
We focus only on the most powerful term in the expression, which
is just n here.
So yes, if you have an array of size 2, and you resize it
into an array of size 3, or really, n plus 1, that's
going to take me roughly n steps.
Technically n plus 1 steps.
But n steps.
Ergo big o of n.
So it's a linear process.
So possible but not necessarily the fastest
thing because he literally had to move all those damn values around.
So what would be better than this?
And if you've programed before, you might have the right instincts already.
How do we solve this problem?

Yeah?
AUDIENCE: Would you allocate more memory at the end of the array?
DAVID J. MALAN: Reallocate more memory at the end of the array.
So it turns out c does have a function called, realloc.
Perfectly, if not obviously, named that reallocates memory.
And if you pass it, the address of a chunk of memory you've allocated,
and the operating system notices, oh, yeah you got lucky.
I've got more memory at the end of this array,
it will then allocate that additional RAM for you, and let you use it.
Or worst case, if there's nothing available at the end
of the array in memory, because it's being
used by something else in your program.
That's fine.
Realloc will take on the responsibility of creating another array somewhere
in memory, copying all of that data for you into it,
and returning the address of that new chunk of memory.
Unfortunately, that's still linear.
Yeah?
AUDIENCE: Is this all being done in the heap?
Or--
DAVID J. MALAN: This is all being done in the heap.
Malloc, and realloc, and free, all operate on the heap.
Yes.
So that is a solution, but it doesn't really speak to the efficiency.
Yeah?
AUDIENCE: Could you use linked list?
DAVID J. MALAN: Yeah.
What is a linked list?
Go ahead.
AUDIENCE: It's when you have an element that points to different elements.
DAVID J. MALAN: OK.
Points to other elements.
Yeah.
So let me speak to what's the fundamental issue here.
The fundamental problem is much like painting yourself into a corner,
so to speak, as the cliche goes.
With an array, you're deciding in advance how big the data structure is
and committing to it.
Well, what if you just do the opposite.
Don't do that.
If you want initially, room for just one value, say one integer,
only ask the computer for that.
Give me space for one integer and I'll put my number 42 in here.
And then, if and only if, you want a second integer,
do you ask the computer for a second integer.
And so the computer, as by a malloc, or whatnot, will give you another one
like, the number 13.
And if you want a third, just ask the same question of the operating system.
Each time just getting back one chunk of memory.
But there's a fundamental gotcha here.
There's always a trade off.
So yes, this is possible.
You can call malloc three times.
Each time asking for a chunk of memory of size 1, instead of size 3,
for instance.
But what's the price you pay?
Or what problem do we still need to solve?
Yeah?
AUDIENCE: They're not stored next to each other.
DAVID J. MALAN: Yeah.
They're not being stored next to each other.
So even though I can think of this as being the first element, the second,
and the third, you do not have, in this story, random access to elements.
And random access, ergo, random access memory, or RAM,
just means that arithmetically, like, mathematically, you
can jump to location 0, location 1, location 2, randomly, or in constant
time.
Just instantly.
Because if they're all back to back to back, all you have to do is like,
add 1, or add 4, or whatever to the address, and you're there.
But the problem is, if you're calling malloc again and again
and again, there's no guarantee that these things are even
going to be proximal to one another.
These second chunks of memory might end up--
if this is a big chunk of memory we've been talking about,
where the heaps up here, and the stacks down here--
42 might end up over here.
The next chunk of memory, 50, might end up over here.
The third chunk might end up over here.
So you can't just jump from location 0, to 1, to 2,
because you have to somehow remember where location 0, and 1, and 2, are.
So how do we solve this?
Even if you haven't programed before, like, what would a solution be here?

AUDIENCE: Somehow store [INAUDIBLE].


DAVID J. MALAN: OK.
Somehow storing the addresses of--
AUDIENCE: Of the [INAUDIBLE]
DAVID J. MALAN: All right.
So let's just suppose, for the sake of discussion, that this chunk of memory
ended up at location 100.
This one ended up at like 150.
This one ended up at like 475.
Whatever those values are.
It would seem that somehow or other I need to remember three values--
100, 150, and 475.
So where can I store that?
Well, it turns out, I can be a little clever but a little greedy.
I could say to malloc, you know what, every time I call you, don't just
give me space for an integer, give me space for an integer
plus the address of another integer.
So if you've ever kind of seen like popcorn strung together on a string,
or any kind of chain link fence where one link is linking to another.
We could create the equivalent of-- oops not that.
We could create the equivalent of this kind of picture,
where each of these squares, or nodes, we'll start calling them, kind of links
graphically to the other.
Well, we've seen these links, or these pointers,
literally arrows that are pointing implemented in code.
An arrow or a pointer is just an address.
So you know what?
We should just ask malloc not for enough space for just the number 42,
we should instead, ask for a little more memory in each of these squares,
making them pictorially rectangles now.
So that now, yes, we do have these arrows conceptually
pointing from one location to another.
But what values do I actually want to put in these new additional boxes?
AUDIENCE: The addresses of the next.
DAVID J. MALAN: The addresses of the next.
So they're like little breadcrumbs.
So in this box here, associated with the first value,
should be the address of my second value, 475.
Associated with my second value here, per the arrow--
and let me draw the arrow from the right place.
--from the arrow, should be the address 150, because that's the last.
And then, from this extra box, what should I put there?
Yeah?
AUDIENCE: Slash 0 or something?
DAVID J. MALAN: Yeah.
So probably, the equivalent of slash 0, which in the world of pointer's recall,
is null.
So just a special value that means that's it, this is the end of the line.
That still leaves us with room to add a fourth value and point to it,
but it for now, signifies very clearly to us there's nothing actually there.
So what did we just do?
We created a list of values 50, oh sorry 42, 50, 13,
but we linked to them together.
First, pictorially, with just arrows.
Like any human might with a piece of chalk.
But technically in code, we could do this
by just storing addresses in each of these places.
So just to be clear then, what might this actually translate to in code?
Well, what if I proposed this.
In code, we might do something like this.
If we want to store an integer.
We're of course, going to need to store like int n, we'll call it.
n will represent 42, or 50, or 13.
But if we want to create a data structure,
we might want to start giving this data structure a name.
I called it, a moment ago, node, which is a CS term for a node in a linked
list, so to speak.
And it looks like this.
So typedef means, give me my own type.
Struct means, make it a structure, like a student was.
And then, node, which is going to be the name of this thing.
And I'll explain in a moment why I have the word node twice this time.
But I left room on the board for just one more line.
In addition to an int, called n, or whatever,
I need to somehow represent in code, the additional memory
that I want malloc to give me for the address.
So first of all, these are addresses of what data types?
Each of those three new boxes.
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: They are the addresses of integers in that point in the story.
But technically, what is this box really pointing to?
Is it pointing specifically to the ints?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: It's pointing to that whole chunk of memory, if you will.
So if you start thinking about each of these rectangles as being a node,
and each of the arrows as pointing to another node,
we need to somehow express, I need to somehow store a pointer to a node.
In other words, each of these arrows needs to point to another node.
And in code, we could say this.
Right?
Like, let's give it a name.
Instead of n, which is the number, let's call it next.
So next, shall be the name of this field that points to the next node in memory.
And node star, what does that mean in English, if you will?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Say again?
AUDIENCE: Pointing to an address.
DAVID J. MALAN: Pointing to an address.
Right?
It looks different.
Node is a new word today and that's fine.
But node star, just means a pointer to a node.
The address of a node.
And it turns out that this is a custom structure
so we actually have to say this.
But it's the same principle even though things are kind of escalating quickly
here, we just need to values, an int, and then, a pointer to another thing.
That other thing is going to be another node.
And we're just using a node, frankly, to encapsulate two values--
an int and a pointer.
And the way you express in C, albeit somewhat cryptically,
a pointer, or one of those arrows, is you say give me a variable called next,
have it point to a structure called node.
Or rather, have it be the address of a structure of type node.
Yeah?
AUDIENCE: How can you [? reveal ?] the timing of struct node [INAUDIBLE]??

DAVID J. MALAN: Good question.


So this feels like a circular kind of definition because I'm defining a node,
and yet, inside of a node is a node.
That is OK because of the star.
It is necessary in C--
remember that C always is kind of read top to bottom.
So accordingly, this very first line of code here, typedef struct note,
at that point in the story, when clang has read that line,
it knows that a phrase, struct node, exists.
AUDIENCE: That's why you say nodes [INAUDIBLE]..
DAVID J. MALAN: Exactly.
Exactly.
We didn't need to do this with students because there were
no pointers involved to other students.
But yes, in this case.
So in short, this tells clang, hey, clang, give me a structure called node.
And then, in here, we say, hey, clang, each of those nodes
shall have two things, an integer called n,
and a pointer to another one of these data structures of type node,
and call the whole thing, node.
It's a bit of a mouthful.
But all this is, is the following.
Let me go ahead and erase all of this.
All this data type is--
if we get rid of the picture we draw on the fly there.
--is this says, hey, clang, give me a data structure
that pictorially looks like this.
It's divided into two parts.
The first part is called n, the second type is called, next.
This data type is of type int.
This is a pointer to another such node.
And that's it.
Even though the code looks complex, the idea is exactly that.
Yeah?
AUDIENCE: [INAUDIBLE]?
Why do you have to say struct node again?
DAVID J. MALAN: Good question.
The reason is, as just came up a moment ago, clang
and C, in general, are kind of dumb.
They just read code top to bottom.
And the problem is, you have to declare the name of this structure
as being a struct node before you actually use it.
It's similar in spirit to our discussion of prototypes-- y functions need
to be mentioned way up top.
This just says to clang, give me a type called struct node.
You don't know what it's going to look like yet.
But I'll finish my thought later.
And then in here, we're just telling clang, inside of that node
should be an integer, as well as, a pointer to the very type of thing
I'm in the middle of defining.
But if I had left off the word node up there, and just said struct,
you couldn't do that because it hasn't seen the word N-O-D-E yet.
That's all.
Other questions?
All right.
So if I now have a data structure called node,
I can use it to kind of stitch together these linked lists.
And maybe just the very things a little bit,
and to start giving away some ducks, would folks
be comfortable with volunteering to solve a problem here?
Yeah?
OK.
Come on up.
1, 2--
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Sure.
Or you can take a duck and run.
OK.
1, 2, how about 3?
Come on over here, 3.
So if you want to be our first pointer, you can be number 5.
Come on over here.
You want to be number 9.
And one more.
One more volunteer.
Come on over here.
Yeah.
All right.
So-- I'll meet you over here.
OK, 17.
All right.
So if you'd like to--
just so we pick this up for those following along at home.
If you would like to just say hello to the audience.
ANDREA: Hi, I'm Andrea.
[? COMEY: ?] Hi, [? I'm Comey. ?]
[? KYONG: ?] Hi, [? I'm Kyong. ?]
SPEAKER 2: Hi, I'm [INAUDIBLE].
DAVID J. MALAN: Wonderful.
OK.
If you wouldn't mind all just taking a big step back over the ducks,
just so that we're a little farther back.
Let's go ahead and do this.
If you're our first pointer, if you could come over here for instance,
and just stand outside the ducks.
And if you guys could come a little over here in front is still fine.
So here we have the makings of a linked list.
And what's your name again?
[? COMEY: ?] [? Comey. ?]
DAVID J. MALAN: [? Comey ?] is our first pointer if you will.
Via [? Comey's ?] variable are we just going
to keep track of the first element of the linked list.
So if you could, with your left hand, represent first.
Just point over at-- what was your name again?
ANDREA: Andrea.
DAVID J. MALAN: So Andrea is the number 9.
If you could use your left hand to point at number 5.
And if you could use your left hand, yep, to point at number 17.
And your left hand to just point at null, which we'll just call the ground.
So you don't want to just point it randomly
because that would be like following a bogus pointer, so here means null.
All right.
So this is a linked list.
All you need to store are linked list of three values
is three nodes, inside of which are three integers,
and their left hands represents that next pointer, so to speak.
[? Comey's ?] a little different, in that she's not holding a value.
She's not holding an integer.
Rather, holding just the name of the variable, first.
So you're the only one that's different here fundamentally.
So suppose I want to insert the number 20?
Could someone volunteer to be number 20?
OK.
Come on up.
All right.
And what's your name?
ERIC: Eric.
DAVID J. MALAN: Eric.
Eric, you're the number 20.
And Eric, actually, let's see.

Actually can we do this?


Let me give-- let me make this a little more different.
OK.
That never happened.
OK.
Eric, give me that please.
I want to insert Eric as number 5.
So Eric, I'm keeping this list sorted.
So where, obviously, you're going to go?
ERIC: Go right there.
DAVID J. MALAN: All right.
But before you do that, let's just consider what this looks like in code.
In code, presumably, we have malloced Eric from the audience.
I've given him a value, n of number 5.
And his left hand is like, it's garbage value right now, because it's not
pointing to anything specific.
So he's got two values-- an integer, and a left hand representing
the next pointer.
If the goal is to put Eric in sorted order.
What should our steps be?
Like, whose hand should point where, and in what order?
Yeah.
Give us one step.
AUDIENCE: You should point to number 9.
DAVID J. MALAN: OK so you should point at number 9,
which is equivalent to saying, point at whatever first.
Where [? Comey ?] is pointing at.
So go ahead and do that.
All right next?
What's the next step?
Someone else?

Someone else.
Almost there.
Yeah?
AUDIENCE: First should point to 5.
DAVID J. MALAN: OK.
So first, or [? Comey, ?] could you point to 5.
And that's fine.
You don't even have to move.
Right?
This is the beauty of a linked list.
It doesn't matter where you are in memory,
it's the whole beauty of these pointers, where you can literally
point at that other location.
It's not an array where they need to be standing back to back to back.
They can be pointing anywhere.
All right.
Let's go ahead and insert one more.
Who wants to be say, 55?
Big value.
Yeah.
Come on down.
All right.
What's your name?
[? KYONG: ?] [? Kyong. ?]
DAVID J. MALAN: [? Kyong. ?] OK.
So come on over.
So we've just malloced [? Kyong ?] from the audience.
I've given him his end value of 55.
His left hand is just some garbage value right now.
How do we insert [? Kyong ?] in the right order?
Where is the obviously supposed to go?
In sorted order, he obviously belongs at the end.
But here's the catch with the linked list.
Just like when we've discussed searching and sorting in the past,
the computer is pretty blind to all but just one value.
And the linked list, at the moment--
like, I don't know that these three, these four, exist.
All I know really, is that [? Comey ?] exists.
Because via this first pointer, is the only access
to the rest of the elements.
And so what's cool about a linked list, but perhaps not obvious,
is that you only--
the most important value is the first.
Because from the first value, you can get to everyone else.
It's not useful-- excuse me for me to remember, Andrea?
--Andrea alone, because if I do, I've just
lost track of [? Comey ?] and more importantly, because of his number,
Eric.
So all I have to do really, is remember [? Comey. ?]
So if the goal now is to insert number 55, what steps should come first?
No pun intended.
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Say again.
AUDIENCE: Finding the first space.
DAVID J. MALAN: OK.
Finding the first space.
So I'm going to start at [? Comey, ?] and I'm going to follow this pointer.
Number 5, does 55 belong here?
No.
So I'm going to follow this pointer and get to Andrea.
Does 55 belong here?
No.
Gonna follow her pointer, and 22, does it belong here?
No.
I follow this pointer, 26?
No.
But you have a free hand, it turns out.
So what step should come next?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: We could have you point at 55, and now done.
So relatively simple, but what was the running time of this?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: It's big o of n.
It's linear.
Because I had to start at the beginning, even though we
humans have the luxury of just eyeballing it.
Saying, oh, obviously, he belongs way at the end.
Mm-mm.
Not in code.
Like, we have to start at the beginning to reverse the whole darn list,
until we get linearly to the very end.
And now we're done.
Let's try one last one.
How about 20?
Yeah.
Great.
Come on down.
What's your name?
JAMES: James.
DAVID J. MALAN: James.
All right, James.
All right.
So we just malloced James, given him the number 20.
He obviously belongs roughly in the middle.
What's the first step?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Sorry?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: All right.
So we start with [? Comey, ?] again.
All right.
First, OK.
5, do you belong here?
No.
Let me follow the link.
OK 9, do you belong here?
No.
Do you belong at 22-- ooh.
But what did I just do wrong?

I went too far.


At least in this story.
Like, I literally-- Andrea is behind me now.
OK.
So can I follow the pointer backwards?
You can't.
Like in every picture we've drawn, and every example
we've done with an address, we only have the address of the next pointer.
We don't have what's called, a doubly linked list, at least in this story,
where I can just turn around.
So that was a bug.
So I need to start over instead.
First, OK 5, OK 19--
what I really need in code, ultimately, is
to kind of peek ahead and not actually move-- not that far.
Just to 22.
Peek ahead at 22 and realize, oh, that's going to be too far.
This is not yet far enough.
So let's go ahead and bring James over.
Well, actually, you can stay there physically.
But what step has to happen first?
I know now he belongs in here.
You want to point at him?
OK.
Point at him.
ANDREA: Oh.
I'm sorry, he points first.
DAVID J. MALAN: Well let's do that, just because it is incorrect.
That's fine.
OK.
Andrea proposed that we point here, but she just broke the whole linked list.
Why?
ANDREA: Because there's nothing to point at.
DAVID J. MALAN: Right.
No one is remembering-- what's was your name again?
[? KYONG: ?] [? Kyong. ?]
DAVID J. MALAN: No one's remembered where [? Kyong ?] was.
So you can't do that.
Your left hand has to stay there.
So what steps should happen first instead?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: James should point at whatever
Andrea is pointing at, perhaps?
So a little redundantly at the moment, just like before.
OK.
Now what happens next?
That's step one.
ANDREA: Now I can point.
DAVID J. MALAN: Now you can point at him.
OK.
You could do that.
All right.
And so now, this looks like a complete mess,
but if we know that [? Comey ?] is first,
we can follow the breadcrumbs to Eric, and then to Andrea, and then to James,
and then the rest of our list step by step by step.
So it's a huge amount of like logic now.
But what problem have we solved?
And I think we identified it over here earlier.
What was the problem first and foremost with the arrays?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: You have to decide on their size in advance.
And once you do that, if you want to add an additional element,
you have to resize the whole darn thing.
Which is expensive because you have to move everyone around.
Now frankly, I'm being a little greedy here.
And every time we've inserted these new elements,
I've been keeping them in sorted order.
So it would seem that if you insert things in sorted order, big o event,
every time.
Because in the worst case, the new element
might end up all the way at the end.
But what if we relax that constraint?
What if I'm not so uptight and need everything nice and orderly and sorted?
What if I just want to keep growing the list in any random order?
And I allocate the number 34.
And I'll play the number 34.
Malloc 34.
Where is the quickest place for me to go?
Yeah?
AUDIENCE: Point to 5, and then have [INAUDIBLE]..
DAVID J. MALAN: OK.
I'll point to 5, and then, [? Comey, ?] if you could point to me.
Done.
One-- well, two steps.
All right.
Suppose now, I malloc 17 with someone else, who'll we'll
pretend is right here.
Where's the best place for 17 to go?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Right after [? Comey ?] too.
So now, [? Comey ?] can point at 17, 17 can point at me, I can point at Eric,
and so forth.
And that's two steps again.
Two steps-- if it's the same number of steps every time,
we call that, constant time.
And we write it as big o of 1.
And so here too, it's just a trade off.
If you want really fast insertions, don't worry about sorting.
Just put them at the beginning and deal with it later.
If you want a dynamic resizeability, don't use an array, use a linked list,
and just keep allocating more and more as you go without wasting
a huge amount of space too.
Which notice, that's another big problem with an array.
If you over allocate space, and only use part of it, you're just wasting space.
So there's no one solution here.
But we do now have the capabilities, thanks to the structs
and pointers to stitch together, if you will, these new problems.
Yes, please.
SPEAKER 2: Why can't the node [INAUDIBLE]??

DAVID J. MALAN: And who am I in this story?


SPEAKER 2: [INAUDIBLE].
DAVID J. MALAN: Oh, OK.
Absolutely.
So another very reasonable idea would be, well,
why don't we just put the new ones at the end?
That's fine if I keep track of who is at the end.
The problem, is at the moment in the story,
and we'll ultimately see this in code, I'm only remembering [? Comey. ?] And
from [? Comey ?] am I getting everywhere else.
I could have another pointer, a second pointer,
and literally call it, last, that's equivalent to you.
Or that's always pointing at you.
I just need then two pointers, one literally called first,
one literally called last.
That's fine.
That's a nice optimization if I want to throw all the elements at the end.
And frankly, I could get really fancy--
and to solve the problem that Andrea cited earlier--
if I store not just an int and a pointer, but instead,
an int and two pointers, I can even have each
of these guys pointing with their left and right hands
in a doubly linked list, so as to solve the problem Andrea identified, which
was if I go too far no big deal.
Take one step back.
I don't have to think as hard about that logic.
So there too, a trade off.
Let's go ahead and take a five minute break.
I'll turn on some music.
Grab a duck now, if you'd like.
And we'll return with some fancier data structures still.
Thanks.
All right.
We're back.
So let's now translate some of these ideas to code.
So that we can actually solve this problem a little more
concretely than just having humans pointing at each other.
So for instance, let's try to distill everything
we've been talking about into just a goal in code
of storing a list of numbers.
I would propose that we can take like three passes at this problem.
The first would be, let's just decide in advance how many numbers we
want to store so we don't have to deal with all
this complexity with the pointing and the pointers and all this,
and just hard code that value somehow, and just stop
when the user is inputted that many numbers and no more.
Two, we can improve upon that and at least let the user dynamically resize
their array.
So that if they decide to input more numbers than we intend,
it's going to grow, and deal with that.
Of course, arrays are not necessarily ideal
because they have to do all that damn copying from old to new.
That's linear time.
It would seem smartest to get subversion 3, which
is actually going to use a linked list.
So we're just more modestly allocating space for another number,
and another number, and another number, or really a node.
One number at a time.
So let me go ahead and start as follows.
I'm going to go ahead and include some familiar lines in list 0.c,
of the CS50 library, just to make it easy to get some user input for this.
And standard iO dot h, for printdef.
And let me go ahead and declare my main function as usual.
And then, in here let's do a couple of things.
First, let's ask the user for the capacity of the array
that we're going to use.
Or rather, let's do this first.
Let me first rewind and say, you know what?
Int, numbers, 50.
Well, that's going to be annoying to type in 50 numbers.
We're going to give the user two numbers at first, that here, she can type in.
Next, let's go ahead and prompt the user for those numbers.
So let me go ahead and say--
let's do this.
Let's at least clean this up a little bit so that we can reuse this value.
So we don't have a magic number.
This just came up in discussion actually.
So while-- do I want to do that?
Nope.
Let me fix this.
This will be my capacity of size 2.
And that's going to give me that size.
And then, I'm going to keep track of how many integers
I've prompted the user for so far.
So initially, the size of this structure is going to be 0.
But it's capacity, so to speak, is 2.
So size means how many things are in it.
Capacity means how many things can be in it.
And while the size of the structure is less than its capacity,
let's go ahead and get some inputs from the user.
Let's go ahead and ask them for a number, using our old friend, get int.
And just say, give me a number.
And then, let me go ahead and insert the number
that they type in into this array at location size, like this.
And then, do size plus, plus.
I think.
You know, I wrote it pretty quickly.
But let's consider what I just did.
I initialized size to 0, because there's nothing in it initially.
Then I say, while size is less than the capacity of the whole thing--
and capacity is 2 by default--
go ahead and do the following.
Give me an int from the user.
OK.
So int number gets int.
Then, put at location, size, in my numbers, array,
whatever the human typed in, number.
And then, increment size with plus, plus.
All right.
So on the first iteration size is 0.
So numbers, bracket, 0, gets the first number.
Numbers, bracket, 1, gets the second number.
Then, size equals capacity.
So it stops, logically.
Any questions on the logic of this code?
All right.
So once we have those numbers, let's just do something simple.
Like for int, I gets 0.
I is less than the actual size I, plus, plus.
Let's just go ahead and print out the number
you inputted, percent I, backslash n, and type out numbers, bracket, I. All
right.
So if I made no typos in list 0 dot C, then, I'm going to go ahead
and do dot, slash, o, dot, C. I'm going to be prompted for a couple of numbers.
Let's go ahead and do 1, 2.
You inputted 1, you inputted 2.
All right.
So not bad.
But this is bad design, arguably, why?
Just find one fault. It's correct.
But bad design.
AUDIENCE: Repetitive.
DAVID J. MALAN: Repetitive, because I'm using a couple of loops, sure.
And it's fundamentally-- it's very limited in functionality.
Why?
Like how useful is this program?
AUDIENCE: It's hard coded at 2.
DAVID J. MALAN: Yeah.
It's hard coded at 2.
So let's at least improve upon this a little bit,
and get rid of this hard coding.
Why don't I at least ask the user for something like this?
Well, instead of just declaring the capacity, let me go ahead and say,
you know what?
Let's just replace the 2.
Get int, and just say capacity, for instance.
All right.
And now if I do this, I'm going to be prompted--
so make list 0.
Dot slash list 0.
The capacity will be 2.
1, 2, that's nice.
But if I run it again, and give it a capacity of 3--
1, 2, 3, I get more capacity.
So that's nice.
It's an improvement for sure.
There is a bug here.
Before I test it further, can anyone identify a bug or somehow crash this?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Oh, go ahead.
AUDIENCE: If you don't input an integer.
DAVID J. MALAN: If I don't put an integer.
Or-- is that same comment up here?
AUDIENCE: I was going to say, what happens if you go back
and put in [INAUDIBLE] those other [INAUDIBLE] will be in the memory.
DAVID J. MALAN: Oh.
No.
Because I'm rerunning it in each time.
I don't need to worry about previous runs of the program.
Yeah?
AUDIENCE: In the for loop, it just goes 1,
2, 3, it doesn't actually care what you put it.
DAVID J. MALAN: [INAUDIBLE] 1, 2, 3-- well, I am iterating up to size,
which could be capacity.
Because now they do end up being equivalent.
Because I'm filling the whole thing.
But let's try this.
If you don't type in a value.
So let me go ahead and rerun this.
My capacity shall be duck.
All right.
So we did handle that.
Because getInt does that for me.
But I bet I can still break this.
Ooh, yeah, let's always try something negative.
Oh, OK.
So bad.
Like cryptic looking message, but clearly,
has to do with a negative value.
So I should probably be a little smarter about this.
And recall from like, Week 1, we did do this.
With Mario, you might have done this.
So I could do something like, do, while capacity is less than 1.
I could go ahead and say, capacity getInt capacity.
So just a little bit of error checking to close the bug that you identified.
All right.
So let's go ahead and recompile this.
Make lists 0-- oops we're going to start hearing that a lot today.
Aren't we [INAUDIBLE]?
Make list 0, dot, slash, list 0.
Capacity will be 3.
1, 2, 3.
Now capacity will be negative 1.
Doesn't allow it.
Capacity 0, doesn't allow it.
Capacity 1, yes.
So non-exhaustively, I've tested it.
It feels like it's in better shape.
OK.
But this program, while correct, and while more featureful,
still has this fundamental limit.
Wouldn't it be nice to allow the user to just keep typing numbers,
as many as they want, and then quit once they're done inputting numbers.
Right?
If you're making a program to compute someone's GPA,
different students might have taken different courses,
you don't want to have them to type in all 32 courses.
If they're younger and haven't taken all those courses.
Like there's a lot of scenarios where you don't know in advance how
many numbers the user wants to provide.
But you want to support a few numbers, lots of numbers, or beyond.
So let's do this in a second version.
In list 1 dot C, let me go ahead and improve upon that example as follows.
First, let me give my familiar friends up here CS50 dot for iO,
standard iO dot h, and then, in here, int main void.
And then, let's start writing this.
So now, I don't know in advance, necessarily, how many numbers the user
is going to type in.
Like the goal is, I want them to be able to type
in a number, another number, another number, and then
hit the equivalent of like, q, for quit, when they're done inputting numbers.
Like I don't want them to have to think about in advance, how many numbers it
is they're inputting.
But how do I do that?
Like I can't just come up with an array called numbers, and say, 50.
Because if the user wants to type in 51 numbers,
I'm going to have to resize that.
But how do you resize an array?

How do you resize an array?


AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: What's that?
AUDIENCE: You can't.
DAVID J. MALAN: You can't.
Right.
We've never seen an instance where you've re-sized an array.
We talked about it on the blackboard here.
Well, just like, allocate a bigger one and copy everything in.
And we did identify realloc.
But you can't actually use realloc on an array.
Realloc actually accepts an address of a chunk of memory
that you want to grow, or shrink.
So it turns out, if we now start to harness
the sort of fundamental definition of what an array is, a chunk of memory,
we can actually build arrays ourselves.
If an array is just a chunk of memory, or more specifically,
it's like the address of the first byte of a chunk of memory,
it would seem that I could declare my array, not with square brackets
as we've been doing for weeks, but I can say,
you know what numbers really is, it's really just a pointer.
And I'm initially going to initialize it to null.
Because there is no array.
But now I have the ability to point that pointer
at any chunk of memory, small or big.
Now why is this useful?
Well, initially let me claim that my capacity is 0,
because nothing's going on yet.
I haven't called malloc or anything.
And initially, my size is 0 because there's nothing in the array.
And it doesn't even have a size.
But let me just do this forever.
Much like in scratch, we had the forever block you can use, while true, and C,
to just say keep doing this until the user breaks out of this.
And let me go ahead and ask the user, give me a number, getInt.
And just ask them for a number.
And then, we just need a place to put that.
So where do I put this number?
Well, do I have, at the moment, any place to put the number?
No.
And technically speaking, how do you express that?
Like in pseudo code, I want to say, if no place for number.
But technically, I could do this.
Well, if the size of the array at the moment, equals its capacity,
that feels like a lower level way of expressing the same thing.
If whatever the capacity is, if the size is the same, there is no more room.
And that simple statement also covers the scenario where the capacity is 0,
the size is therefore, 0.
So its the same question.
Either we have no space at all, or we have some space
but we've used it all-- size equals, equals, capacity.
So if the size equals capacity, or put more casually,
if I don't have enough space.
What do I want to do intuitively?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Allocate more memory.
And it turns out, you proposed, or someone proposed earlier,
reallocating memory.
We can use this function for the very first time.
Let me go ahead and say this--
the catch with realloc is you have to be smart about it,
because it returns a pointer.
So let me propose this code first.
First, just give me a temporary variable, call it, temp,
that's going to store the following.
Actually, no.
Let me start this more simply.
Let me go ahead and say, numbers should be reallocated please,
realloc by passing its self in.
And this time, give me the size of an int, times--
how many ints do I want this time?
How many numbers did the human just input presumably?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Just one.
Right.
Because literally, we've only called getInt once in this story.
So whatever the size of this array is now, we need to increase it by 1.
That's all.
So this line of code here is saying, hey computer,
go ahead and reallocate this array from whatever its current size is,
and make it this size instead.
The size of whatever it is, plus 1, times the size of an int.
Because that's what we're trying to store, is an int.
So we have to do that multiplication.
And realloc, as mentioned earlier, is pretty fancy.
It's going to take an pointer, whatever chunk of memory
you've already allocated, and it's going to then reallocate
a bigger chunk of memory.
Hopefully, what's going to happen is this--
if your chunk of memory initially looks like this,
it's going to hopefully notice, oh, this memory is free.
Let me just give you back the same address.
So if this is address 100, and you get lucky
and this address is also available, the realloc function's
going to remember that for the operating system.
It's going to return the number 100 again.
And you're good to go.
You can safely touch memory here.
Or if this is in use already, this chunk of memory, and therefore we
can't fit another byte there because some other code you wrote
is using that memory.
But there is twice as much memory available down here.
What realloc will do, is if you've stored the number 50,
it will handle the process of copying 50 to the new value.
This is going to be left as a garbage value for you to deal with.
And it's going to return to you the address of the new chunk of memory,
having done the copying for you.
So even though it's technically re-allocating the array,
it's not necessarily just going to grow it.
It might relocate it in memory to a bigger chunk,
and then give you the new address of that memory.
Question?
AUDIENCE: Is that process really preferable
to just creating extra memory in it's place.
And then saving the time and energy of reallocating them [? all at once. ?]
DAVID J. MALAN: That's a really good question.
Honestly, we could avoid this problem slightly by just doing, you know what,
give me at least--
go ahead and give me at least the size of an int, times--
I don't know, most humans are not going to type in more than 50 numbers.
Let's just pick 50.
So you could do this, and that would indeed save you time.
Because the approach I'm currently taking
is pretty inefficient because every damn time the user
calls getInt, and gives an int, we're resizing, resizing, resizing.
Very expensive.
As to what the best value is though--
50?
Should it be 25?
Should it be 1,000?
I'm either going to under bet or over bet.
And it just depends on you to decide which of those is the worst decisions.
AUDIENCE: But, like, in terms of programs,
is it also pretty expensive to have memory that you're not using
or generally, is it usually more OK?
DAVID J. MALAN: Good question.
In programs you're writing, is it better to have more memory than you're using,
or should you really be conservative?
These days, memory is cheap.
We all have gigabytes of memory.
And so wasting 50 bytes or 200 bytes, times 4, of memory, not a big deal.
Like, just get the job done quickly and easily.
But in resource constrained devices, maybe, things like phones
or little internet of things style devices
that have a lot fewer resources, you don't really want to go wasting bytes.
But honestly, the CPUs, the brains in our computers,
are so darned fast these days, even if you're calling malloc
10 times, 1,000 times, it's happening so darned fast
that the human doesn't even notice.
So there too.
These are what are called design decisions.
And these are the kinds of things that, in the real world,
you might actually debate with someone at a whiteboard,
saying, no, this is stupid because of this reason.
Or he or she might push back for other reasons.
And no one's necessarily right.
The whole goal is to just that thought process first
so you're at least confident in what you chose.
Yeah?
AUDIENCE: When we were writing to a file in the last PSET,
was it storing it in memory first or putting it right on the hard drive?
DAVID J. MALAN: When you were calling fread,
you were by definition in the forensics problem set
reading bytes from disk into memory.
When you were calling fwrite, you were copying bytes from memory back to disk.
If that answers the question.
OK.
Other questions?
Yeah?
AUDIENCE: Why did you say, size + 1, in line 16?
DAVID J. MALAN: Why do I say, size + 1, in line 16?
Because the whole goal is to make room in this array for the newly inputted
number that the human just typed in.
And so whatever the current size of the array is,
I clearly need one more space.

AUDIENCE: So that repeats on and on?


DAVID J. MALAN: It does repeat on and on.
Because at the moment, I'm inside of this while loop.
So we do need to ask a question, when is the human done inputting.
And it turns out-- and this is not obvious.
And it's not the best user experience on a keyboard for the human.
But we can actually detect the following sentiments--
if user is done inputting numbers, then let's go ahead and break.
But the question then is, how do you express that pseudo code?
Well, you could in some programs maybe type q for quit.
But is that going to work when using getInt?
Could we detect q?
Why not?

AUDIENCE: Because getInt immediately prompts you for another integer.


DAVID J. MALAN: Exactly.
Because getInt immediately prompts you for another int.
So because of the way we designed the CS50 library, you can't detect q,
or you can't have the human type quit unless you don't use getInt.
You instead use?
AUDIENCE: getString.
DAVID J. MALAN: We could use getString.
And then every time the human types in a number, we could use,
like, A2i to convert it to an int.
But if the human types in q or Q-U-I-T--
a string also-- we could just have an if condition with string compare and quit.
But honestly, then you're reimplementing getInt--
so trade-off.
Anyhow, a common way to work around this would
be, you know that Control-C quits programs, perhaps,
cancels out of your program.
There's another popular keystroke, Control-D,
that sends what's called end of file.
It simulates the end of a file.
It simulates the end of the human's input.
So it's kind of like the period at the end of an English sentence.
So if you want to signal to a computer that's waiting for input from you that
you don't want to quit the program-- that would be Control-C--
but you just want to be done inputting input to the computer,
you hit Control-D, otherwise known as EOF.
And the way to express this-- and you would only know this from
documentation-- would be to say something like this,
if the number the human typed in equals end of file--
but there is no such thing in this context--
you actually do this because of the CS50 library works.
It turns out that if the only values a function can return
are integers, that means you can return 0, 1, negative 1, 2 billion,
negative 2 billion give or take.
What humans did for years with old programming languages
is they would just steal one or a few numbers.
For instance, you'd steal the number two billion and call it intmax--
the maximum integer.
And you'd just say, you can never actually type 2 billion,
because we're using that as a special value to signify
that the human hit Control-D. Or you could do negative 2 billion,
or you could do 0, or 50.
But at some point, you have to steal one of the 4 billion available numbers
to use as a sentinel value, a special value
that you can then check for as a constant.
So anyhow, this just means, when the user is done typing input,
go ahead and break out of this while loop.
And as an aside, let me fix one thing.
It turns out things can go wrong with realloc.
And if realloc fails to allocate memory, it
can return null, a special value that just means, eh, something went wrong.
It's an invalid pointer.
It's the address 0.
And so it turns out there's a subtle bug here where, technically, I
should actually do this--
store realloc's return value in a temporary variable.
Because if temp = null, something went wrong.
And I should actually go ahead and quit out of this program.
But let me wave my hand at that for now because it's more of a corner case.
But you'll see in the online version of this program we have additional error
checking that just checks, in the rare case that realloc fails,
clean it up and return properly.
But I'll wave to the online code for that.
All right.
Any questions on that example before we move on?
Yeah?
AUDIENCE: So in realloc, when it creates the new pointer for the [INAUDIBLE],,
does it clear the memory from the original pointer?
Does it automatically clear it?
DAVID J. MALAN: Good question.
When you call realloc and it ends up allocating more space,
does it clear the original memory?
No.
And that is where garbage values come from, for instance.
Because they're just left in memory from the previous use.
Other questions?
Yeah?
AUDIENCE: What does the user actually type to break?
DAVID J. MALAN: Oh, Control-D. Control-D. And it's not break.
It is to send end of file, end of input.
Control-C kills or breaks out of the program itself.
AUDIENCE: And that's the same as the intmax kind of?
DAVID J. MALAN: Same as intmax?
Yes.
AUDIENCE: Because you're not adding, like, a giant value.
DAVID J. MALAN: Correct.
In the CS50 library, intmax, yes, is the symbol.
Yes.
Yeah?
AUDIENCE: Could you also just ask the user to say,
do you want to enter another number yes or no?
DAVID J. MALAN: Absolutely.
We could add more logic.
And you could use getString.
And we could prompt him or her, hey, do you want to input another number.
The only downside of that would be, now, I
have to type in not only my number, but yes or no constantly.
So it's just a trade-off user interface-wise.
All right.
So let me go ahead.
And let me go ahead and return 0 here just as my simple solution
to this problem of something going wrong.
I've just compiled this program.
Let me go ahead and run it.
I'm going to type in one number, two numbers, three numbers.
And now I'm bored.
I don't want to keep doing this.
How do I tell the computer I'm done?
AUDIENCE: Control-D.
DAVID J. MALAN: Control-D. Oops.
Oh, OK.
That's correct behavior because I forgot a key step.
What's that?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Yeah.
I'm not actually doing anything with the values.
I should probably for int I get 0, I less than size,
I + + code we had before.
And I should probably print out You inputted %I, this.
Save that.
Make list one.
So all I did was re-add the printing code.
Now if I rerun this-- one, two three, Control-D--
dammit.
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Oh.
OK.
Now I broke my code here.
Let me do this.
We're going to get rid of this error checking
because I'm not actually ever resizing.
numbers gets realloc.
Oh, and maybe someone chiming in with this--
numbers bracket size gets the user's input.
Size + +-- was this a key detail someone wanted me to do?
OK.
So I didn't actually finish the program earlier.
Notice we left off as follows--
hey, computer, give me an array of size 0 initially
that's null-- there's no memory for it.
Therefore, the size of this array is 0.
Do the following forever.
Get a number from the human.
If the number equals this special value, intmax just
breakout because the program is done.
And actually, sorry.
This is why I write these in advance too.
OK.
Go ahead and prompt the user for a number.
If they have inputted the Control-D, just break out of this loop.
However, if the size of the array equals its current capacity,
go ahead and reallocate space for this thing being one number bigger than it
previously was.
Now, assuming that succeeded and we have memory,
go ahead, and just like our list 0 example, store in the numbers array
at the current location, which is 0, whatever number the human typed in.
And then increment the size by one to remember what we have done.
I'm also though going to need to do capacity + + here
to remember that we've increased the capacity of the array.
So again, two new measures.
capacity is how much space there is in total.
size is how much we're using.
They happen to be identical at the moment
because we're growing this thing step by step by step.
All right.
Let me go ahead and hit Save.
Let me go ahead and compile this one last time.
./list1 and input 1, 2, 3.
Control-D. OK.
Now it's just an aesthetic bug.
I forgot my /n.
So just to prove that I can actually program, ./list1; 1, 2, 3; Control-D.
Phew.
All right.
So you inputted 1.
And the reason it didn't move to another line
is because Control-D gets sent immediately without hitting Enter.
All right.
Phew.
That's all using arrays.
Now let's do the sort of cake baked already and pull it out of the oven.
The third and final example here is list two.
And actually, before we get there, let me note one thing.
Yeah, let's do one last thing here.
Let me go ahead and run, per earlier, our new friend valgrind on list1.
Enter.
It's waiting for me to type in 1, 2, 3.
Let me go ahead and hit Control-D. Interesting.
I seem to have a buggy program even though I claimed a moment ago that I
knew what I was doing.
12 bytes in one blocks are definitely lost in lost record one of one.
Again, I don't understand most of those words.
But 12 bytes definitely lost--
probably my fault. Why is it 12?
And what are those 12 bytes?
Yeah?
AUDIENCE: I think you made three integers.
DAVID J. MALAN: Yeah, 1, 2, and 3.
AUDIENCE: And each one is 4 bytes.
And you never freed them after you used malloc.
DAVID J. MALAN: Exactly.
I typed in three numbers--
1, 2, and 3.
Each of those is 4 bytes on this computer.
That's 12-- 3 times 4.
And so I'd never freed them seems to be the source of the issue.
So at the end, let's just prove that valgrind
can detect correctness as well.
Free my numbers, semi-colon.
Let me go ahead and rerun make list1.
And now let me increase the size of this and do valgrind again
on list1, typing in the same values--
1, 2, and 3.
Control-D. All he blocks were freed.
No leaks are possible.
So again, valgrind is your friend.
It finds problems that you didn't even necessarily notice.
And you didn't have to read through your lines of code again
and again to identify the source of the issue unnecessarily.
All right.
Any questions then on these arrays that are dynamically allocated
and the bugs we find therein with valgrind?
All right.
So the last demonstration of code is going to be this.
I have stolen, for this final example, some of the building blocks
that we had on the screen earlier.
In my code for list2.c, I need a structure called node.
And that node, as we claimed earlier with our human volunteers,
is going to contain a number called number,
we'll call it this time, instead of n.
And it's going to contain a ptr called next to another such node.
So that's copied and pasted earlier, albeit with the integer renamed
to number for clarity.
Now, notice in main what I'm doing first.
Go ahead and allocate an array of no space initially.
So this was like when Comey was holding up first and representing
the beginning of our data structure.
This is the analog using an array, that the piece of paper that
would be held up here would be numbers.
And it's just pointing at nothing, null-- like left hand
down on the floor.
Because there is no memory yet allocated.
But then, and while true, go ahead and get an integer
from the user with this code here.
Check if the user hit Control-D, as with this arcane technique.
And then our code is similar in spirit, but we
have to stitch these things together.
Allocate space for the number.
So when I malloc an additional volunteer from the audience
and he or she came down, the equivalent in code is this--
hey, computer, allocate with malloc enough space to fit the size of a node,
then store the results in a ptr called n.
So node *n just means, give me a pointer to a node, call it n,
and store the address that was just allocated from the audience as before.
Why do I have these lines of code here that I've highlighted in blue?
What's that expressing?

If bang n, or if not n would be how you pronounce it--


what's going on there?
Yeah?
AUDIENCE: If there is no more memory that you can point to, then it fails.
DAVID J. MALAN: Exactly.
This isn't going to happen all that often.
But if the computer is out of memory, and therefore malloc fails,
you don't want the program just to crash or freeze.
Like, all of us hate when that happens on Mac OS or Windows.
So check for it.
If not n, or equivalently, if n = = null, just return 1.
Quit gracefully, even though annoyingly.
But don't just crash or do something unexpected.
So you can simplify that check to just if not n-- if n is not a valid ptr,
return 1.
Now, here's the code with which we were implementing the demonstration
with our humans.
And this is the scariest looking or most cryptic at least
looking code we're going to see in C.
Today is our final day in C. We've been running up
a really steep hill of late, learning about memory,
and now data structures and syntax.
This is the last of our syntax in C.
So what are the symbols to be aware of?
This line of code here is how I handed one of our volunteers a piece of paper.
On the right-hand side is the number that was typed in--
55, or 5, or 20, or whatever the value is.
On the left-hand side is where you want to put it.
n and then literally an arrow number does this.
It has, with malloc a line or so prior, given me in memory
just one of these big rectangles.
And again, the top of this in this example is called the number
and the bottom is called next.
So that's our human having stood up from the back of the room.
When I hand that human a number, like 55, it visually goes there.
The line of code with which you achieve that is this here.
Because notice on line 31 here, when I malloc that node,
I stored its address in a variable called n.
And that's a pointer, as drawn with an arrow, to that big node.
Or if we really want to be nit-picky, if this is in address 100,
yes, then the pointer actually has the value 100 in it.
But again, that's rarely useful information.
So we can abstract away with just an arrow.
So line 31 is what creates those boxes on the screen.
Line 38 is what puts the number--
for instance, 55-- into the box exactly, much like I handed a piece of paper
over.
So what is this?
This is the only real new notation today,
even though we're using lots of stars elsewhere--
arrow This is wonderfully the first time in C it actually maps to our pictures.
If n is the variable and you do n arrow something,
that means follow the arrow--
kind of like Chutes and Ladders if you grew up playing that--
and then put the number where the arrow has led you in the field called number.
So as an aside, we can think about this a different way. n is what data type?
What is this thing in blue--
n?

AUDIENCE: Pointer.
DAVID J. MALAN: It's a pointer.
And it's a pointer to one of these things that we created earlier.
So we're not doing students anymore with our structures.
We're implementing nodes, which have numbers and next pointers.
So it turns out that if n is a pointer to a node--
recall that dot notation from before--
this is not how you access number in this case.
Because n is not a node itself.
It's a pointer.
But if n is a pointer, how do you go to a pointer?
How do you go to an address?
With what notation?
AUDIENCE: Star.
DAVID J. MALAN: Star.
So recall from last week, if we want to go to an address,
you could do syntax like this.
Ignore the parentheses for a moment.
Just *n means if n is an address of a chunk of memory, *n means go there.
Once you're there, you're conceptually right here-- top left-hand corner.
How do you access individual fields like number or next?
You use dot notation.
So if you literally do *n.number, that means go to the address and access
the number field.
There is nice syntactic sugar in C, which
is just a fancy way of saying shorthand notation, where it's just the arrow.
But that's all it is.
This arrow notation doesn't do anything new.
It just combines, go there, with, access a field in a struct, all in one breath
if you will.
And this just looks a little prettier.
When I told our volunteers earlier, point your hand
down at the floor, that's all that line of code is doing.
It's saying, go to n's address, which is here, access the next field,
and write in that field null, which is just
the address 0-- the default, special address, like pointing at the floor.
This line of code, 40, is just a quick error check.
if (numbers)-- what is that equivalent to?
That's actually just saying, if numbers, not equals null.
So if numbers is legitimate, if malloc worked correctly, then let's go ahead
and do the following.
Phew.
This is a mouthful.
What is going on here?
So this is a for-loop that's not using numbers.
Well, or is it?
Almost every for-loop we've written and you've probably written just
uses I, J, maybe K, but just integers probably.
But that doesn't have to be the case.
What is a pointer?
It's an address.
What is an address?

AUDIENCE: A place in memory.


DAVID J. MALAN: A place in memory, or a number really.
So you can certainly use for-loops just involving addresses.
But how?
So we'll consider this line of code.
This here looks different today, but it's everything
before that first semi-colon.
That's just where you initialize a value.
So this is like saying, hey, computer, go ahead and give me
a variable called ptr and initialize it to be the start of my list.
Then I'm saying, hey, computer, do this so long as ptr does not equal null.
And then what am I doing?
if-- and let's ignore this for now, it's an error check--
go ahead and-- sorry, let me think for one second.

OK.
Let's do this.

What are these lines of code doing?


This is the code that was actually suggested
at the very end of our human example.
Like, what if we wanted to insert all of the elements
at the end of the link list?
How do you express that?
So in this highlighted lines of code, we're asking the question,
if the current pointer's next field is null, we've found the end.
Go ahead and update that next field to equal n and then break.
So let me translate this to an actual picture,
but using smaller boxes that makes clear where something is going.
So suppose that this program's been running for a little while
and we have a length list that looks like this,
where this one is pointing here and maybe this one's pointing here.
And this says null here.
And this points here.
And the numbers are, as we've been using today, 42, 50, 13.
So the start of this list is called numbers.

This points to the start of the list.


What am I doing in this for-loop?
I am just implementing the following logic with this loop--
give me a variable called ptr, as represented
in the story by my left finger, here, and initialize
that to be the start of the list.
If that node's next pointer is equal to null, add a new node here.
But this is not null.
I want to follow the bread crumbs to here.
And then, oh, we're at the end of the list.
I want to insert this new thing here.
So how do express this code actually in C?
So if I look back up here, this is the line of code
that allocates my left finger here called ptr and initialize it
to equal numbers, which is the same as pointing at the first element.
It's kind of like Comey was representing first earlier.
But now our array is called numbers.
Next, what am I doing?
Does ptr equal null?
Well, no.
If my left hand is pointing here, it obviously doesn't equal null.
So we don't have to worry yet.
Then what do I want to do?
If ptr next equals null, well, what does that mean?
Well, ptr is here.
ptr arrow next means here.
Does this equal null in this story?
I mean, it literally doesn't.
Because null is not written there. null is way down there.
So the condition does not pass.
So what do I do next?
If ptr is equal to null doesn't apply, here's a weird update.
ptr gets ptr next.
So it's cryptic-looking syntax.
But if ptr is pointing here, what is ptr next?
That's just this, right?
This is n.
This is next.
Or this is number.
This is next.
So ptr next is this.
So what is this value?
Well, this is a pointer pointing here.
So that highlighted block of code, ptr equals ptr next,
has the effect visually of doing this.
Why?
If the arrows are a little too magical, just think about these being addresses.
If this is saying, the next address is location 100,
ptr equals ptr next is like saying, well, this also equals 100.
Whatever 100 is, for instance, over here is
what both arrows should now point out.
And if you now repeat this process and repeat this process,
eventually that question we asked earlier is going to apply--
if ptr next equals null, what do I want to do?
Well, if ptr x equals null, there's two lines going on. ptr next equals n.
So ptr next is no longer null.
It should instead be pointing at n, which is the new node.
And then that's it.
Because this was already initialized to null.
And let's suppose this was 55.
And we're done.
So much easier to do, obviously, in person with just humans,
and moving around, and pointing with their left hands.
But in code, you just have to think about the basic building blocks.
What is each of these values?
Where is each of it pointing?
And which of those fields do you need to update?
And the only new code here-- even though we're kind of combining it all in one
massive example--
is this.
We are actually using arrow notation to say, go to that address
and access some value therein.
And this condition down here, which I'll wave my hand out for now,
just handles this situation where the list is initially empty.
Any questions on this thus far?
All right.
So let's take a look more graphically at some final problems we can solve.
And what you'll see in the days ahead is the following
when it comes to these linked lists and more.
We now have the ability to actually allocate things in memory dynamically.
We don't necessarily know in advance how many numbers we have
or, in the case of the next problem set, how many words we have.
We have the ability though to use malloc, and maybe even realloc,
to grow and grow our data structure in memory.
And we have the ability in code to actually
traverse those values in such a way that we
can access memory that's all over the board now
and not necessarily back to back to back.
But what happens if we want to combine these ideas into fancier solutions
still?
Well, let's take a look at that.
In particular, if I go let's say over here to the following,
let's consider a problem we might now solve.
If I wanted to store everyone's name in this room in a data structure,
I could do what?
Well, we could use an array.
So I could actually decide how many people are in the room--
let's call it n--
and actually draw n boxes on the board, and then iteratively ask
everyone for their name, and actually write it down.
If I then wanted to take attendance thereafter and say, oh, is Alice here,
or is Bob here, or is Kareem here, or Brian,
I could just look through that array and say yes or no, that human is here.
But what's the running time of that algorithm?
How long would it take to look up a name in a data structure
where I've just drawn it as an array, a big list on the board?
AUDIENCE: A big O of n.
DAVID J. MALAN: What's that?
AUDIENCE: A big O of n.
DAVID J. MALAN: A big O of n, right?
Because if it's just a list of names, it's going to take big 0 of n.
And frankly, that seems a little slow.
How could I do an optimization?
Well, what if we combined some of these ideas?
Arrays are nice because they give me random sort
of instant access to memory locations.
But linked lists are nice because they allow me to dynamically add or subtract
elements even if I want from the list.
So you know what?
Instead of writing down everyone's names, like Alice, and Bob,
and Charlie, like this in just one big array of some fixed size that might
paint me into a corner-- now I only have room for one more name--
what if I instead do things a little more cleverly?
So when I'm actually jotting down everyone's name in the room, what
if I instead did, OK, is Alice here.
All right.
Alice is here.
And then Brian is here.
I'm going to put Brian here.
And then maybe Charlie is here.
All right.
So Charlie.
And then maybe Arnold is here.
Where should I put Arnold?
So also starts with A. You know what?
Let's just put Arnold here.
Arnold.
And Abby is here.
So you know what?
Let's just put Abby up here as well.
Bob came as well.
So Bob-- so what's the pattern I'm obviously following
as I'm hearing names called out?
AUDIENCE: Alphabetically sorted.
DAVID J. MALAN: Alphabetically sorted--
kind of.
Like, Abby kind of ended up in a weird place here.
But that's fine because I didn't hear her name first.
But I did kind of bucketize people into different rows of the board.
In other words, all of the A names I seem
to just write down for convenience at the top,
and then all of the B names together, and C names.
And probably if I kept going, I could do this all the way
through Z in the English alphabet.
So what's nice about this is that, yeah, I'm making lists of names,
but how long is each of those lists?
If there's n people in the room, each of my lists
is not going to be n long, which is slow.
It's going to be what? n divided by 26, give or take.
If we assume that there's an equal number of people
with Z names and A names, it's going to be roughly n divided by 26 so
that I have these chains of human names, but they're
much shorter than they would have been if I just grouped everyone together.
And this is a fundamental technique in programming called hashing.
It turns out there are things in this world called hash functions.
These are just mathematical, or verbal, or code-implemented functions
that take as input something and produce as output a number typically-- a number
from 0 to, say, 25, or from 1 to 26.
But they can also output strings in other contexts as well.
So my hash function here in my mind is, if you hand me a name,
I'm going to look at the first letter in your name.
And if it's A, I'm putting you in location 0.
If it's B, I'm going to put you in location 1.
If it's a Z, I'm going to put you in location 25 at the end.
So these are all buckets I've got, so to speak,
in computer science-- like 26 buckets or room
on the board that represent the starts of people's names.
So what is that?
Well, it would seem that if I don't know in advance how many A names I have,
that's kind of like drawing this as a linked list, if you will,
that might just get longer and longer.
But I do know that I only have a finite number of first letters.
So that-- at the risk of drawing a little messily--
is kind of like drawing what data structure?
AUDIENCE: An array.
DAVID J. MALAN: Yeah.
It's kind of like drawing an array that just has 26 spots.
And what's nice about an array is that I have random access.
I can jump right to any letter of the alphabet in constant time, one step.
And once I get there, I'm still going to see a list of names.
Thankfully, thanks to linked lists, that list can be short or long.
But on average, let's say it's going to be
126th the length that it would have been if I just used one array or one linked
list.
So this technique of using a hash function-- which, again,
I've defined as you give me a name; I take that as input;
I look at the first letter; and I return as output a number from 0 to 25--
a hash function lets you create a hash table.
And there's different ways to implement hash tables,
but perhaps one of the most common is indeed like this.
You decide in advance on the size of an array.
But that array does not contain the strings or the humans' names.
That array actually contains linked lists.
And it's the linked lists that contain the names.
So we borrow ideas from, like, week two.
We merge them with an idea today from week four of adding arrays
to linked list respectively.
And we kind of get the best of both worlds.
Because I can immediately jump to any letter of the alphabet super fast.
And once I'm there, yeah, there's a list,
but it's not nearly as long as it would have been if I didn't use this trick.
So what's the running time of all of this?
Well, it turns out that a hash table in the worst case
might still take you how many steps to find someone's name once it's
been added to the list?
In the very worst case, how many steps, if there's n people in the room?
AUDIENCE: n.
DAVID J. MALAN: Maybe n.
Why?
It's kind of a perverse situation.
But can you contrive a scenario in which,
even though we're doing this fanciness, it still
takes me n steps to confirm or deny that someone's here?
Yeah?
AUDIENCE: Everyone's name starts with the same letter.
DAVID J. MALAN: Everyone's name starts with the same letter
for some weird reason.
Now, it's a little silly in the human world.
But it could happen if you're just talking
data or whatever in the computer world.
This can devolve into, sure, an array with just one really linked list.
But in practice, that's not likely going to happen, right?
If we actually spent the time here and asked everyone for their name,
we'd probably get a reasonably uniform distribution of letters,
at least as is statistically likely with just human names.
So that would kind of spread things out.
And so there's this fundamental distinction between sort of real-world
running time, or wall clock time-- how many seconds are actually spinning
on the clock--
versus asymptotic running time.
We've talked for a couple of weeks now about running time as being big O of n.
And that might be still the case, that a hash table-- yes, in the worst case,
it's still a big O of n data structure.
Because in the worst case, it's going to take n steps.
But in the real world, big O of n is really big O of n divided by 26,
even though we always ignore those lower-order terms.
But when it's you, the human, running the code and analyzing the data,
running 26 times faster is actually real time saved,
even though a mathematician might say, ah, that's the same fundamentally.
And indeed, one of the problems ahead for the next problem set
is going to be to suss out exactly what the implications are
in your own code for actual wall clock running time.
And making smarter design decisions, like something like this,
can actually really speed up your code to be 26 times as fast, even
though, yes, a theoretician would say, ah,
but that's still asymptotically or mathematically
equivalent to just something linear.
So it's this fine tuning that will make your code even better and better.
Now, frankly, hashing on first names probably
isn't the smartest thing alone, right?
Like, does anyone's-- and this is going to be hard.
Does anyone's name start with X here?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: [INAUDIBLE] is not here.
But thank you for that perfect counter-example.
But she's not here.
So look, there's no Zs.
So now we're down to 25 possible values.
And I could probably pick some less common letters too.
The point is there's probably a few more As than there are Zs
or a few more B's than there are Q's just by nature of human names.
So maybe just using the first letter isn't good enough.
And frankly, with 26 names-- suppose we did this for all of Harvard
and had thousands of names.
Each of my chains might still have hundreds or thousands of names.
So another design question is going to be, well, how many buckets should you
have, how big should the array be.
Maybe you shouldn't look at the first letter.
What if you look at the first and the second letter together-- so AA, and AB,
and AC, and then dot dot dot, BA, BB, BC, so you could come up
with more and more buckets?
But what else?
How else might we kind of uniformly distribute people?
What do all of you have that we could use as input to a hash function?
AUDIENCE: A last name.
DAVID J. MALAN: OK.
Well, you could do last name, which might give us
a different or similar distribution.
Yeah?
AUDIENCE: ID number.
DAVID J. MALAN: Whats that?
AUDIENCE: ID number.
DAVID J. MALAN: Yeah.
We could use your ID number and actually look at the first digit of your ID.
And odds are, it's 0 through 9.
So we could probably at least get 10 buckets that way.
And that's probably uniformly distributed.
I'm not sure.
We could use birth dates in some way.
Like, we could put all of the freshmen in one bucket,
all the seniors in another bucket, and everyone else,
and so forth, in their own buckets, which would also give us some input.
So again, a hash function is entirely up to you to program and design.
The goal though is to smooth things out.
You want to have roughly the same number of things in each linked list
just so that you have about the same performance
across all of these various inputs.
So let's take a look at a couple of other data structures,
again, in this abstract way.
Now that we know that, even though it's not obvious at first attempt,
we know how to construct arrays.
We kind of know now how to construct linked lists.
It stands to reason we could implement them together in code.
What else could we do now with these building blocks?
So for instance, this structure here is a very common one, known as a tree.
A tree like a family tree, where there's one patriarch or matriarch
at the top, and then their children, and then their grandchildren,
and great grandchildren, and so forth.
And what's nice about a tree structure is that, if you're storing data,
you can actually store the data in clever ways to the left child,
to the right child, and so forth, as follows.
Notice here, there's something curious about all the numbers in this data
structure.
What is noteworthy about them?

What is noteworthy?
Yeah?
AUDIENCE: Multiples of 11.
DAVID J. MALAN: What's that?
AUDIENCE: They're multiples of 11.
DAVID J. MALAN: They are multiples of 11.
That was just to make them look pretty though by the author here.
Yeah?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Yeah.
There's a mathematical significance too.
Like, no matter what node or circle you look at, the value in it
is bigger than the left child and it's smaller than the right child.
So it's kind of in-between.
Any circle you look at, the number to the left is smaller,
the number to the right is bigger.
And I think that applies universally all over the place.
Yes?
So what does that mean?
We'll recall from, like, week 0 when we had a whole bunch of phone book pages
that we were searching--
1, 2, 3, 4, 5, 6.
Let's give ourselves a 7th one.
Recall that when we did divide and conquer, or binary search,
we did it on an array.
And what was nice about binary search was we started in the middle,
and then we maybe went left, or we maybe went right,
and we kind of divided and divided and divided
and conquered the problem much more efficiently in logarithmic time
than it would have been if we did it linearly.
But we know now weeks later that arrays are kind of limiting, right?
If I keep storing all of my values in an array,
what can I not do with the array?

Make it bigger, right?


I can't add an element to it without copying every darn element,
as we've discussed thus far today.
But what if I was a little smarter about it?
What if I stored my values, not just in an array,
but I started storing them in these circles--
let's call them nodes--
and each of those nodes is really just an integer plus two additional values?
How would we implement this data structure in memory?
Well, here's an int n-- could represent the number in question.
And we could put that in a data structure
called a node that just has the same syntax as earlier today,
but I've left room for two more fields.
What is it that I want to represent in code if I
want to start storing my numbers, not in this old-school week 0 array,
but in a tree?
AUDIENCE: Two pointers.
DAVID J. MALAN: Two--
AUDIENCE: Pointers.
DAVID J. MALAN: Two pointers.
Right?
A tree, as drawn here literally with arrows,
is just like saying every one of these nodes or circles
has a left child and a right child.
How do you implement children?
Well, you can literally just use pointer notation as well here.
A left child is just a pointer to another struct on the left.
And a right child is just another pointer to the child on the right.
And what's nice about this ultimately is that we can now
traverse this tree just as efficiently as we can traverse this array.
Because notice if I want to search for the number 66,
how many steps does it take me if I start at the top?
Just like Comey represented the start of our linked list,
so in the world of a tree does the root have special significance.
And that's where we always begin.
So how many steps does it take me to find 66 given the top?
AUDIENCE: Three.
AUDIENCE: Two.
DAVID J. MALAN: It looks like--
yeah, two or three, right?
I start at the top.
I look at it and say, hmm, 55, which way do I go.
I go to the right.
Then I see 77.
OK.
Which way do I go?
I go to the left.
So it's the same logic as week 0 in dividing and conquering the phone book
or an array a couple of weeks later.
But we get to the number we care about pretty quickly.
And it's not linear.
And in fact, if we actually did out the math, what's
really cool about a binary search tree is that if you have n elements,
n circles, the height of that tree is by definition mathematically log n.
So the height of the tree just so happens
to correspond to exactly how many times you can take n
and divide it, divide it, divide it, divide it in two.
And you can actually see this if you think about it the reverse direction.
On the bottom row, there are how many elements?
All right?
And on the middle row, there is?
AUDIENCE: Two.
DAVID J. MALAN: Two.
And on the top row, there's one.
So you can actually see it in the reverse direction.
This is like divide and conquer, but in a different conceptual way.

Every row in the tree has half as many elements as the one below it.
And so the implication of that is just like from week 0 in the phone book
when we're dividing, and dividing, and dividing in half, and half, and half.
So this is only to say, now that we have structures and pointers,
we can build something like this.
But let's try one other example here too.
This is a crazy looking example.
But it's kind of amazing.
Suppose that, if we wanted to store a dictionary of words--
so not humans' names this time, but English words.
So Merriam Webster or Oxford English Dictionary has what?
Thousands, hundreds of thousands of words
these days in English for instance?
How do you actually store those?
Well, if you just look up words in a dictionary back in yesteryear,
that is linear.
You have to start at the beginning and look through it
page by page, looking for words.
Or you could be a little smarter.
Because the words in any dictionary are hopefully alphabetized,
you can do the Mike Smith-style divide and conquer by going to the middle,
then the middle of the middle, and so forth--
log of n.
But what if I told you, you could look up words in constant time--
some fixed number of steps?
None of this divide and conquer complexity.
No log n.
Just constant time-- you want a word, go get it instantly.
That's where this last structure comes in, which is called a trie--
T-R-I-E-- short for retrieval, even though it's pronounced the opposite.
So a trie is a tree each of whose nodes is an array.
So it's like this weird Frankenstein's monster kind of data structure.
We're just really combining lots of different ideas, as follows.
And the way a trie works, as is implied by this partial diagram on the board,
is that if you want to store the name Brian, for instance,
in your dictionary-- it's the first word--
what you do is you start by creating a tree with just one node.
But that node is effectively an array.
That array is of size, let's say for simplicity, 26.
So A through Z. This location here therefore represents B for Brian.
So if I want to insert Brian into this tree, I create one node at the top.
And then for the second letter in his name, R,
I create another node, also an array, A through Z.
And so here, I put a pointer to this node here.
B-R-I. So I should have drawn some more boxes.
A, B, C, D, E, F, G, H, I. So here, I'm going to draw another pointer to B--
wait.
Bian.
[LAUGHTER]
OK.
That's wrong.
Billy shall be our name.
Billy is at B. Wait.
No.
Dammit.
B, B. B-I-A-- yes, this works.
This works.
OK.
Sorry.
So here we go.
We're inserting Billy into this fancy data structure.
So the first node represents the first letter.
The second node represents the second letter.
The third node represents the third letter.
And so forth.
But what's cool about this is the re-usability.
So notice if this is the second letter and I counted this out correctly,
I, this is going to lead to a third node deeper
in the tree where it's L that we care about next, and then another one
down here which represents another L.
And I'll start drawing the letters.
L. This is B. This is I. L. And we'll call this L.
And then, finally, another one over here, which is a Y. And this
gets pointing down here.
This gets pointing here.
And so forth.
So in short, we have one node essentially
for every letter in the word that we're inserting into the data structure.
Now, this looks stupidly inefficient at the moment.
Because to store B, I, L, L, Y, how much memory did I just use?
26 plus 26 plus 26 plus 26 plus 26.
Just to store five characters, I use 26 times 5.
But this is kind of thematic in computer science--
spend a little more space, and I bet I can decrease the amount of time
it takes to find anyone.
Because now no matter how many other students are in this data structure--
and for instance, let's do another one.
If we had another one, like Bob--
so B is the same first letter.
That leads us to this second node.
O is somewhere else in this array, say, over here.
So this represents O. And then Bob has another one.
So there's going to be another array here.
And this is why the picture above draws this so succinctly.
This is how we might store Bob.
So B, I, L, L, Y. Or you can follow a different route, B, O, B.
So we can start to reuse some of these arrays.
So there's where you start to get some of the efficiency.
Any time names share a few letters, then you start reusing those same nodes.
So it's not super, super wasteful.
But the question now is, if there's like 1,000 students in the class,
or 1,000 students in the room, we're going have a lot of nodes
there on the board.
But how many steps does it take to find Billy,
or Bob, or any name with this data structure, and to conclude yes or no
that student is in the class?
So, like, five for Billy, three for Bob.
And notice none of that math has any relationship
to how many students are in the room.
If we instead wrote out a long list of 1,000 names, in the worst case,
it might take me 1,000 steps to find Billy or Bob.
Maybe I could be a little smarter if I sort it.
But in the worst case, big O of n, it's linear.
Or if I used a hash table before, and maybe there's
1,000 students in the room, but, OK, there's
26 letters in the English alphabet at least.
So that's 26 buckets.
So maybe it's 1,000 divided by 26, worst case,
if I'm using those linked lists inside my array.
But wait a minute.
If I'm using this structure, a trie, where every node in the tree
is just in an array that leads me to the next node, ala breadcrumbs, B, I, L, L,
Y is 5 and always 5.
B, O, B is always 3.
B, R, I, A, N would have been 5 as well.
None of these totals has any impact or any influence
from the number of total names in the data structure.
So a trie in some sense is this amazing holy grail
in that, by combining these various data structures, now you get constant time,
but you do pay a price.
And just to be clear, what is the price we seem to be paying?
AUDIENCE: Memory.
DAVID J. MALAN: Memory.
And in fact, this is why I'm not really drawing it much more.
Because it just becomes a big mess on the screen because it's
hard to draw such wide data structures.
It's taking a huge amount of memory.
But theoretically, it's coming faster.
Yeah?
Question.
AUDIENCE: So would you deal with a case if someone is in the Bob,
but then the other kid is in the Bobby?
DAVID J. MALAN: Good question.
So it's a bit of a simplification.
If you were storing both Bob and Bobby, you would actually keep going.
So each of these elements is not just one letter.
You also have essentially a node there or some other data structure
that says either stop here or continue.
And you'll see actually in the problems that we'll
propose to you how you can represent that idea if you
choose to go this route.
Indeed, the challenge ahead ultimately is something quite like this.
You will implement your very own spell checker.
And we will give you code that gets you started with this process.
And of course, a spell checker these days in Google Docs
and Microsoft Word just underlines in red misspelled words.
But what's going on?
And how is it that Word or Google Docs can
spell check your English or whatever language so quickly?
Well, it has a dictionary in memory, probably with tens of thousands
or hundreds of thousands of words.
And all they're doing constantly is, every time you type a word
and hit the Spacebar, or Period, or Enter,
it's quickly looking up that new word or those words in its dictionary
and saying, yes or no, should I squiggle a red line underneath this word.
And so what we're going to do is give you a big text file, ASCII text,
containing 100-plus thousand words.
You're going to have to decide how to load those
into memory, not just correctly, but in a way that's well designed.
And we'll even give you a tool, if you choose to use it,
that times how long your code takes.
And it even counts how much RAM you're actually using.
But the key goals for this week and our final week in C
is to take some of these basic building blocks,
like arrays, and pointers, and structures,
and decide for yourselves how you're most comfortable stitching them
together, to what extent you want to really fine tune your code beyond just
getting it correct, and to give you a better sense of the underlying code
that people have had to write for years in libraries
to make programming doable, ala Scratch.
Because in just a few weeks, we're going to transition to Python.
And the dozens of lines of code you've been writing now
are going to be whittled down to one line, two line,
because we're going to get a lot more features from these newer,
fancier languages.
But you'll hopefully have an appreciation of what is actually
going on underneath that hood.
So I'll stick around for any one-on-one questions.
Let's call it a day.
Take a duck on your way out for roommates as well.
And we'll see you next time.

Anda mungkin juga menyukai