DAVID J. MALAN: All right, this is CS50 and this is a lecture four.
So we're here in beautiful Lowell Lecture Hall
and Sanders is in use today.
And we're joined by some friends that will soon
be clear and present in just a moment.
But before then, recall that last time we took a look at CS50 IDE.
This was a new web-based programming environment similar in spirit
to CS50 Sandbox and CS50 Lab, but added a few features.
For instance, what features did it add to you--
to your capabilities?
Yeah?
AUDIENCE: Debugger.
DAVID J. MALAN: What's that?
AUDIENCE: The debugger.
DAVID J. MALAN: The debugger.
So debug50, which opens that side panel that
allows you to step through your code, step by step, and see variables.
Yeah?
AUDIENCE: Check50.
DAVID J. MALAN: Sorry, say again?
AUDIENCE: Check50.
DAVID J. MALAN: Check50 as well, which is a CS50 specific tool that
allows you to check the correctness of your code
much like the teaching fellows would when providing feedback on it.
Running a series of tests that pretty much are
the same tests that a lot of the homework's
will encourage you yourself to run manually,
but it just automates the process.
And anything else?
AUDIENCE: [INAUDIBLE]
DAVID J. MALAN: So that is true too.
There's a little hidden Easter egg that we don't use this semester,
but yes indeed.
If you look for a small puzzle piece, you
can actually convert your C code back to Scratch like puzzle pieces
and back and forth, and back to forth, thanks to Kareem and some of the team.
So that is there, but by now, it's probably better
to get comfortable with text as well.
So there's a couple of our other tools that we've
used over time of course besides check50 and debug50.
We've of course used printf and when is printf useful?
Like when might you want to use it beyond needing to just print something
because the problem set tells you to.
Yeah?
AUDIENCE: To find where your bug is.
DAVID J. MALAN: Yeah, so to find where your bug is.
If you just, kind of, want to print out variables, value or some kind of text
so you know what's going on and you don't necessarily
want to deploy debug50, you can do that.
When else?
AUDIENCE: If you have a long formula for something [INAUDIBLE]
and you want to see [INAUDIBLE].
DAVID J. MALAN: Good.
Yeah.
AUDIENCE: How running-- like going through debug50 50 times.
DAVID J. MALAN: Indeed.
Well, in real life-- so you might want to use printf
when you have maybe a nested loop, and you want to put a printf inside loop
so as to see when it kicks in.
Of course, you could use debug50, but you
might end up running debug50 or clicking next, next, next, next, next, next,
next so many times that gets a little tedious.
But do keep in mind, you can just put a breakpoint deeper into your code
as well and perhaps remove an earlier breakpoint as well.
And honestly, all the time, whether it's in C or other languages,
do I find myself occasionally using printf just to type out printf in here
just so that I can literally see if my code got to a certain point in here
to see if something's printed.
But the debugger you're going to find now
and hence forth so much more powerful, so much more versatile.
So if you haven't already gotten to the habit of using debug50 by all
means start and use those breakpoints to actually walk through your code
where you care to see what's going on.
So style50, of course, checks the style of your code much like the teaching
fellows might, and it shows you in red or green
what spaces you might want to delete, what spaces you might
want to add just to pretty things up.
So it's more readable for you and others.
And then what about help50?
When should you instinctively reach for help50?
AUDIENCE: When you don't understand an error message.
DAVID J. MALAN: Exactly.
Yeah, when you don't understand an error message.
So you're compiling something.
You're running a command.
It doesn't really quite work and you're seeing a cryptic error message.
Eventually, you'll get the muscle memory and the sort of exposure
to just know, oh, I remember what that means.
But until then, run help50 at the beginning of that same command,
and it's going to try to detect what your error is
and provide TF-like feedback on how to actually work around that.
You'll see two on the course's website is a wonderful handout made
by Emily Hong, one of our own teaching fellows,
that introduces all of these tools, and a few more,
and gets you into the habit of thinking about things.
It's kind of a flow chart.
If I have this problem, then do this or else
if I have this problem do this other thing.
So to check that out as well.
But today, let's introduce really the last, certainly for C,
of our command line tools that's going to help
you chase down problems in your code.
Last week, recall that we had talked about memory a lot.
We talked about malloc, allocating memory,
and we talked about freeing memory and the like.
But it turns out, you can do a lot of damage
when you start playing with memory.
In fact, probably by now, almost everyone-- segmentation fault?
[LAUGHTER]
Yeah, so that's just one of the errors that you might run into,
and frankly, you might have errors in your code now
and hence forth that have bugs but you don't even realize it
because you're just getting lucky.
And the program is just not crashing or it's not freezing,
but this can still happen.
And so Valgrind is a command line program that is probably
looks the scariest of the tools we've used,
but you can also use it with help50, that
just tries to find what are called memory leaks in your program.
Recall that last week we introduced malloc,
and malloc lets you allocate memory.
But if you don't free that memory, by literally calling the free function,
you're going to constantly ask your operating system, MacOS, Linux,
Windows, whatever, can I have more memory?
Can I have more memory?
Can I have more memory?
And if you never, literally, hand it back by calling free your computer
may very well slow down or freeze or crash.
And frankly, if you've ever had that happen on your Mac or PC, very likely
that's what some human accidentally did.
He or she just allocated more and more memory
but never really got around to freeing that memory.
So Valgrind can help you find those mistakes before you or your users do.
So let's do a quick example, let me go CS50 IDE, and let me go ahead
and make one new program here.
We'll call it memory.c because we'll see later today how
I might chase down those memory leaks.
But for now, let's start with something even simpler, which all of you
may be done by now, which is to accidentally touch memory
that you shouldn't, changing it, reading it and let's see what this might mean.
So let me do the familiar at the top here.
Include standard IO.
Well, let's not even do that yet.
Let's just do this first.
Let's do int, main(void), just to start a simple program
and in here let me go ahead and just call a function called f.
I don't really care what its name is for today.
I just want to call a function f, and then that's it.
Now this function f, let me go ahead and define it as follows, void f(void).
It's not going to do much of anything at all.
But let's suppose, just for the sake of discussion, that f's purpose in life
is just to allocate memory for whatever useful purpose,
but for now it's just for demonstration's sake.
So what's the function with which you can allocate memory?
AUDIENCE: Malloc.
DAVID J. MALAN: Malloc.
So suppose I want malloc space for, I don't know,
something simple like just one integer.
We're just doing this for demonstration purposes,
or actually let's do more, 10 integers, 10 integers.
I could, of course, do-- well, give me 10, but how many bytes do what I want?
How many bytes do I need for 10 integers?
AUDIENCE: sizeof(int).
DAVID J. MALAN: Yeah, so I can do literally sizeof(int)
and most likely the size of an int is going to be?
AUDIENCE: Four.
DAVID J. MALAN: Four, probably.
On many systems today, it's just 4 bytes or 32 bits,
but you don't want to hard code that lest someone else's computer not use
those same values.
So the size of an int.
So 10 times the size of an int.
Malloc returns what type of data?
What does that hand me back?
AUDIENCE: [INAUDIBLE]
DAVID J. MALAN: Yeah, returns an address or a pointer.
Specifically, the address, 100, 900, whatever, of the chunk of memory
it just allocated for you.
So if I want to keep that around, I need to declare a pointer.
Let's just call it x for today that stores that address.
Could call it x, y, z, whatever, but it's not an int that it's returning.
It's the address of an int.
And remember, that's what the star operator now means.
The address of some data type.
It's just a number.
All right, so now if I were to--
first, let's clean this up.
Turns out that you use malloc, I need to use stdlib.h.
We saw that last week, albeit briefly, and then of course
if I'm going to call f, what do I have to do to fix this code?
AUDIENCE: You need to declare.
DAVID J. MALAN: Yeah, I need to declare it up here,
or I could just move f's implementation up top.
So I think this works, even though this program at the moment
is completely stupid.
It doesn't do anything useful, but it will allocate memory.
And I'll do something with it as follows.
If I want to change the first value in this chunk of memory,
well how might I do that?
Well, I've asked the computer for 10 integers or rather space
for 10 integers.
What's interesting about malloc is that when
it returns a chunk of memory for you it's contiguous, back-to-back.
And when you hear contiguous or back-to-back,
what kind of data structure does that recall to mind?
AUDIENCE: An array.
DAVID J. MALAN: An array.
So it turns out we can treat this just random chunk of memory
like it's an array.
So if we want to go to the first location in that array of memory,
I can just do this and put in the number say 50.
Or if I want to go to the next location, I can do this.
Or if I want to do the next location, I can do this.
Or if I want to go to the last location, I might do this,
but is that good or bad?
AUDIENCE: Bad.
DAVID J. MALAN: Why bad?
AUDIENCE: It's-- it's out of bounds
DAVID J. MALAN: Yeah, so it's out of bounds.
Right?
This is sort of week one style mistakes when it came to loops.
Recall, with for loops or while loops, you might go a little too far,
and that's fine.
But now we actually will see we have a tool that
can help us notice these things.
So hopefully, just visually, it's apparent that what I have going on here
is just-- on line 12, I have a variable x
that storing the address of that chunk of memory.
And then on line 13, I'm just trying to access location 10
and set the value 50 there.
But as you note, there is no location 10.
There's location 0, 1, 2, 3, all the way through 9, of course.
So how might we detect this with a program?
Well, let me go ahead and increase my terminal window just a bit
here, save my file, and let me go ahead and compile make memory.
OK, all is well.
It compiled without any error messages, and now
let me go ahead and run memory, enter.
All right, so that worked pretty well.
Let's actually be a little more explicit here just for good measure.
Let me go ahead and print something out.
So printf, %i for an integer, and let's make it just more explicit.
You inputted %i and then comma x bracket 10.
And what do I have to include you use printf?
AUDIENCE: stdio.h.
DAVID J. MALAN: Yeah, so stdio.
So let's just quickly add that, stdio.h, save.
All right, let me recompile this, make memory, enter.
And now let me go ahead and do ./memory.
Huh?
Feels like it's a correct program.
And yet, for a couple of weeks now we've been claiming that mm-hmm,
don't do that.
Don't go beyond the boundaries of your array.
So how do we reconcile this?
Feels like buggy code or at least we've told you it's buggy code,
and yet it's working.
Yeah?
AUDIENCE: [INAUDIBLE]
DAVID J. MALAN: That's a good way of putting it.
AUDIENCE: It's still very similar.
We want that.
DAVID J. MALAN: OK.
AUDIENCE: So we can theoretically--
it just created a program.
DAVID J. MALAN: Yeah, and I think if I heard you correctly,
you said C doesn't scream if you go too far?
AUDIENCE: Yeah.
DAVID J. MALAN: Yeah, OK.
So that's a good way of putting it.
Like, you can get lucky in C. And you can
do something that is objectively, pedagogically, like technically wrong,
but the computer's not going to crash.
It's not going to freeze because you just get lucky.
Because often, for performance reasons, when
you allocate space for 10 integers, you're
actually going to get a chunk of memory back
that's a little bigger than you need.
It's just not safe to assume that it's bigger than you need,
but you might just get lucky.
And you might end up having more memory that you can technically get away
with touching or accessing or changing, and the computer's not going to notice.
But that's not safe because on someone else's Mac or PC,
their computer might just be operating a little bit differently than yours,
and bam, that bug is going to bite them and not you.
And those are the hardest, most annoying bugs to chase down as some of you
might have experienced.
Right?
It works on your computer but not a friends or vise versa.
These are the kinds of explanations for that.
So Valgrind can help us track down even these most subtle errors.
The program seems to be working.
Check50 or tools like it might even assume
that it's working because it is printing the right thing,
but let's take a look at what this program Valgrind thinks.
Let me increase the size of the terminal window here,
and go ahead and type in Valgrind ./memory.
So same program name ./memory but I'm prefixing it with the name Valgrind.
All right?
Unfortunately, Valgrind is really quite ugly,
and it prints out a whole bunch of stuff here.
So let's take a look.
At the very top, you'll see all these numbers on the left,
and that's just an unfortunate aesthetic.
But we do see some useful information.
Invalid read of size 4 and then it has these cryptic
looking letters and numbers.
What are those?
They're just addresses and hexadecimal.
It doesn't really matter what they are, but Valgrind
can tell us where the memory is that's acting up suspiciously.
You can then see next to that, that Valgrind is pointing
to function f on memory. c 15th line.
So that's perhaps helpful, and then main on line 8
because that's the function that was called.
So Valgrind is actually kind of nice in that it's showing us all the functions
that you called from bottom up, much like the stack from last week.
And so something's going wrong line 15, and if we go back to that,
let's see line 15 was--
well, sure enough.
I'm actually trying to access that memory location
and frankly I did it on line 14 as well.
So hopefully fixing one or both of those will address this issue.
And notice here, this frankly just gets overwhelming pretty quickly.
And then, oh, 40 bytes in one block are definitely lost in lost record.
I mean, this is the problem with Valgrind, honestly.
It was written some years ago, not particularly user friendly,
but that's fine we have a tool to address this.
Let me go ahead and rerun Valgrind with help50,
enter, and see if we can't just assist with this.
All right, so still the same amount of black and white input but down here now
help50 is noticing, oh, I can help you with an invalid write of size 4.
So it's still at the same location, but this time--
or rather same file, memory.c but line 14.
And we propose, looks like you're trying to modify 4 bytes of memory that
isn't yours, question mark.
Did you try to store something beyond the bounds of an array?
Take a closer look at line 14 of memory.c.
So hopefully, even though Valgrind's output is crazy esoteric,
at least that yellow output will point you toward, ah, line 14.
I'm indeed touching 4 bytes, an integer, that shouldn't be.
And so let's go ahead and fix this.
If I go into my program, and I don't do this.
Let's change it to location 9, and location 9 here and save.
Then let me go ahead and rerun Valgrind without help50.
All right, progress except--
oops.
Nope, no progress.
I skipped the step.
Yeah, I didn't recompile it.
A little puzzled why I saw the same thing.
So now let's rerun Valgrind and here it seems to be better.
So I don't see that same error message up
at the very top like we did before, but notice here, 40 bytes in one blocks.
OK, that was bad grammar in the program, but are definitely
lost in loss record 1 of 1.
So I still don't quite understand that.
No big deal.
Let's go ahead and run help50 and see what the second of two errors
apparently is here.
So here it's highlighting those lines.
40 bytes and one blocks are definitely lost, and looks like your program
leaked 40 bytes of memory.
Did you forget the free memory that you allocated with malloc?
Take a closer look at line 13 of memory.c.
So in this case line 13 indeed has a call to malloc.
So what's the fix for this problem?
AUDIENCE: Free.
DAVID J. MALAN: Per help50 or your own intuition?
What do I have to add to this program?
AUDIENCE: Free.
AUDIENCE: Free.
Yeah, free, and where does that go?
Right here.
So we can free the memory.
Why would this be bad?
AUDIENCE: [INAUDIBLE]
DAVID J. MALAN: Exactly.
We're freeing the memory, which is like saying to the operating system,
I don't need this anymore.
And yet, two lines later we're using it again and again.
So bad.
We didn't do that mistake last week, but you should only
be freeing memory when, literally, you're
ready to free it up and give it back, which should probably
be at the end of the program.
So let me go ahead and re-save this, Open, up my terminal window,
recompile it this time, and now, let me run Valgrind one last time
without help50.
And still a little verbose, but zero errors, from zero contexts.
That sounds pretty good.
And moreover, it also explicitly says, all heap blocks were freed.
And recall that the heap, is that chunk of memory
that we drew visually up here, which is where malloc takes memory from.
So, done.
So this is kind of the mentality with which
to have when approaching the correctness of your code.
Like, it's one thing to run sample inputs, or run the program like I did.
All looked well.
It's one thing to run tools like check50, which we humans wrote.
But we too are fallible, certainly, and we might not think of anything.
And thankfully, smart humans have made tools, that at first glance,
might be a little hard to use.
Like debug 50, as is Valgrind now.
But they ultimately help you get your code 100% correct
without you having to struggle visually over just staring at the screen.
And we see this a lot in office hours, honestly.
A lot of students, to their credit, sort of reasoning through, staring
at the screen, just trying to understand what's going wrong,
but they're not taking any additional input other than the characters
on the screen.
You have so many tools that can feed you more and more hints along the way.
So do acquire those instincts.
Any questions on this?
Yeah?
AUDIENCE: Sir, if you had a main function that took arguments.
Would you run Valgrind with those arguments as well?
DAVID J. MALAN: Yes, indeed.
So Valgrind works just like debug 50, just like help50.
If you have command line arguments, just run them as usual,
but prefix your command with Valgrind, or maybe even help50 Valgrind,
to help one with the other.
Good question.
Other thoughts?
Yeah?
AUDIENCE: Where does the data go [INAUDIBLE]??
Say again.
An array of characters.
And even more specifically, it's a synonym S-T-R-I-N-G for what actual
data type?
char star, as we've called it.
So a char star is just the computer scientists
way of describing a pointer to a character,
or rather the address of a character, which
is functionally equivalent to saying an array of memory, or sequence of memory.
But it's kind of the more precise, more technical way of describing it.
And so now that we know that we have char stars underneath the hood, well,
where is all of that coming from?
Well, indeed, it maps directly to that memory.
We keep pointing out that something like this is inside of your computer.
And we can think of the memory as just being chunks of memory,
all of whose bytes are numbered.
0 on up to 2 gigabytes, or 2 billion, whatever the value might be.
But of course last week, we pointed out that you think about this memory
not as being hardware per se, but as just being this pool of memory that's
divided into different regions.
The very top of your computer's memory, so to speak,
is what we call the text segment.
And what goes in the text segment of your computer's memory
when you're running a program?
Text is like, poor choice of words, frankly, but what is it?
Say again.
AUDIENCE: File Headers?
DAVID J. MALAN: Not the file headers, in this case.
This is in the context of running a program, not necessarily saving a file.
Yeah?
AUDIENCE: String literals.
DAVID J. MALAN: Not string literals here,
but they're nearby, actually, in memory.
AUDIENCE: Functions.
DAVID J. MALAN: Functions, closer.
Yeah.
The text segment of your computer's memory
is where, when you double click a program to run it,
or in Linux, when you do dot flash something, to run it.
That's where the zeros and ones of your actual program, the machine code,
that we talked about in week zero, is just loaded into RAM.
So recall from last week, that, you know, anything physical in this world--
hard drives, solid state drives, is slow.
So those devices are slow, but RAM, the stuff we keep pulling up on the screen,
is relatively fast.
If only because it has no moving parts.
It's purely electronic.
So when you double click a program on your Mac or PC,
or do dot slash something in Linux, that is
loading from a slow device, your hard drive,
where the data is stored long term, into RAM or memory,
where it can run much more quickly and pleasurably in terms of performance.
And so, what does this actually mean for us?
Well, it's got to go somewhere.
We just decided, humans, years ago that it's
going to go at the top, so to speak, of this chunk of memory.
Below that though, are the more dynamic regions of memory--
the stack and the heap.
And we said this a moment ago, and last week as well, what goes on the heap?
Or who uses the heap?
AUDIENCE: Dynamic memory.
DAVID J. MALAN: Dynamic memory.
Any time you call malloc, you're asking the operating system
for memory from the so-called heap.
Anytime you call free, you're sort of conceptually putting it back.
Like, it's not actually going anywhere.
You're just marking it as available for other functions and variables to use.
The stack, meanwhile, is used for what?
AUDIENCE: Local variables.
DAVID J. MALAN: Local variables and any of your functions.
So main, typically takes a sliver of memory at the bottom.
If main calls another function, it gets a sliver of memory above that.
If that function calls one, it gets a sliver of memory above that.
So they each have their own different regions of memory.
But of course, these arrows, both pointing at each other,
doesn't seem like such a good design.
But the reality, is bad things can happen.
You can allocate so much memory that, bam, the stack overflows the heap.
Or the heap overflows the stack.
Thus was born websites like Stack Overflow, and the like.
But that's just a reality.
If you have a finite amount of memory, at some point,
something's going to break.
Or the computer's going to have to say, mm-mm, no more memory.
You're going to have to quit some programs, or close some files,
or whatnot.
So that was only to say that that's how the memory is laid out.
And we started to explore this by way of a few programs.
This one here-- it's a little dark here.
This one here, was a swap function.
Now it's even darker.
It was a swap function that actually did swap two values, A and B.
But it didn't actually work in the way we intended.
What was broken about this swap function last week?
AUDIENCE: Null.
DAVID J. MALAN: Null.
Maybe.
Maybe.
But it's not obvious because there's no mention of null in the program.
We might get lucky.
Null is just 0.
And sometimes we've seen that 0 are the default values in a program.
So maybe.
But I say, maybe, and I'm hedging why.
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Yeah.
And it doesn't allocate-- well, allocate, is not quite the right word.
That suggests you are allocating actual memory.
It's a garbage value.
There's something there.
Right?
My Mac has been running for a few hours.
And your Macs, and PCs, and phones, are probably running all day long.
Or certainly when the lid is up.
And so, your memory is getting used, and unused, and used.
Like, lots of stuff is going on.
So your computer is not filled with all zeros or all ones.
If you look at it at some random point in the day,
it's filled with like bunches and bunches of zeros and ones
from previous programs that you quit long ago.
Windows you have in the background and the like.
So, the short of it is, when you're running
a program for the first time, that's been running now for some time,
it's going to get messy.
That big rectangle of memory is going to have some ones over here
some zeros over here and vise versa.
So they're garbage values, because those bytes have some values in them.
You just don't necessarily know what they are.
So the point is, you should never ever dereference a pointer
that you have not set yourself.
Maybe you will crash.
Maybe it won't crash.
Valgrind can help you find these things but sometimes.
But it's just not a safe operation.
And lastly, the last thing we introduced last week,
which will be the stepping stone for what problems we'll solve this week,
was struct.
So struck is kind of cool, in that you can design your own custom data
structures.
C is pretty limited out of the box, so to speak.
You only have chars and boules, and floats, and ints, and doubles,
and longs, and str--
well, we don't even have strings, per se.
So it doesn't really come with many features, like a lot of languages do.
Like Python, which we'll see in a few weeks.
So with struct in C, you have the ability
to solve some problems of your own.
For instance, with the struct, we can actually
start to implement our own features.
Or our own data types.
For instance, let me go up here.
And let me go ahead and create a file called say,
student, or rather destruct dot h.
So recall that dot h is a header file.
Thus far, you have used header files that other people made.
Like, CS50 dot h, and standard IO dot h, and standard [? lid ?] dot h,
but you can make your own.
Header files are just files that typically contain code that you
want to share across multiple programs.
And we'll see more of this in time.
So let me go ahead and just save this file.
And suppose that I want to represent a student in memory.
A student of course, is probably going to have what?
For instance, how about a string for their name,
a string for their dorm-- but string is kind of two weeks ago.
Lets call this char star.
And lets call name, char star.
And so you might want to associate like, multiple pieces of data with students.
Right?
And you don't want to have multiple variables, per se.
It would be nice to kind of encapsulate these together.
And recall at the very end of last week, we
saw this feature where you can define your own type,
with typedef, that is a structure itself.
And you can give it a name.
So in short, simply by executing this these lines of code,
you have just created your own custom data type.
It's now called student.
And every student in the world shall have, per this code, a name
and a dorm associated with them.
Now, why is this useful?
Well the program, we looked at the very end of last time looked
a little something like this.
Instruct zero dot c, we had the following,
I first allocated some amount of space for student.
I asked the user what's the enrollment in the class or whatnot?
That gives us an int.
And then, we allocated an array of type student, called students, plural.
This was an alternative, recall, to doing something
like this, string names enrollment, and string dorms enrollment.
Which would work.
You could have two separate arrays, and you'd just
have to remember that name zero and dorm zero is the same human.
But why do that if you can keep things together.
So with structs, we were able to do this.
Give me this many student structures, and call the whole array, students.
And the only new syntax we introduce to satisfy this goal, was what operator?
AUDIENCE: The dot.
DAVID J. MALAN: The dot.
Yeah.
So in the past, recall from like week two, we introduced arrays.
And arrays allow you to do square bracket notation.
So that is no different from a couple of weeks back.
But if your array is not storing just integers, or chars, or floats,
or whatever, it's actually storing a structure, like a student,
you can get at that student's name by literally just saying dot name.
And you can get at their dorm by doing dot dorm.
And then everything else is the same.
This is what's called, encapsulation.
And it's kind of like a fundamental principle of programming
where, if you have some real world entity, like a student,
and you want to represent students with code, yeah,
you can have a bunch of arrays that all have called names, dorms, emails, phone
numbers, but that just gets messy.
You can instead encapsulate all of that related Information about a student
into one data structure so that now you have, per week zero, an abstraction.
Like, a student is an abstraction.
And if we break that abstraction, what is a student actually?
Not in the real world, but in our code world here?
Student is an abstraction.
It's a useful word, all of us can kind of agree means something,
but technically, what does it apparently mean?
A student is actually a name in a dorm, which really kind of is
diminutive to everyone in this room, but we've distilled it in code
to just those two values.
So there we have encapsulation.
You're kind of encapsulating together multiple values.
And you're abstracting away just have a more useful term,
because no one is going to want to talk in terms of lines of code
to describe anything.
So, same topic as in the past.
So, now we have the ability to come up with our own custom data structures
it seems.
That we can store anything inside of them that we want.
So let's now see how poorly we've been designing
some things for the past few weeks.
So it turns out that much of the code, hopefully
we've been writing in recent weeks has been correct,
but we've been not necessarily designing solutions in the best way.
Recall that when we have this chunk of memory,
we've typically treated it as at most, an array.
So just a contiguous chunk of memory.
And thanks to this very simple mental model, do we get strings,
do we get arrays of students now.
But arrays aren't necessarily the best data structure in the world.
Like, what is a downside of an array if you've encountered ones thus far.
Recall a couple of weeks ago, we started talking about efficiency and design.
What's the running time of resizing an array.
AUDIENCE: Too long.
DAVID J. MALAN: Say Again.
AUDIENCE: I said, too long.
DAVID J. MALAN: Too long.
Fair.
But let's be more precise.
Big o of-- big o of what?
AUDIENCE: N.
DAVID J. MALAN: N. What's n?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: OK.
True.
But what does n represent?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Yeah.
So you don't actually have to not know.
It's just a general answer.
In this case, however long the array is, call it n.
It is that many steps to resize it into that plus 1.
Technically it's big o, over n, plus 1.
But recall in our discussion, "The big o notation," we just
ignore the smaller terms-- the plus 1s, the divided by 2s, the plus n.
We focus only on the most powerful term in the expression, which
is just n here.
So yes, if you have an array of size 2, and you resize it
into an array of size 3, or really, n plus 1, that's
going to take me roughly n steps.
Technically n plus 1 steps.
But n steps.
Ergo big o of n.
So it's a linear process.
So possible but not necessarily the fastest
thing because he literally had to move all those damn values around.
So what would be better than this?
And if you've programed before, you might have the right instincts already.
How do we solve this problem?
Yeah?
AUDIENCE: Would you allocate more memory at the end of the array?
DAVID J. MALAN: Reallocate more memory at the end of the array.
So it turns out c does have a function called, realloc.
Perfectly, if not obviously, named that reallocates memory.
And if you pass it, the address of a chunk of memory you've allocated,
and the operating system notices, oh, yeah you got lucky.
I've got more memory at the end of this array,
it will then allocate that additional RAM for you, and let you use it.
Or worst case, if there's nothing available at the end
of the array in memory, because it's being
used by something else in your program.
That's fine.
Realloc will take on the responsibility of creating another array somewhere
in memory, copying all of that data for you into it,
and returning the address of that new chunk of memory.
Unfortunately, that's still linear.
Yeah?
AUDIENCE: Is this all being done in the heap?
Or--
DAVID J. MALAN: This is all being done in the heap.
Malloc, and realloc, and free, all operate on the heap.
Yes.
So that is a solution, but it doesn't really speak to the efficiency.
Yeah?
AUDIENCE: Could you use linked list?
DAVID J. MALAN: Yeah.
What is a linked list?
Go ahead.
AUDIENCE: It's when you have an element that points to different elements.
DAVID J. MALAN: OK.
Points to other elements.
Yeah.
So let me speak to what's the fundamental issue here.
The fundamental problem is much like painting yourself into a corner,
so to speak, as the cliche goes.
With an array, you're deciding in advance how big the data structure is
and committing to it.
Well, what if you just do the opposite.
Don't do that.
If you want initially, room for just one value, say one integer,
only ask the computer for that.
Give me space for one integer and I'll put my number 42 in here.
And then, if and only if, you want a second integer,
do you ask the computer for a second integer.
And so the computer, as by a malloc, or whatnot, will give you another one
like, the number 13.
And if you want a third, just ask the same question of the operating system.
Each time just getting back one chunk of memory.
But there's a fundamental gotcha here.
There's always a trade off.
So yes, this is possible.
You can call malloc three times.
Each time asking for a chunk of memory of size 1, instead of size 3,
for instance.
But what's the price you pay?
Or what problem do we still need to solve?
Yeah?
AUDIENCE: They're not stored next to each other.
DAVID J. MALAN: Yeah.
They're not being stored next to each other.
So even though I can think of this as being the first element, the second,
and the third, you do not have, in this story, random access to elements.
And random access, ergo, random access memory, or RAM,
just means that arithmetically, like, mathematically, you
can jump to location 0, location 1, location 2, randomly, or in constant
time.
Just instantly.
Because if they're all back to back to back, all you have to do is like,
add 1, or add 4, or whatever to the address, and you're there.
But the problem is, if you're calling malloc again and again
and again, there's no guarantee that these things are even
going to be proximal to one another.
These second chunks of memory might end up--
if this is a big chunk of memory we've been talking about,
where the heaps up here, and the stacks down here--
42 might end up over here.
The next chunk of memory, 50, might end up over here.
The third chunk might end up over here.
So you can't just jump from location 0, to 1, to 2,
because you have to somehow remember where location 0, and 1, and 2, are.
So how do we solve this?
Even if you haven't programed before, like, what would a solution be here?
Someone else.
Almost there.
Yeah?
AUDIENCE: First should point to 5.
DAVID J. MALAN: OK.
So first, or [? Comey, ?] could you point to 5.
And that's fine.
You don't even have to move.
Right?
This is the beauty of a linked list.
It doesn't matter where you are in memory,
it's the whole beauty of these pointers, where you can literally
point at that other location.
It's not an array where they need to be standing back to back to back.
They can be pointing anywhere.
All right.
Let's go ahead and insert one more.
Who wants to be say, 55?
Big value.
Yeah.
Come on down.
All right.
What's your name?
[? KYONG: ?] [? Kyong. ?]
DAVID J. MALAN: [? Kyong. ?] OK.
So come on over.
So we've just malloced [? Kyong ?] from the audience.
I've given him his end value of 55.
His left hand is just some garbage value right now.
How do we insert [? Kyong ?] in the right order?
Where is the obviously supposed to go?
In sorted order, he obviously belongs at the end.
But here's the catch with the linked list.
Just like when we've discussed searching and sorting in the past,
the computer is pretty blind to all but just one value.
And the linked list, at the moment--
like, I don't know that these three, these four, exist.
All I know really, is that [? Comey ?] exists.
Because via this first pointer, is the only access
to the rest of the elements.
And so what's cool about a linked list, but perhaps not obvious,
is that you only--
the most important value is the first.
Because from the first value, you can get to everyone else.
It's not useful-- excuse me for me to remember, Andrea?
--Andrea alone, because if I do, I've just
lost track of [? Comey ?] and more importantly, because of his number,
Eric.
So all I have to do really, is remember [? Comey. ?]
So if the goal now is to insert number 55, what steps should come first?
No pun intended.
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Say again.
AUDIENCE: Finding the first space.
DAVID J. MALAN: OK.
Finding the first space.
So I'm going to start at [? Comey, ?] and I'm going to follow this pointer.
Number 5, does 55 belong here?
No.
So I'm going to follow this pointer and get to Andrea.
Does 55 belong here?
No.
Gonna follow her pointer, and 22, does it belong here?
No.
I follow this pointer, 26?
No.
But you have a free hand, it turns out.
So what step should come next?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: We could have you point at 55, and now done.
So relatively simple, but what was the running time of this?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: It's big o of n.
It's linear.
Because I had to start at the beginning, even though we
humans have the luxury of just eyeballing it.
Saying, oh, obviously, he belongs way at the end.
Mm-mm.
Not in code.
Like, we have to start at the beginning to reverse the whole darn list,
until we get linearly to the very end.
And now we're done.
Let's try one last one.
How about 20?
Yeah.
Great.
Come on down.
What's your name?
JAMES: James.
DAVID J. MALAN: James.
All right, James.
All right.
So we just malloced James, given him the number 20.
He obviously belongs roughly in the middle.
What's the first step?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Sorry?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: All right.
So we start with [? Comey, ?] again.
All right.
First, OK.
5, do you belong here?
No.
Let me follow the link.
OK 9, do you belong here?
No.
Do you belong at 22-- ooh.
But what did I just do wrong?
AUDIENCE: Pointer.
DAVID J. MALAN: It's a pointer.
And it's a pointer to one of these things that we created earlier.
So we're not doing students anymore with our structures.
We're implementing nodes, which have numbers and next pointers.
So it turns out that if n is a pointer to a node--
recall that dot notation from before--
this is not how you access number in this case.
Because n is not a node itself.
It's a pointer.
But if n is a pointer, how do you go to a pointer?
How do you go to an address?
With what notation?
AUDIENCE: Star.
DAVID J. MALAN: Star.
So recall from last week, if we want to go to an address,
you could do syntax like this.
Ignore the parentheses for a moment.
Just *n means if n is an address of a chunk of memory, *n means go there.
Once you're there, you're conceptually right here-- top left-hand corner.
How do you access individual fields like number or next?
You use dot notation.
So if you literally do *n.number, that means go to the address and access
the number field.
There is nice syntactic sugar in C, which
is just a fancy way of saying shorthand notation, where it's just the arrow.
But that's all it is.
This arrow notation doesn't do anything new.
It just combines, go there, with, access a field in a struct, all in one breath
if you will.
And this just looks a little prettier.
When I told our volunteers earlier, point your hand
down at the floor, that's all that line of code is doing.
It's saying, go to n's address, which is here, access the next field,
and write in that field null, which is just
the address 0-- the default, special address, like pointing at the floor.
This line of code, 40, is just a quick error check.
if (numbers)-- what is that equivalent to?
That's actually just saying, if numbers, not equals null.
So if numbers is legitimate, if malloc worked correctly, then let's go ahead
and do the following.
Phew.
This is a mouthful.
What is going on here?
So this is a for-loop that's not using numbers.
Well, or is it?
Almost every for-loop we've written and you've probably written just
uses I, J, maybe K, but just integers probably.
But that doesn't have to be the case.
What is a pointer?
It's an address.
What is an address?
OK.
Let's do this.
What is noteworthy?
Yeah?
AUDIENCE: Multiples of 11.
DAVID J. MALAN: What's that?
AUDIENCE: They're multiples of 11.
DAVID J. MALAN: They are multiples of 11.
That was just to make them look pretty though by the author here.
Yeah?
AUDIENCE: [INAUDIBLE].
DAVID J. MALAN: Yeah.
There's a mathematical significance too.
Like, no matter what node or circle you look at, the value in it
is bigger than the left child and it's smaller than the right child.
So it's kind of in-between.
Any circle you look at, the number to the left is smaller,
the number to the right is bigger.
And I think that applies universally all over the place.
Yes?
So what does that mean?
We'll recall from, like, week 0 when we had a whole bunch of phone book pages
that we were searching--
1, 2, 3, 4, 5, 6.
Let's give ourselves a 7th one.
Recall that when we did divide and conquer, or binary search,
we did it on an array.
And what was nice about binary search was we started in the middle,
and then we maybe went left, or we maybe went right,
and we kind of divided and divided and divided
and conquered the problem much more efficiently in logarithmic time
than it would have been if we did it linearly.
But we know now weeks later that arrays are kind of limiting, right?
If I keep storing all of my values in an array,
what can I not do with the array?
Every row in the tree has half as many elements as the one below it.
And so the implication of that is just like from week 0 in the phone book
when we're dividing, and dividing, and dividing in half, and half, and half.
So this is only to say, now that we have structures and pointers,
we can build something like this.
But let's try one other example here too.
This is a crazy looking example.
But it's kind of amazing.
Suppose that, if we wanted to store a dictionary of words--
so not humans' names this time, but English words.
So Merriam Webster or Oxford English Dictionary has what?
Thousands, hundreds of thousands of words
these days in English for instance?
How do you actually store those?
Well, if you just look up words in a dictionary back in yesteryear,
that is linear.
You have to start at the beginning and look through it
page by page, looking for words.
Or you could be a little smarter.
Because the words in any dictionary are hopefully alphabetized,
you can do the Mike Smith-style divide and conquer by going to the middle,
then the middle of the middle, and so forth--
log of n.
But what if I told you, you could look up words in constant time--
some fixed number of steps?
None of this divide and conquer complexity.
No log n.
Just constant time-- you want a word, go get it instantly.
That's where this last structure comes in, which is called a trie--
T-R-I-E-- short for retrieval, even though it's pronounced the opposite.
So a trie is a tree each of whose nodes is an array.
So it's like this weird Frankenstein's monster kind of data structure.
We're just really combining lots of different ideas, as follows.
And the way a trie works, as is implied by this partial diagram on the board,
is that if you want to store the name Brian, for instance,
in your dictionary-- it's the first word--
what you do is you start by creating a tree with just one node.
But that node is effectively an array.
That array is of size, let's say for simplicity, 26.
So A through Z. This location here therefore represents B for Brian.
So if I want to insert Brian into this tree, I create one node at the top.
And then for the second letter in his name, R,
I create another node, also an array, A through Z.
And so here, I put a pointer to this node here.
B-R-I. So I should have drawn some more boxes.
A, B, C, D, E, F, G, H, I. So here, I'm going to draw another pointer to B--
wait.
Bian.
[LAUGHTER]
OK.
That's wrong.
Billy shall be our name.
Billy is at B. Wait.
No.
Dammit.
B, B. B-I-A-- yes, this works.
This works.
OK.
Sorry.
So here we go.
We're inserting Billy into this fancy data structure.
So the first node represents the first letter.
The second node represents the second letter.
The third node represents the third letter.
And so forth.
But what's cool about this is the re-usability.
So notice if this is the second letter and I counted this out correctly,
I, this is going to lead to a third node deeper
in the tree where it's L that we care about next, and then another one
down here which represents another L.
And I'll start drawing the letters.
L. This is B. This is I. L. And we'll call this L.
And then, finally, another one over here, which is a Y. And this
gets pointing down here.
This gets pointing here.
And so forth.
So in short, we have one node essentially
for every letter in the word that we're inserting into the data structure.
Now, this looks stupidly inefficient at the moment.
Because to store B, I, L, L, Y, how much memory did I just use?
26 plus 26 plus 26 plus 26 plus 26.
Just to store five characters, I use 26 times 5.
But this is kind of thematic in computer science--
spend a little more space, and I bet I can decrease the amount of time
it takes to find anyone.
Because now no matter how many other students are in this data structure--
and for instance, let's do another one.
If we had another one, like Bob--
so B is the same first letter.
That leads us to this second node.
O is somewhere else in this array, say, over here.
So this represents O. And then Bob has another one.
So there's going to be another array here.
And this is why the picture above draws this so succinctly.
This is how we might store Bob.
So B, I, L, L, Y. Or you can follow a different route, B, O, B.
So we can start to reuse some of these arrays.
So there's where you start to get some of the efficiency.
Any time names share a few letters, then you start reusing those same nodes.
So it's not super, super wasteful.
But the question now is, if there's like 1,000 students in the class,
or 1,000 students in the room, we're going have a lot of nodes
there on the board.
But how many steps does it take to find Billy,
or Bob, or any name with this data structure, and to conclude yes or no
that student is in the class?
So, like, five for Billy, three for Bob.
And notice none of that math has any relationship
to how many students are in the room.
If we instead wrote out a long list of 1,000 names, in the worst case,
it might take me 1,000 steps to find Billy or Bob.
Maybe I could be a little smarter if I sort it.
But in the worst case, big O of n, it's linear.
Or if I used a hash table before, and maybe there's
1,000 students in the room, but, OK, there's
26 letters in the English alphabet at least.
So that's 26 buckets.
So maybe it's 1,000 divided by 26, worst case,
if I'm using those linked lists inside my array.
But wait a minute.
If I'm using this structure, a trie, where every node in the tree
is just in an array that leads me to the next node, ala breadcrumbs, B, I, L, L,
Y is 5 and always 5.
B, O, B is always 3.
B, R, I, A, N would have been 5 as well.
None of these totals has any impact or any influence
from the number of total names in the data structure.
So a trie in some sense is this amazing holy grail
in that, by combining these various data structures, now you get constant time,
but you do pay a price.
And just to be clear, what is the price we seem to be paying?
AUDIENCE: Memory.
DAVID J. MALAN: Memory.
And in fact, this is why I'm not really drawing it much more.
Because it just becomes a big mess on the screen because it's
hard to draw such wide data structures.
It's taking a huge amount of memory.
But theoretically, it's coming faster.
Yeah?
Question.
AUDIENCE: So would you deal with a case if someone is in the Bob,
but then the other kid is in the Bobby?
DAVID J. MALAN: Good question.
So it's a bit of a simplification.
If you were storing both Bob and Bobby, you would actually keep going.
So each of these elements is not just one letter.
You also have essentially a node there or some other data structure
that says either stop here or continue.
And you'll see actually in the problems that we'll
propose to you how you can represent that idea if you
choose to go this route.
Indeed, the challenge ahead ultimately is something quite like this.
You will implement your very own spell checker.
And we will give you code that gets you started with this process.
And of course, a spell checker these days in Google Docs
and Microsoft Word just underlines in red misspelled words.
But what's going on?
And how is it that Word or Google Docs can
spell check your English or whatever language so quickly?
Well, it has a dictionary in memory, probably with tens of thousands
or hundreds of thousands of words.
And all they're doing constantly is, every time you type a word
and hit the Spacebar, or Period, or Enter,
it's quickly looking up that new word or those words in its dictionary
and saying, yes or no, should I squiggle a red line underneath this word.
And so what we're going to do is give you a big text file, ASCII text,
containing 100-plus thousand words.
You're going to have to decide how to load those
into memory, not just correctly, but in a way that's well designed.
And we'll even give you a tool, if you choose to use it,
that times how long your code takes.
And it even counts how much RAM you're actually using.
But the key goals for this week and our final week in C
is to take some of these basic building blocks,
like arrays, and pointers, and structures,
and decide for yourselves how you're most comfortable stitching them
together, to what extent you want to really fine tune your code beyond just
getting it correct, and to give you a better sense of the underlying code
that people have had to write for years in libraries
to make programming doable, ala Scratch.
Because in just a few weeks, we're going to transition to Python.
And the dozens of lines of code you've been writing now
are going to be whittled down to one line, two line,
because we're going to get a lot more features from these newer,
fancier languages.
But you'll hopefully have an appreciation of what is actually
going on underneath that hood.
So I'll stick around for any one-on-one questions.
Let's call it a day.
Take a duck on your way out for roommates as well.
And we'll see you next time.