Anda di halaman 1dari 14

Printf Case Study

Prerequisites: Basic Pointers, Arrays, String Literals, Stack Frames


Goal: Reinforce our understanding of pointers by solving a real problem that requires them (can't be
solved without pointers).
Overview: In this case study we'll exploit our knowledge of stack frames on Visual Studio for x86 and
write some C code that reads parameter values directly from the stack. This C code will be specific to
Visual Studio on x86 platforms (i.e., it is NOT portable code and is NOT a good example of how to solve
this type of problems generally), but it will very nicely demonstrate how we can draw diagrams of the
internal state of our program variables and then use those diagrams to write correct, non-trivial code
using pointers.
C/C++ language constructs and concepts demonstrated: character strings, int* pointers, char* pointers,
char** pointers


Background
Printf is a function we use all the time and we rarely (if ever) give any thought to how it works. We know
it performs output (to the screen), and that it will format and output just about anything we want it to.
For example, we can output something simple like:
printf("Hello World!\n");

Or, we can output something more complex like:
printf("The circuit's impedance is %f + %fj\n", real_imp, imag_imp);

Somehow, printf has to determine both what values we want to display and how we want those values
formatted. As programmers we know to use the % sequences (such as %f in the example above) to
describe what and how we want things formatted, but how does printf do its thing? In this case study
well write our own printf function. Well only implement a small fraction of the functionality provided
by the Standard C librarys printf, but well cover most of the important aspects of printf. As we go
through this case study well discover that pointers are the key to being able to extract the values from
the stack. Well do some pointer arithmetic to calculate the address of the values we want on the stack,
well declare pointers of the correct type to ensure that we read these values correctly and well even
spend a little bit of time (very little) discussion how we format those values for output.
The putchar function
Were not going to build then entire functionality of printf from scratch. Our starting point will be the
putchar function. Putchar is a very simple function (from the Standard C library) that formats and
outputs a single ASCII character. For example, if you invoke putchar(65), then you will get the letter A
displayed on the screen. Thats because in the ASCII table, row 65 is assigned the character A. Of
course, displaying a character actually requires a fairly complex bit of hardware and software to light up
all the right pixels on your display, not to mention the esoteric art of designing character fonts cool
stuff, but way outside the bounds of what we need to learn in EE312.
Printf version 1, just the basics
Were ready to jump in and get started. Printf, in its most basic usage is actually quite simple. Given an
invocation like:
printf("Hello World!\n");

All we need to do is output each of the ASCII characters one-at-a-time using putchar. A loop will take
care of that easily.
void printf_v1(char fmt[]) {
uint32_t k = 0;

while (fmt[k] != 0) {
putchar(fmt[k]);
k += 1;
}
}

Codereview
In this incarnation of printf, we have just a single parameter. Ive named the parameter fmt,
which is shorthand for format string. I like that name because the first argument provided to
printf is the formatting instructions for this output operation it tells us what characters are to
be displayed, and also contains the %f, %d and similar formatting instructions for all the other
outputs. Ive followed my own personal style of declaring fmt using the array declaration syntax
(i.e., the []), even though I know full well that fmt is actually a pointer. I use the [] syntax
whenever I have a parameter that points to an array. I use the * syntax to declare parameters
that point at single variables. Of course, the compiler doesnt care, and both char fmt[] and
char* fmt mean the same thing.
Youll also note that Im using a while loop rather than a for loop. Thats another element of
my style where I try to use for when the iteration takes place over a well defined range (e.g.,
from 1 to 10), and where I try to use while when the iteration continues until something
special happens. In this case, I dont know in advance how many characters are in the format
string, so I use a while loop that continues until it detects the terminating zero at the end of the
format string.
Finally, you may have noticed that I describe the terminating zero as 0. That, of course, is what it
is the number zero. Some programmers prefer to write that zero using the syntax \0. Please
dont get confused, 0 and \0 are the same thing. For that matter, 0 and 0x0 are the same thing
too. Just be careful not to confuse \0 and 0 which are quite different (0 is actually the
number 48).
This version of printf is very limited. Printf_v1 simply prints the format string verbatim. If you try
something like:
printf("the number is %d\n", 42);

then youll see the number is %d as your output. Note that the \n is handled just fine. Thats because
\n is really just the number 10 (row 10 in the ASCII table is line feed i.e., new line). Thats worth
repeating, just to be sure. \n is NOT a \ character followed by the letter n. It is a single character
representing the new line operation. That character happens to be 10 in the ASCII table (line feed).
However, the %d formatting code is completely ignored by printf_v1. Thats because we didnt even
attempt to take care of this case. Time to move on to version 2.
Printf version 2 one decimal argument
Our first big challenge comes when we try to make printf extract a value from the stack. Consider the
invocation:
printf_v2("the number is %d\n", 42);

In this invocation we have two arguments passed to printf. The first (as always) is a format string. The
second is the value 42 (on our platform this is a 32-bit signed integer). We understand how stack frames
work, so we can certainly imagine how these two arguments would be arranged on the stack.

In the diagram, Ive illustrated the stack frame for a myPrintf function with one formal parameter (i.e.,
one parameter that is declared in the parameter list), but has been supplied with two actual arguments.
Ive also illustrated the stack frame so that it contains two local variables, the variable k which is used as
before to index into the format string, and a new variable p which is a pointer. This diagram corresponds
with the following function:
void printf_v2(char* fmt, ...) {
uint32_t k;
int32_t* p;
}

Note that the parameter list for printf_v2 has one formal argument (declared as a pointer this time, but
as Im sure you remember array parameters and pointer parameters are the same thing Im using
pointer syntax in this case to remind you that the actual argument will be an address). After the
declaration of fmt I have the C/C++ ellipses expression: , The ellipses means that printf_v2 is a
function that can accept extra arguments. If you declare a function with ellipses, then you can call that
function with as many extra arguments as you wish. The extra arguments can be any type (characters,
integers, strings, floats, etc.). You can also have zero extra arguments. For this case, the extra argument
is 42.
Now that weve become familiar with the terrain, we have three problems we have to solve. (1) We
need to locate the memory location where the extra argument is stored, (2) We need to determine
when the argument is supposed to be printed (i.e., where the %d is inside the format string), and (3) we
need to actually format the output in decimal. The first problem is by far the most interesting, finding
the memory location with the 42 in it. Our diagram actually makes this pretty easy. There are actually a
couple of ways I can go about finding this address. In the first method, Im can make the pointer p point
at the variable k, and then Ill increment p by 4. In the second case, Ill make the pointer p point at the
Stack Frame for
main function
Stack Frame for
myPrintf function
parameter fmt and increment p by 1. I actually like the second strategy better, but lets start with the
first. In the diagram below, Ive gone ahead and removed the stack from main (its really not interesting
to us), and Ive added the actual array of characters for the format string. Note that (just as it always is),
the array argument is not actually on the stack with the parameters. The real argument is an address to
the first character in the array (in our case fmt is a pointer to the letter t). Youll notice that the fmt
parameter points to the first character in our array, and also that the array ends with two numbers. The
first number is 10, which is the actual value of \n (newline). In ASCII, the new line command code is the
10
th
entry in the ASCII table. Sometimes students will get confused and think that \n is actually two
characters its not. The new line character is just that, a single character, which happens to have ASCII
value 10. The second number is the zero which marks the end of the string. Ive also assumed that Ive
executed the statement p = &k; setting p to be equal to the address of k, in other words, making p
point to k.



The diagram shows the addresses that result from the pointer arithmetic p + 1, p + 2, etc. Recall that
Visual Studio uses eight bytes of storage on the stack to implement function return (four bytes for the
return address plus another four bytes to store the copy of the old frame pointer). Based on the diagram
we can clearly see that p + 3 points at the first parameter (fmt), and p + 4 points at the memory location
which contains the second argument, 42. Since this argument is one of the extra arguments that are
permitted by our printf(char fmt[], ) declaration, the argument has no name. The ONLY way we can
access this argument is by calculating its address. Using a diagram, we can easily calculate the pointer
arithmetic expression to find this address, resulting in the code shown below.

the number is %d 10 0
p
p+1
p+2
p+3
p+4
void printf_v2(char* fmt, ...) {
uint32_t k = 0;
int32_t* p = &k;
p = p + 4;

while (fmt[k] != 0) {
if (fmt[k] != '%') {
putchar(fmt[k]);
k += 1;
} else { // fmt[k] is the beginning of an escape sequence, e.g., %d
/* I'm just going to assume %d for now */
int32_t x = *p;
displayDecimal(x);
k = k + 2; // we add 2 to skip the % and the d, and then resume our loop.
}
}
}

Theres a couple of things worth noting. First of all, this version of printf is far from done. One big
mistake is that it always assumes that % is followed by d. As a result, the function doesnt work for
%c, %f, %s or any other escape sequence. Also, the function is limited to working with only a single extra
argument. If theres more than one %d in the format string, then the function just prints the same
argument over and over (the pointer p never moves, so each time we go to print an argument we
always print the same one). Still the function works just fine for our simple example
printf_v2("the number is %d\n", 42);
One other thing worth noting is the use of the function displayDecimal. This function takes the integer
argument and converts that argument into a sequence of ASCII characters. As humans we often forget
that this step is even necessary, we instinctively think of something like 42 being a number, even
when it quite clearly is a sequence 4 followed by 2. In our computer program, we have to actually
manually identify and then output each character that makes up the number. A simple function to do
that is shown below.
void displayDecimal(int32_t x) {
if (x == 0) { // special case for 0
putchar('0');
return;
}

if (x < 0) { // special case for negative values
putchar('-');
x = -x;
// fall through and display the absolute value
}

/* we can now assume x > 0 */
/* extract the digits in x from least to most significant */
char digits[10]; //int32_t is at most 2Billion so, at most 10 characters
uint32_t num_digits = 0; // the actual number of digits
while (x != 0) {
uint32_t d = x % 10; // least significant digit
char c = d + '0'; // ASCII representation of d
/* store the characters in an array so we can reverse them */
digits[num_digits] = c;
num_digits += 1;

/* continue to the next digit of x */
x = x / 10;
}

/* now print the digits in reverse order */
while (num_digits > 0) {
num_digits -= 1;
putchar(digits[num_digits]);
}

}
Summary (printf_v2)
Printf has only one formal parameter (in our case, we call this parameter fmt). However, printf
can have extra arguments. These arguments do not have names and can only be accessed using
their address.
Calculating the address of a variable in memory requires that you have a diagram showing you
the location of that variable relative to other variables. In our case, we used our detailed
knowledge of Visual Studios stack frame to draw a diagram illustrating the position of the
unnamed extra argument 42 relative to the named variables k and fmt.
We chose to read the argument from the stack using a pointer (named p). By referring to our
diagram we concluded that p = &k + 4 was the correct arithmetic. Note that since p is declared
to be an int32_t* pointer, the +4 in our arithmetic is actually going to increase the address
stored inside p by 16 the addition is scaled by the size of int32_t, i.e., multiplied by four.

Printf version 3 a string argument and %s
Of course decimal is not the only format we want to use when producing output, and %d is far from the
only escape option provided by printf. Lets consider the escape sequence %s which will format and
display a string argument. Consider:
printf_v3("Hello %s\n", "Craig");
In this case we have two string arguments. The first string argument is bound to the formal parameter
fmt. The second string argument, Craig, will be an unnamed extra argument. To access this
argument we will need to calculate its address (just like we did with the 42 in printf_v2). Before jumping
into the pointer arithmetic, it is worthwhile to remind ourselves exactly what the string argument
Craig is. In the C programming language, strings are arrays (arrays of characters with a zero at the
end). Furthermore arrays, when used as arguments to functions, are passed using the address of the
first character of that array. So, in this case, the unnamed argument is actually going to be the address
of the ASCII C in an array of six characters, C, r, a, i, g, 0. Like all addresses in 32-bit Windows, this
address in Visual Studio will be four bytes long. The following diagram shows the stack frame.


Since weve not changed the number of arguments from the previous example, and all the arguments
are coincidentally the same size, we can continue to use the same code to extract the extra argument
from the stack. Naturally, we dont want to format this argument in decimal anymore, so well use the
function displayString instead.
void displayString(char str[]) {
uint32_t k = 0;
while (str[k] != 0) {
putchar(str[k]);
k += 1;
}
}

The other than changing the function we use to format the output, printf itself is not changed.
Hello %s 10 0
p
p+1
p+2
p+3
p+4
Craig 0
void printf_v3(char* fmt, ...) {
uint32_t k = 0;
int32_t* p = &k;
p = p + 4;

while (fmt[k] != 0) {
if (fmt[k] != '%') {
putchar(fmt[k]);
k += 1;
} else { // fmt[k] is the beginning of an escape sequence, e.g., %s
/* I'm just going to assume %s for now */
int32_t x = *p;
displayString(x);
k = k + 2; // we add 2 to skip the % and the s, and then resume our loop.
}
}
}

Conceptually, this version of printf does the right thing for printf(Hello %s, Craig); However, the
compiler balks at our invocation of displayString(x); The compiler is concerned that we declared x to be
an int32_t (i.e., a number) and yet the function displayString needs an argument that is an address (i.e.,
a pointer). In other words, the compiler thinks we made a mistake. Actually, its the compiler thats
mistaken here. We know our code is correct because the code matches precisely our diagram (and our
diagram is correct). After p = p + 4, our pointer p points at the location on the stack where the second
(extra) argument is stored. We know that this memory location contains the address of the letter C in
our string Craig. So, by reading from *p and storing the result in the variable x, we are storing the
address of the letter C in the variable x. This address is precisely the address that displayString needs in
order for displayString to print out Craig. So, were right, the compiler is wrong. What do we do?
The situation calls for a type cast expression. In this case, Im going to declare an additional variable (q)
and specify that q is type char*. Then Ill use a type cast to convert the value of x into an address and
store that address in q.
int32_t x = *p;
char* q;
q = (char*) x; // type cast expression
displayString(x);

Type casts in C/C++ allow you to explicitly convert from one type to another. In our case, we want to
convert from an integer (x) to an address. We know that addresses really are numbers, after all, so this
conversion isnt actually a conversion at all the value in q is going to be precisely the same value that
was in x. However, since x and q are different types, the language considers them to be different. The
type cast is required in order to satisfy the languages type system, but that type cast doesnt do
anything. q = (char*) x; means exactly what q = x; means, copy the number in x and store that
number in the variable q.
IMPORTANT: Any type cast expression involving pointers in the C programming language will not do
any actual conversion. In fact, if you want to understand what is happening, its best to completely
ignore the type cast when reviewing the code.
Now that we can display both %s and %d we should add the case-selection code to our program so that
it correctly selects between strings and decimals. While the switch keyword can be used, I actually
prefer to stick with the more general if-then-else for most of my case selection. So, printf_v3 looks like
this:
void printf_v3(char* fmt, ...) {
uint32_t k = 0;
int32_t* p = &k;
p = p + 4;

while (fmt[k] != 0) {
if (fmt[k] != '%') {
putchar(fmt[k]);
k += 1;
} else { // fmt[k] is the beginning of an escape sequence, e.g., %s

if (fmt[k+1] != 'd') { // %d case
int32_t x = *p;
displayDecimal(x);
} else if (fmt[k+1] != 's') { // %s case
int32_t x = *p;
char* q = (char*) x;
displayString(q);
else { // default case (an error!)
/* do nothing */
}

k = k + 2; // we add 2 to skip the % and the s, and then resume our loop.
}
}
}
As you can see, we have three cases currently in our code. The first case is for %d sequences, the second
is for %s sequences. We can distinguish between these two cases by examining the value of fmt[k+1].
Since fmt[k] is the % character, then fmt[k+1] will be either a d or an s. Well, I suppose its possible
that fmt[k+1] is neither d nor s. For now, thats an error and since we dont know what to do, Im
going to structure the code so that it ignores that error.
Printf version 3 summary
A string argument is a pointer the address of the first character in an array of characters.
In our platform, addresses are the same size (and same binary encoding) as numbers. We can
extract the string extra argument using the same code that we used to extract the integer extra
argument in version 2.
The C programming language considers the types of our variables to be very important, and
consults the type of each variable before determining if an expression is legal. Using an integer
variable where an address (pointer) is expected is illegal in C, even if the number stored in the
variable is the correct address. To get around this problem, we can use type casts. A type cast
will often not do anything other than tell the compiler that the operation should be legal and to
compile it as written. In the case of type casts using pointer types (e.g., type casting to char*)
this is always the case and a type cast using a pointer will never actually do anything. The type
cast essentially just becomes the manual override button that the programmer presses to tell
the compiler to shut up and just generate the machine code.
Printf version 4, cleaning up the code
In our last version of printf, I want to accomplish two things. First, the code is incredibly ugly. Most
importantly by declaring the variable p as an int32_t* the code is incredibly misleading. We dont know
that p actually points to an integer. It might point to an address (%s) or it might even point to a floating
point number (%f). I want to correct this and declare p using a type that documents only what I know
about that address (and at the same time, Im going to give this variable a new name). The second thing
I want to do is to improve the functionality of printf so that it will print multiple arguments. To make
that happen, Ill need to add some pointer arithmetic to increment p each time we extract an argument.
As long as were working on yet another version of printf, I might as well give the code a thorough
cleaning and add in the additional cases for %c and %f. A heads up though, Im not going to bother
actually writing displayFloat as a function. Extracting the binary encoding for IEEE floating point and
creating a sequence of ASCII characters to represent that number is way outside of the goals for this
example.
First up on the docket is to replace the variable p with a new variable, next_arg. In our program
next_arg will always be the address of the next extra argument (if there is one). So, well initialize
next_arg to be the address of the first extra argument, and each time we see a valid % sequence, well
increment next_arg so that it becomes the address of the next argument. Id like to give next_arg the
correct type, which for this case is quite clearly void*. In C/C++ the type void* is a generic pointer. We
use that type when we have an address, but we dont know what type of information is stored at that
address. Thats perfect for this case where I know that next_arg is the address of the next argument, but
I dont yet know whether that argument is an integer, a float, a character or a string.
As part of my code cleaning, Im going to initialize next_arg to be &fmt + 1 rather than &k + 4. As we can
tell from our diagram either bit of arithmetic calculates the correct address. I prefer &fmt + 1 since this
will still be the correct address even if I create additional local variables (&k + 4 is correct only as long as
k is the first local variable declare a local variable before k in the program and the whole thing breaks).
void printf_v3(char* fmt, ...) {
void* next_arg = &fmt + 1;
uint32_t k = 0;
The main loop for printf is slightly more complicated because Im adding cases for %f and %c (more on
that later). The biggest change to the main loop is caused by the fact that in C/C++ I cannot legally de-
reference a void* pointer. Specifically in this case, even though next_arg is the correct address, I cant
read from that address using *next_arg. The reason I cant read from that location is that since next_arg
is a generic pointer, the compiler has no idea how many bytes I want to read (or how to interpret the
bits contained inside those bytes). For example, next_arg could be the address of a character, or
next_arg could the address of a float. We dont know (yet), which is why we declared the pointer to be
void* in the first place. Well, the compiler doesnt know either, so it cannot possibly create machine
code for an expression like *next_arg. To get around this problem, Im going to resurrect my variable p.
Actually, Im going to create a whole bunch of variables, each named p, and each with exactly the
correct type to match the. Heres the final code.
void printf(char* fmt, ...) {
void* next_arg = &fmt + 1; // address of next "extra" argument
uint32_t k = 0;

while (fmt[k] != 0) {
if (fmt[k] != '%') {
putchar(fmt[k]);
k += 1;
} else {
// fmt[k] is the beginning of an escape sequence, e.g., %d
if (fmt[k + 1] == 'd') { // %d case
int32_t* p = (int32_t*) next_arg;
next_arg = p + 1;
displayDecimal(*p);
} else if (fmt[k + 1] == 's') { // %s case
int32_t* p = (char**) next_arg;
next_arg = p + 1;
displayString(*p);
} else if (fmt[k + 1] == 'f') { // %f case
double* p = (double*) next_arg;
next_arg = p + 1;
displayFloat(*p);
} else if (fmt[k + 1] == 'c') { // %c case
int * p = (int *) next_arg;
next_arg = p + 1;
putchar(*p);
} else { // either %% or error
putchar('%');
}
k += 2; // we add 2 to skip the % and the d, and then resume our loop.
} // end of %? escape sequence
}
}

The first escape sequence case in the code is for %d. In this case, next_arg will be the address of an
integer. Accordingly, I declare a variable named p of type int32_t* and I copy the address from next_arg
to p. The C/C++ programming language mandates that I use a type cast when I copy this address.
However, the type cast doesnt do anything, it just tells the compiler to go ahead and copy the address
into the new variable. Once I have p pointing at the right location (and declared with the correct type), I
can do my pointer arithmetic to calculate the correct address for the next extra argument. The
expression p + 1 is precisely the correct address because the 1 will be scaled by the size of the current
argument (i.e., multiplied by 4 since the current argument is an int32_t). I can also read the extra
argument using the expression *p and send that value directly to displayDecimal to handle the output.
The case for %s is almost verbatim a copy of the %d case. Thats not surprising since our diagram
illustrated how similar the two cases actually are. Again, I declare a pointer p and copy the address from
next_arg into p (with a type cast). Whats different this time is that p is declared to be char**. That type
means a pointer to a pointer to a character. That is, of course, precisely what next_arg is in this case.
Consider this diagram from printf_v3.

The address stored in next_arg is the address of the extra argument. That extra argument is itself an
address, specifically the address of the C in the string Craig. In our diagram, next_arg is a pointer that
points to a pointer that points to a character.
In our code, as soon as we know were processing the case for %s we know we have a diagram like this
one. Consequently we know that next_arg is really a char**. So, we create a new variable (p) of type
char**, copy the address from next_arg into p and proceed as always. We assign next_arg the
incremented address p + 1 and we send *p to our output function displayString. It is incredibly
important to be able to recognize why char** is the correct type, and why all the code around p is
Hello %s 10 0
Craig 0
correct (p + 1 and not *p + 1 or &p + 1 for example). It takes a little while to sink in, but the code is
correct because the code precisely matches the diagram (and the diagram is correct).
Finally we have two cases, one for %c and one for %f. Both these cases match the case for %d with the
obvious substitution of displayFloat instead of displayDecimal for %f and putchar instead of
displayDecimal for %c. There is one odd thing going on, and thats that for %f I used a double* pointer
(instead of float*) and for %c I used an int * pointer instead of char*. The reason I used these pointer
types is because of an obscurity in the C standard. The C standard states that float cannot be used as a
parameter (or argument) type. Instead, the compiler always substitutes double. Even if you declare the
parameter as a float, the compiler will actually use the double-precision type instead. A similar thing
happens with characters. In C and C++, character parameters (and arguments) are always promoted to
int. Since the argument for %f is going to be a double, I have to use double* to read this argument
(otherwise Id only read half the bytes). Since the argument for %c is going to be int, I have to use int* to
read this argument. Note that I used int here instead of the more specific int32_t. The C standard
doesnt say that char is promoted to 32-bit ints, only that its promoted to int.

Anda mungkin juga menyukai