x86 32-Bit Assembly For Ath...

x86 32-bit Assembly for Atheists http://siyobik.info/index.php?
document=x86_32bit_asm
News Pastebin Documents Code Dump x86 Instruction Set Reference Administration
x86 32-bit Assembly for Atheists

Contents
Introduction
What is Assembly?
What is the Purpose of Learning Assembly?
Getting to know the x86 Architecture
Bits, Bytes and Numbers
Bit Operations
Signed Integers
The First Steps
Writing Windows Applications Using x86 32-bit Assembly
MASM
FASM
NASM
Writing Linux/BSD Applications Using x86 32-bit Assembly
as
Partial Registers
Control Structures
Multiplication and Division of Integers
Floating Point Operations
Introduction
This document is an introduction to creating programs for microprocessors of the x86 architecture family - in particular
32-bit code. The reader is expected to be familiar with programming in C/C++ (or similar languages such as Java at
least) and the essential API of the operating system they are using. Some mathematical knowledge up to
highschool/university level is essential for understanding a lot of things aswell. I will try not to be too OS specific but the
environments I am going to focus on in this document are Windows, Linux and BSD. Pure knowledge of assembly in
itself is useless if you do nott know how to combine it with the API of the operating system you are going to run it on so
I want to make sure that this will be demonstrated to a certain extent. It is not that different from the way you would do
it in a high level programming language such as C++ so it is not difficult to understand.
You might wonder why the name of this document contains the "for Atheists" part. I added it primarily to distinguish it
from other documents by name but also because I am an outspoken atheist, as every scientist should be. It was more
of a joke really, though.
What is Assembly?
Assembly (short ASM) is the lowest level programming possible - if you are a programmer and you want to get as close
to the hardware as you can get then this is the place you want to be. In assembly code you get to control every single
instruction that your CPU (central processing unit) is going to execute. There are different assembly languages for
different microprocessor architectures and all of them are different from each other. Usually they are totally
incompatible so you will have to write assembly code that is specific to the particular architecture you want your
program to run on. In assembly you write instructions using the ASCII character set which directly represent machine
code instructions that are executed by your processor. The names of these instructions are usually extremely short and
are often abbreviations of full names. These assembly instruction names are called mnemonics.
Before your microprocessor can actually run the program you have written in assembly you will have to run it through a
program which translates all the mnemonics and arguments to numerical machine code. This program is called an
assembler. Assemblers often also support more features than just the pure instructions to make the jobs easier for the
1 of 37 10/5/2010 10:15 PM
x86 32-bit Assembly for Atheists http://siyobik.info/index.php?document=x86_32bit_asm
programmers but you will see how they do that later on.
What is the Purpose of Learning Assembly?

This is a very important question and subject to a lot of discussion. My answer to this question is a list of reasons,
really. Better understanding of what goes on at the lowest level can make you a better programmer at a higher level. It
allows you to see what goes on behind the scenes and it often gives you a totally new perspective on things. It has its
uses in writing high performance parts of high level language programming where you need to use features of your
processor that are not easily accessible in that high level language. Knowing assembly is obviously also necessary to be
able to write compilers for a particular microprocessor architecture which convert high level language code to machine
instructions.
The truth is that most people learn x86 ASM nowadays to crack commercial software, to reverse engineer closed
source programs and to write cheats for computer games. Cracking is the one that made me learn it but I have to admit
that I never got particularly good at it and I got totally distracted from my original goal in the process of understanding
how it works. It is my experience that it is essential to learn how to write x86 ASM yourself first in order to be
successful at cracking and reverse engineering. Knowing how to manually translate a C++ program to assembly is a
valuable skill to have for this purpose.
Getting to know the x86 Architecture

So, what are we dealing with here? The x86 microprocessor is basically register machine which uses a CISC (complex
instruction set computer) instruction set. At first I am going to explain what a register machine is. After that I will move
on to the CISC part. A register machine is a computing device which stores results of arithmetic operations and such
primarily in so called registers. These are small but highly efficient storage units inside the processor. When you write
assembly you deal a lot with them. They are simple hardware implementations of integers. They are incapable of
holding a lot of data at once but they are essential as temporary placeholders used in most instructions executed by the
processor. In the terms of the memory hierarchy of contemporary computers they are at the very top. The hierarchy
looks like this:
1. CPU registers
2. CPU cache
3. RAM (random access memory)
4. HDD (hard disk drive storage)
5. External storage, DVDs and such
Registers hold minimal amounts of data and are extremely fast. The cache holds far larger amounts of data but it is still
pretty fast. The RAM holds even larger amounts of data and it is far slower than any memory operation inside your
CPU. Hard disks have an even larger capacity than your RAM and they are very slow in comparison to the objects at
the top of the hierarchy.
At this point I should probably briefly explain what the CPU cache actually is. It is a small high performance storage unit
inside your CPU to which chunks of memory from the RAM are copied whenever you perform a RAM memory access.
This way the CPU does not have to access the RAM over and over again when it is processing the same piece of data.
This speeds up the execution of code a lot - RAM access is vey slow in comparison to cache access after all. In reality
this cache is actually not a single unit but it is divided into multiple cache levels. The Level 1 Cache is the smallest but
fastest one. The next level is bigger but slower and so on. Caching is not of much interest to somebody who is new to
assembly, though. I might cover this topic later in sections with deal with optimising code for speed.
Back to the CISC part I mentioned earlier. There are two major microprocessor architecture philosophies known as
RISC (reduced instruction set computer) and CISC (complex instruction set computer). In RISC architectures
instructions are really fundamental and atomic. These instructions are incapable of performing multiple actions at once
but they are very fast. The instruction format for such architectures is usually quite uniform with instructions that all take
up the same amount of bytes which makes them very easy to decode for the processor. in CISC architectures it is
common practice to have complex instructions which perform multiple tasks sequentially, like loading a value from
memory, peforming an arithmetic operation on it and then writing back the result of the operation to the memory. In a
RISC architecture this would be divided into multiple steps. CISC instructions are usually of variable length and they are
very complicated to decode for the CPU. The CISC philosophy actually predates RISC, even though it was not called
CISC at that time. Back in the early days of microprocessors (around 1971) most code had to be written in architecture
specific assembly. High level languages as we know them today were pretty much impossible in those days so it was
desirable to make the job for the programmers easier by providing them with comfortable to use instructions which
would perform commonly grouped actions at once. When they finally realised that this was not very efficient (around
1975) it was already too late. The microprocessor industry based around CISC architectures would flourish and
dominate to eventually totally conquer the desktop market. I would like to emphasise that this did not happen because
of superior technology but because of superior resources and marketing strategies. We are basically using an inherently
inferior instruction architecture for most desktop computers nowadays. The pure design philosophies do not exist as
such in practice anymore and you will find elements of both of them in architectures nowadays.
Bits, Bytes and Numbers
2 of 37 10/5/2010 10:15 PM
A bit is simply the smallest digital piece of information possible. To us it is a rather abstract unit which can have two
values, zero or one. A byte is composed of a certain number of bits. On all mainstream architectures a byte is
composed of exactly 8 bits. In this document we focus on the x86 32-bit architecture. The 32-bit part means that the
word size of this processor is 32 bits. The word size of a microprocessor tells you what size most registers and
operands are. So in this case all general purpose registers are 32 bits = 4 bytes in size. General purpose means that
we, the programmers, can pretty much use these registers as we want without breaking anything. They are meant to be
used for all major calculations. There are special registers such as the stack pointer, which may not be used in such a
way. In fact, that would usually result in a crash at some point, but we will get back to that later. Our registers consist
of 32 single bits which can each assume two different values. This means that each register can assume one of
(that is: 2 to the power of 32) different values at a time.
There are different systems of representing numbers using digits (which can actually be arbitrary symbols) such as our
decimal system. At this point it is important to introduce the concept of the positional notation. Positional notation is a
method which takes a base and a set of digits with values , . I hope I did
not scare anybody away with those equations - I should probably clarify. is a natural number - a positive integer like 1,
2, 3, etc. is the union of the digits which are associated with the numerical values 0 to . In a positional system
the number which is composed of the digits actually represents the natural number
The decimal system is actually a positional system with . So we have the digits with the
corresponding values. So according to the equation the decimal number 913 represents the natural number
.
Now that was fairly obvious but let us move on to other systems such as the hexadecimal one, which is quite frequently
used in the world of programming, especially in assembly languages. In the hexadecimal system we deal with the base
and the digits . So A stands for 10, B for 11, and so on. So the
hexadecimal number F4D7 actually represents the natural number
.
Then there is of course the binary system which has the digits . The binary number 1101001 represents
the natural number
. This is pretty
much exactly the way unsigned numbers are stored at the hardware level in the registers of your x86 microprocessor.
The single binary digits represent the bits in the register. But the binary number 1101001 has 7 bits and our 32-bit
registers have 32 bits. What happens to the other 25 bits? The upper bits are simply set to 0, so the register would
actually be set to 00000000000000000000000001101001.
Bit Operations
In the introduction I stated that this document is intended for people who already know a high level programming
language which also implies knowledge about the six essential binary operations but I would still like to introduce them
at this point. I will use the C operators & for AND, | for OR, ~ for NOT, ^ for XOR, << for SHIFT LEFT and >> for
SHIFT RIGHT in this document. Let's start with AND. It takes two bits as input and outputs a single bit. If you do not
fully grasp these operations yet I suggest taking a few moments to totally digest and appreciate their output. So what is
the AND operation usually used for? We use it to set particular bits of an integer to 0, to "mask off" particular ranges of
bits we do not care about.
a b a & b
0 0 0
0 1 0
1 0 0
1 1 1
Now for the OR operator. Like AND it takes two bits as input and has a single bit as output. We use it to set particular
bits of an integer to 1:
a b a | b
0 0 0
0 1 1
1 0 1
3 of 37 10/5/2010 10:15 PM
1 1 1
The NOT operator inverts bits so it takes one argument instead of two like the previous operators:
a ~a
0 1
1 0
The XOR operator is used to "flip" bits. It is not an essential function, meaning that it can be expressed with the help of
AND, OR and NOT. Its primary purpose curiously is setting registers to zero but we will get to that later. It is also very
useful for pseudo random number and encryption algorithms.
a b a ^ b
0 0 0
0 1 1
1 0 1
1 1 0
The bit shift operations SHIFT LEFT and SHIFT RIGHT cannot be described in compact tables like the other ones.
They shift the bits inside the register to the left or to the right, discarding information as bits "are pushed out" and
shifting new 0-bits in. The first operand for a shift operation is the number that will get shifted. The second operator
describes the number of bits that are shifted in and out. Here are a few examples on 32-bit numbers to clarify what
these operations mean:
a b a << b
11010001010010100001101010001010 1 10100010100101000011010100010100
00000000100000001100000001011111 3 10001010010100001101010001010000
11100000010100110001010010111010 2 01000101001010000110101000101000
00001001101110000000001110000000 29 00000000000000000000000000000000
a b a > b
11010001010010100001101010001010 1 01101000101001010000110101000101
00000000100000001100000001011111 3 00000000000100000001100000001011
11100000010100110001010010111010 2 00010001010010100001101010001010
00001001101110000000001110000000 27 00000000000000000000000000000001
Signed Integers
We have seen how unsigned integers are encoded in the 32-bit registers of the microprocessor but surely the x86
architecture has to deal with signed integers aswell. There are different systems of representing signed integers using
binary encodings. The most intuitive approach is to sacrifice one bit (the "left-most" one) to represent the sign of the
integer. This system is called sign and magnitude and the bit we sacrificed is called the sign bit. A sign bit of 0 stands
for the lack of the minus and the sign bit of 1 means that the number is signed. The absolute value is expressed in all the
other bits. In an 8-bit register the binary number 10011100 represents the integer . One
irritating property of this system is that there are two representations for the number 0: 00000000, 10000000 - a
"positive" zero and a "negative" zero - which is obviously a waste. The actual major problem with this system is that it is
very annoying to perform addition and subtraction in it because there are three different cases to be dealt with (positive
+ positve, positive + negative, negative + negative).
Another possibility is to represent the negative version of the natural number , , by inverting all its bits (see
NOT operation). Hence this system is called One's Complement. This way also one bit is sacrificed to represent the
sign and there are two representations of the number zero again. So the negative version of 00001111 (which is 15) is
11110000 (which is -15) in our hypothetical 8 bit register. The advantage of this system is that it makes
addition/subtraction easier. Consider the operation -7 + 3: 11111000 + 00000011 = 11111011, which is the One's
Complement representation of -4. As you can see we can reduce the complexity of arithmetic operations this way as
4 of 37 10/5/2010 10:15 PM
long as we don't cross the zero barrier. Consider the case -1 + 2: 11111110 + 00000010 = 00000000. This is obviously
the wrong result, - 1 + 2 is not zero, it's 1! Every time you cross the "zero barrier" in One's Complement you have to fix
the result by offsets of 1.
So to solve this problem smart computer scientists came up with Two's Complement. It is similar to One's
Complement but all negative numbers have an additional offset of 1. This way there is no redundant zero and you can
flawlessly perform arithmetic operations without having to worry about the sign really. Two's Complement is the system
used by all important mainstream CPUs in desktop computers including the x86 architecture - so be sure to understand
this. There is no inherent difference between storing an unsigned integer and a signed integer in a 32-bit
register - it is all about how you treat it. The 32-bit value 11111111111111111111111111111111 can either stand
for (which is the greatest unsigned integer that can be represented with 32 bits) or, if
interpreted as negative number, for -1. To calculate the Two's Complement representation of with you need
to:
1. Subtract 1
2. Invert bits
All of these operations are injective so you can simply reverse the order of the two steps in order to convert a negative
number to its positive representation. Let's have a look at the Two's Complement representations of a few integers:
Signed decimal number Two's Complement Representation

23 00000000000000000000000000010111
-1 11111111111111111111111111111111
-2 11111111111111111111111111111110
-3 11111111111111111111111111111101
-2147483648 10000000000000000000000000000000
2147483647 01111111111111111111111111111111
As you can see the greatest signed number that can be represented with Two's Complement is 2147483647 and the
lowest one is -2147483648.
The First Steps

Now that we have covered most of the basics we can finally get started with some x86 ASM. I am going to use Intel
syntax for now which is far more popular and more intuitive than AT&T syntax. Let's have a look at the following code:
mov eax, 1
mov ebx, 10
xor ecx, ecx
our_loop:
add ecx, eax
inc eax
cmp eax, ebx
jle our_loop
Most of these lines represent an instruction that gets executed by the processor. The pattern is instruction
[argument 1], [argument 2], .... Let's go through it line by line. mov eax, 1 calls the instruction mov with
the arguments eax and 1. The mov destination, source operator is kind of like the assignment operator in C++.
It copies the value of source to destination. In this case it writes the constant 1 to the 32-bit general purpose
register eax. That is one of the registers we can use for all kinds of calculations without breaking anything. Here is a list
of all the 32-bit general purpose registers of the x86 architecture:
1. eax
2. ebx
3. ecx
4. edx
5. edi
6. esi
7. ebp
As you can see they have different naming patterns and historically some of them are used by special instructions which
hardly appear in modern code, though. They are mostly still supported for legacy reasons but they are inefficient and
are not produced by modern compilers. After that instruction has been processed the instruction pointer gets
increased. The instruction pointers contains of the current address that is being executed. The instruction pointer of the
5 of 37 10/5/2010 10:15 PM
x86 architecture is stored in the 32-bit register eip. So eip gets increased by the length of the mov eax, 1
instruction after it has been executed and eip points towards the next instruction which is mov ebx, 10. As you might
have guessed, mov ebx, 10 writes the value 10 to the register ebx. But now it gets interesting. xor ecx, ecx is
basically the x86 ASM equivalent of the C++ code ecx ^= ecx;. You might ask: What is the fucking point? The
educated mind sees immediately that a ^ a is equal to zero for all values of a. So what this basically is does is setting
ecx to zero. So you might wonder why you do not use mov ecx, 0 instead. Well, you could, but xor ecx, ecx is
actually shorter because it does not involve any constant. By always using xor reg, reg instead we reduce the
overall file size of our binaries and this can save time and space. But this instruction does more than just modify the ecx
register. The x86 microprocessor has a set of so called flags. These are 1 bit values which are used for conditional
jumps and other operations. We do not think about them most of the time and it is not too relevant for writing
conventional code but it is absolutely essential to know how they exist, what they are good for and how they are set. All
arithmetic-logical operations set some of these flags. The mov instruction does not modify the flags but the xor
instruction sets several of them. The Intel x86 Instruction Set Reference says: "The OF and CF flags are cleared; the
SF, ZF, and PF flags are set according to the result. The state of the AF flag is undefined." You can look such data up
in the original PDF document released by Intel or by using the HTML version provided on this site: x86 Instruction Set
Reference. Here is a list of the most important x86 flags and what they stand for:
Abbreviation Full name

CF Carry Flag
PF Parity Flag
ZF Zero Flag
SF Sign Flag
OF Overflow Flag
When you perform the addition of two large unsigned 32-bit numbers the result can obviously be too large to be stored
in a 32-bit register so what you get is a so called wraparound or an overflow. Consider the addition of the following
numbers:
Summand 1 11111111111111111111111111111111
Summand 2 + 00000000000000000000000000000001
Sum 100000000000000000000000000000000
The result requires at least 33 bits to be displayed properly using Two's Complement. So the bit that has been marked
red is basically discarded and not stored in the target register - instead it gets stored in the Carry Flag. The Carry Flag
is basically always the "33rd" bit. Actually This is not a 1 bit value but it is a single bit at offset 0 in a 32 bit register
known as the FLAGS register. This is where all the flags are stored - so obviously there are far more flags than just
the ones listed above but most of them are not of any relevance for writing user land applications. The Parity Flag is
set to 1 if the number of 1-bits in the right most 8 bits of the result contains an even number of 1-bits. It is set to 0
otherwise. It is basically the result of XORing those 8 bits with each other and eventually inverting the result. The Zero
Flag is set when the the result is equal to zero - very useful. The Sign Flag is the left most bit of he 32-bit result - it
represents its sign. The Overflow Flag is set when an overflow occured.
But let us get back to the actual code we were discussing. So none of the flags that are set by most arithmetic logical
operations actually get used and are simply overwritten immediately. The next line in the code is our_loop: is not an
actual instruction that gets executed but an instruction for your assembler which allows you to define labels. These are
markings in the code which are essential to control the flow of execution by providing named locations to which you can
jump to. add ecx, eax is the x86 ASM equivalent of ecx += eax; and it sets the flags again but they are unused -
again. inc eax increments the value of eax by 1 - it is the equivalent of eax++;. We could have used add eax, 1
instead and it is pretty much the same but the inc instruction is smaller so it is preferable. It does not affect the carry
flag but it does not make any difference in this case. cmp eax, ebx is short for "compare" and it performs the
subtraction eax - ebx but the result is not stored anywhere. Instead it simply sets the flags we just talked about
according to the result. This is an essential way to check whether two registers have equal values, for example. In case
they are equal the result of the subtraction is zero so the Zero Flag gets set and we can act accordingly. A cmp is the
first part of the x86 ASM equivalent of a C++ if construct, so to say. The next part is one of the conditional jumps - in
this case it is jle our_loop. jle is short for "jump if less or equal". So, what is a jump? It is a method of controlling
the flow of execution of your program by basically telling your microprocessor where to resume the execution after the
jump. In the case of the x86 32-bit architecture it means modifying the value of eip. When jumps actually get encoded
they just contain numeric offsets which are either relative to the position of the jump ("jump 25 backward") or even
absolute ("go to the instruction at offset 0x12345678"). There are unconditional jumps which are executed no matter
what and conditional jumps which only perform jumps under certain conditions. If a conditional jump is not taken the
execution simply continues with the next instruction after the jump as usual. our_loop refers to the label we previously
defined so if the jump gets executed the next instruction will be add ecx, eax. So in this case we are dealing with a
jump which is executed if the flags indicate that the result is "less or equal". What is this supposed to mean?! Less than
6 of 37 10/5/2010 10:15 PM
or equal to what? The cmp operation is usually always used in combination with a conditional jump. Together they are
the x86 ASM equivalent of the C++ if construct. So what the "less or equal" part relates to is the cmp executed prior to
the conditional jump. Read cmp eax, ebx in combination with jle our_loop as if(eax <= ebx) goto
our_loop; - resume execution at label our_loop if eax is less than or equal to ebx. There are actually 32 different
mnemonics for such jumps:
Mnemonic Meaning
ja Jump if above (CF=0 and ZF=0).
jae Jump if above or equal (CF=0).
jb Jump if below (CF=1).
jbe Jump if below or equal (CF=1 or ZF=1).
jc Jump if carry (CF=1).
je Jump if equal (ZF=1).
jz Jump if 0 (ZF=1).
jg Jump if greater (ZF=0 and SF=OF).
jge Jump if greater or equal (SF=OF).
jl Jump if less (SF!=OF).
jle Jump if less or equal (ZF=1 or SF!=OF).
jne Jump if not above (CF=1 or ZF=1).
jna Jump if not above (CF=1 or ZF=1).
jnae Jump if not above or equal (CF=1).
jnb Jump if not below (CF=0).
jnbe Jump if not below or equal (CF=0 and ZF=0).
jnc Jump if not carry (CF=0).
jne Jump if not equal (ZF=0).
jng Jump if not greater (ZF=1 or SF!=OF).
jnge Jump if not greater or equal (SF!=OF).
jnl Jump if not less (SF=OF).
jnle Jump if not less or equal (ZF=0 and SF=OF).
jno Jump if not overflow (OF=0).
jnp Jump if not parity (PF=0).
jns Jump if not sign (SF=0).
jnz Jump if not zero (ZF=0).
jo Jump if overflow (OF=1).
jp Jump if parity (PF=1).
jpe Jump if parity even (PF=1).
jpo Jump if parity odd (PF=0).
js Jump if sign (SF=1).
jz Jump if 0 (ZF=1).
As you can see several of these mnemonics basically do the same thing and they are actually encoded the same way in
the actual executable files - they just exist to make the jobs for the programmers easier. So let's translate this assembly
code to C++ so we can see what it does:
int eax, ebx, ecx;
//...
//mov eax, 1
eax = 1;
//mov ebx, 10
ebx = 10;
//xor ecx, ecx

ecx = 0;
our_loop:
7 of 37 10/5/2010 10:15 PM
//add ecx, eax

ecx += eax;
//inc eax
eax++;
//cmp eax, ebx

//jle our_loop
if(eax <= ebx)

goto our_loop;
As you can see there is some kind of iteration going on. eax is the iterator which is involved in the decision when to
leave the loop, ebx contains the limit for the iterator and ecx += eax is the actual body of the loop. So we can further
simplify the program to:
int sum = 0;
for(int i = 1; i <= 10; i++)
sum += i;
Nice, now that is actually really short and easy to comprehend C/C++ code. So it actually calculates the sum 1 + 2 + ...
+ 9 + 10 = 55. Currently the upper limit is 10 - what do we do if want to set this limit to an arbitrary number? In a high
level language we would write a new function for this purpose which takes the limit for the loop as its sole argument and
which returns an integer, something along the lines of:
int calculate_sum(int limit)

{
int sum = 0;
for(int i = 1; i <= limit; i++)
sum += i;
return sum;
}
So, how do we translate this back to x86 ASM? There is a direct equivalent for functions, too. Let us check out what it
looks like when we translate this function to assembly language and call the function with a larger limit 20:
calculate_sum:
mov eax, 1
mov ebx, [esp + 4]
xor ecx, ecx
sum_loop:
add ecx, eax
inc eax
cmp eax, ebx
jle sum_loop
mov eax, ecx
ret 4
start:
push 20
call calculate_sum
The first change you can see is the calculate_sum label that has been added. It is necessary so we have marker in
8 of 37 10/5/2010 10:15 PM
the code which we can use to call the function. The next change is the mov ebx, [esp + 4] part. The brackets
mean that we perform a memory access (RAM). It is the x86 ASM equivalent of the C/C++ * operator to dereference
pointers. So in this case the pointer is esp + 4. The x86 architecture supports statements of the form register +
constant_offset for this purpose. So in this case the address esp + 4 is calculated and dereferenced. But how
many bytes are read and stored in ebx? The register is 32-bits in size so 4 bytes are read, beginning at the address
esp + 4, so the bytes esp + 4, esp + 5, esp + 6, esp + 7 are stored in ebx - but in what order? Which byte
goes where? The x86 architecture uses the so called Little Endian byte order in memory. In this order the least
significant byte is stored first in memory:
10010011000100001100000001110010
In this case the red bit is the most significant bit and the green bit is the least significant bit. Significance means how
much of a difference the bit can make to the value of the integer. The most significant bit has a value of whereas
the least significant bit only makes a difference of . We do not have to worry about as long as we stick to writing
values of size to memory and reading bytes later. When we read bytes from memory with an instruction like
mov ebx, [esi] the x86 microprocessor reads 4 bytes in Little Endian order, of course. The only time we have to
worry about it is we have an integer of bytes in memory and we want to access particular bytes of it with
. In that case we have to know what Endianess the underlying system uses so we read the
right bytes. The opposite of Little Endian byte order is Big Endian. In Big Endian byte order the most significant byte
is stored first - just like we do in normal hexadecimal and decimal notation. We write down the number 931 to express
. If we applied the Little Endian philosophy to this we would write 139 instead of 931, so to
say. Let us have a look at a few examples of how bytes are stored in memory with different "Endianesses":
Hexadecimal Consecutive bytes in memory with Consecutive bytes in memory with

notation Little Endian byte order Big Endian byte order
0xFF 0xFF 0xFF
0x1234 0x34, 0x12 0x12, 0x34
0x09451143 0x43, 0x11, 0x45, 0x09 0x09, 0x45, 0x11, 0x43
So for single bytes it does not make any difference but as soon as you deal with integers which require two or more
bytes the Endianess becomes an issue. Big Endian byte order is much more intuitive as you can see, since it matches
the way we usually write numbers so it is easier for humans to read when looking at hex dumps and stuff like that. With
Little Endian byte order you would have to reverse it first in order to parse it in the way you usually do. Unluckily the x86
architecture uses the Little Endian byte order so there is not much we can do about it. Even the internet is based on the
Big Endian byte order - all the official protocols such as IP, TCP and UDP use it. This is why it is also called Network
Byte Order. When you send multi byte integer values in the headers of the official protocols using a Little Endian byte
order all the bytes must be reversed in order first in order to satisfy the specifications - what a bummer! Now that we
have cleared this up we should get back to the statement we were originally discussing, mov ebx, [esp + 4]. So
there is a new register in that statement: esp. Its name is short for extended stack pointer which is actually a 32-bit
"general purpose register" just like the other 7 ones that we listed above. But messing with it can have disastrous
results and there is not really a point to use it for any other purpose in normal code, really. The stack is a part of the
memory your application uses. The C++ analogy to this mechanism is std::stack (which I hope you are familiar with)
and in Java it would be java.util.Stack. A stack is one of the basic container/data structure types which (in theory)
only allows you to access the element at the top of the stack by "popping" it off and it allows you to add new elements
to the stack by "pushing" them on the top. In terms of microprocessors it is a bit different because you usually access
several elements inside the stack which are not at the top because it is more efficient. Let us have a look at a few
examples of operations on an abstract stack of integers. This is what our stack looks like at the beginning: (do not be
confused by the "Bottom of the stack" part, the bottom does not serve a special purpose and it is just marked to clarify
things)
Description Value
Top of the stack 4
17
Bottom of the stack 2
We are going to execute a series of operations on this stack now to observe the changes. First we will push an 8 onto
the top. The C++ pseudo code equivalent for this action shall be: push(8);
9 of 37 10/5/2010 10:15 PM
Description Value
Top of the stack 8
4
17
As you can see the 4 is no longer the top of the stack and we value we just pushed onto the stack is now at the top.
Next pseudocode to execute: int a = pop(); int b = pop();
Description Value
Top of the stack 4
17
Description Value
Top of the stack 17
The pseudo function pop() pops one value off the stack and returns its value. So the two variables we used in the
pseudocode we just executed now have the values a == 8 and b == 4. So what is the point of this whole stack?
What does it get used for in the x86 architecture and other microprocessor architectures? It is an essential data
structure which is used to:
1. Pass arguments to functions
2. Store return addresses
3. Allocate local variables which are too large for the registers
We will get back to what a return address is and how we allocate space for local variables later. At first I would like to
discuss the necessity of the stack to pass arguments to functions. You might wonder: "Why do you not simply pass the
arguments inside registers?" There actually is a special type of calling convention for functions which does use this
mechanism and it is known as fastcall - because it is extremely fast since it uses registers only for the arguments.
This, too, is not yet the point to discuss calling conventions, though. The problem with having all your arguments in
registers is that you have a very limited amount of registers but a large amount of functions which call each other in
complicated nested ways so you will eventually run out of registers to use in one way or the other. Recursive function
calls of arbitrary call depth without storing the arguments somewhere in memory is physically not possible. This is why
we pass most arguments on the stack. We push the arguments on the "top", call our function and remove them after
that. The x86 architecture stores the current pointer to the top of the stack, the stack pointer, in the esp register. So
mov eax, [esp] performs a memory access and copies the value of the top element of the stack into the eax
register. We are dealing with a 32-bit architecture so all elements on the stack are of the same size which is 4 bytes.
This is why esp will usually have a value which is a multiple of 4. You might expect that mov eax, [esp - 4] might
access the element right "below" the top one - but that is not so. On the x86 architecture the next element below the top
is actually stored at esp + 4 so we have to use mov eax, [esp + 4] to copy its value to eax. The third element
would be at esp + 8, the fourth at esp + 12 and so on. I am not entirely sure why they picked this somewhat
counter-intuitive order for the stack - probably because they thought that addition is more comfortable to deal with to
access arguments. The x86 instruction to push a new argument on top of the stack is push. It takes one argument
which can be a register or an immediate value, for example. An immediate value is any constant value like 2 or
0x00ff1234 which is directly encoded into the instruction. So what push eax does is:
1. sub esp, 4 (decrease the stack pointer by the word size of the microprocessor to make space for the new
top element)
2. mov [esp], eax (write the value of eax to the memory at the address contained in esp - the top of the
stack is now equal to eax)
You can basically replace push with those two instructions all the time but x86 is a CISC architecture so we save space
by using push instead. sub is short for subtract and as you might have guessed already, it performs a subtraction like
add performs an addition. The arguments are intuitive, sub register1, register2 is equivalent to the C++ code
register1 -= register; or register1 = register1 - register2;. The x86 instruction to pop an element
off the stack is called pop. Like push, it takes one argument, too. Usually it is a register so let us check out what pop
ebx does:
1. mov ebx, [esp] (read 4 bytes from the memory at the address specified by the register esp and store
them in ebx)
2. add esp, 4 (add 4 to the stack pointer so it makes the element "below" the top one the new top element of
10 of 37 10/5/2010 10:15 PM
the stack)
So, to get back to the piece of x86 ASM we were previously discussing: What mov ebx, [esp + 4] really does is
copying the first argument passed to the function into the ebx register. The first argument is the second element of the
stack, the second argument is the third and so on. So what does the top of the stack contain? What is stored at
[esp]? This is the so called return address which is necessary for a function to know where it was originally called
from so eip, the instruction pointer, can be set to the right value after the function returns. We want our program to
resume execution with the instruction after our call, after all.
The start label is not really relevant for the understanding of the principle but I simply wanted to point out where the
execution actually started in case somebody got confused by the stream of x86 ASM instructions. The assembly code
push 20, call calculate_sum first pushes the immediate value 20 on top of the stack and then calls the function.
Remember that we only see this label in our assembly language code. It does not exist is such in the actual binary file.
In the executable file you will simply see the call with an address after it. This address points to the first instruction of
the function we marked with the label calculate_sum. In this case it is the instruction mov eax, 1. This is what
call address actually does:
1. esp -= 4; (make space for the return address on the stack)
2. *esp = address_of_next_instruction; (stores the address of the instruction after the call on the
stack)
3. eip = address; (modify the instruction pointer so it resumes execution in the body of the function)
As you can see I did not use x86 ASM instructions to describe what this function does. Instead, I resorted to the C++
pseudo code I use occasionally. This is because we cannot modify eip like a normal register. It takes a jump or a call
(which is similar to a jump) to achieve that. What the combination of push 20 and call calculate_sum really
translates to in C++ is calculate_sum(20);. The function calculate_sum is called with the argument 20 - which is
what we originally wanted. You will see the push/call construct quite frequently. usually functions have even more than
one argument so you will see multiple pushes prior to a call. Perhaps you have noticed that another element has been
added to the function. The ret instruction is used to return from a function after it has been called. Additionally it can
take an argument which in the case of ret 4 is equal to 4. Let us have a look at what this ret argument instruction
does:
1. eip = *esp (copy the instruction pointer from the stack back to the instruction pointer to resume execution at
the instruction after the call)
2. esp += 4 + argument; (remove the return address from the stack and, if necessary, arguments that were
pushed onto the stack for the function)
If a function takes arguments you will usually return from it with the instruction . If the function is nullary
(if it takes no arguments), you can simply use ret instead of ret 0. You also might have noticed the additional mov
eax, ecx I inserted prior to the ret 4. I did this because it is a common calling convention to return the result of a
function in the eax register which is the first of the 8 (or rather 7 without esp) general purpose registers. Of course you
do not have to do this but if you interface with C/C++ code you will have to know this of course. Now that we have
covered the basics of calling and returning we can move on to the topic of calling conventions. You might remember me
talking about the fastcall convention - it is one of them. If you are an experienced C/C++ programmer you should
already be familiar with a few of them. They differ in the following regards:
1. How do arguments get passed? On the stack? In registers?
2. In what order are arguments passed?
3. Who is responsible for fixing the stack after the function has processed the arguments?
4. What registers must be preserved by the function?
Let us take a look at the most common calling conventions:
Who is
Where are the In what order are Registers that
Calling responsible for
arguments the arguments have to be
convention cleaning up the
stored? passed preserved
stack?
cdecl (C Stack Right to left Calling function ebx edi esi
declaration) ebp
stdcall Stack Right to left Called function ebx edi esi
(standard ebp
call)
fastcall (fast Registers and 1. ecx 2. edx 3. Called function ebx edi esi
call) stack and further: stack ebp
from right to left
You will find the cdecl calling convention in pretty much all C based interfaces, especially obviously in the world of
UNIX-like operating systems. Windows uses the stdcall convention for most of its API. Right to left means that the
rightmost argument in conventional notation is pushed first in the assembly code. The order does obviously not make
any difference if there is only a single argument as in the code we were discussing. If the calling function is responsible
11 of 37 10/5/2010 10:15 PM
for fixing the stack - that is, getting rid of the arguments below the return address which were passed to the function - it
simply does what our code does: it uses the ret instruction with the right argument to clean up the stack. If the calling
function is responsible for cleaning up the stack, the function in question usually simply calls ret and leaves all the work
for whoever called him. Let us clarify by example:
void __cdecl test1(int a, int b, int c)

{
}
void __stdcall test2(int a, int b, int c)

{
}
test1(1, 2, 3);
test2(1, 2, 3);
This C++ code would look something like this in x86 ASM:
test1:
ret
test2:
ret 12
start:
push 3
push 2
push 1
call test1
add esp, 12
push 3
push 2
push 1
call test2
So the differences between these two function calls are:
Calling convention Description

cdecl (C declaration) Uses ret to return and forces the calling function to add esp, 12 to fix the
stack.
stdcall (standard Uses ret 12 to return so the calling function does not have to fix the stack.
call)
stdcall is more efficient than cdecl because it requires one instruction less to execute. Do not be fooled by the
number of instructions - there are tons of examples in which code which uses more instructions to solve the same
problem is actually faster but in this case it happens to be intuitively right because all these simple instructions get
executed at about the same speed. However, it is not yet the time to talk optimisations in general so let us return to the
matter at hand. Now that we have covered some of the fundamentals of x86 assembly programming we can finally start
writing simple programs that actually interact with the API of the underlying OS. We are going to implement the same
basic command line program on different operating systems using various assembler programs in the next section of
this document. This program will simply be an extension of what we have already written.
Writing Windows Applications Using x86 32-bit

Assembly
12 of 37 10/5/2010 10:15 PM
At first we are going to take a look at how to write assembly for the popular Windows operating system. We are going
to check out the Microsoft (R) Macro Assembler Version 9.00.21022.08 (short MASM), the great open source program
flat assembler 1.67.27 for Windows (short FASM) and the netwide assembler 2.04 RC 1 (short NASM).
MASM
This is the most used assembler for writing assembly code for the Microsoft Windows operating systems. Before I am
going to just throw the code right into your face I would like to dedicate a few words to what is known as MASM32 - a
great source of confusion and frustration.
MASM32 (from movsd.com/masm32.com) does not have anything to do with Microsoft or MASM as such. It is a
package mostly made by an Australian known as Steve Hutchesson (aka hutch). This package contains lots of headers
and example code and it uses the Microsoft ml/link binaries version 6.x from 1998 or so. It is absolutely useless for any
modern x86 32-bit Windows assembly programming. Never make the mistake of mixing up MASM and MASM32. I see
a lot of people making this mistake. I have even seen people give up on MASM because they thought that MASM32
was it. It is really sickening. As for the actual package itself, the example code may be useful for new ASM
programmers but you should stay away from the MASM32 "library" and the outdated headers which give you parsing
errors in new versions of MASM.
Let us get back to the actual code. We will go through it line by line:
;Arithmetic sum program

.686p
.model flat, stdcall
option epilogue: none

option prologue: none
GetStdHandle proto :dword

wsprintfA proto c :vararg
WriteConsoleA proto :dword, :dword, :dword, :dword, :dword
ExitProcess proto :dword
STD_OUTPUT_HANDLE equ -11
.data
sum_string db "Arithmetic sum for the limit %d is %d", 10, 0
.data?
buffer db 64 dup(?)
bytes_written dd ?
.code
calculate_sum proc limit:dword

xor esi, esi
mov edi, 1
jmp loop_start
sum_loop:
add esi, edi
inc edi
loop_start:
cmp edi, [esp + 4]
jle sum_loop
ret
calculate_sum endp
public main
main proc
invoke GetStdHandle, STD_OUTPUT_HANDLE
mov ebp, eax
xor ebx, ebx
main_loop:
13 of 37 10/5/2010 10:15 PM
invoke calculate_sum, ebx

invoke wsprintfA, addr buffer, addr sum_string, ebx, esi
invoke WriteConsoleA, ebp, addr buffer, eax, addr bytes_written, 0
inc ebx
cmp ebx, 20
jle main_loop
invoke ExitProcess, 0
main endp
end
;Arithmetic sum program is a comment. In MASM comments are made using the semicolon character. There are
only single line comments so it is pretty much the equivalent of the C/C++ //. Unluckily there is no multi line comment
feature. .686p specifies what instructions are allowed to be used in the assembly code. There are numerous other
ones but this is a nobrainer really. .686p is simply the most "modern" one. .model flat, stdcall specifies two
things. For one it specifies the memory model to be used. The flat memory model is the only one supported by
modern Windows operating systems so it is hardly worth discussing. The stdcall part specifies the default calling
convention. Since we are dealing with Windows API code we definitely should use stdcall. option epilogue:
none and option prologue: none tells the assembler that we do not want to use the default MASM
epilogues/prologues. These are instructions that are automatically appended at the front of your function and whereever
you call ret. It is a nice feature to support but it is very annoying that it actually puts some instructions there you do not
see in your code - instructions you never expected to be there until you run your binary in a debugger to find out that
some strange feature is totally breaking your assembly! In the case of MASM the default code is establishing a so
called frame stack. This is mostly a nasty relic of the past which is still used by unenlightened copycats who never
bother to actually check whether such features actually make sense or not. I am just going to demonstrate this dumb
practice to you once:
;non-sensical function
some_function proc
xor eax, eax
jmp return
ret
return:
mov eax, 1
ret
This function does not perform any meaningful task, it is merely an example. With the default epilogue/prologue code
this will assemble to:
push ebp
mov ebp, esp
xor eax, eax
jmp some_constant ;I do not know the immediate value that will be used here so I just made
pop ebp
ret
mov eax, 1
pop ebp
ret
So as you can see every proc (a fancy way of declaring a function, short for "procedure") will generate an additional
push ebp, mov ebp, esp and an additional pop ebp prior to every single ret. So at first the value of ebp is
supposed to get preserved by being copied to the stack. Then esp is copied to ebp. In the end the original value of
ebp gets restored by pop ebp. So, what is the point of all this? When you manipulate esp in your function body you
will have to watch what offset your original arguments which were passed to your function are at. The idea of the stack
frame is to preserve the original value so you do not have to do any thinking to come up with the offset. The truth is that
stack frames are useless unless you are performing operations which do not let you predict the argument
14 of 37 10/5/2010 10:15 PM
offset at assembly time. It is a waste of instructions. All you usually achieve is being able to write stuff like mov eax,
[ebp + 4] instead of mov eax, [esp + 4] which is really idiotic in general. So do not use this feature. Now that
we have cleared this up, let us get back to the code at hand.
GetStdHandle proto :dword declares the prototype (proto) of a function called GetStdHandle which uses the
default calling convention we specified. We specified stdcall in the .model directive. We could even have used
GetStdHandle proto stdcall :dword to be even more explicit about it but it is unnecessary in this case since
we already specified a default calling convention. If you do not know what a particular function does you should
definitely look it up on MSDN. You might even know some of these Windows API calls but you are probably surprised
by these -A prefixes some of these have. The Windows API uses the -A prefix for "ASCII" and the -W prefix for "Wide
chars" - Unicode. They are redirected according to the preprocessor definitions in the C interface. The -A variants
usually simply call the -W ones after a few string conversions. Windows handles most strings internally as Unicode. It is
more convenient to use ASCII in this case, though, so we use the -A interfaces. On the next few lines a few more
prototypes are declared and the majority of them are just like this one but the the wsprintfA one stands out because
of the vararg part. This nasty function takes a variable amount of arguments and it is up to us to fix them later. It uses
the C calling convention, hence the c part. These prototypes basically work like the function declarations in C/C++. We
definitely need to use them for external calls so we need to declare them as prototypes using the proto directive first
before we can use them. STD_OUTPUT_HANDLE equ -11 is a directive which will replace all future occurences of the
name STD_OUTPUT_HANDLE with the constant -11. We need this one for GetStdHandle. The .data directive
declares the start of the data section. This is where we store static data/global variables, so to say. sum_string db
"Arithmetic sum for the limit %d is %d", 10, 0 declares a zero terminated string "Arithmetic sum for
the limit %d is %d\n" by the name of format_string. db is short for "declare byte(s)", I believe. Consecutive bytes
can simply be declared using the comma notation you can see in those lines. The .data? directive introduces the
uninitialised data section. We store static variables which have no predefined value in this part of the program. buffer
db 64 dup (?) declares a range of 64 bytes which have no defined value as specified in the parenthesis (?). dup is
short for "duplicate", or so, I believe. bytes_written dd ? uses the dd directive which is short for "declare
DWORD" which has no predefined value. At this point I should probably explain the Microsoft/Intel terms for different
byte counts:
Term Full name Size in bytes

BYTE Byte 1
WORD Word 2
DWORD Double word 4
QWORD Quad word 8
Actually, "word" usually refers to whatever the normal size of the general purpose registers of a microprocessor may be
which in the case of the x86 32-bit architecture happens to be 4 bytes. Intel and Microsoft somewhat hijacked these
terms so you have to be careful not to mix this stuff up. .code represents the begining of the code section where all the
instructions are stored. calculate_sum proc limit:dword declares the beginning of a function (procedure) by
the name of calculate_sum which takes one argument which happens to go by the name of limit which is for
documentative purpose mostly, though. MASM does support statements like mov eax, limit but in this case it
would translate to mov eax, [ebp + 4] because this inferior software simply assumes that you use the stupid frame
stack thing I previously explained. So just stick to accessing the arguments using esp directly - it is much more
comfortable and efficient on the long run. The ret of the function is actually pretty much the same - except for the ret.
As you can see there no longer is any argument on the return instruction. This is managed by MASM automatically since
we specified the number of arguments in the proc line. calculate_sum endp declares the end of the procedure.
Requiring the programmer to add the name of the function there is really redundant and I have no idea why they coded
it this way but we do not have much of a choice if we want to use the proc feature in MASM. end is short for "end of
procedure" I suppose. public main makes the label/function main a public symbol so we can later specify it as the
entry point of our program in the linker command line. main proc declares the beginning of the main procedure. I
chose to call it main because of the meaning of the name in C/C++ but you could pretty much give it any name you
want really. The main procedure contains the code that gets executed first. invoke is not an actual x86 instruction - it
is a MASM macro which makes calling functions with one more arguments more comfortable. As you can see it allows
you to call functions pretty much as in C/C++. The addr directive gets the address of a named object inside the data
section. invoke wsprintfA, addr buffer, addr sum_string, ebx, esi calls the user32.dll function
wsprintfA which has a variable number of arguments and uses the C calling convention. We use it to fill the static
buffer with the formatted string output with the appropriate values plugged in where the %d are but you should be
familiar with this nasty stuff from C and possibly other high level programming languages. The rest of the program
should be fairly obvious with the knowledge you now possess so let us find out how to assemble, link and run this stuff.
At first you will need to get your hands on the relevant MASM binaries. mspdb80.dll and link.exe come with Microsoft
Visual C++ Express 9 which can be downloaded for free from http://www.microsoft.com/express/download/. I
recommend that you get the offline installation instead of the annoying web installer - it simply gives you more control. I
do not know where to get the current ml.exe from for free. You can get the 8.0 one at http://www.microsoft.com
/downloads/details.aspx?FamilyID=7a1c9da0-0510-44a2-b042-7ef370530c64&displaylang=en apparently but that's
15 of 37 10/5/2010 10:15 PM
outdated. If you already own a legal or illegal copy of Microsoft Visual C++ 9 you will not need to download it,
obviously. We only care about three files out of the installation anyways. They are:
1. ml.exe (771 KiB)
2. link.exe (349 KiB)
3. mspdb80.dll (188 KiB)
Ok, actually we can use more of the files - the lib and dll files. Here is the commandline I used to assemble and link this
program:
D:\Code\Assembly\calculate_sum\source>"D:\Code\Assembly\binaries\32\ml.exe" /c /Cx /c
Assembling: calculate_sum.asm
D:\Code\Assembly\calculate_sum\binary>"D:\Code\Assembly\binaries\32\link.exe" /MACHIN
/OUT:calculate_sum.exe /SUBSYSTEM:CONSOLE calculate_sum.obj /LIBPATH:"C:\Program File
8\VC\PlatformSDK\Lib" kernel32.lib "C:\Program Files (x86)\Microsoft Visual Studio 9.
ml.exe is the actual Microsoft Macro Assembler which we use to produce object files. link.exe is the Microsoft linker
which produces the actual binaries from the object files. Let us first quickly go through the arguments of each program.
/c tells it that you want to generate a single object file without having ml produce an .exe directly. /Cx makes it case
sensitive: "Preserve case in publics, externs". /coff makes it generate COFF files which is the standard object format
Windows uses. /nologo suppresses the Microsoft copyright message which increases the file size for no particular
reason. calculate_sum.asm is simply the name of the source file in which I stored the assembly code.
Mow for the linker arguments. /MACHINE:X86 specifies the x86 machine as target platform. /NODEFAULTLIB
specifies that it must not use default libraries. /NOLOGO does the same thing as before. /ENTRY:main specifies the
main entry point of the program - this is the place where the execution starts first. This is directly related to the public
main in our assembly code. /OUT:calculate_sum.exe specifies the output file name which is our binary -
calculate_sum.exe. /SUBSYSTEM:CONSOLE specifies the subsystem to use. In this case we want a console by
default so we're using the Windows PE Loader Console setting. calculate_sum.obj is the output from ml.exe.
/LIBPATH:"C:\Program Files (x86)\Microsoft Visual Studio 8\VC\PlatformSDK\Lib" specifies
the default path where link.exe should look for library files to link with. kernel32.lib is an essential library which we
need for the ExitProcess call. "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC
\lib\msvcrt.lib" links the object file with the Microsoft Visual C runtime. We need this to use all the C library
functions we are using. So after executing both lines we should have a working small 2.5 KiB executable which we are
going to run now:
D:\Code\Assembly\calculate_sum\binary>calculate_sum.exe
Arithmetic sum for the limit 0 is 0
D:\Code\Assembly\calculate_sum\binary>
16 of 37 10/5/2010 10:15 PM
FASM
Flat assembler is a great multi platform open source program which you can download at http://flatassembler.net
/download.php. Here is the equivalent code:
format PE Console 4.0

entry start
include 'D:\Downloads\fasmw16727\INCLUDE\win32a.inc'
include 'D:\Downloads\fasmw16727\INCLUDE\macro\masm.inc'
option prologue:none
option epilogue:none
section '.data' data readable writeable
sum_string db 'Arithmetic sum for the limit %d is %d', 10, 0
section '.bss' readable writeable
buffer db 64 dup (?)

bytes_written dd ?
section '.code' code readable executable
proc calculate_sum limit

xor esi, esi
mov edi, 1
jmp loop_start
sum_loop:
add esi, edi
inc edi
loop_start:
cmp edi, [esp + 4]
jle sum_loop
ret 4
endp
start:
mov ebp, eax
xor ebx, ebx
main_loop:
stdcall calculate_sum, ebx
invoke wsprintfA, buffer, sum_string, ebx, esi
invoke WriteConsoleA, ebp, buffer, eax, bytes_written, 0
inc ebx
cmp ebx, 20
jle main_loop
section '.idata' import data readable
library\
kernel32, 'kernel32.dll',\
user32, 'user32.dll'
import kernel32,\
GetStdHandle, 'GetStdHandle',\
WriteConsoleA, 'WriteConsoleA',\
ExitProcess, 'ExitProcess'
import user32,\
wsprintfA, 'wsprintfA'
17 of 37 10/5/2010 10:15 PM
format PE Console 4.0 specifies the format of the output file. As you can see from this FASM is both an
assembler and a linker. It specifies the portable executable format (PE) which is used by Windows and the console
subsystem just like last time. entry start specifies the entry point - in this case it is the label start. The include
directive includes another FASM file. In this case I included files from the FASM installation which contain information
about the Win32 API and a file which contains macros. We require those macros for the next two lines: option
prologue:none, option epilogue:none are basically equivalent to the MASM directives. Unluckily FASM, too,
uses a stack frame for procs by default - what a shame. So we need to use those directives to get rid of the stuff
again. The section directives declare new sections, their names and their properties. The following lines do not
require any special explanations with two exceptions. The .bss section is used for uninitialised data and it is basically
equivalent to MASM's .data?. stdcall calculate_sum, ebx is different, aswell. For some peculiar reason you
cannot use the invoke macro in that case and I really have no idea why. It will give you an assembler error if you do.
The only difference between between those calls is the fact that the invoke lines operate on imported functions
whereas calculate_sum was defined by us. Another difference to MASM is the section '.idata' import data
readable part which actually defines an import section. This part is not visible in MASM as such. We define which
functions to import from dynamic Windows libraries (DLL) in this part of the code so we can call and use them. The
command line is unspectacular:
D:\Downloads\fasmw16727>FASM.EXE D:\code\Assembly\calculate_sum_fasm\calculate_sum.as
\calculate_sum.exe
flat assembler version 1.67.27 (1207647 kilobytes memory)
3 passes, 2048 bytes.
D:\Downloads\fasmw16727>
The output of the executable is the same as before so we do not need to go over it again.
NASM
The netwide assembler is another multi platform open source assembler which I do not have much experience with.
Windows include files come in a separate package known as NASMX and unluckily it is outdated and does not work
well with the current version of NASM (produces warnings, possibly errors). I tried to use it at first for this small
example but eventually I gave up on it and went the hard way. I shall link NASMX here aswell in case it gets updated at
some point but I cannot recommend getting it right now except as example code. Go to http://nasm.sourceforge.net/ to
download NASM and http://www.asmcommunity.net/projects/nasmx/ to get the outdated NASMX. I am using the
Microsoft linker with the NASM object file output in this example but you can use other linkers aswell. The NASMX
package comes with a different linker but I will stick with Microsoft's current linker which comes with Visual Studio 2008.
Here is the equivalent code in NASM which is "rawer" than the MASM and FASM one:
%define STD_OUTPUT_HANDLE -11
extern _GetStdHandle@4
extern _WriteConsoleA@20
extern _wsprintfA
extern _ExitProcess@4
global _start
[section .data]
format_string db 'Arithmetic sum for the limit %d is %d', 10, 0
[section .bss]
buffer resb 64
18 of 37 10/5/2010 10:15 PM
bytes_written resd 1
[section .text]
calculate_sum:
xor esi, esi
mov edi, 1
jmp loop_start
sum_loop:
add esi, edi
inc edi
loop_start:
cmp edi, [esp + 4]
jle sum_loop
ret
_start:
push STD_OUTPUT_HANDLE
call _GetStdHandle@4
mov ebp, eax
xor ebx, ebx
main_loop:
push ebx
call calculate_sum
push esi
push ebx
push format_string
push buffer
call _wsprintfA
add esp, 16
push 0
push bytes_written
push eax
push buffer
push ebp
call _WriteConsoleA@20
inc ebx
cmp ebx, 20
jle main_loop
push 0
call _ExitProcess@4
%define STD_OUTPUT_HANDLE -11 simply defines the constant STD_OUTPUT_HANDLE again, just with a more
C'ish feeling on it. extern _GetStdHandle@4 declares the external symbol _GetStdHandle@4 which is the actual
name of GetStdHandle inside the Microsoft library kernel32.lib. MASM handles those on its own and can use the
names as they actually appear in the Windows DLLs. The number behind the @ is the size of the arguments in bytes so
it is basically the number of arguments times 4 mostly. global _start declares a symbol which gets exported. We
require this in order do define _start as an entrypoint in the linker settings later on. The underscore prefix is required
by the Microsoft linker as you will see later. The section declaration syntax is a bit different as you can see but the
names are still the same. buffer resb 64 declares 64 uninitialised bytes in the .bss section. The b in resb stands
for byte. bytes_written resd 1 declares a single uninitialised "dword", a 4 byte integer, hence the d in resd. The
calls below show that NASM does not come with any fancy invoke (MASM, FASM) or stdcall (FASM) macros by
default. NASMX has those but I do not use it for aforementioned reasons. So we have to push all the arguments
manually as necessary. The add esp, 16 stack fix is necessary because wsprintfA uses the C calling convention
so it does not clean up after itself. We pushed 4 arguments onto the stack so we have to add to esp. Now
for the commandline arguments to assemble and link this program:
d:\Downloads\nasm-2.04rc1-win32\nasm-2.04rc1\nasm.exe -f win32 D:\code\Assembly\calcu

\Assembly\calculate_sum_nasm\calculate_sum.obj
D:\Code\Assembly\binaries\32\link.exe /MACHINE:X86 /NODEFAULTLIB /NOLOGO /ENTRY:start
19 of 37 10/5/2010 10:15 PM
d:\Code\Assembly\calculate_sum_nasm\calculate_sum.obj /LIBPATH:"C:\Program Files (x86

8\VC\PlatformSDK\Lib" kernel32.lib user32.lib
This does not require much explanation really since it is the same stuff all over again. One notable thing is the
/ENTRY:start part. The Microsoft linker prefixes all those names with an underscore (_) when it actually looks them
up inside the objects files which is why your entry point must have this prefix inside the code, hence the name _start in
the source code. The output is the same as usual so we should move on.
Writing Linux/BSD Applications Using x86 32-bit

Assembly
Now that we have covered Windows x86 programming with various assemblers it is time to find out how to implement
this program on Linux/BSD, too. FASM runs on x86 Linux/BSD and NASM will run on any Linux/BSD aswell but I have
already demonstrated the syntax for those so I am just going to cover the infamous GNU assembler version 2.17 in
this section.
as
as is the command to call the GNU assembler (short gas) on Linux/BSD systems. It comes with any sane distro so I
really do not need to tell you where to get it. Consult apt-get or emerge otherwise. It is generally agreed upon that
the gas syntax is very archaic and esoteric. It uses the AT&T syntax (as opposed to the Intel syntax I have been
using so far) in which basically the order of all arguments for the x86 instructions is reversed. Additionally it uses
customised instruction names with -b, -w, -l suffixes to imply byte/word/long (1 byte, 2 bytes, 4 bytes). Additionally it
introduces terribly redundant and complicated notations for offsets and cryptic register and number prefixes. Luckily, a
feature was introduced which enables crude Intel syntax support. I am going to use this feature in the following code.
Unluckily I had to use a silly hack to make it work properly because I could not find proper documentation on
replacements for standard operations which are not possible in the Intel syntax mode using the conventional ways. I am
going to show you some examples of conventional gas code later. For the record: I am using GNU assembler
version 2.17 (i486-linux-gnu) using BFD version 2.17 Debian GNU/Linux and collect2
version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21) (i386 Linux/ELF) in this example. Here is
my code:
.code32
.intel_syntax noprefix
.section .data
format_string: .string "Arithmetic sum for the limit %d is %d\n"
.section .text
calculate_sum:
xor esi, esi
mov edi, 1
jmp loop_start
sum_loop:
add esi, edi
inc edi
loop_start:
cmp edi, [esp + 4]
jle sum_loop
ret 4
.global main
main:
xor ebx, ebx
main_loop:
20 of 37 10/5/2010 10:15 PM
push ebx
call calculate_sum
push esi
push ebx
lea edi, format_string
push edi
call printf
add esp, 12
add ebx, 1
cmp ebx, 20
jle main_loop
xor eax, eax
ret
You can probably figure out most of this stuff on your own by now, it is the same stuff over and over again. It is merely
the syntax that differs, the idea is usually still the same. .section declare sections, etc. .code32 forces the
assembler to produce 32-bit code instead of 64-bit code. This is only necessary if you are running a 64-bit box.
.intel_syntax noprefix does what I was talking about. It allows you to write code which is more similar to what
you are (hopefully) used to - the Intel syntax. gas is pretty low level by default as you can see and it does not offer any
fancy macros as such. I think it supports the C/C++ preprocessor specs, though. So you could write all kinds of crazy
stuff using #define and such. The "hack" I was talking about in the introduction is lea edi, format_string. lea
is short for "load effective address". It is used to store the address of some object in a register, so they say. In practice
a simple lea is hardly any different from what mov does. In this case we could have achieved the same result using a
mov. The problem is just that I have not figured out how to get the address of a string when you are using the
.intel_syntax noprefix mode otherwise. In reality learet from the main function and it will return the program
exit code in eax.
Figuring out how to assemble and link this application on my Debian box was a painful process. I recommend that you
simply write a simple C program which performs something similar to what you intend to do first. Run gcc with -S in
order to produce ASM output. Check out the .s-file (gas ASM file extension) to get an idea. Proceed to run gcc again,
this time without -S-v. That switch will tell you every single command it actually executes. At first it produces the ASM
code, then it assembles and links it. We can use this information to rip out the assembling and linking information to use
with our project. Be sure to include all libraries you intend to use in your ASM program in your C test program - the
linking information will pove invaluable. Here is how I assembled and linked this monstrosity:
as -o calculate_sum_gas.o calculate_sum_gas.asm
/usr/lib/gcc/i486-linux-gnu/4.1.2/collect2 --eh-frame-hdr -m elf_i386 -dynamic-linker

/gcc/i486-linux-gnu/4.1.2/../../../../lib/crt1.o /usr/lib/gcc/i486-linux-gnu/4.1.2/..
gnu/4.1.2/crtbegin.o -L/usr/lib/gcc/i486-linux-gnu/4.1.2 -L/usr/lib/gcc/i486-linux-gn
/../../../../lib -L/lib/../lib -L/usr/lib/../lib calculate_sum_gas.o -lgcc --as-neede
--as-needed -lgcc_s --no-as-needed /usr/lib/gcc/i486-linux-gnu/4.1.2/crtend.o /usr/li
/lib/crtn.o
As you can see there is an excessive amount of arguments used to actually link this program. I am not sure how many
of these are even necessary but I will not bother to find out as long as it works. If you have a 64-bit Linux/BSD this
might prove problematic. You might need the right gcc library binaries etc to run this 32-bit example there. I previously
claimed that the gas syntx is outrageously cryptic and difficult to understand. Here is the same same program using
"normal" gas syntax, without the lea hack:
.code32
.section .data
format_string: .string "Arithmetic sum for the limit %d is %d\n"
.section .text
21 of 37 10/5/2010 10:15 PM
calculate_sum:
xorl %esi, %esi
movl $1, %edi
jmp loop_start
sum_loop:
addl %edi, %esi
incl %edi
loop_start:
cmpl 4(%esp), %edi
jle sum_loop
retl $4
.global main
main:
xorl %ebx, %ebx
main_loop:
pushl %ebx
call calculate_sum
pushl %esi
pushl %ebx
push $format_string
call printf
addl $12, %esp
addl $1, %ebx
cmp $20, %ebx
jle main_loop
xorl %eax, %eax
ret
The real gas syntax uses so many special characters that you get the feeling of looking at some esoteric programming
language like brainfuck. As long as I have a choice between various assemblers I definitely would not pick this one. gcc
inline assembly has an even more outrageous and complicated syntax.
Partial Registers
One thing I did not tell you on purpose about until now are the partial registers of the x86 architecture. This is because
they frequently do harm and slow down programs when used incorrectly. On the x86 architecture it is possible to
access lower parts of all general purpose registers according to the following pattern which I am going to demonstrate
on eax:
Bits 31 - 24 Bits 23 - 16 Bits 15 - 8 Bits 7 - 0

eax
ax
ah al
So eax can contain 32-bits, alright, we already knew that. But eax also has the three partial registers ax, ah and al.
These point to the data eax contains - they just show a smaller part of the same register. So ax represents the lower
16 bits of eax. ah represents the upper 8 bits of ax and al gives you the lower 8 bits of it. This has mostly backward
compabitibility reasons. ax is the original 16-bit accumulator register the x86 architecture started with. The e in eax
stands for "extended" since it is an extended version of ax. The l and the h in al/ah stand for low/high. Not all general
purpose registers feature the "high"/"low" partial registers. They are only available on eax, ebx, ecx and edx. You can
write to them and read from them just like normal registers. Here is a full list of all the other 32-bit registers and their
partial registers:
Bits 31 - 24 Bits 23 - 16 Bits 15 - 8 Bits 7 - 0

eax
ax
22 of 37 10/5/2010 10:15 PM
ah al
ebx
bx
bh bl
ecx
cx
ch cl
edx
dx
dh dl
ebp
bp
esp
sp
esi
si
edi
di
The problem with using partial registers in 32-bit code is that they can cause so called partial register stalls. Reading
from partial registers and writing to partial registers causes more dependencies than doing the same to the full
registers. This can stall the execution pipelines of the processor which results in a slowdown of your code.
Understanding what the execution pipeline is and how it works on modern CPUs requires a lot of explaining and it is not
yet the time to cover this topic. So all you should keep in mind for now is that you should always avoid using partial
registers. There usually is a faster solution which minimises partial register usage - usually totally eliminating it. Both the
Intel Optimization Manual and the AMD Software Optimization Guide strongly discourage the use of partial registers for
this very reason. Here are a few examples of how partial registers work:
;in MASM syntax you use the following notation for hex numbers:
mov eax, 012345678h
;this writes 0x12345678 to eax, the decisive parts are the 0-prefix and the h-suffix
;the x86 instruction movzx is usually used to move values from 1-byte or 2-byte sized
;it writes them to the lower part of the register and sets the rest of the register t
;the "zx" part stands for "zero extend"
;read al, zero extend it and write it to ebx:

movzx ebx, al
;ebx now contains 0x00000078
;read ah, zero extend it and write it to ebx:

movzx ebx, ah
;read ax, zero extend it and write it to ebx:

movzx ebx, ax
;normal mov:
mov ebx, eax
Addresses
The x86 architecture allows us to use fairly complex expressions when we are addressing memory. The pattern it
23 of 37 10/5/2010 10:15 PM
supports is:
address = base + index * scale + offset
The variables used in the pattern have the following meaning:

1. base: general purpose register
2. index: general purpose register
3. scale: 1, 2, 4 or 8
4. offset: 32-bit integers
To fully comprehend what this means you should take a look at the following example:
;ebx <<= 3, shift it left by 3 bits

shl ebx, 3
add eax, ebx

add eax, 17
;the "byte ptr" signifies that we want to read a single byte only and not 2 or even 4
;I have to use movzx because ecx is a 32-bit register and the byte is only 8 bits in
movzx ecx, byte ptr [eax]
So what this basically does is reading a single byte from memory. This byte is stored at the address eax + (ebx <<
3) + 17. Shifts can be expressed with multiplication/division with powers of two, in case you never heard about it.
Shifting ebx 3 bits to the left is the same as multiplying it by 8 because two to the power of three is equal to 8. In
general n <<= a; is equivalent to n *= static_cast(std::pow(2.0, static_cast(a))); and n >>= a; is equivalent to
n /= static_cast(std::pow(2.0, static_cast(a)));, assuming that a and n are both of type long. This is a direct
result of the positional notation I formulated mathematically at the beginning of this document. By shifting you basically
add zeroes or remove digits from the sequence of digits. This leads to multiplications by the powers of the base or
integer divisions with the same. The great thing is that the x86 architecture allows us to do all of those things in a single
step:
movzx ecx, byte ptr [eax + ebx * 8 + 17]
It is a really useful feature and has lots of applications. In combination with lea you can even use it to accelerate simple
arithmetic operations which would usually require several steps for multiplication and addition so do keep it in mind.
Control Structures
In the introduction I stated that this document is intended for people who are already familiar with iterative high level
programming languages such as C/C++. At the beginning it is difficult to stop thinking in terms of if, for and while
and to start thinking in terms of labels and conditional jumps so I wanted to show you a few examples which
demonstrate how to translate these standard control structures to x86 32-bit assembly. My apology to Linux/BSD users
- but I am going to use Windows specific API calls in these examples occasionally.
The first thing we are going to check out is the basic if-construct:
if
if(lstrlen(some_string) > 32)

return -1;
return 0;
Here is the equivalent x86 32-bit ASM code:
24 of 37 10/5/2010 10:15 PM
invoke lstrlen, addr some_string

cmp eax, 32
jle return_zero
mov eax, -1
ret
return_zero:
xor eax, eax
ret
So the original comparison operator was "greater than" but in x86 assembly we jump away from the if-body so we
need to check for the exact opposite of the original condition. The logical negation of "greater than" is "less than or
equal to" so we use jle in this case. Here is a list of logical negations of standard operators:
Original operator Logical negation

== !=
!= ==
< >=
> <=
<= >
>= <
The if-else
if else construct is next:
void divisor_check(int divisor)

{
if(divisor == 0)
puts("Division by zero!");
else
puts("Valid division");
puts("This gets executed in either case");
}
The approximate equivalent in x86 ASM is:
.data
division_by_zero db "Division by zero!", 0

valid_division db "Valid division", 0
either_case db "This gets executed in either case", 0
.text
divisor_check proc divisor:dword

cmp [esp + 4], 0
jne else_label
invoke puts, addr division_by_zero
jmp end_of_if
else_label:
invoke puts, addr valid_division
end_of_if:
invoke puts, addr either_case
ret
25 of 37 10/5/2010 10:15 PM
So if the result of the comparison implies that the else body must be executed we jump to the else-label and simply
continue the execution after that because it gets us right where we are supposed to be anyways. In the other case the
first jump is not taken but we must not fall through to the else-body so an unconditional jump is necessary to take us to
the end of the if-structure.
Let us move on to the while-construct:
while
long i = 0;
while(data[i] >= 128 || some_function(data[i]))
i++;
Let us assume that i gets stored in the general purpose register eax and that data is a string/an array of bytes. then
the approximate equivalent in x86 ASM is:
xor eax, eax

loop:
movzx ebx, byte ptr [array + eax]
cmp ebx, 128
jge iterate
invoke some_function, ebx
test eax, eax
jz end_of_loop
iterate:
inc eax
jmp loop
end_of_loop:
I am using the new instruction test in this example. test is to the logical AND what cmp is to subtraction. It performs
the a logical AND using the operands behind the instruction as arguments for the logical function but it does not store the
result of it anywhere. It is simply used to set the flags. If you look closely at the logical table for AND you will realise that
for all boolean values a the term a & a evaluates to a. This is called idempotence. An idempotent operation can be
applied to a variable over and over again but in the end it always evaluate to the same variable again. The logical OR
operation is idempotent aswell. So what test eax, eax really does is checking the value of eax and setting the flags
accordingly. In this case we use it to determine whether the register is equal to zero. In case you are not used to this
particular feature of C/C++: while(data[i] >= 128 || some_function(data[i])) is equivalent to
while(data[i] >= 128 || some_function(data[i]) != 0) - which is why we jump out of the loop if that
part of the boolean term fails, too. We could even optimise this loop by using an initial jump to enter the loop and putting
the iteration at the beginning of the body:
xor eax, eax

jmp enter_loop
iterate:
inc eax
enter_loop:
movzx ebx, byte ptr [array + eax]
cmp ebx, 128
jge iterate
invoke some_function, ebx
test eax, eax
jnz iterate
We do have that additional jump at the beginning now but you always have to keep in mind that 99% of all CPU time is
spent inside loops. Optimising the body of a loop greatly outweighs such a one-time operation. The new body looks
unintuitive and does not remind one of the original C/C++ code at all so I did not want to start with it right away and
used the less optimised version first.
26 of 37 10/5/2010 10:15 PM
Now that we have covered if and while the next part is not a big deal. It is time for the for-statement:
for
for(long i = 1; i <= 10; i++)

printf("Iteration step %d\n", i);
Here is the equivalent x86 32-bit ASM code:
.data
iteration_string db "Iteration step %d\n", 0
.text
_start:
mov ebx, 1
loop:
invoke printf, addr iteration_string, ebx
inc ebx
cmp ebx, 10
jle loop
;...
What we actually do here is skipping the first comparison 1 <= 10 because we know that is true anyways. If you have
a more complex iteration condition where you cannot tell whether the first iteration will always occur you can obviously
not skip it so be careful with optimisations like these.
Multiplication and Division of Integers

Multiplication and division of integers on the x86 architecture are not as intuitive as the addition and subtraction of
integers. The largest unsigned integers you can store in a 32-bit general purpose register has the value so
when we have two of those and multiply them we get
. In order to calculate
how many bits this number requires to be represented in a standard binary positional system we use the formula
with a base of : .
This means that the result of multiplications in general is at least 64 bits in size. But our x86 architecture has only 32-bit
general purpose registers so how is this dealt with? The result is stored in two registers. The mul reg instruction is
used for the multiplication of two unsigned 32-bit integers. It takes a single argument - a general purpose register. It
uses eax as the other factor so what it calculates is eax * reg. The result is stored in edx:eax which means that
the upper 32 bits get stored in edx and the lower 32 bits of the result are stored in eax. Example:
mov eax, 012345678h

mov edx, 012345678h
mul edx
;The result of this multiplication is:
;0x12345678 * 0x12345678 = 0x014B66DC1DF4D840
;Which produces the following values of edx and eax:
;edx = 0x014B66DC
;eax = 0x1DF4D840
Multiplications used to be considerably more expensive than addition and subtraction. Multiplication is considerably more
expensive than say addition or subtraction in terms of execution time. The imul instruction is used for the multiplication
of signed integers and it supports more operands. Like mul, it can store the result in edx:eax, taking only a single
27 of 37 10/5/2010 10:15 PM
argument. But it can also take multiple registers and even an immediate value but it will discard the upper 32-bits of the
result then. Example:
2147483647
mov eax, -2
mov ebx, -2
imul ebx
;edx:eax = (-2) * (-2) = 4 = 0x0000000000000004
;Let us use the largest positive signed integers possible:

mov edi, 2147483647
mov esi, 2147483647
;this stores the result in edi and discards the upper 32 bits:
imul edi, esi
;2147483647 * 2147483647 = 4611686014132420609 = 0x3FFFFFFF00000001
;edi = 0x00000001
;0x3FFFFFFF is discarded
mov eax, 10
mov ebx, 20
;this simply adds another factor to the multiplication:

imul eax, ebx, 3
;10 * 20 * 3 = 600
;upper 32 bit are discarded but they are 0 anyways, so:
;eax = 600 = 0x00000258
Division is far more expensive than multiplication. The div reg instruction is used for unsigned division. It performs the
division edx:eax / reg and stores the quotient in eaxedx. In case you do not even know what those
words mean I am going to explain it with a small example: . The result of
the division 17 / 5 is 3 and the remainder is 2 because . You might

wonder why the x86 architecture supports 64-bit dividends but only 32-bit divisors. I
am not absolutely sure why this is, actually. But it was probably more convenient at
the time and it is not a big deal to work around it. Usually you do not require such
large numbers anyways. Enough of this primary school stuff, let us have a look at a few
examples:
mov eax, 100

xor edx, edx
mov ebx, 10
div ebx
;edx:eax / ebx = 100 / 10 = 10
;edx:eax % ebx = 0
;Quotient: eax = 10
;Remainder: edx = 0
mov eax, 21
div ebx
;edx:eax / ebx = 21 / 10 = 2
;edx:eax % ebx = 21 % 10 = 1
;Quotient: eax = 2
;Remainder: edx = 1
Just like there are mul and imul, there is the idiv instruction for signed division. It works just like div, though. The
remainder is always the same sign as the quotient:
28 of 37 10/5/2010 10:15 PM
;Set edx:eax to -100

mov eax, -100
mov edx, -1
;-1 because we want all upper bits to be 1 - it is Two's complement!
mov ebx, 10
idiv ebx
;edx:eax / ebx = - 100 / 10 = - 10
;edx:eax % ebx = 0
;Quotient: eax = 10
;Remainder: edx = 0
mov eax, -21

mov edx, -1
div ebx
;edx:eax / ebx = - 21 / 10 = - 2
;edx:eax % ebx = - 21 % 10 = - 1
;Quotient: eax = - 2
;Remainder: edx = - 1
Floating Point Operations

Now that we have covered the most important integer operations we should move on to floating point instructions. The
CPU contains many different units which work together when a program is executed. Instructions like add, sub, xor,
shl, mul and idiv are all executed on the so called Arithmetic Logical Unit (ALU). The operations we are going to
concern ourselves with are executed in a completely different unit. It is called the Floating Point Unit (FPU). This
is where stuff like 0.942 - 7.183 = -6.241 is calculated. The x86 architecture implements the IEEE-754 standard
(http://en.wikipedia.org/wiki/IEEE_floating-point_standard) and offers support for three different precision formats:
1. Single precision 32-bit floating point numbers (float in C/C++)
2. Double precision 64-bit floating point numbers (double in C/C++)
3. Extended precision 80-bit floating point numbers (long double in C/C++)
We are going to explore the mathematical concept behind this system first. The real numbers know no bounds and
even if we specify boundaries for them they are still not discrete. There are infinitely many real values between the
numbers 0 and 1 so we have to make sacrifices to represent real numbers with boundaries and approximative
capabilities that meet our requirements. We can only represent approximations of most real values using this concept. A
few of them which are composed of sums of positive/negative powers of 2 can be represented flawlessly without error
using IEEE-754. We are going to examine the smallest of the three formats in this part of the document only. The others
are based on the same idea and just use different parameters really.
The IEEE-754 32-bit precision format divides the 32-bit value into 3 different sections:
Section Size Bit range

Sign 1 bit 31 - 31
Biased exponent 8 bits 30 - 23
Mantissa 23 bits 22 - 0
The mantissa expresses a fraction of the form 1.x where x is expressed by the mantissa. The numbers are encoded
using the Big Endian format (most significant bit first). Let stand for the sign bit, for the biased exponent and
with for the mantissa bits (they are read from the left to the right), then the value
most IEEE-754 encoded reals represent can be determined using the following formula:
The part simply changes the sign according to the sign bit. The value is negative if the sign bit is set to 1 and
positive if it is set to 0. is where the the power of 2 gets expressed through the actual exponent
. The "bias" I was talking about is a necessity to allow the exponent to be negative. Otherwise we would
have positive/zero exponents only and we would be unable to express numbers since this requires
negative powers of 2. The mantisa represents the binary digits to the right of the point in a number like which
29 of 37 10/5/2010 10:15 PM
had the leading "1." simply chopped off. Let us manually convert the number 149.8 to the IEEE-754 single precision
format. At first we require a binary representation of this decimal number. We examine the part to the left of the point
first and repeatedly perform divisions by 2, splitting the result into an integer part and a rational part after each step.
Then we repeat the procedure with this integer part until it eventually becomes zero.
149 / 2 = 74 + 1 / 2
74 / 2 = 37 + 0 / 2
37 / 2 = 18 + 1 / 2
18 / 2 = 9 + 0 / 2
9 / 2 = 4 + 1 / 2
4 / 2 = 2 + 0 / 2
2 / 2 = 1 + 0 / 2
1 / 2 = 0 + 1 / 2
You get the binary representation of the left part of the number by reading the "bit stream" on the right bottom-up, which
gives us the binary number 10010101 which is indeed equal to the decimal representation 149. The part to the right of
the of the point works in a similar fashion but it can be tricky at times because some numbers cannot be expressed with
a finite number of digits in another base. We chop off the left part of the number and replace it with 0. We repeatedly
multiply it by 2, chopping off the leading digit to replace it with 0 each step. We are done until there is a 0 left or until we
discover a periodic pattern or until we simply hit the limit for the precision:
0.8 * 2 = 1.6 => 1 <--+

0.6 * 2 = 1.2 => 1 | periodic pattern detected
0.2 * 2 = 0.4 => 0 |
0.4 * 2 = 0.8 => 0 ---+
We discovered a periodic pattern which means that 0.8 cannot be written in binary notation in a finite number of digits
without any loss of precision. We conclude:
We can prove this mathematically by making use of geometric series with :
All we need to do is to perform some index shifts and splitting up of sums to show the desired result:
The next step in the conversion process is shifting our binary representation of the number to the left or to the right such
that it has the form "1.x". We have to keep track of the number of shifting steps we performed and of the direction
aswell. We store this information in the exponent of a power of two:
As you can see I algebraically manipulated the term so it matches the pattern in the original formula I showed to you.
Now we can simply read off the values:
We are nearly done, we simply need to convert the biased exponent to its 8-bit binary representation now:
134 / 2 = 67 + 0 / 2
67 / 2 = 33 + 1 / 2
33 / 2 = 16 + 1 / 2
30 of 37 10/5/2010 10:15 PM
16 / 2 = 8 + 0 / 2
8 / 2 = 4 + 0 / 2
4 / 2 = 2 + 0 / 2
2 / 2 = 1 + 0 / 2
1 / 2 = 0 + 1 / 2
This is the same algorithm we used before and it tells us that the binary representation is 10000110. So the IEEE-754
representation of the decimal value 149.8 is:
Sign : biased exponent : mantissa
0 : 10000110 : 00101011100110011001100
= 01000011 00010101 11001100 11001100
In the case of periodic binary representations you simply repeat the periodic digit group until you hit the limits imposed
by the 23 bit boundary for the mantissa. Let us get back to the x86 architecture. The other formats simply use different
biased exponent and mantissa sizes which also forces them to use different bias values, obviously. There are certain
special cases for invalid numbers, infinity and zero and such but it is not relevant for this document, really. Check out
http://en.wikipedia.org/wiki/IEEE_floating-point_standard if you are interested in the topic.
The FPU contains 8 80-bit elements called st(0), st(1), ..., st(7). I did not say "register" because they are
implemented as a stack so st(0) always refers to the top element, st(1) to the below the top, etc. This makes it
quite tricky to use. Honestly, I am not even quite sure why they implemented it this way. Maybe it makes some common
operations shorter this way or something like that. FPU code is very fragile and often hard to read. I strongly
recommend reading a good instruction set reference simultaneously when you are working with FPU code. There are
lots of details you have to pay attention to and it is easy to make mistakes so be extra careful with FPU code. Let us
start out with some easy examples to see how the FPU stack works. Here is the state of the FPU at the beginning of
the program:
Position Value
st(0) undefined
st(1) undefined
st(2) undefined
st(3) undefined
st(4) undefined
st(5) undefined
st(6) undefined
st(7) undefined
All FPU instructions have the f- prefix. If they have a -p suffix it means that they pop the top element off the stack at
the end of their execution. We are going to execute the following instructions on the FPU now:
;...
;push one on top of the stack

fld1
;push zero on top of the stack
fldz
;add them up and store the result in st(0)
fadd st(0), st(1)
;add the top two elements up and store the result in st(1), pop the stack
faddp st(1), st(0)
;...
31 of 37 10/5/2010 10:15 PM
Let us go through the changes in the FPU stack, instruction by instruction. Changes to registers are marked in red. We
start out with fld1:
Position Value
st(0) 1.0
st(1) undefined
st(2) - st(7) undefined
fldz adds another well known constant to the top of the stack:
Position Value
st(0) 0.0
st(1) 1.0
Perform the addition st(0) = st(0) + st(1); with the specified stack elements using fadd:
Position Value
st(0) 1.0
st(1) 1.0
As you can see the rest of the stack remains the same. We are going to look at the next one in two steps because it
basically does two things. We have the FPU perform the addition st(1) = st(1) + st(0); using faddp, which
discards the top element after the addition:
Position Value
st(0) 1.0
st(1) 2.0
Position Value
st(0) 2.0
st(1) undefined
So how do you load floating point values from memory onto the FPU stack and how do you store them in memory
again? Take a look at the following snippet:
.data
some_float dd 0.123
some_double dq 4.556456823
.data?
float_storage dd ?
32 of 37 10/5/2010 10:15 PM
double_storage dq ?
.text
;...
;load float from memory and push it on the top of the stack:
fld some_float
;load double from memory and push it on the top of the stack:
fld some_double
;add them up and pop the stack:
faddp st(1), st(0)
;store the result as a float:
fst float_storage
;store the result as a double (in case we need more precision) and pop the stack s
fstp double_storage
;...
The first new thing in this code is the dq (declare quadword) assembler directive which we use to initialise double
precision floats (64 bits, 8 bytes). In this case I use it to define static values which I load onto the FPU stack. Let us go
through the changes again, instruction by instruction. In the beginning the stack is totally empty:
Position Value
fld some_float pushes the approximation of 0.123 onto the stack:
Position Value
st(0) 0.123
fld some_double does the same again, just with a double precision value which is an approximation of 4.556456823:
Position Value
st(0) 4.556456823
st(1) 0.123
faddp st(1), st(0) adds them up and pops the stack, just like we have seen before in the previous example:
Position Value
st(0) 4.679456826
fst float_storage stores the top of the stack as a 32-bit single precision value in memory at the specified
address. It does not modify the FPU stack in any way, though. fstp double_storage does, though. After storing
the double precision value in double_storage it pops the top off the stack:
Position Value
33 of 37 10/5/2010 10:15 PM
Now that we have covered the basics of the FPU stack is time to move on to a more elaborate example of how to use
the FPU. Here is the original C++ code we are going to translate to x86 32-bit assembly:
double f(double x)
{
char buffer[10];
return std::sin(x * x + 1.0) / 2.0;
}
int main()
{
for(double i = 0.0; i < 1.0; i += 0.1)
{
double value = f(i);
printf("f(%.1f) = %.4f, first two digits: %ld\n", i, value, static_cast
}
std::cin.get();
return 0;
}
The FPU example code I came up with is quite nasty I must say. I apologise in advance. Here is my translation to x86
ASM:
.686p
.model flat, stdcall
option epilogue: none

option prologue: none
GetStdHandle proto :dword

wsprintfA proto c :vararg
WriteConsoleA proto :dword, :dword, :dword, :dword, :dword
ExitProcess proto :dword
STD_OUTPUT_HANDLE equ -11
.data
format_string db "f(%s) = %s, first two digits: %d", 10, 0

iteration_step dq 0.1
hundred dq 100.0
ten dq 10.0
half dq 0.5
fuck_off dq 123.456789
.data?
number_buffer db 16 dup (?)

number_buffer2 db 16 dup (?)
buffer db 64 dup (?)
bytes_written dd ?
iterator dq ?
.code
f proc
fmul st(0), st(0)
fld1
faddp st(1), st(0)
34 of 37 10/5/2010 10:15 PM
fsin
fmul half
ret
f endp
write_double proc double_buffer:dword, precision:dword

fst qword ptr [esp - 12]
mov esi, [esp + 4]
mov byte ptr [esi], '.'
fld st(0)
fisttp dword ptr [esp - 4]
mov eax, [esp - 4]
mov ecx, 10
write_left_digits:
xor edx, edx
div ecx
add edx, '0'
dec esi
mov byte ptr [esi], dl
test eax, eax
jnz write_left_digits
mov ecx, [esp + 4]
mov eax, [esp + 8]
write_right_digits:
fild dword ptr [esp - 4]
fsubp st(1), st(0)
fmul ten
fld st(0)
mov ebx, [esp - 4]
add ebx, '0'
inc ecx
mov byte ptr [ecx], bl
dec eax
jnz write_right_digits
mov byte ptr [ecx + 1], 0
fstp st(0)
fld qword ptr [esp - 12]
ret 8
write_double endp
public main
main:
mov ebp, eax
fldz
main_loop:
fst iterator
invoke write_double, addr number_buffer + 8, 4
mov edi, esi
call f
invoke write_double, addr number_buffer2 + 8, 6
fld st(0)
fmul hundred
mov eax, [esp - 4]
invoke wsprintfA, addr buffer, addr format_string, edi, esi, eax
invoke WriteConsoleA, ebp, addr buffer, eax, addr bytes_written, 0
fstp st(0)
fld iterator
fadd iteration_step
fld1
35 of 37 10/5/2010 10:15 PM
fcomip st(0), st(1)

jnb main_loop
end
Let us start with the main code. fldz pushes the zero constant on the stack. So after execution this instruction st(0)
contains zero while all other FPU registers are still undefined/empty. If you use such an undefined register in an FPU
operation you will cause an exception. If you try to pop an element off an empty stack - meaning, all values are
undefined - you will get an exception, too. Another thing you have to watch out for is stack overflow. You cannot just
keep on adding more elements onto the stack. If all registers have been set and you push an additional element onto
the stack you will cause an FPU stack overflow so keep track of the height of your FPU stack at all times. fst
iterator reads the 64-bit value from st(0) and stores it at the memory location iterator - the stack remains the
same. On the next line write_double gets called for the first time. This is a stupid workaround I had to come up with
because the Windows API does not offer any functions which allow you to convert floating point numbers to strings to
my knowledge. I tried to use the Microsoft Visual C Runtime but those attempts did not bear any fruit either because
using msvcrt in ASM is pretty tricky nowadays. You have to link the application with those silly manifest files and
annoying stuff like that so I eventually gave up on making, say, printf work and wrote my own double -> string
converter.
The write_double body starts out with fst dword ptr [esp - 12]. This is not any different from the previous
fst store call really, it simply does not use an address which is known as compile time. Instead we use unused bytes
on the stack to store it temporarily locally. The qword ptr part is necessary because the assembler cannot magically
know what format we actually want to store. It could be single precision, double precision or even extended precision -
it simply cannot tell. With the prior fst call this was not necessary because the size of that object is defined in the data
section. The fld loads values and pushes them on top of the stack. This value can be a location in memory or even an
FPU register. In this case we are using fld st(0) which basically just duplicates the top of the stack. fisttp
dword ptr [esp - 4] performs a double to int conversion and stores the result at the specified memory
location. This function simply truncates the digits to the right of the decimal point away. So we would convert something
like 2.913 to 2. In the following loop we perform a conventional integer to decimal ASCII number conversion by
repeatedly dividing the integer by 10, adding 0x30 to it, and writing the lower byte of the register to a buffer. This buffer
gets first filled from right to left, starting with the decimal point. In the write_right_digits section we read the
decimal digits to the right of the decimal point from the number by repeatedly multiplying it by 10, converting it to an
integer, converting the integer to a decimal digit which gets added to the buffer and by eventually subtracting the integer
from the double again to get rid of the digit to the left of the decimal point. The buffer gets filled from the left to the right,
starting with the byte to the right of the decimal point. It starts with fild dword ptr [esp - 4]. This instruction
loads an integer from memory, converts it to a double and pushes it onto the FPU stack so we can access it via s(0).
fsubp st(1), st(0) performs the calculation st(1) - st(0) and stores the result in std(1). Additionally it
pops off the top of the stack so what was once st(1) is now st(0). You can tell this from the -p suffix in the
mnemonic. fmul ten multiplies st(0) with the value 10.0 which is stored in memory. The result of the operation gets
stored in std(0) again. It is followed by another fld st(0) which duplicates the top of the stack. It is necessary
because we are going to use fisttp on the next line. Unlickily this instruction does not feature any non -p version so
we are forced to perform that somewhat redundant additional FPU stack push. So what this loop does is simply writing
a certain number of digits to the right of the decimal point. The number is specified by the second agument precision.
At the end of the function you can see a fstp st(0). This stores st(0) in st(0) and pops the top element of the
stack (which is st(0). So it basically works like a pop - with the exception of being somewhat unintuitive and cryptic. At
the very end fld qword ptr [esp - 12] is called. This restores the original value st(0) had when the function
was first called because we wrote it to that part of the memory in the beginning.
After returning from write_double for the first time the function f is called. This is pretty much a direct translation of
the corresponing C/C++ function we defined earlier. fmul st(0), st(0) is basically equivalent to st(0) = st(0)
* st(0); in C/C++ pseudo code. fld1 is another "load constant" function, just like ldz. It pushes the value zero on
top of the stack. faddp st(1), st(0) adds st(1) to st(0) and stores the reslt in st(1). After this step the top
element of the stack gets popped off so the result is in st(0). fsin performs st(0) = sin(st(0)). And last but
not least we divide the resulting value by 2.0 using fmul half. This actually multiplies st(0) with 0.5 but it is
obviously the same as dividing by zero since . The value of f is returned in st(0) and written into a
buffer. Most of the following instructions have already been explained, with the exception of fcomip st(0), st(1).
This compares the values of st(0) and st(1) and sets the flags Zero/Parity/Carry flags according to a particular
pattern. More importantly, it specifies CF = 1 if and only if st(0) < st(i) - which is exactly what we want. The jnb
instruction jumps only if the iterator is still smaller than 1.0. Here is the actual output of the function:
36 of 37 10/5/2010 10:15 PM
f(0.0000) = 0.420735, first two digits: 42

As you can see there are a few errors in the iterator such as 0.7999 and 0.9999. You should probably already be
familiar with the problem of rounding errors on floating point numbers from imperative high level programming
languages. It is simply a result of the system we use. Not every number can be perfectly represented. Most of them
have certain small errors and the results of arithmetic operations on them are inaccurate. There are many FPU
instructions with extremely specific applications and I am not going to cover them here. Rule of thumb: If you want to
know how something works, write it in C/C++, turn on assembly listings and read the code in combination with an x86
Instruction Set Reference.
37 of 37 10/5/2010 10:15 PM

x86 32-Bit Assembly For Ath...

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

x86 32-Bit Assembly For Ath...

Diunggah oleh

Hak Cipta:

Format Tersedia

x86 32-bit Assembly for Atheists http://siyobik.info/index.php?

x86 32-bit Assembly for Atheists

What is the Purpose of Learning Assembly?

Getting to know the x86 Architecture

Bits, Bytes and Numbers

Signed decimal number Two's Complement Representation

The First Steps

Abbreviation Full name

int eax, ebx, ecx;

//xor ecx, ecx

//add ecx, eax

//cmp eax, ebx

if(eax <= ebx)

int calculate_sum(int limit)

Hexadecimal Consecutive bytes in memory with Consecutive bytes in memory with

void __cdecl test1(int a, int b, int c)

void __stdcall test2(int a, int b, int c)

So the differences between these two function calls are:

Calling convention Description

Writing Windows Applications Using x86 32-bit

;Arithmetic sum program

option epilogue: none

GetStdHandle proto :dword

STD_OUTPUT_HANDLE equ -11

sum_string db "Arithmetic sum for the limit %d is %d", 10, 0

calculate_sum proc limit:dword

invoke calculate_sum, ebx

Term Full name Size in bytes

format PE Console 4.0

section '.data' data readable writeable

sum_string db 'Arithmetic sum for the limit %d is %d', 10, 0

section '.bss' readable writeable

buffer db 64 dup (?)

section '.code' code readable executable

proc calculate_sum limit

section '.idata' import data readable

%define STD_OUTPUT_HANDLE -11

format_string db 'Arithmetic sum for the limit %d is %d', 10, 0

d:\Downloads\nasm-2.04rc1-win32\nasm-2.04rc1\nasm.exe -f win32 D:\code\Assembly\calcu

d:\Code\Assembly\calculate_sum_nasm\calculate_sum.obj /LIBPATH:"C:\Program Files (x86

Writing Linux/BSD Applications Using x86 32-bit

format_string: .string "Arithmetic sum for the limit %d is %d\n"

/usr/lib/gcc/i486-linux-gnu/4.1.2/collect2 --eh-frame-hdr -m elf_i386 -dynamic-linker

format_string: .string "Arithmetic sum for the limit %d is %d\n"

Bits 31 - 24 Bits 23 - 16 Bits 15 - 8 Bits 7 - 0

Bits 31 - 24 Bits 23 - 16 Bits 15 - 8 Bits 7 - 0

;read al, zero extend it and write it to ebx:

;ebx now contains 0x00000078

;read ah, zero extend it and write it to ebx:

;ebx now contains 0x00000056

;read ax, zero extend it and write it to ebx:

;ebx now contains 0x00005678

;ebx now contains 0x12345678

address = base + index * scale + offset

The variables used in the pattern have the following meaning:

;ebx <<= 3, shift it left by 3 bits

add eax, ebx

movzx ecx, byte ptr [eax + ebx * 8 + 17]

if(lstrlen(some_string) > 32)

Here is the equivalent x86 32-bit ASM code:

invoke lstrlen, addr some_string

Original operator Logical negation