Anda di halaman 1dari 26

Unit 1: Introduction

Explain the concept of data structures.


In computer science, a data structure is a particular way of organizing data in a computer so
that it can be used efficiently. Data structures can implement one or more particular abstract data
types (ADT), which are the means of specifying the contract of operations and their complexity.
In comparison, a data structure is a concrete implementation of the contract provided by an ADT.
Different kinds of data structures are suited to different kinds of applications, and some are
highly specialized to specific tasks. For example, databases use B-tree indexes for small
percentages of data retrieval, and compilers and databases use dynamic hash tables as look-up
tables.
Data structures provide a means to manage large amounts of data efficiently for uses such as
large databases and internet indexing services. Usually, efficient data structures are key to
designing efficient algorithms. Some formal design methods and programming languages
emphasize data structures, rather than algorithms, as the key organizing factor in software
design. Storing and retrieving can be carried out on data stored in both main memory and in
secondary memory.

Explain the concept of algorithms.


As you will recall from earlier in your studies, an algorithm is a step-by-step process that
performs actions on data structures. For example, you can design and write code for an algorithm
to find the smallest integer in an array of integers; you can design and write code for an
algorithm that finds all red pixels in a 2-D colour image.
1
Explain the need for efficiency in data structures and algorithms.
These examples tell us that the obvious implementations of data structures do not scale well
when the number of items, , in the data structure and the number of operations,
, performed
on the data structure are both large. In these cases, the time (measured in, say, machine
instructions) is roughly
. The solution, of course, is to carefully organize data within the
data structure so that not every operation requires every data item to be inspected. Although it
sounds impossible at first, we will see data structures where a search requires looking at only two
items on average, independent of the number of items stored in the data structure.

Distinguish the difference between an interface and an implementation.


A program is an implementation of an algorithm. In fact, every program is an
implementation of some algorithm.
When discussing data structures, it is important to understand the difference between a
data structure's interface and its implementation. An interface describes what a data
structure does, while an implementation describes how the data structure does it.
An interface, sometimes also called an abstract data type, defines the set of operations
supported by a data structure and the semantics, or meaning, of those operations. An

interface tells us nothing about how the data structure implements these operations; it
only provides a list of supported operations along with specifications about what types of
arguments each operation accepts and the value returned by each operation.
A data structure implementation, on the other hand, includes the internal representation of
the data structure as well as the definitions of the algorithms that implement the
operations supported by the data structure. Thus, there can be many implementations of a
single interface. For example, in Chapter 2, we will see implementations of the List
interface using arrays and in Chapter 3 we will see implementations of the List interface
using pointer-based data structures. Each implements the same interface, List, but in
different ways.
Use mathematical concepts required to understand data structures and algorithms.

We generally use asymptotic notation to simplify functions. For example, in place of


we can write

. This is proven as follows:

This demonstrates that the function


using the constants

is in the set
and

Using big-Oh notation, the running time can be simplified to

Apply a model of computation.


For this, we use

-bit word-RAM model. RAM stands for Random Access Machine. In this

model, we have access to a random access memory consisting of cells, each of which stores a
-bit word. This implies that a memory cell can represent, for example, any integer in the set
.
In the word-RAM model, basic operations on words take constant time. This includes arithmetic
operations ( ,
, , ,
), comparisons ( ,
,
,
,
), and bitwise boolean
operations (bitwise-AND, OR, and exclusive-OR).
Any cell can be read or written in constant time. A computer's memory is managed by a memory
management system from which we can allocate or deallocate a block of memory of any size we
would like. Allocating a block of memory of size
takes
time and returns a reference (a
pointer) to the newly-allocated memory block. This reference is small enough to be represented
by a single word.
The word-size

is a very important parameter of this model. The only assumption we will

make about
is the lower-bound
, where
is the number of elements stored in
any of our data structures. This is a fairly modest assumption, since otherwise a word is not even
big enough to count the number of elements stored in the data structure.
Space is measured in words, so that when we talk about the amount of space used by a data
structure, we are referring to the number of words of memory used by the structure. All of our
data structures store values of a generic type T, and we assume an element of type T occupies one
word of memory.

Apply correctness, time complexity, and space complexity to data structures and algorithms.
Correctness:
The data structure should correctly implement its interface.
Time complexity:
The running times of operations on the data structure should be as small as possible.
Space complexity:
The data structure should use as little memory as possible.
Worst-case running times:
These are the strongest kind of running time guarantees. If a data structure operation has a worst-

case running time of


, then one of these operations never takes longer than
Amortized running times:
If we say that the amortized running time of an operation in a data structure is
means that the cost of a typical operation is at most
has an amortized running time

, then a sequence of

time. Some individual operations may take more than


entire sequence of operations, is at most

time.

, then this

. More precisely, if a data structure


operations takes at most
time but the average, over the

Expected running times:


If we say that the expected running time of an operation on a data structure is
, this means
that the actual running time is a random variable (see Section 1.3.4) and the expected value of
this random variable is at
made by the data structure.

. The randomization here is with respect to random choices

Unit 2: Array-Based Lists


Implement List interfaces.
Implement Queue interfaces
A stack is a container of objects that are inserted and removed according to the last-in first-out
(LIFO) principle. In the pushdown stacks only two operations are allowed: push the item into the
stack, and pop the item out of the stack. A stack is a limited access data structure - elements can
be added and removed from the stack only at the top. push adds an item to the top of the stack,
pop removes the item from the top. A helpful analogy is to think of a stack of books; you can
remove only the top book, also you can add a new book on the top.
A queue is a container of objects (a linear collection) that are inserted and removed according to
the first-in first-out (FIFO) principle. An excellent example of a queue is a line of students in the
food court of the UC. New additions to a line made to the back of the queue, while removal (or
serving) happens in the front. In the queue only two operations are allowed enqueue and
dequeue. Enqueue means to insert an item into the back of the queue, dequeue means removing
the front item. The picture demonstrates the FIFO access.
The difference between stacks and queues is in removing. In a stack we remove the item the
most recently added; in a queue, we remove the item the least recently added.

Unit 3: Linked Lists


Implement singly-linked lists.
Implement doubly-linked lists.
Implement space-efficient linked lists.
We first study singly-linked lists, which can implement Stack and (FIFO) Queue operations in
constant time per operation and then move on to doubly-linked lists, which can implement
Dequeue operations in constant time.
Linked lists have advantages and disadvantages when compared to array-based implementations
of the List interface. The primary disadvantage is that we lose the ability to access any element
using

or

in constant time. Instead, we have to walk through the list, one

element at a time, until we reach the }}}$ th element. The primary advantage is that they are
more dynamic: Given a reference to any list node
to

in constant time. This is true no matter where

, we can delete

or insert a node adjacent

is in the list.

Unit 4: Skiplists
Implement skiplists.
Unit 5: Hash Tables
Explain hash functions (division, multiplication, folding, radix transformation, digit
rearrangement, length-dependent, mid-square).
Estimate the effectiveness of hash functions (division, multiplication, folding, radix
transformation, digit rearrangement, length-dependent, mid-square).
Differentiate between various hash functions (division, multiplication, folding, radix
transformation, digit rearrangement, length-dependent, mid-square).
Division Hashing
Division hashing using a prime number is quite popular. Here is a list of prime numbers that you
can use and the range in which they are effective:
53 25 to 26
97 26 to 27
193 27 to 28
389 28 to 29
769 29 to 210
The search keys that we intend to hash can range from simple numbers to the entire text content
of a book. The original object could be textual or numerical or in any medium. We need to
convert each object into its equivalent numerical representation in preparation for hashing. That

is, objects are referenced by search keys.


A hash function must guarantee that the number it returns is a valid index to one of the table
slots. A simple way is to use (k modulo TableSize).
Example: Suppose we intend to hash strings; i.e., the table is to store strings. A very simple hash
function would be to add up the ASCII values of all the characters in the string and take the
modulo of the table size, say 97.
Thus cobb would be stored at the location
( 64 +3 + 64 + 15 + 64 + 2 + 64 + 2) % 97 = 88
hike would be stored at the location
( 64 + 8 + 64 + 9 + 64 + 11 + 64 + 5) % 97 = 2
ppqq would be stored at the location
( 64 + 16 + 64 + 16 + 64 + 17 + 64 + 17) % 97 = 35
abcd would be stored at the location
(64 + 1 + 64 + 2 + 64 + 3 + 64 + 4) % 97 = 76
The key idea is to get numbers far away from each other.
A better hashing function for a string s0 s2 s1 sN given a table size TableSize would be
[(ascii(s0) * 1280 + (ascii(s1) * 1281 + (ascii(s2) * 1282 ++ (ascii(sN) * 128N ] % TableSize
The computation of the hashing function is very likely to fail for large strings because of
overflow in various terms. This failure can be avoided by using Horners rule, using mod at each
stage of computation:
Given a polynomial of degree n,
p(x) = anxn + an1xn 1 + ... + a1x1 + a0
One might suspect that n + (n 1) + (n 2) +. ..+ 1 = n(n + 1)/2 multiplications would be
needed to evaluate p(x) for a given x. However Horners rule shows that it can be rewritten so
that only n multiplications are needed:
p(x) = (((anx + an1)x + a1)x + ... a1)x + a0
This is exactly the way that integer constants are evaluated from strings of characters (digits):
12345 = 1 * 104 + 2 * 103 + 3 * 102 + 4 * 101 + 5 * 100
= ((1 * 10 + 2) * 10 + 3 * 10 + 4) * 10 + 5
Use of Horners rule would imply computing the above function in the following fashion:
ascii(s0) + 128(ascii(s1) + 128(ascii(s2) + ... + (128(ascii(sN-1) + 128 ascii(sN))...))
Multiplication Hashing
Multiplication hashing uses multiplication by a real number and then a truncation of the integer.
Intuitively, we can expect that multiplying a random real number between 0 and 1 with an
integer key should give us another random real number. Taking the decimal part of this result
should give us most of the digits of precision (i.e., randomness) of the original. The decimal part
also restricts output to a range of values.
A good value of the real number to be used in multiplication hashing is c = ((sqrt(5) 1) / 2).
Thus,
h(k) = | m * (k * c | k * c | ) |, and 0 < c < 1.
Here, the key k is multiplied by the constant real number c, where 0 < c < 1. We then take the
fractional part of k * c. Multiply this value by m. Note that the value of m does not make a
difference. Take the floor of the result, which is the hash value of k.
Suppose the size of the table, m, is 1301:
For k = 1234, h(k) = 850

For k = 1235, h(k) = 353


For k = 1236, h(k) = 115
For k = 1237, h(k) = 660
For k = 1238, h(k) = 164
For k = 1239, h(k) = 968
For k = 1240, h(k) = 471
As you can see, the hash function breaks the input pattern fairly uniformly.
The division and multiplication hash functions are not order preserving. That is, the original
order of objects is not the same as the order in which the hashed values are stored in the table.
i.e., k1 < k2 => h(k1) < h(k2)
Also, the division and multiplication hash functions are not perfect or minimal perfect hash
functions. A minimal perfect hash maps n keys to a range of n elements with no collisions. A
perfect hash maps n keys to a range of m elements, m >= n, with no collisions. Refer to
http://cmph.sourceforge.net/ for additional information about algorithms that can generate
minimal perfect hashing.
Folding Hashing
This method divides the original object or the search key into several parts, adds the parts
together, and then uses the last four digits (or some other arbitrary number of digits) as the
hashed value or key. For example, a social insurance number 123 456 789 can be broken into
three parts: 123, 456, and 789. These three numbers are added, yielding the position as 1368. The
hash function will be 1368 % TableSize.
The folding can be done in number of ways. For instance, one can divide the number into four
parts: 12, 34, 56, 789, and add them together.
Radix Transformation
The number base (or radix) of the search key can be changed, resulting in a different sequence of
digits. For example, a decimal numbered key could be transformed into a hexadecimal numbered
key. High-order digits could be discarded to fit a hash value of uniform length. For instance, if
our key is 23 in base 10, we might convert it to 32 in base 7. We then use the division method to
obtain a hash value.

Digit Rearrangement
Here, the search key digits, say in positions 3 through 6, are reversed, resulting in a new search
key.
For example, if our key is 1234567, we might select the digits in positions 2 through 4, yielding
234. The manipulation can then take many forms:
reversing the digits 432, resulting in a key of 1432567
performing a circular shift to the right 423, resulting in a key of 1423567
performing a circular shift to the left 342, resulting in a key of 1342567
swapping each pair of digits 324, resulting in a key of 1324567.
Length-Dependent Hashing
In this method, the key and the length of the key are combined in some way to form either the
index itself or an intermediate version. For example, if our key is 8765, we might multiply the
first two digits by the length and then divide by the last digit, yielding 69. If our table size is 43,

we would then use the division method, resulting in an index of 26.


Mid-Square Hashing
The key is squared, and the middle part of the result is used as address for the hash table. The
entire key participates in generating the address so that there is a better chance that different
addresses are generated even for keys close to each other. For example,
suppose the key is 3121, the square is 9740641, and the mid value is 406
suppose the key is 3122, the square is 9746884, and the mid value is 468
suppose the key is 3123, the square is 9753129, and the mid value is 531
In practice, it is more efficient to choose a power of 2 for the size of the table and extract the
middle part of the bit representation of the square of a key. If the table size is chosen in this
example as 1024, the binary representation of square of 3121 is
1001010-0101000010-1100001.
The middle part can be easily extracted using a mask and a shift operation.

recognize various collision resolution algorithmsopen addressing (linear probing, quadratic


probing, double hashing), separate chaining (normal, with list heads, with other data
structures), coalesced hashing, robin hood hashing, cuckoo hashing, hopscotch hashing,
dynamic resizing (resizing the whole, incremental resizing).

Collisions happen when two search keys are hashed into the same slot in the table. There
are many ways to resolve collision in hashing. Alternatively, one can discover a hash
function that is perfectmeaning that it maps each search key into a different hash value.
Unfortunately, perfect hash functions are effective only in situations where the inputs are
fixed and known in advance. A sub-category of perfect hash is minimal perfect hash,
where the range of the hash values is also limited, yielding a compact hash table.
If we are able to develop a perfect hashing function, we do not need to be concerned
about collisions or table size. However, often we do not know the size of the input dataset
and are not able to develop a perfect hashing function. In these cases, we must choose a
method for handling collisions.
For almost all hash functions, it is possible that more than one key is assigned to the same
table slot. For example, if the hash function computes the slot based just on the first letter
of the key, then all keys starting with the same letter will be hashed to the same slot,
resulting in a collision.
Collision can be resolved partially by choosing another hash function, which computes
the slot based on first two letters of the key. However, even if a hash function is chosen in
which all the letters of the key participate, there is still a possibility that a number of keys
may hash to the same slot in the hash table.
Another factor that can be used to avoid collision of multiple keys is the size of the hash
table. A larger size will result in fewer collisions, but that will also increase the access
time during retrieval.
A number of strategies have been proposed to prevent collision of multiple keys.
Strategies that look for another open position in the table other than the one to which the
slot is originally hashed are called open addressing strategies. We will examine three
open addressing strategies: linear probing, quadratic probing, and double hashing.

0 1 2

Linear Probing
When a collision takes place, you should search for the next available position in the
table by making a sequential search. Thus the hash values are generated by
after-collision: h(k, i) = [h(k) + p(i) ] mod TableSize,
where p(i) is the probing function after the ith probe. The probing function is one that
looks for the next available slot in case of a collision. The simplest probing function is
linear probing, for which p(i) = i, where i is the step size.
Consider a simple example with table of size 10, hence mod 10. After hashing keys 22, 9,
and 43, the table is shown below. Note that initially a simple division hashing function,
h(k) = k mod TableSize, works fine. We will use the modified hash function, h(k, i) =
[h(k) + i ] mod TableSize, only when there is a collision.
3 4 5 6 7 8 9

22 43
9
When keys 32 and 65 arrive, they are stored as follows. Note that the search key 32
results in a hash value of 2, but slot 2 is already occupied. Thus, using the modified hash
function, with i = 1, a new hash value of 3 is obtained. However, slot 3 is also occupied,
so we reapply the modified hash function. This results in a slot value of 4 that houses the
search key 32. The search key 65 directly hashes to slot 5.
0 1 2 3 4 5 6 7 8 9
2 4 3 6
9
2 3 2 5
Suppose we have another key with value of 54. The key 54 cannot be stored in its
designated place because it collides with 32, so a new place for it is found by linear
probing to position 6, which is empty at this point:
0 1 2 3 4 5 6 7 8 9

0
59

2 4 3 6 5
9
2 3 2 5 4
When the search reaches end of the table, it continues from the first location again. Thus
the key 59 will be stored as follows:
1 2 3 4 5 6 7 8 9
22 43 32 65 54
9
In linear probing, the keys start forming clusters, which have a tendency to grow fast
because more and more collisions take place and the new keys get attached to one end of
the cluster. These are called primary clusters.
The problem with such clusters is that they generate unsuccessful searches. The search
must go through to the end of the table and start from the beginning of the table.
Quadratic Probing
To overcome the primary clustering problem, quadratic probing places the elements
further away rather than in immediate succession.
Let h(k) be the hash function that maps a search key k to an integer in [0, m 1]. Here m
is the size of the table. One choice is the following quadratic function for the ith probe.
That is, the modified hash function is used to probe only after a collision has been

observed.
after collision: h(k, i) = h(k) + c1i + c2i2 mod TableSize, where c2 is not equal to 0
If c2 = 0, then this hash function will become a linear probe. For a given hash table, the
values c1 and c2 remain constant. For m = 2n, a good choice for the constants are c1 = c2 =
.
For a prime m > 2, most choices of c1 and c2 will make h(k,i) distinct for i in [0,(m 1) /
2]. Such choices include c1 = c2 = 1/2, c1 = c2 = 1, and c1 = 0, c2 = 1.
Although using quadratic probing gives much better results than using linear probing, the
problem of cluster buildup is not avoided altogether. Such clusters are called secondary
clusters.
Double Hashing
The problem of secondary clustering is best addressed with double hashing. A second
function is used for resolving conflicts.
Like linear probing, double hashing uses one hash value as a starting point and then
repeatedly steps forward an interval until the desired value is located, an empty location
is reached, or the entire table has been searched. But the resulting interval is decided
using a second, independent hash function; hence the name double hashing. Given
independent hash functions h1 and h2, the jth probing for value k in a hash table of size m
is
h(k, j) = h1(k) + j * h2(k)) mod m
Whatever scheme is used for hashing, it is obvious that the search time depends on how
much of the table is filled up. The search time increases with the number of elements in
the table. In the worst case, one may have to go through all the table entries.
Similar to other open addressing techniques, double hashing becomes linear as the hash
table approaches maximum capacity. Also, it is possible for the secondary hash function
to evaluate to zero, for example, if we choose k = 5 with the following function: h2(k) =
5 (k mod 7).
Separate Chaining
A popular and space-efficient alternative to the above schemes is separate chaining
hashing. Each position of the table is associated with a linked list or chain of structures
whose data field stores the keys. The hashing table is a table of references to these linked
lists. Thus, the keys 78, 8, 38, 28, and 58 would hash to the same position, position 8, in
the reference hash table.
In this scheme, the table can never overflow, because the linked lists are extended only
upon the arrival of new keys. A new key is always added to the front of the linked list,
thus minimizing storage time. Many unsuccessful searches may end up in empty lists,
which reduce the search time of other hashing schemes. This is of course at the expense
of extra storage for linked-list references. While searching for a key, you must first locate
the slot using the hash function and then search through the linked list for the specific
entry.

In this example, John Smith and Sandra Dee end up in the same bucket: table entry 152.
Entry 152 points first to the John Smith object, which is linked to the Sandra Dee object.
Insertion of a new key requires appending to either end of the list in the hashed slot.
Deletion requires searching the list and removing the element.
Study carefully and thoroughly the section titled Separate Chaining in
http://en.wikipedia.org/wiki/Separate_chaining#Separate_chaining, particularly the two
different types of separate chainingseparate chaining with list heads and separate
chaining with other structures.
This web page also introduces you to coalesced hashing, Robin Hood hashing, cuckoo
hashing, and hopscotch hashing, which may help you understand how they differ from
each other.
Coalesced hashing
A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes
within the table itself.[12] Like open addressing, it achieves space usage and (somewhat
diminished) cache advantages over chaining. Like chaining, it does not exhibit clustering effects;
in fact, the table can be efficiently filled to a high density. Unlike chaining, it cannot have more
elements than table slots.
Cuckoo hashing
Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup
time in the worst case, and constant amortized time for insertions and deletions. It uses two or
more hash functions, which means any key/value pair could be in two or more locations. For
lookup, the first hash function is used; if the key/value is not found, then the second hash
function is used, and so on. If a collision happens during insertion, then the key is re-hashed with

the second hash function to map it to another bucket. If all hash functions are used and there is
still a collision, then the key it collided with is removed to make space for the new key, and the
old key is re-hashed with one of the other hash functions, which maps it to another bucket. If that
location also results in a collision, then the process repeats until there is no collision or the
process traverses all the buckets, at which point the table is resized. By combining multiple hash
functions with multiple cells per bucket, very high space utilization can be achieved.
Hopscotch hashing
Another alternative open-addressing solution is hopscotch hashing,[13] which combines the
approaches of cuckoo hashing and linear probing, yet seems in general to avoid their limitations.
In particular it works well even when the load factor grows beyond 0.9. The algorithm is well
suited for implementing a resizable concurrent hash table.
The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original
hashed bucket, where a given entry is always found. Thus, search is limited to the number of
entries in this neighborhood, which is logarithmic in the worst case, constant on average, and
with proper alignment of the neighborhood typically requires one cache miss. When inserting an
entry, one first attempts to add it to a bucket in the neighborhood. However, if all buckets in this
neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an
unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is
outside the neighborhood, items are repeatedly displaced in a sequence of hops. (This is similar
to cuckoo hashing, but with the difference that in this case the empty slot is being moved into the
neighborhood, instead of items being moved out with the hope of eventually finding an empty
slot.) Each hop brings the open slot closer to the original neighborhood, without invalidating the
neighborhood property of any of the buckets along the way. In the end, the open slot has been
moved into the neighborhood, and the entry being inserted can be added to it.
Robin Hood hashing
One interesting variation on double-hashing collision resolution is Robin Hood hashing.[14][15] The
idea is that a new key may displace a key already inserted, if its probe count is larger than that of
the key at the current position. The net effect of this is that it reduces worst case search times in
the table. This is similar to ordered hash tables[16] except that the criterion for bumping a key does
not depend on a direct relationship between the keys. Since both the worst case and the variation
in the number of probes is reduced dramatically, an interesting variation is to probe the table
starting at the expected successful probe value and then expand from that position in both
directions.[17] External Robin Hashing is an extension of this algorithm where the table is stored
in an external file and each table position corresponds to a fixed-sized page or bucket with B
records.[18]

Unit 6: Recursion
Define recursion.
One of the most succinct properties of modern programming languages like C++, C#, and Java
(as well as many others) is that these languages allow you to define methods that reference
themselves, such methods are said to be recursive. One of the biggest advantages recursive
methods bring to the table is that they usually result in more readable, and compact solutions to
problems.

A recursive method then is one that is defined in terms of itself. Generally a recursive algorithms
has two main properties:

1. One or more base cases; and


2. A recursive case

Inspect recursive programs.


Create different types of recursive programs.
Differentiate recursive solutions from iterative solutions.
An iterative solution uses no recursion whatsoever. An iterative solution relies only on the use of
loops (e.g. for, while, do-while, etc). The down side to iterative algorithms is that they tend not to
be as clear as to their recursive counterparts with respect to their operation. The major advantage
of iterative solutions is speed. Most production software you will find uses little or no recursive
algorithms whatsoever. The latter property can sometimes be a companys prerequisite to
checking in code, e.g. upon checking in a static analysis tool may verify that the code the
developer is checking in contains no recursive algorithms. Normally it is systems level code that
has this zero tolerance policy for recursive algorithms.

Estimate the time complexity of recursive programs.

Unit 7: Binary Trees


Define binary tree
Mathematically, a binary tree is a connected, undirected, finite graph with no cycles, and no
vertex of degree greater than three.
For most computer science applications, binary trees are rooted: A special node,
most two is called the root of the tree. For every node,

, of degree at

, the second node on the path

to

is called the parent of . Each of the other nodes adjacent to


is called a child of . Most
of the binary trees we are interested in are ordered, so we distinguish between the left child and
right child of

The depth of a node,


If a node,
descendant of
all

, in a binary tree is the length of the path from

, is on the path from

to

. The subtree of a node,

's descendants. The height of a node,

, then

to the root of the tree.

is called an ancestor of

, is the binary tree that is rooted at


, is the length of the longest path from

and

and contains
to one of

its descendants. The height of a tree is the height of its root. A node , is a leaf if it has no
children.
We sometimes think of the tree as being augmented with external nodes. Any node that
does not have a left child has an external node as its left child, and, correspondingly, any
node that does not have a right child has an external node as its right child (see
Figure 6.2.b).
Define binary search tree.
A BinarySearchTree is a special kind of binary tree in which each node,
value$

, from some total order. The data values in a binary search tree obey the binary

search tree property: For a node,


less than $

, also stores a data

, every data value stored in the subtree rooted at

and every data value stored in the subtree rooted at

is

is greater than

.
Examine a binary tree and binary search tree.
Implement a binary tree and binary search tree.
Define AVL tree.
An AVL tree is a self-balancing binary search tree in which the heights of the two child subtrees
of any node differ by at most 1; therefore, it is also said to be height-balanced. Lookup, insertion,
and deletion all take O(log n) time in both the average and worst cases, where n is the number of
nodes in the tree prior to the operation. Insertions and deletions may require the tree to be
rebalanced by one or more tree rotations.
The balance factor of a node is the height of its right subtree minus the height of its left subtree,
and a node with balance factor 1, 0, or 1 is considered balanced. A node with any other balance
factor is considered unbalanced and requires rebalancing the tree. The balance factor is either
stored directly at each node or computed from the heights of the subtrees.
AVL trees are often compared to redblack trees (See Unit 9) because they support the same set
of operations and because redblack trees also take O(log n) time for the basic operations. AVL
trees perform better than redblack trees for lookup-intensive applications. AVL trees, redblack
trees, and (2,4) trees, to be introduced in Unit 9 and Chapter 9 of Morins book, share a number
of good properties, but AVL trees and (2,4) trees may require some extra operation to deal with
restructuring (rotations), fusing, or splitting. However, redblack trees do not have these
drawbacks.
Unit 8: Scapegoat Trees
Define scapegoat tree.
A ScapegoatTree is a BinarySearchTree that, in addition to keeping track of the number,
nodes in the tree also keeps a counter,

, of

, that maintains an upper-bound on the number of

nodes.

At all times,

and

obey the following inequalities:

In addition, a ScapegoatTree has logarithmic height; at all times, the height of the scapegoat tree
does not exceed

Examine a scapegoat tree.


Unfortunately, it will sometimes happen that

. In this case, we need to

reduce the height. This isn't a big job; there is only one node, namely
. To
scapegoat,

, we walk from

, whose depth exceeds

back up to the root looking for a scapegoat,

, is a very unbalanced node. It has the property that

where
is the child of
implement a scapegoat tree.

on the path from the root

Unit 9: RedBlack Trees


Define redblack tree.
A 2-4 tree is a rooted tree with the following properties:
Property 9.1 (height) All leaves have the same depth.
Property 9.2 (degree) Every internal node has 2, 3, or 4 children.

Lemma 9..1 A 2-4 tree with

leaves has height at most

. The

Examine a redblack tree.


A red-black tree is a binary search tree in which each node,

, has a colour which is either red

or black. Red is represented by the value 0 and black by the value .


Before and after any operation on a red-black tree, the following two properties are satisfied.
Each property is defined both in terms of the colours red and black, and in terms of the numeric
values 0 and 1.
Property 9..3 (black-height) There are the same number of black nodes on every root to leaf
path. (The sum of the colours on any root to leaf path is the same.)
Property 9..4 (no-red-edge)

No two red nodes are adjacent. (For any

, except the root,

.)Notice that we can always colour the root, , of a


red-black tree black without violating either of these two properties, so we will assume that the
root is black, and the algorithms for updating a red-black tree will maintain this. Another trick
that simplifies red-black trees is to treat the external nodes (represented by

) as black nodes.

This way, every real node, , of a red-black tree has exactly two children, each with a welldefined colour. Furthermore, the black-height property now guarantees that every root-to-leaf
path in

has the same length. In other words,

Lemma 9..2 The height of red-black tree with

is a 2-4 tree!
nodes is at most

implement a redblack tree.

Unit 10: Heaps


Define heap.
A heap is a data structure created using a binary tree. It can be seen as a binary tree with two
additional constraints:
The shape property: the tree is a complete binary tree; that is, all levels of the tree, except
possibly the last one (deepest) are fully filled, and, if the last level of the tree is not complete, the
nodes of that level are filled from left to right.
The heap property: each node is greater than or equal to each of its children according to some
comparison predicate which is fixed for the entire data structure.
Greater than means according to whatever comparison function is chosen to sort the heap, not
necessarily greater than in the mathematical sense because the quantities are not always
numerical. Heaps where the comparison function is the mathematical greater than are called
max-heaps; those where the comparison function is the mathematical less than are called min-

heaps. Conventionally, min-heaps are used because they are readily applicable for use in priority
queues.
Note that the ordering of siblings in a heap is not specified by the heap property, so the two
children of a parent can be freely interchanged, as long as this does not violate the shape and
heap properties.
The binary heap is a special case of the d-ary heap in which d = 2.

Examine a binary heap tree.


If we apply Eytzinger's method to a sufficiently large tree, some patterns emerge. The left
child of the node at index

is at index

node at index

and the right child of the

. The parent of the node at

is at index

.
A BinaryHeap implements the (priority) Queue interface.

Implement a binary heap tree.


Define meldable heap.
The MeldableHeap, a priority Queue implementation in which the underlying structure is also a
heap-ordered binary tree. However, unlike a BinaryHeap in which the underlying binary tree is
completely defined by the number of elements, there are no restrictions on the shape of the
binary tree that underlies a MeldableHeap; anything goes.
The

and

operations in a MeldableHeap are implemented in

terms of the

operation. This operation takes two heap nodes

and

and merges them, returning a heap node that is the root of a heap that contains all
elements in the subtree rooted at

and all elements in the subtree rooted at

Examine a randomized meldable heap.


Unit 11: Sorting Algorithms
Describe sorting algorithms (merge, quick, heap, counting, radix).
Merge sort
The merge-sort algorithm is a classic example of recursive divide and conquer: If the length of
is at most 1, then
halves

is already sorted, so we do nothing. Otherwise, we split


and

into two

. We recursively

and

, and then we merge (the now sorted)

and

to get our fully sorted array

Quick sort
The quicksort algorithm is another classic divide and conquer algorithm. Unlike mergesort, which does merging after solving the two subproblems, quicksort does all of its
work upfront.
Quicksort is simple to describe: Pick a random pivot element,
into the set of elements less than

, the set of elements equal to

, from

; partition

, and the set of

elements greater than ; and, finally, recursively sort the first and third sets in this
partition.
Heap Sort
Heap-sort
The heap-sort algorithm is another in-place sorting algorithm. Heap-sort uses the binary heaps
discussed in Section 10.1. Recall that the BinaryHeap data structure represents a heap using a
single array. The heap-sort algorithm converts the input array
extracts the minimum value.
More specifically, a heap stores

elements in an array,

into a heap and then repeatedly

, at array locations

with the smallest value stored at the root,

. After

transforming

into a BinaryHeap, the heap-sort algorithm repeatedly swaps

, decrements

, and calls

and

so that

once again are a valid heap representation. When this process


ends (because
) the elements of
are stored in decreasing order, so
is
11.1
reversed to obtain the final sorted order. Figure 11.4 shows an example of the execution
of
.
Consider a statement of the form

This statement executes in constant time, but

possible different outcomes,

depending on the value of


. This means that the execution of an algorithm that makes such
a statement cannot be modelled as a binary tree. Ultimately, this is the reason that the algorithms
in this section are able to sort faster than comparison-based algorithms.
Counting Sort
Suppose we have an input array
counting-sort algorithm sorts
version of

consisting

using an auxiliary array

as an auxiliary array

in

and store this

occurrences of 0, followed by
followed by

occurrences

. The

of counters. It outputs a sorted

The idea behind counting-sort is simple: For


occurrences of

integers, each in the range

, count the number of

. Now, after sorting, the output will look like


occurrences of 1, followed by
.

occurrences of 2,...,

Radix-Sort
Counting-sort is very efficient for sorting an array of integers when the length,

, of the array is

not much smaller than the maximum value,


, that appears in the array. The radix-sort
algorithm, which we now describe, uses several passes of counting-sort to allow for a much
greater range of maximum values.
Radix-sort sorts

-bit integers by using

passes of counting-sort to sort these integers

bits at a time.11.2 More precisely, radix sort first sorts the integers by their least significant
then their next significant
most significant

bits.

bits,

bits, and so on until, in the last pass, the integers are sorted by their

Estimate the complexity of sorting algorithms.


Merge sort
Therefore, the total amount of time taken by merge-sort is

The merge_sort(a) algorithm runs


comparisons.

time and performs at most

Quick sort
When quicksort is called to sort an array containing the
times element

, the expected number of

is compared to a pivot element is at most

A little summing up of harmonic numbers gives us the following theorem about the running time
of quicksort:
Theorem 11..2 When quicksort is called to sort an array containing
expected number of comparisons performed is at most
The

method runs in

number of comparisons it performs is at


Heap sort

distinct elements, the


.

expected time and the expected


.

The

method runs

time and performs at most

comparisons.
Counting sort
The

method can sort an array

containing

integers in the set

time.
Radix sort
For any integer

, the

-bit integers in

method can sort an array

containing

time.

If we think, instead, of the elements of the array being in the range

, and take

we obtain the following version of Theorem 11.8.

Corollary 11..1 The

in the range
compare sorting algorithms.

method can sort an


time.

containing

integer values

comparisons

in-place

Merge-sort

worst-case

No

Quicksort

expected

Yes

Heap-sort

worst-case

Yes

Each of these comparison-based algorithms has its advantages and disadvantages. Merge-sort
does the fewest comparisons and does not rely on randomization. Unfortunately, it uses an
auxiliary array during its merge phase. Allocating this array can be expensive and is a potential
point of failure if memory is limited. Quicksort is an in-place algorithm and is a close second in
terms of the number of comparisons, but is randomized, so this running time is not always
guaranteed. Heap-sort does the most comparisons, but it is in-place and deterministic.
There is one setting in which merge-sort is a clear-winner; this occurs when sorting a linked-list.
In this case, the auxiliary array is not needed; two sorted linked lists are very easily merged into a
single sorted linked-list by pointer manipulations (see Exercise 11.2).
The counting-sort and radix-sort algorithms described here are due to Seward [66, Section 2.4.6].
However, variants of radix-sort have been used since the 1920s to sort punch cards using
punched card sorting machines. These machines can sort a stack of cards into two piles based on
the existence (or not) of a hole in a specific location on the card. Repeating this process for
different hole locations gives an implementation of radix-sort.
Finally, we note that counting sort and radix-sort can be used to sort other types of numbers
besides non-negative integers. Straightforward modifications of counting sort can sort integers,
in any

, in

time. Similarly, radix sort can sort integers in the same

interval in
time. Finally, both of these algorithms can also be used to sort
floating point numbers in the IEEE 754 floating point format. This is because the IEEE format is
designed to allow the comparison of two floating point numbers by comparing their values as if
they were integers in a signed-magnitude binary representation.
Unit 12: Graphs
Mathematically, a (directed) graph is a pair

where

is a set of ordered pairs of vertices called edges. An edge


called the source of the edge and
such that, for every

is called the target. A path in


, the edge

is a set of vertices and


is directed from

to

is

is a sequence of vertices $
is in

. A path

is a cycle if, additionally, the edge

is in

if all of its vertices are unique. If there is a path from some vertex
we say that

is reachable from

. A path (or cycle) is simple


to some vertex

then

. An example of a graph is shown in Figure 12.1.

Figure 12.1: A graph with twelve vertices. Vertices are drawn as numbered circles and edges are
drawn as pointed curves pointing from source to target.

Represent a graph by a matrix


Where the adjacency matrix performs poorly is with the
operations. To implement these, we must scan all
and gather up all the indices,
clearly take

and

entries in the corresponding row or column

, respectively

, is true. These operations

time per operation.

Another drawback of the adjacency matrix representation is that it is large. It stores an


boolean matrix, so it requires at

bits of memory. The implementation here uses a matrix of

values so it actually uses on the order of

bytes of memory.

Despite its high memory requirements and poor performance of the

and

operations, an AdjacencyMatrix can still be useful for some applications. In


particular, when the graph
may be acceptable.

is dense, i.e., it has close

edges, then a memory usage

Represent a graph in adjacency lists


The space used by a AdjacencyLists
.
Understand the execution process of the depth-first-search and bread-first-search algorithms for
traversing a graph
Analyze the performance of the depth-first-search and bread-first-search algorithms for
traversing a graph
When given as input a Graph,

, that is implemented using the AdjacencyLists data structure,

the
algorithm runs in
time.
A particularly useful application of the breadth-first-search algorithm is, therefore, in computing
shortest paths
When given as input a Graph,

, that is implemented using the AdjacencyLists data structure,

algorithms each run

time.

Implement those search algorithms for traversing a graph in pseudo-code or other programming
languages, such as Java, C, or C++, etc.
Unit 13: Binary Trie
define trie.
A BinaryTrie encodes a set

bit integers in a binary tree. All leaves in the tree have depth

and each integer is encoded as a root-to-leaf path. The path for the integer
if the th most significant bit

is a 0 and turns right if it is a 1.

the
method runs
Examine a binary trie.
Explain binary trie.

time.

turns left at level

Each node,

, also contains an additional

points to the smallest leaf in


the largest leaf in

's subtree.

's subtree. If

's left child is missing,

's right child is missing, then

points to

Anda mungkin juga menyukai