Anda di halaman 1dari 15

Run-Time Theory in Practice: An Analysis of

Sorting Algorithms

Garrett Ursin)
CoSc 320, Data Structures
Pepperdine University

October 26, 2018

Abstract

This analysis attempts to dissect Insert Sort, Selection Sort, Heap Sort, Merge
Sort, and Quick Sort in order to determine how accurate their theoretical run-time
complexities are in execution. It examines with scrutiny any discrepancies which
would suggest the theoretical run-times to be inaccurate. It also attempts to de-
termine which sorts are most efficient and which are the least efficient of those
tested.

1 Introduction
This paper serves to analyze and compare the theoretical run-time complexities of
Merge Sort, Quick Sort, Heap Sort, Selection Sort, and Insert Sort with real sample data
gathered through first-hand experimentation on each algorithm. Because time samples
vary from machine to machine due to differing factors such as CPU clock speeds, host-
bus speeds, and CPU capacity at run-time, sample data is gathered and presented in
terms of numbers of assignments and comparisons. These statistics remain constant
regardless of the capabilities of the machine implementing the sort and are therefore
more reliable. In order to most appropriately demonstrate a relationship between theo-
retical Θ and the aforementioned samples of each algorithm, best fit curves and residual
standard error approximations determine how well each sort conforms to its theoretical
run-time complexity.
The Method section examines the technique behind each individual sort in depth
and describes the Object-Oriented design pattern used to track assignments and com-

1
parisons. It also explicates RSE and how it relates each algorithm to a particular asymp-
totic run time and demonstrates how to derive the most efficient sort. Results presents
the Raw Data and draws conclusions from it, dissecting each algorithm in detail and
suggesting possible explanations for various data trends. The Conclusions section of-
fers specific deductions and discoveries developed as a result of the experiment.

2 Method
The following subsections delineate each sort as well as the exact Object-Oriented de-
sign pattern used to analyze them. They also examine RSE mathematically and sum-
marize how it can be used to determine asymptotic run time for each algorithm.

2.1 Sort Algorithms


The following subsections summarize the mechanics of each sorting algorithm.

2.1.1 Insert Sort

Insertion sort begins with the first element of an array. Because one element is, by
definition, sorted, Insert Sort begins by partitioning an array into an unsorted and a
sorted portion. By incorporating one element at a time from the unsorted portion into
the sorted portion, Insert Sort grows a sorted list. When it is finished, there no longer
exists an unsorted section of the list.
This means that for each element, Insert Sort likely iterates through a substantial
portion of the sorted list in order to place it at the correct index. When the length of
the array is n and it is in reverse order, it takes Θ( n2 ) time, where n is the number of
steps the algorithm executes. However, the best case for Insert Sort is when the array is
already sorted, in which case no extra transversals are needed, leaving Θ(n) time. The
best case is not that common, but it is indeed possible. The average case assumes that
each element is less than just half of the elements to its left in an unsorted list. However,
Big Θ dictates that half of Θ (n2 )is still Θ (n2 ) time. Thus, Insert Sort has a worst and
average case of Θ( n2 ) and a Best Case of Θ(n).
Insert Sort, due its Best Case of Θ(n), is considered to be adaptive – it works well
for sets that are almost in order. It is also stable, meaning it does not unnecessarily swap
elements that have the same value, unlike Selection Sort (See section 2.1.2). It is also
in place, requiring no extra memory allocation.

2
Insert Sort thus works well for relatively small data sets, much like other sorting
algorithms of quadratic time. However, as the size of the unsorted list grows, quadratic
sorts quickly become inefficient.

2.1.2 Selection Sort

Like Insertion Sort, Selection Sort is an in-place sorting algorithm requiring no extra
internal space to execute. Additionally, Selection Sort partitions the unsorted list into
two portions, the unsorted and the sorted portion. However, the sorted portion starts out
empty, and the algorithm sifts through the entire unsorted portion to find the extrema
(max or min, depending on implementation), before finally swapping it with the left-
most unsorted element. It then increases the boundaries of the sorted list. Eventually,
the unsorted partition disappears, and Selection Sort leaves behind a perfectly sorted
array.
While Selection Sort is an in-place algorithm like Insertion Sort, it is not adaptive.
It goes through the same number of iterations for the reverse ordered list as it does the
list that is already sorted, making it an inefficient counterpart to Insertion Sort for lists
that are already partially sequential. It is also unstable, as it may inefficiently swap
elements of the same value if the smallest element in an array occurs twice. However,
Selection Sort is characterized by a surprisingly small number of writes, or assignments,
compared to Insertion Sort, which may be considered helpful or advantageous in certain
situations.

2.1.3 Heap Sort

The Heap Sort utilizes a Complete Binary Tree, or a Binary Heap, in which all items are
stored in such an order that the value of the parent node is larger (or smaller, depending
on the implementation) than its two children. These are called min and max heaps. The
Binary Heap is also structured so that if it is not full, then it is full up to a certain point,
with no gaps between its final leaves.
In order to Heap Sort an array, the array must first be a representation of a min or
max heap. If it is not already in this order, then it must be built using the elements
already inside the array. This process takes Θ (n) time.
The Max Heap has its largest element stored at the root. Heap Sort takes this root
element and swaps it with the last element in the Heap. It then “Heapifies” the newly
modified Heap – it reorders the elements so that they are once again in Max Heap order.
This is known as “Sifting” an element down to its proper position. Once heapified, the
next largest element of the array is at root of the Heap. Heap Sort replaces this element

3
with the last item in the Heap, then calls Heapify again. This process is repeated until
all the elements in the array are sorted, and the Heap has only one element remaining –
the smallest member of the array.
Building the initial heap takes Θ (n) time, while each call to Heapify takes Θ (lg n)
time. Because Heap Sort calls Heapify for each element within the array, Heap Sort
takes Θ (n lg n) time in the best, worst, and average cases.
Heap Sort, like Insertion and Selection sort, is in-place, using only the memory
allocated by the array as it executes. It is not stable, however, and must be modified in
order to be adaptive.

2.1.4 Merge Sort

Merge Sort is a very popular Divide and Conquer sort developed by John von Neumann
in 1945. It works by recursively splitting an array into two subsections, sorting them,
and then merging them back together. Merge Sort thus continually splits the array it is
operating on into sub-arrays, until each sub-array has one element (the array with one
element is sorted.) It then merges each sub-array, until only one sub-array remains: the
newly sorted array.
Merge Sort is “easy to split, hard to join,” meaning merge sort does most of its work
when merging two sub-arrays. In all cases, Merge Sort has a run-time complexity of
Onlg(n).
Merge Sort, unlike Selection Sort and Heap Sort, is stable. In a typical implementa-
tion, Merge Sort requires extra space to perform, and is thus not an in-place algorithm.
However, it can be specifically programmed so that it performs in-place. Merge Sort is
also not adaptive. It performs roughly the same number of computations regardless of
whether or not the array is already sorted.

2.1.5 Quick Sort

Quick Sort is a Divide and Conquer algorithm designed by Tony Hoare in 1959. It
functions on an array by first selecting a pivot point (either randomly or by design)
becomes the array’s median. It then loops through the array, swapping elements so that
all those members which are less than the pivot (or median) are on the left side of the
array, and all those greater than the pivot are on the right side of the array. It then repeats
this process, sorting the left half of the array followed by the right half in a recursive
fashion until the entire array is sorted.
The best and average-case performance of Quick Sort are Θ (n lg n). The worst case,
however, is Θ (n2 ). This only happens when the pivot element is chosen poorly. If, for

4
example, the pivot is chosen to be the maximum element in the array each time, then
the algorithm will loop up through the entire series n times.
Quick sort is not adaptive, its runtime does not improve over sorted arrays. It is also
not stable, as the initial order of equal elements is not guaranteed to remain constant.
Quick Sort is in-place, requiring no extra memory to perform.

2.2 Data Collection


For each Sorting Algorithm, run time complexity is calculated in terms of comparisons
and assignments rather than seconds, due to the fact that both remain constant regardless
of the specifications of the machine taking the data. An Object Oriented Design Pattern,
called CAMetrics, is used to overload the comparator and assignment operators in order
to increment counters whenever executed. Thus, it is able to track and store the number
of comparisons and assignments of each sort internally, using parametric decorators as
well as an accessory counter class.
When an algorithm is finished sorting an array, the CAMetrics class returns the
values stored within its counter object. These values compose the sample data which
is used to analyze the performance of each individual sort for various n-sized arrays.

2.3 Analysis
True asymptotic run time is determined by comparing the best fit curves of the perfor-
mances of each sort with the line graphs of n2 and n lg n. Here, n is the size of the array
that is being sorted, and the best fit curves are generated using the sample data gathered
by the CAMetrics class.
The equation of the best fit curve is calculated so that it most closely aligns with the
sample data produced by each sorting algorithm. This is done by manipulating the A, B,
and C constants of the following equations in order to minimize the Residual Standard
Error:

y = An2 + Bn +C

y = An lg n + Bn +C

If the Residual Standard Error is minimal in case 1, then the sample data most closely
resembles the line n2 . If it is minimal in case 2, then it most closely resembles the line
n lg n. Residual Standard Error is computed using the following equation, where the
degrees of freedom are the number of coefficients subtracted from the number of data

5
points:

∑ (yi − ŷi )2
RSE =
d. f .

This formula functions by calculating the magnitude of the total difference between the
graphs of the sample data and the curve it is being compared to in order to determine the
line which fits the data best. This analysis compares each sort with only the graphs of
n2 and n lg n as these are the only theoretical asymptotic bounds which characterize the
sorting algorithms being tested (see section 2.1). If, for example, one of the analyzed
sorts had a theoretical run-time complexity of n3 , then the best fit curves of the sample
data would also be compared with a cartesian graph of n3 in order to best determine
accuracy in practice.

3 Results
3.1 Raw Data

Number of data points


Algorithm 200 400 800 1600 2400 3200 4000 4800 5600 6400

Insert 29825 119421 485647 1916027 4324881 7663372 12018039 17272963 23460923 30701612
Select 20099 80199 320399 1280799 2881199 5121599 8001999 11522399 15682799 20483199
Heap 2284 5365 12304 27826 44517 62013 79882 98545 117604 136868
Merge 1470 3366 7511 16671 26381 36477 46787 57523 68411 79401
Quick 2182 5056 10722 23582 37041 51499 65018 79549 94661 109284

Figure 1: Number of comparisons performed by each sort over n elements.

Figure 1 shows the number of comparisons each sort performs given an unsorted
array of size n, with n displayed at the top of each column. Figure 2 displays the number
of assignments each sort executes for unsorted array n, with n similarly displayed at the
top of each column.

6
Number of data points
Algorithm 200 400 800 1600 2400 3200 4000 4800 5600 6400

Insert 49532 198825 804453 3193640 7201293 12778583 20012051 28785771 39132532 51172024
Select 597 1197 2397 4797 7197 9597 11997 14397 16797 19197
Heap 2859 6511 14592 32384 51377 71115 91293 112206 133585 155120
Merge 3088 6976 15552 34304 54208 75008 95808 118016 140416 162816
Quick 3490 7740 16498 36891 54314 74065 99507 119522 151702 174423

Figure 2: Number of assignments performed by each sort over n elements.

Figure 3 is a graphical representation of the sample data in Figure 1, with the number
of elements sorted displayed along the x-axis and the number of comparisons displayed
along the y-axis. Figure 4 is a graph of the of the sample data in Figure 2, with the x-
axis representing the number of elements sorted and the y-axis portraying the number
of assignments performed over n.

3e+07

2e+07
Number of Comparisons

Algorithm
Insert Sort
Selection Sort
Heap Sort
Merge Sort
Quick Sort

1e+07

0e+00

0 2000 4000 6000


Number Sorted

Figure 3: A figure representing the number of comparisons for each sort algorithm,
over n elements.

7
5e+07

4e+07
Number of Assignments

3e+07 Algorithm
Insert Sort
Selection Sort
Heap Sort
Merge Sort
Quick Sort
2e+07

1e+07

0e+00

0 2000 4000 6000


Number Sorted

Figure 4: A figure representing the number of assignments for each sort algorithm over
n elements.

Figure 3 demonstrates that the Insertion and Selection sorts have significantly more
comparisons than Heap, Quick, and Merge Sort as input size grows. This is supported
by the hypothesis that Select and Insert Sort are, at the very least, much less efficient
than the other three by a matter of degrees. It is also apparent from this graph that
the rate at which Selection Sort and Insertion Sort grow is much greater than Heap,
Merge, and Quick Sort as input size grows. This also suggests that these two sorts are
of a different asymptotic run-time than the Heap, Quick, and Merge. Additionally, it
appears that Insertion Sort performs moderately worse than selection sort as n increases,
with a difference of 2(107 ) computations for an array of 6400 elements.
Figures 2 and 4 further suggest that Merge, Quick, and Heap are similar in run-
time efficiency, as their assignment counts are nearly identical for every input size n.
However, there is one major discrepancy in the data. Selection Sort, which is supposed
to have a run-time of Θ (n2 ), has fewer assignments than every other sorting algorithm
– including Heap, Merge, and Quick sort, which are the most efficient algorithms in
Figures 1 and 3. Exactly why this occurs is analyzed in section 3.3.
Sections 3.2 – 3.6 provide an in-depth analysis of each sort individually, based on
the data given in figures 1, 2, 3, and 4.

8
3.2 Analysis: Insert Sort

3e+07 5e+07

4e+07
Insert Sort: Number of Comparisons

Insert Sort: Number of Assignments


2e+07

3e+07

2e+07

1e+07

1e+07

0e+00 0e+00

0 2000 4000 6000 0 2000 4000 6000


Number Sorted Number Sorted

(a) A graph of the best fit curve of the (b) A graph of the best fit curve of assignment
comparison data taken from Insert Sort. data taken from Insert Sort.

Figure 5: Graphs of the best fit curves of the comparison and assignment data taken
on Insert Sort.

When compared with the curves n2 and n lg n, the graph of Insert Sort’s comparison
data produces Residual Standard Errors 18,920 and 700,200, respectively. As shown in
section 2.3, RSE represents the magnitude of difference. Thus, n2 is the curve of best
fit, because 18,920 is quite a bit less than 1,167,000. A smaller standard error indicates
that our sample data is less-different when compared with the curve n2 than it is with
the curve n lg n. This suggests that that Insert Sort does indeed have an in-practice run-
time complexity which matches its theoretical complexity of Θ (n2 ), as stated in section
2.1.1. Figure 5(a) graphically illustrates the similarity between n2 , red, and the sample
data, black. The sample data clearly adheres more closely to the red curve than it does
the blue curve (ngln).
The RSE of the graph of Insert Sort’s assignment data is 18,920 for the curve n2 , and
1,167,000 for the curve n lg n. Because the RSE is far less in the case of n2 than it is for
n lg n, the assignment data also suggests that Insert Sort’s run-time complexity is Θ (n2 ).
Figure 5(b) illustrates the relationship between the curve of Insert Sort’s assignment data
and the curve n2 . The assignment data, like the comparison data, adheres much more
closely to the curve n2 than it does the curve n lg n.
It is thus reasonable to conclude that Insert Sort has a run-time complexity of Θ (n2 )
both in theory and in practice.

9
3.3 Analysis: Selection Sort

20000

2.0e+07

15000

1.5e+07
Selection Sort: Number of Comparisons

Selection Sort: Number of Assignments


10000
1.0e+07

5.0e+06
5000

0.0e+00

0
0 2000 4000 6000 0 2000 4000 6000
Number Sorted Number Sorted

(a) A graph of the best fit curve of the (b) A graph of the best fit curve of assignment
comparison data taken from Selection Sort. data taken from Selection Sort.

Figure 6: Graphs of the best fit curves of the comparison and assignment data taken
on Selection Sort.

Selection Sort’s comparison data produces an RSE of 3.56(10−9 ) when compared with
the graph of n2 and 467,400 when compared with the graph of n lg n. This is essentially
a perfect fit, strongly suggesting that Selection Sort’s run-time complexity is closer to
Θ (n2 ) than Θ (n lg n). It is so close, in fact, that it’s essentially undeniable. Figure 6A
demonstrates this close relationship between the line n2 and the comparison samples.
They are very clearly almost identical.
Interestingly, the best fit curve of Selection Sort’s assignment data yields an RSE
of 2.333(10−12 ) when compared to a graph of n2 , and 1.295(10−12 ) when compared to
the curve n lg n. This suggests Selection Sort’s assignment count is more similar to the
curve n lg n than it is n2 . Figure 6(b) reinforces this notion, as it shows the plot of the
assignment data and the curve of n lg n as being essentially the same. How can this be, if
Selection Sort is an algorithm with a theoretical run-time complexity Θ (n2 )? Selection
Sort’s low assignment count can be attributed to the manner in which it functions. Be-
cause selection sort operates by repeatedly finding the largest or smallest element in an
array and moving it to the front or the end, the number of swaps it makes are negligible
compared to the number of comparisons it performs.
However, even though Selection Sort’s assignment count is relatively low, its com-
parison count still behaves according to Θ (n2 ). Θ (n2 ) + Θ (n lg n) is still Θ (n2 ), ac-
cording to the definition of asymptotic complexity.
Thus, Selection Sort performs in Θ (n2 ) time both in theory and in execution.

10
3.4 Analysis: Heap Sort

150000

1e+05
Selection Sort: Number of Comparisons

Selection Sort: Number of Assignments


100000

5e+04

50000

0e+00 0

0 2000 4000 6000 0 2000 4000 6000


Number Sorted Number Sorted

(a) A graph of the best fit curve of the (b) A graph of the best fit curve of assignment
comparison data taken from Heap Sort. data taken from Heap Sort.

Figure 7: Graphs of the best fit curves of the comparison and assignment data taken
on Heap Sort.

Heap Sort’s comparison data produces an RSE of 497.5 when compared with the quadratic
curve n2 , and 78.8 when compared to the curve n lg n. Because 78.8 suggests a smaller
magnitude of difference, Heap Sort has a runtime complexity much closer to Θ (n lg n)
than Θ (n2 ). This is confirmed by Figure 7(a), which shows that the sample data aligns
more closely to n lg n, the blue curve, than n2 , the red curve.
Heap Sort’s assignment data generates an RSE of 494 when compared with the n2
curve, and 81.38 when compared to the curve n lg n. As with the comparison data, Heap
Sort’s assignment data suggests that it performs with a run-time complexity that is much
closer to Θ (n lg n) than Θ (n2 ). Figure 7(b) shows this similar relationship as a curve of
best fit. The sample data most closely aligns with the blue curve, n lg n.
It is therefore suffice to say that Heap Sort’s actual run-time complexity is the same
as its theoretical run time, Θ (n lg n) (See section 2.1.3).

11
3.5 Analysis: Merge Sort

80000

150000

60000
Selection Sort: Number of Comparisons

Selection Sort: Number of Assignments


100000

40000

50000

20000

0 0

0 2000 4000 6000 0 2000 4000 6000


Number Sorted Number Sorted

(a) A graph of the best fit curve of the (b) A graph of the best fit curve of assignment
comparison data taken from Merge Sort. data taken from Merge Sort.

Figure 8: Graphs of the best fit curves of the comparison and assignment data taken
on Merge Sort.

When Merge Sort’s comparison data is fit to the quadratic curve of n2 , an RSE of 246.9
is produced. When compared with the curve n lg n, an RSE of 44.79 is calculated.
Thus, Merge Sort more closely matches a run-time complexity of Θ (n lg n). Figure 8(a)
visually illustrates this, with the sample data, black, more closely aligning with the blue
curve, n lg n, than the red curve, n2 .
When Merge’s assignment data is fit to the quadratic curve n2 , 499.3 is the cal-
culated RSE. When the assignment data is fit to n lg n, an RSE of 181.9 results. It is
therefore safe to conclude that Merge Sort’s assignment data more closely conforms to
a run-time complexity of Θ (n lg n), as the Residual Standard Error is less significant
for that curve fit. Figure 8(b)is a graphical representation of this data from which the
same conclusion may be drawn, with black dots aligning more closely to the blue curve,
n lg n, than the red curve, n2 .
Because both the comparison data and the assignment data suggest an asymptotic
run-time of Θ (n lg n), it is safe to conclude that Merge Sort has a run-time of Θ (n lg n)
in practice, which is in accordance with the theory proposed in section 2.1.4.

12
3.6 Analysis: Quick Sort

150000
90000
Selection Sort: Number of Comparisons

Selection Sort: Number of Assignments


100000
60000

30000 50000

0 0

0 2000 4000 6000 0 2000 4000 6000


Number Sorted Number Sorted

(a) A graph of the best fit curve of the (b) A graph of the best fit curve of assignment
comparison data taken from Merge Sort. data taken from Quick Sort.

Figure 9: Graphs of the best fit curves of the comparison and assignment data taken
on Quick Sort.

When the comparison data of Quick Sort is fit against the curve n2 , 406 is the RSE.
When it is compared with the curve n lg n, an RSE of 244.4 is produced. Because 244.4
is less significant an error than 406, Quick Sort’s comparison count suggests an asymp-
totic run-time more similar to Θ (n lg n) than Θ (n2 ). Figure 9(a) is a graphical represen-
tation of the curve fit of Quick Sort’s comparison data, from which the same conclusion
may be deduced. The sample data, displayed in black, more closely adheres to the blue
curve, n lg n, than the red curve, n2 .
Fitting the assignment data to the curves n2 and n lg n produces Residual Standard
Errors of 2558 and 2098 respectively. Because 2098 is a smaller error than 2558, Quick
Sort’s assignment data suggests a run-time more similar to Θ (n lg n) than Θ (n2 ) in prac-
tice. Figure 9(b) is a visual representation of the curve fit of Quick Sort’s assignment
data, compared to the curves n2 and n lg n. While it is more difficult to see which cor-
relation is stronger due to the similarity of the RSEs, the errors indicate that the sample
data more closely adheres to n lg n.
Because both the comparison data and the assignment data tend towards the curve
n lg n as demonstrated graphically and statistically, it is reasonable to conclude that
Quick Sort has a run-time complexity of Θ (n lg n) in practice, which is in agreement
with the theory proposed in section 2.1.5.

13
3.7 Sort Comparisons
The best sort algorithm in practice is the one with the least number of total compar-
isons and assignments. Figures 1 and 2 indicate that the best performing algorithms,
according to this definition, are Merge Sort and Heap Sort, as they are the algorithms
with the fewest combined comparison and assignment counts. Adding up the sample
data in both tables yields a sample size of 966,262 for the Heap Sort and 807,973 for
the Merge Sort. Thus, the results suggest that the most efficient sorting algorithm in
practice is the Merge Sort.
Of the two Sorts with theoretical Θ (n2 ) time, Selection Sort is by far the most effi-
cient. Graphs 3 and 4 show that the curve of Insert Sort’s sample data is never less than
Selection Sort’s in either graph. Additionally, the comparison count and the assignment
count are higher for Insertion Sort than Selection Sort. Tables 1 and 2 demonstrate that
Insert’s sample data grows larger than Selection Sort’s for every increase in array size
n.
Of all the sorts experimented on, Insertion Sort has the most comparison and assign-
ment counts, making it the least efficient of those tested. This is odd, because Insert
Sort is adaptive, while Selection Sort is not. It is possible that the unsorted arrays were
such that Insert Sort tended towards its worst case every time – the arrays may have
been closer to reverse order than sorted order, thus resulting in a deteriorating run-time
with increasing n.

3.8 Conclusions
This paper shows, using best fit curves of sample data, that Selection Sort and Insert
Sort have a run-time most closely resembling Θ (n2 ). In addition, it shows that Heap
Sort, Quick Sort, and Merge Sort all have run-time complexities resembling Θ (n lg n).
Thus, it also confirms that all five sorts operate according to their Theoretically Derived
run-time complexities.
This paper also explores which of the five sorts is most efficient, again using best
fit curves and raw sample data gathered from experimental testing. It finds that, of all
the sorts, Merge Sort is the least time consuming. It also determines that, of the two
Θ (n2 ) sorts, Selection and Insertion, Selection runs best. It also shows that Insertion
Sort is the least efficient of all the algorithms analyzed.
It describes several anomalies within the data itself, such as the discrepancy in the
best fit curve of Selection Sort (Section 3.3) by delineating the mechanics of each sort-
ing method.

14
In total, this analysis serves as an exploration into the practicality of run-time theory
within real world environments.

15

Anda mungkin juga menyukai