by
Mengheng Jin
Master of Science
in
Computer, Microelectronic Devices, Circuits and Systems
Mengheng Jin
Spring 2014
Edmonton, Alberta
Permission is hereby granted to the University of Alberta Libraries to reproduce single copies of this thesis
and to lend or sell such copies for private, scholarly or scientific research purposes only. Where the thesis is
converted to, or otherwise made available in digital form, the University of Alberta will advise potential
users of the thesis of these terms.
The author reserves all other publication and other rights in association with the copyright in the thesis and,
except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or
otherwise reproduced in any material form whatsoever without the author's prior written permission.
Dedicated to my beloved parents...
Abstract
Acknowledgement
First of all, I would like to express my sincere gratitude and respect to my super-
visors Dr. Chintha Tellambura and Dr. Bruce Cockburn for their brilliant advices
and limitless time they spent to help me during my M.Sc. program. With their
professional knowledge and continuous and encouragement, I have learned a lot
regarding how to analyse a problem, technical writing, presentation skills, etc. and
I have learned even more from their great personalities. I feel fortunate to have this
opportunity to study under their supervisions and I am sincerely grateful to them.
My thanks also goes to my M.Sc. examining committee Dr.Hai Jiang and Dr.
Masum Hossain, for their time spent reading my thesis and providing valuable com-
ments and advices. I am also grateful to the faculty and the staff of the Department
of Electrical and Computer Engineering for their full support.
I would also like to thank Dr. Shuangshuang Han for her immense encourage-
ment and valuable advices to my study, and Andrew Maier for his very helpful
programming suggestions to my research. I also give many thanks to Prasanna and
Russell for spending their time helping me to improve my oral presentation skills,
and my labmates, including Jinghang and David for creating a pleasant lab environ-
ment.
My heartfelt and deepest gratitude goes to my beloved parents for their invalu-
able support and endless love throughout my life.
Thank you a lot!!!
Table of Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 MIMO Systems 4
2.1 Benefits of MIMO Technology . . . . . . . . . . . . . . . . . . . . 4
2.2 Technical Implementation of MIMO Systems . . . . . . . . . . . . 6
2.2.1 Spatial Multiplexing . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Diversity Coding . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Precoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Characterization of MIMO Systems . . . . . . . . . . . . . . . . . 7
2.3.1 Modulation Schemes . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Signal-to-Noise Ratio . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Bit Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.5 Diversity Order . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.6 Processing Speed . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Overview of Linear MIMO Detection Methods . . . . . . . . . . . 13
2.5.1 The Zero Forcing (ZF) Algorithm . . . . . . . . . . . . . . 13
2.5.2 The Minimum Mean Square Error (MMSE) Algorithm . . . 14
2.5.3 The Vertical BLAST (V-BLAST) Algorithm . . . . . . . . 15
2.5.4 Performance of the Linear Algorithms . . . . . . . . . . . . 16
2.6 Overview of the Sphere Detection Algorithm . . . . . . . . . . . . 17
2.6.1 The Fincke-Pohst (FP) Sphere Detection Algorithm . . . . . 19
2.6.2 Schnorr-Euchner (SE) Enumeration . . . . . . . . . . . . . 20
2.6.3 The K-Best Sphere Detection Algorithm . . . . . . . . . . . 21
2.6.4 Pre-processing the Channel Matrix . . . . . . . . . . . . . . 22
2.6.5 Performance of the Sphere Detection Algorithms . . . . . . 24
6 Conclusions 81
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Bibliography 84
4.1 Matrix multiplication mimes (in seconds) for different looping (for
and gfor) structures . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Matrix multiplication times (in seconds) for serial and different de-
grees of parallel versions . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Matrix multiplication times (in seconds) for the merged matrix with
parallel gfor-loop structure . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Comparison of matrix inverse runing times (in seconds) using built-
in function inv and new function NewInverse . . . . . . . . . . 56
4.5 Running times (in seconds)comparison of MIMO detection algo-
rithms with the serial and different parallel versions . . . . . . . . . 58
4.6 The most time consuming operations for the MIMO detection algo-
rithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Matrix multiplication times (in seconds) using the for, gfor and par-
for loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Running times (in seconds) comparison of MIMO detection algo-
rithms with the serial and different parallel versions . . . . . . . . . 79
List of Figures
Abbreviation Definition
FP Fincke-Pohst
FSK frequency shift keying
SD sphere detector
SE Schnorr-Euchner
SER symbol error rate
SIMD single-instruction multiple-data
SIMO single-input multiple-output
SISD single-instruction single-data
SISO single-input single-output
SM stream multiprocessor
SNR signal-to-noise ratio
ZF zero forcing
List of Symbols
Notation Definition
|a| absolute value of scalar a
ras the smallest integer greater than or equal to a
tau the largest integer less than or equal to a
paq the real component of (complex) scalar a
paq the imaginary component of (complex) scalar a
A real-valued matrix A
A complex-valued matrix A
A detected matrix A
Api, j q i-th element of the j-th column of matrix A
}A B}2F , Euclidean distance between A and B
A1 inverse of square matrix A (for m n)
A: Moore-Penrose pseudo inverse of matrix A (for m
n)
AH conjugate transpose of matrix A
AT transpose of matrix A
In identity matrix of rank n
minpa, bq minimum of scalars a and b
arg minpAq index i corresponding to the smallest element ai of set
i
A
logpq natural logarithm
log2 pq logarithm to base 2
pq
a
Gamma function
Chapter 1
Introduction
1.1 Background
Wireless communications provides key infrastructure used in modern daily life.
The convenience of wireless allows us to use cellular telephones and wirelessly
connected computers almost everywhere in towns and cities and along major trans-
portation corridors. However, since the limited wireless bandwidth can not cope
with the rapidly increasing user traffic, multiplexing technology has become an es-
sential way for better exploiting limited channel resources. A popular method to
increase the wireless capacity within a fixed bandwidth is multipath propagation
among one or more transmitting and receiving antennas. In this thesis, we consider
the multiple-input multiple-output (MIMO) system which makes full use of mul-
tiple antennas at both the transmitter and receiver ends of the channel to achieve
significant improvements in wireless system performance.
1.2 Motivation
Wireless signals propagate from the transmitter to the receiver through the radio
channel. However, because the radio channel has various inevitable sources of
noise and fading attenuation, the received signal is distorted and detection errors
can occur at the receiver. In a MIMO channel, each receiver antenna receives super-
imposed copies of all of the transmitted signals In order to recover (detect) transmit
data from the received signal with a lower bit error rate (BER), researchers have
already investigated many ways to improve the performance of the MIMO detector.
1
MIMO detector employs maximum likelihood principles to recover the transmit
data. Many MIMO detection algorithms have been proposed that can approach the
statistically optimal performance of maximum likelihood detection [2]. However,
the high computational complexity of these algorithms has made them unsuitable
for widespread adoption in practical MIMO receiver designs.
Hardware parallelism is now provided in various ways in the instruction sets
and architectures of most computers [3]. Parallel computing exploits the fact that
large problems can often be divided into smaller computations, which can then be
solved concurrently to reduce the total required running time. Traditionally, to solve
a problem, an algorithm is designed and implemented as a serial stream of instruc-
tions. These instructions are executed on a central processing unit (CPU) on one
computer. Only one instruction may execute at a time. Parallel computing, on the
other hand, uses multiple processing elements to solve a problem simultaneously.
This is accomplished by splitting the problem into several independent parts so that
each processing element can execute simultaneously in parallel with the others. The
processing elements could be diverse and could include resources such as a single
computer with multiple processors, several networked computers, specialized hard-
ware, or any combination of the above. In this research, we investigate different
ways to exploit the forms of hardware parallelism available in the simulation of a
MIMO system with the objective of measuring and maximizing performance and
efficiency. Insights obtained while implementing a parallel MIMO simulation could
lead to improve parallel MIMO detectors of benefit to wireless communications
equipment.
The main objective of this project is to find a better way to implement the par-
allelism, either on the multicore CPU or the GPU subsystem, and to achieve signif-
icant acceleration in some of the MIMO detection algorithms.
2
describes the major MIMO detection algorithms including several variants of the
sphere detector (SD) algorithm. A brief comparison of these algorithms is provided
at the end of the chapter.
Chapter 3 introduces the graphics processing unit (GPU) and the application
of parallel GPU-based computing for MIMO detection. There are several ways to
exploit GPU parallelism. The first parallel programming environment evaluated in
this thesis is the Jacket library extension of the MATLAB environment.
Chapter 4 describes the details of the parallel MIMO detector implementations
on the GPU. Since all of the simulated data are created initially on the CPU in
serial fashion, it is important to have an efficient method to map the calculations
efficiently onto the parallel GPU hardware. There are often limitations imposed on
the algorithms by the hardware parallelism. For example, one might be required
to synchronize the same kinds of arithmetic operations on parallel streams of data.
Several challenges are addressed in this chapter. At the end of the chapter, the
simulation and experimental evaluation of the developed parallel MIMO detection
algorithms are discussed.
Chapter 5 compares the performance of parallel computation on the GPU and
the multiple cores of the CPU. The parallel computing toolbox (PCT) in MATLAB
is used to implement the parallelism on the multicore CPU. Several detection algo-
rithms introduced in Chapters 2 and 4 are run and compared to find a better way to
exploit the different kinds of hardware parallelism.
Chapter 6 includes the conclusions arising from the research presented in this
thesis and gives recommendations for future work.
3
Chapter 2
MIMO Systems
where is the average signal-to-noise ratio (SNR) at the receiver. Eq. (2.1) shows
that when the signal bandwidth B and SNR are fixed, the channel capacity can
be linearly increased by increasing the number of antennas as long as the chan-
nel remains rich scattering. Sufficiently rich scattering is required to allow signal
processing to disentangle the multiple transmitted signals in the MIMO receiver.
Equivalently, Eq. (2.1) also indicates that the spectral efficiency (bits per second
per hertz of bandwidth), which indicates the number of users that can be simultane-
ously supported on a limited frequency bandwidth, can be increased by spreading
the total transmitted power over the available antennas to achieve an improvement
4
h1 RX1
h2
h RX2
TX TX
...
RX hn
h11
TX1 h21 RX1
TX1 h31
h1
h12
h22
h2 TX2 hn2 RX2
TX2 RX
hn
...
...
...
h2n
h1n
hnn
TXn TXn RXn
Channels Channels
Transmitter Receiver Transmitter Receiver
(c) MISO (d) MIMO
5
without consuming additional bandwidth. Furthermore, by employing more anten-
nas at the receiver side, one can reduce the vulnerability to channel fading to im-
prove the link reliability. Fading is the sometimes severe attenuation of the signal
strength at a receiver antenna caused by destructive interference among the multi-
ple superimposed received signals. In general, MIMO technology also can ensures
the independence of each signal copy from different transmitters to achieve a lower
error rate at the receiver. Because of these properties, MIMO technology plays an
important role in many modern wireless communication standards, for example, in
IEEE 802.11n (Wi-Fi) [5], 4G [6], the 3rd-generation partnership project (3GPP)
long term evolution (LTE) [7] and IEEE 802.16 (WiMAX) [8].
Spatial multiplexing is a common MIMO scenario (Fig. 2.1 (d)). Its main princi-
ple is to first split the data stream into several independent sub streams and then
to transmit them from different transmitter antennas within the same frequency
range. Compared to a conventional SISO system, spatial multiplexing improves
the throughput rate to achieve much higher frequency spectrum utilization. If the
MIMO channel between the transmitter and receiver arrays provides sufficient di-
versity due to the rich scattering in the channel, the receiver can detect these parallel
data streams reliably. Spatial multiplexing technology can be applied successfully
at the receiver without knowing and exploiting the channel state information at the
transmitting side. The Bell Laboratories Layered Space-Time (BLAST) system [9],
developed by Foschini and other researchers at Bell Labs, was an early example of
practical spatial multiplexing technology.
6
receiver. This combination step effectively reduces the effects of channel fading af-
fecting any one of path to ensure a robust system by increasing the effective number
of independent channels. To maximize the signal diversity, space time coding [10]
is used in MIMO systems to ensure that all transmitted data are sent out on all trans-
mitter antennas and then received on all receiver antennas. A suitable space-time
decoder is required at the receiver to efficiently recover the data from the signals
obtained from all receiver antennas.
2.2.3 Precoding
7
2.3.1 Modulation Schemes
Standard modulation techniques include phase shift keying (PSK) modulation, fre-
quency shift keying (FSK) modulation, amplitude shift keying (ASK) modulation,
quadrature amplitude modulation (QAM). In this thesis, we focus on QAM modu-
lation, which is widely used in the highest-capacity broadband wireless systems.
In QAM, the digital bit stream modulates the amplitudes of two orthogonal car-
riers (on sine and cosine) of the same frequency. Because QAM makes full use of
both the amplitude and phase of two orthogonal carriers, the bandwidth efficiency
is increased. A QAM constellation diagram is a two-dimensional scatter plot of a
digital modulated signal in the complex plane. In QAM, if a suitable constellation
size is chosen, it is possible to achieve relatively high spectral efficiencies, limited
only by the signal-to-noise ratio and the effects of distortion and fading in the com-
munications channel. The constellation points are usually packed within a square or
rectangular grid with equal vertical and horizontal spacing. Because data in digital
communications is in binary format, it is convenient that the number of points in the
grid be a power of 2 (such as 2, 4, 8, . . . ). Each point maps a group of data bits (e.g.,
2, 4, 8, . . . ) forming a code word to a unique transmitted complex symbol in the
transmitter. The constellation diagram of 16-QAM, which is used in this research,
is shown in Fig. 2.2.
Following standard practice, the Gray code [12] scheme was used to map code
words to constellation points. Adjacent constellation points correspond to code
words that differ in exactly on bit. In 16-QAM, the data is transmitted using 4-bit
symbols. So during the transmission, the number of data bit errors is minimized
when symbol detection errors occur.
According to Fig. 2.2, the data stream is mapped to a complex plane by demul-
tiplexing them into real and imaginary substreams, converting consecutive bit pair
00 to -3, 01 to -1, 10 to +1, 11 to +3. Note that each complex-
valued symbol encodes 4 bits (Fig. 2.2 where the axis labels I and Q stand for
the real and imaginary part, respectively). It is possible to transmit more bits per
symbol by applying a higher-order constellation. However, higher-order QAM con-
stellations mean that the constellation points are more closely spaced together and
8
Q
4
0000 0100 1100 1000
3
(0) (4) (12) (8)
2
0001 0101 1101 1001
1
(1) (5) (13) (9)
-4 -3 -2 -1 0 1 2 3 4 I
0011 0111 1111 1011
-1
(3) (7) (15) (11)
-2
0010 0110 1110 1010
-3
(2) (6) (14) (10)
-4
are thus more susceptible to noise and other signal corruptions, possibly leading to
incorrectly detected symbols and hence to bit errors.
9
compared to the number of originally transmitted bits:
BER
Nerror
. (2.2)
Ntotal
In Eq. (2.2), Nerror denotes the total number of error bits seen at the receiver,
and Ntotal denotes the total number of the bits that were transmitted. When using
QAM it is simpler and thus common practice to calculate the related quantity called
the symbol error rate (SER) instead of the BER. In 16-QAM, there are 16 possible
symbols and the SER can be obtained by replacing the Nerror and Ntotal with the
number of error symbols and the total number of transmitted symbols respectively.
For general M-QAM, the SER will be roughly log2 M times the corresponding
BER because one symbol detection error can cause more than one errored bit.
2.3.4 Complexity
10
order of the MIMO channel can be shown to be:
d Mt Mr . (2.3)
When the SNR and the error probability are measured experimentally in a sys-
tem simulation, the diversity order can be shown to be [13]:
log Pe pq
d lim , (2.4)
8 log
where denotes the signal-to-noise ratio and Pe pq denotes the error probability,
which is taken to be the SER in this thesis. In this expression, the diversity order
can be seen to be the magnitude of the slope of the error probability vs. SER curve
on a log-log plot. This implies that for the same SNR, the use of a higher diver-
sity order detector can achieve a lower SER. MIMO diversity coding mentioned
in Section 2.2.2 has been designed to maximize the diversity order. In contrast,
spatial multiplication does not attempt to maximize the diversity order but instead
maximizes the data rate.
Processing speed is affected by many factors, including most obviously the running
time, which gives the required number of CPU or GPU instructions. The hardware
of the processing device also plays an important role in determining the time com-
plexity cost per bit. In this research, we investigate alternative ways of improving
MIMO detection algorithms by converting the data and node searching algorithms
into parallel form to better exploit the characteristics of the available parallel hard-
ware. Whenever a group of data values can be processed at the same time, the
computation time should be reduced compared to the serial computation.
y Hs n, (2.5)
11
where y ry1 y2 yM sT is the Mr -element received signal vector, where the
r
y Hs n, (2.6)
i.e.,
pyq pHq pHq psq pnq
pyq
pHq pHq psq pnq
(2.7)
where pq and pq represent the real and imaginary parts of the corresponding
elements of complex vectors and matrices. As a result, when going to a real-valued
system, the dimensions of y, H, s, and n grow to 2Mr 1, 2Mr 2Mt, 2Mt 1,
and 2Mr 1, respectively.
The objective of MIMO detection is to find the signal vector that minimizes
the Euclidean distance between the predicted noise-free signal vector Hs and the
received vector y in the presence of the Gaussian noise n [15]. Statistically optimal
performance is obtained using the maximum likelihood (ML) detection rule, i.e.,
where s is the detected signal vector and stands for the set of the real entries along
one dimension in the constellation, e.g., t3, 1, 1, 3u if we are considering
16-QAM. }}2 denotes the sum of the squares of the corresponding elements. For
?
convenience, we also define Mc M to be the equivalent real-valued constella-
tion size of M-QAM (i.e., Mc 4 in 16-QAM).
12
2.5 Overview of Linear MIMO Detection Methods
In a MIMO system, linear detection algorithms are widely used methods at the re-
ceiver. In a linear detection algorithm, the computational complexity grows linearly
in the number of antennas. First the received signal vector undergoes a linear trans-
formation by being pre-multiplied by a conditioning matrix (e.g., matrix computed
using the Zero-Forcing or MMSE criteria), then the resulting signals quantized it to
the closest constellation points. In these algorithms, we obtain the complex form of
the signal vector y using the system model from Eq. (2.5). That is:
s1
.
y h1 h2 hMt .. n, (2.9)
sMt
1. Nulling step:
z1 g1
Z . GZF y
.. .. y,
. (2.11)
zMt gMt
The constructed matrix GZF in Eq. (2.11) meets the following constraints:
g1 Khk (where k 2, 3, , Mt ), so that g1 hk 0, and also g1 h1 1. We
13
thus have z1 g1y g1h1 s1 g1 h2 s2 g1 hMt sMt g1 n s1 g1 n.
The other gk vectors are computed similarly.
In the Zero Forcing algorithm, the conditioning matrix GZF (which is also
called the ZF equalizer) is calculated as follows:
where pqH is the conjugate transpose of a matrix. GZF H: is also known as the
Moore-Penrose pseudo inverse [16] [17].
This definition of GZF ensures that the effects of the measured impairments are
forced to zero (i.e., nulled) to totally remove the ISI, ignoring the possibility that
some of the impairment is caused by noise. Thus, a noise-free environment is the
ideal case for using the ZF algorithm. However in a normal noisy channel, the
ZF algorithms performance is limited because the effects of noise will tend to be
amplified by multiplying the ZF equalizer to the received signal vector y.
In the minimum mean square error (MMSE) algorithm, the basic strategy is similar
to that of ZF. The difference is that a new inverse matrix GM M SE is calculated to
minimize signal distortion caused by both the channel H and the expected noise.
The conditioning matrix is given by:
1 1 H
GM M SE pHH H
In q H , (2.13)
where denotes the SNR and In denotes the n n Identity matrix. Note how
the SNR is considered in the calculation of the conditioning matrix GM M SE . The
MMSE algorithm gives better performance than the ZF algorithm in the presence of
additive Gaussian noise because it accounts for the average effects of the Gaussian
noise while also minimizing the effects of ISI.
14
2.5.3 The Vertical BLAST (V-BLAST) Algorithm
The Bell Laboratories layered space-time (BLAST) detector was first proposed
in [9]. It is an efficient MIMO detection algorithm that gives better BER perfor-
mance than either ZF or MMSE at the cost of increased computational complexity.
In V-BLAST, signal symbols are detected vertically from the same signal vec-
tor y, that is, by detecting the symbol transmitted by each transmit antenna in turn
in order of decreasing estimated SNR. V-BLAST achieves the better detection ac-
curacy by exploiting interference cancellation. The principle of this algorithm is
that the strongest (i.e., highest SNR) transmitted symbol is detected in the first step
using either the ZF or MMSE criteria. Then the interference from this symbol on
the Mr received MIMO signals is predicted and subtracted away to eliminate the
interference of the symbol from the Mr signals. The same steps are repeated to
detect the remaining transmitted symbols. In this way, we can cancel the interfer-
ence caused by previously detected symbols to offer more accurate detection for
the next detected symbol. But we also need to pay an increasing calculation cost
when the number of antennas grows. The V-BLAST algorithm is more expensive
computationally than ZF and MMSE, but the cost still grows linearly in the number
of antennas. The algorithm is as follows:
:
1. Initialization: i 1. Compute the first conditioning matrix H from H.
the index of the column with the minimum norm. Select this column, gk
j
i
element in Eq. (2.5). Null the interference on symbol ki from the other Mt i
undetected symbols. Then slice gki yi to detect ski .
15
transmit antenna can now be removed. So we predict the interference caused
by the detected symbol, and then subtract this interference from all of the
MIMO signals.
5. i i 1, go back to step 2 until all the symbols are detected (i.e., until
i Mt ).
0
10
ZF
MMSE
VBLAST
1 ML
10
2
10
SER
3
10
4
10
5
10
6
10
5 10 15 20 25 30 35 40 45
SNR/dB
Figure 2.3: Performance of three linear MIMO detection algorithms (ZF, MMSE
and V-BLAST) compared to the optimal detection method (ML detection) for Mt
Mr 4, and 16-QAM modulation. Each data point represents at least 100 detection
errors.
Fig. 2.3 shows the big performance gap between the optimal and suboptimal
16
detection algorithms according to the SER vs. SNR characteristic. The three linear
symbol detection algorithms have relatively low computational complexity but their
SER performance is relatively poor at high SNRs. The SER performance of V-
BLAST is limited by error propagation effects of symbol detection errors for the
first detected symbols [18].
where n 2Mt represents the dimension of H, and rij are elements of the R. Then,
the partial Euclidean distance (PED) after detecting symbol values sn , sn1 , . . . , sk
in symbol vector positions n, n 1, . . . , k, where n k 1, can be written as
follows:
n
n
Tk |zi rij sj |2 d2 (2.16)
i k
j i
17
If Tk d2 for a symbol sj , all fully detected symbol vectors based on the given
partially detected symbol vector will be pruned away and discarded. In this way,
the complexity of the sphere detection algorithm will be reduced compared with the
exhaustive ML detection.
Root Node
...
Layer 2--S2
Leaf Nodes
it is common to convert the sphere detection algorithm into a tree search prob-
lem. Fig. 2.4 shows the model of the search tree we applied in this thesis. The
root node at the top of the tree corresponds to the start of the search for the best
symbol vector s. The leaves of the tree at the bottom correspond to the set of fully
detected candidate symbol vectors. The tree has n (n 2Mt 1) layers including
the root node and each traversed node has Mc sub-nodes under it so that the total
number of nodes in this tree will be Mc pMc0 Mc1 Mcn1 q. The search
starts at the root node before the first symbol has been detected. As the search pro-
gresses from the root node, symbol selections are made going from the n-th layer
to the pn 1q-th layer, etc. on down to the 1-st layer. Then the least-cost path from
the root node down to a leaf node is the detected received signal vector. Follow-
ing conventional tree search theory given a general graph theory reference [20], we
distinguish between two basic kinds of sphere detection algorithms:
18
algorithm and Schnorr-Euchner (SE) enumeration. The relative complexity
of these two methods varies with the systems SNR.
The details of the FP sphere detection (FP-SD) algorithm are given in [21]. One of
the key ideas is that the initial radius d is defined as
d2 n2 , (2.17)
where 2 is the variance of the noise vector n. The probability 1 that a sphere
of radius d will enclose the correct signal vector is given by:
2 1
n n
e d 1
2
p n2 q
(2.18)
0
19
3. Determine the number of nodes within the bound. If theres the node within
the bound, go to step 5; else go to step 4.
6. The last level has been reached. Calculate the PED of this detected symbol
vector s as ped }y Hs}2 . Compare this ped with the previous lowest PED.
If ped PED, save s and assign PED ped, then go to step 3.
7. If the returned symbol vector s is empty, reduce the to get a larger radius d.
Restart the algorithm from step 1.
Both the forward (going down layers) and backward (going up layers) tree
search are applied in a depth-first search order, so that the performance of the FP-
SD algorithm approaches that of ML detection, However, the cost in computational
complexity is extremely high, especially when the size Mc of the QAM constella-
tion and the number Mt of transmit antennas increase.
In the SE sphere detection (SE-SD) algorithm [22], [23], we begin the search from
the Babai point (BP) si , which is the Zero-Forcing solution at the n th level. We
then use Eq. (2.20) to define a zigzag search path to determine the next node. As
the search proceeds, we keep the s which has the smallest PED encountered so far.
si si, si 1, si 1, si 2, si 2, . . . (2.20)
20
1. Initialization: i n, bestdist 21 0, the initial PED disti 0, si
quantizepzi q is the BP based on the constellation points set, and error e
zi rii si . Record the sign of e: stepi signpeq to determine next direction
of enumeration.
3. If i 1, i
i 1, go to step 4; otherwise the lowest level has been reached
and so save s as the detected symbol vector and update bestdist newdist.
Go to step 5.
zi newdist, si quantizeptempZi q
n
4. tempZi rij sj , new PED disti
j i 1
based on the constellation points set, e tempZi rii si . Record the sign of
the error e, stepi signpeq. Go to step 2.
6. if i n, terminate the algorithm and return the detected symbol vector s; else
go to step 5.
As can be seen from Fig. 2.5, the SER performance of SE-SD can closely ap-
proach that of exhaustive ML detection. Note that, because the Zero-Forcing solu-
tion ensures that the start of the search will be closer to the optimal point compared
to FP-SD (in FP-SD, the search starts from the first point of the Mc constellation),
the complexity of SE-SD is much lower than FP-SD even though it still takes a long
time to find the optimal ML solution when the SNR is low.
As mentioned above, the K-Best sphere detection (K-Best SD) algorithm uses a
breath-first search strategy. Starting from the n-th level, we keep the K nodes that
21
have the smallest PEDs at each level to obtain a matrix that comprises K detected
vectors s. We then pick the symbol vector s with the smallest PED as the output
result after the tree search is finished.
The basic K-Best algorithm is as follows:
Inputs: n (number of levels), K (retained nodes per level), R, z (these two
matrices are the result of QR decomposition and are used to calculate the PEDs)
3. Extend the surviving partial symbol vectors and obtain Mc K contender paths.
Select the K partial symbol vectors with the smallest PEDs and update the
path history with them.
4. If i 1, terminate the algorithm and return the symbol vector s that has the
smallest PED; otherwise, return to step 2.
If K is large enough, which means the surviving paths contain most if not all the
closest symbol vectors, the performance of K-Best SD algorithm approaches that of
exhaustive ML detection [24]. However, in the K-Best algorithm, the complexity is
proportional to the number Mc K of searched paths at each level (expecting the n-th
level with Mc paths), so it will increase linearly with increasing K.
During the processing of the sphere detection algorithms above, it is clear that the
quality of the estimate of the channel H will influence both the search complex-
ity and the performance. In other words, when the channels SNR is high enough,
it should be much easier for these algorithms to correctly detect the symbol vec-
tor. In addition, preprocessing the channel before detection might produce better
performance.
22
As with ZF and MMSE, it is common to condition the signal vector by pre-
multiplying by the Moore-Penrose pseudo-inverse (denoted as pq: ) which is com-
puted from the channel matrix H as Eq. (2.12) in real valued system:
To achieve the smallest detection error on the i th layer, the row gi of G should
have the minimum Euclidean norm value, to minimize the interference noise from
the other undetected symbols. According to this, we should sort the rows of channel
matrix H to obtain better performance. The preprocessing algorithm is as follows:
Inputs: n, H, y
23
After the preprocessing, the resulting new Q, R, H can be used in the sphere
detection algorithm and the algorithms complexity can be partially reduced which
can also be shown in Fig. 2.7. One thing to note here is that the symbols in the
detected vector s should be reordered relatively according the ordered subscript p.
The operation environment of the system is the same as the one that was assumed
for the linear detection algorithms. The plots in Fig. 2.5 show that the sphere de-
tection algorithms achieve much higher detection accuracy than the suboptimal,
algorithms illustrated in Fig. 2.3, while costing much more in computation.
0
10
1
10
SER
2
10 ML
FP without preprocessing
FP with preprocessing
SE without preprocessing
SE with preprocessing
3
6Best without preprocessing
10 6Best with preprocessing
4
10
5 10 15 20 25 30
SNR in dB
Figure 2.5: Performance of three detection algorithms (FP, SE, K-Best when K
6) with and without preprocessing and for a Mt Mr 4, 16-QAM MIMO
system. Each data point represents at least 100 detection errors.
Fig. 2.5 shows the SER v.s. SNR performance of the different sphere detection
algorithms with preprocessing. It shows that the FP and SE algorithms approach
24
optimal ML detection performance while the K-Best (K 6)s performance ap-
proaches optimal performance only after applying the preprocessing method.
0
10
1
10
ML
SER
3
10
4 6 8 10 12 14 16 18 20 22 24
SNR in dB
25
700
FP without preprocessing
600 FP with preprocessing
SE without proprocessing
SE with preprocessing
Average number of traversed nodes
400
300
200
100
0
5 10 15 20 25 30
SNR in dB
Figure 2.7: Complexity of three detection algorithms (FP, SE, K-Best when K 6)
for a Mt Mr 4, 16-QAM MIMO system.
26
detection algorithms. We can see in Fig. 2.7 that, in a poor low-SNR environment,
the complexity of the FP sphere detection algorithm is much higher than that of
the other two algorithms. However, when the SNR is greater, it becomes easier
to detect the correct symbol vector in both the FP and SE algorithms. For the K-
Best algorithm, the complexity stays fixed as expected. However, the complexity of
the K-Best algorithm increases several-fold when the K becomes larger because at
each level, the computation amount of each selected node depends on the size Mc
of the constellation.
27
Chapter 3
3.1 Parallelism
Traditionally, programs are written to produce serial data manipulations and calcu-
lations. The execution time of a calculation is directly proportional to the required
number of representative CPU operations. For cases where we need to deal with
a large amount of data, the data storage capacity is also a limitation if only one
processor is considered. To solve these problems, parallel processing on parallel
hardware is one strategy that can be applied to speed up the processing. In parallel
processing, the problem is divided into several sub-programs that are executed at the
same time on different processors so that the total processing time is reduced. The
shrinking size of semiconductor transistors and wires is allowing more and more
processing cores to be provided on each chip, so parallel hardware is now widely
available and relatively inexpensive.
There are two major ways of implementing hardware parallelism, pipelining and
multiprocessing.
Pipelining
In a classical Von Neumann computer architecture, binary data and program in-
structions are stored in a shared memory [3]. A single central processing unit (CPU)
28
fetches instructions and executes them one by one.
Various strategies have been employed to speed up calculations.
Use a cache memory hierarchy to speed up the average time for memory
accesses.
Multiprocessing
While the pipelines architecture is applied within one processor, the multiprocess-
ing approach to parallelism uses multiple processors.
There are four major kinds parallelism according to Flynns taxonomy [25] [26]
[27]. This taxonomy represents theoretical extremes of computer architecture. Real
computer architectures incorporate different kinds of parallelism at different levels
of their architecture.
29
The instructions are fetched by the one CPU from the common memory
and executed one-by-one at a rate determined by the system clock. The
execution time per instruction is determined by the clock period, and
the average number of clock cycles per instruction.
Parallel structure is present in the data memory and in the data process-
ing hardware. In other words, there are multiple parallel data paths.
Parallel structure is present in the CPU but not in the data memory or
the data processing hardware.
The usage of the MISD is not as widespread as SIMD. One of the few
examples [27] is the experimental Carnegie-Mellon C.mmp computer
(1971) [28] [29].
30
The most flexible form of parallel structure, with parallelism in both the
CPU and the data memory.
The SISD architecture is the traditional computer model when algorithms are
developed. However, there are disadvantages to the SISD architecture. Many prob-
lems have inherent parallelism that could be exploited for faster execution on par-
allel hardware. The simplest form of parallelism (SIMD) involves performing the
same instructions with different data on different processors. The most complex
form of parallelism (MIMD) is to execute different commands with different data.
In this thesis, we investigate how each of these two forms of parallelism can reduce
the execution time of MIMO detection algorithms in the communications area.
Ideally, the acceleration of the parallelism should increase linearly with the number
of parallel processors that are applied. However, in most problems, not all of the
commands in the algorithm can be executed in parallel so that the achievable speed-
up of a parallel program with multi-processors is limited by the inherently serial part
of the algorithm. Amdahls law [30] gives the potential speed-up of a program with
serial and parallel parts. Amdahls law is given by:
S pnq
1
p1 P q P
n
, (3.1)
31
tends to infinity, the maximum possible speed-up is limited by the non-parallelized
portion of the program, no matter how large the degree n of parallelism.
The Amdahls law is illustrated in Fig. 3.1:
20
18
Parallel Proportion
16 50%
75%
90%
14 95%
12
Speedup factor
10
0
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 163843276865536
Numbers n of multiprocessors
32
hardware parallelism that is used to speed up graphical data processing. GPUs
are widely used in consumer PCs, supercomputers, game consoles and even cell
phones.
The concept of the GPU was proposed in 1999 [31] and applied initially to the
personal computer. The company NVIDIA released the worlds first GPU, the
GeForce 256, with the ability to process a minimum of 10 million polygons per
second [31]. From then on, GPU technology evolved rapidly. General-purpose
computing on graphics processing units (GPGPU) is a new trend that attempts to
use the parallel computing power of the GPU for a wider range of programming
problems, beyond graphics processing.
CUDA is short for Compute Unified Device Architecture which was provided as
a software development environment by NVIDIA to support GPGPU computing
on their GPUs. This environment includes new features to support general-purpose
computation. Parallel code is executed by different CUDA threads running on mul-
tiple parallel CUDA cores. All the threads in one multiprocessor are independent
of each other but execute the same instructions following the SIMD model. The
SIMD model imposes a strict form on the parallel computation.
33
Grid
Thread Block Thread Block
(Stream Multiprocessor) (Stream Multiprocessor)
Shared Memory Shared Memory
...
Thread ... Thread Thread ... Thread
Global Memory
Host/
CPU
Memory Constant cache
Texture cache
Warp: A set of threads which are running in parallel at the same time. A warp
consists of up to 32 threads. The concept of warp was introduced in CUDA
by NVIDIA.
Thread Block: A group of threads are organized into a thread block, and also
a block can be made up of warps. These threads share memory space and
cooperate with each other via barrier synchronization.
34
Grid: An array of thread blocks that execute the same parallel program and
that access the global memory. These blocks are executed one by one, so that
the synchronization does not exist among these blocks.
CUDA Stream: A host initiated sequence which contains a list of grids exe-
cuting in order.
In order to make effective use of GPUs, several programming models have been
proposed:
CUDA C
OpenCL
35
manage the programming on the device parts, to allocate memory resources and to
control the run-time environment. The device parts, which could include GPUs and
DSPs, are responsible for parallel functions offloaded from the CPUs.
In OpenCL, the task in the device part is divided into work groups which refer
to CUDA thread blocks. All these work groups are organized by ND range (next
organization level). A work group organizes all the work items that correspond to
the CUDA threads within it. At the host side, all the instructions follow the SIMT
model, which stands for Single Instruction Multiple Thread, which means that the
same instruction is executed on the different threads at the same time.
Jacket
Jacket, which was marketed from 2007 to 2012 by AccelerEyes (Atlanta, GA),
is another parallel GPGPU computing platform. Jacket is designed to accelerate
MATLAB-based codes running on GPU-equipped PCs that have CUDA technol-
ogy on the GPU. Jacket provides parallel extensions of data types and functions for
MATLAB. Most of the Jacket commands look as same as the original MATLAB
codes, but with several limitations governing their usage. MATLAB is a widely
used technical programming language and environment for many different kinds of
fields in both academic and industry areas, such as signal processing, data analy-
sis, mathematical computations, image processing, and application development.
Jacket extends MATLAB to make the GPU data structures and operations much
more visual and easier to be understand, and to make sure that the GPU applica-
tions can work properly in MATLAB environment. In this thesis, the algorithms
were originally written in MATLAB, so our initial GPU acceleration strategy was
to exploit Jacket.
In Jacket programming, data can be either moved (i.e., cast) cast from the CPU
memory to GPU memory or created on the GPUs own memory, depending on
the functions that are used. According to the Jacket documentation [32], it costs
significant time to transfer data between the GPU and CPU and that bottleneck
reduces the benefits from GPU acceleration. Thus, as much as possible, it is better
to create the data on GPU directly and then cast the final result to the CPU in a final
36
Code Listing 3.1: Simple Example to Generate and Casting Random numbers on/to
the GPU using Jacket Library in MATLAB
% Casting a matrix on the GPU
a = randn(N); % N is the size of matrix
b = gdouble(a); % matrix b is a parallel data structure on the GPU
gzeros, gones, geye, grand: These functions create a matrix of zeros, ones,
the identity matrix, random matrix directly in the parallel GPU cores.
Code Listing 3.1 shows a simple example that generates parallel data on the
GPU. All of these GPU data structures are manipulated by parallel operations on
the GPU. The last input argument is usually used to specify the number of parallel
GPU cores to be used.
Many parallel extensions of basic operations are supported on the GPU [32],
such as matrix and arrays arithmetic operations, relational operations, logical op-
erations, diagonal matrices and diagonals of matrix (diag), LU matrix factorization
(lu), orthogonal-triangular decomposition (qr), sorting array elements in ascending
or descending order (sort), etc.
Parallelism can be performed in a loop-like control structure. Instead of launch-
ing each of the loops sequentially, as in the original MATLAB for-loop, Jacket uses
the gfor-loop to vectorize it on volumes as well so that the original loop iterations
be performed simultaneously on parallel GPU cores. The iterator of the gfor-loop
controls the degree of parallelism.
37
It is often possible to avoid using parallelism that is explicitly specified using a
gfor-loop, and to instead rely on the implied use of parallel operations on parallel
variables. Such implicit vectorization usually provides better performance than the
explicit parallelism using the gfor construct. For example, use a b c instead of
looping apiiq bpiiq cpiiq in a gfor loop with ii 1 : parallelism.
There are many built-in functions that are supported for parallel operation within
a gfor-loop such as fft, sum, max, min, ect. However, these functions have restric-
tions that we must consider [32]. Here are some of these key constraints.
All iterations within one gfor-loop must be independent of each other. Data
dependencies are not allowed among different iterations of the gfor-loop.
if condition
var = expr1; var = condition*expr1
else +
var = expr2; (1-condition)*expr2;
end
a b
condition is a logical value of either true (1) or false (0). Because of this lim-
itation, depth-first search algorithms are less practical and efficient. Breadth-
first search algorithms are often more attractive.
Nesting one gfor-loop inside another gfor-loop is not allowed. However, one
gfor-loop can be nested among one or more nested regular for-loops.
38
ory is required to support all iterations at the same time; otherwise, out of
memory errors will occur.
Subscripted data can not be cast back directly to CPU. On GPU, the parallel
path run simultaneously with the same subscript, when these variables are
pulled back to CPU, an extra dimension must be added to the destination
matrix to avoid the subscript conflict. For example, if we need to pull a
4 4 parallel matrix product with 1024 parallel paths back to the CPU, a size
4 4 1024 matrix should be prepared after the end of gfor-loop.
The PC platform that we used to run experiments has a NVIDIA GeForce GTX
590 GPU with 1024 CUDA cores which are organized in 32 streaming multiproces-
sors of 32 cores each. The memory clock runs of 607 MHz. The standard memory
configuration is 3073 MB and the memory bandwidth is 327.7 GB/sec.
We used a PC with an Intel (R) Core (TM) i7-2600k CPU running at a clock
frequency of 3.40 GHz processor with 16.0 GB RAM. This CPU actually contains
four independent cores that can each execute two parallel threads. In addition,
these CPUs have a certain number of SIMD instructions for relatively simple vector
arithmetic.
39
L which contains the most likely candidates with the smallest Euclidean distances
(EDs). The parallel architecture of LSD can be divided into several parts: the first
step is to compute the PED. This is done by a number of TSUs (Tree Search Units)
in parallel. At the end of this operation, the results are written into cache memory.
Then in the second step, a dispatcher unit finds the smallest PED, that is used to
compare with the current radius when the leaf node is reached. If the new PED
is smaller, the list is updated and a new sub-tree is assigned to the TSU. In [33],
although the hardware was designed for a custom VLSI implementation, it still
provides an example of parallel programming.
40
algorithms (PSA), which are proposed in this paper to deal with this problem, pro-
vide sorting strategies for the K-Best algorithm. The key structure in the PSA is an
array of combined interconnected Compare-and-Exchange cells. The inputs of
this array are the corresponding branch-metric costs for each path at each layer, and
the outputs are the sorted values that can be used in K-Best to determine the first K
best symbols. PSA proposes to exploit customized hardware design (e.g., FPGA)
to accomplish the parallelism of sorting.
41
parallel programming.
In this way, the 2k-th layer and p2k 1q-th layer for k 1, 2, , Mt , which
represent the real and imaginary components of detected symbol, respectively, are
independent of each other so that the search path can be arranged simultaneously
between these two nodes which are defined as a node pair in this paper. The
Parallel Sphere Decoder (PSD) algorithm moves from node pair to node pair to
reduce the computing time.
Depth-first search algorithms can take an irregular path when trying to find the
optimal path from the root node to the leaves, so they are often considered to be
hard to synchronize in a parallel implementation. In this paper, the author provides
a good idea (node pairs that can be traversed in parallel) that bring the parallelism
into the Depth-first searching.
42
Software implementations mostly use different programming language on the
existing parallel enabled hardware. The parallel enabled hardware can be worksta-
tions with multicore CPUs or GPUs, the researchers do not need to design the hard-
ware but have to be familiar with the development programming languages such as
C/C++, CUDA C for GPU, etc. They came up with several different ideas to build
the data structures to fit the parallelism models for different detection algorithms.
But because the hardware environment is fixed, there are also varies limitations
during the software programming.
Hardware implementation requires more knowledge about the hardware design,
but as a benefit, the structure of the detection algorithms can be more flexible to the
hardware. The commonly used hardware environments are FPGA and the VLSI.
Researchers can point each data structure or even a single detection algorithm pack-
age to a unit on the chip, allocate the different memory to different working space
and trace and control the parallelism step by step.
43
Chapter 4
In Chapter 3, we reviewed the key aspects of GPU technology and parallel pro-
gramming. General-purpose GPUs have already been applied in several different
areas [39]. Our research aims to speed up the standard MIMO detection algorithms
by exploiting the hardware parallelism of the GPU and the parallel computing en-
vironment provided by the Jacket extension of MATLAB. To ensure the efficiency
of the parallel approach, most of data should be generated and processed in parallel
on the GPU to avoid time-consuming transfers of data between the CPU and GPU.
This means that we need to rewrite conventional MATLAB MIMO detector models
using the Jacket library functions to ensure the parallel operation of the GPU based
detection programs up to the parallelism limits of the underlying hardware.
In a MIMO system, we aim to process more data streams in a shorter time to
gain higher efficiency. If these data streams can be efficiently mapped in a directly
scalable way onto a parallel structure and processed at the same time, then acceler-
ation can be achieved by increasing the number of parallel paths.
44
Code Listing 4.1: Source code for the Matrix Multiplication with a conventional
MATLAB for-loop and the Jacket gfor-loop
C1 = gzeros(N,N,Parallelism); % N is the size of matrix
C2 = gzeros(N,N,Parallelism); % Parallelism is the degree of
parallelism
for outloop = 1:100
A = grand(N,N,Parallelism);
B = grand(N,N,Parallelism);
Bt = B; % Transpose needed for the dot product in for-loop
% for-loop applied
for ii = 1:N
for jj = 1:N
C1(ii,jj,:) = dot(conj(A(ii,1:N,:)),Bt(jj,1:N,:));
end
end
% gfor-loop applied
gfor pp = 1:Parallelism
C2(:,:,pp) = A(:,:,pp)*B(:,:,pp);
gend
end
(which will be discussed in detail in Section 4.4.1) is always required, and the ma-
trix multiplication costs most of the time during the calculation and it will be the
bottleneck of the acceleration of this algorithm. So we decided to conduct experi-
ments to determine the best way to implement this critical operation in parallel. In
parallel matrix processing our data structures are often three dimensional, where the
first two dimensions correspond to the number of rows and columns and the third
dimension corresponds to the degree of parallelism.
Two alternative methods are compared in this experiment. The MATLAB source
code used in the experiment is shown as Code Listing 4.1. The first method uses two
nested for-loops to do the dot product on each row and column vectors of the two
input matrices. The second method uses a single gfor-loop from the Jacket library
as the inner loop. The usage of the gfor-loop is almost the same as the for-loop
in MATLAB; the only difference is that the iterator in a gfor specifies the degree
of parallelism across GPU cores. The operations in a gfor loop can be viewed as
executing in parallel on different streams of data in SIMD fashion.
45
The experimental results are shown in Table 4.1. where N N is the size of the
Table 4.1: Matrix multiplication mimes (in seconds) for different looping (for and
gfor) structures
real-valued matrices. The table gives the average running times (in seconds) of real-
valued matrix multiplication based using the for-loop and gfor-loop constructs. For
a reliable measurement, we repeated the test 100 times using the outer for-loop, and
so these running time are amplified 100 times greater than a single matrix multipli-
cation. It can be seen that the running time is not greatly influenced by the increases
in the matrix size in the gfor-loop implementation while it causes a big impact in the
for-loop implementation. In other words, when the degree of parallelism increases,
the running time of the for-loop method keeps almost steady, while for the gfor-loop
method the running time increases directly at the same rate as the degree of paral-
lelism. However, it is clear that even though the gfor-loops running time increases,
it is still faster than the for-loop, until the degree of parallelism reaches to 1024,
which is the number of GPU cores. Moreover, for the gfor-loop method, the size of
the matrix doesnt affect the running time of multiplication, while it quadruples for
the for-loop.
4.1.2 Experiment 2 for the Serial and Parallel gfor Looping Struc-
tures
Having compared the different loop models, we also wanted to determine how much
improvement we could achieve from parallelism compared with serial multiplica-
46
Code Listing 4.2: Source Code for the Matrix Multiplication Experiment with Se-
rial and Parallel Versions
% Serial version on the CPU
for ii = 1:Parallelism*100
A = rand(N,N);
B = rand(N,N);
C(:,:,ii) = A*B;
end
tions for different sizes of matrices. In the parallel version, we apply the gfor-loop
structure for the multiplication, while in the serial version, a for-loop with an itera-
tor equals serially the degree of parallelism is used so that the multiplication can be
executed in serial.
The source code for this second experiment is shown in Code Listing 4.2, where
N stands for the size of a matrix and P arallelism is the degree of parallelism. In
order to get an equivalent result, the number of iterations is set to be P arallelism
100 in the serial version.
Table 4.2 shows the results from this test. As in Table 4.1, the performance is
measured by the running time (including 100 outer loop repetitions) of each version.
The Speed-Up values are calculated as:
Speed-Up
Time for the serial version
(4.1)
Time for the parallel version
Results could not be obtained when the size of matrix grows to 128 and the de-
gree of parallelism equals to 10240. Jacket is unable to allocate sufficient memory
from GPU to do the multiplications under these conditions. The serial multiplica-
tion time grows rapidly when the size of the matrix increases but the running time
for the parallel version remains relatively constant. It is only impacted by the in-
47
Table 4.2: Matrix multiplication times (in seconds) for serial and different degrees of parallel versions
Degree of Parallelism
Size of
512 1024 10240
Matrix N
Serial Parallel Speed-up Serial Parallel Speed-up Serial Parallel Speed-up
4 0.18 0.16 1.13 0.38 0.34 1.12 3.65 3.35 1.09
48
The results from the previous Experiment 2 showed that when the matrix size in
each of parallel path grows bigger, we can achieve more acceleration from the par-
allelism. So we decided to try merging multiple small matrices into one large matrix
to see how much acceleration and advantage could be obtained.
The strategy of this experiment is to merge small matrices into the diagonal of
a large matrix. Taking two groups of four small 4 4 matrices A, B, C, D, E, F, G
and H as an example, the multiplication equation is shown in Eq. (4.2).
A000 E000 AE 0 0 0
0 B 0 0 0 F 0 0 0 BF 0 0
0 0 C 0 0 0 G 0 0 0 CG 0
(4.2)
000D 000H 0 0 0 DH
where AE, BF, CG and EH stands for the sub-matrix products A E, B F, C G
and E H, respectively.
It can be seen from Eq. (4.2) that these four ssub-matrix multiplications are
executed at the same time in a large matrix to save running time. In this equation,
we set the size of matrix to be 4 which could also be changed. In this experiment,
we deal with the square matrix with the size of N. However, the matrix does not
have to be square, only if two small matrices at each multiplication side can be
49
Code Listing 4.3: Source Code for the Merged Matrix Multiplication Experiment
with the Parallel gfor-loop Structure
Parallelism = 1024 % Degree of parallelism
N = 4; % Matrix size
F = 1; % Number of component sub-matrices
interval = N-1; % Number of rows/columns between each small
matrix
for loop = 1:100
LeftMatrix = gzeros(N*F,N*F,Parallelism);
RightMatrix = gzeros(N*F,N*F,Parallelism);
ProdMatrix = gzeros(N*F,N*F,Parallelism);
for f = 1:F
LeftMatrix((f*N-interval):f*N,(f*N-interval):f*N,:) = grand
(N,N,Parallelism);
RightMatrix((f*N-interval):f*N,(f*N-interval):f*N,:) =
grand(N,N,Parallelism);
end
gfor pp = 1:Parallelism
ProdMatrix(:,:,pp) = LeftMatrix(:,:,pp) * RightMatrix(:,:,
pp);
gend
for f = 1:F
AE = ProdMatrix((f*N-interval):f*N,(f*N-interval):f*N,:);
% result for each small matrix
end
end
matched and put into the diagonal of two large matrices. This is strategy is also
applied in the parallel implementation of MIMO detection algorithms later.
The source code of this experiment is in Code Listing 4.3
In Code Listing 4.3, P arallelism is the degree of parallelism, which is set to
be 1024. N is the size of small square matrix. F is the number of small matrices
that have been combined into one matrix. F can also be seen as a speed-up factor
for the N N matrix within the NF NF matrix. The same as in Experiments 1
and 2, we also set an outer loop to repeat all the operations.
The results of this experiment is shown in Table 4.3.
In Table 4.3, the data in columns gfor show the running times of the gfor-loop
(including 100 outer loops repetitions), which only contain the matrix multiplica-
tion inside. The data in columns Speed-up compare the time for different value
of F with F 1 for each N. The data in F 73 with N 4, F 36 with
N 8,F 18 with N 16, F 4 with N 64, F 2 with N 128 and
50
Table 4.3: Matrix multiplication times (in seconds) for the merged matrix with parallel gfor-loop structure
Matrix Size N
F 4 8 16 64 128 256
gfor Speed-up gfor Speed-up gfor Speed-up gfor Speed-up gfor Speed-up gfor Speed-up
1 0.34 1.00 0.34 1.00 0.34 1.00 0.40 1.00 1.64 1.00 15.97 1.00
2 0.33 2.06 0.35 1.94 0.34 2.00 0.34 2.35 15.56 0.21
3 0.34 3.00 0.35 2.91 0.33 3.09 0.39 3.08
4 0.33 4.12 0.34 4.00 0.33 4.12 15.28 0.10
5 0.34 5.00 0.34 5.00 0.34 5.00
10 0.34 10.00 0.34 10.00 0.44 7.73
51
52
needs to do. NumP arallel stands for the number of parallel GPU threads that can
be executed simultaneously. Mt and Mr stand for the number of antennas at the
transmitter and the receiver, respectively, following the same convention used in
Chapter 2. In this way we detect T otalLoop NumP arallel symbol vectors by
the end of program execution and hopefully reduce the running time by efficiently
exploiting the hardware parallelism.
Since we obtain fairly good results from Experiment 3 in Section 4.1.3, we
can also have another parallelism model by applying the strategy of merged matrix
multiplication to detection algorithms to see how much advantages we can take.
It can be seen from Algorithm 2 that the structure of the model is almost the
same as Algorithm 1, the difference is that the data matrix is amplified by the fac-
tor F , which was introduced in Experiment 3 in Section 4.1.3, to enable F times
matrices/vectors to be operated on at the same time. Line 3 shows the reduction of
the outer loops if we have a fixed total amount (T otalLoops NumP arallel) of
symbols vectors when the factor F is applied.
The performance of the serial and parallel versions is compared in Table 4.5.
53
4.3 Channel Generation on the GPU
Since our simulation model as described in Chapter 2, generates the channel and
noise using the MATLABs built-in function randn, the first step of efficient par-
allelization is to generate all these signals on the GPU. It is important for efficiency
to avoid moving data between the CPU and GPU as well as between GPU cores.
As much as possible, data should be generated and processed in parallel within the
GPU cores. In the Jacket library, there are many useful functions that can achieve
this task. The Jacket function grandn is used to generate normally distributed
pseudo-random numbers on the GPU. Both the channel coefficients and the addi-
tive white Gaussian noise samples are generated using grandn. Symbol gener-
ation must be done differently because MATLABs built-in functions randi and
qammod are not supported with parallel versions on the GPU. These two func-
tions are used to generate integer values from the uniform distribution and produce
a random stream of QAM symbols. The random bit stream is encoded using a Gray
Code in the real and imaginary dimensions, following standard practice, to mini-
mize the number of bit errors produced by symbol detection errors during symbol
detection. For most symbol detection errors, only one bit error will be produced;
only rarely will two or more bit errors occur because of one symbol error.
The distributions of these three generated values are shown in Fig. 4.1. Note that
two independent 4-PAM symbols are required for each complex 16-QAM symbol.
As can be seen in Fig. 4.1, the distributions of both the noise samples and the
channel coefficients accurately follow the Gaussian distribution.
54
Distribution of symbol values
4000
2000
0
3 2 1 0 1 2 3
Distribution of noise values
400
200
0
2 1.5 1 0.5 0 0.5 1 1.5 2
Distribution of channel coefficients
2000
1000
0
3 2 1 0 1 2 3
Figure 4.1: Distribution of 4-PAM symbols, additive noise and MIMO channel
coefficients
Jacket [32].
For the first three linear detection algorithms, the most complex calculation is to
:
compute the Moore-Penrose pseudo inverse of the channel (H ), which was de-
scribed in Chapter 2 in Eqs. (2.12) and (2.13). In conventional serial MATLAB,
the inverse of a matrix can be implemented using the built-in function inv. How-
ever, we found the running time to be quite large using this function: it costs almost
half of the time of a linear MIMO detection program. Note that in the V-BLAST
algorithm, we have to apply interference cancellation after each layers slicing and
quantization steps, and so the resulting channel matrix inverse calculations become
the bottleneck of the detector simulation. So we decided to design an improved
function to accomplish the matrix inverse.
Our new function NewInverse performs LU decomposition and then solves
the resulting linear equations. Assume that there is an N N matrix A and a linear
55
equation AX B. If B is set to be an N N identity matrix, then X must be the
inverse matrix of A. The matrix inverse decomposition proceeds as follows:
In steps 2 and 3, the equations can be easily solved by forward and backward
substitution N times without using Gaussian elimination because of the triangular
forms of matrices L and U.
Table 4.4: Comparison of matrix inverse runing times (in seconds) using built-in
function inv and new function NewInverse
The brief comparison between this new inverse version (NewInverse) and
MATLAB built-in function inv is given in Table 4.4. A and B are two random
square matrices of size N. Since the purpose of matrix inverse in this thesis is to do
the matrix division for the channel (as Eq. (2.12) and Eq. (2.13)), the matrix division
is also included in this comparison. In MATLAB, instead of using inv pAq B for
B{A, AzB is more efficient and faster according to the documentation of MATLAB.
So the matrix division is compared between AzB and NewInverse(AB).
56
The results in Table 4.4 shows that when the matrix size is small, our new in-
version function is much faster than the MATLAB built-in function inv, but when
the size increases, we should decide which method to be applied depending on dif-
ferent situations. In this thesis, when a 4 4 real-valued matrix inverse need to be
considered, the speed up provides enough improvement during the processing.
The major strategy in parallelization is to make sure that all the data structures are
initialized on the GPU before the detection process begins and are then updated in
parallel on the GPU. In this way, we can minimize time-consuming data transfers
between the CPU and GPU. As described in the previous section, the transmitted
signals, channel matrices and noise signals have already been loaded on the GPU,
and the performance of the channel inverse calculation has also been improved. We
can now directly implement parallel versions of the detection algorithms. Since
these algorithms are implemented only by slicing, quantization and interference
cancellation (e.g., the V-BLAST algorithm) and all these operations can be fully
supported on the GPU, it is relatively straightforward to convert them into fully
parallel versions using Jacket functions.
The first part of our research is to test the performance of our algorithms on the
GPU. The running times of serial and parallel versions are compared in Table 4.5.
The assumed system environment is as follows:
4 4 MIMO System
SNR = 20 dB
In Table 4.5, the data in the columns of the Serial and GPU Parallel gfor are
the running time (in seconds) of each algorithm in both versions, respectively. The
running times include all 1000 1024 symbol vectors. The speed-up is calculated
as in Eq. (4.1).
57
Table 4.5: Running times (in seconds)comparison of MIMO detection algorithms with the serial and different parallel versions
Detection
MMSE
287.134 15.296 18.772 6.572 43.691 8.044 35.695
Detection
V-BLAST
626.020 191.333 3.272 133.290 4.697 64.530 9.701
Detection
K-Best
703.008 620.127 1.134 - - 490.303 1.434
Detection
Parallel
V-BLAST 3075.281 513.076 5.994 234.749 13.100 290.964 10.569
Detection
The notation 1000 1024 means that the program loops 1000 times while 1024
parallel signal paths are processed concurrently in each loop iteration (according to
the Algorithm 1 described above). Thus the notation 10 102400 corresponds to
a program that loops 10 times while 102400 parallel signal detector paths are pro-
cessed each time. Data Generation stands for the generation of all the transmitted
symbols, channel matrices and noise samples. Most of the data are complex-valued,
however the K-Best algorithm deals with real-valued data.
It can be seen from Table 4.5 that the speed-up factor from the normal parallel
gfor loop (10001024) is not as good as we expected. There must be some overhead
during the processing, especially since the data matrix of each parallel path is quite
small so that we can not achieve much benefits from the parallelism. When the
degree of parallelism increases to 102400, the improvement becomes better. The
results from the K-Best algorithm are quite the same, there is little speed-up going
from the serial to normal parallel version. Its because in the K-Best algorithm, most
of the operations are applied node by node at each level, and there are few matrix
multiplications. The CPUs speed of processing for one single data is already fast
enough so that the GPU calculation can not take much advantage from that.
The speed up results in columns 5 and 6 with a factor F are different than for
the other parallel versions. In order to understand the limits to acceleration, the top
3 most time consuming parts of each algorithms are listed in Table 4.5. We ran the
profiler in MATLAB and listed in the Table 4.6.
According to Table 4.6, it can be seen that for all four linear detection algo-
rithms, matrix multiplication and matrix inverse are the most time consuming parts
during the processing. To solve the first problem, merged matrix multiplication can
be applied to provide a significant improvement, which can be seen in Table 4.5
column 5 and 6 with F 18.
The speed-up factors of the ZF and MMSE detection algorithms are about 35x,
which are also the number of the matrix multiplication time consuming percentage.
For the V-BLAST detection algorithm, the bottleneck is the channel matrix inverse.
Note that in this V-BLAST version we have already modified the matrix inverse
in Section 4.1.1 and Table 4.4 shows that larger matrix costs more time to inverse,
59
Table 4.6: The most time consuming operations for the MIMO detection algorithms
60
so we can not take much advantage of merging small matrices together even if
the built-in function inv textquotedblright is applied. Then the speed-up factor is
affected by the amount of matrix multiplications in the V-BLAST algorithm.
In the K-Best detection algorithm, we can take more advantages of adding the
factor F to the program. Because in each step, the operations work on one node, it
rarely requires the matrix multiplication. If we put several 2Mr 2Mt (real-valued)
matrices into a large matrix, we still have to evaluate the nodes (detect the symbols)
level by level, and even the overhead of using for f = 1:F loop will cost other
time. Then theres no data for K-Best using F-factor parallelism in Table 4.5.
The speed-up of the parallel V-BLAST detection algorithm using the merged
matrix strategy is better than for the normal parallel version (1000 1024) from
Table 4.6. Since the weakest channel layer is fully enumerated with all the con-
stellation points, the operations of the V-BLAST algorithm are repeated 16 times if
16-QAM is applied during the processing. By packing F matrices together, the to-
tal amount of 16-times-repeated matrix multiplications is reduced efficiently. Note
that, from the matrix inverse result in Table 4.4, we also use MATLAB built-in
function z to do the matrix division in this Parallel V-BLAST version.
In the original V-BLAST algorithm, the first detected layer is chosen to be the
layer with the minimum norm and hence the lowest expected post-detection SER.
After symbol detection, we subtract the predicted contribution of that symbol on
the signal vector (interference cancellation) to minimize the SER on the remaining
symbols to be detected. Errors in the detection of the first layer increases the inter-
ference in the detection of the following layers. The parallel V-BLAST algorithm
in [18] tries to avoid this effect by fully enumerating the weakest layer to minimize
detection errors in the strongest layer. In this way, in the first detected layer all 16
possible symbol values (16-QAM) of the weakest layer will be considered, and then
the original V-BLAST detector will be applied on the remaining layers. At the end
of this algorithm, 16 candidate detected symbol vectors are compared and only the
one with the minimum Euclidean distance between the predicted noise-free signal
61
Hs and the received symbol vector y is picked as the detected symbol vector.
In Algorithm 3, the channel, symbol and received symbol vectors use the same
complex-valued convention (Eq. (2.5)) used in Chapter 2. G equals GM M SE as
shown in Eq. (2.13) in Chapter 2. ConsMat denotes the 16-QAM constellation set
t3 3i, 3 1i, , 3 3iu shown in Fig. 2.2 in Chapter 2.
Ideally, all 16 candidate symbol vectors should be processed in parallel to get
the maximum speed-up compared to the serial V-BLAST algorithm. However, as
we mentioned in Chapter 3, the Jacket library uses the gfor loop to specify one
explicit dimension of parallelism and Jacket does not allow nested gfor loops. We
have already set 1024 parallel paths to run the program at the beginning, so it is
impossible for us to set another 16 parallel paths within 1024 paths. However it
could be possible to have 64 16 parallel execution paths. Our parallel V-BLAST
implementation processed the 16 symbols in the first/weakest layer one by one to
get the 16 candidate detected symbol vectors. The performance of this parallel
V-BLAST can be seen in Fig. 4.2.
62
It can be seen from Fig. 4.2 that the parallel V-BLAST algorithms performance
is near-optimal. By enumerating all possible values of the weakest symbol we re-
move a significant source of interference noise on all other symbols. Significantly,
detection errors on the strongest symbol are reduced, and this reduces error propa-
gation to detection errors affecting the detection of the other symbols.
This algorithm modified the original parallel V-BLAST algorithm [41]. The main
modification is that the real and imaginary components are treated separately and
all calculations are real-valued. For 16-QAM, 16 possible weakest-layer symbols
are enumerated in the original complex-valued algorithm. But by treating the real
and imaginary components separately, only 4 possible component-values need to
be considered for each component and so the total number of candidates is reduced
?
from Mc 16 to 2 Mc 8. Therefore, the computational complexity can be
partially reduced compared to the complex-valued parallel V-BLAST, the number
of candidate symbols calculations is reduced while the matrices are bigger (from
4 4 to 8 8) which increases the cost of matrix inversions and multiplications.
This advantage could be significant for large MIMO systems with big number of
antennas. The algorithm shown in Algorithm 4.
In Algorithm 4, the channel, symbol and received symbol vectors use the real-
valued convention as Eq. (2.7) in Chapter 2. G still uses the calculation of GM M SE
as Eq. (2.13) in Chapter 2. RealConsMat is simplified to t3, 1, 1, 3u which
was also introduced in Chapter 2, and its size is reduced to RealConsSize Mc
4. The detection loop iterates 8 times, four times for the real and imaginary values
of the weakest layer. Line 2 indicates the related imaginary components layer when
the weakest real components layer is determined in line 1.
As with real-valued parallel V-BLAST, we did not make these 8 candidates into
parallel threads. It could also be possible to have a 128 8 parallel execution model.
Fig. 4.2 shows the performance of three V-BLAST algorithms compared to the ML
detection.
As can be seen from Fig. 4.2, the real-valued parallel V-BLAST detection al-
63
Algorithm 4 Parallel V-BLAST with Real and Imaginary Components
Inputs:
The numbers of transmitter and receivers antennas Mt , Mr ;
16-QAM Constellation set RealConsMat and its size RealConsSize;
The 2Mr 2Mt real-valued channel matrix H, the real-valued symbol vector s
and the real-valued received signal vector y;
The real-valued Moore-Penrose pseudo inverse matrix G;
Output:
The number of symbol errors from the detector;
1: Reallayerweakest maxpnormpGqq
2: Imaglayerweakest Reallayerweakest RealConsSize
3: for all i 1 : RealConsSize do
4: RealDetSympReallayerweakest q RealConsMatpiq
5: Interference cancellation on Reallayerweakest -th layer
6: Normal V-BLAST algorithm on the remaining layers
7: end for
8: Combine 4 candidate symbol vectors as a matrix RealDetSym
9: 1 : RealConsSize do
for all j
10: ImagDetSympImaglayerweakest q RealConsMatpj q
11: Interference cancellation on Imaglayerweakest -th layer
12: Normal V-BLAST algorithm on the remaining layers
13: end for
14: Combine 4 candidate symbol vectors as a matrix ImagDetSym
15: The candidates vector DetSym rRealDetSym ImagDetSyms
16: BestSetIndex minp}y H DetSym}2 q
17: Compare DetSympBestSetIndexq to s, calculate the number of symbol errors
64
0
10
VBLAST
Parallel VBLAST, complexvalued version
Parallel VBLAST, realvalued version
1
10 ML
2
10
SER
3
10
4
10
5
10
6
10
5 10 15 20 25 30 35 40 45 50
SNR/dB
65
most reaches the optimal performance since the weakest channel layer is fully enu-
merated for all the possible candidates; then after the interference cancellation, the
influence of the noise can be reduced as much as possible to ensure the accuracy of
the detection on the rest of layers. The real and imaginary components version also
performs near optimally.
The algorithm considered in this section is the conventional K-Best algorithm that
has been converted into a parallel version. As introduced in Chapter 2, the K-Best
algorithm is a breadth-first sphere detector where the width of the search at each
level in the tree is restricted to K. All the operations at the same level of the tree
search can be transferred to the GPU and executed in parallel.
The structure of this algorithm is essentially the same as that of the K-Best
algorithm which was introduced in Section 2.6.3. Recall that conditional statements
are not allowed inside a gfor loop, but this restriction can be overcome by expressing
the condition as a multiplied condition factor. (See Fig. 3.3 in Chpater 3)
The performance of the resulting parallel K-Best algorithm is shown in Fig. 4.3.
With the increasing of the number of K, the performance of the parallel K-Best
becomes better. When K 16, the performance gives ML results.
Since the performance of parallel V-BLAST shows a great reduction in SER com-
pared to the conventional V-BLAST detector, we tried to apply the same strategy
to the K-Best algorithm to see if improved performance would result. As was de-
scribed above, the first step in designing a parallel V-BLAST algorithm was to find
the weakest channel layer and then do a fully-enumerated breadth-first search of
the original detector applied to the remaining layers. Modifying the original K-
Best algorithm in the same way, the symbol vector is first separated into real and
imaginary components and reshaped again as one real-valued symbol vector ac-
cording to Eq. (2.6) in Chapter 2. So we only need to consider 4 possible values
66
(-3, -1, 1, 3) for the both real and imaginary components on this weakest layer.
Then the normal K-Best procedure is executed on the remaining layers. At the end
of this algorithm, there will be 2 4 K candidate symbol vectors left. From these
candidate solutions we pick the symbol vector that minimizes the predicted error
metric.
The algorithmic procedure is provided in Algorithm 5. Most of the parameters
in this algorithm are similar to those in Algorithm 4. Note that in the conventional
K-Best algorithm, the strongest layer of the symbol vector is detected first. In the
modified K-Best algorithm, the weakest layer is fully enumerated, and the remain-
ing sub-trees are searched in K-Best fashion, with a total of K nodes expanded at
each level.
0
10
K=2
K=4
1 K = 16
10
Fully enumerated K = 2 KBest
Fully enumerated K = 4 KBest
2
10
SER
3
10
4
10
5
10
6
10
5 10 15 20 25 30 35 40 45 50
SNR/dB
Figure 4.3: Performance of the K-Best and the fully enumerated K-Best for a Mt
Mr 4, 16-QAM MIMO system.
67
Algorithm 5 The Fully Enumerated K-Best Algorithm
Inputs:
The numbers of transmitter and receivers antennas, Mt and Mr , respectively;
16-QAM Constellation set RealConsMat and its size RealConsSize;
The 2Mr 2Mt real-valued channel matrix H, the real-valued symbol vector s
and the real-valued received signal vector y;
The number K of selected best nodes on each layer;
The real-valued Moore-Penrose pseudo inverse matrix G;
Output:
The number of symbol errors from the detector;
1: Reallayerweakest maxpnormpGqq
2: Imaglayerweakest Reallayerweakest RealConsSize
3: Set Reallayerweakest as the first detected symbol layer
4: Reorder H as Reallayerweakest is the last channel layer
5: QR decomposition on the new ordered channel
6: for all i 1 : RealConsSize do
7: The first detected symbol = RealConsMatpiq
8: Normal K-Best algorithm on the remaining layers
9: end for
10: Combine four candidate symbol vectors as a matrix RealDetSym
11: Set Imaglayerweakest as the first detected symbol layer
12: Reorder H as Imaglayerweakest is the last channel layer
13: QR decomposition on the new ordered channel
14: for all j 1 : RealConsSize do
15: The first detected symbol = RealConsMatpj q
16: Normal K-Best algorithm on the remaining layers
17: end for
18: Combine four candidate symbol vectors as a matrix ImagDetSym
19: The candidates vector DetSym rRealDetSym RealDetSyms
20: BestSetIndex minp}y H DetSym}2 q
21: Compare DetSympBestSetIndexq to s, calculate the number of symbol errors
68
The performance of this fully enumerated K-Best algorithm is illustrated in
Fig. 4.3 for K 2 and 4. The plots in Fig. 4.3 show that exhaustive enumeration
over the weakest layer achieves good performance. Note that conventional 16-Best
and 4-Best with full enumeration both approach the optimal detection curve. From
this figure, we also see that 2-Best with full enumeration performs much better than
conventional 4-Best in low SNR environments, but it approaches the performance
of conventional 2-Best when the SNR becomes higher. In a higher SNR environ-
ment, the influence of the noise becomes more and more weak and detection errors
are determined increasingly by the effects of interference and error propagation. In
conclusion, the fully enumerated method produces more benefits in a lower SNR
environment and it can also help to improve the detection accuracy of cheaper but
less accurate detection algorithms.
69
brief structure of this algorithm is shown in Fig. 4.4.
2 M c K symbols
Weakest layer ...
of real-valued M-QAM
Strongest
V-BLAST
L V-BLAST
L V-BLAST
L
VLayers
...
Remaining
ai
layers
e
... KLayers
-B
K-Best -
K-Best -B
K-Best
Second Weakest
Figure 4.4: Algorithmic structure of the parallel V-BLAST with K-Best algorithm.
Algorithm 6 shows the procedure of this algorithm. The main strategy of Algo-
rithm 6 is still to fully enumerate the weakest symbol layer, then the remaining lay-
ers are detected by both the V-BLAST and K-Best algorithms. Parameters KLayer
and V Layer can be chosen from 1 to 2 Mt 2. Note that the sum of KLayer and
V Layer is always 2Mt 1.
The resulting performance can be seen in Fig. 4.5. The conventional K-Best
algorithm (for K 2, 4and16) was compared to two extreme versions of the new
parallel algorithm: one executes parallel V-BLAST on the first 6 layers and 2-Best
algorithm on the last layer. For this version, we can see from Fig. 4.5 that the
performance is better than that of the normal 4-Best algorithm. But at the other
extreme, if only one layer is detected by parallel V-BLAST and the rest are detected
using 2-Best, the performance is almost the same as the conventional serial 4-Best
algorithm. The conclusion is that parallel V-BLAST performs better than parallel
V-BLAST with the last layer using the 2-Best algorithm.
70
Algorithm 6 The parallel V-BLAST with K-Best algorithm
Inputs:
The numbers of transmitter and receiver antennas, Mt and Mr , respectively;
16-QAM Constellation set RealConsMat and its size RealConsSize;
The 2Mr 2Mt real-valued channel matrix H, the real-valued symbol vector s
and the real-valued received signal vector y;
The number K of selected best nodes on each layer;
The number KLayer of layers that apply the K-Best algorithm;
The number V Layer of layers that apply the V-BLAST algorithm;
The real-valued Moore-Penrose pseudo inverse matrix G;
Output:
The number of symbol errors from the detector;
1: Reallayerweakest maxpnormpGqq
2: Imaglayerweakest Reallayerweakest RealConsSize
3: for all i 1 : RealConsSize do
4: RealDetSympReallayerweakest q RealConsMatpiq
5: Interference cancellation on Reallayerweakest -th layer
6: Normal V-BLAST algorithm on the remaining 2 : V Layer layers
7: end for
8: Reorder H, set the last KLayer channel layers as the pending layers
9: QR decomposition on the new ordered channel
10: Normal K-Best algorithm on the remaining layers
11: Get detected symbol vector RealDetSym
12: for all i 1 : RealConsSize do
13: ImagDetSympImaglayerweakest q RealConsMatpiq
14: Interference cancellation on Reallayerweakest -th layer
15: Normal V-BLAST algorithm on the remaining 2 : V Layer layers
16: end for
17: Reorder H, set the last KLayer channel layers as the pending layers
18: QR decomposition on the new ordered channel
19: Normal K-Best algorithm on the remaining layers
20: Get detected symbol vector ImagDetSym
21: The candidates vector DetSym rRealDetSym RealDetSyms
22: BestSetIndex minp}y H DetSym}2 q
23: Compare DetSympBestSetIndexq to s, calculate the number of symbol errors
71
0
10
K=2
K=4
K=16 (Nearoptimal)
1
10 6 layers with ParaVBLAST & 1 layer with 2best
1 layers with ParaVBLAST & 6 layer with 2best
2
10
SER
3
10
4
10
5
10
6
10
5 10 15 20 25 30 35 40 45
SNR/dB
72
Chapter 5
73
Nowadays, most CPUs in desktop computers, laptops, tablets and even cell-
phones are multicore. The MATLAB parallel computing toolbox (PCT), which is
intended to exploit multicore CPUs and clusters of computers, is another parallel
strategy that we investigated in this research.
MATLAB
pool 1
MATLAB
MATLAB
pool 2
client
...
MATLAB
pool n
Fig. 5.1 briefly shows that how a program is parallelized with the PCT. The
MATLAB client stands for the copy of the MATLAB that we start in the regular
way. The MATLAB pool, which is also called as worker in some of the docu-
ments, is the the copy that are created to help in the computation. The pool can
be seen as a lab in MATLAB, where the lab is the space that the data will be
distributed to. Each of the lab can either be independent with each other or com-
municate if necessary. The number of labs depends on the number of cores on one
or multiple workstations.
The PCT starts the parallelism by opening multiple labs in MATLAB. On the
local computer we must enter:
matlabpool close
74
by the second command matlabpool close. Since the overhead of opening the
MATLAB labs is relatively expensive, we should make sure that all the parallel
computing should be finished before we close the pool of labs.
PCT is easy to apply in our programs since most of the built-in functions in
MATLAB are almost multithreading aware. Only relatively small changes are re-
quired to our program, mainly for some commands that are related to the paral-
lelism.
Similar to the gfor-loop structure in Jacket for the GPU, the parfor-loop struc-
ture in PCT can replace a conventional for-loop to provide parallel computation.
Instead of being executed in serial, the commands in the parfor-loop are executed
in parallel. The total number of iterations is automatically distributed over the num-
ber of labs that are open. Each group of iterations will be executed at the same time.
The computation within each iteration of the loop should be independent of all the
others. The parallelism pattern of the parfor structure is the task parallel and there
is no communication between the labs.
PCT also provides another command spmd, which stands for the single pro-
gram, multiple data. This command can automatically distribute a large array over
parallel hardware by dividing it into pieces for each of the lab in MATLAB. The
parallelism pattern of the SPMD structure is the data parallel and the parallel labs
can communicate with each other under the SPMD model.
In our research, we deal with million element data sets with one algorithm at a
time, where each iteration of the loop is independent of the others, so we decided
to apply the parfor structure to implement parallelism.
75
Code Listing 5.1: Matrix Multiplication Benchmark using the parfor loop
matlabpool open NumPool
parfor ii = 1:102400
A = rand(N,N);
B = rand(N,N);
C(:,:,ii) = A*B;
end
matlabpool close
The matrix multiplication benchmark in Code Listing 5.1 is similar to the Code
Listing 4.2. The only change is that the parfor structure is used instead of for.
As in Chapter 4, we get a comparison among the for, gfor and parfor loops. The
results are shown in Table 5.1.
Table 5.1: Matrix multiplication times (in seconds) using the for, gfor and parfor
loops
In Table 5.1, the running time includes the total number of iterations which is
102400. The degree of the parallelism depends on the number NumP ool of the
MATLAB pools that have been opened during the computation. When using the
parfor-loop, MATLAB distributes the 102400 iterations into NumP ool groups. For
each group, MATLAB serializes the data first and then execute all the commands
in the parfor-loop. This is why the results from the parfor-loop are even larger than
the serial version when the number of pools is 1. This is also the reason why the
76
data is not available when the size of matrix is 64 and the number of open pools
in MATLAB is 1. The resulting error from MATLAB is Attempt to serialize data
which is too large.
The results from the for-loop and the gfor-loop are a little different from Ta-
ble 4.2. This is because we only consider the multiplication alone in each iteration
in the previous test results, as obtained from the profiler in MATLAB. In this test
we include the time for random number generation as well, and use the tic-toc func-
tions in the programs to accurately determine the running time. It can be seen that
when the matrix size is small, multiplication time with gfor-loop on the 1024-core
GPU is almost the same as for the serial for-loop, while the performance of the
parfor-loop becomes better when the number of open labs increases. But when the
matrix size grows, the running times for both the for and parfor loops are influenced
a lot while the running times for the gfor loop stay almost the same.
In Algorithm 7, NumP ool stands for the number of labs that we decide to open.
The degree of parallelism is determined by NumP ool. NumLoops is the number
of outside loops to repeat the same algorithm. Mt and Mr are still the number of
77
antennas at the transmitter and the receiver.
Then each of the detection algorithms which were described in Chapter 2 and
Chapter 4 can be substituted for step 6. A serial version of the algorithms can be
used with rarely changing in Algorithm 7, the only change is apply the parfor-loop
instead of for-loop. The communication system environment is set to be the same
as in Chapter 4.
In order to have a clear view of acceleration using the different methods, the
complete form of the running times comparison of different MIMO detection algo-
rithms using serial and all kinds of parallel versions is provided in Table 5.2.
Table 5.2 shows the comparison among the serial version, the parallel version
using Jacket on the GPU, and the parallel versions using PCT on the CPU. The data
in first seven columns are the same as those in Table 4.5 in Chapter 4; the last four
columns include the new results for the PCT version. The numbers of labs are set
to be 4 and 8 in this test.
It can be seen from each of the speed-up columns that the acceleration for all
detection algorithms are similar, they are affected by the number of open labs. For
the ZF and MMSE algorithms, since the matrix multiplications are only applied a
few times, the advantage of GPU computing is more than the PCT. In the V-BLAST
algorithm, matrix multiplication is frequently used, and so PCT performance is
better than the Jacket performance. The K-Best algorithm was described in Chapter
2, we can see from the procedures that matrix multiplication is used intensively.
When the size of the matrix is only 4, the Jacket does not provide much acceleration,
while the PCT can still distribute the data and the instructions to all 4 labs to reduce
the calculation time, which also shows the advantages of task parallelism.
For all these detection algorithms, the speed-up factors are stay almost around
4 with when the number of open labs is fixed at 4. When the number of open labs
increases to 8, we can get some benefits from 4 more labs but not too much since
the physical cores of our PC is 4. With the help of multithreading technique in 4
cores, 8 threads are available for the calculation.
The parallel V-BLAST algorithm has the same strategy as the conventional se-
rial V-BLAST, except the weakest channel layer is fully enumerated, which makes
78
Table 5.2: Running times (in seconds) comparison of MIMO detection algorithms with the serial and different parallel versions
Detection
MMSE
287.134 15.296 18.772 6.572 43.691 8.044 35.695 72.399 3.966 56.031 5.125
Detection
V-BLAST
626.020 191.333 3.272 133.290 4.697 64.530 9.701 158.114 3.959 116.038 5.395
Detection
K-Best
703.008 620.127 1.134 - - 490.303 1.434 204.663 3.435 209.069 3.363
Detection
Parallel
V-BLAST 3075.281 513.076 5.994 234.749 13.100 290.964 10.569 932.596 3.298 677.297 4.541
Detection
the calculation times larger than for the normal V-BLAST algorithm. So we can
see that, under the condition of large computation, GPU can achieve more speed-up
than the CPU.
80
Chapter 6
Conclusions
6.1 Contributions
In Chapter 2, we briefly reviewed the fundamentals of MIMO wireless technology
and described the major classes of detection algorithms. The detectors included
three linear-complexity algorithms (ZF, MMSE, V-BLAST), and then the more
complex, but more accurate, sphere detection algorithms (FP, SE, K-Best). Note
that the FP and SE sphere detection algorithms are depth-first and can not easily be
parallelized because each branch of the search tree would be different if different
data is applied, and it is awkward and usually inefficient to execute different codes
simultaneously.
Chapter 3 introduced various ways to exploit hardware parallelism such as using
FPGA, custom VLSI and GPU technology. We briefly reviewed the architecture of
GPU units and described the various kinds of memories allocated inside these units.
The GPU is not only used in the graphics processing field, but it can also be used
for general-purpose computing. General-purpose computing on the GPU has re-
cently received a lot of attention because of the potential benefits of significant and
relatively cheap speed-up. Access to the GPU can be achieved using a variety of
programming environments such as CUDA, OpenCL and Jacket. Our implemen-
tations used Jacket and the parfor construct in the MATLAB parallel computing
toolbox. In addition, we reviewed the literature to see what other researchers have
achieved in this area.
81
The main focus of this thesis is the implementation of the MIMO detection
algorithm on the GPU, as described in Chapter 4. We chose the Jacket function
library because of its compatibility with MATLAB. First, we did experiments on
matrix-vector multiplication benchmarks using the GPU to find out how much im-
provement we could expect to achieve from the parallelism. The disappointing
result was that the large reported speed-ups for other matrix-oriented problems on
the GPU only seem to be attainable with relatively large matrices. So in Exper-
iment 3, we tried to merge several small matrices into the diagonal of one large
matrix, and then executed the multiplication by the large matrix, the results showed
that we can take more benefits from the GPU when the size of matrix is larger.
Our existing detection code was already designed in MATLAB, and it would be
easier and clearer to compare (serial and parallel versions) of alternative detectors
if we used the same programming environment. In data generation, we readily
achieved significant speed-ups on the GPU. When parallelizing the algorithms, we
had to pay much attention to the restrictions of Jacket library, and make sure that
all the data structures were processed as much as possible locally on the GPU. We
also proposed a new MIMO detection algorithm called Parallel VBLAST-KBest.
This algorithm combined the strategies of Parallel V-BLAST and K-Best together
to reduce the computational complexity of V-BLAST and increase the accuracy of
normal K-Best.
In Chapter 5, we investigated another way to implement the parallelism. Since
the matrix multiplication was still the problem sometimes in Chapter 4, we tried to
apply the parallel computing toolbox in MATLAB to achieve acceleration by taking
advantage of multiple cores on the CPU. Also, we repeated the same experiments
on the basic MIMO detection algorithms to see how much improvement we can get
from multicore computer parallelism compared to GPU programming.
82
erations simultaneously on 1024 GPU cores. In Chapter 5, we described a brief
investigation about the multicore CPU parallelism using parallel computing tool-
box in MATLAB. However, there are many models of parallelism. One strategy is
to divide the multiple data streams into several different groups that can execute dif-
ferent commands at the same time. While the execution threads in each group are in
parallel, after a certain period of time, the results from different groups can be com-
bined into the same data structure and the rest of the calculation can be completed
serially.
All of the parallel MIMO detectors that were investigated in this thesis were
implemented using both the Jacket function library with GPU and the parallel com-
puting toolbox in MATLAB with multicore CPU. It is possible that the relatively
high-level data structures and functions are limited compared to the lower-level
GPU language CUDA C/C++. If all or even just the critical parts of the programs
could be re-implemented in CUDA C/C++ and called in MATLAB, the perfor-
mance of the parallel MIMO detection algorithms might be found to be greatly
improved.
83
Bibliography
[2] A. Viterbi, Error bounds for convolutional codes and an asymptotically opti-
mum decoding algorithm, IEEE Trans. Inf. Theory, vol. 13, no. 2, pp. 260
269, 1967.
[5] IEEE Standard for Information technology Local and metropolitan area
networks Specific requirements Part 11: Wireless LAN Medium Access
Control (MAC)and Physical Layer (PHY) Specifications Amendment 5: En-
hancements for Higher Throughput, IEEE Std 802.11n-2009 (Amendment
to IEEE Std 802.11-2007 as amended by IEEE Std 802.11k-2008, IEEE Std
802.11r-2008, IEEE Std 802.11y-2008, and IEEE Std 802.11w-2009), pp. 1
565, 29 2009.
[6] I. T. U.-R. Bureau, ITU global standard for international mobile telecommu-
nications IMT-Advanced, ITU-R, March 2008. [Online]. Available: http://
wirelessman.org/liaison/docs/L80216-08 008.pdf
84
[7] G. T. S. Group, Spatial channel model, SCM-134 text V6.0, in Spatial Chan-
nel Model AHG (Combined as-hoc from 3GPP and 3GPP2), April 2003.
[10] V. Tarokh, N. Seshadri, and A. Calderbank, Space-time codes for high data
rate wireless communication: performance criterion and code construction,
IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 744765, 1998.
[12] G. Frank, Pulse code communication, U.S. Patent US2 632 058 A, March
17, 1953.
[14] O. Damen, A. Chkeif, and J.-C. Belfiore, Lattice code decoder for space-time
codes, IEEE Commun. Lett., vol. 4, no. 5, pp. 161163, 2000.
[15] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, Closest point search in lat-
tices, IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 22012214, 2002.
[16] E. H. Moore, On the reciprocal of the general algebraic matrix, B. Am. Math.
Society, vol. 26, pp. 394395, 1920.
85
[17] R. Penrose, A generalized inverse for matrices, Math. Proc. Cambridge,
vol. 51, no. 03, pp. 406413, 1955. [Online]. Available: http://dx.doi.org/10.
1017/s0305004100030401
[19] G. H. Golub and C. F. Van Loan, Matrix computations (3rd ed.). Baltimore,
MD, USA: Johns Hopkins University Press, 1996.
[20] S. Even, Graph algorithms. Cambridge University Press, New York, 2012.
[25] M. Flynn, Very high-speed computing systems, Proc. IEEE, vol. 54, no. 12,
pp. 1901 1909, dec. 1966.
86
[27] B. Barney. Introduction to parallel computing. Lawrence Livermore National
Laboratory. UCRL-MI-133316. [Online]. Available: https://computing.llnl.
gov/tutorials/parallel comp/
87
[35] P. Cervantes-Lozano, L. Gonzalez-Perez, and A. Garcia-Garcia, Analysis
of Parallel Sorting Algorithms in K-best Sphere-Decoder Architectures for
MIMO Systems, in Reconfigurable Computing and FPGAs (ReConFig),
2011 International Conference on, 2011, pp. 321326.
[36] J. Yu, J. Ma, and Z. Mao, Parallel SFSD MIMO detection with SOFT-HARD
combination enumeration, in Signal Processing Systems (SiPS), 2011 IEEE
Workshop on, 2011, pp. 228233.
[38] H.-W. Liang, W.-H. Chung, H. Zhang, and S.-Y. Kuo, A parallel processing
algorithm for Schnorr-Euchner sphere decoder, in Wireless Communications
and Networking Conference (WCNC), 2012 IEEE, 2012, pp. 613617.
[40] D. Luebke and G. Humphreys, How GPUs Work, Computer, vol. 40, no. 2,
pp. 96100, Feb. 2007. [Online]. Available: http://dx.doi.org/10.1109/MC.
2007.59
88
Appendix A
NumSymbs = 1024000;
NumSymbsPerChan = 10;
89
Constellation_point(col);
ConsCount = ConsCount + 1;
end
end
for i = 1:length(SNR)
snr = 10(SNR(i)/10);
ChanAge = NumSymbsPerChan;
Error_temp = zeros(1,NumSymbs);
for ss = 1:NumSymbs
% Generate the transmit signal,channel, noise
x = randint(M_transmit,1,[0,M_QAM-1]);
s = qammod(x,M_QAM,0,Gray);
noise = complex(randn(M_receive,1),randn(M_receive,1))*
sqrt(Energy*M_transmit/(2*snr));
ChanAge = ChanAge + 1;
if (ChanAge > NumSymbsPerChan)
H = complex(randn(M_receive,M_transmit),randn(
M_receive,M_transmit))/sqrt(2);
end
% Received signal
y = H*s + noise;
% Detection algorithm
90
% M_QAM: size of modulation scheme
% ConstMat: complex-valued constellation
% Output: s_det: detected symbol vector
% error: number of erroneously detected symbols
SymbolSize = length(s);
s_det = zeros(SymbolSize,1);
error = 0;
temps = zeros(SymbolSize,1);
s_det(:,1) = ConstMat(1);
min_value = norm(y- H*s_det)2;
for dd = 1:M_QAM
layer = 1;
temps(layer) = ConstMatrix(dd);
for cc = 1:M_QAM
layer = 2;
temps(layer) = ConstMatrix(cc);
for bb = 1:M_QAM
layer = 3;
temps(layer) = ConstMatrix(bb);
for aa = 1:M_QAM
layer = 4;
temps(layer) = ConstMat(aa);
temp_norm = norm(y - H*temps)2;
if temp_norm < min_value
s_det = temps;
min_value = temp_norm;
end
end
end
end
end
% Calculate the symbol errors
for ee = 1:SymbolSize
RealCount = real(s_det(ee)-real(s(ee));
Realcondition = RealCount=0;
error = error+Realcondition;
ImgCount = imag(s_det(ee))-imag(s(ee));
Imgcondition = ImgCount=0;
error = error+Imgcondition;
end
91
A.3 Zero Forcing (ZF) Detection Algorithm
function error = ZF(s,H,noise,Constellation_point,partition)
Q = (H*H)\H;
error = 0;
for i = 1:length(s)
Y(:,i) = H(:,i)*s(i)+noise;
% Nulling and Slicing
[,R] = quantiz(real(Q(i,:)*Y(:,i)),partition,
Constellation_point);
[,Img]= quantiz(imag(Q(i,:)*Y(:,i)),partition,
Constellation_point);
% Symbol Errors
if R=real(s(i))
error = error+1;
end
if Img=imag(s(i))
error = error+1;
end
end
92
Q = (H*H+(1/snr)*eye(M_transmit,M_transmit))\H;
error = 0;
for i = 1:length(s)
Y(:,i) = H(:,i)*s(i)+noise;
% Nulling and Slicing
[,R] = quantiz(real(Q(i,:)*Y(:,i)),partition,
Constellation_point);
[,Img]= quantiz(imag(Q(i,:)*Y(:,i)),partition,
Constellation_point);
% Symbol Errors
if R=real(s(i))
error = error+1;
end
if Img=imag(s(i))
error = error+1;
end
end
G = (H*H+(1/snr)*eye(M_transmit,M_transmit))\H;
k = zeros(1,length(s));
s_det = zeros(1,length(s));
error = 0;
for i = 1:length(s)
for p = 1:length(s)
Q(p) = (norm(G(p,:)))2; % calculate the
normal value of G
end
for t = 1:i-1
Q(k(t)) = Inf; % set the detected normal
93
value to infinity
end
[,I] = min(Q); % I stands for the subscript of
the minimum value in the normal value set
k(i) = I; % save the subscript
shk = G(I,:)*Y; % nulling
% slicing
[,R] = quantiz(real(shk),partition,Constellation_point)
;
[,Img] = quantiz(imag(shk),partition,
Constellation_point);
s_det(I) = R+1j*Img;
Y = Y-s_det(I)*H(:,I); % interference cancellation
H(:,I) = 0; % set the used channel into 0
G = pinv(H); % peseudo inverse for the new
channel
% Symbol Errors
if R=real(s(I))
error = error+1;
end
if Img=imag(s(I))
error = error+1;
end
end
94
% Radius for FP-SD
variance2 = (M_transmit*Energy/(2*log2(M_QAM)))/snr; % variance
of the noise
Probability2 = 0.01;
d = 2*chi2inv((1-Probability2),m_dimension)*variance2;
a = inf;
num_nodes = 0;
k = m_dimension; % search level
D(k) = d; % the radius matrix
s = zeros(m_dimension,1); % initialze the detected result
det_node = zeros(m_dimension,1);
error = 0;
while (k=0)
rs = 0;
for t = (k+1):m_dimension
rs = rs+R(k,t)*s(t); % Sumation of r*s
end
lower_bound(k) = (Z_r(k)-rs-sqrt(D(k)))/R(k,k); % set
the lower_bound
upper_bound(k) = (Z_r(k)-rs+sqrt(D(k)))/R(k,k); % set
the upper_bound
while(k=(m_dimension+1))
s(k) = search(lower_bound(k),upper_bound(k),
Constellation_point,s(k)); %check if Sk is in bound
95
if (s(k)==0) % do not find the point
k = k+1; % back to the higher level
else
num_nodes = num_nodes+1;
if (k==1) % reach the lowest level
b = norm(Y_r-H_r*s)2; % ML detection
if(b<a) % Find the smaller node
a = b;
det_node = s; % Save the detected node
end
else
k = k-1; % keep searching the lower level
RS = 0;
for j = (k+1):m_dimension
RS = RS+R(k+1,j)*s(j); % Sumation of R*S
end
D(k) = D(k+1)-(Z_r(k+1)-RS)2; % reduce the
redius
break; % calculate the
searching bound, start the searching again
end
end
end
if (k==(m_dimension+1)) % the search level is out of the
node bound
break; % terminate the algorithm
end
end
% Symbol Errors
for i = 1:length(S_r)
if det_node(i)=S_r(i)
error = error+1; % count the error symbol
end
end
96
Y_r = H_r*S_r+noise_r; % The real system
[Q,R] = qr(H_r); % QR decomposition
for k = 1:length(Y_r)
if (R(k,k)<0)
Q(:,k) = Q(:,k)*(-1);
R(k,:) = R(k,:)*(-1);
end
end
Z_r = Q*Y_r;
for p = 1:(length(Constellation_point)-1)
partition(p) = (Constellation_point(p)+Constellation_point(p
+1))/2;
end
gap = Constellation_point(2)-Constellation_point(1);
i = m_dimension; % search level
bestdist = d; % the radius matrix
dist(i) = 0;
e(i,:) = Y_r*Q*L;
[Index,u(i)]=quantiz(e(i,i),partition,Constellation_point);
y_h = (e(i,i)-u(i))/L(i,i);
step(i) = sign(y_h);
num_nodes = 0; % the counter for the expanded nodes
97
s = zeros(m_dimension,1); % initialze the detected result
error = 0;
while (1)
newdist = dist(i)+y_h2;
if (newdist<bestdist)
num_nodes = num_nodes+1;
if (i>1)
for j = 1:i-1
e(i-1,j) = e(i,j)-y_h*L(i,j);
end
i = i-1;
dist(i) = newdist;
[Index,u(i)]=quantiz(e(i,i),partition,
Constellation_point);
y_h = (e(i,i)-u(i))/L(i,i);
step(i) = sign(y_h);
else
det_node = u;
bestdist = newdist;
i = i+1;
y_h = 25;
for k = 1:2
u(i) = u(i)+gap*step(i);
step(i) = (-1)*step(i)-sign(step(i));
if (isempty(find(u(i)==Constellation_point))==0)
y_h = (e(i,i)-u(i))/L(i,i);
break;
end
end
end
else
if (i==m_dimension)
return;
else
i = i+1;
y_h = 25;
for k = 1:2
u(i) = u(i)+gap*step(i);
step(i) = (-1)*step(i)-sign(step(i));
if (isempty(find(u(i)==Constellation_point))==0)
y_h = (e(i,i)-u(i))/L(i,i);
98
break;
end
end
end
end
end
% Symbol Errors
for i = 1:length(S_r)
if det_node(i)=S_r(i)
error = error+1; % count the error symbol
end
end
99
% Constellation_point: real-valued constellation
% Output: det_node: the matrix of the detected node at each
level
% num_nodes: the number of the expanded nodes
% error: number of erroneously detected symbols
num_nodes = 0;
T = [];
s_h = zeros(m_dimension,K);
temp_s = zeros(m_dimension,K);
e = [];
for i = m_dimension:-1:1
if (i==m_dimension) % m_dimension-th node
if (K>length(Constellation_point))
K1 = length(Constellation_point);
else
K1 = K;
end
temp_T = zeros(1,K1);
for j = 1:length(Constellation_point)
temp_T(j) = (Z(i)-R(i,i)*Constellation_point(j))2;
% Branch cost
end
Sort_T = sort(temp_T,ascend); % Sort the branch
cost with the ascend order
T(i,1:K1) = Sort_T(1:K1); % Select K partial
vectors which have the smallest PEDs
num_nodes = num_nodes+length(T(i,:));
for t = 1:K1
s_h(i,t) = Constellation_point(find(temp_T==T(i,t)))
; % save the detected nodes
end
temp_s = s_h;
else % i-th node(i<m_dimension)
count = 1;
if (K>(length(Constellation_point))(m_dimension-i))
K1 = (length(Constellation_point))(m_dimension-i);
if (K>(length(Constellation_point))(m_dimension-i
+1))
K2 = length(Constellation_point)(m_dimension-i
+1);
100
else
K2 = K;
end
else
K1 = K;
K2 = K;
end
length_T = K1*length(Constellation_point);
temp_T = zeros(1,length_T);
for t=1:K1
for j = 1:length(Constellation_point) % Go
through all the constellation nodes
temp_s(i,t) = Constellation_point(j);
temp_vector(:,count) = temp_s(:,t);
rs = 0;
for n = i:m_dimension
rs = rs+R(i,n)*temp_s(n,t); % Calculate
the branch cost for each level
end
e(i,count) = (Z(i)-rs)2;
temp_T(count) = T(i+1,t)+e(i,count); % Calculate
he PED
count = count+1;
end
end
Sort_T = sort(temp_T,ascend); % Sort the
branch cost with the ascend order
T(i,1:K2) = Sort_T(1:K2); % Select K
partial vectors which have the smallest PEDs
num_nodes = num_nodes+length(T(i,:));
for t = 1:K2 % Pick the
nodes retated to the partial vectors
subscript(t) = find(temp_T==T(i,t));
end
subscript = sort(subscript,ascend);
for q = 1:K2
T(i,q) = temp_T(subscript(q));
s_h(:,q) = temp_vector(:,subscript(q)); % Save the
detected nodes and Update the path
end
temp_s = s_h;
101
end
end
% Reach the lowest level
for k = 1:K
b(k) = norm(Y_r-H_r*s_h(:,k))2; % Calculate K PEDs
end
det_node = s_h(:,(find(b==min(b)))); % Pick the vector which
has the smallest PED
% Symbol Errors
for i = 1:length(S_r)
if det_node(i)=S_r(i)
error = error+1; % count the error symbol
end
end
102
Appendix B
M_transmit = 4;
M_receive = 4;
m_dimension = 2*M_transmit; % Channel layer
NumSymbs = 1000;
NumSymbsPerChan = 10;
NumParallel = 1024; % Degree of Parallelism
103
-1.0000 - 3.0000i;
-1.0000 - 1.0000i;
3.0000 + 3.0000i;
3.0000 + 1.0000i;
3.0000 - 3.0000i;
3.0000 - 1.0000i;
1.0000 + 3.0000i;
1.0000 + 1.0000i;
1.0000 - 3.0000i;
1.0000 - 1.0000i]);
for ee = 1:LengthSNR
snr = 10(SNR(ee)/10);
NoiseScale = sqrt(Energy*M_transmit/(2*snr));
ChanAge = NumSymbsPerChan; % force generation of first
channel matrix
Error_temp = zeros(NumSymbs,1);
for ss = 1:NumSymbs
x = gsingle(randi([0,M_QAM-1],M_transmit,1,NumParallel))
;
s = bbb(x+1);
ChanAge = ChanAge + 1;
if (ChanAge > NumSymbsPerChan)
% It is time to generate a new channel matrix
ChanAge = 0;
H = complex(grandn(M_receive,M_transmit,NumParallel)
,grandn(M_receive,M_transmit,NumParallel))/sqrt(2)
;
end % if (ChanAge > NumSymbsPerChan)
104
n = complex(grandn(M_receive,1,NumParallel),grandn(
M_receive,1,NumParallel))*sqrt(Energy*M_transmit/(2*
snr));
gfor pp = 1:NumParallel
y(:,:,pp) = H(:,:,pp)*s(:,:,pp) + n(:,:,pp);
gend
% Detection algorithm
global NumParallel
IdentityMat = geye(M_transmit);
[,N] = size(A(:,:,1));
B = IdentityMat; %B is an N x N identity matrix
X = gzeros(N,N,NumParallel);
Y = gzeros(N,N,NumParallel);
R = gsingle(1:N);
R = repmat(R,NumParallel,1);
C = gzeros(1,N,NumParallel);
j = gzeros(NumParallel,1);
d = gzeros(NumParallel,1);
105
mult = gzeros(NumParallel,1);
% The next steps is to find the factorization (LU decomposition)
gfor pp = 1:NumParallel
B(:,:,pp) = IdentityMat;
for p = 1:N-1
%Find the pivot row for column p
[, j(pp)] = max(abs(A(p:N,p,pp)));
106
end
gend
result = X;
global NumParallel
RealCount = gzeros(NumParallel,1);
Realcondition = gzeros(NumParallel,1);
ImgCount = gzeros(NumParallel,1);
Imgcondition = gzeros(NumParallel,1);
error = gzeros(1,NumParallel);
gfor pp = 1:NumParallel
transpose_h(:,:,pp) = h(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*h(:,:,pp);
gend
AfterInverseMat = NewInverse(ZF_InverseMat);
gfor pp = 1:NumParallel
Q(:,:,pp) = AfterInverseMat(:,:,pp)*transpose_h(:,:,pp);
gend
gfor pp = 1:NumParallel
% Nulling
TempVec(:,:,pp) = Q(:,:,pp)*Y(:,:,pp);
% Slicing
[,R(:,:,pp)] = quantization(real(TempVec(:,:,pp)),partition,
Constellation_point);
[,Img(:,:,pp)] = quantization(imag(TempVec(:,:,pp)),partition,
Constellation_point);
107
gend
% Symbol Errors
for i = 1:length(s(:,1,1))
gfor pp = 1:NumParallel
RealCount(pp) = R(:,i,pp)-real(s(i,:,pp));
Realcondition(pp) = RealCount(pp)=0;
error(pp) = error(pp)+Realcondition(pp);
ImgCount(pp) = Img(:,i,pp)-imag(s(i,:,pp));
Imgcondition(pp) = ImgCount(pp)=0;
error(pp) = error(pp)+Imgcondition(pp);
gend
end
global NumParallel
RealCount = gzeros(NumParallel,1);
Realcondition = gzeros(NumParallel,1);
ImgCount = gzeros(NumParallel,1);
Imgcondition = gzeros(NumParallel,1);
error = gzeros(1,NumParallel);
IdentityMat = geye(M_transmit);
gfor pp = 1:NumParallel
transpose_h(:,:,pp) = h(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*h(:,:,pp)+(1/
108
snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
Q(:,:,pp) = AfterInverseMat(:,:,pp)*transpose_h(:,:,pp);
gend
gfor pp = 1:NumParallel
% Nulling
TempVec(:,:,pp) = Q(:,:,pp)*Y(:,:,pp);
% Slicing
[,R(:,:,pp)] = quantization(real(TempVec(:,:,pp)),partition,
Constellation_point);
[,Img(:,:,pp)] = quantization(imag(TempVec(:,:,pp)),partition,
Constellation_point);
gend
% Symbol Errors
for i = 1:length(s(:,1,1))
gfor pp = 1:NumParallel
RealCount(pp) = R(:,i,pp)-real(s(i,:,pp));
Realcondition(pp) = RealCount(pp)=0;
error(pp) = error(pp)+Realcondition(pp);
ImgCount(pp) = Img(:,i,pp)-imag(s(i,:,pp));
Imgcondition(pp) = ImgCount(pp)=0;
error(pp) = error(pp)+Imgcondition(pp);
gend
end
109
% Constellation_point: real-valued constellation
% partition: constellation points partition
% Output: SymbolError: number of erroneously detected symbols
global NumParallel
M_receive = length(H(:,1,1));
k = gzeros(1,M_transmit,NumParallel);
TestY = Y;
TestH = H;
error = gzeros(1,NumParallel);
shk = gzeros(1,NumParallel);
R = gzeros(NumParallel,1);
RealCount = gzeros(NumParallel,1);
Realcondition = gzeros(NumParallel,1);
Img = gzeros(NumParallel,1);
ImgCount = gzeros(NumParallel,1);
Imgcondition = gzeros(NumParallel,1);
transpose_h = gzeros(M_transmit,M_receive,NumParallel,single);
InverseMat = gzeros(M_transmit,M_transmit,NumParallel,single);
I = gzeros(NumParallel,1);
NormQ = gzeros(1,M_transmit,NumParallel);
IdentityMat = geye(M_transmit);
gfor pp = 1:NumParallel
transpose_h(:,:,pp) = h(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*h(:,:,pp)+(1/
snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
G(:,:,pp) = AfterInverseMat(:,:,pp)*transpose_h(:,:,pp);
gend
for i = 1:M_transmit
gfor pp = 1:NumParallel
for p = 1:M_transmit
NormQ(1,p,pp) = sum(abs(G(p,:,pp)).2);
end
110
[,I(pp)] = min(NormQ(:,:,pp)); % I is the subscript
of the minimum value
k(1,i,pp) = I(pp); % save the subscript
% Nulling
shk(pp) = G(I(pp),:,pp)*TestY(:,:,pp);
% Slicing
[,R(pp)] = quantization(real(shk(pp)),partition,
Constellation_point);
RealCount(pp) = R(pp)-real(s(I(pp),:,pp));
Realcondition(pp) = RealCount(pp)=0;
error(pp) = error(pp)+Realcondition(pp);
[,Img(pp)] = quantization(imag(shk(pp)),partition,
Constellation_point);
ImgCount(pp) = Img(pp)-imag(s(I(pp),:,pp));
Imgcondition(pp) = ImgCount(pp)=0;
error(pp) = error(pp)+Imgcondition(pp);
TestY(:,:,pp) = TestY(:,:,pp)-(R(pp)+1i*Img(pp))*TestH
(:,I(pp),pp); % interference cancellation
TestH(:,I(pp),pp) = 0; % set the used channel into 0
transpose_h(:,:,pp) = TestH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*TestH(:,:,pp)
+(1/snr)*geye(M_transmit);
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
G(:,:,pp) = AfterInverseMat(:,:,pp)*transpose_h
(:,:,pp);
gend
end
111
% Input: snr: signal-to-noise ratio
% M_transmit: number of transmit antennas
% M_receive: number of received antennas
% H: complex-valued channel matrix
% y: complex-valued received signal
% s: complex-valued transmitted signal
% Constellation_point: real-valued constellation
% ConsMat_s: complex-valued constellation
% M_QAM : size of the complex-valued constellation
% partition: constellation points partition
% Output: SymbolError: number of erroneously detected symbols
global NumParallel
error = gzeros(1,NumParallel);
HH = H;
tempH = H;
YY = Y;
s_det = gzeros(M_transmit,NumParallel);
size = length(s(:,1,1));
G = gzeros(M_transmit,M_receive,M_transmit,NumParallel);
shk = gzeros(1,NumParallel);
SymbolTest = gzeros(M_transmit,M_QAM,NumParallel);
TempY = gzeros(M_transmit,M_QAM,NumParallel);
TempError = gzeros(1,M_QAM,NumParallel);
NormQ = gzeros(1,M_transmit,NumParallel);
VBLASTI = gzeros(1,NumParallel);
R = gzeros(1,NumParallel);
Img = gzeros(1,NumParallel);
order = gzeros(1,M_transmit,NumParallel);
RealCount = gzeros(1,NumParallel);
Realcondition = gzeros(1,NumParallel);
ImgCount = gzeros(1,NumParallel);
Imgcondition = gzeros(1,NumParallel);
ConsMat = repmat(ConsMat_s,[1,NumParallel]);
IdentityMat = geye(M_transmit);
gfor pp = 1:NumParallel
112
transpose_h(:,:,pp) = h(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*h(:,:,pp)+(1/
snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
Q(:,:,pp) = AfterInverseMat(:,:,pp)*transpose_h(:,:,pp);
gend
gfor pp = 1:NumParallel
G(:,:,1,pp) = Q(:,:,pp);
for p = 1:M_transmit
NormQ(1,p,pp) = (norm(G(p,:,1,pp)))2; % calculate the
normal value of G
end
[,I(pp)] = max(NormQ(:,:,pp)); % I is the subscript of
the maximum value
tempH(:,I(pp),pp) = 0;
transpose_h(:,:,pp) = tempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*tempH(:,:,pp)+(1/
snr)*geye(M_transmit);
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
G(:,:,1,pp) = AfterInverseMat(:,:,pp)*transpose_h(:,:,pp
);
gend
VBLASTk = gzeros(1,M_transmit-1,NumParallel);
113
tempH(:,VBLASTI(pp),pp) = 0;
transpose_h(:,:,pp) = tempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*tempH(:,:,pp)
+(1/snr)*geye(M_transmit);
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
G(:,:,jj+1,pp) = AfterInverseMat(:,:,pp)*
transpose_h(:,:,pp);
for qq = 1:M_receive
NormQ(1,qq,pp) = (norm(G(qq,:,jj+1,pp)))
2; % calculate the normal value of
G
end
gend
end
gfor pp = 1:NumParallel
for jj = 1:M_transmit-1
% nulling
shk(:,pp) = G(VBLASTk(1,jj,pp),:,jj,pp)*
TestY(:,:,pp);
% Slicing
[,R(:,pp)] = quantization(real(shk(:,pp
)),partition,Constellation_point);
[,Img(:,pp)] = quantization(imag(shk(:,
pp)),partition,Constellation_point);
VBLASTSymbolTest(jj,pp) = R(:,pp)+1i*Img
114
(:,pp); % get the real detected
symbol
% interference cancellation
TestY(:,:,pp) = TestY(:,:,pp)-(R(:,pp)+1
i*Img(:,pp))*TestH(:,VBLASTk(1,jj,pp),
pp);
TestH(:,VBLASTk(1,jj,pp),pp) = 0;
% set the used channel into 0
end
SymbolTestRow(pp,:) = [ConsMat(tt,pp)
VBLASTSymbolTest(:,pp).];
order(:,:,pp) = [I(pp) VBLASTk(:,:,pp)];
gend
TempVector = ColumnExchange(SymbolTestRow,order);
gfor pp = 1:NumParallel
SymbolTest(:,tt,pp) = (TempVector(pp,:)).;
TempY(:,tt,pp) = HH(:,:,pp)*SymbolTest(:,tt,pp);
TempError(1,tt,pp) = norm(YY(:,:,pp)-TempY(:,tt,pp))2;
gend
end
gfor pp = 1:NumParallel
[,Index(pp)] = min(TempError(:,:,pp));
s_det(:,pp) = SymbolTest(:,Index(pp),pp);
115
SymbolError = single(sum(error)); % Cast GPU data back to
CPU
gfor pp = 1:NumParallel
[Q(:,:,pp),R(:,:,pp)] = qr(H_r(:,:,pp)); %
QR factorization
for k = 1:2*M_receive
QRCondition(pp) = R(k,k,pp)<0;
Q(:,k,pp) = (1-QRCondition(pp))*Q(:,k,pp)+
QRCondition(pp)*Q(:,k,pp)*(-1);
R(k,:,pp) = (1-QRCondition(pp))*R(k,:,pp)+
QRCondition(pp)*R(k,:,pp)*(-1);
end
Y_r(:,:,pp) = H_r(:,:,pp)*S_r(:,:,pp)+noise_r(:,:,pp); %
The real system
Z_r(:,:,pp) = Q(:,:,pp)*Y_r(:,:,pp);
gend
global NumParallel
116
LengthConstelaltion = length(Constellation_point(1,:));
LengthKConstelaltion = K*LengthConstelaltion;
T = gzeros(m_dimension,K,NumParallel);
s_h = gzeros(m_dimension,K,NumParallel);
e = gzeros(m_dimension,LengthKConstelaltion,NumParallel);
temp_vector = gzeros(m_dimension,LengthKConstelaltion,
NumParallel);
subscript = gzeros(1,K,NumParallel);
b = gzeros(1,K,NumParallel);
error = gzeros(NumParallel,1);
i = m_dimension;
KCondition_0 = K>LengthConstelaltion;
K1 = KCondition_0*LengthConstelaltion+(1-KCondition_0)*K;
temp_T = gzeros(1,LengthConstelaltion,NumParallel);
gfor pp = 1:NumParallel
for j = 1:LengthConstelaltion
temp_T(1,j,pp) = (Z(i,:,pp)-R(i,i,pp)*
Constellation_point(pp,j))2; % Branch cost
end
Sort_T = sort(temp_T,ascend); % Sort the
branch cost with the ascend order
T(i,1:K1,pp) = Sort_T(1,1:K1,pp); % Select K
partial vectors which have the smallest PEDs
for t = 1:K1
s_h(i,t,pp) = Constellation_point(pp,FindData(T(i,t,pp),
temp_T(:,:,pp))); % save the detected nodes
end
emp_s = s_h;
for i = m_dimension-1:-1:1
count = 1;
KCondition = K>(LengthConstelaltion)(m_dimension-i);
K1 = KCondition*(LengthConstelaltion)(m_dimension-i)
+(1-KCondition)*K;
KCondition_1 = K>(LengthConstelaltion)(m_dimension-i+1)
;
K2 = KCondition*(KCondition_1*LengthConstelaltion(
117
m_dimension-i+1)+(1-KCondition_1)*K)+(1-KCondition)*K;
temp_T = gzeros(1,K1*LengthConstelaltion,NumParallel);
for t=1:K1
for j = 1:LengthConstelaltion % Go through all the
constellation nodes
temp_s(i,t,pp) = Constellation_point(pp,j);
temp_vector(:,count,pp) = temp_s(:,t,pp);
e(i,count,pp) = (Z(i,1,pp)-R(i,i:m_dimension,pp)
*temp_s(i:m_dimension,t,pp))2;
temp_T(1,count,pp) = T(i+1,t,pp)+e(i,count,pp);
% Calculate he PED
count = count+1;
end
end
Sort_T = sort(temp_T,ascend); % Sort the branch
cost with the ascend order
T(i,1:K2,pp) = Sort_T(1,1:K2,pp); % Select K partial
vectors which have the smallest PEDs
for t = 1:K2 % Pick the nodes
retated to the partial vectors
subscript(1,t,pp) = FindData(T(i,t,pp),temp_T(:,:,pp
));
end
subscript = sort(subscript,ascend);
T(i,1:K2,pp) = temp_T(:,subscript(1,1:K2,pp),pp);
s_h(:,1:K2,pp) = temp_vector(:,subscript(1,1:K2,pp),pp);
% Save the detected nodes and Update the path
temp_s = s_h;
end
% Reach the lowest level
for k = 1:K
b(1,k,pp) = norm(Y_r(:,:,pp)-H_r(:,:,pp)*s_h(:,k,pp))2;
% Calculate K PEDs
end
det_node(:,:,pp) = s_h(:,FindMinimum(b(:,:,pp)),pp);
% Pick the vector has the smallest PED and save it
error(pp) = sum(det_node(1:m_dimension,:,pp)=S_r(1:
m_dimension,:,pp));
118
gend
IdentityMat = geye(2*M_transmit);
gfor pp = 1:NumParallel
transpose_h(:,:,pp) = H_r(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*H_r(:,:,pp)+(1/
snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
Q(:,:,pp) = AfterInverseMat(:,:,pp)*transpose_h(:,:,pp);
Y_r(:,pp) = H_r(:,:,pp)*S_r(:,pp)+noise_r(:,pp); %
The real system
gend
global pp
119
global NumParallel
gfor pp = 1:NumParallel
HH(:,:,pp) = H;
RealTempH(:,:,pp) = H;
ImagTempH(:,:,pp) = H;
YY(:,pp) = Y;
M_receive = (length(Y(:,1)))/2;
RealG = gzeros(2*M_transmit,2*M_receive,2*M_transmit,
NumParallel);
ImagG = gzeros(2*M_transmit,2*M_receive,2*M_transmit,
NumParallel);
Dimention = length(Constellation_point);
TempY = gzeros(2*M_receive,2*Dimention,NumParallel); %
all the suspected Y
TempError = gzeros(1,2*Dimention); % error between
original Y and all the suspected Y
NormQ = gzeros(1,M_transmit);
RealNormQ = gzeros(1,2*M_transmit);
ImagNormQ = gzeros(1,2*M_transmit);
G(:,:,pp) = Q;
for p = 1:M_transmit
NormQ(p) = (norm(G(p,:,pp)))2; % calculate the normal
value of G
end
transpose_h(:,:,pp) = RealTempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*RealTempH(:,:,
pp)+(1/snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
RealG(:,:,1,pp) = AfterInverseMat(:,:,pp)*transpose_h
(:,:,pp);
for qq = 1:2*M_transmit
120
RealNormQ(qq) = (norm(RealG(qq,:,1,pp)))2; %
calculate the normal value of G
end
I_Imag = I_Real+Dimention;
ImagTempH(:,I_Imag,pp) = 0;
transpose_h(:,:,pp) = ImagTempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*ImagTempH(:,:,
pp)+(1/snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
ImagG(:,:,1,pp) = AfterInverseMat(:,:,pp)*transpose_h
(:,:,pp);
for qq = 1:2*M_transmit
ImagNormQ(qq) = (norm(ImagG(qq,:,1,pp)))2; %
calculate the normal value of G
end
gend
%%%%%%%%%%%%%%%%%
% Real part Norm of G
RealVBLASTk = gzeros(1,2*M_transmit-1);
for jj = 1:2*M_transmit-1
RealNormQ(I_Real) = Inf;
for t = 1:jj-1
RealNormQ(RealVBLASTk(t)) = Inf; % set the
detected normal value to infinity
end
[,Real] = min(RealNormQ); % I is the
subscript of the minimum value
RealVBLASTk(jj) = Real;
gfor pp = 1:NumParallel
RealTempH(:,RealVBLASTk(jj),pp) = 0;
transpose_h(:,:,pp) = RealTempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*
RealTempH(:,:,pp)+(1/snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
121
RealG(:,:,jj+1,pp) = AfterInverseMat(:,:,pp)*
transpose_h(:,:,pp);
for qq = 1:length(RealTempH(:,1))
RealNormQ(qq) = (norm(RealG(qq,:,jj+1,pp)))2; %
calculate the normal value of G
end
gend
end
%%%%%%%%%%%%%%%%%
% Imaginary part Norm of G
ImagVBLASTk = gzeros(1,2*M_transmit-1);
for jj = 1:2*M_transmit-1
ImagNormQ(I_Imag) = Inf;
for t = 1:jj-1
ImagNormQ(ImagVBLASTk(t)) = Inf; % set the
detected normal value to infinity
end
[,Imag] = min(ImagNormQ); % I stands for the
subscript of the minimum value in the normal value set
ImagVBLASTk(jj) = Imag;
gfor pp = 1:NumParallel
ImagTempH(:,ImagVBLASTk(jj),pp) = 0;
transpose_h(:,:,pp) = ImagTempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*
ImagTempH(:,:,pp)+(1/snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
ImagG(:,:,jj+1,pp) = AfterInverseMat(:,:,pp)*
transpose_h(:,:,pp);
for qq = 1:length(ImagTempH(:,1))
ImagNormQ(qq) = (norm(ImagG(qq,:,jj+1,pp
)))2; % calculate the normal
value of G
end
gend
end
gfor pp = 1:NumParallel
122
%%%%%%%%%%%%%%%%%%
% Real part V-BLAST
RealSymbolTestRow = gzeros(Dimention,2*M_transmit,
NumParallel);
RealVBLASTSymbolTest = gzeros(2*M_transmit-1,1,NumParallel);
for tt = 1:Dimention
TestY = gzeros(2*M_receive,NumParallel);
TestH = gzeros(2*M_receive,2*M_transmit,NumParallel);
TestY(:,pp) = Y;
TestH(:,:,pp) = H;
TestY(:,pp) = TestY(:,pp)-Constellation_point(tt)*TestH
(:,I_Real,pp);
TestH(:,I_Real,pp) = 0;
for jj = 1:2*M_transmit-1
shk = gzeros(1,NumParallel);
% nulling
shk(:,pp) = RealG(RealVBLASTk(jj),:,jj,pp)*TestY(:,
pp);
% Slicing
[,RealValue] = quantiz(shk(:,pp),partition,
Constellation_point);
RealVBLASTSymbolTest(jj,1,pp) = RealValue; %
get the real detected symbol
% interference cancellation
TestY(:,pp) = TestY(:,pp)-RealValue*TestH(:,
RealVBLASTk(jj),pp);
TestH(:,RealVBLASTk(jj),pp) = 0; % set the used
channel into 0
end
RealSymbolTestRow(tt,:,pp) = [Constellation_point(tt)
RealVBLASTSymbolTest(:,1,pp).];
end
%%%%%%%%%%%%%%%%%%
% Imaginary part V-BLAST
ImagSymbolTestRow = gzeros(Dimention,2*M_transmit,
NumParallel);
ImagVBLASTSymbolTest = gzeros(2*M_transmit-1,NumParallel);
for tt = 1:Dimention
TestY = gzeros(2*M_receive,NumParallel);
TestH = gzeros(2*M_receive,2*M_transmit,NumParallel);
TestY(:,pp) = Y;
123
TestH(:,:,pp) = H;
TestY(:,pp) = TestY(:,pp)-Constellation_point(tt)*TestH
(:,I_Imag,pp);
TestH(:,I_Imag,pp) = 0;
for jj = 1:2*M_transmit-1
shk = gzeros(1,NumParallel);
% nulling
shk(:,pp) = ImagG(ImagVBLASTk(jj),:,jj,pp)*TestY(:,
pp);
% Slicing
[,ImagValue] = quantiz(shk(:,pp),partition,
Constellation_point);
ImagVBLASTSymbolTest(jj,pp) = ImagValue; %
get the real detected symbol
TestY(:,pp) = TestY(:,pp)-ImagValue*TestH(:,
ImagVBLASTk(jj),pp); % interference
cancellation
TestH(:,ImagVBLASTk(jj),pp) = 0; % set the used
channel into 0
end
ImagSymbolTestRow(tt,:,pp) = [Constellation_point(tt)
ImagVBLASTSymbolTest(:,pp).];
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Symbol Errors
Realorder = [I_Real RealVBLASTk];
Imagorder = [I_Imag ImagVBLASTk];
RealTotalSymbol = Symbol_ColumnExchangeBack(
RealSymbolTestRow(:,:,pp),Realorder);
ImagTotalSymbol = Symbol_ColumnExchangeBack(
ImagSymbolTestRow(:,:,pp),Imagorder);
TotalSymbol(:,:,pp) = [RealTotalSymbol(:,:,pp)
ImagTotalSymbol(:,:,pp)];
for kk = 1:2*Dimention
TempY(:,kk,pp) = HH(:,:,pp)*TotalSymbol(:,kk,pp);
TempError(kk) = norm(YY(:,pp)-TempY(:,kk,pp))2;
end
[,Index] = min(TempError);
s_det = gzeros(2*M_transmit,NumParallel);
s_det(:,pp) = TotalSymbol(:,Index,pp);
124
ss(:,pp) = s;
for i = 1:2*M_transmit
Count = s_det(i,pp)-ss(i,pp);
condition = Count=0;
error = error+condition;
end
gend
global NumParallel
error = gzeros(NumParallel,1);
num_nodes = 0;
ConstellationSzie = length(Constellation_point);
HH(:,:,pp) = H_r;
RealZ = gzeros(m_dimension,NumParallel);
125
ImagZ = gzeros(m_dimension,NumParallel);
IdentityMat = geye(2*M_transmit);
gfor pp = 1:NumParallel
transpose_h(:,:,pp) = HH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*HH(:,:,pp)+(1/
snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
Vblast_Q(:,:,pp) = AfterInverseMat(:,:,pp)*transpose_h
(:,:,pp);
gend
G = Vblast_Q;
NormQ = gzeros(1,M_transmit);
RealOriginal_order = gdouble(1:m_dimension);
ImagOriginal_order = gdouble(1:m_dimension);
gfor pp = 1:NumParallel
for p = 1:M_transmit
NormQ(p) = (norm(G(p,:,pp)))2; % calculate the normal
value of G
end
[,I_Real] = max(NormQ); % I is the subscript of
the maximum value
I_Imag = I_Real+ConstellationSzie;
RealTempOrder = RealOriginal_order(m_dimension);
RealOriginal_order(m_dimension) = I_Real;
RealOriginal_order(I_Real) = RealTempOrder;
ImagTempOrder = ImagOriginal_order(m_dimension);
ImagOriginal_order(m_dimension) = I_Imag;
ImagOriginal_order(I_Imag) = ImagTempOrder;
Realnew_H = Channel_ColumnExchange(HH(:,:,pp),
RealOriginal_order);
Realnew_H(:,m_dimension,pp) = 0;
[RealQ(:,:,pp),RealR(:,:,pp)] = qr(Realnew_H(:,:,pp)); % QR
126
factorization
for k = 1:M_receive
QRCondition = RealR(k,k,pp)<0;
RealQ(:,k,pp) = (1-QRCondition)*RealQ(:,k,pp)+
QRCondition*RealQ(:,k,pp)*(-1);
RealR(k,:,pp) = (1-QRCondition)*RealR(k,:,pp)+
QRCondition*RealR(k,:,pp)*(-1);
end
Y_r = Y_ori;
%%%%%%%%%%%%%%%%%%%%
RealT = gzeros(m_dimension,K);
Reals_h = gzeros(m_dimension,K);
Reale = gzeros(m_dimension,K*length(Constellation_point));
Realtemp_vector = gzeros(m_dimension,K*length(
Constellation_point));
Realsubscript = gzeros(1,K);
KCondition_0 = K>length(Constellation_point);
K1 = KCondition_0*length(Constellation_point)+(1-
KCondition_0)*K;
RealK_s = gdouble([]);
for cc = 1:ConstellationSzie
for tt = 1:K1
Reals_h(m_dimension,tt) = Constellation_point(cc);
end
Y_r = Y_r-Constellation_point(cc)*HH(:,I_Real,pp);
RealZ(:,pp) = RealQ(:,:,pp)*Y_r;
ii = m_dimension-1;
Realtemp_T = gzeros(1,length(Constellation_point));
for j = 1:length(Constellation_point)
Realtemp_T(j) = (RealZ(ii,pp)-RealR(ii,ii,pp)*
Constellation_point(j))2; % Branch cost
end
RealSort_T = sort(Realtemp_T,ascend); % Sort the
branch cost with the ascend order
RealT(ii,1:K1) = RealSort_T(1:K1); % Select K
partial vectors which have the smallest PEDs
for t = 1:K1
Reals_h(ii,t) = Constellation_point(FindData(RealT(
127
ii,t),Realtemp_T)); % save the detected nodes
end
Realtemp_s = Reals_h;
for i = m_dimension-2:-1:1
% i-th node(i<m_dimension)
count = 1;
KCondition = K>(length(Constellation_point))(
m_dimension-i);
K1 = KCondition*(length(Constellation_point))(
m_dimension-i)+(1-KCondition)*K;
KCondition_1 = K>(length(Constellation_point))(
m_dimension-i+1);
K2 = KCondition*(KCondition_1*length(
Constellation_point)(m_dimension-i+1)+(1-
KCondition_1)*K)+(1-KCondition)*K;
length_T = K1*length(Constellation_point);
Realtemp_T = gzeros(1,length_T);
for t=1:K1
for j = 1:length(Constellation_point)
% Go through all the
constellation nodes
Realtemp_s(i,t) = Constellation_point(j);
Realtemp_vector(:,count) = Realtemp_s(:,t);
rs = 0;
for n = i:m_dimension
rs = rs+RealR(i,n,pp)*Realtemp_s(n,t);
% Calculate the branch cost for
each level
end
Reale(i,count) = (RealZ(i,pp)-rs)2;
Realtemp_T(count) = RealT(i+1,t)+Reale(i,
count); % Calculate he PED
count = count+1;
end
end
RealSort_T = sort(Realtemp_T,ascend); % Sort the
branch cost with the ascend order
RealT(i,1:K2) = RealSort_T(1:K2); % Select K
partial vectors which have the smallest PEDs
128
for t = 1:K2 % Pick the
nodes retated to the partial vectors
Realsubscript(t) = FindData(RealT(i,t),
Realtemp_T);
end
Realsubscript = sort(Realsubscript,ascend);
for q = 1:K2
RealT(i,q) = Realtemp_T(Realsubscript(q));
Reals_h(:,q) = Realtemp_vector(:,Realsubscript(q
)); % Save the detected nodes and Update the
path
end
Realtemp_s = Reals_h;
end
RealK_s = [RealK_s Reals_h];
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Imagnew_H = Channel_ColumnExchange(HH(:,:,pp),
ImagOriginal_order);
Imagnew_H(:,m_dimension,pp) = 0;
[ImagQ(:,:,pp),ImagR(:,:,pp)] = qr(Imagnew_H(:,:,pp));
% QR factorization
for k = 1:M_receive
QRCondition = ImagR(k,k,pp)<0;
ImagQ(:,k,pp) = (1-QRCondition)*ImagQ(:,k,pp)+
QRCondition*ImagQ(:,k,pp)*(-1);
ImagR(k,:,pp) = (1-QRCondition)*ImagR(k,:,pp)+
QRCondition*ImagR(k,:,pp)*(-1);
end
Y_r = Y_ori;
%%%%%%%%%%%%%%%%%%%%
ImagT = gzeros(m_dimension,K);
Imags_h = gzeros(m_dimension,K);
Image = gzeros(m_dimension,K*length(Constellation_point));
Imagtemp_vector = gzeros(m_dimension,K*length(
Constellation_point));
Imagsubscript = gzeros(1,K);
ImagK_s = gdouble([]);
for cc = 1:ConstellationSzie
for tt = 1:K1
129
Imags_h(m_dimension,tt) = Constellation_point(cc);
end
Y_r = Y_r-Constellation_point(cc)*Imagnew_H(:,
m_dimension,pp);
ImagZ(:,pp) = ImagQ(:,:,pp)*Y_r;
ii = m_dimension-1;
KCondition_0 = K>length(Constellation_point);
K1 = KCondition_0*length(Constellation_point)+(1-
KCondition_0)*K;
Imagtemp_T = gzeros(1,length(Constellation_point));
for j = 1:length(Constellation_point)
Imagtemp_T(j) = (ImagZ(ii,pp)-ImagR(ii,ii,pp)*
Constellation_point(j))2; % Branch cost
num_nodes = num_nodes+1;
end
ImagSort_T = sort(Imagtemp_T,ascend); % Sort the
branch cost with the ascend order
ImagT(ii,1:K1) = ImagSort_T(1:K1); % Select K
partial vectors which have the smallest PEDs
for t = 1:K1
Imags_h(ii,t) = Constellation_point(FindData(ImagT(
ii,t),Imagtemp_T)); % save the detected nodes
end
Imagtemp_s = Imags_h;
for i = m_dimension-2:-1:1
% i-th node(i<m_dimension)
count = 1;
KCondition = K>(length(Constellation_point))(
m_dimension-i);
K1 = KCondition*(length(Constellation_point))(
m_dimension-i)+(1-KCondition)*K;
KCondition_1 = K>(length(Constellation_point))(
m_dimension-i+1);
K2 = KCondition*(KCondition_1*length(
Constellation_point)(m_dimension-i+1)+(1-
KCondition_1)*K)+(1-KCondition)*K;
length_T = K1*length(Constellation_point);
Imagtemp_T = gzeros(1,length_T);
130
for t=1:K1
for j = 1:length(Constellation_point) %
Go through all the constellation nodes
Imagtemp_s(i,t) = Constellation_point(j);
Imagtemp_vector(:,count) = Imagtemp_s(:,t);
rs = 0;
for n = i:m_dimension
rs = rs+ImagR(i,n,pp)*Imagtemp_s(n,t); %
Calculate the branch cost for each
level
end
Image(i,count) = (ImagZ(i,pp)-rs)2;
num_nodes = num_nodes+1;
Imagtemp_T(count) = ImagT(i+1,t)+Image(i,
count); % Calculate he PED
count = count+1;
end
end
ImagSort_T = sort(Imagtemp_T,ascend); % Sort the
branch cost with the ascend order
ImagT(i,1:K2) = ImagSort_T(1:K2); % Select K
partial vectors which have the smallest PEDs
for t = 1:K2 % Pick the
nodes retated to the partial vectors
Imagsubscript(t) = FindData(ImagT(i,t),
Imagtemp_T);
end
Imagsubscript = sort(Imagsubscript,ascend);
for q = 1:K2
ImagT(i,q) = Imagtemp_T(Imagsubscript(q));
Imags_h(:,q) = Imagtemp_vector(:,Imagsubscript(q
)); % Save the detected nodes and Update the
path history for each retained path
end
Imagtemp_s = Imags_h;
end
ImagK_s = [ImagK_s Imags_h];
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Real part
RealTempVector = Symbol_ColumnExchangeBack(RealK_s(:,:)
131
.,RealOriginal_order);
ImagTempVector = Symbol_ColumnExchangeBack(ImagK_s(:,:)
.,ImagOriginal_order);
s_total(:,:,pp) = [RealTempVector(:,:,pp).
ImagTempVector(:,:,pp).];
% Reach the lowest level
b = gzeros(1,2*K*ConstellationSzie);
for k = 1:2*K*ConstellationSzie
b(k) = norm(Y_r-H_r*s_total(:,k,pp))2; % Calculate K
PEDs
end
[MinSub,] = FindMinimum(b);
det_node = gzeros(m_dimension,NumParallel);
det_node(:,pp) = s_total(:,MinSub,pp); % Pick the
vector which has the smallest PED and save it
end
% Symbol Errors
for i = 1:2*M_transmit
Count = det_node(i,pp)-S_r(i,pp);
condition = Count=0;
error = error+condition;
end
gend
132
% M_receive: number of received antennas
% m_dimension: search level
% K: the number of the selected best node
% H_r: real-valued of the Channel matrix
% Y_r: real-valued of the received signal y
% K_Layer: number of layers to be executed with K-Best
% snr: signal-to-noise ratio
% Constellation_point: real-valued constellation
% partition: constellation points partition
% output: SymbolError: the matrix of the detected node at each
level
global pp
global NumParallel
error = gzeros(NumParallel,1);
V_BLAST_Layer = m_dimension-K_Layer;
num_nodes = 0;
ConstellationSzie = length(Constellation_point);
IdentityMat = geye(2*M_transmit);
gfor pp = 1:NumParallel
HH(:,:,pp) = H_r;
RealTempH(:,:,pp) = H_r;
ImagTempH(:,:,pp) = H_r;
transpose_h(:,:,pp) = HH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*HH(:,:,pp)+(1/
snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
Vblast_Q(:,:,pp) = AfterInverseMat(:,:,pp)*transpose_h
(:,:,pp);
gend
G = Vblast_Q;
RealG = gzeros(2*M_transmit,2*M_receive,2*M_transmit,
NumParallel);
ImagG = gzeros(2*M_transmit,2*M_receive,2*M_transmit,
133
NumParallel);
NormQ = gzeros(1,M_transmit);
RealNormQ = gzeros(1,2*M_transmit);
ImagNormQ = gzeros(1,2*M_transmit);
gfor pp = 1:NumParallel
for p = 1:M_transmit
NormQ(p) = (norm(G(p,:,pp)))2; % calculate the
normal value of G
end
[,I_Real] = max(NormQ); % I is the
subscript of the maximum value
RealTempH(:,I_Real,pp) = 0;
transpose_h(:,:,pp) = RealTempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*RealTempH(:,:,
pp)+(1/snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
RealG(:,:,1,pp) = AfterInverseMat(:,:,pp)*transpose_h
(:,:,pp);
for qq = 1:2*M_transmit
RealNormQ(qq) = (norm(RealG(qq,:,1,pp)))2; % calculate
the normal value of G
end
I_Imag = I_Real+ConstellationSzie;
ImagTempH(:,I_Imag,pp) = 0;
transpose_h(:,:,pp) = ImagTempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*ImagTempH(:,:,
pp)+(1/snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
ImagG(:,:,1,pp) = AfterInverseMat(:,:,pp)*transpose_h
(:,:,pp);
for qq = 1:2*M_transmit
ImagNormQ(qq) = (norm(ImagG(qq,:,1,pp)))2; % calculate
the normal value of G
134
end
gend
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Real part Norm G
RealVBLASTk = gzeros(1,2*M_transmit-1);
for jj = 1:2*M_transmit-1
gfor pp = 1:NumParallel
RealNormQ(I_Real) = Inf;
for t = 1:jj-1
RealNormQ(RealVBLASTk(t)) = Inf; % set the detected
normal value to infinity
end
[,Real] = min(RealNormQ); % I is the
subscript of the minimum value
RealVBLASTk(jj) = Real;
RealTempH(:,RealVBLASTk(jj),pp) = 0;
transpose_h(:,:,pp) = RealTempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*
RealTempH(:,:,pp)+(1/snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
RealG(:,:,jj+1,pp) = AfterInverseMat(:,:,pp)*
transpose_h(:,:,pp);
for qq = 1:length(RealTempH(:,1))
RealNormQ(qq) = (norm(RealG(qq,:,jj+1,pp)))2; %
calculate the normal value of G
end
gend
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Imaginary part Norm G
ImagVBLASTk = gzeros(1,2*M_transmit-1);
for jj = 1:2*M_transmit-1
gfor pp = 1:NumParallel
ImagNormQ(I_Imag) = Inf;
for t = 1:jj-1
ImagNormQ(ImagVBLASTk(t)) = Inf; % set the detected
normal value to infinity
end
135
[,Imag] = min(ImagNormQ); % I is the
subscript of the minimum value
ImagVBLASTk(jj) = Imag;
ImagTempH(:,ImagVBLASTk(jj),pp) = 0;
transpose_h(:,:,pp) = ImagTempH(:,:,pp);
InverseMat(:,:,pp) = transpose_h(:,:,pp)*
ImagTempH(:,:,pp)+(1/snr)*IdentityMat;
gend
AfterInverseMat = NewInverse(InverseMat);
gfor pp = 1:NumParallel
ImagG(:,:,jj+1,pp) = AfterInverseMat(:,:,pp)*
transpose_h(:,:,pp);
for qq = 1:length(ImagTempH(:,1))
ImagNormQ(qq) = (norm(ImagG(qq,:,jj+1,pp)))2; %
calculate the normal value of G
end
gend
end
gfor pp = 1:NumParallel
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Real part V-BLAST
RealSymbolTestRow = gzeros(ConstellationSzie,V_BLAST_Layer,
NumParallel);
RealVBLASTSymbolTest = gzeros(V_BLAST_Layer-1,1,NumParallel)
;
for tt = 1:ConstellationSzie
TestY = gzeros(m_dimension,NumParallel);
TestY(:,pp) = Y_r;
TestH = HH;
TestY(:,pp) = TestY(:,pp)-Constellation_point(tt)*TestH
(:,I_Real,pp);
TestH(:,I_Real,pp) = 0;
for jj = 1:V_BLAST_Layer-1
shk = gzeros(1,NumParallel);
% nulling
shk(:,pp) = RealG(RealVBLASTk(jj),:,jj,pp)*TestY(:,
pp);
% Slicing
[,RealValue] = quantization(shk(:,pp),partition,
Constellation_point);
136
RealVBLASTSymbolTest(jj,1,pp) = RealValue; % get
the real detected symbol
% interference cancellation
TestY(:,pp) = TestY(:,pp)-RealValue*TestH(:,
RealVBLASTk(jj),pp);
TestH(:,RealVBLASTk(jj),pp) = 0; % set the used
channel into 0
end
RealSymbolTestRow(tt,:,pp) = [Constellation_point(tt)
RealVBLASTSymbolTest(:,1,pp).];
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Imaginary part V-BLAST
ImagSymbolTestRow = gzeros(ConstellationSzie,V_BLAST_Layer,
NumParallel);
ImagVBLASTSymbolTest = gzeros(V_BLAST_Layer-1,NumParallel);
for tt = 1:ConstellationSzie
TestY = gzeros(m_dimension,NumParallel);
TestY(:,pp) = Y_r;
TestH = HH;
TestY(:,pp) = TestY(:,pp)-Constellation_point(tt)*TestH
(:,I_Imag,pp);
TestH(:,I_Imag,pp) = 0;
for jj = 1:V_BLAST_Layer-1
shk = gzeros(1,NumParallel);
% nulling
shk(:,pp) = ImagG(ImagVBLASTk(jj),:,jj,pp)*TestY(:,
pp);
% Slicing
[,ImagValue] = quantization(shk(:,pp),partition,
Constellation_point);
ImagVBLASTSymbolTest(jj,pp) = ImagValue; % get
the real detected symbol
% interference cancellation
TestY(:,pp) = TestY(:,pp)-ImagValue*TestH(:,
ImagVBLASTk(jj),pp);
TestH(:,ImagVBLASTk(jj),pp) = 0; % set the used
channel into 0
end
ImagSymbolTestRow(tt,:,pp) = [Constellation_point(tt)
ImagVBLASTSymbolTest(:,pp).];
137
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Realorder = [I_Real RealVBLASTk(1:V_BLAST_Layer-1)];
Imagorder = [I_Imag ImagVBLASTk(1:V_BLAST_Layer-1)];
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Realorder_count = 1;
Realnew_order = gzeros(1,K_Layer);
for i = m_dimension-1:-1:V_BLAST_Layer
Realnew_order(Realorder_count) = RealVBLASTk(i);
Realorder_count = Realorder_count+1;
end
%%%%%%%%%%%%%%%%%%%%
Realtotal_order = [Realorder Realnew_order];
Realnew_H = Channel_ColumnExchange(HH(:,:,pp),
Realtotal_order);
[RealQ(:,:,pp),RealR(:,:,pp)] = qr(Realnew_H(:,:,pp)); %
QR factorization
for k = 1:length(Y_r(:,1))
QRCondition = RealR(k,k,pp)<0;
RealQ(:,k,pp) = (1-QRCondition)*RealQ(:,k,pp)+
QRCondition*RealQ(:,k,pp)*(-1);
RealR(k,:,pp) = (1-QRCondition)*RealR(k,:,pp)+
QRCondition*RealR(k,:,pp)*(-1);
end
RealZ(:,pp) = RealQ(:,:,pp)*Y_r;
%%%%%%%%%%%%%%%%%%%%
% Real part K-Best
RealT = gzeros(m_dimension,K);
Reals_h = gzeros(m_dimension,K);
Reale = gzeros(m_dimension,K*length(Constellation_point));
Realtemp_vector = gzeros(m_dimension,K*length(
Constellation_point));
Realsubscript = gzeros(1,K);
i = m_dimension;
KCondition_0 = K>length(Constellation_point);
K1 = KCondition_0*length(Constellation_point)+(1-
KCondition_0)*K;
Realtemp_T = gzeros(1,length(Constellation_point));
for j = 1:length(Constellation_point)
Realtemp_T(j) = (RealZ(i,pp)-RealR(i,i,pp)*
138
Constellation_point(j))2; % Branch cost
num_nodes = num_nodes+1;
end
RealSort_T = sort(Realtemp_T,ascend); % Sort the branch
cost with the ascend order
RealT(i,1:K1) = RealSort_T(1:K1); % Select K partial
vectors which have the smallest PEDs
for t = 1:K1
Reals_h(i,t) = Constellation_point(FindData(RealT(i,t),
Realtemp_T)); % save the detected nodes
end
Realtemp_s = Reals_h;
for i = m_dimension-1:-1:m_dimension-K_Layer+1
% i-th node(i<m_dimension)
count = 1;
KCondition = K>(length(Constellation_point))(
m_dimension-i);
K1 = KCondition*(length(Constellation_point))(
m_dimension-i)+(1-KCondition)*K;
KCondition_1 = K>(length(Constellation_point))(
m_dimension-i+1);
K2 = KCondition*(KCondition_1*length(Constellation_point
)(m_dimension-i+1)+(1-KCondition_1)*K)+(1-KCondition)
*K;
length_T = K1*length(Constellation_point);
Realtemp_T = gzeros(1,length_T);
for t=1:K1
for j = 1:length(Constellation_point)
% Go through all the constellation
nodes
Realtemp_s(i,t) = Constellation_point(j);
Realtemp_vector(:,count) = Realtemp_s(:,t);
rs = 0;
for n = i:m_dimension
rs = rs+RealR(i,n,pp)*Realtemp_s(n,t);
% Calculate the branch cost for each
level
end
Reale(i,count) = (RealZ(i,pp)-rs)2;
139
num_nodes = num_nodes+1;
Realtemp_T(count) = RealT(i+1,t)+Reale(i,count);
% Calculate he PED
count = count+1;
end
end
RealSort_T = sort(Realtemp_T,ascend); % Sort the
branch cost with the ascend order
RealT(i,1:K2) = RealSort_T(1:K2); % Select K
partial vectors which have the smallest PEDs
for t = 1:K2 % Pick the
nodes retated to the partial vectors
Realsubscript(t) = FindData(RealT(i,t),Realtemp_T);
end
Realsubscript = sort(Realsubscript,ascend);
for q = 1:K2
RealT(i,q) = Realtemp_T(Realsubscript(q));
Reals_h(:,q) = Realtemp_vector(:,Realsubscript(q));
% Save the detected nodes and Update the path
end
Realtemp_s = Reals_h;
end
RealK_s(:,:,pp) = Reals_h(V_BLAST_Layer+1:m_dimension,:);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Imagorder_count = 1;
Imagnew_order = gzeros(1,K_Layer);
for i = m_dimension-1:-1:V_BLAST_Layer
Imagnew_order(Imagorder_count) = ImagVBLASTk(i);
Imagorder_count = Imagorder_count+1;
end
%%%%%%%%%%%%%%%%%%%%
Imagtotal_order = [Imagorder Imagnew_order];
Imagnew_H = Channel_ColumnExchange(HH(:,:,pp),
Imagtotal_order);
[ImagQ(:,:,pp),ImagR(:,:,pp)] = qr(Imagnew_H(:,:,pp));
% QR factorization
for k = 1:length(Y_r(:,1))
QRCondition = ImagR(k,k,pp)<0;
ImagQ(:,k,pp) = (1-QRCondition)*ImagQ(:,k,pp)+
QRCondition*ImagQ(:,k,pp)*(-1);
ImagR(k,:,pp) = (1-QRCondition)*ImagR(k,:,pp)+
140
QRCondition*ImagR(k,:,pp)*(-1);
end
ImagZ(:,pp) = ImagQ(:,:,pp)*Y_r;
%%%%%%%%%%%%%%%%%%%%
% Imaginary part K-Best
ImagT = gzeros(m_dimension,K);
Imags_h = gzeros(m_dimension,K);
Image = gzeros(m_dimension,K*length(Constellation_point)
);
Imagtemp_vector = gzeros(m_dimension,K*length(
Constellation_point));
Imagsubscript = gzeros(1,K);
i = m_dimension;
KCondition_0 = K>length(Constellation_point);
K1 = KCondition_0*length(Constellation_point)+(1-
KCondition_0)*K;
Imagtemp_T = gzeros(1,length(Constellation_point));
for j = 1:length(Constellation_point)
Imagtemp_T(j) = (ImagZ(i,pp)-ImagR(i,i,pp)*
Constellation_point(j))2; % Branch cost
num_nodes = num_nodes+1;
end
ImagSort_T = sort(Imagtemp_T,ascend); % Sort the
branch cost with the ascend order
ImagT(i,1:K1) = ImagSort_T(1:K1); % Select K
partial vectors which have the smallest PEDs
for t = 1:K1
Imags_h(i,t) = Constellation_point(FindData(ImagT(i,t),
Imagtemp_T)); % save the detected nodes
end
Imagtemp_s = Imags_h;
for i = m_dimension-1:-1:m_dimension-K_Layer+1
% i-th node(i<m_dimension)
count = 1;
KCondition = K>(length(Constellation_point))(
m_dimension-i);
K1 = KCondition*(length(Constellation_point))(
m_dimension-i)+(1-KCondition)*K;
141
KCondition_1 = K>(length(Constellation_point))(
m_dimension-i+1);
K2 = KCondition*(KCondition_1*length(Constellation_point
)(m_dimension-i+1)+(1-KCondition_1)*K)+(1-KCondition)
*K;
length_T = K1*length(Constellation_point);
Imagtemp_T = gzeros(1,length_T);
for t=1:K1
for j = 1:length(Constellation_point) % Go
through all the constellation nodes
Imagtemp_s(i,t) = Constellation_point(j);
Imagtemp_vector(:,count) = Imagtemp_s(:,t);
rs = 0;
for n = i:m_dimension
rs = rs+ImagR(i,n,pp)*Imagtemp_s(n,t); %
Calculate the branch cost for each level
end
Image(i,count) = (ImagZ(i,pp)-rs)2;
num_nodes = num_nodes+1;
Imagtemp_T(count) = ImagT(i+1,t)+Image(i,count);
% Calculate he PED
count = count+1;
end
end
ImagSort_T = sort(Imagtemp_T,ascend); % Sort the
branch cost with the ascend order
ImagT(i,1:K2) = ImagSort_T(1:K2); % Select K
partial vectors which have the smallest PEDs
for t = 1:K2 % Pick the
nodes retated to the partial vectors
Imagsubscript(t) = FindData(ImagT(i,t),Imagtemp_T);
end
Imagsubscript = sort(Imagsubscript,ascend);
for q = 1:K2
ImagT(i,q) = Imagtemp_T(Imagsubscript(q));
Imags_h(:,q) = Imagtemp_vector(:,Imagsubscript(q));
% Save the detected nodes and Update the path
end
Imagtemp_s = Imags_h;
end
ImagK_s(:,:,pp) = Imags_h(V_BLAST_Layer+1:m_dimension,:)
142
;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Real part total Symbol
Realtemp_s_total = gzeros(m_dimension,K*
ConstellationSzie,NumParallel);
RealTotalSymbol = 0;
for ll = 1:ConstellationSzie
for vv = 1:K
RealTotalSymbol = RealTotalSymbol+1;
Realtemp_s_total(RealTotalSymbol,:,pp) = [
RealSymbolTestRow(ll,:,pp) RealK_s(:,vv,pp)
.];
end
end
% Imaginary part total Symbol
Imagtemp_s_total = gzeros(m_dimension,K*
ConstellationSzie,NumParallel);
ImagTotalSymbol = 0;
for ll = 1:ConstellationSzie
for vv = 1:K
ImagTotalSymbol = ImagTotalSymbol+1;
Imagtemp_s_total(ImagTotalSymbol,:,pp) = [
ImagSymbolTestRow(ll,:,pp) ImagK_s(:,vv,pp)
.];
end
end
RealTempVector = Symbol_ColumnExchangeBack(
Realtemp_s_total(:,:,pp),Realtotal_order);
ImagTempVector = Symbol_ColumnExchangeBack(
Imagtemp_s_total(:,:,pp),Imagtotal_order);
s_total(:,:,pp) = [RealTempVector(:,:,pp).
ImagTempVector(:,:,pp).];
% Reach the lowest level
b = gzeros(1,2*K*ConstellationSzie);
for k = 1:2*K*ConstellationSzie
b(k) = norm(Y_r-H_r*s_total(:,k,pp))2; % Calculate K
PEDs
end
[MinSub,] = FindMinimum(b);
det_node = gzeros(m_dimension,NumParallel);
143
det_node(:,pp) = s_total(:,MinSub,pp); % Pick the
vector which has the smallest PED and save it
gend
144