06200359 (1)

834 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO.
5, MAY 2013
Novel MIMO Detection Algorithm for High-Order
Constellations in the Complex Domain
Mojtaba Mahdavi and Mahdi Shabany
AbstractA novel detection algorithm with an efcient VLSI
architecture featuring efcient operation over innite complex
lattices is proposed. The proposed design results in the highest
throughput, the lowest latency, and the lowest energy compared
to the complex-domain VLSI implementations to date. The
main innovations are a novel complex-domain means of expand-
ing/visiting the intermediate nodes of the search tree on demand,
rather than exhaustively, as well as a new distributed sorting
scheme to keep track of the best candidates at each search phase.
Its support of unbounded innite lattice decoding distinguishes
the present method from previous K-Best strategies and also
allows its complexity to scale sublinearly with the modulation
order. Since the expansion and sorting cores are data-driven,
the architecture is well suited for a pipelined parallel VLSI
implementation. The proposed algorithm is used to fabricate a
44, 64-QAM complex multiple-input-multiple-output detector
in a 0.13-m CMOS technology, achieving a clock rate of
417 MHz with the core area of 340 kgates. The chip test results
prove that the fabricated design can sustain a throughput of
1 Gb/s with energy efciency of 110 pJ/bit, the best numbers
reported to date.
Index TermsComplex-domain detection, K-best detectors,
LTE/WiMAX systems, multiple-input multiple-output (MIMO)
detector.
I. INTRODUCTION
M
ULTIPLE-INPUT MULTIPLE-OUTPUT (MIMO)
systems have the potential of achieving high spectral
efciency, high data rate, and robust wireless link, with an
acceptable implementation complexity in wireless systems.
The MIMO technology has been already included in many
wireless communication standards, such as the long-term
evolution project, IEEE 802.16e, and IEEE 802.16 m. The
design of low-complexity, low-energy, high-performance,
and high-throughput receivers is the key challenge in the
design of any MIMO receiver. Several MIMO detection
algorithms have been proposed to address this challenge,
which offer various tradeoffs between the performance and
the computational complexity.
Among the large variety of the MIMO detection techniques,
maximum-likelihood (ML) detection is the optimum detection
method and minimizes the bit error rate (BER) performance.
But its computational complexity grows exponentially with
the number of transmit antennas. On the other hand, linear
Manuscript received November 11, 2011; revised February 2, 2012;
accepted April 4, 2012. Date of publication May 15, 2012; date of current
version April 22, 2013.
The authors are with the Electrical Engineering Department, Sharif
University of Technology, Tehran 14174, Iran (e-mail: mahd_ma@yahoo.com;
mahdi@sharif.edu).
Color versions of one or more of the gures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identier 10.1109/TVLSI.2012.2196296
detection methods such as the zero-forcing or the minimum
mean squared error (MMSE) have lower complexity with a
poor BER performance. Ordered successive interference can-
celation (SIC) algorithms such as the vertical Bell Laboratories
layered space-time algorithm are employed in another category
of detectors. These algorithms have better performance than
the linear detection ones but their BER performance is not
acceptable.
Finally, as a tradeoff between complexity and performance
loss, a large category of the detection algorithms have been
proposed, which includes the depth-rst and the breadth-
rst search algorithms. The well-known depth-rst strategy
is the sphere decoder (SD) [1], which guarantees the optimal
performance in the case of unlimited execution time [2]. But
the intrinsic variable throughput results in extra overhead in
the hardware and signicantly lower data rates in the lower
signal-to-noise ratio (SNR) regimes. Among the breadth-
rst search methods, the most well-known approach is the
K-Best algorithm (a.k.a M-algorithm) [3]. The K-Best detector
guarantees an SNR-independent xed-throughput detection
scheme with a performance close to that of ML. Although
K-Best detectors are attractive for VLSI implementations,
there are still some challenges, such as an efcient sorting
and expansion scheme, in order to pave the way to achieve
high throughputs.
II. SYSTEM MODEL
Consider a MIMO system with N
T
transmit and N
R
receive
antennas. The equivalent baseband model of the Rayleigh
fading channel between the transmitter and the receiver is
described by a complex-valued N
R
N
T
channel matrix H.
There are two models for such MIMO systems, namely, the
complex equivalent model and the real equivalent model.
In this paper, we consider the complex-domain framework.
However, the proposed scheme can be easily tailored for
the real equivalent model. The complex baseband equivalent
model can be expressed as
y = Hs + v (1)
where s = [s
1
, s
2
, . . . , s
N
T
]
T
is the N
T
-dimensional complex
transmit signal vector, in which each element is indepen-
dently drawn from a complex constellation O (symmetric
M-QAM schemes with log
2
M bits per symbol, i.e., |O| =
M), y = [y
1
, y
2
, . . . ; y
N
R
]
T
is the N
R
-dimensional received
symbol vector, and v = [v
1
, v
2
, . . . , v
N
R
]
T
represents the
N
R
-dimensional independent identically distributed circularly
symmetric complex zero-mean Gaussian noise vector with
variance
2
, i.e., v
i
N
c
(0,
2
). The real equivalent model
1063-8210/$31.00 2012 IEEE
MAHDAVI AND SHABANY: NOVEL MIMO DETECTION ALGORITHM 835
can also be derived using a simple real-valued decomposition
technique [1].
The objective of the MIMO detection method is to nd the
closest lattice point s for a given received signal y
s = arg min
sO
N
T
y-Hs
2
. (2)
In this paper, a novel MIMO detection algorithm is proposed
to solve the above problem in the complex domain with linear
complexity. Since the proposed algorithm is based on the
K-Best algorithm, rst this algorithm will be briey described
in the following.
A. K-Best Algorithm
Consider the problem in (2), and let us denote the QR
decomposition of the channel matrix as H = QR, where Q
is a unitary N
R
N
T
matrix and R is an upper triangular
N
T
N
T
matrix. Performing the nulling operation by Q
H
results in z = Q
H
y = Rs + w, where w = Q
H
v. Since the
nulling matrix is unitary, the noise w remains spatially white
after the nulling. Exploiting the triangular nature of R, (2) can
be expanded as
s = arg min
sO
N
T
N
T
i=1
z
i

N
T
j =i
r
i j
s
j
2
. (3)
The above problem can be thought of as a tree-search
problem with N
T
levels, where, starting from the last row,
one symbol is detected and, based on that, the next symbol
in the upper row is detected, and so on. Thus starting from
i = N
T
, (3) can be evaluated in an iterative manner as follows:
T
i
_
s
(i)
_
= T
i+1
_
s
(i+1)
_
+

e
i
_
s
(i)
_
2
(4)
e
i
_
s
(i)
_
= z
i

N
T
j =i
r
i j
s
j
= L
i
_
s
(i)
_
r
ii
s
i
(5)
L
i
_
s
(i)
_
= z
i

N
T
j =i+1
r
i j
s
j
(6)
L
i
_
s
(i)
_
= L
i
_
s
(i)
_
r
ii
(7)
where s
(i)
= [s
i
s
i+1
, . . . , s
N
T
]
T
, T
i
_
s
(i)
_
is the accumulated
partial Euclidean distance (PED) with T
N
T
+1
(s
(N
T
+1)
) = 0,
and |e
i
(s
(i)
)|
2
denotes the distance increment between two
successive nodes/levels in the tree.
Based on the above model, starting from level i = N
T
, the
K-Best algorithm expands each K existing nodes in each level
to M new possible children in O and calculates their updated
PED. Therefore, it sorts the K M produced nodes and selects
the K best nodes with the lowest PED as the surviving nodes
in the next level. The path with the lowest PED at the rst level
of the tree is the hard decision output of the detector. There are
two main computational cores in the above algorithm, which
are discussed in the following.
1) Expansion: According to the K-Best algorithm in the
complex domain, in each level, K (parents of each level)M
(children per parent) children should be enumerated, which
results in a large complexity. The current expansion schemes
in the real domain such as the phase shift keying (PSK)
enumeration [11], the base-centric search methodology [5],
and the relaxed K-Best enumeration scheme based on PSK
enumeration [6] are compared in [7]. Although these schemes
can be applied to the complex domain, they do not linearly
scale with the constellation size (such as in [11]) and/or have
performance loss compared to the exact K-Best implemen-
tation (such as in [5] and [6]). To address this challenge,
an efcient complex-domain expansion method called the
on-demand expansion scheme is proposed in this paper, which
provides all the information required for the exact K-Best
implementation in the complex domain with no performance
degradation while avoiding the exhaustive enumeration of
the children. The computational complexity of the proposed
scheme is independent of the constellation size.
2) Sorting: In each level of the K-Best algorithm in the
complex domain, K M children should be sorted. In [7] and
[8], most of the sorting schemes such as bubble sorting [3],
which is a sorting method based on the SchnorrEuchner (SE)
([2], [9]) technique, and a distributed sorting scheme [6], [10]
are compared. But some of them are time-intensive for large
values of K and M ( [3], [11]) or have performance loss
( [6], [10]). The most efcient sorting scheme is proposed in
[7], which is used in this paper. The distributed sorter works
for any value of K and M with no performance loss. Also, its
complexity is independent of the constellation size and scales
linearly with the value of K.
1
Due to the intrinsic challenges in the implementation in
the complex domain, most of the MIMO detection algorithms
in the literature have been proposed for the real domain.
However, on account of the deeper search tree, the real domain
implementation results in a larger silicon area and a larger
latency. Nevertheless, a high-throughput MIMO detector in
the complex domain with an acceptable complexity for the
high-order constellations has always been a challenge in the
literature. To address this challenge, in this paper a high-
throughput detection algorithm along with its VLSI architec-
ture for a 44, 64-QAM complex MIMO detector is proposed,
which is scalable to higher order constellation schemes such as
256-QAM and for a larger number of antennas (i.e., N
T
> 4).
III. COMPLEX SE ENUMERATION
The main challenge in developing the complex-domain
enumerator lies in devising a means of iteratively enumerating
the elements of the complex constellation in the order of
increasing the squared distance from the unconstrained value,
i.e., the PED value. In this paper, a novel complex SE
enumeration scheme is proposed to enumerate the complex
constellation points in the order of nondecreasing PED.
A. First Child (FC)
Based on the complex version of the system model in (4),
the FC of a node in K
l+1
(s
[1]
l
) is the one that minimizes
1
By increasing the value of K, the performance becomes close to that of
ML detection. However, a higher K results in more hardware complexity. In
this paper, based on the simulations (see Section VII), K is chosen to be 10.
836 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 5, MAY 2013
Fig. 1. Three-level tree used for enumeration of the complex constellation
O. This tree is dened for each complex constellation point.
|e
l
(s
(l)
)|.
2
In other words
s
[1]
l
= arg min
s
l
O
e
l
_
s
(l)
_
2
= arg min
s
l
O
L
l
_
s
(l)
_
r
ll
s
l
2
= arg min
R(s
l
)
R
_
L
l
_
s
(l)
__
/r
ll
. , .
u
R
l
R(s
l
)
2
(8)
+arg min
I(s
l
)
I
_
L
l
_
s
(l)
__
/r
ll
. , .
u
I
l
I(s
l
)
2
(9)
= a
R
[1]
+ j a
I
[1]
(10)
where = {
M + 1, . . . , 1, +1, . . . , +
M 1} repre-
sents all the possible values of the real/imaginary part of the
constellation points, a
R
[1]
= R[s
[1]
l
], a
I
[1]
= I[s
[1]
l
], and the
index l was removed as we focus on one parent node in the
level l + 1 and try to enumerate its children in level l. Note
that (8) and (10) are derived based on the fact that r
ll
is a real
number, which is a result of the QR decomposition.
Let us dene s
[0]
l
= (L
l
(s
(l)
)/r
ll
) as the unconstrained
received value. Considering the square symmetric M-QAM
constellation schemes, there are || =

M possible integers
on both the real and imaginary axes. Thus the optimizations
in (8) and (10) are computationally inexpensive to implement,
as u
R
l
= R[s
[0]
l
] and u
I
l
= I[s
[0]
l
] can be easily rounded to the
nearest integers in to nd the optimized value for R(s
[1]
l
)
and I(s
[1]
l
), respectively. This optimized value is denoted by
a
R
[1]
+ j a
I
[1]
in (10), which is the FC of the current parent.
Therefore, the FC can be easily implemented through a 2-D
slicer.
B. Next Child (NC)
To describe how the next child in the complex domain
is calculated, let us denote all the points in the complex
constellation O by a three-level tree as shown in Fig. 1,
where s
[0]
l
is at the root (Level 1). Once the FC (i.e., s
[1]
l
=
a
R
[1]
+ j a
I
[1]
) is determined, it is selected as the rst node in
Level 2 of the three-level tree (the left-most node in the Level 2
in Fig. 1) and the

M Level-2 siblings are chosen as those
that share the same imaginary value a
I
[1]
as the FC.
3
Therefore,
their squared distances from s
[0]
l
vary directly with those of
their real components from the real part of s
[0]
l
. Since they
2
Because T
l+1
(s
(l+1)
) is common for all the children of a parent node.
3
This is because of the fact that there are

M possible real values and

M
possible imaginary values in the constellation.
are all in one line in the complex constellation with different
real parts, the typical real SE enumeration technique [7] can
be applied to enumerate them in the order of nondecreasing
squared distance from s
[0]
l
[row SE (RSE) enumeration]. This
means that the nodes in the second level of the three-level tree
are positioned in the nondecreasing PED order.
For the third level of the three-level tree in Fig. 1,

M 1
siblings are assigned to each Level-2 parent node such that
they share the same real value inherited from their common
Level-2 node whereas their imaginary parts take the values
a
I
[2]
, . . . a
I
[
M]
. For instance, the elements of the left-most
subtree in Fig. 1 are {a
R
[1]
+ j a
I
[2]
, . . . , a
R
[1]
+ j a
I
[
M]
} and
those of the right-most subtree are {a
R
[
M]
+ j a
I
[2]
, . . . , a
R
[
M]
+
j a
I
[
M]
}. In fact, these (
M 1) Level-3 nodes and their

common Level-2 node are in the same column of the complex
constellation. Thus each column of the complex constellation
corresponds to a denite subtree in the three-level tree. Similar
to the Level-2 nodes, the real SE enumeration can be applied
to their imaginary components to enumerate them in the
order of nondecreasing squared distance from s
[0]
l
[column
SE enumeration (CSE)]. This implies that, using CSE, all the
Level-3 nodes of a denite Level-2 node are positioned in the
corresponding subtree in the third level of the three-level tree
from left to right in the order of nondecreasing PED.
Based on the above three-level tree structure, the next child
is calculated as follows. Recall that the FC corresponds to the
s
l
O that minimizes |e
l
(s
(l)
)|. By denition, its next best
sibling (s
[2]
l
) is the one that has the next smallest incremental
distance |e
l
(s
(l)
)|, i.e., it is the one in s
l
{O s
[1]
l
} that
minimizes |e
l
(s
(l)
)|. Let L denote the set of points in the
constellation, which are enumerated but have not yet been
announced as the next best sibling. These nodes are named
visited nodes. As a new point is enumerated, it is added to
L, so initially L = {s
[1]
l
}. At each step, the point in L with
the lowest PED is announced to be the next best sibling. That
point is removed from L and the complex SE enumeration
is applied to it. So, according to the type of the announced
node, i.e., Level-3 or Level-2, the announced node is replaced
by one or two new points, respectively.
In other words, if the announced node is a Level-3 node,
then only column SE (CSE) enumeration will be applied to it,
whereas if the announced node is a Level-2 node, then both
row and CSE enumerations will be performed on it. In fact,
the row and column enumerations enable the coverage of the
possible sets of values for R[s
l
] and I[s
l
], respectively. These
new point(s) are said to be visited and are then added to L.
Fig. 2 shows an example of the complex SE enumeration,
where the bold crosses represent the visited points while the
bold circled crosses represent the points announced as the next
best child. Starting from s
[1]
l
[Fig. 2(a)], which is the left-most
node in Level 2, its corresponding Level-3 node and its next
sibling in Level 2 are visited and are added to L. Among
these two new nodes in L, the one with the lowest PED is
chosen [+1 + j in Fig. 2(b)]. If the chosen node is a Level-2
node, its next sibling in Level 2 and its next child in the
Level 3 are both enumerated and are added to L, which is
equivalent to running both the row and the CSE enumerations
(a) (b)
(c) (d)
Fig. 2. First four best children using complex SE enumeration in a 16-
QAM constellation scheme. (a) L = {1 + j }. (b) L = {1 j, +1 + j }.
(c) L = {1 j, 1 j, 3 + j }. (d) L = {1 + 3 j, 1 j, 3 + j }.
simultaneously [Fig. 2(c)]. However, if the chosen node in L
is a Level-3 node, only its next sibling in Level 3 is counted
and added to L, as is the case in Fig. 2(d). In other words,
after nding s
[2]
l
in Fig. 2(b), since it is a Level-2 node, both
row and column enumerations are performed, resulting in the
addition of +1 j and 3 + j to L. On the other hand,
since in Fig. 2(c) the node s
[3]
l
is a Level-3 node, only the
column enumeration is performed, resulting in the addition of
1 + 3 j to L. This process is performed until all the points
in the constellation are covered. Repeated application of this
procedure ensures the on-demand enumeration of the complex
constellation points in the order of increasing local PED. Note
that, since the expansion scheme proposed here is on demand,
not all the nodes in the constellation are necessarily visited.
The sequence in Fig. 2 shows the process of nding the
rst four best children of a particular parent node using the
proposed scheme with the elements of L listed in each stage
in the caption. At each stage, a dashed gray circle indicates
the distance of the most recent announced K-Best node to
s
[0]
l
. One argument should be proven to guarantee the correct
functionality of this complex enumeration scheme, i.e., it
should be ensured that, when a node in L is announced as the
next K-Best node, any other node in O with a lower PED has
already been visited and announced as the K-Best candidate.
In other words, all the unannounced nodes in O should have
a larger PED than the most recently announced K-Best node.
Proposition: Using the proposed SE complex enumeration
scheme, nodes are visited in the order of increasing PED value.
Proof : The following two observations are used for this
proof.
Lemma:
1) any Level-3 node has a larger PED than that of its
corresponding Level-2 parent node;
2) if a node is announced as the next K-Best node, it has
the lowest PED among the nodes in L.
Fig. 3. Six possible cases for the complex SE enumeration.
Let U = OKL, where O is the set of all nodes in the
constellation, K is the set of nodes that have been announced
as the K-Best nodes, and L is the set of nodes visited but
have not announced as the K-Best candidate yet. It should be
proved that all the unannounced nodes in O have a larger PED
than the nodes in K. In a mathematical form
s
l
{O K}, s
[i]
l
K : PED(s
l
) > PED(s
[i]
l
) (11)
where PED(s
[i]
l
) denotes the PED value of the node and all its
ancestors to the N
T
th level of the detection tree. Since K-Best
nodes are announced in the nondecreasing order of the PED,
one only needs to prove that
s
l
{O K} : PED(s
l
) > PED(s
[k]
l
) (12)
where s
[k]
l
represents the most recent announced K-Best node
in K.
To prove, let us consider the contrary, i.e., assume there is
a node, represented by s
l
, which is not in K and has a lower
PED than PED(s
[k]
l
). There are six possible cases to look at,
as shown in Fig. 3 and described in the following:
1) this case implies that a Level-2 node s
l
has a lower
PED than a Level-2 s
[k]
l
node, which is contrary to the
concept of the ordered row SE enumeration;
2) this case means that a Level-2 node s
l
L has a lower
PED than the Level-3 s
[k]
l
node, which is contrary to the
note 2) above;
3) this case means that a Level-2 node s
l
U has a lower
PED than the Level-3 node s
[k]
l
. This implies that there
is one unannounced Level-2 node in L whose PED is
larger than PED(s
[k]
l
), resulting in the conclusion that
PED(s
l
) < PED(s
[k]
l
), which is contrary to the ordered
row SE enumeration;
l
L has a lower
PED than the node s
[k]
l
, which is contrary to the note 2)
above;
5) this means that a Level-3 node s
l
U, whose Level-2
parent node is in L, has a lower PED than the node s
[k]
l
,
which is contrary to notes 1) and 2) above;
l
U, whose
Level-2 parent node is in U, has a lower PED than the
node s
[k]
l
. This means that regardless of the Level of
Fig. 4. Variation of the value of |L| for 16-QAM for a specic received symbol. (a) |L| = 3. (b) |L| = 4. (c) |L| = 4. (d) |L| = 4. (e) |L| = 4. (f) |L| = 1.
s
[k]
l
, there is one unannounced Level-2 node in L (not
including s
[k]
l
), which has a lower PED than the node
s
[k]
l
. This is contrary to the ordered row SE enumeration
and note 1) above.
Considering the nature of the SE enumeration in both
dimensions creates an intuitive proof. In other words, the nodes
in the three-level tree shown in Fig. 1 are sorted while being
enumerated in the tree on both Levels 2 and 3 from left to right.
Therefore, the selection of the nodes starts from left to right on
both levels. In fact, moving from left to right on the three-level
tree corresponds to visiting the nodes with nondecreasing PED
value. Therefore, if there is one unannounced Level-3 node
among children of a Level-2 parent node, it is guaranteed that
all its Level-3 siblings on its right side have a higher PED.
Also, if there is one unannounced Level-2 node, all the other
Level-2 nodes on its right side and their children have a higher
PED than this node.
One key property of the proposed enumeration scheme is
that, because it is based on the SE enumeration, it does not
require that the lattice search space under consideration be
bounded. Another feature is that the best children of each
parent are generated one by one and on demand (without
visiting all the other points). Therefore, the complexity of this
approach and the search complexity are independent of the
constellation order. This makes our approach a promising one
especially for higher order modulations such as the 64-QAM
or 256-QAM.
C. Relation of |L| and the Constellation Order (M)
Another promising aspect of the proposed approach is that it
can be shown that |L|

M. The fact that makes this feature
possible is the ordered expansion along with the pipelined
sorting scheme. It is worth noting that the complex version
requires extra circuitry to implement the above proposed
expansion scheme in the complex domain. However, since it is
implemented in an on-demand basis, and the fact that L does
not populate linearly, this extra circuitry is not considerable.
Proposition: Using the above complex SE enumeration
scheme, the value of the L is always less than

M, where M
represents the size of the constellation.
Proof : Based on the above proposed scheme, there are two
levels of nodes in the tree (Levels-2 and -3 nodes). If any of
the Level-2 nodes is selected, it is excluded from L and at
most two more constellation points are added to L (its next
sibling in Level-2 and its next child in Level 3). Therefore,
the selection of any Level-2 node would increase the value of
|L| at most by 1. However, if the selected node is in Level 3,
it is excluded from L and at most its sibling, if any, is added
to L. Therefore, if a Level-3 node is selected, the value of |L|
does not change, or may even decrease. Having said that, since
there are

M Level-2 nodes in any M-QAM constellation, the
value of |L| can increase by

M thus |L|

M.
This fact is illustrated in Fig. 4, which shows the complex
enumeration in a 16-QAM constellation for a specic received
symbol, where the received symbol is depicted by in the
gure. Note that dots () represent the constellation points,
circled dots () represent the visited candidates not selected
yet (or the elements of L), and nally the black circles ()
denote the announced constellation so far (the elements of
K). The arrows show the ow of the enumeration in the 16-
QAM constellation, which depends on the location of the
received symbol, and the number on each arrow represents
the time Step at which the target node is visited. For instance,
in Fig. 4(a), +1 j and 1 3 j are visited in the second
Step of the enumeration, while +1 + j is visited in the fth
Step of the enumeration [Fig. 4(c)]. The value of |L| is the
number of visited points not selected yet, i.e., circled dots
(). By looking at Fig. 4, the largest value of |L| is 4. The
value of |L| for each Step is mentioned in the caption.
IV. PROPOSED COMPLEX K-BEST ALGORITHM
The implementation on the real domain is straightforward,
as the next child can be found by a simple zigzag movement
around the unconstrained received value s
[0]
l
without any PED
calculation in the feedback path of the architecture [12].
However, in the complex SE enumeration scheme, a 2-D SE
enumeration is needed to nd the next best siblings. After
each complex SE enumeration, one or two new nodes will be
generated that will later be added to the L after calculation of
their PED values. Thus the size of L, which includes the best
sibling nodes, may increase. It is most probable that the size
of L is greater than 1. So in order to nd the next child of the
announced node, we should nd the child with the minimum
PED from the L entries, which incurs an extra complexity in
the critical path.
On the other hand, in order to have a high-throughput
detector, K parent nodes of the next level should be generated
in K clock cycles. So according to the nature of the distributed
K-Best algorithm, both the PED calculation and PED compari-
son processes should be done in the feedback path in one clock
cycle for each announced node, which is the main underlying
challenge. Also it is obvious that the PED calculation in the
complex domain needs more computations compared to the
real domain. Therefore, all these computations will be added
to the total critical path of the detector, which will result in
a signicant decrease in the throughput. Thus the idea of the
real-domain distributed K-Best algorithm cannot be applied to
the complex domain to achieve a high-throughput design.
To address the above challenges, a novel complex-domain
detection algorithm is proposed in this paper. Let us consider
an N
R
N
T
, M-QAM MIMO system. So the complex-domain
detection tree has N
T
levels. Thus the proposed algorithm can
be described as follows.
A. Proposed Complex K-Best Algorithm
Step I. Level N
T
1) Calculate the FC of the incoming node, which is the
N
T
th entry of the z matrix.
2) Find all of the Level-2 nodes that are located in the same
row of the constellation with the FC.
3) Calculate the PED of the FC and all the Level-2 nodes
and save all these

M nodes and their PED values in
a register bank (i.e., L).
4) For k = 1:K
a) Find the node with the minimum PED in the L and
announce it as one of the K parent nodes of the
next level of the detector.
b) Find the next child of the announced node. In the
complex domain, the next child should be calcu-
lated by the complex SE enumeration technique.
c) Calculate the PED of the new Level-3 node and
replace the announced node with the new Level-3
node in L.
4
End
Step II. Level (N
T
1)Level 2
1) For i = 1 : K
a) Calculate the value of L
i
_
s
(i)
_
for the incoming
node, which is the i th parent of the current level.
b) Find the FC.
c) Find RSE_Num Level-2 nodes, which are the
nearest nodes to the FC in the constellation, using
the row SE enumeration (RSE) technique.
d) Calculate the PED values for the above RSE_Num
Level-2 nodes and the FC, and then save all these
nodes in the corresponding register bank (L) result-
ing in |L| = RSE_Num + 1 for the i th parent of
the current level.
e) For j = 1 : CSE_Num
i) Find the node with the minimum PED in the L.
ii) Find the next child of the selected node. In
this step, all the L entries are Level-2 nodes.
So regardless of the type of the selected node
(Level 2 or Level 3), in order to nd the next
child, only the corresponding Level-3 node of
the selected node should be found by using
the CSE technique.
4
Note that there are

M Level-2 nodes in an M-QAM constellation. So in
Step I.2) dfd, L was initialized by

M Level-2 nodes (|L| =

M). Also,
according to the proposed idea in Step I.4.b, each announced node will be
replaced with only one Level-3 node. So the size of L will remain xed
(|L| =

M).
iii) Calculate the PED of the new Level-3 node and
save it in L. So the size of L will increase by 1.
End
f) For the current parent node (i.e., i th parent), sort
the entries of the obtained L in the order of
nondecreasing PED, resulting in the nal size of
|L| = RSE_Num + 1 + CSE_Num.
End
2) Find the sorted list of the K rst children of the K
parents in the order of nondecreasing PED.
3) For k = 1 : K
a) Announce the node with the minimum PED in the
above sorted list as the kth parent of the next level.
b) In the above sorted list, replace the announced node
with its next child, which is obtained from the
sorted L of its parent.
End
Step III. First Level
1) For k = 1 : K
a) Calculate the value of L
k
_
s
(k)
_
for the incoming
node, which is the kth parent of the current level.
b) Find the FC and the corresponding PED.
End
2) Find the sorted list of the K rst children of the K
parents in the order of nondecreasing PED.
3) Find the node with the minimum PED from the sorted
list of the rst children and announce it with all of its
parents up to the level N
T
as the hard decision output s
of the detector.
B. Limited SE Enumeration Idea
There are two key points in the detection process that affect
the BER performance of the detection algorithm.
1) It is obvious that the generation of the K parent nodes of
level N
T
should be done carefully, as any error in level
N
T
propagates to all of the other levels, which results
in performance loss.
2) According to the complex SE enumeration, any Level-2
node of each column has lower PED than the other nodes
of that column. So the Level-2 node of each column
in the constellation has a higher priority than the other
nodes of the same column to announce as one of the K
best nodes of the next level. Thus, it is necessary to avoid
missing the Level-2 nodes in the detection algorithm.
So the generation of the parent nodes of the level N
T
and
the generation of the Level-2 nodes are two important factors
in the nal BER of the system.
One of the innovations of this paper is to nd all the
Level-2 nodes at the beginning of the proposed algorithm (i.e.,
Level N
T
). Then, regardless of the type of the current node
(Level-2 or Level-3), in order to nd the next child, only the
corresponding Level-3 node of the current node should be
found, which can be done by the CSE enumeration technique.
So, in order to consider the rst factor at the beginning of
level N
T
, all the Level-2 nodes are generated and saved in
L (Step I.2). If one of these nodes is selected as one of the
parents of the next level, then only its corresponding Level-3
node should be generated, as its neighboring Level-2 nodes
have already been generated (Step I.4.b). So all the Level-2
and Level-3 nodes can be generated in this level and there is
no omitted node in the level N
T
of the proposed algorithm.
This will ensure that there is no performance degradation in
this level of the proposed algorithm compared to the exact
K-Best algorithm.
Also, the second point can be considered through the
generation of all of the Level-2 nodes. The above idea can
be applied to the other levels of the proposed algorithm with
added cost of a larger silicon area. On the other hand, the effect
of the parent nodes of the level 1 on the BER performance
is lower than that of the parent nodes of level N
T
. Another
contribution in the proposed algorithm is to generate a xed
number of child nodes for each parent in level N
T
1 through
level 2, described in the following.
Let us consider the distributed K-Best algorithm which uses
the complex SE enumeration scheme [13]. After announcing a
node as one of the K best nodes of the next level, the complex
SE enumeration should be applied on the announced node.
Then the node with the lowest PED in L will be announced
as one of the K parents of the next level. This process will be
repeated to generate all of the K parents of the next level.
During this process, an unpredictable number (0 K) of
parents of the next level will be chosen from the same parent
of the current level. So a number of children will be chosen
from a parent of the current level, as the parents of the next
level are not xed. This means that the number of column/row
SE enumerations that should be performed on a denite
parent and also the size of corresponding L are unknown
variant values. This results in a signicant decrease in the
hardware utilization. To address this challenge, in the proposed
algorithm a xed number of children will be generated for
each parent (i.e., RSE_Num + CSE_Num + 1). Thus a xed
number of the column/row SE enumerations will be applied
to each parent, which is done in Steps II.1.c and II.1.e of the
proposed algorithm by using CSE_Num CSE enumeration and
RSE_Num row SE enumeration, respectively. Proper values of
these two important parameters will be indicated as follows.
C. Finding the Value of CSE_Num
Let us consider the generation of the parent nodes in the
complex domain using the complex SE enumeration scheme.
In this case, all the child nodes of a parent can be enumerated
and there is no constraint on the number of the column/row
SE enumerations. This method is referred to as the relaxed
SE enumeration scheme as opposed to our proposed limited
SE enumeration scheme. In fact, the BER performance of the
exact K-Best algorithm will be obtained for the relaxed SE
enumeration scheme. It will also be shown that the difference
between the nal BER of our proposed detection algorithm
and the exact K-Best algorithm is negligible.
In order to nd the proper value of the CSE_Num,
consider the generation of the parent nodes in the relaxed
SE enumeration scheme in the complex domain. The sim-
ulation result of this scheme for a 4 4, 64-QAM
Fig. 5. Number of parent nodes that have the same number of visited child
nodes for a 4 4, 64-QAM complex MIMO detector with K = 10.
MIMO detector with K = 10 is shown in Fig. 5.
This simulation was performed for 21 845 packets, where
each packet consists of 4(transmitted vector/packet)
4(symbol/transmitted vector) 6(bit/symbols) = (96 bits)
(thus 2 M bits in total). This gure shows the number of
parents that have the same number of visited child nodes at
the end of the simulation. For more clarity, the generation of
the parents will be explained Step by Step in the following.
At the beginning, the FC of each parent will be found. So
each parent node has only one visited child. Thus the parent
nodes will be placed in the rst column in Fig. 5. If the FC
of a parent node from the rst column is announced as one
of the K-Best parents of the next level, then both column/row
SE enumerations will be applied on it and two new nodes will
be generated (note that the FC is a Level-2 node). So there
is a total of three visited children, which will be reected
in the third column in Fig. 5. Thus the parent nodes with
no announced child will remain in the rst column. This is
the reason why always the second column is empty, which is
the result of the concept of the complex SE enumeration [see
Fig. 2(a) and (b)].
Moreover, Fig. 5 shows that there are nearly 1110
6
parent
nodes that have one visited child till the end of the simulation
with no announced node. Also, there are almost 610
6
parent
nodes with three visited children.
This can be seen differently in Fig. 6, which shows the
number of child nodes that are in the same category. For
example, the third column of Fig. 5 shows that there are
610
6
parent nodes with three visited children. Thus there are
3610
6
= 1810
6
child nodes in this category. In fact, the
value of each column in Fig. 6 is equal to the product of the
values of the horizontal and vertical axes of the corresponding
column in Fig. 5.
Note that, according to Figs. 5 and 6, the number of parent
nodes with ve visited children is larger than the number
of parent nodes with four visited children (i.e., the fourth
column). After announcing the FC as a parent of the next
level, its parent will be transferred from the rst column to
the third column. According to the concept of the complex
Fig. 6. Number of child nodes that are in the same category for a 4 4,
64-QAM MIMO detector with K = 10 in the complex domain.
SE enumeration, the third column includes both Level-2
and -3 nodes. So there are two scenarios.
1) If a Level-3 node is chosen from the third column as
one of the K parents of the next level, then just the
column SE enumeration will be applied to it, which
results in transferring that parent from the third column
to the fourth column. This is the only possible way to
add a node to the fourth column because the second
column is empty.
2) If a Level-2 node is chosen from the third column
as one of the parents of the next level, then both the
column/row SE enumeration will be applied to it, which
results in transferring that parent from the third column
to the fth column. But as a key point, consider a
Level-3 node in the fourth column, which is chosen
as one of the K parents of the next level. So only the
column SE enumeration will be applied to it, which
results in transferring that parent node from the fourth
column to the fth column.
Thus the fth column is the target of two columns (i.e., the
third and fourth columns). But the fourth column is the target
of only one column (i.e., the third column). So the number of
nodes that are placed in the fth category is larger than in the
fourth category.
It is worth noting that visiting and announcing are two
different concepts, meaning that the number of visited nodes
and the number of announced nodes from a parent node will
be different, as shown in Table I. The rst row of this table
corresponds to the horizontal axes in Fig. 5, which includes the
number of visited child nodes for a parent, and the second row
shows the possible number of announced child nodes for that
parent node. For instance, the parent nodes of the third column
in Fig. 5 have only one announced child node (i.e., their rst
children), which is shown in the third column of Table I.
Note that in the second row of Table I the possible range
for the number of announced children from a specic parent
is shown. For example, for the parent nodes of the seventh
column of Table I, we are sure that at least three child nodes of
these parents were announced as the parents of the next level.
TABLE I
COMPARISON OF THE NUMBER OF VISITED NODES AND THE
NUMBER OF ANNOUNCED NODES FOR A SPECIFIC PARENT
Number of
1 2 3 4 5 6 7 8 9 10
visited nodes
0 0 1 2 2 3 3 4 4 5
Number of 3 4 4 5 5 6
parent nodes 5 6 6 7
7 8
Fig. 7. All possible scenarios for the node generation from a denite parent
node with CSE_Num = 3.
But this table shows that the parent nodes of this category can
have up to ve announced child nodes.
As an important result, Fig. 5 shows that less than 9% of
all the parent nodes (210
6
parent nodes from 2210
6
) have
more than ve visited child nodes, and less than 2% of all the
parent nodes (0.510
6
parent nodes from 2210
6
) have more
than seven visited child nodes. So we can ignore the parent
nodes of the other categories. Thus, one of the innovations of
this paper is to limit the number of SE enumerations for each
parent node, which is equivalent to ignoring the last columns
in Fig. 5. This idea is used to obtain the appropriate values
of CSE_Num and RSE_Num based on the above observation.
The large values of CSE_Num and RSE_Num result in a better
BER performance at the cost of more silicon area.
In order to nd the proper value of CSE_Num, exhaustive
simulations were done and the value of CSE_Num was chosen
to be 3. According to the concept of complex SE enumeration,
it can be proven that in this case (i.e. CSE_Num = 3) the
number of visited children for a parent node will be ve,
six, or seven child nodes, which is shown in Fig. 7(a)(c),
respectively. Fig. 7 shows all of the possible scenarios for
the node generation of a denite parent with CSE_Num = 3,
which results in visiting ve, six, or seven child nodes from
that parent. The announced child nodes of the parent are shown
by green circles in Fig. 7. Thus in our proposed algorithm,
three CSE enumerations (i.e., CSE_Num = 3) will be applied
on each parent node and three Level-3 nodes will be generated
(Step II.1.e in the proposed algorithm).
D. Value of RSE_Num
Fig. 7 shows that, according to the value of the CSE_Num
after the FC, up to three Level-2 nodes can be generated [see
Fig. 7(a)]. So, in order to avoid missing the Level-2 nodes
in the proposed algorithm, always in addition to the FC of a
parent, three Level-2 nodes should be generated, which can be
Fig. 8. Two scenarios of node generation from a denite parent node that are
covered by (a) Level-3 nodes are in three different columns and (b) Level-3
nodes are in two columns.
Fig. 9. Proposed VLSI architecture of a 4 4, 64-QAM MIMO detector.
using three row SE enumerations. According to the Step II.1.f
of the proposed algorithm, all children of a parent should be
sorted in the order of nondecreasing PED. According to the
proposed architecture for Step II.1.f, it can be proven that
the difference between the required hardware to sort seven
or eight values is negligible. So, according to this fact and
the importance of the Level-2 nodes in the nal BER (see
Section IV-B2), the number of Level-2 child nodes of each
parent was chosen to be ve in the proposed algorithm, which
implies that RSE_Num = 4. Because the FC is generated in
a different manner and without SE enumeration technique.
Thus always four row SE enumerations will be applied on
each parent node and four Level-2 nodes will be generated
(Step II.1.c in the proposed algorithm).
Therefore, the result of the above ideas is to generate
four Level-2 nodes by applying four row SE enumerations
(RSE_Num = 4) and three Level-3 nodes by applying three
enumerations (CSE_Num = 3) for each parent (Steps II.1.c
and II.1.e in the proposed algorithm). Thus, in addition to the
FC, seven child nodes will be generated for each parent. Fig. 8
shows two possible cases that are covered by the proposed
idea. Note that, after nding the FC, all these nodes will be
visited for each parent (Steps II.1.c and II.1.e). This means
that all the nodes that are not indicated with the green circles
in Fig. 7 can be announced as the parents of the next level.
But if the next child of the announced node does not exist in
the collection of eight visited nodes of that parent, then that
next child will be not enumerated.
V. PROPOSED VLSI ARCHITECTURE
The proposed VLSI architecture for a 44, 64-QAM hard-
output MIMO detector is shown in Fig. 9. This architecture
consists of four layers. Each layer gets the entries of z, R,
some control signals, and the K parents of the previous layer
(if any) as inputs and generates the K parents of the next layer
as outputs. The control unit performs the scheduling of the
inputs of the layers. Each layer consists of some subblocks,
as described in the following. In order to reduce the nal
Fig. 10. Detailed VLSI architecture for Layer N
T
of the proposed MIMO
detector.
critical path of the design, ne-grain pipelining is applied to
the proposed architecture. The internal pipelining stages of
the sub blocks and the pipeline stages between the blocks
are shown by the red dash lines and arrows in all the gures
to come.
A. Layer N
T
The layer N
T
of the proposed architecture implements Step I
of the proposed algorithm. The detailed VLSI architecture of
the layer N
T
is shown in Fig. 10, which gets z
4
and r
44
as inputs and generates K parents of layer N
T
1 as the
outputs.
B. Layer (N
T
1). . . Layer 2
In the layer N
T
1 through Layer 2, the corresponding
entries of z and R, some control signals and the K parents of
the previous layer are the inputs and the K parents of the
next layer will be generated as the outputs. In fact, these
layers of the proposed architecture perform Step II of the
proposed algorithm. The architecture of theses layers are the
same, which will be described in detail.
C. Layer 1
Finally, Layer 1 of the proposed architecture performs
Step III of the proposed algorithm and announces the child
with the lowest PED with all its parents up to Layer N
T
as
the hard decision output s of the detector.
D. Multipliers
According to (3)(7), there are four types of multiplication
in the proposed scheme, which are implemented as follows.
Fig. 11. (a) Proposed architecture of the fast multiplier. (b) Architecture of
a 4 4-bit BaughWooley carry-save multiplier.
1) Fast Multiplier: The rst type of multipliers is devised
to perform L
i
= r
ii
L
i
. This multiplication is time-intensive
and is a part of the total critical path of the system. So, in
order to improve the throughput of the system, an efcient
architecture is proposed in Fig. 11. The basic core of the
proposed architecture is a 2 s complement 4 4-bit Baugh
Wooley carry-save multiplier [Fig. 11(b)]. The nal 16 16-
bit fast multiplier is obtained by connecting a group of these
basic cores together [Fig. 11(a)]. Also, in order to decrease the
critical path of the multiplier, ne-grain pipelining is used in
this block [red arrows in Fig. 11(b)], which results in a critical
path of length of 1 ns and the throughput of 32 Gb/s for the
multiplier in a 0.13-m CMOS technology instead of 7 ns
and the throughput of 4.5 Gb/s for a 16 16-bit conventional
multiplier in the same technology.
2) Constant Multiplier (CM): The second multiplier is
devised to perform r
i j
R/I{s
k
}, which is the basic core
of the remaining three types of multipliers. So an efcient
low-area multiplier is proposed that is implemented by using
the simple shift and addition operations [Fig. 12(a)]. The key
idea in the proposed multiplier comes from the fact that both
operands are real valued and always the value of R/I{s
k
} is
chosen from a constant set ().
3) Real Complex Multiplier (RCM): The third type of
multipliers is designed to perform r
ii
s
i
. After the QR
decomposition, all the diagonal elements of R will be real and
the other elements will be complex. So this multiplier is an
RCM. The RCM is used to implement (5). It is is implemented
by a combination of two constant multipliers [see Fig. 12(b)].
4) Complex Complex Multiplier: Finally, the fourth type
of multipliers is designed to perform r
i j
s
j
, which is a CCM.
According to (6), the CCM is used to implement L
i
_
s
(i)
_
,
shown in Fig. 12(c).
E. PED Calculation Block (PED Calc.)
This block calculates the PED based on (4), which is
implemented through a fully pipelined architecture in Fig. 13.
The middle subblocks in Fig. 13 implement the
1
-norm cal-
culation. Simulation results show that the difference between
the BER performance of
2
-norm and
1
-norm is negligible
[11]. Due to the lower complexity of the
1
-norm, it is the
preferred approach for the implementation in our design.
Fig. 12. Proposed architectures of (a) CM, (b) RCM, and (c) CCM.
Fig. 13. Proposed VLSI architecture for the PED calculation block.
F. L
i
Calculation (L
i
Calc.)
According to (8) and (10), in order to nd the FC the
value of s
[0]
l
= L
l
_
s
(l)
_
should be calculated. Also based
on (7), this value is used to calculate L
l
_
s
(l)
_
for the PED
calculation. Since different numbers of z
i
and r
i j
are used
to calculate L
1
, L
2
, and L
3
a customized fully pipelined
architecture is proposed for each of them in Fig. 14(a)(c),
respectively. These blocks perform Step II.1.a and Step III.1.a
of the proposed algorithm.
G. FC Block
The proposed architecture for the FC block is shown
in Fig. 15(a), which performs Step I.1, Step II.1.b, and
Step III.1.b of the proposed algorithm. This is done by using
two mapper blocks and two limiter blocks, which are described
below.
1) Mapper: The main task of the mapper block is to round
the real/imaginary part of L
i
(i.e., R/I{s
[0]
l
}) to the nearest
odd integer value, which can be done on the basis of the
following equation:
R
I{s
[1]
l
}
= 2
R
I{s
[0]
l
}
2
+ 1
1 (13)
where . represents the truncation operation. The detailed
architecture of the mapper block is shown in Fig. 15(b).
2) Limiter: If R/I{s
[1]
l
} is outside the allowed boundary
of , it will be bounded by the limiter block to generate
R/I{s
[1]
l
}, which is the real/imaginary part of the FC [see
Fig. 15(c)].
Fig. 14. Proposed VLSI architecture of L
i
calculation block. (a) L
1
calculation block. (b) L
2
calculation block. (c) L
3
calculation block.
Fig. 15. (a) Proposed architecture for the FC block. (b) Detailed architecture
of the mapper block. (c) Detailed architecture of the limiter block.
H. CSE Enumeration
This block performs the CSE technique to generate the
Level-3 nodes of a parent. In fact, this block performs
Step I.4.b and Step II.1.e.ii of the proposed algorithm, which
is implemented using simple adders and subtractors.
I. NC Block
This block is used in all layers except the rst and the last
layer. A fully pipelined VLSI architecture is proposed for the
NC block, shown in Fig. 16. The NC block implements the
Step II.1.c and the Step II.1.d using adders, subtractors, and
norm calculation blocks. Also, it implements the Step II.1.e
and the Step II.1.f of the proposed algorithm using the upper
right corner blocks and lower half part of Fig. 16, respectively.
J. Sorters
The proposed VLSI architecture includes four types of
sorters, which are described below.
1) Sorter 1: This sorter performs the Step II.2 and Step III.2
of the proposed algorithm. Sorter 1 proposed in [7] is used in
all of the layers except the Layer N
T
.
2) Sorter 2 and Shifter: This block implements the Step II.3
of the proposed algorithm by using Sorter 2, which is the same
as the Sorter 1 and a shifter. This block is used in all layers
except the rst and the last.
3) Sorter 3: The main task of Sorter 3 is to perform the
Step I.4.a of the proposed algorithm. A feedforward and
pipelinable VLSI architecture is proposed for Sorter 3, which
is shown in Fig. 17.
4) Sorter NC: This sorter is used in the NC block. In fact,
Sorter NC is a modied version of the Sorter 3, which is
customized for a different number of inputs. Note that the
Fig. 16. Detailed VLSI architecture of the NC block.
Fig. 17. Proposed VLSI architecture of Sorter 3.
only difference between the Sorter NC blocks in Fig. 16 is in
the number of inputs.
It is worth noting that the proposed feedforward/fully
pipelined architecture can be easily applied to the multicarrier
scenarios. In fact, the proposed MIMO detector is applied on
each carrier separately, so each subsequent carrier can be fed
to the proposed MIMO detector through a pipelined fashion.
This can be done through a simple hardware wrapper sitting
next to the proposed core. While it is assumed that the channel
is perfectly known to the receiver, the proposed algorithm can
be used under different channel conditions when used along
with a channel estimator providing the estimate of the current
channel status.
VI. COMPLEXITY ANALYSIS
A. ASIC Implementation
The proposed VLSI architecture was modeled in Verilog
HDL using ModelSim, synthesized using Synopsys Design
Vision in 0.13- and 90-nm CMOS technology and placed, and
routed using Cadence SoC Encounter. The chip boundary and
nal graphic database system stream out was performed using
Cadence Virtuoso. The golden xed-point MATLAB model
was used to validate the register transfer language and gate-
level net lists. The nal veried ASIC core was fabricated
in 0.13-m IBM 1P/8M CMOS technology using Artisan
standard library cells. A micrograph of the die for the design is
shown in Fig. 18, which was packaged in a CFP120 package.
The fabricated design was tested using an Agilent (Verigy)
93 000 SoC high-speed digital tester and a Temptronic
TP04300 thermal forcing unit. The test setup consisted of the
93K system-on-chip (SoC) tester, the thermal forcing unit, and
a load board holding the device under test. The nominal supply
voltage supplied to the core is 1.2 V, while the I/O voltage is
TABLE II
DESIGN COMPARISON OF THE CURRENT ASIC IMPLEMENTATIONS FOR 4 4 MIMO DETECTORS
Reference
JSAC
2006 [1]
TCAS-II
2010 [14]
DATE
2009 [15]
TVLSI
2007 [6]
TVLSI
2010 [16]
JSSC
2010 [17]
JSSC
2011 [18]
TVLSI
2011 [19]
TVLSI
2011 [7]
This work
Modulation 16-QAM 16-QAM 64-QAM 64-QAM 64-QAM (4-64)QAM 64-QAM (16-64)QAM 64-QAM 64-QAM
Antenna 4 4 4 4 4 4 4 4 4 4 4 488 4 4 4 4 4 4 4 4
Method K-Best SISO-SD Sys. Like
detection
K-Best K-Best MBF-FD
(SD)
SISO
MMSE-
PIC
MMF-LSD K-Best Modied
K-Best
Domain Real Complex Complex Complex Real Complex Complex Real Real Complex
K-Value 5 N/A N/A 64 5-64 N/A N/A N/A 10 10
Process 0.35 m 90 nm 45 nm 0.13 m 65 nm 0.13 m 90 nm 0.18 m 0.13 m 0.13
$m
f
max
(MHz) 100 250 574.7 270 158 198 568 250 282 417
Throughput 54 90 215 100 732-100 285-431 757 31.7-146.3 675 1000
(Mb/s) 145
a
62
a
74
a
366-50
a
524
a
44-202
a
Gate count 91 kG 96 kG 33.1 kG 5270 kG 1760 kG 350 kG 410 kG 25.4-48.2 kG 114 kG 340 kG
NHE
b
0.63
a
1.6
a
0.45
a
52.7 4.81-35.2
a
1.23-0.81 0.78
a
0.58-0.24
a
0.17 0.34
Energy/bit 594 pJ/b N/A N/A 8470 pJ/b N/A N/A 250 pJ/b N/A 200 pJ/b 110 pJ/b
Power (mW) 626 N/A N/A 847 165 57-74 189.1 57-90 135 1700
Latency (s) 2.4 N/A N/A N/A N/A N/A N/A N/A 0.6 0.36
Hard/soft Hard Soft Soft Soft Hard Soft Soft Soft Hard Hard
a
Technology scaling from S
1
to 0.13 m CMOS process assuming t
pd2
=
t
pd1
.S
1
(nm)
130(nm)
, f
max

1
t
pd
b
Normalized Hardware Efciency (kG/M/bps).
Layer 2
Layer 1
Layer 4
Layer 3
Fig. 18. Photograph of the die.
2.7 V. The operation of the chip was veried by passing the
input vectors at different SNR values to the chip through the
tester and comparing the detector outputs with the expected
values from the bit-true simulations both from MATLAB and
ModelSim simulations. Finally, an at-speed test was run on
the chip and the outputs were compared against the expected
bit stream generated by the MATLAB simulations.
The nal measured BER performance result of the
proposed approach is similar to that in [1] and [11]. Thus
the major difference between all of these schemes, including
the one in this paper, is the way the detection algorithm is
implemented, which translates to different throughput and/or
hardware complexity. It is shown that the proposed algorithm
is implemented using a feedforward architecture. In our
proposed architecture, the critical path of the subblocks such
as the 16 16 bit fast multiplier and the PED Calc. block
are reduced by applying the pipelining technique. According
to the proposed algorithm, K-Best candidates of each layer
of the architecture are generated in K clock cycles, which
increases the throughput of the system.
TABLE III
DEVICE UTILIZATION IN THE FPGA PLATFORM
Slices LUTs Reg. LUT-FF pairs DSP48Es
Available 14720 58880 58880 51869 640
Used 13467 46160 36912 31203 8
Utilization 91% 78% 62% 60% 1%
The comparison between the proposed complex MIMO
detector and the recently proposed MIMO detectors in the
complex and the real domains that are reported in the literature
is shown in Table II. This comparison shows that the proposed
scheme has the same performance but higher throughput,
lower area, lower energy, and lower latency compared to
all the reported complex-domain VLSI realizations. Also, the
proposed design has higher throughput, less latency, and less
energy than the distributed K-Best algorithm in [7], which
is one of the most efcient real-domain MIMO detectors.
Needless to say, the proposed design has a larger core area
than the one in [7], which is related to the nature of the
complex-domain implementation and extra resources for the
complex-domain calculations.
In order to perform a fair comparison, a normalized hard-
ware efciency (NHE) is dened, which includes the core area
of the design (i.e., gait count) and the corresponding scaled
throughput in the same technology for all designs. So
NHE(kG/Mb/ps) =
core area(kG)
scaled throughput(Mb/s)
.
Table II shows that the proposed design has the lowest NHE
compared to the all of the complex-domain MIMO detectors.
Moreover, the proposed scheme has less NHE than all the
real-domain implementations except [7].
10 15 20 25 30 35
10
6
10
5
10
4
10
3
10
2
10
1
10
0
SNR
B
E
R
KBest complex (K=7)
KBest real (K=10)
KBest complex (K=11)
KBest complex (K=15)
ML
Fig. 19. K-Best versus. ML BER performance for different values of K in
both real and complex domain for a 4 4, 64-QAM MIMO detector.
Moreover, the proposed scheme is implemented in the
FPGA platform. The synthesis results and the required
resources for the 4 4, 64-QAM MIMO detector using the
proposed scheme is shown in Table III. The target device is the
Virtex-5 FPGA from Xilinx, i.e., XC5VSX95T-2FF1136. On
the FPGA platform, the throughput of 360 Mb/s at 150 MHz
is achieved for the proposed design.
VII. SIMULATION RESULTS
In theory, the K-Best algorithm might miss the hard-ML
point and might have performance loss as a result. However,
by a proper choice of K, the BER performance of the K-
Best method approaches the optimal case for a reasonable
range of SNR values. Since the proposed algorithm is based
on the K-Best algorithm, it is necessary to choose a proper
value for K. Fig. 19 shows the BER performance curves
for a 4 4, 64-QAM MIMO system using the proposed
scheme versus the ML detector. It reveals the behavior of
the proposed algorithm for different values of K. It is seen
that by increasing the value of K, the performance result
becomes close to that of the ML detection. However, a higher
K value results in more hardware complexity. For instance,
in the proposed algorithm for a 4 4, 64-QAM MIMO
system, K = 15 results in an ML-like result while K = 8
comes with less diversity in high-SNR regimes (Fig. 19). The
performance of the K-Best scheme with K = 10 is close to
the ML while having a moderate complexity. Thus K = 10 is
chosen as the framework for the hardware implementation in
this paper.
Moreover, word-length effect is another important parameter
that affects on the nal BER of the algorithm and also the
hardware complexity. Let us consider a pair (W, F) for each
variable of the algorithm, which represents the total word
length and fractional length, respectively. We consider three
different cases for (W, F): small values, medium values, and
large values. These values are listed in Table IV. The BER
comparison of these cases in the xed-point domain and also
the oating point simulation results of the proposed algorithm
as well as ML detection are shown in Fig. 20. Simulation
results show that choosing a large value for W and F results in
TABLE IV
FIXED-POINT VARIABLES WITH THREE SETS OF (W, F)
r
i
j r
i
i Z
i
S
i
L
i
P ED
Best (34, 30) (31, 28) (34, 25) (4, 0) (34, 25) (34, 31)
Optimized (16, 12) (13, 10) (16, 7) (4, 0) (16, 7) (16, 13)
Bad (10, 6) (9, 6) (12, 3) (4, 0) (11, 2) (10, 7)
Fig. 20. BER performance of the proposed algorithm in both xed/oating
point domains for different values of (W, F) versus ML for a 4 4, 64-QAM
MIMO detector with K = 10.
less performance loss, but a larger and complicated hardware
is needed (best in Fig. 20). More truncation of parameters
results in lower hardware complexity but a larger BER and
performance loss (bad in Fig. 20). Medium hardware com-
plexity and BER are obtained by choosing the medium set of
(W, F), denoted by optimized in Fig. 20, which is chosen
in this paper.
Finally, it is necessary to verify the proposed idea of the
limited number of column/row SE enumerations and compare
the BER of the proposed algorithm with the exact K-Best
algorithm and also ML detection. In other words, the chosen
values of CSE_Num and RSE_Num (i.e., CSE_Num = 3,
RSE_Num = 4) as well as their effects on the BER of the
proposed algorithm should be conrmed. According to the
proposed algorithm, the values of CSE_Num and RSE_Num
affect the complexity of the architecture and the BER perfor-
mance. Large values results in larger chip area and better BER
performance.
There are two strategies for the value of CSE_Num and
RSE_Num. The rst strategy is the relaxed SE enumeration
scheme, where the number of column/row SE enumerations
is not limited (relaxed SE in Fig. 21). In fact, the BER
of this strategy is the same as the BER of the exact K-Best
algorithm [13]. The second strategy is our proposed limited
SE enumeration scheme, where the number of column/row SE
enumerations is limited (i.e., CSE_Num = 3, RSE_Num = 4).
This scheme is denoted by Limited SE in Fig. 21. The
simulation results of these strategies versus ML detection are
shown in Fig. 21. We can see that the difference between the
BER of these two schemes is negligible, which conrms that
the proposed algorithm can achieve the same BER of that of
the exact K-Best algorithm.
Fig. 21. Effect of the proposed limited SE enumeration idea on the BER
performance of the proposed algorithm versus the exact K-Best algorithm and
the ML for a 4 4, 64-QAM MIMO detector with K = 10.
The above simulation results are for a single-carrier 4 4,
64-QAM MIMO system. The simulation results for the BER
curve are performed for 100 000 packets, where each packet
consists of 4 6 4 = 96 bits (9.6Mbits in total) for a 4 4,
64-QAM MIMO system. Test vectors are generated using:
1) pseudorandom data; 2) complex-valued random Gaussian
channel matrix H with statistically independent elements
updated per four channel use; and 3) additive white Gaussian
(circularly symmetric) complex random noise.
VIII. CONCLUSION
A novel detection algorithm with an efcient architecture
featuring efcient operation over innite complex-domain
lattices has been proposed. The proposed design is scalable
both in terms of the number of transmit antenna and the
constellation order. Efcient implementation of the subblocks
results in the highest throughput and the lowest area and
energy consumption design in the literature to date. The
proposed design was implemented on both the FPGA and
ASIC platforms. In the ASIC implementation, the proposed
hard-output detector provided a sustained throughput of 1 Gb/s
with the areas of 340 kgates in a 0.13-m CMOS process.
Synthesis results in 90-nm CMOS show a potential throughput
of 1.5 Gb/s.
REFERENCES
[1] Z. Guo and P. Nilsson, Algorithm and implementation of the K-best
sphere decoding for MIMO detection, IEEE J. Sel. Areas Commun.,
vol. 24, no. 3, pp. 491503, Mar. 2006.
[2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, Closest point search in
lattices, IEEE Trans. Inf. Theory, vol. 48, no. 8, pp. 22012214, Aug.
2002.
[3] K. W. Wong, C. Y. Tsui, R. S. K. Cheng, and W. H. Mow, A VLSI
architecture of a K-best lattice decoding algorithm for MIMO channels,
in Proc. IEEE Int. Symp. Circuits Syst., vol. 3. May 2002, pp. 273276.
[4] B. M. Hochwald and S. T. Brink, Achieving near-capacity on a
multiple-antenna channel, IEEE Trans. Commun., vol. 51, no. 3, pp.
389399, Mar. 2003.
[5] H.-L. Lin, R. C. Chang, and H. Chan, A high-speed SDM-MIMO
decoder using efcient candidate searching for wireless communication,
IEEE Trans. Circuits, Syst. II, vol. 55, no. 3, pp. 289293, Mar. 2008.
[6] S. Chen, T. Zhang, and Y. Xin, Relaxed K-best MIMO signal detector
design and VLSI implementation, IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 15, no. 3, pp. 328337, Mar. 2007.
[7] M. Shabany and P. G. Gulak, A 675 Mb/s, 4 4 64-QAM K-best
MIMO detector in 0.13 m CMOS, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 20, no. 1, pp. 135147, Jan. 2012.
[8] P. A. Bengough and S. J. Simmons, Sorting-based VLSI architecture
for the M-algorithm and T-algorithm trellis decoders, IEEE Trans.
Commun., vol. 43, no. 234, pp. 514522, Mar. 1995.
[9] C. P. Schnorr and M. Euchner, Lattice basis reduction: Improved prac-
tical algorithms and solving subset sum problems, Math. Programm.,
vol. 66, nos. 13, pp. 181191, 1994.
[10] B. Kim and I. C. Park, K-best MIMO detection based on interleaving
of distributed sorting, Electron. Lett., vol. 44, no. 1, pp. 4243, Jan.
2008.
[11] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, K-best
MIMO detection VLSI architectures achieving up to 424 Mb/s, in Proc.
IEEE Int. Symp. Circuits Syst., May 2006, pp. 11511154.
[12] M. Shabany and P. G. Gulak, A 0.13 m CMOS, 655 Mb/s, 64-QAM,
K-best 4 4 MIMO Detector, in Proc. IEEE Int. Solid State Circuits
Conf., Feb. 2009, pp. 256257.
[13] M. Shabany, K. Su, and P. G. Gulak, A pipelined high-throughput
implementation of near-optimal complex K-best lattice decoders, in
Proc. IEEE Int. Conf. Acoust., Speech, Signal, Apr. 2008, pp. 3173
3176.
[14] E. M. Witte, F. Borlenghi, G. Ascheid, R. Leupers, and H. Meyr, A
scalable VLSI architecture for soft-input soft-output single tree-search
sphere decoding, IEEE Trans. Circuits, Syst. II, vol. 57, no. 9, pp. 706
710, Sep. 2010.
[15] P. Bhagawat, R. Dash, and G. Choi, Systolic like soft-detection archi-
tecture for 44 64-QAM MIMO system, in Proc. IEEE Design, Autom.
Test Eur. Conf. Exhibit., Jun. 2009, pp. 870873.
[16] S. Mondal, A. Eltawil, C. Shen, and K. Salama, Design and implemen-
tation of a sort-free K-best sphere decoder, IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., vol. 18, no. 10, pp. 14971501, Oct. 2010.
[17] C. Liao, T. Wang, and T. Chiueh, A 74.8 mW soft-output detector IC
for 8 8 spatial-multiplexing MIMO communications, IEEE J. Solid
State Circuits, vol. 45, no. 2, pp. 411421, Feb. 2010.
[18] C. Studer, S. Fateh, and D. Seethaler, ASIC implementation of soft-
input soft-output MIMO detection using MMSE parallel interference
cancellation, IEEE J. Solid State Circuits, vol. 46, no. 7, pp. 1754
1765, Jul. 2011.
[19] M. Myllyl, J. Cavallaro, and M. Juntti, Architecture design and imple-
mentation of the metric rst list sphere detector algorithm, IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 19, no. 5, pp. 895899, May
2011.
Mojtaba Mahdavi received the M.Sc. degree in
electrical engineering from the Sharif University of
Technology, Tehran, Iran, in 2010.
He was with Advanced Integrated Circuit Design
Laboratory, Sharif University of Technology, from
2010 to 2012. Currently, he is working on implemen-
tation of the Long Term Evolution (LTE-Advanced)
System. His current research interests include digi-
tal VLSI architectures for digital signal processing
algorithms, VLSI communication systems, digital
integrated circuit design, and eld-programmable
gate array-based systems.
Mr. Mahdavi was on the subcommittee for the International Solid-State
Circuits Conference from 2005 to 2008.
Mahdi Shabany received the B.Sc. degree in elec-
trical engineering from the Sharif University of
Technology, Tehran, Iran, in 2002, and the M.Sc.
and Ph.D. degrees in electrical engineering from the
University of Toronto, Toronto, ON, Canada, in 2004
and 2008, respectively.
He is an Assistant Professor with the Electrical
Engineering Department, Sharif University of Tech-
nology. From 2007 to 2008, he was with Red-
line Communications Company, Toronto, where he
developed and patented designs for WiMAX sys-
tems. He served as a Post-Doctoral Fellow with the University of Toronto
in 2009. He holds two U.S. patents. His current research interests include
digital electronics and VLSI architecture and algorithm design for broadband
communication systems.

06200359 (1)

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

06200359 (1)

Diunggah oleh

Hak Cipta:

Format Tersedia

834 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO.

M 1) Level-3 nodes and their

Anda mungkin juga menyukai