net/publication/221295323
Conference Paper in Proceedings of the IEEE International Conference on VLSI Design · January 2007
DOI: 10.1109/VLSID.2007.73 · Source: DBLP
CITATIONS READS
11 169
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Basu Anupam on 27 May 2014.
Abstract 2000 training data points required almost 18x more time to
execute in an ARM9 processor when compared to a Pen-
In recent years, research and development in the field of tium IV. Hence, this work proposes SVMs as the classifier
machine learning and classification techniques have gained of choice for which specific architectural refinements are
paramount importance. The future generation of intelli- required to be done.
gent embedded devices will obviously require such classi- In this paper, a comprehensive methodology has been
fiers working on-line and performing classification tasks in proposed for performance enhancement of SVMs on an em-
a variety of fields ranging from data mining to recognition bedded processor platform. Subsequent architectural refine-
tasks in image and video. Among different such techniques, ments are performed on the base platform. For the case
Support Vector Machines (SVMs) have been found to de- study, the ARM processor was chosen for its commercial
liver state of the art performance thus emerging as the clear popularity [1]. Based on the information extracted about
winner. application characteristics, a domain specific co-processor
In this work, the Support Vector Machine Learning and was implemented to attain performance improvement. The
Classification tasks are evaluated on embedded processor results attained from the co-processor implementation in-
architectures and subsequent architectural modifications dicated that a parallel architecture would be more suitable
are proposed for performance improvement of the same. for the SVM application for obtaining further high speedup
values. Subsequently, a parallel version of SVM was eval-
uated on a super-threaded architecture [2] and speedup re-
sults were extracted for different number of threading units.
1 Introduction The paper is organized as follows: section 2 briefly re-
views the related works. Section 3 describes the main as-
Classifiers are intelligent applications equipped with on- pects of the applications concerned. Section 4 is concerned
line learning for different classification problems like natu- with application analysis and section 5 is devoted to de-
ral language parsing, information retrieval, image recogni- scribe the adopted methodology and working platform. Fur-
tion etc. Such kind of classifiers are likely to be a key com- ther architectural refinements are suggested and evaluated.
ponent in future embedded systems. Among different clas- The results for a co-processor based implementation as well
sification techniques for machine learning, Support Vector as a super threaded architecture based implementation are
Machines have recently attracted a lot of attention both from presented. Section 6 sums up the work with conclusions
researchers and practitioners [14]. The theory of SVMs, de- and future work.
veloped in recent years, suggests that they are more promis-
ing than traditional neural networks. SVMs have shown re-
markable results both in classification and regression ap- 2 Related Work
plications. The inherent complexity of the learning phase
of the SVM make it unsuitable for embedded mobile ap- There is a significant lack of architecture exploration
plications using online learning, discouraging their use in- work for machine learning systems like SVMs. Mixed sig-
spite of superior performance. In such a situation, a general nal ASIC implementation of SVM for pattern recognition
embedded processor like ARM [1] alone cannot meet the tasks have been reported in [3]. Anguita et. al. [4] pro-
performance requirements. A case study carried out in the poses the design of a fully digital architecture for SVM
present work revealed that an SVM learning machine with training and classification employing the linear and RBF
Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
kernels. None of these works present a performance evalua-
tion of SVMs on embedded processor platforms and subse-
quent architectural modifications. However, SVM has been
observed to be an application which is ideal for hardware-
software co-design because it can be targeted towards var-
ious application domains with little change in the base im-
plementation. A co-design architecture for SVM would be
beneficial since the bottleneck computations could be ex-
ecuted in custom hardware and software programmability
would allow the same SVM implementation to be applied
for classification work in different domains. This has been
a design objective in the present work.
Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
have minimum access latency, L1 misses have the most pro-
nounced effect on the number of execution cycles. The SVM : Cycle count for L1 I-Cache variation
block size of both L1 instruction and data cache was varied
from 8 to 64 bytes keeping the number of sets fixed at 128
followed by variation in number of sets upto 4096 keeping
20000
Cycles (10^6)
the block size fixed at 64. As is common for all knowledge-
based systems, SVM was also found to have high values of
15000
1-way
miss-rates for small block sizes and smaller number of sets.
Miss rates were found to be smaller for cache configura- 10000 2-way
tions with higher associativity. The reduction in miss% is 4-way
evident from Fig 2 which shows the analysis for L1 instruc- 5000
tion cache. Results are shown for three cache associativity
values. 0
0 20 40 60 80
SVM: L1 I-Cache Miss Ratio L1 I-Cache Block Size (bytes)
0.35
0.3 Figure 3. Simulation Cycles for Instruction
0.25 Cache with varying block size.
1-way
Miss %
0.2
2-way
0.15
4-way FP-MAC operations. Most standard embedded processors
0.1 do not support floating point (FP) instructions in hardware
0.05 due to their long latencies which creates frequent pipeline
stalls in simple 5 stage pipelines. They are generally emu-
0 lated by software FP-libraries. FP units with low latencies
0 20 40 60 80 will incurr massive area over-head. Such units are gener-
L1 I-Cache Block Size (bytes) ally found in high-performance cores likes the MIPS R3000
[7]. ARM has introduced vector floating point (VFP) co-
processors from the recent ARM10 series (15 cycle FP-
multiplication). However, a full-fledged co-processor will
Figure 2. Miss ratio for Level 1 Instruction
have all FP instructions implemented in hardware. In order
Cache with variable block size and 128 sets.
to make the design simple and area efficient, an application
specific FP-coprocessor needs to implement only the most
The number of simulation cycles consumed by the SVM frequntly occurring FP operations in hardware. Hence, for
learning machine for different cache configurations with the SVM, a co-processor needs to be implemented based on
varying block size are shown in Fig 3. For a cache block these observations. It will operate in conjunction with an
size variation from 8 to 64, there is reduction of almost ARM base processor and also try to exploit parallelism in
65% in the execution time and a miss-ratio reduction of upto the application code.
85%. qrA:B stJvuwpNJ8FxEUUJyN{zpmvUEo|o}EUI6~K
Performance evaluation was also performed assuming a
bi-modal branch predictor hardware with branch history ta-
ble size variation from 8 to 4096 which provided roughly The design process has been carried out in an interac-
18.75% savings in execution cycles and 27% increase in tive co-simulation environment. The software code is exe-
branch prediction rate. cuted on the top of an Instruction Set Simulator(ISS) called
SimIt-Arm [8]. The GeZel [9] design language and en-
vironment was chosen for implementing the co-processor.
5 Architectural Exploration The methodology for application analysis and subsequent
architectural refinement is briefly summarized in Fig 4. The
It was found from the profile results that more than 50% methodology relies on a base architecture which is the plat-
of the execution time was consumed by double precision form to build upon. As shown in Fig 4, the process is ba-
Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
APPLICATION ports. A non busy waiting protocol adapted for this
ITERATE IF
NOT OK purpose allows for concurrent execution which adds
PROFILING up to the speedup obtained by implementing the co-
processor. The co-processor writes the accumulated
PERFORMANCE BOTTLENECK IDENTIFICATION
MAC result to the out port only when the main pro-
cessor signals it at the exit of the loop of inner product
PARTITIONING SYNCHRONIZATION SCHEDULING
computation.
Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
techniques and runtime hardware support to exploit both
Table 1. Co-processor statistics. thread-level and instruction-level parallelism [17]. It con-
Application SVM sists of multiple threading units (TUs) with each TU con-
Operation Double-precision nected to its successor via a unidirectional communication
FP-MAC ring. Each TU has its own private L1 instruction cache and
Computation Cycles 15
Operating Frequency 90 MHz data cache, memory buffer, register file, and execution units
Lines of Gezel Code 629 based on the superscalar MIPS IV architecture. The TUs
Lines of VHDL generated 1846 share unified L2 instruction and data caches, thus modeling
No. of Logic cells(Altera APEX FPGA) 2784
a chip multi-processor (CMP) type architecture. The exe-
cution model for the super threaded architecture is thread-
pipelining, which allows threads with data and control de-
most viable solution because the clock speed of single core
pendencies to be executed in parallel by runtime checking
embedded processors are not increasing in keeping with the
and speculation.
performance requirement. The co-processor based imple-
The execution profile of GPDT suggested that the ker-
mentation suggested previously for the SVM application
nel evaluations of the SVMs took nearly 75% of the execu-
will obviously perform better than general purpose embed-
tion time for an input data size of about 17000 taken from
ded processor solutions. But if the performance require-
a web-mining data-set [13]. From the relative execution
ment is further more, the only alternative is multi-processor
pattern of the kernel evaluations shown in Fig 6, it can be
architectures.
observed that the execution time of the kernel evaluations
Initially it was found that a sequential implementation of
continue to increase with increasing size of the data-set.
the SVM had certain bottlenecks which were implemented
Hence, the GPDT code was ported to the super-threaded ar-
in a co-processor. However the speedup had scope for im-
chitecture after parllelizing the kernel evaluation phase by
provement. This is because, once the FP-MAC instruction
incorporating threading function calls and library support.
is executed in hardware, the cost of transferring the data
The proposed modifications in the GPDT software resulted
to the co-processor became greater than the computation
in spawning of multiple threads of kernel evaluation which
cost. Every communication of data-pairs required setting
were executed parallely by the superscalar TUs.
up the data values in special registers by the processor and
the bus latency. In such a scenario, when multiple data-pairs
were transferred to the co-processor for parallel operation, E x e c u tio n p ro file
the execution time increased because of the parallelization
overhead. For exploiting the scope of parallel processing, a 75
shared memory multi-processor architecture is more likely
70
to be the solution because of the huge number of memory
intensive operations as common to all knowledge-based ap- 65
plications. SVMs have shown similarly high memory band- 60
Percentage
Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
that the speedup increases with the input data size as the ker- accelerating the learning phase of an SVM executing in a
nel evaluations are being computed for a greater amount of single embedded processor. For attaining further speedup,
time leading to greater amount of parallel execution. How- a parallel SVM algorithm was ported to a shared memory
ever, the speedup is not proportional to the number of TUs architecture consisting of multiple high performance cores.
because of the threading overhead associated with the par- All the previously reported hardwares for SVM learning and
allel execution. A speedup of about 2.1 times was observed classification are ASIC/FPGA implementations. The pro-
with four parallel threading units (which is practically im- posed architectural solutions are based on embedded pro-
plementable in modern embedded MP-SoCs) when com- cessors and application specific accelerators. Hence, the
pared to sequential execution in a single TU. speedup results reported in the present work can not be
compared with prior works due to the discrepancy in the
implementation platforms.
Speedup due to Parallel Im plem entation The possibility of embedded learning machines is im-
mense keeping in mind the drive towards intelligent embed-
16 ded devices that will perform a variety of tasks like mining
web-based data, parsing human language and classifying
14 image and video data by learning models of each of the do-
Execution cycles (10^9)
Authorized
View publication stats licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.