Anda di halaman 1dari 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221295323

Embedded Support Vector Machine : Architectural Enhancements and


Evaluation

Conference Paper  in  Proceedings of the IEEE International Conference on VLSI Design · January 2007
DOI: 10.1109/VLSID.2007.73 · Source: DBLP

CITATIONS READS
11 169

4 authors, including:

Soumyajit Dey Basu Anupam


Indian Institute of Technology Kharagpur Indian Institute of Technology Kharagpur
27 PUBLICATIONS   51 CITATIONS    204 PUBLICATIONS   1,229 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Generating questions from text View project

Bengali Speech Recognition View project

All content following this page was uploaded by Basu Anupam on 27 May 2014.

The user has requested enhancement of the downloaded file.


Embedded Support Vector Machine : Architectural Enhancements and
Evaluation

Soumyajit Dey, Monu Kedia, Niket Agarwal, Anupam Basu


Department of Computer Science and Engineering
Indian Institute of Technology Kharagpur
soumyajit.dey@gmail.com

Abstract 2000 training data points required almost 18x more time to
execute in an ARM9 processor when compared to a Pen-
In recent years, research and development in the field of tium IV. Hence, this work proposes SVMs as the classifier
machine learning and classification techniques have gained of choice for which specific architectural refinements are
paramount importance. The future generation of intelli- required to be done.
gent embedded devices will obviously require such classi- In this paper, a comprehensive methodology has been
fiers working on-line and performing classification tasks in proposed for performance enhancement of SVMs on an em-
a variety of fields ranging from data mining to recognition bedded processor platform. Subsequent architectural refine-
tasks in image and video. Among different such techniques, ments are performed on the base platform. For the case
Support Vector Machines (SVMs) have been found to de- study, the ARM processor was chosen for its commercial
liver state of the art performance thus emerging as the clear popularity [1]. Based on the information extracted about
winner. application characteristics, a domain specific co-processor
In this work, the Support Vector Machine Learning and was implemented to attain performance improvement. The
Classification tasks are evaluated on embedded processor results attained from the co-processor implementation in-
architectures and subsequent architectural modifications dicated that a parallel architecture would be more suitable
are proposed for performance improvement of the same. for the SVM application for obtaining further high speedup
values. Subsequently, a parallel version of SVM was eval-
uated on a super-threaded architecture [2] and speedup re-
sults were extracted for different number of threading units.
1 Introduction The paper is organized as follows: section 2 briefly re-
views the related works. Section 3 describes the main as-
Classifiers are intelligent applications equipped with on- pects of the applications concerned. Section 4 is concerned
line learning for different classification problems like natu- with application analysis and section 5 is devoted to de-
ral language parsing, information retrieval, image recogni- scribe the adopted methodology and working platform. Fur-
tion etc. Such kind of classifiers are likely to be a key com- ther architectural refinements are suggested and evaluated.
ponent in future embedded systems. Among different clas- The results for a co-processor based implementation as well
sification techniques for machine learning, Support Vector as a super threaded architecture based implementation are
Machines have recently attracted a lot of attention both from presented. Section 6 sums up the work with conclusions
researchers and practitioners [14]. The theory of SVMs, de- and future work.
veloped in recent years, suggests that they are more promis-
ing than traditional neural networks. SVMs have shown re-
markable results both in classification and regression ap- 2 Related Work
plications. The inherent complexity of the learning phase
of the SVM make it unsuitable for embedded mobile ap- There is a significant lack of architecture exploration
plications using online learning, discouraging their use in- work for machine learning systems like SVMs. Mixed sig-
spite of superior performance. In such a situation, a general nal ASIC implementation of SVM for pattern recognition
embedded processor like ARM [1] alone cannot meet the tasks have been reported in [3]. Anguita et. al. [4] pro-
performance requirements. A case study carried out in the poses the design of a fully digital architecture for SVM
present work revealed that an SVM learning machine with training and classification employing the linear and RBF

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
kernels. None of these works present a performance evalua-
tion of SVMs on embedded processor platforms and subse-
quent architectural modifications. However, SVM has been
observed to be an application which is ideal for hardware-
software co-design because it can be targeted towards var-
ious application domains with little change in the base im-
plementation. A co-design architecture for SVM would be
beneficial since the bottleneck computations could be ex-
ecuted in custom hardware and software programmability
would allow the same SVM implementation to be applied
for classification work in different domains. This has been
a design objective in the present work.

Figure 1. Profiling Results of SVM.


3 Application Aspects
working set, ii) successive shrinking of optimization prob-
lem, iii) caching and incremental update of the gradient and
SVM algorithms
 based onaretraining concerned with learning a mapping efficient termination criteria [15].

 

 !"$#%'sample
&)( andof applying
input-output pairs
them for @1A:B C?D8EFG1H!I6JKMLON!JQPSRTEU
classification [14]. In an example domain of Natural Lan-


guage Parsing the function is supposed to map a given
sentence to a parse tree . The basic philosophy of SVM Initially, profile based executions of SVM applications
algorithms is to employ linear models to represent nonlin- had been extensively studied. This was necessary for the
ear class boundaries by transforming the input space, into identification of task-level hot spots which need to be accel-
a new space using a nonlinear mapping. This transforma- erated through architectural optimizations. The application
tion is done using Kernel functions [14] which ease the binaries were ported to a TI-OMAP1510 [6] based platform
process of learning a decision boundary called Maximum running Embedded Linux. The profile results were obtained
Margin Hyperplane(MMH). Training instances closest to using Arm-linux-gprof. Profiling was done at both procedu-
MMH are then identified as Support Vectors(SV). Support ral and Basic Block level.
SVM-Light: The profiling of SVM-Light was done with
,* +
Vector training is a complex Quadratic Optimization Prob-
lem(QP). Let be a SV with class value , coefficient
 -+ the input data containing upto 2000 training examples. It
.
and threshold-value . Hence, for a simple two-class classi- was seen from the profiling results as shown in Fig 1 that
sprod ns procedure takes more than 50% of the total exe-
fication problem, the classification function becomes:
cution time and hence, is the computational bottleneck of
/ * 103246578 :9 - +6+6; / *  ,* + =< .  the application. Analysis of sprod ns reveals that it is called
+ multiple times and contains double precision floating point
multiply and accumulate (FP-MAC) operations in a loop.
where
; ,* +  ?* >  is the choice of kernel function for trans- The reason for this huge number of FP-MAC operations can
formation to high dimensional space. In order to obtain dif- be stated as follows: SVM-Light offers several choice of
ferent functional forms of SVs, different kernel functions kernels, of which even in the simplest one i.e. the linear ker-
are employed. nel, inner product computation consumes most of the time.

VXW YQ ,Z*   *  [\]0


The computation of the inner-products takes the form of

4 Application Analysis ^_$`a where


a
matrix-vector multiplication (MVM),
b
is the number of SVs and is the feature
vector length. This naturally leads to huge number of fused
The architectural exploration of the task begins with a multiply and add operations.
workload analysis of SVM in the Simplescalar architecture
simulator [5]. The base line configuration was selected to
@1Adc efEghJQN!ikjlE gnmQK1opU
model the StrongARM architecture. Several open source
software implementations of SVM are available. SVM- A detailed cache analysis was performed to obtain an in-
Light was considered the implementation of choice for the sight into the temporal versus spatial relationships for the
following software optimizations already implemented to memory access patterns. The memory system is modeled
reduce the training time,a bottleneck in most learning tech- with access latencies of 1 cycle for L1 caches, 6 cycles for
niques: i) an efficient and effective method for selecting the L2 caches and 18 cycles for the DRAM. Since, L1 caches

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
have minimum access latency, L1 misses have the most pro-
nounced effect on the number of execution cycles. The SVM : Cycle count for L1 I-Cache variation
block size of both L1 instruction and data cache was varied
from 8 to 64 bytes keeping the number of sets fixed at 128
followed by variation in number of sets upto 4096 keeping
20000

Cycles (10^6)
the block size fixed at 64. As is common for all knowledge-
based systems, SVM was also found to have high values of
15000
1-way
miss-rates for small block sizes and smaller number of sets.
Miss rates were found to be smaller for cache configura- 10000 2-way
tions with higher associativity. The reduction in miss% is 4-way
evident from Fig 2 which shows the analysis for L1 instruc- 5000
tion cache. Results are shown for three cache associativity
values. 0
0 20 40 60 80
SVM: L1 I-Cache Miss Ratio L1 I-Cache Block Size (bytes)

0.35
0.3 Figure 3. Simulation Cycles for Instruction
0.25 Cache with varying block size.
1-way
Miss %

0.2
2-way
0.15
4-way FP-MAC operations. Most standard embedded processors
0.1 do not support floating point (FP) instructions in hardware
0.05 due to their long latencies which creates frequent pipeline
stalls in simple 5 stage pipelines. They are generally emu-
0 lated by software FP-libraries. FP units with low latencies
0 20 40 60 80 will incurr massive area over-head. Such units are gener-
L1 I-Cache Block Size (bytes) ally found in high-performance cores likes the MIPS R3000
[7]. ARM has introduced vector floating point (VFP) co-
processors from the recent ARM10 series (15 cycle FP-
multiplication). However, a full-fledged co-processor will
Figure 2. Miss ratio for Level 1 Instruction
have all FP instructions implemented in hardware. In order
Cache with variable block size and 128 sets.
to make the design simple and area efficient, an application
specific FP-coprocessor needs to implement only the most
The number of simulation cycles consumed by the SVM frequntly occurring FP operations in hardware. Hence, for
learning machine for different cache configurations with the SVM, a co-processor needs to be implemented based on
varying block size are shown in Fig 3. For a cache block these observations. It will operate in conjunction with an
size variation from 8 to 64, there is reduction of almost ARM base processor and also try to exploit parallelism in
65% in the execution time and a miss-ratio reduction of upto the application code.
85%. qrA:B stJvuwpNJ8FxEUUJyN{zpmvUEo|o}EUI6~K
Performance evaluation was also performed assuming a
bi-modal branch predictor hardware with branch history ta-
ble size variation from 8 to 4096 which provided roughly The design process has been carried out in an interac-
18.75% savings in execution cycles and 27% increase in tive co-simulation environment. The software code is exe-
branch prediction rate. cuted on the top of an Instruction Set Simulator(ISS) called
SimIt-Arm [8]. The GeZel [9] design language and en-
vironment was chosen for implementing the co-processor.
5 Architectural Exploration The methodology for application analysis and subsequent
architectural refinement is briefly summarized in Fig 4. The
It was found from the profile results that more than 50% methodology relies on a base architecture which is the plat-
of the execution time was consumed by double precision form to build upon. As shown in Fig 4, the process is ba-

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
APPLICATION ports. A non busy waiting protocol adapted for this
ITERATE IF
NOT OK purpose allows for concurrent execution which adds
PROFILING up to the speedup obtained by implementing the co-
processor. The co-processor writes the accumulated
PERFORMANCE BOTTLENECK IDENTIFICATION
MAC result to the out port only when the main pro-
cessor signals it at the exit of the loop of inner product
PARTITIONING SYNCHRONIZATION SCHEDULING
computation.

Results of Co-processor Design: The architectural re-


INPUT MODIFIED
APPLICATION SW
finements performed led to speedup results as shown in Fig
5. With a single co-processor and a base processor modeled
on the StrongARM [11] instruction set architecture (ISA),
^€‚ƒ
GEZEL DESCRIPTION
OF CO−PROCESSOR
ARM−LINUX GCC
the designed architecture for the SVM co-processor resulted
in about times speedup over the software implementa-
tion in the learning phase.
GEZEL SIMULATOR SimIt ARM ISS

CYCLE COUNT AND CACHE STATISTICS Relative Speedup for SVM


Figure 4. Adopted Methodology.
40000

Cycle Count (10^6)


35000
sically an iterative approach where in each pass application
bottlenecks are identified. This is similar to any standard
30000
SoC design methodology where for the reason of energy 25000
SW
efficiency and performance, co-processors are used for do- 20000
HW-SW
main specific tasks. 15000
Analysis of the SVM-Light shows that more than 57% 10000
of the computation intensive learning time is eaten up by 5000
a single function sprod ns. This function is doing double 0
precision FP-MAC operations in a loop. The result of this
0 500 1000 1500 2000
MAC operation is returned to the calling function. The vari-
ables passed are dynamically allocated in this case leading No. of Training Examples
to different length of inner product computations for differ-
ent function calls which forbids the use of shared memory
and increases communication overhead. The various issues Figure 5. SVM Co-processor speedup results.
in the co-processor implementation are reported as follows:
 Representation format: Since the application used
From the Gezel description of the co-processor, synthe-
sizable VHDL code was extracted using back-end code gen-
only double precision arithmetics, only double preci- erators and the design was synthesized for an Altera APEX
sion format was supported in the co-processor. FPGA to obtain area and timing results. The usefulness of
 Unsupported modes: The co-processor was imple- the customizations performed for the SVM co-processor are
mented without support for floating point exception validated by comparing synthesis results of a full fledged
handling, operations on denormalised numbers and IEEE754 compliant double precision design [16] at same
the rounding modes [10]. Being infrequent as found operating frequency and target board with the proposed one.
from execution statistics, these features were left to The full implementation required 4.7 times the area of the
be handled by software. After designing this appli- custom design reported in the present work. Some statistics
cation specific low area overhead co-processor, classi- regarding the implemented co-processor is given in Table1.
fication results for natural language parse and sample qrAdc C?D8wXR6JyITH!I:K}~„L?m=N!m=R:R6E R6I6Ug
text-mining inputs were compared with the software
implementation showing zero loss of accuracy.
 Communication protocol: The application and co-
With the increasing complexity of applications that are
being migrated to embedded domain, multi-processor Sys-
processor communicate through memory mapped tem on Chip(MP-SoC) solutions have nowadays become the

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
techniques and runtime hardware support to exploit both
Table 1. Co-processor statistics. thread-level and instruction-level parallelism [17]. It con-
Application SVM sists of multiple threading units (TUs) with each TU con-
Operation Double-precision nected to its successor via a unidirectional communication
FP-MAC ring. Each TU has its own private L1 instruction cache and
Computation Cycles 15
Operating Frequency 90 MHz data cache, memory buffer, register file, and execution units
Lines of Gezel Code 629 based on the superscalar MIPS IV architecture. The TUs
Lines of VHDL generated 1846 share unified L2 instruction and data caches, thus modeling
No. of Logic cells(Altera APEX FPGA) 2784
a chip multi-processor (CMP) type architecture. The exe-
cution model for the super threaded architecture is thread-
pipelining, which allows threads with data and control de-
most viable solution because the clock speed of single core
pendencies to be executed in parallel by runtime checking
embedded processors are not increasing in keeping with the
and speculation.
performance requirement. The co-processor based imple-
The execution profile of GPDT suggested that the ker-
mentation suggested previously for the SVM application
nel evaluations of the SVMs took nearly 75% of the execu-
will obviously perform better than general purpose embed-
tion time for an input data size of about 17000 taken from
ded processor solutions. But if the performance require-
a web-mining data-set [13]. From the relative execution
ment is further more, the only alternative is multi-processor
pattern of the kernel evaluations shown in Fig 6, it can be
architectures.
observed that the execution time of the kernel evaluations
Initially it was found that a sequential implementation of
continue to increase with increasing size of the data-set.
the SVM had certain bottlenecks which were implemented
Hence, the GPDT code was ported to the super-threaded ar-
in a co-processor. However the speedup had scope for im-
chitecture after parllelizing the kernel evaluation phase by
provement. This is because, once the FP-MAC instruction
incorporating threading function calls and library support.
is executed in hardware, the cost of transferring the data
The proposed modifications in the GPDT software resulted
to the co-processor became greater than the computation
in spawning of multiple threads of kernel evaluation which
cost. Every communication of data-pairs required setting
were executed parallely by the superscalar TUs.
up the data values in special registers by the processor and
the bus latency. In such a scenario, when multiple data-pairs
were transferred to the co-processor for parallel operation, E x e c u tio n p ro file
the execution time increased because of the parallelization
overhead. For exploiting the scope of parallel processing, a 75
shared memory multi-processor architecture is more likely
70
to be the solution because of the huge number of memory
intensive operations as common to all knowledge-based ap- 65
plications. SVMs have shown similarly high memory band- 60
Percentage

width demands and performing the MVM operations paral- 55 K ernel


lely seems to be a nice option. E valuations
50
For a parallel embedded architecture, the applications
need to be based on algorithms that are designed for ex- 45
ploiting the parallelism that the architecture will offer. SVM 40
light [12] is not an inherently parallel implementation of the 35
SVM. Hence, GPDT [18] was used, which is a parallel im-
30
plementation of SVMs. GPDT is a C++ software designed
0 10000 20000
to train large-scale SVMs in both scalar and distributed
N o . o f d ata p o in ts
memory parallel environments. It uses a popular problem
decomposition technique to split the SVM quadratic pro-
gramming (QP) problem into a sequence of smaller QP sub- Figure 6. Execution profile of the Kernel Eval-
problems, each one being solved by a suitable gradient pro- uation in GPDT.
jection method (GPM). The GPM used in the present work
was the Dai-Fletcher method because of its relative merits Results of Parallel Implementation: The Kernel Eval-
as discussed in [19] which is a part of the GPDT software. uation part of GPDT was parallelized and simulated on the
The GPDT version of SVM learning machine was ported super-threaded architecture with varying threading units.
on a multi-threaded version of the simplescalar simulator The results are shown in Fig 7 which clearly shows the
[2]. The super-threaded architecture integrates compilation speed up achieved by parallelization and highlights the fact

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007

Authorized licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.
that the speedup increases with the input data size as the ker- accelerating the learning phase of an SVM executing in a
nel evaluations are being computed for a greater amount of single embedded processor. For attaining further speedup,
time leading to greater amount of parallel execution. How- a parallel SVM algorithm was ported to a shared memory
ever, the speedup is not proportional to the number of TUs architecture consisting of multiple high performance cores.
because of the threading overhead associated with the par- All the previously reported hardwares for SVM learning and
allel execution. A speedup of about 2.1 times was observed classification are ASIC/FPGA implementations. The pro-
with four parallel threading units (which is practically im- posed architectural solutions are based on embedded pro-
plementable in modern embedded MP-SoCs) when com- cessors and application specific accelerators. Hence, the
pared to sequential execution in a single TU. speedup results reported in the present work can not be
compared with prior works due to the discrepancy in the
implementation platforms.
Speedup due to Parallel Im plem entation The possibility of embedded learning machines is im-
mense keeping in mind the drive towards intelligent embed-
16 ded devices that will perform a variety of tasks like mining
web-based data, parsing human language and classifying
14 image and video data by learning models of each of the do-
Execution cycles (10^9)

main specific problems. Future work in this direction will


12 16 TUs be focused on extracting custom instructions that will accel-
10 8 TUs erate the QP solver phase of SVM learning and customiz-
4 TUs ing the multi-processor infrastructure with more application
8 specific optimizations.
Sequential
6
References
4
[1] ARM. http://www.arm.com/.
2
[2] SIMCA. http://www.arctic.umn.edu/SIMCA/index.shtml/.
[3] Roman Genov and Gert Cauwenberghs, ”Kerneltron: Support Vector
0
”Machine” in Silicon,” IEEE Transactions on neural networks, vol.
0 5000 10000 15000 20000 14, no. 5, september 2003.
N o. of data points [4] Davide Anguita, Andrea Boni and Sandro Ridella, ”A Digital Archi-
tecture for Support Vector Machines: Theory, Algorithm, and FPGA
Implementation,” IEEE Transactions on neural networks, vol. 14, no.
Figure 7. Speedup results of Parallel imple- 5, september 2003.
[5] Simplescalar. http://www.simplescalar.com/.
mentation of SVM. [6] TI. http://www.ti.com/.
The FP-MAC operation, which was previously imple- [7] MIPS http://www.mips.com/.
[8] SimIt-Arm. http://sourceforge.net/projects/simit-arm/.
mented in a co-processor, was also tried with the super-
[9] Gezel. http://www.ece.vt.edu/schaum/gezel/.
threaded architecture. However, the speedup achieved was [10] IEEE 754. http://grouper.ieee.org/groups/754/.
not considerable. The co-processor based implementation [11] StronARM. http://www.intel.com/.
had used ARM as the base processor, which does not have [12] SVMLight. http://svmlight.joachims.org/.
FP instructions. Hence, implementing the FP-MAC instruc- [13] http://dm.unife.it/gpdt/.
[14] J.C. Burges, ”A tutorial on Support Vector Machines for pattern
tion in custom hardware led to considerable acceleration. recognition,” Data Mining and Knowledge Discovery, Vol.2, June
However, the super-threaded architecture, used for evaluat- 1998.
ing GPDT, has floating point addition and multiplication in- [15] Thorsten Joachims, ”Making Large-Scale SVM Learning Practical,”
structions (low latency implementation, 4 cycle) and hence Advances in Kernel Methods - Support Vector Learning, MIT-Press
1999.
the custom hardware does not lead to considerable advan- [16] G. Danese, I. De Lotto, F. Leporati, M. Scaricabarozzi and A. Spel-
tage in this case. gatti, ”An Accelerator for Double Precision Floating Point Opera-
tions,” Euro-PDP, 2003.
[17] Jenn-Yuan Tsai, Jian Huang, Christoffer Amlo, David J. Lilja, and
6 Conclusion Pen-Chung Yew, ”The Superthreaded Processor Architecture,” IEEE
Transactions on Computers, vol 48, no. 9, pp. 881-902, 1999.
The present work considered domain specific architec- [18] G. Zanghirati and L. Zanni, ”A parallel solver for large quadratic
programs in training support vector machines,” Parallel Computing,
ture exploration for SVMs in embedded handheld devices. pp. 535-551, 2003.
The memory sub-system exploration led to fixing up proper [19] Y.H. Dai and R. Fletcher, ”New Algorithms for Singly Linearly Con-
values of cache and branch predictor parameters. An appli- strained Quadratic Programming Problems Subject to Lower and Up-
cation specific floating point co-processor was designed for per Bounds,” Math. Prog, vol 106(3), pp. 403-421, 2006.

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 © 2007

Authorized
View publication stats licensed use limited to: IEEE Xplore. Downloaded on December 12, 2008 at 18:43 from IEEE Xplore. Restrictions apply.

Anda mungkin juga menyukai