Anda di halaman 1dari 4

H.

264 Decoder Implementation on a Dynamically Reconfigurable


Instruction Cell Based Architecture

Adam Major, Ying Yi, Ioannis Nousias, Mark Milward, Sami Khawam and Tughrul Arslan

School of Engineering and Electronics


University of Edinburgh, Edinburgh, EH9 3JL
Tel: (+44)131 650 5619 Email: adam.major@ed.ac.uk

ABSTRACT them can provide a comprehensive solution to all


This paper presents a new Baseline Profile future mobile requirements. DSPs are highly
compliant h.264 decoder implementation specifically flexible, allowing full general purpose
tailored for an ANSI-C programmable, dynamically programmability which can be updated at a low cost
reconfigurable, instruction cell based architecture after release to respond to changing standards.
which has recently been developed [10]. We use the They are however significantly slower than
ffmpeg libavcodec library as the basis for our alternative platforms. At the other end of the scale,
decoder and identify the most processor intensive ASIC design offers the best possible combination of
functions. These functions are tailored in a novel high throughput and low power operation but
framework incorporating established software development is costly and it lacks post-fabrication
techniques alongside several architecture specific flexibility. FPGA solutions are able provide
transforms. Initial results demonstrate that our performance close to that of ASIC with the added
reconfigurable architecture based decoder provides benefit that post-fabrication modifications are
a significant performance boost with power figures possible. They are less attractive in terms of energy
below that of a microcontroller such as ARM. and area requirements however, due to the large
overhead imposed by complex configuration and
I. INTRODUCTION interconnect logic [8]. Therefore, in order to address
Future mobile imaging and video applications, all the future requirements of mobile systems we
from personal entertainment systems to aerospace require new low power, high performance
reconnaissance platforms will demand high image technologies which are highly flexible and
quality at extreme resolutions whilst remaining inexpensive in terms of development and
constrained to relatively low transmission maintenance costs.
bandwidths, storage capacity and energy resources. One such technology, which combines an
H.264/AVC is the latest video compression standard energy efficient, instruction cell based, dynamically
from the Joint Video Team (JVT) of the ISO MPEG reconfigurable fabric with ANSI-C programmability,
and ITU-T VCEG which aims to address the needs has recently been developed in industry [10]. This
of these next generation video systems [4]. The architecture belongs to the emerging field of
standard builds on and extends the feature sets of Reconfigurable Computing (RC) and aims to
previous standards such as MPEG-2, H.263+, and combine the flexibility and programmability of DSP
MPEG-4, and adds several key components such with the performance of FPGA and the energy
as an in-loop de-blocking filter; variable block size requirements of ASIC in order to meet the demands
quarter pixel precision motion prediction; and of next generation mobile systems. This paper
context adaptive binary arithmetic coding (CABAC). provides the initial results of the first h.264 decoder
Studies show that this enables it to achieve bit rate implemented on a programmable instruction cell
reductions of around 50% compared with MPEG-2 based reconfigurable architecture. We present
streams of comparable subjective image quality timing and performance analyses of the open
[9][13]. However, the computational cost associated source ffmpeg libavcodec library compiled for this
with the encoding and decoding algorithms is also system, and then apply the first stage in a number of
significantly elevated [3][7][16]. planned optimisations to computationally intensive
At present there are three established code sections in order to improve the throughput.
technologies for use in mobile applications: General This paper is organised as follows. Section II
purpose Digital Signal Processing cores (DSP), contains a brief overview of previous work. Section
Application Specific Integrated Circuits (ASIC) and III describes the implementation of code tailoring.
Field Programmable Gate Arrays (FPGA). Each of Testing and results are provided in sections IV and
these has advantages but, importantly, none of V. Finally, conclusions are drawn in section VI.

0-7803-9782-7/06/$20.00 ©2006 IEEE 49

Authorized licensed use limited to: The Ohio State University. Downloaded on January 8, 2009 at 08:58 from IEEE Xplore. Restrictions apply.
II. BACKGROUND AND PREVIOUS WORK Furthermore, the nature of the reconfigurable
fabric is such that a number of conventionally
A. Target reconfigurable architecture overview hardware specific optimisations such as shift
The target core consists of a highly registers and pipelining can be implemented for
reconfigurable fabric of interconnected instruction appropriate sections of code. This can provide
cells (ICs) with functionality derived from common orders of magnitude greater performance for heavily
machine code instructions which can be dynamically iterated loop sections. Since such sections often
reconfigured to provide highly parallel FPGA-like account for well over three quarters of the execution
representations of typical software operations. The time of an algorithm, these are extremely powerful
fabric is controlled through a series of VLIW techniques.
instructions which are extracted from a C based
high level algorithm description by a sophisticated III. IMPLEMENTATION
scheduling algorithm [15]. This enables the For this initial implementation, the standard
architecture to take advantage of vastly increased libavcodec algorithms were modified based on
levels of parallelism at the instruction level than findings from several timing profiles of simulated
existing processors executing the same code since decode tests which highlighted the most appropriate
it is not constrained by dependencies between target functions for optimisation. In order to assess
operations or by small branches. As a result it the ease of porting existing software to the new core
exhibits a significant performance and flexibility only minimal and obvious modifications were made
advantage over traditional solutions. [10] and no advanced methods such as pipelining were
The resources available to the system are fully employed.
parameterisable with a standard allocation defined Loop Transforms: Several standard transforms
for general purpose applications. At present the were applied selectively to suitable looping sections
standard core is only around 4 times the area of a of code. Loops with a small number of iterations or a
small DSP core such as OpenRISC. When short string of internal commands were recoded into
compared to FPGA, this area equates to between a single iteration to reduce the loop overheads and
twice as much for small designs and around 10 increase the opportunity to perform commands in
times less for large designs. Furthermore, tests parallel. Longer loops were unrolled by a number of
across a range of algorithms indicate energy iterations or re-distributed entirely; for example to
requirements of up to 7 times less than DSP and 16 remove a conditional jump from the body of the
times less than FPGA. This combination of low loop.
power, performance and flexibility makes the Multiplexer Enforcement: This optimisation relies
instruction cell based architecture a prime candidate on a specialised feature of the reconfigurable core
for the execution of processor intensive image which is the inclusion of a hardware multiplexer.
processing algorithms such as h.264 in a mobile Ordinarily branches reduce the number of
environment. computations that must be performed by skipping
sections of code that are not applicable when
B. Code optimisation certain conditions exist. In the reconfigurable core it
The instruction cell based architecture affords is more efficient, in terms of speed and energy, to
significant scope for code tailoring to achieve even evaluate both branches and then select the
greater performance. The core has been found to appropriate result using the hardware multiplexer. In
respond most favourably to techniques which using this technique attention is paid to the relative
enable it to take maximum advantage of instruction costs of evaluating one branch over the other. If one
level parallelism. In general these techniques side requires substantially more computation then it
attempt to re-arrange the target algorithm to reduce is often preferable to leave the jump in place and
the number and density of conditional branches. focus on the body of each branch. However, in
Many methods employed to increase performance simple situations, where an “if” statement is simply
on traditional processors, such as loop unrolling, assigning a value for example, it remains more
loop unfolding and loop re-distribution, can be very efficient to map unbalanced branches as
effective. Additionally, it is possible to realise small multiplexers since the benefits of increased
branches as hardware multiplexers, reducing jumps parallelism outweigh the cost of always evaluating
and allowing groups of conditional statements to be both paths.
evaluated simultaneously. Consequently, throughput Computational Simplifications: A number of
can be increased significantly at the expense of a computational simplifications were made to the
few wasted computations. code. A scheme of pre-calculation was applied for

50

Authorized licensed use limited to: The Ohio State University. Downloaded on January 8, 2009 at 08:58 from IEEE Xplore. Restrictions apply.
common expressions in arithmetic heavy sections to V. RESULTS
reduce duplication of operations.
Memory Access Reduction: Many operations in A. Optimisation
the h.264 algorithm repeatedly require the same Table 1 presents the performance achieved by
piece of input data for a number of calculations. FIR our decoder for each test sequence using both
filter like sections for example will access the same modified and original decoding algorithms.
data location for n results; where n is the number of Executing the standard code, performance ranges
taps. In the libavcodec implementation of the from 12,407 Macro-Blocks per second (MB/s) for
decoder, sets of input data are often stored in an complex scenes to 22,424MB/s for simple
array, which not only incurs a high indexing sequences. The average output bandwidth of
overhead but also translates into one or several 16,848MB/s exceeds the requirements of Baseline
calls to main memory every time the data is Profile Level 2 [4].
required. After minor modification the performance of the
In hardware representations this data would be algorithm on the reconfigurable core is improved by
stored in fast local registers to improve access over 28%. The optimised decoder speed ranges
times. We are able to enforce a similar scheme for from 15,944MB/s to 28,877MB/s. The average
the reconfigurable core by assigning repeatedly output bandwidth is 21,654MB/s and hence this
accessed locations in the input array to non array initial implementation realises just over half of the
variables and thereafter referencing the data speed required for Level 3 compliance [4].
through them. In this way the data is stored in the
core’s internal registers and the memory access Table 1: Decoder Performance in frames per second
overhead is significantly reduced. Sequence Standard Optimised Improvement
Driving 11.64 14.93 28.25%
IV. TEST SETUP Opening 16.61 21.39 28.78%
Whale 9.19 11.81 28.56%
A. Test Sequences Average 12.48 16.04 28.53%
The test sequences “Driving”, “Opening
Ceremony” and “Whale Show” were used to perform
both the initial complexity analysis and subsequent B. Comparison with Existing Platforms
performance assessment. Each sequence uses Table 2 compares our decoder with a selection
YUV 4:2:0 format and consists of 200 frames at a of existing implementations. TIVR Communications
resolution of 720x480. The characteristics and h.264 Baseline Profile Software is aimed at mobile
complexity of each sequence varies which allows platforms powered by ARM and Xscale processors
the performance of the decoder over a range of [12]. The software enables a 220MHz ARM9 based
scene types to be roughly assessed. device to achieve real time decoding of QVGA
The “Opening Ceremony” sequence is relatively (320x240) video streams at 25fps. This equates to a
low in complexity comprising a static backdrop throughput of 7500 MB/s with energy requirements
overlaid with small areas of simple motion as teams of 7.33µJ per decoded Macro Block (µJ/MB). Using
march into an Olympic stadium. Comparatively, tailored code our decoder provides three times this
“Whale Show” is highly complex combining a high performance whilst consuming six times less
level of object motion with fast and irregular camera energy.
panning as the scene follows a whale performing at TIVR’s solution has been highly optimised and
a wildlife park. The complexity of the “Driving” tailored for ARM based systems. In order to gauge
sequence falls between the other two scenes and the true performance differential between ARM and
consists of a regular slow moving background with the reconfigurable architecture the source code was
an essentially stationary object at the image focus. also executed on a simulated ARM9TDMI core
using the ADS 1.0.1 ARMulator [2]. The simulated
B. Encoder Settings core uses an idealised memory bus and default flat
Test sequences were encoded using the JVT memory map which assesses its maximum
Joint Model 10.2 reference software (JM 10.2) theoretical speed. Compared with a similarly
configured using the default baseline profile settings idealised reconfigurable core the ARM performance
[5]. A model with constant QP was used since the is eight times slower. If a similar differential existed
rate control supported by the software is not fully between our decoder and TIVR ARM, the
mature. Hence complex scenes are allocated higher throughput achieved would be close to 60,000MB/s.
bit rates and present a more complex task to the The second device in Table 2 is an AHB AMBA
decode algorithm. based SoC ASIC decoder design proposed in [6].

51

Authorized licensed use limited to: The Ohio State University. Downloaded on January 8, 2009 at 08:58 from IEEE Xplore. Restrictions apply.
The device combines a high performance ARM focus on a combination of established h.264
microcontroller with a pipelined set of application algorithmic simplifications, for both hardware and
specific hardware blocks which enable it to realise a software, with the advanced methods available to
real time 1080HD decoder at 30fps. To achieve this, the reconfigurable fabric, such as pipelined loops
the core must run at 170MHz at which it consumes and shift registers.
554mW which equates to 2.28µJ/MB. Compared
with our decoder this design requires roughly twice REFERENCES
the energy to decode a single MB.
The last device in table 2 is a Baseline decoder 1. 4i2i Communications technical staff, H.264/MPEG-4 Part-10
core for Xilinx Virtex and Spartan FPGAs developed Baseline Video Decoder IP Core, 4i2i Communications Ltd.,
by 4i2i [1]. Documentation for the core states clock revision 1.0, 2005
speeds required for a number of standard video 2. Advanced RISC Machines, AXD Debugger for ADS 1.0.1,
ARM Ltd., 1999-2000
resolutions alongside the maximum frequency 3. A. Chang, O.C. Au, Y.M. Yeung: “A novel approach to fast
attainable when implemented on a Xilinx Virtex-II multi-block motion estimation for H.264 video coding”
array. Extrapolating from this data yields a Multimedia and Expo, 2003. ICME '03. Proceedings. 2003
theoretical 79,200MPB/s upper throughput limit International Conference on, Volume 1, 6-9 July 2003
Page(s):I - 105-8 vol.1
which equates to roughly 4 times the speed of our 4. Joint Video Team of ITU-T and ISO/IEC JTC 1, “Draft ITU-T
current decoder implementation. Recommendation and Final Draft International Standard of
To approximate the energy required by this Joint Video Specification (ITU-T Rec. h.264 | ISO/IEC 14496-
design to decode a single Macro-Block we refer to 10 AVC),” Joint Video Team (JVT) of ISO/IEC MPEG and
ITU-T VCEG, JVT G050, March 2003.
the findings of [11] which estimates the power 5. Joint Video Team (JVT), Reference Software JM 10.2,
consumption of a Virtex-II CLB in a typical design to October, 2005
be 5.9µW per MHz. The decoder has a stated area 6. H. Kang, K. Jeong, J. Bae, Y. Lee, S. Lee: “MPEG4
of 15000 slices which equates to 3750 CLBs in a AVC/H.264 Decoder with Scalable Bus Architecture and Dual
Memory Controller”, Circuits and Systems, 2004. ISCAS '04.
Virtex-II array [14]. Proceedings of the 2004 International Symposium on,
Volume 2, 23-26 May 2004 Page(s):II - 145-8 Vol.2
Table 2: Comparison of h.264 decoder implementations 7. J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke,
F. Pereira, T. Stockhammer, T. Wedi, “Viedo coding with
Throughput Energy
Platform H.264/AVC: Tools, Performance, and Complexity”, IEEE
Macro-Blocks / s µJ/Macro-Block
Circuits and Systems Magazine, First Quarter 2004, pp. 7-
ARM9 [12] 7,500 7.33 28.
Xilinx Virtex II [1] 79,200 15.64 8. Power Comparison, RapidChip© Platform ASICs vs. FPGA
ASIC [6] 243,000 2.28 9. Raja, G.; Mirza, M.J.; “Performance comparison of advanced
our decoder 16,848 1.49 video coding H.264 standard with baseline H.263 and H.263+
standards” Communications and Information Technology,
our decoder (tailored) 21,654 1.15 2004. ISCIT 2004. IEEE International Symposium on,
Volume 2, 26-29 Oct. 2004 Page(s):743 - 746 vol.2
VI. CONCLUSIONS 10. Reconfigurable Instruction Cell Array, U.K. Patent
Application Number 0508589.9.
This paper has presented the first h.264 decoder 11. L. Shang, A. S. Kaviani, and K. Bathala. “Dynamic power
implementation on a reconfigurable, instruction cell consumption in Virtex-II FPGA family”, Proceedings of the
based architecture. Currently the decoder achieves 2002 ACM/SIGDA tenth International Symposium on Field
between 12fps and 21fps at NTSC D1 after minor Programmable Gate Arrays, pages 157–164. ACM Press,
2002.
modification to the standard libavcodec based code. 12. TIVR technical staff, H.264 Baseline Profile (BP) Video
Compared to existing RISC based solutions our Decoder, TIVR Communications Private Ltd.
decoder exhibits significant performance 13. Wiegand, T.; Sullivan, G.J.; Bjntegaard, G.; Luthra, A.;
advantages, attaining three times the speed of a “Overview of the H.264/AVC video coding standard”
Circuits and Systems for Video Technology, IEEE
highly optimised ARM9 solution and eight times the Transactions on, Volume 13, Issue 7, July 2003
speed of an ARM9 executing the same code. Page(s):560 – 576
Furthermore, it vastly outperforms DSP, FPGA and 14. Xilinx, Virtex-II Platform FPGAs: Complete Data Sheet, Xilinx
ASIC accelerated microcontroller designs in terms Inc, DS031, version 3.4, 2005
15. Y. Yi, I. Nousias, M. Milward, S. Kawham, T. Arslan and I.
of energy requirements. Lindsay: “System-level Scheduling on Instruction Cell Based
The aim of future implementations will be to Reconfigurable Systems”, Presented at Design, Automation
match and exceed the throughput of FPGA and and Test in Europe Conference 2006 (DATE 2006), 6-10
ASIC based devices by applying a deeper level of March, ICM, MESSE Munich, Germany
16. J. Zhang, Y. He, S. Yang, Y. Zhong: “Performance and
modifications to the critical modules of the decoder. complexity joint optimization for H.264 video coding”, Circuits
Amongst the candidate functions will be the in loop and Systems, 2003. ISCAS '03. Proceedings of the 2003
de-blocking filter, sub-pixel motion compensation International Symposium on, Volume 2, 25-28 May 2003
and Inverse DCT. The tailoring of each module will Page(s):II-888 - II-891 vol.2

52

Authorized licensed use limited to: The Ohio State University. Downloaded on January 8, 2009 at 08:58 from IEEE Xplore. Restrictions apply.

Anda mungkin juga menyukai