2

Monolithic Architectures for Image
Processing and Compression

Konstantinos Konstantinides and Vasudev Bhaskaran
Hewlett-Packard Laboratories
To meet the demands of I n emerging applications such as desktop publishing, medical imaging,
and multimedia, hardware support for image processing is vital. In the
image-intensive past, custom, complex, and expensive image processing architectures satis-
applications like desktop fied computational needs for intensive image processing tasks like pattern
recognition and analysis of satellite data.'.* However, the increasing power
publishing, medical of computing ICs and the opportunity to attract new markets prompted
imaging, and multimedia, many developers to provide some image processing support. Present im-
researchers are refining age processing accelerators are based mostly on general-purpose micro-
processors or digital signal processors (DSPs), such as the i860 from Intel
programmable ICs for or the TMS320 family from Texas Instruments.
image processing . Emphasizing image-processor ICs, we survey many of the current IC
designs. No single article can cover all existing designs. Hence, we present
a selection of chips, application-specific circuits (ASICs), and pro-
grammable image processors that cover the current design trends in imag-
ing acceleration.
Challenges for imaging

New areas of applications, such as high-definition TV and video tele-
conferencing, demand processing power that existing general-purpose
DSPs cannot provide. Until recently, commercially available image pro-
cessing ICs performed only low-level imaging operations like convolution.
These ICs had limited programming capability, if any, because of the well-
defined operation and data representation of the low-level imaging func-
tions and the integration limits in VLSI circuits.
Emerging standards (like JPEG and MPEG for image and video com-
pression), advances in submicron integration technology, and the opening
of new market areas (like multimedia) encourage manufacturers to invest
in the design and development of new image processing ICs. Image com-
pression ICs, for example, are now becoming widely available. With these
design and development efforts, a new generation of general-purpose im-
November 1992 0272 17-16/92/1100-0075$0300 0 1992 IEEE 15
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY SURATHKAL. Downloaded on September 23, 2009 at 21:39 from IEEE Xplore. Restrictions apply.
cessing architectures or at corporate lab-
Features JPEG MPEG-I MPEG-I1 Px64 oratories for in-house use or for specific
customers and applications.3"
Full-color still images Yes __ - Developers usually design image pro-
Full-motion video
____________
Yes Yes Yes Yes cessing accelerators as attached boards
Real-time video capture Yes Yes or subsystems for commercially-available
and playback personal computers or technical worksta-
Broadcast-quality Yes Yes tions. These accelerators use standard
full-motion video buses (VME, EISA, Nubus, and so forth)
Image size 64K x 64K 360 x 240 640 x 480 360 x 288 to communicate with the main host, and
(max) -~ - they usually consist of memory, a video
Compression ratios 10 to 80: 1 200:l 100:l 100:l to 2,000:l port, a frame buffer, and o n e t o four
(max) (max) commercially available DSPs (such as the
Data rates (compressed) 1.5 Mbps 5-10 Mbps 64 Kbps - 2 Mbps TMS320C40 from Texas Instruments) or
________~___~ ~ _ _ _ ~ _ _ _ ~
high-performance microprocessors (such

as the Intel 860). Most accelerators use
age processors will emerge to extend the capabilities of gen- floating-point arithmetic and have a bus-based architecture.'
eral-purpose DSPs.
Image processing is one of computing's broadest fields. It Application-specific image ICs
includes document image processing, machine vision, geo-
physical imaging, multimedia, graphic arts, and medical imag- You can implement most image processing functions on a
ing. In general, we can separate image processing operations general-purpose digital signal processor, and several vendors
into low, intermediate, and high levels of complexity. In low- offer such solutions today. Such solutions are generally not
level processing (filtering, scaling, and thresholding), after amenable to real-time applications because of the processing
some operations, a one-to-one mapping occurs between in- requirements and the resulting high cost associated with using
put and output data, and output data is again in the form of a multiple DSPs per system. Let's look at several chip sets that
pixel array. In intermediate-level processing (like edge detec- you can use for image and video compression and for tradi-
tion), the system can no longer express the transformed input tional image processing systems like computer-vision systems.
data as just an image-sized pixel array. Finally, high-level
processing (like feature extraction and pattern recognition) ICs for image and video compression
attempts to interpret this data to describe the image content. Workstations today possess significant capabilities, includ-
Because of this large range of operations, it seems that de- ing high-performance CPUs (70 to 100 million instructions
velopers have applied every conceivable type of computer ar- per second), high-performance display subsystems, and ex-
chitecture at one time or another to image processing.* But tensive networking and storage support. Such workstations
so far, no single architecture can efficiently address all possi- offer a viable platform for multimedia applications requiring
ble problems. For example, single instruction, multiple data images and video. However, the bandwidth and computation
(SIMD) architectures are well suited for low-level image pro- power of these workstations is not adequate for the compres-
cessing algorithms, but because of their limited local control, sion functions needed for video and image data. A single 24-
they can't address complex, high-level algorithms.3 Multiple bit image at 1K x 768 resolution requires 2.3 Mbytes of
instruction, multiple data (MIMD) architectures are better storage, and an animation sequence with 15 minutes of ani-
suited for high-level algorithms, but they require expensive mation would require 60 Gbytes. For imaging and video ap-
support for efficient interprocessor communication, data plications (and multimedia computing in general), recent
management, and programming. Regardless of the specifics, research has focused on image and video compression. This
every image processing architecture must address the follow- research has progressed along two fronts-the standardiza-
ing requirements: processing power, efficient data address- tion of compression techniques and the development of ICs
ing, and support for data management and I/O. that can perform the compression functions these standards
We can divide image processing architectures into the fol- require. As shown in Table 1, the image and video compres-
lowing broad categories: dedicated image processors, image sion standards now being developed fall in four classes:
processing accelerators, and image-processor ICs. The dedi- JPEG, MPEG-I, MPEG-11, and Px64.
cated image-processor systems usually include a host con- IC vendors offer single-chip solutions for JPEG, Px64, and
troller and an imaging subsystem that embodies memory, MPEG. The JPEG standard' is intended for still images. How-
custom processing ICs, and custom or commercially available ever, researchers are also using it in edit-level video applications,
math processing units, connected in a SIMD, MIMD, o r where there is a need for frame-by-frame processing. The
mixed-type architecture. Researchers are developing most of MPEG (I and 11) standard applies to full-motion video and was
these systems either at universities for study in image pro- originally intended for CD-ROM applications wherein encoding
76 IEEE Computer Graphics & Applications
-~ - - ~- - - ~
configuration, the broad-

Table 2. Image and \ideo compression IC vendors.
casters are attempting to in-
11
Vendor ~ Part Number Standard

~ ~ Speed 1 Comments /1
corporate up to four video
channels in the present 6
C-Cube 1 CL550 JPEG IO m d 27 M H / MHz single-channel band-
Microsystems ~ CL450 1 MPEG-I 3 0 franicsisec Decode only width. The increased chan-
CL950 MPEG-I 30 framesisec Can be used
~
nel capacity has significant

at MPEGII
resolutions business implications.
_c The Px64 (H.261) stan-
AT&T AVP-I300E JPEGIMPEG-I/ 30 franiesisec Encodct- only. dard is intended for video
Px64 Data rates conferencing applications.
up to 4 Mhps
~~
This technology is mature.
A V P -1400D JPEGiMPECi-I/ 1 30 frame\isec Decoder only. hut the devices available to
Px64 Spatial resolution accomplish video confer-
uptolKx1K cncing have bccn multi-
-~
board level products until
NEC JPEGIMPEG-I/ 30 framesisec 352 x 240 encoding at
PX64 30 framesisec. now. As CPU capabilities
JPEG solution require.; increase. we expect some
only two chips o f these functions to mi-
-- -~ - -
prate to the CPIJ. How-
~~ ~ ~~ ~
Integrated VP JPEGiMPECi-I/ 30 framcsisec MPEG-I encoding ever, a calculation of the

Information Px64 at non-real time.
Technology Kcal-time decoding speed (generally measured
at 352 x 288 spatial in MIPS. million instruc-
resolution tions per second) require
I compression schemes indi-
Intel 82750 JPEGIMPEGII 3 0 fr‘imesiwc Real-time decoding at cates that such an eventual-
Pxh4 352 x 288 spatial
rewlution MPEG-I i t y is at least five years
encoding c,ipabiltty away. For example. the
n o t known Px64 compression scheme
- -
requires more than 1.000
LSI Logic , LS647xx JPEGIMPEGII 3 0 trmiesiscc Requircs u p to 7 ICs MIPS for encoding a n d
Px64 f o r ti1 u I t ih t md,ird
comprewon about 200 MIPS for decod-
I
---____
t - ing.‘ The 1.000 MIPS re-
SGS- STV3200 JPEGIMPEG Pei forms 8 x 8 and quirement for encoding
Thompson I 6 x 16 D<’T for JPEG will not be achievable
and MPEG core cheaply on a desktop in the
-- ~~
STV3208 JPEGiMPEG Performs 8 x 8 DCT near future. On the other

lor JPEG and hand, MPEG decoding
I MPbG core should become feasible on
a desktop by 1994. Thus,
STI3220
~ Perfoinis motion
estimation for application-specific ICs
MPEG “I Px64 seem to be the only viable
encoding near-term solution. For still
~-
images. we can optimize

Matsushita 15 tramesisec 64 Khps encoder out-
the JPEG algorithms to ac-
put r,ite 352 x 288
s p i t i d resolution complish interactive pro-
- ~-
cessing on a desktop
workstation. However, if
we use J P E G for motion
would be done less frequently than decoding. MPEG-1’ offers sequences. a desktop system cannot easily support the 30
VHS quality decompressed video, whereas the intent of MPEG- frames per second data rates, and in such cases, a chip-based
11 is to offer improved video quality suitable for broadcast appli- solution is appropriate.
cations. Recently, video broadcast companies proposed video In Table 2. we list representative image and video compres-
transmission schemes that would use an MPEG decoder and a sion chips offered by various IC vendors. The chip designs
modem in each home. By optimizing the decoder and modem from Matsushita and NEC have been disclosed, but they are
November 1992 77
not yet commercially available. The remaining designs are ei- Let's turn from explaining the ICs' image-compression
ther available now or will be by the end of this year. functions to examining the chips themselves. We briefly de-
Designers optimized these chips for a core set of four com- scribe the JPEG and MPEG ICs offered by various vendors.
pression-related functions:
CL550
1. spatial-to-frequency domain transformations, The CL550 is a single-chip JPEG compression and decom-
2. quantization of frequency-domain signals, pression engine offered by C-Cube Microsystems." It con-
3. entropy coding of the quantized data, and sists of a pixel-bus and a host-bus interface for data IIO,
4. motion compensation. control registers, and a JPEG core unit with a DCT coeffi-
cient register, a quantizer, and Huffman tables. Compression
For compression of motion sequences, these chips also em- ratios can range from 8:l to 100:l and are controlled by the
ploy additional processing in the temporal domain. Let's look quantization tables within the chip. The chip has on-chip
briefly at the processing steps for the JPEG, MPEG, and video and host-bus interfaces and thus minimizes the glue
Px64 compression schemes. logic needed to interface the chip to the computer's or dis-
play subsystem's bus.
Spatial-to-frequency domain transformation. The chips The CL550 chip can handle up to four color channels, and
accomplish spatial-to-frequency domain transformation via an thus we can use it either for printing (images are handled as
8 x 8 2D discrete cosine transform (DCT) performed on the CMYK-yan, magenta, yellow, black) or for image display
spatial domain data. Often the chips first convert RGB (red, and scanning (images are handled as RGB). This chip also
green, blue) spatial domain data into a color space suitable for provides memory interface signals and can process data at a
image compression. The system uses the YCrCb space (a 27 MHz rate. Thus, we can use it in full-motion video appli-
commonly used color space) for this purpose. The 2D DCT cations that need frame-to-frame access. (MPEG uses inter-
can be performed as straightforward matrix multiplication, as frame coding methods that require several frames to be
matrix-vector multiplication (on 16 vectors), or via a fast DCT decoded to access a specific frame.) Similar chips will soon be
algorithm. A hardware implementation would use straightfor- offered (or proposed) by SGS-Thompson and LSI Logic.
ward matrix-vector multiplication methods or fast DCT algo-
rithms with a regular structure, whereas a programmable CL450
image computing engine (like a DSP) would use only a fast The CL450 is a single-chip MPEG-I decoder from C-Cube
DCT implementation that has a small multiply count. Microsystems. Compact disc systems (such as the Philips CD-
Quantization. The system quantizes frequency domain I player) that play back audio and video use such a chip for
samples to eliminate some of the frequency components. video playback. As with the CL-550, designers minimized the
This results in a loss of detail that worsens as the compres- glue logic required for interfacing to buses and other chips by
sion ratio is increased. The quantization function is a point- providing bus logic and memory control signals within the
wise divide (and rounding) operation. chip. In a small system, we might need to add up to one
Entropy-coding. The quantized data has few values, and megabyte of dynamic RAM for MPEG decoding.
there is usually a run of zero-valued samples between non- The CL950 is similar to this chip. However, it can support
zero-valued samples. Developers usually use a Huffman cod- data rates and spatial resolutions proposed for MPEG-11.
ing method employed on the run (for example, a baseline
JPEG standard). For decoding, a lookup-table-based approach Combined chip sets
is faster and more efficient for software implementations. Multimedia computing systems will deal with JPEG, MPEG,
Motion compensation. In CD-ROM applications, since and Px64 compressed data. For cost-effective solutions on the
CDs output data at a 1 to 1.5 Mbits per second rate and digi- desktop, it is highly unlikely that designers will use separate
tized NTSC rates are at 80 to 130 Mbits per second, developers chips for each standard. Thus, several IC vendors have devel-
need compression ratios around 1OO:l for CD-ROM based oped (or proposed) JPEGIMPEGIPx64 chip designs that can
video. When developing the MPEG standard, designers ex- handle all three of the major multimedia compression, de-
pected that many video applications would be CD-ROM compression, and transmission standards.
based. To achieve such high compression ratios, developers
need to exploit both intraframe and temporal redundancies. AT&T multistandard chip set
This is accomplished through motion compensation. We find The AT&T chip set" includes the AVP-1400C system con-
this technique in MPEG and F'x64 encoding; frame-to-frame troller, the AVP-1300E encoder, and the AVP-1400D de-
changes are determined and only the changes are transmitted. coder. The system controller addresses the issue of trans-
In most motion estimation implementations, the system per- mission of multimedia data (this is a key differentiator in
forms some form of block matching between blocks compris- AT&T's multimedia chip set offering). The controller pro-
ing frames. This feature is highly parallelizable and is exploited vides the usual multiplexing and demultiplexing of MPEG or
in some of the MPEG and Px64 encoders. Px64 streams and performs error correction. Furthermore,
exhaustively over a range of 532 pixels with half-pixel
Address
generator ~
accuracy. Users can select the channel bit rate from 40
Kbits per second to 4 Mbits per second (the latter is
Static closer to the proposed MPEG-I I specifications) using a
RAM
Pod
---A---
PlJ0 DCT
constant bit rate or a constant quantization step size.
Spatial resolution can go up to 720 pixels x 576 lines.
You can use the 1300-E in the intraframe mode and thus
perform JPEG-like encoding.
The encoder contains several functional blocks:
1. a host bus interface,
2. FIFO to hold unconipressed data.
3. memory controller to interface to external dynamic
I24 x 24 register1
window
~ 1 x8 18
1 window
register^ 1 Memory RAMS,
4. motion estimator.
3 5. quantization processor to determine quantizer step-
size and output frame rate,
6. signal processor comprised of six SIMD engines that
perform DCT, quantization, and zigzag scanning,
7. global controller to sequence the operation of the
blocks,
8. entropy coder t o generate data compliant with
64-word FlFOs A1 r44-1
Memory-A Memory-B I MPEG and H.261 standards, and
RAM
- . _-
- ~ -
I (128-word x 4 ) l (128-word)
9. output FIFO to hold compressed data.
I The decoder is less computation-intensive (primarily

I because it need not estimate motion). The AVP-1400D
I 44 I performs decoding up to 4 Mbits per second compressed
DSP I
unit I
I
[ Four multipliers
(16-bit)
r ~ ---
1 data stream rate and can support frame rates up to 30
frames per second. The AVP-1400D can support spatial
resolutions up to 1,024 lines x 1,024 pixels (the larger
spatial resolutions are useful in JPEG still-image applica-
tions). The decoder has an on-chip color space con-
b L verter, so you can convert the decompressed YCbCr
data directly to RGB.
NEC multistandard chip set

registers The NEC multistandard chip set" can encode and de-
Address ' StatlC RAM code MPEG video in real time. Like the AT&T chip set,
generator +address this chip can support all three compression standards.
I 1 I '
Unlike the AT&T chip set. the encoding and decoding
Entropy algorithms are split across several chips. The first chip
encoder, decoder
performs interframe prediction, the second executes the
DCT transform and quantization, and the third one does
Figure 1. The NEC multistandard chip set. (a) Interframe predic-
entropy coding. We depict these chips in Figure 1. The
tion, (b) DCT and quantization, and (c) variable length coding.
chip set estimates motion in the prediction engine, but
unlike AT&T's encoder, the motion search strategy is a
two-step process. In NEC's motion estimation algorithm,
the concentration highway interface provides an interface to the first search step is over a one-fourth subsampled area,
channels like T l and ISDN. This interface also supports the and the second search step occurs over a 18 x 18 window and
synchronization of audio and video streams, an important yields half-pixel accuracy. The DCT and quantization engine
feature in multimedia systems. In a JPEG-only system, this is a DSP unit that has four 16-bit multiply and accumulate
chip is not required. (MAC) units and 24-bit arithmetic logic units (ALUs). The
The AVP 1300-E performs single-chip MPEG-I encoding. entropy coding engine has a RISC core, performs all the bit-
The 1400-E version supports MPEG-I and Px64 compression stream coding functions, and provides control for basic play-
and uses a fixed motion-estimation processor that searches back functions like stop, play, and random access.
November 1992 79
IIT multistandard chip set
interface and dynamic RAM control logic would probably be
The IIT multistandard chip set" takes a different approach preferable.
from the AT&T chip set. It offers a programmable chip (re-
ferred to as the Vision Processor), and the microcode-based ICs for generic image processing
engine of this chip can execute JPEG, MPEG, and Px64 com- In computer vision and most traditional image processing
pression and decompression algorithms. This chip has a systems, the computer performs a significant amount of pro-
RISC processing core, a direct memory access port (DMA) cessing with low-level image processing functions. For exam-
for data access, and a command port. You can also connect ple, in computer vision systems, you might want to perform a
this chip set to optional microcode static RAM. This chip can semi-automated PC board inspection. T o do this, the system
perform JPEG, MPEG, and Px64 decoding at 30 frames per must image a PC board at video rates (256 x 256 x 8 bits per
second at spatial resolutions of 352 x 240. For MPEG encod- pixel acquired at 30 frames per second), detect the edges of the
ing in real time, you will need two Vision Processors. For digital image, and make a binary bitmap of the edge-detected
MPEG and Px64 applications, the Vision Processor is com- image. Then the system might perform a template-matching
bined with a vision controller chip that handles the pixel in- function on the binary image to determine faulty traces on the
terface and frame buffer control functions. board. The above-mentioned tasks require at least 120 MOPS
of processing capability. Most general-purpose DSPs cannot
deliver such fast processing at sustained rates. To make appli-
Intel 82750 cations like these feasible, IC vendors offer single-chip solu-
This is Intel's third-generation design of the 827.50 family. tions for many of these image processing tasks.
Earlier versions were designed primarily to decode Digital
Video Interactive (DVI) compressed data. This chip offers ICs for image filtering
1,000 MOPS (million operations per second) and operates at LSI Logic'' offers the L64240, a multibit, finite impulse re-
50 MHz. Because of the MOPS requirements for MPEG en- sponse filter (MFIR). This chip is a transversal filter consist-
coding, the 827.50 with this MOPS rating might not be suit- ing of two 32-tap sections. Each 32-tap section contains four
able for MPEG encoding. However, it can easily accomplish separate eighth-order finite impulse response filter sections.
MPEG decoding. Furthermore, a system can decode multiple Each filter cell within a filter section consists of a multiplier
MPEG streams with a single 82750, and thus in a desktop en- and an adder that adds the multiplier output and the adder
vironment you could view several video sources simultane- output of the preceding filter cell. This chip can generate out-
ously. The programmable nature of the 827.50 allows for put at a 20 MHz sample rate with 24-bit precision. This device
JPEG and Px64 encoding and decoding. (At this time. we are is suitable for both adaptive filtering and cross-correlations,
not aware of the availability of this processor.) since the convolution kernel can be loaded synchronously
with the device clock.
Other vendors The L64240 has a format-adjust block that provides the ca-
Zoran14 offers a two-chip solution for use in still-image pability to scale (with saturation), threshold, clip negative
compression applications where the encoder bit rate needs to values, invert, or take absolute value of the output (these
be a fixed rate. This is essential in applications like digital functions are useful in image enhancement and in image dis-
still-video cameras. The Zoran chip set achieves the fixed play). The chip is reconfigurable and can perform l D , 2D,
rate via a two-pass process. During the first pass, the chip es- and 3D filtering. With the incorporation of a frame delay,
timates bit rate from the DCT values. The chip actually codes you can configure it as a 3D filter in which a 4 x 8 convolu-
during the second pass by controlling quantization based on tion is performed across two frames.
information derived from the first pass.
Since many of the compression and decompression pro- ICs for image enhancement
cesses resemble DSP methods, traditional DSP IC vendors In many imaging applications, users acquire the image data
such as Texas Instruments, Motorola, and Analog Devices from a noisy source, such as a scanned image from a photo-
will probably announce JPEG and MPEG solutions based on copy. Users must apply some noise reduction function to this
an optimized DSP core. In fact, Matsushita" recently dis- image. One such technique is rank-value filtering, such as me-
closed that, according to their estimates, the DSP engine that dian filtering. An IC that does this at a 20 MHz rate is the
they specially designed for 2 giga-operations per second L64220 from LSI Logic. Like the MFIR, this is a 64-tap recon-
(COPS) can provide decoding of Px64 compressed video at figurable filter. However, the output is from a sorted list of the
15 frames per second. input and not a weighted sum of the input values, like the
From the graphics side, IC vendors like Brooktree and In- MFIR's filtering function. You can configure the filter to per-
mos will have compression and decompression chip sets form rank-value operations on 8 x 8, 4 x 16, 2 x 32, or 1 x 64
based on an optimization of the graphics core in their current masks. The rank-value functions include min, max, and median.
offerings. For a multimedia computer, we believe that a chip In many image processing applications, users need to en-
set with a high level of integration and support for host bus hance the acquired image to incorporate it into an electronic
image. For each contour, it returns the (.U, y ) coordinates, dis-
crete curvature. bounding box. area. and perimeter. In an
OCR application, the system can use this data as part of a fea-
ture set for template matching in feature space. Thus, the sys-
tem can recognize a character (whose 128 x 128 imagc is input
to the contour tracker). In conventional computer-vision ap-
plications. you can use the contour tracker as a component in
a system t o identify objects. This could be the front-end sig-
k- - a nal-processing component of a robot. For an n x m image. the
L+
-
Histogram contour tracker takes 4.5 nnz cycles to generate the contours.
New ““%x
11 ADDR
(512 ~ 2 4 )
I
Counter DI LUT RAM DO +-t
f Transformed
data
You can also use this IC to compress black-and-white im-
ages in which the contour information will represent the
Counter compressed data. For most textual documents, this could
New frame,-- 0 yield high compression ratios compared with the traditional
Figure 2. The L64250 chip: histogram and Hough-transform IC. Group 111 and Group IV fax-compression methods.
You can use an IC‘ such as the Lh4230 (also from LSI
Logic) for template matching and binary niorphological opera-
I 4’
1
AA AB
‘ $ 4
_L-1
h oiic
M - U U I..
)
I tions. These processing functions are usually required in the
preprocessing stage of an image recognition system. By chang-
XSYNCdTimingl 1 Address I ing the setup and filter corfficients. this IC can perform FIR
control generation1 filtering, templatc matching, erosion. and dilation functions.
Y SYNC-’ unit ~ unit
General-purpose image ICs

Designs for general-purpose image processors (IPS) must
combine the flexibility and cost eficctiveness of a program-
mable processor with the high performance of an ASIC.
Therefore, it is no surprise that one class of programmable
I
i 0-BUS image processors are extensions of general-purpose DSP
chips. We call them zitziprocc‘ssor IPS.They include one multi-
Figure 3. A block diagram of the Video Image Signal Proces-
ply and accumulate (MAC) unit. separate buses for instruc-
sor (VISP).
tion and data. and independent address generation units.
Depending on the target applications, they also include spe-
document. A popular image enhancement technique is his- cial arithmetic function units. on-chip DMA, and special data-
togram equalization.” To do this, you compute the histogram addressing capabilities (for example. dual arithmetic units for
of the input image and then apply a function to the input that computing absolute values or L 2 norms in one cycle).
equalizes the histogram. In Figure 2 we show a simplified Another class of image processors have an internal multi-
block diagram of LSI Logic’s L64250, which can equaliie his- processing architecture. We call them nzultiprocessor IPS. Re-
tograms at a 20 MHz rate on data sets as large as 4K x 4K. searchers develop almost all these chips for in-house use. and
The accumulation memory ( “ A C C RAM” in the figure) they are targeted to specific applications (video processing,
holds the histogram for the input data (DI). and this memory machine vision, data compression. and so forth). Multipro-
is updated for each set of input data. After processing a cessor I Ps don’t seem to be commercially available.
frame of data, the chip can use the accumulation memory’s
contents to compute an equalization transfer function that Uniprocessor IPS
can then be transferred to the lookup table RAM. The chip Let’s look at these two categories of ICs-uniprocessor
then uses the image data to index the lookup table RAM, and multiprocessor-more closely. We begin with uniproces-
and the output is the histogram-equalized image. You can sor IPS.
also reconfigure the L64250 chip to compute the Hough
transform of an image, a common preprocessing operation in VISP
many computer vision tasks. Figure 3 shows a block diagram of the Video Image Signal
Processor (VISP) from NEC.I8 This design is representative
ICs for computer vision of most programmable, uniprocessor IPS that can be consid-
A n IC useful for computer vision and optical character ered extensions of general-purpose DSPs. VISP is a 16-bit
recognition (OCR) is the object contour tracer (L64290 from video processor developed for real-time picture encoding in
LSI Logic). This chip locates contours of objects in a binary TV conference systems. I t has two input and one output data
November 1992 81
A-BUS last stage is a minimum value de-
B-BUS tector (MMD) that codes motion
pictures efficiently.
In similar designs, Toshiba’s
A-BUS IP2’ has dual ALUs to compute
I I
I I
an absolute value in one cycle, Hi-
tachi’s IP has special hardware to
I
.... . search for minimum and maxi-
64/ mum values,’’ and Matsushita’s
Real-time Image Signal Processor
B-BUS 22
(RISP) processor has a multi-
plier and divider unit and a spe-
8x8 8x8 8x8 8x8 8x8 8x8 8x8 8x8
cial flag c o u n t e r t o average
operations fast.
Instead of on-chip D M A and
data memories (as in VISP, DISP,
and Hitachi’s IP), Toshiba’s IP has
hardware support for direct access
to three external memories, and
RISP has an array of 25 local im-
Figure 4. A block diagram of the Figure 5. A block diagram of ViTec’s age registers and buffers with a 25
Video Image Signal Processor (VISP) Pipelined Image Processor. to 1 selector. This feature lets
processing unit. RISP freeze and operate on a 5 x
5 array of image pixels at a time.
RISP addresses the data by set-
ting the multiplexers’ select code.
buses, two local RAMS (128 x 16-bit each in the data memory The VISP processor has a 25 nanosecond cycle time, and, like
unit (DMU)), an instruction sequencer, a 16-bit ALU, a 16 x most of the other image processors, you can use it in a multi-
16-bit multiplier, a 20-bit accumulator, a timing control unit, processor array.
an address generation unit, and a host interface unit. For effi-
cient IIO, the processor has a built-in DMA controller in the Multiprocessor IPS
host interface unit. The address generation unit can support The latest designs of image processing ICs combine fea-
up to three external buffer memories. Like Hitachi’s IP” and tures from both DSP architectures and classical SIMD and
the Digital Image Signal Processor (DISP) processor from MIMD designs. Advances in integration now allow multiple
Mitsubishi?’ VISP has hardware support to translate a 2D processing units on a single chip. Thus, these designs com-
logical address to a 1D physical memory address. This feature bine the power of parallel architectures with the advantages
is important in image processing applications, where we can of monolithic designs.
describe many of the low-level algorithms by
ViTec’s PIP
P 4
cc
Y ( i , j ) =m=l n=l cmnx(i- m, i- n) Figure 5 shows a block diagram of ViTec’s Parallel Image
Processor The PIP chip is the core of ViTec’s multipro-
where x and y are the input and output images, and c is a p x cessor imaging systems.24It is probably the first image-proces-
q matrix (kernel) of algorithm-dependent coefficients. For ef- sor chip that employs an on-chip parallel architecture. Each
ficient addressing, special pointers are required to move the chip has eight 8-bit ALUs and eight 8 x 8 bit parallel multipli-
kernel of coefficients within an image and to access the data ers. A nine-way 32-bit adder combines the results out of the
within a kernel window.’’ multipliers. This architecture is highly efficient for convolu-
VISP’s A L U is specially designed to compute pattern tion operations. The chip uses a special replicator to replicate
matching operations in a pipelined manner. It can compute zooming. This architecture has limited local control, but it can
either L1 or L2 norms for vector quantization and motion provide up to 114 MOPS, assuming an 8-bit wide data path.
compensation operations. Figure 4 shows a block diagram of Communication with an image memory occurs via a 64-bit
the ALU. The first stage has two shifters (SFT) for scaling bidirectional bus. A separate 32-bit bus allows interprocessor
the input video data. The second to fourth stages are for the communication in multichip systems. As in Hitachi’s IP,” the
ALU and the multiplier. The output of the multiplier is ex- PIP chip uses two levels of internal control. A 4-bit instruction
tended to 20 bits by the Expand module in the fifth stage. combined with a 6-bit address selects a 64-bit micro-instruc-
Data is further scaled by a 16-bit barrel shifter (BSF). The tion contained in an on-chip writable control store.
Pipeline IP
We can consider the Pipe-

line IP chip from Matsushita"
a variation of the ViTec de-
sign. It consists of nine pro-
cessing elements positioned
side by side. However, in this
design (shown in Figure 6).
you can reconfigure t h e
neighbor interconnections be- I
tween t h e processing ele- I I
I D I
ments. E a c h subprocessor I
consists of a 13 x 16-bit multi- I
I
plier, a 24-bit accumulator. I I I c
pipeline registers. and multi- l I I ACC
plexers. The multiplexers al-
low different interconnection
a -d
I
< D
+-
topologies a m o n g t h e pro-
cessing elements, and the sys-
tem c o m p u t e s image p r o -
cess ing o p e r a t i o n s sy s t o I i -
cally. The processor has three
input and one output ports Figure 6. A block diagram of the Pipeline Image Processor.
and several pipeline-shift reg-
isters. This architecture is very efficient for matrix-type and I 300 lmaae bus
filtering operations and can compute an eight-tap DC'T using
a brute-force technique (matrix-vector multiplication).
Image
data
in Shifter $
Image 1 r - l Data'nput
-{-- i/
bus- - l
ISMP
Figure 7 shows the block diagram of ISMP,"' a digital im-
age signal multiprocessor version of the RISP" architecture
I l l
we referred to earlier. Also manufactured by Matsushita. I l l
ISMP has a main controller, an arithmetic unit, and four 12- I l l
bit processor elements. Each processor element has a local
I 1 _ _ _ _ _PEA1 PE-BLPE-CLPE-D
t 1 Dataoutput us
- -
image register, a 24-bit ALU and a 13 x 12-bit multiplier. 0 Jf

ISMP also has its own instruction RAM and local controller. p r o g r a m data bus I l l
You can use this chip to either operatc at different locations
Figure 7. A black diagram of ISMP.
of the image or on the same location, but with
four different feature extraction programs.
ISMP chips can also operate in a multiproces-
Ji
sor array. 1 Address generation unit I
To resolve the problem of feeding four dif-
ferent processors with data, the ISMP has a
Program
~
i memory 1
and
-1-
Work
~
Cache
+ Cache
7-
sequencer memory memory 1 memory
global image register and four local image reg-
isters. The global register consists of five 12-
bit-wide shift registers, each with a 12-bit
input port, and drives a n image bus of 300
lines. The local image registers, located in
each processor element, receive a 5 x 5 pixel
window from the image bus. This window is
latched at the falling edge of a "start" signal
that starts the execution of each processor ele- ~~ ~ - ~-
ment. The processor has a 20 nanosecond cy- Protram bu f f f

Program load bus I
cle time and can compute a 32-level histogram
of a 256 x 256 pixel image in 23.6 milliseconds. Figure 8. A block diagram of IDSP.
November 1992 83
i System
Toshiba
CMOS
Table 3. Programmable image processing ICs.
-
Clock
(ns)
50
Instruction
RAM
6 4 x 16
16 x 99
None
I
P E s , B S
32 x 32 MPY
I
~ 32
I Hitachi
ViTec’s
;T---
1987
CMOS
CMOS
50
~-
20
70
~
1,024 x 48
64x 16
32 x 128
64 x 64
64 x 16
~
8X5X5
_ _
Divider
Pipeline
I
30 None
25 512 x 32
Normalizer
____
1990 1
CMOS
1.2CMOS
50
20
512 x 48
40 x 128/IP
5 1 2 x 2 4 ~ 2 ALU
64 x 24-
userstack
12 X?
~MPY
BS
4 RISP IPS
I
~
24
1991 0.8 BiCMOS 40 512 x 32 5 x 512 x 16 4 DPUs

~.
I/
I
Year denotes the year of publication of the corresponding reference. MPY denotes multiply unit,
MAC stands for multiply and accumulate unit, and BS denotes a barrel shifter. On the bit field, a dual
number such as AIB denotes that the chip uses A bits for input, but output or internal
results might also be available with B bits of precision.
IDSP Texas Instrument‘s video processor

Figure 8 shows NTT’s Image Digital Signal Processor Texas Instruments is working on an advanced processing
(IDSP), another parallel architecture of a video signal pro- chip for imaging, graphics, video, and audio processing.”” It
27
cessor. This processor has three parallel I/O ports and four will support all the major imaging and graphics standards
pipelined processing units. Each processing unit has a 16-bit (like JPEG, MPEG, PEX, and MS Windows). It will have
ALU, a 16 x 16-bit multiplier, and a 24-bit adderhbtracter. both floating-point and pixel-processing capabilities for 2 0
Each I/O port has a 20-bit address generation unit, DMA image processing and 3D graphics.
control processor, and U0 control circuitry. Data transfers The proposed product will have multiple processors on a
and program execution occur in parallel. single chip. One of these will be a RISC floating-point pro-
Each data processing unit (DPU) unit has a local data cessor. The remaining processors will have a mixture of DSP
cache memory and also shares a work memory. All memories and pixel-manipulation capabilities. The on-chip processors
are configured as 512-word x 16-bit dual-port RAMS. The will be fully programmable and have instruction caches, thus
system transfers data via ten 16-bit vector data buses and a each of the processors will be able to run independently in a
16-bit scalar data bus. As in other architectures, where con- MIMD fashion. The processors and on-chip main memory
ventional VLIW (very large instruction word) would require will be connected via a crossbar. Texas Instruments estimates
hundreds of bits, the IDSP uses two levels of instruction de- that performance will be in the range of 3 billion “RISC-like”
coding. IDSP has a 40 nanosecond instruction cycle time. operations per second.
Table 3 summarizes the major features of the image-pro- be determined for each scan line. However, this type of oper-
cessor ICs. Table 4 shows benchmark results for two classes ation is also well suited for an image processor.
of algorithms for which data for most of the above ICs is High-end graphics workstations will always require appli-
available: a block type algorithm, such as the FFT or DCT, cation-specific ICs. However, the new image and video pro-
and a spatial filtering algorithm with a 3 x 3 kernel. cessors should provide adequate performance t o satisfy the
needs of most users. The latest designs from Intel and Texas
Imaging and graphics Instruments claim to support both imaging and graphics.
Such systems should assist in making the integration of
An area of special interest to both computer workstation graphics and imaging a reality.
manufacturers and IC developers is the integration of image
processing with computer graphics. Traditionally, developers
Conclusions and predictions
considered these separate fields and designed acceleration
engines specifically for imaging or graphics. However, the Circuitry that accelerates image processing will be an inte-
proliferation of window-based user interfaces (which can grated part of future personal computers and scientific work-
benefit from graphics acceleration), the combination of tradi- stations. In this article, we reviewed recent architectures for
tional image processing operations (like texture mapping and image processing ICs. Application specific processors are
warping) with graphics techniques, and the emergence of available for image and video compression and low-level im-
multimedia applications have motivated manufacturers and age processing functions. Programmable image processing
developers to work on systems that integrate graphics, imag- ICs are either modified extensions of general-purpose DSPs
ing, audio, and video. or incorporate multiple processing units in an integrated par-
In general, requirements for graphics architectures differ allel architecture. Because of the iterative nature of low-level
from those in image processing. Most graphics accelerators imaging, such parallel architectures are ideal for fast image
follow the well established graphics pipeline: Host, transfor- processing.
mation unit, clipping, illumination, scan conversion and shad- A new generation of programmable video processors will
ing, a n d display.2y In graphics, systems p e r f o r m most soon emerge to accommodate the special needs of real-time
operations on polygons or vectors. Regardless of the front- video processing. They will consist of multiple computing
end operations (transformations, lighting, and so forth), the
back-end operations are always scan conversion and z-buffer
interpolation. The regularity of the graphics pipeline makes it
easier for hardware designers to decide which part of the Table 4. Benchmarking results for programmable
pipeline needs t o be accelerated by special processors. In image processing ICs.
contrast, image processing applications always operate at the
pixel level and do not usually follow a single computational Filtering
path. Furthermore, most front-end operations in graphics re- Processor Transform (256 x 256 data,
3 x 3 mask)
quire floating-point arithmetic. In contrast, only high-level ms
imaging operations require floating-point accuracy.
1
i
T h e s e diverse requirements and t h e absence of pro- Toshiba FFT, 1,024 points
grammable image processors forced most developers to use 1.0 ms
general purpose processors (such as Intel’s i860 or the TI ~~
TMS32040) to provide an integrated solution.’ However, the Hitachi FlT, 512 points 39.3
new generation of image and video processors might allow 1.5 ms
___________ -
for more efficient solutions. The high performance of the lat-
RISP 11.8
est RISC processors lets us shift most of the computational -~
load, especially the floating-point operations, t o the main FFT, 512 points 26.9
DISP
CPU unit. Then the image processors can execute both imag- 0.91 ms
ing and the last stages of the graphics pipeline.
For example, perspective and z-buffer interpolation are two VISP 2D DCT, 256 x 256 14.8
common graphics tasks usually performed by custom ICs. 26.3 ms
You can describe the pixel arithmetic in those functions by ~~
where 4 , is the quantity t o be interpolated (for example,

RGB specular, or z value z,) for pixel i. which lies in the scan 2DDCT,8x8 1
line between pixels s and e, and S is a scaling factor that must 41.8 ks
November 1992 8f
19. K. Kaneko et al., “A 50 ns DSP with Parallel Processing Architecture,”
units, some of them application specific 2D-data-address- IEEE ISSCC 87, CS Press, Los Alamitos, Calif., 1987, pp. 158-159.
generation units, and efficient I/O ports. As video compres- 20. T. Murakami et al., “A DSP Architectural Design for Low Bit-rate
Motion Video Codec,” IEEE Trans. Circuits and Systems, Vol. 36,
sion standards reach their final form, new image compression No. 10, Oct. 1989, pp. 1,267-1,274.
chips will appear for applications like video-teleconferencing 21. A. Kanuma et al., “A 20 MHz 32b Pipelined CMOS Image Processor,”
IEEE lSSCC86, CS Press, Los Alamitos, Calif., 1986, pp. 102-103.
and HDTV. These chips will probably include support for in- 22. H. Yamada et al., “A Microprogrammable Real-Time Image Proces-
terfacing to digital networks like the fiber distributed data in- sor,” IEEE J. Solid State Circuitry, Vol. 23, No. 1, Jan. 1988, pp. 216-
223.
terface (FDDI), will possess computer power in the range of 23. J.P. Norsworthy et al., “A Parallel Image Processing Chip,” IEEE
2 to 4 GOPS, and will include dynamic RAM control for ISSCC88, 1988, CS Press, Los Alamitos, Calif., pp. 158-159.
24. D. Pfeiffer, “Integrating Image Processing with Standard Workstation
glueless interfacing to display memory. To smoothly inte- Platforms,” Computer Technology Review, Summer 1991, pp. 103-107.
grate audio applications, these ICs will also include a conven- 25. K. Aono, M. Toyokura, and T. Araki, “A 30 ns (600 MOPS) Image
Processor with a Reconfigurable Pipeline Architecture,” IEEE I989
tional DSP core for audio processing. 0 Custom Integrated Circuits Con$, 1989, CS Press, Los Alamitos, Calif.,
pp. 24.4.1 -24.4.4.
26. M. Maruyama et al., “An Image Signal Multiprocessor on a Single
Chip.” I E E E J . Solid-State Circuits, Vol. 25, No. 6, Dec. 1990, pp.
1,476-1,483.
27. T. Minami et al.. “A 300-MOPS Video Signal Processor with a Paral-
lel Architecture,” IEEE J. Solid-state Circuits, Vol. 26, No. 12, Dec.
Acknowledgment 91, pp. 1,868-1,875.
W e thank Alex Drukarev for his helpful comments and suggestions. 28. R.J. Gove, “Architectures for Single-Chip Image Computing,” SPIE
Electronic Imaging in Science and Technology Conf on Image Process-
ing and Interchange, Vol. 1659, SPIE, San Jose, Calif., 1992, pp. 30-40.
29. H.K. Reghbati and A.C. Lee, Computer Graphics Hardware: Image
Generation and Display, CS Press, Los Alamitos, Calif., 1988.
References
1. M.D. Edwards, “A Review of MIMD Architectures for Image Pro-
cessing,” J. Kittler and M.J.B. Duff, eds., Image Processing System Ar-
chitectures, Research Studies Press, Letchwork, Hertfordshire,
England, 1985, pp. 85-101.
2. S. Yalamanchili et al., “Image Processing Architectures: A Taxonomy
and Survey,” in Progress in Pattern Recognition 2, L.N. Kana1 and A.
Rosenfeld, eds., Elsevier Science Publishers, North Holland, 1985,
pp. 1-37. KonstantinosKonstantinidesis a member of the
3. T.J. Foundain, K.N. Matthews, and M.B. Duff, “The CLIP7A Image
Processor,” I E E E Trans. Pattern Analysis and Machine Intelligence, technical staff at Hewlett-Packard Laboratories
Vol. 10, No. 3, May 1988, pp. 310-319. in Palo Alto, California, where h e is involved in
4. M. Kidode and Y. Shiraogawa, “High Speed Image Processor: digital and image signal processing and scientific
TOSPIX-11,” in Evaluation of Multicomputers, L.M. Uhr et al., Aca- visualization. H e received his e n g i n e e r i n g
demic Press, New York, 1986, pp. 319-335. diploma from the University of Patras in Patras,
5. N.L. Seed et al., “An Enhanced Transputer Module for Real-time Im- Greece, his MS from the University of Mas-
age Processing,” Third Int’l Conf Image Processing and Its Applica- sachusetts, Amherst, and his P h D from the Uni-
tions, IEE, Hitchen, Herts, England, 1989, pp. 131-135. versity of California, Los Angeles. All t h r e e
6. K.S. Mills, G.K. Wong, and Y. Kim, “A High Performance Floating-
Point Image Computing Workstation for Medical Applications,” degrees are in electrical engineering.
Medical Imaging IV: Image Capture and Display, Vol. 1232, 1990, pp.
246-256.
7. JPEG-1 DIS, Draft International Standard, DIS 10918-1. C C I T Rec.
T.81, Working Group 10, ComitC Consultatif International de TCIC-
graphique et TCICphonique,New York, Jan. 2, 1992.
<
8. MPEG-I CD. Committee Draft ISO/IEC 11172. Workine Grouo 11.
International’Standards Organization, IPSJ, Tokyo, Dec. 1991.’
9. K. Guttag et al., “A Single-Chip Multiprocessor for Multimedia: The
MVP,” IEEE CG&A, Vol. 12, No. 6, Nov. 1992, pp. 53-64. Bhaskaran Vasudev is a member of the techni-
10. D. Pryce, “Monolithic Circuits Expedite Desktop Video.” EDN, Vol. cal staff a t Hewlett-Packard Laboratories in
36, No. 22, Oct. 24,1991, pp. 67-76. Palo Alto. His research interests include image
11. “AUP-1300E Video Encoder,” AT&T Product Note, AT&T Micro- and video compression, image processing, and
electronics, April 1992. video transmission. Vasudev received his BTech
12. I . Tamitani et al, “An Encoder/Decoder Chip Set for the MPEG from Indian Institute of Technology, Madras,
Video Standard,” lEEE ICASSP-92, CS Press, Los Alamitos, Calif., India, his MS from Wichita State University in
1992,pp. 661-664. Wichita, Kansas, and his P h D from Rennselaer
13. ‘‘Using the IIT Vision Processor in JPEG Applications,” Product Note,
Integrated Information Technologies, Santa Clara, Calif., Sept. 1991. Polytechnic Institute, Troy, New York.
14. A. Razavi et al.. “VLSI Imolementation of an Imaee Comoression Al-
gorithm with a New Bit Rate Control Capabi1ity”:IEEE ICASSP-92,
Vol. 5 , CS Press, Los Alamitos, Calif., 1992, pp. 669-672.
15. T. Araki et al., “The Architecture of a Vector Digital Signal Processor
For Video Coding”, IEEE ICASSP-92, Vol. 5, CS Press, Los Alami-
tos, Calif., 1992, pp. 681-684.
16. “Digital Signal Processing Data Book,” tech. memo, LSI Logic Corp.,
Milpitas, Calif., Sept. 1991.
17. W.K. Pratt, Digital Image Processing, John Wiley and Sons, New
York.
- ----,1991.
-- - -
18. K. Kikuchi et al., “A Single Chip 16-bit 25 Ns Real-time Video/image
Signal Processor,” IEEE J. Solid State Circuits, Vol. 24, No. 6, Dec. Contact Konstantinides at Hewlett-Packard Laboratories, P.O. Box
1989, pp. 1662-1667. 10490, Palo Alto, C A 94303 o r by e-mail at kk@hpkronos.hpl.hp.com.

2

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

2

Diunggah oleh

Hak Cipta:

Format Tersedia

Monolithic Architectures for Image

Processing and Compression

Challenges for imaging

November 1992 0272 17-16/92/1100-0075$0300 0 1992 IEEE 15

high-performance microprocessors (such

76 IEEE Computer Graphics & Applications

configuration, the broad-

Vendor ~ Part Number Standard

nel capacity has significant

Integrated VP JPEGiMPECi-I/ 30 framcsisec MPEG-I encoding ever, a calculation of the

STV3208 JPEGiMPEG Performs 8 x 8 DCT near future. On the other

images. we can optimize

78 IEEE Computer Graphics & Applications

I The decoder is less computation-intensive (primarily

NEC multistandard chip set

80 IEEE Computer Graphics & Applications

General-purpose image ICs

82 IEEE Computer Graphics & Applications

We can consider the Pipe-

image register, a 24-bit ALU and a 13 x 12-bit multiplier. 0 Jf

ment. The processor has a 20 nanosecond cy- Protram bu f f f

1991 0.8 BiCMOS 40 512 x 32 5 x 512 x 16 4 DPUs

IDSP Texas Instrument‘s video processor

84 IEEE Computer Graphics & Applications

where 4 , is the quantity t o be interpolated (for example,

86 IEEE Computer Graphics & Applications

Anda mungkin juga menyukai