Anda di halaman 1dari 10

Automatic Process-Oriented Control Circuit Generation for

Asynchronous High-Level Synthesis


Euiseok Kim, Jeong-Gun Lee and Dong-Ik Lee
Department of Information and Communications, Kwangju Institute of Science and Technology
1, Oryong-Dong, Buk-Gu, Kwangju, 500-712, Republic of Korea
Tel: +82-62-970-2267
Fax: +82-62-970-2204
e-mail: uskim@geguri.kjist.ac.kr, eulia@geguri.kjist.ac.kr, dilee@kjist.ac.kr

Abstract
As an asynchronous design style becomes popular,
the request for asynchronous high-level synthesis(AHLS) tools is increasing continuously. In this
paper, a method , so called process-oriented method,
which generates distributed asynchronous control
circuits automatically in a hierarchical and systematic
manner is suggested as a part of an AHLS tool.
Experimental results show that the suggested method
is ecient in the aspects of area and performance of
derived control circuits.

1 Introduction
In the last decade, asynchronous design style has
become popular due to the potential advantages
such as no clock skew, high-performance, low-power
consumption, small EMI and ease of modular design[1].
However, since most existing CAD tools, which are
indispensable for supporting complex and large circuit
design, are targeted for synchronous system design,
there is a great need of design automation tools which
are suitable for asynchronous system design. Until
recently, most work on CAD tools for asynchronous
systems has been focused on logic synthesis for
hazard-free asynchronous control circuits from signal
transition graphs(STGs) or asynchronous nite state
machines(AFSMs)[2, 3, 4]. However, as asynchronous
design style becomes wide spread, researches in
high-level synthesis such as scheduling, allocation,
resource binding and control circuit generation are
required more and more. Though there are many
achievements in synchronous HLS, those results cannot
be directly applied to asynchronous systems due to
inherent features of asynchronous circuits such as the
absence of a global clock. In particular, scheduling
and automatic control circuit generation are steps that

should be reconsidered in the view point of AHLS.


In this paper, automatic control circuit generation is
discussed chie y.
[5] and [6] are early work for asynchronous high-level
synthesis, in short AHLS, where scheduling under
resource constraints, asynchronous architecture model
and control circuit generation were discussed. In
particular, [5] proposed a general way to generate an
independent control circuit for each hardware component, so called hardware-oriented control circuit.
However, in the case of hardware-oriented controllers,
various kinds of information should be handled in each
controller and the resulting circuit may su er from
rapid area increase and performance degradation. Another synthesis method, macromodule based synthesis
method[7], may cause area overhead due to inherent
circuit redundancy[8].
In this paper, a method of control-circuit generation
for AHLS is proposed. Control circuits generated by
the proposed method have the following four primary
features;
- to control whole system without a global clock
- to guarantee maximal concurrency between processes
- to be uniformly distributed control circuits
- to be derived in a hierarchical and systematic way
The uniform distribution of controllers can provide advantages in the aspects of area as well as
performance. In order to achieve above four primary
features, we suggest a process-oriented control circuit
generation method, which is the main idea of this
paper. Further, in the proposed controller, we introduce a particular way of interleaving transitions to
avoid CSC violation that is a source of area overhead.
Compared to hardware-oriented control circuits in
[5, 6], process-oriented control circuits can stand
at advantage in the points of area, performance and

DFG-Unit 1
0

DFG-Unit 1

Control
Node

if

Conditional
Node

CDFG-Unit 2
while
0

CDFG-Unit 2
while

Conditional
Node

DFG-Unit 3

endif

adder : 1
alu : 1
multiplier : 1

(b)

alu1

multiplier1
start:140
end:240

Child Block
if

Conditional
Node

+
-

Child Block
if

+
alu1

start:70
end :140

Conditional
Node

DFG-Unit 3

Child
Block

start:0 start:0
end :50 end :70

adder1

Conditional
Node

DFG-Unit*

HW Dependency

DFG-Unit*
Conditional
Node

Control
Node

while
0

endif

endif

start:0 start:0
end:50 end:70
adder1 alu1

Child
Block
DFG-Unit

DFG-Unit

+
start:70
end:140 start:70
alu1
end:120
adder1
adder : 1
alu : 1

(a)

(c)

Figure 1: (a) CDFG (b) IF-node (c) WHILE-node


decentralization of control, as is discussed in Section 4.
This paper is organized as follows; Section 2 presents
preliminaries which are necessary to understand this
paper. Section 3 explains AFAHLS which is a target asynchronous architecture for AHLS. Section 4
presents and compares several approaches to control
circuit generation. In sections 5 and 6, we explain
how to build a set of communicating controllers from a
control data ow graph using process-oriented control
circuit generation method in detail. Section 7 presents
timing constraints that control circuits generated by a
suggested method should satisfy for the correct operation. In section 8, experimental results to show the
e ectiveness of the suggested method are given. Section 9 presents conclusions and future work.

2 Preliminaries
In this section, we explain basic knowledge necessary
for better understanding of this paper.

2.1 Control data ow graph

We assume that the initial behavioral description of


a target system is given in the form of a control data
ow graph, in short a CDFG, such as Fig. 1 (a) in
this paper. In prior to de ning a CDFG, two major
constituent components, a DFG-unit and a CDFG-unit,
are de ned. Then we de ne a CDFG.

DEFINITION 1 A DFG-unit is a triple


=

(N; E; O), where N is a set of operation nodes, E is a


set of edges between operation nodes, O is a set of operations de ned for each operation node.

DEFINITION 2 A CDFG-unit is a 4-tuple , =

(X; Y; Z; E), where X is a control node, Y is a conditional node, Z is the set of nodes in , except for X and

(a)

Figure 2: Scheduled, Allocated, Resource bindedCDFG(SAR-CDFG)


Y, where Z is said a child block of the ,'s control node.
E is a set of edges between nodes.
The control node is responsible for the control of the
prede ned execution order of the unit and corresponds
to if and while which are shown in Fig. 1(b) and (c). In
the rest of this paper, only IF-node and WHILE-node
are taken into account as control nodes for the sake
of simplicity. A conditional node is a DFG-unit representing an execution condition for the control node.
A child block corresponds to an execution block under
the control of the control node and consists of a set of
DFG-units and CDFG-units.
DEFINITION 3 A CDFG is a pair  = (N; E), where
N=
[ , is the set of nodes and E is the set of edges
between nodes, where a node is trivially either a DFGunit or a CDFG-unit. Each node has at most one predecessor and one successor. Moreover, each node has
at least one predecessor and/or one successor except
the case that CDFG consists of only one DFG-unit or
one CDFG-unit.

Through a series of AHLS procedures such as scheduling, allocation and resource binding, details about HW
implementation is associated to the initial CDFG as
shown in Fig. 2 and we call it a Scheduled, Allocated,
Resource binded-CDFG(SAR-CDFG). An SAR-CDFG is
used for the input of the synthesis procedure, namely
scheduling, allocation and resource binding have been
already done. When only a DFG-unit is considered, the
term `SAR-DFG' is used instead of `SAR-CDFG'. In the
rest of the paper, for the sake of simplicity, we omit the
pre x `SAR-', when it is not confusing.

3 Architectural model for AHLS


The target Architecture For AHLS, in short AFAHLS,
is shown in Fig. 3. AFAHLS is composed of in-

WHILE-CNC

USC

I/O Processing Part

Child Block

PC 11
PC 12

MUX

MUX
MUX

DFG-Unit1
PSC

DFG-Unit3
PSC

DFG-Unit

PC 1n

CNC 1

PSC
PC
Delay

PSC 1

PC PC

IF-CNC

USC

PC PC PC PC

Positive Edge
Triggered Register

PC

PC

DFG-Unit*
PSC
PC
PC PC PC

(a)
PC 21
PC 22

MUX

MUX

USC

MUX

PC 2n

CNC CNC

CNC

PSC PSC

PSC

CNC

PC

PSC

CNC

CNC

PSC 2
CNC
Delay

USC

Control Part

PSC

USC

PC

PSC

Functional Unit
(Adder,Shifter,..)

PC

Functional Part

Figure 3: Architecture For AHLS(AFAHLS)


put/output processing parts, a functional part and a
control part.
The control part, which is main concern of this
paper, is composed of process controllers(PCs), process
sequencing controllers(PSCs), control node controllers(CNCs) and unit sequencing controllers(USCs)
that communicate with each other. The rst two
controllers, PC and PSC, work as a controller for a
DFG-unit and the rest, CNC and USC, take responsibility for controlling higher level than DFG-units in
a CDFG. Each PC is associated to a process which
corresponds to a node in a DFG-unit. And a process
consists of `read operands', `execute' and `write the
execution result'. PSC corresponds to a DFG-unit
and coordinates execution order among PCs based on
dependencies given in the DFG-unit. Therefore PSC
is considered as a centralized one basically, while PCs
are distributed for a DFG-unit. A CNC corresponds
to a control node in a CDFG-unit and thus there
are two kinds of CNCs, WHILE-CNC and IF-CNC.
Each CNC handles a child block under its control
according to the function of the corresponding control
node in a CDFG-unit. An USC coordinates execution
order among CNCs and PSCs as a PSC does among
PCs. Fig. 4(a) presents hierarchical interconnection
over various controllers explained in this section for
Fig. 1(a). Fig. 4(b) shows general hierarchy of those
controllers.
In AFAHLS, I/O processing parts consist of input
muxes, output muxes and positive edge triggered
registers. Furthermore, mux selectors(SEL) which are
necessary for choosing an input among mux inputs,
register enabling signal generator(RE) for register
writing, and delay elements for registers are required.
Muxes, which are interconnected to I/O ports of
registers, transport data between functional units and
registers. Input signals to SELs and REs come from a

PC

PC

PC

PC

PC

(b)

Figure 4: (a) Hierarchical interconnection of controllers


for Fig. 1 (a) (b) General hierarchy of controllers
control part, specially PCs.
Data processing modules such as adders, ALUs, multipliers and shifters etc. comprise a functional part in
AFAHLS. In AFAHLS, 4-phase handshake protocol and
bundled data method for control and data path protocols are assumed respectively for better performance
and smaller area. Therefore, each functional unit is implemented in single-rail as synchronous one. Note here
that basically bundled data assumption is not essential, but is introduced to reduce the area overhead of
functional units.

4 Approaches to control circuit generation

In this section, we introduce three approaches


to control circuit generation including our method,
process-oriented one for reader's comprehension.
The rst approach shown in Fig. 5(a) is a centralized control circuit generation method which is
widely used in SHLS. In the method, all the control
signals are generated from a centralized controller and
hence the size of the controller is large in general.
The centralized controller is described and synthesized
through a Finite State Machine(FSM) and generates
proper control signals according to the speci ed state
transitions. However, the absence of a global clock
makes it dicult to apply this method to asynchronous
controller generation. Moreover, diculty in synthesizing hazard free logic of a large centralized controller
makes it even hard. A centralized control unit would
take o some of the attractiveness of asynchronous
systems. That is, global control signals introduce
delays that reduce the inherent parallelism of an
asynchronous system.

I/O Processing Part

In the following two sections, we explain how to derive a distributed control unit of AFAHLS. At the rst,
automatic generation of PCs and a PSC for a DFG-unit
is presented, then the method to build CNCs and an
USC for a CDFG is explained in detail.

5 Controllers for data ow graph


5.1 Generating process controllers

A PC is a module to generate proper control signals


necessary for performing a process corresponding to
a node in a DFG-unit. In order to perform a process,
a PC exchanges several control signals with a PSC, a
functional unit and registers as shown in Fig. 6(a).
The general behavior of a PC is described as follows;
step 1 A PC is activated by receiving ReqStart+ from
the associated PSC.
step 2 The PC sends signals for requesting operands

MUX

CPmux11
CPmux12

MUX

CPmux1n
Positive Edge
Triggered Register

CPreg 1
CPreg 2
CPreg n

Controller
MUX

MUX

CPmux21
CPmux22

MUX

CPmux2n

Delay

The process-oriented control circuit generation


method shown in Fig. 3 is another way to decompose
an overall control circuit in a systematic way. In this
method, a basic control element, a PC, is built for
a process corresponding to a node in a DFG-unit.
Since processes are regular in sense of behavior, `read
operand', `execute' and `write the execution results',
the sizes of PCs are also regular and small comparing
to controllers based on the above two approaches.
PCs constitute a controller for a DFG-unit together
with a PSC coordinating execution order among all
the PCs. This concept can be expanded into a CDFG
systematically through CNC and USC. The most
distinctive feature of the process-oriented control
circuit generation method is separation of controllers
performing processes and ones handling process execution order. The former is a PC and the latter are
a PSC, a CNC and an USC. This separation enables
a good decomposition of an overall circuit resulting
in small delay and area as is shown in experimental
results.

MUX

Delay

In order to avoid this problem, [5] and [6] proposed


a decentralized approach that lies in implementing
the overall control as a set of communicating small
controllers, called control processes(CPs). A CP is
generated for each self-timed hardware block and this
is the reason we call it a hardware-oriented control
circuit generation method. Although this method
naturally decentralize overall control, it may still fall
into the same problem. For example, if there are many
addition operations but only one adder is allocated,
the size of a corresponding CP becomes large. Fig.
5(b) shows this approach.

Centralized
Controller

(a)

Functional Unit
(Adder,Shifter,..)

Functional Part

CPfu 1
CPfu n

Hardware-Oriented
Controller

(b)

Figure 5: (a) Centralized controller (b) Hardwareoriented controller


such as ReqOP1+ and ReqOP2+.
step 3 The PC activates the binded functional unit by
sending a ReqFU+.
Since all signal exchanges are based on 4-phase
bundled handshake protocol, AckFU+ arrives at
the PC after some xed delay. Timing constraints
for correct operation is discussed in section 7 in
detail.
step 4 The PC generates ReqWDR+ and sends it to
the destination register in order to store the execution
result.
Similar to AckFU+, AckWDR+ arrives at the PC
after some xed delay.
step 5 After whole procedures of a process have been
completed, the PC sends AckStart+ to the PSC and
enters an idling phase immediately.
The above behavior of the PC is described in a signal
transition graph(STG) that has been widely accepted
as a description of asynchronous control circuits after
introduced in [2]. Fig. 6(b) shows the STG corresponding to above behavior. Since behaviors of PCs are regular, the corresponding STG can be generated easily
for each process with a little modi cation of the STG
in Fig. 6(b) according to operations of the corresponding processes. For example, in the case of an assignment operation, rising/falling transitions of ReqOP1,
ReqOP2, ReqFU and AckFU have only to be removed.
Note that OP code signals are not described in Fig. 6(b)
because those are not general signals. For example, if
one-hot coding is adopted for OP code, the generation
of only one appropriate bit signal is enough for an OP
code signal. The circuit in Fig. 6(c) is derived from
(b) using asynchronous control circuit synthesis tool,
Petrify[9]. In general, an STG should satisfy following
ve properties for correct asynchronous logic synthesis;

ReqOP1
OP1

Start

ReqWDR

ReqOP1+

Result

AckPC1+

ReqStart+

ReqStart
AckStart

Register

ReqOP2+

ReqPC1+

PC

ReqPC2+
PC1

PC2

AckPC1+

AckWDR
AckFU

AckFU+

OP2

ReqPC4+

AckPC3+

AckPC4+

Req-

PC4

Idling Phase

End

ReqPC1-

ReqPC2-

ReqPC3-

ReqPC4-

AckPC1-

AckPC2-

AckPC3-

AckPC4-

Ack-

External Ack

AckStart+
ReqOP1
ReqOP2
ReqFU

ReqOP1-

ReqOP2-

(a)

ReqFU-

ReqWDR-

AckFU-

AckWDR-

AckStart-

PC2

AckPC3
AckPC4
C

AckPC1

ReqPC3

Idling Phase

AckWDR

(b)

Ack

AckPC2

AckPC3

ReqPC2

AckPC1
AckPC2

ReqPC4

Req

ReqPC1
ReqPC2

AckPC4
ReqPC3

PC3

(c)

(b)
Ack

PSC

ReqStartAckStart

Req
PC1

ReqPC1

ReqWDR
AckFU

Working Phase

Ack+
PC3

AckWDR+

(a)

AckPC2+

AckPC2+

ReqWDR+

ReqPC2+

ReqPC3+

Working Phase

FU
ReqOP2

ReqStart
AckStart

ReqPC1+

ReqFU+

ReqFU

Register
Input Mux 2

Req+

External Req

PSC
Register
Input Mux 1

PC4

(c)

ReqPC4
(d)

Figure 6: (a)Signal exchanges in a PC (b) An STG for


a PC (c) Gate-level implementation of the PC given in
(b)

Figure 7: (a)A PN derived from a DFG-unit (b)An STG


for a PSC (c)Signal exchanges in a PSC (d) Gate-level
implementation of the PSC given in (b)

liveness, boundedness, output semi-modularity, consistency and CSC[10]. The following de nition and two
theorems are given in order to prove that an STG we
conceive for a PC always satis es those ve properties.

can not be disabled without ring. Therefore, the STG


satis es output semi-modularity inherently.
CSC The STG has a working phase and an idling phase
explicitly. Namely, all the rising transitions occur in
a working phase and all the falling transitions occur
in an idling phase. Since a working phase ends in
AckStart+ and an idling phase ends in AckStart-, all
the states generated in a working phase have di erent
binary states from ones generated in an idling phase.
Moreover, all the states in the same phase have di erent
binary states from each other. Therefore, all the binary
states generated from the STG are distinct. Therefore
the STG satis es CSC property.
According to Prop. 7, an STG for a PC can be synthesized into a speed-independent circuit.

DEFINITION 4 [11] A marked graph, in short MG,

is a Petri Net(PN) such that each place p has exactly


one input transition and exactly one output transition,
i.e., j  pj = jp  j = 1 for all p 2 P, P is a set of places.

THEOREM 5 [11] A marked graph(G, M0) is live i

M0 places at least one token on each directed circuit in


G.

THEOREM 6 [11] A strongly connected marked


graph(G, M0 ) is bounded.

Note that an STG for a PC is always a strongly connected MG from the construction.

PROPOSITION 7 An STG for a PC always satis-

es liveness, boundedness, consistency, output semimodularity and CSC properties.


[proof] All the proofs are almost trivial.
liveness All the directed circuits include the arc from
AckStart- to ReqStart+ structurally. Since the arc has
one token initially, all the directed circuits has at least
one token at M0 . Therefore, the STG is live from Th.
5
boundedness It is trivial from the strongly connectedness of Th. 6
consistency The STG has exactly one rising transition
and one falling transition for each signal. Moreover,
rising and falling transitions for a signal always occur
alternatively. Therefore, consistency is satis ed.
output semi-modularity An enabled transition in a MG

5.2 Generating process sequencing controllers

A process sequencing controller(PSC) is a circuit activating a series of PCs in a proper order based on the
dependencies among nodes in a DFG-unit. In the rst
step of automatic PSC generation, we transform a DFGunit into a Petri Net(PN). Since all the operation nodes
in a DFG-unit has been already scheduled, the corresponding PN can be easily obtained. The following
de nition de nes dependency relation between any two
operation nodes in the same DFG-unit.
DEFINITION 8 For any two operation nodes,
opi and opj, in a DFG-unit,
if scheduled execution of opi does not overlap with upon
that of opj and opi is scheduled prior to opj , opi precedes opj, denoted by (opi Popj). If (opi Popj) and there
does not exist opk such that (opi Popk ) and (opk Popj ),
then opi directly precedes opj, denoted by (opi DPopj ).

Based on Def. 8, the following is an algorithm to


construct a PN from a DFG-unit.

ALGORITHM 1 Derivation of a PN from a


DFG-unit

Note that any DFG does not have choice inherently and
thus the net is an MG. Therefore, \make an arc from
t to t0 " in this algorithm means that t and t0 are connected via a place p, i.e., t ! p ! t0 .
step 1 Generate two transitions labeled Start and End.
step 2 For each operation node opi , make a transition
and label PCi; i = 1; 2; : : :.
step 3 For each transition PCi corresponding to an operation node opi which has no operation node opj such
that (opj Popi ), make an arc from Start transition to
PCi.
step 4 For each transition PCi corresponding to an operation node opi which has no operation node opj such
that (opi Popj), make an arc to End transition.
step 5 Make an arc from PCi to PCj corresponding
to operation nodes, opi and opj such that (opi DPopj).
That is, if (opi DPopj), PCi ! p ! PCj.
Moreover the PN derived from a DFG-unit can be
automatically transformed into an STG in a straight
forward way. The following algorithm shows how to
derive an STG of a PSC from a PN using 4-phase handshaking protocol. Fig. 7(a) and (b) show how to derive
a PN and an STG from the DFG , unit in Fig.2, and
Fig.7 (c) shows signal exchanges between PSC and associated PCs.

ALGORITHM 2 Derivation of an STG from a


PN Using 4-phase Handshaking protocol
step 1 Transform the Start transition and the End transition to Req+ and Ack+,!Req-, respectively.

step 2 Divide each transition corresponding to PCi into


two rising signals ReqPCi+ and AckPCi+ and add an
arc as ReqPCi+ ,!AckPCi+.
step 3 For all signals generated in steps 1 and 2, make
arcs from AckPCi to ReqPCj if there is an arc from PCi
to PCj in the PN. Otherwise, make arcs from AckPCi+
to Ack+ and from Req+ to ReqPCj+.
step 4 Make falling signals ReqPCi,'s, AckPCi,'s and
Ack- and add arcs as ReqPCi, ,!AckPCi,.
step 5 Make arcs from Req- to all the ReqPCi,'s and
from all the AckPCi,'s to Ack-.
step 6 Add an arc from Ack- to Req+.
step 7 Put a token on an arc from Ack- to Req+.

It can be proved in a similar way to Prop. 7 that


STGs obtained through Algo. 2 always satisfy the ve
properties for speed-independent circuit implementation. Therefore an STG for a PSC can be synthesized
into a speed-independent circuit. Fig.7 (d) is the gate
level implementation of the STG in (b).

There are two points that are worth noting here in


constructions of a PSC and the associated PCs and
their interactions. Consider a PSC, say PSCi , and all
associated PCs, say PC PSCi . PSCi activates every PC
in PC PSCi in the order of associated dependencies
before entering the idling phase. Then each PC that
is activated by PSCi executes the assigned process.
After completion of the process, the PC sends out an
acknowledge signal. Then the PC immediately goes
into its idling phase, i.e., taking down request and
acknowledgement signals, by its own control rather
than by PSCi 's idling phase. In consequence, we can
enjoy better performance by hiding idling phases of
all PCs. For example, in the STG shown in Fig. 7,
the PSC issues ReqPCi +, i=1, 2, to activate PC1
and PC2 . ReqPCi+ corresponds to ReqStart+ in
the STG for a PC as shown in Fig. 6(b). After the
completion of the assigned process PCi , i=1, 2, sends
AckStart+ which is identical to AckPCi + in the PSC.
Then PCi immediately starts the idling phase without
any interaction with the PSC by issuing ReqOP1-,
ReqOP2-, ReqFU- and so on. At the same time
PSC sends another ReqPC3+ and ReqPC4+ without
issuing ReqPCi -, i=1, 2.
In addition, if decomposition of a PSC is required
for some reason, for example several small size of a
PSCs are preferable to one large PSC, a PSC can be
decomposed easily into small sub-PSCs communicating
with each other and controlling allocated PCs. This decomposition guarantees further distribution of control
circuits in uniform size.

6 Controllers for control data ow


graph

In the previous section, we explained how to derive


controllers, PCs and a PSC, for a DFG-unit. In this
section, a target description for generating controllers
expand from an SAR-DFG to an SAR-CDFG. Generation
of CNCs and USCs are explained in the followings.

6.1 Building control node controllers

In a CDFG, control nodes coordinate executions of


child blocks and there are two kinds of control-nodes
according to their function; IF-node and WHILE-node.
IF-node makes the child block under its control to
execute if the given condition is true. Similarly,
WHILE-node makes the child block under its control
to perform while the given condition is true.
A CNC corresponds to a control node in a CDFGunit and a control node operates according to the given
condition. Thus a CNC should check whether the given
condition is true or not through executing a conditional

Req
Ack
P5
AckBlk-

Ack-/2

Req+

CNC

P1
Req+

ReqBlkReqCon

ReqBlk

AckCon

AckBlk

Req-/2

AckBlk+
Child
Block

ReqCon+

C1+

AckCon+

P2
P1

C1-/2
P4

(a)

Ack-/1

ReqCon-

ReqCon+
ReqBlk-

Req-/1

C1-/1

ReqCon-

Ack-

P3
P2
P5

P3
Ack+/1

C1+ AckCon+

AckBlk+

Flag+

ReqBlk+
FLAG

AckCon-

C1-/2
Ack+/2

Conditional
Node

AckBlk-

AckCon-

Req-

Flag+
P4

ReqBlk+

Ack+

C1-/1

Flag-

Flag-

(b)

(c)

Figure 8: (a) Signal exchanges in CNC (b) STG for IF-CNC (c) STG for WHILE-CNC
node when a Req input signal is activated. Then, as a
result of executing the conditional node, the CNC activates the child block according to the value of Flag
which indicates an execution result of conditional node.
Fig. 8(a) shows a block diagram which shows signal exchanges among CNC, the associated conditional
node and the child block. We propose STGs for IFCNC and WHILE-CNC as shown in Fig. 8(b) and (c),
respectively. STGs for IF-CNC and WHILE-CNC satisfy ve properties for speed-independent circuit synthesis. Therefore, they can be synthesized into speedindependent circuits.

arc between corresponding two transitions according to


the given dependency. Here a node means a DFG-unit
or a CDFG-unit.
A sequential PN obtained through Algo. 3 can be
transformed into an STG of the target USC in the same
way as Algo. 2 and synthesized into speed-independent
USC like PSC. Fig. 9 shows a series of steps to derive
an STG for an USC from an SAR-CDFG.

6.2 Building unit sequencing controllers

Timing constraints are necessary for correct control


of whole system by proposed controllers. Those
constraints are as follows;

A unit sequencing controller, an USC, is to coordinate


the given execution order among CNCs and PSCs in order to perform a CDFG as a PSC does for a DFG-unit.
According to the de nition of a CDFG, DFG-units and
CDFG-units constituting a CDFG are performed sequentially. Therefore, executions of CNCs and PSCs that are
handled by an USC are linearly ordered. The following algorithm shows the procedure which generates a
PN representing the execution order among CNCs and
PSCs from an SAR-CDFG. Note that \make an arc"
in the algorithm also means that two transitions are
connected via a place.

ALGORITHM 3 Derivation of a Sequential PN


from an SAR-CDFG

step 1 Generate two transitions Start and End.


step 2 For every DFG-unit or CDFG-unit, make a corresponding transition and label it by Blocki, i=1, 2, : : :.
step 3 Make an arc from Start transition to a transition
corresponding to a DFG-unit or a CDFG-unit having no
predecessor.
step 4 Make an arc to End transition from a transition
corresponding to a DFG-unit or a CDFG-unit having no
successor.
step 5 Between two adjacent nodes in a CDFG, make an

7 Timing constraints for correct operation of AFAHLS

1. The size of delay associated to a functional unit(FU)


should be larger than sum of maximum operand fetch
delay, the FU's worst case delay and worst case delay
of destination register's input muxes.
2. The size of delay associated to a register should be
larger than delay for register writing.
3. For two consecutive processes, pi and pj , using the
same hardware, if pi is executed prior to pj , the idling
phase of pi should not overlap with the working phase
of pj .
Necessities of the rst and the second constraints are
trivial. And it is easy to satisfy 1 and 2. I/O ports of
a FU are connected to interconnection logics consisting of a series of muxes. Therefore, maximum operand
fetch delay and worst case delay of destination register's input muxes should be considered in constraint
1. The last constraint is necessary for correct implementation of 4-phase bundled handshaking. Note that
signal exchanges between FU/Register and PC are implemented in AFAHLS as shown in Fig. 10. We assume
that PC1 and PC2 are processes to access the same
FU sequentially as shown in Fig. 10(a). Consider the

Start

Req+

DFG-Unit 1

ReqFU1

ReqBlk1+
while
0

ReqFU2

AckBlk1+
1

Delay

CDFG-Unit 2

Block1

FU / Register

Block2
ReqBlk2+

DFG-Unit3

Block3

g1

AckBlk2+

Child Block

ReqBlk3+

End

if

PC2
AckFU1

AckBlk3+

g2

PC1

(a)

AckFU2

Ack+
DFG-Unit*
Reqendif

ReqBlk1-

ReqBlk2-

ReqBlk3-

AckBlk1-

AckBlk2-

AckBlk3-

DFG-Unit

Ack-

Figure 9: Derivation of an STG for an USC from a


CDFG
FU has completed for ReqFU1+. Then delay element
give AckFU1+ to PC1. Since g1 and g2 share output which comes from delay element D, if PC2 sends
ReqFU2+ before the output of D becomes low, PC2 receives AckFU2+ immediately. This problem originally
stems from the fact that a PSC activates all associated
PCs before it enters into the idling phase for better performance. To avoid this situation, we should insert a
delay element with proper delay between PSC and PC2.
This causes some performance degradation. In order to
avoid or reduce this performance degradation due to
the third constraint, we make idling phases of PCs as
short as possible through concurrent signal fallings and
a specialized delay element whose delay time is very
small for a falling signal as shown in Fig. 10(b)[12].
Consequently, the third constraint can be satis ed easily with only small performance loss.

8 Experimental results
In this paper, we suggested a process-oriented
control circuit generation method. The proposed
method has been being implemented as a part of an
asynchronous high-level synthesis tool. This tool consists of two parts largely, an automatic asynchronous
control circuit generator and a VHDL code generator.
The former derives a series of controllers based on
process-oriented method and the latter generates
structural VHDL codes of those controllers for circuit
simulation and analysis.
We performed two experiments in order to check
e ectiveness of our method. The rst, we performed
a comparison between the hardware-oriented method
and the process-oriented method in terms of the
number of literals, area, worst-case delay, and average
cycle time. Since an automatic VHDL code generator
is being developed, in the experiment, control circuits
were implemented and simulated manually using

Rising Delay >> Falling Delay


(b)

Figure 10: (a) Signal exchanges based on 4-phase bundled data method between FU and PC (b) Delay element with fast high-to-low propagation delay
a commercial VHDL tool, SYNOPSYS and 0.6m
IDEC-C631 library[13]. Table 1 shows experimental
results for four kinds of controllers, PC, PSC, CNC and
USC in the process-oriented method. Especially, PSCs
and USCs are scalable according to the size of the
given DFG-unit and CDFG. Thus we made experiments
in several PSCs and USCs with various sizes. As
shown in Table 1, areas of process-oriented method
based controllers are small and regular except IF-CNC
and WHILE-CNC, in consequence, they show smaller
worst case delay and average cycle time comparing
those of controllers in the hardware-oriented method.
These features are due to good decomposition of a
global controller into several process-level controllers
and coordinators among them through the proposed
method.
Table 2 shows experimental results of controllers
for functional units and registers, denoted by CPfu
and CPreg , among controllers in hardware-oriented
method[5]. Unfortunately, [5] did not presents any
experimental results and thus CPs in Table 2 were
constructed by us for the purpose of comparison. CPi ,
i=1, 2, 3, in Table 2 are constructed for cases that
one, two or three processes use a functional unit or a
register sequentially. As Table 2 shows, the sizes of
CPs are larger than those of process-oriented based
controllers given in Table 1 and thus su er from
bigger delay. In our opinion, those features result
from following two reasons; The rst reason is a CSC
violation causing large area overhead and performance
degradation. For controllers in hardware-oriented
method, although Petrify, which is state of art in
solving a CSC violation problem, was used in order to
solve CSC violations, many additional internal signals
were inserted. In consequence, resulting circuits show
bigger area, worst case delay and average cycle time.
Although CSC violations can be reduced with much

Table 1: Controllers generated through processoriented method


Number of
Worst Case Delay /
Literals Area Average Cycle Time
PCcomp
14
23.29
2.23 / 5.30 ns
PCassi
4
15.29
1.77 / 3.73 ns
PSC2
4
12.97
1.41 / 2.62 ns
PSC4
14
19.63
1.74 / 4.26 ns
PSC8
30
28.63
2.41 / 5.91 ns
IF-CNC
25
74.83
3.72 / 6.77 ns
WHILE-CNC
27
68.18
2.40 / 5.03 ns
USC2
5
14.63
1.41 / 3.37 ns
USC4
11
20.28
2.03 / 4.91 ns
USC8
23
31.92
2.41 / 6.73 ns
PCcomp : PC for an op. node performing data computation
PCassi : PC for an op. node performing data assignment
PSCi : PSC coordinating execution order among i PCs
USCi : USC coordinating execution order among i units
The area of 2-input NAND gate is 1.00

a = 1;
c = 0;

(b = c)

Controllers

Table 2: Controllers generated through hardwareoriented method[5]


Number of
Worst Case Delay /
Literals
Area
Average Cycle Time
CPfu 1
27(3)
73.82
4.68 / 12.43 ns
CPfu 2
51(6)
146.68
3.66 / 10.52 ns
CPfu 3
73(9)
190.24
5.84 / 11.57 ns
CPreg 1
24(4)
83.12
6.76 / 13.15 ns
CPreg 2
69(5)
153.71
5.55 / 13.12 ns
CPreg 3
141(7)
273.48
9.79 / 14.69 ns
The area of 2-input NAND gate is 1.00
() represents the number of internal signals inserted for solving
CSC violations.
Controllers

more intensive e ort, it seems to be dicult to avoid


them completely. The second reason is that process
execution information and process sequencing information are handled together in a CP. Especially, with
the limit of hardware allocation, increase of a DFG or
a CDFG may cause increase in area and delay of a CP.
Through comparison between Table 1 and Table 2,
we can see that the process-oriented method is more
ecient in the aspects of both area and performance.
The last, we implemented a simple CDFG as shown
in Fig. 11 based on the suggested method in order
to check the feasibility. As Fig. 12 and 13 shows,
the process-oriented method based controllers work
together using 4-phase handshaking protocol. In order
to satisfy timing constraints given in section 7, we
have inserted proper delays. Especially, for functional
units and registers, 1.1-1.2 times bigger delays than
their worst case delays were inserted.

b = 4;
d = 2;

while
0

R0 : a
R1 : b, e, g
R2 : c, f, h
R3 : d

c d

FU : adder1, adder2
Reg : R0, R1, R2, R3

Figure 11: Simple example for simulation


Through above two experiments we can conclude
that our approach can present a good and practical method in automatic control circuit generation for
AHLS.

9 Conclusions and future work


We have presented an automatic process-oriented
control circuit generation method for AHLS. The
proposed method has the following noticeable features;
- to present a systematic and hierarchical way to
generate a set of controllers with small and regular
sizes
- to produce STGs satisfying ve properties for
speed-independent circuit implementation without any
modi cation. Therefore, STGs can be synthesized into
speed-independent circuits easily.
- to produce distributed controllers standing at advantage in the points of area and performance due to the
above two features.
Moreover, since all the procedures from a CDFG to
derive controllers are suggested in an algorithmic and
systematic way, they can be automated. Consequently
our method is expected to present a good and practical
method for generating controllers automatically for
AHLS.
This work has been performed as a part of building AHLS CAD tool. In order to construct a complete
AHLS CAD tool, researches about scheduling, resource
allocation, resource binding and asynchronous architecture should be performed.

10 Acknowledgements
This work has been supported in part by the Korea Research Foundation under grant 1998-016-E00058

Figure 12: Simulation result I for Fig. 11

Figure 13: Simulation result II for Fig. 11

and by the KAIST/K-JIST IT-21 Initiative in BK21 of


Ministry of Education.

[8] T. Kolks, S. Vercauteren and B. Lin, \Control Resynthesis for Control-Dominated Asynchronous Designs,"
In Proceedings of Second International Symposium on
Advanced Research in Asynchronous Circuits and Systems, Mar., 1996.
[9] J. Cortadella et. al., \Petrify: a tool for manipulating concurrent speci cations and synthesis of asynchronous controllers," In Proceedings of the 11th Conf.
Design of Integrated Circuits and Systems, Nov., 1996.
[10] A. Kondratyev, J. Cortadella, M. Kishinevsky, E.
Pastor, O. Roig and A. Yakovlev, \Checking Signal
Transition Graph Implementability by Symbolic BDD
Traversal," European Design and Test'95, Mar., 1995.
[11] T. Murata, \Petri Nets: Properties, Analysis and Applications," Proceedings of the IEEE, Vol. 77, No. 4,
1989.
[12] K. T. Christensen, P. Jensen, P. Korger and J.
Spars;, \The Design of an Asynchronous TinyRISCTM
TR4101 Microprocessor Core," In Proceedings of
Fourth International Symposium on Advanced Research in Asynchronous Circuits and Systems, Mar.,
1998.
[13] IDEC Cell Library Data Book Release 9804, Apr.,
1998.

References
[1] S. Hauck, \Asynchronous Design Methodologies : An
Overview," Proceedings of the IEEE, 83(1), Jan., 1995.
[2] T. A. Chu, \Synthesis of Self-timed VLSI Circuits
from Graph-theoretic Speci cations," Ph. D. thesis,
MIT, Jun., 1987.
[3] A. Kondratyev, M. Kishinevsky, B. Lin, P. Vanbekbergen, and Yakovlev, \Basic Gate Implementation
of Speed-Independent Circuits," In Proceedings of Design Automation Conference, Jun., 1994.
[4] S. M. Nowick and D. L. Dill, \Synthesis of Asynchronous State Machines Using a Local Clock," In
Proceedings of ICCD, Oct., 1991.
[5] J. Cortadella and R. M. Badia, \An Asynchronous Architecture Model for Behavioral Synthesis," In Proceedings of European Conference on Design Automation, Mar., 1992.
[6] R. M. Badia, J. Cortadella, E. Pastor and A. Pardo,
\A High-Level Synthesis System for Asynchronous
Circuits," Sixth International Workshop on High-Level
Synthesis, Nov., 1992.
[7] E. Brunvand, \Translating Concurrent Communicating Programs into Asynchronous Circuits," Ph. D. thesis, CMU, 1991.