ABSTRACT
In Image processing, when we want extract the object of interest from the remaining image data the first stage is detecting the edges of all the objects in the image and filtering the objects with the required features. So edge detection becomes the necessary step in image processing. This step is normally done in a computer. In this project we wanted to design a customized system for edge detection to be embedded in digital cameras. The algorithm we chose for edge detection is Canny. This is simple and easy algorithm to implement. We designed the entire system using SpecC language and the simulation gave a good result.
ii
CONTENTS
S.No 1. Title Introduction a. System Level Modeling b. System Level Description Languages Case Study on a Canny Edge Detector SoC a. Canny Application Reference C Code. b. System Level Model in SpecC. c. Estimation, Optimization and Refinement using SCE. Conclusion Reference Page No. 1 3
2.
5 5 8
3. 4.
9 10
iii
1. INTRODUCTION
In designing a system, we need to make lot of decisions and the two main decisions that reflect on all other decisions are choosing the model of computation and the selection of description language. These two parameters decide the flow of design and tools required. Their selection depends on the nature of the system we will be developing. In the following subsections we will discuss about the System level modeling and the System level description languages.
This clearly shows that at the highest level of abstraction, the components we will be working with is very less compared to the lowest level which is result of adding more and more details and requirement to the design. The accuracy also improves as we move towards lower level of abstraction and this is because we can individually specify the behavior of the components that perform the very basic operation. For example, at transistor level abstraction we will know the
1
width and length of the transistor which helps us to predict the exact timing of the gates and eventually the timing of the entire system. The following figure will help you understand the models as we move towards the lower level of abstraction.
As you can see, we start with the requirements and specification model which is pure functional description, then we add more details to the system like different processing elements, Architecture model, the deciding on the communication networks in the chip, Communication model, deciding which technology to use and the RTL, Implementation model. In some systems where the number of tasks is high, the scheduling of those tasks plays an important role in the efficiency of the system.
Few of the languages are C o C++ o Java o VHDL o Good with functional representation. Cannot be used for hardware level modeling. Same as C with additional feature of exception handling. Like C++ with features of concurrency and synchronization. Hardware Description language. Has almost all features required to synthesize a hardware with structural hierarchy.
Verilog o Another variant of VHDL SpecC o Perfect for capturing the system. Have all the features that are missing in the above mentioned languages. SystemC o It is more like a library to C++ than a language. It also has all the features of SpecC
The following figure gives the capability of each language in the context of system level modeling.
Structural Hierarchy of the model. Bio Bil Bic Bil Bil Bil Cil Cil Bil Cil Cil behavior Main |------ Monitor monitor |------ Platform platform | |------ DUT canny | |------ DataIn din | |------ DataOut dout | |------ c_img_queue q1 | \------ c_img_queue q2 |------ Stimulus stimulus |------ c_img_queue q1 \------ c_img_queue q2
Queue
Queue
q1
Queue
PLATFORM
Queue
q2 P
P q1 STIMULUS
Read_pgm() P.Send(img) P1.Read(img) P2.Send(img)
q2 DUT
Canny() Hysteresis() Gaussian()
P1
P2
P1
P2
MONITOR
P.Read(img) Write_pgm(img)
DATA IN
DATA OUT
P1.Read(img)
Exit() P2.Send(img)
Stimulus: This behavior implements the Read_pgm() to read the image and sends the read image to the behavior Platform through the port P. The communication channel between Stimulus and Platform is a simple Queue q1.
6
Platform: The platform has its own Data_in and Data_out interfaces to communicate with other behaviors instead of directly communicating with stimulus and monitor. These modules are included to make the future modifications easier. That is if we intend to change the interface between the stimulus or monitor and Platform we need not disturb the entire code instead we can simply modify the Data_in and Data_out. Data_in is the interface between Platform and Stimulus. DUT is the main behavior that implements the full functionality of the canny application. All the functions related to edge detection are implemented in the DUT behavior. Data_out is the interface between Platform and Monitor. Monitor: This behavior reads the processed image from the Platform and writes it to the file using Write_pgm() function. The interface between Platform and Monitor is also a simple Queue q2.
This clearly shows that the Gaussian_smooth function dominates the entire computation time. The Gaussian_smooth function has two main parts blurring in X and blurring in Y. These two parts are completely independent internally (i.e.) they can be parallelized individually. Four instances are created for BlurX and four instances for BlurY. Before parallelizing this part took 400 ms and now it takes 100ms.
Architectural Refinement:
With optimized model in hand, the next step is to decide on hardware units to be allocated for the behaviors. The SCE tool offered different processors like ARM7TDMI, Motorolla, ARM 9, different DSPs and many custom hardware options. The code does not have any DSP requirement. So ARM7TDMI was chosen for the main controls. Individual custom hardware was chose for BlurX and BlurY behaviors. In selecting the hardware for Blur functions also has two options, using the same hardware for BlurX and BlurY because BlurY is executed after BlurX. So this is upto the designer and the decision is the tradeoff between performance and chip cost. In this project, the decision was made in favor of individual hardware units. For the Data_in and Data_out is allocated virtual hardwares. So the following clearly shows the mapping of Processing Elements and behaviors. Behavior Canny BlurX BlurY Data_in , Data_out
8
Scheduling Refinement
Once the decision on processing element is made, the next step is to decide on scheduling the tasks which share the same hardware unit. Fortunately in this project there is no necessity of scheduling because the functions are sequential and the only parallel part is Gaussian and that is given custom hardware units.
Network Refinement
After the scheduling the communication channels between individual hardware units has to be defined. In this project the hardware units are 1 ARM core, Virtual hardware units and 8 custom hardware units. The ARM core comes with the AMBA bus architecture therefore any communication to and from ARM can be done using the AMBA architecture. Now the communication between custom hardware units needs to finalized. We dont need to have a complex protocol so a simple double handshake protocol is selected. All BlurX hardware units have two ports one for AMBA bus, input from ARM and other for double handshake bus , output to BlurY units. All BlurY units have 5 ports. One for AMBA bus, output to ARM core and four input ports for double handshake busses from each BlurX unit.
3. CONCLUSION
Starting with just the requirement and a sample C source code, the system has been developed step by step going through all the levels of abstraction. Though final RTL refinement which is required for synthesis is not done due to the short duration of the project, the results obtained are satisfactory. The individual execution times of the Processing elements ARM core - 501.9ms BlurX HW - 47.9ms BlurY HW - 51.3ms
The next issue is, all the simulation results we got are just estimates they are not real. There can be variations in the result after synthesis. This will further lower the processing power. Is 320X240 an acceptable resolution these days? No. Most of the digital camera these days captures videos with a minimal resolution of 1280X720. Our design cannot process a single image of this high resolution. So we need to complete parallelize our approach like in graphics card use as many cores possible and process the image. It is possible to work on these issues by proper selection of hardware units and optimization of code with some tradeoff in accuracy of results and make our system work in real time video processing.
4. References
1. ftp://figment.csee.usf.edu/pub/Edge_Comparison/source_code/canny.src 2. http://en.wikipedia.org/wiki/Canny_edge_detector 3. http://www.cecs.uci.edu/~doemer/publications/SpecC_LRM_20.pdf 4. http://www.cecs.uci.edu/~cad/sce.html
10