An Abstract Machine-Based Dynamic Translation Technique in Java Processors

International Journal of Advanced Computer Science, Vol. 2, No. 1, Pp. 11-17, Jan. 2012.
An Abstract Machine-based Dynamic Translation Technique in Java Processors

Haichen Wang & Xiangmo Zhao
Abstract Binary Translation is a migration technique that allows software to run on other machines achieving near native code performance. The paper proposed a hardware abstract machine (HAM)-based dynamic translation technique for Java processors implementation. The HAM exploits the mock execution method to analyze and identify dependency among Java instructions, dynamically translate Java bytecodes into tag-based RISC-like instructions, and then the instructions are executed on a RISC engine. Also, stack folding is added to the technique to further reduce load/store operations. We realized a Java ILP processor by using the proposed HAM-based technique and extend it to design a multi-threading Java processor. The performance evaluation for the Java ILP processor is given in the paper and some related issues are also discussed.
Manuscript
Received: 19,Sep., 2011 Revised: 1, Nov., 2011 Accepted: 26,Nov., 2011 Published: 15,Feb., 2012
Keywords
Binary Translation, Abstract Machine, Java processor, Multithreading,
1. Introduction
Dynamic binary translation (DBT) is the process of translating and optimizing executable code for one processor to another at run time. From late 1980s, the processor manufacturers developed binary translation technique to provide for a migration path from legacy CISC machines to the newer RISC machines. Now DBT has become a versatile tool that addresses a wide range of system challenges while maintaining backwards compatibility for old developed existing software. DBT facilitates deployment of new tools by eliminating the need to re-compile or modify existing software. DBT has been successfully used in commercial and research environments to support virtualization, cross-platform binary compatibility, debuggers, and performance optimization. DBT can be realized by software through modifying a running programs binary instructions at run time. This work need for a software dynamic translator to support. Most software dynamic translation systems exploit a similar fundamental approach: adding a software layer between program and CPU to virtualise aspects of the host running environment [1]. The software layer acts as a virtual machine to dynamically examine and translate instructions
This work work was supported by Program for Changjiang Scholars and Innovative Research Team in University (IRT0951). Hai-chen Wang, Xiang-mo Zhao, ChangAn University, Xian China; { h.c.wang, x.m.zhao}@chd.edu.cn)
before they are run on the host machine. As a case study, the paper applied an approach of hardware abstract machine (HAM) to implement a DBT function. In the approach, HAM works as a decoder to dynamically identify independent instructions and translate Java bytecodes into tag-based RISC-like instruction formats. Combined with the stack folding, the processor can further reduce load/store operations. With the proposed approach, we designed a Java ILP processor, and extend it to a multithreading Java processor architecture which can further explain the proposed techniques availability. The remainder of this paper has the following organization. Section 2 discusses related work. Section 3 explains the concept ADBT, and illustrates an implementation of the Java HAM. Section 4 gives a Java ILP processor architecture which exploited the proposed abstract machine-based dynamic binary translation (ADBT) technique. And Section 5 gives the performance evaluations. On top of the Java ILP processor, the architecture of a multithreading Java processor is proposed in Section 6. Conclusion and future work are presented in Section 7.
2. Related Work
In the late 1980s, companies attempt to improve on existing emulation techniques, and began using binary translation to achieve native code performance. In recent years, binary translation has transferred to hybrid translators, which are proving to be extremely successful. The mixing translation with emulation and runtime profiling brought about some of the leading performers in the hybrid translation, i.e. Digitals FX!32 [2], which emulates the program initially and translate it in the background by using information gathered during profiling. Many optimization techniques have been used in dynamic translators. Runtime optimizations in dynamic compilers can provide 0.9x 2x the performance of statically compiled programs. Such techniques have been used in Just-in-time (JIT) compilers for Java. JIT from Sun [3] and Intel [4] can dynamically generate native machine code at runtime. Software dynamic translation provides system designers very good flexibility in controlling and modifying a programs execution. For example, Transmetas Code Morphing technique allows unmodified Intel IA-32 binaries to run on the low power, VLIW Crusoe processor [5]. The UQDBT dynamically translates Intel IA-32 binaries to run on SPARC-based processors [6], and the IBM DAISY [7]
12
International Journal of Advanced Computer Science, Vol. 2, No. 1, Pp. 11-17, Jan. 2012. Inst. Input Inst. Tagging (Decode)/ DBT Inst. Sched ... Inst. Exec. Write -back
uses software dynamic translation to run the newly generated code on novel VLIW architectures with accompanying optimization. Here, the Transmeta Crusoe processor [5] and the DAISY used co-designed VM that have an internal VLIW-style instruction set, composed of RISC-like operations. DAISY combines JIT with native compilation techniques to execute Java efficiently. Thread level parallelism (TLP) can be exploited in Java applications, especially in Java multithreading processor to extract coarse-grained parallelism. Suns MAJC [8] processor adopts a vertical multithreading technique to exploit TLP. But MAJC needs a JIT compiler to convert bytecodes to the native codes. Java Multi-Threaded Processor (JMTP) is a single-chip CPU architecture which contains an off-the-shelf general purpose processor core coupled with an array of Java Thread Processors (JTPs) [9] to discover TLP. And Pico Java [10], a Java directly executed stack processor is designed by Sun. The paper originally proposed an all-hardware supported ADBT approach to directly translate any input binaries from different machines including RISC or CISC machines into a tag-based RISC instruction formats, and execute converted instructions on a RISC engine. This approach makes us to be able to exploit the existing RISC processor cores to design new processors effectively. This is the major contribution of the paper. In the following, we will explain the proposed ADBT approach.
Fig. 1 The concept of ABM-based DBT approach (ADBT)
B. Java Stack Abstract Machine The DBT involves dynamically translating an existing binary, and replacing instructions as needed. In our Java ILP processor, DBT is implemented in the tagging unit (TU) which realizes the function of the HAM. The HAM first implements a tag-based mock execution, adding tags to each instruction, and then translates bytecodes to RISC-like instructions (tag-based). During the mock execution, tags will replace operands in the instructions. Second, the processor completes Java stack instruction folding to reduce stack load/store operations. Third, folded tag-based instructions are scheduled to issue slot to execution on a RISC engine if they are independent. Here we introduce first two steps, and the schedule and execution procedures may be dependent on what execution engine you use, whichever is superscalar or VLIW engine. C. How a HAM works In stack machines, operands on execution stacks are erased once they are used by an operator. A load/ALU instruction may only need one or two operators. The result is uniquely identified by a tag. Once it is used, a new result is immediately discarded without put onto the stack. We use the following Java bytecode snippet to illustrate how a bytecode instruction stream is tagged: iload_1, iload_2, imul, iload_4, iload_5, iadd, iadd, istore_3. Table 1 shows how the HAM works and changes tags on the OTS. Each stack instruction is assigned a unique tag when it enters the Tag Renaming Unit (TRU). The tagged instructions are dispatched after an instruction is executed in a load/store or ALU unit. The generated result is delivered to the later instruction that consumes the result. In this manner, we see that the consumer-producer relation between the instruction pair has been identified by using tags. The HAM has, as in the case of RISC processors, undergone a mock execution of the program using tags rather than values and the "execution" is sequential since only one operation can use the stack at any moment, but fast. In DBT-supported processors, all code execution is controlled by the DBT [13], and only translated code blocks are executed. A cache is used to store translated basic blocks or traces, and typically used to reduce the runtime overhead. HAM includes TRU, OTS, and stack cache. TRU is a key component, responsible for adding tags to instructions and storing tagged instructions. OTS emulates executions of the stack. Stack cache is only for storing temporary results and data. Each basic block instructions are stored in TRU that is similar as an instruction cache in DBT, and waited to be translated into tag-based instructions. Tag-based mock execution is simple, so
International Journal Publishers Group (IJPG)
3. ADBT Approach
A. The concept of ADBT Abstract machines (ABM) are mostly used for compilation [11], but we proposed an all-hardware ABM approach. The ABM-based DBT technique (ADBT) is shown in Figure 1. ADBT is pipeline-based and caters for many existing computer architectures. The ADBT exploits hardware ABM to dynamically translate any binary codes into tag-based instruction formats for instruction level parallelism (ILP) execution after data conflicts are resolved. Since the tagged instructions are similar to a RISC instruction set architecture, they can run on RISC machines via mapping tags to values. In a pipelined processor, the decoder logic often completed identifying the code dependences in order to schedule the independent instructions in issue window in next execution cycle [12]. In a DBT-based processor, DBT realizes dynamical translation and execution application binaries. DBT works the function of the decoder. In ADBT, we replace the decoder with a hardware abstract machine (HAM) that runs through a mock execution to achieve dynamic binary translation. There is no actual instruction execution in HAM as in a real processor which inputs real values and produces output values. The HAM processes instruction streams sequentially, but much faster, and it can keep up with parallel execution requirements. In the following, as an example we implement a Java ILP processor to demonstrate how to apply ADBT approach.
H.C.Wang et al.: An Abstract Machine-based Dynamic Translation Technique In Java Processors.
13
HAM can work faster to meet the requirement of the execution engine. Tags represent the results of load/ALU instructions, HAM puts them onto the OTS, and consumer instructions remove tags from the stack during the instruction tagging. The presence of a common tag thus establishes a dataflow [14] relation between the producer and consumer instructions. The actual values are delivered from producer instructions to consumer instructions via a stack cache (register file) as in a RISC processor.
TABLE 1 A SAMPLE OF HAM WORKING PROCESS IN A JAVA PROCESSOR
Instruction
Tag Renaming Unit (TRU)
Operand Tag Stack (OTS)
1 2 3 4 5 6 7 8
iload_1 iload_2 imul iload_4 iload_5 iadd iadd istore_3
T1 T2 T3 T4 T5 T6 T7 T8
iload_1 iload_2 imul iload_4 iload_5 iadd iadd istore_3
T1
T2
T4 T3 T7
T5 T6
T1 T1 T3 T3 T3 T3 T7
T2 T4 T4 T6
T5
D. Java stack folding With the aid of stack folding [10] technique, the processor can reduce effectively reduce load/store operations [15] after HAM dynamically translates bytecodes into tag-based RISC-like instructions. With mock execution, HAM can analyze and identify data dependences among instructions, and this will simplify the implementation of Java stack folding to some extent and enable the bytecodes tagging and the stack folding to execute in parallel. In the paper, we use the same stack folding approach as in [16-19] to complete the folding functions. After stack folding, the bytecode instructions are produced as RISC-like instructions.
VLIW instructions. The generation of instruction bundles is depending on data availability, and its working process is similar as in the Daisy [7]. Although the instruction bundles are issued in-order, at the time they are bundled they may not be in program order. Using TMU to manage the results generated by producer instructions, the results are still stored in register file, only referenced tags are modified. This reduced the data path for data forwarding, made the TMU logic simple. Stack Cache is provided in the processor to eliminate inefficiencies typically associated with stack-based instruction processing, and to store the temporary results. The TMU, built with a reorder buffer, holds the control-related information for each entry, uses a mapping table to manage tags and registers. In TMU, data forwarding [12] will only change the reference of the tags associated with the data. With instruction tagging, WAR (write after read) and WAW (write after write) data dependences are removed, because the two instructions will access different registers. Thus, the processor only needs to handle the real data dependence read after write (RAW), which requires the issue logic to schedule the later instruction containing the read operation only after the previous write completes and the result is acquired through a tag-value matching (TVM) window. The TVM accommodates out-of-order execution. A free tag list (FTL) is maintained for allocation and reuse of tags. A tag is no longer needed, and can be placed back into FTL for reuse after the tag is used or its associative value is read. The VLIW instructions are to be issued in-order to the VLIW execution engine which brings a simple the instruction issue logic. The schematic block diagram is shown in Figure 4.
5. Performance Evaluation
A. Experimental methodology We have done a simulation study on the proposed Java ILP processor. We used a trace-driven simulation technique to analyze the performance. The bytecode traces are extracted from the execution of the benchmark programs, and traces are scheduled and run based on pipeline stages. We assumed the TMU with 64 entries. We run Linpack [20] and SPECjvm98 [21] programs. The SPECjvm98 programs run with the s1 data set, and the mtrt runs in a single-thread. The instruction schedule was within a basic block, and the instruction prefetching is provided. A static branch predictor is used with a penalty of 3 cycles and all cache is assumed in 100% hit rate. B. Exploitable ILP To investigate the exploitable ILP, we relaxed the resources constraints and set the issue rate at four. That is to say, within an instruction issue window, the Java processor could potentially execute at most four instructions at same time if there are no dependencies. We executed different
4. A JAVA ILP PROCESSOR

In the following, the paper presents how to apply the proposed ADBT technique to implement a Java ILP processor. As a common pipelined processor, the proposed Java processor has a six-stage pipeline, including instruction-fetch, decode/tagging, stack folding, instruction issue & schedule, execution and commit stage. HAM and DBT are composed of TRU, Tag Matching Unit (TMU), OTS and stack folding logic (SFL). After instruction tagging, the TRU converts stack instructions to tag-based RISC-like instructions with the support of SFL function unit. The TMU tracks the ready or not-ready status of operands of a tag-based instruction according to the value of its mapped-register in stack cache (register file). If the operands of an instruction are ready, the instruction are added to the ready instruction queue and then be bundled as
14
benchmarks on the simulator and obtained the following result. Table III gives the results of distributions of instruction issue in parallel. In [20], an in-order multi-issue Java processor, it shows only a small number of three-issue and no four-issue instruction groups. In contrast, our results show that the percentage of issued three-issue instruction-group is from 1.55% to 10%, and the percentage of issued four-issue instruction-group is from 0.86% to 33.43%.
TABLE 2 DISTRIBUTION OF INSTRUCTIONS EXECUTED IN PARALLEL WITH TAGGING
SCHEME
C. Reduced Execution Instructions
Normalized Load/Store Reduced Count 1.2 1 0.8 0.6 0.4 0.2 0 compress Db Jack javac jess mpgeaudio mtrt Linpack Total Insts. Exec Ratio% Reduced Ratio%
Bench -mark Comp. Db Jack Javac Jess Mpegau. Mtrt
Instructions executed in parallel (percentage) 1-issue 2-issue 3-issue 4--issue 67.37 79.97 79.54 72.85 81.51 43.26 87.92 15.43 14.98 14.22 21.87 13.47 16.53 9.67 10.78 3.78 3.89 4.24 3.26 6.78 1.55 6.42 1.27 2.35 1.04 1.76 33.43 0.86
Fig. 2 The normalized reduced execution instructions counts
TABLE 3 DISTRIBUTION OF INSTRUCTIONS EXECUTED IN PARALLEL WITH UNLIMITED RESOURCES
Bench -mark Comp. Db Jack Javac Jess Mpeg. Mtrt
1 62.6 79.1 75.8 71.2 80.9 45.1 85.6
Tagged instructions executed in parallel (percentage) 2 3 4 5 6 7 13.7 17.1 3.4 1.8 0.3 0.04 15.0 4.6 1.04 0.12 0.09 0.03 17.5 4.18 1.96 0.12 0.27 0.0 21.3 6.35 0.78 0.02 0.27 0.03 13.6 3.73 1.1 0.2 0.27 0.16 18.4 5..9 4.16 5.75 9.62 2.4 11.8 1.62 0.77 0.02 0.01 0.03
8 1.02 0.0 0.11 0.03 0.0 8.64 0.0
For our technique, we combine Java stack folding with ADBT. With the stack folding, the processor can reduce load/store instructions largely and improve the performance. We calculated the total Java instruction needed to be executed in the benchmark programs in the stackbased processor architecture, and counted the total executed instructions after using the proposed ADBT technique. We obtained the total reduced instruction ratio from 62% to 71%. The mpegaudio reached the figure of around 80%. This is because some big loops exist in the program to be executed and the loop blocks have many load/store instructions which can be folded and reduced. The characteristics of mpegaudio program results in reduced total number of executed instructions and the performances improvement. D. ILP speedup gains To compute the ILP gain, we assumed all the bytecodes with unit latency. Figure 2 presents the ILP speedup results for the proposed ADBT-based multi-issue Java processor (Multi-Issue) over the base single-issue Java stack processor (Base). It is observed that the ILP gain with the proposed Java processor is from 78% to 173% higher than that with the base Java processor. The result shows that the processor is able to significantly increase the average ILP.
In order to investigate the maximum of exploitable ILP using basic block schedule, we modified the simulator, and relaxed some constraints, such as, no resources limitation, and the instruction issue rate at 8. After settings, we re-scheduled the benchmark programs to get the new results in Table 4. We can see that most of instructions within a basic block are issued within 4-issue-groups. Although we have increased the issue-rate to 8, due to dependency limitation, only small amount instructions are issued in more than 4-issue groups. For most of benchmark programs except compress and mpegaudio, only less than 1% instruction are issued in larger than 4-issue groups. For compress, the number of percentage is 3.16% and for mpegaudio, it is 26.41%. These two benchmarks have bigger basic blocks than the others. Compared with the result in Table 3, we find that due to the ILP limitations within basic block, to add more resources in the processor, only very less ILP improvement can be obtained. Thus, to consider the hardware complexity and pipeline efficiency, we prefer a 4-issue Java processor.
6. A Multithreading Java Processor

Simultaneous multithreading (SMT) is a variation on multithreading that combines hardware features of wide-issue superscalars with multi-threaded processors [21]. It can consume both thread-level parallelism (TLP) and ILP with greater instruction throughput and speedups. In this case, TLP and ILP are exploited simultaneously with multiple threads using the issue slots within a single cycle.
15
Normalized Speedup
2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 compress Db Jack Javac Jess mpegaudio mtrt Linpck
Base Multi-Issue
Fig. 3 ILP speedup gain of the Java processor with ADBT
Operand Tag Stack (OTS)
Decoding/ Tagging Unit (TU) Trace Buffer
Tag Matching Unit (TMU) Stack Folding Logic (SFL) Ready Inst Queue, Multi Issue Logic
Stack Cache (Register File)
Load/Store buffer Hazard Detection
Method Cache Branch Pred. Logic
Inst Fetch Fetch Logic
VLIW Execution Engine
Data Cache
Fig. 4 The proposed Java ILP processor architecture
Multi-Threaded Programs
I Cache
L o a d / S t o r
Inst Fetch
TR U
TP
TR U
TP
TR U
PRF
VLIW Bundler
R U R U
TP
PRF X F U X F U
e P a t h
PRF GRF Parallel Tagging Array
R U
Write-back Buffer
Fig. 5 The multithreading Java processor architecture
16
The Java ILP processor in the paper can be extended to support SMT to achieve higher speedup and throughput. The paper proposed a multithreading Java processor (MJP), which composes of four threading units. Each unit owns a fetching unit, TRU, private register file (PRF), a program counter (PC) and a private page table. In the MJP, multithreading programs are resident in I-Cache, multiple fetch units can fetch program from multiple threads simultaneously. The bytecode programs from different threads can be tagged by different TRUs which deal with instructions in the same way as in the Java ILP processor. After tag-based RISC-like instructions are ready, they are bundled to the VLIW instruction to be executed in parallel. Tagged instructions are from different threads, TLP is achieved accordingly. Here an instruction scheduler is needed to be design to schedule the ready instructions from different TRUs. Although tagged ready instructions can be issued without considering data dependences, but dependences within a thread is handled by each TRU. Design in such way, multiple threads in MJP can share the common execution engine so that the high throughput can be achieved. The schematic figure can be referred to the Figure 5 . MJP owns a hierarchical register file architecture. A global register file (GRF) is used for communication between threads, and for holding global values that have many consumers. The PRF for each thread is used for storing temporary data within the thread, and to run a single thread locally. The memory can be shared by all threads through the virtual memory mechanisms, which already support multiprogramming. In the MJP, multiple threads can share their common object or data via virtual memory, therefore we consider sequential consistency or release consistency memory model for programming in order to guarantee the correctness of the program execution and it can respect the Java Memory Model (JMM) [22].
There are many challenging issues in implementing an x86 ISA. To execute x86 at high performance, Intels Pentium [23] dynamically translates x86 instructions into simple, fixed-length instructions that Intel calls Micro-Operations or -ops. These -ops are then executed in a decoupled superscalar core capable of register renaming and out-of-order execution. Like RISC instructions, -ops use a load/store mode. Those x86 instructions operating on memory must be broken into a load -op, an ALU -op, and possibly a store -op [24]. The -op uses a regular structure to encode an operation, two sources, and a destination like RISC instruction. When implementing a CISC instruction set architecture, i.e. Intel X86 on a RISC processor, generally the processor will split or crack an X86 instruction into a number of RISC-like micro-operations [24], and then execute them. HAM can realize the similar work in [25-26] to identify dependency among micro-instructions, and put the independent ones into the same slot to schedule. HAM can convert -ops into the tag-based RISC-like instructions after X86 instructions are converted into -ops. Here, a register renaming logic needs to be combined with the Register Alias Table (RAT) used in Pentium [24]. After tagging -ops, HAM can identify all independent instruction groups, and schedule an independent group into an issue group to dispatch and execution. The detailed implementation for X86 multithreading processor with the ADBT technique will be our future work.
Acknowledgment
The The Project was supported by Program for Changjiang Scholars and Innovative Research Team in University (IRT0951), China, and the Special Fund for Basic Scientific Research of Central Colleges, Changan University, CHD2009JC125CHD2011TD009, China NSF Grant (50978030).
7. Conclusion and Discussion

In the paper, we proposed an ADBT approach which can dynamically translate binaries into tag-based RISC-like instruction formats by using mock execution technique. The tag-based RISC instructions can easily be executed on RISC engines. ADBT as a decoder can identify all instructions dependency, and generate producer-consumer relationship with tags in order to discover ILP within programs. As an example, we explained in details how to implement ADBT for a Java ILP processor, extended to a multithreading Java processor. We gave the MJPs architecture and discussed some design issues for applying ADBT technique. During the limitation of the paper size, we did not give the results for MJP s performance evaluation which will be our future work. With instructions mock execution in HAM, the processor can discover all independent instruction groups, and put independent instructions into issue slots to implement ILP execution. The technique can be also applied to x86 CISC processors.
References
[1] R.L. Sites, A. Chernoff, M.B. Kirk, M.P. Marks, and S.G. Robinson, Binary translation, (1993) Communications of the ACM, vol. 36(2), pp. 69-81. [2] R.J. Hookway and M.A. Herdeg, Digital FX!32: Combining emulation and binary translation, (1997) Digital Technical Journal, 9(1), pp.3-12. [3] Sun, Java JIT compiler, http://www.sun.com/solaris/jit. [4] Ali-Reza Adl-Tabatabai, et, at. Fast, effective code generation in a just-in-time Java compiler, (1998) Proceedings of ACM SIGPLAN '98 on Programming language design and implementation, pp. 280-290, Montreal, Canada. [5] DITZEL, D.R, Transmetas Crusoe: Cool chips for mobile computing, Hot Chips 12: Stanford University, Stanford, California, August 2000, IEEE Press. [6] David Ung and Cristina Cifuentes, Machine-Adaptable Dynamic Binary Translation, (2000) ACM SIGPLAN Notices, vol. 35, issue 7 , pp. 41-51. [7] Kemal Ebciolu , Erik R. Altman, DAISY: dynamic compilation for 100% architectural compatibility, (1997) International Journal Publishers Group (IJPG)
17
[8]
[9]
[10] [11] [12] [13]
[14]
[15] [16]
[17]
[18] [19] [20]
[21]
[22]
[23]
[24] [25]
[26]
ACM SIGARCH Computer Architecture News, vol.25 n.2, pp.26-37. M. Tremblay, et al., The MAJC Architecture: A Synthesis of Parallelism and Scalability, (2000) IEEE Micro, vol. 20, (6), pp. 12 -25. R. Helaihel, and K. Olukotun, JMTP: An Architecture for Exploiting Concurrency in Embedded Java Applicatins with Real-time Considerations, (1999) International conference on Computer-Aided Design, pp. 551-557 J. Michael OConnor, Marc Tremblay, PicoJava-I: The Java Virtual Machine in Hardware, (1997) IEEE Micro, vol. 17, Issue 2, pp 45-53. Stephan Diehl, P.Hartel, P.Sestoft, Abstract machines for programming language implementation (2000) Future Generation Computer Systems, Vol. 16 , pp 739751. J.E. Smith, and G.S. Sohi, The micro architecture of Superscalar Processors, (1995) Proceedings of the IEEE, vol. 83, pp1609-1624. K.Scott, N.Kumar, S. Velusamy,et al. , Retargetable and Reconfigurable Software Dynamic Translation, (2003) Proceedings of the International Symposium on Code Generation and Optimization, San Francisco, California, pp.36-47. Ben Lee, A.R. Hurson, Dataflow Architectures and Multi-threading, (1994) IEEE Computer, Vol. 27, Issue 8, pp 27-39. T.Lindholm, F. Yellin, The Java Virtual Machine Specification, Addison-Wesley, Reading MA, 1996 H. C. Wang, C. K. Yuen, Exploiting Dataflow to Extract Java Instruction Level Parallelism on a Tag- based Multi-Issue Semi In-Order (TMSI) Processor, (2006) IEEE International Parallel & Distributed Processing Symposium, Rhodes, Greece. H.C. Wang, C.K.Yuen, Exploiting an abstract-machine-based framework in the design of a Java ILP processor, (2009) Journal of Systems Architecture: the EUROMICRO Journal, vol. 55, Issue 1, pp. 53-60. SPEC JVM98 Benchmarks, http://www.spec.org/osg/jvm98/. Linpack, http://www.netlib.org/linpack Ramesh Radhakrishnan, Deependra Talla and Lizy Kurian John, Allowing for ILP in an Embedded Java Processor, (2000) Proceedings of the 27th International Symposium on Computer Architecture, pp. 294--305. Jack Lo, S. Eggers, J. Emer, H. Levy, R. Stamm and Dean Tullse, Converting Thread-Level Parallelism into Instruction-Level Parallelism via Simultaneous Multithreading, (1997) ACM Transactions on Computer Systems, vol.15, no.3, pp. 322-354. Jeremy Manson, William Pugh and Sarita V.Adve, The Java Memory Model, (2005) Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, California, USA. G. Hinton, D. Sager, M. Upton, et. al, The microarchitecture of the Pentium 4 processor, (2001) Intel Technical Journal, Q1 2001 Issue. L.Gwennap, Intels Uses Decoupled Superscalar Design, (1995) Microprocessor Report, pp. 9-15. S. L. Hu, I. Kim, M. H. Lipasti. J.E.Smith, An approach for implementing efficient superscalar CISC processors, (2006) Proc. of 12th International Symposium on High-Performance Computer Architecture (HPCA-12), Austin, TX,USA S.L. Hu, James E. Smith, Using Dynamic Binary Translation to Fuse Dependent Instructions, (2004) Proceedings of the international symposium on code generation and optimization, Palo Alto, California, pp. 213-220.

An Abstract Machine-Based Dynamic Translation Technique in Java Processors

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

An Abstract Machine-Based Dynamic Translation Technique in Java Processors

Diunggah oleh

Hak Cipta:

Format Tersedia

International Journal of Advanced Computer Science, Vol. 2, No. 1, Pp. 11-17, Jan. 2012.

An Abstract Machine-based Dynamic Translation Technique in Java Processors

Fig. 1 The concept of ABM-based DBT approach (ADBT)

H.C.Wang et al.: An Abstract Machine-based Dynamic Translation Technique In Java Processors.

Tag Renaming Unit (TRU)

Operand Tag Stack (OTS)

iload_1 iload_2 imul iload_4 iload_5 iadd iadd istore_3

iload_1 iload_2 imul iload_4 iload_5 iadd iadd istore_3

4. A JAVA ILP PROCESSOR

C. Reduced Execution Instructions

Bench -mark Comp. Db Jack Javac Jess Mpegau. Mtrt

Fig. 2 The normalized reduced execution instructions counts

TABLE 3 DISTRIBUTION OF INSTRUCTIONS EXECUTED IN PARALLEL WITH UNLIMITED RESOURCES

Bench -mark Comp. Db Jack Javac Jess Mpeg. Mtrt

1 62.6 79.1 75.8 71.2 80.9 45.1 85.6

8 1.02 0.0 0.11 0.03 0.0 8.64 0.0

6. A Multithreading Java Processor

International Journal Publishers Group (IJPG)

H.C.Wang et al.: An Abstract Machine-based Dynamic Translation Technique In Java Processors.

Fig. 3 ILP speedup gain of the Java processor with ADBT

Operand Tag Stack (OTS)

Decoding/ Tagging Unit (TU) Trace Buffer

Stack Cache (Register File)

Load/Store buffer Hazard Detection

Method Cache Branch Pred. Logic

Inst Fetch Fetch Logic

VLIW Execution Engine

Fig. 4 The proposed Java ILP processor architecture

PRF GRF Parallel Tagging Array

Fig. 5 The multithreading Java processor architecture

International Journal Publishers Group (IJPG)

7. Conclusion and Discussion

H.C.Wang et al.: An Abstract Machine-based Dynamic Translation Technique In Java Processors.

[10] [11] [12] [13]

[18] [19] [20]

International Journal Publishers Group (IJPG)

Anda mungkin juga menyukai