Anda di halaman 1dari 3

The Instruction Frame Pointer (IPF): points to the address of the rst instruction of the thread.

The Data Frame Pointer List (DFPL): a list of pointers to the data inputs assigned for the thread.

Hardware performance counter

In computers, hardware performance counters, or hardware counters are a set of specialpurpose registers built into modern microprocessors to store the counts of hardware-related activities within computer systems. Advanced users often rely on those counters to conduct lowlevel performance analysis or tuning. The number of available hardware counters in a processor is limited while each cpu model might have a lot of different events that a developer might like to measure. Each counter can be programmed with the index of an event type to be monitored, like a L1 cache miss or a branch mispredict.

---------------------------------------------------------------------------------------------------------------------------------------Compared to software profilers, hardware counters provide low-overhead access to a wealth of detailed performance information related to CPU's functional units, caches and main memory etc. Another benefit of using them is that no source code modifications are needed in general. However, the types and meanings of hardware counters vary from one kind of architecture to another due to the variation in hardware organizations. There can be difficulties correlating the low level performance metrics back to source code. The limited number of registers to store the counters often force users to conduct multiple measurements to collect all desired performance metrics. Moreover, modern superscalar processors schedule and execute multiple instructions at one time. These "in-flight" instructions can retire at any time, depending on memory access, hits in cache, stalls in the pipeline and many other factors. This can cause performance counter events to be attributed to the wrong instructions, making precise performance analysis difficult or impossible. Newer processors have introduced methods to mitigate some of these drawbacks. For example, the [2] AMD Opteron processors have implemented in 2007 a technique known as Instruction Based Sampling (or IBS). AMD's implementation of IBS provides hardware counters for both fetch sampling (the front of the superscalar pipeline) and op sampling (the back of the pipeline). This results in discrete performance data associating retired instructions with the "parent" AMD64 instruction.

Hardware Performance Counter Basics

The first time I encountered the term hardware performance counters, it was in the context of having access to multi-million-dollar supercomputers where every CPU cycle is critical and research teams spend substantial amounts of time tweaking their codes in order to extract maximum performance from the system. Often, software is tailored explicitly for each type of computer on which it is to be run. Research teams sometimes pore over the numbers generated by these performance counters to measure the exact performance of their applications and to ferret out places where they might gain additional speedup. Needless to say, this all sounded exotic to me. But the purpose and function of the counters turned out to be simple: they are extra logic added to the CPU that track low-level operations or events within the processor accurately and with minimal overhead. For example, even if you're not an expert in computer architecture, you probably already know that nearly all processors in common use are cachebased machines. Caches, which offer much higher-speed access to data and instructions than what is possible with main memory, are based on the principles of temporal and spatial locality. Put another way, cache designs hope to take advantage of many applications' tendency to reuse blocks of data not long after first use (temporal locality) and to also access data items near those already used (spatial locality). If your application follows these patterns, you have a much greater chance of achieving high performance on a cache-based processor. If not, your performance may be disappointing. If you're interested in improving a poorly performing application, your next task is to try to determine why the processor is stalling instead of completing useful work. This is where performance counters may help. It takes a little research to learn which performance counters are available to you on a particular processor. Each CPU has a different set of available performance counters, usually with different names. In fact, different models in the same processor family can differ substantially in the specific performance counters available. In general, the counters measure similar types of things. For example, they can record the absolute number of cache misses, the number of instructions issued, the number of floating point instructions executed and the number of vector, such as SSE or MMX, instructions. The best reference for available counters on your processor are the vendor's technical reference on the processor, often available on the Web. Another complication is kernel-level support is needed to access the performance counters. Although the Itanium (IA-64) kernel provides this support through the perfmon driver in the official kernel (authored by

Stephane Eranian of HP Research), the standard x86 Linux tree currently does not. Fortunately, efforts are underway to address these issues. The first is the development of a performance monitoring driver for the x86 kernel called perfctr. This is a very stable kernel patch developed by Mikael Pettersson of Uppsala University in Sweden. The perfctr kernel patch is becoming more widely adopted by the community and continually is improved and maintained. The second is an effort from the Innovative Computing Laboratory at the University of Tennessee-Knoxville called PAPI (Performance Application Programming Interface). PAPI defines a standard set of cross-platform performance monitoring events and a standard API that allows measurement using hardware counters in a portable way. The PAPI project provides implementations for the library on several current processors and operating systems, including Intel/AMD x86 processors, Itanium systems and, most recently, AMD's x86-64 CPUs. On Linux, PAPI uses the perfmon and perfctr drivers as appropriate. Refer to the on-line Resources section for references where you can learn much more about perfctr, perfmon and PAPI. PerfSuite, discussed in the remainder of this article, builds upon PAPI, perfmon and perfctr to provide developers with an even higher-level user interface as well as additional functionality. A main focus of PerfSuite is ease of use. Based on my experiences in working with developers interested in performance analysis, it became clear that an ideal solution would require little or no extra work from users who simply want to know how well an application is performing on a computer. They want to know this without having to learn many details about how to configure or access the performance data at a low level.