Using the perf utility on ARM

Given two systems, both with a Cortex-A5 CPU, one clocked at 396MHz without L2 cache and one clocked at 500 MHz with 512kB L2 cache. How big is the impact of the L2 cache? Since the clock frequency is different, a simple CPU time comparison of a given program does not answer the question… I tried to answer this question using perf. perf is often used to profile software, but in this case it also proved to be useful to compare two different hardware implementations.

Most CPU’s nowadays have internal counters which count various events (e.g. executed instructions, cache misses, executed branches and branch misses etc…). Other hardware, e.g. cache controllers, might expose performance counters too, but this article focuses on the hardware counters exposed by the CPU.

The Linux perf_events subsystem exports these counters to user space. Beside the hardware events, there are also various software events which are counted (e.g. context switches). The perf utility is a user space application which makes use of the perf_events interface of the Linux kernel. The building block for most perf commands are the available event types, which are listed by the perf list command.

The two mentioned system configurations can be found on the Freescale Vybrid based Toradex Colibri VF50 and VF61 modules. As of now, the perf list output looks rather sparse:

root@colibri-vf:~# perf list

List of pre-defined events (to be used in -e):

  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:[/len][:access]                          [Hardware breakpoint]

The CPU counters are missing… According to the Cortex-A5 Technical Reference Manual the CPU should support various counters. Digging a bit into it showed two prerequisites: The config option CONFIG_HW_PERF_EVENTS (which makes sure the architecture dependent driver arch/arm/kernel/perf_event_v7.c gets compiled into the kernel) and an appropriate device tree node. With that, this in place, perf list shows the CPU counters:

# perf list

List of pre-defined events (to be used in -e):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]

  alignment-faults                                   [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:[/len][:access]                          [Hardware breakpoint]

Unfortunately, the listed events by itself leave some questions unanswered: What means cache-misses exactly? The global cache misses (accounting misses which miss both cache?) or only a local cache misses? The source file above and combined with the information from the ARM Cortex-A5 TRM give somewhat more insight. This list shows the mapping between the above perf events and the effective hardware counters they represent:

perf Event PMU/Reg ARM Cortex-A5 TRM Description
branches 0x0c Software change of the PC (according to source all taken branches)
branch-misses 0x10 Mispredicted or not predicted branch speculatively executed
cache-misses 0x03 Level 1 data cache refill
cache-references 0x04 Level 1 data cache access
cycles PMCCNTR Counts processor clock cycles
instructions 0x08 Instruction architecturally executed
L1-dcache-load-misses 0x03 Level 1 data cache refill
L1-dcache-loads 0x04 Level 1 data cache access
L1-dcache-prefetch-misses 0xc3 Prefetch linefill dropped
L1-dcache-prefetches 0xc2 Linefill because of prefetch
L1-dcache-store-misses 0x03 Level 1 data cache refill
L1-dcache-stores 0x04 Level 1 data cache access
L1-icache-load-misses 0x01 Level 1 instruction cache refill
L1-icache-loads 0x14 Level 1 instruction cache access
L1-icache-prefetch-misses 0xc3 Prefetch linefill dropped
L1-icache-prefetches 0xc2 Linefill because of prefetch
branch-load-misses 0x10 Mispredicted or not predicted branch speculatively executed
branch-loads 0x12 Predictable branch speculatively executed
dTLB-load-misses 0x05 Level 1 data TLB refill
dTLB-store-misses 0x05 Level 1 data TLB refill
iTLB-load-misses 0x02 Level 1 instruction TLB refill

Since the perf_events interface is architecture independent, not all information of the ARM PMU map perfectly to perf events. Several counters appear twice, and some are misleading (e.g. L1-dcache-prefetches are actually D$ and I$ prefetches). The PMU of other ARMv7 CPUs expose a similar amount of events, often mapping as shown above, but some events might be different, hence YMMV! Another interesting fact is that the Cortex-A5 supports two hardware counters. The CPU cycles counter is a separate register and hence comes “for free”.

Ok, now lets compare an application between the two above mentioned systems. I choose four events: task-clock, cycles, instructions and branch-misses. This selection makes sure that we have a dedicated PMU hardware counter for the hardware counted events instructions and branch-misses (again, cycles do not need a PMU hardware counter since they are exposed as a separate register, and task-clock is a software event). My test application uses the cairo framework to draw 1000 rectangles on a framebuffer device. One can argue whether that load is representative, but I do not want to go down that road!

The first system under test is the Colibri VF50 with the 396MHz clocked CPU and without L2 cache. Here is a sample output:

# perf stat -e task-clock,cycles,instructions,branch-misses ./cairo

 Performance counter stats for './cairo':

       4346.477415      task-clock (msec)         #    0.138 CPUs utilized          
        1705864718      cycles                    #    0.392 GHz                    
         292200141      instructions              #    0.17  insns per cycle        
            920911      branch-misses             #    0.212 M/sec

And the same application on the Colibri VF61 with the 500MHz clocked CPU and 512kB L2 cache:

# perf stat -e task-clock,cycles,instructions,branch-misses ./cairo

 Performance counter stats for './cairo':

       2851.861460      task-clock (msec)         #    0.124 CPUs utilized          
        1418321454      cycles                    #    0.497 GHz                    
         290460474      instructions              #    0.20  insns per cycle        
            860675      branch-misses             #    0.302 M/sec

I made a “warm-up” run and then 5 measurements on each system. perf calculates the CPU frequency using the (software based) task clock and the cycles (behind the # sign of the cycles row). The values match pretty close to the actual CPU frequencies, which tells that the hardware counts and software task-clock seem to be accurate. All counters were stable across the 5 measurements, with standard deviations below 1% of the mean value.

Since the two systems use the very same CPU core, the difference in clock cycles required to execute the program is likely to be attributed to the L2 cache misses. Another measurement showed that the L1 cache misses are in the same order for the two systems. Unfortunately it is not possible to measure the cache miss penalties, which would be likely quite different between the two systems…

Assuming we can attribute the instruction count difference to the L2 cache, we can answer the initial question: The speedup between VF50 and VF61 attributed to the L2 cache itself is 1.20 (calculated using clocks per instruction). The overall speedup which includes the higher CPU frequency is 1.53 (calculated using the execution times).

 

Note: When using more hardware events then hardware counters available, the subsystem uses the counters in turns and interpolates the values. This can lead to values which are way off, especially if the sample rate is low (exposed through /proc/sys/kernel/perf_event_max_sample_rate). The subsystem lowers the sample rate on slow systems automatically, in my case after using the perf top command. With a sample frequency of 200 and while using more then two event counters, I got some interesting values:

...
            878094      branch-misses      #  191.03% of all branches          (99.60%)
            459661      branches                                               (78.82%)
...

...
          23028913      cycles             #    0.005 GHz                      (82.95%)
         293729707      instructions       #   12.75  insns per cycle          (99.77%)
...

More branch misses than branches! And 12.74 instructions per cycle on a single-issue in-order Cortex-A5, impressive 😉

See also:

 

2 Replies to “Using the perf utility on ARM”

  1. Awesome report! Your data is presented well also. I am trying to measure performance between an A5 board and another machine too.

    I have perf_3.2 running on my machine, but whenever I run “perf_3.2 stat ./program”, the results say that the default events are . How did you enable the event support on your A5 machine? Thanks.

    1. Beside the Kernel config option and the device tree change, I don’t think I did anything else. What kernel version are you using? And what events are supported according to perf? (check with perf list)…

Leave a Reply

Your email address will not be published. Required fields are marked *