jump to navigation

Measuring Cache miss August 25, 2008

Posted by Prabakaran Thirumalai in cache.
Tags: , ,
trackback

cachegrind tool helps programmers to quantify and understand the cache behavior of programs and algorithms. we shall modify our code to be more cache-friendly and thereby make it run faster.

The following example demonstrates how to measure cache misses for L1 and L2 cache on Linux platform. Consider the following program which initializes two dimensional array (test1.c)

#include<stdio.h>
int array[10000][10000];
int setValue1()
{
int i,j;
for (i=0;i<10000;i++)
for (j=0;j<10000;j++)
array[j][i]=0;
}
int main()
{
setValue1();
return 0;
}

Compile the program using gcc compiler

$gcc -O2 -o test test1.c

Run the executable generated using the valgrind tool

$valgrind –tool=cachegrind ./test
==2605== Cachegrind, an I1/D1/L2 cache profiler.
==2605== Copyright (C) 2002-2007, and GNU GPL’d, by Nicholas Nethercote et al.
==2605== Using LibVEX rev 1732, a library for dynamic binary translation.
==2605== Copyright (C) 2004-2007, and GNU GPL’d, by OpenWorks LLP.
==2605== Using valgrind-3.2.3, a dynamic binary instrumentation framework.
==2605== Copyright (C) 2000-2007, and GNU GPL’d, by Julian Seward et al.
==2605== For more details, rerun with: -v
==2605==
==2605==
==2605== I refs: 500,153,442
==2605== I1 misses: 547
==2605== L2i misses: 545
==2605== I1 miss rate: 0.00%
==2605== L2i miss rate: 0.00%
==2605==
==2605== D refs: 100,048,658 (33,484 rd + 100,015,174 wr)
==2605== D1 misses: 100,000,803 ( 629 rd + 100,000,174 wr)
==2605== L2d misses: 6,260,759 ( 589 rd + 6,260,170 wr)
==2605== D1 miss rate: 99.9% ( 1.8% + 99.9% )
==2605== L2d miss rate: 6.2% ( 1.7% + 6.2% )
==2605==
==2605== L2 refs: 100,001,350 ( 1,176 rd + 100,000,174 wr)
==2605== L2 misses: 6,261,304 ( 1,134 rd + 6,260,170 wr)
==2605== L2 miss rate: 1.0% ( 0.0% + 6.2% )

It displays the access patterns with respect to Level 1 and Level 2 caches which includes the total instruction references(I) and total data references(D), Level 1 instruction misses(I1) and data misses(D1) and L2 instruction(L2i) and data misses(L2d).
From the above output, we could see D1 miss rate is 99.9% and D2 miss rate is 6.2%. There were about 100,048,658 references to L1 cache out of which 100,000,803 references are misses in L1 cache.

The above output is summary for the whole program, When we run the cachegrind tool, it also generates a file named cachegrind.out with suffix as pid of the process. This file contains L1 and L2 cache information for each and every function in your program. Using this we can figure out which function is the candidate for optimization.
For the above run, cachegrind.out.2605 file contains the following information

desc: I1 cache: 32768 B, 64 B, 8-way associative
desc: D1 cache: 32768 B, 64 B, 8-way associative
desc: L2 cache: 2097152 B, 64 B, 8-way associative
cmd: ./a.out
events: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw
fl=???
fn=(below main)
0 64 4 4 19 0 0 26 0 0
fn=???
0 1037 14 12 956 3 1 34 1 1
…….
fn=sbrk
0 25 2 2 9 0 0 5 0 0
fn=setValue1
0 500060005 0 0 2 1 1 100000001 100000000 6260000
fn=strcmp
0 9647 1 1 3161 14 11 0 0 0
fn=strlen
0 657 3 3 103 2 2 0 0 0
fn=strsep
0 711 2 2 241 0 0 7 0 0
fn=uname
0 8 2 2 2 0 0 0 0 0
fn=version_check_doit
0 19 2 2 7 0 0 5 0 0
summary: 500153442 547 545 33484 629 589 100015174 100000174 6260170


To understand the output, we should know what each of these values represent.
fn=sbrk means the function is sbrk() and values below that are in the format “
events Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw”

Ir Instructions Executed
I1mr L1 Instruction Cache read misses
I2mr L2 Instruction Cache read misses
Dr Total Memory Reads
D1mr L1 Data Cache read misses
D2mr L2 Data Cache read misses
Dw Total Memory Writes
D1mw L1 Data Cache write misses
D2mw L2 Data Cache write misses

For setValue1() function, 500060005 Instructions were executed and there were no instruction misses both in L1 and L2 cache. 100000001 memory writes and out of which 100000000 were L1 data misses and 6260000 were L2 data misses.

Further Reading

http://www.valgrind.org

About these ads

Comments»

1. Lucas De Marchi - May 19, 2009

As far as I know this will use an “emulated trace” for cache misses an hits depending on your configuration of caches.

I’m looking if there’s a way (at least for x86 architecture) to trace the real hits/misses. I’m not satisfied with this tool because I have to test a Xeon Quad core, in which there’s a private L1 for each core and a shared L2 for each 2 of them and I think may be other architectures not covered by this tool. I think a hardware measurement is much more truthful. Do you any way of doing this?

2. Nitin Kunal - October 11, 2011

You can try Intel’s vtune for getting actual number of cache misses asd hits. This tool relies upon CPU performance counters for fetching these numbers–unlike valgrind that uses some kind of virtual machine for doing the same.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: