CPU caches with examples for ARM Cortex-M

In the article, I said that we used caches to accelerate graphics on microcontrollers.

We talked about some advantages and disadvantages of Write-Through mode, but it was a quick overview. This article talks about caches from the point of view of programmers in more detail.

CPU data cache modes

I’m starting with the point where I stopped in the previous topic, namely, what’s the difference between write-back and write-through modes. In short:

The advantage of write-through mode is easy usage. It potentially reduces errors count. Indeed, in this mode, the memory is always coherent and does not require additional update procedures.

I seem it should decrease performance significantly, but ST says it shouldn’t:

“Write-through: triggers a write to the memory as soon as the contents on the cache line are written to. This is safer for the data coherency, but it requires more bus accesses. In practice, the write to the memory is done in the background and has a little effect unless the same cache set is being accessed repeatedly and very quickly. It is always a tradeoff.”

Disadvantages of the ‘write-through’ mode are the following:

During the write operation in ‘write-back’ mode data is loaded only to the cache, not to the memory.

There is another cache property. It is the ‘write-allocate’ and ‘no-write-allocate’ modes. These modes determine what is occurred when a write cache miss has happened. If you use the ‘write-allocate’ mode data is loaded to cache, but if you use the ‘no-write-allocate’ data is not loaded.

The resulting cache mode is a combination of these two pairs: ‘write-through’ and ‘write-back’ with ‘write-allocate’ and ‘no-write-allocate’. For instance, it can be ‘write-through no-write-allocate’ and ‘write-back write-allocate’.

NOTE: Except for D-cache (data cache), there is I-cache (instruction cache). We do not consider I-Cache here in detail but enable it in real-world applications.

How to set a cache mode in ARM Cortex-M?

MPU (Memory Protection Unit) is used to set up a specific region’s cache mode in the ARMv7M architecture. You can set up settings for up to 16 regions. The settings include base address, size, rights access, TEX, cacheable, bufferable, shareable attributes, and so on. You can set up any cache modes with these attributes.

There are standard settings of each region described in the ARMv7M. For example, STM32F7 SRAM has ‘write-back write-allocate’ mode by default. You need to set up only regions with non-standard attributes. You can also create a subregion with its own attributes.

Each chip manufacturer can have its own specific memory types. For example, STM32 MCUs have TCM memory. It is connected to its own bus so it is very fast (it works on the CPU clock). It can be thought of as an additional cache, so it is not allowed to make this region cacheable.

Caches modes comparison. Tests.

Let’s run some tests to better understand the difference between the cache modes. We disable I-Cache and compiler optimizations (compile with -O0 flag) to decrease uncertainty. Also, we use a 64kB region in SDRAM and set up required memory attributes through MPU.

We also need to measure time intervals with high precision, therefore we disable interrupts and use DWT (Data Watchpoint and Trace unit) instead of generic timers. DWT contains a 32-bits counter of CPU cycles and it can be used for measuring precise time but only in 20 seconds periods, therefore we need to clear the counter before each test start.

First of all, we try to notice a difference between non-cacheable and cacheable write-back modes. Obviously, we need to use both write and read operations, because CPU data cache makes sense only if there are read operations.

Non-cachable memory VS. Write-back

#define ITERS         100
#define BLOCK_LEN 4096
#define BLOCKS 16
/* So it's just an arbitrary buffer in external SDRAM memory. */
#define DATA_ADDR 0x60100000
***for (i = 0; i < ITERS * BLOCKS * 8; i++) {
dst = (uint8_t *) DATA_ADDR;
for (j = 0; j < BLOCK_LEN; j++) {
val = VALUE;
*dst = val;
val = *dst;
dst++;
}
}

But when we measured time for both cacheable and non-cacheable modes we were surprised there is no difference. We think that the reason is internal buffers in FMC. Yes, SDRAM is connected through Flexible Memory Controller (FMC) in STM32. It also has internal buffers, which influence the memory subsystem performance.

We need to make FMC’s life more difficult. We can expand the loops and add an increment of an array element:

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
/* 16 lines */
arr[i]++;
arr[i]++;
***
arr[i]++;
}
}

Results:

Non-cacheable: 4s 743ms
Write-back : 4s 187ms

It looks better. The result with cacheable memory is faster by half a second. Let’s continue and add access to sparse array indexes.

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
arr[i + 0 ]++;
***
arr[i + 3 ]++;
arr[i + 4 ]++;
arr[i + 100]++;
arr[i + 6 ]++;
arr[i + 7 ]++;
***
arr[i + 15]++;
}
}

Results:

Non-cacheable: 11s 371ms
Write-back: : 4s 551ms

Well, we got a noticeable difference. Let’s go a little deeper and add a second index:

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
arr[i + 0 ]++;
***
arr[i + 4 ]++;
arr[i + 100]++;
arr[i + 6 ]++;
***
arr[i + 9 ]++;
arr[i + 200]++;
arr[i + 11]++;
arr[i + 12]++;
***
arr[i + 15]++;
}
}

Results:

Non-cacheable: 12s 62ms
Write-back : 4s 551ms

We can see that the time increased for non-cacheable mode and remained the same for cacheable mode.

When ‘write-allocate’ is better to use?

For example, when there are a lot of write operations to sequential memory addresses and rather seldom read operations ‘write-allocate’ is better. In this case, a lot of cache misses occurring if the ‘no write allocate’ mode was selected.

Let’s use the following test to simulate this situation:

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
arr[j + 0 ] = VALUE;
***
arr[j + 7 ] = VALUE;
arr[j + 8 ] = arr[i % 1024 + (j % 256) * 128];
arr[j + 9 ] = VALUE;
***
arr[j + 15 ] = VALUE;
}
}

In this example 15 from 16 write operations set up VALUE constant, while read operations access other memory cells. So only read elements are cached when ‘no write allocate’ mode is used. Array index “(i % 1024 + (j % 256) * 128)” is used to decrease SDRAM performance making accesses sparse.

Results:

Write-back                  : 4s 720ms
Write-back no write allocate: 4s 888ms

When ‘no-write-allocate’ is better to use?

It’s the most complex case to test. It happens when a program often writes to addresses that rarely read. In this case, cache allocations would be excessive operations, because the data in the cache is displaced by other redundant data.

In the following test, I prepared ‘arr_wr’ 64kB array for writing and ‘arr_rd’ 4kB array for reading. Because write operations are made to sparse and non-sequential addresses, cache allocations would happen each time when write happened so it would be an excess operation. Therefore ‘no write allocate’ strategy has to work faster.

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
arr_wr[i * BLOCK_LEN ] = arr_rd[j + 0 ];
arr_wr[i * BLOCK_LEN + j*32 + 1 ] = arr_rd[j + 1 ];
arr_wr[i * BLOCK_LEN + j*64 + 2 ] = arr_rd[j + 2 ];
arr_wr[i * BLOCK_LEN + j*128 + 3] = arr_rd[j + 3 ];
arr_wr[i * BLOCK_LEN + j*32 + 4 ] = arr_rd[j + 4 ];
***
arr_wr[i * BLOCK_LEN + j*32 + 15] = arr_rd[j + 15 ];
}
}

Results:

Write-back                  : 7s 601ms
Write-back no write allocate: 7s 599ms

Great. We have found a situation when ‘no write allocate’ a little faster.

Real-world applications with and without CPU caches

Let’s also consider a couple of real examples. We will use Embox RTOS and will launch the examples as command-line utils.

PING

We can estimate the performance with STM32F769I-Discovery board connects directly to the host. Embox is built with the -O2 optimization. The results are following:

Non-cachable :  ~0.246 sec
Write-back : ~0.140 sec

OpenCV

OpenCV is one more example that brightly demonstrates how using cache influences performance. OpenCV has already been ported to Embox but on the STM32F7 board, it had rather low FPS. We have already told about it in ‘How to run OpenCV on STM32 MCU‘ article. Let’s launch the same example (Canny edge detector) and compare execution time with and without caches (both I-cache and D-cache enabled).

Code:

gettimeofday(&tv_start, NULL);cedge.create(image.size(), image.type());
cvtColor(image, gray, COLOR_BGR2GRAY);
blur(gray, edge, Size(3,3));
Canny(edge, edge, edgeThresh, edgeThresh*3, 3);
cedge = Scalar::all(0);
image.copyTo(cedge, edge);gettimeofday(&tv_cur, NULL);
timersub(&tv_cur, &tv_start, &tv_cur);

Results without caches:

> edges fruits.png 20
Processing time 0s 926ms
Framebuffer: 800x480 32bpp
Image: 512x269; Threshold=20

Results with ‘write-back’:

> edges fruits.png 20
Processing time 0s 134ms
Framebuffer: 800x480 32bpp
Image: 512x269; Threshold=20

926ms vs 134ms! Performance increased by almost 7 times with caches enabled.

Conclusion

As we expected, caches are very powerful. Why do we not use them always? In other words are there any cases where using non-cacheable memory is profitable? Yes, it is. For example, if you use DMA, you have to take care of keeping memory in a coherent state.

We considered different strategies for memory management. Good memory management strongly influences common system performance.

You can reproduce all examples described in the article from our repo on GitHub.