CPU caches with examples for ARM Cortex-M

In the article, I said that we used caches to accelerate graphics on microcontrollers.

We talked about some advantages and disadvantages of Write-Through mode, but it was a quick overview. This article talks about caches from the point of view of programmers in more detail.

CPU data cache modes

  • Write-back. During write operation data is loaded only to the cache. The real write to memory is deferred until the cache is full and space is required for new data.
  • Write-through. Writing to a cache and a memory occur “simultaneously”.

The advantage of write-through mode is easy usage. It potentially reduces errors count. Indeed, in this mode, the memory is always coherent and does not require additional update procedures.

I seem it should decrease performance significantly, but ST says it shouldn’t:

“Write-through: triggers a write to the memory as soon as the contents on the cache line are written to. This is safer for the data coherency, but it requires more bus accesses. In practice, the write to the memory is done in the background and has a little effect unless the same cache set is being accessed repeatedly and very quickly. It is always a tradeoff.”

Disadvantages of the ‘write-through’ mode are the following:

  • Sequential and frequent access to the same memory address can degrade performance.
  • You still need to do a cache invalidate after the end of DMA operations.
  • There is the “Data corruption in a sequence of Write-Through stores and loads” bug in some versions of Cortex-M7

During the write operation in ‘write-back’ mode data is loaded only to the cache, not to the memory.

There is another cache property. It is the ‘write-allocate’ and ‘no-write-allocate’ modes. These modes determine what is occurred when a write cache miss has happened. If you use the ‘write-allocate’ mode data is loaded to cache, but if you use the ‘no-write-allocate’ data is not loaded.

The resulting cache mode is a combination of these two pairs: ‘write-through’ and ‘write-back’ with ‘write-allocate’ and ‘no-write-allocate’. For instance, it can be ‘write-through no-write-allocate’ and ‘write-back write-allocate’.

NOTE: Except for D-cache (data cache), there is I-cache (instruction cache). We do not consider I-Cache here in detail but enable it in real-world applications.

How to set a cache mode in ARM Cortex-M?

There are standard settings of each region described in the ARMv7M. For example, STM32F7 SRAM has ‘write-back write-allocate’ mode by default. You need to set up only regions with non-standard attributes. You can also create a subregion with its own attributes.

Each chip manufacturer can have its own specific memory types. For example, STM32 MCUs have TCM memory. It is connected to its own bus so it is very fast (it works on the CPU clock). It can be thought of as an additional cache, so it is not allowed to make this region cacheable.

Caches modes comparison. Tests.

We also need to measure time intervals with high precision, therefore we disable interrupts and use DWT (Data Watchpoint and Trace unit) instead of generic timers. DWT contains a 32-bits counter of CPU cycles and it can be used for measuring precise time but only in 20 seconds periods, therefore we need to clear the counter before each test start.

First of all, we try to notice a difference between non-cacheable and cacheable write-back modes. Obviously, we need to use both write and read operations, because CPU data cache makes sense only if there are read operations.

Non-cachable memory VS. Write-back

#define ITERS         100
#define BLOCK_LEN 4096
#define BLOCKS 16
/* So it's just an arbitrary buffer in external SDRAM memory. */
#define DATA_ADDR 0x60100000
***for (i = 0; i < ITERS * BLOCKS * 8; i++) {
dst = (uint8_t *) DATA_ADDR;
for (j = 0; j < BLOCK_LEN; j++) {
val = VALUE;
*dst = val;
val = *dst;
dst++;
}
}

But when we measured time for both cacheable and non-cacheable modes we were surprised there is no difference. We think that the reason is internal buffers in FMC. Yes, SDRAM is connected through Flexible Memory Controller (FMC) in STM32. It also has internal buffers, which influence the memory subsystem performance.

We need to make FMC’s life more difficult. We can expand the loops and add an increment of an array element:

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
/* 16 lines */
arr[i]++;
arr[i]++;
***
arr[i]++;
}
}

Results:

Non-cacheable: 4s 743ms
Write-back : 4s 187ms

It looks better. The result with cacheable memory is faster by half a second. Let’s continue and add access to sparse array indexes.

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
arr[i + 0 ]++;
***
arr[i + 3 ]++;
arr[i + 4 ]++;
arr[i + 100]++;
arr[i + 6 ]++;
arr[i + 7 ]++;
***
arr[i + 15]++;
}
}

Results:

Non-cacheable: 11s 371ms
Write-back: : 4s 551ms

Well, we got a noticeable difference. Let’s go a little deeper and add a second index:

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
arr[i + 0 ]++;
***
arr[i + 4 ]++;
arr[i + 100]++;
arr[i + 6 ]++;
***
arr[i + 9 ]++;
arr[i + 200]++;
arr[i + 11]++;
arr[i + 12]++;
***
arr[i + 15]++;
}
}

Results:

Non-cacheable: 12s 62ms
Write-back : 4s 551ms

We can see that the time increased for non-cacheable mode and remained the same for cacheable mode.

When ‘write-allocate’ is better to use?

Let’s use the following test to simulate this situation:

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
arr[j + 0 ] = VALUE;
***
arr[j + 7 ] = VALUE;
arr[j + 8 ] = arr[i % 1024 + (j % 256) * 128];
arr[j + 9 ] = VALUE;
***
arr[j + 15 ] = VALUE;
}
}

In this example 15 from 16 write operations set up VALUE constant, while read operations access other memory cells. So only read elements are cached when ‘no write allocate’ mode is used. Array index “(i % 1024 + (j % 256) * 128)” is used to decrease SDRAM performance making accesses sparse.

Results:

Write-back                  : 4s 720ms
Write-back no write allocate: 4s 888ms

When ‘no-write-allocate’ is better to use?

In the following test, I prepared ‘arr_wr’ 64kB array for writing and ‘arr_rd’ 4kB array for reading. Because write operations are made to sparse and non-sequential addresses, cache allocations would happen each time when write happened so it would be an excess operation. Therefore ‘no write allocate’ strategy has to work faster.

for (i = 0; i < ITERS * BLOCKS; i++) {
for (j = 0; j < BLOCK_LEN; j++) {
arr_wr[i * BLOCK_LEN ] = arr_rd[j + 0 ];
arr_wr[i * BLOCK_LEN + j*32 + 1 ] = arr_rd[j + 1 ];
arr_wr[i * BLOCK_LEN + j*64 + 2 ] = arr_rd[j + 2 ];
arr_wr[i * BLOCK_LEN + j*128 + 3] = arr_rd[j + 3 ];
arr_wr[i * BLOCK_LEN + j*32 + 4 ] = arr_rd[j + 4 ];
***
arr_wr[i * BLOCK_LEN + j*32 + 15] = arr_rd[j + 15 ];
}
}

Results:

Write-back                  : 7s 601ms
Write-back no write allocate: 7s 599ms

Great. We have found a situation when ‘no write allocate’ a little faster.

Real-world applications with and without CPU caches

PING

Non-cachable :  ~0.246 sec
Write-back : ~0.140 sec

OpenCV

Code:

gettimeofday(&tv_start, NULL);cedge.create(image.size(), image.type());
cvtColor(image, gray, COLOR_BGR2GRAY);
blur(gray, edge, Size(3,3));
Canny(edge, edge, edgeThresh, edgeThresh*3, 3);
cedge = Scalar::all(0);
image.copyTo(cedge, edge);gettimeofday(&tv_cur, NULL);
timersub(&tv_cur, &tv_start, &tv_cur);

Results without caches:

> edges fruits.png 20
Processing time 0s 926ms
Framebuffer: 800x480 32bpp
Image: 512x269; Threshold=20

Results with ‘write-back’:

> edges fruits.png 20
Processing time 0s 134ms
Framebuffer: 800x480 32bpp
Image: 512x269; Threshold=20

926ms vs 134ms! Performance increased by almost 7 times with caches enabled.

Conclusion

We considered different strategies for memory management. Good memory management strongly influences common system performance.

You can reproduce all examples described in the article from our repo on GitHub.