A little about graphics subsystem internals on microcontrollers

In this topic, we would like to talk about a particularity of GUI implementation on MCU which has both a familiar user interface and a decent FPS. We also would like to emphasize common things here: memory, caches, DMA, etc, but not specific hardware. We are members of the Embox team so examples and experiments were made under this RTOS.

There is a Qt port on Embox which has been launched on MCU. We managed to achieve pretty smooth animation in this example, but it requires a lot of memory for code therefore we had to run the example from external QSPI flash. Of course, when a complex and multifunctional interface is required, then the cost of hardware resources can be quite justified (especially if you already have this code developed for Qt).

But what if powerful Qt functionality isn’t required? What if you need four buttons, a single slider, and a pair of popup menus, and of course you want it to be “pretty and quick”? In this case, it would be more practical to use more lightweight tools, for instance, lvgl or similar.

Embox has a port of a very lightweight GUI framework — Nuklear. Therefore, we decided to use it to design a simple application with several widgets and touchscreen. Also, we use STM32F7-Discovery as a hardware platform.

So the GUI framework and hardware platform were chosen. It’s time to analyze how many resources are required. It should be noted that since the internal RAM is several times faster than the external SDRAM, the data, including video memory, should be placed in the internal RAM. The screen is 480x272, so if we want to use a 4-byte pixel format, it will need about 512KB of RAM. Our chip has 320kB RAM, so we have to use external RAM. On the other hand, we can use the 16 bits per pixel format, so less than 256KB of RAM is required, and therefore we can try to use the internal RAM.

Well, let’s try to save maximum resources: make our video-memory 256kB, place it into the internal memory, and draw directly into it. The trouble has appeared immediately. The trouble was in the blinking scene. It is natural because “Nuklear” is redrawing the full scene each time. So a full sequence happens during every redrawing: at first, it fills the whole screen with a specified color, then a widget is drawn on top of this, then a button is placed into the widget, and so on.

After we fiddled with the previous method (placement video memory in internal RAM) a little, memories of X Server and Wayland immediately began to come to mind. Yes, indeed, in fact, window managers are engaged in processing requests from clients (just our custom application) and then collecting the elements into the final scene. For example, the Linux kernel sends events from input devices to the server through the evdev driver. The server, in turn, determines to which client the event is sent. Clients, having received an event (for example, clicking on a touch screen), execute their internal logic — select a button, display a new menu. Then (slightly differently for X and Wayland), either the client itself or the server makes changes to the buffer. And then the compositor puts all the pieces together to render the screen. Quite a simple and schematic explanation here.

It became clear that we need similar logic, but we really don’t want to push X Server into stm32 for the sake of a small application. Therefore, let’s try to just draw not directly in video memory, but into the Intermediate buffer (located in ordinary memory). After rendering the entire scene, the buffer will be copied to video memory.

The code of the example can be found in the Embox repo here.

In this example, a 200 x 200 px window with different graphical elements is created. The final scene itself is drawn into the fb_buf buffer, which we allocated in SDRAM. And then in the last line, memcpy() is simply called. And everything repeats in an infinite loop.

If we just build and run this example, we get about 10–15 FPS. Which is certainly not very good, because it is visible to the naked eye. Moreover, since there are a lot of floating-point calculations in the Nuklear renderer code, we included its support initially, FPS would be even lower without it. The first and simplest optimization is -O2 compiler flag. Let’s build and run the same example — we get 20 FPS. It’s better, but still not good enough.

Before going on I want to say that we are using the rawfb Nuklear plugin, which draws directly into memory. Accordingly, memory optimization looks very promising. The first thing that comes to mind is the cache.

Higher versions of Cortex-M, such as Cortex-M7 (our case), have an additional cache controller (instruction cache and data cache). It can be enabled through the CCR register of the System Control Block. But with cache enabling new problems come, such as the inconsistency of data in the cache and memory.

There are several ways to manage the cache, but in this article, I will not dwell on them, so I will move on to one of the simplest, in my opinion. To solve the cache/memory inconsistency problem we can simply mark all available memory as “non-cacheable”. This means that all writes to this memory will always go to memory and not to the cache. But if we mark all memory in this way, then there will be no advantages in using the cache.

There is another variant, the so-called “pass-through” mode, in which all writes to the memory marked as write-through are placed simultaneously to both the cache and memory. This creates overhead for writing, but on the other hand, it greatly speeds up reading, so the result will depend on the specific application.

Write-through mode turned out to be very good for Nuklear — the performance increased from 20 FPS up to 45 FPS, which is already quite good and smooth. This effect is interesting of course, we even tried to disable write-through mode, not paying attention to data inconsistency, but the FPS increased only up to 50 FPS. That is, there was no significant increase in comparison with the write-through mode. From this, we concluded that many operations required reading memory. The question is of course why? Perhaps, it’s because of the number of transformations in the rawfb code, which often access memory to read the next coefficient or something like that.

We didn’t want to stop at 45 FPS, so we decided to experiment further. The next idea was double buffering. The idea is widely known, and rather obvious. A scene is rendering with one device to one buffer, while the other device is drawing to screen from another buffer at the same time. If you take a look at the previous code, you can clearly see a loop in which a scene is drawn into the buffer, and then copied into video memory using memcpy. It is clear that memcpy uses CPU, that is, rendering and copying happen sequentially. Our idea was that copying can be done in parallel with rendering using DMA. In other words, while the processor is drawing a new scene, the DMA copies the previous scene into video memory.

Let’s change memcpy() call with the following code:

while (dma_in_progress()) {
}
dma_transfer((uint32_t) fb_info->screen_base,
(uint32_t) fb_buf[fb_buf_idx], (width * height * bpp) / 4);
fb_buf_idx = (fb_buf_idx + 1) % 2;

Here fb_buf_idx is the index of the buffer. fb_buf_idx = 0 is the front buffer, fb_buf_idx = 1 is the back buffer. The dma_transfer () function takes destination, source, and a number of 32-bit words. Then DMA is charged with the required data, and rendering continues with the next buffer.

After trying this mechanism, the performance increased to about 48 FPS. It’s better than memcpy(), but only slightly. I don’t mean to say that DMA turned out to be useless, but in this particular example, the effect of the cache on the big picture was better.

After a little surprise that DMA performed worse than expected, we came up with an “excellent” (as it seemed to us then) idea to use several DMA channels. What’s the point? The number of data that can be loaded into DMA at a time on stm32f7xx is 256 KB. At the same time, remember that our screen is 480x272 and video memory is about 512 KB, which means it would seem that we can put the first half of the data into one DMA channel and the second half into the second. And everything seems to be fine. But the performance decreased from 48 FPS to 25–30 FPS. That is, we are returning to the situation when the cache has not been enabled yet.

What it might depend on? After a little reflection, we realized that there is nothing surprising here, because there is the only single memory, and the write and read cycles are generated to the single-chip (on the single bus), and since another source/receiver is added, then the arbiter( who controls data flow on the bus) needs to mix command cycles from different DMA channels.

Copying from an intermediate buffer is certainly good, but as we found out, this is not enough. Let’s take a look at another obvious improvement — double buffering. In the vast majority of modern display controllers, the address to the used video memory can be set. Thus we can completely get rid of copying and simply change the video memory address to the prepared buffer instead. The display controller takes the data itself in an optimal way via DMA. This is real double buffering, without an intermediate buffer as it was before. There is also an option when the display controller can have two or more buffers. This is essentially the same because we write to one buffer, while the other is used by the controller, so copying is not required.

The LTDC (LCD-TFT display controller) in stm32f74xx has two hardware overlay layers — Layer 1 and Layer 2, where Layer 2 is overLayer 1. Each of the layers is independently configurable and can be enabled or disabled separately. We tried to enable only Layer 1 and change the video memory address to the front buffer or back buffer. That is, we give one buffer to the display and draw in the other one at this time. But we got a noticeable flickering when switching overlays.

We tried to use both layers by turning on / off one of them. That is, each layer has its own video memory address that does not change, and the current buffer is selected by turning on one of the layers while turning off the other. This method also resulted in flickering. And finally, we tried the option when the layer was not turned off, but the alpha channel was set either to zero or to its maximum value (255), that is, we controlled the transparency, making one of the layers invisible. But this option did not live up to expectations, the flickering was still present.

The reason was not clear. The documentation says that layer state updates can be performed “on the fly”. We made a simple test. We turned the caches and floating-point unit off and drew a static picture with a green square at the center of the screen, the same for both Layer 1 and Layer 2. Then we began to switch layers in a loop, hoping to get a static picture. But we got the same flickering again.

It became clear that there was another reason. And then we remembered about the alignment of the framebuffer address in memory. Since the buffers were allocated from the heap their addresses were not aligned. We aligned the framebuffer addresses by 1 KB and got the expected picture without flickering. Then we found in the documentation that LTDC reads data in batches of 64 bytes and that unaligned data leads to a significant loss in performance. Moreover, both the address of the beginning of the framebuffer and its width must be aligned. To test this we changed the 480x4 width to 470x4, which is not divisible by 64 bytes, and got the same screen flickering.

As a result, we aligned both buffer’s address and width by 64 bytes and ran Nuklear — the flickering disappeared. The solution that worked looks like this. Use transparency instead of switching between layers by completely disabling either Layer 1 or Layer 2. That is, to disable or enable the level by setting its transparency to 0 or 255 respectively.

BSP_LCD_SetTransparency_NoReload(fb_buf_idx, 0xff);
fb_buf_idx = (fb_buf_idx + 1) % 2;
BSP_LCD_SetTransparency(fb_buf_idx, 0x00);

We got 70–75 FPS! Much better than the original 15 :)

It is worth noting that only the solution through transparency control works. The options with disabling one of the levels and the option with changing the video memory address give the picture flickering at FPS greater than 40–50, the reason is currently unknown to us. Also, running ahead, I will say that this is a solution for this board.

But this is not the last thing we can improve, our final optimization for FPS increasing is hardware scene filling. Before that, we did the filling programmatically. Now we will fill the scene with the same color (0xff303030) through the DMA2D controller. One of the main functions of DMA2D is to copy or fill a rectangle in RAM. The main convenience here is that this is not a continuous piece of memory, but a rectangular area located in memory with gaps, which means that ordinary DMA cannot be used right away. In Embox, we have not worked with this device yet and there is no API for it, so let’s just use the STM32Cube tools — the BSP_LCD_Clear (uint32_t Color) function. It programs the fill color and size of the entire screen in DMA2D.

We achieved 80–85 FPS, that is, another 10 FPS more.

But even with the achieved 80 FPS, a noticeable problem remained — parts of the widget had small “breaks” when moving across the screen. In other words, the widget seemed to be distorted and divided into 3 (or more) parts that were moving side by side with a slight delay. It turned out that the reason was in incorrect video memory updates. More precisely, updates were performed at the wrong time intervals.

A display controller has the VBLANK property, aka VBI, or Vertical Blanking Period. It defines the time interval between adjacent video frames. Or more precisely, it defines the time interval between the last line of the previous video frame and the first line of the next one. There is no data transferred to a display during this interval, which makes the picture to be static. For this reason, it is safe to update video memory during VBLANK.

In practice, the LTDC controller has an interrupt that is configured to be triggered after processing the next framebuffer line (LTDC line interrupt position configuration register (LTDC_LIPCR)). Thus, if this interrupt is configured with the value of the last screen line we will just get the beginning of the VBLANK interval inside the interrupt handler. At this point, the buffer switching is safe.

As a result of such actions, the picture returned to its normal state, the gaps were gone. But at the same time, FPS decreased from 80 to 60. Let’s understand the reason.

The following formula can be found in the documentation:

LCD_CLK (MHz) = total_screen_size * refresh_rate

where total_screen_size = total_width x total_height. LCD_CLK is the frequency at which the display controller load pixels from video memory to the screen (for example, via the Display Serial Interface (DSI)), refresh_rate is the refresh rate of the screen itself, it is a physical characteristic. It turns out that you can configure the frequency for the display controller knowing the refresh rate of the screen and its dimensions. After checking the registers for the configuration used by STM32Cube, we found out that it tunes the controller to a 60 Hz screen. It’s an answer to our question.

We had another similar board — STM32F769I-DISCO. There is the same LTDC controller, but a different screen with a resolution 800x480. After launching our example we got 25 FPS. That is a noticeable drop in performance. This is due to the size of the framebuffer — it is almost 3 times larger. But there was a bigger problem The widget was very distorted even when not moving

The reason was not clear, so we started looking at standard examples from STM32Cube. There was an example with double buffering for this board. In this example, the developers, in contrast to the method with transparency changing, simply switch the pointer to the framebuffer when handling VBLANK. We have already tried this method earlier for the first board, but it did not work for it. But using this method for STM32F769I-DISCO, we got a fairly smooth picture change from 25 FPS.

After the success, we checked this method again (with switching pointers) on the first board, but it still did not work at high FPS. As a result, the method with layer transparencies (60 FPS) works on one board, and the method with switching pointers (25 FPS) on the other. After discussing the situation, we decided to postpone unification until a deeper investigation of the graphics stack.

Well, let’s summarize. The demonstrated example represents a simple but at the same time common GUI pattern for microcontrollers — a few buttons, a slider, or something else. The example lacks any logic associated with events since the emphasis was placed on the graphics. In terms of performance, we got a pretty decent FPS value.

The accumulated nuances for optimizing performance lead to the conclusion that graphics is becoming more complicated in modern microcontrollers. Now, just like on large platforms, you need to control the processor cache, place something in external memory, and something in faster memory, use DMA, use DMA2D, monitor VBLANK, and so on. In other words, microcontrollers have become to look like big platforms, and maybe that’s why we have already referred to X Server and Wayland several times.

Perhaps one of the most unoptimized parts is the rendering itself, we redraw the whole scene from scratch, entirely. I cannot say how it is done in other libraries for microcontrollers, perhaps somewhere this stage is built into the library itself. But based on the results of working with Nuklear, it seems that in this place a lightweight analog of X Server or Wayland is needed. It leads us again to the idea that small systems repeat the path of large ones.

Our contacts:

Github: https://github.com/embox/embox

Email list: embox-devel@googlegroups.com

Telegram chat: https://t.me/embox_chat_en