VC4

Mesa’s vc4 graphics driver supports multiple implementations of Broadcom’s VideoCore IV GPU. It is notably used in the Raspberry Pi 0 through Raspberry Pi 3 hardware, and the driver is included as an option as of the 2016-02-09 Rasbpian release using raspi-config. On most other distributions such as Debian or Fedora, you need no configuration to enable the driver.

This Mesa driver talks directly to the vc4 kernel DRM driver for scheduling graphics commands, and that module also provides KMS display support. The driver makes no use of the closed source VPU firmware on the VideoCore IV block, instead talking directly to the GPU block from Linux.

GLES2 support

The vc4 driver is a nearly conformant GLES2 driver, and the hardware has achieved GLES2 conformance with other driver stacks.

OpenGL support

Along with GLES 2.0, the Mesa driver also exposes OpenGL 2.1, which is mostly correct but with a few caveats.

  • 4-byte index buffers.

GLES2.0, and vc4, don’t have GL_UNSIGNED_INT index buffers. To support them in vc4, we create a shadow copy of your index buffer with the indices truncated to 2 bytes. This is incorrect (and will assertion fail in debug builds of Mesa) if any of the indices were >65535. To fix that, we would need to detect this case and rewrite the index buffer and vertex buffers to do a series of draws each with small indices and new vertex attrib bindings.

To avoid this problem, ensure that all index buffers are written using GL_UNSIGNED_SHORT, even at the cost of doing multiple draw calls with updated vertex attrib bindings.

  • Occlusion queries

The VC4 hardware has no support for occlusion queries. GL 2.0 requires that you support the occlusion queries extension, but you can report 0 from glGetQueryiv(GL_SAMPLES_PASSED, GL_QUERY_COUNTER_BITS). This is absurd, but it’s how OpenGL handles “we want the functions to be present everywhere, but we want it to be optional for hardware to support it. Sadly, gallium doesn’t yet allow the driver to report 0 query bits.

  • Primitive mode

VC4 doesn’t support reducing triangles/quads/polygons to lines and points like desktop GL. If front/back mode matched, we could rewrite the index buffer to the new primitive type, but we don’t. If front/back mode don’t match, we would need to run the vertex shader in software, classify the prims, write new index buffers, and emit (possibly many) new draw calls to rasterize the new prims in the same order.

Bug Reporting

VC4 rendering bugs should go to Mesa’s gitlab issues page.

By far the easiest way to communicate bug reports for rendering problems is to take an apitrace. This passes exactly the drawing you saw to the developer, without the developer needing to download and build the application and replicate whatever steps you took to produce the problem. Traces attached to bug reports should ideally be small.

For GPU hangs, if you can get a short apitrace that produces the problem, that’s still the best. If the problem takes a long time to reproduce or you can’t capture it in a trace, describing how to reproduce and including a gpu hang dump would be the most useful. Install vc4-gpu-tools <https://github.com/anholt/vc4-gpu-tools/> and use vc4_dump_hang_state my-app.hang. Sometimes the hang file will provide useful information.

Tiled Rendering

VC4 is a tiled renderer, chopping the screen into 64x64 (non-MSAA) or 32x32 (MSAA) tiles and rendering the scene per tile. Rasterization looks like:

(CPU) Allocate space to store a list of draw commands per tile
(CPU) Set up a command list per tile that does:
    Either load the current tile's color buffer from memory, or clear it.
    Either load the current tile's depth buffer from memory, or clear it.
    Branch into the draw list for the tile
    Store the depth buffer if anybody might read it.
    Store the color buffer if anybody might read it.
(GPU) Initialize the per-tile draw call lists to empty.
(GPU) Run all draw calls collecting vertex data
(GPU) For each tile covered by a draw call's primitive.
    Emit state packets to the list to update it to the current draw call's state.
    Emit a primitive description into the tile's draw call list.

Tiled rendering avoids the need for large render target caches, at the expense of increasing the cost of vertex processing. Unlike some tiled renderers, VC4 has no non-tiled rendering mode.

Performance Tricks

  • Reducing memory bandwidth by clearing.

Even if your drawing is going to cover the entire render target, it’s more efficient for VC4 if you emit a glClear() of the color and depth buffers. This means we can skip the load of the previous state from memory, in favor of a cheap GPU-side memset() of the tile buffer before we start running the draw calls.

  • Reducing memory bandwidth with scissoring.

If all draw calls for the frame are with a glScissor() to only part of the screen, then we can skip setting up the tiles for that area, which means a little less memory used setting up the empty bins, and a lot less memory used loading/storing the unchanged tiles.

  • Reducing memory bandwidth with glInvalidateFramebuffer().

If we don’t know who might use the contents of the framebuffer’s depth or color in the future, then we have to store it for later. If you use glInvalidateFramebuffer() before accessing the results of your rendering, then we can skip the store of the depth or color buffer. Note that this is unimplemented.

  • Avoid non-constant GLSL array indexing

In VC4 the only non-constant-index array access supported in hardware is uniforms. For everything else (inputs, outputs, temporaries), we have to lower them to an IF ladder like:

if (index == 0)
   return array[0]
else if (index == 1)
  return array[1]
...

This is very expensive as we probably have to execute every branch of every IF statement due to it being a SIMD machine. So, it is recommended (if you can) to avoid non-uniform non-constant array indexing.

Note that if you do variable indexing within a bounded loop that Mesa can unroll, that can actually count as constant indexing.

  • Increasing GPU memory Increase CMA pool size

The memory for the VC4 driver is allocated from the standard Linux cma pool. The size of this pool defaults to 64 MB. To increase this, pass an additional parameter on the kernel command line. Edit the boot partition’s cmdline.txt to add:

cma=256M@256M

cmdline.txt is a single line with whitespace separated parameters.

The first value is the size of the pool and the second parameter is the start address of the pool. The pool size can be increased further, but it must fit into the memory, so size + start address must be below 1024M (Pi 2, 3, 3+) or 512M (Pi B, B+, Zero, Zero W). Also this reduces the memory available to Linux.

  • Decrease firmware memory

The firmware allocates a fixed chunk of memory before booting Linux. If firmware functions are not required, this amount can be reduced.

In config.txt edit gpu_mem to 16, if you do not need video decoding, edit gpu_mem to 64 if you need video decoding.

Performance debugging

  • Step 1: Known issues

The first tool to look at is running your application with the environment variable VC4_DEBUG=perf set. This will report debug information for many known causes of performance problems on the console. Not all of them will cause visible performance improvements when fixed, but it’s a good first step to see what might going wrong.

  • Step 2: CPU vs GPU

The primary question is figuring out whether the CPU is busy in your application, the CPU is busy in the GL driver, the GPU is waiting for the CPU, or the CPU is waiting for the GPU. Ideally, you get to the point where the CPU is waiting for the GPU infrequently but for a significant amount of time (however long it takes the GPU to draw a frame).

Start with top while your application is running. Is the CPU usage around 90%+? If so, then our performance analysis will be with sysprof. If it’s not very high, is the GPU staying busy? We don’t have a clean tool for this yet, but cat /debug/dri/0/v3d_regs could be useful. If CT0CA != CT0EA or CT1CA != CT1EA, that means that the GPU is currently busy processing some rendering job.

  • sysprof for CPU usage

If the CPU is totally busy and the GPU isn’t terribly busy, there is an excellent tool for debugging: sysprof. Install, run as root (so you can get system-wide profiling), hit play and later stop. The top-left area shows the flat profile sorted by total time of that symbol plus its descendants. The top few are generally uninteresting (main() and its descendants consuming a lot), but eventually you can get down to something interesting. Click it, and to the right you get the callchains to descendants – where all that time actually went. On the other hand, the lower left shows callers – double-clicking those selects that as the symbol to view, instead.

Note that you need debug symbols for the callgraphs in sysprof to work, which is where most of its value is. Most distributions offer debug symbol packages from their builds which can be installed separately, and sysprof will find them. I’ve found that on arm, the debug packages are not enough, and if someone could determine what is necessary for callgraphs in debugging, that would be really helpful.

  • perf for CPU waits on GPU

If the CPU is not very busy and the GPU is not very busy, then we’re probably ping-ponging between the two. Most cases of this would be noticed by VC4_DEBUG=perf, but not all. To see all cases where this happens, use the perf tool from the Linux kernel (note: unrelated to VC4_DEBUG=perf):

sudo perf record -f -g -e vc4:vc4_wait_for_seqno_begin -c 1 openarena

If you want to see the whole system’s stalls for a period of time (very useful!), use the -a flag instead of a particular command name. Just ^C when you’re done capturing data.

At exit, you’ll have perf.data in the current directory. You can print out the results with:

perf report | less
  • Debugging for GPU fully busy

As of Linux kernel 4.17 and Mesa 18.1, we now expose the hardware’s performance counters in OpenGL. Install apitrace, and trace your application with:

apitrace trace <application>          # for GLX applications
apitrace trace -a egl <application>   # for EGL applications

Once you’ve captured a trace, you can see what counters are available and replay it while looking while looking at some of those counters:

apitrace replay <application>.trace --list-metrics

apitrace replay <application>.trace --pdraw=GL_AMD_performance_monitor:QPU-total-clk-cycles-vertex-coord-shading

Multiple counters can be captured at once with commas separating them.

Once you’ve found what draw calls are surprisingly expensive in one of the counters, you can work out which ones they were at the GL level by opening the trace up in qapitrace and using ^-G to jump to that call number and ^-L to look up the GL state at that call.

shader-db

shader-db is often used as a proxy for real-world app performance when working on the compiler in Mesa. On vc4, there is a lot of state-dependent code in the shaders (like blending or vertex attribute format handling), so the typical shader-db will miss important areas for optimization. Instead, anholt wrote a new one based on apitraces. Once you have a collection of traces, starting from traces-db, you can test a compiler change in this shader-db with:

./run.py > before
(cd ../mesa && make install)
./run.py > after
./report.py before after

Hardware Documentation

For driver developers, Broadcom publicly released a specification PDF for the 21553, which is closely related to the vc4 GPU present in the Raspberry Pi. They also released a snapshot of a corresponding Android graphics driver. That graphics driver was ported to Raspbian for a demo, but was not expected to have ongoing development.

Developers with NDA access with Broadcom or Raspberry Pi can potentially get access to “simpenrose”, the C software simulator of the GPU. The Mesa driver includes a backend (vc4_simulator.c) to use simpenrose from an x86 system with the i915 graphics driver with all of the vc4 rendering commands emulated on simpenrose and memcpyed to the real GPU.