Introduction

I’ve recently seen a discussion on twitter about getting to the root cause of things and documenting the process. Since joining Frostbite, I had a pleasure to work with some of the cutting edge tech (which is a euphemism for buggy, unstable, unpredictable, untested stuff). Investigating rendering artifacts / corruptions quite often leads to realization that the GPU driver or shader compiler is at fault. Tracking down driver issues has become mundane. Sharing the discoveries with a wider audience likely isn’t even worth while, since most people will never actually get exposed to the problem. On the other hand, tracking down intermittent / hard-to-repro issues in higher-level code always results in some juicy discovery and the investigation process often forces me to learn something new. I actually really like investigating bugs for this reason.

When I’m working on an issue, I keep plenty of various random notes (addresses, memory dumps, call stacks, code snippets, screenshots, logs). I starting working like that many years ago at Creative Assembly. Sometimes it’s possible to reconstruct a story out of those debugging notes. I used to share some of the more interesting issues on the internal mailing lists. These days, I typically write a summary of what lead to the problem and describe the solution in Perforce CL description directly, then mirror the data to JIRA.

I thought I’d share a story of investigating one particular bug that we encountered at Creative Assembly on Alien: Isolation in May 2013. Investigations were done by Tomasz Stachowiak, Michael Bailey, Angel Cuñado and me. This post is based on various notes that I made during investigation and the write-up that I sent to the team after the fix was submitted.

TL;DR

Legacy optimized code will eventually bite you.

The Black Screen Bug

The issue manifested as the screen going black sometimes, shortly after encountering some NPCs. Only in optimized 32-bit build configuration on PC. We have quickly narrowed down the visual issue to the post-processing system. One of the shader constants looked a little suspicious: -1.#IND000. The corresponding C++ code was sane:

float adaptationLuminance = calculateBracketedLuminanceMean(
	timeAveragedHistogram,
	lowBracket, highBracket);

The adaptationLuminance would sometimes become IND. We have eventually noticed that re-running the line above (by setting the instruction pointer in the debugger) magically turned IND into a sensible value! I’m sure some of the readers will immediately guess the root cause just from this information. However, back then this behavior made absolutely no sense to me or my colleagues.

The following code was inside calculateBracketedLuminanceMean:

for (ptrdiff_t i = blackPointPtr - &sat[0]; i < whitePointPtr - &sat[0]; ++i) {
	const float v = src[i];
	const float luminance = src.bucketToLuminance(i);
	bracketedLumSum += luminance * v;
	bracketedSum += v;
}

We duplicated this loop and then stepped through the code. The first copy sometimes produced IND in one of the values. Running the second copy immediately afterwards magically fixed everything. All this was single-threaded. There were no race conditions that would explain weird behavior. Just a pure function that sometimes returned broken result.

We concluded that the luminance calculation function is likely not at fault and there must be something happening earlier that leaves the FPU in a funky state. We decided to reset the FPU immediately before calculateBracketedLuminanceMean using FNINIT. It worked! After a while, we noticed something suspicious in the FPU register view of the debugger:

FPU Status

This page provides a detailed description of what various FPU registers mean, including the STAT register. I took this screenshot from there:

FPU Exception Flags

We removed the FNINIT and replaced it with code which logged the status register:

unsigned short fpstat = 0;
__asm fstsw fpstat;

We noticed that in the cases where IND was produced, the FPU stack bits were non-zero (something was on the FPU stack). This was quite unexpected. Also, this was a very promising clue!

We wrote a small function that checked various bits of STAT register (stack being empty, stack fault bit in the exception flags being unset, etc.) and sprinkled it all over the code-base. The function would trigger a breakpoint when FPU was in unexpected state. This allowed us to track down the problem to something completely unrelated to the post-processing system or even rendering. The culprit was in path planning / local avoidance system (Ha! Suddenly we understand why the bug was reproducible only after entering an area with NPCs!). This was the offending code:

Culprit

This looks like a fairly ordinary gameplay code. But it isn’t! Results of the ca_clamp_angle() calculations are used as the branch condition, however the body of the branch is empty (someone commented it out for whatever reason). Here’s what the disassembly looked like:

Culprit disassembly

Note that fldpi instruction pushes Pi on the FPU stack, fadd then is used to produce the 2*Pi value. This value is never popped from the stack!

We changed the code a little bit to actually have a non-empty branch body. Here’s the generated code that we observed:

Culprit disassembly with non-empty body

Now we see the fstp instruction, which stores the 2*Pi value and pops the stack. All this is very suspicious and little of it makes any sense. I believe the words “compiler bug” were said several times. Where does 2*Pi even come from? Why isn’t all this code just optimized away, since the branch body is clearly empty? We need to look at the implementation of ca_clamp_angle:

float ca_clamp_angle(float a) // clamps an angle to -Pi <= a < Pi
{
	return a - ca_round_to_float_fast(a * ca_inv_2pi()) * ca_2pi();
}

Oh look, we found our 2*Pi! Let’s look at how it’s implemented:

float ca_pi() { __asm fldpi}
float ca_2pi() { __asm fldpi __asm fadd st(0), st}

Looks familiar, doesn’t it? This works just fine for cases like this:

float foo = ca_2pi();
if(foo > 0) printf("hello!\n");

However, this will corrupt the FPU stack:

ca_2pi();

Compiler can’t remove inline assembly in this case:

Broken Pi

However, this code works:

Working Pi

Note how compiler automatically added the fstp instruction when we actually reference the result.

Mystery solved

Explicit FPU optimizations (questionable in 2013) in the core maths library caused an FPU stack corruption because compiler did not optimize away an inline assembly block. The stack remained non-empty after the offending gameplay code ran. This caused no real issues until a very FPU-heavy bucketToLuminance function ran inside the post-processing system. At that point the compiler would try to use the entire FPU stack, which it assumes to be empty. When stack overflowed, we got -1.#IND000 and passed it to the post-processing shader, which affected every pixel on screen.

Conclusion

We were quite lucky that FPU stack corruption resulted in a very obvious visual issue that forced us to fully investigate it. I’m sure this issue caused some intermittent glitches in the past which were brushed under the rug or just weren’t noticed. Corrupt FPU stack was a time bomb. I’m glad we don’t have to worry about x87 these days.

I’m sure that explicit FPU inline assembly code made some sort of sense when it was written (probably a decade earlier), however it most certainly did not make sense in 2013. Core maths libraries for 32-bit PC were never revisited since original implementation. People just used various functions from the library without experiencing any obvious problems. Nobody ever looked at the disassembly on PC, otherwise they would probably notice an unholy mix of x87 FPU and SSE instructions. Usage of inline assembly in optimized legacy code like this turned into an anti-optimization on more modern machines. It prevents the compiler from generating good code (const folding, reordering, vectorization). For example, here’s what the compiler can do if we write something sensible:

Better Pi

Alien codebase was pretty ancient. It evolved all the way from PS2 and GameCube days with no full re-writes. PC support was never a important in the past, since consoles were the target. Shipping Alien on PC, PS4 and XB1 meant not only fixing all the legacy maths code like this, but also porting everything to 64 bits, writing a modern renderer, etc. Good times!