UB or not UB: How gcc and clang handle statically known undefined behaviour
25 June 2024
Recently, we had a discussion in our team about undefined behaviour
(UB) in C. For those unfamiliar: We say that a program has undefined
behaviour when we write code where the language specification doesn't
define what should happen during execution. That means compilers can do
whatever they like if they encounter such code and there is no
guarantee that execution will behave in a predictable way. Thus,
undefined behaviour must be avoided at all cost as it not only makes
programs misbehave but is also a common source of security vulnerabilities.
Examples of code that
has undefined behaviour are out-of-bounds indexing of an array, integer
overflow, division by zero, and nullpointer dereferencing
[1].
Compilers often use undefined language semantics to make assumptions about the program.
For example, if we write something like int x = y/z, then
the compiler may assume that z must not be zero, since division by
zero is undefined, and programmers surely wouldn't write undefined code. It
can then use that information to further optimise the program:
In this example, we can see that clang used the fact that division by
zero is undefined and thus argc must not be zero to
entirely remove the condition if (argc == 0), knowing this
case can never happen [2].
Statically known undefined behaviour
While I knew that compilers can do clever optimisations when assuming that no UB may exists in the program, I was
wondering what they do when they statically detect the
existence of UB, or in other words, when we force the compiler
to compile code that both we and the compiler know is undefined. Eager to
find excuses to use Compiler Explorer, I did some quick
experiments. For many these results may not be surprising (and the experiments, if you can even call them that, are certainly
not exhaustive), but they satisfied my curiosity, and, by putting this out there, I hope others
may get some value from this too.
I need a hzero
The simplest program I could think of that forces UB in C is
division by zero with a constant. The program and its output by gcc (v14.1) and clang
(v18.1) compiled to x86_64 are shown below:
Program
intmain(intargc){intub=argc/0;returnub;}
gcc -O2
main:ud2
clang -O2
main:ret
During compilation, both gcc and clang give a warning:
:2:17: warning: division by zero [-Wdiv-by-zero]
2 | int ub = argc / 0;
| ^
However, while gcc compiled the program to a single (illegal instruction)
ud2, clang reduced it to a ret. Under UB, both
approaches are valid, yet they are very different: one crashes the
program, while the other ignores the problematic code [3].
What if we changed the program slightly by replacing the constant inside
the division with a variable:
Program
intmain(intargc){inti=0;intub=argc/i;returnub;}
gcc -O2 -Wall
main:ud2
clang -O2 -Wall
main:ret
While the compiled programs stayed the same, we no longer get a warning (even
with -Wall), even though both compilers can easily work out
statically (e.g. via constant folding) that a division by zero occurs [4].
No guarantees
Let's add some more code before the division-by-zero line and see how this affects the output:
Somewhat expectedly, gcc remains faithful to its crash approach, though note
that it only inserts the crash when it compiles the division-by-zero, not earlier, like at the beginning of the function.
Clang on the other hand compiled both prints, before and after the division,
simply removing the division itself. As with the code after the division-by-zero,
there are also no guarantees for the code leading up to it. The mere existence
of UB in the program means all bets are off and the compiler could chose to
crash the function immediatley upon entering it. [5].
If there's UB in a program but no one is around to use it, does it still make a sound?
Do compilers treat code that exhibits undefined behaviour but is never used,
like the proverbial soundless tree in the forest, and ignore it? Let's find out:
Program
intmain(intargc){inti=0;intub=argc/i;return1;}
gcc -O2
main:moveax,1ret
clang -O2
main:moveax,1ret
We can see that the answer to our question is "yes", and now both compilers have optimised the division away.
Most likely dead code elimination will have removed the division before
the compiler figured out it is UB.
Again, it is important to understand that this is something the
compilers chose to do (and only if we enable optimisations, otherwise the division is compiled as is). Even if the UB "isn't used", that
doesn't mean the program has no UB. We just got "lucky" that the
compiler removed the dead code before realising it had UB.
There is no guarantee other compilers
will do the same, nor that this behaviour will be consistent between
different versions of the compilers. It would have been equally valid
to crash the program or open your CD-ROM drive.
That girl value is poison
We are now left with two questions: 1) Why do we often not get warnings
about UB in a program even if the compiler was able to work out that it exists?
2) Why are clang (and sometimes gcc) lenient when handling UB, compiling
(and running) code instead of making it crash (e.g. by inserting an illegal
instruction)?
We can find answers for both questions in a blog
post by Chris Lattner.
In regards to the warnings, he explains that it would often generate too many
warnings to be useful (with lots of false positives). It's also difficult to
know when people want these warnings and when not (e.g. nobody cares about UB
in dead code). In regards to the leniency, especially in relation to our
programs above, the following paragraph from the blog post gives some insight:
“Arithmetic that operates on undefined values is considered to produce a
undefined value instead of producing undefined behavior. The distinction is
that undefined values can't format your hard drive or produce other undesirable
effects.”
These days LLVM uses mostly ‘poison’
values which enable more optimisations
than ‘undef’, but the idea is the same: just because a value is the result of
undefined behaviour, that doesn't mean we need to immediatley invalidate any code using that value. For example, taking a poison value and anding it with 0, we may assume
that the result will always be 0, no matter what the actual poison value is.
This makes sense when, for example, the result of an undefined operation is
irrelevant for the execution of the remainder of the program as the following
example shows:
Since a bit-wise or with a non-zero value will always evaluate to
true, the if-condition will always succeed, no matter what the value of
ub is. In LLVM, arithmetic with poison values
doesn't necessarily produce another poison value. This is the case here,
where the compiler can thus remove the condition. Gcc on the other hand bailed with a ud2 as soon as it saw the null-pointer dereference.
Conclusion
While these were all
very cherry-picked examples, they weren't selected in order paint one compiler in a worse light.
The goal was to show a difference
in philosophies when handling UB: LLVM just carries on compiling when it can,
crossing its fingers that this won't cause problems later on, in an attempt
to to make more programs run and to closer match what it believes a developer, unaware of
undefined behaviour in their code, might expect. Gcc, at least in the examples above, appears to be more conservative and
prefers to crash the program instead, making it more obvious to developers when
their programs contain UB. Neither approach is objectively better than the
other and both are equally valid in the face of UB, and which one to choose
ultimately comes down to personal preference of the compiler developers and their
users.
[2]
When I wrote this example, I was fully
expecting gcc to do the same, and was surprised that it doesn't. While this
is interesting on its own, it is not the focus of this blog post.
[3]
The latter may or may not have bad effects later on depending on how the value (which is now whatever happens to be in the RAX register at the time) is used after it is returned.
[4]
We can get both gcc and clang to produce errors at runtime with
-fsanitize=integer-divide-by-zero. However, this comes with a
performance overhead, and doesn't otherwise change the program: gcc still
crashes with ud2 while clang ignores the division.
[5]
At this point I was wondering if a compiler
could even choose to abort compilation when it can prove that there is UB and
that it will be executed. There is some discussion around this topic to be
found but I couldn't find a definitive answer to this question. Though I can
imagine that the reason for why they don't do this is probably that many
programs wouldn't compile at all otherwise.