Using and Abusing Unions

March 12, 2010

The C union type is one of those features that is generally frowned on by those who set programming standards for critical systems, yet is quite often used. MISRA C 2004 rule 18.4 bans them (“unions shall not be used”) on the grounds that there is a risk that the data may be misinterpreted. However, it goes on to say that deviations are acceptable for packing and unpacking of data, and for implementing variant records provided that the variants are differentiated by a common field.

According to K&R’s The C Programming Language, “A union is a variable that may hold (at different times) objects of different types and sizes, with the compiler keeping track of size and alignment requirements.” Note the words at different times. So it appears that they didn’t expect programmers to use them to pack and unpack data, using code such as the following:

uint16_t unpack(uint8_t lobyte, uint8_t hibyte) {
union {
uint16_t wordData;
uint8_t byteData[2];
} temp;
temp.byteData[0] = lobyte;
temp.byteData[1] = hibyte;
return temp.wordData;
}

I regard this as an abuse of unions. This code is not portable, because its behaviour is dependent on how the compiler lays out the union, and on whether the processor is big-endian or little-endian. Is there any need to use it? Let’s look at the alternative:

uint16_t unpack(uint8_t lobyte, uint8_t hibyte) {
return ((uint16_t)hibyte) << 8 ) | (uint16_t)lobyte;
}

This version also makes it clear that lobyte contains exactly the lower 8 bits of the data (which was probably read from an I/O port), and does not assume that uint8_t has exactly 8 bits (as opposed to at least 8 bits).

Is there any reason not to use this code and avoid the union? There might be a performance penalty, but only if you are using a poor compiler or a very low optimization level so that the compiler does not implement the shift using a move or byte swap instruction, and the processor does not have barrel-shift hardware. In other cases, this version might be faster than using a union, because optimizing it does not require a variable to be eliminated.

ArC requires programs to be type-safe, and doesn’t make assumptions about endianness or struct and union layout and alignment. So it doesn’t support use of unions in this way. In the event that you really do need to use a union for packing or unpacking data, you can fool ArC like this:

#ifdef __ARC__
// define the shift version of unpack
...
#else
// define the union version of unpack
...
#endif

but you are then assuming responsibility for ensuring that the union version behaves correctly.

What about using unions for their intended purpose, i.e. holding different types of data at different times? The usual criticism here is that C unions don’t have automatic discriminants, so the compiler cannot insert run-time checks. Why not verify formally that the data is never misinterpreted instead? What we need to ensure is that a union is only ever read through the same member as was last used to assign it. We express the concept that “member M was last used to assign the value of E” in ArC using the syntax E holds M. We can use a holds expression anywhere in a specification or any other ghost context, but not of course in real code. Here’s an example:

struct Status { ... };
struct Error { ...};

union StatusOrError {
struct Status st;
struct Error err;
};

static union StatusOrError lastResult;

Whenever ArC sees lastResult.err or lastResult.st being read, it will attempt to prove lastResult holds err or lastresult holds st respectively. If we want to write a function that assumes that lastResult holds a particular member, ArC will fail to verify the function unless we declare that assumption as a precondition. For example:

void displayError()
pre(lastResult holds err)
{ ... lastResult.err ... }
void displayStatus()
pre(lastResult holds st)
{ ... lastResult.st ... }

Now ArC will need to verify that the precondition holds at each call to displayError or displayStatus:

lastResult.err = ... ;
displayError();     // OK
displayStatus();    // verification failure here

So we have made unions type safe, effectively by adding a ghost discriminant that can be interrogated by a holds expression. If you want to store a real discriminant, you can tie the two together using an invariant:

struct WrappedStatusOrError {
union StatusOrError stOrErr;
enum { disc_st, disc_err } disc;
invariant((disc == disc_st) == (stOrErr holds st))
invariant((disc == disc_err) == (stOrErr holds err))
}

Unions are rarely used in regular C++ programming, because variant data is almost always better represented using a class inheritance hierarchy. However, that approach normally requires dynamic memory allocation. Therefore, C-style unions still have a place in embedded C++ programming.

  1. AnonCSProf
    March 13, 2010 at 10:17

    I definitely encourage you to take a look at Deputy. (http://deputy.cs.berkeley.edu/) They’ve got an attractive mechanism for discriminated unions.

    (If you’re going to tackle a problem that others have solved before, it’s worth knowing what approach others have taken before going your own direction, so you can at least be aware of what they were trying to solve, and perhaps even can learn from them. As one of my colleagues likes to say, a week in the lab can save you a day in the library.)

    • March 13, 2010 at 11:17

      The syntax used by Deputy for discriminated unions is very neat. I guess that with Deputy, you always need a discriminant in a union, so that it can do the run-time checks. I’m taking the approach that a discriminant is not always needed, because we use formal verification to ensure that only the last-assigned member is read. In the case that the programmer does want a discriminate, the question is whether we should relate it to the active member using a mechanism that already exists in the language (i.e. ArC invariants), or introduce a special, simpler notation for this particular case. If it turns out that disciminated unions are frequently used, then the special notation will be worth adding.

  2. Dave Banham
    March 23, 2010 at 20:47

    I tend to use C in two ways; as a low level “on the metal” programming language and as a higher level application programming language. Unions have the place at both of these “levels”. For sure there is the application for holding different types of data in over lapped memory and where a state, or discriminant, variable indicates the type of data that is stored. Message queues are a classic example when more than one type of message can be sent. And this idiom applies at both levels of programming.

    However, for low level programming the application of a union for binary level access to data (by overlaying an unsigned char array) is much better than conning the compiler with an (unsigned char*)(void *) cast on a pointer to the data to be accessed. How else can the bytes of a long double be accessed so that they can be packed into a serial communication message frame? Or equally unpacked from a received communications message frame. Yes, you do need to understand the endianess of the machine and other such horrors (like the potentially uninitialised padding in structures). And yes this is non-portable code. But that is generally always the case for low level code.

    Can ArC be applied at both levels of programming, or just at the application level?

    Regards
    Dave B.

    • March 23, 2010 at 21:50

      I would prefer to see non-portable code such as serializing doubles to bytes confined to a small number of library functions. These library functions can be left unverified by ArC. You can still give them specifications so that ArC knows that the serialize and deserialize functions are the inverse of each other. Serializing structures should, I think, be done by iterating over the fields, thereby making the code independent of padding.

  1. No trackbacks yet.
Comments are closed.
%d bloggers like this: