Home > C and C++ in critical systems > Danger – unsigned types used here!

Danger – unsigned types used here!

April 7, 2010

By way of a change from the last two posts on formal verification, this time I’m going to talk about using unsigned types in C/C++. Modern programming languages such as C# and Java don’t provide unsigned types, with good reason (actually, C# does have unsigned types, but only for the purpose of interfacing to COM objects written in other languages, and they are not used in the .NET Framework API).

To illustrate the dangers of using unsigned types, I invite you to consider this example (in which uint16_t and int32_t are typedef’d to be unsigned 16-bit and signed 32-bit types respectively) and decide whether it makes safe use of unsigned values:

void foo(const uint16_t* reading, const size_t size) {
  size_t i;
 for (i = size - 1; i >= 0; --i) {
    doSomething(reading[i]);
 if (i > 0) {
      int32_t diff = reading[i] - reading[i - 1];
      doSomethingElse(diff);
    }
  }
}

Why are unsigned types dangerous in C/C++? Here are some reasons:

1. C/C++ provide implicit type conversions between signed and unsigned values. Unlike Ada, there is no a runtime check to make sure the value is convertible to the new type. For example, you can readily “convert” a negative signed value to an unsigned value.

2. If you mix signed and unsigned operands in an arithmetic operation, the signed operand is always converted to unsigned. This may not be what you wanted.

3. In arithmetic expressions, operands whose type is smaller than (unsigned) int get promoted to int or unsigned int. This complicates the situation – especially if the result is implicitly converted again, because the implicit type conversions may not be happening where you think they are.

4. It is very easy to underflow the minimum value of an unsigned int (i.e. zero). It is much more difficult to underflow or overflow the minimum or maximum value of a signed int, especially on a 32-bit (or greater) platform.

How many problems did you spot in the example? There are at least three. The first one is the use of size – 1 as the initial value of the loop counter i. Since size is of type size_t which is an unsigned type, size – 1 will underflow if size is zero. In this case, the loop will count down from the maximum value of a size_t rather than not executing at all. The second problem is the use of i >= 0 in the for-loop header, with i again of type size_t. The loop will never terminate, because i cannot go negative. The third problem concerns the assignment of reading[i] – reading[i – 1] to diff. Suppose reading[i -i] is greater than reading[i], for example by one. Will diff end up as -1 ? Unfortunately, that depends on your compiler and target platform. If uint16_t maps to unsigned short and int32_t maps to int, and assuming 2’s complement hardware, then yes. Both readings will be promoted to unsigned int prior to subtraction, yielding 0xFFFFFFFF. This is then converted implicitly to int for the assignment to diff, yielding -1. But if uint16_t maps to unsigned int and int32_t maps to long, we get a different result. The subtraction yields 0xFFFF this time, which is converted to 0x0000FFFF for assignment to diff.

If you regularly use static analysis on your C/C++ programs (as I hope you do), you might like to check whether your static analyzer reports all three problems for this example.

What’s the best way to manage the dangers inherent in using unsigned types? One strategy is Just Say No. Don’t use unsigned types, except in very special situations. Is this really feasible? When using a 32-bit (or greater) platform, I think it is. You’ll need to define a signed size-type (i.e. an integral type with the same size as size_t) to use as the natural type for representing array indices, string lengths, and sizes. On most platforms, ptrdiff_t is a suitable type to start from. Whenever you call a library function that returns a size_t, or use a sizeof expression, you should immediately cast the result to your signed size type. You may have difficulties if you use an array or string that takes up more than half the address space of the processor, but are you ever going to do that? The other thing you’ll need to do is convert unsigned data read from the hardware to signed data as soon as it is read in, after any necessary shifting and masking. If you read an unsigned value from a 16-bit A-to-D converter, you’ll need to store each value as 32-bits. Alternatively, if you need to store lots of them, you can exempt that data from Just Say No and store it as 16-bit unsigned values, doing the conversion to 32-bit signed int whenever you pick a value out from the store.

Here’s my original example using Just Say No, assuming that the readings came from a 15-bit (or less) A-to-D converter, with sz_t standing for our signed size type:

void foo(const int16_t* reading, const sz_t size) {
  sz_t i;
 for (i = size - 1; i >= 0; --i) {
    doSomething(reading[i]);
 if (i > 0) {
      int32_t diff = reading[i] - reading[i - 1];
      doSomethingElse(diff);
    }
  }
}

The only changes I made are to replace unsigned types by signed types. You’ll still need to use unsigned values where you do bitwise operations, such as shifting and masking. But you probably only need to do these operations on raw incoming data, before storing the result as signed data. Also, if you are using bit fields, any fields that store data of Boolean or enumeration type should be declared unsigned.

If Just Say No is too radical for you, then the alternative is to embrace unsigned values, but be very careful using them. You must use a good static analyzer to detect possible problems. MISRA C 2004 rule 10.1 prohibits implicit signed/unsigned conversions (amongst others), so a static analyzer that enforces MISRA compliance should catch them. Even better, use a formal tool such as ArC. Here’s a safer version of our example that still uses unsigned types:

void foo(const uint16_t* reading, const size_t size) {
  size_t i;
 for (i = size; i != 0; ) {
    --i;
    doSomething(reading[i]);
 if (i > 0) {
      int32_t diff = (int32_t)(reading[i]) - (int32_t)(reading[i - 1]);
      doSomethingElse(diff);
    }
  }
}

I’ve changed the loop to count down from size instead of size – 1, I’ve changed the termination condition, and I’ve moved the decrement of i to the start of the loop body. I’ve also cast the unsigned readings to signed before subtracting them.

Have you been caught out by the behavour of unsigned types? Does your organization have a policy on using them? As always, I welcome your feedback – just click on the Leave a comment link near the top of this page.

  1. AnonCSProf
    April 22, 2010 at 21:37

    I would assert that the problem is not the presence or use of unsigned types, but with the presence of implicit non-value-preserving type conversions.

    If you use signed types to hold lengths, that creates a different set of issues, such as integer overflow and sign/unsigned bugs.

    Also: When using an unsigned type, it’s a bad idea to write like

    for (i=size-1; i>=0; i--)
      ...
    

    That’s a bad idiom! Better:

    for (i=0; i<size; i++)
      ...
    

    Blame the bad idiom, not the use of unsigned types.

    I’d say: best practice in C is to use unsigned types like size_t to hold lengths, and to write your code carefully to avoid integer overflows and non-value-preserving conversions/casts. Because C’s rules about implicit conversions are tricky, I generally suggest that you try to avoid mixing signed and unsigned integral types in the same expression, or exercise great caution everywhere that you do.

    • April 22, 2010 at 22:23

      Thanks for your feedback. I was expecting this post to be controversial, but you are the first to post a comment.

      I agree that implicit non-value-preserving type conversions are one of the main problems in this area. Fortunately, these conversions are easy to detect using static analysis, so they can be banned. However, absence of underflow and overflow cannot, in general, be ensured by conventional shallow static analysis, but only by theorem proving, model checking, rigorous argument or similar. One the dangers I see in using unsigned types is that it is very easy to write code that causes undeflow, whereas underflow/overflow in signed types is much rarer.

      Regarding your assertion that using signed types to hold lengths creates a different set of issues, on a typical 32-bit platform the only way I think you could encounter an overflow issue is to create an array of char that takes up at least half the address space of the computer (so that a signed integer index is insufficient), or take ‘sizeof’ any array that uses at least half the address space. I have never come across a program that uses arrays as big as this. Signed/unsigned bugs can be avoided by not using unsigned types at all, or where thay cannot be avoided (sizeof, return from strlen etc.) by casting the unsigned value immediately to signed.

      Regarding your for-loop examples, the idiom I gave is a common way of iterating through an array in reverse. Your suggested “better” idiom is not equivalent as it iterates in a forward direction, assuming that ‘i’ is to serve as the array index. There are many situations in which it is necessary to iterate over an array in reverse, for example when searching for the last element in the array that satisfies some property. Using signed types, it is easy to write a correct for-loop header, whichever direction you want to iterate in; but using unsigned types, you need to be careful if you want to iterate in the reverse direction, as my example was intended to show.

      I agree entirely with your suggestion that signed and unsigned types should not be mixed in the same expression. A policy of avoiding unsigned types makes this easier to implement. However, if formal analysis is used, then it matters less which policy is used, since underflow and overflow can be shown not to occur.

  2. AnonCSProf
    April 23, 2010 at 04:38

    Thanks for the comments. I guess maybe I didn’t quite understand the argument.

    Are you arguing that, if we have an integral value that ought to be non-negative, it’s better to store it in an unsigned int than a signed int, because if some intermediate expression happens to go a little bit negative, we won’t have a surprising departure from the standard mathematical rules of arithmetic?

    If so, is this still relevant if we avoid underflow and overflow? Don’t you have to avoid underflow and overflow either way, regardless of whether you’re using a signed type or an unsigned type? And if you can prove absence of underflow and overflow, then there’s nothing wrong with using unsigned types to represent integer values that are supposed to be non-negative.

    I realize that proving that underflow and overflow will never happen requires sophisticated verification technology; on the other hand, aren’t you building a tool with exactly those kind of sophisticated capabilities?

    If you’re using a signed type to hold a length (which should always be non-negative), and you allow it to go negative, that might not be an “underflow” condition from the language perspective, but isn’t it still a problematic problem? And if you later pass that length, e.g., as an argument to memcpy(), you’re now in a signed/unsigned condition that causes problems.

    As a security person, I’m uncomfortable with saying that it’s OK to use “int” to hold lengths and don’t bother to worry about arrays of size 2^31 or larger. I can imagine attacks that might be able to exploit such bugs.

    In network protocols, it is not unusual to transmit variable-length data as a (length,value) pair. The code that parses this pair has to read in a length, then allocate enough room. If the length is stored as a signed int, then you have to worry about lengths that will cause the int to go negative. Indeed, there is an entire class of security vulnerabilities that arise specifically when lengths are stored in a signed int, which go away if you use an unsigned int. Consider:

    char buf[64];
    int len = read_length_from_network();
    if (len > 64)
      return;
    rv = read(networkfd, buf, len); // kaboom!
    

    Naively, the code looks like it performs all necessary checks to avoid array out-of-bounds/buffer overflow; but the implicit conversion from signed to unsigned (which occurs because the third argument to read() is declared as a size_t) causes disaster.

    I agree that iterating in a reverse direction using unsigned types is tricky, and if you really need to iterate in the reverse direction (which I suspect is only a rare fraction of the time), you’re going to have to be careful. Is using a pointer for iteration safer in this case?

    uint16_t *p;
    for (p = reading+size-1; p>=reading; p--)
      ...
    
    • April 23, 2010 at 09:23

      I agree with nearly everything you say. If you’re verifying absence of underflow and overflow – as you should in safety-critical or security-critical applications – then it doesn’t matter whether you use unsigned or signed types. If I chose to use signed types only, I would use constrained typedefs (see my earlier post) to impose limits. In a network protocol, I would transmit the length as an unsigned value, but I might convert it to a larger signed type after reading it in. Regarding your suggestion to do reverse iteration using pointers, technically the pointer assigment and pointer comparison in your code has undefined behaviour when p has become equal to reading, because section 3.3.6 of the C standard requires that the result of adding/subtracting a pointer and an integer results in a pointer to an element of the array, or to one-past-the-last-element. A pointer to one-before-the-first-element is not permitted; although I doubt it would cause problems in practice on most platforms.

  3. AnonCSProf
    April 23, 2010 at 16:09

    Thanks, that makes sense. I do appreciate the detailed explanation.

    Good point about the pointer-based solution: I overlooked that issue. Oops.

  4. Yannick Moy
    May 7, 2010 at 22:28

    David,

    Very interesting post, with a little confusion I think regarding the rules for promotion of operands. Your example where “uint16_t maps to unsigned int and int32_t maps to long” does not result in 0x0000FFFF. I checked it on my machine with the following program:

    #include 
    
    int main() {
      unsigned short us1 = 0;
      unsigned short us2 = 1;
      int res1 = us1 - us2;
      unsigned int ui1 = 0;
      unsigned int ui2 = 1;
      long res2 = ui1 - ui2;
      printf ("res1 is %d, res2 is %d\n", res1, res2);
      return 0;
    }
    

    Although I do not have a C standard right now to check, I believe your rule 2 is slightly wrong. Rule 3 applies first, so that the operands are promoted to int or unsigned int, and only then both are converted to unsigned int if one is unsigned int. (Not talking here about (unsigned) long long.)

    You may find the following article interesting: it details the results of a hunt for integer overflow bugs in a large Microsoft codebase…

    • May 8, 2010 at 08:53

      Hi Yannick,

      Thanks for your comment. Your program is slightly wrong: the format specification should be “res1 is %d, res2 is %ld\n” (you need to use %ld for a long, not %d). Also, on most 32-bit platforms, “long” and “int” are the same anyway. So to demostrate the problem, you need to be using a platform and compiler for which “long” really is longer than “int” (such as gcc in 64-bit mode), or else change the type of res2 to “long long” and use %lld as the format specifier. I just tried the latter, using Visual C++ 2008, and got the result I was expecting (i.e. res2 is not -1).

  5. Yannick Moy
    May 8, 2010 at 10:25

    You are right about the long long issue, but then your original example still does not give the appropriate result. The result is not 0x0000FFFF (65535) but 0x00000000FFFFFFFF (4294967295). Hope I am correct this time, I have to think about these things twice each time! C machine integers are just too complex to use without an analyzer on your side.

    • May 8, 2010 at 11:13

      In my post I was assuming that uint16_t and int32_t have been typedef’d so as to define an unsigned 16-bit type and signed 32-bit type respectively, hence their names (these sort of typedefs are normally used when writing critical embedded software). Under that assumption, my post is correct. Using “long long” to simulate the problem on a 32-bit platform, I get the same result as you.

  6. Yannick Moy
    May 8, 2010 at 13:19

    I am referring to this sentence in your post:

    “But if uint16_t maps to unsigned int and int32_t maps to long, we get a different result. The subtraction yields 0xFFFF this time, which is converted to 0x0000FFFF for assignment to diff.”

    The subtraction certainly does not yield 0xFFFF betwen zero and one as two unsigned int.

    • May 8, 2010 at 14:27

      You’re assuming that what’s true for the C compiler you used is true of all other compilers. That isn’t so. The C compilers for 16-bit processors store an ‘int’ in only 16 bits. That’s why developers of critical embedded software define types such as uint16_t and int32_t, and use those types instead of int, short, long etc. – so that they always know the bounds of the variables they declare. Using a compiler that stores ‘int’ in 16 bits, the result will be 0xFFFF.

  7. Yannick Moy
    May 8, 2010 at 14:44

    I am only saying this example you show is inconsistent with the previous one with which you contrast it:

    “If uint16_t maps to unsigned short and int32_t maps to int, and assuming 2′s complement hardware, then yes. Both readings will be promoted to unsigned int prior to subtraction, yielding 0xFFFFFFFF.”

    In this previous example, you assume indeed 32-bits int.

    But yes you’re right about the fact that some sizes for integral types do trigger the behaviour you describe. All these subtle differences only show that considering integers as bitvectors like it is done in C is asking for trouble.

    • May 8, 2010 at 14:54

      Yes, and in the following sentence, I said that the 0xFFFFFFFF result from the unsigned subtraction will be implicitly converted to -1 for the assignment to int (i.e. the bit pattern of all 1s is interpreted as -1). Do you doubt that this is the case, or have I misunderstood the point you are making?

  8. Yannick Moy
    May 8, 2010 at 15:04

    I agree with your arguments. I am just saying your first examples assumes 32-bits int while your second example assumes 16-bits int. That confused me.

    • May 8, 2010 at 15:06

      I’m sorry I didn’t make that clear – I assumed it would be apparent from the type names I chose.

  9. Dave Banham
    May 21, 2010 at 20:49

    After many years of using typedef names that included the size of the type, as in unit16_t and unit32_t, I have stopped and now favour the use of abstract type names. I’m even quite happy about a naked int for a loop index! Why? Because I started coming across a class of errors that related to mismatches between the implied type size and the actual size.

    For example typedef int int16_t does not guarantee that objects declared as int16_t implement modulo 2^16 (2s compliment) arithmetic. On most 32 bit machines you get 2^32 modulo arithmetic. Worse still typedef signed char int8_t most definitely won’t give you 2^8 modulo arithmetic when using an ISO C90/99 conforming compiler on any class of machine.

    So by this argument, any attempt to include sizing information in a type name does not lead to more portable code. Moreover it tends to stop developers/programmers from thinking about what they are doing and this causes defects to be introduced. Typically ones that don’t manifest until the software enters its sustaining/maintenance part of its life cycle. That makes them nasty defects to find.

    • May 24, 2010 at 07:35

      Good point, you can’t assume that intN_t or uintN_t implements modulo-N arithmetic. I think the main benefit of these typedefs is so that when you have some external data that you know fits in N bits, you can declare variables and arrays that you know can hold that data. Using intN_t or uintN_t as an array index type makes less sense, which is why I prefer to typedef a signed index type, or use size_t if I am allowing unsigned types. Your practice of using int as an array index type is fine for 16 or 32 bit embedded software, but potentially too small in a 64-bit environment, where array indices might be larger than 31 bits.

    • Nick Hounsome
      September 27, 2012 at 10:48

      A very late reply: You don’t typedef your own int16_t you include
      int16_t is required to be EXACTLY 16 bits.
      If the processor does not have a 16 bit integer then int16_t wont exist.
      For these rare processors you have int_least16_t (smallest possible with at least 16 bits) and int_fast16_t (at least 16 bits but possibly more than 16 if more bits is faster)

  10. Clark
    July 18, 2010 at 03:45

    You may have difficulties if you use an array or string that takes up more than half the address space of the processor, but are you ever going to do that?

    Yes, yes I will. I like to have my programs able to handle the full range of the numbers I give it to fix a problem I’ve never experienced. When you need to handle the “size” of ANYTHING, it should ALWAYS be size_t. NEVER a signed type. This prevents security bugs (as have already been mentioned) as well as other bugs when the size DOES exceed 2GB on x86. When you think about it, that isn’t actually very much memory at at all.

    Programmers like you make buffer overflows trivial to exploit.

    • July 18, 2010 at 08:15

      If you write a program that tries to allocate a single array 2Gb or bigger on a 32-bit machine as a result of some untrusted input it receives, then I suggest there is something seriously wrong with your input validation. Even if you do use size_t, you will run into the same problems when the input calls for a 4Gb, 6Gb or even larger array. BTW the C standard makes no guarantee that size_t is big enough to represent the entire address space available.

  1. June 7, 2010 at 16:44
Comments are closed.
%d bloggers like this: