Verifying pointer arithmetic

July 16, 2010

Today I’ll look at whether code that uses pointer arithmetic is any harder to verify than equivalent code that does not use pointer arithmetic.

Consider this function for copying an array (or part of an array) into another array (or part of another array):

void arrayCopy(const int* src, int* dst, size_t num) {
  size_t i;
  for (i = 0; i < num; ++i) {
    dst[i] = src[i];
  }
}

To verify it, we first need to specify what it is supposed to achieve in a postcondition and add any necessary preconditions. Then we need to add a loop invariant so that ArC can verify that the loop achieves the desired state when it terminates, and a loop variant so that ArC can prove that the loop terminates:

void arrayCopy(const int* array src, int* array dst, size_t num)
writes(dst.all)
pre(src.lim >= num; dst.lim >= num)
pre(disjoint(src.all, dst.all))
post(forall i in 0..(num - 1) :- dst[i] == src[i])
{ size_t i;
  for (i = 0; i < num; ++i)
  keep(i <= num)
  keep(forall j in 0..(i - 1) :- dst[j] == src[j])
  decrease(num - i)
  { dst[i] = src[i];
  }
}

This is fairly straightforward. As usual, the main loop invariant (the keep(forall…) clause) is a generalisation of the desired postcondition, and we need another invariant (the keep(i <= num) clause) to ensure that the main invariant and the body don’t violate the preconditions of the indexing operator. These invariants and the loop variant (the decrease(…) clause) are sufficient for ArC to verify this function.

Many C programmers would write the body as *dst++ = *src++; instead of explicitly indexing dst and src. If we make this change, then because we are changing src and dst within the loop, we need to describe their values in the invariant. Here is how we can do that:

void arrayCopy(const int* array src, int* array dst, size_t num)
writes(dst.all) pre(src.lim >= num; dst.lim >= num)
pre(disjoint(src.all, dst.all))
post(forall i in 0..(num - 1) :- dst[i] == src[i])
{ size_t i;
  for (i = 0; i < num; ++i)
  keep(i <= num)
  keep(forall j in 0..(i - 1) :- (old dst)[j] == (old src)[j]) 
  keep(src == (old src) + i; dst == (old dst) + i)
  decrease(num - i)
  { *dst++ = *src++;
  }
}

Applying the keyword old to an expression in a loop invariant or loop variant yields the value of that expression just before the first iteration of the loop. I’ve had to replace src in the second loop invariant by old src and similarly for dst, because in that invariant I am referring to array elements computed relative to the original values of src and dst. I’ve described the modification to src and dst by adding a third invariant, that says that at the start and end of any iteration, src has its original value plus i, and similarly for dst. The post-increment operators applied to src and dst in the body ensure that these relationships are maintained. Once again, the function is verifiable by ArC.

What if we go further and get rid of the loop counter i as well? We can do that if we precompute the end pointer src + num at the start, and then iterate until src reaches this point. We still need to describe the values of src and dst in the invariant, but we must do so without the benefit of i. Here’s one way of doing it:

void arrayCopy(const int* array src, int* array dst, size_t num)
writes(dst.all)
pre(src.lim >= num; dst.lim >= num)
pre(disjoint(src.all, dst.all))
post(forall i in 0..(num - 1) :- dst[i] == src[i])
{ const int* array const srcEnd = src + num;
  while(src != srcEnd)
  keep(src.base == old(src.base))
  keep(0 <= src - (old src)) 
  keep(src - (old src) <= n) 
  keep(forall j in 0..((src - (old src)) - 1) :- (old dst)[j] == (old src)[j])  
  keep(dst == (old dst) + (src - (old src))) 
  decrease(srcEnd - src)
  { *dst++ = *src++;
  }
}

Here’s how I arrived at the new loop annotations:

  • I’ve taken the three keep-clauses from the previous version and substituted (src – (old src)) for the loop counter i. That’s how I got the 3rd, 4th and 5th keep-clauses. For the last of the three, I also removed the resulting tautology src == (old src) + (src – (old src)), leaving just the component that talks about dst.
  • The original first invariant keep(i <= n) placed an upper bound on i, thereby ensuring that src[i] and dst[i] in the original code or *src and *dst in the modified code were in bounds. I didn’t need to put a lower bound on i because its type was size_t which is an unsigned type, so its lower bound was zero implicitly. But I’ve now replaced i by (src – (old src)), which has no such constraint. So I need to put a zero lower bound on this expression, which is what the second keep-clause is for.
  • Unfortunately, this isn’t quite enough. The expression src – old src uses the pointer-difference operator, which has the precondition that the operands are pointers into the same array. In general, when we re-assign an array pointer such as src, there is no requirement that it points into the same array as before. In this case, all we ever do to src is increment it; so it obviously continues to point into the same array as long as it is not over-incremented (which ArC verifies). However, ArC isn’t yet clever enough to spot this and generate an implicit loop invariant. So I’ve added the first invariant src.base == old(src.base) right at the beginning of the loop. In the expression src.base, base is a ghost field of src, just like lim and lwb. It subtracts src.lwb from src, yielding a pointer to the very start of the array that src points into. So to state that two array pointers point into the same array, I just need to say that their bases are equal.
  • Rather than just substitute for i in the loop variant, I’ve changed it to say that I expect the difference between srcEnd and src to decrease on each iteration.

I hope this example demonstrates to you that verification of C code that uses pointer arithmetic can be done, although it is likely to require more loop invariants than equivalent code that does not use pointer arithmetic, and getting them right may be a little harder. The number of verification conditions to be proved may be higher too – there were 24 from the original code, 32 from the second version, and 35 from the last version. So indexing should generally be preferred to explicit pointer arithmetic – as recommended by MISRA-C 2004 rule 17.4 – when writing verifiable code.

Finally, we could implement this function using memcpy:

void arrayCopy(const int* array src, int* array dst, size_t num)
writes(dst.all)
pre(src.lim >= num; dst.lim >= num)
pre(disjoint(src.all, dst.all))
pre(num * sizeof(int) <= maxof(size_t))
post(forall i in 0..(num - 1) :- dst[i] == src[i])
{ memcpy(dst, src, num * sizeof(int));
}

I’ve had to add another precondition to ensure that the computation of the number of bytes num * sizeof(int) doesn’t overflow a size_t. If Arc were to assume that the number of bytes in any array cannot exceed maxof(size_t) – as is the case for many implementations – then it could infer that dst.lim <= maxof(size_t)/sizeof(int) and we would not need the precondition. Unfortunately, the C standard makes no such promise.

The call to memcpy involves implicit conversions of its first two parameters from int* to void* and from const int* to const void*. In general, pointer conversions should be avoided when writing for ArC, because such conversions break ArC’s strong type model. However, ArC knows enough about the semantics of memcpy to prove the 10 verification conditions generated by this version too.

  1. AnonProf
    July 19, 2010 at 01:57

    You give an example that does not use loop indexing, and argue that it’s more complex than using loop indexing, so therefore everyone should use loop indexing.

    But I’m not convinced. It seems like the code that doesn’t use loop indexing could be made simpler. I don’t think you’ve made the case that this is the best it can get, without loop indexing.

    For instance, look at Deputy. See especially Figure 4 of the OSDI 2006 paper, which suggests that the invariants and annotations needed can be substantially less, through two techniques: (a) appropriate design of an invariant language, (b) appropriate inference.

    In your specific example, can we do something like this?

    void arrayCopy(const int* src, int* dst, size_t num) {
      char *p=src, *q=dst;
      while (p<src+num)
           keep(src <= p && p < src+num)
           keep(dst <= q && q < dst+num)
           keep(p-src == q-dst)
           keep(forall i in 0..(p-src-1) :- dst[i] == src[i])
           decreases(src+num-p) {
        *q++ = *p++;
      }
    }
    

    I make no promises that I got the above exactly right, since I don’t have a copy of ArC to test this on, but it seems simpler than what you’ve listed.

    It also seems plausible that, with a little bit of support for programming with pointer arithmetic, most of these loop invariants could be automatically inferred.

    Another comment: In one place, you write: “all we ever do to src is increment it; so it obviously continues to point into the same array as long as it is not over-incremented (which ArC verifies). However, ArC isn’t yet clever enough to spot this and generate an implicit loop invariant.” Well, this suggests to me that your comparison above is not a fair comparison. The fair comparison would be between ‘ArC written in a way that envisions typical idioms using loop indices, used to verify code written with loop indices’ vs ‘ ArC written in a way that envisions typical idioms using pointer arithmetic, used to verify code written with pointer arithmetic’. It seems like right now you are using a tool that has been optimized for code written using loop indices, and then concluding that this tool works better for code that uses loop indices than for code that uses pointer arithmetic. Well, no surprise there. The interesting question is what changes if you extend ArC to recognize typical C idioms of programming using pointer arithmetic and generate the obvious implicit loop invariants for those idioms.

    The really interesting comparison would be one where we are dealing with ”-terminated strings, which is the one where the classic C pointer arithmetic idioms really start to look more concise than ones using loop indices. You might try a side-by-side comparison of something like strcpy(), for instance — that would be interesting.

    Ultimately, I’d agree with your final conclusion: that using loop indices will often make verification easier than using pointer arithmetic. My quibble is with your argument, rather than your conclusion; I think the argument may have some weak spots in it.

    P.S. Do we also need arrayCopy() to have a precondition that src!=NULL and dst!=NULL?

  2. July 19, 2010 at 14:40

    Thanks for your comment. You seem to be suggesting that ArC favours code that doesn’t use pointer arithmetic because that’s what we’ve written it to work with. It’s certainly true that we’ve developed ArC with reference to code that is actually used in critical systems, however I don’t believe that’s an important factor here. Anywhere you compare pointers other than for equality, or subtract one pointer from another, there is the precondition that both point into the same array; otherwise the result is undefined. In my example, this meant that an extra invariant “src.base == (old src).base” was needed. ArC currently doesn’t infer any loop invariants at all (although it often suggests them when proofs fail), so I needed to add this explicitly.

    In your suggested code, I’m not sure whether you really meant p and q to have type char* or whether you meant int*. If you use char* then the main invariant is incorrect because an element will sometimes have been partially copied. So I tried the following amended version of your code:

    void arrayCopy(const int* array src, int* array dst, size_t num) 
    pre(...)
    post(...)
    { const int* array p=src;
      int* array q=dst;
      while (p<src+num)
           keep(src <= p && p < src+num)
           keep(dst <= q && q < dst+num)
           keep(p-src == q-dst)
           keep(forall i in 0..(p-src-1) :- dst[i] == src[i])
           decreases(src+num-p) {
        *q++ = *p++;
      }
    }

    ArC can’t prove this correct until I implement its suggestions for extra loop invariants “p.base == src.base” and “q.base == dst.base”. This gives a total of 8 invariants after splitting on “&&” at the top level, compared to 2 for my original version (or 3 if you include the invariant 0 <= i that I didn't need to make explicit because I declared i as of type size_t).

    I agree that suitable design of an invariant langage can reduce the number of annotations needed. In ArC you can declare ghost functions to define your own invariant language, to some extent. If I declare:

    ghost bool pointsWithin(const int* array parent, const int* child, size_t num)
    pre(num <= parent.lim)
    returns(child.base == parent.base && parent <= child && child <= parent + num);
    

    then I can condense 6 of the invariants into just 2 that call pointsWithin.

    Regarding your PS, ArC doesn't require an explicit precondition that src and dst are not null, because it requires nullable pointers to be qualified with the keyword "null" when they are declared – just as it requires array pointers to be qualified with "array". We decided that strengthening the type system in this way was preferable to writing the "not null" preconditions that would otherwise be needed, because in critical embedded software, pointers are used mostly for parameter passing and null values don't usually make sense.

  3. AnonProf
    July 20, 2010 at 15:59

    Thanks. Right, that’s what I intended; sorry about the inadvertent char *s. The invariants indicating that p and src point into the same array (p.base == src.base) seem like they should be straightforward to automatically infer, for a verification tool that includes support for pointer arithmetic. I would expect that the invariants saying that p should remain in-bounds for the array that it points to (i.e., keep(src <= p && p < src+num)) should also be pretty easy to automatically infer.

    So with reasonable support for pointer arithmetic, the pointer arithmetic idiom should need 3 loop invariants; with slightly better support for loop indexing, the loop indexing idiom should need 2 loop invariants (slightly better support, to avoid the need to specify i <= num; it seems entirely plausible that this could be automatically inferred). That seems fairly close.

  1. September 8, 2011 at 14:03
Comments are closed.
<span>%d</span> bloggers like this: