Home > C and C++ in critical systems > Run-time checks: Are they worth it?

Run-time checks: Are they worth it?

July 7, 2010

One of the criticisms levelled against the use of C in safety-critical software is that the C language does not provide run-time checks automatically. For example, when indexing into an array, there is no check that the index is in bounds. Likewise, when doing integer arithmetic in C, there is no check for arithmetic overflow.

Many other programming languages do provide run-time checks. For example, Java and C# both raise exceptions if the program attempts to access an out-of-bounds array element. C# also provides an option to raise exceptions on arithmetic overflow. Ada provides all these run-time checks by default, although most compilers have an option to inhibit their generation. If you are programming in C++ and follow my recommendation to use a C++ array class, you can choose whether or not to perform index-in-bound checks.

Some developers of critical software consider it axiomatic that you should leave run-time checks enabled, if the programming language provides them. The practice of enabling run-time checks in debug builds but disabling them in release builds had been likened to carrying a fire extinguisher in your car at all times, except when you actually use it! Does this mean that we should insist that run-time checks are always enabled in release builds of critical software – and therefore avoid using programming languages that don’t support them, such as plain C?

In my opinion, shipping software containing run-time checks is generally a good thing if the software can do something useful when a run-time check fails, and that “something useful” is tested. This will probably require the deliberate introduction of temporary bugs, since we expect the software to be free from run-time errors in normal use. For an example of where this wasn’t done properly, see the report on the Ariane 5 launch. In summary, the launcher was lost when a run-time error occurred in a floating-point to fixed-point conversion, in a software module that was performing no useful function at the time. This led to the whole software subsystem shutting down. In doing so, it put an error code on the data bus, where it was interpreted as flight data. This caused the rocket nozzles to move to full deflection, leading to break-up of the rocket and triggering of the self-destruction mechanism. Had the overflowing data conversion simply yielded wrong information instead of raising an exception, the mission might not have been lost. This may be a rare example, but I think it illustrates that run-time checks are not invariably a good thing.

What should we do when a run-time check fails? The Ariane software developers assumed that any run-time check failures would be caused by random hardware errors, so their approach was to log the error (which they got right) and hand over to the backup unit (which didn’t help, because it had already experienced exactly the same run-time check failure).

In an embedded system, the options for handling failed run-time checks may be limited – particularly in life-preserving software, such as fly-by-wire control systems. Handing over to a backup unit won’t help if the run-time failure is caused by a systemic failure, and restarting the software is usually not a viable option. In such cases, the best solution is surely to show that run-time checks can never fail unless the hardware is at fault. You can attempt to show this by thorough testing, but you need to be very careful. The Ariane 5 software had been thoroughly tested and “proven in use” in Ariane 4; however, the software was then re-used in Ariane 5 without repeating the tests using the higher horizontal velocity inputs that occur in an Ariane 5 launch.

You can also attempt to show that run-time checks will never fail by formal verification. This will expose any hidden preconditions – such as the maximum horizontal velocity that could be handled by the Ariane software. Your re-use strategy must ensure that when you use previously verified software in a new environment, the new environment continues to respect the preconditions.

If run-time checks can be a mixed blessing in release software, are they ever always a good thing? Yes they are, when you are testing the software! No worries about what to do when a run-time check fails: just log it and terminate the test! Whether you are doing unit testing or integration testing, a failed run-time check gives an early indication of something being wrong, often making diagnosis of the bug much easier. Run-time checks can also catch “benign” bugs that don’t lead to incorrect results in tests, but which may bite you later. For example, reading an array beyond its bounds may cause the program to read a value that depends on the previous test. It might read a “benign” value during testing, but a “harmful” value during some particular use in the field.

If you’re performing formal verification before testing, you may argue that run-time checks are a waste of testing time. After all, they are never going to fail, right? Well, even with full formal verification, errors might occur. The compiler you are using might be generating the wrong code; or the linker might introduce an error; or the hardware itself may be faulty. Even formal verification systems have been known to contain errors. When we test formally verified software, any test failure is symptomatic of a fault in the development process, tool chain, or hardware. If we test throughly and find no errors, this gives us confidence that the process and tool chain are sound. Testing with run-time checks enabled (as well as without, if we intend to ship without run-time checks) and experiencing no run-time check failures adds to that confidence.

We adopt this philosophy when developing ArC and Perfect Developer. Both are written in Perfect, from which we generate code. For releases, we generate C++ because of the increased speed it offers – after all, theorem proving is computationally expensive. But all our development prior to final regression testing is done using C# code generation, so that we have the added benefit of run-time checks.

  1. AnonProf
    July 11, 2010 at 01:14

    The key claim in this article is
    In my opinion, shipping software containing run-time checks is generally a good thing if the software can do something useful when a run-time check fails, and that “something useful” is tested.

    I think this advice is too narrow, and should contain some qualifiers. It might[1] be appropriate in safety-critical systems where the top priority is availability/liveness: if the system halts, then people can die. In those cases, run-time checks are not enough; they need to come with a recovery action.

    However in many systems, integrity is more important than availability, and it’s better to halt than to let it corrupt data or provide misleading/incorrect answers. In those cases, the software can certainly do something useful when the run-time check fails: it can halt. I don’t see any need for special qualifications about testing those run-time checks. When you write the assert() library, you check that it works correctly (once), but you don’t need to do special testing for each subsequent application you build that uses assert()s.

    I’m not disputing that if the choice is between statically proving that a particular assert() will never trigger vs checking it at run-time, the static proof is preferable (if we ignore cost). But that’s not the interesting question. The interesting question is, if the choice is between checking an assert() at runtime vs not checking at all (in cases where there is no static proof that the assert cannot trigger), which is better?

    Footnote [1]: I don’t have enough experience with safety-critical embedded systems to have an informed opinion. I wonder if there might be some architectures where run-time checks can be appropriate, even if not both of your conditions are not met. For instance, if the system is structured following a micro-reboot architecture, would run-time checks with automated reset of the failing component be a plausible approach? I don’t have enough experience to know.

    • July 12, 2010 at 08:32

      Where it is permissible to lose system availability if a run-time check fails, then I would regard the action “report the error and halt” as doing “something useful”, and generally preferable to delivering incorrect results as you say. Regarding a micro-reboot architecture, the danger I can see is that if the input conditions haven’t changed, then the same run-time check may fail repeatedly, leading to loss of system availability.

  2. Dave Banham
    July 13, 2010 at 15:09

    David,
    Does ArC go as far as allowing C++ try-catch blocks?

    • July 13, 2010 at 15:22

      Not yet, but we will consider supporting them if/when our clients wish to use C++ exceptions in their critical software.

  1. No trackbacks yet.
Comments are closed.
%d bloggers like this: