CRASHME: Random input testing

See the source crashme.c for reports of system crashes. Software available:

Source tarball crashme-2.7.tgz download page
Windows installer, source code (ZIP) for download, and subversion checkout, crashme.codeplex.com
links to various versions

Acknowledgements.

Many people have provided suggestions and comments and feedback. Some in private email and some as published on the comp.arch newsgroups. But as the author of this gross hack I take full responsibility for any errors in the information presented.

A bit of background on crashme. It is a tool for testing the robustness of an operating environment using a technique of "Random Input" response analysis. This I first saw formally proposed in the book Cybernetics by Norbert Wiener, but which any parent who has observed his children playing and learning would be well disposed to describe in detail.

The operating environment under consideration is the user-mode process.
The Random Input is provided by the execution of a sequence of pseudo-random data as an instruction stream.
The response analysis is to catch and record machine and software generated exceptions/errors/signals and to retry using new random data in both the current user-mode process and in newly created subprocesses.

Notes for release 2.2 of Crashme. 9-MAY-1994 GJC@ALUM.MIT.EDU

Added the X.Y syntax for the NBYTES argument. This may run faster, doing more tests per second. A reasonable value for Y would be the number of bytes in a machine instruction.

Many people have suggested that the output of previous versions was far too verbose, and that that was not only annoying but also effectively slowing down the program. Therefore there is a new argument available after the subprocess control argument, which is a verboseness level from 0 to 5. Using a level of 2 will print out only summary information about the runs. e.g.


$ crashme +2000 666 50 00:30:00 2
Crashme: (c) Copyright 1990, 1991 George J. Carrette
Version: 1.7 25-SEP-1991
Subprocess run for 1800 seconds (0 00:30:00)
Test complete, total real time: 1801 seconds (0 00:30:01)
exit status ... number of cases
       1100 ...     2
    3522652 ...     4
       1036 ...     1
       1084 ...     7
       1108 ...    19
          1 ...   432
         12 ...   137

The table of exit status codes and frequencies is a new interesting aspect of the test. This test was run on a VMS system, so that we have a normal process exit 432 times, access violation 137 times, and reserved operand fault 19 times, etc. As the number of tries goes up (50 in this case) we would expect that the number of normal process exits to go down.

If you define an environment variable (or vms logical name) called CRASHLOG then each subprocess will append to a file the arguments it was given. In that way you can recover what instance possibly caused a crash, but remember that without frequent disk fsync operations most Unix systems will leave a CRASHLOG that is out of date by a few minutes or more.

Here is some output supplied by nik@infonode.ingr.com on one of his machines.


Processor : Intergraph Clipper C300 RISC processor
            16Mb memory + 4k I cache and 4K D cache

Operating System: CLIX Version c.5.3.2
                  derived from AT&T SVR 3.1 with BSD enhancements.

Crashme: (c) Copyright 1990, 1991 George J. Carrette
Version: 1.7 25-SEP-1991
Subprocess run for 9000 seconds (0 02:30:00)
Test complete, total real time: 9004 seconds (0 02:30:04)
exit status ... number of cases
        136 ...     1
      24576 ...     1
         14 ...     1
        138 ...    11
        135 ...    27
        140 ...    26
        132 ...   430
        139 ...    18
      12800 ...   567

The status values here could be decoded by reading the documentation for the "wait" system procedure, and looking up the correct part of the value in the sys_errlist[] array. That is left as an exersize for the reader.

To compile, some systems may need #include <sys/types.h>.

Also, note the conditionalized code in bad_malloc. If your system only gets the signal "segmentation violation" then you may need to consider conditionalizations along this line.

However, on a machine with a segmented address space, that has "instructions" in one segment and "data" in another, it is highly unlikely that the code for setting up and invoking the "void (*badboy)()" will have any interesting effect. Nothing other than an easily handled SIGSEGV will result in the inner testing loop.

Some PDP-11 systems would be examples of this situation (different I and D space).

---MACHINE O/S SPECIFIC NOTES---

MACHINE:: DEC C (OPENVMS ALPHA AXP):


$ CC/PREFIX=ALL/NOOPTIMIZE CRASHME
$ LINK CRASHME

New for version 2.2 code has been added to hackishly manipulate the Procedure Descriptor data format. It seems be executing random instructions like we would want.


#if defined(__ALPHA) && defined(VMS) && !defined(NOCASTAWAY)

Without this hack crashme on this platform has very little chance of causing anything other than a SIGBUS bus error.

Perhaps a smart "learning" mode of random-data creation could achieve the same ends, maximizing some measurement of punishment. Genetic programming might be useful.

Test I've tried:


$crashme +1000.48 0 100 03:00:00 2

MACHINE:: Windows NT:

The only files needed are crashme.c,makefile.wnt, and make.bat. cd into the directory containing the files and you can make two versions. crashme and crashmep (posix).


>make

In WIN32 subsystem the subprocess-all-at-once mode has not been implemented, but the sequential (-nsub) and timed modes have been implemented.

In posix subsystem you must use the full name of the file in the command if you want to generate subprocesses.


>crashmep.exe .....

On an 486DX2-66 machine the following caused a totally wedged up machine in the Windows NT final release. (Build 511). This was built in WIN32 mode with debugging on.


>crashme +1000 666 50 12:00:00 3

In the posix subsystem the more verbose modes were not ever observed to go through more than 2 setjmp/longjmp cycles on a given random number seed. In the WIN32 subsystem there was a greater variety of fault conditions.

The above crash took place after about 6 hours of running. Final subprocess arguments were +1000 24131 50, and we verified twice that invoking the following crashed the OS within seconds.


>crashme +1000 24131 50

I have always been concerned that the more complex the unprotected data in the user address space the more likely it is for a program being developed to generate inscrutable errors that an "application developer" level of person will be unable to understand. And worse, will end up spinning wheels for large amounts of time, thereby delaying projects and risking deadline failures, and even worse, forcing management to bring in super-experienced (and limited availability) people into a project in order to get it going again.

The WINDOWS NT client-server model is one way around this problem. Having a subsystem in a different address space is one way to protect complex data manipulated through an API. However, as page 127 of "Inside Windows NT" there are some optimizations that make an unspoken trade-off between the robustness afforded by a protected seperate address space and efficiency of implementation on an API.

Robustness and 'scrutability of failure situations' vs efficiency.

MACHINE:: OS/2

It has been reported that this runs when compiled gcc crashme.c -o crashme.exe In order to disable the dialog boxes reporting abnormal process termination, add this to CONFIG.SYS: AUTOFAIL=YES. Or the following code to main:


DosError(FERR_DISABLEHARDERR | FERR_DISABLEEXCEPTION);

Another person says that Emx is the only c compiler under OS/2 that supports fork.

Survey of Procedure Descriptor Usage. The emphasis here is on currently shipping products. The program pddet.c included with the distribution can be used to determine some of this information. Note that in some environments, e.g. Microsoft C++ 4.0 the results of PDDET will be different depending on the compilation modes chosen: debug verses release.


Architecture    |D| Desc | Env | Reg | Apos | Atyp | Rpos | Rtyp | 
------------------------------------------------------------------
VAX             |2| No   | No  | Yes | No   | No   | No   | No   |
ALPHA, OPENVMS  |3| Yes  | Yes | Yes | Yes  | Opt  | Yes  | Opt  |
ALPHA, WNT      | | No   |     |     |      |      |      |      |
ALPHA, OSF/1    | | No   |     |     |      |      |      |      |
RS/6000, AIX    |2| Yes  | Yes | No  | No   | No   | No   | No   |
PowerPC,        |2| Yes  | Yes | No  | No   | No   | No   | No   |
MIPS, Unix      | |      |     |     |      |      |      |      |
MIPS, WNT       | |      |     |     |      |      |      |      |
Intel, WNT      | |      |     |     |      |      |      |      |
Sparc, SUNOS    | | No   |     |     |      |      |      |      |
PA-RISC, HPUX   |2| Yes  |     |     |      |      |      |      |
------------------------------------------------------------------

Legend:

D    ... level of detailed information I have available
         1 = Verbal description or suspect from pddet.c
         2 = exact structure details including code for CRASHME.C
             or obvious what it is from pddet.c
         3 = crashme uses manufacturers include files for descriptors.
Desc ... Uses descriptors
Env  ... has pointer to non-static environment
Reg  ... describes registers used
Apos ... describes argument positions (stack, registers) or number.
Atyp ... describes argument types
Rpos ... describes return value position.
Rtyp ... describes return value types

Layout of Descriptors. Sizes in bytes.

ALPHA OPENVMS:

[FLAGS&KIND]       2
[REG-SAVE]         2
[REG-FOR-RETPC]    1
[REG-FOR-RETVAL]   1
[SIGNATURE-OFFSET] 2
[START-PC]         8
[Other stuff ...]  from 8 to 32 bytes worth.

AIX 

actually points to a 3 word struct with:
 - the actual function address
 - Table Of Contents (r2) register value
 - Environment (r11) pointer (for nested functions)

POWERPC

[PROGRAM-COUNTER]
[TABLE-OF-CONTENTS]
[EXCEPTION-INFO]

[Editorial comment taken from comp.arch:] Not to sound picky about this, but this is not really part of the POWER/PowerPC architecture. There is no special support for this in the hardware, it is just the scheme the software designers came up with in order to support shared libraries. Other schemes would be possible. [GJC comment] Pretty much true for every architecture.

PA-RISC HPUX.

The pddet.c program was used, and suggested descriptors of 8 bytes long. The -examine 8 argument showed what appeared to be a 4-byte starting PC followed by a table of contents. Note: If somebody knows what /usr/include/sys/*.h file to use for this, please let me know.

Notes for release 2.6 of Crashme. 13-JUL-2008 GJC.

CRASHPRNG is a new environment variable letting you change the pseudorandom number generator used.

CRASHPRNG	description
RAND	C runtime library rand
MT (default)	Mersenne twister coded by Takuji Nishimura and Makoto Matsumoto.
VNSQ	A variation of the middle square method