Failure and Run-Time Error Recovery

How to employ the processor's built-in hardware features for robust error recovery

Getting started and getting stopped – restarts and resets
The COP watchdog timer and clock monitor

This chapter describes some useful hardware features of the PDQ Single Board Computer (SBC) based on the Freescale HCS12/9S12 processor:

The processor’s external hardware interrupt /IRQ, may be used by external devices to request immediate service.
Three nonmaskable interrupts cause a hardware reset: the external reset, the COP, and the clock monitor. The main reset is activated on power-up or when the /RESET pin is pulled low. Enabling the computer operating properly (COP) sets up a watchdog timer that resets the processor unless a special register is periodically updated. This provides a means of recovering from crashes in an embedded application. Use of the COP requires installation of an autostart routine which services the COP. The clock monitor backs up the COP by resetting the machine if the system clock fails.
An on-board jumper allows you to invoke the special cleanup mode to return the PDQ Controller to its pristine factory condition.

Getting started and getting stopped – restarts and resets

External hardware resets

The main reset interrupt of the Freescale 9S12 (HCS12) processor is activated upon power-up or when the active-low /RESET signal is pulled low. The processor does not distinguish between a power-on reset and a reset caused by a low level on the /RESET input pin; both result in the same hardware initialization and software restart sequence.

The /RESET line is normally held high by a pull-up resistor. You can pull the /RESET line low by pushing on the reset switch. Moreover, any peripheral device can reset the processor by driving the /RESET signal low for at least 0.125 microseconds using an open-collector output.

The active-low /RESET signal is controlled by the power monitor circuitry. On power-up, the monitor asserts the reset signal until the positive supply has stabilized at an adequate operational level.

Internal resets

The HCS12 processor resets itself when a failure condition is detected by either the computer-operating-properly (COP) or the clock monitor circuit. When either of these failure conditions occur, the processor drives the /RESET line low to reset itself and any peripherals that are connected to the /RESET line. The processor then determines which failure (COP or clock monitor) caused the reset, and branches to the associated service routine. The QED-Forth operating system initializes the interrupt vectors for the COP and clock monitor to perform the standard restart sequence, and the programmer may change the vectors if desired. The COP and clock monitor are described in the following sections.

Crashes

A computer crashes when it executes a set of instructions that it is not supposed to. This can cause the processor to write over memory locations that are not write-protected. The processor may get into an infinite loop of legal instructions (in which case it will not respond to your commands), or it may eventually execute an illegal opcode. Illegal instructions are detected by the processor’s illegal opcode trap and result in a restart, in which case you will see the QED-Forth startup message on your terminal, or execution of the autostart program, if present.

The best response to a crash during program development is to push the reset button. This initializes all of the registers and performs a restart. In most cases a warm restart will be performed, which should allow you to continue programming with access to all of the functions that you have defined. In other cases, the state of the user area or the dictionary may be corrupted. If the operating system detects the corruption, it will automatically execute a cold restart; otherwise you may type at the terminal:

COLD↓

which performs the restart. The cold restart re-initializes all of the user variables that control the operating system. The Table titled Shadow Flash, Write Protection, and Autostarting Configuration Functions in the Loading Your Program into Memory Chapter summarizes functions such as RESTORE and RESTORE.ALL that restore the programming environment (memory pointers, access to defined functions, etc.) after a COLD restart.

Resets versus restarts

To clarify the discussion of crashes, some terms must be defined. A reset is an initialization process invoked by the hardware of the HCS12 processor, while a restart is an initialization process controlled by software.

A reset can be caused by any of four events:

power is applied to the processor
the reset button is pushed
the clock monitor detects a clock failure
the computer operating properly (COP) circuit detects a failure

The hardware of the HCS12 is configured by a set of registers that reside at locations 0x0000 through 0x03FF. (These hardware registers should not be confused with the programming registers D, X, Y, etc.) The hardware reset initializes essentially all of the registers, and then initiates an interrupt response sequence. The interrupt calls a specified response program whose address is stored in an interrupt vector near the top of memory. The power-on and reset-button resets share the same interrupt vector at 0xFFFE. The clock monitor and COP resets are re-vectored to addresses in EEPROM where the programmer can install customized service routines coded in the C language or Forth, if desired. By default, all of these service routines are initialized to perform the standard restart routine.

A restart is an initialization process performed by software. After a (hardware-invoked) reset, the HCS12 calls a restart routine which re-initializes some of the registers to accommodate the embedded operating system, and initializes other memory locations including all or part of the user area. A restart can also be invoked solely via software, by executing the kernel words COLD or WARM from the terminal or from your program. When the illegal opcode trap detects an illegal instruction, it calls a restart routine, but does not perform a hardware reset. Note that a reset always results in a restart, but that a restart can be performed without a reset.

COLD is the most comprehensive software-invoked initialization command. Executing COLD after a crash usually puts the machine into a well-known state by completely initializing the system variables used by the operating system. But COLD does not initialize all of the registers. Therefore, in crashes where the contents of key hardware registers are corrupted, it may be necessary to perform a hardware reset by pushing the reset button or powering the machine off and on again.

Cold versus warm restarts

There are two types of restart: cold and warm. A cold restart initializes all of the parameters used by the operating system. These parameters are stored in reserved areas of system RAM, and in the user area, which is a 256-byte block of memory in the reserved common RAM. COLD initializes to default values all of the memory management pointers, format variables to control numeric conversion, quantities that enable the compilation of local variables, and many other system values.

COLD also initializes several vital interrupt vectors so that they will perform the startup sequence if they are invoked. These vital interrupts are the clock monitor, computer operating properly, and illegal opcode trap. The automated initialization of the vital interrupt response vectors can be overridden by executing NoVitalIRQInit() (see its entry in the C Glossary) and then posting customized handlers for these interrupts.

A warm restart, on the other hand, assumes that most of the user variables have already been properly initialized. A warm restart initializes only a few of these parameters, including stack pointers (it clears the stacks) and some multitasking variables (it makes sure that a single task is running and that it has control of the serial port).

A warm restart preserves QED-Forth’s prior number base (whatever you had set it to before the restart occurred) while a cold restart always sets the base to decimal. A warm restart preserves the user’s memory map and QED-Forth’s ability to find user defined words, while a cold restart sets a default memory map and forgets all words except those in the original kernel. As discussed in earlier chapters, you can use SAVE , SAVE.ALL, RESTORE, and RESTORE.ALL in conjunction with a restart to recover access to a saved memory map and your defined words. The Table titled Shadow Flash, Write Protection, and Autostarting Configuration Functions in the Loading Your Program Into Memory Chapter summarizes functions that restore the programming environment after a COLD restart.

The default restart program decides whether to perform a cold or a warm restart by checking a location in the user area to see if a specified pattern (0x1357) is stored there. If the correct pattern is present, the restart program assumes that the user area is already properly initialized, so it performs a warm restart. If the location does not contain the proper value, the restart program assumes that some event (perhaps a crash) has corrupted the user area, so a cold restart is executed to force the system to a known state.

If a crash over-writes the user area, the next restart will be a cold restart. QED-Forth signals a cold startup by printing a COLDSTART statement before the QED-Forth V6.xx startup message is printed. If the crash did not corrupt the startup pattern in the user area, a warm restart would be performed, and you could continue debugging. In most cases, all of the words that you defined would still be accessible. If the machine is behaving in an unpredictable manner, however, it may be necessary to reset the machine and perform a cold restart to establish a known initialized state.

Bullet-proof your production code using ColdOnReset()

In a production embedded control application, it is wise to ensure that a thorough re-initialization of system variables occurs upon every reset. For example, a brown-out condition may occur in which system power starts to fail but is quickly restored. In this case, the processor could reset, while the RAM contents remain fully or mostly intact. If some but not all of the RAM contents are corrupted, the operating system may perform a warm restart and initialize only some of the system control variables. The possibility that some system variables may have been corrupted during the power transient could cause a run-time failure. The art of designing reliable embedded systems is ensuring that such low-probability events are properly handled.

The ColdOnReset() function (see its entry in the C glossary) solves this potential problem. It stores a pattern in reserved system EEPROM that tells the operating system to perform a cold (as opposed to a warm) restart upon each power-up or reset condition. This guarantees a complete re-initialization of all system variables, boosting the reliability of a production instrument. To return to the more convenient cold-or-warm restart for system development, you can interactively execute STANDARD.RESET at the terminal pr ompt, or include the StandardReset() command in your C program.

Recovery tricks

Some crashes may be difficult (but not impossible!) to recover from. For example, if the name area of the dictionary is corrupted, QED-Forth may not be able to find even the most basic commands in the operating system vocabulary. If every command you give is met with the ? error message, try typing COLD in the terminal window. The operating system’s interpreter is programmed to always recognize the word COLD, even if the dictionary is corrupted.

If all else fails, use the special cleanup mode

These recovery techniques may not work if you have a buggy autostart word or a major crash. If typing COLD from the terminal or pressing the reset button does not greet you with the standard QED-Forth V6.xx prompt, you may need to use the special cleanup mode to restore your system to a proper state. This involves installing the cap onto the jumper labeled Jmp1 Clean and then pressing the reset button. The special cleanup procedure places the PDQ Board in the same state it was in when it was shipped from the factory.

Doing a "Special Cleanup"

If you ever need to return your PDQ Board to its factory-new condition, just do a Special Cleanup: With the power on, install the cap onto the jumper labeled Jmp1 Clean next to the reset button, press the reset button, then remove the jumper cap. This procedure will remove any application programs and reinitialize all operating system parameters to their factory-new condition.

If you still are having trouble, email or give us a call.

The COP watchdog timer and clock monitor

In many embedded control applications, it is important that processor crashes be detected quickly so that the system can rapidly be returned to a proper operating condition. The Computer Operating Properly subsystem, also known as a watchdog timer or COP, provides this capability. It gives the programmer a way to force a processor reset if an application program crashes or gets lost. When enabled, the COP resets the processor if the application program fails to periodically update a specified register within a predetermined time-out period. The COP time-out period is programmable to any of 7 values ranging from 1 to 1049 milliseconds (ms).

The COP is enabled and controlled by two registers named COPCTL and ARMCOP. To use the COP the autostart routine that runs the application must enable the COP by writing a value from 1 to 7 to the COPCTL register, and then, in addition to performing all of its normal tasks, periodically write a 2-byte sequence to the ARMCOP register. The specified sequence is simple: 0x55 then 0xAA must be written to the ARMCOP register. The 0x55 and/or 0xAA bytes may be written multiple times, and may be interspersed with other instructions, but the two-write sequence must be completed before the COP times out, and no other values may be written to ARMCOP. Then install the application as an autostart routine using the QED-Forth word AUTOSTART: or PRIORITY.AUTOSTART:.

If the application program ever allows the time-out period to be exceeded without writing the specified sequence, or if an invalid value (that is, any value other than 0x55 or 0xAA) is written to ARMCOP, the COP resets the processor. Presumably the sequence will not be properly written if the processor crashes for any reason, so the COP provides a way of automatically resetting the processor to recover from crashes. Then, because the application program has been installed as an autostart routine, the application is automatically restarted when the COP forces a reset.

Be careful with the COP

Before enabling the COP be sure to fully debug the application program that periodically updates the ARMCOP register, and install it as an autostart or priority autostart routine as described in prior chapters. If the startup program is improperly designed so that it is unable to service the COP on time, the COP will reset the machine, thereby invoking the startup program again, and leading to an infinite series of COP resets.

If you find yourself in this situation you can return the PDQ Board to its pristine state by performing a special clean-up: Install the cap onto the jumper labeled Clean and then press the reset button to resume normal operation with the COP disabled and any autostart routine removed.

The COP feature should prove trouble-free as long as the application program:

is fully debugged;
enables the COP by writing to the COPCTL register;
is capable of updating the ARMCOP register in a timely fashion; and,
is installed as an autostart or priority autostart routine.

Configuring the COP

The COPCTL register at address 0x003C configures and enables or disables the COP. This is a write-once register; only the first write to the register after a reset has an effect. The Freescale Clock and Reset Generator (CRG) Block User Guide fully describes the two COP registers. Your application program can enable the COP with a specified timeout period by writing one of the values shown in the following table to the COPCTL register when the program starts up:

COPCTL = 7;

The value determines the amount of time that can elapse between updates of the ARMCOP register by the application program. If the time-out period is exceeded, the COP forces a reset. The available time-out periods are:

COP Time-out Period
COPCTL Contents	Time-out Period (milliseconds)
0	`COP` Disabled
1	1.024 ms
2	4.096 ms
3	16.384 ms
4	65.536 ms
5	262.144 ms
6	524.288 ms
7	1048.576 ms

Servicing the COP

Servicing the COP is accomplished by writing 0x55 and 0xAA to the ARMCOP register:

ARMCOP = 0x55;
ARMCOP = 0xAA;

At least one 0x55 followed by at least one 0xAA must be written to ARMCOP before the specified time period elapses. No other values may be written to ARMCOP. The number of intermediate instructions between them is inconsequential. Once the sequence has been written, the COP will need to be serviced again before the next time-out period has elapsed.

See below a demonstration program that enables the COP to time out after the maximum interval, and then writes to ARMCOP for several seconds. After the program stops writing to ARMCOP, the processor will reset within the configured COP interval of just over one second.

 1: // Mosaic Industries sample application
 2:
 3: #include <mosaic\allqed.h>
 4:
 5: // These commands enable the computer-operating-properly (COP)
 6: // watchdog timer, and acknowledge it after it has been enabled.
 7: // If it is not acknowledged within a certain time period from
 8: // being enabled or from the last acknowledgement, the processor
 9: // resets.  The value of the timeout can only be configured once
10: // in COPCTL after a hardware reset, and then cannot be changed.
11: // Choose the value from the list below corresponding to the most
12: // appropriate timeout interval for your application:
13: // '\x00'          COP disabled
14: // '\x01'    1.024 milliseconds
15: // '\x02'    4.096 milliseconds
16: // '\x03'   16.384 milliseconds
17: // '\x04'   65.536 milliseconds
18: // '\x05'  262.144 milliseconds
19: // '\x06'  524.288 milliseconds
20: // '\x07' 1048.576 milliseconds
21: // Note that a single-quoted character is the appropriate way
22: // to specify an unsigned 8-bit (single byte) value.
23: #define WATCHDOG_START() { COPCTL = '\x07'; }
24: #define WATCHDOG_ACK()   { ARMCOP = '\x55'; ARMCOP = '\xaa'; }
25:
26: int main()
27: {
28:     int i;
29:
30:     // Disable libc output buffering, which causes unexpected behavior on embedded systems.
31:     // If I/O buffering would benefit your application, see the Queued Serial demo.
32:     setbuf( stdout, NULL );
33:     Emit('\n');
34:
35:     WATCHDOG_START();
36:
37:     for( i = 0; ; ++i )
38:     {
39:         puts( "Hello world!" );
40:
41:         if( i < 20 ) { WATCHDOG_ACK(); }
42:         else puts( "Not resetting COP." );
43:
44:         MicrosecDelay(-1);
45:         MicrosecDelay(-1);
46:         MicrosecDelay(-1);
47:         MicrosecDelay(-1);
48:         MicrosecDelay(-1);
49:     }
50:
51:     return 0;
52: }

COP Watchdog Timer Demonstration

(download)

The clock monitor

The clock monitor provides a second level of security by monitoring the main system clock and resetting the processor if the clock signal disappears or oscillates too slowly. The clock monitor is enabled by default by the operating system, and a slow or stopped clock will force a reset.

Processor operating modes

The HCS12 microcontroller has 8 operating modes, but only one is used on the PDQ Board: normal expanded wide mode, with BDM allowed. Other modes include single-chip modes that do not support external memory, and special test, emulation, and peripheral modes. These modes are set by 3 mode pins on the processor that are typically do not need to be modified.

The HCS12 offers two low-power modes called STOP and WAIT. Because the processor is not the major consumer of power on the PDQ Board, these modes do not confer much advantage and are not recommended for most applications.

Third party manufacturers sell BDM (Background Debug Mode) PC dongles with associated debugger software that connect to the 6-pin BDM header on the PDQ Board, enabling assembly-level tracing and debugging. These tools can supplement the built-in interactive debug tools implemented by the operating system.