Kernel Exception Events Analysis

Background

When HCE is running, there are some inevitable kernel events, such as soft lockup, RCU (Read-Copy Update) stall, hung task, global OOM, cgroup OOM, page allocation failure, list corruption, bad mm_struct, I/O error, EXT4-fs error, Machine Check Exception (MCE), fatal signal, warning, and panic. This section describes those events and how you can trigger them.

Soft Lockup

A soft lockup is the symptom of a task or kernel thread not releasing a CPU for a period longer than allowed (20 seconds by default).

Details
A soft lockup is triggered by the watchdog mechanism of the Linux kernel. The kernel starts a FIFO real-time kernel thread (watchdog) with the highest priority for each CPU. The thread name is watchdog/0, watchdog/1, and so on. The thread invokes the watchdog function every 4 seconds by default. Each time the function is invoked, an hrtimer will be reset to expire after a soft lockup threshold, which is 2 times the duration specified by watchdog_thresh (a kernel parameter) and defaults to 20 seconds.

Within this duration, if watchdog is not scheduled and the hrtimer expires, the kernel prints a soft lockup exception similar to the following:
```
BUG: soft lockup - CPU#3 stuck for 23s! [kworker/3:0:32]
```
Triggering method
Disable interrupts or preemption to result in an infinite loop.

RCU Stall

An RCU stall is an exception that RCU kernel threads are not scheduled within the RCU grace period.

Details
RCU readers are allowed to access any data, and RCU records information about these readers. When RCU writers are updating data, they copy a backup and modify the data on the backup. After all readers exit, writers replace the old data at a time.

Writers can only replace the old data after all readers stop referencing the old data. This period of time is a grace period.

If the readers do not exit even after the grace period expires and the writers wait for a period longer than the grace period, an RCU stall will be reported.
Triggering method
Stimulate a scenario described in Documentation/RCU/stallwarn.txt to trigger RCU stalls. An example is that CPU keeps looping in the RCU read-side critical section and keeps looping when the interrupt or preemption function is disabled.

Hung Task

When the kernel detects that a process is in the D state for a period longer than the specified time, a hung task exception is reported.

Details
One status of a process is TASK_UNINTERRUPTIBLE, which is also called the D state. A process in the D state can be woken up only by wake_up. When the kernel introduces the D state, the process waits for the I/O to complete. When I/Os are normally, the process should not be in the D state for a long time.

The kernel creates a thread (khungtaskd) to periodically traverse all processes in the system and check whether there is a process that is in the D state for a period longer than the preset duration (120 seconds by default). If there is such a process, related warnings and process stacks will be printed and reported. If hung_task_panic is configured (through proc or kernel startup parameters), a panic is initiated directly.

Triggering method
Create a kernel thread, set it to the D state, and use the scheduler to release the time slice.

Global OOM

The Linux OOM killer is a memory management mechanism. When there is less available memory, the kernel kills some processes to release some memory to ensure system continuity.

Details
When the kernel allocates memory to a process but the system memory is insufficient, OOM will occur. The OOM killer traverses all processes, scores the processes based on their memory usage, selects a process with the highest score, and terminates this process to release memory.

The kernel source code is linux/mm/oom_kill.c, and the core function is out_of_memory(). The following describes the processing flow:
1. The kernel instructs the modules that are registered with oom_notify_list in the system to release some memory. If these modules release some memory, it will take no more actions. If the memory fails to be reclaimed, it will go to the next step.
2. Generally, the OOM killer is triggered when the kernel is allocating memory to a process. If the process has a pending SIGKILL or is exiting, the kernel will terminate this process to release memory. Otherwise, the kernel will go to the next step.
3. The kernel checks the settings of the system administrator using panic_on_oom and determines whether to perform OOM killer or panic in case of OOM. If the kernel selects panic, the system will crash and restart. If the kernel selects OOM killer, it will go to the next step.
4. The kernel enters the OOM killer and checks the system settings. The system administrator can terminate the process that attempts to request memory and causes OOM, or other processes. If the system administrator chooses to terminate the current process, the OOM killer stops. Otherwise, the kernel will go to the next step.
5. The kernel invokes select_bad_process to select appropriate processes, and then invokes oom_kill_process to terminate the selected processes. If select_bad_process does not select any process, the kernel will enter the panic state.
Triggering method
Execute the program that occupies large memory until the memory is insufficient.

cgroup OOM

Difference from global OOM
The memory of cgroup OOM is different from that of global OOM. When the memory usage of processes in the cgroup exceeds the upper limit, the cgroup kills the processes to release the memory.

Triggering method
Execute the program that occupies large memory until the memory is insufficient.

Page Allocation Failure

A page allocation failure is an error reported by the system when a program fails to apply for an idle page. When a program applies for memory of an order, but there is no idle page whose order is higher than the required order in the system memory, the kernel reports an error.

Details
Linux uses the buddy system to efficiently allocate and manage memory. All idle page tables (with a size of 4 KB per page table) are linked to an array containing 11 elements. Each element in the array forms a linked list with consecutive page tables of the same size. The number of page tables is 1, 2, 4, 8, 16, 32, and 64, or 128, 256, 512, and 1,024. The maximum continuous memory that can be allocated at a time is 4 MB, the memory of 1,024 continuous 4-KB page tables.

Assume that you apply for memory that contains 256 page tables and whose order is 6. The system searches for the ninth, tenth, and eleventh linked lists in the array in sequence. If the previous linked list is empty, there is no free memory of this order. The system searches for the next linked list until the last linked list.

If all linked lists are empty and the application fails, the kernel will report a page allocation failure and display the following error message to indicate that the memory page whose order is 6 fails to be requested.
```
page allocation failure:order:6
```
Triggering method
Use alloc_pages to continuously apply for high-order memory pages (for example, order=10) and do not release the memory pages until the application fails.

List Corruption

A list corruption error is reported when the kernel fails to check the validity of a linked list. There are two error types: list_add corruption and list_del corruption.

Details
The kernel provides list_add and list_del to check the validity of the linked list and to add or delete an entry from the linked list if it is valid. If the linked list is invalid, a list corruption error is reported. The kernel source code is lib/list_debug.c.

This error is typically caused by abnormal memory operations, such as memory corruption and memory damage.
Triggering method
Use the standard kernel interface of list.h to create a linked list, illegally modify the previous or next pointer of a linked list entry, and then call the kernel list_add or list_del interface.

Bad mm_struct

A bad mm_struct error is reported when one or more mm_struct data structures in the kernel are corrupted or damaged.

Details
mm_struct is an important data structure in the Linux kernel. It is used to trace the virtual memory area of a process. If the data structure is damaged, the process or system may break down. This error is usually caused by memory exceptions. For example, the memory in mm_struct is corrupted or memory overwriting occurs.
Triggering method
Bad mm_struct is triggered when there is a hardware error or Linux kernel code error.

I/O Error

An I/O error is reported when an input/output operation fails. This error may be printed when the driver of the I/O device such as the NIC or disk is abnormal or the file system is abnormal.

Details
The condition under which the code fails to be executed is the cause of this error. Common causes are hardware faults, disk damage, file system errors, driver problems, and permission problems. For example, if an error occurs when the system attempts to read data from or write data to a disk, an I/O error is reported.
Triggering method
When the system is reading data from or writing data to the disk, remove the disk to damage the disk data.

EXT4-fs Error

EXT4-fs errors typically indicate problems with the ext4 file system.

Details
A sector is the minimum file storage unit on a storage device. Multiple consecutive sectors form a block. inode stores the metadata of a file, including the creator, creation date, file size, attributes, and the number of blocks. If the inode information in EXT4 format fails to be verified, an EXT4-fs error will be reported.

The kernel ext4 verification uses checksum to verify inode information. When there is a partition table error or the disk is damaged, the kernel returns the EIO (Input/Output Error) error code and the system reports "EXT4-fs error checksum invalid".
Triggering method
Forcibly remove the disk and add it back to read the data.

MCE

An MCE is a type of hardware error that occurs when a CPU detects a hardware problem. The interrupt number is 18, and the exception type is abort.

Details
MCEs are caused by bus faults, memory ECC errors, cache errors, TLB errors, or internal clock errors. In addition to hardware faults, inappropriate BIOS configurations, firmware bugs, and software bugs may also cause MCEs.

When an MCE is reported, the OS checks a group of registers called Machine-Check MSR and executes the corresponding function based on the error codes of the registers. (The function varies depending on the chip architecture.)
Triggering method
An MCE is reported when there is a bus fault, memory ECC error, cache error, TLB error, or internal clock error.

Fatal Signal

If a signal cannot be ignored or handled through user-defined processing functions, it is a fatal signal, such as SIGKILL, SIGSTOP, and SIGILL.

Details
The signal mechanism is an asynchronous notification mechanism for communication between processes in the system. When a signal is sent to a process but the OS interrupts the process, all non-atomic operations are interrupted.

If a signal is SIGKILL, SIGSTOP, or SIGILL, it is a fatal signal.

Triggering method
Use a user-mode program to execute invalid instructions or run kill -9 to kill the process.

Warning

Warning is an action taken to report a kernel issue that requires immediate attention when the OS is running. Warning prints the call stack information when the issue occurs. The OS continues to run after a warning.

Details
Warning is triggered when macros such as WARN, WARN_ON, and WARN_ON_ONCE are invoked.

There are several causes of invoking a warning macro. You need to trace the call stack to locate the cause. A warning macro does not change the system status and does not provide guidance for handling the warning.
Triggering method
Trigger a warning when the system is invoking a macro.

Panic

A kernel panic refers to the action taken by the OS when it detects a fatal internal error and cannot securely handle the error. When an exception occurs during kernel running, the kernel uses the kernel_panic function to print all the information obtained when the exception occurs.

Details
There are various causes for the exception. Common causes include kernel stack overflow, division by zero, memory access out of bounds, and kernel deadlock. When this exception occurs, locate the cause of kernel_panic based on the invoking information printed for the exception.
Triggering method
Read address 0 in kernel mode.

Parent topic: Kernel Functions and Interfaces

Previous topic: Multi-level Hybrid Scheduling of Kernel CPU cgroups

Next topic: xGPU