文档首页/ 裸金属服务器 BMS/ 故障排除/ 内核缺陷导致Linux BMS整机网络流量大时OS panic,实例异常重启
更新时间:2025-07-18 GMT+08:00
分享

内核缺陷导致Linux BMS整机网络流量大时OS panic,实例异常重启

问题描述

Linux操作系统的BMS实例发生异常重启事件,内核日志打印如下信息:

```
[ 206.049736] Call trace:
[ 206.053074] vring_interrupt+0x38/0xfc [virtio_ring]
[ 206.058968] __handle_irq_event_percpu+0x64/0x1e0
[ 206.064597] handle_irq_event+0x80/0x1d0
[ 206.069423] handle_fasteoi_irq+0xd4/0x220
[ 206.074418] __handle_domain_irq+0x84/0xf0
[ 206.079409] gic_handle_irq+0x78/0x2c0
[ 206.084045] el1_irq+0xb8/0x140
[ 206.088069] virtqueue_get_buf_ctx_packed+0x194/0x200 [virtio_ring]
[ 206.095254] virtqueue_get_buf_ctx+0x20/0x40 [virtio_ring]
[ 206.101012] Unable to handle kernel paging request at virtual address ffff800023d7800e
[ 206.101671] virtnet_receive+0xc4/0x260 [virtio_net]
[ 206.110739] Mem abort info:
[ 206.116600] virtnet_poll+0x60/0x2c0 [virtio_net]
[ 206.120271] ESR = 0x96000007
[ 206.125864] napi_poll+0xcc/0x264
[ 206.125867] net_rx_action+0xdc/0x21c
[ 206.129811] EC = 0x25: DABT (current EL), IL = 32 bits
[ 206.134011] __do_softirq+0x130/0x358
[ 206.138564] SET = 0, FnV = 0
[ 206.138566] EA = 0, S1PTW = 0
[ 206.144785] irq_exit+0x134/0x154
[ 206.144788] __handle_domain_irq+0x88/0xf0
[ 206.149348] Data abort info:
[ 206.153293] gic_handle_irq+0x78/0x2c0
[ 206.153296] el1_irq+0xb8/0x140
[ 206.157333] ISV = 0, ISS = 0x00000007
[ 206.161542] arch_cpu_idle+0x18/0x40
[ 206.166530] CM = 0, WnR = 0
[ 206.170300] default_idle_call+0x5c/0x1c0
[ 206.170304] cpuidle_idle_call+0x17c/0x1b4
[ 206.174944] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000284f50157000
[ 206.178968] do_idle+0xc8/0x15c
[ 206.178971] cpu_startup_entry+0x30/0xfc
[ 206.183696] [ffff800023d7800e] pgd=00000840081d1003
[ 206.188158] secondary_start_kernel+0x158/0x1ec
[ 206.192007] , p4d=00000840081d1003
[ 206.196899] Code: f9402e61 d37c3c00 3941c662 8b000020 (79401c00)
[ 206.201881] , pud=00000840081d2003
[ 206.209674] SMP: stopping secondary CPUs
[ 206.213710] , pmd=00000840177fb003, pte=0000000000000000
[ 207.228541] SMP: failed to stop secondary CPUs 0-156
[ 207.235964] Starting crashdump kernel...
[ 207.240749] ------------[ cut here ]------------
[ 207.246227] Some CPUs may be stale, kdump will be unreliable.
[ 207.252842] WARNING: CPU: 157 PID: 0 at arch/arm64/kernel/machine_kexec.c:156 machine_kexec+0x48/0x2b0
```

可能原因

5.10内核版本,packed ring收包逻辑存在缺陷,导致内存越界,内核crash。

openEuler缺陷记录:

https://gitee.com/openeuler/kernel/issues/I9RQAS

约束与限制

本节操作涉及修改系统内核参数,在线修改内核参数会出现内核不稳定,安全性下降或者出现兼容性问题,请仔细评估风险后操作。

规避方案

问题原因为packed ring收包逻辑存在缺陷,导致内存越界,发生缺页异常导致crash。可配置OS镜像开启iommu=pt,内存分配位置为线性映射区,不会发生缺页异常导致Crash。

  1. 远程登录BMS实例。
  2. 修改配置文件,配置iommu=pt。
    • x86 ubuntu 22.04 LTS,配置文件:/boot/grub/grub.cfg 启动项末尾添加“intel_iommu=on iommu=pt”。
      linux   /vmlinuz-5.15.0-25-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro net.ifnames=0 biosdevname=0 intel_iommu=on iommu=pt
    • ARM OpenEuler 配置文件:/boot/grub/grub.cfg或/boot/efi/EFI/euleros/grub.cfg 启动项末尾添加“iommu.passthrough=1”。
      linux   /vmlinuz-5.10.0-136.12.0.86.h1032.eulerosv2r12.aarch64 root=/dev/mapper/euleros-root ro iommu.passthrough=1
  3. 重启实例,使配置生效
    reboot

相关文档