内核缺陷导致Linux BMS整机网络流量大时OS panic,实例异常重启
问题描述
Linux操作系统的BMS实例发生异常重启事件,内核日志打印如下信息:
``` [ 206.049736] Call trace: [ 206.053074] vring_interrupt+0x38/0xfc [virtio_ring] [ 206.058968] __handle_irq_event_percpu+0x64/0x1e0 [ 206.064597] handle_irq_event+0x80/0x1d0 [ 206.069423] handle_fasteoi_irq+0xd4/0x220 [ 206.074418] __handle_domain_irq+0x84/0xf0 [ 206.079409] gic_handle_irq+0x78/0x2c0 [ 206.084045] el1_irq+0xb8/0x140 [ 206.088069] virtqueue_get_buf_ctx_packed+0x194/0x200 [virtio_ring] [ 206.095254] virtqueue_get_buf_ctx+0x20/0x40 [virtio_ring] [ 206.101012] Unable to handle kernel paging request at virtual address ffff800023d7800e [ 206.101671] virtnet_receive+0xc4/0x260 [virtio_net] [ 206.110739] Mem abort info: [ 206.116600] virtnet_poll+0x60/0x2c0 [virtio_net] [ 206.120271] ESR = 0x96000007 [ 206.125864] napi_poll+0xcc/0x264 [ 206.125867] net_rx_action+0xdc/0x21c [ 206.129811] EC = 0x25: DABT (current EL), IL = 32 bits [ 206.134011] __do_softirq+0x130/0x358 [ 206.138564] SET = 0, FnV = 0 [ 206.138566] EA = 0, S1PTW = 0 [ 206.144785] irq_exit+0x134/0x154 [ 206.144788] __handle_domain_irq+0x88/0xf0 [ 206.149348] Data abort info: [ 206.153293] gic_handle_irq+0x78/0x2c0 [ 206.153296] el1_irq+0xb8/0x140 [ 206.157333] ISV = 0, ISS = 0x00000007 [ 206.161542] arch_cpu_idle+0x18/0x40 [ 206.166530] CM = 0, WnR = 0 [ 206.170300] default_idle_call+0x5c/0x1c0 [ 206.170304] cpuidle_idle_call+0x17c/0x1b4 [ 206.174944] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000284f50157000 [ 206.178968] do_idle+0xc8/0x15c [ 206.178971] cpu_startup_entry+0x30/0xfc [ 206.183696] [ffff800023d7800e] pgd=00000840081d1003 [ 206.188158] secondary_start_kernel+0x158/0x1ec [ 206.192007] , p4d=00000840081d1003 [ 206.196899] Code: f9402e61 d37c3c00 3941c662 8b000020 (79401c00) [ 206.201881] , pud=00000840081d2003 [ 206.209674] SMP: stopping secondary CPUs [ 206.213710] , pmd=00000840177fb003, pte=0000000000000000 [ 207.228541] SMP: failed to stop secondary CPUs 0-156 [ 207.235964] Starting crashdump kernel... [ 207.240749] ------------[ cut here ]------------ [ 207.246227] Some CPUs may be stale, kdump will be unreliable. [ 207.252842] WARNING: CPU: 157 PID: 0 at arch/arm64/kernel/machine_kexec.c:156 machine_kexec+0x48/0x2b0 ```
可能原因
5.10内核版本,packed ring收包逻辑存在缺陷,导致内存越界,内核crash。
openEuler缺陷记录:
约束与限制
本节操作涉及修改系统内核参数,在线修改内核参数会出现内核不稳定,安全性下降或者出现兼容性问题,请仔细评估风险后操作。
规避方案
问题原因为packed ring收包逻辑存在缺陷,导致内存越界,发生缺页异常导致crash。可配置OS镜像开启iommu=pt,内存分配位置为线性映射区,不会发生缺页异常导致Crash。
- 远程登录BMS实例。
- 修改配置文件,配置iommu=pt。
- x86 ubuntu 22.04 LTS,配置文件:/boot/grub/grub.cfg 启动项末尾添加“intel_iommu=on iommu=pt”。
linux /vmlinuz-5.15.0-25-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro net.ifnames=0 biosdevname=0 intel_iommu=on iommu=pt
- ARM OpenEuler 配置文件:/boot/grub/grub.cfg或/boot/efi/EFI/euleros/grub.cfg 启动项末尾添加“iommu.passthrough=1”。
linux /vmlinuz-5.10.0-136.12.0.86.h1032.eulerosv2r12.aarch64 root=/dev/mapper/euleros-root ro iommu.passthrough=1
- x86 ubuntu 22.04 LTS,配置文件:/boot/grub/grub.cfg 启动项末尾添加“intel_iommu=on iommu=pt”。
- 重启实例,使配置生效。
reboot