lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <5F52CAE2-2FB7-4712-95F1-3312FBBFA8DD@gmail.com>
Date: Tue, 20 Feb 2024 22:38:52 +0800
From: Miao Wang <shankerwangmiao@...il.com>
To: netdev@...r.kernel.org
Cc: pabeni@...hat.com,
 "David S. Miller" <davem@...emloft.net>
Subject: [Bug report] veth cannot be created, reporting page allocation
 failure

Hi, all

I'm writing to report an issue with the veth module. The symptom is that
ocaasionally, the veth pair can't be created successfully, and the error shown
in the dmesg is something like below:

  dockerd: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=docker.service,mems_allowed=0

I met this problem when I upgraded the kernel of an open source mirror server
from 5.10 to 6.1, provided by Debian. I also tried the latest mainline kernel
and the problem still exists.

The server has 512GB memory and 96 cores, in 2 Kunpeng 920 CPUs. The serving
files reside in a ZFS filesystem. About 75% ~ 90% of the memory is used by the
zfs arc cache. We run sync jobs inside docker containers to sync the files from
upstreams. As a result, docker containers are created and destroyed frequently.

After the upgradation, docker containers cannot be created successfully
occasionally. The error message from the docker daemon indicates the failure of
the creation of the veth pair. The full error message in the dmesg is attached
at the end. On our server, after several reboots trying different versions of
kernels, first occurance of the error is about 7~8 hours after boot. Also, I
ruled out the possibility of zfs issues and docker issues by using the exact
same version of zfs and docker during the reboots.

Similar issues are also reported in the following link:

  https://github.com/docker/for-linux/issues/1443

I tried to bisect the kernel to find the commit that introduced the problem, but
it would take too long to carry out the tests. However, after 4 rounds of
bisecting, by examining the remaining commits, I'm convinced that the problem is
caused by the following commit:

  9d3684c24a5232 ("veth: create by default nr_possible_cpus queues")

where changes are made to the veth module to create queues for all possbile
cpus when not providing expected number of queues by the userland. The previous
behavior was to create only one queue in the same condition. The memory in need
will be large when the number of cpus is large, which is 96 * 768 = 72KB or 18
continuous 4K pages in total, no wonder causing the allocation failure. I guess
on certain platforms, the number of possbile cpus might be even larger, and
larger than actual cpu cores physically installed, for several people in the
above discussion mentioned that manually specifing nr_cpus in the boot command
line can work around the problem.

I've carried out a cross check by applying the commit on the working 5.10
kernel, and the problem occurs. Then I reverted the commit on the 6.1 kernel, 
the problem has not occured for 27 hours.

I think that the change on the default behavior of the veth creation should be
re-considered to reduce memory waste and avoid the allocation failure, since
linux containers heavily rely on veth pairs.

Cheers,

Miao Wang


Attached dmesg:

[11203.923303] dockerd: page allocation failure: order:5, mode:0x440dc0(GFP_KERNEL_ACCOUNT|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=docker.service,mems_allowed=0-3
[11203.939669] CPU: 56 PID: 361309 Comm: dockerd Kdump: loaded Tainted: P           OE      6.1.0-17-arm64 #1  Debian 6.1.69-1
[11203.951427] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDA, BIOS 1.38 07/04/2020
[11203.960245] Call trace:
[11203.963304]  dump_backtrace+0xe4/0x140
[11203.967570]  show_stack+0x20/0x30
[11203.971407]  dump_stack_lvl+0x64/0x80
[11203.975590]  dump_stack+0x18/0x34
[11203.979411]  warn_alloc+0x124/0x1ac
[11203.983395]  __alloc_pages+0xda0/0xe64
[11203.987634]  __kmalloc_large_node+0x94/0x170
[11203.992396]  __kmalloc+0x128/0x1d0
[11203.996295]  veth_dev_init+0x8c/0x104 [veth]
[11204.001058]  register_netdevice+0xf8/0x5a4
[11204.005649]  veth_newlink+0x1e0/0x460 [veth]
[11204.010406]  __rtnl_newlink+0x5f0/0x860
[11204.014725]  rtnl_newlink+0x58/0x84
[11204.018710]  rtnetlink_rcv_msg+0x274/0x36c
[11204.023292]  netlink_rcv_skb+0x64/0x130
[11204.027617]  rtnetlink_rcv+0x20/0x30
[11204.031682]  netlink_unicast+0x2d4/0x33c
[11204.036078]  netlink_sendmsg+0x1d8/0x450
[11204.040475]  __sock_sendmsg+0x5c/0x70
[11204.044613]  __sys_sendto+0x10c/0x16c
[11204.048744]  __arm64_sys_sendto+0x30/0x40
[11204.053221]  invoke_syscall+0x78/0x100
[11204.057439]  el0_svc_common.constprop.0+0x4c/0xf4
[11204.062604]  do_el0_svc+0x34/0xd0
[11204.066388]  el0_svc+0x34/0xd4
[11204.069908]  el0t_64_sync_handler+0xf4/0x120
[11204.074639]  el0t_64_sync+0x18c/0x190
[11204.078954] Mem-Info:
[11204.081746] active_anon:4881 inactive_anon:2091912 isolated_anon:0
[11204.081746]  active_file:74800 inactive_file:76095 isolated_file:0
[11204.081746]  unevictable:4 dirty:158 writeback:2
[11204.081746]  slab_reclaimable:1596340 slab_unreclaimable:34544450
[11204.081746]  mapped:35555 shmem:3865 pagetables:13950
[11204.081746]  sec_pagetables:0 bounce:0
[11204.081746]  kernel_misc_reclaimable:0
[11204.081746]  free:1867614 free_pcp:12171 free_cma:48513
[11204.125833] Node 0 active_anon:17128kB inactive_anon:2173780kB active_file:47228kB inactive_file:59048kB unevictable:16kB isolated(anon):0kB isolated(file):0kB mapped:47124kB dirty:116kB writeback:0kB shmem:7092kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 323584kB writeback_tmp:0kB kernel_stack:11092kB pagetables:11944kB sec_pagetables:0kB all_unreclaimable? no
[11204.159593] Node 1 active_anon:624kB inactive_anon:1706892kB active_file:54120kB inactive_file:57144kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:41532kB dirty:28kB writeback:0kB shmem:3424kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 159744kB writeback_tmp:0kB kernel_stack:8692kB pagetables:10048kB sec_pagetables:0kB all_unreclaimable? no
[11204.192762] Node 2 active_anon:936kB inactive_anon:2221868kB active_file:59832kB inactive_file:58708kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:11072kB dirty:64kB writeback:0kB shmem:1860kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 241664kB writeback_tmp:0kB kernel_stack:10596kB pagetables:8644kB sec_pagetables:0kB all_unreclaimable? yes
[11204.226036] Node 3 active_anon:836kB inactive_anon:2265116kB active_file:138020kB inactive_file:129480kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:42492kB dirty:984kB writeback:24kB shmem:3084kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 180224kB writeback_tmp:0kB kernel_stack:10916kB pagetables:25164kB sec_pagetables:0kB all_unreclaimable? yes
[11204.259903] Node 0 DMA free:681676kB boost:0kB min:332kB low:1888kB high:3444kB reserved_highatomic:0KB active_anon:8192kB inactive_anon:304000kB active_file:1584kB inactive_file:2500kB unevictable:0kB writepending:4kB present:2092864kB managed:1598684kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:194052kB
[11204.289892] lowmem_reserve[]: 0 0 122877 122877
[11204.295192] Node 0 Normal free:1941500kB boost:0kB min:27112kB low:152936kB high:278760kB reserved_highatomic:2048KB active_anon:8936kB inactive_anon:1869616kB active_file:45644kB inactive_file:56548kB unevictable:16kB writepending:112kB present:132120576kB managed:125826136kB mlocked:16kB bounce:0kB free_pcp:40508kB local_pcp:0kB free_cma:0kB
[11204.327105] lowmem_reserve[]: 0 0 0 0
[11204.331639] Node 1 Normal free:686100kB boost:0kB min:28468kB low:160584kB high:292700kB reserved_highatomic:0KB active_anon:624kB inactive_anon:1707364kB active_file:54120kB inactive_file:57156kB unevictable:0kB writepending:60kB present:134217728kB managed:132117604kB mlocked:0kB bounce:0kB free_pcp:15484kB local_pcp:0kB free_cma:0kB
[11204.362967] lowmem_reserve[]: 0 0 0 0
[11204.367568] Node 2 Normal free:1283168kB boost:439048kB min:467516kB low:599632kB high:731748kB reserved_highatomic:0KB active_anon:936kB inactive_anon:2221864kB active_file:59832kB inactive_file:58708kB unevictable:0kB writepending:52kB present:134217728kB managed:132117600kB mlocked:0kB bounce:0kB free_pcp:12368kB local_pcp:32kB free_cma:0kB
[11204.399711] lowmem_reserve[]: 0 0 0 0
[11204.404341] Node 3 Normal free:2676652kB boost:435556kB min:463800kB low:594864kB high:725928kB reserved_highatomic:0KB active_anon:836kB inactive_anon:2265084kB active_file:138020kB inactive_file:129480kB unevictable:0kB writepending:1008kB present:134217728kB managed:131064536kB mlocked:0kB bounce:0kB free_pcp:17504kB local_pcp:0kB free_cma:0kB
[11204.436947] lowmem_reserve[]: 0 0 0 0
[11204.441696] Node 0 DMA: 2588*4kB (UMEC) 2322*8kB (UMEC) 2061*16kB (UMEC) 1664*32kB (UMEC) 1278*64kB (UMEC) 663*128kB (UMEC) 338*256kB (UMEC) 178*512kB (UMEC) 51*1024kB (UMC) 47*2048kB (UC) 18*4096kB (UC) = 681680kB
[11204.462171] Node 0 Normal: 14056*4kB (UEH) 13928*8kB (UMEH) 38891*16kB (UMEH) 35172*32kB (UMEH) 162*64kB (U) 924*128kB (UH) 2*256kB (H) 1*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 2045072kB
[11204.480482] Node 1 Normal: 16538*4kB (UME) 10181*8kB (UME) 4804*16kB (UME) 13628*32kB (UME) 296*64kB (U) 1300*128kB (UME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 846160kB
[11204.498139] Node 2 Normal: 26478*4kB (UME) 20542*8kB (UME) 8536*16kB (UME) 27135*32kB (UME) 207*64kB (U) 2096*128kB (U) 2*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1557192kB
[11204.515883] Node 3 Normal: 38676*4kB (UME) 56237*8kB (UME) 43905*16kB (UME) 42286*32kB (UE) 385*64kB (UE) 2193*128kB (UE) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2965832kB
[11204.533806] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[11204.543525] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
[11204.552945] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[11204.562353] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
[11204.571526] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[11204.581137] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
[11204.590554] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[11204.599819] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
[11204.608866] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[11204.618314] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
[11204.627718] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[11204.637004] Node 2 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
[11204.646201] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[11204.655783] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=32768kB
[11204.665174] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[11204.674608] Node 3 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=64kB
[11204.683672] 152133 total pagecache pages
[11204.688311] 0 pages in swap cache
[11204.692251] Free swap  = 0kB
[11204.695737] Total swap = 0kB
[11204.699194] 134216656 pages RAM
[11204.702911] 0 pages HighMem/MovableOnly
[11204.707288] 3535516 pages reserved
[11204.711243] 131072 pages cma reserved
[11204.715446] 0 pages hwpoisoned

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ