netdev - Re: [Bug report] veth cannot be created, reporting page allocation failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <78C6CA8F-3634-418A-8A50-71753B5DB0C8@gmail.com>
Date: Sat, 24 Feb 2024 06:37:18 +0800
From: Miao Wang <shankerwangmiao@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org,
 pabeni@...hat.com,
 "David S. Miller" <davem@...emloft.net>
Subject: Re: [Bug report] veth cannot be created, reporting page allocation
 failure


> 2024年2月22日 23:47，Miao Wang <shankerwangmiao@...il.com> 写道：
> 
> 
> 
>> 2024年2月22日 08:49，Jakub Kicinski <kuba@...nel.org> 写道：
>> 
>> On Tue, 20 Feb 2024 22:38:52 +0800 Miao Wang wrote:
>>> I tried to bisect the kernel to find the commit that introduced the problem, but
>>> it would take too long to carry out the tests. However, after 4 rounds of
>>> bisecting, by examining the remaining commits, I'm convinced that the problem is
>>> caused by the following commit:
>>> 
>>> 9d3684c24a5232 ("veth: create by default nr_possible_cpus queues")
>>> 
>>> where changes are made to the veth module to create queues for all possbile
>>> cpus when not providing expected number of queues by the userland. The previous
>>> behavior was to create only one queue in the same condition. The memory in need
>>> will be large when the number of cpus is large, which is 96 * 768 = 72KB or 18
>>> continuous 4K pages in total, no wonder causing the allocation failure. I guess
>>> on certain platforms, the number of possbile cpus might be even larger, and
>>> larger than actual cpu cores physically installed, for several people in the
>>> above discussion mentioned that manually specifing nr_cpus in the boot command
>>> line can work around the problem.
>>> 
>>> I've carried out a cross check by applying the commit on the working 5.10
>>> kernel, and the problem occurs. Then I reverted the commit on the 6.1 kernel, 
>>> the problem has not occured for 27 hours.
>> 
>> Thank you for the very detailed report! Would you be willing to give
>> this patch a try and report back if it fixes the problem for you?
>> 
>> It won't help with the memory waste but should make the allocation
>> failures less likely:
>> 
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index a786be805709..cd4a6fe458f9 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -1461,7 +1461,8 @@ static int veth_alloc_queues(struct net_device *dev)
>> struct veth_priv *priv = netdev_priv(dev);
>> int i;
>> 
>> - priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL_ACCOUNT);
>> + priv->rq = kvcalloc(dev->num_rx_queues, sizeof(*priv->rq),
>> +    GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
>> if (!priv->rq)
>> return -ENOMEM;
>> 
>> @@ -1477,7 +1478,7 @@ static void veth_free_queues(struct net_device *dev)
>> {
>> struct veth_priv *priv = netdev_priv(dev);
>> 
>> - kfree(priv->rq);
>> + kvfree(priv->rq);
>> }
>> 
>> static int veth_dev_init(struct net_device *dev)
> 
> I directly applied this patch to the veth module on 6.1.0 stable kernel since no
> reboot would be required. No problem had occurred in the previous try on
> reverting the patch in question, which lasted for about 76 hours before I replaced
> the veth module with this patch applied. I'll monitor and report after 24 hours if
> the problem does not occur.
> 

It's now about 30 hours since the patch is applied, and the problem has not occurred. 

Cheers,

Miao Wang