[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <78C6CA8F-3634-418A-8A50-71753B5DB0C8@gmail.com>
Date: Sat, 24 Feb 2024 06:37:18 +0800
From: Miao Wang <shankerwangmiao@...il.com>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org,
pabeni@...hat.com,
"David S. Miller" <davem@...emloft.net>
Subject: Re: [Bug report] veth cannot be created, reporting page allocation
failure
> 2024年2月22日 23:47,Miao Wang <shankerwangmiao@...il.com> 写道:
>
>
>
>> 2024年2月22日 08:49,Jakub Kicinski <kuba@...nel.org> 写道:
>>
>> On Tue, 20 Feb 2024 22:38:52 +0800 Miao Wang wrote:
>>> I tried to bisect the kernel to find the commit that introduced the problem, but
>>> it would take too long to carry out the tests. However, after 4 rounds of
>>> bisecting, by examining the remaining commits, I'm convinced that the problem is
>>> caused by the following commit:
>>>
>>> 9d3684c24a5232 ("veth: create by default nr_possible_cpus queues")
>>>
>>> where changes are made to the veth module to create queues for all possbile
>>> cpus when not providing expected number of queues by the userland. The previous
>>> behavior was to create only one queue in the same condition. The memory in need
>>> will be large when the number of cpus is large, which is 96 * 768 = 72KB or 18
>>> continuous 4K pages in total, no wonder causing the allocation failure. I guess
>>> on certain platforms, the number of possbile cpus might be even larger, and
>>> larger than actual cpu cores physically installed, for several people in the
>>> above discussion mentioned that manually specifing nr_cpus in the boot command
>>> line can work around the problem.
>>>
>>> I've carried out a cross check by applying the commit on the working 5.10
>>> kernel, and the problem occurs. Then I reverted the commit on the 6.1 kernel,
>>> the problem has not occured for 27 hours.
>>
>> Thank you for the very detailed report! Would you be willing to give
>> this patch a try and report back if it fixes the problem for you?
>>
>> It won't help with the memory waste but should make the allocation
>> failures less likely:
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index a786be805709..cd4a6fe458f9 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -1461,7 +1461,8 @@ static int veth_alloc_queues(struct net_device *dev)
>> struct veth_priv *priv = netdev_priv(dev);
>> int i;
>>
>> - priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL_ACCOUNT);
>> + priv->rq = kvcalloc(dev->num_rx_queues, sizeof(*priv->rq),
>> + GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
>> if (!priv->rq)
>> return -ENOMEM;
>>
>> @@ -1477,7 +1478,7 @@ static void veth_free_queues(struct net_device *dev)
>> {
>> struct veth_priv *priv = netdev_priv(dev);
>>
>> - kfree(priv->rq);
>> + kvfree(priv->rq);
>> }
>>
>> static int veth_dev_init(struct net_device *dev)
>
> I directly applied this patch to the veth module on 6.1.0 stable kernel since no
> reboot would be required. No problem had occurred in the previous try on
> reverting the patch in question, which lasted for about 76 hours before I replaced
> the veth module with this patch applied. I'll monitor and report after 24 hours if
> the problem does not occur.
>
It's now about 30 hours since the patch is applied, and the problem has not occurred.
Cheers,
Miao Wang
Powered by blists - more mailing lists