netdev - Re: [Bug report] veth cannot be created, reporting page allocation failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240221164942.5af086c5@kernel.org>
Date: Wed, 21 Feb 2024 16:49:42 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Miao Wang <shankerwangmiao@...il.com>
Cc: netdev@...r.kernel.org, pabeni@...hat.com, "David S. Miller"
 <davem@...emloft.net>
Subject: Re: [Bug report] veth cannot be created, reporting page allocation
 failure

On Tue, 20 Feb 2024 22:38:52 +0800 Miao Wang wrote:
> I tried to bisect the kernel to find the commit that introduced the problem, but
> it would take too long to carry out the tests. However, after 4 rounds of
> bisecting, by examining the remaining commits, I'm convinced that the problem is
> caused by the following commit:
> 
>   9d3684c24a5232 ("veth: create by default nr_possible_cpus queues")
> 
> where changes are made to the veth module to create queues for all possbile
> cpus when not providing expected number of queues by the userland. The previous
> behavior was to create only one queue in the same condition. The memory in need
> will be large when the number of cpus is large, which is 96 * 768 = 72KB or 18
> continuous 4K pages in total, no wonder causing the allocation failure. I guess
> on certain platforms, the number of possbile cpus might be even larger, and
> larger than actual cpu cores physically installed, for several people in the
> above discussion mentioned that manually specifing nr_cpus in the boot command
> line can work around the problem.
> 
> I've carried out a cross check by applying the commit on the working 5.10
> kernel, and the problem occurs. Then I reverted the commit on the 6.1 kernel, 
> the problem has not occured for 27 hours.

Thank you for the very detailed report! Would you be willing to give
this patch a try and report back if it fixes the problem for you?

It won't help with the memory waste but should make the allocation
failures less likely:

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a786be805709..cd4a6fe458f9 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1461,7 +1461,8 @@ static int veth_alloc_queues(struct net_device *dev)
 	struct veth_priv *priv = netdev_priv(dev);
 	int i;
 
-	priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL_ACCOUNT);
+	priv->rq = kvcalloc(dev->num_rx_queues, sizeof(*priv->rq),
+			    GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
 	if (!priv->rq)
 		return -ENOMEM;
 
@@ -1477,7 +1478,7 @@ static void veth_free_queues(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
 
-	kfree(priv->rq);
+	kvfree(priv->rq);
 }
 
 static int veth_dev_init(struct net_device *dev)