[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB41572415707F0FA6D9A61247D4132@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Thu, 9 Jan 2025 03:16:03 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Breno Leitao <leitao@...ian.org>, Herbert Xu
<herbert@...dor.apana.org.au>, "saeedm@...dia.com" <saeedm@...dia.com>,
"tariqt@...dia.com" <tariqt@...dia.com>, "linux-hyperv@...r.kernel.org"
<linux-hyperv@...r.kernel.org>
CC: Andrew Morton <akpm@...ux-foundation.org>, Thomas Graf <tgraf@...g.ch>,
Tejun Heo <tj@...nel.org>, Hao Luo <haoluo@...gle.com>, Josh Don
<joshdon@...gle.com>, Barret Rhoden <brho@...gle.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH] rhashtable: Fix potential deadlock by moving
schedule_work outside lock
From: Breno Leitao <leitao@...ian.org> Sent: Thursday, January 2, 2025 2:16 AM
>
> On Sat, Dec 21, 2024 at 05:06:55PM +0800, Herbert Xu wrote:
> > On Thu, Dec 12, 2024 at 08:33:31PM +0800, Herbert Xu wrote:
> > >
> > > The growth check should stay with the atomic_inc. Something like
> > > this should work:
> >
> > OK I've applied your patch with the atomic_inc move.
>
> Sorry, I was on vacation, and I am back now. Let me know if you need
> anything further.
>
> Thanks for fixing it,
> --breno
Breno and Herbert --
This patch seems to break things in linux-next. I'm testing with
linux-next20250108 in a VM in the Azure public cloud. The Mellanox mlx5
ethernet NIC in the VM is failing to get setup.
I bisected to commit e1d3422c95f0 ("rhashtable: Fix potential deadlock
by moving schedule_work outside lock"), then debugged why opening
the mlx5 NIC device is failing. The failure is in the XDP code in function
__xdp_reg_mem_model() where the call to rhashtable_insert_slow()
is returning -E2BIG. The problem does not occur when the commit
is reverted.
The function call stack is this:
dev_open()
__dev_open()
mlx5e_open()
mlx5e_open_locked()
mlx5e_open_channels()
mlx5e_open_channel()
mlx5e_open_queues()
mlx5e_open_rxq_rq()
mlx5e_open_rq()
mlx5e_alloc_rq()
xdp_rxq_info_reg_mem_model()
__xdp_reg_mem_model()
rhashtable_insert_slow()
I have not debugged further as I don't know anything about the
rhashtable code or the XDP code. The only repro I have is a VM
in Azure. I thought I'd ask you (Breno and Herbert) to review
the patch again and see if there's a path that could cause the
hash table to be incorrectly detected as full.
I've included the linux-hyperv mailing list and the mlx5 driver
maintainers on this email. Someone involved with Azure/Hyper-V
or the mlx5 driver may have seen the problem, and I want to try
to avoid duplicative debugging.
Let me know if there's something I can do to help debug further.
Thanks,
Michael Kelley
Powered by blists - more mailing lists