[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <615fdc6d-0c8b-4f09-a03e-996410bd0a65@nvidia.com>
Date: Sun, 6 Apr 2025 18:37:49 +0300
From: Yael Chemla <ychemla@...dia.com>
To: Kuniyuki Iwashima <kuniyu@...zon.com>
Cc: davem@...emloft.net, edumazet@...gle.com, horms@...nel.org,
kuba@...nel.org, kuni1840@...il.com, netdev@...r.kernel.org,
pabeni@...hat.com
Subject: Re: [PATCH v5 net 2/3] net: Fix dev_net(dev) race in
unregister_netdevice_notifier_dev_net().
On 02/04/2025 0:58, Kuniyuki Iwashima wrote:
> Hi Yael,
>
> Thanks for testing!
>
> From: Yael Chemla <ychemla@...dia.com>
> Date: Tue, 1 Apr 2025 23:49:42 +0300
>> Hi Kuniyuki,
>> Sorry for the delay (I was OOO). I tested your patch, and while the race
>> occurs much less frequently, it still happens—see the warnings and call
>> traces below.
>> Additionally, in some cases, the test which reproduce the race hang.
>> Debugging shows that we're stuck in an endless loop inside
>> rtnl_net_dev_lock because the passive refcount is already zero, causing
>> net_passive_inc_not_zero to return false, thus it go to "again" and this
>> repeats without ending.
>> I suspect, as you mentioned before, that in such cases, the passive
>> refcount was decreased from cleanup_net.
>
> This sounds weird.
>
> We assumed vif will be moved to init_net, then the infinite loop
> should never happen.
>
> So the assumption was wrong and vif belonged to the dead netns and
> was not moved to init_net ... ??
>
> Even if dev_change_net_namespace() fails, it leads to BUG().
>
Hi Kuniyuki,
In failure scenarios, we observe that cleanup_net is invoked, followed
by net_passive_dec, which reduces the passive refcount to zero. These
are called before the call to unregister_netdevice_notifier_dev_net.
During the test, dev_change_net_namespace is called once, but it
operates on different net_device poiner than the one passed to final
call of unregister_netdevice_notifier_dev_net, a call which enter
infinite loop (with net->passive=0 and net->ns.count=0, inside
rtnl_net_dev_lock, as explained in previous mail).
Do you need additional debug information, perhaps specific details
regarding reassigning the netns to init_net? Please let me know how I
can help further.
>>
>>
>> warnings and call traces:
>>
>> refcount_t: addition on 0; use-after-free.
>
> I guess this is from the old log or the test patch was not applied
> because _inc_not_zero() will trigger REFCOUNT_ADD_NOT_ZERO_OVF and
> then the message will be
>
> refcount_t: saturated; leaking memory
>
> , see __refcount_add_not_zero() and refcount_warn_saturate().
>
you are right it's a mistake, i was unable to reproduce another failure
with call trace info. Test either succeeds or hang (infinite loop).
>
>> WARNING: CPU: 4 PID: 27219 at lib/refcount.c:25 refcount_warn_saturate
>> (/usr/work/linux/lib/refcount.c:25 (discriminator 1))
Powered by blists - more mailing lists