[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5ca349f4-dabe-48d4-8c52-1b02c7650104@nvidia.com>
Date: Wed, 23 Apr 2025 14:47:57 +0300
From: Yael Chemla <ychemla@...dia.com>
To: Kuniyuki Iwashima <kuniyu@...zon.com>
Cc: davem@...emloft.net, edumazet@...gle.com, horms@...nel.org,
kuba@...nel.org, kuni1840@...il.com, netdev@...r.kernel.org,
pabeni@...hat.com
Subject: Re: [PATCH v5 net 2/3] net: Fix dev_net(dev) race in
unregister_netdevice_notifier_dev_net().
On 06/04/2025 18:37, Yael Chemla wrote:
> On 02/04/2025 0:58, Kuniyuki Iwashima wrote:
>> Hi Yael,
>>
>> Thanks for testing!
>>
>> From: Yael Chemla <ychemla@...dia.com>
>> Date: Tue, 1 Apr 2025 23:49:42 +0300
>>> Hi Kuniyuki,
>>> Sorry for the delay (I was OOO). I tested your patch, and while the race
>>> occurs much less frequently, it still happens—see the warnings and call
>>> traces below.
>>> Additionally, in some cases, the test which reproduce the race hang.
>>> Debugging shows that we're stuck in an endless loop inside
>>> rtnl_net_dev_lock because the passive refcount is already zero, causing
>>> net_passive_inc_not_zero to return false, thus it go to "again" and this
>>> repeats without ending.
>>> I suspect, as you mentioned before, that in such cases, the passive
>>> refcount was decreased from cleanup_net.
>>
>> This sounds weird.
>>
>> We assumed vif will be moved to init_net, then the infinite loop
>> should never happen.
>>
>> So the assumption was wrong and vif belonged to the dead netns and
>> was not moved to init_net ... ??
>>
>> Even if dev_change_net_namespace() fails, it leads to BUG().
>>
>
> Hi Kuniyuki,
> In failure scenarios, we observe that cleanup_net is invoked, followed
> by net_passive_dec, which reduces the passive refcount to zero. These
> are called before the call to unregister_netdevice_notifier_dev_net.
>
> During the test, dev_change_net_namespace is called once, but it
> operates on different net_device poiner than the one passed to final
> call of unregister_netdevice_notifier_dev_net, a call which enter
> infinite loop (with net->passive=0 and net->ns.count=0, inside
> rtnl_net_dev_lock, as explained in previous mail).
>
> Do you need additional debug information, perhaps specific details
> regarding reassigning the netns to init_net? Please let me know how I
> can help further.
>
Hi Kuniyuki,
any updates on this?
thanks,
Yael
>>>
>>>
>>> warnings and call traces:
>>>
>>> refcount_t: addition on 0; use-after-free.
>>
>> I guess this is from the old log or the test patch was not applied
>> because _inc_not_zero() will trigger REFCOUNT_ADD_NOT_ZERO_OVF and
>> then the message will be
>>
>> refcount_t: saturated; leaking memory
>>
>> , see __refcount_add_not_zero() and refcount_warn_saturate().
>>
>
> you are right it's a mistake, i was unable to reproduce another failure
> with call trace info. Test either succeeds or hang (infinite loop).
>
>>
>>> WARNING: CPU: 4 PID: 27219 at lib/refcount.c:25 refcount_warn_saturate
>>> (/usr/work/linux/lib/refcount.c:25 (discriminator 1))
>
>
Powered by blists - more mailing lists