[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YmqYgu++0OuhfFxy@kroah.com>
Date: Thu, 28 Apr 2022 15:37:06 +0200
From: Greg KH <gregkh@...uxfoundation.org>
To: Lin Ma <linma@....edu.cn>
Cc: Duoming Zhou <duoming@....edu.cn>, krzysztof.kozlowski@...aro.org,
pabeni@...hat.com, linux-kernel@...r.kernel.org,
davem@...emloft.net, alexander.deucher@....com,
akpm@...ux-foundation.org, broonie@...nel.org,
netdev@...r.kernel.org, Jakub Kicinski <kuba@...nel.org>
Subject: Re: [PATCH net v4] nfc: ... device_is_registered() is data race-able
On Thu, Apr 28, 2022 at 09:22:11PM +0800, Lin Ma wrote:
> Hello there,
>
> >
> > Yes, that looks better,
>
> Cool, thanks again for giving comments. :)
>
> > but what is the root problem here that you are
> > trying to solve? Why does NFC need this when no other subsystem does?
> >
>
> Well, in fact, me and Duoming are keep finding concurrency bugs that happen
> between the device cleanup/detach routine and other undergoing routines.
>
> That is to say, when a device, no matter real or virtual, is detached from
> the machine, the kernel awake cleanup routine to reclaim the resource.
> In current case, the cleanup routine will call nfc_unregister_device().
>
> Other routines, mainly from user-space system calls, need to be careful of
> the cleanup event. In another word, the kernel need to synchronize these
> routines to avoid race bugs.
>
> In our practice, we find that many subsystems are prone to this type of bug.
>
> For example, in bluetooth we fix
>
> BT subsystem
> * e2cb6b891ad2 ("bluetooth: eliminate the potential race condition when removing
> the HCI controller")
> * fa78d2d1d64f ("Bluetooth: fix data races in smp_unregister(), smp_del_chan()")
> ..
>
> AX25 subsystem
> * 1ade48d0c27d ("ax25: NPD bug when detaching AX25 device")
> ..
>
> we currently focus on the net relevant subsystems and we now is auditing the NFC
> code.
>
> In another word, all subsystems need to take care of the synchronization issues.
> But seems that the solutions are varied between different subsystem.
>
> Empirically speaking, most of them use specific flags + specific locks to prevent
> the race.
>
> In such cases, if the cleanup routine first hold the lock, the other routines will
> wait on the locks. Since the cleanup routine write the specific flag, the other
> routine, after check the specific flag, will be aware of the cleanup stuff and just
> abort their tasks.
> If the other routines first hold the lock, the cleanup routine just wait them to
> finish.
>
> NFC here is special because it uses device_is_registered. I thought the author may
> believe this macro is race free. However, it is not. So we need to replace this check
> to make sure the netlink functions will 100 percent be aware of the cleanup routine
> and abort the task if they grab the device_lock lately. Otherwise, the nelink routine
> will call sub-layer code and possilby dereference resources that already freed.
>
> For example, one of my recent fix 3e3b5dfcd16a ("NFC: reorder the logic in
> nfc_{un,}register_device") takes the suggestion from maintainer as he thought the
> device_is_registered is enough. And for now we find out this device_is_registered
> is not enough.
How do you physically remove a NFC device from a system? What types of
busses are they on? If non-removable one, then odds are there's not
going to be many races so this is a low-priority thing. If they can be
on removable busses (USB, PCI, etc.), then that's a bigger thing.
But again, the NFC bus code should handle all of this for the drivers,
there's nothing special about NFC that should warrant a special need for
this type of thing.
thanks,
greg k-h
Powered by blists - more mailing lists