[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8rKVZolu8n6lB1P@phenom.ffwll.local>
Date: Fri, 7 Mar 2025 11:28:37 +0100
From: Simona Vetter <simona.vetter@...ll.ch>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: John Hubbard <jhubbard@...dia.com>,
Greg KH <gregkh@...uxfoundation.org>,
Danilo Krummrich <dakr@...nel.org>,
Joel Fernandes <joelagnelf@...dia.com>,
Alexandre Courbot <acourbot@...dia.com>,
Dave Airlie <airlied@...il.com>, Gary Guo <gary@...yguo.net>,
Joel Fernandes <joel@...lfernandes.org>,
Boqun Feng <boqun.feng@...il.com>, Ben Skeggs <bskeggs@...dia.com>,
linux-kernel@...r.kernel.org, rust-for-linux@...r.kernel.org,
nouveau@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
paulmck@...nel.org
Subject: Re: [RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice
implementation
On Thu, Mar 06, 2025 at 11:32:36AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 06, 2025 at 11:42:38AM +0100, Simona Vetter wrote:
> > > Further, I just remembered, (Danilo please notice!) there is another
> > > related issue here that DMA mappings *may not* outlive remove()
> > > either. netdev had a bug related to this recently and it was all
> > > agreed that it is not allowed. The kernel can crash in a couple of
> > > different ways if you try to do this.
> > >
> > > https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/#m0c7dda0fb5981240879c5ca489176987d688844c
> >
> > Hm for the physical dma I thought disabling pci bust master should put a
> > stop to all this stuff?
>
> Not in the general case. Many device classes (eg platform) don't have
> something like "PCI bus master". It is also not always possible to
> reset a device, even in PCI.
>
> So the way things work today for module reload relies on the driver
> duing a full quiet down so that the next driver to attach can safely
> start up the device. Otherwise the next driver flips PCI bus master
> back on and immediately UAFs memory through rouge DMA.
>
> Relying on PCI Bus master also exposes a weakness we battled with in
> kexec. When the new driver boots up it has to gain control of the
> device and stop the DMA before flipping "PCI Bus Master" off. Almost
> no drivers actually do this, and some HW can't even achieve it without
> PCI reset (which is not always available). Meaning you end up with a
> likely UAF flow if you rely on this technique.
Yeah this gets really hairy really fast. We might need some pragmatism
here of not being able to be better than C.
And the entire "load driver after previously the linux driver messed with
it already" is a very broad issue, from rebinding to module reload to
kexec. With some hw it's just not possible to do safely, and with a lot
more hw not reliably due to complexity. E.g. drm/i915/display can take
over the gpu if outputs are enabled and fully recover hw state into sw
state. But defacto that only works for configurations the fw/bootloader
leaves behind, and not in full generality. Plus we don't handle
misprogrammed hw at all.
> > For the sw lifecycle stuff I honestly didn't know that was an issue, I
> > guess that needs to be adressed in the dma-api wrappers or rust can blow
> > up in funny ways. C drivers just walk all mappings and shoot them.
>
> I wonder what people will come up with. DMA API is performance path,
> people are not going to accept pointless overheads there.
>
> IMHO whatever path the DMA API takes the MMIO design should follow
> it.
I think this needs to be subsystem specific, since very often there's
already data structures to track all mappings, and so easy to add a bit of
glue to nuke them all forcefully. Or at least data structures to track all
pending requests, and so again we can enforce that we stall for them all
to finish.
We'll probably end up with rust bindings being a lot more opinionated
about how a driver should work, which has the risk of going too far into
the midlayer mistake antipattern. I guess we'll see how that all pans out.
> > The trouble is that for real hotunplug, you need all this anyway. Because
> > when you physically hotunplug the interrupts will be dead, the mmio will
> > be gone any momem (not just at the beginnning of an rcu revocable
> > section), so real hotunplug is worse than what we're trying to do here.
>
> There are two kinds of real hotunplug, the friendly kind that we see
> in physical PCI where you actually plonk a button on the case and wait
> for the light to go off. Ie it is interactive and safe with the
> OS. Very similar to module reloading.
>
> And the hostile kind, like in thunderbolt, where it just goes away and
> dies.
>
> In the server world, other than nvme, we seem to focus on the friendly
> kind.
Yeah gpus tend to hang out in external enclosures sometimes, so I'm not
sure we can ignore the hostile kind.
> > So randomly interrupts not happening is something you need to cope with no
> > matter what.
>
> Yes
>
> > But for a driver unbind you _do_ have to worry about cleanly shutting down
> > the hardware. For the above reasons and also in general putting hardware
> > into a well-known (all off usually) state is better for then reloading a
> > new driver version and binding that. Except that there's no way to tell
> > whether your ->remove got called for unbinding or hotunplug.
>
> IMHO it doesn't really matter, the driver has to support the most
> difficult scenario anyhow. The only practical difference is that the
> MMIO might return -1 to all reads and the interrupts are dead. If you
> want to detect a gone PCI device then just do a register read and
> check for -1, which some drivers like mlx5 are doing as part of their
> resiliency strategy.
>
> > pci device was physically unplugged or not, and so for developer
> > convenience most pci drivers go with the "cleanly shut down everything"
> > approach, which is the wrong thing to do for actual hotunplug.
>
> I wouldn't say it is wrong. It is still the correct thing to do, and
> following down the normal cleanup paths is a good way to ensure the
> special case doesn't have bugs. The primary difference is you want to
> understand the device is dead and stop waiting on it faster. Drivers
> need to consider these things anyhow if they want resiliency against
> device crashes, PCI link wobbles and so on that don't involve
> remove().
Might need to revisit that discussion, but Greg didn't like when we asked
for a pci helper to check whether the device is physically gone (at least
per the driver model). Hacking that in drivers is doable, but feels icky.
> Regardless, I think the point is clear that the driver author bears
> alot of responsibility to sequence this stuff correctly as part of
> their remove() implementation. The idea that Rust can magically make
> all this safe against UAF or lockups seems incorrect.
Agreed, it's not pure magic. I do think it can help a lot though, or at
least I'm hoping.
> > > Ah.. I guess rust would have to validate the function pointers and the
> > > THIS_MODULE are consistent at runtime time before handing them off to
> > > C to prevent this. Seems like a reasonable thing to put under some
> > > CONFIG_DEBUG, also seems a bit hard to implement perhaps..
> >
> > We should know the .text section of a module, so checking whether a
> > pointer is within that shouldn't be too hard.
>
> It is legal to pass a pointer to a function in a module that this
> module is linked to as well. We do that sometimes.. Eg a fops having a
> simple_xx pointer. So you'd need to do some graph analysis.
Hm right, indirect deps are fine too ...
Cheers, Sima
--
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Powered by blists - more mailing lists