[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241017052335.iue4jhvk5q4efigv@thinkpad>
Date: Thu, 17 Oct 2024 10:53:35 +0530
From: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
To: Johan Hovold <johan@...nel.org>, Bjorn Helgaas <helgaas@...nel.org>
Cc: Marc Zyngier <maz@...nel.org>,
Pali Rohár <pali@...nel.org>,
Johan Hovold <johan+linaro@...nel.org>,
Kishon Vijay Abraham I <kishon@...com>,
Xiaowei Song <songxiaowei@...ilicon.com>,
Binghui Wang <wangbinghui@...ilicon.com>,
Thierry Reding <thierry.reding@...il.com>,
Ryder Lee <ryder.lee@...iatek.com>,
Jianjun Wang <jianjun.wang@...iatek.com>, linux-pci@...r.kernel.org,
Krzysztof Wilczyński <kw@...ux.com>,
Ley Foon Tan <ley.foon.tan@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: Why set .suppress_bind_attrs even though .remove() implemented?
On Thu, Jul 28, 2022 at 02:17:01PM +0200, Johan Hovold wrote:
> On Wed, Jul 27, 2022 at 02:57:16PM -0500, Bjorn Helgaas wrote:
> > On Tue, Jul 26, 2022 at 11:56:59AM +0200, Johan Hovold wrote:
> > > On Mon, Jul 25, 2022 at 06:35:27PM +0100, Marc Zyngier wrote:
> > > > On Mon, 25 Jul 2022 16:18:48 +0100,
> > > > Johan Hovold <johan@...nel.org> wrote:
> > >
> > > > > Since when is unloading modules something that is expected to
> > > > > work perfectly? I keep hearing "well, don't do that then" when
> > > > > someone complains about unloading this module while doing this
> > > > > or that broke something. (And it's only root that can unload
> > > > > modules in the first place.)
> > > >
> > > > Well, maybe I have higher standards. For the stuff I maintain, I
> > > > now point-blank refuse to support module unloading if this can
> > > > result in a crash. Or worse.
> > >
> > > That makes sense for regular interrupt controllers where its hard to
> > > tell that all consumers are gone, but I don't think that should
> > > limit the usefulness of having modular PCI controller drivers where
> > > we know that the consumers are gone after deregistering the bus
> > > (i.e. the consumers are descendants of the controller in the device
> > > tree).
> >
> > Those consumers are endpoint drivers, so I think this depends on those
> > drivers correctly unmapping the interrupts they use, right?
>
> Right. For MSI this means that pci_alloc_irq_vectors() in probe should
> be matched by pci_free_irq_vectors() on remove.
>
> For legacy interrupts, which can be shared, the mapping is created by
> PCI core when binding to the first device and can only be disposed by
> the host-bridge driver once all descendants have been removed.
>
> The endpoint drivers still need to disable their interrupts of course.
>
> Buggy endpoint-driver remove implementations can lead to all sorts of
> crashes (e.g. after failing to deregister a class device), and if that's
> a worry then don't unload modules (and possibly disable it completely
> using CONFIG_MODULE_UNLOAD).
>
> > > > > It's useful for developers, but use it at your own risk.
> > > > >
> > > > > That said, I agree that if something is next to impossible to
> > > > > get right, as may be the case with interrupt controllers
> > > > > generally, then fine, let's disable module unloading for that
> > > > > class of drivers.
> > > > >
> > > > > And this would mean disabling driver unbind for the 20+ driver
> > > > > PCI drivers that currently implement it to some degree.
> > > >
> > > > That would be Bjorn's and Lorenzo's call.
> > >
> > > Sure, but I think it would be the wrong decision here. Especially,
> > > since the end result will likely just be that more drivers will
> > > become always compiled-in.
> >
> > Can you elaborate on this? I think Marc is suggesting that these PCI
> > controller drivers be modular but not removable. Why would that cause
> > more of them to be compiled-in?
>
> As mentioned earlier in this thread, we only appear to have some 60
> drivers in the entire tree that bother to try to implement that. I fear
> that blocking the use of modules (including being able to unload them)
> will just make people submit drivers that can only be built in.
>
> Not everyone cares about Android's GKI, but being able to unload a
> module during development is very useful (and keeping that out-of-tree
> prevents sharing the implementation and make it susceptible to even
> further bit rot).
>
> So continuing to supporting modules properly is a win for everyone (e.g.
> GKI and developers).
>
> > > > > > > Turns out the pcie-qcom driver does not support legacy
> > > > > > > interrupts so there's no risk of there being any lingering
> > > > > > > mappings if I understand things correctly.
> > > > > >
> > > > > > It still does MSIs, thanks to dw_pcie_host_init(). If you can
> > > > > > remove the driver while devices are up and running with MSIs
> > > > > > allocated, things may get ugly if things align the wrong way
> > > > > > (if a driver still has a reference to an irq_desc or irq_data,
> > > > > > for example).
> > > > >
> > > > > That is precisely the way I've been testing it and everything
> > > > > appears to be tore down as it should.
> > > > >
> > > > > And a PCI driver that has been unbound should have released its
> > > > > resources, or that's a driver bug. Right?
> > > >
> > > > But that's the thing: you can easily remove part of the
> > > > infrastructure without the endpoint driver even noticing. It may
> > > > not happen in your particular case if removing the RC driver will
> > > > also nuke the endpoints in the process, but I can't see this is an
> > > > absolute guarantee. The crash pointed to by an earlier email is
> > > > symptomatic of it.
> > >
> > > But that was arguably due to a driver bug, which we know how to fix.
> > > For MSIs the endpoint driver will free its interrupts and all is
> > > good.
> > >
> > > The key observation is that the driver model will make sure that any
> > > endpoint drivers have been unbound before the bus is deregistered.
> > >
> > > That means there are no longer any consumers of the interrupts,
> > > which can be disposed. For MSI this is handled by
> > > pci_free_irq_vectors() when unbinding the endpoint drivers. For
> > > legacy interrupts, which can be shared, the PCIe RC driver needs to
> > > manage this itself after the consumers are gone.
> >
> > The driver model ensures that endpoint drivers have been unbound. But
> > doesn't the interrupt disposal depend on the correct functioning of
> > those endpoint drivers? So if a buggy endpoint driver failed to
> > dispose of them, we're still vulnerable?
>
> Just as you are if an endpoint-driver fails to clean up after itself in
> some other way (e.g. leaves the interrupt enabled).
>
The IRQ disposal issue should hopefully fixed by this series:
https://lore.kernel.org/linux-pci/20240715114854.4792-3-kabel@kernel.org/
Then if the dwc driver calls pci_remove_irq_domain() instead of
irq_domain_remove(), we can be sure that all the IRQs are disposed during the
driver remove.
So can we proceed with the series making Qcom driver modular?
- Mani
--
மணிவண்ணன் சதாசிவம்
Powered by blists - more mailing lists