[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241017082526.ppoz7ynxas5nlht5@thinkpad>
Date: Thu, 17 Oct 2024 13:55:26 +0530
From: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
To: Marc Zyngier <maz@...nel.org>
Cc: Johan Hovold <johan@...nel.org>, Bjorn Helgaas <helgaas@...nel.org>,
Pali Rohár <pali@...nel.org>,
Johan Hovold <johan+linaro@...nel.org>,
Kishon Vijay Abraham I <kishon@...com>,
Xiaowei Song <songxiaowei@...ilicon.com>,
Binghui Wang <wangbinghui@...ilicon.com>,
Thierry Reding <thierry.reding@...il.com>,
Ryder Lee <ryder.lee@...iatek.com>,
Jianjun Wang <jianjun.wang@...iatek.com>, linux-pci@...r.kernel.org,
Krzysztof Wilczyński <kw@...ux.com>,
Ley Foon Tan <ley.foon.tan@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: Why set .suppress_bind_attrs even though .remove() implemented?
On Thu, Oct 17, 2024 at 08:50:11AM +0100, Marc Zyngier wrote:
> On Thu, 17 Oct 2024 06:23:35 +0100,
> Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org> wrote:
> >
> > On Thu, Jul 28, 2022 at 02:17:01PM +0200, Johan Hovold wrote:
> > > On Wed, Jul 27, 2022 at 02:57:16PM -0500, Bjorn Helgaas wrote:
> > > > On Tue, Jul 26, 2022 at 11:56:59AM +0200, Johan Hovold wrote:
> > > > > On Mon, Jul 25, 2022 at 06:35:27PM +0100, Marc Zyngier wrote:
> > > > > > On Mon, 25 Jul 2022 16:18:48 +0100,
> > > > > > Johan Hovold <johan@...nel.org> wrote:
> > > > >
> > > > > > > Since when is unloading modules something that is expected to
> > > > > > > work perfectly? I keep hearing "well, don't do that then" when
> > > > > > > someone complains about unloading this module while doing this
> > > > > > > or that broke something. (And it's only root that can unload
> > > > > > > modules in the first place.)
> > > > > >
> > > > > > Well, maybe I have higher standards. For the stuff I maintain, I
> > > > > > now point-blank refuse to support module unloading if this can
> > > > > > result in a crash. Or worse.
> > > > >
> > > > > That makes sense for regular interrupt controllers where its hard to
> > > > > tell that all consumers are gone, but I don't think that should
> > > > > limit the usefulness of having modular PCI controller drivers where
> > > > > we know that the consumers are gone after deregistering the bus
> > > > > (i.e. the consumers are descendants of the controller in the device
> > > > > tree).
> > > >
> > > > Those consumers are endpoint drivers, so I think this depends on those
> > > > drivers correctly unmapping the interrupts they use, right?
> > >
> > > Right. For MSI this means that pci_alloc_irq_vectors() in probe should
> > > be matched by pci_free_irq_vectors() on remove.
> > >
> > > For legacy interrupts, which can be shared, the mapping is created by
> > > PCI core when binding to the first device and can only be disposed by
> > > the host-bridge driver once all descendants have been removed.
> > >
> > > The endpoint drivers still need to disable their interrupts of course.
> > >
> > > Buggy endpoint-driver remove implementations can lead to all sorts of
> > > crashes (e.g. after failing to deregister a class device), and if that's
> > > a worry then don't unload modules (and possibly disable it completely
> > > using CONFIG_MODULE_UNLOAD).
> > >
> > > > > > > It's useful for developers, but use it at your own risk.
> > > > > > >
> > > > > > > That said, I agree that if something is next to impossible to
> > > > > > > get right, as may be the case with interrupt controllers
> > > > > > > generally, then fine, let's disable module unloading for that
> > > > > > > class of drivers.
> > > > > > >
> > > > > > > And this would mean disabling driver unbind for the 20+ driver
> > > > > > > PCI drivers that currently implement it to some degree.
> > > > > >
> > > > > > That would be Bjorn's and Lorenzo's call.
> > > > >
> > > > > Sure, but I think it would be the wrong decision here. Especially,
> > > > > since the end result will likely just be that more drivers will
> > > > > become always compiled-in.
> > > >
> > > > Can you elaborate on this? I think Marc is suggesting that these PCI
> > > > controller drivers be modular but not removable. Why would that cause
> > > > more of them to be compiled-in?
> > >
> > > As mentioned earlier in this thread, we only appear to have some 60
> > > drivers in the entire tree that bother to try to implement that. I fear
> > > that blocking the use of modules (including being able to unload them)
> > > will just make people submit drivers that can only be built in.
> > >
> > > Not everyone cares about Android's GKI, but being able to unload a
> > > module during development is very useful (and keeping that out-of-tree
> > > prevents sharing the implementation and make it susceptible to even
> > > further bit rot).
> > >
> > > So continuing to supporting modules properly is a win for everyone (e.g.
> > > GKI and developers).
> > >
> > > > > > > > > Turns out the pcie-qcom driver does not support legacy
> > > > > > > > > interrupts so there's no risk of there being any lingering
> > > > > > > > > mappings if I understand things correctly.
> > > > > > > >
> > > > > > > > It still does MSIs, thanks to dw_pcie_host_init(). If you can
> > > > > > > > remove the driver while devices are up and running with MSIs
> > > > > > > > allocated, things may get ugly if things align the wrong way
> > > > > > > > (if a driver still has a reference to an irq_desc or irq_data,
> > > > > > > > for example).
> > > > > > >
> > > > > > > That is precisely the way I've been testing it and everything
> > > > > > > appears to be tore down as it should.
> > > > > > >
> > > > > > > And a PCI driver that has been unbound should have released its
> > > > > > > resources, or that's a driver bug. Right?
> > > > > >
> > > > > > But that's the thing: you can easily remove part of the
> > > > > > infrastructure without the endpoint driver even noticing. It may
> > > > > > not happen in your particular case if removing the RC driver will
> > > > > > also nuke the endpoints in the process, but I can't see this is an
> > > > > > absolute guarantee. The crash pointed to by an earlier email is
> > > > > > symptomatic of it.
> > > > >
> > > > > But that was arguably due to a driver bug, which we know how to fix.
> > > > > For MSIs the endpoint driver will free its interrupts and all is
> > > > > good.
> > > > >
> > > > > The key observation is that the driver model will make sure that any
> > > > > endpoint drivers have been unbound before the bus is deregistered.
> > > > >
> > > > > That means there are no longer any consumers of the interrupts,
> > > > > which can be disposed. For MSI this is handled by
> > > > > pci_free_irq_vectors() when unbinding the endpoint drivers. For
> > > > > legacy interrupts, which can be shared, the PCIe RC driver needs to
> > > > > manage this itself after the consumers are gone.
> > > >
> > > > The driver model ensures that endpoint drivers have been unbound. But
> > > > doesn't the interrupt disposal depend on the correct functioning of
> > > > those endpoint drivers? So if a buggy endpoint driver failed to
> > > > dispose of them, we're still vulnerable?
> > >
> > > Just as you are if an endpoint-driver fails to clean up after itself in
> > > some other way (e.g. leaves the interrupt enabled).
> > >
> >
> > The IRQ disposal issue should hopefully fixed by this series:
> > https://lore.kernel.org/linux-pci/20240715114854.4792-3-kabel@kernel.org/
> >
> > Then if the dwc driver calls pci_remove_irq_domain() instead of
> > irq_domain_remove(), we can be sure that all the IRQs are disposed during the
> > driver remove.
> >
> > So can we proceed with the series making Qcom driver modular?
>
> Who is volunteering to fix the drivers that will invariably explode
> once we allow this?
>
Why should anyone volunteer first up? If the issue gets reported for a driver
blowing up, then the driver has to be fixed by the maintainer or someone, just
like any other bug.
>From reading the thread, the major concern was disposing the IRQs before
removing the domain and that is now taken care of. If you are worrying about a
specific issue, please say so.
As a Qcom PCIe driver maintainer, I'd like to provide users/developers the
flexibility to remove the driver for development purposes.
- Mani
> Because if the outcome is that we let things bitrot even more than
> they already are, I don't think this is going in the correct direction
> -- as in *the direction of correctness*.
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
--
மணிவண்ணன் சதாசிவம்
Powered by blists - more mailing lists