linux-kernel - Re: [RFC PATCH 1/2] PCI: pciehp: Merge hotplug work requests

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <56906209.9040401@roeck-us.net>
Date:	Fri, 8 Jan 2016 17:27:37 -0800
From:	Guenter Roeck <linux@...ck-us.net>
To:	Bjorn Helgaas <helgaas@...nel.org>
Cc:	Bjorn Helgaas <bhelgaas@...gle.com>,
	Yinghai Lu <yinghai@...nel.org>, linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 1/2] PCI: pciehp: Merge hotplug work requests

Hi Bjorn,

On 01/08/2016 09:46 AM, Bjorn Helgaas wrote:
> On Fri, Jan 08, 2016 at 08:30:30AM -0800, Guenter Roeck wrote:
>> On 01/08/2016 08:18 AM, Bjorn Helgaas wrote:
>>> Hi Guenter,
>>>
>>> Sorry for the delay in getting to this.
>>>
>>> On Mon, Nov 02, 2015 at 07:48:15PM -0800, Guenter Roeck wrote:
>>>> Some oddball devices may experience a PCIe link flap after power-on.
>>>> This may result in the following sequence of events.
>>>>
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24: Card present on Slot(0)
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Up event
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24:
>>>> 	Link Up event ignored on slot(0): already powering on
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Down event
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24:
>>>> 	Link Down event queued on slot(0): currently getting powered on
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24: slot(0): Link Up event
>>>> fpc0 kernel: pciehp 0000:02:08.0:pcie24:
>>>> 	Link Up event queued on slot(0): currently getting powered off
>>>>
>>>> This causes the driver for affected devices to be instantiated, removed,
>>>> and re-instantiated.
>>>>
>>>> An even worse problem is a device which resets itself continuously.
>>>> This can result in the following endless sequence of messages.
>>>>
>>>> pciehp 0000:02:0a.0:pcie24: Card present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card not present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card not present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card present on Slot(10)
>>>> pciehp 0000:02:0a.0:pcie24: Card not present on Slot(10)
>>>>
>>>> The problem in the both cases is that all events are enqueued as hotplug
>>>> work requests and executed in sequence, which can overwhelm the system
>>>> and even result in "hung task" tracebacks in pciehp_power_thread().
>>>>
>>>> This exposes an underlying limitation of the hotplug state machine: It
>>>> executes all received requests, even though only the most recent request
>>>> really needs to be handled. As a result, by the time a request is handled,
>>>> it may well be obsolete and have been superseded by many other enable /
>>>> disable requests which have been enqueued in the meantime.
>>>>
>>>> To solve the problem, fold hotplug work requests into a single request.
>>>> Store the request as well as the work data structure in 'struct slot',
>>>> thus eliminating the need to allocate memory for each request.
>>>> Handle a sequence of requests by setting a 'disable' flag when needed,
>>>> indicating that a link needs to be disabled prior to re-enabling it.
>>>>
>>>> With this change, requests and request sequences are handled as follows.
>>>>
>>>> enable (single request):		enable link
>>>> disable (single request):		disable link
>>>> ... disable-enable-disable...disable:	disable link
>>>> ... disable-enable-disable...enable:	disable link, then enable it
>>>
>>> I think this is a really good idea, but I would like to understand
>>> what the critical points are and how they affect the state machine.
>>>
>>> I think you're basically accounting for the fact that some hotplug
>>> controller commands are not completed instantaneously, and we might
>>> receive more interrupts before the first command is completed.  I
>>> suspect that your patch only makes a difference on controllers that
>>> support Command Completed events, right?
>>
>> No, not really. problem is that state change interrupts are not handled
>> immediately but enqueued. By the time an event is handled by the workqueue,
>> it is long since obsolete and superseded by other events.
>
> Ah.  So the important interval is the one between pcie_isr(), where we
> enqueue work, and the worker threads that are awakened to actually do
> the work.  Then the idea is that only the most recent state is
> important -- if we have several presence detect changes, e.g.,
> present, absent, present, absent, before the worker thread starts
> processing them, it should only look at the most recent state.  That
> seems like the right thing to me, and I think the removal of kmalloc
> from the ISR path is an important consequence of doing that.
>
Yes, exactly.

> Blue sky thinking:
>
>    - Do all interrupt-related CSR reads in pcie_isr() (as opposed to
>      doing some in pcie_isr() and others in the worker threads).  I
>      think this is important for consistency.  I think it's a mistake
>      to read and clear the status register in pcie_isr(), then go back
>      and read other CSRs later in the worker threads.
>
>    - Have pcie_isr() update a single set of "current state" CSR values.
>      This means we don't need any allocation in pcie_isr().  Some
>      values, e.g., Power Fault Detected, might be OR-d in.  Others,
>      e.g., Presence Detect, might overwrite the previous value.
>
>    - Collapse the several worker threads into a single one that
>      pcie_isr() would awaken (as opposed to having pcie_isr() decide
>      whether this is a button press, presence detect change, link
>      event, etc.)
>
>    - Have the single worker thread figure out what happened and how to
>      advance the state machine, based on the CSR values read by the
>      most recent pcie_isr() invocation.
>
> This would be a lot of changes, but I think it has the potential to
> centralize the state machine management and simplify things
> significantly.
>
That all makes a lot of sense, and I would love to spend some time on it.
Unfortunately, I don't think I will be able to spend much time on this
anytime soon.

Any chance to accept at least the first of my two patches ? Or, in other
words, do you see an immediate problem with it that I could address in,
say, the next week or two ?

Thanks,
Guenter