linux-kernel - Re: 3.8-rc2: pciehp waitqueue hang...

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 8 Jan 2013 11:11:24 -0700
From:	Bjorn Helgaas <bhelgaas@...gle.com>
To:	Yijing Wang <wangyijing@...wei.com>
Cc:	Jiang Liu <liuj97@...il.com>, Daniel J Blueman <daniel@...ra.org>,
	Jesse Barnes <jbarnes@...tuousgeek.org>,
	Kenji Kaneshige <kaneshige.kenji@...fujitsu.com>,
	Yinghai Lu <yinghai@...nel.org>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Linux PCI <linux-pci@...r.kernel.org>
Subject: Re: 3.8-rc2: pciehp waitqueue hang...

On Sun, Jan 6, 2013 at 5:13 AM, Yijing Wang <wangyijing@...wei.com> wrote:
> On 2013/1/5 5:50, Bjorn Helgaas wrote:
>> [+to Yijing, +cc Kenji]
>>
>> On Fri, Jan 4, 2013 at 1:01 PM, Bjorn Helgaas <bhelgaas@...gle.com> wrote:
>>> On Thu, Jan 3, 2013 at 8:41 AM, Jiang Liu <liuj97@...il.com> wrote:
>>>> Hi Daniel,
>>>>         It seems like an issue caused by recursive PCIe HPC.
>>>> Could you please help to try the patch from:
>>>> http://www.spinics.net/lists/linux-pci/msg18625.html
>>>
>>> Hi Gerry,
>>>
>>> I'm working on merging this patch.  Seems like something that might be
>>> appropriate for stable as well.
>>>
>>> Did you look for similar problems in other hotplug drivers?
>>
>> Oops, sorry, I forgot that Yijing is the author of the patch in question.
>>
>> Yijing, please check for the same problem in other hotplug drivers.
>> Questions I have after a quick look:
>>
>
> Hi Bjorn,
>    Sorry for delay reply. There are some busy work these days.
>
>>   - shpchp_wq looks like it might have the same deadlock issue.
>
> shpchp driver uses two workqueues shpchp_wq and shpchp_ordered_wq, they are created by alloc_ordered_workqueue
> which set the "max_active" parameter to 1. So only one pci hotplug slot can do hotplug at the same time.
> shpchp introduced these workqueue to remove the use of flush_scheduled_work() which is deprecated and scheduled for removal.
>
> hot remove path is:
>  button press
>        shpc_isr(interrupt handler)
>             shpchp_handle_attention_button
>                 queue_interrupt_event
>                    queue_work "interrupt_event_handler" into "shpchp_wq"
>                        interrupt_event_handler
>                              handle_button_press_event
>                                    queue_delayed_work "shpchp_queue_pushbutton_work" into "shpchp_wq"
>                                          queue_work "shpchp_pushbutton_thread" into "shpchp_ordered_wq"
>                                                shpchp_pushbutton_thread
>                                                       shpchp_disable_slot
>                                                             pci_stop_and_remove_bus_device
>                                                                 ......
>                                                                shpc_remove()   if the hotplug slot connected a iobox which contains some hotplug pcieport, shpc_remove will be called when remove pcie port device.
>                                                                    hpc_release_ctlr
>                                                                        flush_workqueue(shpchp_wq);
>                                                                        flush_workqueue(shpchp_ordered_wq);
>                                                                        So hotplug task hang.
> shpchp driver has the same deadlock issue like pciehp driver, I think we should fix the issue, I will send out the patch if you agree this, but I have no machine support shpchp hotplug,
> so I can't test this patch in real machine.

That's OK.  You've tested pciehp, and I don't want to leave shpchp
broken the same way just because we can't test a similar fix there, so
please do send the shpchp patch, too.

>>   - pciehp_wq (and your per-slot replacement) are allocated with
>> alloc_workqueue().  shpchp_wq is allocated with
>> alloc_ordered_workqueue().  Why the difference?
>
> alloc_workqueue(name, 0, 0) set max_active to 0(0 is default value used and support 256 work items of the wq can be executing at the same time per CPU).
> So pciehp driver can handle push button event asynchronously.
>
> alloc_ordered_workqueue can only one handle push button event at the same time.

pciehp and shpchp should work the same in this respect unless there's
a reason they can't, so it sounds like we should make shpchp work like
pciehp.

>>   - The alloc/alloc_ordered difference might be related to 486b10b9f4,
>> where Kenji removed alloc_ordered from pciehp.  Should a similar
>> change be made to shpchp?
>
> Yes, I agree, we can use per-slot workqueue to fix this issue.
>
>>
>>   - acpiphp uses the global kacpi_hotplug_wq.  We never flush or drain
>> kacpi_hotplug_wq, so I doubt there's a deadlock issue, but I wonder if
>> there are any ordering issues there because we *don't* ever wait for
>> things in that queue to be completed.
>
> acpiphp driver is not attach to a pci device, so when hot remove pci device, driver will not to flush or drain kacpi_hotplug_wq.
> But if we do acpiphp hot remove in sequence like this, there maybe cause some unexpected errors, I think.
> slot(A)------pcie port----slot(B)
> slot A and slot B both support acpiphp hotplug.
> 1、press attention button on slot A;
> 2、press attention button on slot B quickly after step 1;
> Because kacpi_hotplug_wq is a ordered workqueue, slot B hot remove won't run unless slot A hot remove action completed.
> After Slot B hot remove completed, some resources of slot A also has been destroyed. So slot B hot remove will cause some unexpected errors.
> Because my hotplug machine's bios don't support iobox hotplug(slot-connected-slot), I can't verify this situation.

Hmm.  That definitely sounds like a potential problem.  But I think
it's beyond the scope of the issue you're trying to fix, and any fix
would look much different from your current pciehp patch, so I think
we can treat it separately.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/