linux-kernel - RE: system hung up when offlining CPUs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.20.1709131529400.1874@nanos>
Date:   Wed, 13 Sep 2017 15:33:01 +0200 (CEST)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Kashyap Desai <kashyap.desai@...adcom.com>
cc:     Hannes Reinecke <hare@...e.de>,
        YASUAKI ISHIMATSU <yasu.isimatu@...il.com>,
        Marc Zyngier <marc.zyngier@....com>,
        Christoph Hellwig <hch@....de>, axboe@...nel.dk,
        mpe@...erman.id.au, keith.busch@...el.com, peterz@...radead.org,
        LKML <linux-kernel@...r.kernel.org>, linux-scsi@...r.kernel.org,
        Sumit Saxena <sumit.saxena@...adcom.com>,
        Shivasharan Srikanteshwara 
        <shivasharan.srikanteshwara@...adcom.com>
Subject: RE: system hung up when offlining CPUs

On Wed, 13 Sep 2017, Kashyap Desai wrote:
> > On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote:
> > > + linux-scsi and maintainers of megasas

> > >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline
> > >> CPU#24-29, I/O does not work, showing the following messages.

....

> > This indeed looks like a problem.
> > We're going to great lengths to submit and complete I/O on the same CPU,
> > so
> > if the CPU is offlined while I/O is in flight we won't be getting a
> > completion for
> > this particular I/O.
> > However, the megasas driver should be able to cope with this situation;
> > after
> > all, the firmware maintains completions queues, so it would be dead easy
> > to
> > look at _other_ completions queues, too, if a timeout occurs.
> In case of IO timeout, megaraid_sas driver is checking other queues as well.
> That is why IO was completed in this case and further IOs were resumed.
> 
> Driver complete commands as below code executed from
> megasas_wait_for_outstanding_fusion().
>     for (MSIxIndex = 0 ; MSIxIndex < count; MSIxIndex++)
>         complete_cmd_fusion(instance, MSIxIndex);
> 
> Because of above code executed in driver, we see only one print as below in
> this logs.
> megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to complete for scsi0
> 
> As per below link CPU hotplug will take care- "All interrupts targeted to
> this CPU are migrated to a new CPU"
> https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html
> 
> BTW - We are also able reproduce this issue locally.  Reason for IO timeout
> is -" IO is completed, but corresponding interrupt did not arrived on Online
> CPU. Either missed due to CPU is in transient state of being OFFLINED. I am
> not sure which component should take care this."
> 
> Question - "what happens once __cpu_disable is called and some of the queued
> interrupt has affinity to that particular CPU ?"
> I assume ideally those pending/queued Interrupt should be migrated to
> remaining online CPUs. It should not be unhandled if we want to avoid such
> IO timeout.

Can you please provide the following information, before and after
offlining the last CPU in the affinity set:

# cat /proc/irq/$IRQNUM/smp_affinity_list
# cat /proc/irq/$IRQNUM/effective_affinity
# cat /sys/kernel/debug/irq/irqs/$IRQNUM

The last one requires: CONFIG_GENERIC_IRQ_DEBUGFS=y

Thanks,

	tglx