lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <alpine.DEB.2.20.1709131529400.1874@nanos> Date: Wed, 13 Sep 2017 15:33:01 +0200 (CEST) From: Thomas Gleixner <tglx@...utronix.de> To: Kashyap Desai <kashyap.desai@...adcom.com> cc: Hannes Reinecke <hare@...e.de>, YASUAKI ISHIMATSU <yasu.isimatu@...il.com>, Marc Zyngier <marc.zyngier@....com>, Christoph Hellwig <hch@....de>, axboe@...nel.dk, mpe@...erman.id.au, keith.busch@...el.com, peterz@...radead.org, LKML <linux-kernel@...r.kernel.org>, linux-scsi@...r.kernel.org, Sumit Saxena <sumit.saxena@...adcom.com>, Shivasharan Srikanteshwara <shivasharan.srikanteshwara@...adcom.com> Subject: RE: system hung up when offlining CPUs On Wed, 13 Sep 2017, Kashyap Desai wrote: > > On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote: > > > + linux-scsi and maintainers of megasas > > >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline > > >> CPU#24-29, I/O does not work, showing the following messages. .... > > This indeed looks like a problem. > > We're going to great lengths to submit and complete I/O on the same CPU, > > so > > if the CPU is offlined while I/O is in flight we won't be getting a > > completion for > > this particular I/O. > > However, the megasas driver should be able to cope with this situation; > > after > > all, the firmware maintains completions queues, so it would be dead easy > > to > > look at _other_ completions queues, too, if a timeout occurs. > In case of IO timeout, megaraid_sas driver is checking other queues as well. > That is why IO was completed in this case and further IOs were resumed. > > Driver complete commands as below code executed from > megasas_wait_for_outstanding_fusion(). > for (MSIxIndex = 0 ; MSIxIndex < count; MSIxIndex++) > complete_cmd_fusion(instance, MSIxIndex); > > Because of above code executed in driver, we see only one print as below in > this logs. > megaraid_sas 0000:02:00.0: [ 0]waiting for 2 commands to complete for scsi0 > > As per below link CPU hotplug will take care- "All interrupts targeted to > this CPU are migrated to a new CPU" > https://www.kernel.org/doc/html/v4.11/core-api/cpu_hotplug.html > > BTW - We are also able reproduce this issue locally. Reason for IO timeout > is -" IO is completed, but corresponding interrupt did not arrived on Online > CPU. Either missed due to CPU is in transient state of being OFFLINED. I am > not sure which component should take care this." > > Question - "what happens once __cpu_disable is called and some of the queued > interrupt has affinity to that particular CPU ?" > I assume ideally those pending/queued Interrupt should be migrated to > remaining online CPUs. It should not be unhandled if we want to avoid such > IO timeout. Can you please provide the following information, before and after offlining the last CPU in the affinity set: # cat /proc/irq/$IRQNUM/smp_affinity_list # cat /proc/irq/$IRQNUM/effective_affinity # cat /sys/kernel/debug/irq/irqs/$IRQNUM The last one requires: CONFIG_GENERIC_IRQ_DEBUGFS=y Thanks, tglx
Powered by blists - more mailing lists