linux-kernel - RE: [PATCH v9 0/4] shut down devices asynchronously

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID:
 <SN6PR02MB4157B4916F2B8581B9C6E214D4422@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Sun, 20 Oct 2024 00:24:32 +0000
From: Michael Kelley <mhklinux@...look.com>
To: stuart hayes <stuart.w.hayes@...il.com>, Greg Kroah-Hartman
	<gregkh@...uxfoundation.org>, Lukas Wunner <lukas@...ner.de>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Rafael J .
 Wysocki" <rafael@...nel.org>, Martin Belanger <Martin.Belanger@...l.com>,
	Oliver O'Halloran <oohall@...il.com>, Daniel Wagner <dwagner@...e.de>, Keith
 Busch <kbusch@...nel.org>, David Jeffery <djeffery@...hat.com>, Jeremy
 Allison <jallison@....com>, Jens Axboe <axboe@...com>, Christoph Hellwig
	<hch@....de>, Sagi Grimberg <sagi@...mberg.me>,
	"linux-nvme@...ts.infradead.org" <linux-nvme@...ts.infradead.org>, Nathan
 Chancellor <nathan@...nel.org>, Jan Kiszka <jan.kiszka@...mens.com>, Bert
 Karwatzki <spasswolf@....de>
Subject: RE: [PATCH v9 0/4] shut down devices asynchronously

From: stuart hayes <stuart.w.hayes@...il.com> Sent: Friday, October 18, 2024 5:27 PM
> 
> On 10/18/2024 4:37 AM, Greg Kroah-Hartman wrote:
> > On Fri, Oct 18, 2024 at 11:14:51AM +0200, Lukas Wunner wrote:
> >> On Fri, Oct 18, 2024 at 07:49:51AM +0200, Greg Kroah-Hartman wrote:
> >>> On Fri, Oct 18, 2024 at 03:26:05AM +0000, Michael Kelley wrote:
> >>>> In the process, the workqueue code spins up additional worker threads
> >>>> to handle the load.  On the Hyper-V VM, 210 to 230 new kernel
> >>>> threads are created during device_shutdown(), depending on the
> >>>> timing. On the Pi 5, 253 are created. The max for this workqueue is
> >>>> WQ_DFL_ACTIVE (256).
> >> [...]
> >>> I don't think we can put this type of load on all systems just to handle
> >>> one specific type of "bad" hardware that takes long periods of time to
> >>> shutdown, sorry.
> >>
> >> Parallelizing shutdown means shorter reboot times, less downtime,
> >> less cost for CSPs.
> >
> > For some systems, yes, but as have been seen here, it comes at the
> > offset of a huge CPU load at shutdown, with sometimes longer reboot
> > times.
> >
> >> Modern servers (e.g. Sierra Forest with 288 cores) should handle
> >> this load easily and may see significant benefits from parallelization.
> >
> > "may see", can you test this?
> >
> >> Perhaps a solution is to cap async shutdown based on the number of cores,
> >> but always use async for certain device classes (e.g. nvme_subsys_class)?
> >
> > Maybe, but as-is, we can't take the changes this way, sorry.  That is a
> > regression from the situation of working hardware that many people have.
> >
> > thanks,
> >
> > greg k-h
> 
> Thank you both for your time and effort considering this.  It didn't
> occur to me that an extra few 10s of milliseconds (or maxing out the
> async workqueue) would be an issue.
> 
> To answer your earlier question (Michael), there shouldn't be a
> possibility of deadlock regardless of the number of devices. While the
> device shutdowns are scheduled on a workqueue rather than run in a loop,
> they are still scheduled in the same order as they are without this
> patch, any any device that is scheduled for shutdown should never have
> to wait for device that hasn't yet been scheduled. So even if only one
> device shutdown could be scheduled at a time, it should still work
> without deadlocking--it just wouldn't be able to do shutdowns in parallel.

OK -- makes sense.

> 
> And I believe there is still a benefit to having async shutdown enabled
> even with one core. The NVMe shutdowns that take a while involve waiting
> for drives to finish commands, so they are mostly just sleeping.
> Workqueues will schedule another worker if one worker sleeps, so even a
> single core system should be able to get a number of NVMe drives started
> on their shutdowns in parallel.

Yes, I agree that a single core system should be able to get multiple
NVMe drives shutting down in parallel. The parallelism would be
governed by the number of worker processes that the workqueue
decides are needed. I didn't look at how it makes that decision.

> 
> I'll see what I can to do limit the amount of stuff that gets put on the
> workqueue, though.  I can likely limit it to just the asynchronous
> device shutdowns (NVMe shutdowns), plus any devices that have to wait
> for them (i.e., any devices of which they are dependents or consumers).
> 

Yes, based on what I saw, that should eliminate the extra overhead
when there are no NVMe devices, or only a small number. If a system has
only a couple of NVMe devices (plus some dependent/consumer devices),
putting a few entries in the aysnc workqueue should be negligible. And when
there are many NVMe devices, the extra overhead of more kernel threads
is more than offset by the parallelism.

Michael