linux-kernel - RE: [PATCH v9 0/4] shut down devices asynchronously

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SN6PR02MB41571E2DD410D09CE7494B38D4402@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Fri, 18 Oct 2024 03:26:05 +0000
From: Michael Kelley <mhklinux@...look.com>
To: Michael Kelley <mhklinux@...look.com>, Stuart Hayes
	<stuart.w.hayes@...il.com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, Greg Kroah-Hartman
	<gregkh@...uxfoundation.org>, "Rafael J . Wysocki" <rafael@...nel.org>,
	Martin Belanger <Martin.Belanger@...l.com>, Oliver O'Halloran
	<oohall@...il.com>, Daniel Wagner <dwagner@...e.de>, Keith Busch
	<kbusch@...nel.org>, Lukas Wunner <lukas@...ner.de>, David Jeffery
	<djeffery@...hat.com>, Jeremy Allison <jallison@....com>, Jens Axboe
	<axboe@...com>, Christoph Hellwig <hch@....de>, Sagi Grimberg
	<sagi@...mberg.me>, "linux-nvme@...ts.infradead.org"
	<linux-nvme@...ts.infradead.org>, Nathan Chancellor <nathan@...nel.org>, Jan
 Kiszka <jan.kiszka@...mens.com>, Bert Karwatzki <spasswolf@....de>
Subject: RE: [PATCH v9 0/4] shut down devices asynchronously

From: Michael Kelley <mhklinux@...look.com> Sent: Thursday, October 10, 2024 9:22 PM
> 
> From: Stuart Hayes <stuart.w.hayes@...il.com> Sent: Wednesday, October 9, 2024 10:58 AM
> >
> > This adds the ability for the kernel to shutdown devices asynchronously.
> >
> > Only devices with drivers that enable it are shut down asynchronously.
> >
> > This can dramatically reduce system shutdown/reboot time on systems that
> > have multiple devices that take many seconds to shut down (like certain
> > NVMe drives). On one system tested, the shutdown time went from 11 minutes
> > without this patch to 55 seconds with the patch.
> 
> I've been testing this series against a 6.11.0 kernel in an Azure VM, which
> is running as a guest on the Microsoft Hyper-V hypervisor. The VM has 16 vCPUs,
> 128 Gbytes of memory, and two physical NVMe controllers that are mapped
> into the VM so that it can access them directly.
> 
[snip]
> 
> But here's the kicker:  The overall process of shutting down the devices
> took *longer* with the patch set than without.  Here's the same output
> from a 6.11.0 kernel without the patch set:
> 
[snip]
> 
> Any thoughts on what might be causing this?  I haven't gone into the
> details of your algorithms for parallelizing, but is there any extra
> overhead that could be adding to the time? Or maybe this is
> something unique to Hyper-V guests. The overall difference is only
> a few 10's of milliseconds, so not that big of a deal. But maybe it's
> an indicator that something unexpected is happening that we should
> understand.
> 
> I'll keep thinking about the issue and see if I can get any more insight.

I've debugged this enough to now know what is happening. I see the
same behavior in two different environments:

1) A Hyper-V VM with 8 vCPUs running on my local laptop. It has
no NVMe devices, so there's no opportunity for parallelism with this
patch set.

2) A Raspberry Pi 5 with 4 CPUs. Linux is running on the bare metal.
It has one NVMe device via the Pi 5 M.2 HAT.

In both cases, the loop in device_shutdown() goes through a lot of
devices:  492 in my Hyper-V VM, and 577 in the Pi 5. With the code
in this patch set, all the devices get added to the async_wq in
kernel/async.c. So async_wq quickly gets heavily populated.

In the process, the workqueue code spins up additional worker threads
to handle the load.  On the Hyper-V VM, 210 to 230 new kernel
threads are created during device_shutdown(), depending on the
timing. On the Pi 5, 253 are created. The max for this workqueue is
WQ_DFL_ACTIVE (256).

Creating all these new worker threads, and scheduling and
synchronizing them takes 30 to 40 milliseconds of additional time
compared to the original code. On the Hyper-V VM, device_shutdown()
completes in about 5 millisecond with the original code, and +/- 40
milliseconds with the new async code. The Pi 5 results are more
variable, but also have roughly 30 to 40 milliseconds additional.

Is all this extra work a problem? I don't know. Clearly, there's big
benefit in parallelizing the NVMe shutdown, and in those situations
the extra 30 to 40 milliseconds can be ignored. But I wonder if there
are other situations whether the extra elapsed time and CPU
consumption might be a problem.

I also wonder about scalability. Does everything still work if
device_shutdown is processing 5000 devices? Is there a possibility
of deadlock if async_wq can only have 256 active items out of
5000 queued ones?

Michael Kelley

> 
> >
> > Changes from V8:
> >
> > Deal with shutdown hangs resulting when a parent/supplier device is
> >   later in the devices_kset list than its children/consumers:
> >   * Ignore sync_state_only devlinks for shutdown dependencies
> >   * Ignore shutdown_after for devices that don't want async shutdown
> >   * Add a sanity check to revert to sync shutdown for any device that
> >     would otherwise wait for a child/consumer shutdown that hasn't
> >     already been scheduled
> >
> > Changes from V7:
> >
> > Do not expose driver async_shutdown_enable in sysfs.
> > Wrapped a long line.
> >
> > Changes from V6:
> >
> > Removed a sysfs attribute that allowed the async device shutdown to be
> > "on" (with driver opt-out), "safe" (driver opt-in), or "off"... what was
> > previously "safe" is now the only behavior, so drivers now only need to
> > have the option to enable or disable async shutdown.
> >
> > Changes from V5:
> >
> > Separated into multiple patches to make review easier.
> > Reworked some code to make it more readable
> > Made devices wait for consumers to shut down, not just children
> >   (suggested by David Jeffery)
> >
> > Changes from V4:
> >
> > Change code to use cookies for synchronization rather than async domains
> > Allow async shutdown to be disabled via sysfs, and allow driver opt-in or
> >   opt-out of async shutdown (when not disabled), with ability to control
> >   driver opt-in/opt-out via sysfs
> >
> > Changes from V3:
> >
> > Bug fix (used "parent" not "dev->parent" in device_shutdown)
> >
> > Changes from V2:
> >
> > Removed recursive functions to schedule children to be shutdown before
> >   parents, since existing device_shutdown loop will already do this
> >
> > Changes from V1:
> >
> > Rewritten using kernel async code (suggested by Lukas Wunner)
> >
> >
> > Stuart Hayes (4):
> >   driver core: don't always lock parent in shutdown
> >   driver core: separate function to shutdown one device
> >   driver core: shut down devices asynchronously
> >   nvme-pci: Make driver prefer asynchronous shutdown
> >
> >  drivers/base/base.h           |   4 +
> >  drivers/base/core.c           | 137 +++++++++++++++++++++++++++-------
> >  drivers/nvme/host/pci.c       |   1 +
> >  include/linux/device/driver.h |   2 +
> >  4 files changed, 118 insertions(+), 26 deletions(-)
> >
> > --
> > 2.39.3
> >
>