linux-kernel - Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bc5ebdea-7091-4999-a021-ec2a65573aa0@flourine.local>
Date: Fri, 12 Sep 2025 10:32:38 +0200
From: Daniel Wagner <dwagner@...e.de>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Hannes Reinecke <hare@...e.de>, Daniel Wagner <wagi@...nel.org>, 
	Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>, Christoph Hellwig <hch@....de>, 
	Sagi Grimberg <sagi@...mberg.me>, "Michael S. Tsirkin" <mst@...hat.com>, 
	Aaron Tomlin <atomlin@...mlin.com>, "Martin K. Petersen" <martin.petersen@...cle.com>, 
	Costa Shulyupin <costa.shul@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, 
	Valentin Schneider <vschneid@...hat.com>, Waiman Long <llong@...hat.com>, Ming Lei <ming.lei@...hat.com>, 
	Frederic Weisbecker <frederic@...nel.org>, Mel Gorman <mgorman@...e.de>, 
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, linux-kernel@...r.kernel.org, linux-block@...r.kernel.org, 
	linux-nvme@...ts.infradead.org, megaraidlinux.pdl@...adcom.com, linux-scsi@...r.kernel.org, 
	storagedev@...rochip.com, virtualization@...ts.linux.dev, 
	GR-QLogic-Storage-Upstream@...vell.com
Subject: Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue
 is enabled

On Wed, Sep 10, 2025 at 10:20:26AM +0200, Thomas Gleixner wrote:
> On Mon, Sep 08 2025 at 09:26, Daniel Wagner wrote:
> > On Mon, Sep 08, 2025 at 08:13:31AM +0200, Hannes Reinecke wrote:
> >> >   const struct cpumask *blk_mq_online_queue_affinity(void)
> >> >   {
> >> > +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> >> > +		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
> >> > +			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> >> > +		return &blk_hk_online_mask;
> >> 
> >> Can you explain the use of 'blk_hk_online_mask'?
> >> Why is a static variable?
> >
> > The blk_mq_*_queue_affinity helpers return a const struct cpumask *, the
> > caller doesn't need to free the return value. Because cpumask_and needs
> > store its result somewhere, I opted for the global static variable.
> >
> >> To my untrained eye it's being recalculated every time one calls
> >> this function. And only the first invocation run on an empty mask,
> >> all subsequent ones see a populated mask.
> >
> > The cpu_online_mask might change over time, it's not a static bitmap.
> > Thus it's necessary to update the blk_hk_online_mask. Doing some sort of
> > caching is certainly possible. Given that we have plenty of cpumask
> > logic operation in the cpu_group_evenly code path later, I am not so
> > sure this really makes a huge difference.
> 
> Sure,  but none of this is serialized against CPU hotplug operations. So
> the resulting mask, which is handed into the spreading code can be
> concurrently modified. IOW it's not as const as the code claims.

Thanks for explaining.

In group_cpu_evenly:

	/*
	 * Make a local cache of 'cpu_present_mask', so the two stages
	 * spread can observe consistent 'cpu_present_mask' without holding
	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
	 * hotplug code.
	 *
	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
	 * we can live with the case because it only affects that hotplug
	 * CPU is handled in the 1st or 2nd stage, and either way is correct
	 * from API user viewpoint since 2-stage spread is sort of
	 * optimization.
	 */
	cpumask_copy(npresmsk, data_race(cpu_present_mask));


0263f92fadbb ("lib/group_cpus.c: avoid acquiring cpu hotplug lock in
group_cpus_evenly"):

  group_cpus_evenly() could be part of storage driver's error handler, such
  as nvme driver, when may happen during CPU hotplug, in which storage queue
  has to drain its pending IOs because all CPUs associated with the queue
  are offline and the queue is becoming inactive.  And handling IO needs
  error handler to provide forward progress.

  Then deadlock is caused:

  1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
     handler is waiting for inflight IO

  2) error handler is waiting for CPU hotplug lock

  3) inflight IO can't be completed in blk-mq's CPU hotplug handler
     because error handling can't provide forward progress.

  Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
  in which two stage spreads are taken: 1) the 1st stage is over all present
  CPUs; 2) the end stage is over all other CPUs.

  Turns out the two stage spread just needs consistent 'cpu_present_mask',
  and remove the CPU hotplug lock by storing it into one local cache.  This
  way doesn't change correctness, because all CPUs are still covered.

This sounds like I should do something similar with cpu_online_mask.
Anyway, I'll work on this.

> How is this even remotely correct?

It isn't :( I did hotplug tests but obviously these were not really up
to the task. The kernel test bot gave me a pointer how I should test.