lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bc5ebdea-7091-4999-a021-ec2a65573aa0@flourine.local>
Date: Fri, 12 Sep 2025 10:32:38 +0200
From: Daniel Wagner <dwagner@...e.de>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: Hannes Reinecke <hare@...e.de>, Daniel Wagner <wagi@...nel.org>, 
	Jens Axboe <axboe@...nel.dk>, Keith Busch <kbusch@...nel.org>, Christoph Hellwig <hch@....de>, 
	Sagi Grimberg <sagi@...mberg.me>, "Michael S. Tsirkin" <mst@...hat.com>, 
	Aaron Tomlin <atomlin@...mlin.com>, "Martin K. Petersen" <martin.petersen@...cle.com>, 
	Costa Shulyupin <costa.shul@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, 
	Valentin Schneider <vschneid@...hat.com>, Waiman Long <llong@...hat.com>, Ming Lei <ming.lei@...hat.com>, 
	Frederic Weisbecker <frederic@...nel.org>, Mel Gorman <mgorman@...e.de>, 
	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>, linux-kernel@...r.kernel.org, linux-block@...r.kernel.org, 
	linux-nvme@...ts.infradead.org, megaraidlinux.pdl@...adcom.com, linux-scsi@...r.kernel.org, 
	storagedev@...rochip.com, virtualization@...ts.linux.dev, 
	GR-QLogic-Storage-Upstream@...vell.com
Subject: Re: [PATCH v8 10/12] blk-mq: use hk cpus only when isolcpus=io_queue
 is enabled

On Wed, Sep 10, 2025 at 10:20:26AM +0200, Thomas Gleixner wrote:
> On Mon, Sep 08 2025 at 09:26, Daniel Wagner wrote:
> > On Mon, Sep 08, 2025 at 08:13:31AM +0200, Hannes Reinecke wrote:
> >> >   const struct cpumask *blk_mq_online_queue_affinity(void)
> >> >   {
> >> > +	if (housekeeping_enabled(HK_TYPE_IO_QUEUE)) {
> >> > +		cpumask_and(&blk_hk_online_mask, cpu_online_mask,
> >> > +			    housekeeping_cpumask(HK_TYPE_IO_QUEUE));
> >> > +		return &blk_hk_online_mask;
> >> 
> >> Can you explain the use of 'blk_hk_online_mask'?
> >> Why is a static variable?
> >
> > The blk_mq_*_queue_affinity helpers return a const struct cpumask *, the
> > caller doesn't need to free the return value. Because cpumask_and needs
> > store its result somewhere, I opted for the global static variable.
> >
> >> To my untrained eye it's being recalculated every time one calls
> >> this function. And only the first invocation run on an empty mask,
> >> all subsequent ones see a populated mask.
> >
> > The cpu_online_mask might change over time, it's not a static bitmap.
> > Thus it's necessary to update the blk_hk_online_mask. Doing some sort of
> > caching is certainly possible. Given that we have plenty of cpumask
> > logic operation in the cpu_group_evenly code path later, I am not so
> > sure this really makes a huge difference.
> 
> Sure,  but none of this is serialized against CPU hotplug operations. So
> the resulting mask, which is handed into the spreading code can be
> concurrently modified. IOW it's not as const as the code claims.

Thanks for explaining.

In group_cpu_evenly:

	/*
	 * Make a local cache of 'cpu_present_mask', so the two stages
	 * spread can observe consistent 'cpu_present_mask' without holding
	 * cpu hotplug lock, then we can reduce deadlock risk with cpu
	 * hotplug code.
	 *
	 * Here CPU hotplug may happen when reading `cpu_present_mask`, and
	 * we can live with the case because it only affects that hotplug
	 * CPU is handled in the 1st or 2nd stage, and either way is correct
	 * from API user viewpoint since 2-stage spread is sort of
	 * optimization.
	 */
	cpumask_copy(npresmsk, data_race(cpu_present_mask));


0263f92fadbb ("lib/group_cpus.c: avoid acquiring cpu hotplug lock in
group_cpus_evenly"):

  group_cpus_evenly() could be part of storage driver's error handler, such
  as nvme driver, when may happen during CPU hotplug, in which storage queue
  has to drain its pending IOs because all CPUs associated with the queue
  are offline and the queue is becoming inactive.  And handling IO needs
  error handler to provide forward progress.

  Then deadlock is caused:

  1) inside CPU hotplug handler, CPU hotplug lock is held, and blk-mq's
     handler is waiting for inflight IO

  2) error handler is waiting for CPU hotplug lock

  3) inflight IO can't be completed in blk-mq's CPU hotplug handler
     because error handling can't provide forward progress.

  Solve the deadlock by not holding CPU hotplug lock in group_cpus_evenly(),
  in which two stage spreads are taken: 1) the 1st stage is over all present
  CPUs; 2) the end stage is over all other CPUs.

  Turns out the two stage spread just needs consistent 'cpu_present_mask',
  and remove the CPU hotplug lock by storing it into one local cache.  This
  way doesn't change correctness, because all CPUs are still covered.

This sounds like I should do something similar with cpu_online_mask.
Anyway, I'll work on this.

> How is this even remotely correct?

It isn't :( I did hotplug tests but obviously these were not really up
to the task. The kernel test bot gave me a pointer how I should test.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ