linux-kernel - Re: [PATCH 1/1] nvme-pci: Add CPU latency pm-qos handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241015132928.GA3961@lst.de>
Date: Tue, 15 Oct 2024 15:29:28 +0200
From: Christoph Hellwig <hch@....de>
To: Tero Kristo <tero.kristo@...ux.intel.com>
Cc: Christoph Hellwig <hch@....de>, linux-kernel@...r.kernel.org,
	axboe@...nel.dk, linux-nvme@...ts.infradead.org, sagi@...mberg.me,
	kbusch@...nel.org
Subject: Re: [PATCH 1/1] nvme-pci: Add CPU latency pm-qos handling

On Tue, Oct 15, 2024 at 12:25:37PM +0300, Tero Kristo wrote:
> I've been giving this some thought offline, but can't really think of
> how this could be done in the generic layers; the code needs to figure
> out the interrupt that gets fired by the activity, to prevent the CPU
> that is going to handle that interrupt to go into deep idle,
> potentially ruining the latency and throughput of the request. The
> knowledge of this interrupt mapping only resides in the driver level,
> in this case NVMe.
> 
> One thing that could be done is to prevent the whole feature to be used
> on setups where the number of cpus per irq is above some threshold;
> lets say 4 as an example.

As a disclaimer I don't really understand the PM QOS framework, just
the NVMe driver and block layer.

With that my gut feeling is that all this latency management should
be driven by the blk_mq_hctx structure, the block layer equivalent
to a queue.  And instead of having a per-cpu array of QOS requests
per device, there should one per cpu in the actual mask of the
hctx, so that you only have to iterate this local shared data
structure.

Preferably there would be one single active check per hctx and
not one per cpu, e.g. when the block layer submits commands
it has to do one single check instead of an iteration.  Similarly
the block layer code would time out the activity once per hctx,
and only then iterate the (usually few) CPUs per hctx.