linux-kernel - Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zj6-1sXvUNZWO1pB@kbusch-mbp.dhcp.thefacebook.com>
Date: Fri, 10 May 2024 18:41:58 -0600
From: Keith Busch <kbusch@...nel.org>
To: Ming Lei <ming.lei@...hat.com>
Cc: Christoph Hellwig <hch@....de>, Keith Busch <kbusch@...a.com>,
	linux-nvme@...ts.infradead.org, linux-kernel@...r.kernel.org,
	tglx@...utronix.de
Subject: Re: [PATCH 2/2] nvme-pci: allow unmanaged interrupts

On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote:
> On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote:
> > On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote:
> > > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
> > > > From: Keith Busch <kbusch@...nel.org>
> > > > 
> > > > Some people _really_ want to control their interrupt affinity.
> > > 
> > > So let them argue why.  I'd rather have a really, really, really
> > > good argument for this crap, and I'd like to hear it from the horses
> > > mouth.
> > 
> > It's just prioritizing predictable user task scheduling for a subset of
> > CPUs instead of having consistently better storage performance.
> > 
> > We already have "isolcpus=managed_irq," parameter to prevent managed
> > interrupts from running on a subset of CPUs, so the use case is already
> > kind of supported. The problem with that parameter is it is a no-op if
> > the starting affinity spread contains only isolated CPUs.
> 
> Can you explain a bit why it is a no-op? If only isolated CPUs are
> spread on one queue, there will be no IO originated from these isolated
> CPUs, that is exactly what the isolation needs.

The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs.
It only limits where the managed irq will assign the effective_cpus as a
best effort.

Example, I boot with a system with 4 threads, one nvme device, and
kernel parameter:

  isolcpus=managed_irq,2-3

Run this:

  for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done

Check /proc/interrupts | grep nvme0:

           CPU0       CPU1       CPU2       CPU3
..
 26:       1000          0          0          0  PCI-MSIX-0000:00:05.0   1-edge      nvme0q1
 27:          0       1004          0          0  PCI-MSIX-0000:00:05.0   2-edge      nvme0q2
 28:          0          0       1000          0  PCI-MSIX-0000:00:05.0   3-edge      nvme0q3
 29:          0          0          0       1043  PCI-MSIX-0000:00:05.0   4-edge      nvme0q4

The isolcpus did nothing becuase the each vector's mask had just one
cpu; there was no where else that the managed irq could send it. The
documentation seems to indicate that was by design as a "best effort".