lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aApbYhyeYcCifoYI@kbusch-mbp.dhcp.thefacebook.com>
Date: Thu, 24 Apr 2025 09:40:18 -0600
From: Keith Busch <kbusch@...nel.org>
To: Christoph Hellwig <hch@....de>
Cc: Caleb Sander Mateos <csander@...estorage.com>,
	Jens Axboe <axboe@...nel.dk>, Sagi Grimberg <sagi@...mberg.me>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Kanchan Joshi <joshi.k@...sung.com>, linux-nvme@...ts.infradead.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 3/3] nvme/pci: make PRP list DMA pools per-NUMA-node

On Thu, Apr 24, 2025 at 04:12:49PM +0200, Christoph Hellwig wrote:
> On Tue, Apr 22, 2025 at 04:09:52PM -0600, Caleb Sander Mateos wrote:
> > NVMe commands with more than 4 KB of data allocate PRP list pages from
> > the per-nvme_device dma_pool prp_page_pool or prp_small_pool.
> 
> That's not actually true.  We can transfer all of the MDTS without a
> single pool allocation when using SGLs.

Let's just change it to say discontiguous data, then.

Though even wtih PRP's, you could transfer up to 8k without allocating a
list, if its address is 4k aligned.
 
> > Each call
> > to dma_pool_alloc() and dma_pool_free() takes the per-dma_pool spinlock.
> > These device-global spinlocks are a significant source of contention
> > when many CPUs are submitting to the same NVMe devices. On a workload
> > issuing 32 KB reads from 16 CPUs (8 hypertwin pairs) across 2 NUMA nodes
> > to 23 NVMe devices, we observed 2.4% of CPU time spent in
> > _raw_spin_lock_irqsave called from dma_pool_alloc and dma_pool_free.
> > 
> > Ideally, the dma_pools would be per-hctx to minimize
> > contention. But that could impose considerable resource costs in a
> > system with many NVMe devices and CPUs.
> 
> Should we try to simply do a slab allocation first and only allocate
> from the dmapool when that fails?  That should give you all the
> scalability from the slab allocator without very little downsides.

The dmapool allocates dma coherent memory, and it's mapped for the
remainder of lifetime of the pool. Allocating slab memory and dma
mapping per-io would be pretty costly in comparison, I think.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ