lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZvIsPe4JbJ7HX2sQ@dread.disaster.area>
Date: Tue, 24 Sep 2024 13:04:29 +1000
From: Dave Chinner <david@...morbit.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Kent Overstreet <kent.overstreet@...ux.dev>,
	linux-bcachefs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, Dave Chinner <dchinner@...hat.com>
Subject: Re: [GIT PULL] bcachefs changes for 6.12-rc1

On Mon, Sep 23, 2024 at 07:26:31PM -0700, Linus Torvalds wrote:
> On Mon, 23 Sept 2024 at 17:27, Dave Chinner <david@...morbit.com> wrote:
> >
> > However, the problematic workload is cold cache operations where
> > the dentry cache repeatedly misses. This places all the operational
> > concurrency directly on the inode hash as new inodes are inserted
> > into the hash. Add memory reclaim and that adds contention as it
> > removes inodes from the hash on eviction.
> 
> Yeah, and then we spend all the time just adding the inodes to the
> hashes, and probably fairly seldom use them. Oh well.
> 
> And I had missed the issue with PREEMPT_RT and the fact that right now
> the inode hash lock is outside the inode lock, which is problematic.

*nod*

> So it's all a bit nasty.
> 
> But I also assume most of the bad issues end up mainly showing up on
> just fairly synthetic benchmarks with ramdisks, because even with a
> good SSD I suspect the IO for the cold cache would still dominate?

No, all the issues show up on consumer level NVMe SSDs - they have
more than enough IO concurrency to cause these CPU concurrency
related problems.

Keep in mind that when it comes to doing huge amounts of IO,
ramdisks are fundamentally flawed and don't scale.  That is, the IO
is synchonous memcpy() based and so consumes CPU time and both read
and write memory bandwidth, and concurrency is limited to the number
of CPUs in the system..

With NVMe SSDs, all the data movement is asynchronous and offloaded
to hardware with DMA engines that move the data. Those DMA engines
can often handle hundreds of concurrent IOs at once.

DMA sourced data is also only written to RAM once, and there are no
dependent data reads to slow down the DMA write to RAM like there is
with a data copy streamed through a CPU. IOWs, once the IO rates and
concurrency go up, it is generally much faster to use the CPU to
program the DMA engines to move the data than it is to move the data
with the CPU itself.

The testing I did (and so the numbers in those benchmarks) was done
on 2018-era PCIe 3.0 enterprise NVMe SSDs that could do
approximately 400k 4kB random read IOPS. The latest consumer PCIe
5.0 NVMe SSDs are *way faster* than these drives when subject to
highly concurrent IO requests...

-Dave.
-- 
Dave Chinner
david@...morbit.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ