linux-kernel - Re: [linus:master] [iomap] c5c810b94c: stress-ng.metamix.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zw7jwnvBaMwloHXG@dread.disaster.area>
Date: Wed, 16 Oct 2024 08:50:58 +1100
From: Dave Chinner <david@...morbit.com>
To: Brian Foster <bfoster@...hat.com>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
	lkp@...el.com, linux-kernel@...r.kernel.org,
	Christian Brauner <brauner@...nel.org>,
	"Darrick J. Wong" <djwong@...nel.org>,
	Josef Bacik <josef@...icpanda.com>, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, ying.huang@...el.com,
	feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linus:master] [iomap]  c5c810b94c:
 stress-ng.metamix.ops_per_sec -98.4% regression

On Mon, Oct 14, 2024 at 12:34:37PM -0400, Brian Foster wrote:
> On Mon, Oct 14, 2024 at 03:55:24PM +0800, kernel test robot wrote:
> > 
> > 
> > Hello,
> > 
> > kernel test robot noticed a -98.4% regression of stress-ng.metamix.ops_per_sec on:
> > 
> > 
> > commit: c5c810b94cfd818fc2f58c96feee58a9e5ead96d ("iomap: fix handling of dirty folios over unwritten extents")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > 
> > testcase: stress-ng
> > config: x86_64-rhel-8.3
> > compiler: gcc-12
> > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> > parameters:
> > 
> > 	nr_threads: 100%
> > 	disk: 1HDD
> > 	testtime: 60s
> > 	fs: xfs
> > 	test: metamix
> > 	cpufreq_governor: performance
> > 
> > 
> > 
> > 
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@...el.com>
> > | Closes: https://lore.kernel.org/oe-lkp/202410141536.1167190b-oliver.sang@intel.com
> > 
> > 
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> > 
> > 
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20241014/202410141536.1167190b-oliver.sang@intel.com
> > 
> 
> So I basically just run this on a >64xcpu guest and reproduce the delta:
> 
>   stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64
> 
> The short of it is that with tracing enabled, I see a very large number
> of extending writes across unwritten mappings, which basically means XFS
> eof zeroing is calling zero range and hitting the newly introduced
> flush. This is all pretty much expected given the patch.

Ouch.

The conditions required to cause this regression are that we either
first use fallocate() to preallocate beyond EOF, or buffered writes
trigger specualtive delalloc beyond EOF and they get converted to
unwritten beyond EOF through background writeback or fsync
operations. Both of these lead to unwritten extents beyond EOF that
extending writes will fall into.

All we need now is the extending writes to be slightly
non-sequential and those non-sequential extending writes will not
land at EOF but at some distance beyond it. At this point, we
trigger the new flush code. Unfortunately, this is actually a fairly
common workload pattern.

For example, experience tells me that NFS server processing of async
sequential write requests from a client will -always- end up with
slightly out of order extending writes because the incoming async
write requests are processed concurrently. Hence they always race to
extend the file and slightly out of order file extension happens
quite frequently.

Further, the NFS client will also periodically be sending a write
commit request (i.e. server side fsync), the
NFS server writeback will convert the speculative delalloc that
extends beyond EOF into unwritten extents beyond EOF whilst the
incoming extending write requests are still incoming from the
client.

Hence I think that there are common workloads (e.g. large sequential
writes on a NFS client) that set up the exact conditions and IO
patterns necessary to trigger this performance regression in
production systems...

> I ran a quick experiment to skip the flush on sub-4k ranges in favor of
> doing explicit folio zeroing. The idea with that is that the range is
> likely restricted to single folio and since it's dirty, we can assume
> unwritten conversion is imminent and just explicitly zero the range. I
> still see a decent number of flushes from larger ranges in that
> experiment, but that still seems to get things pretty close to my
> baseline test (on a 6.10 distro kernel).

What filesystems other than XFS actually need this iomap bandaid
right now?  If there are none (which I think is the case), then we
should just revert this change it until a more performant fix is
available for XFS.

-Dave.
-- 
Dave Chinner
david@...morbit.com