linux-kernel - Re: [linus:master] [iomap] c5c810b94c: stress-ng.metamix.ops_per

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zw_gDDlIEgZbApU_@bfoster>
Date: Wed, 16 Oct 2024 11:47:24 -0400
From: Brian Foster <bfoster@...hat.com>
To: Dave Chinner <david@...morbit.com>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
	lkp@...el.com, linux-kernel@...r.kernel.org,
	Christian Brauner <brauner@...nel.org>,
	"Darrick J. Wong" <djwong@...nel.org>,
	Josef Bacik <josef@...icpanda.com>, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, ying.huang@...el.com,
	feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linus:master] [iomap]  c5c810b94c:
 stress-ng.metamix.ops_per_sec -98.4% regression

On Wed, Oct 16, 2024 at 08:50:58AM +1100, Dave Chinner wrote:
> On Mon, Oct 14, 2024 at 12:34:37PM -0400, Brian Foster wrote:
> > On Mon, Oct 14, 2024 at 03:55:24PM +0800, kernel test robot wrote:
> > > 
> > > 
> > > Hello,
> > > 
> > > kernel test robot noticed a -98.4% regression of stress-ng.metamix.ops_per_sec on:
> > > 
> > > 
> > > commit: c5c810b94cfd818fc2f58c96feee58a9e5ead96d ("iomap: fix handling of dirty folios over unwritten extents")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > testcase: stress-ng
> > > config: x86_64-rhel-8.3
> > > compiler: gcc-12
> > > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> > > parameters:
> > > 
> > > 	nr_threads: 100%
> > > 	disk: 1HDD
> > > 	testtime: 60s
> > > 	fs: xfs
> > > 	test: metamix
> > > 	cpufreq_governor: performance
> > > 
> > > 
> > > 
> > > 
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <oliver.sang@...el.com>
> > > | Closes: https://lore.kernel.org/oe-lkp/202410141536.1167190b-oliver.sang@intel.com
> > > 
> > > 
> > > Details are as below:
> > > -------------------------------------------------------------------------------------------------->
> > > 
> > > 
> > > The kernel config and materials to reproduce are available at:
> > > https://download.01.org/0day-ci/archive/20241014/202410141536.1167190b-oliver.sang@intel.com
> > > 
> > 
> > So I basically just run this on a >64xcpu guest and reproduce the delta:
> > 
> >   stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64
> > 
> > The short of it is that with tracing enabled, I see a very large number
> > of extending writes across unwritten mappings, which basically means XFS
> > eof zeroing is calling zero range and hitting the newly introduced
> > flush. This is all pretty much expected given the patch.
> 
> Ouch.
> 
> The conditions required to cause this regression are that we either
> first use fallocate() to preallocate beyond EOF, or buffered writes
> trigger specualtive delalloc beyond EOF and they get converted to
> unwritten beyond EOF through background writeback or fsync
> operations. Both of these lead to unwritten extents beyond EOF that
> extending writes will fall into.
> 
> All we need now is the extending writes to be slightly
> non-sequential and those non-sequential extending writes will not
> land at EOF but at some distance beyond it. At this point, we
> trigger the new flush code. Unfortunately, this is actually a fairly
> common workload pattern.
> 
> For example, experience tells me that NFS server processing of async
> sequential write requests from a client will -always- end up with
> slightly out of order extending writes because the incoming async
> write requests are processed concurrently. Hence they always race to
> extend the file and slightly out of order file extension happens
> quite frequently.
> 
> Further, the NFS client will also periodically be sending a write
> commit request (i.e. server side fsync), the
> NFS server writeback will convert the speculative delalloc that
> extends beyond EOF into unwritten extents beyond EOF whilst the
> incoming extending write requests are still incoming from the
> client.
> 
> Hence I think that there are common workloads (e.g. large sequential
> writes on a NFS client) that set up the exact conditions and IO
> patterns necessary to trigger this performance regression in
> production systems...
> 

It's not clear to me that purely out of order writeback via NFS would
produce the same sort of hit here because we'd only flush on write
extensions. I think the pathological case would have to be something
like reordering such that every other write lands sequentially to
maximize the number of post-eof write extensions, and then going back
and filling in the gaps. That seems rather suboptimal to start, and
short of that the cost of the flushes will start to amortize to some
degree (including with commit requests, etc.).

That said, I don't have much experience with NFS and I think this is a
reasonable enough argument to try and optimize here. If you or anybody
has an NFS test/workload that might exacerbate this condition, let me
know and I'll try to play around with it.

> > I ran a quick experiment to skip the flush on sub-4k ranges in favor of
> > doing explicit folio zeroing. The idea with that is that the range is
> > likely restricted to single folio and since it's dirty, we can assume
> > unwritten conversion is imminent and just explicitly zero the range. I
> > still see a decent number of flushes from larger ranges in that
> > experiment, but that still seems to get things pretty close to my
> > baseline test (on a 6.10 distro kernel).
> 
> What filesystems other than XFS actually need this iomap bandaid
> right now?  If there are none (which I think is the case), then we
> should just revert this change it until a more performant fix is
> available for XFS.
> 

I think that's a bit hasty. I had one or two ideas/prototypes to work
around this sort of problem before the flush patches even landed, it
just wasn't clear to me they were worth the extra logic. I'd prefer to
try and iterate on performance from a baseline of functional correctness
rather than the other way around, if possible.

A quick hack to test out some of that on latest master brings the result
of this test right back to baseline in my local env. Let me play around
with trying to work that into something more production worthy before we
break out the pitchforks.. ;)

Brian

> -Dave.
> -- 
> Dave Chinner
> david@...morbit.com
>