lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZxkE93Vz3ZQaAFO1@bfoster>
Date: Wed, 23 Oct 2024 10:15:19 -0400
From: Brian Foster <bfoster@...hat.com>
To: Dave Chinner <david@...morbit.com>
Cc: kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
	lkp@...el.com, linux-kernel@...r.kernel.org,
	Christian Brauner <brauner@...nel.org>,
	"Darrick J. Wong" <djwong@...nel.org>,
	Josef Bacik <josef@...icpanda.com>, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, ying.huang@...el.com,
	feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linus:master] [iomap]  c5c810b94c:
 stress-ng.metamix.ops_per_sec -98.4% regression

On Wed, Oct 16, 2024 at 11:47:24AM -0400, Brian Foster wrote:
> On Wed, Oct 16, 2024 at 08:50:58AM +1100, Dave Chinner wrote:
> > On Mon, Oct 14, 2024 at 12:34:37PM -0400, Brian Foster wrote:
> > > On Mon, Oct 14, 2024 at 03:55:24PM +0800, kernel test robot wrote:
> > > > 
> > > > 
> > > > Hello,
> > > > 
> > > > kernel test robot noticed a -98.4% regression of stress-ng.metamix.ops_per_sec on:
> > > > 
> > > > 
> > > > commit: c5c810b94cfd818fc2f58c96feee58a9e5ead96d ("iomap: fix handling of dirty folios over unwritten extents")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > 
> > > > testcase: stress-ng
> > > > config: x86_64-rhel-8.3
> > > > compiler: gcc-12
> > > > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
> > > > parameters:
> > > > 
> > > > 	nr_threads: 100%
> > > > 	disk: 1HDD
> > > > 	testtime: 60s
> > > > 	fs: xfs
> > > > 	test: metamix
> > > > 	cpufreq_governor: performance
> > > > 
> > > > 
> > > > 
> > > > 
> > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > the same patch/commit), kindly add following tags
> > > > | Reported-by: kernel test robot <oliver.sang@...el.com>
> > > > | Closes: https://lore.kernel.org/oe-lkp/202410141536.1167190b-oliver.sang@intel.com
> > > > 
> > > > 
> > > > Details are as below:
> > > > -------------------------------------------------------------------------------------------------->
> > > > 
> > > > 
> > > > The kernel config and materials to reproduce are available at:
> > > > https://download.01.org/0day-ci/archive/20241014/202410141536.1167190b-oliver.sang@intel.com
> > > > 
> > > 
> > > So I basically just run this on a >64xcpu guest and reproduce the delta:
> > > 
> > >   stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64
> > > 
> > > The short of it is that with tracing enabled, I see a very large number
> > > of extending writes across unwritten mappings, which basically means XFS
> > > eof zeroing is calling zero range and hitting the newly introduced
> > > flush. This is all pretty much expected given the patch.
> > 
> > Ouch.
> > 
> > The conditions required to cause this regression are that we either
> > first use fallocate() to preallocate beyond EOF, or buffered writes
> > trigger specualtive delalloc beyond EOF and they get converted to
> > unwritten beyond EOF through background writeback or fsync
> > operations. Both of these lead to unwritten extents beyond EOF that
> > extending writes will fall into.
> > 
> > All we need now is the extending writes to be slightly
> > non-sequential and those non-sequential extending writes will not
> > land at EOF but at some distance beyond it. At this point, we
> > trigger the new flush code. Unfortunately, this is actually a fairly
> > common workload pattern.
> > 
> > For example, experience tells me that NFS server processing of async
> > sequential write requests from a client will -always- end up with
> > slightly out of order extending writes because the incoming async
> > write requests are processed concurrently. Hence they always race to
> > extend the file and slightly out of order file extension happens
> > quite frequently.
> > 
> > Further, the NFS client will also periodically be sending a write
> > commit request (i.e. server side fsync), the
> > NFS server writeback will convert the speculative delalloc that
> > extends beyond EOF into unwritten extents beyond EOF whilst the
> > incoming extending write requests are still incoming from the
> > client.
> > 
> > Hence I think that there are common workloads (e.g. large sequential
> > writes on a NFS client) that set up the exact conditions and IO
> > patterns necessary to trigger this performance regression in
> > production systems...
> > 
> 
> It's not clear to me that purely out of order writeback via NFS would
> produce the same sort of hit here because we'd only flush on write
> extensions. I think the pathological case would have to be something
> like reordering such that every other write lands sequentially to
> maximize the number of post-eof write extensions, and then going back
> and filling in the gaps. That seems rather suboptimal to start, and
> short of that the cost of the flushes will start to amortize to some
> degree (including with commit requests, etc.).
> 
> That said, I don't have much experience with NFS and I think this is a
> reasonable enough argument to try and optimize here. If you or anybody
> has an NFS test/workload that might exacerbate this condition, let me
> know and I'll try to play around with it.
> 
> > > I ran a quick experiment to skip the flush on sub-4k ranges in favor of
> > > doing explicit folio zeroing. The idea with that is that the range is
> > > likely restricted to single folio and since it's dirty, we can assume
> > > unwritten conversion is imminent and just explicitly zero the range. I
> > > still see a decent number of flushes from larger ranges in that
> > > experiment, but that still seems to get things pretty close to my
> > > baseline test (on a 6.10 distro kernel).
> > 
> > What filesystems other than XFS actually need this iomap bandaid
> > right now?  If there are none (which I think is the case), then we
> > should just revert this change it until a more performant fix is
> > available for XFS.
> > 
> 
> I think that's a bit hasty. I had one or two ideas/prototypes to work
> around this sort of problem before the flush patches even landed, it
> just wasn't clear to me they were worth the extra logic. I'd prefer to
> try and iterate on performance from a baseline of functional correctness
> rather than the other way around, if possible.
> 
> A quick hack to test out some of that on latest master brings the result
> of this test right back to baseline in my local env. Let me play around
> with trying to work that into something more production worthy before we
> break out the pitchforks.. ;)
> 

So it turns out there is a little bit more going on here. The regression
is not so much the flush on its own, but the combination of the flush
and changes in commit 5ce5674187c34 ("xfs: convert delayed extents to
unwritten when zeroing post eof blocks"). This changes post-eof zero
range calls on XFS to convert delalloc extents to unwritten instead of
the prior behavior of leaving them as delalloc, zeroing in memory, and
continuing on. IOW, the regression also goes away by bypassing this
particular commit, even with flushing in place.

The prealloc change seems fairly reasonable at face value, but the
commit log description documents it as purely an i_size change bug fix
associated with an internal zero range, which AFAICT isn't relevant any
more because iomap_zero_range() doesn't update i_size AFAICS. However,
it looks like it did so in the past and this behavior also swizzled back
and forth a time or two in the same timeframe as this particular commit,
so perhaps it was a problem when this was introduced and then iomap
changed again after (or maybe I'm just missing something?).

On thinking more about this, I'd be a little concerned on whether this
will reduce effectiveness of speculative preallocation on similar sorts
of write extending workloads as this test (i.e. strided extending
writes). This changes behavior from doing in-memory zeroing between
physical allocations via the writeback path to doing physical allocation
on every write that starts beyond EOF, which feels a little like going
from one extreme to the other. Instead, I'd expect to see something
where this converts larger mappings to avoid excessive zeroing and
zeroes on smallish ranges to avoid overly frequent and unnecessarily
small physical allocations, allowing multiple speculative preallocations
to compound.

Anyways, I've not dug into this enough to know whether it's a problem,
but since this is documented purely as a bug fix I don't see any
evidence that potential impact on allocation patterns was tested either.
This might be something to evaluate more closely in XFS.

On the iomap side, it also seems like the current handling of i_size on
zero range is confused. If iomap_zero_range() doesn't update i_size,
then it basically doesn't fully support post-eof ranges. It zeroes
through buffered writes, which writeback will just drop on the floor if
beyond EOF. However, XFS explicitly calls zero range on post-eof ranges
to trigger the aforementioned conversion in its begin callback (but
never expecting to see ranges that need buffered writes).

I think this is a landmine waiting to happen. If iomap decides to be
deliberate and skip post-eof ranges, then this could break current XFS
behavior if it skips the begin callback. OTOH if XFS were to change back
to at least doing some speculative prealloc delalloc zeroing, IIUC this
now introduces a race between writeback potentially throwing away the
zeroed folios over delalloc preallocation and the subsequent write
operation extending i_size so that doesn't happen. :/ None of this is
particularly obvious. And FWIW, I'm also skeptical that i_size updates
were ever consistent across mapping types. I.e., if the size was only
ever updated via iomap_write_end() for example, then behavior is kind of
unpredictable.

Maybe this is something that should be configurable via a keepsize flag
or some such. That would at least allow for correct behavior and/or a
failure/warning if we ever fell into doing zeroing for post-eof ranges
without updating i_size. Thoughts on any of this?

Brian

> Brian
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@...morbit.com
> > 
> 
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ