linux-ext4 - Re: [PATCH 0/1] iomap regression for aio dio 4k writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZJOqC7Cfjr5AoW7S@dread.disaster.area>
Date:   Thu, 22 Jun 2023 11:55:23 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Jeremy Bongio <bongiojp@...il.com>
Cc:     Ted Tso <tytso@....edu>, "Darrick J . Wong" <djwong@...nel.org>,
        Allison Henderson <allison.henderson@...cle.com>,
        linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-xfs@...r.kernel.org
Subject: Re: [PATCH 0/1] iomap regression for aio dio 4k writes

On Thu, Jun 22, 2023 at 09:59:30AM +1000, Dave Chinner wrote:
> On Wed, Jun 21, 2023 at 10:29:19AM -0700, Jeremy Bongio wrote:
> > Since no issues have been found for ext4 calling completion work
> > directly in the io handler pre-iomap, it is unlikely that this is
> > unsafe (sleeping within an io handler callback). However, this may not
> > be true for all filesystems. Does XFS potentially sleep in its
> > completion code?
> 
> Yes, and ext4 does too. e.g. O_DSYNC overwrites always need to be
> deferred to task context to be able to take sleeping locks and
> potentially block on journal and or device cache flushes.
> 
> i.e. Have you considered what context all of XFS, f2fs, btrfs,
> zonefs and gfs2 need for pure DIO overwrite completion in all it's
> different variants?
> 
> AFAIC, it's far simpler conceptually to defer all writes to
> completion context than it is to try to work out what writes need to
> be deferred and what doesn't, especially as the filesystem ->end_io
> completion might need to sleep and the iomap code has no idea
> whether that is possible.

Ok, so having spent a bit more thought on this away from the office
this morning, I think there is a generic way we can avoid deferring
completions for pure overwrites.

We already have a mechanism in iomap that tells us if the write is a
pure overwrite and we use it to change how we issue O_DSYNC DIO
writes. i.e. we use it to determine if we can use FUA writes rather
than a post-IO journal/device cache flush to guarantee data
integrity. See IOMAP_DIO_WRITE_FUA for how we determine whether we
need issue a generic_write_sync() call or not in the post IO
completion processing.

The iomap flags that determines if we can make this optimisation are
IOMAP_F_SHARED and IOMAP_F_DIRTY. IOMAP_F_SHARED indicates a COW is
required to break sharing for the write IO to proceed, whilst
IOMAP_F_DIRTY indicates that the inode is either dirty or that the
write IO requires metadata to be dirtied at completion time (e.g.
unwritten extent conversion) before the sync operation that provides
data integrity guarantees can be run.

If neither of these flags are set in the iomap, it effectively means
that the IO is a pure overwrite. i.e. the filesytsem has explicitly
said that this write IO does not need any post-IO completion
filesystem work to be done.

At this point, the iomap code can optimise for a pure overwrite into
an IOMAP_MAPPED extent, knowing that the only thing it needs to care
about on completion is internal data integrity requirements (i.e.
O_DSYNC/O_SYNC) of the IO.

Hence if the filesystem has told iomap that it has no IO completion
requirements, and iomap doesn't need generic_write_sync() for data
integrity (i.e. no data integrity required or FUA was used for the
entire IO), then we could complete the DIO write directly from the
bio completion callback context...

IOWs, what you want can be done, it's just a whole lot more complex
than just avoiding a queue_work() call...

-Dave.
-- 
Dave Chinner
david@...morbit.com