linux-kernel - Re: [PATCHSET v6 0/12] Uncached buffered IO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241206173751.GI7864@frogsfrogsfrogs>
Date: Fri, 6 Dec 2024 09:37:51 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Jens Axboe <axboe@...nel.dk>
Cc: linux-mm@...ck.org, linux-fsdevel@...r.kernel.org, hannes@...xchg.org,
	clm@...a.com, linux-kernel@...r.kernel.org, willy@...radead.org,
	kirill@...temov.name, bfoster@...hat.com
Subject: Re: [PATCHSET v6 0/12] Uncached buffered IO

On Tue, Dec 03, 2024 at 08:31:36AM -0700, Jens Axboe wrote:
> Hi,
> 
> 5 years ago I posted patches adding support for RWF_UNCACHED, as a way
> to do buffered IO that isn't page cache persistent. The approach back
> then was to have private pages for IO, and then get rid of them once IO
> was done. But that then runs into all the issues that O_DIRECT has, in
> terms of synchronizing with the page cache.
> 
> So here's a new approach to the same concent, but using the page cache
> as synchronization. That makes RWF_UNCACHED less special, in that it's
> just page cache IO, except it prunes the ranges once IO is completed.
> 
> Why do this, you may ask? The tldr is that device speeds are only
> getting faster, while reclaim is not. Doing normal buffered IO can be
> very unpredictable, and suck up a lot of resources on the reclaim side.
> This leads people to use O_DIRECT as a work-around, which has its own
> set of restrictions in terms of size, offset, and length of IO. It's
> also inherently synchronous, and now you need async IO as well. While
> the latter isn't necessarily a big problem as we have good options
> available there, it also should not be a requirement when all you want
> to do is read or write some data without caching.
> 
> Even on desktop type systems, a normal NVMe device can fill the entire
> page cache in seconds. On the big system I used for testing, there's a
> lot more RAM, but also a lot more devices. As can be seen in some of the
> results in the following patches, you can still fill RAM in seconds even
> when there's 1TB of it. Hence this problem isn't solely a "big
> hyperscaler system" issue, it's common across the board.
> 
> Common for both reads and writes with RWF_UNCACHED is that they use the
> page cache for IO. Reads work just like a normal buffered read would,
> with the only exception being that the touched ranges will get pruned
> after data has been copied. For writes, the ranges will get writeback
> kicked off before the syscall returns, and then writeback completion
> will prune the range. Hence writes aren't synchronous, and it's easy to
> pipeline writes using RWF_UNCACHED. Folios that aren't instantiated by
> RWF_UNCACHED IO are left untouched. This means you that uncached IO
> will take advantage of the page cache for uptodate data, but not leave
> anything it instantiated/created in cache.
> 
> File systems need to support this. The patches add support for the
> generic filemap helpers, and for iomap. Then ext4 and XFS are marked as
> supporting it. The last patch adds support for btrfs as well, lightly
> tested. The read side is already done by filemap, only the write side
> needs a bit of help. The amount of code here is really trivial, and the
> only reason the fs opt-in is necessary is to have an RWF_UNCACHED IO
> return -EOPNOTSUPP just in case the fs doesn't use either the generic
> paths or iomap. Adding "support" to other file systems should be
> trivial, most of the time just a one-liner adding FOP_UNCACHED to the
> fop_flags in the file_operations struct.
> 
> Performance results are in patch 8 for reads and patch 10 for writes,
> with the tldr being that I see about a 65% improvement in performance
> for both, with fully predictable IO times. CPU reduction is substantial
> as well, with no kswapd activity at all for reclaim when using uncached
> IO.
> 
> Using it from applications is trivial - just set RWF_UNCACHED for the
> read or write, using pwritev2(2) or preadv2(2). For io_uring, same
> thing, just set RWF_UNCACHED in sqe->rw_flags for a buffered read/write
> operation. And that's it.
> 
> Patches 1..7 are just prep patches, and should have no functional
> changes at all. Patch 8 adds support for the filemap path for
> RWF_UNCACHED reads, patch 11 adds support for filemap RWF_UNCACHED
> writes. In the below mentioned branch, there are then patches to
> adopt uncached reads and writes for ext4, xfs, and btrfs.
> 
> Passes full xfstests and fsx overnight runs, no issues observed. That
> includes the vm running the testing also using RWF_UNCACHED on the host.
> I'll post fsstress and fsx patches for RWF_UNCACHED separately. As far
> as I'm concerned, no further work needs doing here.
> 
> And git tree for the patches is here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=buffered-uncached.8

Oh good, I much prefer browsing git branches these days. :)

 * mm/filemap: change filemap_create_folio() to take a struct kiocb
 * mm/readahead: add folio allocation helper
 * mm: add PG_uncached page flag
 * mm/readahead: add readahead_control->uncached member
 * mm/filemap: use page_cache_sync_ra() to kick off read-ahead
 * mm/truncate: add folio_unmap_invalidate() helper

The mm patches look ok to me, but I think you ought to get at least an
ack from willy since they're largely pagecache changes.

 * fs: add RWF_UNCACHED iocb and FOP_UNCACHED file_operations flag

See more detailed reply in the thread.

 * mm/filemap: add read support for RWF_UNCACHED

Looks cleaner now that we don't even unmap the page if it's dirty.

 * mm/filemap: drop uncached pages when writeback completes
 * mm/filemap: add filemap_fdatawrite_range_kick() helper
 * mm/filemap: make buffered writes work with RWF_UNCACHED

See more detailed reply in the thread.

 * mm: add FGP_UNCACHED folio creation flag

I appreciate that !UNCACHED callers of __filemap_get_folio now clear the
uncached bit if it's set.

Now I proceed into the rest of your branch, because I felt like it:

 * ext4: add RWF_UNCACHED write support

(Dunno about the WARN_ON removals in this patch, but this is really
Ted's call anyway).

 * iomap: make buffered writes work with RWF_UNCACHED

The commit message references a "iocb_uncached_write" but I don't find
any such function in the extended patchset?  Otherwise this looks ready
to me.  Thanks for changing it only to set uncached if we're actually
creating a folio, and not just returning one that was already in the
pagecache.

 * xfs: punt uncached write completions to the completion wq

Dumb nit: spaces between "IOMAP_F_SHARED|IOMAP_F_UNCACHED" in this
patch.

 * xfs: flag as supporting FOP_UNCACHED

Otherwise the xfs changes look ready too.

 * btrfs: add support for uncached writes
 * block: support uncached IO

Not sure why the definition of bio_dirty_lock gets moved around, but in
principle this looks ok to me too.

For the whole pile of mm changes (aka patches 1-6,8-10,12),
Acked-by: "Darrick J. Wong" <djwong@...nel.org>

--D

> 
>  include/linux/fs.h             |  21 +++++-
>  include/linux/page-flags.h     |   5 ++
>  include/linux/pagemap.h        |  14 ++++
>  include/trace/events/mmflags.h |   3 +-
>  include/uapi/linux/fs.h        |   6 +-
>  mm/filemap.c                   | 114 +++++++++++++++++++++++++++++----
>  mm/readahead.c                 |  22 +++++--
>  mm/swap.c                      |   2 +
>  mm/truncate.c                  |  35 ++++++----
>  9 files changed, 187 insertions(+), 35 deletions(-)
> 
> Since v5
> - Skip invalidation in filemap_uncached_read() if the folio is dirty
>   as well, retaining the uncached setting for later cleaning to do
>   the actual invalidation.
> - Use the same trylock approach in read invalidation as the writeback
>   invalidation does.
> - Swap order of patches 10 and 11 to fix a bisection issue.
> - Split core mm changes and fs series patches. Once the generic side
>   has been approved, I'll send out the fs series separately.
> - Rebase on 6.13-rc1
> 
> -- 
> Jens Axboe
> 
>