linux-kernel - Re: Could it be made possible to offer "supplementary" data to a DIO write ?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YQv+iwmhhZJ+/ndc@casper.infradead.org>
Date:   Thu, 5 Aug 2021 16:06:51 +0100
From:   Matthew Wilcox <willy@...radead.org>
To:     David Howells <dhowells@...hat.com>
Cc:     linux-fsdevel@...r.kernel.org, jlayton@...nel.org,
        Christoph Hellwig <hch@...radead.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        dchinner@...hat.com, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO
 write ?

On Thu, Aug 05, 2021 at 03:38:01PM +0100, David Howells wrote:
> > If you want to take leases at byte granularity, and then not writeback
> > parts of a page that are outside that lease, feel free.  It shouldn't
> > affect how you track dirtiness or how you writethrough the page cache
> > to the disk cache.
> 
> Indeed.  Handling writes to the local disk cache is different from handling
> writes to the server(s).  The cache has a larger block size but I don't have
> to worry about third-party conflicts on it, whereas the server can be taken as
> having no minimum block size, but my write can clash with someone else's.
> 
> Generally, I prefer to write back the minimum I can get away with (as does the
> Linux NFS client AFAICT).
> 
> However, if everyone agrees that we should only ever write back a multiple of
> a certain block size, even to network filesystems, what block size should that
> be?

If your network protocol doesn't give you a way to ask the server what
size it is, assume 512 bytes and allow it to be overridden by a mount
option.

> Note that PAGE_SIZE varies across arches and folios are going to
> exacerbate this.  What I don't want to happen is that you read from a file, it
> creates, say, a 4M (or larger) folio; you change three bytes and then you're
> forced to write back the entire 4M folio.

Actually, you do.  Two situations:

1. Application uses MADVISE_HUGEPAGE.  In response, we create a 2MB
page and mmap it aligned.  We use a PMD sized TLB entry and then the
CPU dirties a few bytes with a store.  There's no sub-TLB-entry tracking
of dirtiness.  It's just the whole 2MB.

2. The bigger the folio, the more writes it will absorb before being
written back.  So when you're writing back that 4MB folio, you're not
just servicing this 3 byte write, you're servicing every other write
which hit this 4MB chunk of the file.

There is one exception I've found, and that's O_SYNC writes.  These are
pretty rare, and I think I have a solution to it which essentially treats
the page cache as writethrough (for sync writes).  We skip marking
the page (folio) as dirty and go straight to marking it as writeback.
We have all the information we need about which bytes to write and we're
actually using the existing page cache infrastructure to do it.

I'm working on implementing that in iomap; there's some SMOP type
problems to solve, but it looks doable.