linux-kernel - Re: [LSF/MM/BPF TOPIC] untorn buffered writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <9e230104-4fb8-44f1-ae5a-a940f69b8d45@oracle.com>
Date: Wed, 15 May 2024 13:54:39 -0600
From: John Garry <john.g.garry@...cle.com>
To: Theodore Ts'o <tytso@....edu>, lsf-pc@...ts.linux-foundation.org
Cc: linux-fsdevel@...r.kernel.org, linux-mm <linux-mm@...ck.org>,
        Luis Chamberlain <mcgrof@...nel.org>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Dave Chinner <david@...morbit.com>, linux-kernel@...r.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

On 27/02/2024 23:12, Theodore Ts'o wrote:
> Last year, I talked about an interest to provide database such as
> MySQL with the ability to issue writes that would not be torn as they
> write 16k database pages[1].
> 
> [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
> 

After discussing this topic earlier this week, I would like to know if 
there are still objections or concerns with the untorn-writes userspace 
API proposed in 
https://lore.kernel.org/linux-block/20240326133813.3224593-1-john.g.garry@oracle.com/

I feel that the series for supporting direct-IO only, above, is stuck 
because of this topic of buffered IO.

So I sent an RFC for buffered untorn-writes last month in 
https://lore.kernel.org/linux-fsdevel/20240422143923.3927601-1-john.g.garry@oracle.com/, 
which did leverage the bs > ps effort. Maybe it did not get noticed due 
to being an RFC. It works on the following principles:

- A buffered atomic write requires RWF_ATOMIC flag be set, same as
   direct IO. The same other atomic writes rules apply.
- For an inode, only a single size of buffered write is allowed. So for
   statx, atomic_write_unit_min = atomic_write_unit_max always for
   buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. So inode
   address_space folio min order = max order = atomic_write_unit_min/max
- A folio is tagged as "atomic" when atomically written and written back
   to storage "atomically", same as direct-IO method would do for an
   atomic write.
- If userspace wants to guarantee a buffered atomic write is written to
   storage atomically after the write syscall returns, it must use
   RWF_SYNC or similar (along with RWF_ATOMIC).

This is all along the lines of what I described on Monday.

There are no concrete semantics for buffered untorn-writes ATM - like 
mixing RWF_ATOMIC write with non-RWF_ATOMIC writes in the pagecache - 
but I don't think that this needs to be formalized yet. Or, if it really 
does, let me know.

There was also talk in the "limits of buffered IO.. " session - as I 
understand - that RWF_ATOMIC for buffered IO should be writethough. If 
anyone wants to discuss that further or describe that issue, then please do.

Anyway, I plan to push the direct IO series for merging in the next 
cycle, so let me know of what else to discuss and get conclusion on.


> There is a patch set being worked on by John Garry which provides
> stronger guarantees than what is actually required for this use case,
> called "atomic writes".  The proposed interface for this facility
> involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
> that the specific write be written to the storage device in an
> all-or-nothing fashion, and if it can not be guaranteed, that the
> write should fail.  In this interface, if the userspace sends an 128k
> write with the RWF_ATOMIC flag, if the storage device will support
> that an all-or-nothing write with the given size and alignment the
> kernel will guarantee that it will be sent as a single 128k request
> --- although from the database perspective, if it is using 16k
> database pages, it only needs to guarantee that if the write is torn,
> it only happen on a 16k boundary.  That is, if the write is split into
> 32k and 96k request, that would be totally fine as far as the database
> is concerned --- and so the RWF_ATOMIC interface is a stronger
> guarantee than what might be needed.
> 
> So far, the "atomic write" patchset has only focused on Direct I/O,
> where this stronger guarantee is mostly harmless, even if it is
> unneeded for the original motivating use case.  Which might be OK,
> since perhaps there might be other future use cases where they might
> want some 32k writes to be "atomic", while other 128k writes might
> want to be "atomic" (that is to say, persisted with all-or-nothing
> semantics), and the proposed RWF_ATOMIC interface might permit that
> --- even though no one can seem top come up with a credible use case
> that would require this.
> 
> 
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes.   Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag.   In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time.   This very quickly becomes a mess.
> 
> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2).  We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode.  (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)
> 
> The write(2) system call will check whether the size and alignment of
> the write are valid given the requested untorn write granularity.  And
> in the writeback path, the writeback will detect if there are
> contiguous (aligned) dirty pages, and make sure they are sent to the
> storage device in multiples of the largest requested untorn write
> granularity.  This provides only the guarantees required by databases,
> and obviates the need to track which pages were dirtied by an
> RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.
> 
> I'd like to discuss at LSF/MM what the best interface would be for
> buffered, untorn writes (I am deliberately avoiding the use of the
> word "atomic" since that presumes stronger guarantees than what we
> need, and because it has led to confusion in previous discussions),
> and what might be needed to support it.
> 
> 						- Ted
>