[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <363ee90b-00a4-45f9-91a3-663a8cdf077c@oracle.com>
Date: Mon, 22 Apr 2024 17:02:27 +0100
From: John Garry <john.g.garry@...cle.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: axboe@...nel.dk, brauner@...nel.org, djwong@...nel.org,
viro@...iv.linux.org.uk, jack@...e.cz, akpm@...ux-foundation.org,
dchinner@...hat.com, tytso@....edu, hch@....de,
martin.petersen@...cle.com, nilay@...ux.ibm.com, ritesh.list@...il.com,
mcgrof@...nel.org, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-mm@...ck.org,
ojaswin@...ux.ibm.com, p.raghav@...sung.com, jbongio@...gle.com,
okiselev@...zon.com
Subject: Re: [PATCH RFC 5/7] fs: iomap: buffered atomic write support
On 22/04/2024 16:03, Matthew Wilcox wrote:
> On Mon, Apr 22, 2024 at 02:39:21PM +0000, John Garry wrote:
>> Add special handling of PG_atomic flag to iomap buffered write path.
>>
>> To flag an iomap iter for an atomic write, set IOMAP_ATOMIC.
>>
>> For a folio associated with a write which has IOMAP_ATOMIC set, set
>> PG_atomic.
>>
>> Otherwise, when IOMAP_ATOMIC is unset, clear PG_atomic.
>>
>> This means that for an "atomic" folio which has not been written back, it
>> loses it "atomicity". So if userspace issues a write with RWF_ATOMIC set
>> and another write with RWF_ATOMIC unset and which fully or partially
>> overwrites that same region as the first write, that folio is not written
>> back atomically. For such a scenario to occur, it would be considered a
>> userspace usage error.
>>
>> To ensure that a buffered atomic write is written back atomically when
>> the write syscall returns, RWF_SYNC or similar needs to be used (in
>> conjunction with RWF_ATOMIC).
>>
>> As a safety check, when getting a folio for an atomic write in
>> iomap_get_folio(), ensure that the length matches the inode mapping folio
>> order-limit.
>>
>> Only a single BIO should ever be submitted for an atomic write. So modify
>> iomap_add_to_ioend() to ensure that we don't try to write back an atomic
>> folio as part of a larger mixed-atomicity BIO.
>>
>> In iomap_alloc_ioend(), handle an atomic write by setting REQ_ATOMIC for
>> the allocated BIO.
>>
>> When a folio is written back, again clear PG_atomic, as it is no longer
>> required. I assume it will not be needlessly written back a second time...
>
> I'm not taking a position on the mechanism yet; need to think about it
> some more. But there's a hole here I also don't have a solution to,
> so we can all start thinking about it.
>
> In iomap_write_iter(), we call copy_folio_from_iter_atomic(). Through no
> fault of the application, if the range crosses a page boundary, we might
> partially copy the bytes from the first page, then take a page fault on
> the second page, hence doing a short write into the folio. And there's
> nothing preventing writeback from writing back a partially copied folio.
>
> Now, if it's not dirty, then it can't be written back. So if we're
> doing an atomic write, we could clear the dirty bit after calling
> iomap_write_begin() (given the usage scenarios we've discussed, it should
> always be clear ...)
> > We need to prevent the "fall back to a short copy" logic in
> iomap_write_iter() as well. But then we also need to make sure we don't
> get stuck in a loop, so maybe go three times around, and if it's still
> not readable as a chunk, -EFAULT?
This idea sounds reasonable. So at what stage would the dirty flag be
set? Would it be only when all bytes are copied successfully as a single
chunk?
FWIW, we do have somewhat equivalent handling in direct IO path, being
that if the iomap iter loops more than once such that we will need to
create > 1 bio in the DIO bio submission handler, then we -EINVAL as
something has gone wrong. But that's not so relevant here.
Thanks,
John
Powered by blists - more mailing lists