[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e7e3e769-07fb-4b71-b4d4-8d50754bd3b1@oracle.com>
Date: Wed, 15 Jan 2025 09:30:33 +0000
From: John Garry <john.g.garry@...cle.com>
To: "Darrick J. Wong" <djwong@...nel.org>, Dave Chinner <david@...morbit.com>
Cc: brauner@...nel.org, cem@...nel.org, dchinner@...hat.com, hch@....de,
ritesh.list@...il.com, linux-xfs@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
martin.petersen@...cle.com
Subject: Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes
On 14/01/2025 23:57, Darrick J. Wong wrote:
> On Tue, Jan 14, 2025 at 03:41:13PM +1100, Dave Chinner wrote:
>> On Wed, Dec 11, 2024 at 05:34:33PM -0800, Darrick J. Wong wrote:
>>> On Fri, Dec 06, 2024 at 08:15:05AM +1100, Dave Chinner wrote:
>>>> On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote:
>>>> e.g. look at MySQL's use of fallocate(hole punch) for transparent
>>>> data compression - nobody had forseen that hole punching would be
>>>> used like this, but it's a massive win for the applications which
>>>> store bulk compressible data in the database even though it does bad
>>>> things to the filesystem.
>>>>
>>>> Spend some time looking outside the proprietary database application
>>>> box and think a little harder about the implications of atomic write
>>>> functionality. i.e. what happens when we have ubiquitous support
>>>> for guaranteeing only the old or the new data will be seen after
>>>> a crash *without the need for using fsync*.
>>>
>>> IOWs, the program either wants an old version or a new version of the
>>> files that it wrote, and the commit boundary is syncfs() after updating
>>> all the files?
>>
>> Yes, though there isn't a need for syncfs() to guarantee old-or-new.
>> That's the sort of thing an application can choose to do at the end
>> of it's update set...
>
> Well yes, there has to be a caches flush somewhere -- last I checked,
> RWF_ATOMIC doesn't require that the written data be persisted after the
> call completes.
Correct, RWF_ATOMIC | RWF_SYNC is required for guarantee of persistence
>
>>>> Think about the implications of that for a minute - for any full
>>>> file overwrite up to the hardware atomic limits, we won't need fsync
>>>> to guarantee the integrity of overwritten data anymore. We only need
>>>> a mechanism to flush the journal and device caches once all the data
>>>> has been written (e.g. syncfs)...
>>>
>>> "up to the hardware atomic limits" -- that's a big limitation. What if
>>> I need to write 256K but the device only supports up to 64k? RWF_ATOMIC
>>> won't work. Or what if the file range I want to dirty isn't aligned
>>> with the atomic write alignment? What if the awu geometry changes
>>> online due to a device change, how do programs detect that?
>>
>> If awu geometry changes dynamically in an incompatible way, then
>> filesystem RWF_ATOMIC alignment guarantees are fundamentally broken.
>> This is not a problem the filesystem can solve.
>>
>> IMO, RAID device hotplug should reject new device replacement that
>> has incompatible atomic write support with the existing device set.
>> With that constraint, the whole mess of "awu can randomly change"
>> problems go away.
>
> Assuming device mapper is subject to that too, I agree.
If a device is added to a md raid array which does not support atomic
writes, then atomic writes are disabled (for the block device). I need
to verify that hotplug behaves like this.
And dm does behave like this also, i.e. atomic writes are disabled for
the dm block device.
>
>>> Programs that aren't 100% block-based should use exchange-range. There
>>> are no alignment restrictions, no limits on the size you can exchange,
>>> no file mapping state requiments to trip over, and you can update
>>> arbitrary sparse ranges. As long as you don't tell exchange-range to
>>> flush the log itself, programs can use syncfs to amortize the log and
>>> cache flush across a bunch of file content exchanges.
>>
>> Right - that's kinda my point - I was assuming that we'd be using
>> something like xchg-range as the "unaligned slow path" for
>> RWF_ATOMIC.
>>
>> i.e. RWF_ATOMIC as implemented by a COW capable filesystem should
>> always be able to succeed regardless of IO alignment. In these
>> situations, the REQ_ATOMIC block layer offload to the hardware is a
>> fast path that is enabled when the user IO and filesystem extent
>> alignment matches the constraints needed to do a hardware atomic
>> write.
>>
>> In all other cases, we implement RWF_ATOMIC something like
>> always-cow or prealloc-beyond-eof-then-xchg-range-on-io-completion
>> for anything that doesn't correctly align to hardware REQ_ATOMIC.
>>
>> That said, there is nothing that prevents us from first implementing
>> RWF_ATOMIC constraints as "must match hardware requirements exactly"
>> and then relaxing them to be less stringent as filesystems
>> implementations improve. We've relaxed the direct IO hardware
>> alignment constraints multiple times over the years, so there's
>> nothing that really prevents us from doing so with RWF_ATOMIC,
>> either. Especially as we have statx to tell the application exactly
>> what alignment will get fast hardware offloads...
>
> Ok, let's do that then. Just to be clear -- for any RWF_ATOMIC direct
> write that's correctly aligned and targets a single mapping in the
> correct state, we can build the untorn bio and submit it. For
> everything else, prealloc some post EOF blocks, write them there, and
> exchange-range them.
>
That makes my life easier ... today, anyway.
For RWF_ATOMIC, our targeted users will want guaranteed performance, so
would really need to know about anything which is doing software-based
atomic writes behind the scenes.
JFYI, I did rework the zeroing code to leverage what we already have in
iomap, and it looks better to me:
https://github.com/johnpgarry/linux/commits/atomic-write-large-atomics-v6.13-v4/
There is a problem with atomic writes over EOF, but that same be solved.
> Tricky questions: How do we avoid collisions between overlapping writes?
> I guess we find a free file range at the top of the file that is long
> enough to stage the write, and put it there? And purge it later?
>
> Also, does this imply that the maximum file size is less than the usual
> 8EB?
>
> (There's also the question about how to do this with buffered writes,
> but I guess we could skip that for now.)
>
>>> Even better, if you still wanted to use untorn block writes to persist
>>> the temporary file's dirty data to disk, you don't even need forcealign
>>> because the exchange-range will take care of restarting the operation
>>> during log recovery. I don't know that there's much point in doing that
>>> but the idea is there.
>>
>> *nod*
Powered by blists - more mailing lists