linux-kernel - Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <956e081a-2f1c-4a8e-a5fa-49eb91778eee@oracle.com>
Date: Fri, 6 Dec 2024 09:43:06 +0000
From: John Garry <john.g.garry@...cle.com>
To: Dave Chinner <david@...morbit.com>
Cc: brauner@...nel.org, djwong@...nel.org, cem@...nel.org, dchinner@...hat.com,
        hch@....de, ritesh.list@...il.com, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        martin.petersen@...cle.com
Subject: Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes


>>> Where's the documentation that outlines all the restrictions on
>>> userspace behaviour to prevent this sort of problem being triggered?
>>
>> I would provide a man page update.
> 
> I think, at this point, we need an better way of documenting all the
> atomic write stuff in one place. Not just the user interface and
> what is expected of userspace, but also all the things the
> filesystems need to do to ensure atomic writes work correctly. I was
> thinking that a document somewhere in the Documentation/ directory,
> rather than random pieces of information splattered across random man pages
> would be a much better way of explaining all this.
> 
> Don't get me wrong - man pages explaining the programmatic API are
> necessary, but there's a whole lot more to understanding and making
> effective use of atomic writes than what has been added to the man
> pages so far.

Sure, maybe that would be useful. I think that the final piece of the 
jigsaw is large atomic write support, and then any kernel documentation 
can be further considered.

> 
>>> Common operations such as truncate, hole punch,
>>
>> So how would punch hole be a problem? The atomic write unit max is limited
>> by the alloc unit, and we can only punch out full alloc units.
> 
> I was under the impression that this was a feature of the
> force-align code, not a feature of atomic writes. i.e. force-align
> is what ensures the BMBT aligns correctly with the underlying
> extents.
> 
> Or did I miss the fact that some of the force-align semantics bleed
> back into the original atomic write patch set?

Not really.

As I mentioned, if we can only punch out a full allocation unit and 
atomic write unit max is limited by the allocation unit size, then 
punching out a hole should not create a new range of mixed extents that 
we can legally attempt to atomic write.

> 
>>> buffered writes,
>>> reflinks, etc will trip over this, so application developers, users
>>> and admins really need to know what they should be doing to avoid
>>> stepping on this landmine...
>>
>> If this is not a real-life scenario which we expect to see, then I don't see
>> why we would add the complexity to the kernel for this.
> 
> I gave you one above - restoring a data set as a result of disaster
> recovery.

ack

> 
>> My motivation for atomic writes support is to support atomically writing
>> large database internal page size. If the database only writes at a fixed
>> internal page size, then we should not see mixed mappings.
> 
> Yup, that's the problem here. Once atomic writes are supported by
> the kernel and userspace, all sorts of applications are going to
> start using them for in all sorts of ways you didn't think of.
> 
>> But you see potential problems elsewhere ..
> 
> That's my job as a senior engineer with 20+ years of experience in
> filesystems and storage related applications. I see far because I
> stand on the shoulders of giants - I don't try to be a giant myself.
> 
> Other people become giants by implementing ground-breaking features
> (e.g. like atomic writes), but without the people who can see far
> enough ahead just adding features ends up with an incoherent mess of
> special interest niche features rather than a neatly integrated set
> of widely usable generic features.

yes

> 
> e.g. look at MySQL's use of fallocate(hole punch) for transparent
> data compression - nobody had forseen that hole punching would be
> used like this, but it's a massive win for the applications which
> store bulk compressible data in the database even though it does bad
> things to the filesystem.
> 
> Spend some time looking outside the proprietary database application
> box and think a little harder about the implications of atomic write
> functionality.  i.e. what happens when we have ubiquitous support
> for guaranteeing only the old or the new data will be seen after
> a crash *without the need for using fsync*.
> 
> Think about the implications of that for a minute - for any full
> file overwrite up to the hardware atomic limits, we won't need fsync
> to guarantee the integrity of overwritten data anymore. We only need
> a mechanism to flush the journal and device caches once all the data
> has been written (e.g. syncfs)...
> 
> Want to overwrite a bunch of small files safely?  Atomic write the
> new data, then syncfs(). There's no need to run fdatasync after each
> write to ensure individual files are not corrupted if we crash in
> the middle of the operation. Indeed, atomic writes actually provide
> better overwrite integrity semantics that fdatasync as it will be
> all or nothing. fdatasync does not provide that guarantee if we
> crash during the fdatasync operation.
> 
> Further, with COW data filesystems like XFS, btrfs and bcachefs, we
> can emulate atomic writes for any size larger than what the hardware
> supports.
> 
> At this point we actually provide app developers with what they've
> been repeatedly asking kernel filesystem engineers to provide them
> for the past 20 years: a way of overwriting arbitrary file data
> safely without needing an expensive fdatasync operation on every
> file that gets modified.

Understood, you see that there are many applications of atomic writes 
beyond the scope of DBs.

> 
> Put simply: atomic writes have a huge potential to fundamentally
> change the way applications interact with Linux filesystems and to
> make it *much* simpler for applications to safely overwrite user
> data.  Hence there is an imperitive here to make the foundational
> support for this technology solid and robust because atomic writes
> are going to be with us for the next few decades...
> 

Thanks for going further in describing the possible use cases.

Now let's talk again about the implementation of kernel extent zeroing 
for atomic writes.

Firstly I will mention the obvious and that is we so far can 
automatically atomically write a single FS block and there was no 
FS_XFLAG_ATOMICWRITES flag introduced for enabling this. Furthermore, 
the range of data does not need to be in mapped state.

Then we need to consider how to decide to do the extent zeroing for the 
following scenarios for atomic writes:
a. For forcealign, we can decide to always zero the full alloc unit, 
same as [0]. So that is good for atomic writes. But that does involve 
further work to zero extents for buffered IO.
b. For rtvol without forcealign, we will not start to always zero the 
full alloc unit as that would be major regression in performance.
c. For rtvol with forcealign, we can do the same as a.

I can suggest 2x options for solving b:
1. Introduce FS_XFLAG_LARGE_ATOMICWRITES to control whether we do the 
same as a.
2. Introduce a method to pre-zero extents for atomic writes only

Option 2. would work like this:
- when we find that the atomic write covers a mix of unwritten and 
mapped extent mappings, we have 2x phases of the atomic write:
- phase 1. will pre-zero the unwritten extents and update the extent 
mappings
- phase 2. will retry the atomic write, and we should find a single mapping

Option 2. could also be leveraged for a. and c., above.

Please let me know your thoughts on this.

[0] 
https://lore.kernel.org/linux-xfs/20240607143919.2622319-3-john.g.garry@oracle.com/

John