lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef979627-52dc-4a15-896b-c848ab703cd6@oracle.com>
Date: Mon, 13 Jan 2025 21:35:01 +0000
From: John Garry <john.g.garry@...cle.com>
To: Dave Chinner <david@...morbit.com>
Cc: brauner@...nel.org, djwong@...nel.org, cem@...nel.org, dchinner@...hat.com,
        hch@....de, ritesh.list@...il.com, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        martin.petersen@...cle.com
Subject: Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

On 05/12/2024 21:15, Dave Chinner wrote:
> On Thu, Dec 05, 2024 at 10:52:50AM +0000, John Garry wrote:
>> On 04/12/2024 20:35, Dave Chinner wrote:
>>> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
>>>> From: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
>>>>
>>>> Filesystems like ext4 can submit writes in multiples of blocksizes.
>>>> But we still can't allow the writes to be split into multiple BIOs. Hence
>>>> let's check if the iomap_length() is same as iter->len or not.
>>>>
>>>> It is the responsibility of userspace to ensure that a write does not span
>>>> mixed unwritten and mapped extents (which would lead to multiple BIOs).
>>>
>>> How is "userspace" supposed to do this?
>>
>> If an atomic write spans mixed unwritten and mapped extents, then it should
>> manually zero the unwritten extents beforehand.
>>
>>>
>>> No existing utility in userspace is aware of atomic write limits or
>>> rtextsize configs, so how does "userspace" ensure everything is
>>> laid out in a manner compatible with atomic writes?
>>>
>>> e.g. restoring a backup (or other disaster recovery procedures) is
>>> going to have to lay the files out correctly for atomic writes.
>>> backup tools often sparsify the data set and so what gets restored
>>> will not have the same layout as the original data set...
>>
>> I am happy to support whatever is needed to make atomic writes work over
>> mixed extents if that is really an expected use case and it is a pain for an
>> application writer/admin to deal with this (by manually zeroing extents).
>>
>> JFYI, I did originally support the extent pre-zeroing for this. That was to
>> support a real-life scenario which we saw where we were attempting atomic
>> writes over mixed extents. The mixed extents were coming from userspace
>> punching holes and then attempting an atomic write over that space. However
>> that was using an early experimental and buggy forcealign; it was buggy as
>> it did not handle punching holes properly - it punched out single blocks and
>> not only full alloc units.
>>
>>>
>>> Where's the documentation that outlines all the restrictions on
>>> userspace behaviour to prevent this sort of problem being triggered?
>>
>> I would provide a man page update.
> 
> I think, at this point, we need an better way of documenting all the
> atomic write stuff in one place. Not just the user interface and
> what is expected of userspace, but also all the things the
> filesystems need to do to ensure atomic writes work correctly. I was
> thinking that a document somewhere in the Documentation/ directory,
> rather than random pieces of information splattered across random man pages
> would be a much better way of explaining all this.
> 
> Don't get me wrong - man pages explaining the programmatic API are
> necessary, but there's a whole lot more to understanding and making
> effective use of atomic writes than what has been added to the man
> pages so far.
> 
>>> Common operations such as truncate, hole punch,
>>
>> So how would punch hole be a problem? The atomic write unit max is limited
>> by the alloc unit, and we can only punch out full alloc units.
> 
> I was under the impression that this was a feature of the
> force-align code, not a feature of atomic writes. i.e. force-align
> is what ensures the BMBT aligns correctly with the underlying
> extents.
> 
> Or did I miss the fact that some of the force-align semantics bleed
> back into the original atomic write patch set?
> 
>>> buffered writes,
>>> reflinks, etc will trip over this, so application developers, users
>>> and admins really need to know what they should be doing to avoid
>>> stepping on this landmine...
>>
>> If this is not a real-life scenario which we expect to see, then I don't see
>> why we would add the complexity to the kernel for this.
> 
> I gave you one above - restoring a data set as a result of disaster
> recovery.
> 
>> My motivation for atomic writes support is to support atomically writing
>> large database internal page size. If the database only writes at a fixed
>> internal page size, then we should not see mixed mappings.
> 
> Yup, that's the problem here. Once atomic writes are supported by
> the kernel and userspace, all sorts of applications are going to
> start using them for in all sorts of ways you didn't think of.
> 
>> But you see potential problems elsewhere ..
> 
> That's my job as a senior engineer with 20+ years of experience in
> filesystems and storage related applications. I see far because I
> stand on the shoulders of giants - I don't try to be a giant myself.
> 
> Other people become giants by implementing ground-breaking features
> (e.g. like atomic writes), but without the people who can see far
> enough ahead just adding features ends up with an incoherent mess of
> special interest niche features rather than a neatly integrated set
> of widely usable generic features.
> 
> e.g. look at MySQL's use of fallocate(hole punch) for transparent
> data compression - nobody had forseen that hole punching would be
> used like this, but it's a massive win for the applications which
> store bulk compressible data in the database even though it does bad
> things to the filesystem.
> 
> Spend some time looking outside the proprietary database application
> box and think a little harder about the implications of atomic write
> functionality.  i.e. what happens when we have ubiquitous support
> for guaranteeing only the old or the new data will be seen after
> a crash *without the need for using fsync*.
> 
> Think about the implications of that for a minute - for any full
> file overwrite up to the hardware atomic limits, we won't need fsync
> to guarantee the integrity of overwritten data anymore. We only need
> a mechanism to flush the journal and device caches once all the data
> has been written (e.g. syncfs)...
> 
> Want to overwrite a bunch of small files safely?  Atomic write the
> new data, then syncfs(). There's no need to run fdatasync after each
> write to ensure individual files are not corrupted if we crash in
> the middle of the operation. Indeed, atomic writes actually provide
> better overwrite integrity semantics that fdatasync as it will be
> all or nothing. fdatasync does not provide that guarantee if we
> crash during the fdatasync operation.
> 
> Further, with COW data filesystems like XFS, btrfs and bcachefs, we
> can emulate atomic writes for any size larger than what the hardware
> supports.
> 
> At this point we actually provide app developers with what they've
> been repeatedly asking kernel filesystem engineers to provide them
> for the past 20 years: a way of overwriting arbitrary file data
> safely without needing an expensive fdatasync operation on every
> file that gets modified.
> 
> Put simply: atomic writes have a huge potential to fundamentally
> change the way applications interact with Linux filesystems and to
> make it *much* simpler for applications to safely overwrite user
> data.  Hence there is an imperitive here to make the foundational
> support for this technology solid and robust because atomic writes
> are going to be with us for the next few decades...
> 



Dave,

I provided an proposal to solve this issue in 
https://lore.kernel.org/lkml/20241210125737.786928-3-john.g.garry@oracle.com/ 
(there is also a v3, which is much the same.

but I can't make progress, as there is no agreement upon how this should 
be implemented, if at all. Any input there would be appreciated...

Cheers


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ