linux-kernel - Re: Wislist for Linux from the mold linker's POV

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <7df5f683-692c-42c7-a50a-9cafe672212f@nh2.me>
Date: Sat, 30 Nov 2024 16:36:00 +0100
From: Niklas Hambüchen <mail@....me>
To: Theodore Ts'o <tytso@....edu>
Cc: Rui Ueyama <rui314@...il.com>, LKML <linux-kernel@...r.kernel.org>,
 Florian Weimer <fw@...eb.enyo.de>
Subject: Re: Wislist for Linux from the mold linker's POV

Hi Ted,

On 2024-11-29 19:12, Theodore Ts'o wrote:
> It's not actually an fsync() in the close case).  We initiate
> writeback, but we don't actually wait for the writes to complete on
> the close().  [..]  But in the case where the
> application programmer is too lazy to call fsync(2), the delayed
> completion of the transaction complete is the implicit commit, and
> nothing is bloced behind it.  (See below for more details.)
Then I actually have a question for you, as it seems I do have a situation where the close-without-rename blocks the userspace application's `close(2)` on ext4.

I have program which, when writing files, uses

    openat(..., O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC)

In `strace -T`, writing 1 GiB to a file in an empty directory, it shows

    close(3<output.bin>) = 0 <0.000005>

but in a directory where `file2` already exists, it takes 2.5 seconds:

    close(3<output.bin>) = 0 <2.527808>

Is that expected?

Repro:

    time python -c 'with open("output.bin", "wb") as f: f.write(b"a" * (1024 * 1024 * 1024))'

The first run is fast, subsequent runs are slow; `rm output.bin` makes it fast again.

Environment: Linux 6.6.33 x86_64, mount with `ext4 (ro,relatime,errors=remount-ro)`
> But yes, the reason behind this is applications such as tuxracer
Ahah glorious, I didn't know that.
"But boss, the new kernel reduces global server throughput by 10x..." -- "Whatever the cost, my tuxracer high score MUST NOT BE LOST."

> In essence, file system developers are massively outnumbered by
> application programs, and for some reason as a class application
> programmers don't seem to be very careful about data corruption
> compared to file system developers --- and users *always* blame the
> file system developers.
Personally (as an application programmer) I would probably prefer the old behaviour, because now as a correct application it is difficult to opt out of the performance penalty, and understanding your own performance and benchmarking becomes ever more complex.
Writing fast apps that do file processing with intermediate files now requires inspecting which FS we're on and what their mount options, and implementing "quirks" style workarounds like "rm + rename instead of just rename".

But I equally relate to the frustration of users that lost files, and I can understand why you added this.

One can also blame the POSIX API for this to some extent, as it doesn't make it easy for the application programmer to do the right thing.

AppDev:    How do I write a file?
Posix:     write() + close().
AppDev:    Really?
Posix:     Actually, no. You also need to fsync() if you care about the data.
AppDev:    OK I added it, good?
Posix:     Actually, no. You also need to fsync() the parent dir if you care about the data and the file is new.
AppDev:    How many more surprise steps will there be?
Fsyncgate: Hi

I'm wondering if there's a way out of this, to make the trade-offs less global and provide an opt-out.
(As an application programmer I can't ask my users to enable `noauto_da_alloc`, because who knows what other applications they run.)
Maybe an `open(O_I_READ_THE_DOCS)` + fcntl flag, which disables wrong-application heuristics?
It should probably have a more technical name.

I realise this is fighting complexity with somewhat more complexity, but maybe buffering-by-default-and-fsync-for-durability was the wrong default all along, and close-is-durable-by-default-and-there-is-and-opt-out would be the better model; not sure.

Niklas