lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7df5f683-692c-42c7-a50a-9cafe672212f@nh2.me>
Date: Sat, 30 Nov 2024 16:36:00 +0100
From: Niklas Hambüchen <mail@....me>
To: Theodore Ts'o <tytso@....edu>
Cc: Rui Ueyama <rui314@...il.com>, LKML <linux-kernel@...r.kernel.org>,
 Florian Weimer <fw@...eb.enyo.de>
Subject: Re: Wislist for Linux from the mold linker's POV

Hi Ted,

On 2024-11-29 19:12, Theodore Ts'o wrote:
> It's not actually an fsync() in the close case).  We initiate
> writeback, but we don't actually wait for the writes to complete on
> the close().  [..]  But in the case where the
> application programmer is too lazy to call fsync(2), the delayed
> completion of the transaction complete is the implicit commit, and
> nothing is bloced behind it.  (See below for more details.)
Then I actually have a question for you, as it seems I do have a situation where the close-without-rename blocks the userspace application's `close(2)` on ext4.

I have program which, when writing files, uses

    openat(..., O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC)

In `strace -T`, writing 1 GiB to a file in an empty directory, it shows

    close(3<output.bin>) = 0 <0.000005>

but in a directory where `file2` already exists, it takes 2.5 seconds:

    close(3<output.bin>) = 0 <2.527808>

Is that expected?

Repro:

    time python -c 'with open("output.bin", "wb") as f: f.write(b"a" * (1024 * 1024 * 1024))'

The first run is fast, subsequent runs are slow; `rm output.bin` makes it fast again.

Environment: Linux 6.6.33 x86_64, mount with `ext4 (ro,relatime,errors=remount-ro)`
> But yes, the reason behind this is applications such as tuxracer
Ahah glorious, I didn't know that.
"But boss, the new kernel reduces global server throughput by 10x..." -- "Whatever the cost, my tuxracer high score MUST NOT BE LOST."

> In essence, file system developers are massively outnumbered by
> application programs, and for some reason as a class application
> programmers don't seem to be very careful about data corruption
> compared to file system developers --- and users *always* blame the
> file system developers.
Personally (as an application programmer) I would probably prefer the old behaviour, because now as a correct application it is difficult to opt out of the performance penalty, and understanding your own performance and benchmarking becomes ever more complex.
Writing fast apps that do file processing with intermediate files now requires inspecting which FS we're on and what their mount options, and implementing "quirks" style workarounds like "rm + rename instead of just rename".

But I equally relate to the frustration of users that lost files, and I can understand why you added this.

One can also blame the POSIX API for this to some extent, as it doesn't make it easy for the application programmer to do the right thing.

AppDev:    How do I write a file?
Posix:     write() + close().
AppDev:    Really?
Posix:     Actually, no. You also need to fsync() if you care about the data.
AppDev:    OK I added it, good?
Posix:     Actually, no. You also need to fsync() the parent dir if you care about the data and the file is new.
AppDev:    How many more surprise steps will there be?
Fsyncgate: Hi

I'm wondering if there's a way out of this, to make the trade-offs less global and provide an opt-out.
(As an application programmer I can't ask my users to enable `noauto_da_alloc`, because who knows what other applications they run.)
Maybe an `open(O_I_READ_THE_DOCS)` + fcntl flag, which disables wrong-application heuristics?
It should probably have a more technical name.

I realise this is fighting complexity with somewhat more complexity, but maybe buffering-by-default-and-fsync-for-durability was the wrong default all along, and close-is-durable-by-default-and-there-is-and-opt-out would be the better model; not sure.

Niklas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ