linux-kernel - Re: sys_write() racy for multi-threaded append?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <f2b55d220703092243v226483ceve72cf654f61ee39f@mail.gmail.com>
Date:	Fri, 9 Mar 2007 22:43:35 -0800
From:	"Michael K. Edwards" <medwards.linux@...il.com>
To:	"Benjamin LaHaise" <bcrl@...ck.org>
Cc:	"Eric Dumazet" <dada1@...mosbay.com>,
	"Linux Kernel Mailing List" <linux-kernel@...r.kernel.org>
Subject: Re: sys_write() racy for multi-threaded append?

I apologize for throwing around words like "stupid".  Whether or not
the current semantics can be improved, that's not a constructive way
to characterize them.  I'm sorry.

As three people have ably pointed out :-), the particular case of a
pipe/FIFO isn't seekable and doesn't need the f_pos member anyway
(it's effectively always O_APPEND).  That's what I get for checking
against standards documents at 3AM.  Of course, this has nothing to do
with the point that led me to comment on pipes/FIFOs (which was that
there exist file types that never return 0<ret<nbytes).  And it was in
the context of a very explicit aside that f_pos is not _interesting_
on a pipe/FIFO, except as an indicator of total bytes written.  You
could only peek at this with an (admittedly non-portable) llseek(fd,
0, SEEK_CUR) anyway -- which you would only do for diagnostic
purposes.  But diagnosis of odd corner cases (rarely in my code,
usually in other people's) is what I do day in and day out, so for me
it would be worth having.

In any case, you're all right that the standard doesn't require you to
do anything useful with f_pos on a pipe/FIFO.  But you're permitted to
make it useful if you want to:

<1003.1 lseek()>
The behavior of lseek() on devices which are incapable of seeking is
implementation-defined. The value of the file offset associated with
such a device is undefined.
</1003.1>

Tracking f_pos accurately when writes from multiple threads hit the
same fd (pipe or not) isn't portable, but I recall situations where it
would have been useful.  And if f_pos has to be kept at all in the
uncontended case, it costs you little or nothing to do it in a
thread-safe manner -- as long as you don't overconstrain the semantics
such that you forbid the transient overshoot associated with a short
write.  In fact, unless there's something I've missed, increasing
f_pos before entering vfs_write() happens to be _faster_ than the
current code for common load patterns, both single- and multi-threaded
(although getting the full benefit in the multi-threaded case will
take some fiddling with f_count placement).

I say it costs "little or nothing" only because altering an loff_t
atomically is not free.  But even on x86, with its inability to
atomically modify any 64-bit entity in memory, an uncontended spinlock
on a cacheline already in L1 is so cheap that making the f_pos changes
atomic will (I think) be lost in the noise.

In any case, rewriting read_write.c is proving interesting.  I'll let
you all know if anything comes of it.  In the meantime, thanks for
your (really quite friendly under the circumstances) comments.

Cheers,
- Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/