linux-kernel - Re: wakeup_pipe_readers/writers() && pipe

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wh-SxjH7uvADd5XJBuM2ReyPcLPyXKvBbwbiS5kod+3hA@mail.gmail.com>
Date: Mon, 6 Jan 2025 12:23:58 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Oleg Nesterov <oleg@...hat.com>
Cc: Manfred Spraul <manfred@...orfullife.com>, Christian Brauner <brauner@...nel.org>, 
	David Howells <dhowells@...hat.com>, WangYuli <wangyuli@...ontech.com>, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: wakeup_pipe_readers/writers() && pipe_poll()

On Mon, 6 Jan 2025 at 11:34, Oleg Nesterov <oleg@...hat.com> wrote:
>
> 1. pipe_read() says
>
>          * But when we do wake up writers, we do so using a sync wakeup
>          * (WF_SYNC), because we want them to get going and generate more
>          * data for us.
>
> OK, WF_SYNC makes sense if pipe_read() or pipe_write() is going to do wait_event()
> after wake_up(). But wake_up_interruptible_sync_poll() looks at bit misleading if
> we are going to wakeup the writer or next_reader before return.

This heuristic has always been a bit iffy. And honestly, I think it's
been driven by benchmarks that aren't necessarily always realistic (ie
for ping-pong benchmarks, the best behavior is often to stay on the
same CPU and just schedule between the reader/writer).

Those are exactly the kinds that will *not* do wait_event(), because
they just read or wrote a single byte, but doing a "sync" wakeup then
gets you the optimal pattern of "run the other end on the same CPU".

And do those cases exist outside of benchmarks? Probably not.

Ergo: I don't think there's a "right solution".

> 2. I can't understand this code in pipe_write()
>
>         if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
>                 int err = file_update_time(filp);
>                 if (err)
>                         ret = err;
>                 sb_end_write(file_inode(filp)->i_sb);
>         }
>
>         - it only makes sense in the "fifo" case, right? When
>           i_sb->s_magic != PIPEFS_MAGIC...

I think we've done it for regular pipes too. You can see it with
'fstat()', after all.

And we've done it forever. It goes way back to even before the BK
days, although back when it was done differently with an open-coded

        inode->i_ctime = inode->i_mtime = CURRENT_TIME;
        mark_inode_dirty(inode);

instead.

I actually went back and looked at some historical trees. The inode
time updates were originally added in linux-2.0.1 (July 1996) and
didn't actually have the "mark_inode_dirty()" part, so they didn't
cause the inode to be written back to disk, but people would see the
time updates while they were happening.

The "write it back to disk" part started happening in 2.1.37pre1 (so
May 1997), although back then it was just a simple

        inode->i_dirt = 1;

instead.

My point being that we've been doing it for a *loong* time. And yes,
we've always done it for regular pipes too, not just on-disk FIFOs.

>         - why should we propogate the error code if "ret > 0" but
>           file_update_time() fails?

Normally file_update_time() wouldn't fail.

This was changed in c3b2da314834 ("fs: introduce inode operation
->update_time") and honestly, I think it was just a global
search-and-replace kind of change.

And I think the answer is: nobody cares, because it's FIFO's that
nobody uses in the first place, with an operation that never fails
anyway.

Could we remove the error propagation? Almost certainly. Does it
matter? Almost certainly not.

Notice how the read side also does the time update, except it's
obviously just atime, and it's done with

        if (ret > 0)
                file_accessed(filp);

instead (which honors O_NOATIME and then does touch_atime, and which
does the inode_update_time() *without* any error checking).

IOW, that is different, but it all boils down to the same thing:
nobody really cares in practice.

So to both of your issues, I think you can make changes as long as you
have good reasoning for them and document them in the commit (both
code and commit message).

           Linus