lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wh-SxjH7uvADd5XJBuM2ReyPcLPyXKvBbwbiS5kod+3hA@mail.gmail.com>
Date: Mon, 6 Jan 2025 12:23:58 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Oleg Nesterov <oleg@...hat.com>
Cc: Manfred Spraul <manfred@...orfullife.com>, Christian Brauner <brauner@...nel.org>, 
	David Howells <dhowells@...hat.com>, WangYuli <wangyuli@...ontech.com>, 
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: wakeup_pipe_readers/writers() && pipe_poll()

On Mon, 6 Jan 2025 at 11:34, Oleg Nesterov <oleg@...hat.com> wrote:
>
> 1. pipe_read() says
>
>          * But when we do wake up writers, we do so using a sync wakeup
>          * (WF_SYNC), because we want them to get going and generate more
>          * data for us.
>
> OK, WF_SYNC makes sense if pipe_read() or pipe_write() is going to do wait_event()
> after wake_up(). But wake_up_interruptible_sync_poll() looks at bit misleading if
> we are going to wakeup the writer or next_reader before return.

This heuristic has always been a bit iffy. And honestly, I think it's
been driven by benchmarks that aren't necessarily always realistic (ie
for ping-pong benchmarks, the best behavior is often to stay on the
same CPU and just schedule between the reader/writer).

Those are exactly the kinds that will *not* do wait_event(), because
they just read or wrote a single byte, but doing a "sync" wakeup then
gets you the optimal pattern of "run the other end on the same CPU".

And do those cases exist outside of benchmarks? Probably not.

Ergo: I don't think there's a "right solution".

> 2. I can't understand this code in pipe_write()
>
>         if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
>                 int err = file_update_time(filp);
>                 if (err)
>                         ret = err;
>                 sb_end_write(file_inode(filp)->i_sb);
>         }
>
>         - it only makes sense in the "fifo" case, right? When
>           i_sb->s_magic != PIPEFS_MAGIC...

I think we've done it for regular pipes too. You can see it with
'fstat()', after all.

And we've done it forever. It goes way back to even before the BK
days, although back when it was done differently with an open-coded

        inode->i_ctime = inode->i_mtime = CURRENT_TIME;
        mark_inode_dirty(inode);

instead.

I actually went back and looked at some historical trees. The inode
time updates were originally added in linux-2.0.1 (July 1996) and
didn't actually have the "mark_inode_dirty()" part, so they didn't
cause the inode to be written back to disk, but people would see the
time updates while they were happening.

The "write it back to disk" part started happening in 2.1.37pre1 (so
May 1997), although back then it was just a simple

        inode->i_dirt = 1;

instead.

My point being that we've been doing it for a *loong* time. And yes,
we've always done it for regular pipes too, not just on-disk FIFOs.

>         - why should we propogate the error code if "ret > 0" but
>           file_update_time() fails?

Normally file_update_time() wouldn't fail.

This was changed in c3b2da314834 ("fs: introduce inode operation
->update_time") and honestly, I think it was just a global
search-and-replace kind of change.

And I think the answer is: nobody cares, because it's FIFO's that
nobody uses in the first place, with an operation that never fails
anyway.

Could we remove the error propagation? Almost certainly. Does it
matter? Almost certainly not.

Notice how the read side also does the time update, except it's
obviously just atime, and it's done with

        if (ret > 0)
                file_accessed(filp);

instead (which honors O_NOATIME and then does touch_atime, and which
does the inode_update_time() *without* any error checking).

IOW, that is different, but it all boils down to the same thing:
nobody really cares in practice.

So to both of your issues, I think you can make changes as long as you
have good reasoning for them and document them in the commit (both
code and commit message).

           Linus

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ