[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wh62OxWsL+msmks7=VdBJHz7HvRYoPDckkAEAwsgrmjew@mail.gmail.com>
Date: Tue, 21 Oct 2025 05:47:01 -1000
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Kiryl Shutsemau <kirill@...temov.name>, David Hildenbrand <david@...hat.com>,
Matthew Wilcox <willy@...radead.org>, Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
Kiryl Shutsemau <kas@...nel.org>, Suren Baghdasaryan <surenb@...gle.com>
Subject: Re: [PATCH] mm/filemap: Implement fast short reads
On Sun, 19 Oct 2025 at 18:53, Andrew Morton <akpm@...ux-foundation.org> wrote:
>
> Is there really no way to copy the dang thing straight out to
> userspace, skip the bouncing?
Sadly, no.
It's trivial to copy to user space in a RCU-protected region: just
disable page faults and it all works fine.
In fact, it works so fine that everything boots and it all looks
beautiful in profiles etc - ask me how I know.
But it's still wrong. The problem is that *after* you've copies things
away from the page cache, you need to check that the page cache
contents are still valid.
And it's not a problem to do that and just say "don't count the bytes
I just copied, and we'll copy over them later".
But while 99.999% of the time we *will* copy over them later, it's not
actually guaranteed. What migth happen is that after we've filled in
user space with the optimistically copied data, we figure out that the
page cache is no longer valid, and we go to the slow case, and two
problems may have happened:
(a) the file got truncated in the meantime, and we just filled in
stale data (possibly zeroes) in a user space buffer, and we're
returning a smaller length than what we filled out.
Will user space care? Not realistically, no. But it's wrong, and some
user space *might* be using the buffer as a ring-buffer or something,
and assume that if we return 5 bytes from "read()", the subsequent
bytes are still valid from (previous) ring buffer fills.
But if we decide to ignore that issue (possibly with some "open()"
time flag to say "give me optimistic short reads, and I won't care),
we still have
(b) the folio we copied from migth have been released and re-used for
something else
and this is fatal. We might have optimistically copied things that are
now security-sensitive and even if we return a short read - or
overwrite it - layer, user space should never have seen that data.
This (b) thing is solvable too, but requires that page cache releases
always would be RCU-delayed, and they aren't.
So both are "solvable", but they are very big and very separate solutions.
Linus
Powered by blists - more mailing lists