[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2mfkm3sotjz5tfw6wvtfrwnerae5pqspelyxw6xg6e5glsyaq6@jl73gcvmtve5>
Date: Tue, 16 Dec 2025 23:33:20 -0800
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Andrii Nakryiko <andrii.nakryiko@...il.com>
Cc: Matthew Wilcox <willy@...radead.org>,
Linus Torvalds <torvalds@...ux-foundation.org>, Christoph Hellwig <hch@...radead.org>,
"Darrick J. Wong" <djwong@...nel.org>, SHAURYA RANE <ssrane_b23@...vjti.ac.in>,
akpm@...ux-foundation.org, eddyz87@...il.com, andrii@...nel.org, ast@...nel.org,
linux-fsdevel@...r.kernel.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
linux-kernel-mentees@...ts.linux.dev, skhan@...uxfoundation.org, david.hunter.linux@...il.com,
khalid@...nel.org, syzbot+09b7d050e4806540153d@...kaller.appspotmail.com,
bpf <bpf@...r.kernel.org>
Subject: Re: [PATCH] mm/filemap: fix NULL pointer dereference in
do_read_cache_folio()
On Tue, Nov 18, 2025 at 11:27:47AM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 18, 2025 at 7:37 AM Matthew Wilcox <willy@...radead.org> wrote:
> >
> > On Tue, Nov 18, 2025 at 05:03:24AM -0800, Christoph Hellwig wrote:
> > > On Mon, Nov 17, 2025 at 10:45:31AM -0800, Andrii Nakryiko wrote:
> > > > As I replied on another email, ideally we'd have some low-level file
> > > > reading interface where we wouldn't have to know about secretmem, or
> > > > XFS+DAX, or whatever other unusual combination of conditions where
> > > > exposed internal APIs like filemap_get_folio() + read_cache_folio()
> > > > can crash.
> > >
> > > The problem is that you did something totally insane and it kinda works
> > > most of the time.
> >
> > ... on 64-bit systems. The HIGHMEM handling is screwed up too.
> >
> > > But bpf or any other file system consumer has
> > > absolutely not business poking into the page cache to start with.
> >
> > Agreed.
>
> Then please help make it better, give us interfaces you think are
> appropriate. People do use this functionality in production, it's
> important and we are not going to drop it. In non-sleepable mode it's
> best-effort, if the requested part of the file is paged in, we'll
> successfully read data (such as ELF's build ID), and if not, we'll
> report that to the BPF program as -EFAULT. In sleepable mode, we'll
> wait for that part of the file to be paged in before proceeding.
> PROCMAP_QUERY ioctl() is always in sleepable mode, so it will wait for
> file data to be read.
>
> If you don't like the implementation, please help improve it, don't
> just request dropping it "because BPF folks" or anything like that.
>
So, I took a stab at this, particularly based on Willy's suggestions on
IOCB_NOWAIT. This is untested and I am just sharing to show how it looks
like and if there are any concerns. In addition I think I will look into
fstest part as well.
BTW by simple code inspection I already see that IOCB_NOWAIT is not well
respected. For example filemap_read() is doing cond_resched() without
any checks. The readahead i.e. page_cache_sync_ra() can potential take
sleeping locks. Btrfs is taking locks in btrfs_file_read_iter. So, it
seems like this would need extensive testing hopefully for all major
FSes.
Here the draft patch:
>From 9652cc97a817fe35e53a7e98a5fbb49c7788c744 Mon Sep 17 00:00:00 2001
From: Shakeel Butt <shakeel.butt@...ux.dev>
Date: Tue, 16 Dec 2025 16:53:57 -0800
Subject: [PATCH] lib/buildid: convert freader to use __kernel_read()
Convert the freader file reading implementation from direct page cache
access via filemap_get_folio()/read_cache_folio() to using kernel_read
interfaces.
Add a new __kernel_read_nowait() function that uses IOCB_NOWAIT flag
for non-blocking I/O. This is used when may_fault is false to avoid
blocking on I/O - if data is not immediately available, it returns
-EAGAIN.
For the may_fault case, use the standard __kernel_read() which can
block waiting for I/O.
This simplifies the code by removing the need to manage folios,
kmap/kunmap operations, and page cache locking. It also makes the
code work with filesystems that don't use the page cache directly.
Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev>
---
fs/read_write.c | 18 ++++++++-
include/linux/buildid.h | 3 --
include/linux/fs.h | 1 +
lib/buildid.c | 85 +++++++++--------------------------------
4 files changed, 37 insertions(+), 70 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 833bae068770..7a042cfeefec 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -503,7 +503,8 @@ static int warn_unsupported(struct file *file, const char *op)
return -EINVAL;
}
-ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
+static ssize_t __kernel_read_internal(struct file *file, void *buf,
+ size_t count, loff_t *pos, int flags)
{
struct kvec iov = {
.iov_base = buf,
@@ -526,6 +527,7 @@ ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
init_sync_kiocb(&kiocb, file);
kiocb.ki_pos = pos ? *pos : 0;
+ kiocb.ki_flags |= flags;
iov_iter_kvec(&iter, ITER_DEST, &iov, 1, iov.iov_len);
ret = file->f_op->read_iter(&kiocb, &iter);
if (ret > 0) {
@@ -538,6 +540,20 @@ ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
return ret;
}
+ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
+{
+ return __kernel_read_internal(file, buf, count, pos, 0);
+}
+
+/*
+ * Non-blocking variant of __kernel_read() using IOCB_NOWAIT.
+ * Returns -EAGAIN if the read would block waiting for I/O.
+ */
+ssize_t __kernel_read_nowait(struct file *file, void *buf, size_t count, loff_t *pos)
+{
+ return __kernel_read_internal(file, buf, count, pos, IOCB_NOWAIT);
+}
+
ssize_t kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
{
ssize_t ret;
diff --git a/include/linux/buildid.h b/include/linux/buildid.h
index 831c1b4b626c..f1fa220353a2 100644
--- a/include/linux/buildid.h
+++ b/include/linux/buildid.h
@@ -25,9 +25,6 @@ struct freader {
union {
struct {
struct file *file;
- struct folio *folio;
- void *addr;
- loff_t folio_off;
bool may_fault;
};
struct {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f5c9cf28c4dc..498c804fc0b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2832,6 +2832,7 @@ extern int do_pipe_flags(int *, int);
extern ssize_t kernel_read(struct file *, void *, size_t, loff_t *);
ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos);
+ssize_t __kernel_read_nowait(struct file *file, void *buf, size_t count, loff_t *pos);
extern ssize_t kernel_write(struct file *, const void *, size_t, loff_t *);
extern ssize_t __kernel_write(struct file *, const void *, size_t, loff_t *);
extern struct file * open_exec(const char *);
diff --git a/lib/buildid.c b/lib/buildid.c
index aaf61dfc0919..c9d4491557fe 100644
--- a/lib/buildid.c
+++ b/lib/buildid.c
@@ -5,6 +5,7 @@
#include <linux/elf.h>
#include <linux/kernel.h>
#include <linux/pagemap.h>
+#include <linux/fs.h>
#include <linux/secretmem.h>
#define BUILD_ID 3
@@ -28,55 +29,35 @@ void freader_init_from_mem(struct freader *r, const char *data, u64 data_sz)
r->data_sz = data_sz;
}
-static void freader_put_folio(struct freader *r)
-{
- if (!r->folio)
- return;
- kunmap_local(r->addr);
- folio_put(r->folio);
- r->folio = NULL;
-}
-
-static int freader_get_folio(struct freader *r, loff_t file_off)
+/*
+ * Read data from file at specified offset into the freader buffer.
+ * Uses non-blocking I/O when may_fault is false.
+ * Returns 0 on success, negative error code on failure.
+ */
+static int freader_read(struct freader *r, loff_t file_off, size_t sz)
{
- /* check if we can just reuse current folio */
- if (r->folio && file_off >= r->folio_off &&
- file_off < r->folio_off + folio_size(r->folio))
- return 0;
-
- freader_put_folio(r);
+ ssize_t ret;
+ loff_t pos = file_off;
/* reject secretmem folios created with memfd_secret() */
if (secretmem_mapping(r->file->f_mapping))
return -EFAULT;
- r->folio = filemap_get_folio(r->file->f_mapping, file_off >> PAGE_SHIFT);
-
- /* if sleeping is allowed, wait for the page, if necessary */
- if (r->may_fault && (IS_ERR(r->folio) || !folio_test_uptodate(r->folio))) {
- filemap_invalidate_lock_shared(r->file->f_mapping);
- r->folio = read_cache_folio(r->file->f_mapping, file_off >> PAGE_SHIFT,
- NULL, r->file);
- filemap_invalidate_unlock_shared(r->file->f_mapping);
- }
+ if (r->may_fault)
+ ret = __kernel_read(r->file, r->buf, sz, &pos);
+ else
+ ret = __kernel_read_nowait(r->file, r->buf, sz, &pos);
- if (IS_ERR(r->folio) || !folio_test_uptodate(r->folio)) {
- if (!IS_ERR(r->folio))
- folio_put(r->folio);
- r->folio = NULL;
+ if (ret < 0)
+ return ret;
+ if (ret != sz)
return -EFAULT;
- }
-
- r->folio_off = folio_pos(r->folio);
- r->addr = kmap_local_folio(r->folio, 0);
return 0;
}
const void *freader_fetch(struct freader *r, loff_t file_off, size_t sz)
{
- size_t folio_sz;
-
/* provided internal temporary buffer should be sized correctly */
if (WARN_ON(r->buf && sz > r->buf_sz)) {
r->err = -E2BIG;
@@ -97,46 +78,18 @@ const void *freader_fetch(struct freader *r, loff_t file_off, size_t sz)
return r->data + file_off;
}
- /* fetch or reuse folio for given file offset */
- r->err = freader_get_folio(r, file_off);
+ /* read data from file into buffer */
+ r->err = freader_read(r, file_off, sz);
if (r->err)
return NULL;
- /* if requested data is crossing folio boundaries, we have to copy
- * everything into our local buffer to keep a simple linear memory
- * access interface
- */
- folio_sz = folio_size(r->folio);
- if (file_off + sz > r->folio_off + folio_sz) {
- u64 part_sz = r->folio_off + folio_sz - file_off, off;
-
- memcpy(r->buf, r->addr + file_off - r->folio_off, part_sz);
- off = part_sz;
-
- while (off < sz) {
- /* fetch next folio */
- r->err = freader_get_folio(r, r->folio_off + folio_sz);
- if (r->err)
- return NULL;
- folio_sz = folio_size(r->folio);
- part_sz = min_t(u64, sz - off, folio_sz);
- memcpy(r->buf + off, r->addr, part_sz);
- off += part_sz;
- }
-
- return r->buf;
- }
-
- /* if data fits in a single folio, just return direct pointer */
- return r->addr + (file_off - r->folio_off);
+ return r->buf;
}
void freader_cleanup(struct freader *r)
{
if (!r->buf)
return; /* non-file-backed mode */
-
- freader_put_folio(r);
}
/*
--
2.47.3
Powered by blists - more mailing lists