[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSeNtFxD1WRjFaiR@shell.armlinux.org.uk>
Date: Wed, 26 Nov 2025 23:31:00 +0000
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: Al Viro <viro@...iv.linux.org.uk>
Cc: Xie Yuanbin <xieyuanbin1@...wei.com>, brauner@...nel.org, jack@...e.cz,
will@...nel.org, nico@...xnic.net, akpm@...ux-foundation.org,
hch@....de, jack@...e.com, wozizhi@...weicloud.com,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-arm-kernel@...ts.infradead.org, linux-mm@...ck.org,
lilinjie8@...wei.com, liaohua4@...wei.com,
wangkefeng.wang@...wei.com, pangliyuan1@...wei.com
Subject: Re: [RFC PATCH] vfs: Fix might sleep in load_unaligned_zeropad()
with rcu read lock held
On Wed, Nov 26, 2025 at 08:02:21PM +0000, Al Viro wrote:
> On Wed, Nov 26, 2025 at 07:51:54PM +0000, Russell King (Oracle) wrote:
>
> > I don't understand how that helps. Wasn't the report that the filename
> > crosses a page boundary in userspace, but the following page is
> > inaccessible which causes a fault to be taken (as it always would do).
> > Thus, wouldn't "addr" be a userspace address (that the kernel is
> > accessing) and thus be below TASK_SIZE ?
> >
> > I'm also confused - if we can't take a fault and handle it while
> > reading the filename from userspace, how are pages that have been
> > swapped out or evicted from the page cache read back in from storage
> > which invariably results in sleeping - which we can't do here because
> > of the RCU context (not that I've ever understood RCU, which is why
> > I've always referred those bugs to Paul.)
>
> No, the filename is already copied in kernel space *and* it's long enough
> to end right next to the end of page. There's NUL before the end of page,
> at that, with '/' a couple of bytes prior. We attempt to save on memory
> accesses, doing word-by-word fetches, starting from the beginning of
> component. We *will* detect NUL and ignore all subsequent bytes; the
> problem is that the last 3 bytes of page might be '/', 'x' and '\0'.
> We call load_unaligned_zeropad() on page + PAGE_SIZE - 2. And get
> a fetch that spans the end of page.
>
> We don't care what's in the next page, if there is one mapped there
> to start with. If there's nothing mapped, we want zeroes read from
> it, but all we really care about is having the bytes within *our*
> page read correctly - and no oops happening, obviously.
>
> That fault is an extremely cold case on a fairly hot path. We don't
> want to mess with disabling pagefaults, etc. - not for the sake
> of that.
I think, looking at the x86 handling, 32-bit ARM has missed a heck of
a lot of changes to the fault handling code, going all the way back to
pre-git history.
I seem to remember that I had updated it to match i386's implementation
at one point in the distant past, which is essentially what we have
today with a few tweaks. As code ages, it gets more difficult to
justify wholesale rewrites to bring it back up.
Relevant to this, looking at i386, that at some point added:
+ /*
+ * We fault-in kernel-space virtual memory on-demand. The
+ * 'reference' page table is init_mm.pgd.
+ *
+ * NOTE! We MUST NOT take any locks for this case. We may
+ * be in an interrupt or a critical region, and should
+ * only copy the information from the master page table,
+ * nothing more.
+ *
+ * This verifies that the fault happens in kernel space
+ * (error_code & 4) == 0, and that the fault was not a
+ * protection error (error_code & 1) == 0.
+ */
+ if (unlikely(address >= TASK_SIZE)) {
+ if (!(error_code & 5))
+ goto vmalloc_fault;
+ /*
+ * Don't take the mm semaphore here. If we fixup a prefetch
+ * fault we could otherwise deadlock.
+ */
+ goto bad_area_nosemaphore;
+ }
which is after notify_die() and the test to see whether we need a
local_irq_enable(). This means we go straight to the fixing up etc
for these addresses.
In today's kernel, this has morphed into:
/* Was the fault on kernel-controlled part of the address space? */
if (unlikely(fault_in_kernel_space(address))) {
do_kern_addr_fault(regs, error_code, address);
} else {
do_user_addr_fault(regs, error_code, address);
meaning any page fault for a kernel space address is handled entirely
separately from the normal page fault handling, and it looks like
this is entirely sensible.
Interestingly, however, I notice that x86 appears to no longer call
notify_die(DIE_PAGE_FAULT) in its page fault handling path, and I
wonder whether that's a regression on x86.
Now, for 32-bit ARM, I think I am coming to the conclusion that Al's
suggestion is probably the easiest solution. However, whether it has
side effects, I couldn't say - the 32-bit ARM fault code has been
modified by quite a few people in ways I don't yet understand, so I
can't be certain at the moment whether it would cause problems.
I think the only thing to do is to try the solution and see what
breaks. I'm not in a position to be able to do that as, having not
had reason to touch 32-bit ARM for years, I don't have a hackable
platform nearby. Maybe Xie Yuanbin can test it?
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Powered by blists - more mailing lists