[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aTLLLuup7TeAqFVL@shell.armlinux.org.uk>
Date: Fri, 5 Dec 2025 12:08:14 +0000
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: Xie Yuanbin <xieyuanbin1@...wei.com>
Cc: torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
brauner@...nel.org, catalin.marinas@....com, hch@....de,
jack@...e.com, linux-arm-kernel@...ts.infradead.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, pangliyuan1@...wei.com,
wangkefeng.wang@...wei.com, will@...nel.org,
wozizhi@...weicloud.com, yangerkun@...wei.com
Subject: Re: [Bug report] hash_name() may cross page boundary and trigger
sleep in RCU context
On Wed, Dec 03, 2025 at 09:48:00AM +0800, Xie Yuanbin wrote:
> On Tue, 2 Dec 2025 14:07:25 -0800, Linus Torvalds wrote:
> > On Tue, 2 Dec 2025 at 04:43, Russell King (Oracle)
> > <linux@...linux.org.uk> wrote:
> >>
> >> What I'm thinking is to address both of these by handling kernel space
> >> page faults (which will be permission or PTE-not-present) separately
> >> (not even build tested):
> >
> > That patch looks sane to me.
> >
> > But I also didn't build test it, just scanned it visually ;)
>
> That patch removes harden_branch_predictor() from __do_user_fault(), and
> moves it to do_page_fault()->do_kernel_address_page_fault().
> This resolves previously mentioned kernel warning issue. However,
> __do_user_fault() is not only called by do_page_fault(), it is
> alse called by do_bad_area(), do_sect_fault() and do_translation_fault().
>
> So I think that some harden_branch_predictor() is missing on other paths.
> According to my tests, when CONFIG_ARM_LPAE=n, harden_branch_predictor()
> will never be called anymore, even if a user program trys to access the
> kernel address.
>
> Or perhaps I've misunderstood something, could you please point it out?
> Thank you very much.
Right, let's split these issues into separate patches. Please test this
patch, which should address only the hash_name() fault issue, and
provides the basis for fixing the branch predictor issue.
Yes, at the moment, do_kernel_address_page_fault() looks very much like
do_bad_area(), but with the addition of the IRQ-enable if the parent
context was enabled, but the following patch to address the branch
predictor hardening will show why its different.
In my opinion, this approach makes the handling for kernel address
page faults (non-present pages and page permission faults) much easier
to understand.
Note that this will call __do_user_fault() with interrupts disabled.
Build tested, and remotely boot tested on Cortex-A5 hardware but
without kfence enabled. Also tested usermode access to kernel space
which fails with SEGV:
- read from 0xc0000000 (section permission fault, do_sect_fault)
- read from 0xffff2000 (page translation fault, do_page_fault)
- read from 0xffff0000 (vectors page - read possible as expected)
- write to 0xffff0000 (page permission fault, do_page_fault)
8<===
From: "Russell King (Oracle)" <rmk+kernel@...linux.org.uk>
Subject: [PATCH] ARM: fix hash_name() fault
Zizhi Wo reports:
"During the execution of hash_name()->load_unaligned_zeropad(), a
potential memory access beyond the PAGE boundary may occur. For
example, when the filename length is near the PAGE_SIZE boundary.
This triggers a page fault, which leads to a call to
do_page_fault()->mmap_read_trylock(). If we can't acquire the lock,
we have to fall back to the mmap_read_lock() path, which calls
might_sleep(). This breaks RCU semantics because path lookup occurs
under an RCU read-side critical section."
This is seen with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_KFENCE=y.
Kernel addresses (with the exception of the vectors/kuser helper
page) do not have VMAs associated with them. If the vectors/kuser
helper page faults, then there are two possibilities:
1. if the fault happened while in kernel mode, then we're basically
dead, because the CPU won't be able to vector through this page
to handle the fault.
2. if the fault happened while in user mode, that means the page was
protected from user access, and we want to fault anyway.
Thus, we can handle kernel addresses from any context entirely
separately without going anywhere near the mmap lock. This gives us
an entirely non-sleeping path for all kernel mode kernel address
faults.
Reported-by: Zizhi Wo <wozizhi@...weicloud.com>
Reported-by: Xie Yuanbin <xieyuanbin1@...wei.com>
Link: https://lore.kernel.org/r/20251126090505.3057219-1-wozizhi@huaweicloud.com
Signed-off-by: Russell King (Oracle) <rmk+kernel@...linux.org.uk>
---
arch/arm/mm/fault.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 46169fe42c61..2bbec38ced97 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -260,6 +260,35 @@ static inline bool ttbr0_usermode_access_allowed(struct pt_regs *regs)
}
#endif
+static int __kprobes
+do_kernel_address_page_fault(struct mm_struct *mm, unsigned long addr,
+ unsigned int fsr, struct pt_regs *regs)
+{
+ if (user_mode(regs)) {
+ /*
+ * Fault from user mode for a kernel space address. User mode
+ * should not be faulting in kernel space, which includes the
+ * vector/khelper page. Send a SIGSEGV.
+ */
+ __do_user_fault(addr, fsr, SIGSEGV, SEGV_MAPERR, regs);
+ } else {
+ /*
+ * Fault from kernel mode. Enable interrupts if they were
+ * enabled in the parent context. Section (upper page table)
+ * translation faults are handled via do_translation_fault(),
+ * so we will only get here for a non-present kernel space
+ * PTE or PTE permission fault. This may happen in exceptional
+ * circumstances and need the fixup tables to be walked.
+ */
+ if (interrupts_enabled(regs))
+ local_irq_enable();
+
+ __do_kernel_fault(mm, addr, fsr, regs);
+ }
+
+ return 0;
+}
+
static int __kprobes
do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
{
@@ -273,6 +302,12 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
if (kprobe_page_fault(regs, fsr))
return 0;
+ /*
+ * Handle kernel addresses faults separately, which avoids touching
+ * the mmap lock from contexts that are not able to sleep.
+ */
+ if (addr >= TASK_SIZE)
+ return do_kernel_address_page_fault(mm, addr, fsr, regs);
/* Enable interrupts if they were enabled in the parent context. */
if (interrupts_enabled(regs))
--
2.47.3
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Powered by blists - more mailing lists