[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66508848-7bf3-44ac-92b5-0836960e852e@asahilina.net>
Date: Thu, 18 Jul 2024 23:34:38 +0900
From: Asahi Lina <lina@...hilina.net>
To: Will Deacon <will@...nel.org>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, asahi@...ts.linux.dev,
linux-arm-kernel@...ts.infradead.org,
Catalin Marinas <catalin.marinas@....com>, ryan.roberts@....com,
mark.rutland@....com, ardb@...nel.org
Subject: Re: LPA2 on non-LPA2 hardware broken with 16K pages
Hi,
On 7/18/24 10:14 PM, Will Deacon wrote:
> Hi Lina, [+Ard, Mark and Ryan],
>
> On Thu, Jul 18, 2024 at 06:39:10PM +0900, Asahi Lina wrote:
>> I ran into this with the Asahi Linux downstream kernel, based on v6.9.9,
>> but I believe the problem is also still upstream. The issue seems to be
>> an interaction between folding one page table level at compile time and
>> another one at runtime.
>
> Thanks for reporting this!
>
>> With this config, we have:
>>
>> CONFIG_PGTABLE_LEVELS=4
>> PAGE_SHIFT=14
>> PMD_SHIFT=25
>> PUD_SHIFT=36
>> PGDIR_SHIFT=47
>> pgtable_l5_enabled() == false (compile time)
>> pgtable_l4_enabled() == false (runtime, due to no LPA2)
>
> I think this is 'defconfig' w/ 16k pages, although I wasn't able to
> trigger the issue quickly under QEMU with that. Your analysis looks
> correct, however.
Yes, it should be. I first ran into the issue with a .config that was
derived from defconfig that someone sent me while trying to debug a
different problem.
[snip]
> Cheers for the explanation; I agree that 6.10 looks like it's affected
> in the same way, even though I couldn't reproduce the crash. I think the
> root of the problem is that p4d_offset_lockless() returns a stack
> address when the p4d level is folded. I wondered about changing the
> dummy pXd_offset_lockless() definitions in linux/pgtable.h to pass the
> real pointer through instead of the address of the local, but then I
> suppose _most_ pXd_offset() implementations are going to dereference
> that and it would break the whole point of having _lockless routines
> to start with.
>
> What if we provided our own implementation of p4d_offset_lockless()
> for the folding case, which could just propagate the page-table pointer?
> Diff below.
That seems to work, it neither reproduces the oopses outright nor
triggers any of the random sanity checks I sprinkled around gup.c while
debugging this to try to make it fail early ^^
>
>> This causes random oopses in internal_get_user_pages_fast and related
>> codepaths.
>
> Do you have a reliable way to trigger those? I tried doing some GUPpy
> things like strace (access_process_vm()) but it all seemed fine.
It's a bit weird because I had kernel builds where it didn't obviously
happen most of the time. During my latest round tracking this down
though, it almost 100% reliably triggered with a simple boot of a Fedora
system, usually in an `lvm` process during boot (lvm2-monitor.service).
The lvm crashes look like this, and I also got this trace in
systemd-journald once:
internal_get_user_pages_fast+0x420/0x1728
pin_user_pages_fast+0x9c/0xc4
iov_iter_extract_pages+0x234/0x1044
bio_iov_iter_get_pages+0x248/0xa90
blkdev_direct_IO.part.0+0x3a0/0x143c
blkdev_read_iter+0x1cc/0x388
aio_read.constprop.0+0x1e0/0x324
io_submit_one.constprop.0+0x378/0x1470
__arm64_sys_io_submit+0x198/0x2d0
invoke_syscall.constprop.0+0xd8/0x1e0
do_el0_svc+0xc4/0x1e0
el0_svc+0x48/0xc0
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x190/0x194
Right now with my kernel build, this happens basically every boot.
However, it wasn't always like that, and I'm not sure what other
environmental differences affect the outcome. I guess since it's reading
random stack memory, it depends on what's there...
I disabled lvm2-monitor.service and tried to boot again (to unwedge
things) and this time it oopsed in udisksd like this (I recall seeing
this one before too):
internal_get_user_pages_fast+0x420/0x1728
pin_user_pages_fast+0x9c/0xc4
iov_iter_extract_pages+0x234/0x1044
bio_map_user_iov+0x214/0x724
blk_rq_map_user_iov+0x8b0/0x1080
blk_rq_map_user_io+0x138/0x17c
nvme_map_user_request.isra.0+0x2b4/0x3d8
nvme_submit_user_cmd.isra.0+0x21c/0x300
nvme_user_cmd.isra.0+0x200/0x3fc
nvme_dev_ioctl+0x284/0x480
__arm64_sys_ioctl+0x550/0x1b80
invoke_syscall.constprop.0+0xd8/0x1e0
do_el0_svc+0xc4/0x1e0
el0_svc+0x48/0xc0
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x190/0x194
Prior to this I was testing with glmark2 (this whole thing started with
me trying to debug a downstream GPU driver issue, which I now realize is
a completely unrelated bug but it led me down this rabbit hole because
it was first reported by someone coincidentally compiling with LPA2 on).
I use this process termination torture test incantation, and I just
confirmed that it still oopses even with just a simpledrm framebuffer
and llvmpipe (software rendering) and no GPU/"full" KMS drivers compiled
in, so it should be reproducible on other platforms hopefully. The repro
is reliable within a few seconds. Running on KDE Plasma for whatever
that's worth:
while true; do WAYLAND_DISPLAY=wayland-0 timeout -s TERM -k 0 0.5
glmark2-es2-wayland & sleep 0.02 ; done
The trace looks like:
[ 301.830742] internal_get_user_pages_fast+0x248/0xcd0
[ 301.831343] get_user_pages_fast+0x48/0x60
[ 301.831897] get_futex_key+0xa4/0x3d0
[ 301.832281] futex_wait_setup+0x6c/0x164
[ 301.832777] __futex_wait+0xbc/0x15c
[ 301.833200] futex_wait+0x88/0x110
[ 301.833596] do_futex+0xf8/0x1a0
[ 301.833926] __arm64_sys_futex+0xec/0x188
[ 301.834417] invoke_syscall.constprop.0+0x50/0xe4
[ 301.835029] do_el0_svc+0x40/0xdc
[ 301.835378] el0_svc+0x3c/0x140
[ 301.835796] el0t_64_sync_handler+0x120/0x12c
[ 301.836365] el0t_64_sync+0x190/0x194
So it looks like aio, nvme ioctls, and futexes are things that tend to
trigger it. However, it's definitely not always obvious. The person
reporting the GPU issues with a LPA2 kernel build clearly wasn't having
their system crash on every boot, and I myself ended up with kernels
where the only repro was the glmark2 invocation and things weren't just
oopsing on boot.
Going down the aio path I just tried fio, but it turns out it actually
fairly reliably reproduces the futex oops when configured to fail, just
like glmark2 (I couldn't trivially get it to actually oops on actual aio
usage...).
while true; do fio --filename=nonexist --size=1M --rw=read --bs=16k
--ioengine=libaio --numjobs=10 --name=a --readonly; done
(This will complain with "fio: refusing extend of file due to read-only"
and generally not actually run a proper test, but it still repros the
futex problem)
It's not just any futex usage though. I tried writing a trivial futex
test app, and also running the tools/testing/selftests/futex selftests,
and those don't seem to trigger it.
> Thanks,
>
> Will
>
> --->8
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f8efbc128446..3afe624a39e1 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
>
> #define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
>
> +static inline
> +p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
> +{
> + return p4d_offset(pgdp, addr);
> +}
> +#define p4d_offset_lockless p4d_offset_lockless
> +
> #endif /* CONFIG_PGTABLE_LEVELS > 4 */
>
> #define pgd_ERROR(e) \
~~ Lina
Powered by blists - more mailing lists