linux-kernel - Re: LPA2 on non-LPA2 hardware broken with 16K pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66508848-7bf3-44ac-92b5-0836960e852e@asahilina.net>
Date: Thu, 18 Jul 2024 23:34:38 +0900
From: Asahi Lina <lina@...hilina.net>
To: Will Deacon <will@...nel.org>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, asahi@...ts.linux.dev,
 linux-arm-kernel@...ts.infradead.org,
 Catalin Marinas <catalin.marinas@....com>, ryan.roberts@....com,
 mark.rutland@....com, ardb@...nel.org
Subject: Re: LPA2 on non-LPA2 hardware broken with 16K pages

Hi,

On 7/18/24 10:14 PM, Will Deacon wrote:
> Hi Lina, [+Ard, Mark and Ryan],
> 
> On Thu, Jul 18, 2024 at 06:39:10PM +0900, Asahi Lina wrote:
>> I ran into this with the Asahi Linux downstream kernel, based on v6.9.9,
>> but I believe the problem is also still upstream. The issue seems to be
>> an interaction between folding one page table level at compile time and
>> another one at runtime.
> 
> Thanks for reporting this!
> 
>> With this config, we have:
>>
>> CONFIG_PGTABLE_LEVELS=4
>> PAGE_SHIFT=14
>> PMD_SHIFT=25
>> PUD_SHIFT=36
>> PGDIR_SHIFT=47
>> pgtable_l5_enabled() == false (compile time)
>> pgtable_l4_enabled() == false (runtime, due to no LPA2)
> 
> I think this is 'defconfig' w/ 16k pages, although I wasn't able to
> trigger the issue quickly under QEMU with that. Your analysis looks
> correct, however.

Yes, it should be. I first ran into the issue with a .config that was
derived from defconfig that someone sent me while trying to debug a
different problem.

[snip]

> Cheers for the explanation; I agree that 6.10 looks like it's affected
> in the same way, even though I couldn't reproduce the crash. I think the
> root of the problem is that p4d_offset_lockless() returns a stack
> address when the p4d level is folded. I wondered about changing the
> dummy pXd_offset_lockless() definitions in linux/pgtable.h to pass the
> real pointer through instead of the address of the local, but then I
> suppose _most_ pXd_offset() implementations are going to dereference
> that and it would break the whole point of having _lockless routines
> to start with.
> 
> What if we provided our own implementation of p4d_offset_lockless()
> for the folding case, which could just propagate the page-table pointer?
> Diff below.

That seems to work, it neither reproduces the oopses outright nor
triggers any of the random sanity checks I sprinkled around gup.c while
debugging this to try to make it fail early ^^

> 
>> This causes random oopses in internal_get_user_pages_fast and related
>> codepaths.
> 
> Do you have a reliable way to trigger those? I tried doing some GUPpy
> things like strace (access_process_vm()) but it all seemed fine.

It's a bit weird because I had kernel builds where it didn't obviously
happen most of the time. During my latest round tracking this down
though, it almost 100% reliably triggered with a simple boot of a Fedora
system, usually in an `lvm` process during boot (lvm2-monitor.service).

The lvm crashes look like this, and I also got this trace in
systemd-journald once:

 internal_get_user_pages_fast+0x420/0x1728
 pin_user_pages_fast+0x9c/0xc4
 iov_iter_extract_pages+0x234/0x1044
 bio_iov_iter_get_pages+0x248/0xa90
 blkdev_direct_IO.part.0+0x3a0/0x143c
 blkdev_read_iter+0x1cc/0x388
 aio_read.constprop.0+0x1e0/0x324
 io_submit_one.constprop.0+0x378/0x1470
 __arm64_sys_io_submit+0x198/0x2d0
 invoke_syscall.constprop.0+0xd8/0x1e0
 do_el0_svc+0xc4/0x1e0
 el0_svc+0x48/0xc0
 el0t_64_sync_handler+0x120/0x130
 el0t_64_sync+0x190/0x194

Right now with my kernel build, this happens basically every boot.
However, it wasn't always like that, and I'm not sure what other
environmental differences affect the outcome. I guess since it's reading
random stack memory, it depends on what's there...

I disabled lvm2-monitor.service and tried to boot again (to unwedge
things) and this time it oopsed in udisksd like this (I recall seeing
this one before too):

 internal_get_user_pages_fast+0x420/0x1728
 pin_user_pages_fast+0x9c/0xc4
 iov_iter_extract_pages+0x234/0x1044
 bio_map_user_iov+0x214/0x724
 blk_rq_map_user_iov+0x8b0/0x1080
 blk_rq_map_user_io+0x138/0x17c
 nvme_map_user_request.isra.0+0x2b4/0x3d8
 nvme_submit_user_cmd.isra.0+0x21c/0x300
 nvme_user_cmd.isra.0+0x200/0x3fc
 nvme_dev_ioctl+0x284/0x480
 __arm64_sys_ioctl+0x550/0x1b80
 invoke_syscall.constprop.0+0xd8/0x1e0
 do_el0_svc+0xc4/0x1e0
 el0_svc+0x48/0xc0
 el0t_64_sync_handler+0x120/0x130
 el0t_64_sync+0x190/0x194

Prior to this I was testing with glmark2 (this whole thing started with
me trying to debug a downstream GPU driver issue, which I now realize is
a completely unrelated bug but it led me down this rabbit hole because
it was first reported by someone coincidentally compiling with LPA2 on).
I use this process termination torture test incantation, and I just
confirmed that it still oopses even with just a simpledrm framebuffer
and llvmpipe (software rendering) and no GPU/"full" KMS drivers compiled
in, so it should be reproducible on other platforms hopefully. The repro
is reliable within a few seconds. Running on KDE Plasma for whatever
that's worth:

while true; do WAYLAND_DISPLAY=wayland-0 timeout -s TERM -k 0 0.5
glmark2-es2-wayland & sleep 0.02 ; done

The trace looks like:

[  301.830742]  internal_get_user_pages_fast+0x248/0xcd0
[  301.831343]  get_user_pages_fast+0x48/0x60
[  301.831897]  get_futex_key+0xa4/0x3d0
[  301.832281]  futex_wait_setup+0x6c/0x164
[  301.832777]  __futex_wait+0xbc/0x15c
[  301.833200]  futex_wait+0x88/0x110
[  301.833596]  do_futex+0xf8/0x1a0
[  301.833926]  __arm64_sys_futex+0xec/0x188
[  301.834417]  invoke_syscall.constprop.0+0x50/0xe4
[  301.835029]  do_el0_svc+0x40/0xdc
[  301.835378]  el0_svc+0x3c/0x140
[  301.835796]  el0t_64_sync_handler+0x120/0x12c
[  301.836365]  el0t_64_sync+0x190/0x194

So it looks like aio, nvme ioctls, and futexes are things that tend to
trigger it. However, it's definitely not always obvious. The person
reporting the GPU issues with a LPA2 kernel build clearly wasn't having
their system crash on every boot, and I myself ended up with kernels
where the only repro was the glmark2 invocation and things weren't just
oopsing on boot.

Going down the aio path I just tried fio, but it turns out it actually
fairly reliably reproduces the futex oops when configured to fail, just
like glmark2 (I couldn't trivially get it to actually oops on actual aio
usage...).

while true; do fio --filename=nonexist --size=1M --rw=read --bs=16k
--ioengine=libaio --numjobs=10 --name=a --readonly; done

(This will complain with "fio: refusing extend of file due to read-only"
and generally not actually run a proper test, but it still repros the
futex problem)

It's not just any futex usage though. I tried writing a trivial futex
test app, and also running the tools/testing/selftests/futex selftests,
and those don't seem to trigger it.

> Thanks,
> 
> Will
> 
> --->8
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f8efbc128446..3afe624a39e1 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1065,6 +1065,13 @@ static inline bool pgtable_l5_enabled(void) { return false; }
>  
>  #define p4d_offset_kimg(dir,addr)      ((p4d_t *)dir)
>  
> +static inline
> +p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
> +{
> +       return p4d_offset(pgdp, addr);
> +}
> +#define p4d_offset_lockless p4d_offset_lockless
> +
>  #endif  /* CONFIG_PGTABLE_LEVELS > 4 */
>  
>  #define pgd_ERROR(e)   \

~~ Lina