[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fec617e3-8955-42c6-9cca-588e86833998@oracle.com>
Date: Mon, 8 Sep 2025 15:47:06 -0700
From: Anthony Yznaga <anthony.yznaga@...cle.com>
To: John Paul Adrian Glaubitz <glaubitz@...sik.fu-berlin.de>,
Michael Karcher <kernel@...rcher.dialup.fu-berlin.de>,
Andreas Larsson <andreas@...sler.com>
Cc: sparclinux@...r.kernel.org, linux-kernel@...r.kernel.org,
René Rebe <rene@...ctcode.com>
Subject: Re: [PATCH v4 2/5] sparc: fix accurate exception reporting in
copy_{from_to}_user for UltraSPARC III
On 9/7/25 11:53 PM, John Paul Adrian Glaubitz wrote:
> On Mon, 2025-09-08 at 08:47 +0200, John Paul Adrian Glaubitz wrote:
>> Hi,
>>
>> On Mon, 2025-09-08 at 08:30 +0200, John Paul Adrian Glaubitz wrote:
>>> Hi,
>>>
>>> On Sun, 2025-09-07 at 23:31 +0200, John Paul Adrian Glaubitz wrote:
>>>> Hi,
>>>>
>>>> On Sun, 2025-09-07 at 20:33 +0200, John Paul Adrian Glaubitz wrote:
>>>>> I assume that cheetah_patch_cachetlbops has to be invoked on UltraSPARC III
>>>>> since there is other code depending on it. On the other hand, the TLB code
>>>>> on UltraSPARC III was heavily overhauled in 2016 [1] which was also followed
>>>>> by a bug fix [2].
>>>>>
>>>>> Chances are there are still bugs in the code introduced in [1].
>>>>>
>>>>>> [1] https://github.com/torvalds/linux/commit/a74ad5e660a9ee1d071665e7e8ad822784a2dc7f
>>>>>> [2] https://github.com/torvalds/linux/commit/d3c976c14ad8af421134c428b0a89ff8dd3bd8f8
>>>>
>>>> I have reverted both commits. The machine boots until it tries to start
>>>> systemd when it locks up. So, I guess if there is a bug in the TLB code
>>>> it needs to be diagnosed differently.
>>>
>>> Another test with a kernel source rebased to 6.17-rc5+, with the following patch applied
>>> by Anthony Yznaga and CONFIG_SMP disabled:
>>>
>>> diff --git a/arch/sparc/mm/ultra.S b/arch/sparc/mm/ultra.S
>>> index 70e658d107e0..b323db303de1 100644
>>> --- a/arch/sparc/mm/ultra.S
>>> +++ b/arch/sparc/mm/ultra.S
>>> @@ -347,6 +347,7 @@ __cheetah_flush_tlb_kernel_range: /* 31 insns */
>>> membar #Sync
>>> stxa %g0, [%o4] ASI_IMMU_DEMAP
>>> membar #Sync
>>> + flush
>>> retl
>>> nop
>>> nop
>>> @@ -355,7 +356,6 @@ __cheetah_flush_tlb_kernel_range: /* 31 insns */
>>> nop
>>> nop
>>> nop
>>> - nop
>>>
>>> #ifdef DCACHE_ALIASING_POSSIBLE
>>> __cheetah_flush_dcache_page: /* 11 insns */
>>>
>>> Still crashes:
>>>
>>> [ 139.236744] tsk->{mm,active_mm}->context = 00000000000000ab
>>> [ 139.310042] tsk->{mm,active_mm}->pgd = fff0000007db8000
>>> [ 139.378747] \|/ ____ \|/
>>> [ 139.378747] "@'/ .. \`@"
>>> [ 139.378747] /_| \__/ |_\
>>> [ 139.378747] \__U_/
>>> [ 139.572059] systemd(1): Oops [#1]
>>> [ 139.615613] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc5+ #19 NONE
>>> [ 139.712832] TSTATE: 0000004411001602 TPC: 00000000005e29e4 TNPC: 00000000005e29e8 Y: 00000000 Not tainted
>>> [ 139.842076] TPC: <bpf_patch_insn_data+0x204/0x2e0>
>>> [ 139.905077] g0: ffffffffffffffff g1: 0000000000000000 g2: 0000000000000065 g3: fff0000009618b28
>>> [ 140.019460] g4: fff00000001f9500 g5: 0000000000657300 g6: fff000000022c000 g7: 0000000000000001
>>> [ 140.133837] o0: 0000000100058000 o1: 0000000000000000 o2: 0000000000000001 o3: 0000000000000002
>>> [ 140.248208] o4: fff00000045ec900 o5: 0000000000000002 sp: fff000000022f031 ret_pc: 00000000005e2998
>>> [ 140.367158] RPC: <bpf_patch_insn_data+0x1b8/0x2e0>
>>> [ 140.430057] l0: fff0000009618000 l1: 0000000100046048 l2: 0000000000000001 l3: 0000000100058000
>>> [ 140.544437] l4: 0000000100046068 l5: 0000000000000005 l6: 0000000000000000 l7: fff000000961e128
>>> [ 140.658810] i0: 0000000100046000 i1: 0000000000000004 i2: 0000000000000005 i3: 0000000000000002
>>> [ 140.773189] i4: 0000000100066000 i5: fff0000009618ae8 i6: fff000000022f0e1 i7: 0000000000607a08
>>> [ 140.887561] I7: <bpf_check+0x1988/0x34a0>
>>> [ 140.940171] Call Trace:
>>> [ 140.972191] [<0000000000607a08>] bpf_check+0x1988/0x34a0
>>> [ 141.041963] [<00000000005d862c>] bpf_prog_load+0x8ec/0xc80
>>> [ 141.114021] [<00000000005d9be4>] __sys_bpf+0x724/0x28a0
>>> [ 141.182646] [<00000000005dc338>] sys_bpf+0x18/0x60
>>> [ 141.245551] [<0000000000406174>] linux_sparc_syscall+0x34/0x44
>>> [ 141.322185] Disabling lock debugging due to kernel taint
>>> [ 141.391952] Caller[0000000000607a08]: bpf_check+0x1988/0x34a0
>>> [ 141.467440] Caller[00000000005d862c]: bpf_prog_load+0x8ec/0xc80
>>> [ 141.545212] Caller[00000000005d9be4]: __sys_bpf+0x724/0x28a0
>>> [ 141.619558] Caller[00000000005dc338]: sys_bpf+0x18/0x60
>>> [ 141.688179] Caller[0000000000406174]: linux_sparc_syscall+0x34/0x44
>>> [ 141.770535] Caller[fff000010089b80c]: 0xfff000010089b80c
>>> [ 141.840301] Instruction DUMP:
>>> [ 141.840305] 326ffffa
>>> [ 141.879185] c4004000
>>> [ 141.910065] c25e2038
>>> [ 141.940945] <c4006108>
>>> [ 141.971827] 80a0a000
>>> [ 142.002709] 04400014
>>> [ 142.033589] c25860f0
>>> [ 142.064474] 8400bfff
>>> [ 142.095354] 8e00606c
>>> [ 142.126234]
>>> [ 142.176560] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
>>> [ 142.277218] Press Stop-A (L1-A) from sun keyboard or send break
>>> [ 142.277218] twice on console to return to the boot prom
>>> [ 142.423608] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
>>
>> Disabling support for Transparent Huge Pages (CONFIG_THP) avoids the crash.
>
> Sorry, the option is called CONFIG_TRANSPARENT_HUGEPAGE, of course.
>
> My suspicion is that it's related the flushing of D-Cache handling which is enabled
> for small pages only:
>
> https://elixir.bootlin.com/linux/v6.16.5/source/arch/sparc/mm/ultra.S#L1016
>
> and:
>
> https://elixir.bootlin.com/linux/v6.16.5/source/arch/sparc/include/asm/page_64.h#L9
>
> Interestingly, while running the reproducer with CONFIG_TRANSPARENT_HUGEPAGE disabled,
> I'm also getting this kernel warning, but the kernel does not crash:
>
> [ 108.733686] CPU[0]: Cheetah+ D-cache parity error at TPC[00000000005d78b4]
> [ 108.824096] TPC<bpf_prog_load+0x394/0xc80>
>
> Could it be that we need to enable the code guarded by DCACHE_ALIASING_POSSIBLE
> unconditionally?
It's already essentially enabled unconditionally. PAGE_SHIFT will always
be 13 on sparc64 systems.
The flushing should be happening for folios of any size. See
flush_dcache_folio(()/flush_dcache_folio_all().
You could try setting page_poison=1 on the kernel command line to see if
the kernel detects any freed pages being used.
Is this a different Cheetah+-based system than the one I borrowed?
Definitely some sort of memory corruption happening, but the system I
used seemed much more stable than this.
Anthony
>
> Adrian
>
Powered by blists - more mailing lists