lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20231018143024.79vRO3R3@linutronix.de>
Date:   Wed, 18 Oct 2023 16:30:24 +0200
From:   Sebastian Andrzej Siewior <bigeasy@...utronix.de>
To:     Pierre Gondois <pierre.gondois@....com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        linux-rt-users@...r.kernel.org,
        Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [ANNOUNCE] v6.6-rc4-rt7

On 2023-10-10 11:23:59 [+0200], Pierre Gondois wrote:
> Hello,
Hi,

> The issue seems to be related to this patchset:
> https://lore.kernel.org/all/20230112194314.845371875@infradead.org/
> but as I was unable to really diagnose the issue, it might aswell be
> something else.

Is it easy to figure when this faulty behaviour was introduced?

> It seems the memory/registers get corrupted, cf the recurring error
> messages:
> - Undefined instruction
> - Unable to handle kernel paging request at virtual address
> - Mem abort info
> Splats seem to happen while taking IRQs while going out of idle or
> when handling a syscall. More splat variations could be generated,
> but 5 should be enough I believe.
> 
> When running a non-PREEMPT_RT kernel, splats don't appear, so the issue
> might be related to the way locks are handled in PREEMPT_RT. I don't
> deeply understand the relation between rcu/irq/tracing so far, if someone
> has an idea of what could happen, this would be helpful :)

I've been looking at these splats. Do you have auth-pointer or shadow
stack enabled?

> Regards,
> Pierre
> 
> Splats:
> [splat-1]
> [...] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT_RT SMP
…
> All code
> ========
>    0:   a8c47bfd        ldp     x29, x30, [sp], #64
>    4:   d50323bf        autiasp
>    8:   d65f03c0        ret
>    c:   d503201f        nop
>   10:*  d503233f        paciasp         <-- trapping instruction

paciasp is the undefined instruction. I don't see paciasp to be patched
out if not supported by the CPU so it is a NOP if not supported. What
could go wrong here?

> [splat-2]
> [...] Unable to handle kernel paging request at virtual address 001c71c71c71d434
…
> [...] pc : ttwu_do_activate (kernel/sched/core.c:3855)
> [...] lr : ttwu_do_activate (kernel/sched/sched.h:1363 kernel/sched/sched.h:1507 kernel/sched/core.c:3846)
…
> [...] x20: ffff80008000bcd8 x19: ffff00097eedebc0 x18: 071c71c71c71c71c
> [...] x8 : 00000000000645ab x7 : ffff800080149d58 x6 : 0000000000000000

…
> All code
> ========
>    0:   f001dd89        adrp    x9, 0x3bb3000
>    4:   f9068668        str     x8, [x19, #3336]
>    8:   f9418129        ldr     x9, [x9, #768]
>    c:   d341fd08        lsr     x8, x8, #1
>   10:*  f9068e68        str     x8, [x19, #3352]                <-- trapping instruction

So this reads as
	store x8 to x19 + #3352, 0xffff00097eedebc0 + 3352
	store 0x00000000000645ab to 0xffff00097eedf8d8

But the kernel complains about 001c71c71c71d434 which is not exactly
what I computed. But it is familiar to x18. Looking at those two
 0x07 1c71c71c71 c71c
 0x00 1c71c71c71 d434

the pattern in the middle is the same. And 0xd434 - 0xc71c = 0xd18 which
is 3352. 

x18 is the shadow stack (?) and contains the same value in splat-1 and
is zero in splat-3 and splat-4.

> [splat-3]
> [...] Mem abort info:
> [...]   ESR = 0x0000000096000045
> [...]   EC = 0x25: DABT (current EL), IL = 32 bits
> [...]   SET = 0, FnV = 0
> [...]   EA = 0, S1PTW = 0
> [...]   FSC = 0x05: level 1 translation fault
> [...] Data abort info:
> [...]   ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
> [...]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [...]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> [...] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000ed344000
> [...] [ffff800886adb8b8] pgd=10000009fffff003, p4d=10000009fffff003, pud=0000000000000000
> [...] Internal error: Oops: 0000000096000045 [#1] PREEMPT_RT SMP
> [...] Modules linked in:
> [...] CPU: 1 PID: 264 Comm: rtla-static Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1193
> [...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct  4 2023
> [...] pstate: 204000c5 (nzCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [...] pc : rcu_is_watching (kernel/rcu/tree.c:695)
> [...] lr : trace_irq_disable (./include/trace/events/preemptirq.h:36)
> [...] sp : ffff800086adb8d0
> [...] x29: ffff800086adb8e0 x28: ffff0008062a4ec0 x27: 0000000000000030
> All code
> ========
>    0:   54fffb61        b.ne    0xffffffffffffff6c  // b.any
>    4:   17ffffe4        b       0xffffffffffffff94
>    8:   d503201f        nop
>    c:   d503233f        paciasp
>   10:*  a9be7bfd        stp     x29, x30, [sp, #-32]!           <-- trapping instruction

This looks like a stack entry. So this looks sane and if SP is correct
then nothing should go wrong. The fault says "translation fault" so my
guess would be that SP is not correct and we have to page tables backing
ffff800086adb8d0.

> [splat-4]
> [...] Mem abort info:
> [...]   ESR = 0x0000000096000045
> [...]   EC = 0x25: DABT (current EL), IL = 32 bits
> [...]   SET = 0, FnV = 0
> [...]   EA = 0, S1PTW = 0
> [...]   FSC = 0x05: level 1 translation fault
> [...] Data abort info:
> [...]   ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000
> [...]   CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> [...]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
…
> [...] lr : trace_irq_disable (./include/trace/events/preemptirq.h:36)
> [...] sp : ffff800086adb8d0
> [...] x29: ffff800086adb8e0 x28: ffff0008062a4ec0 x27: 0000000000000030
…
>    c:   d503233f        paciasp
>   10:*  a9be7bfd        stp     x29, x30, [sp, #-32]!           <-- trapping instruction

This seems to be same as splat-3 including register.

> [splat-5]
> [...] Internal error: Oops - Undefined instruction: 0000000002000000 [#1] PREEMPT_RT SMP
> [...] Modules linked in:
> [...] CPU: 2 PID: 40 Comm: rcuc/2 Not tainted 6.6.0-rc4-rt8-00102-g97b0e2d47443 #1194
> [...] Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Oct  4 2023
> [...] pstate: 804000c5 (Nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [...] pc : trace_pelt_irq_tp (./include/trace/events/sched.h:?)
> [...] lr : irqtime_account_irq (kernel/sched/cputime.c:64)
> [...] sp : ffff8000851d3ce0
> [...] x29: ffff8000851d3ce0 x28: 0000000000000020 x27: ffff800083ce4e80
> [...] x26: ffff800083d46180 x25: 000000000000000a x24: 0000000000000000
> [...] x23: 0000000000000007 x22: 0000000000000000 x21: ffff00097eeebf50
> [...] x20: 0000000000002a08 x19: ffff00080092b480 x18: ffff8000850fd038
> [...] x17: ffff800084e05000 x16: ffff800084445bf0 x15: 0000000008a87beb
> [...] x14: 000000003bb0a251 x13: 0000000000000006 x12: 0000000934346b33
> [...] x11: 0000000100000000 x10: 0000000000000001 x9 : 0000000014443054
> [...] x8 : ffff00097eeebeb0 x7 : ffff80008012d608 x6 : 0000000000000000
> [...] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff8000851d3d80
> [...] x2 : ffff00080092b480 x1 : 0000000000000000 x0 : 000000b814aa1780
> [...] Call trace:
> [...] trace_pelt_irq_tp (./include/trace/events/sched.h:?)
> [...] __do_softirq (./include/linux/vtime.h:? kernel/softirq.c:593)
> [...] __local_bh_enable_ip (kernel/softirq.c:?)
> [...] local_bh_enable (./include/linux/bottom_half.h:34)
> [...] rcu_cpu_kthread (kernel/rcu/tree.c:2493)
> [...] smpboot_thread_fn (kernel/smpboot.c:?)
> [...] kthread (kernel/kthread.c:389)
> [...] ret_from_fork (arch/arm64/kernel/entry.S:854)
> [...] Code: 17ffffc2 d4210000 17ffffe4 d503201f (819e3608)
> All code
> ========
>    0:   17ffffc2        b       0xffffffffffffff08
>    4:   d4210000        brk     #0x800
>    8:   17ffffe4        b       0xffffffffffffff98
>    c:   d503201f        nop
>   10:*  819e3608        .inst   0x819e3608 ; undefined          <-- trapping instruction

Knowing what PC is could help to figure out if this is really
trace_pelt_irq_tp. The brk opcode could be a warning since it jumps back
afterwards. But the trapping code a different/ wrong page that is mapped
here. Or it jumped too far.
I don't know why trace_pelt_irq_tp is visible here. This should be just
a nop which is patched at runtime to some underscroll function ;) I
might be missing something.

Sebastian

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ