[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4ad5180d-cb4d-b916-3872-b24c5a2cd1d8@loongson.cn>
Date: Tue, 30 Dec 2025 12:03:54 +0800
From: Bibo Mao <maobibo@...ngson.cn>
To: Jinyang He <hejinyang@...ngson.cn>, lixianglai <lixianglai@...ngson.cn>
Cc: loongarch@...ts.linux.dev, linux-kernel@...r.kernel.org,
kvm@...r.kernel.org, stable@...r.kernel.org,
Huacai Chen <chenhuacai@...nel.org>, WANG Xuerui <kernel@...0n.name>,
Tianrui Zhao <zhaotianrui@...ngson.cn>,
Charlie Jenkins <charlie@...osinc.com>, Thomas Gleixner
<tglx@...utronix.de>, Tiezhu Yang <yangtiezhu@...ngson.cn>
Subject: Re: [PATCH V3 2/2] LoongArch: KVM: fix "unreliable stack" issue
On 2025/12/30 上午11:36, Jinyang He wrote:
> On 2025-12-30 10:24, Bibo Mao wrote:
>
>>
>>
>> On 2025/12/29 下午6:41, Jinyang He wrote:
>>> On 2025-12-29 18:11, lixianglai wrote:
>>>
>>>> Hi Jinyang:
>>>>>
>>>>> On 2025-12-29 11:53, lixianglai wrote:
>>>>>> Hi Jinyang:
>>>>>>> On 2025-12-27 09:27, Xianglai Li wrote:
>>>>>>>
>>>>>>>> Insert the appropriate UNWIND macro definition into the
>>>>>>>> kvm_exc_entry in
>>>>>>>> the assembly function to guide the generation of correct ORC
>>>>>>>> table entries,
>>>>>>>> thereby solving the timeout problem of loading the
>>>>>>>> livepatch-sample module
>>>>>>>> on a physical machine running multiple vcpus virtual machines.
>>>>>>>>
>>>>>>>> While solving the above problems, we have gained an additional
>>>>>>>> benefit,
>>>>>>>> that is, we can obtain more call stack information
>>>>>>>>
>>>>>>>> Stack information that can be obtained before the problem is fixed:
>>>>>>>> [<0>] kvm_vcpu_block+0x88/0x120 [kvm]
>>>>>>>> [<0>] kvm_vcpu_halt+0x68/0x580 [kvm]
>>>>>>>> [<0>] kvm_emu_idle+0xd4/0xf0 [kvm]
>>>>>>>> [<0>] kvm_handle_gspr+0x7c/0x700 [kvm]
>>>>>>>> [<0>] kvm_handle_exit+0x160/0x270 [kvm]
>>>>>>>> [<0>] kvm_exc_entry+0x100/0x1e0
>>>>>>>>
>>>>>>>> Stack information that can be obtained after the problem is fixed:
>>>>>>>> [<0>] kvm_vcpu_block+0x88/0x120 [kvm]
>>>>>>>> [<0>] kvm_vcpu_halt+0x68/0x580 [kvm]
>>>>>>>> [<0>] kvm_emu_idle+0xd4/0xf0 [kvm]
>>>>>>>> [<0>] kvm_handle_gspr+0x7c/0x700 [kvm]
>>>>>>>> [<0>] kvm_handle_exit+0x160/0x270 [kvm]
>>>>>>>> [<0>] kvm_exc_entry+0x104/0x1e4
>>>>>>>> [<0>] kvm_enter_guest+0x38/0x11c
>>>>>>>> [<0>] kvm_arch_vcpu_ioctl_run+0x26c/0x498 [kvm]
>>>>>>>> [<0>] kvm_vcpu_ioctl+0x200/0xcf8 [kvm]
>>>>>>>> [<0>] sys_ioctl+0x498/0xf00
>>>>>>>> [<0>] do_syscall+0x98/0x1d0
>>>>>>>> [<0>] handle_syscall+0xb8/0x158
>>>>>>>>
>>>>>>>> Cc: stable@...r.kernel.org
>>>>>>>> Signed-off-by: Xianglai Li <lixianglai@...ngson.cn>
>>>>>>>> ---
>>>>>>>> Cc: Huacai Chen <chenhuacai@...nel.org>
>>>>>>>> Cc: WANG Xuerui <kernel@...0n.name>
>>>>>>>> Cc: Tianrui Zhao <zhaotianrui@...ngson.cn>
>>>>>>>> Cc: Bibo Mao <maobibo@...ngson.cn>
>>>>>>>> Cc: Charlie Jenkins <charlie@...osinc.com>
>>>>>>>> Cc: Xianglai Li <lixianglai@...ngson.cn>
>>>>>>>> Cc: Thomas Gleixner <tglx@...utronix.de>
>>>>>>>> Cc: Tiezhu Yang <yangtiezhu@...ngson.cn>
>>>>>>>>
>>>>>>>> arch/loongarch/kvm/switch.S | 28 +++++++++++++++++++---------
>>>>>>>> 1 file changed, 19 insertions(+), 9 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/loongarch/kvm/switch.S
>>>>>>>> b/arch/loongarch/kvm/switch.S
>>>>>>>> index 93845ce53651..a3ea9567dbe5 100644
>>>>>>>> --- a/arch/loongarch/kvm/switch.S
>>>>>>>> +++ b/arch/loongarch/kvm/switch.S
>>>>>>>> @@ -10,6 +10,7 @@
>>>>>>>> #include <asm/loongarch.h>
>>>>>>>> #include <asm/regdef.h>
>>>>>>>> #include <asm/unwind_hints.h>
>>>>>>>> +#include <linux/kvm_types.h>
>>>>>>>> #define HGPR_OFFSET(x) (PT_R0 + 8*x)
>>>>>>>> #define GGPR_OFFSET(x) (KVM_ARCH_GGPR + 8*x)
>>>>>>>> @@ -110,9 +111,9 @@
>>>>>>>> * need to copy world switch code to DMW area.
>>>>>>>> */
>>>>>>>> .text
>>>>>>>> + .p2align PAGE_SHIFT
>>>>>>>> .cfi_sections .debug_frame
>>>>>>>> SYM_CODE_START(kvm_exc_entry)
>>>>>>>> - .p2align PAGE_SHIFT
>>>>>>>> UNWIND_HINT_UNDEFINED
>>>>>>>> csrwr a2, KVM_TEMP_KS
>>>>>>>> csrrd a2, KVM_VCPU_KS
>>>>>>>> @@ -170,6 +171,7 @@ SYM_CODE_START(kvm_exc_entry)
>>>>>>>> /* restore per cpu register */
>>>>>>>> ld.d u0, a2, KVM_ARCH_HPERCPU
>>>>>>>> addi.d sp, sp, -PT_SIZE
>>>>>>>> + UNWIND_HINT_REGS
>>>>>>>> /* Prepare handle exception */
>>>>>>>> or a0, s0, zero
>>>>>>>> @@ -200,7 +202,7 @@ ret_to_host:
>>>>>>>> jr ra
>>>>>>>> SYM_CODE_END(kvm_exc_entry)
>>>>>>>> -EXPORT_SYMBOL(kvm_exc_entry)
>>>>>>>> +EXPORT_SYMBOL_FOR_KVM(kvm_exc_entry)
>>>>>>>> /*
>>>>>>>> * int kvm_enter_guest(struct kvm_run *run, struct kvm_vcpu
>>>>>>>> *vcpu)
>>>>>>>> @@ -215,6 +217,14 @@ SYM_FUNC_START(kvm_enter_guest)
>>>>>>>> /* Save host GPRs */
>>>>>>>> kvm_save_host_gpr a2
>>>>>>>> + /*
>>>>>>>> + * The csr_era member variable of the pt_regs structure is
>>>>>>>> required
>>>>>>>> + * for unwinding orc to perform stack traceback, so we need
>>>>>>>> to put
>>>>>>>> + * pc into csr_era member variable here.
>>>>>>>> + */
>>>>>>>> + pcaddi t0, 0
>>>>>>>> + st.d t0, a2, PT_ERA
>>>>>>> Hi, Xianglai,
>>>>>>>
>>>>>>> It should use `SYM_CODE_START` to mark the `kvm_enter_guest`
>>>>>>> rather than
>>>>>>> `SYM_FUNC_START`, since the `SYM_FUNC_START` is used to mark
>>>>>>> "C-likely"
>>>>>>> asm functionw.
>>>>>>
>>>>>> Ok, I will use SYM_CODE_START to mark kvm_enter_guest in the next
>>>>>> version.
>>>>>>
>>>>>>> I guess the kvm_enter_guest is something like exception
>>>>>>> handler becuase the last instruction is "ertn". So usually it should
>>>>>>> mark UNWIND_HINT_REGS where can find last frame info by "$sp".
>>>>>>> However, all info is store to "$a2", this mark should be
>>>>>>> `UNWIND_HINT sp_reg=ORC_REG_A2(???) type=UNWIND_HINT_TYPE_REGS`.
>>>>>>> I don't konw why save this function internal PC here by `pcaddi
>>>>>>> t0, 0`,
>>>>>>> and I think it is no meaning(, for exception handler, they save
>>>>>>> last PC
>>>>>>> by read CSR.ERA). The `kvm_enter_guest` saves registers by
>>>>>>> "$a2"("$sp" - PT_REGS) beyond stack ("$sp"), it is dangerous if IE
>>>>>>> is enable. So I wonder if there is really a stacktrace through
>>>>>>> this function?
>>>>>>>
>>>>>> The stack backtracking issue in switch.S is rather complex because
>>>>>> it involves the switching between cpu root-mode and guest-mode:
>>>>>> Real stack backtracking should be divided into two parts:
>>>>>> part 1:
>>>>>> [<0>] kvm_enter_guest+0x38/0x11c
>>>>>> [<0>] kvm_arch_vcpu_ioctl_run+0x26c/0x498 [kvm]
>>>>>> [<0>] kvm_vcpu_ioctl+0x200/0xcf8 [kvm]
>>>>>> [<0>] sys_ioctl+0x498/0xf00
>>>>>> [<0>] do_syscall+0x98/0x1d0
>>>>>> [<0>] handle_syscall+0xb8/0x158
>>>>>>
>>>>>> part 2:
>>>>>> [<0>] kvm_vcpu_block+0x88/0x120 [kvm]
>>>>>> [<0>] kvm_vcpu_halt+0x68/0x580 [kvm]
>>>>>> [<0>] kvm_emu_idle+0xd4/0xf0 [kvm]
>>>>>> [<0>] kvm_handle_gspr+0x7c/0x700 [kvm]
>>>>>> [<0>] kvm_handle_exit+0x160/0x270 [kvm]
>>>>>> [<0>] kvm_exc_entry+0x104/0x1e4
>>>>>>
>>>>>>
>>>>>> In "part 1", after executing kvm_enter_guest, the cpu switches
>>>>>> from root-mode to guest-mode.
>>>>>> In this case, stack backtracking is indeed very rare.
>>>>>>
>>>>>> In "part 2", the cpu switches from the guest-mode to the root-mode,
>>>>>> and most of the stack backtracking occurs during this phase.
>>>>>>
>>>>>> To obtain the longest call chain, we save pc in kvm_enter_guest to
>>>>>> pt_regs.csr_era,
>>>>>> and after restoring the sp of the root-mode cpu in kvm_exc_entry,
>>>>>> The ORC entry was re-established using "UNWIND_HINT_REGS",
>>>>>> and then we obtained the following stack backtrace as we wanted:
>>>>>>
>>>>>> [<0>] kvm_vcpu_block+0x88/0x120 [kvm]
>>>>>> [<0>] kvm_vcpu_halt+0x68/0x580 [kvm]
>>>>>> [<0>] kvm_emu_idle+0xd4/0xf0 [kvm]
>>>>>> [<0>] kvm_handle_gspr+0x7c/0x700 [kvm]
>>>>>> [<0>] kvm_handle_exit+0x160/0x270 [kvm]
>>>>>> [<0>] kvm_exc_entry+0x104/0x1e4
>>>>> I found this might be a coincidence—correct behavior due to the
>>>>> incorrect
>>>>> UNWIND_HINT_REGS mark and unusual stack adjustment.
>>>>>
>>>>> First, the kvm_enter_guest contains only a single branch
>>>>> instruction, ertn.
>>>>> It hardware-jump to the CSR.ERA address directly, jump into
>>>>> kvm_exc_entry.
>>>>>
>>>>> At this point, the stack layout looks like this:
>>>>> -------------------------------
>>>>> frame from call to `kvm_enter_guest`
>>>>> ------------------------------- <- $sp
>>>>> PT_REGS
>>>>> ------------------------------- <- $a2
>>>>>
>>>>> Then kvm_exc_entry adjust stack without save any register (e.g.
>>>>> $ra, $sp)
>>>>> but still marked UNWIND_HINT_REGS.
>>>>> After the adjustment:
>>>>> -------------------------------
>>>>> frame from call to `kvm_enter_guest`
>>>>> -------------------------------
>>>>> PT_REGS
>>>>> ------------------------------- <- $a2, new $sp
>>>>>
>>>>> During unwinding, when the unwinder reaches kvm_exc_entry,
>>>>> it meets the mark of PT_REGS and correctly recovers
>>>>> pc = regs.csr_era, sp = regs.sp, ra = regs.ra
>>>>>
>>>> Yes, here unwinder does work as you say.
>>>>
>>>>> a) Can we avoid "ertn" rather than `jr reg (or jirl ra, reg, 0)`
>>>>> instead, like call?
>>>> No, we need to rely on the 'ertn instruction return PIE to CRMD IE,
>>>> at the same time to ensure that its atomic,
>>>> there should be no other instruction than' ertn 'more appropriate here.
>>> You are right! I got it.
>>>>
>>>>> The kvm_exc_entry cannot back to kvm_enter_guest
>>>>> if we use "ertn", so should the kvm_enter_guest appear on the
>>>>> stacktrace?
>>>>>
>>>>
>>>> It is flexible. As I mentioned above, the cpu completes the switch
>>>> from host-mode to guest mode through kvm_enter_guest,
>>>> and then the switch from guest mode to host-mode through
>>>> kvm_exc_entry. When we ignore the details of the host-mode
>>>> and guest-mode switching in the middle, we can understand that the
>>>> host cpu has completed kvm_enter_guest->kvm_exc_entry.
>>>> From this perspective, I think it can exist in the call stack, and
>>>> at the same time, we have obtained the maximum call stack information.
>>>>
>>>>
>>>>> b) Can we adjust $sp before entering kvm_exc_entry? Then we can mark
>>>>> UNWIND_HINT_REGS at the beginning of kvm_exc_entry, which something
>>>>> like ret_from_kernel_thread_asm.
>>>>>
>>>> The following command can be used to dump the orc entries of the
>>>> kernel:
>>>> ./tools/objtool/objtool --dump vmlinux
>>>>
>>>> You can observe that not all orc entries are generated at the
>>>> beginning of the function.
>>>> For example:
>>>> handle_tlb_protect
>>>> ftrace_stub
>>>> handle_reserved
>>>>
>>>> So, is it unnecessary for us to modify UNWIND_HINT_REGS in order to
>>>> place it at the beginning of the function.
>>>>
>>>> If you have a better solution, could you provide an example of the
>>>> modification?
>>>> I can test the feasibility of the solution.
>>>>
>>> The expression at the beginning of the function is incorrect (feeling
>>> sorry).
>>> It should be marked where have all stacktrace info.
>>> Thanks for all the explaining, since I'm unfamiliar with kvm, I need
>>> these to help my understanding.
>>>
>>> Can you try with follows, with save regs by $sp, set more precise era
>>> to pt_regs, and more unwind hint.
>>>
>>>
>>> diff --git a/arch/loongarch/kvm/switch.S b/arch/loongarch/kvm/switch.S
>>> index f1768b7a6194..8ed1d7b72c54 100644
>>> --- a/arch/loongarch/kvm/switch.S
>>> +++ b/arch/loongarch/kvm/switch.S
>>> @@ -14,13 +14,13 @@
>>> #define GGPR_OFFSET(x) (KVM_ARCH_GGPR + 8*x)
>>>
>>> .macro kvm_save_host_gpr base
>>> - .irp n,1,2,3,22,23,24,25,26,27,28,29,30,31
>>> + .irp n,1,2,22,23,24,25,26,27,28,29,30,31
>>> st.d $r\n, \base, HGPR_OFFSET(\n)
>>> .endr
>>> .endm
>>>
>>> .macro kvm_restore_host_gpr base
>>> - .irp n,1,2,3,22,23,24,25,26,27,28,29,30,31
>>> + .irp n,1,2,22,23,24,25,26,27,28,29,30,31
>>> ld.d $r\n, \base, HGPR_OFFSET(\n)
>>> .endr
>>> .endm
>>> @@ -88,6 +88,7 @@
>>> /* Load KVM_ARCH register */
>>> ld.d a2, a2, (KVM_ARCH_GGPR + 8 * REG_A2)
>>>
>>> +111:
>>> ertn /* Switch to guest: GSTAT.PGM = 1, ERRCTL.ISERR = 0,
>>> TLBRPRMD.ISTLBR = 0 */
>>> .endm
>>>
>>> @@ -158,9 +159,10 @@ SYM_CODE_START(kvm_exc_entry)
>>> csrwr t0, LOONGARCH_CSR_GTLBC
>>> ld.d tp, a2, KVM_ARCH_HTP
>>> ld.d sp, a2, KVM_ARCH_HSP
>>> + UNWIND_HINT_REGS
>>> +
>>> /* restore per cpu register */
>>> ld.d u0, a2, KVM_ARCH_HPERCPU
>>> - addi.d sp, sp, -PT_SIZE
>>>
>>> /* Prepare handle exception */
>>> or a0, s0, zero
>>> @@ -184,10 +186,11 @@ SYM_CODE_START(kvm_exc_entry)
>>> csrwr s1, KVM_VCPU_KS
>>> kvm_switch_to_guest
>>>
>>> + UNWIND_HINT_UNDEFINED
>>> ret_to_host:
>>> - ld.d a2, a2, KVM_ARCH_HSP
>>> - addi.d a2, a2, -PT_SIZE
>>> - kvm_restore_host_gpr a2
>>> + ld.d sp, a2, KVM_ARCH_HSP
>>> + kvm_restore_host_gpr sp
>>> + addi.d sp, sp, PT_SIZE
>>> jr ra
>>>
>>> SYM_INNER_LABEL(kvm_exc_entry_end, SYM_L_LOCAL)
>>> @@ -200,11 +203,15 @@ SYM_CODE_END(kvm_exc_entry)
>>> * a0: kvm_run* run
>>> * a1: kvm_vcpu* vcpu
>>> */
>>> -SYM_FUNC_START(kvm_enter_guest)
>>> +SYM_CODE_START(kvm_enter_guest)
>>> + UNWIND_HINT_UNDEFINED
>>> /* Allocate space in stack bottom */
>>> - addi.d a2, sp, -PT_SIZE
>>> + addi.d sp, sp, -PT_SIZE
>>> /* Save host GPRs */
>>> - kvm_save_host_gpr a2
>>> + kvm_save_host_gpr sp
>>> + la.pcrel a2, 111f
>>> + st.d a2, sp, PT_ERA
>>> + UNWIND_HINT_REGS
>>>
>> why the label 111f is more accurate? Supposing there is hw breakpoint
>> here and backtrace is called, what is the call trace stack then?
>> obvious label 111f is not executed instead.
> Xianglai said marking it as regs can get more stack infos, so I use
> UNWIND_HINT_REGS marked here, though it not called. Remove
> UNWIND_HINT_REGS thenforbid unwind from here.
> This function is called and should usually be marked as "call",
> but it is complex by switching the stack and use `ertn` calls
> another function.
>
>
>>
>> UNWIND_HINT_REGS is used for nested kernel stack, is that right?
>> With nested interrupt and exception handlers on LoongArch kernel, is
>> UNWIND_HINT_REGS used?
>>
>> SYM_CODE_START(ret_from_fork_asm)
>> UNWIND_HINT_REGS
>> move a1, sp
>> bl ret_from_fork
>> STACKLEAK_ERASE
>> RESTORE_STATIC
>> RESTORE_SOME
>> RESTORE_SP_AND_RET
>> SYM_CODE_END(ret_from_fork_asm)
>> With this piece of code, what is contents of pt_regs? In generic it is
>> called from sys_clone, era is user PC address, is that right? If so,
>> what is detailed usage in the beginning of ret_from_fork_asm?
> The stacktrace shows the control flow where the PC will go back, so
> it is right because when PC is in ret_from_fork_asm, it is already
> another thread. The era means it will go back user mode.
The problem is that user mode era shows unwind with error, and
user_mode(regs) is not accurate. here is piece of code.
pc = regs->csr_era;
if (!__kernel_text_address(pc))
goto err;
will UNWIND_HINT_END_OF_STACK be better than UNWIND_HINT_REGS?
Regards
Bibo Mao
>
> Jinyang
>>
>> Regards
>> Bibo Mao
>>
>>> addi.d a2, a1, KVM_VCPU_ARCH
>>> st.d sp, a2, KVM_ARCH_HSP
>>>
>>> Jinyang
>>>
>
Powered by blists - more mailing lists