lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9982d344-328e-320d-020f-218ab74ae2b1@huawei.com>
Date:   Tue, 14 Apr 2020 20:19:41 +0800
From:   Xie XiuQi <xiexiuqi@...wei.com>
To:     Mark Rutland <mark.rutland@....com>
CC:     <catalin.marinas@....com>, <will@...nel.org>, <tglx@...utronix.de>,
        <james.morse@....com>, <tanxiaofei@...wei.com>,
        <wangxiongfeng2@...wei.com>,
        <linux-arm-kernel@...ts.infradead.org>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] arm64: panic on synchronous external abort in kernel
 context

Hi Mark,

On 2020/4/14 18:59, Mark Rutland wrote:
> On Fri, Apr 10, 2020 at 09:52:45AM +0800, Xie XiuQi wrote:
>> We should panic even panic_on_oops is not set, when we can't recover
>> from synchronous external abort in kernel context.
>>
>> Othervise, there are two issues:
>> 1) fallback to do_exit() in exception context, cause this core hung up.
>>    do_sea()
>>    -> arm64_notify_die
>>       -> die
>>          -> do_exit
>> 2) errors may propagated.
>>
>> Signed-off-by: Xie XiuQi <xiexiuqi@...wei.com>
>> Cc: Xiaofei Tan <tanxiaofei@...wei.com>
>> ---
>>  arch/arm64/include/asm/esr.h | 12 ++++++++++++
>>  arch/arm64/kernel/traps.c    |  2 ++
>>  2 files changed, 14 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index cb29253ae86b..acfc71c6d148 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -326,6 +326,18 @@ static inline bool esr_is_data_abort(u32 esr)
>>  	return ec == ESR_ELx_EC_DABT_LOW || ec == ESR_ELx_EC_DABT_CUR;
>>  }
>>  
>> +static inline bool esr_is_inst_abort(u32 esr)
>> +{
>> +	const u32 ec = ESR_ELx_EC(esr);
>> +
>> +	return ec == ESR_ELx_EC_IABT_LOW || ec == ESR_ELx_EC_IABT_CUR;
>> +}
>> +
>> +static inline bool esr_is_ext_abort(u32 esr)
>> +{
>> +	return esr_is_data_abort(esr) || esr_is_inst_abort(esr);
>> +}
> 
> A data abort or an intstruction abort are not necessarily synchronus
> external aborts, so this isn't right.
> 
> What exactly are you trying to catch here? If you are seeing a problem
> in practice, can you please share your log from a crash?

Yes, I meet a problem in practice.

Tan Xiaofei report this issue when doing ras error inject testing:
1) panic_on_oops is not set;
2) trigger a sea by inject a ECC error;
3) a cpu core receive a sea in kernel context:
   do_mem_abort
     -> arm64_notify_die
        -> die                    # kernel context, call die() directly;
           -> do_exit             # kernel process context, call do_exit(SIGSEGV);
              -> do_task_dead()   # call do_task_dead(), and hung up this core;

Actually, we should not call do_exit() for kernel task, or before return from
external abort.

So, I send this patch intend to panic directly when receive a sea in kernel context.

crash log:

NOTICE:  [TotemRasIntCpuNodeEri]:[1879L]
NOTICE:  [RasEriInterrupt]:[173L]NodeTYP2Status = 0x0
[  387.740609] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 9
[  387.748837] {1}[Hardware Error]: event severity: recoverable
[  387.754470] {1}[Hardware Error]:  Error 0, type: recoverable
[  387.760103] {1}[Hardware Error]:   section_type: ARM processor error
[  387.766425] {1}[Hardware Error]:   MIDR: 0x00000000481fd010
[  387.771972] {1}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 0x0000000081080000
[  387.780628] {1}[Hardware Error]:   error affinity level: 0
[  387.786088] {1}[Hardware Error]:   running state: 0x1
[  387.791115] {1}[Hardware Error]:   Power State Coordination Interface state: 0
[  387.798301] {1}[Hardware Error]:   Error info structure 0:
[  387.803761] {1}[Hardware Error]:   num errors: 1
[  387.808356] {1}[Hardware Error]:    error_type: 0, cache error
[  387.814160] {1}[Hardware Error]:    error_info: 0x0000000024400017
[  387.820311] {1}[Hardware Error]:     transaction type: Instruction
[  387.826461] {1}[Hardware Error]:     operation type: Generic error (type cannot be determined)
[  387.835031] {1}[Hardware Error]:     cache level: 1
[  387.839878] {1}[Hardware Error]:     the error has been corrected
[  387.845942] {1}[Hardware Error]:    physical fault address: 0x00000027caf50770
[  387.853162] Internal error: synchronous external abort: 96000610 [#1] PREEMPT SMP
[  387.860611] Modules linked in: l1l2_inject(O) vfio_iommu_type1 vfio_pci vfio_virqfd vfio ib_ipoib ib_umad rpcrdma ib_iser libiscsi scsi_transport_iscsi hns_roce_hw_v2 crct10dif_ce ses hns3 hclge hnae3 hisi_trng_v2 rng_core hisi_zip hisi_sec2 hisi_hpre hisi_qm uacce hisi_sas_v3_hw hisi_sas_main libsas scsi_transport_sas
[  387.888725] CPU: 0 PID: 940 Comm: kworker/0:3 Kdump: loaded Tainted: G           O      5.5.0-rc4-g5993cbe #1
[  387.898592] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.B210.01 03/12/2020
[
 Message from  387.907429] Workqueue: events error_inject [l1l2_inject]
[  387.914098] pstate: 80c00009 (Nzcv daif +PAN +UAO)
 s[yslogd@...alho  387.918867] pc : lsu_inj_ue+0x58/0x70 [l1l2_inject]
[  387.925103] lr : error_inject+0x64/0xb0 [l1l2_inject]
[  387.930132] sp : ffff800020af3d90
[  387.933435] x29: ffff800020af3d90 x28: 0000000000000000
st[ at Mar 30 11:  387.938723] x27: ffff0027dae6e838 x26: ffff80001178bcc8
[  387.945391] x25: 0000000000000000 x24: ffffa2c77162e090
[  387.950680] x23: 0000000000000000 x22: ffff2027d7c33d40
33[:55 ...
 ker  387.955968] x21: ffff2027d7c37a00 x20: ffff0027d679c000
[  387.962636] x19: ffffa2c77162e088 x18: 0000000020cbf59a
ne[l:Internal err  387.967924] x17: 000000000000000e x16: ffffa2c7b812bc98
[  387.974592] x15: 0000000000000001 x14: 0000000000000000
[  387.979880] x13: ffff2027cf75a780 x12: ffffa2c7b8299c18
or[: synchronous   387.985168] x11: 0000000000000000 x10: 00000000000009f0
[  387.991836] x9 : ffff800020af3d50 x8 : fefefefefefefeff
ex[ternal abort:   387.997124] x7 : 0000000000000000 x6 : 001d4ed88e000000
[  388.003792] x5 : 000073746e657665 x4 : 0000080110f81381
[  388.009080] x3 : 000000000000002f x2 : 807fffffffffffff
96[000610 [#1] PR  388.014369] x1 : ffffa2c77162c518 x0 : 0000000000000081
[  388.021037] Call trace:
EEMPT SMP
[  388.023475]  lsu_inj_ue+0x58/0x70 [l1l2_inject]
[  388.029019]  error_inject+0x64/0xb0 [l1l2_inject]
[  388.033707]  process_one_work+0x158/0x4b8
[  388.037699]  worker_thread+0x50/0x498
[  388.041348]  kthread+0xfc/0x128
[  388.044480]  ret_from_fork+0x10/0x1c
[  388.048042] Code: b2790000 d519f780 f9800020 d5033f9f (58001001)
[  388.054109] ---[ end trace 39d51c21b0e42ba6 ]---

core 0 hung up at here.


> 
> Thanks,
> Mark.
> 
>> +
>>  const char *esr_get_class_string(u32 esr);
>>  #endif /* __ASSEMBLY */
>>  
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index cf402be5c573..08f7f7688d5b 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -202,6 +202,8 @@ void die(const char *str, struct pt_regs *regs, int err)
>>  		panic("Fatal exception in interrupt");
>>  	if (panic_on_oops)
>>  		panic("Fatal exception");
>> +	if (esr_is_ext_abort(err))
>> +		panic("Synchronous external abort in kernel context");
>>  
>>  	raw_spin_unlock_irqrestore(&die_lock, flags);
>>  
>> -- 
>> 2.20.1
>>
> .
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ