linux-kernel - Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1585895551.7o9oa0ey62.astroid@bobo.none>
Date:   Fri, 03 Apr 2020 16:41:04 +1000
From:   Nicholas Piggin <npiggin@...il.com>
To:     Alexios Zavras <alexios.zavras@...el.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Christophe Leroy <christophe.leroy@....fr>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Enrico
 Weigelt <info@...ux.net>,
        Leonardo Bras <leonardo@...ux.ibm.com>,
        Michael Ellerman <mpe@...erman.id.au>,
        Paul Mackerras <paulus@...ba.org>, peterz@...radead.org,
        Thomas
 Gleixner <tglx@...utronix.de>
Cc:     linux-kernel@...r.kernel.org, linuxppc-dev@...ts.ozlabs.org
Subject: Re: [PATCH v3 1/1] ppc/crash: Reset spinlocks during crash

Leonardo Bras's on April 3, 2020 10:37 am:
> On Thu, 2020-04-02 at 22:28 +1100, Michael Ellerman wrote:
>> Leonardo Bras <leonardo@...ux.ibm.com> writes:
>> > During a crash, there is chance that the cpus that handle the NMI IPI
>> > are holding a spin_lock. If this spin_lock is needed by crashing_cpu it
>> > will cause a deadlock. (rtas.lock and printk logbuf_lock as of today)
>> > 
>> > This is a problem if the system has kdump set up, given if it crashes
>> > for any reason kdump may not be saved for crash analysis.
>> > 
>> > After NMI IPI is sent to all other cpus, force unlock all spinlocks
>> > needed for finishing crash routine.
>> 
>> I'm not convinced this is the right approach.
> 
> Me neither. I think it's a very hacky solution, but I couldn't think of
> anything better at the time.
> 
>> Busting locks is risky, it could easily cause a crash if data structures
>> are left in some inconsistent state.
>> 
>> I think we need to make this code more careful about what it's doing.
>> There's a clue at the top of default_machine_crash_shutdown(), which
>> calls crash_kexec_prepare_cpus():
>> 
>> 	 * This function is only called after the system
>> 	 * has panicked or is otherwise in a critical state.
>> 	 * The minimum amount of code to allow a kexec'd kernel
>> 	 * to run successfully needs to happen here.
>> 
>> 
>> You said the "IPI complete" message was the cause of one lockup:
>> 
>>   #0  arch_spin_lock 
>>   #1  do_raw_spin_lock 
>>   #2  __raw_spin_lock 
>>   #3  _raw_spin_lock 
>>   #4  vprintk_emit 
>>   #5  vprintk_func
>>   #7  crash_kexec_prepare_cpus 
>>   #8  default_machine_crash_shutdown
>>   #9  machine_crash_shutdown 
>>   #10 __crash_kexec
>>   #11 crash_kexec
>>   #12 oops_end
>> 
>> TBH I think we could just drop that printk() entirely.
>> 
>> Or we could tell printk() that we're in NMI context so that it uses the
>> percpu buffers.
>> 
>> We should probably do the latter anyway, in case there's any other code
>> we call that inadvertently calls printk().
>> 
> 
> I was not aware of using per-cpu buffers in printk. It may be a nice
> solution.
> 
> There is another printk call there:
> printk("kexec: Starting switchover sequence.\n");
> in default_machine_kexec().
> 
> Two printk and one rtas call: it's all I could see using a spinlock
> after IPI Complete.
> 
>> 
>> The RTAS trace you sent was:
>> 
>>   #0 arch_spin_lock
>>   #1  lock_rtas () 
>>   #2  rtas_call (token=8204, nargs=1, nret=1, outputs=0x0)
>>   #3  ics_rtas_mask_real_irq (hw_irq=4100) 
>>   #4  machine_kexec_mask_interrupts
>>   #5  default_machine_crash_shutdown
>>   #6  machine_crash_shutdown 
>>   #7  __crash_kexec
>>   #8  crash_kexec
>>   #9  oops_end
>> 
>> 
>> Which doesn't make it clear who holds the RTAS lock. We really shouldn't
>> be crashing while holding the RTAS lock, but I guess it could happen.
>> Can you get a full backtrace?
>> 
> 
> Oh, all traces are from the thread that called the crash, by writing
> 'c' to sysrq. That is what I am using to reproduce.
> 
> #10 bad_page_fault
> #11 handle_page_fault
> #12 __handle_sysrq (key=99, check_mask=false) 
> #13 write_sysrq_trigger 
> #14 proc_reg_write
> #15 __vfs_write
> #16 vfs_write
> #17 SYSC_write
> #18 SyS_write
> #19 system_call
> 
>> 
>> PAPR says we are not allowed to have multiple CPUs calling RTAS at once,
>> except for a very small list of RTAS calls. So if we bust the RTAS lock
>> there's a risk we violate that part of PAPR and crash even harder.
> 
> Interesting, I was not aware.
> 
>> 
>> Also it's not specific to kdump, we can't even get through a normal
>> reboot if we crash with the RTAS lock held.
>> 
>> Anyway here's a patch with some ideas. That allows me to get from a
>> crash with the RTAS lock held through kdump into the 2nd kernel. But it
>> only works if it's the crashing CPU that holds the RTAS lock.
>> 
> 
> Nice idea. 
> But my test environment is just triggering a crash from sysrq, so I
> think it would not improve the result, given that this thread is
> probably not holding the lock by the time.

Crash paths should not take that RTAS lock, it's a massive pain. I'm 
fixing it for machine check, for other crashes I think it can be removed 
too, it just needs to be unpicked. The good thing with crashing is that 
you can reasonably *know* that you're single threaded, so you can 
usually reason through situations like above.

> I noticed that when rtas is locked, irqs and preemption are also
> disabled.
> 
> Should the IPI send by crash be able to interrupt a thread with
> disabled irqs?

Yes. It's been a bit painful, but in the long term it means that a CPU 
which hangs with interrupts off can be debugged, and it means we can 
take it offline to crash without risking that it will be clobbering what 
we're doing.

Arguably what I should have done is try a regular IPI first, wait a few 
seconds, then NMI IPI.

A couple of problems with that. Firstly it probably avoids this issue 
you hit almost all the time, so it won't get fixed. So when we really 
need the NMI IPI in the field, it'll still be riddled with deadlocks.

Secondly, sending the IPI first in theory can be more intrusive to the 
state that we want to debug. It uses the currently running stack, paca 
save areas, ec. NMI IPI uses its own stack and save regions so it's a 
little more isolated. Maybe this is only a small advantage but I'd like 
to have it if we can.  

Thanks,
Nick