[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <199ba55d-8eb8-47df-9f42-55a865b51ab7@zhaoxin.com>
Date: Wed, 29 May 2024 14:44:14 +0800
From: Tony W Wang-oc <TonyWWang-oc@...oxin.com>
To: Thomas Gleixner <tglx@...utronix.de>, Dave Hansen <dave.hansen@...el.com>,
<mingo@...hat.com>, <bp@...en8.de>, <dave.hansen@...ux.intel.com>,
<x86@...nel.org>, <hpa@...or.com>, <keescook@...omium.org>,
<tony.luck@...el.com>, <gpiccoli@...lia.com>, <mat.jonczyk@...pl>,
<rdunlap@...radead.org>, <alexandre.belloni@...tlin.com>,
<mario.limonciello@....com>, <yaolu@...inos.cn>, <bhelgaas@...gle.com>,
<justinstitt@...gle.com>, <linux-kernel@...r.kernel.org>,
<linux-hardening@...r.kernel.org>
CC: <CobeChen@...oxin.com>, <TimGuo@...oxin.com>, <LeoLiu-oc@...oxin.com>,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress
On 2024/5/29 12:39, Tony W Wang-oc wrote:
>
>
> On 2024/5/29 06:12, Thomas Gleixner wrote:
>>
>>
>> [这封邮件来自外部发件人 谨防风险]
>>
>> On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
>>> On 5/27/24 23:38, Tony W Wang-oc wrote:
>>> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>>>> index c96ae8fee95e..ecadd0698d6a 100644
>>>> --- a/arch/x86/kernel/hpet.c
>>>> +++ b/arch/x86/kernel/hpet.c
>>>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>>> if (in_nmi())
>>>> return (u64)hpet_readl(HPET_COUNTER);
>>>>
>>>> + /*
>>>> + * Read HPET directly if panic in progress.
>>>> + */
>>>> + if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>>>> + return (u64)hpet_readl(HPET_COUNTER);
>>>> +
>>>
>>> There is literally one other piece of the code in the kernel doing
>>> something similar: the printk() implementation. There's no other
>>> clocksource or timekeeping code that does this on any architecture.
>>>
>>> Why doesn't this problem apply to any other clock sources?
>>
>> I principle it applies to any clocksource which needs a spinlock to
>> serialize access. HPET is not the only insanity here.
>>
>> Think about i8253 :)
>>
>> Most real clocksources, like TSC and the majority of the preferred clock
>> sources on other architectures are perfectly fine. They just read and be
>> done.
>>
>>> Why should the problem be fixed in the clock sources themselves? Why
>>> doesn't printk() deadlock on systems using the HPET?
>>
>> Because regular printk()s are deferred to irq work when in NMI and
>> similar contexts, but that obviously does not apply to panic
>> situations. Also NMI is treated special even in the HPET code. See
>> below.
>>
>>> In other words, I think we should fix pstore to be more like printk
>>> rather than hacking around this in each clock source.
>>
>> pstore is perfectly fine. It uses a NMI safe time accessor function
>> which is then tripping over the HPET lock. That's really a HPET specific
>> problem.
>>
>> Though what I read out of the changelog is that the MCE hits the same
>> CPU 'x' which holds the lock. But that's fairy tale material as you can
>> see in the patch above:
>>
>> if (in_nmi())
>> return (u64)hpet_readl(HPET_COUNTER);
>>
>> For that particular case the dead lock, which would actually be a live
>> lock, cannot happen because in kernel MCEs are NMI class exceptions and
>> therefore in_nmi() evaluates to true and that new voodoo can't be
>> reached at all.
>>
>> Now there are two other scenarios which really can make that happen:
>>
>> 1) A non-NMI class exception within the lock held region
>>
>> CPU A
>> acquire(hpet_lock);
>> ... <- #PF, #GP, #DE, ... -> panic()
>>
>> If any of that happens within that lock held section then the live
>> lock on the hpet_lock is the least of your worries. Seriously, I
>> don't care about this at all.
>>
Actually, this scenario is what this patch is trying to solve.
We encountered hpet_lock deadlock from the call path of the MCE handler,
and this hpet_lock deadlock scenario may happen when others exceptions'
handler like #PF/#GP... to call the panic. So read_hpet should avoid
deadlock if panic in progress.
Sincerely
TonyWWang-oc
>> 2) The actual scenario is:
>>
>> CPU A CPU B
>> lock(hpet_lock)
>> MCE hits user space
>> ...
>> exc_machine_check_user()
>> irqentry_enter_from_user_mode(regs);
>>
>> irqentry_enter_from_user_mode() obviously does not mark the
>> exception as NMI class, so in_nmi() evaluates to false. That would
>> actually dead lock if CPU A is not making progress and releases
>> hpet_lock.
>>
>> Sounds unlikely to happen, right? But in reality it can because of
>> MCE broadcast. Assume that both CPUs go into MCE:
>>
>> CPU A CPU B
>> lock(hpet_lock)
>> exc_machine_check_user()
>> irqentry_enter_from_user_mode();
>> exc_machine_check_kernel() do_machine_check()
>> irqentry_nmi_enter(); mce_panic()
>> do_machine_check() if
>> (atomic_inc_return(&mce_panicked) > 1)
>> mce_panic() wait_for_panic(); <- Not taken
>>
>> if (atomic_inc_return(&mce_panicked) > 1)
>> wait_for_panic(); <- Taken
>>
>> ....
>> hpet_read()
>>
>> -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
>> obviously can't release the lock.
>>
>
> Because MCE handler will call printk() before call the panic, so
> printk() deadlock may happen in this scenario:
>
> CPU A CPU B
> printk()
> lock(console_owner_lock)
> exc_machine_check_user()
> irqentry_enter_from_user_mode()
> exc_machine_check_kernel() do_machine_check()
> irqentry_nmi_enter() mce_panic()
> do_machine_check() printk_mce() #A
> mce_panic() ...
> wait_for_panic() panic()
>
> printk deadlock will happened at #A because in_nmi() evaluates to false
> on CPU B and CPU B do not enter the panic() AT #A.
>
> Update user space MCE handler to NMI class context is preferred?
>
> Sincerely
> TonyWWang-oc
>
>> So the proposed patch makes sense to some extent. But it only cures the
>> symptom. The real underlying questions are:
>>
>> 1) Should we provide a panic mode read callback for clocksources which
>> are affected by this?
>>
>> 2) Is it correct to claim that a MCE which hits user space and ends
>> up in
>> mce_panic() is still just a regular exception or should we
>> upgrade to
>> NMI class context when we enter mce_panic() or even go as far to
>> upgrade to NMI class context for any panic() invocation?
>>
>> #1 Solves it at the clocksource level. It still needs HPET specific
>> changes.
>>
>> #2 Solves a whole class of issues
>>
>> ... while potentially introducing new ones :)
>>
>> To me upgrading any panic() invocation to NMI class context makes a
>> lot of sense because in that case all bets are off.
>>
>> in_nmi() is used in quite some places to avoid such problems. IOW,
>> that would kill a whole class of issues instead of "curing" the HPET
>> problem locally for the price of an extra conditional. Not that the
>> extra conditional matters much if HPET is the clocksource as that's
>> awfully slow anyway and I really don't care about that.
>>
>> But I very much care about avoiding to sprinkle panic_cpu checks all
>> over the place.
>>
>> Thanks,
>>
>> tglx
Powered by blists - more mailing lists