linux-kernel - Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87wmnda8mc.ffs@tglx>
Date: Wed, 29 May 2024 00:12:43 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Dave Hansen <dave.hansen@...el.com>, Tony W Wang-oc
 <TonyWWang-oc@...oxin.com>, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
 keescook@...omium.org, tony.luck@...el.com, gpiccoli@...lia.com,
 mat.jonczyk@...pl, rdunlap@...radead.org, alexandre.belloni@...tlin.com,
 mario.limonciello@....com, yaolu@...inos.cn, bhelgaas@...gle.com,
 justinstitt@...gle.com, linux-kernel@...r.kernel.org,
 linux-hardening@...r.kernel.org
Cc: CobeChen@...oxin.com, TimGuo@...oxin.com, LeoLiu-oc@...oxin.com, Linus
 Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH] x86/hpet: Read HPET directly if panic in progress

On Tue, May 28 2024 at 07:18, Dave Hansen wrote:
> On 5/27/24 23:38, Tony W Wang-oc wrote:
> ...> diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
>> index c96ae8fee95e..ecadd0698d6a 100644
>> --- a/arch/x86/kernel/hpet.c
>> +++ b/arch/x86/kernel/hpet.c
>> @@ -804,6 +804,12 @@ static u64 read_hpet(struct clocksource *cs)
>>  	if (in_nmi())
>>  		return (u64)hpet_readl(HPET_COUNTER);
>>  
>> +	/*
>> +	 * Read HPET directly if panic in progress.
>> +	 */
>> +	if (unlikely(atomic_read(&panic_cpu) != PANIC_CPU_INVALID))
>> +		return (u64)hpet_readl(HPET_COUNTER);
>> +
>
> There is literally one other piece of the code in the kernel doing
> something similar: the printk() implementation.  There's no other
> clocksource or timekeeping code that does this on any architecture.
>
> Why doesn't this problem apply to any other clock sources?

I principle it applies to any clocksource which needs a spinlock to
serialize access. HPET is not the only insanity here.

Think about i8253 :)

Most real clocksources, like TSC and the majority of the preferred clock
sources on other architectures are perfectly fine. They just read and be
done.

> Why should the problem be fixed in the clock sources themselves?  Why
> doesn't printk() deadlock on systems using the HPET?

Because regular printk()s are deferred to irq work when in NMI and
similar contexts, but that obviously does not apply to panic
situations. Also NMI is treated special even in the HPET code. See
below.

> In other words, I think we should fix pstore to be more like printk
> rather than hacking around this in each clock source.

pstore is perfectly fine. It uses a NMI safe time accessor function
which is then tripping over the HPET lock. That's really a HPET specific
problem.

Though what I read out of the changelog is that the MCE hits the same
CPU 'x' which holds the lock. But that's fairy tale material as you can
see in the patch above:

  	if (in_nmi())
  		return (u64)hpet_readl(HPET_COUNTER);

For that particular case the dead lock, which would actually be a live
lock, cannot happen because in kernel MCEs are NMI class exceptions and
therefore in_nmi() evaluates to true and that new voodoo can't be
reached at all.

Now there are two other scenarios which really can make that happen:

 1) A non-NMI class exception within the lock held region

    CPU A
    acquire(hpet_lock);
    ...                 <- #PF, #GP, #DE, ... -> panic()

    If any of that happens within that lock held section then the live
    lock on the hpet_lock is the least of your worries. Seriously, I
    don't care about this at all.

 2) The actual scenario is:

    CPU A                       CPU B
    lock(hpet_lock)
                                MCE hits user space
                                ...
                                exc_machine_check_user()
                                  irqentry_enter_from_user_mode(regs);

    irqentry_enter_from_user_mode() obviously does not mark the
    exception as NMI class, so in_nmi() evaluates to false. That would
    actually dead lock if CPU A is not making progress and releases
    hpet_lock.

    Sounds unlikely to happen, right? But in reality it can because of
    MCE broadcast. Assume that both CPUs go into MCE:

    CPU A                       CPU B
    lock(hpet_lock)
                                exc_machine_check_user()
                                  irqentry_enter_from_user_mode();
    exc_machine_check_kernel()    do_machine_check()
      irqentry_nmi_enter();         mce_panic()
      do_machine_check()            if (atomic_inc_return(&mce_panicked) > 1)
        mce_panic()                     wait_for_panic(); <- Not taken

        if (atomic_inc_return(&mce_panicked) > 1)
            wait_for_panic(); <- Taken

                                    ....
                                    hpet_read()

    -> Dead lock because in_nmi() evaluates to false on CPU B and CPU A
       obviously can't release the lock.

So the proposed patch makes sense to some extent. But it only cures the
symptom. The real underlying questions are:

  1) Should we provide a panic mode read callback for clocksources which
     are affected by this?

  2) Is it correct to claim that a MCE which hits user space and ends up in
     mce_panic() is still just a regular exception or should we upgrade to
     NMI class context when we enter mce_panic() or even go as far to
     upgrade to NMI class context for any panic() invocation?

#1 Solves it at the clocksource level. It still needs HPET specific
   changes.

#2 Solves a whole class of issues

   ... while potentially introducing new ones :)

   To me upgrading any panic() invocation to NMI class context makes a
   lot of sense because in that case all bets are off.

   in_nmi() is used in quite some places to avoid such problems. IOW,
   that would kill a whole class of issues instead of "curing" the HPET
   problem locally for the price of an extra conditional. Not that the
   extra conditional matters much if HPET is the clocksource as that's
   awfully slow anyway and I really don't care about that.

   But I very much care about avoiding to sprinkle panic_cpu checks all
   over the place.

Thanks,

        tglx