linux-kernel - Re: [STACK DUMP][5.16-rc?] from (cpufreq: intel_pstate: Process HWP Guaranteed change notification)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wi63LDzMV1sSiPuH8q38tMZ0_b+w9bUVW=P_O1+rkW28A@mail.gmail.com>
Date:   Wed, 3 Nov 2021 14:58:39 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Steven Rostedt <rostedt@...dmis.org>
Cc:     LKML <linux-kernel@...r.kernel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        pablomh@...il.com, "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>
Subject: Re: [STACK DUMP][5.16-rc?] from (cpufreq: intel_pstate: Process HWP
 Guaranteed change notification)

On Wed, Nov 3, 2021 at 1:33 PM Steven Rostedt <rostedt@...dmis.org> wrote:
>
> The RIP of cpuidle_enter_state+0xd6/0x3a0 gives me in gdb:

That's not actually the error address - it's not an oops. That's just
the register data at the interrupt entry.

The error was that triggered this was that

    unchecked MSR access error:
        WRMSR to 0x773 (tried to write 0x0000000000000000)
        at rIP: 0xffffffffab06e824 (native_write_msr+0x4/0x20)

and then it prints the stack frame - because we shouldn't be doing
random MSR writes.

And part of printing the stack frame is printing the register state at
kernel entries when switching to an interrupt or exception stack.

So that RIP you point to is not really interesting, and it looks like
it's an entirely random harmless instruction, it's just that it is
preceded by that

        sti
        nopw

so it's not surprising that it's an interrupt point, because it's
exactly one instruction after interrupts were enabled on that CPU (the
one instruction being the sti shadow that itself doesn't take
interrupts).

So cpuidle_enter_state() itself is entirely harmless, and you see that
in the call chain presumably simply because the CPU was idle when the
interrupt happened. Again, that's a very reasonable place for an
interrupt to happen.

End result: all of this is simply due to an unexpected MSR write, and
the real issue is that interrupt from an IPI that caused the WRMSR
write on another CPU:

    __wrmsr_on_cpu+0x33/0x40
    flush_smp_call_function_queue+0x122/0x1a0
    __sysvec_call_function_single+0x29/0xc0
    sysvec_call_function_single+0x9d/0xd0

and the rest of the call chain is just noise from "whatever that CPU
happened to do when it got the IPI".

You don't get the call chain of whatever actually did the IPI, I'm
afraid. That's the actual _cause_ of all this, but that happens on
another CPU entirely.

> I bisected it down to:
>
>  57577c996d73 ("cpufreq: intel_pstate: Process HWP Guaranteed change notification")
>
> Removing the commit makes the issue go away. Adding it back brings it back'

That commit does various odd things.

The code seems to depend on the 'hwp_notify_lock' spinlock, but then
it does some work outside that locked region (exactly because that
work triggers an interrupt that needs that lock):

      wrmsrl_on_cpu(cpudata->cpu, MSR_HWP_INTERRUPT, 0x00);

and that's exactly what will then warn because that MSR hasn't been
enabled on that CPU.

But it also does randomly do some READ_ONCE/WRITE_ONCE magic that all
seems entirely bogus. Those values should be protected by the lock.

The way it cancels delayed work is also very questionable: it does

        spin_lock_irqsave(&hwp_notify_lock, flags);
        if (cpumask_test_and_clear_cpu(cpudata->cpu, &hwp_intr_enable_mask))
                cancel_delayed_work(&cpudata->hwp_notify_work);
        spin_unlock_irqrestore(&hwp_notify_lock, flags);

but if the delayed work was already started on that cpu, then it will
continue running concurrently, even if the caller now thinks it's all
disabled.

You need the "sync" version to make sure the work is also waited on,
but you can't do that under the spinlock that the work in question may
need, of course.

> So I'm guessing that this commit calls some code that does not expect to be
> called on powering off, at least in debug mode, and is triggering this
> stack dump. Probably because there was some static call or something that
> isn't shutdown properly.

It actually looks more direct than that to me: the code literally
seems to do that wrmsrl_on_cpu() that triggers the IPI that actually
causes the unchecked MSR write.

I don't know this hardware, and I don't know that code, but there are
certainly a few red flags here.

Srinivas?

              Linus