linux-kernel - Re: [PATCH] x86/apic: Fix circular locking dependency between console and hrtimer locks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87tv15qj5u.fsf@nanos.tec.linutronix.de>
Date:   Mon, 27 Apr 2020 13:09:49 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Ingo Molnar <mingo@...nel.org>, Leon Romanovsky <leon@...nel.org>
Cc:     Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        "H. Peter Anvin" <hpa@...or.com>, x86 <x86@...nel.org>,
        Suresh Siddha <suresh.b.siddha@...el.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/apic: Fix circular locking dependency between console and hrtimer locks

Ingo Molnar <mingo@...nel.org> writes:
> * Leon Romanovsky <leon@...nel.org> wrote:
> The fix definitely looks legit, lockdep is right that we shouldn't take 
> the console_sem.lock even under trylock.
>
> It's only a printk_once(), yet I'm wondering why in the last ~8 years 
> this never triggered. Nobody ever ran lockdep and debug console level 
> enabled on such hardware, or did something else change?
>
> One possibility would be that apic_check_deadline_errata() marked almost 
> all Intel systems as broken and the TSC-deadline hardware never actually 
> got activated. In that case you have triggered rarely tested code and 
> might see other weirdnesses. Just saying. :-)
>
> Or a bootup with "debug" specified is much more rare in production 
> systems, hence the 8 years old bug.

None of this makes any sense at all.

The local APIC timer (in this case the TSC deadline timer) is set up
during early boot on the boot CPU (before SMP setup) with this call
chain:

smp_prepare_cpus()
 native_smp_prepare_cpus()
   x86_init.timers.setup_percpu_clockev()
     setup_boot_APIC_clock()
       setup_APIC_timer()
         clockevents_config_and_register()
           tick_check_new_device()
             tick_setup_device()
               tick_setup_oneshot()
                 clockevents_switch_state()
                   lapic_timer_set_oneshot()
                     __setup_APIC_LVTT()
                       printk_once(...)

Nothing holds hrtimer.base_lock in this call chain.

But the lockdep splat clearly says:

 [  735.324357] stack backtrace:
 [  735.324360] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-for-upstream-dbg-2020-04-03_10-44-43-70 #1

...

So how can that be the first invocation of that printk_once()?

While the patch looks innocent, it papers over the underlying problem
and wild theories are not really helping here.

Here is a boot log excerpt with lockdep enabled and 'debug' on the
command line:

[    0.000000] Linux version 5.7.0-rc3 ...
...
[    3.992125] TSC deadline timer enabled
[    3.995820] smpboot: CPU0: Intel(R) ....
...
[    4.050766] smp: Bringing up secondary CPUs ...

No splat nothing. The real question is WHY this triggers on Leons
machine 735 seconds after boot and on CPU3.

Thanks,

        tglx