linux-kernel - Re: [PATCH] clocksource: disable irq when holding watchdog

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878r826xys.ffs@tglx>
Date:   Mon, 16 Oct 2023 23:47:55 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     John Stultz <jstultz@...gle.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Cc:     Stephen Boyd <sboyd@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
        "Paul E. McKenney" <paulmck@...nel.org>, x86@...nel.org
Subject: Re: [PATCH] clocksource: disable irq when holding watchdog_lock.

On Mon, Oct 16 2023 at 10:46, John Stultz wrote:
> On Fri, Oct 13, 2023 at 7:51 AM Tetsuo Handa
> <penguin-kernel@...ove.sakura.ne.jp> wrote:
>>
>> Lockdep found that spin_lock(&watchdog_lock) from call_timer_fn()
>> is not safe. Use spin_lock_irqsave(&watchdog_lock, flags) instead.
>>
>> [    0.378387] TSC synchronization [CPU#0 -> CPU#1]:
>> [    0.378387] Measured 55060 cycles TSC warp between CPUs, turning off TSC clock.

<SNIP/>

Please trim splats down to the relevant content. Timestamps and the bulk
of the backtraces are completely uninteresting. Something like this is
completely sufficient and readable:

   tsc: Marking TSC unstable due to check_tsc_sync_source failed
  
   WARNING: inconsistent lock state
   inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
   ffffffff8cfa1c78 (watchdog_lock){?.-.}-{2:2}, at: clocksource_watchdog+0x23/0x5a0
   {IN-HARDIRQ-W} state was registered at:
     _raw_spin_lock_irqsave+0x3f/0x60
     clocksource_mark_unstable+0x1b/0x90
     mark_tsc_unstable+0x41/0x50
     check_tsc_sync_source+0x14f/0x180
     sysvec_call_function_single+0x69/0x90

   Possible unsafe locking scenario:
     lock(watchdog_lock);
     <Interrupt>
       lock(watchdog_lock);

   stack backtrace:
    _raw_spin_lock+0x30/0x40
    clocksource_watchdog+0x23/0x5a0
    run_timer_softirq+0x2a/0x50
    sysvec_apic_timer_interrupt+0x6e/0x90

20 lines of useful information vs. ~100 lines containing mostly
noise. No?

Documentation/process/* clearly asks for this.

>> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
>> index c108ed8a9804..4e8fc0a5ca9d 100644
>> --- a/kernel/time/clocksource.c
>> +++ b/kernel/time/clocksource.c
>> @@ -405,8 +405,9 @@ static void clocksource_watchdog(struct timer_list *unused)
>>         enum wd_read_status read_ret;
>>         unsigned long extra_wait = 0;
>>         u32 md;
>> +       unsigned long flags;

  https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#variable-declarations

>> -       spin_lock(&watchdog_lock);
>> +       spin_lock_irqsave(&watchdog_lock, flags);

That looks obvious, but it's just making the problem worse. See below.

> At first glance this looks sane to me.

Emphasis on "first glance" :)

> The only thing that gives me pause is that I'm a little shocked this
> hasn't been caught earlier, and I don't right off see recent changes
> that would have caused this to trip. But Thomas may have more insight.

>From looking at the (for you neatly trimmed) backtrace:

     {IN-HARDIRQ-W} state was registered at:
     _raw_spin_lock_irqsave+0x3f/0x60
     clocksource_mark_unstable+0x1b/0x90
     mark_tsc_unstable+0x41/0x50
     check_tsc_sync_source+0x14f/0x180
     sysvec_call_function_single+0x69/0x90

it's bloody obvious which change caused that:

  9d349d47f0e3 ("x86/smpboot: Make TSC synchronization function call based")

Even if you don't recognize it immediately then:

  git grep -l check_tsc_sync_source | xargs git blame | grep check_tsc_sync_source

is not really rocket science either, right?

Vs. you being shocked: As long as that synchronization function does not
try to mark the TSC unstable, lockdep won't notice. Definitely
surprising that this took a few months to surface.

But that's not what shocks me...

That spinlock_irqsave() conversion is not a solution. Why?

  If that happens on RT then this is even worse. mark_tsc_unstable()
  is invoked from hard interrupt context via smp function call which
  is a NONO due to watchdog_lock being a spinlock and the _irqsave()
  change won't help anything.

  Yes, I broke it and did not think about that particular side effect
  when moving check_tsc_sync_source() from thread to smp function call
  context.

But that's just a distraction. What's even worse is that the patch
blindly slaps spinlock_irqsave() at the problem without actually looking
what the function does and why it does not use the irqsave variant in
the first place.

  The watchdog function deliberately runs with interrupts enabled and
  only explicitly disables them for the actual readout because there is
  no point in delaying interrupts for this horrorshow accross the whole
  function.

  But the explicit local_irq_disable()/enable() pair around the readout
  means that the patch above is obviously broken:

    spin_lock_irqsave(&watchdog_lock, flags);
    cs_watchdog_read()
      local_irq_disable();
      ...
      local_irq_enable();  <--- FAIL

  Lockdep should have yelled at that too, but even if it failed to do
  so then it's still very obviously wrong.

Something like the uncompiled/untested below should cure it for real. It
really does not matter whether the TSC unstable event happens a bit
later. The system is unhappy no matter what.

That said, this whole clocksource watchdog mess wants a proper
overhaul. It has become a pile of warts and duct tape by now and after
staring at it long enough there is no real reason to run it in a timer
callback anymore. It just can move into delayed work and the whole
locking problem can be reduced to the clocksource_mutex and some well
thought out atomic operations to handle the mark unstable case. But
that's a different story and not relevant for curing the problem at
hand.

Thanks,

        tglx
---
--- a/arch/x86/kernel/tsc_sync.c
+++ b/arch/x86/kernel/tsc_sync.c
@@ -15,6 +15,7 @@
  * ( The serial nature of the boot logic and the CPU hotplug lock
  *   protects against more than 2 CPUs entering this code. )
  */
+#include <linux/workqueue.h>
 #include <linux/topology.h>
 #include <linux/spinlock.h>
 #include <linux/kernel.h>
@@ -342,6 +343,13 @@ static inline unsigned int loop_timeout(
 	return (cpumask_weight(topology_core_cpumask(cpu)) > 1) ? 2 : 20;
 }
 
+static void tsc_sync_mark_tsc_unstable(struct work_struct *work)
+{
+	mark_tsc_unstable("check_tsc_sync_source failed");
+}
+
+static DECLARE_WORK(tsc_sync_work, tsc_sync_mark_tsc_unstable);
+
 /*
  * The freshly booted CPU initiates this via an async SMP function call.
  */
@@ -395,7 +403,7 @@ static void check_tsc_sync_source(void *
 			"turning off TSC clock.\n", max_warp);
 		if (random_warps)
 			pr_warn("TSC warped randomly between CPUs\n");
-		mark_tsc_unstable("check_tsc_sync_source failed");
+		schedule_work(&tsc_sync_work);
 	}
 
 	/*