linux-kernel - Re: [RFC PATCH] clocksource: skip check while watchdog hung up or unstable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <5ef35e9e-3a91-282c-4254-0abcd10e0a7f@gmail.com>
Date:   Fri, 13 Aug 2021 08:54:12 +0800
From:   brookxu <brookxu.cn@...il.com>
To:     Thomas Gleixner <tglx@...utronix.de>, john.stultz@...aro.org,
        sboyd@...nel.org
Cc:     linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] clocksource: skip check while watchdog hung up or
 unstable



Thomas Gleixner wrote on 2021/8/12 6:53 下午:
> On Wed, Aug 11 2021 at 23:26, brookxu wrote:
>> Thomas Gleixner wrote on 2021/8/11 22:01:
>>>> To be precise, we are processing interrupts in handle_edge_irq() for a long
>>>> time. Since the interrupts of multiple hardware queues are mapped to a single
>>>> CPU, multiple cores are continuously issuing IO, and then a single core is
>>>> processing IO. Perhaps the test case can be optimized, but shouldn't this lead
>>>> to switching clocks in principle?
>>>
>>> The clocksource watchdog failure is only _ONE_ consequence. Processing
>>> hard interrupts for 155 seconds straight will trigger lockup detectors
>>> of all sorts if you have them enabled.
>>>
>>> So just papering over the clocksource watchdog does not solve anything,
>>> really. Next week you have to add similar hacks to the lockup detectors,
>>> RCU and whatever.
>>
>> Yeah, we have observed soft lockup and RCU stall, but these behaviors are
>> expected because the current CPU scheduling is disabled. However, marking
>> TSC unstable is inconsistent with the actual situation. The worst problem
>> is that after the clocksource switched to hpet, the abnormal time will be
>> greatly prolonged due to the degradation of performance. We have not found
>> that soft lockup and RCU stall will affect the machine for a long time in
>> this test. Aside from these, as the watchdog is scheduled periodically, when
>> wd_nsec is 0, it means that something maybe abnormal, do we readlly still
>> need to continue to verify TSC? and how to ensure the correctness of the
>> results?
> 
> Sorry no. While softlockups and RCU stalls might have no long term
> effect in the first place, this argumentation vs. the clocksource
> watchdog is just a strawman. You're abusing the system in a way which
> causes it to malfunction so you have to live with the consequences.
> 
> Aside of that this 'workaround' is just duct taping a particular part of
> the problem. What guarantees that after the interrupt storm subsided the
> clocksource delta of the watchdog becomes 0 (negative)?
> 
> Absolutely nothing. The delta can be positive, but then the watchdog and
> the TSC are not in sync anymore which will disable the TSC as well.
> 
> A 24MHz HPET has a wraparound time of ~178s which means during:
> 
>   89s < tdelta < 178s
> 
> your hack papers over the problem. Any interrupt storm time outside of
> that window results in fail.
> 
> Now run the same test on a machine with a 14MHz HPET and you get
> 
>  153s < tdelta < 306s
> 
> so your 155s interrupt storm barely fits. And what are you doing with
> your next test which runs only 80 seconds?
> 
> Not to talk about the fact that you wreckage detection of a watchdog
> clocksource going stale.
> 
> So no, we are not adding hacks to support abuse.
> 
> What we really want to do is to add detection for interrupt storms of
> this sort and shut those interrupts down for good.

ok, thanks for your suggestion.

> Thanks,
> 
>         tglx
> ---
> Patient: "Doctor, it hurts when I hammer on my toe."
> Doctor:  "Don't do that then!"
> 
>