linux-kernel - Re: [PATCH] tick/sched: fix data races at tick_do_timer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <878skdat77.fsf@nanos.tec.linutronix.de>
Date:   Sat, 07 Mar 2020 01:45:32 +0100
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Qian Cai <cai@....pw>
Cc:     fweisbec@...il.com, mingo@...nel.org, elver@...gle.com,
        linux-kernel@...r.kernel.org, Peter Zijlstra <peterz@...radead.org>
Subject: Re: [PATCH] tick/sched: fix data races at tick_do_timer_cpu

Qian Cai <cai@....pw> writes:
> On Wed, 2020-03-04 at 06:20 -0500, Qian Cai wrote:
>> Looks at tick_sched_do_timer(), for example,
>> 
>> if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE)) {
>> #ifdef CONFIG_NO_HZ_FULL
>> 		WARN_ON(tick_nohz_full_running);
>> #endif
>> 		tick_do_timer_cpu = cpu;
>> 	}
>> #endif
>> 
>> /* Check, if the jiffies need an update */
>> if (tick_do_timer_cpu == cpu)
>> 	tick_do_update_jiffies64(now);
>> 
>> Could we rule out all compilers and archs will not optimize it if !CONFIG_NO_HZ_FULL to,
>> 
>> if (unlikely(tick_do_timer_cpu == TICK_DO_TIMER_NONE) || tick_do_timer_cpu == cpu)
>>          tick_do_update_jiffies64(now);
>> 
>> So it could save a branch or/and realized that tick_do_timer_cpu is
>> not used later in this cpu, so it could discard the store?

Sorry, I can't make any sense of this.

If a compiler optimizes this in the way you are describing then it is
seriously broken. If that happens then lots of code will just fall apart
in colourful ways.

>> I am not all that familiar with all other 14 places if it is possible
>> to happen concurrently as well, so I took a pragmatic approach that
>> for now only deal with the places that KCSAN confirmed, and then look
>> forward for an incremental approach if there are more places needs
>> treatments later once confirmed.

Stop this please. I just wasted days to understand why code was
sprinkled with random rcu*enter/exit() invocations until I figured out
that the reason was exactly what you are proposing:

     Curing the symptoms

That's the worst engineering principle ever, really.

Just shutting up the tool without doing a proper analysis is a
guaranteed recipe for disaster because it is going to paper over real or
latent bugs. Read the git log history carefully to find prime examples
for this.

So no, neither shutting up just the places KCSAN complained about and
ignoring the rest nor your new proposal

> If you don't think that will be happening and dislike using READ/WRITE_ONCE()
> for documentation purpose, we could annotate those using the data_race() macro.
> Is that more acceptable?

is leading anywhere. It's both based on cargo cult programming and lacks
proper root cause analysis.

I'm surely not ignoring the output of KCSAN, but please understand that
any attempt to silence its output purely based on mechanical duct tape
is not going anywhere.

I wasted actually months in the past to decode a subtle wreckage which
was caused by the back then 'silence uninitilized variable warnings'
frenzy. One of these blindly applied annotations papered over a real
bug and made it horribly hard to find.

As I'm busy with other urgent nightmares right now, this has to wait
until I get some spare cycles or you come up with some proper analysis.

Thanks,

        tglx