linux-kernel - Re: [PATCH v8 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20210417235136.GD5006@paulmck-ThinkPad-P17-Gen-1>
Date:   Sat, 17 Apr 2021 16:51:36 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     linux-kernel@...r.kernel.org, john.stultz@...aro.org,
        sboyd@...nel.org, corbet@....net, Mark.Rutland@....com,
        maz@...nel.org, kernel-team@...com, neeraju@...eaurora.org,
        ak@...ux.intel.com, Chris Mason <clm@...com>
Subject: Re: [PATCH v8 clocksource 3/5] clocksource: Check per-CPU clock
 synchronization when marked unstable

On Sat, Apr 17, 2021 at 02:47:18PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 13 2021 at 21:36, Paul E. McKenney wrote:
> 
> Bah, hit send too quick.
> 
> > +	cpumask_clear(&cpus_ahead);
> > +	cpumask_clear(&cpus_behind);
> > +	preempt_disable();
> 
> Daft. 

Would migrate_disable() be better?

Yes, I know, in virtual environments, the hypervisor can migrate anyway,
but this does limit the potential damage to one out of the two schedulers.

> > +	testcpu = smp_processor_id();
> > +	pr_warn("Checking clocksource %s synchronization from CPU %d.\n", cs->name, testcpu);
> > +	for_each_online_cpu(cpu) {
> > +		if (cpu == testcpu)
> > +			continue;
> > +		csnow_begin = cs->read(cs);
> > +		smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1);
> > +		csnow_end = cs->read(cs);
> 
> As this must run with interrupts enabled, that's a pretty rough
> approximation like measuring wind speed with a wet thumb.
> 
> Wouldn't it be smarter to let the remote CPU do the watchdog dance and
> take that result? i.e. split out more of the watchdog code so that you
> can get the nanoseconds delta on that remote CPU to the watchdog.

First, an interrupt, NMI, SMI, vCPU preemption, or whatever could
not cause a false positive.  A false negative, perhaps, but no
false positives.  Second, in normal operation, these are rare, so that
hitting the (eventual) default of eight CPUs is very likely to result in
tight bounds on the delay-based error for most of those CPUs.  Third,
we really need to compare the TSC on one CPU to the TSC on the other
in order to have a very clear indication of a problem, should a real
TSC-synchronization issue arise.  In contrast, comparisons against the
watchdog timer will be more complicated and any errors detected will be
quite hard to prove to be due to TSC issues.

Or am I once again missing something?

> > +		delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
> > +		if (delta < 0)
> > +			cpumask_set_cpu(cpu, &cpus_behind);
> > +		delta = (csnow_end - csnow_mid) & cs->mask;
> > +		if (delta < 0)
> > +			cpumask_set_cpu(cpu, &cpus_ahead);
> > +		delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
> > +		cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
> 
> > +		if (firsttime || cs_nsec > cs_nsec_max)
> > +			cs_nsec_max = cs_nsec;
> > +		if (firsttime || cs_nsec < cs_nsec_min)
> > +			cs_nsec_min = cs_nsec;
> > +		firsttime = 0;
> 
>   int64_t cs_nsec_max = 0, cs_nsec_min = LLONG_MAX;
> 
> and then the firsttime muck is not needed at all.

Good point, will fix!

And again, thank you for looking all of this over.

							Thanx, Paul