linux-kernel - Re: [PATCH] smp: print only local CPU info when sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f1fe524a8892acf032d8ddabda036c89a3cffce5.camel@surriel.com>
Date: Tue, 16 Jul 2024 09:10:08 -0400
From: Rik van Riel <riel@...riel.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: "Paul E. McKenney" <paulmck@...nel.org>, linux-kernel@...r.kernel.org, 
	kernel-team@...a.com, neeraj.upadhyay@...nel.org, mingo@...nel.org, 
	rostedt@...dmis.org, Leonardo Bras <leobras@...hat.com>
Subject: Re: [PATCH] smp: print only local CPU info when sched_clock goes
 backward

On Tue, 2024-07-16 at 11:04 +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2024 at 01:49:41PM -0400, Rik van Riel wrote:
> > About 40% of all csd_lock warnings observed in our fleet appear to
> > be due to sched_clock() going backward in time (usually only a
> > little
> > bit), resulting in ts0 being larger than ts2.
> > 
> > When the local CPU is at fault, we should print out a message
> > reflecting
> > that, rather than trying to get the remote CPU's stack trace.
> > 
> > Signed-off-by: Rik van Riel <riel@...riel.com>
> > ---
> >  kernel/smp.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index f085ebcdf9e7..5656ef63ea82 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -237,6 +237,14 @@ static bool
> > csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1,
> > in
> >  	if (likely(ts_delta <= csd_lock_timeout_ns ||
> > csd_lock_timeout_ns == 0))
> >  		return false;
> >  
> > +	if (ts0 > ts2) {
> > +		/* Our own sched_clock went backward; don't blame
> > another CPU. */
> > +		ts_delta = ts0 - ts2;
> > +		pr_alert("sched_clock on CPU %d went backward by
> > %llu ns\n", raw_smp_processor_id(), ts_delta);
> > +		*ts1 = ts2;
> > +		return false;
> > +	}
> 
> So I've seen some chatter about this on IRC and was WTF, seeing this
> patch I'm still WTF. What is going on with those machines?!?!
> 

Unfortunately this particular behavior does not appear to be limited
to a few bad systems. The distribution over time looks as follows:

- 1 day:     N affected systems
- 3 days:   3N affected systems
- 1 week:   6N affected systems
- 2 weeks: 14N affected systems
- 30 days: 29N affected systems
- 90 days: 72N affected systems

In other words, we are not looking at a few "repeat customers",
but at a large number of systems that show the backward TSC
behavior a few times, and then not again for months.

N is a triple digit number, so I'm fairly confident in these
numbers.

If it's any consolation this seems to happen on both Intel
and AMD systems. 

The only detection code we have currently is in csd_lock_wait_toolong,
so there could be a lot of cases of the TSC moving backward that
are falling through the cracks.

Most of the time the TSC only moves backward a few dozen, or a few
hundred nanoseconds, and backward time past a few microseconds is
a very long tail. I have not sliced this data by CPU model, and
don't know if the distribution is different for different CPUs.

The code in question simply loads one sched_clock value into ts0,
and another into ts2:

static void __csd_lock_wait(call_single_data_t *csd)
{
        int bug_id = 0;
        u64 ts0, ts1;

        ts1 = ts0 = sched_clock();
        for (;;) {
                if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id))
                        break;
                cpu_relax();
        }

...

static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64
*ts1, int *bug_id)
{
...
        ts2 = sched_clock();

I don't know if it matters for the CPU that csd_lock_wait_toolong
gets inlined into __csd_lock_wait.

On one particular system replacing the rdtsc() in native_sched_clock()
with rdtsc_ordered() did not help, but that system sees frequent
negative time jumps of 2.5ms, and might just be broken.

The MTBF is high enough that I'm not sure whether always using
rdtsc_ordered() would be warranted, and I don't know whether it
would even help...

-- 
All Rights Reversed.