lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f1fe524a8892acf032d8ddabda036c89a3cffce5.camel@surriel.com>
Date: Tue, 16 Jul 2024 09:10:08 -0400
From: Rik van Riel <riel@...riel.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: "Paul E. McKenney" <paulmck@...nel.org>, linux-kernel@...r.kernel.org, 
	kernel-team@...a.com, neeraj.upadhyay@...nel.org, mingo@...nel.org, 
	rostedt@...dmis.org, Leonardo Bras <leobras@...hat.com>
Subject: Re: [PATCH] smp: print only local CPU info when sched_clock goes
 backward

On Tue, 2024-07-16 at 11:04 +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2024 at 01:49:41PM -0400, Rik van Riel wrote:
> > About 40% of all csd_lock warnings observed in our fleet appear to
> > be due to sched_clock() going backward in time (usually only a
> > little
> > bit), resulting in ts0 being larger than ts2.
> > 
> > When the local CPU is at fault, we should print out a message
> > reflecting
> > that, rather than trying to get the remote CPU's stack trace.
> > 
> > Signed-off-by: Rik van Riel <riel@...riel.com>
> > ---
> >  kernel/smp.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index f085ebcdf9e7..5656ef63ea82 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -237,6 +237,14 @@ static bool
> > csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1,
> > in
> >  	if (likely(ts_delta <= csd_lock_timeout_ns ||
> > csd_lock_timeout_ns == 0))
> >  		return false;
> >  
> > +	if (ts0 > ts2) {
> > +		/* Our own sched_clock went backward; don't blame
> > another CPU. */
> > +		ts_delta = ts0 - ts2;
> > +		pr_alert("sched_clock on CPU %d went backward by
> > %llu ns\n", raw_smp_processor_id(), ts_delta);
> > +		*ts1 = ts2;
> > +		return false;
> > +	}
> 
> So I've seen some chatter about this on IRC and was WTF, seeing this
> patch I'm still WTF. What is going on with those machines?!?!
> 

Unfortunately this particular behavior does not appear to be limited
to a few bad systems. The distribution over time looks as follows:

- 1 day:     N affected systems
- 3 days:   3N affected systems
- 1 week:   6N affected systems
- 2 weeks: 14N affected systems
- 30 days: 29N affected systems
- 90 days: 72N affected systems

In other words, we are not looking at a few "repeat customers",
but at a large number of systems that show the backward TSC
behavior a few times, and then not again for months.

N is a triple digit number, so I'm fairly confident in these
numbers.

If it's any consolation this seems to happen on both Intel
and AMD systems. 

The only detection code we have currently is in csd_lock_wait_toolong,
so there could be a lot of cases of the TSC moving backward that
are falling through the cracks.

Most of the time the TSC only moves backward a few dozen, or a few
hundred nanoseconds, and backward time past a few microseconds is
a very long tail. I have not sliced this data by CPU model, and
don't know if the distribution is different for different CPUs.

The code in question simply loads one sched_clock value into ts0,
and another into ts2:

static void __csd_lock_wait(call_single_data_t *csd)
{
        int bug_id = 0;
        u64 ts0, ts1;

        ts1 = ts0 = sched_clock();
        for (;;) {
                if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id))
                        break;
                cpu_relax();
        }

...

static bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64
*ts1, int *bug_id)
{
...
        ts2 = sched_clock();


I don't know if it matters for the CPU that csd_lock_wait_toolong
gets inlined into __csd_lock_wait.

On one particular system replacing the rdtsc() in native_sched_clock()
with rdtsc_ordered() did not help, but that system sees frequent
negative time jumps of 2.5ms, and might just be broken.

The MTBF is high enough that I'm not sure whether always using
rdtsc_ordered() would be warranted, and I don't know whether it
would even help...

-- 
All Rights Reversed.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ