linux-kernel - Re: A couple of TSC questions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <c9680fe4-5f28-4436-84ee-472d1f5befb3@paulmck-laptop>
Date:   Thu, 13 Apr 2023 11:39:15 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Feng Tang <feng.tang@...el.com>
Cc:     Waiman Long <longman@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        linux-kernel@...r.kernel.org
Subject: Re: A couple of TSC questions

On Mon, Apr 03, 2023 at 11:11:40PM +0800, Feng Tang wrote:
> On Sun, Apr 02, 2023 at 08:38:37PM -0700, Paul E. McKenney wrote:
> > On Sun, Apr 02, 2023 at 10:05:51PM -0400, Waiman Long wrote:
> > > On 4/2/23 22:00, Paul E. McKenney wrote:
> > > > On Sun, Apr 02, 2023 at 09:04:04PM -0400, Waiman Long wrote:
> > > > > On 3/31/23 13:16, Paul E. McKenney wrote:
> > > > > > On Tue, Mar 28, 2023 at 02:58:54PM -0700, Paul E. McKenney wrote:
> > > > > > > On Mon, Mar 27, 2023 at 10:19:54AM +0800, Feng Tang wrote:
> > > > > > > > On Fri, Mar 24, 2023 at 05:47:33PM -0700, Paul E. McKenney wrote:
> > > > > > > > > On Wed, Mar 22, 2023 at 01:14:48PM +0800, Feng Tang wrote:

[ . . . ]

> > > > > > And what we are seeing is unlikely to be due to cache-latency-induced
> > > > > > delays.  We see a very precise warp, for example, one system always
> > > > > > has 182 cycles of TSC warp, another 273 cycles, and a third 469 cycles.
> > > > > > Another is at the insanely large value of about 2^64/10, and shows some
> > > > > > variation, but that variation is only about 0.1%.
> > > > > > 
> > > > > > But any given system only sees warp on about half of its reboots.
> > > > > > Perhaps due to the automation sometimes power cycling?
> > > > > > 
> > > > > > There are few enough affected systems that investigation will take
> > > > > > some time.
> > > > > Maybe the difference in wrap is due to NUMA distance of the running cpu from
> > > > > the node where the data reside. It will be interesting to see if my patch
> > > > > helps.
> > > > Almost all of them are single-socket systems.
> 
> Interesting to know. I thought most of the TSC sync problems happen
> in multiple socket system. IIRC, Waiman mentioned his platform is a
> Cooper Lake which is for 4S or 8S platform, also Thomas and Peter
> mentioned tsc sync issue on 8S platforms in other threads.
> 
> And your consistent warp of 182 (91 * 2) and 273 (91 * 3) cycles sound
> like 'artificial' :), maybe the TSC_ADJUST MSR was programmed by BIOS
> or other firmware?

And all but one of them is almost assuredly a firmware issue.  But not
an Intel firmware issue, so there is that.  And in that case, the kernel
is doing what it should, yelling about a real problem.

The other is an Intel system, but is a one-off, with other ostensibly
identical systems doing just fine.  So it is likely simply a case of
dying hardware.  I will look closer when I return.

I will be on travel this coming week starting tomorrow (Friday),
Pacific Time.  There may be substantial intervals when I am completely
off the grid.

Have a great week!!!

							Thanx, Paul

> Thanks,
> Feng
> 
> > > > 
> > > > If the problem sticks with a few systems, I should be able to test
> > > > patches no problem.  If it is randomly distributed across the fleet, a
> > > > bit more prework analysis will be called for.  But what is life without
> > > > a challenge?  ;-)
> > > 
> > > If it is happening on a single socket system, maybe it is caused by false
> > > cacheline sharing. It is hard to tell unless we find a way to reproduce it.
> > 
> > But multiple times on a given system with exactly the same number of
> > clock cycles of warp each time?  It should be entertaining tracking this
> > one down.  ;-)
> > 
> > I will take a few scans of the fleet over the coming week and see if
> > there is any consistency.  Here is hoping...
> > 
> > 							Thanx, Paul