linux-kernel - RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <001701db676d$1ef85910$5ce90b30$@telus.net>
Date: Wed, 15 Jan 2025 08:47:09 -0800
From: "Doug Smythies" <dsmythies@...us.net>
To: "'Len Brown'" <lenb@...nel.org>
Cc: "'Peter Zijlstra'" <peterz@...radead.org>,
	<linux-kernel@...r.kernel.org>,
	<vincent.guittot@...aro.org>,
	"'Ingo Molnar'" <mingo@...nel.org>,
	<wuyun.abel@...edance.com>,
	"Doug Smythies" <dsmythies@...us.net>
Subject: RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

Hi Len,

Thank you for chiming in on this thread.

On 2025.01.14 18:09 Len Brown wrote:
> Doug,
> Your attention to detail and persistence has once again found a tricky
> underlying bug -- kudos!
>
> Re: turbostat behaviour
>
> Yes, TSC_MHz -- "the measured rate of the TSC during an interval", is
> printed as a sanity check.  If there are any irregularities in it, as
> you noticed, then something very strange in the hardware or software
> is going wrong (and the actual turbostat results will likely not be
> reliable).

While I use turbostat almost every day, I am embarrassed to admit
that until this investigation I did not know about the ability to
add the "Time_Of_Day_Seconds" and "usec" columns.
They have been incredibly useful.
Early on, and until I discovered those two "show" options, I was
using the sanity check calculation of TSC_MHz to reveal the anomaly.

> Yes, the "usec" column measures how long it takes to migrate to a CPU
> and collect stats there.  So if you are hunting down a glitch in
> migration all you need is this column to see it.  "usec" on the
> summary row is the difference between the 1st migration and after the
> last -- excluding the sysfs/procfs time that is consumed on the last
> CPU.  So migration delays will also be reflected there.

On a per CPU basis, it excludes the actual CPU migration step.
Peter and I made a modification to turbostat to have the per CPU
'usec" column focus just on the CPU migration time. [1]

> Note: we have a patch queued which changes the "usec" on the Summary
> row to *include* the sysfs/procfs time on the last CPU.

I did not realise that is was just for the last "sysfs/procfs" time.
I'll take a closer look, and wonder if that can explain why I have been
unable to catch the lingering >= 10 mSec stuff.

>  (The per-cpu
> "usec" values are unchanged.)  This is because we've noticed some
> really weird delays in doing things like reading /proc/interrupts and
> we want to be able to easily do A/B comparisons by simply including or
> excluding counters.

Yes, I saw the patch email on the linux-pm email list and have included it
in my local turbostat for about a week now.

> Also FYI, The scheme of migrating to each CPU so that collecting stats
> there will be "local" isn't scaling so well on very large systems, and
> I'm about to take a close look at it.  In yogini we used a different
> scheme, where a thread is bound to each CPU, so they can collect in
> parallel; and we may be moving to something like that.
>
> cheers,
> Len Brown, Intel Open Source Technology Center

[1] https://lore.kernel.org/lkml/001b01db608a$56d3dc40$047b94c0$@telus.net/

... Doug