linux-kernel - Re: [PATCH] perf/x86/intel: Restrict period on Haswell

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87plq9l5d2.ffs@tglx>
Date: Fri, 16 Aug 2024 01:43:21 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: "Liang, Kan" <kan.liang@...ux.intel.com>, Li Huafei
 <lihuafei1@...wei.com>, peterz@...radead.org, mingo@...hat.com
Cc: acme@...nel.org, namhyung@...nel.org, mark.rutland@....com,
 alexander.shishkin@...ux.intel.com, jolsa@...nel.org, irogers@...gle.com,
 adrian.hunter@...el.com, bp@...en8.de, dave.hansen@...ux.intel.com,
 x86@...nel.org, hpa@...or.com, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>, Vince
 Weaver <vincent.weaver@...ne.edu>
Subject: Re: [PATCH] perf/x86/intel: Restrict period on Haswell

On Thu, Aug 15 2024 at 16:15, Kan Liang wrote:
> On 2024-08-15 2:26 p.m., Thomas Gleixner wrote:
>>> Implication: When this erratum occurs, the fixed-function performance
>>> counter IA32_FIXED_CTR0 may over count by up to 32.
>> 
>> Sure. That's only explaining half of the problem.
>> 
>> As I demonstrated in the non-contended case even with a count of 2 (I
>> tried 1 too) the status bit is never set on the second check.
>> 
>
> Do you mean the below example? The status bit (32) of the fixed counter
> 0 is always set.

When HT is off or the threads are not running the handler concurrently
then there is zero looping. Once they start do fiddle concurrently the
looping starts and potentially never ends.

> 65.147782: x86_perf_event_set_period: idx:    32 period:         1 left:  2
> 65.147783: intel_pmu_handle_irq:      loops: 001 status: 100000000
>
> 65.147784: x86_perf_event_set_period: idx:    32 period:         1 left:  2
> 65.147784: intel_pmu_handle_irq:      loops: 002 status: 100000000

So in the non-concurrent (which includes !HT) case the status check
after handling the event is always 0. This never gets into a loop, not
even once.

>>> It should explains that the issue is gone with the magic number 32 or
>>> disabling the Hyper-Threading.
>> 
>> It explains only half of it. If you use 32, then the counter is set to
>> -32 so the overcount of 32 will still bring it to 0, which should set
>> the status bit, no?
>
> I think it's up to 32, not always 32.
> I don't have more details regarding the issue. The architect of HSW has
> left. I'm asking around internally to find the original bug report of
> the erratum. Hope there are more details in the report.

See below.

>>> I also found a related discussion about 9 years ago.
>>> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
>>> Vince tried the workaround but it seems not work.
>> 
>> Let me play with that. :)
>
> Appreciate it.

I got it actually working. The inital sequence which "worked" is:

    1) Clear bit 32  in IA32_PERF_GLOBAL_CTRL
    2) Clear bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
    3) Zero the IA32_FIXED_CTR0 MSR
    4) Set IA32_FIXED_CTR0 to (-left) & mask;
    5) Set bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
    6) Set bit 32  in IA32_PERF_GLOBAL_CTRL

If I omit #3 it does not work. If I flip #5 and #6 it does not work.

So the initial "working" variant I had was hacking this sequence into
x86_perf_event_set_period() (omitting the fixed counter 0 conditionals
for readability):

	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));

	rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);

Now I thought, that needs to be in intel/core.c and implemented a proper
quirk. And of course being smart I decided it's a brilliant idea to use
the cached values instead of the rdmsrl()s.

	cglbl = hybrid(cpuc->pmu, intel_ctrl) & ~cpuc->intel_ctrl_guest_mask;
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));

        cfixd = cpuc->fixed_ctrl_val;
	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);

Surprise, surprise, that does not work. So I went back and wanted to
know which rdmslr() is curing it:

	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));

        cfixd = cpuc->fixed_ctrl_val;
	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);

This worked. Using the rdmsrl() only for MSR_ARCH_PERFMON_FIXED_CTR_CTRL
did not.

Now I got bold and boiled it down to:

	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

and the whole thing still worked. *BLINK*

Exactly zero loop entries in the trace when running 100 instances of
that cve test case, which otherwise spams the trace with entries and
ends up in the loop > 100 path within a split second.

Removing the zeroing of the counter makes it come back, but reducing the
whole nonsense to:

	wrmsrl(hwc->event_base, 0);
	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);

makes the loop problem go away, but it "works" so well that the stupid
frequency adjustment algorithm keeps the left == 1, i.e count == 2 case
stay long enough around to trigger 'hung task messages' ....

Now I looked more at the dmesg output of all the experiments. In all
"working" cases except one running these 100 instances of cve... results
in a varying cascade of

   perf: interrupt took too long (2503 > 2500), lowering ...

messages.

The one case where this does not happen is when the limit is
unconditionally set to 32. But when I apply this limit only for the
fixed counter 0 it comes back. 

Now I looked at when these 'took too long' problems surface aside of the
obvious case of extensive looping. They are unrelated to the hyper
threading issue as I can reproduce with smt=off too.

They always happen when a counter was armed with a count < 32 and two
events expired in the same NMI. The test case uses fixed counter 0 and
general counter 0 for the other event.

So that erratum is a good hint, but that hardware does have more issues
than it tells.

So I think we should just apply that limit patch with a proper change
log and also make it:

hsw_limit(...)
{
	*left = max(*left, erratum_hsw11(event) ? 128 : 32;);
}

or such.

That limit won't cure the overcount issue from that HSM154 erratum, but
*SHRUG*.

Thanks,

        tglx