linux-kernel - Re: [PATCH] perf/x86/intel: Restrict period on Haswell

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <705dc2fe-7ab3-458a-9b5a-7ea0b30756b8@linux.intel.com>
Date: Fri, 16 Aug 2024 15:27:10 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: Thomas Gleixner <tglx@...utronix.de>, Li Huafei <lihuafei1@...wei.com>,
 peterz@...radead.org, mingo@...hat.com
Cc: acme@...nel.org, namhyung@...nel.org, mark.rutland@....com,
 alexander.shishkin@...ux.intel.com, jolsa@...nel.org, irogers@...gle.com,
 adrian.hunter@...el.com, bp@...en8.de, dave.hansen@...ux.intel.com,
 x86@...nel.org, hpa@...or.com, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>,
 Vince Weaver <vincent.weaver@...ne.edu>
Subject: Re: [PATCH] perf/x86/intel: Restrict period on Haswell

Hi Thomas,

On 2024-08-15 7:43 p.m., Thomas Gleixner wrote:
> On Thu, Aug 15 2024 at 16:15, Kan Liang wrote:
>> On 2024-08-15 2:26 p.m., Thomas Gleixner wrote:
>>>> Implication: When this erratum occurs, the fixed-function performance
>>>> counter IA32_FIXED_CTR0 may over count by up to 32.
>>>
>>> Sure. That's only explaining half of the problem.
>>>
>>> As I demonstrated in the non-contended case even with a count of 2 (I
>>> tried 1 too) the status bit is never set on the second check.
>>>
>>
>> Do you mean the below example? The status bit (32) of the fixed counter
>> 0 is always set.
> 
> When HT is off or the threads are not running the handler concurrently
> then there is zero looping. Once they start do fiddle concurrently the
> looping starts and potentially never ends.
> 
>> 65.147782: x86_perf_event_set_period: idx:    32 period:         1 left:  2
>> 65.147783: intel_pmu_handle_irq:      loops: 001 status: 100000000
>>
>> 65.147784: x86_perf_event_set_period: idx:    32 period:         1 left:  2
>> 65.147784: intel_pmu_handle_irq:      loops: 002 status: 100000000
> 
> So in the non-concurrent (which includes !HT) case the status check
> after handling the event is always 0. This never gets into a loop, not
> even once.
> 
>>>> It should explains that the issue is gone with the magic number 32 or
>>>> disabling the Hyper-Threading.
>>>
>>> It explains only half of it. If you use 32, then the counter is set to
>>> -32 so the overcount of 32 will still bring it to 0, which should set
>>> the status bit, no?
>>
>> I think it's up to 32, not always 32.
>> I don't have more details regarding the issue. The architect of HSW has
>> left. I'm asking around internally to find the original bug report of
>> the erratum. Hope there are more details in the report.
> 
> See below.
> 
>>>> I also found a related discussion about 9 years ago.
>>>> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
>>>> Vince tried the workaround but it seems not work.
>>>
>>> Let me play with that. :)
>>
>> Appreciate it.
> 
> I got it actually working. The inital sequence which "worked" is:
> 
>     1) Clear bit 32  in IA32_PERF_GLOBAL_CTRL
>     2) Clear bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
>     3) Zero the IA32_FIXED_CTR0 MSR
>     4) Set IA32_FIXED_CTR0 to (-left) & mask;
>     5) Set bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
>     6) Set bit 32  in IA32_PERF_GLOBAL_CTRL
> 
> If I omit #3 it does not work. If I flip #5 and #6 it does not work.
> 
> So the initial "working" variant I had was hacking this sequence into
> x86_perf_event_set_period() (omitting the fixed counter 0 conditionals
> for readability):
> 
> 	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
> 
> 	rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 
> Now I thought, that needs to be in intel/core.c and implemented a proper
> quirk. And of course being smart I decided it's a brilliant idea to use
> the cached values instead of the rdmsrl()s.
> 
> 	cglbl = hybrid(cpuc->pmu, intel_ctrl) & ~cpuc->intel_ctrl_guest_mask;
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
> 
>         cfixd = cpuc->fixed_ctrl_val;
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 
> Surprise, surprise, that does not work. So I went back and wanted to
> know which rdmslr() is curing it:
> 
> 	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
> 
>         cfixd = cpuc->fixed_ctrl_val;
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 
> 	wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> 	wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 
> This worked. Using the rdmsrl() only for MSR_ARCH_PERFMON_FIXED_CTR_CTRL
> did not.
> 
> Now I got bold and boiled it down to:
> 
> 	rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 
> and the whole thing still worked. *BLINK*
> 
> Exactly zero loop entries in the trace when running 100 instances of
> that cve test case, which otherwise spams the trace with entries and
> ends up in the loop > 100 path within a split second.
> 
> Removing the zeroing of the counter makes it come back, but reducing the
> whole nonsense to:
> 
> 	wrmsrl(hwc->event_base, 0);
> 	wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
> 

Thanks for all the testing. So the trick is to clear the counter before
writing to it. That sounds like a usual way to trigger some ucode magic.


> makes the loop problem go away, but it "works" so well that the stupid
> frequency adjustment algorithm keeps the left == 1, i.e count == 2 case
> stay long enough around to trigger 'hung task messages' ....
> 
> Now I looked more at the dmesg output of all the experiments. In all
> "working" cases except one running these 100 instances of cve... results
> in a varying cascade of
> 
>    perf: interrupt took too long (2503 > 2500), lowering ...
> 
> messages.
> 
> The one case where this does not happen is when the limit is
> unconditionally set to 32. But when I apply this limit only for the
> fixed counter 0 it comes back.


Yes, that sounds there is something wrong with < 32 as well.

> 
> Now I looked at when these 'took too long' problems surface aside of the
> obvious case of extensive looping. They are unrelated to the hyper
> threading issue as I can reproduce with smt=off too.
> 
> They always happen when a counter was armed with a count < 32 and two
> events expired in the same NMI. The test case uses fixed counter 0 and
> general counter 0 for the other event.
> 
> So that erratum is a good hint, but that hardware does have more issues
> than it tells.
> 
> So I think we should just apply that limit patch with a proper change
> log and also make it:
> 
> hsw_limit(...)
> {
> 	*left = max(*left, erratum_hsw11(event) ? 128 : 32;);
> }
>

The HSW11 is also BDM11. It sounds like we need the trick from both bdw
and nhm.

How about this?

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e8bd45556c30..42f557a128b9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4664,6 +4664,12 @@ static void nhm_limit_period(struct perf_event
*event, s64 *left)
 	*left = max(*left, 32LL);
 }

+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+	nhm_limit_period(event, left);
+	bdw_limit_period(event, left);
+}
 static void glc_limit_period(struct perf_event *event, s64 *left)
 {
 	if (event->attr.precise_ip == 3)

Do you plan to post the "limit" patch for HSW?
Or should I send the patch?

> or such.
> 
> That limit won't cure the overcount issue from that HSM154 erratum, but
> *SHRUG*.
> 

The code is there without the fixed for ~10 years. I didn't find
complains. I guess it should be fine to leave it as is.

Thanks,
Kan