[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <705dc2fe-7ab3-458a-9b5a-7ea0b30756b8@linux.intel.com>
Date: Fri, 16 Aug 2024 15:27:10 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: Thomas Gleixner <tglx@...utronix.de>, Li Huafei <lihuafei1@...wei.com>,
peterz@...radead.org, mingo@...hat.com
Cc: acme@...nel.org, namhyung@...nel.org, mark.rutland@....com,
alexander.shishkin@...ux.intel.com, jolsa@...nel.org, irogers@...gle.com,
adrian.hunter@...el.com, bp@...en8.de, dave.hansen@...ux.intel.com,
x86@...nel.org, hpa@...or.com, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, Andi Kleen <ak@...ux.intel.com>,
Vince Weaver <vincent.weaver@...ne.edu>
Subject: Re: [PATCH] perf/x86/intel: Restrict period on Haswell
Hi Thomas,
On 2024-08-15 7:43 p.m., Thomas Gleixner wrote:
> On Thu, Aug 15 2024 at 16:15, Kan Liang wrote:
>> On 2024-08-15 2:26 p.m., Thomas Gleixner wrote:
>>>> Implication: When this erratum occurs, the fixed-function performance
>>>> counter IA32_FIXED_CTR0 may over count by up to 32.
>>>
>>> Sure. That's only explaining half of the problem.
>>>
>>> As I demonstrated in the non-contended case even with a count of 2 (I
>>> tried 1 too) the status bit is never set on the second check.
>>>
>>
>> Do you mean the below example? The status bit (32) of the fixed counter
>> 0 is always set.
>
> When HT is off or the threads are not running the handler concurrently
> then there is zero looping. Once they start do fiddle concurrently the
> looping starts and potentially never ends.
>
>> 65.147782: x86_perf_event_set_period: idx: 32 period: 1 left: 2
>> 65.147783: intel_pmu_handle_irq: loops: 001 status: 100000000
>>
>> 65.147784: x86_perf_event_set_period: idx: 32 period: 1 left: 2
>> 65.147784: intel_pmu_handle_irq: loops: 002 status: 100000000
>
> So in the non-concurrent (which includes !HT) case the status check
> after handling the event is always 0. This never gets into a loop, not
> even once.
>
>>>> It should explains that the issue is gone with the magic number 32 or
>>>> disabling the Hyper-Threading.
>>>
>>> It explains only half of it. If you use 32, then the counter is set to
>>> -32 so the overcount of 32 will still bring it to 0, which should set
>>> the status bit, no?
>>
>> I think it's up to 32, not always 32.
>> I don't have more details regarding the issue. The architect of HSW has
>> left. I'm asking around internally to find the original bug report of
>> the erratum. Hope there are more details in the report.
>
> See below.
>
>>>> I also found a related discussion about 9 years ago.
>>>> https://lore.kernel.org/lkml/alpine.DEB.2.11.1505181343090.32481@vincent-weaver-1.umelst.maine.edu/
>>>> Vince tried the workaround but it seems not work.
>>>
>>> Let me play with that. :)
>>
>> Appreciate it.
>
> I got it actually working. The inital sequence which "worked" is:
>
> 1) Clear bit 32 in IA32_PERF_GLOBAL_CTRL
> 2) Clear bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
> 3) Zero the IA32_FIXED_CTR0 MSR
> 4) Set IA32_FIXED_CTR0 to (-left) & mask;
> 5) Set bit 0/1 in IA32_PERF_FIXED_CTR_CTRL
> 6) Set bit 32 in IA32_PERF_GLOBAL_CTRL
>
> If I omit #3 it does not work. If I flip #5 and #6 it does not work.
>
> So the initial "working" variant I had was hacking this sequence into
> x86_perf_event_set_period() (omitting the fixed counter 0 conditionals
> for readability):
>
> rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
>
> rdmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
>
> wrmsrl(hwc->event_base, 0);
> wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>
> wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
>
> Now I thought, that needs to be in intel/core.c and implemented a proper
> quirk. And of course being smart I decided it's a brilliant idea to use
> the cached values instead of the rdmsrl()s.
>
> cglbl = hybrid(cpuc->pmu, intel_ctrl) & ~cpuc->intel_ctrl_guest_mask;
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
>
> cfixd = cpuc->fixed_ctrl_val;
> wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
>
> wrmsrl(hwc->event_base, 0);
> wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>
> wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
>
> Surprise, surprise, that does not work. So I went back and wanted to
> know which rdmslr() is curing it:
>
> rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl & ~(1ULL << 32));
>
> cfixd = cpuc->fixed_ctrl_val;
> wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd & ~3ULL);
>
> wrmsrl(hwc->event_base, 0);
> wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>
> wrmsrl(MSR_ARCH_PERFMON_FIXED_CTR_CTRL, cfixd);
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
>
> This worked. Using the rdmsrl() only for MSR_ARCH_PERFMON_FIXED_CTR_CTRL
> did not.
>
> Now I got bold and boiled it down to:
>
> rdmsrl(MSR_CORE_PERF_GLOBAL_CTRL, cglbl);
>
> wrmsrl(hwc->event_base, 0);
> wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>
> and the whole thing still worked. *BLINK*
>
> Exactly zero loop entries in the trace when running 100 instances of
> that cve test case, which otherwise spams the trace with entries and
> ends up in the loop > 100 path within a split second.
>
> Removing the zeroing of the counter makes it come back, but reducing the
> whole nonsense to:
>
> wrmsrl(hwc->event_base, 0);
> wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>
Thanks for all the testing. So the trick is to clear the counter before
writing to it. That sounds like a usual way to trigger some ucode magic.
> makes the loop problem go away, but it "works" so well that the stupid
> frequency adjustment algorithm keeps the left == 1, i.e count == 2 case
> stay long enough around to trigger 'hung task messages' ....
>
> Now I looked more at the dmesg output of all the experiments. In all
> "working" cases except one running these 100 instances of cve... results
> in a varying cascade of
>
> perf: interrupt took too long (2503 > 2500), lowering ...
>
> messages.
>
> The one case where this does not happen is when the limit is
> unconditionally set to 32. But when I apply this limit only for the
> fixed counter 0 it comes back.
Yes, that sounds there is something wrong with < 32 as well.
>
> Now I looked at when these 'took too long' problems surface aside of the
> obvious case of extensive looping. They are unrelated to the hyper
> threading issue as I can reproduce with smt=off too.
>
> They always happen when a counter was armed with a count < 32 and two
> events expired in the same NMI. The test case uses fixed counter 0 and
> general counter 0 for the other event.
>
> So that erratum is a good hint, but that hardware does have more issues
> than it tells.
>
> So I think we should just apply that limit patch with a proper change
> log and also make it:
>
> hsw_limit(...)
> {
> *left = max(*left, erratum_hsw11(event) ? 128 : 32;);
> }
>
The HSW11 is also BDM11. It sounds like we need the trick from both bdw
and nhm.
How about this?
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e8bd45556c30..42f557a128b9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4664,6 +4664,12 @@ static void nhm_limit_period(struct perf_event
*event, s64 *left)
*left = max(*left, 32LL);
}
+static void hsw_limit_period(struct perf_event *event, s64 *left)
+{
+ nhm_limit_period(event, left);
+ bdw_limit_period(event, left);
+}
static void glc_limit_period(struct perf_event *event, s64 *left)
{
if (event->attr.precise_ip == 3)
Do you plan to post the "limit" patch for HSW?
Or should I send the patch?
> or such.
>
> That limit won't cure the overcount issue from that HSM154 erratum, but
> *SHRUG*.
>
The code is there without the fixed for ~10 years. I didn't find
complains. I guess it should be fine to leave it as is.
Thanks,
Kan
Powered by blists - more mailing lists