linux-kernel - Re: [RESEND PATCH 2/2] perf/x86: improve the event scheduling to avoid unnecessary pmu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YidREXNn2AtI3V1c@hirez.programming.kicks-ass.net>
Date:   Tue, 8 Mar 2022 13:50:25 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Wen Yang <wenyang@...ux.alibaba.com>
Cc:     Wen Yang <simon.wy@...baba-inc.com>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephane Eranian <eranian@...gle.com>,
        mark rutland <mark.rutland@....com>,
        jiri olsa <jolsa@...hat.com>,
        namhyung kim <namhyung@...nel.org>,
        borislav petkov <bp@...en8.de>, x86@...nel.org,
        "h. peter anvin" <hpa@...or.com>, linux-perf-users@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RESEND PATCH 2/2] perf/x86: improve the event scheduling to
 avoid unnecessary pmu_stop/start

On Sun, Mar 06, 2022 at 10:36:38PM +0800, Wen Yang wrote:

> The perf event generated by the above script A (cycles:D), and the counter
> it used changes from #1 to #3. We use perf event in pinned mode, and then
> continuously read its value for a long time, but its PMU counter changes

Yes, so what?

>, so the counter value will also jump.

I fail to see how the counter value will jump when we reprogram the
thing. When we stop we update the value, then reprogram on another
counter and continue. So where does it go sideways?

> 
> 0xffff88b72db85800:
> The perf event generated by the above script A (instructions:D), which has
> always occupied #fixed_instruction.
> 
> 0xffff88bf46c34000, 0xffff88bf46c35000, 0xffff88bf46c30000:
> Theses perf events are generated by the above script B.
> 
> 
> > > 
> > > so it will cause unnecessary pmu_stop/start and also cause abnormal cpi.
> > 
> > How?!?
> > 
> 
> We may refer to the x86_pmu_enable function:
> step1: save events moving to new counters
> step2: reprogram moved events into new counters
> 
> especially:
> 
> static inline int match_prev_assignment(struct hw_perf_event *hwc,
>                     struct cpu_hw_events *cpuc,
>                     int i)
> {
>     return hwc->idx == cpuc->assign[i] &&
>         hwc->last_cpu == smp_processor_id() &&
>         hwc->last_tag == cpuc->tags[i];
> }

I'm not seeing an explanation for how a counter value is not preserved.

> > > Cloud servers usually continuously monitor the cpi data of some important
> > > services. This issue affects performance and misleads monitoring.
> > > 
> > > The current event scheduling algorithm is more than 10 years old:
> > > commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")
> > 
> > irrelevant
> > 
> 
> commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")
> 
> This commit is the basis of the perf event scheduling algorithm we currently
> use.

Well yes. But how is the age of it relevant?

> The reason why the counter above changed from #1 to #3 can be found from it:
> The algorithm takes into account the list of counter constraints
> for each event. It assigns events to counters from the most
> constrained, i.e., works on only one counter, to the least
> constrained, i.e., works on any counter.
> 
> the nmi watchdog permanently consumes one fp (*cycles*).
>  therefore, when the above shell script obtains *cycles:D*
> again, it has to use a GP, and its weight is 5.
> but other events (like *cache-misses*) have a weight of 4,
> so the counter used by *cycles:D* will often be taken away.

So what?

I mean, it is known the algorithm isn't optimal, but at least it's
bounded. There are event sets that will fail to schedule but could, but
I don't think you're talking about that.

Event migrating to a different counter is not a problem. This is
expected and normal. Code *must* be able to deal with it.

> In addition, we also found that this problem may affect NMI watchdog in the
> production cluster.
> The NMI watchdog also uses a fixed counter in fixed mode. Usually, it is The
> first element of the event_list array, so it usually takes precedence and
> can get a fixed counter.
> But if the administrator closes the watchdog first and then enables it, it
> may be at the end of the event_list array, so its expected fixed counter may
> be occupied by other perf event, and it can only use one GP. In this way,
> there is a similar issue here: the PMU counter used by the NMI watchdog may
> be disabled/enabled frequently and unnecessarily.

Again, I'm not seeing a problem. If you create more events than we have
hardware counters we'll rotate the list and things will get scheduled in
all sorts of order. This works.

> Any advice or guidance on this would be appreciated.

I'm still not sure what your actual problem is; I suspect you're using
perf wrong.

Are you using rdpmc and not respecting the scheme described in
include/uapi/linux/perf_events.h:perf_event_mmap_page ?

Note that if you're using pinned counters you can simplify that scheme
by ignoring all the timekeeping nonsense. In that case it does become
significantly simpler/faster.

But you cannot use rdpmc without using the mmap page's self-monitoring
data.