linux-kernel - Re: [RESEND PATCH 2/2] perf/x86: improve the event scheduling to avoid unnecessary pmu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0c119da1-053b-a2d6-1579-8fb09dbe8e63@linux.alibaba.com>
Date:   Sun, 6 Mar 2022 22:36:38 +0800
From:   Wen Yang <wenyang@...ux.alibaba.com>
To:     Peter Zijlstra <peterz@...radead.org>,
        Wen Yang <simon.wy@...baba-inc.com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Stephane Eranian <eranian@...gle.com>,
        mark rutland <mark.rutland@....com>,
        jiri olsa <jolsa@...hat.com>,
        namhyung kim <namhyung@...nel.org>,
        borislav petkov <bp@...en8.de>, x86@...nel.org,
        "h. peter anvin" <hpa@...or.com>, linux-perf-users@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [RESEND PATCH 2/2] perf/x86: improve the event scheduling to
 avoid unnecessary pmu_stop/start



在 2022/3/4 下午11:39, Peter Zijlstra 写道:
> On Fri, Mar 04, 2022 at 07:03:51PM +0800, Wen Yang wrote:
>> this issue has been there for a long time, we could reproduce it as follows:
> 
> What issue? You've not described an issue. So you cannot reference one.
> 

OK, thank you for your opinion. Let's explain it.


> This is still completely unreadable gibberish.
> 
>> 1, run a script that periodically collects perf data, eg:
>> while true
>> do
>>      perf stat -e cache-misses,cache-misses,cache-misses -c 1 sleep 2
>>      perf stat -e cache-misses -c 1 sleep 2
>>      sleep 1
>> done
>>
>> 2, run another one to capture the ipc, eg:
>> perf stat -e cycles:d,instructions:d  -c 1 -i 1000
> 
> <snip line noise>
> 
>> the reason is that the nmi watchdog permanently consumes one fp
>> (*cycles*). therefore, when the above shell script obtains *cycles*
>> again, only one gp can be used, and its weight is 5.
>> but other events (like *cache-misses*) have a weight of 4,
>> so the counter used by *cycles* will often be taken away, as in
>> the raw data above:
>> [1]
>>    n_events = 3
>>    assign = {33, 1, 32, ...}
>> -->
>>    n_events = 6
>>    assign = {33, 3, 32, 0, 1, 2, ...}
> 
> Again unreadable... what do any of those numbers mean?
> 

Scenario: a monitor program will monitor the CPI on a specific CPU in 
pinned mode for a long time, such as the CPI in the original email:
     perf stat -e cycles:D,instructions:D  -C 1 -I 1000

Perf here is just an example. Our monitor program will continuously read 
the counter values of these perf events (cycles and instructions).

However, it is found that CPI data will be abnormal periodically because 
PMU counter conflicts. For example, scripts in e-mail will cause 
interference:
     perf stat -e cache-misses,cache-misses,cache-misses -C 1 sleep 2
     perf stat -e cache-misses -C 1 sleep 2 

   n_events = 3
   assign = {33, 1, 32, ...}
-->
   n_events = 6
   assign = {33, 3, 32, 0, 1, 2, ...}

They are some fields of cpu_hw_events structure.
int n_events; /* the # of events in the below arrays */
int assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */

32: intel_pmc_idx_fixed_instructions
33: intel_pmc_idx_fixed_cpu_cycles
34: intel_pmc_idx_fixed_ref_cycles
0,1,2,3: gp


n_events = 3
assign = {33, 1, 32, ...}
event_list = {0xffff88bf77b85000, 0xffff88b72db82000, 
0xffff88b72db85800, ...}

—>
Indicates that there are 3 perf events on CPU 1, which occupy the 
#fixed_cpu_cycles, #1 and #fixed_instruction counter one by one.
These 3 perf events are generated by the NMI watchdog and the script A:
perf stat -e cycles:D,instructions:D  -C 1 -I 1000


n_events = 6
assign = {33, 3, 32, 0, 1, 2, ...}
event_list = {0xffff88bf77b85000, 0xffff88b72db82000, 
0xffff88b72db85800, 0xffff88bf46c34000, 0xffff88bf46c35000, 
0xffff88bf46c30000,  ...}

—>
Indicates that there are 6 perf events on CPU 1, which occupy the 
#fixed_cpu_cycles, #3, #fixed_instruction, #0, #1 and #2 counter one by one.
These 6 perf events are generated by the NMI watchdog and the script A 
and B:
perf stat -e cycles:D,instructions:D  -C 1 -I 1000
perf stat -e cache-misses,cache-misses,cache-misses -C 1 sleep 2

0xffff88bf77b85000:
The perf event generated by NMI watchdog, which has always occupied 
#fixed_cpu_cycles

0xffff88b72db82000:
The perf event generated by the above script A (cycles:D), and the 
counter it used changes from #1 to #3. We use perf event in pinned mode, 
and then continuously read its value for a long time, but its PMU 
counter changes, so the counter value will also jump.


0xffff88b72db85800:
The perf event generated by the above script A (instructions:D), which 
has always occupied #fixed_instruction.

0xffff88bf46c34000, 0xffff88bf46c35000, 0xffff88bf46c30000:
Theses perf events are generated by the above script B.


>>
>> so it will cause unnecessary pmu_stop/start and also cause abnormal cpi.
> 
> How?!?
> 

We may refer to the x86_pmu_enable function:
step1: save events moving to new counters
step2: reprogram moved events into new counters

especially:

static inline int match_prev_assignment(struct hw_perf_event *hwc,
                     struct cpu_hw_events *cpuc,
                     int i)
{
     return hwc->idx == cpuc->assign[i] &&
         hwc->last_cpu == smp_processor_id() &&
         hwc->last_tag == cpuc->tags[i];
}



>> Cloud servers usually continuously monitor the cpi data of some important
>> services. This issue affects performance and misleads monitoring.
>>
>> The current event scheduling algorithm is more than 10 years old:
>> commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")
> 
> irrelevant
> 

commit 1da53e023029 ("perf_events, x86: Improve x86 event scheduling")

This commit is the basis of the perf event scheduling algorithm we 
currently use.
The reason why the counter above changed from #1 to #3 can be found from it:
The algorithm takes into account the list of counter constraints
for each event. It assigns events to counters from the most
constrained, i.e., works on only one counter, to the least
constrained, i.e., works on any counter.

the nmi watchdog permanently consumes one fp (*cycles*).
  therefore, when the above shell script obtains *cycles:D*
again, it has to use a GP, and its weight is 5.
but other events (like *cache-misses*) have a weight of 4,
so the counter used by *cycles:D* will often be taken away.

In addition, we also found that this problem may affect NMI watchdog in 
the production cluster.
The NMI watchdog also uses a fixed counter in fixed mode. Usually, it is 
The first element of the event_list array, so it usually takes 
precedence and can get a fixed counter.
But if the administrator closes the watchdog first and then enables it, 
it may be at the end of the event_list array, so its expected fixed 
counter may be occupied by other perf event, and it can only use one GP. 
In this way, there is a similar issue here: the PMU counter used by the 
NMI watchdog may be disabled/enabled frequently and unnecessarily.


Any advice or guidance on this would be appreciated.

Best wishes,
Wen


>> we wish it could be optimized a bit.
> 
> I wish for a unicorn ...
> 
>> The fields msk_counters and msk_events are added to indicate currently
>> used counters and events so that the used ones can be skipped
>> in __perf_sched_find_counter and perf_sched_next_event functions to avoid
>> unnecessary pmu_stop/start.
> 
> Still not sure what your actual problem is, nor what the actual proposal
> is.
> 
> Why should I attempt to reverse engineer your code without basic
> understanding of what you're actually trying to achieve?
>