lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <775d8f1d-437d-47a3-b4b2-da476e914cf5@linux.intel.com>
Date: Fri, 19 Jul 2024 12:35:35 -0400
From: "Liang, Kan" <kan.liang@...ux.intel.com>
To: Ian Rogers <irogers@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Adrian Hunter <adrian.hunter@...el.com>,
 Bjorn Helgaas <bhelgaas@...gle.com>, Jonathan Corbet <corbet@....net>,
 James Clark <james.clark@....com>, Ravi Bangoria <ravi.bangoria@....com>,
 Dominique Martinet <asmadeus@...ewreck.org>, linux-kernel@...r.kernel.org,
 linux-perf-users@...r.kernel.org,
 Dhananjay Ugwekar <Dhananjay.Ugwekar@....com>, ananth.narayan@....com,
 gautham.shenoy@....com, kprateek.nayak@....com, sandipan.das@....com
Subject: Re: [PATCH v2 3/6] perf pmu: Add support for event.cpus files in
 sysfs



On 2024-07-19 10:59 a.m., Ian Rogers wrote:
> On Fri, Jul 19, 2024 at 6:56 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>>
>>
>>
>> On 2024-07-18 4:50 p.m., Ian Rogers wrote:
>>> On Thu, Jul 18, 2024 at 10:48 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>>>>
>>>>
>>>>
>>>> On 2024-07-18 11:39 a.m., Ian Rogers wrote:
>>>>> On Thu, Jul 18, 2024 at 7:33 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024-07-17 8:30 p.m., Ian Rogers wrote:
>>>>>>> If an event file exists in sysfs, check if a event.cpus file exists
>>>>>>> and read a perf_cpu_map from it if it does. This allows particular
>>>>>>> events to have a different set of CPUs compared to the PMU.
>>>>>>>
>>>>>>> One scenario where this could be useful is when a PMU is set up with a
>>>>>>> cpumask/events per SMT thread but some events record for all SMT
>>>>>>> threads. Programming an event on each SMT thread will cause
>>>>>>> unnecessary counters to be programmed and the aggregate value to be
>>>>>>> too large.
>>>>>>
>>>>>> If I understand the scenario correctly, I think the issue should have
>>>>>> been addressed since ICX, with the introduction of  the "SMT-aware
>>>>>> events". Is there a specific event which still has the issue on newer
>>>>>> platforms?
>>>>>
>>>>> Nothing comes to mind, but it is there on popular machines like Skylake.
>>>>>
>>>>>>>
>>>>>>> Another scenario where this could be useful if when a PMU has
>>>>>>> historically had a cpumask at the package level, but now newer per
>>>>>>> die, core or CPU information is available.
>>>>>>
>>>>>> The traditional way to support new counters with a different scope is to
>>>>>> add a new PMU.
>>>>>>
>>>>>>>
>>>>>>> Additional context for the motivation is in these patches and
>>>>>>> conversation:
>>>>>>> https://lore.kernel.org/lkml/20240711102436.4432-1-Dhananjay.Ugwekar@amd.com/
>>>>>>
>>>>>> I don't see the advantages of the new event.cpus way.
>>>>>> The aggregation should be the same.
>>>>>
>>>>> Agreed. My concern is that we may end up with a pattern of
>>>>> <pmu>_per_pkg, <pmu>_per_core, <pmu>_per_cpu, <pmu>_per_l3, etc. just
>>>>> for the sake of the cpumask.
>>>>
>>>> The cstate PMUs already do so, e.g., cstate_core, cstate_pkg.
>>>>
>>>> From another perspective, it discloses the scope information in a PMU
>>>> name. If a user only cares about the information of a package, they just
>>>> need to focus on everything under <pmu>_pkg. Otherwise, they have to go
>>>> through all the events.
>>>>
>>>>>
>>>>>> The RAPL counters are free-running counters. So there is no difference
>>>>>> whether grouping or not grouping.
>>>>>
>>>>> Should the RAPL counters return true for perf_pmu__is_software? In the
>>>>> tool we currently return false and won't allow grouping, these events
>>>>> with other hardware events. The intent in perf_pmu__is_software was to
>>>>> emulate the way the kernel allows/disallows groups - software context
>>>>> events can be in a hardware context but otherwise I don't believe the
>>>>> grouping is allowed.
>>>>
>>>> No, it's not allowed for the RAPL counters.
>>>>
>>>> If the motivation is to find another way to group counters with
>>>> different scope, it may not work.
>>>>
>>>> We once tried to mix the perf_invalid_context PMUs in a group. But it's
>>>> denied.
>>>> https://yhbt.net/lore/all/20150415172856.GA5029@twins.programming.kicks-ass.net/
>>>>
>>>> The event.cpus still faces the same issues.
>>>
>>> Why so? The events would share the same perf_event_context, I thought
>>> the perf_event_open would succeed.
>>
>> I think it breaks what groups are as well. At least, you cannot
>> guarantee that two events can be co-scheduled on the same CPU. Even
>> worse, there could be no events on some CPU.
>> The first thing that pops up is the sample read feature. On CPU_A, the
>> event_A is the leader event, but on CPU_B, the event_B could be the
>> leader event if event_A's cpumask doesn't include the CPU_B.
>> There could be more similar features we have to special handle.
> 
> Thanks Kan. I'm not wondering about a case of 2 CPUs, say on CPU0 and
> solely its perf event context, I want to know its core power and
> package power as a group so I never record one without the other. That
> grouping wouldn't be possible with 2 PMUs.


For power, to be honest, I don't think it improves anything. It gives
users a false image that perf can group these counters.
But the truth is that perf cannot. The power counters are all
free-running counters. It's impossible to co-schedule them (which
requires a global mechanism to disable/enable all counters, e.g.,
GLOBAL_CTRL for core PMU). The kernel still has to read the counters one
by one while the counters keep running. There are no differences with or
without a group for the power events.

> 
>>>
>>>>>
>>>>>> But it makes the kernel driver complex, since it has to maintain at
>>>>>> least two different cpumasks.
>>>>>
>>>>> Two drivers each maintaining a cpumask or 1 driver maintaining 2
>>>>> cpumasks, it seems like there is chance for code reuse in both cases.
>>>>> I'm not seeing how it adds to complexity particularly.
>>>>
>>>> Yes, there are some cleanup opportunities for the cpumask/hotplug codes.
>>>>
>>>> We may even abstracts some generic interfaces for pkg or core level PMUs.
>>>>
>>>> Eventually, the complexity/duplication should be able to be reduced.
>>>>
>>>>>
>>>>>> The tool becomes complex either, since it has to take care of the
>>>>>> per-event CPU override case.
>>>>>
>>>>> I'm not sure I agree with this. What we need for perf_event_open is a
>>>>> perf_event_attr, we dress these up as evsels which also have the
>>>>> ability to be for >1 CPU or thread. Longer term I think evsels can
>>>>> also have >1 PMU, for the wildcard cases like uncore memory
>>>>> controllers - this would remove the need for resorting evsels except
>>>>> for topdown events which have thrown us a giant bundle of challenges.
>>>>> Anyway, so an evsel is perf_event_attr information paired with CPU
>>>>> information, it makes sense to me that the parser should do this
>>>>> pairing. What's harder in the evsel/evlist setup is trying to fix CPU
>>>>> maps up not in parsing, like with __perf_evlist__propagate_maps where
>>>>> the parsing is trying to leave crumbs around (like system_wide,
>>>>> has_user_cpus, is_pmu_core) so the map propagation works properly.
>>>>>
>>>>>> The json file must also be updated to add a
>>>>>> new field cpumask.
>>>>>
>>>>> Yeah, I don't like this as it means we end up putting CPU information
>>>>> into the json that isn't the same for every CPU variant of the same
>>>>> architecture model. Maybe we can have some kind of "scope" enum value
>>>>> in the json and then when the scope differs from the PMU's, core scope
>>>>> vs the PMU's hyperthread scope, then in the tool we can figure out the
>>>>> cpumask from the topology in sysfs. Maybe we should just always use
>>>>> the topology and get rid of cpumask files in sysfs, replacing them
>>>>> with "scope" files. Will Deacon pushed back on having ARM PMUs
>>>>> supporting hot plugging
>>>>> (https://lore.kernel.org/lkml/20240701142222.GA2691@willie-the-truck/)
>>>>> where the main thing hot plugging handler needs to maintain is set the
>>>>> cpumask.
>>>>
>>>> Not just the cpumask but also migrate the context for some PMUs, see
>>>> perf_pmu_migrate_context().
>>>
>>> Will do, thanks.
>>>
>>>> It seems we really need a shared cpumask in the generic code, so the
>>>> drivers don't need to handle the hotplug everywhere. I will think about it.
>>>
>>> Thanks. There are other problems on ARM PMUs like having no or empty
>>> cpumasks, which the tool take to mean open the event on every online
>>> CPU (there is no cpus or cpumask file on core PMUs historically, so we
>>> adopt this behavior when the cpumask is empty or not present).
>>
>> The no cpus/cpumasks and empty cpumasks should be different.
>> No cpus/cpumasks file means that the counters/events are available for
>> all the CPUs.
>> But if it's an empty cpus/cpumasks file, it means that there are no
>> proper CPUs. It may happen on a hybrid machine when all e-core CPUs are
>> hot-removed. Since it's possible that the CPUs can be hot-added later,
>> the kernel driver doesn't unregister the cpu_atom PMU.
>>
>>> I've
>>> been thinking to expand `tools/perf/tests/pmu.c` with basic PMU sanity
>>> tests. Some ideas:
>>>
>>
>> Thanks.
>>
>>> 1) some kind of cpumask sanity check - we could open an event with the
>>> cpumask and see if it yields multiplexing.. which would highlight the
>>> ARM no cpumask bug
>>
>> The multiplexing is triggered when there are not enough counters. It
>> should not related to the cpumask.
> 
> Agreed. Here I'm thinking about the bugs I see in PMUs. One of them is
> to always succeed in opening siblings within a group and for the
> unavailable counters to just report "not counted". This breaks weak
> groups used by metrics.
> 
>> For the PMU without cpumask, I think the test case should try to open an
>> event on all CPUs and check if the open succeeds.
>> For the PMU with cpumask, the test case should try to open an event on
>> the masked CPUs and check if the open succeeds.
> 
> I agree without cpumask means all CPUs, the bug I see on ARM PMUs is
> that they have uncore PMUs with no or empty cpumasks leading to the
> uncore events being programmed on every CPU and over counting,
> multiplexing counters and so on. I'm trying to devise tests so that
> they can see they are broken.
> 
>> The behavior of opening an event on unmasked CPUs seems not defined.
>> For uncore, it's OK. The kernel treats the CPUs from the same socket
>> exactly the same. But I'm not sure about the other PMUs.
> 
> My understanding had been that for core PMUs a "perf stat -C" option
> would choose the particular CPU to count the event on, for an uncore
> PMU the -C option would override the cpumask's "default" value. We
> have code to validate this:
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/evlist.c?h=perf-tools-next#n2522
> But it seems now that overriding an uncore PMU's default CPU is
> ignored. 

For the uncore driver, no matter what -C set, it writes the default CPU
back.
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/arch/x86/events/intel/uncore.c#n760

> If you did:
> $ perf stat -C 1 -e data_read -a sleep 0.1
> Then the tool thinks data_read is on CPU1 and will set its thread
> affinity to CPU1 to avoid IPIs. It seems to fix this we need to just
> throw away the -C option.
The perf tool can still read the the counter from CPU1 and no IPIs
because of the PMU_EV_CAP_READ_ACTIVE_PKG().

Not quite sure, but it seems only the open and close may be impacted and
silently changed to CPU0.

Thanks,
Kan
> 
>>> 2) do the /sys/devices/<pmu>/events/event.(unit|scale|per-pkg|snapshot)
>>> files parse correctly and have a corresponding event.
>>> 3) keep adding opening events on the PMU to a group to make sure that
>>> when counters are exhausted the perf_event_open fails (I've seen this
>>> bug on AMD)
>>> 4) are the values in the type file unique
>>>
>>
>> The rest sounds good to me.
> 
> Cool. Let me know if you can think of more.
> 
> Thanks,
> Ian
> 
>> Thanks,
>> Kan
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ