[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fUUcehu-C=ytHVVixOpeCYoW4oJkkj6p6W=M0HtQ2wrRA@mail.gmail.com>
Date: Thu, 18 Jul 2024 08:12:27 -0700
From: Ian Rogers <irogers@...gle.com>
To: "Liang, Kan" <kan.liang@...ux.intel.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>, Namhyung Kim <namhyung@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>, Jiri Olsa <jolsa@...nel.org>,
Adrian Hunter <adrian.hunter@...el.com>, Bjorn Helgaas <bhelgaas@...gle.com>,
Jonathan Corbet <corbet@....net>, James Clark <james.clark@....com>,
Ravi Bangoria <ravi.bangoria@....com>, Dominique Martinet <asmadeus@...ewreck.org>,
linux-kernel@...r.kernel.org, linux-perf-users@...r.kernel.org,
Dhananjay Ugwekar <Dhananjay.Ugwekar@....com>, ananth.narayan@....com, gautham.shenoy@....com,
kprateek.nayak@....com, sandipan.das@....com
Subject: Re: [PATCH v2 6/6] perf parse-events: Add "cpu" term to set the CPU
an event is recorded on
On Thu, Jul 18, 2024 at 7:41 AM Liang, Kan <kan.liang@...ux.intel.com> wrote:
>
>
>
> On 2024-07-17 8:30 p.m., Ian Rogers wrote:
> > The -C option allows the CPUs for a list of events to be specified but
> > its not possible to set the CPU for a single event. Add a term to
> > allow this. The term isn't a general CPU list due to ',' already being
> > a special character in event parsing instead multiple cpu= terms may
> > be provided and they will be merged/unioned together.
> >
> > An example of mixing different types of events counted on different CPUs:
> > ```
> > $ perf stat -A -C 0,4-5,8 -e "instructions/cpu=0/,l1d-misses/cpu=4,cpu=5/,inst_retired.any/cpu=8/,cycles" -a sleep 0.1
> >
> > Performance counter stats for 'system wide':
> >
> > CPU0 368,647 instructions/cpu=0/ # 0.26 insn per cycle
> > CPU4 <not counted> instructions/cpu=0/
> > CPU5 <not counted> instructions/cpu=0/
> > CPU8 <not counted> instructions/cpu=0/
> > CPU0 <not counted> l1d-misses [cpu]
> > CPU4 203,377 l1d-misses [cpu]
> > CPU5 138,231 l1d-misses [cpu]
> > CPU8 <not counted> l1d-misses [cpu]
> > CPU0 <not counted> cpu/cpu=8/
> > CPU4 <not counted> cpu/cpu=8/
> > CPU5 <not counted> cpu/cpu=8/
> > CPU8 943,861 cpu/cpu=8/
> > CPU0 1,412,071 cycles
> > CPU4 20,362,900 cycles
> > CPU5 10,172,725 cycles
> > CPU8 2,406,081 cycles
> >
> > 0.102925309 seconds time elapsed
> > ```
> >
> > Note, the event name of inst_retired.any is missing, reported as
> > cpu/cpu=8/, as there are unmerged uniquify fixes:
> > https://lore.kernel.org/lkml/20240510053705.2462258-3-irogers@google.com/
> >
> > An example of spreading uncore overhead across two CPUs:
> > ```
> > $ perf stat -A -e "data_read/cpu=0/,data_write/cpu=1/" -a sleep 0.1
> >
> > Performance counter stats for 'system wide':
> >
> > CPU0 223.65 MiB uncore_imc_free_running_0/cpu=0/
> > CPU0 223.66 MiB uncore_imc_free_running_1/cpu=0/
> > CPU0 <not counted> MiB uncore_imc_free_running_0/cpu=1/
> > CPU1 5.78 MiB uncore_imc_free_running_0/cpu=1/
> > CPU0 <not counted> MiB uncore_imc_free_running_1/cpu=1/
> > CPU1 5.74 MiB uncore_imc_free_running_1/cpu=1/
> > ```
> >
> > Manually fixing the output it should be:
> > ```
> > CPU0 223.65 MiB uncore_imc_free_running_0/data_read,cpu=0/
> > CPU0 223.66 MiB uncore_imc_free_running_1/data_read,cpu=0/
> > CPU1 5.78 MiB uncore_imc_free_running_0/data_write,cpu=1/
> > CPU1 5.74 MiB uncore_imc_free_running_1/data_write,cpu=1/
> > ```
> >
> > That is data_read from 2 PMUs was counted on CPU0 and data_write was
> > counted on CPU1.
>
> There was an effort to make the counter access from any CPU of the package.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d6a2f9035bfc27d0e9d78b13635dda9fb017ac01
>
> But now it limits the access from specific CPUs. It sounds like a
> regression.
Thanks Kan, I'm not sure I understand the comment. The overhead I was
thinking of here is more along the lines of cgroup context switches
(although that isn't in my example). There may be a large number of
say memory controller events just by having 2 events for each PMU and
then there are 10s of PMUs. By putting half of the events on 1 CPU and
half on another, the context switch overhead is shared. That said, the
counters don't care what cgroup is accessing memory, and users doing
this are likely making some kind of error.
Thanks,
Ian
Powered by blists - more mailing lists