[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20240610205405.GA8774@noisy.programming.kicks-ass.net>
Date: Mon, 10 Jun 2024 22:54:05 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Stephane Eranian <eranian@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Ian Rogers <irogers@...gle.com>,
"Liang, Kan" <kan.liang@...el.com>, Andi Kleen <ak@...ux.intel.com>,
Ingo Molnar <mingo@...hat.com>,
"Narayan, Ananth" <ananth.narayan@....com>,
"Bangoria, Ravikumar" <ravi.bangoria@....com>,
Namhyung Kim <namhyung@...gle.com>,
Mingwei Zhang <mizhang@...gle.com>,
Dapeng Mi <dapeng1.mi@...ux.intel.com>,
Zhang Xiong <xiong.y.zhang@...el.com>
Subject: Re: [RFC] perf_events: exclude_guest impact on
time_enabled/time_running
On Thu, Jun 06, 2024 at 12:57:35AM -0700, Stephane Eranian wrote:
> Hi Peter,
>
> In the context of the new vPMU passthru patch series, we have to look
> closer at the definition and implementation of the exclude_guest
> filter in the perf_event_attr structure. This filter has been in the
> kernel for many years. See patch:
> https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/
>
> The presumed definition of the filter is that the user does not want
> the event to count while the processor is running in guest mode (i.e.,
> inside the virtual machine guest OS or guest user code).
>
> The perf tool sets is by default on all core PMU events:
> $ perf stat -vv -e cycles sleep 0
> ------------------------------------------------------------
> perf_event_attr:
> size 112
> sample_type IDENTIFIER
> read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
> disabled 1
> inherit 1
> enable_on_exec 1
> exclude_guest 1
> ------------------------------------------------------------
>
> In the kernel, the way this is treated differs between AMD and Intel
> because AMD does provide a hardware filter for guest vs. host in the
> PMU counters whereas Intel does not. For the latter, the kernel
> simply disables the event in the hardware counters, i.e., the event is
> not descheduled. Both approaches produce pretty much the same desired
> effect, the event is not counted while in guest mode.
>
> The issue I would like to raise has to do with the effects on
> time_enabled and time_running for exclude_guest=1 events.
>
> Given the event is not scheduled out while in guest mode, even though
> it is stopped, both time_enabled and time_running continue ticking
> while in guest mode. If a measurement is 10s long but only 5s are in
> non-guest mode, then time_enabled=10s, time_running=10s. The count
> represents 10s worth of non guest mode, of which only 5s were really
> actively monitoring, but the user has no way of determining this.
>
> If we look at vPMU passthru, the host event must have exclude_guest=1
> to avoid going into an error state on context switch to the vCPU
> thread (with vPMU enabled). But this time, the event is scheduled out,
> that means that time_enabled keeps counting, but time_running stops.
> On context switch back in, the host event is scheduled again and
> time_running restarts ticking. For a 10s measurement, where 5s here in
> the guest, the event will come out with time_enabled=10s,
> time_running=5s, and the tool will scale it up because it thinks the
> event was multiplexed, when in fact it was not. This is not the
> intended outcome here. The tool should not scale the count, it was not
> multiplexed, it was descheduled because the filter forced it out.
> Note that if the event had been multiplexed while running on the host,
> then the scaling would be appropriate.
>
> In that case, I argue, time_running should be updated to cover the
> time the event was not running. That would bring us back to the case I
> was describing earlier.
>
> It boils down to the exact definition of exclude_guest and expected
> impact on time_enabled and time_running. Then, with or without vPMU
> passthru, we can fix the kernel to ensure a uniform behavior.
>
> What are your thoughts on this problem?
So with those patches having explicit scheduling points, we can actually
do this time accounting accurately, so I don't see a reason to not do
the right thing here.
Hysterically this was left vague in order to be able to avoid the
scheduling for these scenarios -- performance raisins etc.
The thing is, if you push this to its limits, we should start time
accounting for the ring selectors too, and that's going to be painful.
Powered by blists - more mailing lists