linux-kernel - Re: [RFC] perf_events: exclude_guest impact on time_enabled/time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20240610205405.GA8774@noisy.programming.kicks-ass.net>
Date: Mon, 10 Jun 2024 22:54:05 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Stephane Eranian <eranian@...gle.com>
Cc: LKML <linux-kernel@...r.kernel.org>, Ian Rogers <irogers@...gle.com>,
	"Liang, Kan" <kan.liang@...el.com>, Andi Kleen <ak@...ux.intel.com>,
	Ingo Molnar <mingo@...hat.com>,
	"Narayan, Ananth" <ananth.narayan@....com>,
	"Bangoria, Ravikumar" <ravi.bangoria@....com>,
	Namhyung Kim <namhyung@...gle.com>,
	Mingwei Zhang <mizhang@...gle.com>,
	Dapeng Mi <dapeng1.mi@...ux.intel.com>,
	Zhang Xiong <xiong.y.zhang@...el.com>
Subject: Re: [RFC] perf_events: exclude_guest impact on
 time_enabled/time_running

On Thu, Jun 06, 2024 at 12:57:35AM -0700, Stephane Eranian wrote:
> Hi Peter,
> 
> In the context of the new vPMU passthru patch series, we have to look
> closer at the definition and implementation of the exclude_guest
> filter in the perf_event_attr structure. This filter has been in the
> kernel for many years. See patch:
> https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/
> 
> The presumed  definition of the filter is that the user does not want
> the event to count while the processor is running in guest mode (i.e.,
> inside the virtual machine guest OS or guest user code).
> 
> The perf tool sets is by default on all core PMU events:
> $ perf stat -vv -e cycles sleep 0
> ------------------------------------------------------------
> perf_event_attr:
>   size                             112
>   sample_type                      IDENTIFIER
>   read_format                      TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING
>   disabled                         1
>   inherit                          1
>   enable_on_exec                   1
>   exclude_guest                    1
> ------------------------------------------------------------
> 
> In the kernel, the way this is treated differs between AMD and Intel
> because AMD does provide a hardware filter for guest vs. host in the
> PMU counters whereas Intel does not. For the latter, the  kernel
> simply disables the event in the hardware counters, i.e., the event is
> not descheduled.  Both approaches produce pretty much the same desired
> effect, the event is not counted while in guest mode.
> 
> The issue I would like to raise has to do with the effects on
> time_enabled and time_running for exclude_guest=1 events.
> 
> Given the event is not scheduled out while in guest mode, even though
> it is stopped, both time_enabled and time_running continue ticking
> while in guest mode.  If a measurement is 10s long but only 5s are in
> non-guest mode, then time_enabled=10s, time_running=10s. The count
> represents 10s worth of non guest mode, of which only 5s were really
> actively monitoring, but the user has no way of determining this.
> 
> If we look at vPMU passthru, the host event must have exclude_guest=1
> to avoid going into an error state on context switch to the vCPU
> thread (with vPMU enabled). But this time, the event is scheduled out,
> that means that time_enabled keeps counting, but time_running stops.
> On context switch back in, the host event is scheduled again and
> time_running restarts ticking. For a 10s measurement, where 5s here in
> the guest, the event will come out with time_enabled=10s,
> time_running=5s, and the tool will scale it up because it thinks the
> event was multiplexed, when in fact it was not. This is not the
> intended outcome here. The tool should not scale the count, it was not
> multiplexed, it was descheduled because the filter forced it out.
> Note that if the event had been multiplexed while running on the host,
> then the scaling would be appropriate.
> 
> In that case, I argue, time_running should be updated to cover the
> time the event was not running. That would bring us back to the case I
> was describing earlier.
> 
> It boils down to the exact definition of exclude_guest and expected
> impact on time_enabled and time_running. Then, with or without vPMU
> passthru, we can fix the kernel to ensure a uniform behavior.
> 
> What are your thoughts on this problem?

So with those patches having explicit scheduling points, we can actually
do this time accounting accurately, so I don't see a reason to not do
the right thing here.

Hysterically this was left vague in order to be able to avoid the
scheduling for these scenarios -- performance raisins etc.

The thing is, if you push this to its limits, we should start time
accounting for the ring selectors too, and that's going to be painful.