linux-kernel - Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZtdQYvfw9GV3LCRK@google.com>
Date: Tue, 3 Sep 2024 11:07:30 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Leonardo Bras <leobras.c@...il.com>
Cc: "Paul E. McKenney" <paulmck@...nel.org>, Leonardo Bras <leobras@...hat.com>, 
	Frederic Weisbecker <frederic@...nel.org>, Paolo Bonzini <pbonzini@...hat.com>, 
	Marcelo Tosatti <mtosatti@...hat.com>, linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Thu, Jun 20, 2024, Leonardo Bras wrote:
> On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote:
> I tested x86 by counting cycles (using rdtsc_ordered()).
> 
> Cycles were counted upon function entry/exit on 
> {svm,vmx}_vcpu_enter_exit(), and right before / after 
> __{svm,vmx}_vcpu_run() in the same function.

Note, for posterity, this is the super-duper inner portion of KVM's run loop.
I.e. the 18-22% increase in latency sounds scary, but only because Leo used a
a relatively small portion of the entry flow for the baseline.  E.g. the total
time would be significantly higher, and thus the percentage increase much lower,
if the measurement started at the beginning of vmx_vcpu_run().

> The main idea was to get cycles spend in the procedures before entering 
> guest (such as reporting RCU quiescent state in entry / exit) and the 
> cycles actually used by the VM. 
> 
> Those cycles were summed-up and stored in per-cpu structures, with a 
> counter to get the average value. I then created a debug file to read the 
> results and reset the counters.
> 
> As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll.
> 
> The workload inside the VM consisted in cyclictest in 16 vcpus 
> (SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus 
> (SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. 
> 
>  $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 
>   --mainaffinity 0-3
> 
> All tests were run for exaclty 1 hour, and the clock counter was reset at 
> the same moment cyclictest stared. After that VM was poweroff from guest.
> Results show the average for all CPUs in the same category, in cycles.
> 
> With above setup, I tested 2 use cases:
> 1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case)
> 2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and 
>     RCU patience = 1000ms (best case for RT)
> 
> Results are:
> # Test case 1:
> Vanilla: (average on all vcpus)
> VM Cycles / RT vcpu:		123287.75 
> VM Cycles / non-RT vcpu:	709847.25
> Setup Cycles:			186.00
> VM entries / RT vcpu:		58737094.81
> VM entries / non-RT vcpu:	10527869.25
> Total cycles in RT VM:		7241564260969.80
> Total cycles in non-RT VM:	7473179035472.06
> 
> Patched: (average on all vcpus)
> VM Cycles / RT vcpu:		124695.31        (+ 1.14%)
> VM Cycles / non-RT vcpu:	710479.00        (+ 0.09%)
> Setup Cycles:			218.65           (+17.55%)
> VM entries / RT vcpu:		60654285.44      (+ 3.26%) 
> VM entries / non-RT vcpu:	11003516.75      (+ 4.52%)
> Total cycles in RT VM:		7563305077093.26 (+ 4.44%)
> Total cycles in non-RT VM:	7817767577023.25 (+ 4.61%)
> 
> Discussion:
> Setup cycles raised in ~33 cycles, increasing overhead.
> It proves that noting a quiescent state in guest entry introduces setup 

Isn't the effect of the patch note a quiescent state in guest exit?  

> routine costs, which is expected.
> 
> On the other hand, both the average time spend inside the VM and the number 
> of VM entries raised, causing the VM to have ~4.5% more cpu cycles 
> available to run, which is positive. Extra cycles probably came from not 
> having invoke_rcu_core() getting ran after VM exit.
> 
> 
> # Test case 2:
> Vanilla: (average on all vcpus)
> VM Cycles / RT vcpu:		123785.63
> VM Cycles / non-RT vcpu:	698758.25
> Setup Cycles:			187.20
> VM entries / RT vcpu:		61096820.75
> VM entries / non-RT vcpu:	11191873.00
> Total cycles in RT VM:		7562908142051.72
> Total cycles in non-RT VM:	7820413591702.25
> 
> Patched: (average on all vcpus)
> VM Cycles / RT vcpu:		123137.13        (- 0.52%)
> VM Cycles / non-RT vcpu:	696824.25        (- 0.28%)
> Setup Cycles:			229.35           (+22.52%)
> VM entries / RT vcpu:		61424897.13      (+ 0.54%) 
> VM entries / non-RT vcpu:	11237660.50      (+ 0.41%)
> Total cycles in RT VM:		7563685235393.27 (+ 0.01%)
> Total cycles in non-RT VM:	7830674349667.13 (+ 0.13%)
> 
> Discussion:
> Setup cycles raised in ~42 cycles, increasing overhead.
> It proves that noting a quiescent state in guest entry introduces setup 

Same here, s/entry/exit, correct?  I just want to make sure I'm not completely
misunderstanding something.

> routine costs, which is expected.
> 
> The average time spend inside the VM was reduced, but the number of VM  
> entries raised, causing the VM to have around the same number of cpu cycles 
> available to run, meaning that the overhead caused by reporting RCU 
> quiescent state in VM exit got absorbed, and it may have to do with those 
> rare invoke_rcu_core()s that were bothering latency.
> 
> The difference is much smaller compared to case 1, and this is probably 
> because there is a clause in rcu_pending() for isolated (nohz_full) cpus 
> which may be already inhibiting a lot of invoke_rcu_core()s.
> 
> Sean, Paul, what do you think?

I'm in favor of noting the context switch on exit.  Singling out this patch for
introducing ~30 cycles of latency in entry/exit is rather ridiculous.  I guarantee
ww have introduced far more latency at various times without any scrutiny, even
if we were to limit the search to just the fastpath entry/exit loop.  Outside of
the fastpath, there's basically zero chance the extra 30-40 cycles are meaningful.

Even without the non-RT improvement, I'd give this a thumbs up just so that entry
and exit are symmetrical.  The fact that noting the context switches seems to be
a win-win is icing on the cake.

So, assuming the above entry/exit confusion was just a typo,

Acked-by: Sean Christopherson <seanjc@...gle.com>