linux-kernel - Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZnPUTGSdF7t0DCwR@LeoBras>
Date: Thu, 20 Jun 2024 04:03:41 -0300
From: Leonardo Bras <leobras.c@...il.com>
To: "Paul E. McKenney" <paulmck@...nel.org>
Cc: Leonardo Bras <leobras@...hat.com>,
	Sean Christopherson <seanjc@...gle.com>,
	Frederic Weisbecker <frederic@...nel.org>,
	Paolo Bonzini <pbonzini@...hat.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	linux-kernel@...r.kernel.org,
	kvm@...r.kernel.org
Subject: Re: [RFC PATCH 1/1] kvm: Note an RCU quiescent state on guest exit

On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote:
> On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote:
> > On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote:
> > > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote:
> > > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@...gle.com> wrote:
> > > > >
> > > > > On Fri, May 10, 2024, Leonardo Bras wrote:
> > > > > > As of today, KVM notes a quiescent state only in guest entry, which is good
> > > > > > as it avoids the guest being interrupted for current RCU operations.
> > > > > >
> > > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will
> > > > > > check for any RCU operations waiting for this CPU. In case there are any of
> > > > > > such, it invokes rcu_core() in order to sched-out the current thread and
> > > > > > note a quiescent state.
> > > > > >
> > > > > > This occasional schedule work will introduce tens of microsseconds of
> > > > > > latency, which is really bad for vcpus running latency-sensitive
> > > > > > applications, such as real-time workloads.
> > > > > >
> > > > > > So, note a quiescent state in guest exit, so the interrupted guests is able
> > > > > > to deal with any pending RCU operations before being required to invoke
> > > > > > rcu_core(), and thus avoid the overhead of related scheduler work.
> > > > >
> > > > > Are there any downsides to this?  E.g. extra latency or anything?  KVM will note
> > > > > a context switch on the next VM-Enter, so even if there is extra latency or
> > > > > something, KVM will eventually take the hit in the common case no matter what.
> > > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible.
> > > > >
> > > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and
> > > > > VM-Exit note the context switch, which begs the question of why KVM isn't already
> > > > > doing that.  I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz:
> > > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry
> > > > > case?
> > > > 
> > > > I don't know, by the lore I see it happening in guest entry since the
> > > > first time it was introduced at
> > > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@redhat.com/
> > > > 
> > > > Noting a quiescent state is cheap, but it may cost a few accesses to
> > > > possibly non-local cachelines. (Not an expert in this, Paul please let
> > > > me know if I got it wrong).
> > > 
> > > Yes, it is cheap, especially if interrupts are already disabled.
> > > (As in the scheduler asks RCU to do the same amount of work on its
> > > context-switch fastpath.)
> > 
> > Thanks!
> > 
> > > 
> > > > I don't have a historic context on why it was just implemented on
> > > > guest_entry, but it would make sense when we don't worry about latency
> > > > to take the entry-only approach:
> > > > - It saves the overhead of calling rcu_virt_note_context_switch()
> > > > twice per guest entry in the loop
> > > > - KVM will probably run guest entry soon after guest exit (in loop),
> > > > so there is no need to run it twice
> > > > - Eventually running rcu_core() may be cheaper than noting quiescent
> > > > state every guest entry/exit cycle
> > > > 
> > > > Upsides of the new strategy:
> > > > - Noting a quiescent state in guest exit avoids calling rcu_core() if
> > > > there was a grace period request while guest was running, and timer
> > > > interrupt hits the cpu.
> > > > - If the loop re-enter quickly there is a high chance that guest
> > > > entry's rcu_virt_note_context_switch() will be fast (local cacheline)
> > > > as there is low probability of a grace period request happening
> > > > between exit & re-entry.
> > > > - It allows us to use the rcu patience strategy to avoid rcu_core()
> > > > running if any grace period request happens between guest exit and
> > > > guest re-entry, which is very important for low latency workloads
> > > > running on guests as it reduces maximum latency in long runs.
> > > > 
> > > > What do you think?
> > > 
> > > Try both on the workload of interest with appropriate tracing and
> > > see what happens?  The hardware's opinion overrides mine.  ;-)
> > 
> > That's a great approach!
> > 
> > But in this case I think noting a quiescent state in guest exit is 
> > necessary to avoid a scenario in which a VM takes longer than RCU 
> > patience, and it ends up running rcuc in a nohz_full cpu, even if guest 
> > exit was quite brief. 
> > 
> > IIUC Sean's question is more on the tone of "Why KVM does not note a 
> > quiescent state in guest exit already, if it does in guest entry", and I 
> > just came with a few arguments to try finding a possible rationale, since 
> > I could find no discussion on that topic in the lore for the original 
> > commit.
> 
> Understood, and maybe trying it would answer that question quickly.
> Don't get me wrong, just because it appears to work in a few tests doesn't
> mean that it really works, but if it visibly blows up, that answers the
> question quite quickly and easily.  ;-)
> 
> But yes, if it appears to work, there must be a full investigation into
> whether or not the change really is safe.
> 
> 							Thanx, Paul

Hello Paul, Sean, sorry for the delay on this.

I tested x86 by counting cycles (using rdtsc_ordered()).

Cycles were counted upon function entry/exit on 
{svm,vmx}_vcpu_enter_exit(), and right before / after 
__{svm,vmx}_vcpu_run() in the same function.

The main idea was to get cycles spend in the procedures before entering 
guest (such as reporting RCU quiescent state in entry / exit) and the 
cycles actually used by the VM. 

Those cycles were summed-up and stored in per-cpu structures, with a 
counter to get the average value. I then created a debug file to read the 
results and reset the counters.

As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll.

The workload inside the VM consisted in cyclictest in 16 vcpus 
(SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus 
(SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. 

 $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 
  --mainaffinity 0-3

All tests were run for exaclty 1 hour, and the clock counter was reset at 
the same moment cyclictest stared. After that VM was poweroff from guest.
Results show the average for all CPUs in the same category, in cycles.

With above setup, I tested 2 use cases:
1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case)
2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and 
    RCU patience = 1000ms (best case for RT)

Results are:
# Test case 1:
Vanilla: (average on all vcpus)
VM Cycles / RT vcpu:		123287.75 
VM Cycles / non-RT vcpu:	709847.25
Setup Cycles:			186.00
VM entries / RT vcpu:		58737094.81
VM entries / non-RT vcpu:	10527869.25
Total cycles in RT VM:		7241564260969.80
Total cycles in non-RT VM:	7473179035472.06

Patched: (average on all vcpus)
VM Cycles / RT vcpu:		124695.31        (+ 1.14%)
VM Cycles / non-RT vcpu:	710479.00        (+ 0.09%)
Setup Cycles:			218.65           (+17.55%)
VM entries / RT vcpu:		60654285.44      (+ 3.26%) 
VM entries / non-RT vcpu:	11003516.75      (+ 4.52%)
Total cycles in RT VM:		7563305077093.26 (+ 4.44%)
Total cycles in non-RT VM:	7817767577023.25 (+ 4.61%)

Discussion:
Setup cycles raised in ~33 cycles, increasing overhead.
It proves that noting a quiescent state in guest entry introduces setup 
routine costs, which is expected.

On the other hand, both the average time spend inside the VM and the number 
of VM entries raised, causing the VM to have ~4.5% more cpu cycles 
available to run, which is positive. Extra cycles probably came from not 
having invoke_rcu_core() getting ran after VM exit.


# Test case 2:
Vanilla: (average on all vcpus)
VM Cycles / RT vcpu:		123785.63
VM Cycles / non-RT vcpu:	698758.25
Setup Cycles:			187.20
VM entries / RT vcpu:		61096820.75
VM entries / non-RT vcpu:	11191873.00
Total cycles in RT VM:		7562908142051.72
Total cycles in non-RT VM:	7820413591702.25

Patched: (average on all vcpus)
VM Cycles / RT vcpu:		123137.13        (- 0.52%)
VM Cycles / non-RT vcpu:	696824.25        (- 0.28%)
Setup Cycles:			229.35           (+22.52%)
VM entries / RT vcpu:		61424897.13      (+ 0.54%) 
VM entries / non-RT vcpu:	11237660.50      (+ 0.41%)
Total cycles in RT VM:		7563685235393.27 (+ 0.01%)
Total cycles in non-RT VM:	7830674349667.13 (+ 0.13%)

Discussion:
Setup cycles raised in ~42 cycles, increasing overhead.
It proves that noting a quiescent state in guest entry introduces setup 
routine costs, which is expected.

The average time spend inside the VM was reduced, but the number of VM  
entries raised, causing the VM to have around the same number of cpu cycles 
available to run, meaning that the overhead caused by reporting RCU 
quiescent state in VM exit got absorbed, and it may have to do with those 
rare invoke_rcu_core()s that were bothering latency.

The difference is much smaller compared to case 1, and this is probably 
because there is a clause in rcu_pending() for isolated (nohz_full) cpus 
which may be already inhibiting a lot of invoke_rcu_core()s.

Sean, Paul, what do you think?

Thanks!
Leo

> 
> > Since noting a quiescent state in guest exit is cheap enough, avoids rcuc 
> > schedules when grace period starts during guest execution, and enables a 
> > much more rational usage of RCU patience, it's a safe to assume it's a 
> > better way of dealing with RCU compared to current implementation.
> > 
> > Sean, what do you think?
> > 
> > Thanks!
> > Leo
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > 
>