[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZSlqo-k2htjN1gPh@google.com>
Date: Fri, 13 Oct 2023 09:04:51 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Uros Bizjak <ubizjak@...il.com>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org,
Linus Torvalds <torvalds@...ux-foundation.org>,
Nadav Amit <namit@...are.com>, Ingo Molnar <mingo@...nel.org>,
Andy Lutomirski <luto@...nel.org>,
Brian Gerst <brgerst@...il.com>,
Denys Vlasenko <dvlasenk@...hat.com>,
"H . Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>,
Thomas Gleixner <tglx@...utronix.de>,
Josh Poimboeuf <jpoimboe@...hat.com>
Subject: Re: [PATCH tip] x86/percpu: Rewrite arch_raw_cpu_ptr()
On Wed, Oct 11, 2023, Uros Bizjak wrote:
> Additionaly, the patch introduces 'rdgsbase' alternative for CPUs with
> X86_FEATURE_FSGSBASE. The rdgsbase instruction *probably* will end up
> only decoding in the first decoder etc. But we're talking single-cycle
> kind of effects, and the rdgsbase case should be much better from
> a cache perspective and might use fewer memory pipeline resources to
> offset the fact that it uses an unusual front end decoder resource...
The switch to RDGSBASE should be a separate patch, and should come with actual
performance numbers.
A significant percentage of data accesses in Intel's TDX-Module[*] use this
pattern, e.g. even global data is relative to GS.base in the module due its rather
odd and restricted environment. Back in the early days of TDX, the module used
RD{FS,GS}BASE instead of prefixes to get pointers to per-CPU and global data
structures in the TDX-Module. It's been a few years so I forget the exact numbers,
but at the time a single transition between guest and host would have something
like ~100 reads of FS.base or GS.base. Switching from RD{FS,GS}BASE to prefixed
accesses reduced the latency for a guest<->host transition through the TDX-Module
by several thousand cycles, as every RD{FS,GS}BASE had a latency of ~18 cycles
(again, going off 3+ year old memories).
The TDX-Module code is pretty much a pathological worth case scenario, but I
suspect its usage is very similar to most usage of raw_cpu_ptr(), e.g. get a
pointer to some data structure and then do multiple reads/writes from/to that
data structure.
The other wrinkle with RD{FS,FS}GSBASE is that they are trivially easy to emulate.
If a hypervisor/VMM is advertising FSGSBASE even when it's not supported by
hardware, e.g. to migrate VMs to older hardware, then every RDGSBASE will end up
taking a few thousand cycles (#UD -> VM-Exit -> emulate). I would be surprised
if any hypervisor actually does this as it would be easier/smarter to simply not
advertise FSGSBASE if migrating to older hardware might be necessary, e.g. KVM
doesn't support emulating RD{FS,GS}BASE. But at the same time, the whole reason
I stumbled on the TDX-Module's sub-optimal RD{FS,GS}BASE usage was because I had
hacked KVM to emulate RD{FS,GS}BASE so that I could do KVM TDX development on older
hardware. I.e. it's not impossible that this code could run on hardware where
RDGSBASE is emulated in software.
[*] https://www.intel.com/content/www/us/en/download/738875/intel-trust-domain-extension-intel-tdx-module.html
Powered by blists - more mailing lists