linux-kernel - Re: [PATCH v2 00/21] Runtime TDX Module update support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68fbebc54e776_10e9100fd@dwillia2-mobl4.notmuch>
Date: Fri, 24 Oct 2025 14:12:37 -0700
From: <dan.j.williams@...el.com>
To: Dave Hansen <dave.hansen@...el.com>, <dan.j.williams@...el.com>, Chao Gao
	<chao.gao@...el.com>
CC: Vishal Annapurve <vannapurve@...gle.com>, "Reshetova, Elena"
	<elena.reshetova@...el.com>, "linux-coco@...ts.linux.dev"
	<linux-coco@...ts.linux.dev>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "x86@...nel.org" <x86@...nel.org>, "Chatre,
 Reinette" <reinette.chatre@...el.com>, "Weiny, Ira" <ira.weiny@...el.com>,
	"Huang, Kai" <kai.huang@...el.com>, "yilun.xu@...ux.intel.com"
	<yilun.xu@...ux.intel.com>, "sagis@...gle.com" <sagis@...gle.com>,
	"paulmck@...nel.org" <paulmck@...nel.org>, "nik.borisov@...e.com"
	<nik.borisov@...e.com>, Borislav Petkov <bp@...en8.de>, Dave Hansen
	<dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, Ingo Molnar
	<mingo@...hat.com>, "Kirill A. Shutemov" <kas@...nel.org>, Paolo Bonzini
	<pbonzini@...hat.com>, "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [PATCH v2 00/21] Runtime TDX Module update support

Dave Hansen wrote:
> On 10/24/25 12:40, dan.j.williams@...el.com wrote:
> > Dave Hansen wrote:
> >> On 10/24/25 00:43, Chao Gao wrote:
> >> ...
> >>> Beyond "the kvm_tdx object gets torn down during a build," I see two potential
> >>> issues:
> >>>
> >>> 1. TD Build and TDX migration aren't purely kernel processes -- they span multiple
> >>>    KVM ioctls. Holding a read-write lock throughout the entire process would
> >>>    require exiting to userspace while the lock is held. I think this is
> >>>    irregular, but I'm not sure if it's acceptable for read-write semaphores.
> >>
> >> Sure, I guess it's irregular. But look at it this way: let's say we
> >> concocted some scheme to use a TD build refcount and a module update
> >> flag, had them both wait_event_interruptible() on each other, and then
> >> did wakeups. That would get the same semantics without an rwsem.
> > 
> > This sounds unworkable to me.
> > 
> > First, you cannot return to userspace while holding a lock. Lockdep will
> > rightfully scream:
> > 
> >     "WARNING: lock held when returning to user space!"
> 
> Well, yup, it sure does look that way for normal lockdep-annotated lock
> types. It does seem like a sane rule to have for most things.
> 
> But, just to be clear, this is a lockdep thing and a good, solid
> semantic to have. It's not a rule that no kernel locking structure can
> ever be held when returning to userspace.

Sure, but I would submit that the lesser known cousin of the common
suggestion "do not write your own locking primitives" is "do not invent
locking schemes that involve holding locks over return to userspace". It
is rarely a good idea to the point that lockdep warns about it by
default.

> > The complexity of ensuring that a multi-stage ABI transaction completes
> > from the kernel side is painful. If that process dies in the middle of
> > its ABI sequence who cleans up these references?
> 
> The 'struct kvm_tdx' has to get destroyed at some point.

Indefinite hangs because a process goes out to lunch and fails to
destroy kvm_tdx in a reasonable timeframe now has knock-on effects.

[..]
> > The operational mechanism to make sure that one process flow does not
> > mess up another process flow is for those process to communicate with
> > *userspace* file locks, or for those process to check for failures after
> > the fact and retry. Unless you can make the build side an atomic ABI,
> > this is a documentation + userspace problem, not a kernel problem.
> 
> Yeah, that's a totally valid take on it.
> 
> My only worry is that the module update is going to be off in another
> world from the thing building TDs. We had a similar set of challenges
> around microcode updates, CPUSVN and SGX enclaves.
> 
> The guy doing "echo 1 > /sys/.../whatever" wasn't coordinating with
> every entity on the system that might run an SGX enclave. It certainly
> didn't help that enclave creation is typically done by unprivileged
> users. Maybe the KVM/TDX world is a _bit_ more narrow and they will be
> talking to each other, or the /dev/kvm permissions will be a nice funnel
> to get them talking to each other.
> 
> The SGX solution, btw, was to at least ensure forward progress (CPUSVN
> update) when the last enclave goes away. So new enclaves aren't
> *prevented* from starting but the window when the first one starts
> (enclave count going from 0->1) is leveraged to do the update.

The status quo does ensure forward progress. The TD does get built and
the update does complete, just the small matter of TD attestation
failures, right?

Note, we had a similar problem with the tsm_report interface which,
because it is configfs and not an ioctl, is a multi-stage ABI to build a
report. If 2 threads collide in building an object, userspace indeed
gets to keep the pieces, but there is:

1/ Documentation of the potential for collisions

2/ A mechanism to detect collisions. See
   /sys/kernel/config/tsm/report/$name/generation in
   Documentation/ABI/testing/configfs-tsm-report

I really would not worry about the "off in another world" problem, it is
par for the course for datacenter operations. I encountered prolific use
of file locks in operations scripts at my time at Facebook. Think of
problems like coordinating disk partitioning across various provisioning
flows. The kernel happily lets 2 fdisk processes race to write a
partition table. The only way to ensure a consistent result in that case
is userspace sequencing, not a kernel lock while some process has a
partition table open.