lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aC3rpZChhtw4NODS@google.com>
Date: Wed, 21 May 2025 08:05:09 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Michael Kelley <mhklinux@...look.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Nuno Das Neves <nunodasneves@...ux.microsoft.com>, 
	Paolo Bonzini <pbonzini@...hat.com>, Ingo Molnar <mingo@...hat.com>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, 
	Marc Zyngier <maz@...nel.org>, Oliver Upton <oliver.upton@...ux.dev>, 
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"linux-arm-kernel@...ts.infradead.org" <linux-arm-kernel@...ts.infradead.org>, 
	"kvmarm@...ts.linux.dev" <kvmarm@...ts.linux.dev>, K Prateek Nayak <kprateek.nayak@....com>, 
	David Matlack <dmatlack@...gle.com>, Juergen Gross <jgross@...e.com>, 
	Stefano Stabellini <sstabellini@...nel.org>, 
	Oleksandr Tyshchenko <oleksandr_tyshchenko@...m.com>
Subject: Re: [PATCH v2 08/12] sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()

On Wed, May 21, 2025, Michael Kelley wrote:
> From: Peter Zijlstra <peterz@...radead.org> Sent: Wednesday, May 21, 2025 4:43 AM
> > 
> > On Tue, May 20, 2025 at 03:20:00PM -0700, Sean Christopherson wrote:
> > > On Tue, May 20, 2025, Peter Zijlstra wrote:
> > > > On Mon, May 19, 2025 at 11:55:10AM -0700, Sean Christopherson wrote:
> > > > > Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() to
> > > > > differentiate it from add_wait_queue_priority_exclusive().  The one and
> > > > > only user add_wait_queue_priority(), Xen privcmd's irqfd_wakeup(),
> > > > > unconditionally returns '0', i.e. doesn't actually operate in exclusive
> > > > > mode.
> > > >
> > > > I find:
> > > >
> > > > drivers/hv/mshv_eventfd.c:      add_wait_queue_priority(wqh, &irqfd->irqfd_wait);
> > > > drivers/xen/privcmd.c:  add_wait_queue_priority(wqh, &kirqfd->wait);
> > > >
> > > > I mean, it might still be true and all, but hyperv seems to also use
> > > > this now.
> > >
> > > Oh FFS, another "heavily inspired by KVM".  I should have bribed someone to take
> > > this series when I had the chance.  *sigh*
> > >
> > > Unfortunately, the Hyper-V code does actually operate in exclusive mode.  Unless
> > > you have a better idea, I'll tweak the series to:
> > >
> > >   1. Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and have the callers
> > >      explicitly set the flag,
> > >   2. Add a patch to drop WQ_FLAG_EXCLUSIVE from Xen privcmd entirely.
> > >   3. Introduce add_wait_queue_priority_exclusive() and switch KVM to use it.
> > >
> > > That has an added bonus of introducing the Xen change in a dedicated patch, i.e.
> > > is probably a sequence anyways.
> > >
> > > Alternatively, I could rewrite the Hyper-V code a la the KVM changes, but I'm not
> > > feeling very charitable at the moment (the complete lack of documentation for
> > > their ioctl doesn't help).
> > 
> > Works for me. Michael is typically very responsive wrt hyperv (but you
> > probably know this).
> 
> I can't be much help on this issue. This Hyper-V code is for Linux running in
> the root partition (i.e., "dom0") and I don't have a setup where I can run and
> test that configuration.
> 
> Adding Nuno Das Neves from Microsoft for his thoughts.

A slightly more helpful, less ranty explanation of what's going on:

KVM's irqfd code, which was pretty copied verbatim for Hyper-V partitions, disallows
binding an eventfd to a single VM multiple times, but doesn't handle the scenario
where an eventfd is bound to multiple VMs, i.e. to multiple partitions.  What's
particular "fun" about such a scenario is that WQ_FLAG_EXCLUSIVE+WQ_FLAG_PRIORITY
means only the first VM/partition that bound the eventfd will be notified.

For KVM-based setups, this is a legitimate concern because KVM supports intra-host
migration.  E.g. to upgrade the userspace VMM, a guest can be "migrated" from the
old VMM's "struct kvm" instance to the new VMM's "struct kvm".  If userspace mucks
up the migration, e.g. doesn't *unbind* the eventfd from the old VM(M) before
resuming the guest in the new VM(M), KVM will effectively drop virtual IRQs.

This is purely a hardening exercise, i.e. isn't required for correctness, assuming
userspace userspace is bug-free.  The KVM patches surrounding this patch show how
I am planning on ensuring a 1:1 eventfd:VM binding.

To not block the KVM hardening on Hyper-V's eventfd usage, I am planning on making
this change in the next version of the series:

diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index 8dd22be2ca0b..b348928871c2 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -368,6 +368,14 @@ static void mshv_irqfd_queue_proc(struct file *file, wait_queue_head_t *wqh,
                        container_of(polltbl, struct mshv_irqfd, irqfd_polltbl);
 
        irqfd->irqfd_wqh = wqh;
+
+       /*
+        * TODO: Ensure there isn't already an exclusive, priority waiter, e.g.
+        * that the irqfd isn't already bound to another partition.  Only the
+        * first exclusive waiter encountered will be notified, and
+        * add_wait_queue_priority() doesn't enforce exclusivity.
+        */
+       irqfd->irqfd_wait.flags |= WQ_FLAG_EXCLUSIVE;
        add_wait_queue_priority(wqh, &irqfd->irqfd_wait);
 }

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ