linux-kernel - Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABgObfa4TKcj-d3Spw+TAE7ZfO8wFGJebkW3jMyFY2TrKxMuSw@mail.gmail.com>
Date: Fri, 24 Jan 2025 23:19:52 +0100
From: Paolo Bonzini <pbonzini@...hat.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: "Kernel Mailing List, Linux" <linux-kernel@...r.kernel.org>, kvm <kvm@...r.kernel.org>
Subject: Re: [PATCH 1/2] KVM: x86: fix usage of kvm_lock in set_nx_huge_pages()

Il ven 24 gen 2025, 21:11 Sean Christopherson <seanjc@...gle.com> ha scritto:
> Heh, except it's all kinds of broken.

Yes, I didn't even try.

> IMO, biting the bullet and converting to
> an SRCU-protected list is going to be far less work in the long run.

I did try a long SRCU critical section and it was unreviewable. It
ends up a lot less manageable than just making the lock a leaf,
especially as the lock hierarchy spans multiple subsystems (static
key, KVM, cpufreq---thanks CPU hotplug lock...). I also didn't like
adding a synchronization primitive that's... kinda single-use, but
that would not be a blocker of course.

So the second attempt was regular RCU, which looked a lot like this
patch. I started writing all the dances to find a struct kvm that
passes kvm_get_kvm_safe() before you do rcu_read_unlock() and drop the
previous one (because you cannot do kvm_put_kvm() within the RCU read
side) and set aside the idea, incorrectly thinking that they were not
needed with kvm_lock. Plus I didn't like having to keep alive a bunch
of data for a whole grace period if call_rcu() is used.

So for the third attempt I could have chosen between dropping the SRCU
or just using kvm_lock. I didn't even think of SRCU to be honest,
because everything so far looked so bad, but it does seem a little
better than RCU. At least, if kvm_destroy_vm() uses call_srcu(), you
can call kvm_put_kvm() within srcu_read_lock()...srcu_read_unlock().
It would look something like

  list_for_each_entry_srcu(kvm, &vm_list, vm_list, 1) {
    if (!kvm_get_kvm_safe(kvm))
      continue;

    /* kvm is protected by the reference count now. */
    srcu_read_unlock(&kvm_srcu);
    ...
    srcu_read_lock(&kvm_srcu);
    /* kvm stays alive, and next can be read, until the next
srcu_read_unlock() */
    kvm_put_kvm(kvm);
  }
  srcu_read_unlock(&kvm_srcu);

but again I am not sure how speedy call_srcu() is in reclaiming the
data, even in the common case where set_nx_huge_pages() or any other
RCU reader (none of them is frequent) isn't running. If you don't use
call_srcu() it becomes just as bad as RCU or kvm_lock.

So... let's talk about kvm_lock.

> > @@ -7143,16 +7141,19 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
> > +                     kvm_get_kvm(kvm);
>
> This needs to be:
>
>                 if (!kvm_get_kvm_safe(kvm))
>                         continue;

If we go for kvm_lock, kvm_get_kvm() *can* be made safe within the
critical section; if kvm_put_kvm() uses refcount_dec_and_mutex_lock(),
then the 1->0 transition happens under kvm_lock and cannot race with
kvm_get_kvm() (the mutex can be dropped as soon as
refcount_dec_and_mutex_lock() returns, it's really just the decrement
that needs to be within the critical section).

> >       if (new_val != old_val) {
> >               struct kvm *kvm;
> >
> > -             mutex_lock(&kvm_lock);
> > -
> >               list_for_each_entry(kvm, &vm_list, vm_list) {
>
> This is unsafe, as vm_list can be modified while kvm_lock is dropped.  And
> using list_for_each_entry_safe() doesn't help, because the _next_ entry have been
> freed.

list_for_each_entry_safe() is broken, but list_for_each_entry() can be
used. The problem is the call to kvm_put_kvm(), which is where the kvm
struct is freed thus breaking the foreach loop. So next must be read
and ref'd _before_ kvm_put_kvm(), then you can do

  kvm_get_kvm(kvm);
  mutex_unlock(&kvm_lock);
  if (prev)
    kvm_put_kvm(prev);
  ...
  mutex_lock(&kvm_lock);
  prev = kvm;

I don't know... there are few-enough readers that SRCU seems a bit
misplaced and it has the issue of keeping the VM data alive; while
kvm_lock has uglier code with the kvm_put_kvm() looking really
misplaced. If there were many instances one could write a nice
iterator, but for just one use?

Hmm... I wonder if something like

  if (poll_state_synchronize_srcu(&kvm_srcu,
          get_state_synchronize_srcu(&kvm_srcu))) {
    kvm_destroy_vm_cb(&kvm->rcu_head);
  } else {
    call_srcu(&kvm_srcu, &kvm->rcu_head, kvm_destroy_vm_cb);
  }

catches the case where there's no concurrent reader. If so, SRCU would
be a winner undoubtedly, but being the only user of a tricky RCU API
doesn't give me warm and fuzzy feelings. I'm still team kvm_lock for
now.

Anyhow I can prepare a tested version next Monday, with either
kvm_lock or with SRCU if the above trick works. Unless I showed that
it's trickier than it seems and successfully nerd-sniped you.
Seriously - just tell me what you prefer.

Paolo