[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240826192410.GA2182008.vipinsh@google.com>
Date: Mon, 26 Aug 2024 12:24:10 -0700
From: Vipin Sharma <vipinsh@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: pbonzini@...hat.com, dmatlack@...gle.com, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP
MMU under MMU read lock
On 2024-08-26 07:34:35, Sean Christopherson wrote:
> On Fri, Aug 23, 2024, Vipin Sharma wrote:
> > On 2024-08-19 15:12:42, Sean Christopherson wrote:
> > > On Mon, Aug 19, 2024, Vipin Sharma wrote:
> > > Huh. Actually, after a lot of fiddling and staring, there's a simpler solution,
> > > and it would force us to comment/document an existing race that's subly ok.
> > >
> > > For the dirty logging case, the result of kvm_mmu_sp_dirty_logging_enabled() is
> > > visible to the NX recovery thread before the memslot update task is guaranteed
> > > to finish (or even start) kvm_mmu_zap_collapsible_sptes(). I.e. KVM could
> > > unaccount an NX shadow page before it is zapped, and that could lead to a vCPU
> > > replacing the shadow page with an NX huge page.
> > >
> > > Functionally, that's a-ok, because the accounting doesn't provide protection
> > > against iTLB multi-hit bug, it's there purely to prevent KVM from bouncing a gfn
> > > between an NX hugepage and an execute small page. The only downside to the vCPU
> > > doing the replacement is that the vCPU will get saddle with tearing down all the
> > > child SPTEs. But this should be a very rare race, so I can't imagine that would
> > > be problematic in practice.
> >
> > I am worried that whenever this happens it might cause guest jitter
> > which we are trying to avoid as handle_changed_spte() might be keep a
> > vCPU busy for sometime.
>
> That race already exists today, and your series already extends the ways in which
> the race can be hit. My suggestion is to (a) explicit document that race and (b)
> expand the window in which it can occur to also apply to dirty logging being off.
>
I was not clear about vCPU doing the replacement part. I see how this
change is expanding the window.
> > } else {
> > /*
> > * Try again in future if the page is still in the
> > * list
> > */
> > spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > if (!list_empty(&sp->possible_nx_huge_page_link))
> > list_move_tail(&sp->possible_nx_huge_page_link,
> > kvm-> &kvm->arch.possible_nx_huge_pages);
>
> This is unsafe. The only thing that prevents a use-after-free of "sp" is the fact
> that this task holds rcu_read_lock(). The sp could already been queued for freeing
> via call_rcu().
Before call_rcu() happens, that page will be removed from
kvm->arch.possible_nx_huge_pages list in handle_remove_pt() via
tdp_mmu_unlink_sp() using kvm->arch.tdp_mmu_pages_lock. Here, we are
using the same lock and checking if page is in the list or not. If it is
in the list move to end and if it is not then don't do anything.
Am I missing something else? Otherwise, this logic seems correct to me.
Overall, I will be using your example code, so you won't see this code
in next version but just want to understand the concern with this else
part.
>
> > spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> > }
> >
> > /* Resched code below */
> > }
Powered by blists - more mailing lists