[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <294c8c437c2e48b318b8c27eb7467430dfcba92b.camel@infradead.org>
Date: Mon, 05 Aug 2024 12:04:53 +0100
From: David Woodhouse <dwmw2@...radead.org>
To: Sean Christopherson <seanjc@...gle.com>, Paolo Bonzini
<pbonzini@...hat.com>, "Hussain, Mushahid" <hmushi@...zon.co.uk>
Cc: Vitaly Kuznetsov <vkuznets@...hat.com>, Wanpeng Li
<wanpengli@...cent.com>, Jim Mattson <jmattson@...gle.com>, Joerg Roedel
<joro@...tes.org>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
Mingwei Zhang <mizhang@...gle.com>, Maxim Levitsky <mlevitsk@...hat.com>
Subject: Re: [PATCH v2 6/8] KVM: Fix multiple races in gfn=>pfn cache refresh
On Wed, 2022-04-27 at 01:40 +0000, Sean Christopherson wrote:
> Rework the gfn=>pfn cache (gpc) refresh logic to address multiple races
> between the cache itself, and between the cache and mmu_notifier events.
>
> The existing refresh code attempts to guard against races with the
> mmu_notifier by speculatively marking the cache valid, and then marking
> it invalid if a mmu_notifier invalidation occurs. That handles the case
> where an invalidation occurs between dropping and re-acquiring gpc->lock,
> but it doesn't handle the scenario where the cache is refreshed after the
> cache was invalidated by the notifier, but before the notifier elevates
> mmu_notifier_count. The gpc refresh can't use the "retry" helper as its
> invalidation occurs _before_ mmu_notifier_count is elevated and before
> mmu_notifier_range_start is set/updated.
>
> CPU0 CPU1
> ---- ----
>
> gfn_to_pfn_cache_invalidate_start()
> |
> -> gpc->valid = false;
> kvm_gfn_to_pfn_cache_refresh()
> |
> |-> gpc->valid = true;
>
> hva_to_pfn_retry()
> |
> -> acquire kvm->mmu_lock
> kvm->mmu_notifier_count == 0
> mmu_seq == kvm->mmu_notifier_seq
> drop kvm->mmu_lock
> return pfn 'X'
> acquire kvm->mmu_lock
> kvm_inc_notifier_count()
> drop kvm->mmu_lock()
> kernel frees pfn 'X'
> kvm_gfn_to_pfn_cache_check()
> |
> |-> gpc->valid == true
>
> caller accesses freed pfn 'X'
>
> Key off of mn_active_invalidate_count to detect that a pfncache refresh
> needs to wait for an in-progress mmu_notifier invalidation. While
> mn_active_invalidate_count is not guaranteed to be stable, it is
> guaranteed to be elevated prior to an invalidation acquiring gpc->lock,
> so either the refresh will see an active invalidation and wait, or the
> invalidation will run after the refresh completes.
>
> Speculatively marking the cache valid is itself flawed, as a concurrent
> kvm_gfn_to_pfn_cache_check() would see a valid cache with stale pfn/khva
> values. The KVM Xen use case explicitly allows/wants multiple users;
> even though the caches are allocated per vCPU, __kvm_xen_has_interrupt()
> can read a different vCPU (or vCPUs). Address this race by invalidating
> the cache prior to dropping gpc->lock (this is made possible by fixing
> the above mmu_notifier race).
>
> Finally, the refresh logic doesn't protect against concurrent refreshes
> with different GPAs (which may or may not be a desired use case, but its
> allowed in the code), nor does it protect against a false negative on the
> memslot generation. If the first refresh sees a stale memslot generation,
> it will refresh the hva and generation before moving on to the hva=>pfn
> translation. If it then drops gpc->lock, a different user can come along,
> acquire gpc->lock, see that the memslot generation is fresh, and skip
> the hva=>pfn update due to the userspace address also matching (because
> it too was updated). Address this race by adding an "in-progress" flag
> so that the refresh that acquires gpc->lock first runs to completion
> before other users can start their refresh.
>
> Complicating all of this is the fact that both the hva=>pfn resolution
> and mapping of the kernel address can sleep, i.e. must be done outside
> of gpc->lock
>
> Fix the above races in one fell swoop, trying to fix each individual race
> in a sane manner is impossible, for all intents and purposes.
Hm, the problem with this (commit 58cd407ca4c6278) is that it ends up
acting like a *really* unfair read/write lock.
If there are a lot of 'writers' invalidating other HVA ranges, then the
hva_to_pfn_retry() function as the 'reader' will back off for ever and
never make progress. Even while a single other invalidation is active
(->mn_active_invalidate_count is non-zero), hva_to_pfn_retry() will
just spin in an elaborate busy-wait loop, mapping same page over and
over again.
(In the repro case I'm not entirely sure *why* there are userspace
threads bashing on MADV_DONTNEED and causing this, but 'Don't Do That
Then' is only a partial answer. The kernel's own locking should allow
it to make progress regardless.)
If we call the GPC invalidate function from invalidate_range_end()
instead of _start, can't we do it concurrently without having to check
for active invalidations or mmu_invalidate_seq?
We can then introduce a 'validating' flag, set before the attempt to
hva_to_pfn() in the loop, so that the translation can be shot down
before it's even made. And use *that* as the trigger for the retry loop
so that only has to retry if its *own* uHVA is messed with.
Patch follows...
Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)
Powered by blists - more mailing lists