[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8735djvwbu.wl-maz@kernel.org>
Date: Fri, 26 Aug 2022 16:49:41 +0100
From: Marc Zyngier <maz@...nel.org>
To: Paolo Bonzini <pbonzini@...hat.com>
Cc: Peter Xu <peterx@...hat.com>, Gavin Shan <gshan@...hat.com>,
kvmarm@...ts.cs.columbia.edu, linux-arm-kernel@...ts.infradead.org,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, linux-kselftest@...r.kernel.org,
corbet@....net, james.morse@....com, alexandru.elisei@....com,
suzuki.poulose@....com, oliver.upton@...ux.dev,
catalin.marinas@....com, will@...nel.org, shuah@...nel.org,
seanjc@...gle.com, dmatlack@...gle.com, bgardon@...gle.com,
ricarkol@...gle.com, zhenyzha@...hat.com, shan.gavin@...il.com
Subject: Re: [PATCH v1 1/5] KVM: arm64: Enable ring-based dirty memory tracking
On Fri, 26 Aug 2022 11:50:24 +0100,
Paolo Bonzini <pbonzini@...hat.com> wrote:
>
> On 8/24/22 00:47, Marc Zyngier wrote:
> >> I definitely don't think I 100% understand all the ordering things since
> >> they're complicated.. but my understanding is that the reset procedure
> >> didn't need memory barrier (unlike pushing, where we have explicit wmb),
> >> because we assumed the userapp is not hostile so logically it should only
> >> modify the flags which is a 32bit field, assuming atomicity guaranteed.
> > Atomicity doesn't guarantee ordering, unfortunately. Take the
> > following example: CPU0 is changing a bunch of flags for GFNs A, B, C,
> > D that exist in the ring in that order, and CPU1 performs an ioctl to
> > reset the page state.
> >
> > CPU0:
> > write_flag(A, KVM_DIRTY_GFN_F_RESET)
> > write_flag(B, KVM_DIRTY_GFN_F_RESET)
> > write_flag(C, KVM_DIRTY_GFN_F_RESET)
> > write_flag(D, KVM_DIRTY_GFN_F_RESET)
> > [...]
> >
> > CPU1:
> > ioctl(KVM_RESET_DIRTY_RINGS)
> >
> > Since CPU0 writes do not have any ordering, CPU1 can observe the
> > writes in a sequence that have nothing to do with program order, and
> > could for example observe that GFN A and D have been reset, but not B
> > and C. This in turn breaks the logic in the reset code (B, C, and D
> > don't get reset), despite userspace having followed the spec to the
> > letter. If each was a store-release (which is the case on x86), it
> > wouldn't be a problem, but nothing calls it in the documentation.
> >
> > Maybe that's not a big deal if it is expected that each CPU will issue
> > a KVM_RESET_DIRTY_RINGS itself, ensuring that it observe its own
> > writes. But expecting this to work across CPUs without any barrier is
> > wishful thinking.
>
> Agreed, but that's a problem for userspace to solve. If userspace
> wants to reset the fields in different CPUs, it has to synchronize
> with its own invoking of the ioctl.
userspace has no choice. It cannot order on its own the reads that the
kernel will do to *other* rings.
> That is, CPU0 must ensure that a ioctl(KVM_RESET_DIRTY_RINGS) is done
> after (in the memory-ordering sense) its last write_flag(D,
> KVM_DIRTY_GFN_F_RESET). If there's no such ordering, there's no
> guarantee that the write_flag will have any effect.
The problem isn't on CPU0 The problem is that CPU1 does observe
inconsistent data on arm64, and I don't think this difference in
behaviour is acceptable. Nothing documents this, and there is a baked
in assumption that there is a strong ordering between writes as well
as between writes and read.
> The main reason why I preferred a global KVM_RESET_DIRTY_RINGS ioctl
> was because it takes kvm->slots_lock so the execution would be
> serialized anyway. Turning slots_lock into an rwsem would be even
> worse because it also takes kvm->mmu_lock (since slots_lock is a
> mutex, at least two concurrent invocations won't clash with each other
> on the mmu_lock).
Whatever the reason, the behaviour should be identical on all
architectures. As is is, it only really works on x86, and I contend
this is a bug that needs fixing.
Thankfully, this can be done at zero cost for x86, and at that of a
set of load-acquires on other architectures.
M.
--
Without deviation from the norm, progress is not possible.
Powered by blists - more mailing lists