[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150108223132.GA5861@amt.cnet>
Date: Thu, 8 Jan 2015 20:31:33 -0200
From: Marcelo Tosatti <mtosatti@...hat.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Paolo Bonzini <pbonzini@...hat.com>,
"xen-devel@...ts.xenproject.org" <xen-devel@...ts.xenproject.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
kvm list <kvm@...r.kernel.org>, Gleb Natapov <gleb@...nel.org>
Subject: Re: [RFC 2/2] x86, vdso, pvclock: Simplify and speed up the vdso
pvclock reader
On Tue, Jan 06, 2015 at 11:49:09AM -0800, Andy Lutomirski wrote:
> On Tue, Jan 6, 2015 at 10:45 AM, Marcelo Tosatti <mtosatti@...hat.com> wrote:
> > On Tue, Jan 06, 2015 at 10:26:22AM -0800, Andy Lutomirski wrote:
> >> On Tue, Jan 6, 2015 at 10:13 AM, Marcelo Tosatti <mtosatti@...hat.com> wrote:
> >> > On Tue, Jan 06, 2015 at 08:56:40AM -0800, Andy Lutomirski wrote:
> >> >> On Jan 6, 2015 4:01 AM, "Paolo Bonzini" <pbonzini@...hat.com> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On 06/01/2015 09:42, Paolo Bonzini wrote:
> >> >> > > > > Still confused. So we can freeze all vCPUs in the host, then update
> >> >> > > > > pvti 1, then resume vCPU 1, then update pvti 0? In that case, we have
> >> >> > > > > a problem, because vCPU 1 can observe pvti 0 mid-update, and KVM
> >> >> > > > > doesn't increment the version pre-update, and we can return completely
> >> >> > > > > bogus results.
> >> >> > > > Yes.
> >> >> > > But then the getcpu test would fail (1->0). Even if you have an ABA
> >> >> > > situation (1->0->1), it's okay because the pvti that is fetched is the
> >> >> > > one returned by the first getcpu.
> >> >> >
> >> >> > ... this case of partial update of pvti, which is caught by the version
> >> >> > field, if of course different from the other (extremely unlikely) that
> >> >> > Andy pointed out. That is when the getcpus are done on the same vCPU,
> >> >> > but the rdtsc is another.
> >> >> >
> >> >> > That one can be fixed by rdtscp, like
> >> >> >
> >> >> > do {
> >> >> > // get a consistent (pvti, v, tsc) tuple
> >> >> > do {
> >> >> > cpu = get_cpu();
> >> >> > pvti = get_pvti(cpu);
> >> >> > v = pvti->version & ~1;
> >> >> > // also acts as rmb();
> >> >> > rdtsc_barrier();
> >> >> > tsc = rdtscp(&cpu1);
> >> >>
> >> >> Off-topic note: rdtscp doesn't need a barrier at all. AIUI AMD
> >> >> specified it that way and both AMD and Intel implement it correctly.
> >> >> (rdtsc, on the other hand, definitely needs the barrier beforehand.)
> >> >>
> >> >> > // control dependency, no need for rdtsc_barrier?
> >> >> > } while(cpu != cpu1);
> >> >> >
> >> >> > // ... compute nanoseconds from pvti and tsc ...
> >> >> > rmb();
> >> >> > } while(v != pvti->version);
> >> >>
> >> >> Still no good. We can migrate a bunch of times so we see the same CPU
> >> >> all three times and *still* don't get a consistent read, unless we
> >> >> play nasty games with lots of version checks (I have a patch for that,
> >> >> but I don't like it very much). The patch is here:
> >> >>
> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=a69754dc5ff33f5187162b5338854ad23dd7be8d
> >> >>
> >> >> but I don't like it.
> >> >>
> >> >> Thus far, I've been told unambiguously that a guest can't observe pvti
> >> >> while it's being written, and I think you're now telling me that this
> >> >> isn't true and that a guest *can* observe pvti while it's being
> >> >> written while the low bit of the version field is not set. If so,
> >> >> this is rather strongly incompatible with the spec in the KVM docs.
> >> >>
> >> >> I don't suppose that you and Marcelo could agree on what the actual
> >> >> semantics that KVM provides are and could write it down in a way that
> >> >> people who haven't spent a long time staring at the request code
> >> >> understand? And maybe you could even fix the implementation while
> >> >> you're at it if the implementation is, indeed, broken. I have ugly
> >> >> patches to fix it here:
> >> >>
> >> >> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso_paranoia&id=3b718a050cba52563d831febc2e1ca184c02bac0
> >> >>
> >> >> but I'm not thrilled with them.
> >> >>
> >> >> --Andy
> >> >
> >> > I suppose that separating the version write from the rest of the pvclock
> >> > structure is sufficient, as that would guarantee the writes are not
> >> > reordered even with fast string REP MOVS.
> >> >
> >> > Thanks for catching this Andy!
> >> >
> >>
> >> Don't you stil need:
> >>
> >> version++;
> >> write the rest;
> >> version++;
> >>
> >> with possible smp_wmb() in there to keep the compiler from messing around?
> >
> > Correct. Could just as well follow the protocol and use odd/even, which
> > is what your patch does.
> >
> > What is the point with the new flags bit though?
>
> To try to work around the problem on old hosts. I'm not at all
> convinced that this is worthwhile or that it helps, though.
Andy,
Are you going to submit the fix or should i?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists