[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1620128746.51595.1509239268991.JavaMail.zimbra@efficios.com>
Date: Sun, 29 Oct 2017 01:07:48 +0000 (UTC)
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Andy Lutomirski <luto@...nel.org>
Cc: Kyle Huey <me@...ehuey.com>, Ben Maurer <bmaurer@...com>,
Boqun Feng <boqun.feng@...il.com>,
Will Deacon <will.deacon@....com>,
Peter Zijlstra <peterz@...radead.org>,
Paul Turner <pjt@...gle.com>,
linux-kernel <linux-kernel@...r.kernel.org>,
Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: rseq event_counter field
----- On Oct 28, 2017, at 10:20 PM, Andy Lutomirski luto@...nel.org wrote:
> Answering both emails here.
>
> Also, welcome Kyle. Kyle, how badly does rseq's proposed
> event_counter break rr? For that matter, how badly does rseq without
> an event_counter break rr?
>
> (Linus, if you care, I'm proposing that rseq is probably fine for
> 4.15, but that it should be merged without the event_counter. If
> there is a bona fide use for event_counter along with a benchmark
> showing that it makes a meaningful difference, then event_counter can
> be easily added later.)
>
> On Fri, Oct 27, 2017 at 10:52 AM, Ben Maurer <bmaurer@...com> wrote:
>> My understanding is that one of the motivations for the event counter was to
>> make it possible to express things like "increment a per-cpu value" in C.
>> Without this abstraction it wasn't possible to ensure that the strict
>> requirements of the rseq ABI were met (contiguous restartable block, single
>> instruction for the final write). If you write your entire restartable
>> sequence in assembly there's no need for the event counter. Comparing all
>> input values as an alternative way of having non-assembly sequences seems
>> like it might have a risk of ABA style race conditions.
>>
>
> My opinion after looking at a few examples is that event_counter
> sounds useful but that it's really more of a crutch that enables
> mediocre userspace code. With event_counter, you can do roughly this:
>
> struct rseq_state rseq = begin();
>
> int cpu = rseq_getcpu();
> Calculate some stuff. Be very very careful to to chase pointers,
> because it's easy to do in a very very fragile way as Mathieu's
> example below does.
>
> commit(&rseq, some pointer you found, some value to write, some
> double-checks to check);
>
> With a real database, this style of programming works well. You
> probably have a progress guarantee, and the database gives you the
> semantics you want. With rseq, you don't get any of this. If the C
> code is too slow or does something (an accidental interaction with
> userfaultfd, perhaps?), then you lose forward progress. You also can
> be a bit slow in the C code, in which case you'll get more conflicts
> detected than actually exist.
>
> Without event_counter, you can use a different abstraction. In your
> percpu data, you have something like:
>
> struct rseq_lock lock;
> struct freelist_entry *head;
>
> Then you can do:
>
> int cpu = rseq_getcpu();
> struct my_data *data = ... cpu ...;
> struct rseq_state rseq = rseq_acquire(data->lock);
>
> Calculate some stuff, being careful not to touch anything that isn't
> protected by the rseq_lock you're using.
>
> commit(&rseq, cpu, some pointer you found, some value to write, some
> double-checks to check);
>
> This variant is indeed a tiny bit slower *if you measure time by
> number if instructions* -- there are some very clever optimizations
> that save some instructions in the event_counter variant. But, in the
> cases where you actually have C code in here, I suspect that the
> difference won't matter in real code. And the counter-less variant
> can have arbitrarily slow C code or even syscalls without being
> spuriously aborted until the very short asm commit sequence starts.
>
> BUT, I think that the event_counter-free version is likely to be
> faster in real code if the library is done well. The version with
> event_counter has to load the event_counter from the rseq cacheline
> right at the beginning, and that value is used, so unless the CPU is
> really quite aggressive avoid speculating through a branch that hasn't
> gotten its dependencies into cache yet, you're likely to stall a bit
> if the per-thread rseq cacheline isn't in L1.
>
> The variant without event_counter can avoid this problem as long as
> the architecture gives a cheap way to find the CPU number. On x86,
> this way is RDPID on upcoming CPUs and is MOV %fs:__percpu_cpu_nr on
> any CPU as long as the kernel supports per-cpu FSBASE. (That's a
> feature that isn't terribly hard to add and would IMO be quite nifty.)
> If this is done, then you still have to do:
>
> __this_thread_rseq->rseq_cs = &the_rseq_cs_here;
>
> but that's a *store* to the rseq cacheline. It'll go in the store
> buffer and, in the fast case, nothing ever depends on it. Hence it's
> unlikely to stall. Of course, there's still a load from the
> rseq_lock, but that shares a cacheline with the data that it protects,
> so it's basically free.
>
> IOW, I think that adding event_counter encourages user code that has a
> weaker progress guarantee, is slower in a very highly optimized
> implementation, and allows users to think less about they're doing to
> their own detriment, for the gain of saving a couple of instructions.
> I could be wrong, but I think it would be nice to see evidence that
> I'm wrong before event_counter becomes enshrined in the ABI. Also, I
> suspect that neat tools like rr will have a much easier time dealing
> with rseq if event_counter isn't involved.
>
>>
>> RDPID is interesting, but it seems like given the current ABI (which
>> requires setting the current restartable range) you'd still need some sort
>> of TLS value. Have any benchmarks of RDPID perf vs TLS?
>
> That's the wrong comparison. For a good userspace implementation
> (which is very awkward without libc's help), storing to the rseq
> pointer is:
>
> leaq the_rseq_cs(%rip), %rax
> movq %rax, %gs:__tls_rseq + offsetof(struct rseq, rseq_cs)
>
> loading the cpu number is:
>
> movq %gs:__tls_rseq + offsetof(struct rseq, cpu), %rax
>
> The latter could be rdpid %rax; movl %eax, %eax (if the ABI gets
> fixed) instead. This avoids reading from the rseq cacheline.
>
>
> On Sat, Oct 28, 2017 at 2:35 AM, Mathieu Desnoyers
> <mathieu.desnoyers@...icios.com> wrote:
>> ----- On Oct 27, 2017, at 6:00 PM, Andy Lutomirski luto@...nel.org wrote:
>>
>>> On Thu, Oct 26, 2017 at 9:54 AM, Mathieu Desnoyers
>>> <mathieu.desnoyers@...icios.com> wrote:
>>
>> Let me step back a bit, and try to understand the fundamental
>> motivation for not letting user-space detect that it has been
>> preempted.
>>
>> My understanding (please let me know where I am mistaken) is that
>> the motivation for not letting user-space know if preempted is to
>> ensure malware cannot cloak (hide) themselves by changing their
>> behavior when they detect that they are heavily preempted. This
>> is effectively a way to detect things like single-stepping, strace,
>> ltrace, and so on.
>
> That's about a third of my objection. The next third is that it seems
> messy to be to intentionally expose what should be an invisible
> detail, and the third is that I still haven't seen convincing evidence
> that anything genuinely needs an event counter. See below and my
> other email.
>
>>
>> One can also pretty much trivially use rdtsc (x86), tb register
>> (powerpc) or ARM ccnt register (if user access is enabled) to
>> detect whether it runs under observation.
>> (this would be your timing-and-guessing approach)
>>
>
> Mostly irrelevant, but rdtsc is trappable.
>
>> I hope that very *few* users will completely open-code their asm
>> sequences. For instance, if you do the whole list pop in asm, you
>> need to:
>>
>> 1) store current rseq_cs pointer
>> 2) load CPU number from struct rseq,
>> 3) pointer-chase the right per-cpu data that you want to act on,
>> 4) load head
>> 5) compare head against NULL
>> 6) conditional branch
>> 6) load head->next
>> 10) store head->next to head (commit)
>>
>> step 3) is the interesting bit here:
>> It heavily depends on the application per-cpu data layout. It can be
>> a large array of per-cpu integers, or an array of per-cpu structures
>> at a specific offset, or an array of per-cpu struct + offset + pointer
>> to some other structure (dynamically allocated), and so on. I do not
>> want to impose any restriction on the data structure layout of
>> user-space applications. This means the asm-specialized-stuff will
>> have to be specialized for each per-cpu data structure layout, which
>> is a pain.
>
> To my point: I don't think this is a particularly good example. I
> think your examples are buggy (see below),
Good catch! I think indeed that the list pop I currently have is only meant
to be used on type-safe memory (or objects otherwise guaranteed to exist by
some other mean like RCU). This can be indeed very much unexpected and tricky
for users to have this kind of constraint. So I start to better understand
where you are getting to...
> and I would implement this
> quite differently if I were trying to use C (or even asm, perhaps),
> like this:
>
> 1) load CPU number.
> 2) pointer-chase the right per-cpu data, in C if desired.
> 3) store rseq_cs (which pretty much has to be in asm using the current ABI)
> 4) load head, compare against null, load head->next
> 5) commit
>
> I don't see how the event counter is at all useful here.
OK, so there is a "use-after-free" on non type-safe memory for list pop
in both my rseq fast-path and cpu_opv slow path currently. We'll indeed
need to have the load head, compare != NULL, load head->next all in asm
(and in the cpu_opv preempt-off section) to prevent use-after-free in the
general case.
This makes a very strong case for removing the event_counter, since its
use does not prevent against use-after-free, and would therefore be very
much error-prone.
>
> FWIW, I can easily imagine alternative ABIs in which abort_ip is
> treated as a *call* instead of a jump to enable much better C support.
>
>> Having the event_counter in place allows users to code the
>> pointer-chasing in pure C, which we can expect will help adoption
>> and ensure people use rseq through a utility library header file
>> rather than open-coding everything (with the associated testing
>> problems).
>
> True, but my variant works just as well, might be faster, and doesn't
> need event_counter.
>
>>
>> If we look at the list pop alternative using C for pointer-chasing,
>> with event_counter:
>>
>> <C>
>> 1) 64-bit load of both event_counter and cpu_id
>> 2) Use cpu_id to pointer-chase the per-cpu data
>> 3) Load head
>> 4) compare head with NULL
>> 5) conditional branch
>> 6) Load head->next
>
> ^^^ Potential segfault here. head might have been freed and unmapped.
>
>> <ASM>
>> 9) store current rseq_cs
>> 10) Load event counter
>> 11) compare current event_counter with event_counter loaded in 1)
>> 12) conditional branch
>> 13) store new value to @loc
>>
>> For a total of 2 store, 4 loads, one or many loads for pointer-chasing,
>> 2 compare, 2 cond. branch. Again, we can expect the load in 3) to
>> stall due to pointer-chasing.
>>
>> And here is the resulting list pop where the algorithm is in C,
>> without sequence counter. Let's make it a simple case
>> where the application is not changing the layout of the data
>> structures being pointer-chased concurrently from a book-keeping
>> algorithm (because that would add extra compare in the asm).
>>
>> <C>
>> 1) 32-bit load of cpu_id
>> 2) Use cpu_id to pointer-chase the per-cpu data
>> 3) Load head
>> 4) compare head with NULL
>> 5) conditional branch
>> 6) Load head->next
>
> ^^^ ditto: potential segfault here
>
Yep, good catch! Thanks for showing me how you envision the
userspace C vs ASM split, I'll try to come up with improved
user-space fast/slow paths based on your suggestions, and
therefore remove the event counter.
>> <ASM>
>> 7) store current rseq_cs
>> 8) 32-bit load of cpu_id
>> 9) compare cpu_id with prior cpu_id read in 1)
>> 10) conditional branch
>> 11) Load head
>> 12) compare head against head loaded in 3)
>> 13) conditional branch
>> 14) Load head->next
>> 15) compare head->next against head->next loaded in 6) --> this prevents ABA
>> 16) conditional branch
>> 17) store head->next as new head
>>
>>
>> Would it be required to change the kernel-user ABI when switching to
>> RDPID ? Can this co-exist with TLS users ?
>
> RDPID gives you the CPU number without reading memory, and that's
> pretty much all it does. The format in which it returns it sucks,
> unfortunately, but I'm going to try to fix that by the time rseq is
> merged.
OK, so it is entirely independent from rseq and can be simply used as
an alternative way of reading the cpu_id.
Thanks for the thorough explanation, I think I know what to aim for now.
Mathieu
>
> --Andy
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Powered by blists - more mailing lists