lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <238182999.1493.1504291547065.JavaMail.zimbra@efficios.com>
Date:   Fri, 1 Sep 2017 18:45:47 +0000 (UTC)
From:   Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:     Will Deacon <will.deacon@....com>
Cc:     "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Boqun Feng <boqun.feng@...il.com>,
        Andrew Hunter <ahh@...gle.com>,
        maged michael <maged.michael@...il.com>,
        gromer <gromer@...gle.com>, Avi Kivity <avi@...lladb.com>,
        Benjamin Herrenschmidt <benh@...nel.crashing.org>,
        Paul Mackerras <paulus@...ba.org>,
        Michael Ellerman <mpe@...erman.id.au>,
        Dave Watson <davejwatson@...com>,
        Andy Lutomirski <luto@...nel.org>,
        Hans Boehm <hboehm@...gle.com>,
        Russell King <linux@...linux.org.uk>
Subject: Re: [RFC PATCH v3] membarrier: provide core serialization

----- On Sep 1, 2017, at 1:10 PM, Will Deacon will.deacon@....com wrote:

> Hi Mathieu,
> 
> On Fri, Sep 01, 2017 at 05:00:38PM +0000, Mathieu Desnoyers wrote:
>> ----- On Sep 1, 2017, at 12:25 PM, Will Deacon will.deacon@....com wrote:
>> 
>> > On Fri, Sep 01, 2017 at 12:10:07PM -0400, Mathieu Desnoyers wrote:
>> >> Add a new MEMBARRIER_FLAG_SYNC_CORE flag to the membarrier
>> >> system call. It allows membarrier to issue core serializing barriers in
>> >> addition to memory barriers on target threads whenever a membarrier
>> >> command is performed.
>> >> 
>> >> It is relevant for reclaim of JIT code, which requires to issue core
>> >> serializing barriers on all threads running on behalf of a process
>> >> after ensuring the old code is not visible anymore, before re-using
>> >> memory for new code.
>> >> 
>> >> The new MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED used with
>> >> MEMBARRIER_FLAG_SYNC_CORE flag registers the current process as
>> >> requiring core serialization. It may block. It can be used to ensure
>> >> MEMBARRIER_CMD_PRIVATE_EXPEDITED never blocks, even the first time it is
>> >> invoked by a process with the MEMBARRIER_FLAG_SYNC_CORE flag.
>> >> 
>> >> * Scheduler Overhead Benchmarks
>> >> 
>> >> Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>> >> Linux v4.13-rc6
>> >> 
>> >> Inter-thread scheduling
>> >> taskset 01 ./perf bench sched pipe -T
>> >> 
>> >>                        Avg. usecs/op         Std.Dev. usecs/op
>> >> Before this change:         2.55                   0.10
>> >> With this change:           2.49                   0.08
>> >> SYNC_CORE processes:        2.70                   0.10
>> >> 
>> >> Inter-process scheduling
>> >> taskset 01 ./perf bench sched pipe
>> >> 
>> >> Before this change:         2.93                   0.13
>> >> With this change:           2.93                   0.13
>> >> SYNC_CORE processes:        3.20                   0.06
>> >> 
>> >> Changes since v2:
>> >> - Rename MEMBARRIER_CMD_REGISTER_SYNC_CORE to
>> >>   MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED,
>> > 
>> > I'm still not convinced that this registration step is needed (at least
>> > for arm, power and x86), but my previous comments were ignored.
>> 
>> I mistakenly thought that your previous comments were addressed in
>> other legs of the previous thread, sorry about that.
> 
> No problem, thanks for replying this time!

And thanks for the reminder :)

> 
>> Let's take x86 as an example. The private expedited membarrier
>> command iterates on all cpu runqueues, checking if rq->curr->mm
>> match current->mm, and only IPI if it matches.
>> 
>> We can very well have a CPU for which the scheduler goes back
>> and forth between user-space thread and a kernel thread, in
>> which case the mm state is kept as is, and rq->curr->mm is
>> temporarily saved into rq->curr->active_mm.
>> 
>> This means that while that CPU is executing a kthread, we
>> won't send any IPI that that CPU, but it could then schedule
>> back a thread belonging to the original process, and then
>> we go back executing user-space code without having issued
>> any kind of core serializing barrier (assuming we return to
>> userspace with sysexit).
> 
> Right, ok. I forgot about Andy's sysexit optimisation on x86.
> 
>> Now about arm64, given that as you say it issues a core serializing
>> barrier when returning to user-space, and has a strong barrier
>> in switch_to, this means that the explicit sync_core() in sched_in
>> is not needed.
> 
> Good, that's what I thought.

And now that I think about it further, I think we could do without the
sync_core in sched_out; we may just need core serializing instruction
between the full barrier after rq->curr store and return to user-space.
The rationale is that we just want to issue a core serializing instruction
before executing further user-space instructions. We might not care
about ordering wrt instructions executed by user-space prior to
entering into the scheduler. We'd need advice from architecture
maintainers on this point though.

> 
>> However, AFAIU, arm64 does not guarantee consistent data and instruction
>> caches.
> 
> Correct, but:
> 
>  * On 32-bit arm, we have a syscall to do that (and this is already used by
>    JITs and things like __builtin_clear_cache)
> 
>  * On arm64, cache maintenance instructions are directly available to
>    userspace

Good!

> 
> In both cases, the maintenance is broadcast by the hardware to all CPUs.
> The only part that cannot be broadcast is the pipeline flush, which is
> the part we need to do above and is implicit on exception return.
> 
>> I'm actually trying to wrap my head around what would be the sequence
>> of operations of a JIT trying to reclaim memory. Can we combine
>> core serialization and instruction cache flushing into a single
>> system call invocation, or we need to split this into two separate
>> operations ?
> 
> I think that cache-flushing and pipeline-flushing should be separated,
> as they tend to be in the CPU architectures I'm familiar with.

Indeed, if we make the icache flushing separate, then we can apply it
more specifically to specific address ranges and such, without having
to flush the entire user icache.

> 
>> The JIT reclaim usage scheme I envision is:
>> 
>> - userspace unpublish all reference to old code,
>> - userspace ensure no thread use the old code anymore,
>> - sys_membarrier
>>   - for each executing threads
>>     - issue core serializing barrier
>> - userspace use a separate system call to issue data cache flush for
>>   the modified range
>> - sys_membarrier
>>   - for each executing threads
>>     - issue instruction cache flush
>> 
>> So my current thinking is that we may need to change the membarrier
>> system call so one command serializes the core, and a separate command
>> issues cache flush.
> 
> Yeah, and the sequence is slightly different I think, as we need the
> pipeline flush to come *after* the I-cache invalidation (otherwise the
> stale instructions can just be refetched).

In my scenario, notice that userspace first unpublish all refs to old
code, and does its own waiting for any thread still seeing the old
code (e.g. by using RCU). However, URCU currently only has full barriers,
not core serializing barriers.

This means that when sys_membarrier is invoked, no thread can branch
into that old code anymore. What we actually want there is to synchronize
the core and flush the icache before we eventually publish a reference
to the new code.

What I wonder is whether the simple fact that some cores still hold the
old unsynchronized state prevents us from safely overwriting the old
code at that point, even though none of them can execute it going forward,
or if we need the sync core on every core before we even overwrite the
old code (conservative approach).

Assuming we don't need a sync core before updating the old code, an
aggressive approach would be:

reclaim and re-use (aggressive):

1- userspace unpublish all reference to old code,
2- userspace ensure no thread use the old code anymore (e.g. URCU),
3- userspace updates old code -> new code
4- issue data cache flush for the modified range (if needed)
5- sys_membarrier
   - for each executing threads
      - issue core serializing barrier
6- issue instruction cache flush for the modified range (if needed)
   (may be required on all active threads on some architectures)
7- userspace publish reference to new code

However, if we do need a sync core before updating the old code,
the conservative approach looks like:

reclaim and re-use (conservative):

1- userspace unpublish all reference to old code,
2- userspace ensure no thread use the old code anymore (e.g. URCU),
3- sys_membarrier
   - for each executing threads
      - issue core serializing barrier
4- userspace updates old code -> new code
5- issue data cache flush for the modified range (if needed)
6- issue instruction cache flush for the modified range (if needed)
   (may be required on all active threads on some architectures)
7- userspace publish reference to new code

Knowing whether the more "aggressive" approach above is correct should
allow us to find out whether we need to add a sync_core() in sched_out
as well.

> 
> If you're at LPC in a week's time, this might be a good thing to sit down
> and bash our heads against (espec. if we can grab PPC and x86 folks too).

Yes, I'll be there! I could even suggest a microconf topics about this. It
could fit either in Paul's Linux-Kernel Memory Model Workshop track [1], the
Wildcard track [2] or the Hallway track if we can get everyone together.

Thoughts ?

Thanks,

Mathieu

[1] http://www.linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/632
[2] http://www.linuxplumbersconf.org/2017/ocw/events/LPC2017/tracks/629

> 
> Will

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ