lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <CACT4Y+aJomhGe5qbffVCj43UroinYDGcQFKeAd7RMNv=cVX=cw@mail.gmail.com> Date: Sun, 14 Sep 2014 11:35:25 -0700 From: Dmitry Vyukov <dvyukov@...gle.com> To: Andi Kleen <ak@...ux.intel.com> Cc: Konstantin Khlebnikov <koct9i@...il.com>, x86@...nel.org, LKML <linux-kernel@...r.kernel.org>, Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>, "H. Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>, Andrew Hunter <ahh@...gle.com> Subject: Re: [PATCH RFC] x86_64: per-cpu memory for user-space On Sun, Sep 14, 2014 at 7:06 AM, Andi Kleen <ak@...ux.intel.com> wrote: > On Sat, Sep 13, 2014 at 06:35:34PM +0400, Konstantin Khlebnikov wrote: >> This patch implements user-space per-cpu memory in the same manner as in >> kernel-space: each cpu has its own %gs base address. On x86_64 %fs is used >> for thread local storage, %gs usually is free. >> >> User-space application cannot prevent preemption but x86 read-modify-write >> operations are atomic against interrupts and context switches. Thus percpu >> counters, ring-buffer cursors, per-cpu locks and other cool things might >> be implemented in a very efficient way. > > Do you have some concrete examples for the more complex operations? Hi Andy, I've showed how to implement a per-cpu freelist in a previous email. Here is how you can do a per-cpu logging system, multiple threads enqueue messages, a single thread polls all per-cpu queues (not tested as well): struct elem_t { uint32 lap; T data; }; struct percpu_queue_t { uint64 enqueue_pos_cpu; uint32 dequeue_pos; elem_t buf[N]; }; bool percpu_queue_enqueue(T *data) { percpu_queue_t *q; uint32 enqueue_pos; elem_t *e; uint64 old; for (;;) { q = (percpu_queue_t*)&GS[OFF]; old = atomic_load(&q->enqueue_pos_cpu, memory_order_relaxed); enqueue_pos = (uint32)(old >> 32); if (enqueue_pos - atomic32_load(&q->dequeue_pos, memory_order_acquire) >= N) return false; // full if (CMPXCHG16B(&GS[OFF], old, old+1)) // w/o LOCK prefix return p; e = &q->buf[enqueue_pos%N]; e->data = *data; atomic_store(&e->lap, enqueue_pos/N + 1, memory_order_release); return true; } } bool percpu_queue_dequeue(percpu_queue_t *q, T *data) { elem_t *e; uint32 dequeue_pos; dequeue_pos = q->dequeue_pos; e = q->buf[dequeue_pos%N]; if (atomic_load(&e->lap, memory_order_acquire) != dequeue_pos/N+1) return false; // empty *data = e->data; atomic_store(&q->dequeue_pos, dequeue_pos+1, memory_order_release); return true; } This is a simple version. It can be modified to do variable-size messages, batch updates of dequeue_pos to reduce write sharing, or split both enqueue and dequeue in prepare and commit phases to do zero-copy communication. So, generally, if you can reduce per-cpu state modification to a single CAS, then you pair the data with cpu number and include it into CAS. This allows you do CAS loop on per-cpu state w/o LOCK prefix. And in the freelist code, the line: if (CMPXCHG16B(fl, old, new)) needs to be replaced with (we want to ensure that we are still on the same cpu): if (CMPXCHG16B(&GS[OFF], old, new)) For more complex data structures (e.g. doubly-linked list), I *suspect* that it's possible to do something like transaction journalling approach. Namely, there is a single per-cpu slot for current transaction. If it's empty, then a thread installs own transaction descriptor with a CAS w/o lock prefix; executes it and uninstalls the trx descriptor. If it's not empty, then a thread first helps to execute the pending transaction and uninstalls the trx descriptor. All mutations in transactions are done with CAS w/o lock prefix (both when executed by the issuing thread and during helping). Care must be taken to ensure that threads do not operate on stale data and on wrong cpu; probably all data slots must be accompanied by aba counter and/or cpu number. I've done this in the context of robust shared memory data structures. I have not worked out all the details for per-cpu data structures, but I think that it can work. > It seems to me the limitation to a simple instruction will be very limiting > for anything more complicated than a counter. Also it's not even > clear how someone would implement retry (short of something like kuchannel) What is retry/kuchannel? > Of course it wouldn't be a problem with TSX transactions, but it's not > clear they need it. As far as I know (had not have a change to try), commit of a TSX transaction includes a trailing store-load style fence. And the whole point of this change is to eliminate the cost of the store-load fence. Otherwise, you just query any approximation of the current cpu and do either TSX or LOCK-prefixed RWM on the current cpu state. > The other problem with the approach is, how would cpu hotplug > be handled? > >> By the way, newer Intel cpus have even faster instructions for >> changing %fs/%gs, but they are still not supported by the kernel. > > Patch kits are pending. > > -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists