[<prev] [next>] [day] [month] [year] [list]
Message-ID: <77543974-cd67-3999-103e-6714d04f0e5e@efficios.com>
Date: Fri, 23 Sep 2022 09:46:15 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Chris Kennelly <ckennelly@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Paul Turner <pjt@...gle.com>, Peter Oskolkov <posk@...k.io>,
linux-kernel <linux-kernel@...r.kernel.org>,
"carlos@...hat.com" <carlos@...hat.com>,
Florian Weimer <fw@...eb.enyo.de>,
"linux-api@...r.kernel.org" <linux-api@...r.kernel.org>
Subject: Re: [PATCH v4 00/25] RSEQ node id and virtual cpu id extensions
On 2022-09-22 16:10, Chris Kennelly wrote:
> Hi,
>
> I still need to update the code in TCMalloc to cooperate with the new
> glibc ABI/convention. One concern I have is that it looks like I might
> need to add a extra memory dereference (or two) to get the early
> initialized offsets provided by glibc folded into the read of the cpu_id
> field.
If you have a concrete example of this, I'd be happy to help and perhaps
we can improve your usage pattern.
>
> I think I can avoid this by using %gs to point to the address of the
> cpu_id field itself (which I think could be used to select between vCPUs
> or not*), but %gs is a global piece of state that all of the libraries
> in the program need to cooperate on.
I think what we are all looking for here is a scheme that would allow us
the fastest per-vcpu data structure accesses possible from userspace.
I think we could do something similar to what is done in the Linux
kernel for that, but in userspace. Here are some random ideas I have on
this topic:
We could introduce a new prctl(2) PT_{SET,GET}_GS_MODE on x86-64. This
would take as arguments the indexing mode and offset multiplier we want
to be applied to the GS segment selector on return to userspace:
enum gs_index_mode {
GS_INDEX_MODE_MM_VCPU,
};
struct prctl_set_gs_mode {
enum gs_index_mode index_mode;
u64 stride;
};
For a memory space which has this gs mode set, the return to userspace
code would populate the GS segment selector register with:
stride * current->mm_vcpu_id
The "stride" would be the virtual address space size allowed for
per-vcpu-data. This could be decided by the libc, with a tunable
allowing to increase/decrease this size. Another libc tunable could
disable populating the GS segment selector altogether (e.g. for
compatibility with applications like Wine which AFAIK use it).
With this in place, I hope we could then do per-vcpu data access by
simply prefixing memory access instructions with a %%gs: segment
selector prefix.
Thoughts ?
Thanks,
Mathieu
>
> Thanks,
> Chris
>
> * TCMalloc is already paying a load+pointer arithmetic to select between
> cpu_id versus vcpu_id, so this would actually make things a little bit
> faster.
>
> On Thu, Sep 22, 2022 at 3:21 PM Mathieu Desnoyers
> <mathieu.desnoyers@...icios.com <mailto:mathieu.desnoyers@...icios.com>>
> wrote:
>
> Hi Chris,
>
> Sorry it looks like I forgot to CC you on this series. If you can give
> it a spin with tcmalloc I would be very much interested in the result.
>
> Thanks,
>
> Mathieu
>
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists