[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <500891137.95782.1659452479846.JavaMail.zimbra@efficios.com>
Date: Tue, 2 Aug 2022 11:01:19 -0400 (EDT)
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Oskolkov <posk@...k.io>
Cc: Peter Zijlstra <peterz@...radead.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Thomas Gleixner <tglx@...utronix.de>,
"Paul E . McKenney" <paulmck@...nel.org>,
Boqun Feng <boqun.feng@...il.com>,
"H. Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
linux-api <linux-api@...r.kernel.org>,
Christian Brauner <christian.brauner@...ntu.com>,
Florian Weimer <fw@...eb.enyo.de>,
David Laight <David.Laight@...lab.com>,
carlos <carlos@...hat.com>,
Chris Kennelly <ckennelly@...gle.com>,
Peter Oskolkov <posk@...gle.com>
Subject: Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions
----- On Aug 1, 2022, at 1:07 PM, Peter Oskolkov posk@...k.io wrote:
> On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers
> <mathieu.desnoyers@...icios.com> wrote:
>>
>> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.
>
> Thanks a lot, Mathieu - it is really exciting to see this happening!
>
> I'll share our experiences here, with the hope that it may be useful.
> I've also cc-ed
> Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide
> more context/details if I miss or misrepresent something.
Thanks for sharing your experiences at Google. This helps put things in
perspective.
>
> The problem:
>
> tcmalloc maintains per-cpu freelists in the userspace to make userspace
> memory allocations fast and efficient; it relies on rseq to do so, as
> any manipulation
> of the freelists has to be protected vs thread migrations.
>
> However, as a typical userspace process at a Google datacenter is confined to
> a relatively small number of CPUs (8-16) via cgroups, while the
> servers typically
> have a much larger number of physical CPUs, the per-cpu freelist model
> is somewhat
> wasteful: if a process has only at most 10 threads running, for
> example, but these threads
> can "wander" across 100 CPUs over the lifetime of the process, keeping 100
> freelists instead of 10 noticeably wastes memory.
>
> Note that although a typical process at Google has a limited CPU
> quota, thus using
> only a small number of CPUs at any given time, the process may often have many
> hundreds or thousands of threads, so per-thread freelists are not a viable
> solution to the problem just described.
>
> Our current solution:
>
> As you outlined in patch 9, tracking the number of currently running threads per
> address space and exposing this information via a vcpu_id abstraction helps
> tcmalloc to noticeably reduce its freelist overhead in the "narrow
> process running
> on a wide server" situation, which is typical at Google.
>
> We have experimented with several approaches here. The one that we are
> currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
>
> We did try per-numa-node vcpus, but it did not show any material improvement
> over the "flat" model, perhaps because on our most "wide" servers the CPU
> topology is multi-level. Chris Kennelly may provide more details here.
I would really like to know more about Google's per-numa-node vcpus implementation.
I suspect you guys may have taken a different turn somewhere in the design which
led to these results. But having not seen that implementation, I can only guess.
I notice the following Google-specific prototype extension in tcmalloc:
// This is a prototype extension to the rseq() syscall. Since a process may
// run on only a few cores at a time, we can use a dense set of "v(irtual)
// cpus." This can reduce cache requirements, as we only need N caches for
// the cores we actually run on simultaneously, rather than a cache for every
// physical core.
union {
struct {
short numa_node_id;
short vcpu_id;
};
int vcpu_flat;
};
Can you tell me more about the way the numa_node_id and vcpu_id are allocated
internally, and how they are expected to be used by userspace ?
>
> On a more technical note, we do use atomic operations extensively in
> the kernel to make sure
> vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently
> running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps,
> no going to N and above; this does consume some extra CPU cycles, but the
> RAM savings we gain far outweigh the extra CPU cost; it will be interesting to
> see what you can do with the optimizations you propose in this patchset.
The optimizations I propose keep those "tightly packed" characteristics, but skip
the atomic operations in common scenarios. I'll welcome benchmarks of the added
overhead in representative workloads.
> Again, thanks a lot for this effort!
Thanks for your input. It really helps steering the effort in the right direction.
Mathieu
>
> Peter
>
> [...]
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
Powered by blists - more mailing lists