[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <580eec2b-f204-2eb1-806d-8282b8b60bf2@efficios.com>
Date: Tue, 8 Nov 2022 15:07:42 -0500
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, Thomas Gleixner <tglx@...utronix.de>,
"Paul E . McKenney" <paulmck@...nel.org>,
Boqun Feng <boqun.feng@...il.com>,
"H . Peter Anvin" <hpa@...or.com>, Paul Turner <pjt@...gle.com>,
linux-api@...r.kernel.org, Christian Brauner <brauner@...nel.org>,
Florian Weimer <fw@...eb.enyo.de>, David.Laight@...lab.com,
carlos@...hat.com, Peter Oskolkov <posk@...k.io>,
Alexander Mikhalitsyn <alexander@...alicyn.com>,
Chris Kennelly <ckennelly@...gle.com>
Subject: Re: [PATCH v5 08/24] sched: Introduce per memory space current
virtual cpu id
On 2022-11-08 08:04, Peter Zijlstra wrote:
> On Thu, Nov 03, 2022 at 04:03:43PM -0400, Mathieu Desnoyers wrote:
>
>> The credit goes to Paul Turner (Google) for the vcpu_id idea. This
>> feature is implemented based on the discussions with Paul Turner and
>> Peter Oskolkov (Google), but I took the liberty to implement scheduler
>> fast-path optimizations and my own NUMA-awareness scheme. The rumor has
>> it that Google have been running a rseq vcpu_id extension internally at
>> Google in production for a year. The tcmalloc source code indeed has
>> comments hinting at a vcpu_id prototype extension to the rseq system
>> call [1].
>
> Re NUMA thing -- that means that on a 512 node system a single threaded
> task can still observe 512 separate vcpu-ids, right?
Yes, that's correct.
>
> Also, said space won't be dense.
Indeed, this can be inefficient if the data structure within the
single-threaded task is not NUMA-aware *and* that task is free to bounce
all over the 512 numa nodes.
>
> The main selling point of the whole vcpu-id scheme was that the id space
> is dense and not larger than min(nr_cpus, nr_threads), which then gives
> useful properties.
>
> But I'm not at all seeing how the NUMA thing preserves that.
If a userspace per-vcpu data structure is implemented with NUMA-local
allocations, then it becomes really interesting to guarantee that the
per-vcpu-id accesses are always numa-local for performance reasons.
If a userspace per-vcpu data structure is not numa-aware, then we have
two scenarios:
A) The cpuset/sched affinity under which it runs pins it to a set of
cores belonging to a specific NUMA node. In this case, even with
numa-aware vcpu id allocation, the ids will stay as close to 0 as if not
numa-aware.
B) No specific cpuset/sched affinity set, which means the task is free
to bounce all over. In this case I agree that having the indexing
numa-aware, but the per-vcpu data structure not numa-aware, is inefficient.
I wonder whether scenarios with 512 nodes systems, with containers using
few cores, but without using cpusets/sched affinity to pin the workload
to specific numa nodes is a workload we should optimize for ? It looks
like the lack of numa locality due to lack of allowed cores restriction
is a userspace configuration issue.
We also must keep in mind that we can expect a single task to load a mix
of executable/shared libraries where some pieces may be numa-aware, and
others may not. This means we should ideally support a numa-aware
vcpu-id allocation scheme and non-numa-aware vcpu-id allocation scheme
within the same task.
This could be achieved by exposing two struct rseq fields rather than
one, e.g.:
vm_vcpu_id -> flat indexing, not numa-aware.
vm_numa_vcpu_id -> numa-aware vcpu id indexing.
This would allow data structures that are inherently numa-aware to
benefit from numa-locality, without hurting non-numa-aware data structures.
>
> Also; given the utter mind-bendiness of the NUMA thing; should it go
> into it's own patch; introduce the regular plain old vcpu first, and
> then add things to it -- that also allows pushing those weird cpumask
> ops you've created later into the series.
Good idea. I can do that once we agree on the way forward for flat vs
numa-aware vcpu-id rseq fields.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists