[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46d08f8e-bd68-44a3-9b33-ba029c7e2a10@efficios.com>
Date: Wed, 4 Sep 2024 11:50:45 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
linux-kernel@...r.kernel.org, Valentin Schneider <vschneid@...hat.com>,
Mel Gorman <mgorman@...e.de>, Steven Rostedt <rostedt@...dmis.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
<bsegall@...gle.com>, Rasmus Villemoes <linux@...musvillemoes.dk>,
Dmitry Vyukov <dvyukov@...gle.com>, Marco Elver <elver@...gle.com>
Subject: Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency
IDs for intermittent workloads
On 2024-09-04 11:24, Yury Norov wrote:
> On Tue, Sep 03, 2024 at 07:22:37PM -0400, Mathieu Desnoyers wrote:
>> On 2024-09-03 15:59, Yury Norov wrote:
>>> On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:
>> [...]
>>>> +
>>>> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
>>>> +{
>>>> + struct cpumask *mm_allowed = mm_cpus_allowed(mm);
>>>> + int cpu, nr_set = 0;
>>>> +
>>>> + if (!mm)
>>>> + return;
>>>> + /* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
>>>> + for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
>>>> + if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
>>>> + nr_set++;
>>>> + }
>>>
>>> You can do the same nicer:
>>>
>>> for_each_cpu(cpu, cpumask)
>>> nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);
>>>
>>> This should be faster and a bit simpler, to me.
>>
>> In this scenario, I expect the following per-thread cpumask properties for a
>> given process (typically): those will be typically the same bits
>> set repeated over all threads belonging to a process. There are of
>> course scenarios where specific threads will override the mask, but
>> I don't expect this to be the most frequent case.
>>
>> So we typically have an operation which initially copies the initial
>> thread's allowed cpus mask to the mm allowed cpus mask, and then when
>> additional affinity changes are done, we want to augment the mm allowed
>> cpus masks with any additional cpu that may show up. But again, I expect
>> the initial thread to typically have the complete mask and other
>> operations won't typically change the mm allowed cpumask bits.
>>
>> I also expect the cpumask to be often quite dense (often all bits
>> are set).
>>
>> Now if we look at the operations for your proposal here:
>>
>> - for_each_cpu loads cpumask word-by-word and for each set bit, it
>> issues cpumask_test_and_set_cpu on mm_allowed, which is really a
>> test_and_set_bit, a fully ordered atomic operation, on each _bit_
>> set. That's O(nr_cpus) fully ordered atomic operations, and thus
>> expensive exclusive cache line accesses.
>
> Both versions are O(N).
Yes, those are both theoretically O(N), but the cost of
loading/comparing two words compared to the cost of an atomic
test-and-set of each bit within those words is far from being
the same.
>
>> My approach does:
>>
>> - The equivalent of a for_each_cpu_andnot (actually I should use
>> exactly that! I just noticed it exists in the API.), which loads
>
> Yes, you should.
>
>> both thread and mm CPUs allowed masks in parallel, word-by-word,
>> and only issues a cpumask_test_and_set_cpu for CPUs which are set
>> in the per-thread mask, but not in the mm mask. In the typical cases
>> discussed above, we pretty much never need to issue the atomic
>> test-and-set. So all we need to do for the common case is to read
>> both cpu masks in parallel, no stores/atomic ops needed.
>
> This all doesn't look like a hot path. And anyways, speculating around
> performance without numbers on hands sounds cheap.
This is done whenever userspace invokes sched_setaffinity, or changes
its cgroup cpuset. It may not be the most important fast-path in the
world, but I expect some workloads to issue sched_setaffinity whenever
they create a thread, so it's not a purely slow-path either.
> In my experience, iterators with a very lightweight payload are ~100
> times slower comparing to dedicated bitmap ops. Check this for example:
> 3cea8d4753277.
>
> If you're really cared about performance here, I'd suggest you to
> compare your iterators approach with something like this:
>
> cpumask_or(mm_allowed, mm_allowed, cpumask);
> atomic_set(&mm->nr_cpus_allowed, cpumask_weight(mm_allowed);
AFAIU this approach would not work, or then we'd need some kind of
locking to make sure we don't have two concurrent cpumask updates
which happen to do a sequence of atomic_set which move the
nr_cpus_allowed value towards a smaller value due to ordering.
Also, AFAIU cpumask_or is not safe against concurrent updates, so
we'd need locking for that as well.
I'm fine providing performance numbers, but we need to make sure
the alternative we compare against actually works.
A "dedicated" bitmap op that would do what I need would do the
following:
- For each bit set in bitmap A, which are not set in bitmap B,
atomically test-and-set those bits in bitmap B, and return the
number of bits that transitioned from 0 to 1 for weighting
purposes.
In my original attempts I tried using cpumask_weight after altering
the bitmap, but then noticed that it would not work without additional
locking.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Powered by blists - more mailing lists