linux-kernel - Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency IDs for intermittent workloads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <615f169b-3b24-4661-8a2c-185c6d80f7a4@efficios.com>
Date: Tue, 3 Sep 2024 19:22:37 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To: Yury Norov <yury.norov@...il.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 linux-kernel@...r.kernel.org, Valentin Schneider <vschneid@...hat.com>,
 Mel Gorman <mgorman@...e.de>, Steven Rostedt <rostedt@...dmis.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
 <bsegall@...gle.com>, Rasmus Villemoes <linux@...musvillemoes.dk>,
 Dmitry Vyukov <dvyukov@...gle.com>, Marco Elver <elver@...gle.com>
Subject: Re: [RFC PATCH 2/2] sched: Improve cache locality of RSEQ concurrency
 IDs for intermittent workloads

On 2024-09-03 15:59, Yury Norov wrote:
> On Tue, Sep 03, 2024 at 03:06:50PM -0400, Mathieu Desnoyers wrote:
[...]
>> +
>> +static inline void mm_set_cpus_allowed(struct mm_struct *mm, const struct cpumask *cpumask)
>> +{
>> +	struct cpumask *mm_allowed = mm_cpus_allowed(mm);
>> +	int cpu, nr_set = 0;
>> +
>> +	if (!mm)
>> +		return;
>> +	/* The mm_cpus_allowed is the union of each thread allowed CPUs masks. */
>> +	for (cpu = 0; cpu < nr_cpu_ids; cpu = cpumask_next_andnot(cpu, cpumask, mm_allowed)) {
>> +		if (!cpumask_test_and_set_cpu(cpu, mm_allowed))
>> +			nr_set++;
>> +	}
> 
> You can do the same nicer:
> 
>    for_each_cpu(cpu, cpumask)
>    	nr_set += !cpumask_test_and_set_cpu(cpu, mm_allowed);
> 
> This should be faster and a bit simpler, to me.

In this scenario, I expect the following per-thread cpumask properties 
for a given process (typically): those will be typically the same bits
set repeated over all threads belonging to a process. There are of
course scenarios where specific threads will override the mask, but
I don't expect this to be the most frequent case.

So we typically have an operation which initially copies the initial
thread's allowed cpus mask to the mm allowed cpus mask, and then when
additional affinity changes are done, we want to augment the mm allowed
cpus masks with any additional cpu that may show up. But again, I expect
the initial thread to typically have the complete mask and other
operations won't typically change the mm allowed cpumask bits.

I also expect the cpumask to be often quite dense (often all bits
are set).

Now if we look at the operations for your proposal here:

- for_each_cpu loads cpumask word-by-word and for each set bit, it
   issues cpumask_test_and_set_cpu on mm_allowed, which is really a
   test_and_set_bit, a fully ordered atomic operation, on each _bit_
   set. That's O(nr_cpus) fully ordered atomic operations, and thus
   expensive exclusive cache line accesses.

My approach does:

- The equivalent of a for_each_cpu_andnot (actually I should use
   exactly that! I just noticed it exists in the API.), which loads
   both thread and mm CPUs allowed masks in parallel, word-by-word,
   and only issues a cpumask_test_and_set_cpu for CPUs which are set
   in the per-thread mask, but not in the mm mask. In the typical cases
   discussed above, we pretty much never need to issue the atomic
   test-and-set. So all we need to do for the common case is to read
   both cpu masks in parallel, no stores/atomic ops needed.

> What concerns me is that you call atomic function in a loop, which makes
> the whole procedure non-atomic. If it's OK, can you put a comment why a
> series of atomic ops is OK here? If not - I believe external locking
> would be needed.

Good point, so the whole mm CPUs allowed masks tracks the allowed set,
and based on the result of the test_and_set it accumulates the number of
bits set and atomically add the total to nr_allowed_cpus. The mm_cid
algorithms only care about mm nr_allowed_cpus, so those don't even need
to look at the mm CPUs allowed mask, therefore there is no need to
provide any ordering guarantees across the two data structures.

If we'd have a cpumask_test_and_set_cpu_relaxed() it would be sufficient
here.

As you say, I should explain this in a comment.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com