linux-kernel - Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <84fdb30e-387d-7bc1-e1f9-fa57cd9a32c9@redhat.com>
Date:   Tue, 1 Sep 2020 13:38:29 -0400
From:   Waiman Long <longman@...hat.com>
To:     Alex Kogan <alex.kogan@...cle.com>
Cc:     linux@...linux.org.uk, peterz@...radead.org, mingo@...hat.com,
        will.deacon@....com, arnd@...db.de, linux-arch@...r.kernel.org,
        linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
        tglx@...utronix.de, bp@...en8.de, hpa@...or.com, x86@...nel.org,
        guohanjun@...wei.com, jglauber@...vell.com,
        steven.sistare@...cle.com, daniel.m.jordan@...cle.com,
        dave.dice@...cle.com
Subject: Re: [PATCH v10 3/5] locking/qspinlock: Introduce CNA into the slow
 path of qspinlock

On 8/31/20 5:39 PM, Alex Kogan wrote:
>> On Jul 28, 2020, at 4:00 PM, Waiman Long <longman@...hat.com> wrote:
>>
>> On 4/3/20 4:59 PM, Alex Kogan wrote:
>>> In CNA, spinning threads are organized in two queues, a primary queue for
>>> threads running on the same node as the current lock holder, and a
>>> secondary queue for threads running on other nodes. After acquiring the
>>> MCS lock and before acquiring the spinlock, the lock holder scans the
>>> primary queue looking for a thread running on the same node (pre-scan). If
>>> found (call it thread T), all threads in the primary queue between the
>>> current lock holder and T are moved to the end of the secondary queue.
>>> If such T is not found, we make another scan of the primary queue when
>>> unlocking the MCS lock (post-scan), starting at the position where
>>> pre-scan stopped. If both scans fail to find such T, the MCS lock is
>>> passed to the first thread in the secondary queue. If the secondary queue
>>> is empty, the lock is passed to the next thread in the primary queue.
>>> For more details, see https://urldefense.com/v3/__https://arxiv.org/abs/1810.05600__;!!GqivPVa7Brio!OaieLQ3MMZThgxr-Of8i9dbN5CwG8mXSIBJ_sUofhAXcs43IWL2x-stO-XKLEebn$ .
>>>
>>> Note that this variant of CNA may introduce starvation by continuously
>>> passing the lock to threads running on the same node. This issue
>>> will be addressed later in the series.
>>>
>>> Enabling CNA is controlled via a new configuration option
>>> (NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
>>> boot time only if we run on a multi-node machine in native environment and
>>> the new config is enabled. (For the time being, the patching requires
>>> CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
>>> resolved once static_call() is available.) This default behavior can be
>>> overridden with the new kernel boot command-line option
>>> "numa_spinlock=on/off" (default is "auto").
>>>
>>> Signed-off-by: Alex Kogan <alex.kogan@...cle.com>
>>> Reviewed-by: Steve Sistare <steven.sistare@...cle.com>
>>> Reviewed-by: Waiman Long <longman@...hat.com>
>>> ---
>> There is also a concern that the worst case latency for a lock transfer can be close to O(n) which can be quite large for large SMP systems. I have a patch on top that modifies the current behavior to limit the number of node scans after the lock is freed.
> I understand the concern. While your patch addresses it, I am afraid it makes
> the code somewhat more complex, and duplicates some of the slow path
> functionality (e.g., the spin loop until the lock value changes to a certain
> value).
>
> Let me suggest a different idea for gradually restructuring the main queue
> that has some similarity to the way you suggested to handle prioritized waiters.
> Basically, instead of scanning the entire chain of main queue waiters,
> we can check only the next waiter and, if present and it runs on a different
> node, move it to the secondary queue. In addition, to maintain the preference
> for a certain numa node ID, we set the numa node of the next-next waiter,
> if present, to that of the current lock holder. This is the part similar to the
> way you suggested to handle prioritized waiters.
>
> This way, the worst case complexity of cna_order_queue() decreases from O(n)
> down to O(1), as we always “scan" only one waiter. And as before, we change
> the preference (and flush the secondary queue) after X handovers (or after
> Y ms, as in your in other patch).
>
> I attach the patch that applies on top of your patch for prioritized nodes
> (0006), but does not include your patch 0007 (time based threshold),
> which I will integrate into the series in the next revision.
>
> Please, let me know what you think.
>
That is an interesting idea. I don't have any fundamental objection to 
that. I just wonder how it will impact the kind of performance test that 
you ran before. It would be nice to see the performance impact with that 
change.

Cheers,
Longman