linux-kernel - Re: A new spinlock for multicore (>16) platform

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOpXEOzisJO2kN+qwVRcT5z+jQdSAtL2mpzyH1Q+PcXYvFQDJQ@mail.gmail.com>
Date: Mon, 15 Jul 2024 18:33:20 +0800
From: Shi-Wu, Lo（Gmail） <shiwulo@...il.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Bagas Sanjaya <bagasdotme@...il.com>, 
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>, 
	Will Deacon <will@...nel.org>, Waiman Long <longman@...hat.com>, Boqun Feng <boqun.feng@...il.com>, 
	Mel Gorman <mgorman@...e.de>, Thomas Gleixner <tglx@...utronix.de>, Borislav Petkov <bp@...en8.de>, 
	Dave Hansen <dave.hansen@...ux.intel.com>, "H. Peter Anvin" <hpa@...or.com>, 
	"Theodore Ts'o" <tytso@....edu>, 
	"ccu_cs_oslab@...glegroups.com" <ccu_cs_oslab@...glegroups.com>
Subject: Re: A new spinlock for multicore (>16) platform

Dear Peter Zijlstra,
To address the issues you mentioned, here are my explanations:

1. We currently use a bitArray to solve the time complexity problem
during unlock (https://www.cs.ccu.edu.tw/~shiwulo/nxt+sh-RON.pdf). If
the number of processors reaches thousands, a two-layer bitArray can
be used. The first layer bitArray records whether any group of cores
wants to enter the critical section, while the second layer bitArray
records which cores within that group want to enter the critical
section. After traversing a second layer bitArray, the first layer
bitArray determines the next second layer bitArray to be traversed.

2. Core-to-core transfer speed can be measured using code. In our
paper, the core-to-core transfer speeds were measured using our code.
There are already some websites providing information on core-to-core
transfer speeds, such as:
https://chipsandcheese.com/2023/11/07/core-to-core-latency-data-on-large-systems/.

3. Redesigning data structures to achieve higher performance is the
best approach, but not every problem can be solved with better data
structures. For example, many spinlocks are still used within the
current Linux kernel.

4. We have published the code on GitHub
(https://github.com/shiwulo/ron-osdi2023). This link is also available
on the first page of our paper. We did not apply for the USENIX
AVAILABLE badge because we believe that while our code is open, it is
not yet good enough.


shiwu

Peter Zijlstra <peterz@...radead.org> 於 2024年7月15日 週一 下午5:54寫道：
>
> On Mon, Jul 15, 2024 at 06:45:16AM +0700, Bagas Sanjaya wrote:
> > On Mon, Jul 15, 2024 at 01:07:40AM +0800, Shi-Wu, Lo（Gmail） wrote:
> > > Dear Linux Contributors,
>
> > >    A detailed introduction to this method can be found in the following paper:
> > > https://www.usenix.org/conference/osdi23/presentation/lo
>
> So if I understand this right, the algorithm basically boils down to a
> circular path and unlock will iterate this path looking for the next CPU
> waiting.
>
> The worst case wait-time is one full circle of the rotor, which is
> bounded, so that's good.
>
> The immediate problem however is that O(n) iteration on unlock, given
> Linux runs on many big machines with 1e3 order of CPUs, which would be
> quite terrible on low to medium contention.
>
> You mention in Future work, to alleviate this unlock issue by switching
> to a linked-list, whereby unlock would then become O(1). The problem
> then becomes lock() needs to an insertion-sort, which needs to know the
> rotor position (eg. current lock owner).
>
> I don't think that is a much easier proposition -- notably uncontended
> fast path doesn't (want) to track the rotor position, and in the worst
> case you're still iterating 1e3 order CPUs.
>
> Hierachical rotors come to mind, but complexity -- is it warranted?
>
> > > Our laboratory is currently developing a system that can apply the
> > > same optimization strategy to all multi-core processors. Below is our
> > > plan.
> > >
> > > The New Method and Its Compatibility with qspinlock:
> > > 1. The default algorithm in the Linux kernel remains qspinlock.
> > > 2. A new file is created in /proc/routing_path, where a shortest path
> > > can be input, for example:
> > > sudo echo 1,2,3,4,16,17,18,19,5,6,7,8,11,12,13,14 > /proc/routing_path
> > > 3. After inputting the shortest path, the kernel switches to using the
> > > RON algorithm.
>
> This is quite horrendous. If you cannot compute the OWCR from the
> provided topology (CPUID on x86) 'nobody' is going to be using this.
>
> Additionally, what locks are so contended that this gives a significant
> performance gain, and can't we better rework the locking rather than the
> lock implementation to alleviate this problem.
>
> Eg. the futex hash lock is an oft cited example, but something like:
>
>   https://lkml.kernel.org/r/20230721105744.434742902@infradead.org
>
> can significantly reduce this.
>
> That is, in general is helps more to reduce lock contention rather than
> to optimize the contention behaviour.
>
> So which locks do you see sufficient contention on to make this
> worthwhile and how much do you gain by doing this?
>
> Additionally, per the paper there are Linux patches (albeit for rather
> old kernel versions), why aren't those included?
>
>


-- 
羅習五
中正大學資工系，副教授
電話：(05)2720411轉33116
傳真：(05)2720859