linux-kernel - user-space concurrent pipe buffer scheduler interactions

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <969ccc0f-d909-4b45-908e-e98279777733@metaparadigm.com>
Date: Wed, 3 Apr 2024 09:53:34 +1300
From: Michael Clark <michael@...aparadigm.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>,
        Jens Axboe <axboe@...nel.dk>, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>, linux-kernel@...r.kernel.org
Subject: user-space concurrent pipe buffer scheduler interactions

Folks,

I am working on a low latency cross-platform concurrent pipe buffer 
using C11 threads and atomics. It is portable code using a <stdatomic.h> 
polyfill on Windows that wraps the intrinsics that Microsoft provides. 
There is a detailed write up with implementation details, source code, 
tests and benchmark results in the URL here:

- https://github.com/michaeljclark/cpipe/

I have been eagerly following the work of Jens on io_uring which is why 
I am including him as he may be interested in these scheduler findings, 
because I am currently using busy memory polling for synchronization.

The reason why I am writing here, is that I think I now have a pretty 
decent test case to test the Windows and Linux schedulers side-by-side. 
Let's just say it has been an eye opening process and I think folks here 
might be interested in what I am seeing and what we could predict should 
happen based on Amdahl's Law and low-level cache ping-pong on atomics.

Let me cut to the chase. What I am observing is a situation where when I 
add threads on Windows, performance increases, but when I add threads on 
Linux, performance decreases. I don't know exactly why. I am wondering 
if Windows is doing some topologically affine scheduling? or if it is 
using performance counters to intuit scheduling decisions? I have 
checked the codegen and it is basically two LOCK CMPXCHG instructions.

I ran bare metal tests on Kaby Lake and Skylake processors on both OSes:

- `Windows 11 Version 23H2 Build 22631.3296`
- `Linux 6.5.0-25-generic #25~22.04.1-Ubuntu`

In any case, here are numbers. I will let them speak for themselves:

# Minimum Latency (nanoseconds)

|                      | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) |      ~219ns |      ~362ns |     ~7692ns |
| Skylake (i9-7980XE)  |      ~404ns |      ~425ns |     ~9183ns |

# Message Rate (messages per second)

|                      | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) |       4.55M |       2.71M |     129.62K |
| Skylake (i9-7980XE)  |       2.47M |       2.35M |     108.89K |

# Bandwidth 32KB buffer (1-thread)

|                      | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) |  2.91GB/sec |  1.36GB/sec |  1.72GB/sec |
| Skylake (i9-7980XE)  |  2.98GB/sec |  1.44GB/sec |  1.67GB/sec |

# Bandwidth 32KB buffer (4-threads)

|                      | cpipe win11 | cpipe linux |
|:---------------------|------------:|------------:|
| Kaby Lake (i7-8550U) |  5.56GB/sec |  0.79GB/sec |
| Skylake (i9-7980XE)  |  7.11GB/sec |  0.89GB/sec |

I think we have a very useful test case here for the Linux scheduler. I 
have been working on a generalization of memory polled user-space queue 
and this is about the 5th iteration where I have been very careful about 
modulo arithmetic and overflow as the normal case.

I know it is a little unfair to compare latency with Linux pipes and 
also we waste a lot of time spinning on queue full. This is where we 
would really like to use something like SENDUIPI, UMONITOR and UMWAIT 
but I don't have access to silicon that supports those yet.

Regards,
Michael Clark