linux-kernel - Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wi+VFeT7e04kMr7jhoKWb4iKgb1szb7BxXC_-M38_qAKw@mail.gmail.com>
Date: Mon, 12 Aug 2024 14:07:02 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, Michael Ellerman <mpe@...erman.id.au>, tglx@...utronix.de, 
	peterz@...radead.org, paulmck@...nel.org, rostedt@...dmis.org, 
	mark.rutland@....com, juri.lelli@...hat.com, joel@...lfernandes.org, 
	raghavendra.kt@....com, boris.ostrovsky@...cle.com, konrad.wilk@...cle.com, 
	LKML <linux-kernel@...r.kernel.org>, Nicholas Piggin <npiggin@...il.com>
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
>
> top 3 callstacks of __schedule collected with bpftrace.
>
>                         preempt=none                                                            preempt=full
>
>      __schedule+12                                                                  |@[
>      schedule+64                                                                    |    __schedule+12
>      interrupt_exit_user_prepare_main+600                                           |    preempt_schedule+84
>      interrupt_exit_user_prepare+88                                                 |    _raw_spin_unlock_irqrestore+124
>      interrupt_return_srr_user+8                                                    |    __wake_up_sync_key+108
> , hackbench]: 482228                                                               |    pipe_write+1772
> @[                                                                                 |    vfs_write+1052
>      __schedule+12                                                                  |    ksys_write+248
>      schedule+64                                                                    |    system_call_exception+296
>      pipe_write+1452                                                                |    system_call_vectored_common+348
>      vfs_write+940                                                                  |, hackbench]: 538591
>      ksys_write+248                                                                 |@[
>      system_call_exception+292                                                      |    __schedule+12
>      system_call_vectored_common+348                                                |    schedule+76
> , hackbench]: 1427161                                                              |    schedule_preempt_disabled+52
> @[                                                                                 |    __mutex_lock.constprop.0+1748
>      __schedule+12                                                                  |    pipe_write+132
>      schedule+64                                                                    |    vfs_write+1052
>      interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
>      syscall_exit_prepare+336                                                       |    system_call_exception+296
>      system_call_vectored_common+360                                                |    system_call_vectored_common+348
> , hackbench]: 8151309                                                              |, hackbench]: 5388301
> @[                                                                                 |@[
>      __schedule+12                                                                  |    __schedule+12
>      schedule+64                                                                    |    schedule+76
>      pipe_read+1100                                                                 |    pipe_read+1100
>      vfs_read+716                                                                   |    vfs_read+716
>      ksys_read+252                                                                  |    ksys_read+252
>      system_call_exception+292                                                      |    system_call_exception+296
>      system_call_vectored_common+348                                                |    system_call_vectored_common+348
> , hackbench]: 18132753                                                             |, hackbench]: 64424110
>

So the pipe performance is very sensitive, partly because the pipe
overhead is normally very low.

So we've seen it in lots of benchmarks where the benchmark then gets
wildly different results depending on whether you get the goo "optimal
pattern".

And I think your "preempt=none" pattern is the one you really want,
where all the pipe IO scheduling is basically done at exactly the
(optimized) pipe points, ie where the writer blocks because there is
no room (if it's a throughput benchmark), and the reader blocks
because there is no data (for the ping-pong or pipe ring latency
benchmarks).

And then when you get that "perfect" behavior, you typically also get
the best performance when all readers and all writers are on the same
CPU, so you get no unnecessary cache ping-pong either.

And that's a *very* typical pipe benchmark, where there are no costs
to generating the pipe data and no costs involved with consuming it
(ie the actual data isn't really *used* by the benchmark).

In real (non-benchmark) loads, you typically want to spread the
consumer and producer apart on different CPUs, so that the real load
then uses multiple CPUs on the data. But the benchmark case - having
no real data load - likes the "stay on the same CPU" thing.

Your traces for "preempt=none" very much look like that "both reader
and writer sleep synchronously" case, which is the optimal benchmark
case.

And then with "preempt=full", you see that "oh damn, reader and writer
actually hit the pipe mutex contention, because they are presumably
running at the same time on different CPUs, and didn't get into that
nice serial synchronous pattern. So now you not only have that mutex
overhead (which doesn't exist in the reader and writer synchronize),
you also end up with the cost of cache misses *and* the cost of
scheduling on two different CPU's where both of them basically go into
idle while waiting for the other end.

I'm not convinced this is solvable, because it really is an effect
that comes from "benchmarking is doing something odd that we
*shouldn't* generally optimize for".

I also absolutely detest the pipe mutex - 99% of what it protects
should be using either just atomic cmpxchg or possibly a spinlock, and
that's actually what the "use pipes for events" code does. However,
the actual honest user read()/write() code needs to do user space
accesses, and so it wants a sleeping lock.

We could - and probably at some point should - split the pipe mutex
into two: one that protects the writer side, one that protects the
reader side. Then with the common situation of a single reader and a
single writer, the mutex would never be contended. Then the rendezvous
between that "one reader" and "one writer" would be done using
atomics.

But it would be more complex, and it's already complicated by the
whole "you can also use pipes for atomic messaging for watch-queues".

Anyeway, preempt=none has always excelled at certain things. This is
one of them.

               Linus