linux-kernel - Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87jzglq8tw.fsf@oracle.com>
Date: Mon, 12 Aug 2024 22:40:27 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Shrikanth Hegde <sshegde@...ux.ibm.com>,
        Ankur Arora
 <ankur.a.arora@...cle.com>,
        Michael Ellerman <mpe@...erman.id.au>, tglx@...utronix.de,
        peterz@...radead.org, paulmck@...nel.org, rostedt@...dmis.org,
        mark.rutland@....com, juri.lelli@...hat.com, joel@...lfernandes.org,
        raghavendra.kt@....com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com, LKML
 <linux-kernel@...r.kernel.org>,
        Nicholas Piggin <npiggin@...il.com>
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling


Linus Torvalds <torvalds@...ux-foundation.org> writes:

> On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
>>
>> top 3 callstacks of __schedule collected with bpftrace.
>>
>>                         preempt=none                                                            preempt=full
>>
>>      __schedule+12                                                                  |@[
>>      schedule+64                                                                    |    __schedule+12
>>      interrupt_exit_user_prepare_main+600                                           |    preempt_schedule+84
>>      interrupt_exit_user_prepare+88                                                 |    _raw_spin_unlock_irqrestore+124
>>      interrupt_return_srr_user+8                                                    |    __wake_up_sync_key+108
>> , hackbench]: 482228                                                               |    pipe_write+1772
>> @[                                                                                 |    vfs_write+1052
>>      __schedule+12                                                                  |    ksys_write+248
>>      schedule+64                                                                    |    system_call_exception+296
>>      pipe_write+1452                                                                |    system_call_vectored_common+348
>>      vfs_write+940                                                                  |, hackbench]: 538591
>>      ksys_write+248                                                                 |@[
>>      system_call_exception+292                                                      |    __schedule+12
>>      system_call_vectored_common+348                                                |    schedule+76
>> , hackbench]: 1427161                                                              |    schedule_preempt_disabled+52
>> @[                                                                                 |    __mutex_lock.constprop.0+1748
>>      __schedule+12                                                                  |    pipe_write+132
>>      schedule+64                                                                    |    vfs_write+1052
>>      interrupt_exit_user_prepare_main+600                                           |    ksys_write+248
>>      syscall_exit_prepare+336                                                       |    system_call_exception+296
>>      system_call_vectored_common+360                                                |    system_call_vectored_common+348
>> , hackbench]: 8151309                                                              |, hackbench]: 5388301
>> @[                                                                                 |@[
>>      __schedule+12                                                                  |    __schedule+12
>>      schedule+64                                                                    |    schedule+76
>>      pipe_read+1100                                                                 |    pipe_read+1100
>>      vfs_read+716                                                                   |    vfs_read+716
>>      ksys_read+252                                                                  |    ksys_read+252
>>      system_call_exception+292                                                      |    system_call_exception+296
>>      system_call_vectored_common+348                                                |    system_call_vectored_common+348
>> , hackbench]: 18132753                                                             |, hackbench]: 64424110
>>
>
> So the pipe performance is very sensitive, partly because the pipe
> overhead is normally very low.
>
> So we've seen it in lots of benchmarks where the benchmark then gets
> wildly different results depending on whether you get the goo "optimal
> pattern".
>
> And I think your "preempt=none" pattern is the one you really want,
> where all the pipe IO scheduling is basically done at exactly the
> (optimized) pipe points, ie where the writer blocks because there is
> no room (if it's a throughput benchmark), and the reader blocks
> because there is no data (for the ping-pong or pipe ring latency
> benchmarks).
>
> And then when you get that "perfect" behavior, you typically also get
> the best performance when all readers and all writers are on the same
> CPU, so you get no unnecessary cache ping-pong either.
>
> And that's a *very* typical pipe benchmark, where there are no costs
> to generating the pipe data and no costs involved with consuming it
> (ie the actual data isn't really *used* by the benchmark).
>
> In real (non-benchmark) loads, you typically want to spread the
> consumer and producer apart on different CPUs, so that the real load
> then uses multiple CPUs on the data. But the benchmark case - having
> no real data load - likes the "stay on the same CPU" thing.
>
> Your traces for "preempt=none" very much look like that "both reader
> and writer sleep synchronously" case, which is the optimal benchmark
> case.
>
> And then with "preempt=full", you see that "oh damn, reader and writer
> actually hit the pipe mutex contention, because they are presumably
> running at the same time on different CPUs, and didn't get into that
> nice serial synchronous pattern. So now you not only have that mutex
> overhead (which doesn't exist in the reader and writer synchronize),
> you also end up with the cost of cache misses *and* the cost of
> scheduling on two different CPU's where both of them basically go into
> idle while waiting for the other end.

Thanks. That was very clarifying.

--
ankur