[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87jzglq8tw.fsf@oracle.com>
Date: Mon, 12 Aug 2024 22:40:27 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Shrikanth Hegde <sshegde@...ux.ibm.com>,
Ankur Arora
<ankur.a.arora@...cle.com>,
Michael Ellerman <mpe@...erman.id.au>, tglx@...utronix.de,
peterz@...radead.org, paulmck@...nel.org, rostedt@...dmis.org,
mark.rutland@....com, juri.lelli@...hat.com, joel@...lfernandes.org,
raghavendra.kt@....com, boris.ostrovsky@...cle.com,
konrad.wilk@...cle.com, LKML
<linux-kernel@...r.kernel.org>,
Nicholas Piggin <npiggin@...il.com>
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling
Linus Torvalds <torvalds@...ux-foundation.org> writes:
> On Mon, 12 Aug 2024 at 10:33, Shrikanth Hegde <sshegde@...ux.ibm.com> wrote:
>>
>> top 3 callstacks of __schedule collected with bpftrace.
>>
>> preempt=none preempt=full
>>
>> __schedule+12 |@[
>> schedule+64 | __schedule+12
>> interrupt_exit_user_prepare_main+600 | preempt_schedule+84
>> interrupt_exit_user_prepare+88 | _raw_spin_unlock_irqrestore+124
>> interrupt_return_srr_user+8 | __wake_up_sync_key+108
>> , hackbench]: 482228 | pipe_write+1772
>> @[ | vfs_write+1052
>> __schedule+12 | ksys_write+248
>> schedule+64 | system_call_exception+296
>> pipe_write+1452 | system_call_vectored_common+348
>> vfs_write+940 |, hackbench]: 538591
>> ksys_write+248 |@[
>> system_call_exception+292 | __schedule+12
>> system_call_vectored_common+348 | schedule+76
>> , hackbench]: 1427161 | schedule_preempt_disabled+52
>> @[ | __mutex_lock.constprop.0+1748
>> __schedule+12 | pipe_write+132
>> schedule+64 | vfs_write+1052
>> interrupt_exit_user_prepare_main+600 | ksys_write+248
>> syscall_exit_prepare+336 | system_call_exception+296
>> system_call_vectored_common+360 | system_call_vectored_common+348
>> , hackbench]: 8151309 |, hackbench]: 5388301
>> @[ |@[
>> __schedule+12 | __schedule+12
>> schedule+64 | schedule+76
>> pipe_read+1100 | pipe_read+1100
>> vfs_read+716 | vfs_read+716
>> ksys_read+252 | ksys_read+252
>> system_call_exception+292 | system_call_exception+296
>> system_call_vectored_common+348 | system_call_vectored_common+348
>> , hackbench]: 18132753 |, hackbench]: 64424110
>>
>
> So the pipe performance is very sensitive, partly because the pipe
> overhead is normally very low.
>
> So we've seen it in lots of benchmarks where the benchmark then gets
> wildly different results depending on whether you get the goo "optimal
> pattern".
>
> And I think your "preempt=none" pattern is the one you really want,
> where all the pipe IO scheduling is basically done at exactly the
> (optimized) pipe points, ie where the writer blocks because there is
> no room (if it's a throughput benchmark), and the reader blocks
> because there is no data (for the ping-pong or pipe ring latency
> benchmarks).
>
> And then when you get that "perfect" behavior, you typically also get
> the best performance when all readers and all writers are on the same
> CPU, so you get no unnecessary cache ping-pong either.
>
> And that's a *very* typical pipe benchmark, where there are no costs
> to generating the pipe data and no costs involved with consuming it
> (ie the actual data isn't really *used* by the benchmark).
>
> In real (non-benchmark) loads, you typically want to spread the
> consumer and producer apart on different CPUs, so that the real load
> then uses multiple CPUs on the data. But the benchmark case - having
> no real data load - likes the "stay on the same CPU" thing.
>
> Your traces for "preempt=none" very much look like that "both reader
> and writer sleep synchronously" case, which is the optimal benchmark
> case.
>
> And then with "preempt=full", you see that "oh damn, reader and writer
> actually hit the pipe mutex contention, because they are presumably
> running at the same time on different CPUs, and didn't get into that
> nice serial synchronous pattern. So now you not only have that mutex
> overhead (which doesn't exist in the reader and writer synchronize),
> you also end up with the cost of cache misses *and* the cost of
> scheduling on two different CPU's where both of them basically go into
> idle while waiting for the other end.
Thanks. That was very clarifying.
--
ankur
Powered by blists - more mailing lists