linux-kernel - Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zfrts1l1.fsf@oracle.com>
Date: Mon, 10 Jun 2024 00:23:38 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, tglx@...utronix.de,
        peterz@...radead.org, torvalds@...ux-foundation.org,
        paulmck@...nel.org, rostedt@...dmis.org, mark.rutland@....com,
        juri.lelli@...hat.com, joel@...lfernandes.org, raghavendra.kt@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        LKML
 <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling


Shrikanth Hegde <sshegde@...ux.ibm.com> writes:

> On 6/4/24 1:02 PM, Shrikanth Hegde wrote:
>>
>>
>> On 6/1/24 5:17 PM, Ankur Arora wrote:
>>>
>>> Shrikanth Hegde <sshegde@...ux.ibm.com> writes:
>>>
>>>> On 5/28/24 6:04 AM, Ankur Arora wrote:
>>>>> Hi,
>>>>>
>>>>> This series adds a new scheduling model PREEMPT_AUTO, which like
>>>>> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
>>>>> preemption model. Unlike, PREEMPT_DYNAMIC, it doesn't depend
>>>>> on explicit preemption points for the voluntary models.
>>>>>
>>>>> The series is based on Thomas' original proposal which he outlined
>>>>> in [1], [2] and in his PoC [3].
>>>>>
>>>>> v2 mostly reworks v1, with one of the main changes having less
>>>>> noisy need-resched-lazy related interfaces.
>>>>> More details in the changelog below.
>>>>>
>>>>
>>>> Hi Ankur. Thanks for the series.
>>>>
>>>> nit: had to manually patch 11,12,13 since it didnt apply cleanly on
>>>> tip/master and tip/sched/core. Mostly due some word differences in the change.
>>>>
>>>> tip/master was at:
>>>> commit e874df84d4a5f3ce50b04662b62b91e55b0760fc (HEAD -> master, origin/master, origin/HEAD)
>>>> Merge: 5d145493a139 47ff30cc1be7
>>>> Author: Ingo Molnar <mingo@...nel.org>
>>>> Date:   Tue May 28 12:44:26 2024 +0200
>>>>
>>>>     Merge branch into tip/master: 'x86/percpu'
>>>>
>>>>
>>>>
>>>>> The v1 of the series is at [4] and the RFC at [5].
>>>>>
>>>>> Design
>>>>> ==
>>>>>
>>>>> PREEMPT_AUTO works by always enabling CONFIG_PREEMPTION (and thus
>>>>> PREEMPT_COUNT). This means that the scheduler can always safely
>>>>> preempt. (This is identical to CONFIG_PREEMPT.)
>>>>>
>>>>> Having that, the next step is to make the rescheduling policy dependent
>>>>> on the chosen scheduling model. Currently, the scheduler uses a single
>>>>> need-resched bit (TIF_NEED_RESCHED) which it uses to state that a
>>>>> reschedule is needed.
>>>>> PREEMPT_AUTO extends this by adding an additional need-resched bit
>>>>> (TIF_NEED_RESCHED_LAZY) which, with TIF_NEED_RESCHED now allows the
>>>>> scheduler to express two kinds of rescheduling intent: schedule at
>>>>> the earliest opportunity (TIF_NEED_RESCHED), or express a need for
>>>>> rescheduling while allowing the task on the runqueue to run to
>>>>> timeslice completion (TIF_NEED_RESCHED_LAZY).
>>>>>
>>>>> The scheduler decides which need-resched bits are chosen based on
>>>>> the preemption model in use:
>>>>>
>>>>> 	       TIF_NEED_RESCHED        TIF_NEED_RESCHED_LAZY
>>>>>
>>>>> none		never   		always [*]
>>>>> voluntary       higher sched class	other tasks [*]
>>>>> full 		always                  never
>>>>>
>>>>> [*] some details elided.
>>>>>
>>>>> The last part of the puzzle is, when does preemption happen, or
>>>>> alternately stated, when are the need-resched bits checked:
>>>>>
>>>>>                  exit-to-user    ret-to-kernel    preempt_count()
>>>>>
>>>>> NEED_RESCHED_LAZY     Y               N                N
>>>>> NEED_RESCHED          Y               Y                Y
>>>>>
>>>>> Using NEED_RESCHED_LAZY allows for run-to-completion semantics when
>>>>> none/voluntary preemption policies are in effect. And eager semantics
>>>>> under full preemption.
>>>>>
>>>>> In addition, since this is driven purely by the scheduler (not
>>>>> depending on cond_resched() placement and the like), there is enough
>>>>> flexibility in the scheduler to cope with edge cases -- ex. a kernel
>>>>> task not relinquishing CPU under NEED_RESCHED_LAZY can be handled by
>>>>> simply upgrading to a full NEED_RESCHED which can use more coercive
>>>>> instruments like resched IPI to induce a context-switch.
>>>>>
>>>>> Performance
>>>>> ==
>>>>> The performance in the basic tests (perf bench sched messaging, kernbench,
>>>>> cyclictest) matches or improves what we see under PREEMPT_DYNAMIC.
>>>>> (See patches
>>>>>   "sched: support preempt=none under PREEMPT_AUTO"
>>>>>   "sched: support preempt=full under PREEMPT_AUTO"
>>>>>   "sched: handle preempt=voluntary under PREEMPT_AUTO")
>>>>>
>>>>> For a macro test, a colleague in Oracle's Exadata team tried two
>>>>> OLTP benchmarks (on a 5.4.17 based Oracle kernel, with the v1 series
>>>>> backported.)
>>>>>
>>>>> In both tests the data was cached on remote nodes (cells), and the
>>>>> database nodes (compute) served client queries, with clients being
>>>>> local in the first test and remote in the second.
>>>>>
>>>>> Compute node: Oracle E5, dual socket AMD EPYC 9J14, KVM guest (380 CPUs)
>>>>> Cells (11 nodes): Oracle E5, dual socket AMD EPYC 9334, 128 CPUs
>>>>>
>>>>>
>>>>> 				  PREEMPT_VOLUNTARY                        PREEMPT_AUTO
>>>>> 				                                        (preempt=voluntary)
>>>>>                               ==============================      =============================
>>>>>                       clients  throughput    cpu-usage            throughput     cpu-usage         Gain
>>>>>                                (tx/min)    (utime %/stime %)      (tx/min)    (utime %/stime %)
>>>>> 		      -------  ----------  -----------------      ----------  -----------------   -------
>>>>>
>>>>>
>>>>>   OLTP                  384     9,315,653     25/ 6                9,253,252       25/ 6            -0.7%
>>>>>   benchmark	       1536    13,177,565     50/10               13,657,306       50/10            +3.6%
>>>>>  (local clients)       3456    14,063,017     63/12               14,179,706       64/12            +0.8%
>>>>>
>>>>>
>>>>>   OLTP                   96     8,973,985     17/ 2                8,924,926       17/ 2            -0.5%
>>>>>   benchmark	        384    22,577,254     60/ 8               22,211,419       59/ 8            -1.6%
>>>>>  (remote clients,      2304    25,882,857     82/11               25,536,100       82/11            -1.3%
>>>>>   90/10 RW ratio)
>>>>>
>>>>>
>>>>> (Both sets of tests have a fair amount of NW traffic since the query
>>>>> tables etc are cached on the cells. Additionally, the first set,
>>>>> given the local clients, stress the scheduler a bit more than the
>>>>> second.)
>>>>>
>>>>> The comparative performance for both the tests is fairly close,
>>>>> more or less within a margin of error.
>>>>>
>>>>> Raghu KT also tested v1 on an AMD Milan (2 node, 256 cpu,  512GB RAM):
>>>>>
>>>>> "
>>>>>  a) Base kernel (6.7),
>>>>>  b) v1, PREEMPT_AUTO, preempt=voluntary
>>>>>  c) v1, PREEMPT_DYNAMIC, preempt=voluntary
>>>>>  d) v1, PREEMPT_AUTO=y, preempt=voluntary, PREEMPT_RCU = y
>>>>>
>>>>>  Workloads I tested and their %gain,
>>>>>                     case b           case c       case d
>>>>>  NAS                +2.7%              +1.9%         +2.1%
>>>>>  Hashjoin,          +0.0%              +0.0%         +0.0%
>>>>>  Graph500,          -6.0%              +0.0%         +0.0%
>>>>>  XSBench            +1.7%              +0.0%         +1.2%
>>>>>
>>>>>  (Note about the Graph500 numbers at [8].)
>>>>>
>>>>>  Did kernbench etc test from Mel's mmtests suite also. Did not notice
>>>>>  much difference.
>>>>> "
>>>>>
>>>>> One case where there is a significant performance drop is on powerpc,
>>>>> seen running hackbench on a 320 core system (a test on a smaller system is
>>>>> fine.) In theory there's no reason for this to only happen on powerpc
>>>>> since most of the code is common, but I haven't been able to reproduce
>>>>> it on x86 so far.
>>>>>
>>>>> All in all, I think the tests above show that this scheduling model has legs.
>>>>> However, the none/voluntary models under PREEMPT_AUTO are conceptually
>>>>> different enough from the current none/voluntary models that there
>>>>> likely are workloads where performance would be subpar. That needs more
>>>>> extensive testing to figure out the weak points.
>>>>>
>>>>>
>>>>>
>>>> Did test it again on PowerPC. Unfortunately numbers shows there is regression
>>>> still compared to 6.10-rc1. This is done with preempt=none. I tried again on the
>>>> smaller system too to confirm. For now I have done the comparison for the hackbench
>>>> where highest regression was seen in v1.
>>>>
>>>> perf stat collected for 20 iterations show higher context switch and higher migrations.
>>>> Could it be that LAZY bit is causing more context switches? or could it be something
>>>> else? Could it be that more exit-to-user happens in PowerPC? will continue to debug.
>>>
>>> Thanks for trying it out.
>>>
>>> As you point out, context-switches and migrations are signficantly higher.
>>>
>>> Definitely unexpected. I ran the same test on an x86 box
>>> (Milan, 2x64 cores, 256 threads) and there I see no more than a ~4% difference.
>>>
>>>   6.9.0/none.process.pipe.60:       170,719,761      context-switches          #    0.022 M/sec                    ( +-  0.19% )
>>>   6.9.0/none.process.pipe.60:        16,871,449      cpu-migrations            #    0.002 M/sec                    ( +-  0.16% )
>>>   6.9.0/none.process.pipe.60:      30.833112186 seconds time elapsed                                          ( +-  0.11% )
>>>
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:       177,889,639      context-switches          #    0.023 M/sec                    ( +-  0.21% )
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:        17,426,670      cpu-migrations            #    0.002 M/sec                    ( +-  0.41% )
>>>   6.9.0-00035-gc90017e055a6/none.process.pipe.60:      30.731126312 seconds time elapsed                                          ( +-  0.07% )
>>>
>>> Clearly there's something different going on powerpc. I'm travelling
>>> right now, but will dig deeper into this once I get back.
>>>
>>> Meanwhile can you check if the increased context-switches are voluntary or
>>> involuntary (or what the division is)?
>>
>>
>> Used "pidstat -w -p ALL 1 10" to capture 10 seconds data at 1 second interval for
>> context switches per second while running "hackbench -pipe 60 process 100000 loops"
>>
>>
>> preempt=none				6.10			preempt_auto
>> =============================================================================
>> voluntary context switches	    	7632166.19	        9391636.34(+23%)
>> involuntary context switches		2305544.07		3527293.94(+53%)
>>
>> Numbers vary between multiple runs. But trend seems to be similar. Both the context switches increase
>> involuntary seems to increase at higher rate.
>>
>>
>
>
> Continued data from hackbench regression. preempt=none in both the cases.
> From mpstat, I see slightly higher idle time and more irq time with preempt_auto.
>
> 6.10-rc1:
> =========
> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 09:45:23 AM  all    4.14    0.00   77.57    0.00   16.92    0.00    0.00    0.00    0.00    1.37
> 09:45:24 AM  all    4.42    0.00   77.62    0.00   16.76    0.00    0.00    0.00    0.00    1.20
> 09:45:25 AM  all    4.43    0.00   77.45    0.00   16.94    0.00    0.00    0.00    0.00    1.18
> 09:45:26 AM  all    4.45    0.00   77.87    0.00   16.68    0.00    0.00    0.00    0.00    0.99
>
> PREEMPT_AUTO:
> ===========
> 10:09:50 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
> 10:09:56 AM  all    3.11    0.00   72.59    0.00   21.34    0.00    0.00    0.00    0.00    2.96
> 10:09:57 AM  all    3.31    0.00   73.10    0.00   20.99    0.00    0.00    0.00    0.00    2.60
> 10:09:58 AM  all    3.40    0.00   72.83    0.00   20.85    0.00    0.00    0.00    0.00    2.92
> 10:10:00 AM  all    3.21    0.00   72.87    0.00   21.19    0.00    0.00    0.00    0.00    2.73
> 10:10:01 AM  all    3.02    0.00   72.18    0.00   21.08    0.00    0.00    0.00    0.00    3.71
>
> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.

Yeah, the %sys is lower and %irq, higher. Can you also see where the
increased %irq is? For instance are the resched IPIs numbers greater?

> 6.10-rc1:
> =========
> SOFTIRQ          TOTAL_usecs
> tasklet                   71
> block                    145
> net_rx                  7914
> rcu                   136988
> timer                 304357
> sched                1404497
>
>
>
> PREEMPT_AUTO:
> ===========
> SOFTIRQ          TOTAL_usecs
> tasklet                   80
> block                    139
> net_rx                  6907
> rcu                   223508
> timer                 492767
> sched                1794441
>
>
> Would any specific setting of RCU matter for this?
> This is what I have in config.

Don't see how it could matter unless the RCU settings are changing
between the two tests? In my testing I'm also using TREE_RCU=y,
PREEMPT_RCU=n.

Let me see if I can find a test which shows a similar trend to what you
are seeing. And, then maybe see if tracing sched-switch might point to
an interesting difference between x86 and powerpc.


Thanks for all the detail.

Ankur

> # RCU Subsystem
> #
> CONFIG_TREE_RCU=y
> # CONFIG_RCU_EXPERT is not set
> CONFIG_TREE_SRCU=y
> CONFIG_NEED_SRCU_NMI_SAFE=y
> CONFIG_TASKS_RCU_GENERIC=y
> CONFIG_NEED_TASKS_RCU=y
> CONFIG_TASKS_RCU=y
> CONFIG_TASKS_RUDE_RCU=y
> CONFIG_TASKS_TRACE_RCU=y
> CONFIG_RCU_STALL_COMMON=y
> CONFIG_RCU_NEED_SEGCBLIST=y
> CONFIG_RCU_NOCB_CPU=y
> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
> # CONFIG_RCU_LAZY is not set
> # end of RCU Subsystem
>
>
> # Timers subsystem
> #
> CONFIG_TICK_ONESHOT=y
> CONFIG_NO_HZ_COMMON=y
> # CONFIG_HZ_PERIODIC is not set
> # CONFIG_NO_HZ_IDLE is not set
> CONFIG_NO_HZ_FULL=y
> CONFIG_CONTEXT_TRACKING_USER=y
> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
> CONFIG_NO_HZ=y
> CONFIG_HIGH_RES_TIMERS=y
> # end of Timers subsystem


--
ankur