linux-kernel - Re: [PATCH 00/10] sched: EEVDF using latency-nice

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0e80b9b6-cd71-66d5-f90a-ba8584f8d8df@amd.com>
Date:   Wed, 22 Mar 2023 15:08:10 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Peter Zijlstra <peterz@...radead.org>, mingo@...nel.org,
        vincent.guittot@...aro.org
Cc:     linux-kernel@...r.kernel.org, juri.lelli@...hat.com,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, corbet@....net,
        qyousef@...alina.io, chris.hyser@...cle.com,
        patrick.bellasi@...bug.net, pjt@...gle.com, pavel@....cz,
        qperret@...gle.com, tim.c.chen@...ux.intel.com, joshdon@...gle.com,
        timj@....org, yu.c.chen@...el.com, youssefesmat@...omium.org,
        joel@...lfernandes.org
Subject: Re: [PATCH 00/10] sched: EEVDF using latency-nice

Hello Peter,

One important detail I forgot to mention: When I picked eevdf commits
from your tree
(https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=sched/core),
they were based on v6.3-rc1 with the sched/eevdf HEAD at:

commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy")

On 3/22/2023 12:19 PM, K Prateek Nayak wrote:
> Hello Peter,
> 
> Leaving some results from my testing on a dual socket Zen3 machine
> (2 x 64C/128T) below.
> 
> tl;dr
> 
> o I've not tested workloads with nice and latency nice yet focusing more
>   on the out of the box performance. No changes to sched_feat were made
>   for the same reason. 
> 
> o Except for hackbench (m:n communication relationship), I do not see any
>   regression for other standard benchmarks (mostly 1:1 or 1:n) relation
>   when system is below fully loaded.
> 
> o At fully loaded scenario, schbench seems to be unhappy. Looking at the
>   data from /proc/<pid>/sched for the tasks with schedstats enabled,
>   there is an increase in number of context switches and the total wait
>   sum. When system is overloaded, things flip and the schbench tail
>   latency improves drastically. I suspect the involuntary
>   context-switches help workers make progress much sooner after wakeup
>   compared to tip thus leading to lower tail latency.
> 
> o For the same reason as above, tbench throughput takes a hit with
>   number of involuntary context-switches increasing drastically for the
>   tbench server. There is also an increase in wait sum noticed.
> 
> o Couple of real world workloads were also tested. DeathStarBench
>   throughput tanks much more with the updated version in your tree
>   compared to this series as is.
>   SpecJBB Max-jOPS sees large improvements but comes at a cost of
>   drop in Critical-jOPS signifying an increase in either wait time
>   or an increase in involuntary context-switches which can lead to
>   transactions taking longer to complete.
> 
> o Apart from DeathStarBench, the all the trends reported remain same
>   comparing the version in your tree and this series, as is, applied
>   on the same base kernel.
> 
> I'll leave the detailed results below and some limited analysis. 
> 
> On 3/6/2023 6:55 PM, Peter Zijlstra wrote:
>> Hi!
>>
>> Ever since looking at the latency-nice patches, I've wondered if EEVDF would
>> not make more sense, and I did point Vincent at some older patches I had for
>> that (which is here his augmented rbtree thing comes from).
>>
>> Also, since I really dislike the dual tree, I also figured we could dynamically
>> switch between an augmented tree and not (and while I have code for that,
>> that's not included in this posting because with the current results I don't
>> think we actually need this).
>>
>> Anyway, since I'm somewhat under the weather, I spend last week desperately
>> trying to connect a small cluster of neurons in defiance of the snot overlord
>> and bring back the EEVDF patches from the dark crypts where they'd been
>> gathering cobwebs for the past 13 odd years.
>>
>> By friday they worked well enough, and this morning (because obviously I forgot
>> the weekend is ideal to run benchmarks) I ran a bunch of hackbenck, netperf,
>> tbench and sysbench -- there's a bunch of wins and losses, but nothing that
>> indicates a total fail.
>>
>> ( in fact, some of the schbench results seem to indicate EEVDF schedules a lot
>>   more consistent than CFS and has a bunch of latency wins )
>>
>> ( hackbench also doesn't show the augmented tree and generally more expensive
>>   pick to be a loss, in fact it shows a slight win here )
>>
>>
>>   hackbech load + cyclictest --policy other results:
>>
>>
>> 			EEVDF			 CFS
>>
>> 		# Min Latencies: 00053
>>   LNICE(19)	# Avg Latencies: 04350
>> 		# Max Latencies: 76019
>>
>> 		# Min Latencies: 00052		00053
>>   LNICE(0)	# Avg Latencies: 00690		00687
>> 		# Max Latencies: 14145		13913
>>
>> 		# Min Latencies: 00019
>>   LNICE(-19)	# Avg Latencies: 00261
>> 		# Max Latencies: 05642
>>
> 
> Following are the results from testing the series on a dual socket
> Zen3 machine (2 x 64C/128T):
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Kernel versions:
> - tip:          6.2.0-rc6 tip sched/core
> - eevdf: 	6.2.0-rc6 tip sched/core
> 		+ eevdf commits from your tree
> 		(https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=sched/eevdf)

I had cherry picked the following commits for eevdf:

commit: b84a8f6b6fa3 ("sched: Introduce latency-nice as a per-task attribute")
commit: eea7fc6f13b4 ("sched/core: Propagate parent task's latency requirements to the child task")
commit: a143d2bcef65 ("sched: Allow sched_{get,set}attr to change latency_nice of the task")
commit: d9790468df14 ("sched/fair: Add latency_offset")
commit: 3d4d37acaba4 ("sched/fair: Add sched group latency support")
commit: 707840ffc8fa ("sched/fair: Add avg_vruntime")
commit: 394af9db316b ("sched/fair: Remove START_DEBIT")
commit: 89b2a2ee0e9d ("sched/fair: Add lag based placement")
commit: e3db9631d8ca ("rbtree: Add rb_add_augmented_cached() helper")
commit: 0dddbc0b54ad ("sched/fair: Implement an EEVDF like policy")

from the sched/eevdf branch in your tree onto the tip branch back when
I started testing. I notice some more changes have been added since then.
Queuing testing of latest changes on the updated tip:sched/core based
on v6.3-rc3. I was able to cherry pick the latest commits from
sched/eevdf cleanly.

> 
> - eevdf prev:	6.2.0-rc6 tip sched/core + this series as is
> 
> When the testing started, the tip was at:
> commit 7c4a5b89a0b5 "sched/rt: pick_next_rt_entity(): check list_entry"
> [..snip..]
> 
--
Thanks and Regards,
Prateek