[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZRFp3EO2JUXtK6XB@gmail.com>
Date: Mon, 25 Sep 2023 13:07:08 +0200
From: Ingo Molnar <mingo@...nel.org>
To: kernel test robot <oliver.sang@...el.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Peter Zijlstra <peterz@...radead.org>
Cc: oe-lkp@...ts.linux.dev, lkp@...el.com,
linux-kernel@...r.kernel.org, ying.huang@...el.com,
feng.tang@...el.com, fengwei.yin@...el.com,
aubrey.li@...ux.intel.com, yu.c.chen@...el.com,
Mike Galbraith <efault@....de>,
K Prateek Nayak <kprateek.nayak@....com>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>,
linux-tip-commits@...r.kernel.org, x86@...nel.org,
Gautham Shenoy <gautham.shenoy@....com>
Subject: Re: [PATCH] sched/fair: Do not wakeup-preempt same-prio SCHED_OTHER
tasks
* kernel test robot <oliver.sang@...el.com> wrote:
> Hello,
>
> kernel test robot noticed a -19.0% regression of stress-ng.filename.ops_per_sec on:
Thanks for the testing, this is useful!
So I've tabulated the results into a much easier to read format:
> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds -5.3% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds -3.5% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement
> | testcase: change | blogbench: blogbench.write_score -35.9% regression
> | testcase: change | hackbench: hackbench.throughput -4.8% regression
> | testcase: change | blogbench: blogbench.write_score -59.3% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression
> | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement
> | testcase: change | hackbench: hackbench.throughput 19.1% improvement
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression
And then sorted them along the regression/improvement axis:
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -93.9% regression
> | testcase: change | stress-ng: stress-ng.sigsuspend.ops_per_sec -82.1% regression
> | testcase: change | blogbench: blogbench.write_score -59.3% regression
> | testcase: change | blogbench: blogbench.write_score -35.9% regression
> | testcase: change | stress-ng: stress-ng.exec.ops_per_sec -34.6% regression
> | testcase: change | stress-ng: stress-ng.filename.ops_per_sec -19.0% regression
> | testcase: change | stress-ng: stress-ng.dnotify.ops_per_sec -15.7% regression
> | testcase: change | stress-ng: stress-ng.lockbus.ops_per_sec -6.0% regression
> | testcase: change | hackbench: hackbench.throughput -4.8% regression
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Masskrug.CPU-only.seconds +5.3% improvement
> | testcase: change | phoronix-test-suite: phoronix-test-suite.darktable.Boat.CPU-only.seconds +3.5% improvement
> | testcase: change | lmbench3: lmbench3.TCP.socket.bandwidth.64B.MB/sec 11.5% improvement
> | testcase: change | stress-ng: stress-ng.sigfd.ops_per_sec 17.6% improvement
> | testcase: change | hackbench: hackbench.throughput 19.1% improvement
> | testcase: change | stress-ng: stress-ng.sock.ops_per_sec 59.4% improvement
> | testcase: change | netperf: netperf.Throughput_Mbps 60.6% improvement
> | testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 100.2% improvement
Testing results notes:
- the '+' denotes an inverted improvement. The mixing of signs in the output of the
ktest robot is arguably confusing.
- Any hope getting similar summary format by default? It's much more informative than
just picking up the biggest regression, which wasn't even done correctly AFAICT.
Summary:
While there's a lot of improvements, it is primarily the nature of performance
regressions that dictate the way forward:
- stress-ng.sigsuspend.ops_per_sec regressions, -93%:
Clearly signal delivery performance hurts from delayed preemption, but
that should be straightforward to resolve, if we are willing to commit
to adding a high-prio insta-wakeup variant API ...
- stress-ng.exec.ops_per_sec -34% regression:
Likewise this possibly expresses that it's better to immediately reschedule
during exec() - but maybe it's more and reflects some unfavorable migration,
as suggested by the NUMA locality figures:
%change %stddev
| \
79317172 -34.2% 52217838 ± 3% numa-numastat.node0.local_node
79360983 -34.2% 52240348 ± 3% numa-numastat.node0.numa_hit
77971050 -33.2% 52068168 ± 3% numa-numastat.node1.local_node
78009071 -33.2% 52089987 ± 3% numa-numastat.node1.numa_hit
88287 -45.7% 47970 ± 2% vmstat.system.cs
- 'blogbench' regression of -59%:
It too has a very large reduction in context switches:
%stddev %change %stddev
\ | \
30035 -49.7% 15097 ± 3% vmstat.system.cs
2243545 ± 2% -4.1% 2152228 blogbench.read_score
52412617 -28.3% 37571769 blogbench.time.file_system_outputs
2682930 -74.1% 694136 blogbench.time.involuntary_context_switches
2369329 -50.0% 1184098 ± 5% blogbench.time.voluntary_context_switches
5851 -35.9% 3752 ± 2% blogbench.write_score
It's unclear to me what's happening with this one, just from these stats,
but it's "write_score" that hurts most.
- 'stress-ng.filename.ops_per_sec' regression of -19%:
This test suffered from an *increase* in context-switching, and a large
increase in CPU-idle:
%stddev %change %stddev
\ | \
4641666 +19.5% 5545394 ± 2% cpuidle..usage
90589 ± 2% +70.5% 154471 ± 2% vmstat.system.cs
628439 -19.2% 507711 stress-ng.filename.ops
10317 -19.0% 8355 stress-ng.filename.ops_per_sec
171981 -59.7% 69333 ± 3% stress-ng.time.involuntary_context_switches
770691 ± 3% +200.9% 2319214 stress-ng.time.voluntary_context_switches
Anyway, it's clear from these results that while many workloads hurt
from our notion of wake-preemption, there's several ones that benefit
from it, especially generic ones like phoronix-test-suite - which have
no good way to turn off wakeup preemption (SCHED_BATCH might help though).
One way to approach this would be to instead of always doing
wakeup-preemption (our current default), we could turn it around and
only use it when it is clearly beneficial - such as signal delivery,
or exec().
The canonical way to solve this would be give *userspace* a way to
signal that it's beneficial to preempt immediately, ie. yield(),
but right now that interface is hurting tasks that only want to
give other tasks a chance to run, without necessarily giving up
their own right to run:
se->deadline += calc_delta_fair(se->slice, se);
Anyway, my patch is obviously a no-go as-is, and this clearly needs more work.
Thanks,
Ingo
Powered by blists - more mailing lists