[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y65RNzj522d7Q3OI@chenyu5-mobl1>
Date: Fri, 30 Dec 2022 10:47:19 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
CC: Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Tim Chen <tim.c.chen@...el.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Juri Lelli <juri.lelli@...hat.com>,
"Rik van Riel" <riel@...riel.com>, Aaron Lu <aaron.lu@...el.com>,
Abel Wu <wuyun.abel@...edance.com>,
Yicong Yang <yangyicong@...ilicon.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Ingo Molnar <mingo@...hat.com>,
"Dietmar Eggemann" <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Hillf Danton <hdanton@...a.com>,
Honglei Wang <wanghonglei@...ichuxing.com>,
Len Brown <len.brown@...el.com>,
Chen Yu <yu.chen.surf@...il.com>,
Tianchen Ding <dtcccc@...ux.alibaba.com>,
Joel Fernandes <joel@...lfernandes.org>,
Josh Don <joshdon@...gle.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH v4 0/2] sched/fair: Choose the CPU where short task
is running during wake up
On 2022-12-29 at 12:46:59 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> Including the detailed results from testing below.
>
> tl;dr
>
> o There seems to be 3 noticeable regressions:
> - tbench for lower number of clients. The schedstat data shows
> an increase in wait time.
> - SpecJBB MultiJVM performance drops as the workload prefers
> an idle CPU over a busy one.
> - Unixbench-pipe benchmark performance drops.
>
> o Most benchmark numbers remain same.
>
> o Small gains seen for ycsb-mongodb and unixbench-syscall.
>
Thanks Prateek for your test.
> On 12/16/2022 11:38 AM, Chen Yu wrote:
> > The main purpose of this change is to avoid too many cross CPU
> > wake up when it is unnecessary. The frequent cross CPU wake up
> > brings significant damage to some workloads, especially on high
> > core count systems.
> >
> > This patch set inhibits the cross CPU wake-up by placing the wakee
> > on waking CPU or previous CPU, if both the waker and wakee are
> > short-duration tasks.
> >
> > The first patch is to introduce the definition of a short-duration
> > task. The second patch leverages the first patch to choose a local
> > or previous CPU for wakee.
> >
> > Changes since v3:
> > 1. Honglei and Josh have concern that the threshold of short
> > task duration could be too long. Decreased the threshold from
> > sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
> > and the '8' comes from get_update_sysctl_factor().
> > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
> > 3. Move the calculation of average duration from put_prev_task_fair()
> > to dequeue_task_fair(). Because there is an issue in v3 that,
> > put_prev_task_fair() will not be invoked by pick_next_task_fair()
> > in fast path, thus the dur_avg could not be updated timely.
> > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
> > on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
> > 5. Move the scan for CPU with short duration task from select_idle_cpu()
> > to select_idle_siblings(), because there is no CPU scan involved, per> Yicong.
>
> Following are the results from running standard benchmarks on a
> dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
>
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
>
> NPS1: Each socket is a NUMA node.
> Total 2 NUMA nodes in the dual socket machine.
>
> Node 0: 0-63, 128-191
> Node 1: 64-127, 192-255
>
> NPS2: Each socket is further logically divided into 2 NUMA regions.
> Total 4 NUMA nodes exist over 2 socket.
>
> Node 0: 0-31, 128-159
> Node 1: 32-63, 160-191
> Node 2: 64-95, 192-223
> Node 3: 96-127, 223-255
>
> NPS4: Each socket is logically divided into 4 NUMA regions.
> Total 8 NUMA nodes exist over 2 socket.
>
> Node 0: 0-15, 128-143
> Node 1: 16-31, 144-159
> Node 2: 32-47, 160-175
> Node 3: 48-63, 176-191
> Node 4: 64-79, 192-207
> Node 5: 80-95, 208-223
> Node 6: 96-111, 223-231
> Node 7: 112-127, 232-255
>
> Benchmark Results:
>
> Kernel versions:
> - tip: 6.1.0-rc2 tip sched/core
> - sis_short: 6.1.0-rc2 tip sched/core + this series
>
> When the testing started, the tip was at:
> commit d6962c4fe8f9 "sched: Clear ttwu_pending after enqueue_task()"
>
OK, I'll rebase the code and to check if I could reproduce the regression
with SNC enabled. Previously I tested v3 with SNC enabled and did not see regress
so I did not enable SNC when testing v4, I'll do the test with SNC enabled.
[...]
>
> o NPS1
>
> Test Metric Parallelism tip sis_short
> unixbench-dhry2reg Hmean unixbench-dhry2reg-1 48929419.48 ( 0.00%) 48992339.28 ( 0.13%)
> unixbench-dhry2reg Hmean unixbench-dhry2reg-512 6266355505.80 ( 0.00%) 6251441423.60 ( -0.24%)
> unixbench-syscall Amean unixbench-syscall-1 2994319.73 ( 0.00%) 2665595.13 * 10.98%*
> unixbench-syscall Amean unixbench-syscall-512 7349715.87 ( 0.00%) 7645690.70 * -4.03%*
> unixbench-pipe Hmean unixbench-pipe-1 2830206.03 ( 0.00%) 2508957.89 * -11.35%* *
> unixbench-pipe Hmean unixbench-pipe-512 326207828.01 ( 0.00%) 306588592.66 * -6.01%* *
I'll also run unixbench-pipe in my environment. As will-it-scale context_switch1 also stress pipe and did
see improvement, I'll check why unixbench-pipe regress.
[...]
> total waittime by tasks on this processor (in jiffies) : 176365, 258702 | 46.69| * Wait time is much longer
This seems to be fall back to the result of v1, where we inhibit the idle CPU scan and caused problems.
> total timeslices run on this cpu : 116797, 106922 | -8.45|
> -------------------------------------------------------------------------------------------------------------------------
> < ----------------------------------------------------------------- Wakeup info: ------------------------------------ >
> Wakeups on same SMT cpus = all_cpus (avg) : 0, 0
> Wakeups on same MC cpus = all_cpus (avg) : 116689, 106797 | -8.48|
> Wakeups on same DIE cpus = all_cpus (avg) : 2, 4
> Wakeups on same NUMA cpus = all_cpus (avg) : 5, 7
> Affine wakeups on same SMT cpus = all_cpus (avg) : 0, 0
> Affine wakeups on same MC cpus = all_cpus (avg) : 116667, 106781 | -8.47|
> Affine wakeups on same DIE cpus = all_cpus (avg) : 2, 4
> Affine wakeups on same NUMA cpus = all_cpus (avg) : 5, 6
> --------------------------------------------------------------------------------------------------------------------------
>
> The rq->rq_sched_info.pcount and rq->sched_count seems to
> have reduced proportionally.
>
> >
> > Changes since v2:
> >
> > 1. Peter suggested comparing the duration of waker and the cost to
> > scan for an idle CPU: If the cost is higher than the task duration,
> > do not waste time finding an idle CPU, choose the local or previous
> > CPU directly. A prototype was created based on this suggestion.
> > However, according to the test result, this prototype does not inhibit
> > the cross CPU wakeup and did not bring improvement. Because the cost
> > to find an idle CPU is small in the problematic scenario. The root
> > cause of the problem is a race condition between scanning for an idle
> > CPU and task enqueue(please refer to the commit log in PATCH 2/2).
> > So v3 does not change the core logic of v2, with some refinement based
> > on Peter's suggestion.
> >
> > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion.
> >
> > This change brings overall improvement on some microbenchmarks, both on
> > Intel and AMD platforms.
> >
> > v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/
> > v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/
> > v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/
> >
> > Chen Yu (2):
> > sched/fair: Introduce short duration task check
> > sched/fair: Choose the CPU where short task is running during wake up
> >
> > include/linux/sched.h | 3 +++
> > kernel/sched/core.c | 2 ++
> > kernel/sched/debug.c | 1 +
> > kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++++++
> > kernel/sched/features.h | 1 +
> > 5 files changed, 39 insertions(+)
> >
>
> All numbers are with turbo and C2 enabled. I wonder if the
> check "(5 * nr < 3 * sd->span_weight)" in v2 helped workloads
> like tbench and SpecJBB. I'll queue some runs with the condition
> added back and separate run with turbo and C2 disabled to see
> if they helps. I'll update the thread once the results are in.
Thanks for helping check if the nr part in v2 could bring the improvement
back. However Peter seems to have concern regarding the nr check, I'll
think about it more.
thanks,
Chenyu
>
> --
> Thanks and Regards,
> Prateek
Powered by blists - more mailing lists