linux-kernel - Re: [RFC PATCH v4 0/2] sched/fair: Choose the CPU where short task is running during wake up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y8lrnvLzGGJmlPor@chenyu5-mobl1>
Date:   Fri, 20 Jan 2023 00:11:10 +0800
From:   Chen Yu <yu.c.chen@...el.com>
To:     Yicong Yang <yangyicong@...wei.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Tim Chen <tim.c.chen@...el.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        <yangyicong@...ilicon.com>, Juri Lelli <juri.lelli@...hat.com>,
        Rik van Riel <riel@...riel.com>,
        Aaron Lu <aaron.lu@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        "Ingo Molnar" <mingo@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        "Daniel Bristot de Oliveira" <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        Hillf Danton <hdanton@...a.com>,
        Honglei Wang <wanghonglei@...ichuxing.com>,
        Len Brown <len.brown@...el.com>,
        Chen Yu <yu.chen.surf@...il.com>,
        Tianchen Ding <dtcccc@...ux.alibaba.com>,
        "Joel Fernandes" <joel@...lfernandes.org>,
        Josh Don <joshdon@...gle.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH v4 0/2] sched/fair: Choose the CPU where short task
 is running during wake up

Hi Yicong,
On 2023-01-17 at 10:53:25 +0800, Yicong Yang wrote:
> Hi Chen Yu,
> 
> On 2022/12/16 14:08, Chen Yu wrote:
> > The main purpose of this change is to avoid too many cross CPU
> > wake up when it is unnecessary. The frequent cross CPU wake up
> > brings significant damage to some workloads, especially on high
> > core count systems.
> > 
> > This patch set inhibits the cross CPU wake-up by placing the wakee
> > on waking CPU or previous CPU, if both the waker and wakee are
> > short-duration tasks.
> > 
> > The first patch is to introduce the definition of a short-duration
> > task. The second patch leverages the first patch to choose a local
> > or previous CPU for wakee.
> > 
> 
> Tested with tbench and netperf locally and MySQL remotely. Unluckily
> cannot reproduce the improvement and see some regression.
Thank you for the test.
> Detailed information below. The tests are done based on 6.2-rc1 on a 2x64 CPUs
> Kunpeng 920 server with 4 nodes and no SMT, CONFIG_SCHED_CLUSTER
> enabled.
> 
> For tbench and netperf tested on single node and two nodes on one socket.
> 
Does this platform have 2 sockets, each socket has 2 Numa nodes, each node
is attached to 32 CPUs, and the benchmarks are bound to the first sockets?
And may I know how many CPUs there are in one LLC? Is it also 32?
> Further looked into 64 threads since it regress most and mainly looked on the
> server(mysqld) side. In this case most loads will be on node 0 and partly
> on node 1. With SIS_SHORT the utilization on node 0 keeps almost the same
> but utilization on node 1 remains nearly half comparing to baseline. That
> may partly answers why the performance decreases.
> 
OK, I see. So it seems that the Numa sched domain load balance does not spread
tasks fast enough while SIS_SHORT is waking task on the same node frequently.
But I thought select_idle_sibling() would wake up the tasks
on the same node(LLC) anyway, why SIS_SHORT inhibits the load balance more?
> mpstat 60s, non-idle%
> 	vanilla	sis_short	
> node 0	94.25	91.67	-2.73%
> node 1	46.41	30.03	-35.29%
> 
> Previously there's argument that sync wakeup from interrupts maybe incorrectly
I see. Do you mean, wakeup from interrupt might cause tasks stacking on the same
CPU, because the interrupt is delivered to the same CPU? I thought userspace
irq load balance would spread the interrupt? Or do you mean others?
> and SIS_SHORT seems to consolidate that and tie more tasks on the interrupt node.
>
> SIS_SHORT will likely to make short tasks wakeup locally but seems to increase the
> wait time in this case. schedstat for CPUs on node 0 shows:
> 
Agree. 
>                         vanilla               sis_short
> time elapsed            60.201085s            60.200395s
> [cpu Average, counts/sec]
> #0 cpu zeros            0.0                   0.0
> #1 sched_yied()         0.03                  0.12
> #2 zeros                0.0                   0.0
> #3 schedule()           409415.36             340304.86
> #4 sched->idle()        125616.92             91078.52
> #5 try_to_wake_up()     317404.25             280896.88
> #6 try_to_wakeup(local) 115451.87             140567.68
> #7 run time             26071494674.87        26164834556.84
> #8 wait time            3861592891.75         7361228637.54 (+ 90.63%)
> #9 timeslices,          283747.1              249182.34
> 
> Since it's a 128 CPU machine the min_granularity is 3000000ns which means a short
> task running within 375000ns. Most mysql worker threads' se.dur_avg is around
> 110000ns and will be regarded as short tasks.
> 
> To exclude the influence of Patch 1 (which we'll always record the dur_avg and related
> informations), tested 64 threads case with SIS_SHORT enabled or not through debugfs
> control and seems Patch 2 mainly dominate the results.
> 
> 		NO_SIS_SHORT	SIS_SHORT	
> TPS		17446.54	16464.83	-5.63%
> QPS		279144.61	263437.21	-5.63%
> avg lat		3.67		3.89		-5.99%
> 95th lat	5.09		5.37		-5.50%
> 
> Please let me know if you need more information.
> 
Very useful information, I'm thinking of taking nr_llc into consideration to
adjust the threshold of short task, so on system with smaller llc we do
not inhibit the idle CPU scan too much.

thanks,
Chenyu
> Thanks.
> 
> > Changes since v3:
> > 1. Honglei and Josh have concern that the threshold of short
> >    task duration could be too long. Decreased the threshold from
> >    sysctl_sched_min_granularity to (sysctl_sched_min_granularity / 8),
> >    and the '8' comes from get_update_sysctl_factor().
> > 2. Export p->se.dur_avg to /proc/{pid}/sched per Yicong's suggestion.
> > 3. Move the calculation of average duration from put_prev_task_fair()
> >    to dequeue_task_fair(). Because there is an issue in v3 that,
> >    put_prev_task_fair() will not be invoked by pick_next_task_fair()
> >    in fast path, thus the dur_avg could not be updated timely.
> > 4. Fix the comment in PATCH 2/2, that "WRITE_ONCE(CPU1->ttwu_pending, 1);"
> >    on CPU0 is earlier than CPU1 getting "ttwu_list->p0", per Tianchen.
> > 5. Move the scan for CPU with short duration task from select_idle_cpu()
> >    to select_idle_siblings(), because there is no CPU scan involved, per
> >    Yicong.
> > 
> > Changes since v2:
> > 
> > 1. Peter suggested comparing the duration of waker and the cost to
> >    scan for an idle CPU: If the cost is higher than the task duration,
> >    do not waste time finding an idle CPU, choose the local or previous
> >    CPU directly. A prototype was created based on this suggestion.
> >    However, according to the test result, this prototype does not inhibit
> >    the cross CPU wakeup and did not bring improvement. Because the cost
> >    to find an idle CPU is small in the problematic scenario. The root
> >    cause of the problem is a race condition between scanning for an idle
> >    CPU and task enqueue(please refer to the commit log in PATCH 2/2).
> >    So v3 does not change the core logic of v2, with some refinement based
> >    on Peter's suggestion.
> > 
> > 2. Simplify the logic to record the task duration per Peter and Abel's suggestion.
> > 
> > This change brings overall improvement on some microbenchmarks, both on
> > Intel and AMD platforms.
> > 
> > v3: https://lore.kernel.org/lkml/cover.1669862147.git.yu.c.chen@intel.com/
> > v2: https://lore.kernel.org/all/cover.1666531576.git.yu.c.chen@intel.com/
> > v1: https://lore.kernel.org/lkml/20220915165407.1776363-1-yu.c.chen@intel.com/
> > 
> > Chen Yu (2):
> >   sched/fair: Introduce short duration task check
> >   sched/fair: Choose the CPU where short task is running during wake up
> > 
> >  include/linux/sched.h   |  3 +++
> >  kernel/sched/core.c     |  2 ++
> >  kernel/sched/debug.c    |  1 +
> >  kernel/sched/fair.c     | 32 ++++++++++++++++++++++++++++++++
> >  kernel/sched/features.h |  1 +
> >  5 files changed, 39 insertions(+)
> >