[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a71a227a-5f1b-8ebc-bfde-0babf267b7d9@linux.alibaba.com>
Date: Mon, 5 Dec 2022 10:36:25 +0800
From: Tianchen Ding <dtcccc@...ux.alibaba.com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Rik van Riel <riel@...riel.com>,
Aaron Lu <aaron.lu@...el.com>,
Abel Wu <wuyun.abel@...edance.com>,
K Prateek Nayak <kprateek.nayak@....com>,
Yicong Yang <yangyicong@...ilicon.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Ingo Molnar <mingo@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
Hillf Danton <hdanton@...a.com>,
Honglei Wang <wanghonglei@...ichuxing.com>,
Len Brown <len.brown@...el.com>,
Chen Yu <yu.chen.surf@...il.com>,
Joel Fernandes <joel@...lfernandes.org>,
Josh Don <joshdon@...gle.com>, linux-kernel@...r.kernel.org,
kernel test robot <yujie.liu@...el.com>,
Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Tim Chen <tim.c.chen@...el.com>,
Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [PATCH v3 2/2] sched/fair: Choose the CPU where short task is
running during wake up
Hi Chen.
On 2022/12/1 16:44, Chen Yu wrote:
> [Problem Statement]
> For a workload that is doing frequent context switches, the throughput
> scales well until the number of instances reaches a peak point. After
> that peak point, the throughput drops significantly if the number of
> instances continues to increase.
>
> The will-it-scale context_switch1 test case exposes the issue. The
> test platform has 112 CPUs per LLC domain. The will-it-scale launches
> 1, 8, 16 ... 112 instances respectively. Each instance is composed
> of 2 tasks, and each pair of tasks would do ping-pong scheduling via
> pipe_read() and pipe_write(). No task is bound to any CPU.
> It is found that, once the number of instances is higher than
> 56(112 tasks in total, every CPU has 1 task), the throughput
> drops accordingly if the instance number continues to increase:
>
> ^
> throughput|
> | X
> | X X X
> | X X X
> | X X
> | X X
> | X
> | X
> | X
> | X
> |
> +-----------------.------------------->
> 56
> number of instances
>
> [Symptom analysis]
>
> The performance downgrading was caused by a high system idle
> percentage(around 20% ~ 30%). The CPUs waste a lot of time in
> idle and do nothing. As a comparison, if set CPU affinity to
> these workloads and stops them from migrating among CPUs,
> the idle percentage drops to nearly 0%, and the throughput
> increases by about 300%. This indicates room for optimization.
>
> The reason for the high idle percentage is different before/after
> commit f3dd3f674555 ("sched: Remove the limitation of WF_ON_CPU
> on wakelist if wakee cpu is idle").
>
> [Before the commit]
> The bottleneck is the runqueue spinlock.
>
> nr_instance rq lock percentage
> 1 1.22%
> 8 1.17%
> 16 1.20%
> 24 1.22%
> 32 1.46%
> 40 1.61%
> 48 1.63%
> 56 1.65%
> --------------------------
> 64 3.77% |
> 72 5.90% | increase
> 80 7.95% |
> 88 9.98% v
> 96 11.81%
> 104 13.54%
> 112 15.13%
>
> And top 2 rq lock hot paths:
>
> (path1):
> raw_spin_rq_lock_nested.constprop.0;
> try_to_wake_up;
> default_wake_function;
> autoremove_wake_function;
> __wake_up_common;
> __wake_up_common_lock;
> __wake_up_sync_key;
> pipe_write;
> new_sync_write;
> vfs_write;
> ksys_write;
> __x64_sys_write;
> do_syscall_64;
> entry_SYSCALL_64_after_hwframe;write
>
> (path2):
> raw_spin_rq_lock_nested.constprop.0;
> __sched_text_start;
> schedule_idle;
> do_idle;
> cpu_startup_entry;
> start_secondary;
> secondary_startup_64_no_verify
>
> task A tries to wake up task B on CPU1, then task A grabs the
> runqueue lock of CPU1. If CPU1 is about to quit idle, it needs
> to grab its lock which has been taken by someone else. Then
> CPU1 takes more time to quit which hurts the performance.
>
> [After the commit]
> The cause is the race condition between select_task_rq() and
> the task enqueue.
>
> Suppose there are nr_cpus pairs of ping-pong scheduling
> tasks. For example, p0' and p0 are ping-pong scheduling,
> so do p1' <=> p1, and p2'<=> p2. None of these tasks are
> bound to any CPUs. The problem can be summarized as:
> more than 1 wakers are stacked on 1 CPU, which slows down
> waking up their wakees:
>
> CPU0 CPU1 CPU2
>
> p0' p1' => idle p2'
>
> try_to_wake_up(p0) try_to_wake_up(p2);
> CPU1 = select_task_rq(p0); CPU1 = select_task_rq(p2);
> ttwu_queue(p0, CPU1); ttwu_queue(p2, CPU1);
> __ttwu_queue_wakelist(p0, CPU1); => ttwu_list->p0
> quiting cpuidle_idle_call()
>
> ttwu_list->p2->p0 <= __ttwu_queue_wakelist(p2, CPU1);
>
> WRITE_ONCE(CPU1->ttwu_pending, 1); WRITE_ONCE(CPU1->ttwu_pending, 1);
>
> p0' => idle
> sched_ttwu_pending()
> enqueue_task(p2 and p0)
>
> idle => p2
>
> ...
> p2 time slice expires
> ...
> !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> <=== !!! p2 delays the wake up of p0' !!!
> !!! causes long idle on CPU0 !!!
> p2 => p0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> p0 wakes up p0'
>
> idle => p0'
>
>
>
> Since there are many waker/wakee pairs in the system, the chain reaction
> causes many CPUs to be victims. These idle CPUs wait for their waker to
> be scheduled.
>
> Actually Tiancheng has mentioned above issue here[2].
>
First I want to say that this issue (race condition between selecting idle cpu
and enqueuing task) always exists before or after the commit f3dd3f674555. And I
know this is a big issue so in [2] I only try to fix a small part of it. Of
course I'm glad to see you trying solving this issue too :-)
There may be a bit wrong in your comment about the order.
"WRITE_ONCE(CPU1->ttwu_pending, 1);" on CPU0 is earlier than CPU1 getting
"ttwu_list->p0", right?
Thanks.
Powered by blists - more mailing lists