[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250307110733.GA10571@amazon.com>
Date: Fri, 7 Mar 2025 11:07:33 +0000
From: Hagar Hemdan <hagarhem@...zon.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>
CC: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Steven Rostedt <rostedt@...dmis.org>, "Ben
Segall" <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin
Schneider <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
<abuehaze@...zon.com>
Subject: Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
On Thu, Mar 06, 2025 at 05:26:35PM +0100, Dietmar Eggemann wrote:
> Hagar reported a 30% drop in UnixBench spawn test with commit
> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> (aarch64) (single level MC sched domain) [1].
>
> There is an early bail from sched_move_task() if p->sched_task_group is
> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> (Ubuntu '22.04.5 LTS').
>
> So in:
>
> do_exit()
>
> sched_autogroup_exit_task()
>
> sched_move_task()
>
> if sched_get_task_group(p) == p->sched_task_group
> return
>
> /* p is enqueued */
> dequeue_task() \
> sched_change_group() |
> task_change_group_fair() |
> detach_task_cfs_rq() | (1)
> set_task_rq() |
> attach_task_cfs_rq() |
> enqueue_task() /
>
> (1) isn't called for p anymore.
>
> Turns out that the regression is related to sgs->group_util in
> group_is_overloaded() and group_has_capacity(). If (1) isn't called for
> all the 'spawn' tasks then sgs->group_util is ~900 and
> sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
> group_is_overloaded() returning true (2) and group_has_capacity() false
> (3) much more often compared to the case when (1) is called.
>
> I.e. there are much more cases of 'group_is_overloaded' and
> 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
> then returns much more often a CPU != smp_processor_id() (5).
>
> This isn't good for these extremely short running tasks (FORK + EXIT)
> and also involves calling sched_balance_find_dst_group_cpu() unnecessary
> (single CPU sched domain).
>
> Instead if (1) is called for 'p->flags & PF_EXITING' then the path
> (4),(6) is taken much more often.
>
> select_task_rq_fair(..., wake_flags = WF_FORK)
>
> cpu = smp_processor_id()
>
> new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)
>
> group = sched_balance_find_dst_group(..., cpu)
>
> do {
>
> update_sg_wakeup_stats()
>
> sgs->group_type = group_classify()
>
> if group_is_overloaded() (2)
> return group_overloaded
>
> if !group_has_capacity() (3)
> return group_fully_busy
>
> return group_has_spare (4)
>
> } while group
>
> if local_sgs.group_type > idlest_sgs.group_type
> return idlest (5)
>
> case group_has_spare:
>
> if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
> return NULL (6)
>
> Unixbench Tests './Run -c 4 spawn' on:
>
> (a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
> and Ubuntu 22.04.5 LTS (aarch64).
>
> Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.
>
> w/o patch w/ patch
> 21005 27120
>
> (b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
> Ubuntu 22.04.5 LTS (x86_64).
>
> Shell & test run in '/A'.
>
> w/o patch w/ patch
> 67675 88806
>
> CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
> 0 or 1.
>
> [1] https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com
>
> Reported-by: Hagar Hemdan <hagarhem@...zon.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@....com>
> ---
> kernel/sched/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b00f884701a6..ca0e3c2eb94a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9064,7 +9064,7 @@ void sched_move_task(struct task_struct *tsk)
> * group changes.
> */
> group = sched_get_task_group(tsk);
> - if (group == tsk->sched_task_group)
> + if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
> return;
>
> update_rq_clock(rq);
> --
> 2.34.1
>
Thank you very much for submitting the fix and for all the explanations.
Could you please add the "Fixes:" tag for commit eff6c8ce8d4d to your patch? So that it is backported to the stable 6.12.
And actually this has been discovered internally by <abuehaze@...zon> so please add Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@...zon.com> and Tested-by: Hagar Hemdan <hagarhem@...zon.com>.
Thanks,
Hagar
Powered by blists - more mailing lists