[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250303135715.GA21308@amazon.com>
Date: Mon, 3 Mar 2025 13:57:15 +0000
From: Hagar Hemdan <hagarhem@...zon.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>
CC: <hagarhem@...zon.com>, <abuehaze@...zon.com>,
<linux-kernel@...r.kernel.org>
Subject: Re: BUG Report: Fork benchmark drop by 30% on aarch64
On Mon, Mar 03, 2025 at 11:05:01AM +0100, Dietmar Eggemann wrote:
> On 21/02/2025 07:44, Hagar Hemdan wrote:
> > On Mon, Feb 17, 2025 at 11:51:45PM +0100, Dietmar Eggemann wrote:
> >> On 13/02/2025 19:55, Dietmar Eggemann wrote:
> >>> On 11/02/2025 22:40, Hagar Hemdan wrote:
> >>>> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
> >>>>> On 10/02/2025 22:31, Hagar Hemdan wrote:
> >>>>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> >>>>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
> >>>>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >>>>>>>>> Hi Hagar,
> >>>>>>>>>
> >>>>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
>
> [...]
>
> >> './Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
> >> maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':
> >>
> >> CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)
> >>
> >> y 1 y 21005 (27120 **)
> >> y 0 y 21059 (27012 **)
> >> n - y 21299
> >> y 1 n 27745 *
> >> y 0 n 27493 *
> >> n - n 20928
> >>
> >> (*) So here the higher numbers are only achieved when
> >> 'sched_autogroup_exit_task() -> sched_move_task() ->
> >> sched_change_group() is called for the 'spawn' tasks.
> >>
> >> (**) When I apply the fix from
> >> https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.
> > Thanks!
> > Will you submit that fix upstream?
>
> I will, I just had to understand in detail why this regression happens.
>
> Looks like the issue is rather related to 'sgs->group_util' in
> group_is_overloaded() and group_has_capacity(). If we don't
> 'deqeue/detach + attach/enqueue' (1) the task in sched_move_task() then
> sgs->group_util is ~900 (you run 4 CPUs flat in a single MC sched domain
> so sgs->group_capacity = 1024 and this leads to group_is_overloaded()
> returning true and group_has_capacity() false much more often as if
> we would do (1).
>
> I.e. we have much more cases of 'group_is_overloaded' and
> 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
> then (a) returns much more often a CPU != smp_processor_id() (which
> isn't good for these extremely short running tasks (FORK + EXIT)) and
> also involves calling sched_balance_find_dst_group_cpu() unnecessary
> (since we deal with single CPU sched domains).
>
> select_task_rq_fair(..., wake_flags = WF_FORK)
>
> cpu = smp_processor_id()
>
> new_cpu = sched_balance_find_dst_group(..., cpu, ...)
>
> do {
>
> update_sg_wakeup_stats()
>
> sgs->group_type = group_classify()
> w/o patch w/ patch
> if group_is_overloaded() (*)
> return group_overloaded /* 6 */ 457,141 394
>
> if !group_has_capacity() (**)
> return group_fully_busy /* 1 */ 816,629 714
>
> return group_has_spare /* 0 */ 1,158,890 3,157,472
>
> } while group
>
> if local_sgs.group_type > idlest_sgs.group_type
> return idlest 351,598 273
>
> case group_has_spare:
>
> if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
> return NULL 156,760 788,462
>
>
> (*)
>
> if sgs->group_capacity * 100) <
> sgs->group_util * imbalance_pct 951,705 856
> return true
>
> sgs->group_util ~ 900 and sgs->group_capacity = 1024 (1 CPU per sched group)
>
>
> (**)
>
> if sgs->group_capacity * 100 >
> sgs->group_util * imbalance_pct
> return true 1,087,555 3,163,152
>
> return false 1,332,974 882
>
>
> (*) and (**) are for 'wakeup' and 'load-balance' so they don't
> match the only wakeup numbers above!
Thank you for the detailed explanation. We appreciate your effort and
will await the fix.
>
> In this test run I got 608,092 new wakeups w/o and 789,572 (~+ 30%)
> w/ the patch when running './Run -c 4 -i 1 spawn' on AWS instance
> (m7gd.16xlarge) with v6.13, 'mem=16G maxcpus=4 nr_cpus=4' and
> Ubuntu '22.04.5 LTS'
>
> > Do you think that this fix is the same as reverting commit eff6c8ce8d4d and
> > its follow up commit fa614b4feb5a? I mean what does commit eff6c8ce8d4d
> > actually improve?
>
> There are occurrences in which 'group == tsk->sched_task_group' and
> '!(tsk->flags & PF_EXITING)' so there the early bail might help w/o
> the negative impact on sched benchmarks.
ok, thanks!
Powered by blists - more mailing lists