linux-kernel - Re: BUG Report: Fork benchmark drop by 30% on aarch64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250210213155.GA649@amazon.com>
Date: Mon, 10 Feb 2025 21:31:55 +0000
From: Hagar Hemdan <hagarhem@...zon.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>, <hagarhem@...zon.com>
CC: <abuehaze@...zon.com>, <linux-kernel@...r.kernel.org>
Subject: Re: BUG Report: Fork benchmark drop by 30% on aarch64

On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> On 07/02/2025 12:07, Hagar Hemdan wrote:
> > On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >> Hi Hagar,
> >>
> >> On 05/02/2025 16:10, Hagar Hemdan wrote:
> >>> Hi,
> >>>
> >>> There is about a 30% drop in fork benchmark [1] on aarch64 and a 10%
> >>> drop on x86_64 using kernel v6.13.1.
> >>>
> >>> Git bisect pointed to commit eff6c8ce8d4d ("sched/core: Reduce cost
> >>> of sched_move_task when config autogroup") which merged starting
> >>> v6.4-rc1.
> >>>
> >>> The regression only happens when number of CPUs is equal to number
> >>> of threads [2] that fork test is creating which means it's only visible
> >>> under CPU contention.
> >>>
> >>> I used m6g.xlarge AWS EC2 Instance with 4 vCPUs and 16 GiB RAM for ARM64
> >>> and m6a.xlarge with also 4 vCPUs and 16 GiB RAM for x86_64.
> >>>
> >>> I noticed this regression exists only when autogroup config is enabled.
> >>
> >> So '# CONFIG_SCHED_AUTOGROUP is not set' in .config so we have: 
> >>
> >> static inline void sched_autogroup_exit_task(struct task_struct *p) { }
> >>
> >> I.e. doing a 'echo 0 > /proc/sys/kernel/sched_autogroup_enabled' still
> >> shows this issue?
> > yes, when I do 'echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled',
> > It behaves like the disable 'CONFIG_SCHED_AUTOGROUP'.
> 
> OK.
> 
> >>> Run the fork test with these combinations and autogroup is enabled:
> >>>
> >>> Arch      | commit eff6c8ce8d4d | Fork Result (lps)  |  %Cpu(s)
> >>> ----------+---------------------+--------------------+------------------
> >>> aarch64   | without             | 28677.0            |  3.2 us, 96.7 sy
> >>> aarch64   | with                | 19860.7 (30% drop) |  2.7 us, 79.4 sy
> >>> x86_64    | without             | 27776.2            |  3.1 us, 96.9 sy
> >>> x86_64    | with                | 25020.6 (10% drop) |  4.1 us, 93.2 sy
> >>> ----------+---------------------+--------------------+------------------
> >>
> >> Can you rerun with:
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 3e5a6bf587f9..62cc50c79a78 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -9057,7 +9057,7 @@ void sched_move_task(struct task_struct *tsk)
> >>          * group changes.
> >>          */
> >>         group = sched_get_task_group(tsk);
> >> -       if (group == tsk->sched_task_group)
> >> +       if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
> >>                 return;
> > I tried that and I see it fixed the regression and the cpu utilization
> > is 100% with it.
> 
> W/ this one, we force the 'spawn' tasks into: 
> 
>   sched_change_group() -> task_change_group_fair()
> 
> > I'd like to ask if this like reverting the patch in case of exit path
> > and also means the enqueue/dequeue are needed in case of task exiting,
> > right?
> 
> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
> call dequeue_task(), put_prev_task(), enqueue_task() and
> set_next_task().
> 
> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
> case of root tg) update in:
> 
>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>   attach_task_cfs_rq() -> ...
> 
> since this is used for WF_FORK, WF_EXEC handling in wakeup:
> 
>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>   sched_balance_find_dst_group_cpu()
> 
> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
> 
> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
> then they match "/" == "/".
> 
> I assume you run Ubuntu on your AWS instances? What kind of
> 'cgroup/taskgroup' related setup are you using?

I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
AL2023 uses cgroupv2 by default.
> 
> Can you run w/ this debug snippet w/ and w/o AG enabled?

I have run that and have attached the trace files to this email.
> 
> -->8--
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3e5a6bf587f9..c696740177b7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9035,6 +9035,8 @@ static void sched_change_group(struct task_struct *tsk, struct task_group *group
>                 set_task_rq(tsk, task_cpu(tsk));
>  }
>  
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
>  /*
>   * Change task's runqueue when it moves between groups.
>   *
> @@ -9048,6 +9050,8 @@ void sched_move_task(struct task_struct *tsk)
>                 DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
>         struct task_group *group;
>         struct rq *rq;
> +       int cpu = cpu_of(rq_of(cfs_rq_of(&tsk->se)));
> +       char buf_1[64], buf_2[64];
>  
>         CLASS(task_rq_lock, rq_guard)(tsk);
>         rq = rq_guard.rq;
> @@ -9057,6 +9061,13 @@ void sched_move_task(struct task_struct *tsk)
>          * group changes.
>          */
>         group = sched_get_task_group(tsk);
> +
> +       task_group_path(group, buf_1, sizeof(buf_1));
> +       task_group_path(tsk->sched_task_group, buf_2, sizeof(buf_2));
> +
> +       trace_printk("%s %d cpu=%d group=%s tsk->sched_task_group=%s\n",
> +                    tsk->comm, tsk->pid, cpu, buf_1, buf_2);
> +
>         if (group == tsk->sched_task_group)
>                 return;
>  
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index a1be00a988bf..c967d55f971f 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -701,7 +701,9 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
>  static DEFINE_SPINLOCK(sched_debug_lock);
>  static char group_path[PATH_MAX];
>  
> -static void task_group_path(struct task_group *tg, char *path, int plen)
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
> +void task_group_path(struct task_group *tg, char *path, int plen)
>  {
>         if (autogroup_path(tg, path, plen))
>                 return;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26958431deb7..f90d28be9695 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13198,9 +13198,16 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>  #endif
>  }
>  
> +extern void task_group_path(struct task_group *tg, char *path, int plen);
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static void task_change_group_fair(struct task_struct *p)
>  {
> +       int cpu = cpu_of(rq_of(cfs_rq_of(&p->se)));
> +       struct task_group *tg = task_group(p);
> +       unsigned long cfs_rq_load_avg_pre, cfs_rq_load_avg_post;
> +       char buf[64];
> +
>         /*
>          * We couldn't detach or attach a forked task which
>          * hasn't been woken up by wake_up_new_task().
> @@ -13208,14 +13215,33 @@ static void task_change_group_fair(struct task_struct *p)
>         if (READ_ONCE(p->__state) == TASK_NEW)
>                 return;
>  
> +       task_group_path(tg, buf, sizeof(buf));
> +       cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg;
> +
>         detach_task_cfs_rq(p);
>  
> +       cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg;
> +
> +       trace_printk("%s %d (d) cpu=%d tg=%s load_avg=%lu->%lu\n",
> +                    p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre,
> +                    cfs_rq_load_avg_post);
> +
>  #ifdef CONFIG_SMP
>         /* Tell se's cfs_rq has been changed -- migrated */
>         p->se.avg.last_update_time = 0;
>  #endif
>         set_task_rq(p, task_cpu(p));
> +
> +       task_group_path(tg, buf, sizeof(buf));
> +       cfs_rq_load_avg_pre = cfs_rq_of(&p->se)->avg.load_avg;
> +
>         attach_task_cfs_rq(p);
> +
> +       cfs_rq_load_avg_post = cfs_rq_of(&p->se)->avg.load_avg;
> +
> +       trace_printk("%s %d (a) cpu=%d tg=%s load_avg=%lu->%lu\n",
> +                    p->comm, p->pid, cpu, buf, cfs_rq_load_avg_pre,
> +                    cfs_rq_load_avg_post);
>  }

Download attachment "fork_traces.tar.gz" of type "application/gzip" (446011 bytes)