linux-kernel - Re: BUG Report: Fork benchmark drop by 30% on aarch64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5f92761b-c7d4-4b96-9398-183a5bf7556a@arm.com>
Date: Mon, 17 Feb 2025 23:51:45 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Hagar Hemdan <hagarhem@...zon.com>
Cc: abuehaze@...zon.com, linux-kernel@...r.kernel.org
Subject: Re: BUG Report: Fork benchmark drop by 30% on aarch64

On 13/02/2025 19:55, Dietmar Eggemann wrote:
> On 11/02/2025 22:40, Hagar Hemdan wrote:
>> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
>>> On 10/02/2025 22:31, Hagar Hemdan wrote:
>>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
>>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
>>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
>>>>>>> Hi Hagar,
>>>>>>>
>>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
>>>
>>> [...]
>>>
>>>>> The 'spawn' tasks in sched_move_task() are 'running' and 'queued' so we
>>>>> call dequeue_task(), put_prev_task(), enqueue_task() and
>>>>> set_next_task().
>>>>>
>>>>> I guess what we need here is the cfs_rq->avg.load_avg (cpu_load() in
>>>>> case of root tg) update in:
>>>>>
>>>>>   task_change_group_fair() -> detach_task_cfs_rq() -> ...,
>>>>>   attach_task_cfs_rq() -> ...
>>>>>
>>>>> since this is used for WF_FORK, WF_EXEC handling in wakeup:
>>>>>
>>>>>   select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
>>>>>   sched_balance_find_dst_group_cpu()
>>>>>
>>>>> in form of 'least_loaded_cpu' and 'load = cpu_load(cpu_rq(i)'.
>>>>>
>>>>> You mentioned AutoGroups (AG). I don't see this issue on my Debian 12
>>>>> Juno-r0 Arm64 board. When I run w/ AG, 'group' is '/' and
>>>>> 'tsk->sched_task_group' is '/autogroup-x' so the condition 'if (group ==
>>>>> tsk->sched_task_group)' isn't true in sched_move_task(). If I disable AG
>>>>> then they match "/" == "/".
>>>>>
>>>>> I assume you run Ubuntu on your AWS instances? What kind of
>>>>> 'cgroup/taskgroup' related setup are you using?
>>>>
>>>> I'm running AL2023 and use Vanilla kernel 6.13.1 on m6g.xlarge AWS instance.
>>>> AL2023 uses cgroupv2 by default.
>>>>>
>>>>> Can you run w/ this debug snippet w/ and w/o AG enabled?
>>>>
>>>> I have run that and have attached the trace files to this email.
>>>
>>> Thanks!
>>>
>>> So w/ AG you see that 'group' and 'tsk->sched_task_group' are both
>>> '/user.slice/user-1000.slice/session-1.scope' so we bail for those tasks
>>> w/o doing the 'cfs_rq->avg.load_avg' update I described above.
>>
>> yes, both groups are identical so it returns from sched_move_task()
>> without {de|en}queue and without call task_change_group_fair().
> 
> OK.
> 
>>> You said that there is no issue w/o AG. 
>>
>> To clarify, I meant by there's no regression when autogroup is disabled,
>> that the fork results w/o AG remain consistent with or without the commit 
>> "sched/core: Reduce cost of sched_move_task when config autogroup". However,
>> the fork results are consistently lower when AG disabled compared to when
>> it's enabled (without commit applied). This is illustrated in the tables
>> provided in the report.
> 
> OK, but I don't quite get yet why w/o AG the results are lower even w/o
> eff6c8ce8d4d? Have to dig further I guess. Maybe there is more than this
> p->se.avg.load_avg update when we go via task_change_group_fair()?

'./Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':

CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)

   	y	             1		   y            21005 (27120 **)
	y		     0		   y            21059 (27012 **)
	n		     -		   y            21299
	y		     1		   n	        27745 *
	y		     0		   n	        27493 *
	n		     -		   n	        20928

(*) So here the higher numbers are only achieved when
'sched_autogroup_exit_task() -> sched_move_task() ->
sched_change_group() is called for the 'spawn' tasks.

(**) When I apply the fix from
https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.

These results support the story that we need:

  task_change_group_fair() -> detach_task_cfs_rq() -> ...,
  attach_task_cfs_rq() -> ...

i.e. the related 'cfs_rq->avg.load_avg' update during do_exit() so that
WF_FORK handling in wakeup:

  select_task_rq_fair() -> sched_balance_find_dst_cpu() ->
  sched_balance_find_dst_group_cpu()

can use more recent 'load = cpu_load(cpu_rq(i)' values to get a better
'least_loaded_cpu'.

The AWS instance runs systemd so shell and test run in a taskgroup other
than root which trumps autogroups:

  task_wants_autogroup()

     if (tg != &root_task_group)
       return false;

     ...

That's why 'group == tsk->sched_task_group' in sched_move_task() is
true, which is different on my Juno: the shell from which I launch the
tests runs in '/' so that the test ends up in an autogroup, i.e. 'group
!= tsk->sched_task_group'.

[...]