linux-kernel - Re: BUG Report: Fork benchmark drop by 30% on aarch64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <14a2aaac-05d5-4b2e-a8c1-617bb4411659@arm.com>
Date: Mon, 3 Mar 2025 11:05:01 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Hagar Hemdan <hagarhem@...zon.com>
Cc: abuehaze@...zon.com, wuchi.zero@...il.com, linux-kernel@...r.kernel.org
Subject: Re: BUG Report: Fork benchmark drop by 30% on aarch64

On 21/02/2025 07:44, Hagar Hemdan wrote:
> On Mon, Feb 17, 2025 at 11:51:45PM +0100, Dietmar Eggemann wrote:
>> On 13/02/2025 19:55, Dietmar Eggemann wrote:
>>> On 11/02/2025 22:40, Hagar Hemdan wrote:
>>>> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
>>>>> On 10/02/2025 22:31, Hagar Hemdan wrote:
>>>>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
>>>>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
>>>>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
>>>>>>>>> Hi Hagar,
>>>>>>>>>
>>>>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:

[...]

>> './Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
>> maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':
>>
>> CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)
>>
>>    	y	             1		   y            21005 (27120 **)
>> 	y		     0		   y            21059 (27012 **)
>> 	n		     -		   y            21299
>> 	y		     1		   n	        27745 *
>> 	y		     0		   n	        27493 *
>> 	n		     -		   n	        20928
>>
>> (*) So here the higher numbers are only achieved when
>> 'sched_autogroup_exit_task() -> sched_move_task() ->
>> sched_change_group() is called for the 'spawn' tasks.
>>
>> (**) When I apply the fix from
>> https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.
> Thanks!
> Will you submit that fix upstream?

I will, I just had to understand in detail why this regression happens.

Looks like the issue is rather related to 'sgs->group_util' in
group_is_overloaded() and group_has_capacity(). If we don't
'deqeue/detach + attach/enqueue' (1) the task in sched_move_task() then
sgs->group_util is ~900 (you run 4 CPUs flat in a single MC sched domain
so sgs->group_capacity = 1024 and this leads to group_is_overloaded()
returning true and group_has_capacity() false much more often as if
we would do (1).

I.e. we have much more cases of 'group_is_overloaded' and
'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
then (a) returns much more often a CPU != smp_processor_id() (which
isn't good for these extremely short running tasks (FORK + EXIT)) and
also involves calling sched_balance_find_dst_group_cpu() unnecessary
(since we deal with single CPU sched domains). 

select_task_rq_fair(..., wake_flags = WF_FORK)

  cpu = smp_processor_id()

  new_cpu = sched_balance_find_dst_group(..., cpu, ...)

    do {

      update_sg_wakeup_stats()

        sgs->group_type = group_classify()   
							w/o patch 	w/ patch                   
          if group_is_overloaded() (*)
            return group_overloaded /* 6 */		457,141		394

          if !group_has_capacity() (**)
            return group_fully_busy /* 1 */ 	  	816,629		714

          return group_has_spare    /* 0 */		1,158,890	3,157,472

    } while group 

    if local_sgs.group_type > idlest_sgs.group_type	
      return idlest					351,598		273

    case group_has_spare:

      if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
        return NULL 					156,760		788,462


(*)

  if sgs->group_capacity * 100) <			
		sgs->group_util * imbalance_pct		951,705		856
    return true

  sgs->group_util ~ 900 and sgs->group_capacity = 1024 (1 CPU per sched group)


(**)

 if sgs->group_capacity * 100 >
		sgs->group_util * imbalance_pct
   return true						1,087,555	3,163,152

 return false						1,332,974	882


(*) and (**) are for 'wakeup' and 'load-balance' so they don't
match the only wakeup numbers above!

In this test run I got 608,092 new wakeups w/o and 789,572 (~+ 30%)
w/ the patch when running './Run -c 4 -i 1 spawn' on AWS instance
(m7gd.16xlarge) with v6.13, 'mem=16G maxcpus=4 nr_cpus=4' and
Ubuntu '22.04.5 LTS'

> Do you think that this fix is the same as reverting commit eff6c8ce8d4d and
> its follow up commit fa614b4feb5a? I mean what does commit eff6c8ce8d4d 
> actually improve?

There are occurrences in which 'group == tsk->sched_task_group' and
'!(tsk->flags & PF_EXITING)' so there the early bail might help w/o
the negative impact on sched benchmarks.