linux-kernel - Re: BUG Report: Fork benchmark drop by 30% on aarch64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250303135715.GA21308@amazon.com>
Date: Mon, 3 Mar 2025 13:57:15 +0000
From: Hagar Hemdan <hagarhem@...zon.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>
CC: <hagarhem@...zon.com>, <abuehaze@...zon.com>,
	<linux-kernel@...r.kernel.org>
Subject: Re: BUG Report: Fork benchmark drop by 30% on aarch64

On Mon, Mar 03, 2025 at 11:05:01AM +0100, Dietmar Eggemann wrote:
> On 21/02/2025 07:44, Hagar Hemdan wrote:
> > On Mon, Feb 17, 2025 at 11:51:45PM +0100, Dietmar Eggemann wrote:
> >> On 13/02/2025 19:55, Dietmar Eggemann wrote:
> >>> On 11/02/2025 22:40, Hagar Hemdan wrote:
> >>>> On Tue, Feb 11, 2025 at 05:27:47PM +0100, Dietmar Eggemann wrote:
> >>>>> On 10/02/2025 22:31, Hagar Hemdan wrote:
> >>>>>> On Mon, Feb 10, 2025 at 11:38:51AM +0100, Dietmar Eggemann wrote:
> >>>>>>> On 07/02/2025 12:07, Hagar Hemdan wrote:
> >>>>>>>> On Fri, Feb 07, 2025 at 10:14:54AM +0100, Dietmar Eggemann wrote:
> >>>>>>>>> Hi Hagar,
> >>>>>>>>>
> >>>>>>>>> On 05/02/2025 16:10, Hagar Hemdan wrote:
> 
> [...]
> 
> >> './Run -c 4 spawn' on AWS instance (m7gd.16xlarge) with v6.13, 'mem=16G
> >> maxcpus=4 nr_cpus=4' and Ubuntu '22.04.5 LTS':
> >>
> >> CFG_SCHED_AUTOGROUP | sched_ag_enabled | eff6c8ce8d4d | Fork (lps)
> >>
> >>    	y	             1		   y            21005 (27120 **)
> >> 	y		     0		   y            21059 (27012 **)
> >> 	n		     -		   y            21299
> >> 	y		     1		   n	        27745 *
> >> 	y		     0		   n	        27493 *
> >> 	n		     -		   n	        20928
> >>
> >> (*) So here the higher numbers are only achieved when
> >> 'sched_autogroup_exit_task() -> sched_move_task() ->
> >> sched_change_group() is called for the 'spawn' tasks.
> >>
> >> (**) When I apply the fix from
> >> https://lkml.kernel.org/r/4a9cc5ab-c538-4427-8a7c-99cb317a283f@arm.com.
> > Thanks!
> > Will you submit that fix upstream?
> 
> I will, I just had to understand in detail why this regression happens.
> 
> Looks like the issue is rather related to 'sgs->group_util' in
> group_is_overloaded() and group_has_capacity(). If we don't
> 'deqeue/detach + attach/enqueue' (1) the task in sched_move_task() then
> sgs->group_util is ~900 (you run 4 CPUs flat in a single MC sched domain
> so sgs->group_capacity = 1024 and this leads to group_is_overloaded()
> returning true and group_has_capacity() false much more often as if
> we would do (1).
> 
> I.e. we have much more cases of 'group_is_overloaded' and
> 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
> then (a) returns much more often a CPU != smp_processor_id() (which
> isn't good for these extremely short running tasks (FORK + EXIT)) and
> also involves calling sched_balance_find_dst_group_cpu() unnecessary
> (since we deal with single CPU sched domains). 
> 
> select_task_rq_fair(..., wake_flags = WF_FORK)
> 
>   cpu = smp_processor_id()
> 
>   new_cpu = sched_balance_find_dst_group(..., cpu, ...)
> 
>     do {
> 
>       update_sg_wakeup_stats()
> 
>         sgs->group_type = group_classify()   
> 							w/o patch 	w/ patch                   
>           if group_is_overloaded() (*)
>             return group_overloaded /* 6 */		457,141		394
> 
>           if !group_has_capacity() (**)
>             return group_fully_busy /* 1 */ 	  	816,629		714
> 
>           return group_has_spare    /* 0 */		1,158,890	3,157,472
> 
>     } while group 
> 
>     if local_sgs.group_type > idlest_sgs.group_type	
>       return idlest					351,598		273
> 
>     case group_has_spare:
> 
>       if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
>         return NULL 					156,760		788,462
> 
> 
> (*)
> 
>   if sgs->group_capacity * 100) <			
> 		sgs->group_util * imbalance_pct		951,705		856
>     return true
> 
>   sgs->group_util ~ 900 and sgs->group_capacity = 1024 (1 CPU per sched group)
> 
> 
> (**)
> 
>  if sgs->group_capacity * 100 >
> 		sgs->group_util * imbalance_pct
>    return true						1,087,555	3,163,152
> 
>  return false						1,332,974	882
> 
> 
> (*) and (**) are for 'wakeup' and 'load-balance' so they don't
> match the only wakeup numbers above!

Thank you for the detailed explanation. We appreciate your effort and
will await the fix.
> 
> In this test run I got 608,092 new wakeups w/o and 789,572 (~+ 30%)
> w/ the patch when running './Run -c 4 -i 1 spawn' on AWS instance
> (m7gd.16xlarge) with v6.13, 'mem=16G maxcpus=4 nr_cpus=4' and
> Ubuntu '22.04.5 LTS'
> 
> > Do you think that this fix is the same as reverting commit eff6c8ce8d4d and
> > its follow up commit fa614b4feb5a? I mean what does commit eff6c8ce8d4d 
> > actually improve?
> 
> There are occurrences in which 'group == tsk->sched_task_group' and
> '!(tsk->flags & PF_EXITING)' so there the early bail might help w/o
> the negative impact on sched benchmarks.
ok, thanks!