linux-kernel - Re: [PATCH v4] sched/fair: do not scan twice in detach

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtBxtAu1=p22Z5N7_EMeTMyRvN-gQDa_G==dTDDKtPdYzA@mail.gmail.com>
Date: Tue, 22 Jul 2025 09:53:33 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Valentin Schneider <vschneid@...hat.com>
Cc: Huang Shijie <shijie@...amperecomputing.com>, mingo@...hat.com, peterz@...radead.org, 
	juri.lelli@...hat.com, patches@...erecomputing.com, cl@...ux.com, 
	Shubhang@...amperecomputing.com, dietmar.eggemann@....com, 
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

On Tue, 22 Jul 2025 at 09:49, Vincent Guittot
<vincent.guittot@...aro.org> wrote:
>
> On Mon, 21 Jul 2025 at 13:25, Valentin Schneider <vschneid@...hat.com> wrote:
> >
> > On 21/07/25 11:40, Vincent Guittot wrote:
> > > On Mon, 21 Jul 2025 at 04:40, Huang Shijie
> > > <shijie@...amperecomputing.com> wrote:
> > >>
> > >> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
> > >> iteration count limit. It is however set without the source RQ lock held,
> > >> and besides detach_tasks() can be re-invoked after releasing and
> > >> re-acquiring the RQ lock per LBF_NEED_BREAK.
> > >>
> > >> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
> > >> as observed within detach_tasks() can differ. This can cause some tasks to
> > >
> > > why not setting env.loop_max only once rq lock is taken in this case ?
> > >
> > > side note : by default loop_max <= loop_break
> > >
> >
> > I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we
> > could still do something like:
> >
> > ---
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b9b4bbbf0af6f..eef3a0d341661 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >                 .dst_grpmask    = group_balance_mask(sd->groups),
> >                 .idle           = idle,
> >                 .loop_break     = SCHED_NR_MIGRATE_BREAK,
> > +               .loop_max       = UINT_MAX,
> >                 .cpus           = cpus,
> >                 .fbq_type       = all,
> >                 .tasks          = LIST_HEAD_INIT(env.tasks),
> > @@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >         /* Clear this flag as soon as we find a pullable task */
> >         env.flags |= LBF_ALL_PINNED;
> >         if (busiest->nr_running > 1) {
> > +more_balance:
> >                 /*
> >                  * Attempt to move tasks. If sched_balance_find_src_group has found
> >                  * an imbalance but busiest->nr_running <= 1, the group is
> >                  * still unbalanced. ld_moved simply stays zero, so it is
> >                  * correctly treated as an imbalance.
> >                  */
> > -               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> > -
> > -more_balance:
> >                 rq_lock_irqsave(busiest, &rf);
> >                 update_rq_clock(busiest);
> >
> > +
> > +               env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running);
> > +
> >                 /*
> >                  * cur_ld_moved - load moved in current iteration
> >                  * ld_moved     - cumulative load moved across iterations
> >
>
> I would prefer something like below:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1b3879850a9e..636798d53798 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11702,12 +11702,16 @@ static int sched_balance_rq(int this_cpu,
> struct rq *this_rq,
>                  * still unbalanced. ld_moved simply stays zero, so it is
>                  * correctly treated as an imbalance.
>                  */
> -               env.loop_max  = min(sysctl_sched_nr_migrate,
> busiest->nr_running);
>
>  more_balance:
>                 rq_lock_irqsave(busiest, &rf);
>                 update_rq_clock(busiest);
>
> +               if (!env.loop_max)
> +                       env.loop_max = min(sysctl_sched_nr_migrate,
> busiest->cfs.h_nr_runnable);

it should be h_nr_queued as mentioned by Huang and my patch has been
messed up by my web browser

> +               else
> +                       env.loop_max = min(env.loop_max,
> busiest->cfs.h_nr_runnable);
> +
>                 /*
>                  * cur_ld_moved - load moved in current iteration
>                  * ld_moved     - cumulative load moved across iterations