linux-kernel - Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance calculation between asymmetric groups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5c6d48a26a805b3f3b4cee440b902211da3011b0.camel@linux.intel.com>
Date:   Thu, 15 Jun 2023 10:01:44 -0700
From:   Tim Chen <tim.c.chen@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Ricardo Neri <ricardo.neri@...el.com>,
        "Ravi V . Shankar" <ravi.v.shankar@...el.com>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Len Brown <len.brown@...el.com>, Mel Gorman <mgorman@...e.de>,
        "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Valentin Schneider <vschneid@...hat.com>,
        Ionela Voinescu <ionela.voinescu@....com>, x86@...nel.org,
        linux-kernel@...r.kernel.org,
        Shrikanth Hegde <sshegde@...ux.vnet.ibm.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
        naveen.n.rao@...ux.vnet.ibm.com,
        Yicong Yang <yangyicong@...ilicon.com>,
        Barry Song <v-songbaohua@...o.com>,
        Chen Yu <yu.c.chen@...el.com>, Hillf Danton <hdanton@...a.com>
Subject: Re: [Patch v2 3/6] sched/fair: Implement prefer sibling imbalance
 calculation between asymmetric groups

On Thu, 2023-06-15 at 13:07 +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 10:46:36AM -0700, Tim Chen wrote:
> > On Mon, 2023-06-12 at 14:05 +0200, Peter Zijlstra wrote:
> 
> > > > +		/* Limit tasks moved from preferred group, don't leave cores idle */
> > > > +		limit = busiest->sum_nr_running;
> > > > +		lsub_positive(&limit, ncores_busiest);
> > > > +		if (imbalance > limit)
> > > > +			imbalance = limit;
> > > 
> > > How does this affect the server parts that have larger than single core
> > > turbo domains?
> > 
> > Are you thinking about the case where the local group is completely empty
> > so there's turbo headroom and we should move at least one task, even though
> > CPU in busiest group has higher priority?
> 
> Something along those lines, I didn't think it through, just wondered
> about the wisdom of piling everything in the highest priority 'domain',
> which would depress the turbo range.
> 
> Rjw said that on modern client the frequency domains are per-core, but
> that on server parts they are still wider -- or something long those
> lines. So it might make sense to consider some of that.
> 

I see what you are saying. In that case, even for ASYM_PACKING, it probably
makes more sense to simply distribute tasks in proportion to number of cores
in sched group.  And pay attention that we do not let preferred sched group
be idle. 

> 
> > In other words, are you suggesting we should add
> > 
> > 		if (imbalance == 0 && busiest->sum_nr_running > 0 &&
> > 			local->sum_nr_running == 0)
> > 			imbalance = 1;
> 
> I didn't get that far; and I don't think we have the right topology
> information on the servers to even begin considering the effects of the
> turbo-range, so perhaps it all doesn't matter.
> 
> Just wanted to raise the point for consideration.
> 
> Because as-is, the policy of piling extra on the preferred group doesn't
> immediately make sense. IIRC the whole ITMT preferred core is mostly
> about an extra turbo bin, but if you pile on, the headroom evaporates --
> you'll never turbo that high anymore and special casing it doesn't make
> sense.
> 
> So perhaps I'm not saying more code, but less code is better here.
> 
> Dunno, is any of this measurable either way around?

As far as I know, only client CPUs have ITMT and asymmetric turbo.
So this behavior could only be approximated on a client's Atom clusters with
big cores turned off.  

> 
> > > > +
> > > > +		goto out;
> > > > +	}
> > > > +
> > > > +	/* Take advantage of resource in an empty sched group */
> > > > +	if (imbalance == 0 && local->sum_nr_running == 0 &&
> > > > +	    busiest->sum_nr_running > 1)
> > > > +		imbalance = 1;
> > > > +out:
> > > > +	return imbalance << 1;
> > > > +}
> > > 
> > > 
> > > But basically you have:
> > > 
> > >         LcBn - BcLn
> > >   imb = -----------
> > >            LcBc
> > > 
> > > Which makes sense, except you then return:
> > > 
> > >   imb * 2
> > > 
> > > which then made me wonder about rounding.
> > > 
> > > Do we want to to add (LcBc -1) or (LcBc/2) to resp. ceil() or round()
> > > the thing before division? Because currently it uses floor().
> > > 
> > > If you evaludate it like:
> > > 
> > > 
> > >         2 * (LcBn - BcLn)
> > >   imb = -----------------
> > >               LcBc
> > > 
> > > The result is different from what you have now.
> > 
> > If I do the rounding after multiplying imb by two (call it imb_new),
> > the difference with imb I am returning now (call it imb_old)
> > will be at most 1.  Note that imb_old returned is always a multiple of 2.
> > 
> > I will be using imb in calculate_imbalance() and divide it
> > by 2 there to get the number tasks to move from busiest group.
> > So when there is a difference of 1 between imb_old and imb_new,
> > the difference will be trimmed off after the division of 2.
> > 
> > We will get the same number of tasks to move with either
> > imb_old or imb_new in calculate_imbalance() so the two
> > computations will arrive at the same result eventually.
> > 
> > > 
> > > What actual behaviour is desired in these low imbalance cases? and can
> > > you write a comment as to why we do as we do etc..?
> > 
> > I do not keep imb as 
> > 
> >            2 * (LcBn - BcLn)
> >    imb = -----------------
> >                LcBc
> > 
> > as it is easier to leave out the factor of 2
> > in the middle of sibling_imblalance() computation
> > so I can directly interpret imb as the number
> > of tasks to move, and add the factor of two
> > when I actually need to return the imbalance.
> > 
> > Would you like me to add this reasoning in the comments?
> 
> So if we want a multiple of 2, leaving that multiplication off makes
> sense, but I'm not sure I got the argument for or against the rounding.
> 
> floor() gets us 1 task to move when there is at least a whole task's
> worth of imbalance, but less than 2.
> 
> round() would get us 1 task to move when there's at least half a task's
> worth of imbalance but less than 1.5.
> 
> ceil() will migrate on any imbalance, however small -- which will result
> in ping-pong I suppose, so let's disregard that.
> 
> The difference, with the multiplcation later, is 0 or 2.
> 
> Does the round() still result in ping-pong?

Will have to experiment to see whether round() is better.
I chose floor() on purpose in my initial implementation
to minimize migrations.

Tim