linux-kernel - Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250220155229.GB89684@pauld.westford.csb>
Date: Thu, 20 Feb 2025 10:52:29 -0500
From: Phil Auld <pauld@...hat.com>
To: Vishal Chourasia <vishalc@...ux.ibm.com>
Cc: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Frederic Weisbecker <frederic@...nel.org>,
	Waiman Long <longman@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [CHANGE 1/2] sched/isolation: Make use of more than one
 housekeeping cpu


Hi Vishal,

On Thu, Feb 20, 2025 at 02:50:40PM +0530 Vishal Chourasia wrote:
> On Tue, Feb 18, 2025 at 10:00:59AM -0500, Phil Auld wrote:
> > Hi Vishal.
> > 
> > On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
> > > Hi Phil, Vineeth
> > > 
> > > On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
> > > > On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
> > > > > Hi Phil Auld,
> > > > > 
> > > > > On 11/02/25 19:31, Phil Auld wrote:
> > > > > > The exising code uses housekeeping_any_cpu() to select a cpu for
> > > > > > a given housekeeping task. However, this often ends up calling
> > > > > > cpumask_any_and() which is defined as cpumask_first_and() which has
> > > > > > the effect of alyways using the first cpu among those available.
> > > > > > 
> > > > > > The same applies when multiple NUMA nodes are involved. In that
> > > > > > case the first cpu in the local node is chosen which does provide
> > > > > > a bit of spreading but with multiple HK cpus per node the same
> > > > > > issues arise.
> > > > > > 
> > > > > > Spread the HK work out by having housekeeping_any_cpu() and
> > > > > > sched_numa_find_closest() use cpumask_any_and_distribute()
> > > > > > instead of cpumask_any_and().
> > > > > > 
> > > > > 
> > > > > Got the overall intent of the patch for better load distribution on
> > > > > housekeeping tasks. However, one potential drawback could be that by
> > > > > spreading HK work across multiple CPUs might reduce the time that
> > > > > some cores can spend in deeper idle states which can be beneficial for
> > > > > power-sensitive systems.
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > NOHZ_full setups are not generally used in power sensitive systems I think.
> > > > They aren't in our use cases at least. 
> > > > 
> > > > In cases with many cpus a single housekeeping cpu can not keep up. Having
> > > > other HK cpus in deep idle states while the one in use is overloaded is
> > > > not a win. 
> > > 
> > > To me, an overloaded CPU sounds like where more than one tasks are ready
> > > to run, and a HK CPU is one receiving periodic scheduling clock
> > > ticks, so HP CPU is bound to comes out of any power-saving state it is in.
> > 
> > If the overload is caused by HK and interrupts there is nothing in the
> > system to help. Tasks, sure, can get load balanced.
> > 
> > And as you say, the HK cpus will have generally ticks happening anyway.
> > 
> > > > 
> > > > If your single HK cpu can keep up then only configure that one HK cpu.
> > > > The others will go idle and stay there.  And since they are nohz_full
> > > > might get to stay idle even longer.
> > > While it is good to distribute the load across each HK CPU in the HK 
> > > cpumask (queuing jobs on different CPUs each time), this can cause
> > > jitter in virtualized environments. Unnecessaryily evicting other
> > > tenants, when it's better to overload a VP than to wake up other VPs of a
> > > tenant.
> > > 
> > 
> > Sorry I'm not sure I understand your setup. Are your running virtual
> > tenants on the HK cpus?  nohz_full in the guests? Maybe you only need
> > on HK then it won't matter.
> > 
> Firstly, I am unaware if nohz_full is being used in virtualized environment.
> Please correct me it is. I am not saying it can't or shouldn't be used, it's just
> I don't know if anybody is using it.
>

I've seen some people trying it in various ways and to varying degrees of
success if I recall correctly.


> nohz_full in guests would mean that tick is disabled inside the guest but
> the host might still be getting ticks. So, I am unsure, whether it is a good
> idea to have nohz_full in virtualized environment.
> 
> Nevertheless, the idea of nohz_full is to reduce to the kernel interference
> for CPUs marked as nohz_full. And, it does help with guest interference.
> 
> I would like to mention, In SPLPAR environment, scheduling work on
> different HK CPU each time can caused VM preemption in a multi-tenant
> setup in cases where CPUs in HK cpumask spans across VPs, its better to
> consolidate them within few VPs.
> 
> VP is virtual core/processor.

I have a hard time reconciling you saying you are not using virtualization
and then talking about VMs and VPs ;)

Sometimes I forget how interesting PPC can be...

> 
> > My concern is that currently there is no point in having more than
> > one HK cpu (per node in a NUMA case). The code as currently implemented
> > is just not doing what it needs to.
> > 
> > We have numerous cases where a single HK cpu just cannot keep up and
> > the remote_tick warning fires. It also can lead to the other things
> > (orchastration sw, HA keepalives etc) on the HK cpus getting starved
> > which leads to other issues.  In these cases we recommend increasing
> > the number of HK cpus.  But... that only helps the userspace tasks
> > somewhat. It does not help the actual housekeeping part.
> > 
> > It seems clear to me that the intent of the cpumask_any_and() calls
> > is to pick _any_ cpu in the hk mask. Not just the first, otherwise
> > it would just use cpumask_first_and().
> > 
> > I'm open to alternate suggestions of how to fix this.
> Your approach looks good to me.
> 
> I wanted to mention the case of overcommitted multi-tenant setup
> where this will cause noisy neighbour sort of situation, but this can be
> avoided by carefully selecting HK CPUs.

Yes, it can be configured away, hopefully.

Still, this could easily be a sched feat if that make sense to people.


Thanks for the review!


Cheers,
Phil



> 
> Thanks,
> vishalc
> > 
> > 
> > Cheers,
> > Phil
> > 
> > > > 
> > > > I do have a patch that has this controlled by a sched feature if that
> > > > is of interest. Then it could be disabled if you don't want it.
> > > 
> > > Vishal
> > > > 
> > > > Cheers,
> > > > Phil
> > > > 
> > > > > 
> > > > > Thanks,
> > > > > Madadi Vineeth Reddy
> > > > > 
> > > > > > Signed-off-by: Phil Auld <pauld@...hat.com>
> > > > > > Cc: Peter Zijlstra <peterz@...radead.org>
> > > > > > Cc: Juri Lelli <juri.lelli@...hat.com>
> > > > > > Cc: Frederic Weisbecker <frederic@...nel.org>
> > > > > > Cc: Waiman Long <longman@...hat.com>
> > > > > > Cc: linux-kernel@...r.kernel.org
> > > > > > ---
> > > > > >  kernel/sched/isolation.c | 2 +-
> > > > > >  kernel/sched/topology.c  | 2 +-
> > > > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > > > > > index 81bc8b329ef1..93b038d48900 100644
> > > > > > --- a/kernel/sched/isolation.c
> > > > > > +++ b/kernel/sched/isolation.c
> > > > > > @@ -40,7 +40,7 @@ int housekeeping_any_cpu(enum hk_type type)
> > > > > >  			if (cpu < nr_cpu_ids)
> > > > > >  				return cpu;
> > > > > >  
> > > > > > -			cpu = cpumask_any_and(housekeeping.cpumasks[type], cpu_online_mask);
> > > > > > +			cpu = cpumask_any_and_distribute(housekeeping.cpumasks[type], cpu_online_mask);
> > > > > >  			if (likely(cpu < nr_cpu_ids))
> > > > > >  				return cpu;
> > > > > >  			/*
> > > > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > > > index c49aea8c1025..94133f843485 100644
> > > > > > --- a/kernel/sched/topology.c
> > > > > > +++ b/kernel/sched/topology.c
> > > > > > @@ -2101,7 +2101,7 @@ int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
> > > > > >  	for (i = 0; i < sched_domains_numa_levels; i++) {
> > > > > >  		if (!masks[i][j])
> > > > > >  			break;
> > > > > > -		cpu = cpumask_any_and(cpus, masks[i][j]);
> > > > > > +		cpu = cpumask_any_and_distribute(cpus, masks[i][j]);
> > > > > >  		if (cpu < nr_cpu_ids) {
> > > > > >  			found = cpu;
> > > > > >  			break;
> > > > > 
> > > > 
> > > > -- 
> > > > 
> > > 
> > 
> > -- 
> > 
> 

--