linux-kernel - Re: Re: [RFC PATCH 2/2] sched/fair: Do not specialcase SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2c8c4cdb-e9b7-40f3-aa83-d82676641162@bytedance.com>
Date: Wed, 12 Mar 2025 00:43:39 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Josh Don <joshdon@...gle.com>
Cc: K Prateek Nayak <kprateek.nayak@....com>, Ingo Molnar <mingo@...hat.com>,
 Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Tianchen Ding <dtcccc@...ux.alibaba.com>,
 "open list:SCHEDULER" <linux-kernel@...r.kernel.org>,
 Viresh Kumar <viresh.kumar@...aro.org>
Subject: Re: Re: [RFC PATCH 2/2] sched/fair: Do not specialcase SCHED_IDLE
 cpus in select slowpath

Hi Josh,

On 3/11/25 6:38 AM, Josh Don wrote:
> Thanks Abel,
> 
>> @@ -7481,12 +7481,13 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
>>                                  latest_idle_timestamp = rq->idle_stamp;
>>                                  shallowest_idle_cpu = i;
>>                          }
>> -               } else if (shallowest_idle_cpu == -1 && si_cpu == -1) {
>> -                       if (sched_idle_cpu(i)) {
>> -                               si_cpu = i;
>> -                               continue;
>> -                       }
>> -
>> +               } else if (shallowest_idle_cpu == -1) {
>> +                       /*
>> +                        * The SCHED_IDLE cpus do not necessarily means anything
>> +                        * to @p due to the cgroup hierarchical behavior. But it
>> +                        * is almost certain that the wakee will get better served
>> +                        * if the cpu is less loaded.
>> +                        */
>>                          load = cpu_load(cpu_rq(i));
>>                          if (load < min_load) {
>>                                  min_load = load;
> 
> This seems reasonable due to the case you describe. However, I'm
> wondering if you considered any heuristics here to help identify when
> a target cpu should really be considered sched_idle from the
> perspective of the incoming task. For example, in your cgroup
> hierarchy, if you have a cpu currently only running tasks in your
> besteffort container (and all cpus in the system are busy running
> something), then that cpu should be considered as a good target for a
> waking task in the "guaranteed" container, and not a good target for a
> waking task in the "containerd" container.  A simple way to do this

Yes, it actually did cost me several days trying hard to figure out
whether there is a proper way to achieve what you're suggesting.


Solution A
----------

Mark all the hierarchically idle task_groups by assigning a unique
prime, and define a tree walk starts from @n to root that contains
all the preemptable nodes.

	struct task_group {
		/*
		 * Set to a unique prime if at least 1 ancestor is idle,
		 * otherwise set to 1.
		 */
		u64	prime;
		u64	mul;
	};

	/* Called by sched_group_set_idle() */
	static void calculate_mul(struct task_group *tg)
	{
		struct task_group *curr, *iter;

		lockdep_assert_held(&shares_mutex);
		for (curr = tg; curr->parent; curr = curr->parent) {
			list_for_each_entry(iter, &curr->parent->children, siblings) {
				/* Can't preempt non-idle siblings */
				if (!iter->idle)
					continue;
				/*
				 * For each node in the subtree rooted at @iter do:
				 *	tg->mul *= node->prime;
				 */
				traverse_subtree(tg, iter);
			}
		}
	}

	int sched_idle_cpu(int cpu, struct task_struct *p)
	{
		/* rq::cached_mul caches rq->donor's tg::mul */
		return p->sched_task_group.mul % cpu_rq(cpu)->cached_mul == 0;
	}

With this we even can drop find_matching_se(), since it's quite easy
to know whether or not a sched_entity can preempt another. But sadly
task_group::mul will quickly get overflowed.


Solution B
----------

Since we do not require 100% accurate, solution A can be shifted to
using bloom filters.

	struct task_group {
		u64	key; /* A random number used for hashing */
		u64	filter;
	};

	/* Called by sched_group_set_idle() */
	static void calculate_filter(struct task_group *tg)
	{
		struct task_group *curr, *iter;

		lockdep_assert_held(&shares_mutex);
		for (curr = tg; curr->parent; curr = curr->parent) {
			list_for_each_entry(iter, &curr->parent->children, siblings) {
				/* Can't preempt non-idle siblings */
				if (!iter->idle)
					continue;
				/*
				 * For each node in the subtree rooted at @iter do:
				 * 	set_bit(bloom_hash(node->key), &tg->filter);
				 */
				traverse_subtree(tg, iter);
			}
		}
	}

	int sched_idle_cpu(int cpu, struct task_struct *p)
	{
		u64 filter = p->sched_task_group.filter;
		u64 cached = cpu_rq(cpu)->cached_filter;

		return filter & cached == cached;
	}

False positives are possible, but the possibility can be reduced by
optimizing blooming setup.

I chose the simplest way for now to workaround the issue we encountered,
while I am still trying to do something to get rid of sched_idle_cpu().
Thoughts?

> would be to do a find_matching_se on the incoming task and the current
> on_cpu task. That does have a drawback of cgroup pointer chasing in a
> hot wakeup path, so I don't love it.

Neither do I. But if we go through find_matching_se(), I think we need
this rq locked.

> 
> In any case, I'm fine with the change as-is, mostly curious if you
> gave any additional thought to the case mentioned above.

Thanks!
	Abel