linux-kernel - Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the preferred LLC is over aggregated

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cd6065dc-837f-474a-a4f5-0faa607da73a@linux.ibm.com>
Date: Thu, 24 Apr 2025 14:52:24 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Vincent Guittot
 <vincent.guittot@...aro.org>,
        Libo Chen <libo.chen@...cle.com>, Abel Wu <wuyun.abel@...edance.com>,
        Hillf Danton <hdanton@...a.com>, linux-kernel@...r.kernel.org,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [RFC PATCH 4/5] sched: Inhibit cache aware scheduling if the
 preferred LLC is over aggregated

Hi Chen Yu,

On 21/04/25 08:55, Chen Yu wrote:
> It is found that when the process's preferred LLC gets saturated by too many
> threads, task contention is very frequent and causes performance regression.
> 
> Save the per LLC statistics calculated by periodic load balance. The statistics
> include the average utilization and the average number of runnable tasks.
> The task wakeup path for cache aware scheduling manipulates these statistics
> to inhibit cache aware scheduling to avoid performance regression. When either
> the average utilization of the preferred LLC has reached 25%, or the average
> number of runnable tasks has exceeded 1/3 of the LLC weight, the cache aware
> wakeup is disabled. Only when the process has more threads than the LLC weight
> will this restriction be enabled.
> 
> Running schbench via mmtests on a Xeon platform, which has 2 sockets, each socket
> has 60 Cores/120 CPUs. The DRAM interleave is enabled across NUMA nodes via BIOS,
> so there are 2 "LLCs" in 1 NUMA node.
> 
> compare-mmtests.pl --directory work/log --benchmark schbench --names baseline,sched_cache
>                                     baselin             sched_cach
>                                    baseline            sched_cache
> Lat 50.0th-qrtle-1          6.00 (   0.00%)        6.00 (   0.00%)
> Lat 90.0th-qrtle-1         10.00 (   0.00%)        9.00 (  10.00%)
> Lat 99.0th-qrtle-1         29.00 (   0.00%)       13.00 (  55.17%)
> Lat 99.9th-qrtle-1         35.00 (   0.00%)       21.00 (  40.00%)
> Lat 20.0th-qrtle-1        266.00 (   0.00%)      266.00 (   0.00%)
> Lat 50.0th-qrtle-2          8.00 (   0.00%)        6.00 (  25.00%)
> Lat 90.0th-qrtle-2         10.00 (   0.00%)       10.00 (   0.00%)
> Lat 99.0th-qrtle-2         19.00 (   0.00%)       18.00 (   5.26%)
> Lat 99.9th-qrtle-2         27.00 (   0.00%)       29.00 (  -7.41%)
> Lat 20.0th-qrtle-2        533.00 (   0.00%)      507.00 (   4.88%)
> Lat 50.0th-qrtle-4          6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-4          8.00 (   0.00%)        5.00 (  37.50%)
> Lat 99.0th-qrtle-4         14.00 (   0.00%)        9.00 (  35.71%)
> Lat 99.9th-qrtle-4         22.00 (   0.00%)       14.00 (  36.36%)
> Lat 20.0th-qrtle-4       1070.00 (   0.00%)      995.00 (   7.01%)
> Lat 50.0th-qrtle-8          5.00 (   0.00%)        5.00 (   0.00%)
> Lat 90.0th-qrtle-8          7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-8         12.00 (   0.00%)       11.00 (   8.33%)
> Lat 99.9th-qrtle-8         19.00 (   0.00%)       16.00 (  15.79%)
> Lat 20.0th-qrtle-8       2140.00 (   0.00%)     2140.00 (   0.00%)
> Lat 50.0th-qrtle-16         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-16         7.00 (   0.00%)        5.00 (  28.57%)
> Lat 99.0th-qrtle-16        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.9th-qrtle-16        17.00 (   0.00%)       14.00 (  17.65%)
> Lat 20.0th-qrtle-16      4296.00 (   0.00%)     4200.00 (   2.23%)
> Lat 50.0th-qrtle-32         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-32         8.00 (   0.00%)        6.00 (  25.00%)
> Lat 99.0th-qrtle-32        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.9th-qrtle-32        17.00 (   0.00%)       14.00 (  17.65%)
> Lat 20.0th-qrtle-32      8496.00 (   0.00%)     8528.00 (  -0.38%)
> Lat 50.0th-qrtle-64         6.00 (   0.00%)        5.00 (  16.67%)
> Lat 90.0th-qrtle-64         8.00 (   0.00%)        8.00 (   0.00%)
> Lat 99.0th-qrtle-64        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.9th-qrtle-64        17.00 (   0.00%)       17.00 (   0.00%)
> Lat 20.0th-qrtle-64     17120.00 (   0.00%)    17120.00 (   0.00%)
> Lat 50.0th-qrtle-128        7.00 (   0.00%)        7.00 (   0.00%)
> Lat 90.0th-qrtle-128        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 99.0th-qrtle-128       13.00 (   0.00%)       14.00 (  -7.69%)
> Lat 99.9th-qrtle-128       20.00 (   0.00%)       20.00 (   0.00%)
> Lat 20.0th-qrtle-128    31776.00 (   0.00%)    30496.00 (   4.03%)
> Lat 50.0th-qrtle-239        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 90.0th-qrtle-239       14.00 (   0.00%)       18.00 ( -28.57%)
> Lat 99.0th-qrtle-239       43.00 (   0.00%)       56.00 ( -30.23%)
> Lat 99.9th-qrtle-239      106.00 (   0.00%)      483.00 (-355.66%)
> Lat 20.0th-qrtle-239    30176.00 (   0.00%)    29984.00 (   0.64%)
> 
> We can see overall latency improvement and some throughput degradation
> when the system gets saturated.
> 
> Also, we run schbench (old version) on an EPYC 7543 system, which has
> 4 NUMA nodes, and each node has 4 LLCs. Monitor the 99.0th latency:
> 
> case                    load            baseline(std%)  compare%( std%)
> normal                  4-mthreads-1-workers     1.00 (  6.47)   +9.02 (  4.68)
> normal                  4-mthreads-2-workers     1.00 (  3.25)  +28.03 (  8.76)
> normal                  4-mthreads-4-workers     1.00 (  6.67)   -4.32 (  2.58)
> normal                  4-mthreads-8-workers     1.00 (  2.38)   +1.27 (  2.41)
> normal                  4-mthreads-16-workers    1.00 (  5.61)   -8.48 (  4.39)
> normal                  4-mthreads-31-workers    1.00 (  9.31)   -0.22 (  9.77)
> 
> When the LLC is underloaded, the latency improvement is observed. When the LLC
> gets saturated, we observe some degradation.
> 

[..snip..]

> +static bool valid_target_cpu(int cpu, struct task_struct *p)
> +{
> +	int nr_running, llc_weight;
> +	unsigned long util, llc_cap;
> +
> +	if (!get_llc_stats(cpu, &nr_running, &llc_weight,
> +			   &util))
> +		return false;
> +
> +	llc_cap = llc_weight * SCHED_CAPACITY_SCALE;
> +
> +	/*
> +	 * If this process has many threads, be careful to avoid
> +	 * task stacking on the preferred LLC, by checking the system's
> +	 * utilization and runnable tasks. Otherwise, if this
> +	 * process does not have many threads, honor the cache
> +	 * aware wakeup.
> +	 */
> +	if (get_nr_threads(p) < llc_weight)
> +		return true;

IIUC, there might be scenarios were llc might be already overloaded with
threads of other process. In that case, we will be returning true for p in
above condition and don't check the below conditions. Shouldn't we check
the below two conditions either way?

Tested this patch with real life workload Daytrader, didn't see any regression.
It spawns lot of threads and is CPU intensive. So, I think it's not impacted
due to the below conditions.

Also, in schbench numbers provided by you, there is a degradation in saturated
case. Is it due to the overhead in computing the preferred llc which is not
being used due to below conditions?

Thanks,
Madadi Vineeth Reddy

> +
> +	/*
> +	 * Check if it exceeded 25% of average utiliazation,
> +	 * or if it exceeded 33% of CPUs. This is a magic number
> +	 * that did not cause heavy cache contention on Xeon or
> +	 * Zen.
> +	 */
> +	if (util * 4 >= llc_cap)
> +		return false;
> +
> +	if (nr_running * 3 >= llc_weight)
> +		return false;
> +
> +	return true;
> +}
> +

[..snip..]