linux-kernel - Re: Lower than expected CPU pressure in PSI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200208101957.GU14946@hirez.programming.kicks-ass.net>
Date:   Sat, 8 Feb 2020 11:19:57 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Ivan Babrou <ivan@...udflare.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        kernel-team <kernel-team@...udflare.com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>
Subject: Re: Lower than expected CPU pressure in PSI

On Fri, Feb 07, 2020 at 02:08:29PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 09, 2020 at 11:16:32AM -0500, Johannes Weiner wrote:
> > On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> > > We added reporting for PSI in cgroups and results are somewhat surprising.
> > > 
> > > My test setup consists of 3 services:
> > > 
> > > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> > > * stress-cpu2-first-half.service    : taskset -c 2 stress --cpu 1
> > > * stress-cpu2-second-half.service   : taskset -c 2 stress --cpu 1
> > > 
> > > First service runs unconstrained, the other two compete for CPU.
> > > 
> > > As expected, I can see 500ms/s sched delay for the latter two and
> > > aggregated 1000ms/s delay for /system.slice, no surprises here.
> > > 
> > > However, CPU pressure reported by PSI says that none of my services
> > > have any pressure on them. I can see around 434ms/s pressure on
> > > /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> > > is surprising for three reasons:
> > > 
> > > * Pressure is absent for my services (I expect it to match scheed delay)
> > > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> > > * Pressure on root cgroup is lower than on system.slice
> > 
> > CPU pressure is currently implemented based only on the number of
> > *runnable* tasks, not on who gets to actively use the CPU. This works
> > for contention within cgroups or at the global scope, but it doesn't
> > correctly reflect competition between cgroups. It also doesn't show
> > the effects of e.g. cpu cycle limiting through cpu.max where there
> > might *be* only one runnable task, but it's not getting the CPU.
> > 
> > I've been working on fixing this, but hadn't gotten around to sending
> > the patch upstream. Attaching it below. Would you mind testing it?
> > 
> > Peter, what would you think of the below?
> 
> I'm not loving it; but I see what it does and I can't quickly see an
> alternative.
> 
> My main gripe is doing even more of those cgroup traversals.
> 
> One thing pick_next_task_fair() does is try and limit the cgroup
> traversal to the sub-tree that contains both prev and next. Not sure
> that is immediately applicable here, but it might be worth looking into.

One option I suppose, would be to replace this:

+static inline void psi_sched_switch(struct task_struct *prev,
+                                   struct task_struct *next,
+                                   bool sleep)
+{
+       if (static_branch_likely(&psi_disabled))
+               return;
+
+       /*
+        * Clear the TSK_ONCPU state if the task was preempted. If
+        * it's a voluntary sleep, dequeue will have taken care of it.
+        */
+       if (!sleep)
+               psi_task_change(prev, TSK_ONCPU, 0);
+
+       psi_task_change(next, 0, TSK_ONCPU);
+}

With something like:

static inline void psi_sched_switch(struct task_struct *prev,
                                   struct task_struct *next,
                                   bool sleep)
{
	struct psi_group *g, *p = NULL;

	set = TSK_ONCPU;
	clear = 0;

	while ((g = iterate_group(next, &g))) {
		u32 nr_running = per_cpu_ptr(g->pcpu, cpu)->tasks[NR_RUNNING];
		if (nr_running) {
			/* if set, we hit the subtree @prev lives in, terminate */
			p = g;
			break;
		}

		/* the rest of psi_task_change */
	}

	if (sleep)
		return;

	set = 0;
	clear = TSK_ONCPU;

	while ((g = iterate_group(prev, &g))) {
		if (g == p)
			break;

		/* the rest of psi_task_change */
	}
}

That way we avoid clearing and setting the common parents.