linux-kernel - Re: Lower than expected CPU pressure in PSI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200115165543.GA47772@cmpxchg.org>
Date:   Wed, 15 Jan 2020 11:55:43 -0500
From:   Johannes Weiner <hannes@...xchg.org>
To:     Ivan Babrou <ivan@...udflare.com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        kernel-team <kernel-team@...udflare.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>
Subject: Re: Lower than expected CPU pressure in PSI

On Fri, Jan 10, 2020 at 11:28:32AM -0800, Ivan Babrou wrote:
> I applied the patch on top of 5.5.0-rc3 and it's definitely better
> now, both competing cgroups report 500ms/s delay. Feel free to add
> Tested-by from me.

Thanks, Ivan!

> I'm still seeing /unified/system.slice at 385ms/s and /unified.slice
> at 372ms/s, do you have an explanation for that part? Maybe it's
> totally reasonable, but warrants a patch for documentation.

Yes, this is a combination of CPU pinning and how pressure is
calculated in SMP systems.

The stall times are defined as lost compute potential - which scales
with the number of concurrent threads - normalized to wallclock
time. See the "Multiple CPUs" section in kernel/sched/psi.c.

By restricting the CPUs in system.slice, there is less compute
available in that group than in the parent, which means that the
relative loss of potential can be higher.

It's a bit unintuitive because most cgroup metrics are plain numbers
that add up to bigger numbers as you go up the tree. If we exported
both the numerator (waste) and denominator (compute potential) here,
the numbers would act more conventionally, with parent numbers always
bigger than the child's. But because pressure is normalized to
wallclock time, you only see the ratio at each level, and that can
shrink as you go up the tree if lower levels are CPU-constrained.

We could have exported both numbers, but for most usecases that would
be more confusing than helpful. And in practice it's the ratio that
really matters: the pressure in the leaf cgroups is high due to the
CPU restriction; but when you go higher up the tree and look at not
just the pinned tasks, but also include tasks in other groups that
have more CPUs available to them, the aggregate productivity at that
level *is* actually higher.

I hope that helps!