linux-kernel - Re: [RFC 0/5] kernel: Introduce CPU Namespace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YW2g73Lwmrhjg/sv@slm.duckdns.org>
Date:   Mon, 18 Oct 2021 06:29:35 -1000
From:   Tejun Heo <tj@...nel.org>
To:     Pratik Sampat <psampat@...ux.ibm.com>
Cc:     Christian Brauner <christian.brauner@...ntu.com>,
        bristot@...hat.com, christian@...uner.io, ebiederm@...ssion.com,
        lizefan.x@...edance.com, hannes@...xchg.org, mingo@...nel.org,
        juri.lelli@...hat.com, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, cgroups@...r.kernel.org,
        containers@...ts.linux.dev, containers@...ts.linux-foundation.org,
        pratik.r.sampat@...il.com
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace

(cc'ing Johannes for memory sizing part)

Hello,

On Mon, Oct 18, 2021 at 08:59:16PM +0530, Pratik Sampat wrote:
...
> Also, I agree with your point about variability of requirements. If the
> interface we give even though it is in conjunction with the limits set,
> if the applications have to derive metrics from this or from other
> kernel information regardless; then the interface would not be useful.
> If the solution to this problem lies in userspace, then I'm all for it
> as well. However, the intention is to probe if this could potentially be
> solved in cleanly in the kernel.

Just to be clear, avoiding application changes would have to involve
userspace (at least parameterization from it), and I think to set that as a
goal for kernel would be more of a distraction. Please note that we should
definitely provide metrics which actually capture what's going on in terms
of resource availability in a way which can be used to size workloads
automatically.

> Yes, these shortcomings exist even without containerization, on a
> dynamically loaded multi-tenant system it becomes very difficult to
> determine what is the maximum amount resource that can be requested
> before we hurt our own performance.

As I mentioned before, feedback loop on PSI can work really well in finding
the saturation points for cpu/mem/io and regulating workload size
automatically and dynamically. While such dynamic sizing can work without
any other inputs, it sucks to have to probe the entire range each time and
it'd be really useful if the kernel can provide ballpark numbers that are
needed to estimate the saturation points.

What gets challenging is that there doesn't seem to be a good way to
consistently describe availability for each of the three resources and the
different distribution rules they may be under.

e.g. For CPU, the affinity restrictions from cpuset determines the maximum
number of threads that a workload would need to saturate the available CPUs.
However, conveying the results of cpu.max and cpu.weight controls isn't as
straight-fowrads.

For memory, it's even trickier because in a lot of cases it's impossible to
tell how much memory is actually available without trying to use them as
active workingset can only be learned by trying to reclaim memory.

IO is in somewhat similar boat as CPU in that there are both io.max and
io.weight. However, if io.cost is in use and configured according to the
hardware, we can map those two in terms iocost.

Another thing is that the dynamic nature of these control mechanisms means
that the numbers can keep changing moment to moment and we'd need to provide
some time averaged numbers. We can probably take the same approach as PSI
and load-avgs and provide running avgs of a few time intervals.

> The question that I have essentially tries to understand the
> implications of overloading existing interface's definitions to be
> context sensitive.
> The way that the prototype works today is that it does not interfere
> with the information when the system boots or even when it is run in a
> new namespace.
> The effects are only observed when restrictions are applied to it.
> Therefore, what would potentially break if interfaces like these are
> made to divulge information based on restrictions rather than the whole
> system view?

I don't think the problem is that something would necessarily break by doing
that. It's more that it's a dead-end approach which won't get us far for all
the reasons that have been discussed so far. It'd be more productive to
focus on long term solutions and leave backward compatibility to the domains
where they can actually be solved by applying the necessary local knoweldge
to emulate and fake whatever necessary numbers.

Thanks.

-- 
tejun