linux-kernel - Re: [RFC 0/5] kernel: Introduce CPU Namespace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b5f8505c-38d5-af6f-0de7-4f9df7ae9b9b@linux.ibm.com>
Date:   Mon, 18 Oct 2021 20:59:16 +0530
From:   Pratik Sampat <psampat@...ux.ibm.com>
To:     Tejun Heo <tj@...nel.org>
Cc:     Christian Brauner <christian.brauner@...ntu.com>,
        bristot@...hat.com, christian@...uner.io, ebiederm@...ssion.com,
        lizefan.x@...edance.com, hannes@...xchg.org, mingo@...nel.org,
        juri.lelli@...hat.com, linux-kernel@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, cgroups@...r.kernel.org,
        containers@...ts.linux.dev, containers@...ts.linux-foundation.org,
        pratik.r.sampat@...il.com
Subject: Re: [RFC 0/5] kernel: Introduce CPU Namespace



On 15/10/21 3:44 am, Tejun Heo wrote:
> Hello,
>
> On Tue, Oct 12, 2021 at 02:12:18PM +0530, Pratik Sampat wrote:
>>>> The control and the display interface is fairly disjoint with each
>>>> other. Restrictions can be set through control interfaces like cgroups,
>>> A task wouldn't really opt-in to cpu isolation with CLONE_NEWCPU it
>>> would only affect resource reporting. So it would be one half of the
>>> semantics of a namespace.
>>>
>> I completely agree with you on this, fundamentally a namespace should
>> isolate both the resource as well as the reporting. As you mentioned
>> too, cgroups handles the resource isolation while this namespace
>> handles the reporting and this seems to break the semantics of what a
>> namespace should really be.
>>
>> The CPU resource is unique in that sense, at least in this context,
>> which makes it tricky to design a interface that presents coherent
>> information.
> It's only unique in the context that you're trying to place CPU distribution
> into the namespace framework when the resource in question isn't distributed
> that way. All of the three major local resources - CPU, memory and IO - are
> in the same boat. Computing resources, the physical ones, don't render
> themselves naturally to accounting and ditributing by segmenting _name_
> spaces which ultimately just shows and hides names. This direction is a
> dead-end.
>
>> I too think that having a brand new interface all together and teaching
>> userspace about it is much cleaner approach.
>> On the same lines, if were to do that, we could also add more useful
>> metrics in that interface like ballpark number of threads to saturate
>> usage as well as gather more such metrics as suggested by Tejun Heo.
>>
>> My only concern for this would be that if today applications aren't
>> modifying their code to read the existing cgroup interface and would
>> rather resort to using userspace side-channel solutions like LXCFS or
>> wrapping them up in kata containers, would it now be compelling enough
>> to introduce yet another interface?
> While I'm sympathetic to compatibility argument, identifying available
> resources was never well-define with the existing interfaces. Most of the
> available information is what hardware is available but there's no
> consistent way of knowing what the software environment is like. Is the
> application the only one on the system? How much memory should be set aside
> for system management, monitoring and other administrative operations?
>
> In practice, the numbers that are available can serve as the starting points
> on top of which application and environment specific knoweldge has to be
> applied to actually determine deployable configurations, which in turn would
> go through iterative adjustments unless the workload is self-sizing.
>
> Given such variability in requirements, I'm not sure what numbers should be
> baked into the "namespaced" system metrics. Some numbers, e.g., number of
> CPUs can may be mapped from cpuset configuration but even that requires
> quite a bit of assumptions about how cpuset is configured and the
> expectations the applications would have while other numbers - e.g.
> available memory - is a total non-starter.
>
> If we try to fake these numbers for containers, what's likely to happen is
> that the service owners would end up tuning workload size against whatever
> number the kernel is showing factoring in all the environmental factors
> knowingly or just through iterations. And that's not *really* an interface
> which provides compatibility. We're just piping new numbers which don't
> really mean what they used to mean and whose meanings can change depending
> on configuration through existing interfaces and letting users figure out
> what to do with the new numbers.
>
> To achieve compatibility where applications don't need to be changed, I
> don't think there is a solution which doesn't involve going through
> userspace. For other cases and long term, the right direction is providing
> well-defined resource metrics that applications can make sense of and use to
> size themselves dynamically.

I agree that major local resources like CPUs and memory cannot to be
distributed cleanly in a namespace semantic.
Thus the memory resource like CPU too does face similar coherency
issues where /proc/meminfo can be different from what the restrictions
are.

While a CPU namespace maybe not be the preferred way of solving
this problem, the prototype RFC is rather for understanding related
problems with this as well as other potential directions that we could
explore for solving this problem.

Also, I agree with your point about variability of requirements. If the
interface we give even though it is in conjunction with the limits set,
if the applications have to derive metrics from this or from other
kernel information regardless; then the interface would not be useful.
If the solution to this problem lies in userspace, then I'm all for it
as well. However, the intention is to probe if this could potentially be
solved in cleanly in the kernel.

>> While I concur with Tejun Heo's comment the mail thread and overloading
>> existing interfaces of sys and proc which were originally designed for
>> system wide resources, may not be a great idea:
>>
>>> There is a fundamental problem with trying to represent a resource shared
>>> environment controlled with cgroup using system-wide interfaces including
>>> procfs
>> A fundamental question we probably need to ascertain could be -
>> Today, is it incorrect for applications to look at the sys and procfs to
>> get resource information, regardless of their runtime environment?
> Well, it's incomplete even without containerization. Containerization just
> amplifies the shortcomings. All of these problems existed well before
> cgroups / namespaces. How would you know how much resource you can consume
> on a system just looking at hardware resources without implicit knowledge of
> what else is on the system? It's just that we are now more likely to load
> systems dynamically with containerization.

Yes, these shortcomings exist even without containerization, on a
dynamically loaded multi-tenant system it becomes very difficult to
determine what is the maximum amount resource that can be requested
before we hurt our own performance.
cgroups and namespace mechanics help containers give some structure to
the maximum amount of resources that they can consume. However,
applications are unable to leverage that in some cases especially if
they are more inclined to look at a more traditional system wide
interface like sys and proc.

>> Also, if an application were to only be able to view the resources
>> based on the restrictions set regardless of the interface - would there
>> be a disadvantage for them if they could only see an overloaded context
>> sensitive view rather than the whole system view?
> Can you elaborate further? I have a hard time understanding what's being
> asked.

The question that I have essentially tries to understand the
implications of overloading existing interface's definitions to be
context sensitive.
The way that the prototype works today is that it does not interfere
with the information when the system boots or even when it is run in a
new namespace.
The effects are only observed when restrictions are applied to it.
Therefore, what would potentially break if interfaces like these are
made to divulge information based on restrictions rather than the whole
system view?

Thanks
Pratik