linux-kernel - Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPM31RKx0vT-9VFN=XASYM4iv4U5ZGZW93XRtJd_7mOHwu76NA@mail.gmail.com>
Date:	Thu, 15 Oct 2015 04:42:37 -0700
From:	Paul Turner <pjt@...gle.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>, lizefan@...wei.com,
	cgroups <cgroups@...r.kernel.org>,
	LKML <linux-kernel@...r.kernel.org>,
	kernel-team <kernel-team@...com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 3/3] sched: Implement interface for cgroup unified hierarchy

On Thu, Oct 1, 2015 at 11:46 AM, Tejun Heo <tj@...nel.org> wrote:
> Hello, Paul.
>
> Sorry about the delay.  Things were kinda hectic in the past couple
> weeks.

Likewise :-(

>
> On Fri, Sep 18, 2015 at 04:27:07AM -0700, Paul Turner wrote:
>> On Sat, Sep 12, 2015 at 7:40 AM, Tejun Heo <tj@...nel.org> wrote:
>> > On Wed, Sep 09, 2015 at 05:49:31AM -0700, Paul Turner wrote:
>> >> I do not think this is a layering problem.  This is more like C++:
>> >> there is no sane way to concurrently use all the features available,
>> >> however, reasonably self-consistent subsets may be chosen.
>> >
>> > That's just admitting failure.
>> >
>>
>> Alternatively: accepting there are varied use-cases to
>> support.
>
> Analogies like this can go awry but as we're in it anyway, let's push
> it a bit further.  One of the reasons why C++ isn't lauded as an
> example of great engineering is while it does support a vast number of
> use-cases or rather usage-scenarios (it's not necessarily focused on
> utility but just how things are done) it fails to distill the essence
> of the actual utility out of them and condense it.  It's not just an
> aesthetic argument.  That failure exacts heavy costs on its users and
> is one of the reasons why C++ projects are more prone to horrible
> disarrays unless specific precautions are taken.
>
> I'm not against supporting valid and useful use-cases but not all
> usage-scenarios are equal.  If we can achieve the same eventual goals
> with reasonable trade-offs in a simpler and more straight-forward way,
> that's what we should do even though that'd require some modifications
> to specific usage-scenarios.  ie. the usage-scenarios need to
> scrutinized so that the core of the utility can be extracted and
> abstracted in the, hopefully, minimal way.
>
> This is what worries me when you liken the situation to C++.  You
> probably were trying to make a different point but I'm not sure we're
> on the same page and I think we need to agree at least on this in
> principle; otherwise, we'll just keep talking past each other.


I agree with trying to reach a minimal core functionality that
satisfies all use-cases.  I am only saying however, that I think that
I do not think we can reduce to an api so minimal that all users will
use all parts of it.  We have to fit more than one usage model in.

>
>> > The kernel does not update all CPU affinity masks when a CPU goes down
>> > or comes up.  It just enforces the intersection and when the
>> > intersection becomes empty, ignores it.  cgroup-scoped behaviors
>> > should reflect what the system does in the global case in general, and
>> > the global behavior here, although missing some bits, is a lot saner
>> > than what cpuset is currently doing.
>>
>> You are conflating two things here:
>> 1) How we maintain these masks
>> 2) The interactions on updates
>>
>> I absolutely agree with you that we want to maintain (1) in a
>> non-pointwise format.  I've already talked about that in other replies
>> on this thread.
>>
>> However for (2) I feel you are:
>>  i) Underestimating the complexity of synchronizing updates with user-space.
>>  ii) Introducing more non-desirable behaviors [partial overwrite] than
>> those you object to [total overwrite].
>
> The thing which bothers me the most is that cpuset behavior is
> different from global case for no good reason.

I've tried to explain above that I believe there are reasonable
reasons for it working the way it does from an interface perspective.
I do not think they can be so quickly discarded out of hand.  However,
I think we should continue winnowing focus and first resolve the model
of interaction for sub-process hierarchies,

> We don't have a model
> right now.  It's schizophrenic.  And what I was trying to say was that
> maybe this is because we never had a working model in the global case
> either but if that's the case we need to solve the global case too or
> at least figure out where we wanna be in the long term.
>
>> It's the most consistent choice; you've not given any reasons above
>> why a solution with only partial consistency is any better.
>>
>> Any choice here is difficult to coordinate, that two APIs allow
>> manipulation of the same property means that we must always
>> choose some compromise here.  I prefer the one with the least
>> surprises.
>
> I don't think the current situation around affinity mask handling can
> be considered consistent and cpuset is pouring more inconsistencies
> into it.  We need to figure it out one way or the other.
>
> ...
>> I do not yet see a good reason why the threads arbitrarily not sharing an
>> address space necessitates the use of an entirely different API.  The
>> only problems stated so far in this discussion have been:
>>   1) Actual issues involving relative paths, which are potentially solvable.
>
> Also the ownership of organization.  If the use-cases can be
> reasonably served with static grouping, I think it'd definitely be a
> worthwhile trade-off to make.  It's different from process level
> grouping.  There, we can simply state that this is to be arbitrated in
> the userland and that arbitration isn't that difficult as it's among
> administration stack of userspace.
>
> In-process attributes are different.  The process itself can
> manipulate its own attributes but it's also common for external tools
> to peek into processes and set certain attributes.  Even when the two
> parties aren't coordinated, this is usually fine because there's no
> reason for applications to depend on what those attribute are set to
> and even when the different entities do different things, the
> combination is still something coherent.
>
> Now, if you make the in-process grouping dynamic and accessible to
> external entities (and if we aren't gonna do that, why even bother?),
> this breaks down and we have some of the same problems we have with
> allowing applications to directly manipulate cgroup sub-directories.
> This is a fundamental problem.  Setting attributes can be shared but
> organization is an exclusive process.  You can't share that without
> close coordination.

Your concern here is centered on permissions, not the interface.

This seems directly remedied by exactly:
  Any sub-process hierarchy we exposed would be locked down in terms
of write access.  These would not be generally writable.  You're
absolutely correct that you can't share without close coordination,
and granting the appropriate permissions is part of that.

>
> Assigning the full responsiblity of in-process organization to the
> application itself and tying it to static parental relationship allows
> for solid common grounds where these resource operations can be
> performed by different entities without causing structural issues just
> like other similar operations.

But cases have already been presented above where the full
responsibility cannot be delegated to the application.  Because we
explicitly depend on constraints being provided by the external
environment.

>
> Another point for assigning this responsibility to the application
> itself is that it can't be done without the application's cooperation
> anyway because the group membership of new threads is determined by
> the group the parent belongs to.
>
>>   2) Aesthetic distaste for using file-system abstractions
>
> It's not that but more about what the file-system interface implies.
> It's not just different.  It breaks a lot of expectations a lot of
> application visible kernel interface provides as explained above.
> There are reasons why we usually don't do things this way.

The arguments you've made above are largely centered on permissions
and the right to make modifications.  I don't see what other
expectations you believe are being broken here.  This still feels like
an aesthetic objection.

>
> ...
>> >> 1) It depends on "anchor" threads to define groupings.
>> >
>> > So does cgroupfs.  Any kind of thread or process grouping can't escape
>> > that as soon as things start forking and if things don't fork whether
>> > something is anchor or not doesn't make much difference.
>>
>> The difference is that this ignores how applications are actually written:
>
> It does require the applications to follow certain protocols to
> organize itself but this is a pretty trivial thing to do and comes
> with the benefit that we don't need to introduce a completely new
> grouping concept to applications.

I strongly disagree here:  Applications today do _not_ use sub-process
clone hierarchies today.  As a result, this _is_ introducing a
completely new grouping concept because it's one applications have
never cared about outside of a shell implementation.

>
>> A container that is independent of its members (e.g. a cgroup
>> directory) can be created and configured by an application's Init() or
>> within the construction of a data-structure that will use it without
>> dependency on those resources yet being used.
>>
>> As an example:
>>   The resources associated with thread pools are often dynamically
>> managed.  What you're proposing means that some initialization must
>> now be moved into the first thread that pool creates (as opposed to
>> the pool's initilization), that synchronization and identification of
>> this thread is now required, and that it must be treated differently
>> to other threads in the pool (it can no longer be reclaimed).
>
> That should be like a two hour job for most applications.  This is a
> trivial thing to do.  It's difficult for me to consider the difficulty
> of doing this a major decision point.
>

You are seriously underestimating the complexity and API overhead this
introduces.  It cannot be claimed trivial and discarded; it's not.


>> >> 2) It does not allow thread-level hierarchies to be created
>> >
>> > Huh?  That's the only thing it would do.  This obviously wouldn't get
>> > applied to processes.  It's strictly about threads.
>>
>> This allows a single *partition*, not a hierarchy.    As machines
>> become larger, so are many of the processes we run on them.  These
>> larger processes manage resources between threads on scales that we
>> would previously partition between processes.
>
> I don't get it.  Why wouldn't it allow hierarchy?

"- If $TID isn't already a resource group leader, it creates a
  sub-cgroup, sets $KEY to $VAL and moves $PID and all its descendants
  to it.

- If $TID is already a resource group leader, set $KEY to $VAL."

This only allows resource groups at the root level to be created.
There is no way to make $TID2 a resource group leader, parented by
$TID1.

>
>> >> 3) When coordination with an external agent is desired this defines no
>> >> common interface that can be shared.  Directories are an extremely
>> >> useful container.  Are you proposing applications would need to
>> >> somehow publish the list of anchor-threads from (1)?
>> >
>> > Again, this is an invariant no matter what we do.  As I wrote numerous
>> > times in this thread, this information is only known to the process
>> > itself.  If an external agent want to manipulate these from outside,
>> > it just has to know which thread is doing what.  The difference is
>> > that this doesn't require the process itself to coordinate with
>> > external agent when operating on itself.
>>
>> Nothing about what was previously state would require any coordination
>> with the process and an external agent when operating on itself.
>> What's the basis for this claim?
>
> I hope this is explained now.

See above regarding permissions.

>
>> This also ignores the cases previously discussed in which the external
>> agent is providing state for threads within a process to attach to.
>> An example of this is repeated below.
>>
>> This isn't even covering that this requires the invention of entirely
>> new user-level APIs and coordination for somehow publishing these
>> magic tids.
>
> We already have those tids.

External management applications do not.  This was covering that would
now need a new API to handle their publishing.  Whereas using the VFS
handles this naturally.

>
>> >> What if I want to set up state that an application will attaches
>> >> threads to [consider cpuset example above]?
>> >
>> > It feels like we're running in circles.  Process-level control stays
>> > the same.  That part is not an issue.  Thread-level control requires
>> > cooperation from the process itself no matter what and should stay
>> > confined to the limits imposed on the process as a whole.
>> >
>> > Frankly, cpuset example doesn't make much sense to me because there is
>> > nothing hierarchical about it and it isn't even layered properly.  Can
>> > you describe what you're actually trying to achieve?  But no matter
>> > the specifities of the example, it's almost trivial to achieve
>> > whatever end results.
>>
>> This has been previously detailed, repeating it here:
>>
>> Machines are shared resources, we partition the available cpus into
>> shared and private sets.  These sets are dynamic as when a new
>> application arrives requesting private cpus, we must reserve some cpus
>> that were previously shared.
>>
>> We use sub-cpusets to advertise to applications which of their cpus
>> are shared and which are private.  They can then attach threads to
>> these containers  -- which are dynamically updated as cores shift
>> between public and private configurations.
>
> I see but you can easily do that the other way too, right?  Let the
> applications publish where they put their threads and let the external
> entity set configs on them.

And what API controls the right to do this?

>
>> >> 4) How is the cgroup property to $KEY translation defined?  This feels
>> >> like an ioctl and no more natural than the file-system.  It also does
>> >
>> > How are they even comparable?  Sure ioctl inputs are variable-formed
>> > and its definitions aren't closely scrutinized but other than those
>> > it's a programmable system-call interface and how programs use and
>> > interact with them is completely different from how a program
>> > interacts with cgroupfs.
>>
>> They're exactly comparable in that every cgroup.<property> api now
>> needs some magic equivalent $KEY defined.  I don't understand how
>> you're proposing these would be generated or managed.
>
> Not everything.  Just the ones which make sense in-process.  This is
> exactly the process we need to go through when introducing new
> syscalls.  Why is this a surprise?  We want to scrutinize them, hard.

I'm talking only about the control->$KEY mapping.  Yes it would be a
subset, but this seems a large step back in usability.

>
>> > It doesn't have to parse out the path,
>> > compose the knob path, open and format the data into it
>>
>> There's nothing hard about this.  Further, we already have to do
>> exactly this at the process level; which means abstractions for this
>
> I'm not following.  Why would it need to do that already?

Because the process-level interface will continue to work the way it
does today.  That means we still need to implement these operations.

This same library code could be shared for applications to use on
their private, sub-process, controls.

>
>> already exist; removing this property does not change their presence
>> of requirement, but instead means they must be duplicated for the
>> in-thread case.
>>
>> Even ignoring that the libraries for this can be shared between thread
>> and process, this is also generally easier to work with than magic
>> $KEY values.
>
> This is like saying syscalls are worse in terms of progammability
> compared to opening and writing formatted strings for setting
> attributes.  If that's what you're saying, let's just agree to disgree
> on this one.

The goal of such a system is as much administration as it is a
programmable interface.  There's a reason much configuration is
specified by sysctls and not syscalls.

>
>> > all the while
>> > not being sure whether the file it's operating on is even the right
>> > one anymore or the sub-hierarchcy it's assuming is still there.
>>
>> One possible resolution to this has been proposed several times:
>>   Have the sub-process hierarchy exposed in an independent and fixed location.
>>
>> >> not seem to resolve your concerns regarding races; the application
>> >> must still coordinate internally when concurrently calling
>> >> set_resource().
>> >
>> > I have no idea where you're going with this.  When did the internal
>> > synchronization inside a process become an issue?  Sure, if a thread
>> > does *(int *)=0, we can't protect other threads from it.  Also, why
>> > would it be a problem?  If two perform set_resource() on the same
>> > thread, one will be executed after the other.  What are you talking
>> > about?
>>
>> It was my impression that you'd had atomicity concerns regarding
>> file-system operations such as writes for updates previously.  If you
>> have no concerns within a sub-processes operation then this can be
>> discarded.
>
> That's comparing apples and oranges.  Threads being moved around and
> hierarchies changing beneath them present a whole different issues
> than someone else setting an attribute to a different value.  The
> operations might fail, might set properties on the wrong group.
>

There are no differences between using VFS and your proposed API for this.

>> >> 5) How does an external agent coordinate when a resource must be
>> >> removed from a sub-hierarchy?
>> >
>> > That sort of restriction should generally be put at the higher level.
>> > Thread-level resource control should be cooperative with the
>> > application if at all necessary and in those cases just set the limit
>> > on the sub-hierarchy would work.
>> >
>>
>> Could you expand on how you envision this being cooperative?  This
>> seems tricky to me, particularly when limits are involved.
>>
>> How do I even arbitrate which external agents are allowed control?
>
> I think we're talking past each other.  If you wanna put restrictions
> on the process as whole, do it at the higher level.  If you wanna
> fiddle with in-process resource distribution, you just have to assume
> that the application itself is cooperative or at least not malicious.
> No matter what external entities try to do, the application can
> circumvent because that's what ultimately determines the grouping.

I think you misunderstood here.  What I'm saying is equivalently:
- How do I bless a 'good' external agent to be allowed to make modificaitons
- How do I make sure a malicious external process is not able to make
modifications

>
>> So I was really trying to make sure we covered the interface problems
>> we're trying to solve here.  Are there major ones not listed there?
>>
>> However, I strongly disagree with this statement.  It is much easier
>> for applications to work with named abstract objects then having magic
>> threads that it must track and treat specially.
>
> How is that different?  Sure, the name is created by the threads but
> once you set the resource, the tid would be the resource group ID and
> the thread can go away.  It's still an object named by an ID.

Huh?? If the thread goes away, then the tid can be re-used -- within
the same process.  Now you have non-unique IDs to operate on??

> The
> only difference is that the process of creating the hierarchy is tied
> to the process that threads are created in.
>
>> My implementation must now look like this:
>>   1) I instantiate some abstraction which uses cgroups.
>>   2) In construction I must now coordinate with my chosen threading
>> implementation (an exciting new dependency) to create a new thread and
>> get its tid.  This thread must exist for as long as the associated
>> data-structure.  I must pay a kernel stack, at least one page of
>> thread stack and however much TLS I've declared in my real threads.
>>   3) In destruction I must now wake and free the thread created in (2).
>>   4) If I choose to treat it as a real thread, I must be careful, this
>> thread is special and cannot be arbitrarily released like other
>> threads.
>>   5) To do anything I must go grok the documentation to look up the
>> magic $KEY.  If I get this wrong I am going to have a fun time
>> debugging it since things are no longer reasonably inspect-able.  If I
>> must work with a cgroup that adds features over time things are even
>> more painful since $KEY may or may not exist.
>>
>> Is any of the above unfair with respect to what you've described above?
>
> Yeah, as I wrote above.
>
>> This isn't even beginning to consider the additional pain that a
>> language implementing its own run-time such as Go might incur.
>
> Yeap, it does require userland runtime to have a way to make the
> thread creation history visible to the operating system.  It doesn't
> look like a big price.  Again, I'm looking for a balance.

I know that the current API charges a minimal price here.
I strongly believe that what you're proposing carries a significant price.

>
> You're citing inconveniences from userland side and yeah I get that.
> Making things more rigid and static requires some adjustments from
> userland but we gain from it too.  No need to worry about structural
> inconsistencies and the varied failure modes which can cascade from
> that.

See below.

>
> If the only possible solution is C++-esque everything-goes way, sure,
> we'll have to do that but that's not the case.  We can implement and
> provide the core functionality in a more controlled manner.
>
>> Option B:
>>   We expose sub-process hierarchies via /proc/self/cgroups or similar.
>> They do not appear within the process only cgroup hierarchy.
>>   Only the same user (or a privileged one) has access to this internal
>> hierarchy.  This can be arbitrarily restricted further.
>>   Applications continue to use almost exactly the same cgroup
>> interfaces that exist today, however, the problem of path computation
>> and non-stable paths are now eliminated.
>>
>> Really, what problems does this not solve?
>>
>> It eliminates the unstable mount point, your concerns regarding
>> external entity manipulation, and allows for the parent processes to
>> be moved.  It provides a reasonable place for coordination to occur,
>> with standard mechanisms for access control.  It allows for state to
>> be easily inspected, it does not require new documentation, allows the
>> creation of sub-hierarchies, does not require special threads.
>>
>> This was previously raised as a straw man, but I have not yet seen or
>> thought of good arguments against it.
>
> It allows for structural inconsistencies where applications can end up
> performing operations which are non-sensical.  Breaking that invariant
> is substantial.  Why would we do that if

Can you please provide an example?  I don't know what inconsistencies
you mean here.  In particular, I do not see anything that your
proposed interface resolves versus this; while being _significantly_
simpler for applications to use and implement.

>
> Can we at least agree that we're now venturing into an area where
> things aren't really critical?  The core functionality here is being
> able to hierarchically categorize threads and assign resource limits
> to them.  Can we agree that the minimum core functionality is met in
> both approaches?

I'm not sure entirely how to respond here.  I am deeply concerned that
the API you're proposing is not tenable for providing this core
functionality.  I worry that you're introducing serious new challenges
and too quickly discarding them as manageable.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/