linux-kernel - Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1508231100120.13335@vshiva-Udesk>
Date:	Sun, 23 Aug 2015 11:47:49 -0700 (PDT)
From:	Vikas Shivappa <vikas.shivappa@...el.com>
To:	Marcelo Tosatti <mtosatti@...hat.com>
cc:	Vikas Shivappa <vikas.shivappa@...el.com>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Tejun Heo <tj@...nel.org>,
	Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
	linux-kernel@...r.kernel.org, x86@...nel.org, hpa@...or.com,
	tglx@...utronix.de, mingo@...nel.org, peterz@...radead.org,
	matt.fleming@...el.com, will.auld@...el.com,
	glenn.p.williamson@...el.com, kanaka.d.juvva@...el.com,
	Karen Noel <knoel@...hat.com>
Subject: Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
 management



On Fri, 21 Aug 2015, Marcelo Tosatti wrote:

> On Thu, Aug 20, 2015 at 05:06:51PM -0700, Vikas Shivappa wrote:
>>
>>
>> On Mon, 17 Aug 2015, Marcelo Tosatti wrote:
>>
>>> Vikas, Tejun,
>>>
>>> This is an updated interface. It addresses all comments made
>>> so far and also covers all use-cases the cgroup interface
>>> covers.
>>>
>>> Let me know what you think. I'll proceed to writing
>>> the test applications.
>>>
>>> Usage model:
>>> ------------
>>>
>>> This document details how CAT technology is
>>> exposed to userspace.
>>>
>>> Each task has a list of task cache reservation entries (TCRE list).
>>>
>>> The init process is created with empty TCRE list.
>>>
>>> There is a system-wide unique ID space, each TCRE is assigned
>>> an ID from this space. ID's can be reused (but no two TCREs
>>> have the same ID at one time).
>>>
>>> The interface accomodates transient and independent cache allocation
>>> adjustments from applications, as well as static cache partitioning
>>> schemes.
>>>
>>> Allocation:
>>> Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
>>>
>>> A configurable percentage is reserved to tasks with empty TCRE list.
>
> Hi Vikas,
>
>> And how do you think you will do this without a system controlled
>> mechanism ?
>> Everytime in your proposal you include these caveats
>> which actually mean to include a system controlled interface in the
>> background ,
>> and your below interfaces make no mention of this really ! Why do we
>> want to confuse ourselves like this ?
>> syscall only interface does not seem to work on its own for the
>> cache allocation scenario. This can only be a nice to have interface
>> on top of a system controlled mechanism like cgroup interface. Sure
>> you can do all the things you did with cgroup with the same with
>> syscall interface but the point is what are the use cases that cant
>> be done with this syscall only interface. (ex: to deal with cases
>> you brought up earlier like when an app does cache intensive work
>> for some time and later changes - it could use the syscall interface
>> to quickly reqlinquish the cache lines or change a clos associated
>> with it)
>
> All use cases can be covered with the syscall interface.
>
> * How to convert from cgroups interface to syscall interface:
> Cgroup: Partition cache in cgroups, add tasks to cgroups.
> Syscall: Partition cache in TCRE, add TCREs to tasks.
>
> You build the same structure (task <--> CBM) either via syscall
> or via cgroups.
>
> Please be more specific, can't really see any problem.

Well at first you mentioned that the cgroup does not support specifying size in 
bytes and percentage and then you eventually  agreed to my explanation that you 
can easily write a bash script to do the same with cgroup bitmasks. (although i 
had to go through the pain of reading all the proposals you sent without giving 
a chance to explain how it can be used or so). Then you had a confusion in how I 
explained the co mounting of the cpuset 
and intel_rdt and instead of asking a question or pointing out issue, you go 
ahead and write a whole proposal and in the end even say will cook a patch
before I even try to explain you.
And then you send proposals after proposals which varied from modifying the 
cgroup interface itself to slightly modifying cgroups and adding syscalls and 
then also automatically controlling the cache alloc (with all your extend mask 
capabilities) without understanding what the framework is meant to do or just 
asking or specifically pointing out 
any issues in the patch. You had been reviewing the cgroup pathes for 
many versions unlike others who accepted they need time to think about it or 
accepted that they maynot understand the feature yet.
So what is that changed in the patches that is not acceptable now ?  Many things 
have been bought up multiple times even 
after you agreed to a solution already proposed. I was only suggesting that this 
can be better and less confusing if you point out the exact issue in the patch 
just like how Thomas or all of the reviewers have been doing. With the rest of 
the reviewers I either fix the issue or point out a flaw in the review.
If you dont like cgroup interface now , would be best to 
indicate or discuss the specifics of the shortcommings clearly 
before sending new proposals. That way we can come up with an interface which 
does better and works better in linux if we can. Otherwise we may just end up 
adding more code which just does the same thing?

However I have been working on an alternate interface as well and have just sent 
it for your ref.

>
>> I have repeatedly listed the use cases that can be dealt with , with
>> this interface. How will you address the cases like 1.1 and 1.2 with
>> your syscall only interface ?
>
> Case 1.1:
> --------
>
>  1.1> Exclusive access:  The task cannot give *itself* exclusive
> access from using the cache. For this it needs to have visibility of
> the cache allocation of other tasks and may need to reclaim or
> override others cache allocs which is not feasible (isnt that the
> ability of a system managing agent?).
>
> Answer: if the application has CAP_SYS_CACHE_RESERVATION, it can
> create cache allocation and remove cache allocation from
> other applications. So only the administrator could do it.

The 1.1 also includes an other use case(lets call this 1.1.1) which indicates 
that the apps would just 
allocate a lot of cache and soon run out space. Hence the first few apps would 
get most of the cache (would get *most* even if you reserve some % of cache for 
others - and again thats difficult to assign to the others).

Now if you say you want to put a threshold limit for each app to self allocate , 
then that turns out to an interface that can easily built on top of the existing 
cgroup interface. iow its just a control you are giving the app on top of an 
existing admin controlled interface (like cgroup).the threshold can just be the 
cbm of the cgroup which the 
tasks belong to. so now the apps can self allocate or reduce the allocation to 
something which is a subset the cgroup has (thats one way..)

Also the issue was to discuss whether self allocation or process deciding its 
own allocation vs. system controlled mechanism. It wasnt clear what 
syscalls among the ones need to have this sys_cap and which ones would not.

>
> Case 1.2 answer below.
>
>> So we expect all the millions of apps
>> like SAP, oracle etc and etc and all the millions of app developers
>> to magically learn our new syscall interface and also cooperate
>> between themselves to decide a cache allocation that is agreeable to
>> all ?  (which btw the interface doesnt list below how to do it) and
>
> They don't have to: the administrator can use "cacheset" application.

the "cacheset" wasnt mentioned before. Now you are talking about a tool which 
is also doing a centralized or system controlled allocation. This is 
where I pointed out earlier that its best to keep the discussion to the point 
and not randomly expand the scope to a variety of other options. If you want to 
build a taskset like tool thats again just doing a system conrolled interface or 
a centralized control mechamism which is what cgroup does. Then it just comes 
down to whether cgroup 
interface or the cacheset is more easy or intutive. And why would the already 
widely used interface for resource allocation be not intutive ? - we first need 
to answer that may be ? or any really required features it lacks ?
Also give that dockers use cgroups for resource allocations , it seems most fit 
and thats the feedback i received repeatedly in linuxcon as well.

>
> If an application wants to control the cache, it can.
>
>> then by some godly powers the noisly neighbour will decide himself
>> to give up the cache ?
>
> I suppose you imagine something like this:
> http://arxiv.org/pdf/1410.6513.pdf
>
> No, the syscall interface does not need to care about that because:
>
> * If you can set cache (CAP_SYS_CACHE_RESERVATION capability),
> you can remove cache reservation from your neighbours.
>
> So this problem does not exist (it assumes participants are
> cooperative).
>
> There is one confusion in the argument for cases 1.1 and case 1.2:
> that applications are supposed to include in their decision of cache
> allocation size the status of the system as a whole. This is a flawed
> argument. Please point specifically if this is not the case or if there
> is another case still not covered.

Like i said it wasnt clear what syscalls required this capability. also the 
1.1.1 still breaks this , or iow the apps needs to have lesser control than a 
system/admin controlled allocation.

>
> It would be possible to partition the cache into watermarks such
> as:
>
> task group A - can reserve up to 20% of cache.
> task group B - can reserve up to 25% of cache.
> task group C - can reserve 50% of cache.
>
> But i am not sure... Tejun, do you think that is necessary?
> (CAP_SYS_CACHE_RESERVATION is good enough for our usecases).
>
>>  (that should be first ever app to not request
>> more resource in the world for himself and hurt his own performance
>> - they surely dont want to do social service !)
>>
>> And how do we do the case 1.5 where the administrator want to assign
>> cache to specific VMs in a cloud etc - with the hypothetical syscall
>> interface we now should expect all the apps to do the above and now
>> they also need to know where they run (what VM , what socket etc)
>> and then decide and cooperate an allocation : compare this to a
>> container environment like rancher where today the admin can
>> convinetly use docker underneath to allocate mem/storage/compute to
>> containers and easily extend this to include shared l3.
>>
>> http://marc.info/?l=linux-kernel&m=143889397419199
>>
>> without addressing the above the details of the interface below is irrelavant -
>
> You are missing the point, there is supposed to be a "cacheset"
> program which will allow the admin to setup TCRE and assign them to
> tasks.
>
>> Your initial request was to extend the cgroup interface to include
>> rounding off the size of cache (which can easily be done with a bash
>> script on top of cgroup interface !) and now you are proposing a
>> syscall only interface ? this is very confusing and will only
>> unnecessarily delay the process without adding any value.
>
> I suppose you are assuming that its necessary for applications to
> set their own cache. This assumption is not correct.
>
> Take a look at Tuna / sched_getaffinity:
>
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/chap-Realtime_Reference_Guide-Affinity.html
>
>
>> however like i mentioned the syscall interface or user/app being
>> able to modify the cache alloc could be used to address some very
>> specific use cases on top an existing system managed interface. This
>> is not really a common case in cloud or container environment and
>> neither a feasible deployable solution.
>> Just consider the millions of apps that have to transition to such
>> an interface to even use it - if thats the only way to do it, thats
>> dead on arrival.
>
> Applications should not rely on interfaces that are not upstream.
>
> Is there an explicit request or comment from users about
> their difficulty regarding a change in the interface?

HOwever there needs to be a reasoning on why the cgroup interface is 
not good as well?

>
>> Also please donot include kernel automatically adjusting resources
>> in your reply as thats totally irrelavent and again more confusing
>> as we have already exchanged some >100 emails on this same patch
>> version without meaning anything so far.
>>
>> The debate is purely between a syscall only interface and a system
>> manageable interface(like cgroup where admin or a central entity
>> controls the resources). If not define what is it first before going
>> into details.
>
> See the Tuna / taskset page.
> The administrator could, for example, use "cacheset" from within
> the scripts which initialize the applications.
> Then having control over those scripts, he can view them as a "unified
> system control interface".
>
> Problems with cgroup interface:
>
> 1) Global IPI on CBM <---> task change does not scale.

DOnt understand this . how is the IPI related to cgroups. A task is associated 
with one closid and it needs to carry that along where ever it goes. it supports 
the use case i explain in (basicaly cloud/container and server user cases 
mainly)

http://marc.info/?l=linux-kernel&m=144035279828805

> 2) Syscall interface specification is in kbytes, not
> cache ways (which is what must be recorded by the OS
> to allow migration of the OS between different
> hardware systems).

I thought you agreed that a simple bash script can convert the bitmask to bytes 
in chunk size. ALl you need is the cache size from /proc/cpuinfo and the max cbm 
bits in the root intel_rdt cgroup. And its incorrect to say you can do it it 
bytes. Its only chunk size really. (chunk size = cache size / max cbm bits).
Apart from that the mask gives you the ability to decide an exclusive, 
overlapping, or partially overlapping and partially exclusive masks.

> 3) Compilers are able to configure cache optimally for
> given ranges of code inside applications, easily,
> if desired.

This is again not possible because of 1.1.1. And can be still done in a 
restricted fashion like i explained above.

> 4) Does not allow proper usage of shared caches between
> applications. Think of the following scenario:
> 	* AppA has threads which are created/destroyed,
>        but once initialized, want cache reservation.
>        * How is AppA going to coordinate with cgroups
>        system to initialized/shutdown cgroups?
>

Yes , the interface does not support apps to self control cache alloc. That is 
accepted. But this is not the main use case we target like i explained above and 
in the link i provided for the new proposal and before.. So its not very 
important as such.
Also worst case, you can easily design a syscall for apps to self control 
keeping the cgroup alloc for the task as max threshold.
So lets nail this list(of cgroup flaws you list) down before thinking about 
changes ? - this should have 
been the first things in the email really is what i was mentioning.

> I started writing the syscall interface on top of your latest
> patchset yesterday (it should be relatively easy, given
> that most of the low-level code is already there).
>
> Any news on the data/code separation ?

Will send them this week , untested partially due to h/w not yet being with me. 
Have been ready , but was waiting to see the discussions on this patch as well.

more response below -

>
>
>> Thanks,
>> Vikas
>>
>>>
>>> On fork, the child inherits the TCR from its parent.
>>>
>>> Semantics:
>>> Once a TCRE is created and assigned to a task, that task has
>>> guaranteed reservation on any CPU where its scheduled in,
>>> for the lifetime of the TCRE.
>>>
>>> A task can have its TCR list modified without notification.

Whey does the task need a list of allocations ? A task is tagged with only one 
closid and it needs to carry that along. Even if the list is for each socket, 
that needs be an array.

>>>
>>> FIXME: Add a per-task flag to not copy the TCR list of a task but delete
>>> all TCR's on fork.
>>>
>>> Interface:
>>>
>>> enum cache_rsvt_flags {
>>>  CACHE_RSVT_ROUND_DOWN   =      (1 << 0),    /* round "kbytes" down */
>>> };

Not really optional is it ? the chunk size is decided by the h/w sku and you can 
only allocate in that chunk size, not any bytes.

>>>
>>> enum cache_rsvt_type {
>>>  CACHE_RSVT_TYPE_CODE = 0,      /* cache reservation is for code */
>>>  CACHE_RSVT_TYPE_DATA,          /* cache reservation is for data */
>>>  CACHE_RSVT_TYPE_BOTH,          /* cache reservation is for code and data */
>>> };
>>>
>>> struct cache_reservation {
>>>       unsigned long kbytes;

should be rounded off to chunk size really. And like i explained above the masks 
let you do the exclusive/partially adjustable percentage exclusive easily (say 
20% shared and rest exclusive) or a tolerated amount of shared...

>>>       int type;
>>>       int flags;
>>> 	int trcid;
>>> };
>>>
>>> The following syscalls modify the TCR of a task:
>>>
>>> * int sys_create_cache_reservation(struct cache_reservation *rsvt);
>>> DESCRIPTION: Creates a cache reservation entry, and assigns
>>> it to the current task.

So now i assume this is what the task can do itself and the ones below which pid 
need the capability ? Again this breaks 1.1.1 like i said above and any way to 
restrict to a threshold max alloc can just easily be done on top of cgroup alloc 
keeping the cgroup alloc as max threshold.

>>>
>>> returns -ENOMEM if not enough space, -EPERM if no permission.
>>> returns 0 if reservation has been successful, copying actual
>>> number of kbytes reserved to "kbytes", type to type, and tcrid.
>>>
>>> * int sys_delete_cache_reservation(struct cache_reservation *rsvt);
>>> DESCRIPTION: Deletes a cache reservation entry, deassigning it
>>> from any task.
>>>
>>> Backward compatibility for processors with no support for code/data
>>> differentiation: by default code and data cache allocation types
>>> fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
>>> information that they done so via "flags").

Need to address the change of mode which is dynamic and it may be more intutive 
to do that in cgroups for the reasons i said above and taking allocation back 
from a process may need a call back , thats why it may best be to design an 
interface where the apps know their control is very limited and within the 
purview of the already set allocations by root user.

Please check the new proposal which tries to addresses the comments i made 
mostly -
http://marc.info/?l=linux-kernel&m=144035279828805
The framework still lets any kernel mode or high level user mode library 
developer build a cacheset like tool or others on top of it if that needs to be 
more custom and  more intutive.

Thanks,
Vikas

>>>
>>> * int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
>>> DESCRIPTION: Attaches cache reservation identified by "tcrid" to
>>> task by identified by pid.
>>> returns 0 if successful.
>>>
>>> * int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
>>> DESCRIPTION: Detaches cache reservation identified by "tcrid" to
>>> task by identified pid.
>>>
>>> The following syscalls list the TCRs:
>>> * int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
>>> DESCRIPTION: Return all cache reservations in the system.
>>> Size should be set to the maximum number of items that can be stored
>>> in the buffer pointed to by list.
>>>
>>> * int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
>>> DESCRIPTION: Return which pids are associated to tcrid.
>>>
>>> * sys_get_pid_cache_reservations(pid_t pid, size_t size,
>>>                                struct cache_reservation list[]);
>>> DESCRIPTION: Return all cache reservations associated with "pid".
>>> Size should be set to the maximum number of items that can be stored
>>> in the buffer pointed to by list.
>>>
>>> * sys_get_cache_reservation_info()
>>> DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
>>> code/data separation is supported.
>>>
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/