linux-kernel - Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1508201713070.13335@vshiva-Udesk>
Date:	Thu, 20 Aug 2015 17:13:58 -0700 (PDT)
From:	Vikas Shivappa <vikas.shivappa@...el.com>
To:	Vikas Shivappa <vikas.shivappa@...el.com>
cc:	Marcelo Tosatti <mtosatti@...hat.com>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Tejun Heo <tj@...nel.org>,
	Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
	linux-kernel@...r.kernel.org, x86@...nel.org, hpa@...or.com,
	tglx@...utronix.de, mingo@...nel.org, peterz@...radead.org,
	matt.fleming@...el.com, will.auld@...el.com,
	glenn.p.williamson@...el.com, kanaka.d.juvva@...el.com,
	Karen Noel <knoel@...hat.com>
Subject: Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
 management



On Thu, 20 Aug 2015, Vikas Shivappa wrote:

>
>
> On Mon, 17 Aug 2015, Marcelo Tosatti wrote:
>
>> Vikas, Tejun,
>> 
>> This is an updated interface. It addresses all comments made
>> so far and also covers all use-cases the cgroup interface
>> covers.
>> 
>> Let me know what you think. I'll proceed to writing
>> the test applications.
>> 
>> Usage model:
>> ------------
>> 
>> This document details how CAT technology is
>> exposed to userspace.
>> 
>> Each task has a list of task cache reservation entries (TCRE list).
>> 
>> The init process is created with empty TCRE list.
>> 
>> There is a system-wide unique ID space, each TCRE is assigned
>> an ID from this space. ID's can be reused (but no two TCREs
>> have the same ID at one time).
>> 
>> The interface accomodates transient and independent cache allocation
>> adjustments from applications, as well as static cache partitioning
>> schemes.
>> 
>> Allocation:
>> Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
>> 
>> A configurable percentage is reserved to tasks with empty TCRE list.
>
> And how do you think you will do this without a system controlled mechanism ? 
> Everytime in your proposal you include these caveats which actually mean to 
> include a system controlled interface in the background ,
> and your below interfaces make no mention of this really ! Why do we want to 
> confuse ourselves like this ?
>
> syscall only interface does not seem to work on its own for the cache 
> allocation scenario. This can only be a nice to have interface on top of a 
> system controlled mechanism like cgroup interface. Sure you can do all the 
> things you did with cgroup with the same with syscall interface but the point 
> is what are the use cases that cant be done with this syscall only interface. 
> (ex: to deal with cases you brought up earlier like when an app does cache 
> intensive work for some time and later changes - it could use the syscall 
> interface to quickly reqlinquish the cache lines or change a clos associated 
> with it)
>
> I have repeatedly listed the use cases that can be dealt with , with this 
big typo - 'use cases that cannot be dealt with'

> interface. How will you address the cases like 1.1 and 1.2 with your syscall 
> only interface ? So we expect all the millions of apps like SAP, oracle etc 
> and etc and all the millions of app developers to magically learn our new 
> syscall interface and also cooperate between themselves to decide a cache 
> allocation that is agreeable to all ?  (which btw the interface doesnt list 
> below how to do it) and then by some godly powers the noisly neighbour will 
> decide himself to give up the cache ? (that should be first ever app to not 
> request more resource in the world for himself and hurt his own performance - 
> they surely dont want to do social service !)
>
> And how do we do the case 1.5 where the administrator want to assign cache to 
> specific VMs in a cloud etc - with the hypothetical syscall interface we now 
> should expect all the apps to do the above and now they also need to know 
> where they run (what VM , what socket etc) and then decide and cooperate an 
> allocation : compare this to a container environment like rancher where today 
> the admin can convinetly use docker underneath to allocate 
> mem/storage/compute to containers and easily extend this to include shared 
> l3.
>
> http://marc.info/?l=linux-kernel&m=143889397419199
>
> without addressing the above the details of the interface below is irrelavant 
> -
>
> Your initial request was to extend the cgroup interface to include rounding 
> off the size of cache (which can easily be done with a bash script on top of 
> cgroup interface !) and now you are proposing a syscall only interface ? this 
> is very confusing and will only unnecessarily delay the process without 
> adding any value.
>
> however like i mentioned the syscall interface or user/app being able to 
> modify the cache alloc could be used to address some very specific use cases 
> on top an existing system managed interface. This is not really a common case 
> in cloud or container environment and neither a feasible deployable solution.
> Just consider the millions of apps that have to transition to such an 
> interface to even use it - if thats the only way to do it, thats dead on 
> arrival.
>
> Also please donot include kernel automatically adjusting resources in your 
> reply as thats totally irrelavent and again more confusing as we have already 
> exchanged some >100 emails on this same patch version without meaning 
> anything so far.
>
> The debate is purely between a syscall only interface and a system manageable 
> interface(like cgroup where admin or a central entity controls the 
> resources). If not define what is it first before going into details.
>
> Thanks,
> Vikas
>
>> 
>> On fork, the child inherits the TCR from its parent.
>> 
>> Semantics:
>> Once a TCRE is created and assigned to a task, that task has
>> guaranteed reservation on any CPU where its scheduled in,
>> for the lifetime of the TCRE.
>> 
>> A task can have its TCR list modified without notification.
>> 
>> FIXME: Add a per-task flag to not copy the TCR list of a task but delete
>> all TCR's on fork.
>> 
>> Interface:
>> 
>> enum cache_rsvt_flags {
>>   CACHE_RSVT_ROUND_DOWN   =      (1 << 0),    /* round "kbytes" down */
>> };
>> 
>> enum cache_rsvt_type {
>>   CACHE_RSVT_TYPE_CODE = 0,      /* cache reservation is for code */
>>   CACHE_RSVT_TYPE_DATA,          /* cache reservation is for data */
>>   CACHE_RSVT_TYPE_BOTH,          /* cache reservation is for code and data 
>> */
>> };
>> 
>> struct cache_reservation {
>>        unsigned long kbytes;
>>        int type;
>>        int flags;
>> 	int trcid;
>> };
>> 
>> The following syscalls modify the TCR of a task:
>> 
>> * int sys_create_cache_reservation(struct cache_reservation *rsvt);
>> DESCRIPTION: Creates a cache reservation entry, and assigns
>> it to the current task.
>> 
>> returns -ENOMEM if not enough space, -EPERM if no permission.
>> returns 0 if reservation has been successful, copying actual
>> number of kbytes reserved to "kbytes", type to type, and tcrid.
>> 
>> * int sys_delete_cache_reservation(struct cache_reservation *rsvt);
>> DESCRIPTION: Deletes a cache reservation entry, deassigning it
>> from any task.
>> 
>> Backward compatibility for processors with no support for code/data
>> differentiation: by default code and data cache allocation types
>> fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
>> information that they done so via "flags").
>> 
>> * int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
>> DESCRIPTION: Attaches cache reservation identified by "tcrid" to
>> task by identified by pid.
>> returns 0 if successful.
>> 
>> * int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
>> DESCRIPTION: Detaches cache reservation identified by "tcrid" to
>> task by identified pid.
>> 
>> The following syscalls list the TCRs:
>> * int sys_get_cache_reservations(size_t size, struct cache_reservation 
>> list[]);
>> DESCRIPTION: Return all cache reservations in the system.
>> Size should be set to the maximum number of items that can be stored
>> in the buffer pointed to by list.
>> 
>> * int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
>> DESCRIPTION: Return which pids are associated to tcrid.
>> 
>> * sys_get_pid_cache_reservations(pid_t pid, size_t size,
>>                                 struct cache_reservation list[]);
>> DESCRIPTION: Return all cache reservations associated with "pid".
>> Size should be set to the maximum number of items that can be stored
>> in the buffer pointed to by list.
>> 
>> * sys_get_cache_reservation_info()
>> DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
>> code/data separation is supported.
>> 
>> 
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/