linux-kernel - Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1508201637560.13335@vshiva-Udesk>
Date:	Thu, 20 Aug 2015 17:06:51 -0700 (PDT)
From:	Vikas Shivappa <vikas.shivappa@...el.com>
To:	Marcelo Tosatti <mtosatti@...hat.com>
cc:	Vikas Shivappa <vikas.shivappa@...el.com>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Tejun Heo <tj@...nel.org>,
	Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
	linux-kernel@...r.kernel.org, x86@...nel.org, hpa@...or.com,
	tglx@...utronix.de, mingo@...nel.org, peterz@...radead.org,
	matt.fleming@...el.com, will.auld@...el.com,
	glenn.p.williamson@...el.com, kanaka.d.juvva@...el.com,
	Karen Noel <knoel@...hat.com>
Subject: Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
 management



On Mon, 17 Aug 2015, Marcelo Tosatti wrote:

> Vikas, Tejun,
>
> This is an updated interface. It addresses all comments made
> so far and also covers all use-cases the cgroup interface
> covers.
>
> Let me know what you think. I'll proceed to writing
> the test applications.
>
> Usage model:
> ------------
>
> This document details how CAT technology is
> exposed to userspace.
>
> Each task has a list of task cache reservation entries (TCRE list).
>
> The init process is created with empty TCRE list.
>
> There is a system-wide unique ID space, each TCRE is assigned
> an ID from this space. ID's can be reused (but no two TCREs
> have the same ID at one time).
>
> The interface accomodates transient and independent cache allocation
> adjustments from applications, as well as static cache partitioning
> schemes.
>
> Allocation:
> Usage of the system calls require CAP_SYS_CACHE_RESERVATION capability.
>
> A configurable percentage is reserved to tasks with empty TCRE list.

And how do you think you will do this without a system controlled mechanism ? 
Everytime in your proposal you include these caveats which actually mean to 
include a system controlled interface in the background ,
and your below interfaces make no mention of this really ! Why do we want to 
confuse ourselves like this ?

syscall only interface does not seem to work on its own for the cache 
allocation scenario. This can only be a nice 
to have interface on top of a system controlled mechanism like cgroup interface. 
Sure you can do all the things you did with cgroup with the same with syscall 
interface but the point is what are the use cases that cant be done with this 
syscall only interface. (ex: to deal with cases you brought up earlier like when 
an app does cache intensive work for some time and later changes - it could use 
the syscall interface to quickly reqlinquish the cache lines or change a clos 
associated with it)

I have repeatedly listed the use cases that can be dealt with , with this 
interface. How will you address the cases like 1.1 and 1.2 with your syscall 
only interface ? So we expect all the millions of apps like SAP, oracle etc and 
etc and all the millions of app developers to magically learn our new syscall 
interface and also cooperate between themselves to decide a cache allocation 
that is agreeable to all ?  (which btw the interface doesnt list below how to do 
it) and then by some godly powers the noisly neighbour will decide himself to 
give up the cache ? (that should be first ever app to not request more resource 
in the world for himself and hurt his own performance - they surely dont want 
to do social service !)

And how do we do the case 1.5 where the administrator want to assign cache to 
specific VMs in a cloud etc - with the hypothetical syscall interface we now 
should expect all the apps to do the above and now they also need to know where they run (what 
VM , what socket etc) and then decide and cooperate an allocation : compare this 
to a container environment like rancher where today the admin can convinetly use 
docker underneath to allocate mem/storage/compute to containers and easily 
extend this to include shared l3.

http://marc.info/?l=linux-kernel&m=143889397419199

without addressing the above the details of the interface below is irrelavant -

Your initial request was to extend the cgroup interface to include rounding off 
the size of cache (which can easily be done with a bash script on top of cgroup 
interface !) and now you are proposing a syscall only interface ? this is 
very confusing and will only unnecessarily delay the process without adding any 
value.

however like i mentioned the syscall interface or user/app being able to modify 
the cache alloc could be used to address some very specific use 
cases on top an existing system managed interface. This is not really a common 
case in cloud or container environment and neither a feasible deployable 
solution.
Just consider the millions of apps that have to transition to such an interface 
to even use it - if thats the only way to do it, thats dead on arrival.

Also please donot include kernel automatically adjusting resources in your reply 
as thats totally irrelavent and again more confusing as we have already 
exchanged some >100 emails on this same patch version without meaning anything 
so far.

The debate is purely between a syscall only 
interface and a system manageable interface(like cgroup where admin or a central 
entity controls the resources). If not define what is it first before going into 
details.

Thanks,
Vikas

>
> On fork, the child inherits the TCR from its parent.
>
> Semantics:
> Once a TCRE is created and assigned to a task, that task has
> guaranteed reservation on any CPU where its scheduled in,
> for the lifetime of the TCRE.
>
> A task can have its TCR list modified without notification.
>
> FIXME: Add a per-task flag to not copy the TCR list of a task but delete
> all TCR's on fork.
>
> Interface:
>
> enum cache_rsvt_flags {
>   CACHE_RSVT_ROUND_DOWN   =      (1 << 0),    /* round "kbytes" down */
> };
>
> enum cache_rsvt_type {
>   CACHE_RSVT_TYPE_CODE = 0,      /* cache reservation is for code */
>   CACHE_RSVT_TYPE_DATA,          /* cache reservation is for data */
>   CACHE_RSVT_TYPE_BOTH,          /* cache reservation is for code and data */
> };
>
> struct cache_reservation {
>        unsigned long kbytes;
>        int type;
>        int flags;
> 	int trcid;
> };
>
> The following syscalls modify the TCR of a task:
>
> * int sys_create_cache_reservation(struct cache_reservation *rsvt);
> DESCRIPTION: Creates a cache reservation entry, and assigns
> it to the current task.
>
> returns -ENOMEM if not enough space, -EPERM if no permission.
> returns 0 if reservation has been successful, copying actual
> number of kbytes reserved to "kbytes", type to type, and tcrid.
>
> * int sys_delete_cache_reservation(struct cache_reservation *rsvt);
> DESCRIPTION: Deletes a cache reservation entry, deassigning it
> from any task.
>
> Backward compatibility for processors with no support for code/data
> differentiation: by default code and data cache allocation types
> fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
> information that they done so via "flags").
>
> * int sys_attach_cache_reservation(pid_t pid, unsigned int tcrid);
> DESCRIPTION: Attaches cache reservation identified by "tcrid" to
> task by identified by pid.
> returns 0 if successful.
>
> * int sys_detach_cache_reservation(pid_t pid, unsigned int tcrid);
> DESCRIPTION: Detaches cache reservation identified by "tcrid" to
> task by identified pid.
>
> The following syscalls list the TCRs:
> * int sys_get_cache_reservations(size_t size, struct cache_reservation list[]);
> DESCRIPTION: Return all cache reservations in the system.
> Size should be set to the maximum number of items that can be stored
> in the buffer pointed to by list.
>
> * int sys_get_tcrid_tasks(unsigned int tcrid, size_t size, pid_t list[]);
> DESCRIPTION: Return which pids are associated to tcrid.
>
> * sys_get_pid_cache_reservations(pid_t pid, size_t size,
>                                 struct cache_reservation list[]);
> DESCRIPTION: Return all cache reservations associated with "pid".
> Size should be set to the maximum number of items that can be stored
> in the buffer pointed to by list.
>
> * sys_get_cache_reservation_info()
> DESCRIPTION: ioctl to retrieve hardware info: cache round size, whether
> code/data separation is supported.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/