linux-kernel - Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service management

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1508061312250.921@vshiva-Udesk>
Date:	Thu, 6 Aug 2015 13:46:06 -0700 (PDT)
From:	Vikas Shivappa <vikas.shivappa@...el.com>
To:	Marcelo Tosatti <mtosatti@...hat.com>
cc:	Matt Fleming <matt@...eblueprint.co.uk>, Tejun Heo <tj@...nel.org>,
	Vikas Shivappa <vikas.shivappa@...el.com>,
	Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
	linux-kernel@...r.kernel.org, x86@...nel.org, hpa@...or.com,
	tglx@...utronix.de, mingo@...nel.org, peterz@...radead.org,
	matt.fleming@...el.com, will.auld@...el.com,
	glenn.p.williamson@...el.com, kanaka.d.juvva@...el.com
Subject: Re: [PATCH 5/9] x86/intel_rdt: Add new cgroup and Class of service
 management



On Wed, 5 Aug 2015, Marcelo Tosatti wrote:

> On Wed, Aug 05, 2015 at 01:22:57PM +0100, Matt Fleming wrote:
>> On Sun, 02 Aug, at 12:31:57PM, Tejun Heo wrote:
>>>
>>> But we're doing it the wrong way around.  You can do most of what
>>> cgroup interface can do with systemcall-like interface with some
>>> inconvenience.  The other way doesn't really work.  As I wrote in the
>>> other reply, cgroups is a horrible programmable interface and we don't
>>> want individual applications to interact with it directly and CAT's
>>> use cases most definitely include each application programming its own
>>> cache mask.
>>
>> I wager that this assertion is wrong. Having individual applications
>> program their own cache mask is not going to be the most common
>> scenario.
>
> What i like about the syscall interface is that it moves the knowledge
> of cache behaviour close to the application launching (or inside it),
> which allows the following common scenario, say on a multi purpose
> desktop:
>
> Event: launch high performance application: use cache reservation, finish
> quickly.
> Event: cache hog application: do not thrash the cache.
>
> The two cache reservations are logically unrelated in terms of
> configuration, and configured separately do not affect each other.

There could be several issues to let apps allocate the cache themselves. We just 
cannot treat the cache alloc just like memory allocation, please consider the 
scenarios below:

all examples consider cache size : 10MB. cbm max bits : 10


  	(1)user programmable syscall:

   1.1> Exclusive access:  The task cannot give *itself* exclusive access from 
using the cache. For this it needs to have visibility of the cache allocation of 
other tasks and may need to reclaim or override others cache allocs which is not 
feasible (isnt that the ability of a system managing agent?).

   eg:
app1... 10 ask for 1MB of exclusive cache each.
they get it as there was 10MB.

But now a large portion of tasks on the system will end up without any cache ? -
this is not possible
or do they share a common pool or a default shared pool ? - if there is such a
default pool  then that needs to be *managed* and this reduces the number 
of exclusive cache access given.

   1.2> Noisy neighbour problem: how does the task itself decide its the noisy
neighbor ? This is the
key requirement the feature wants to address. We want to address the 
jitter and inconsistencies in the quality of service things like response times 
the apps get. If you read the SDM 
its mentioned clearly there as well. can the task voluntarily declare itself
noisy neighbour(how ??) and relinquish the cache allocation (how much ?). But 
thats not even guaranteed.
How can we expect every application coder to know what system the app is going 
to run and how much is the optimal amount of cache the app can get - its not 
like memory allocation for #3 and #4 below.

   1.3> cannot treat cache allocation similar to memory allocation.
there is system-calls alternatives to do memory allocation apart from cgroups
like cpuset but we cannot treat both as the same.
(This is with reference to the point that there are alternatives to memory
allocation apart from using cpuset, but the whole point is you cant treat 
memory allocation and cache allocation as same)
  	1.3.1> memory is a very large pool in terms of GBs and we are talking
about only a few MBs (~10 - 20 orders and orders of magnitude). So this could 
easily get into a situation mentioned
above where a few first apps get all the exclusive cache and the rest have to
starve.
  	1.3.2> memory is virtualized : each process has its own space and we are
not even bound by the physical memory capacity as we can virtualize it so an app 
can indeed ask for more memory than the physical memory along with other apps 
doing the same - but we cant do the same here with cache allocation. Even if we 
evict the cache , that defeats the purpose of cache allocation to threads.

   1.4> specific h/w requirements : With code data prioritization(cdp) , the h/w
requires the OS to reset all the capacity bitmasks once we change mode
from to legacy cache alloc. So
naturally we need to remove the tasks with all its allocations.  We cannot
easily take away all the cache allocations that users will be thinking is theirs
when they had allocated using the syscall. This is something like the tasks
malloc successfully and midway their allocation is no more there.
Also this would add to the logic that you need to treat the cache allocation and
other resource allocation like memory differently.

   1.5> In cloud and container environments , say we would need to allocate cache 
for entire VM which runs a specific real_time workload vs. allocate cache for VMs 
which run say noisy_workload - how can we achieve this by letting each app 
decide how much cache that needs to be allocated ? This is best done by an 
external system manager.

  	(2)cgroup interface:

  (2.1) compare above usage

1.1> and 1.2> above can easily be done with cgroup interface.
The key difference is system management and process-self management of the cache
allocation. When there is a centralized system manager this works fine.

The administrator can
make sure that certain tasks/group of tasks get exclusive cache blocks. And the
administrator can determine the noisy neighbour application or workload using
cache monitoring and make allocations appropriately.

A classic use case is here :
http://www.intel.com/content/www/us/en/communications/cache-allocation-technology-white-paper.html

    $ cd /sys/fs/cgroup/rdt
    $ cd group1
    $ /bin/echo 0xf > intel_rdt.l3_cbm

    $ cd group2
    $ /bin/echo 0xf0 > intel_rdt.l3_cbm

If we want to prevent the system admin to accidentally allocating overlapping 
masks, that could be easily extended by having an always-exclusive flag.

Rounding off: We can easily write a batch file to calculate the chunk size and 
show and then allocate based on byte size. This is something that can easily be 
done on top of this interface.

Assign tasks to the group2

    $ /bin/echo PID1 > tasks
    $ /bin/echo PID2 > tasks

If a bunch of threads belonging to a process(Processidx) need to be allocated
cache -
    $ /bin/echo <Processidx> > cgroup.procs


   the 4> above  can possibly be addressed in cgroup but would need some support
which we are planning to send. One way to address this is to tear down
the subsystem by deleting all the existing cgroup directories and then handling
the reset. So the cdp starts fresh with all bitmasks ready to be allocated.

   (2.2)  cpu affinity :

Similarly rdt cgroup can be used to assign affinity to the entire cgroup itself.
Also you could always use taskset as well !

example2: Below commands allocate '1MB L3 cache on socket1 to group1'
and '2MB of L3 cache on socket2 to group2'.
This mounts both cpuset and intel_rdt and hence the ls would list the
files in both the subsystems.
    $ mount -t cgroup -ocpuset,intel_rdt cpuset,intel_rdt rdt/
    $ ls /sys/fs/cgroup/rdt
    cpuset.cpus
    cpuset.mem
    ...
    intel_rdt.l3_cbm
    tasks

Assign the cache
    $ /bin/echo 0xf > /sys/fs/cgroup/rdt/group1/intel_rdt.l3_cbm
    $ /bin/echo 0xff > /sys/fs/cgroup/rdt/group2/intel_rdt.l3_cbm

Assign tasks for group1 and group2
    $ /bin/echo PID1 > /sys/fs/cgroup/rdt/group1/tasks
    $ /bin/echo PID2 > /sys/fs/cgroup/rdt/group1/tasks
    $ /bin/echo PID3 > /sys/fs/cgroup/rdt/group2/tasks
    $ /bin/echo PID4 > /sys/fs/cgroup/rdt/group2/tasks

Tie the group1 to socket1 and group2 to socket2
    $ /bin/echo <cpumask for socket1> > /sys/fs/cgroup/rdt/group1/cpuset.cpus
    $ /bin/echo <cpumask for socket2> > /sys/fs/cgroup/rdt/group2/cpuset.cpus

>
> They should be configured separately.
>
> Also, data/code reservation is specific to the application, so it
> should its specification should be close to the application (its just
> cumbersome to maintain that data somewhere else).
>
>> Only in very specific situations would you trust an
>> application to do that.
>
> Perhaps ulimit can be used to allow a certain limit on applications.

The ulimit is very subjective and depends on the workloads/amount of cache space 
available/total cache etc - see here you are moving towards a controlling 
agent which could possibly configure ulimit to control what apps get

>
>> A much more likely use case is having the sysadmin carve up the cache
>> for a workload which may include multiple, uncooperating applications.
>
> Sorry, what cooperating means in this context?

see example 1.2 above - a noisy neighbour cant be expected to relinquish the 
cache alloc himself. thats one example of uncooperating app ?

>
>> Yes, a programmable interface would be useful, but only for a limited
>> set of workloads. I don't think it's how most people are going to want
>> to use this hardware technology.
>
> It seems syscall interface handles all usecases which the cgroup
> interface handles.
>
>> --
>> Matt Fleming, Intel Open Source Technology Center
>
> Tentative interface, please comment.

Please  discuss the interface details once we are solid on the kind of interface 
itself since we already have reviewed one interface and talking about a new one. 
Otherwise it may miss a lot of and hardware requirements 
like 1.4 above - without that we cant have a complete interface ?

Understand the cgroup interface has things like hierarchy which are of not 
much use to the intel_rdt cgroup ? - is that the key issue here or the whole 
'system management of the cache allocation' the issue ?

Thanks,
Vikas

>
> The "return key/use key" scheme would allow COSid sharing similarly to
> shmget. Intra-application, that is functional, but i am not experienced
> with shmget to judge whether there is a better alternative. Would have
> to think how cross-application setup would work,
> and in the simple "cacheset" configuration.
> Also, the interface should work for other architectures (TODO item, PPC
> at least has similar functionality).
>
> enum cache_rsvt_flags {
>   CACHE_RSVT_ROUND_UP   =      (1 << 0),    /* round "bytes" up */
>   CACHE_RSVT_ROUND_DOWN =      (1 << 1),    /* round "bytes" down */
>   CACHE_RSVT_EXTAGENTS  =      (1 << 2),    /* allow usage of area common with external agents */
> };
>
> enum cache_rsvt_type {
>   CACHE_RSVT_TYPE_CODE = 0,      /* cache reservation is for code */
>   CACHE_RSVT_TYPE_DATA,          /* cache reservation is for data */
>   CACHE_RSVT_TYPE_BOTH,          /* cache reservation is for code and data */
> };
>
> struct cache_reservation {
>        size_t kbytes;
>        u32 type;
>        u32 flags;
> };
>
> int sys_cache_reservation(struct cache_reservation *cv);
>
> returns -ENOMEM if not enough space, -EPERM if no permission.
> returns keyid > 0 if reservation has been successful, copying actual
> number of kbytes reserved to "kbytes".
>
> -----------------
>
> int sys_use_cache_reservation_key(struct cache_reservation *cv, int
> key);
>
> returns -EPERM if no permission.
> returns -EINVAL if no such key exists.
> returns 0 if instantiation of reservation has been successful,
> copying actual reservation to cv.
>
> Backward compatibility for processors with no support for code/data
> differentiation: by default code and data cache allocation types
> fallback to CACHE_RSVT_TYPE_BOTH on older processors (and return the
> information that they done so via "flags").
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/