linux-kernel - Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and cgroup usage guide

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1508031120050.921@vshiva-Udesk>
Date:	Mon, 3 Aug 2015 11:22:22 -0700 (PDT)
From:	Vikas Shivappa <vikas.shivappa@...el.com>
To:	Marcelo Tosatti <mtosatti@...hat.com>
cc:	Martin Kletzander <mkletzan@...hat.com>,
	Vikas Shivappa <vikas.shivappa@...el.com>,
	"Auld, Will" <will.auld@...el.com>,
	Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"x86@...nel.org" <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
	"tglx@...utronix.de" <tglx@...utronix.de>,
	"mingo@...nel.org" <mingo@...nel.org>,
	"tj@...nel.org" <tj@...nel.org>,
	"peterz@...radead.org" <peterz@...radead.org>,
	"Fleming, Matt" <matt.fleming@...el.com>,
	"Williamson, Glenn P" <glenn.p.williamson@...el.com>,
	"Juvva, Kanaka D" <kanaka.d.juvva@...el.com>
Subject: Re: [PATCH 3/9] x86/intel_rdt: Cache Allocation documentation and
 cgroup usage guide


Hello Marcelo/Martin,

Like I mentioned let me modify the documentation to better help understand 
the usage. Things like updating each package bitmask is already in the patches.

Lets discuss offline and come up a well defined proposal for change if any and 
then update that in next series. We seem to be just looping over same items.

Thanks,
Vikas

On Mon, 3 Aug 2015, Marcelo Tosatti wrote:

> On Sun, Aug 02, 2015 at 05:48:07PM +0200, Martin Kletzander wrote:
>> On Thu, Jul 30, 2015 at 05:08:13PM -0300, Marcelo Tosatti wrote:
>>> On Thu, Jul 30, 2015 at 10:47:23AM -0700, Vikas Shivappa wrote:
>>>>
>>>>
>>>> Marcello,
>>>>
>>>>
>>>> On Wed, 29 Jul 2015, Marcelo Tosatti wrote:
>>>>>
>>>>> How about this:
>>>>>
>>>>> desiredclos (closid  p1  p2  p3 p4)
>>>>> 	     1       1   0   0  0
>>>>> 	     2	     0	 0   0  1
>>>>> 	     3	     0   1   1  0
>>>>
>>>> #1 Currently in the rdt cgroup , the root cgroup always has all the
>>>> bits set and cant be changed (because the cgroup hierarchy would by
>>>> default make this to have all bits as all the children need to have
>>>> a subset of the root's bitmask). So if the user creates a cgroup and
>>>> not put any task in it , the tasks in the root cgroup could be still
>>>> using that part of the cache. Thats the reason i say we can have
>>>> really 'exclusive' masks.
>>>>
>>>> Or in other words - there is always a desired clos (0) which has all
>>>> parts set which acts like a default pool.
>>>>
>>>> Also the parts can overlap.  Please apply this for all the below
>>>> comments which will change the way they work.
>>>>
>>>>>
>>>>> p means part.
>>>>
>>>> I am assuming p = (a contiguous cache capacity bit mask)
>>>
>>> Yes.
>>>
>>>>> closid 1 is a exclusive cgroup.
>>>>> closid 2 is a "cache hog" class.
>>>>> closid 3 is "default closid".
>>>>>
>>>>> Desiredclos is what user has specified.
>>>>>
>>>>> Transition 1: desiredclos --> effectiveclos
>>>>> Clean all bits of unused closid's
>>>>> (that must be updated whenever a
>>>>> closid1 cgroup goes from empty->nonempty
>>>>> and vice-versa).
>>>>>
>>>>> effectiveclos (closid  p1  p2  p3 p4)
>>>>> 	       1       0   0   0  0
>>>>> 	       2       0   0   0  1
>>>>> 	       3       0   1   1  0
>>>>
>>>>>
>>>>> Transition 2: effectiveclos --> expandedclos
>>>>> expandedclos (closid  p1  p2  p3 p4)
>>>>> 	       1       0   0   0  0
>>>>> 	       2       0   0   0  1
>>>>> 	       3       1   1   1  0
>>>>> Then you have different inplacecos for each
>>>>> CPU (see pseudo-code below):
>>>>>
>>>>> On the following events.
>>>>>
>>>>> - task migration to new pCPU:
>>>>> - task creation:
>>>>>
>>>>> 	id = smp_processor_id();
>>>>> 	for (part = desiredclos.p1; ...; part++)
>>>>> 		/* if my cosid is set and any other
>>>>> 	   	   cosid is clear, for the part,
>>>>> 		   synchronize desiredclos --> inplacecos */
>>>>> 		if (part[mycosid] == 1 &&
>>>>> 		    part[any_othercosid] == 0)
>>>>> 			wrmsr(part, desiredclos);
>>>>>
>>>>
>>>> Currently the root cgroup would have all the bits set which will act
>>>> like a default cgroup where all the otherwise unused parts (assuming
>>>> they are a set of contiguous cache capacity bits) will be used.
>>>>
>>>> Otherwise the question is in the expandedclos - who decides to
>>>> expand the closx parts to include some of the unused parts.. - that
>>>> could just be a default root always ?
>>>
>>> Right, so the problem is for certain closid's you might never want
>>> to expand (because doing so would cause data to be cached in a
>>> cache way which might have high eviction rate in the future).
>>> See the example from Will.
>>>
>>> But for the default cache (that is "unclassified applications"
>>> i suppose it is beneficial to expand in most cases, that is,
>>> use maximum amount of cache irrespective of eviction rate, which
>>> is the behaviour that exists now without CAT).
>>>
>>> So perhaps a new flag "expand=y/n" can be added to the cgroup
>>> directories... What do you say?
>>>
>>> Userspace representation of CAT
>>> -------------------------------
>>>
>>> Usage model:
>>> 1) measure application performance without L3 cache reservation.
>>> 2) measure application perf with L3 cache reservation and
>>> X number of cache ways until desired performance is attained.
>>>
>>> Requirements:
>>> 1) Persistency of CLOS configuration across hardware. On migration
>>> of operating system or application between different hardware
>>> systems we'd like the following to be maintained:
>>>       - exclusive number of bytes (*) reserved to a certain CLOSid.
>>>       - shared number of bytes (*) reserved between a certain group
>>>         of CLOSid's.
>>>
>>> For both code and data, rounded down or up in cache way size.
>>>
>>> 2) Reasoning:
>>> Different CBM masks in different hardware platforms might be necessary
>>> to specify the same CLOS configuration, in terms of exclusive number of
>>> bytes and shared number of bytes. (cache-way rounded number of bytes).
>>> For example, due to L3 allocation by other hardware entities in certain parts
>>> of the cache it might be necessary to relocate CBM mask to achieve
>>> the same CLOS configuration.
>>>
>>> 3) Proposed format:
>>>
>>
>> Few questions from a random listener, I apologise if some of them are
>> in a wrong place due to me missing some information from past threads.
>>
>> I'm not sure whether the following proposal to the format is the
>> internal structure or what's going to be in cgroups.  If this is
>> user-visible interface, I think it could be a little less detailed.
>
> User visible interface. The idea is to have userspace code that performs
>
> [ user visible specification ]  ----> [ cbm bitmasks on present hardware
> 				       platform ]
>
> In systemd, probably (or whatever is between the user and the cgroup
> interface).
>
>>> sharedregionK.exclusive - Number of exclusive cache bytes reserved for
>>> 			shared region.
>>> sharedregionK.excl_data - Number of exclusive cache data bytes reserved for
>>> 	 		shared region.
>>> sharedregionK.excl_bytes - Number of exclusive cache code bytes reserved for
>>> 			shared region.
>>> sharedregionK.round_down - Round down to cache way bytes from respective number
>>> 		     specification (default is round up).
>>> sharedregionK.expand - y/n - Expand shared region to more cache ways
>>> 			when available (default N).
>>>
>>> cgroupN.exclusive - Number of exclusive L3 cache bytes reserved
>>> 		    for cgroup.
>>> cgroupN.excl_data - Number of exclusive L3 data cache bytes reserved
>>> 		    for cgroup.
>>> cgroupN.excl_code - Number of exclusive L3 code cache bytes reserved
>>> 		    for cgroup.
>>
>> By exclusive, you mean that it's exclusive to the tasks in this
>> cgroup?
>
> Correct.
>
>> The thing is that we must differentiate between limiting some
>> process's from hogging the memory (like example 2 below) and making
>> some part of the cache exclusive for particular application (example 1
>> below).
>
> AFAICS there is no difference because: both require exclusive cache
> access: the hog wants exclusive access between any other user of its
> cachelines will be penalized. the high performance application wants
> exclusive cache access because any other user of its cachelines will
> penalize it.
>
> Where do you see the need to differentiate?
>
>> I just hope we won't need to add something similar to 'isolcpus=' just
>> so we can make sure none of the tasks in the root cgroup can spoil the
>> part of the cache we need to have exclusive.
>>
>> I'm not sure creating a new subgroup and moving all the tasks there
>> would work, It certainly is not possible with other cgroups, like the
>> cpuset cgroup mentioned beforehand.
>
> Why not? Should be able to place all tasks in a given cgroup? (trying
> to setup systemd to do that now...).
>
>> I also don't quite fully understand how the co-mounting with the
>> cpuset cgroup should work, but that's not design-related.
>
> Neither do I.
>
>> One more question, how does this work on systems with multiple L3
>> caches (e.g. large NUMA node systems)?  I'm guessing if the process is
>> running only on some CPUs, the wrmsr() will be called on that
>> particular CPU(s), right?
>
> Not in the current patchset, that has to be fixed...
>
>>> cgroupN.round_down - Round down to cache way bytes from respective number
>>> 		     specification (default is round up).
>>> cgroupN.expand - y/n - Expand shared region to more cache ways when
>>> 		       available (default N).
>>> cgroupN.shared = { sharedregion1, sharedregion2, ... } (list of shared
>>> regions)
>>>
>>> Example 1:
>>> One application with 2M exclusive cache, two applications
>>> with 1M exclusive each, sharing an expansive shared region of 1M.
>>>
>>> cgroup1.exclusive = 2M
>>>
>>> sharedregion1.exclusive = 1M
>>> sharedregion1.expand = Y
>>>
>>> cgroup2.exclusive = 1M
>>> cgroup2.shared = sharedregion1
>>>
>>> cgroup3.exclusive = 1M
>>> cgroup3.shared = sharedregion1
>>>
>>> Example 2:
>>> 3 high performance applications running, one of which is a cache hog
>>> with no cache locality.
>>>
>>> cgroup1.exclusive = 8M
>>> cgroup2.exclusive = 8M
>>>
>>> cgroup3.exclusive = 512K
>>> cgroup3.round_down = Y
>>>
>>> In all cases the default cgroup (which requires no explicit
>>> specification) is expansive and uses the remaining cache
>>> ways, including the ways shared by other hardware entities.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@...r.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/