linux-kernel - Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150331011722.GA16792@amt.cnet>
Date:	Mon, 30 Mar 2015 22:17:22 -0300
From:	Marcelo Tosatti <mtosatti@...hat.com>
To:	Vikas Shivappa <vikas.shivappa@...el.com>
Cc:	Vikas Shivappa <vikas.shivappa@...ux.intel.com>, x86@...nel.org,
	linux-kernel@...r.kernel.org, hpa@...or.com, tglx@...utronix.de,
	mingo@...nel.org, tj@...nel.org, peterz@...radead.org,
	matt.fleming@...el.com, will.auld@...el.com,
	glenn.p.williamson@...el.com, kanaka.d.juvva@...el.com
Subject: Re: [PATCH 7/7] x86/intel_rdt: Add CAT documentation and usage guide

On Thu, Mar 26, 2015 at 10:29:27PM -0300, Marcelo Tosatti wrote:
> On Thu, Mar 26, 2015 at 11:38:59AM -0700, Vikas Shivappa wrote:
> > 
> > Hello Marcelo,
> 
> Hi Vikas,
> 
> > On Wed, 25 Mar 2015, Marcelo Tosatti wrote:
> > 
> > >On Thu, Mar 12, 2015 at 04:16:07PM -0700, Vikas Shivappa wrote:
> > >>This patch adds a description of Cache allocation technology, overview
> > >>of kernel implementation and usage of CAT cgroup interface.
> > >>
> > >>Signed-off-by: Vikas Shivappa <vikas.shivappa@...ux.intel.com>
> > >>---
> > >> Documentation/cgroups/rdt.txt | 183 ++++++++++++++++++++++++++++++++++++++++++
> > >> 1 file changed, 183 insertions(+)
> > >> create mode 100644 Documentation/cgroups/rdt.txt
> > >>
> > >>diff --git a/Documentation/cgroups/rdt.txt b/Documentation/cgroups/rdt.txt
> > >>new file mode 100644
> > >>index 0000000..98eb4b8
> > >>--- /dev/null
> > >>+++ b/Documentation/cgroups/rdt.txt
> > >>@@ -0,0 +1,183 @@
> > >>+        RDT
> > >>+        ---
> > >>+
> > >>+Copyright (C) 2014 Intel Corporation
> > >>+Written by vikas.shivappa@...ux.intel.com
> > >>+(based on contents and format from cpusets.txt)
> > >>+
> > >>+CONTENTS:
> > >>+=========
> > >>+
> > >>+1. Cache Allocation Technology
> > >>+  1.1 What is RDT and CAT ?
> > >>+  1.2 Why is CAT needed ?
> > >>+  1.3 CAT implementation overview
> > >>+  1.4 Assignment of CBM and CLOS
> > >>+  1.5 Scheduling and Context Switch
> > >>+2. Usage Examples and Syntax
> > >>+
> > >>+1. Cache Allocation Technology(CAT)
> > >>+===================================
> > >>+
> > >>+1.1 What is RDT and CAT
> > >>+-----------------------
> > >>+
> > >>+CAT is a part of Resource Director Technology(RDT) or Platform Shared
> > >>+resource control which provides support to control Platform shared
> > >>+resources like cache. Currently Cache is the only resource that is
> > >>+supported in RDT.
> > >>+More information can be found in the Intel SDM section 17.15.
> > >>+
> > >>+Cache Allocation Technology provides a way for the Software (OS/VMM)
> > >>+to restrict cache allocation to a defined 'subset' of cache which may
> > >>+be overlapping with other 'subsets'.  This feature is used when
> > >>+allocating a line in cache ie when pulling new data into the cache.
> > >>+The programming of the h/w is done via programming  MSRs.
> > >>+
> > >>+The different cache subsets are identified by CLOS identifier (class
> > >>+of service) and each CLOS has a CBM (cache bit mask).  The CBM is a
> > >>+contiguous set of bits which defines the amount of cache resource that
> > >>+is available for each 'subset'.
> > >>+
> > >>+1.2 Why is CAT needed
> > >>+---------------------
> > >>+
> > >>+The CAT  enables more cache resources to be made available for higher
> > >>+priority applications based on guidance from the execution
> > >>+environment.
> > >>+
> > >>+The architecture also allows dynamically changing these subsets during
> > >>+runtime to further optimize the performance of the higher priority
> > >>+application with minimal degradation to the low priority app.
> > >>+Additionally, resources can be rebalanced for system throughput
> > >>+benefit.  (Refer to Section 17.15 in the Intel SDM)
> > >>+
> > >>+This technique may be useful in managing large computer systems which
> > >>+large LLC. Examples may be large servers running  instances of
> > >>+webservers or database servers. In such complex systems, these subsets
> > >>+can be used for more careful placing of the available cache
> > >>+resources.
> > >>+
> > >>+The CAT kernel patch would provide a basic kernel framework for users
> > >>+to be able to implement such cache subsets.
> > >>+
> > >>+1.3 CAT implementation Overview
> > >>+-------------------------------
> > >>+
> > >>+Kernel implements a cgroup subsystem to support cache allocation.
> > >>+
> > >>+Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping.
> > >>+A CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> > >>+to the kernel and not exposed to user.  Each cgroup would have one CBM
> > >>+and would just represent one cache 'subset'.
> > >>+
> > >>+The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> > >>+cgroup never fails.  When a child cgroup is created it inherits the
> > >>+CLOSid and the CBM from its parent.  When a user changes the default
> > >>+CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> > >>+used before.  The changing of 'cbm' may fail with -ERRNOSPC once the
> > >>+kernel runs out of maximum CLOSids it can support.
> > >>+User can create as many cgroups as he wants but having different CBMs
> > >>+at the same time is restricted by the maximum number of CLOSids
> > >>+(multiple cgroups can have the same CBM).
> > >>+Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> > >>+for each cgroup using a CLOSid.
> > >>+
> > >>+The tasks in the cgroup would get to fill the LLC cache represented by
> > >>+the cgroup's 'cbm' file.
> > >>+
> > >>+Root directory would have all available  bits set in 'cbm' file by
> > >>+default.
> > >>+
> > >>+1.4 Assignment of CBM,CLOS
> > >>+--------------------------
> > >>+
> > >>+The 'cbm' needs to be a  subset of the parent node's 'cbm'.
> > >>+Any contiguous subset of these bits(with a minimum of 2 bits) maybe
> > >>+set to indicate the cache mapping desired.  The 'cbm' between 2
> > >>+directories can overlap. The 'cbm' would represent the cache 'subset'
> > >>+of the CAT cgroup.  For ex: on a system with 16 bits of max cbm bits,
> > >>+if the directory has the least significant 4 bits set in its 'cbm'
> > >>+file(meaning the 'cbm' is just 0xf), it would be allocated the right
> > >>+quarter of the Last level cache which means the tasks belonging to
> > >>+this CAT cgroup can use the right quarter of the cache to fill. If it
> > >>+has the most significant 8 bits set ,it would be allocated the left
> > >>+half of the cache(8 bits  out of 16 represents 50%).
> > >>+
> > >>+The cache portion defined in the CBM file is available to all tasks
> > >>+within the cgroup to fill and these task are not allowed to allocate
> > >>+space in other parts of the cache.
> > >
> > >Is there a reason to expose the hardware interface rather
> > >than ratios to userspace ?
> > >
> > >Say, i'd like to allocate 20% of L3 cache to cgroup A,
> > >80% to cgroup B.
> > >
> > >Well, you'd have to expose the shared percentages between
> > >any two cgroups (that information is there in the
> > >cbm bitmaps, but not in "ratios").
> > >
> > >One problem i see with exposing cbm bitmasks is that on hardware
> > >updates that change cache size or bitmask length, userspace must
> > >recalculate the bitmaps.
> > >
> > >Another is that its vendor dependant, while ratios (plus shared
> > >information for two given cgroups) is not.
> > >
> > 
> > Agree that this interface doesnot give options to directly allocate
> > in terms of percentage . But note that specifying in bitmasks allows
> > the user to allocate overlapping cache areas and also since we use
> > cgroup we naturally follow the cgroup hierarchy. User should be able
> > to convert the bitmasks into intended percentage or size values
> > based on the other available cache size info in hooks like cpuinfo.
> > 
> > We discussed more on this before in the older patches and here is
> > one thread where we discussed it for your reference -
> > http://marc.info/?l=linux-kernel&m=142482002022543&w=2
> > 
> > Thanks,
> > Vikas
> 
> I can't find any discussion relating to exposing the CBM interface
> directly to userspace in that thread ?
> 
> Cpu.shares is written in ratio form, which is much more natural.
> Do you see any advantage in maintaining the 
> 
> (ratio -> cbm bitmasks) 
> 
> translation in userspace rather than in the kernel ? 
> 
> What about something like:
> 
> 
> 		      root cgroup
> 		   /		  \
> 		  /		    \
> 		/		      \
> 	cgroupA-80			cgroupB-30
> 
> 
> So that whatever exceeds 100% is the ratio of cache 
> shared at that level (cgroup A and B share 10% of cache 
> at that level).
> 
> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu_and_memory-use_case.html
> 
> cpu — the cpu.shares parameter determines the share of CPU resources
> available to each process in all cgroups. Setting the parameter to 250,
> 250, and 500 in the finance, sales, and engineering cgroups respectively
> means that processes started in these groups will split the resources
> with a 1:1:2 ratio. Note that when a single process is running, it
> consumes as much CPU as necessary no matter which cgroup it is placed
> in. The CPU limitation only comes into effect when two or more processes
> compete for CPU resources. 

Vikas,

I see the following resource specifications from the POV of a user/admin:

1) Ratios. 

X%/Y%, as discussed above.

2) Specific kilobyte values.

In accord with the rest of cgroups, allow specific kilobyte
specification. See limit_in_bytes, for example, from

https://www.kernel.org/doc/Documentation/cgroups/memory.txt

Of course you would have to convert to way units, but i see
two use-cases here:

	- User wants application to not reclaim more than
	 given number of kilobytes of LLC cache.
	- User wants application to be guaranteed a given
	  amount of kilobytes of LLC, even across processor changes.

Again, some precision is lost with LLC.

3) Per-CPU differentiation 

The current patchset deals with the following use-case suboptimally:


	CPU1-4				CPU5-8

	die1				die2



* Task groupA is isolated to CPU-8 (die2).
* Task groupA has 50% cache reserved.
* Task groupB can reclaim into 50% cache.
* Task groupB can reclaim into 100% of cache 
of die1.

I suppose this is a common scenario which is not handled by 
the current patchset (you would have task groupB use only 50% 
of cache of die1).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/