linux-kernel - Re: [RFD] CAT user space interface revisited

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151119000153.GA27997@amt.cnet>
Date:	Wed, 18 Nov 2015 22:01:54 -0200
From:	Marcelo Tosatti <mtosatti@...hat.com>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Peter Zijlstra <peterz@...radead.org>, x86@...nel.org,
	Luiz Capitulino <lcapitulino@...hat.com>,
	Vikas Shivappa <vikas.shivappa@...el.com>,
	Tejun Heo <tj@...nel.org>, Yu Fenghua <fenghua.yu@...el.com>
Subject: Re: [RFD] CAT user space interface revisited

On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a
> while, I think we all should sit back and look at it from scratch
> again w/o our preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>    - It's a per socket facility
> 
>    - CAT slots can be associated to external hardware. This
>      association is per socket as well, so different sockets can have
>      different behaviour. I missed that detail when staring the first
>      time, thanks for the pointer!
> 
>    - The association ifself is per cpu. The COS selection happens on a
>      CPU while the set of masks which are selected via COS are shared
>      by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>    - The bits which select a cache partition need to be consecutive
> 
>    - The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 	   1 1 1 1 1 1 1 1
> 
> Shared:	   1 1 1 1 1 1 1 1
> 	   0 0 1 1 1 1 1 1
> 	   0 0 0 0 1 1 1 1
> 	   0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
> 	   0 0 0 0 1 1 0 0
> 	   0 0 0 0 0 0 1 0
> 	   0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity
> of a sysadmin. The worst outcome might be L3 disabled for everything,
> so what?
> 
> Now that gets even more convoluted if CDP comes into play and we
> really need to look at CDP right now. We might end up with something
> which looks like this:
> 
>    	   1 1 1 1 0 0 0 0	Code
> 	   1 1 1 1 0 0 0 0	Data
> 	   0 0 0 0 0 0 1 0	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> or 
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 1 1 0 0	Data
> 	   0 0 0 0 0 0 0 1	Code
> 	   0 0 0 0 0 1 1 0	Data
> 
> Let's look at partitioning itself. We have two options:
> 
>    1) Per task partitioning
> 
>    2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on
> it. You give that task a slice of cache. Now that task needs kernel
> services which run in kernel threads on that CPU. We really don't want
> to (and cannot) hunt down random kernel threads (think cpu bound
> worker threads, softirq threads ....) and give them another slice of
> cache. What we really want is:
> 
>     	 1 1 1 1 0 0 0 0    <- Default cache
> 	 0 0 0 0 1 1 1 0    <- Cache for important task
> 	 0 0 0 0 0 0 0 1    <- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is
> useful at all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>    Real sensitive applications will simply pin tasks on a socket to
>    avoid that in the first place. If we make the whole thing
>    configurable enough then the sysadmin can set it up to support
>    even the nonsensical case of identical cache partitions on all
>    sockets and let tasks use the corresponding partitions when
>    migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>    so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>    the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder
> than most other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.

No you don't.

> I was not able to identify any existing infrastructure where this
> really fits in. I chose a directory/file based representation. We
> certainly could do the same with a syscall, but that's just an
> implementation detail.
> 
> At top level:
> 
>    xxxxxxx/cat/max_cosids		<- Assume that all CPUs are the same
>    xxxxxxx/cat/max_maskbits		<- Assume that all CPUs are the same
>    xxxxxxx/cat/cdp_enable		<- Depends on CDP availability
> 
> Per socket data:
> 
>    xxxxxxx/cat/socket-0/
>    ...
>    xxxxxxx/cat/socket-N/l3_size
>    xxxxxxx/cat/socket-N/hwsharedbits
> 
> Per socket mask data:
> 
>    xxxxxxx/cat/socket-N/cos-id-0/
>    ...
>    xxxxxxx/cat/socket-N/cos-id-N/inuse
> 				/cat_mask	
> 				/cdp_mask	<- Data mask if CDP enabled

There is no need to expose all this to userspace, but for some unknown 
reason people seem to be fond of that, so lets pretend its necessary.

> Per cpu default cos id for the cpus on that socket:
> 
>    xxxxxxx/cat/socket-N/cpu-x/default_cosid
>    ...
>    xxxxxxx/cat/socket-N/cpu-N/default_cosid
> 
> The above allows a simple cpu based partitioning. All tasks which do
> not have a cache partition assigned on a particular socket use the
> default one of the cpu they are running on.

A tasks which does not have a partition assigned to it 
has to use the "other tasks" group (COSid0), so that it does 
not interfere with the cache reservations of other tasks.

All is necessary are reservations {size,type}, and lists of reservations
per tasks. This is the right level to expose this to userspace without
userspace having to care about unnecessary HW details.

> Now for the task(s) partitioning:
> 
>    xxxxxxx/cat/partitions/
> 
> Under that directory one can create partitions
> 
>    xxxxxxx/cat/partitions/p1/tasks
> 			    /socket-0/cosid
> 			    ...
> 			    /socket-n/cosid
> 
>    The default value for the per socket cosid is COSID_DEFAULT, which
>    causes the task(s) to use the per cpu default id.
> 
> Thoughts?
> 
> Thanks,
> 
> 	tglx

Again: you don't need to look into the MSR table and relate it 
to tasks if you store the data as:

	task group 1 = {
			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
			reservation-2 = {size = 100Kb, type = code, socketmask = 0xffff}
	}
	
	task group 2 = {
			reservation-1 = {size = 80Kb, type = data, socketmask = 0xffff},
			reservation-3 = {size = 200Kb, type = code, socketmask = 0xffff}
	}

Task group 1 and task group 2 share reservation-1.

This is what userspace is going to expose to users, of course.

If you expose the MSRs to userspace, you force userspace to convert
from this format to the MSRs (minding whether there
are contiguous regions available, and the region shared with HW).

    - The bits which select a cache partition need to be consecutive

BUT, for our usecase the cgroups interface works as well, so lets
go with that (Tejun apparently had a usecase where tasks were allowed to 
set reservations themselves, on response to external events).


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/