linux-kernel - Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1701201520150.3301@vshiva-Udesk>
Date:   Fri, 20 Jan 2017 15:51:48 -0800 (PST)
From:   Shivappa Vikas <vikas.shivappa@...el.com>
To:     David Carrillo-Cisneros <davidcc@...gle.com>
cc:     Shivappa Vikas <vikas.shivappa@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        Stephane Eranian <eranian@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        x86 <x86@...nel.org>, hpa@...or.com,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Shankar, Ravi V" <ravi.v.shankar@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        Fenghua Yu <fenghua.yu@...el.com>, andi.kleen@...el.com,
        "H. Peter Anvin" <h.peter.anvin@...el.com>
Subject: Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:

> On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas
> <vikas.shivappa@...el.com> wrote:
>>
>>
>> On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
>>
>>> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner <tglx@...utronix.de>
>>> wrote:
>>>>
>>>>
>>>> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
>>>>>
>>>>>
>>>>> If resctrl groups could lift the restriction of one resctl per CLOSID,
>>>>> then the user can create many resctrl in the way perf cgroups are
>>>>> created now. The advantage is that there wont be cgroup hierarchy!
>>>>> making things much simpler. Also no need to optimize perf event
>>>>> context switch to make llc_occupancy work.
>>>>
>>>>
>>>> So if I understand you correctly, then you want a mechanism to have
>>>> groups
>>>> of entities (tasks, cpus) and associate them to a particular resource
>>>> control group.
>>>>
>>>> So they share the CLOSID of the control group and each entity group can
>>>> have its own RMID.
>>>>
>>>> Now you want to be able to move the entity groups around between control
>>>> groups without losing the RMID associated to the entity group.
>>>>
>>>> So the whole picture would look like this:
>>>>
>>>> rdt ->  CTRLGRP -> CLOSID
>>>>
>>>> mon ->  MONGRP  -> RMID
>>>>
>>>> And you want to move MONGRP from one CTRLGRP to another.
>>>
>>>
>>> Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
>>> same thing. Details below.
>>>
>>>>
>>>> Can you please write up in a abstract way what the design requirements
>>>> are
>>>> that you need. So far we are talking about implementation details and
>>>> unspecfied wishlists, but what we really need is an abstract requirement.
>>>
>>>
>>> My pleasure:
>>>
>>>
>>> Design Proposal for Monitoring of RDT Allocation Groups.
>>>
>>> -----------------------------------------------------------------------------
>>>
>>> Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
>>> cache bitmask (CBM) per resource. Non-unique CBM are possible although
>>> useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
>>> CLOSIDs are much more scarce than RMIDs.
>>>
>>> If we lift the condition of unique CLOSID, then the user can create
>>> multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
>>> would share the CLOSID and RDT_Allocation must maintain the schemata
>>> to CLOSID relationship (similarly to what the previous CAT driver used
>>> to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
>>> now: adding an element removes it from its previous CTRLGRP.
>>>
>>>
>>> This change would allow further partitioning the allocation groups
>>> into (allocation, monitoring) groups as follows:
>>>
>>> With allocation only:
>>>            CTRLGRP0     CTRLGRP_ALLOC_ONLY
>>> schemata:  L3:0=0xff0       L3:0=x00f
>>> tasks:       PID0       P0_0,P0_1,P1_0,P1_1
>>> cpus:        0x3                0xC
>>
>>
>> Not clear what the PID0 and P0_0 mean ?
>
> PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it
> does now in RDT. I am not changing that.
>
>>
>> If you have to support something like MONGRP and CTRLGRP overall you want to
>> allow for a task to be present in multiple groups ?
>
> I am not proposing to support MONGRP and CTRLGRP. I am proposing to
> allow monitoring of CTRGRPs only.
>
>>>
>>> If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
>>> independently, with the new model we could create:
>>>            CTRLGRP0     CTRLGRP1     CTRLGRP2        CTRLGRP3
>>> schemata:  L3:0=0xff0   L3:0=x00f    L3:0=0x00f     L3:0=0x00f
>>> tasks:       PID0         <none>      P0_0,P0_1     P1_0, P1_1
>>> cpus:        0x3           0xC          0x0             0x0
>>>
>>> Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for
>>> (L3,0).
>>>
>>>
>>> Now we can ask perf to monitor any of the CTRLGRPs independently -once
>>> we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
>>> The perf_event will reserve and assign the RMID to the monitored
>>> CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
>>> (CLOSID and RMID), so perf won't have to.
>>
>>
>> This can be solved by suporting just the -t in perf and a new option in perf
>> to suport resctrl group monitoring (something similar to -R). That way we
>> provide the flexible granularity to monitor tasks independent of whether
>> they are in any resctrl group (and hence also a subset).
>
> One of the key points of my proposal is to remove monitoring PIDs
> independently. That simplifies things by letting RDT handle CLOSIDs
> and RMIDs together.
>
>>
>> CTRLGRP         TASKS           MASK
>> CTRLGRP1        PID1,PID2       L3:0=0Xf,1=0xf0
>> CTRLGRP2        PID3,PID4       L3:0=0Xf0,1=0xf00
>>
>> #perf stat -e llc_occupancy -R CTRLGRP1
>>
>> #perf stat -e llc_occupancy -t PID3,PID4
>>
>> The RMID allocation is independent of resctrl CLOSid allocation and hence
>> the RMID is not always married to CLOS which seems like the requirement
>> here.
>
> It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can
> change in my proposal.
>
>>
>> OR
>>
>> We could have CTRLGRPs with control_only, monitor_only or control_monitor
>> options.
>>
>> now a task could be present in both control_only and monitor_only
>> group or it could be present only in a control_monitor_group. The
>> transitions from one state to another are guarded by this same principle.
>>
>> CTRLGRP         TASKS           MASK                    TYPE
>> CTRLGRP1        PID1,PID2       L3:0=0Xf,1=0xf0         control_only
>> CTRLGRP2        PID3,PID4       L3:0=0Xf0,1=0xf00       control_only
>> CTRLGRP3        PID2,PID3                               monitor_only
>> CTRLGRP4        PID5,PID6       L3:0=0Xf0,1=0xf00       control_monitor
>>
>> CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in
>> the same CTRLGRP and you can add or move tasks into this. The adding and
>> removing the tasks is whats easily supported compared to the task
>> granularity although such a thing could still be supported with the task
>> granularity.
>>
>> CTRLGRP4 allows you to tie the monitor and control together so when tasks
>> move in and out of this we still have that group to consider. And these
>> groups still retain the cpu masks like before so that cpu monitoring is
>> still supported.
>
> Instead of having 3 types of CTRLGRPl, I am proposing one kind
> (equivalent to your control_monitor type) that uses a non-zero RMID
> when an appropriate perf_event is attached to it. What advantages do
> you see on having 3 distinct types?

Basically I am trying to collage the requirements of what you are Stephan 
mentioned, Thomas and some other OEMs who only care about task monitoring 
(probably for a long time/life time so want them to be efficient etc).

To cover all the scenarios I see these may be the design requirements:

A group here is a group of tasks

1. To setup control groups and be able to monitor the control groups.
2. To be able to monitor the groups from the begining of task creation 
(lifetime or continuous)
3. To be able to monitor groups and not care about the CLOS(or allocation is 
'dont care'). Meaning the monitor groups may be a subset of the control groups 
or a superset or may intersect multiple control groups.
Or IOW , the task set here is arbitrary.
4. To monitor a task without bothering to create a group.

CTRLGRP         TASKS           MASK                    TYPE
CTRLGRP1        PID1,PID2       L3:0=0Xf,1=0xf0         control_only
CTRLGRP2        PID3,PID4       L3:0=0Xf0,1=0xf00       control_only
CTRLGRP3        PID2,PID3                               monitor_only
CTRLGRP4        PID5,PID6       L3:0=0Xf0,1=0xf00       control_monitor

now the implementation of this is say either through adding a new option to perf 
(reusing some of cgroup/or not etc which is implementation specific), or create 
a new tool - by tool i really mean an ioctl mechanism - so we just a syscall 
which can be called. Dont need a user mode tool per say. For ex I use resmon as 
the tool , but that is equivalent to having similar option in perf.

With this now if the user wants to do #1,

# resmon -R CTRLGRP1

#2

The resctrl group adds the children of the tasks in the same group. So we can 
essentially do continuous monitoring.

# echo $$ > ../ctrlgrp1/tasks
# resmon -R ctrlgrp1
# task1 &

#3 above is the situation where you want to first profile a bunch of tasks as to 
what the cache usage is (dont care about the CLOS) and then based on the usage , 
assign them or group them into control groups giving cache. This could be 
typical usage model in large scale cluster workload distribution or even in a 
real time workload.

Create the monitor_only groups for this like the CTRLGRP3 above

# echo PID1, ... PIDn > ../ctrlgrp3/tasks
# resmon -R ctrlgrp3

#4 above is when you say are already monitoring some tasks like PID2 and PID3 
which are part of a CTRLGRP but you want to monitor just PID2 now. Cannot create 
a new CTRLGRP with just PID2 if PID2 is already present in one of the CTRLGRPS 
(if we break this requirement , then it complicates the number of groups that 
can be created as we assume there is no hierarchy).

user can use the -t or task monitoring option to do this. Also the user can use 
this option without bothering to create any resctrl groups at all.

# resmon -t PID4, PID5, PID6

I think the email thread is going very long and we should just meet f2f probably 
next week to iron out the requirements and chalk out a design proposal.

overall it looks like we can do without recycling/rotation and just do reuse 
and throw error when run out of RMIDs, support per-package RMIDs. But need to 
add support for cpu monitoring and resctrl groups monitoring (without loosing 
the option to be able to have the flexibility to support monitoring at other 
granularities than the resctrl groups)

>
>>
>> In this case we would need a new option to support the ctrlgrp monitoring in
>> perf or a new tool to do all this if we dont want to bother perf.
>>
>>
>
> Agree, I like expanding the cgroup fd option to take CTRLGRP fds, as
> described in the Implementation Ideas part of the proposal.
>
>>>
>>> If CTRLGRP's schemata changes, the RDT subsystem will find a new
>>> CLOSID for the new schemata (potentially reusing an existing one) or
>>> fail (just like the old CAT used to). The RMID does not change during
>>> schemata updates.
>>>
>>> If a CTRLGRP dies, the monitoring perf_event continues to exists as a
>>> useless wraith, just as happens with cgroup events now.
>>>
>>> Since CTRLGRPs have no hierarchy. There is no need to handle that in
>>> the new RDT Monitoring PMU, greatly simplifying it over the previously
>>> proposed versions.
>>>
>>> A breaking change in user observed behavior with respect to the
>>> existing CQM PMU is that there wouldn't be task events. A task must be
>>> part of a CTRLGRP and events are created per (CTRLGRP, resource_id)
>>> pair. If an user wants to monitor a task across multiple resources
>>> (e.g. l3_occupancy across two packages), she must create one event per
>>> resource_id and add the two counts.
>>>
>>> I see this breaking change as an improvement, since hiding the cache
>>> topology to user space introduced lots of ugliness and complexity to
>>> the CQM PMU without improving accuracy over user space adding the
>>> events.
>>>
>>> Implementation ideas:
>>>
>>> First idea is to expose one monitoring file per resource in a CTRLGRP,
>>> so the list of CTRLGRP's files would be: schemata, tasks, cpus,
>>> monitor_l3_0, monitor_l3_1, ...
>>>
>>> the monitor_<resource_id> file descriptor is passed to perf_event_open
>>> in the way cgroup file descriptors are passed now. All events to the
>>> same (CTRLGRP,resource_id) share RMID.
>>>
>>> The RMID allocation part can either be handled by RDT Allocation or by
>>> the RDT Monitoring PMU. Either ways, the existence of PMU's
>>> perf_events allocates/releases the RMID.
>>>
>>> Also, since this new design removes hierarchy and task events, it
>>> allows for a simple solution of the RMID rotation problem. The removal
>>> of task events eliminates the cgroup vs task event conflict existing
>>> in the upstream version; it also removes the need to ensure that all
>>> active packages have RMIDs at the same time that added complexity to
>>> my version of CQM/CMT. Lastly, the removal of hierarchy removes the
>>> reliance on cgroups, the complex tree based read, and all the hooks
>>> and cgroup files that "raped" the cgroup subsystem.
>>
>>
>> Yes, not sure if the view is same after I sent the implementation details in
>> documentation :) (most likely it is).
>> But the option could be to not support perf_cgroup for cqm and support a new
>> option in perf to monitor resctrl groups and tasks (or some other options
>> like mongrp)
>
> Agree with no supporting cgroups. This proposal is about supporting
> neither cgroups nor tasks and do all monitoring through CTRLGRPs
> through an expansion of an existing perf option.
>
>>
>> I am so far inclined to creating a new monitoring interface that way we dont
>> try to "rape" the existing perf specifics for this RDT or later RDT
>> quirk/features.
>>
>
> On first inspection it seems to me like perf would be fine with this
> approach. It requires no changes to the system call and just some
> changes in the way the cgroup_fd is handled in perf_event_open
> (besides making sure that a context-less PMU don't break things). Do
> you foresee any conflict with future features?

The new tool would really add a new syscall instead of modifying the existing 
perf open syscall for the cgroup (pls see above)

Thanks,
Vikas

>
> Thanks,
> David
>