linux-kernel - Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALcN6mgyXu2Ekqwi+MC24R1X5FVWvG7xcJZH8Ppjh+j+vgm2qg@mail.gmail.com>
Date:   Thu, 19 Jan 2017 08:58:53 -0800
From:   David Carrillo-Cisneros <davidcc@...gle.com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Shivappa Vikas <vikas.shivappa@...el.com>,
        Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        Stephane Eranian <eranian@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        x86 <x86@...nel.org>, hpa@...or.com,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Shankar, Ravi V" <ravi.v.shankar@...el.com>,
        "Luck, Tony" <tony.luck@...el.com>,
        Fenghua Yu <fenghua.yu@...el.com>, andi.kleen@...el.com,
        "H. Peter Anvin" <h.peter.anvin@...el.com>
Subject: Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

On Wed, Jan 18, 2017 at 6:09 PM, David Carrillo-Cisneros
<davidcc@...gle.com> wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner <tglx@...utronix.de> wrote:
>> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just prints
>>> > > zeros or arbitrary numbers.
>>> > >
>>> > > Fix: Patches fix this by just throwing an error if the mode is not
>>> > > supported.
>>> > > The modes supported is task monitoring and cgroup monitoring.
>>> > > Also the per package
>>> > > data for say socket x is returned with the -C <cpu on socketx> -G cgrpy
>>> > > option.
>>> > > The systemwide data can be looked up by monitoring root cgroup.
>>> >
>>> > Fine. That just lacks any comment in the implementation. Otherwise I would
>>> > not have asked the question about cpu monitoring. Though I fundamentaly
>>> > hate the idea of requiring cgroups for this to work.
>>> >
>>> > If I just want to look at CPU X why on earth do I have to set up all that
>>> > cgroup muck? Just because your main focus is cgroups?
>>>
>>> The upstream per cpu data is broken because its not overriding the other task
>>> event RMIDs on that cpu with the cpu event RMID.
>>>
>>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>>> to the cpu, however then we miss all the task llc_occupancy in that - still
>>> evaluating it.
>>
>> The point here is that CQM is closely connected to the cache allocation
>> technology. After a lengthy discussion we ended up having
>>
>>   - per cpu CLOSID
>>   - per task CLOSID
>>
>> where all tasks which do not have a CLOSID assigned use the CLOSID which is
>> assigned to the CPU they are running on.
>>
>> So if I configure a system by simply partitioning the cache per cpu, which
>> is the proper way to do it for HPC and RT usecases where workloads are
>> partitioned on CPUs as well, then I really want to have an equaly simple
>> way to monitor the occupancy for that reservation.
>>
>> And looking at that from the CAT point of view, which is the proper way to
>> do it, makes it obvious that CQM should be modeled to match CAT.
>>
>> So lets assume the following:
>>
>>    CPU 0-3     default CLOSID 0
>>    CPU 4               CLOSID 1
>>    CPU 5               CLOSID 2
>>    CPU 6               CLOSID 3
>>    CPU 7               CLOSID 3
>>
>>    T1                  CLOSID 4
>>    T2                  CLOSID 5
>>    T3                  CLOSID 6
>>    T4                  CLOSID 6
>>
>>    All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>>    they run on.
>>
>> then the obvious basic monitoring requirement is to have a RMID for each
>> CLOSID.
>>
>> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
>> care at all about the occupancy of T1 simply because that is running on a
>> seperate reservation. Trying to make that an aggregated value in the first
>> place is completely wrong. If you want an aggregate, which is pretty much
>> useless, then user space tools can generate it easily.
>>
>> The whole approach you and David have taken is to whack some desired cgroup
>> functionality and whatever into CQM without rethinking the overall
>> design. And that's fundamentaly broken because it does not take cache (and
>> memory bandwidth) allocation into account.
>>
>> I seriously doubt, that the existing CQM/MBM code can be refactored in any
>> useful way. As Peter Zijlstra said before: Remove the existing cruft
>> completely and start with completely new design from scratch.
>>
>> And this new design should start from the allocation angle and then add the
>> whole other muck on top so far its possible. Allocation related monitoring
>> must be the primary focus, everything else is just tinkering.
>>
>
> If in this email you meant "Resource group" where you wrote "CLOSID", then
> please disregard my previous email. It seems like a good idea to me to have
> a 1:1 mapping between RMIDs and "Resource groups".
>
> The distinction matter because changing the schemata in the resource group
> would likely trigger a change of CLOSID, which is useful.
>

Just realized that the sharing of CLOSIDs is not part of the accepted
version of RDT.
My mental model was still on the old CAT driver that did allow sharing
of CLOSIDs
between cgroups. Now I understand why CLOSID was assumed to be equal with
"Resource groups". Sorry for the noise. Then the comments in my previous email
hold.

In summary and addition to latest emails:

A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested
is very problematic because the number of CLOSIDs is much much smaller than the
number of RMIDs, and, as Stephane mentioned it's a common use case to want to
independently monitor many task/cgroups inside an allocation partition.

A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of
cgroup monitoring but the case where CLOSID changes would be messy. In
llc_occupancy, if RMIDs are changed, old RMIDs still hold valid occupancy
for indefinite time, so either RMIDs would be preserved (breaking the
1:many mapping)
or old RMIDs should be tracked while they are dirty.

Thanks,
David