linux-kernel - Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALcN6mhh7ST0fs2K+dcCZBcGdYKyXD+Gvn7zhTbqH1Zn_hrp2Q@mail.gmail.com>
Date:   Thu, 2 Feb 2017 17:40:45 -0800
From:   David Carrillo-Cisneros <davidcc@...gle.com>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        "Shivappa, Vikas" <vikas.shivappa@...el.com>,
        Stephane Eranian <eranian@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        x86 <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Shankar, Ravi V" <ravi.v.shankar@...el.com>,
        "Yu, Fenghua" <fenghua.yu@...el.com>,
        "Kleen, Andi" <andi.kleen@...el.com>,
        "Anvin, H Peter" <h.peter.anvin@...el.com>
Subject: Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

On Thu, Feb 2, 2017 at 3:41 PM, Luck, Tony <tony.luck@...el.com> wrote:
> On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote:
>> There is no need to change perf(1) to support
>>  # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
>>
>> the PMU can work with resctrl to provide the support through
>> perf_event_open, with the advantage that tools other than perf could
>> also use it.
>
> I agree it would be better to expose the counters through
> a standard perf_event_open() interface ... but we don't seem
> to have had much luck doing that so far.
>
> That would need the requirements to be re-written with the
> focus of what does resctrl need to do to support each of the
> perf(1) command line modes of operation.  The fact that these
> counters work rather differently from normal h/w counters
> has resulted in massively complex volumes of code trying
> to map them into what perf_event_open() expects.
>
> The key points of weirdness seem to be:
>
> 1) We need to allocate an RMID for the duration of monitoring. While
>    there are quite a lot of RMIDs, it is easy to envision scenarios
>    where there are not enough.
>
> 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a process
>    of interest is running.
>
> 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events
>
> 4) For llc_occupancy the count can change even when none of the processes
>    are running becauase cache lines are evicted
>
> 5) llc_occupancy measures the delta, not the absolute occupancy. To
>    get a good result requires monitoring from process creation (or
>    lots of patience, or the nuclear option "wbinvd").
>
> 6) RMID counters are package scoped
>
>
> These result in all sorts of hard to resolve situations. E.g. you are
> monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm
> looking at the cache occupancy of PID=234 using RMID=45. The scheduler
> decides to run my proocess on your CPU.  We can only load one RMID, so
> one of us will be disappointed (unless we have some crazy complex code
> where your instance of perf borrows RMID=45 and reads out the local
> byte count on sched_in() and sched_out() to add to the runing count
> you were keeping against RMID=22).
>
> How can we document such restrictions for people who haven't been
> digging in this code for over a year?
>
> I think a perf_event_open() interface would make some simple cases
> work, but result in some swearing once people start running multiple
> complex monitors at the same time.

More problems:

7) Time multiplexing of RMIDs is hard because llc_occupancy cannot be reset.

8) Only one RMID per CPU can be loaded at a time into PQR_ASSOC.

Most of the complexity in past attempts were mainly caused by:
  A. Task events being defined as system-wide and not package-wide.
What you describe in points (4) and (6) made this complicated.
  B. The cgroup hierarchy, due to (7) and (8).

A and B caused the bulk of the code by complicating RMID assignment,
reading and rotation.

Now that we've learned from the past experience, we have defined
per-domain monitoring and use flat groups. FWICT, that enough to allow
a simple implementation that can be expressed through perf_event_open.