linux-kernel - Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABPqkBQW80CFY7PLjDO_EKRrr0TA+tu3zwoSU7tnL7DgdwV+Wg@mail.gmail.com>
Date:   Tue, 7 Feb 2017 00:08:09 -0800
From:   Stephane Eranian <eranian@...gle.com>
To:     "Luck, Tony" <tony.luck@...el.com>
Cc:     David Carrillo-Cisneros <davidcc@...gle.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        "Shivappa, Vikas" <vikas.shivappa@...el.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        x86 <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Shankar, Ravi V" <ravi.v.shankar@...el.com>,
        "Yu, Fenghua" <fenghua.yu@...el.com>,
        "Kleen, Andi" <andi.kleen@...el.com>,
        "Anvin, H Peter" <h.peter.anvin@...el.com>
Subject: Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

Hi,

I wanted to take a few steps back and look at the overall goals for
cache monitoring.
>From the various threads and discussion, my understanding is as follows.

I think the design must ensure that the following usage models can be monitored:
   - the allocations in your CAT partitions
   - the allocations from a task (inclusive of children tasks)
   - the allocations from a group of tasks (inclusive of children tasks)
   - the allocations from a CPU
   - the allocations from a group of CPUs

All cases but first one (CAT) are natural usage. So I want to describe
the CAT in more details.
The goal, as I understand it, it to monitor what is going on inside
the CAT partition to detect
whether it saturates or if it has room to "breathe". Let's take a
simple example.

Suppose, we have a CAT group, cat1:

cat1: 20MB partition (CLOSID1)
    CPUs=CPU0,CPU1
    TASKs=PID20

There can only be one CLOSID active on a CPU at a time. The kernel
chooses to prioritize tasks over CPU when enforcing cases with multiple
CLOSIDs.

Let's review how this works for cat1 and for each scenario look at how
the kernel enforces or not the cache partition:

 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1
 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1
 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1
 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1)
 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID

Now, let's review how we could track the allocations done in cat1 using a single
RMID. There can only be one RMID active at a time per CPU. The kernel
chooses to prioritize tasks over CPU:

cat1: 20MB partition (CLOSID1, RMID1)
    CPUs=CPU0,CPU1
    TASKs=PID20

 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1
 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1
 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1
 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1)
 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID

To make sense to a user, the cases where the hardware monitors MUST be
the same as the cases where the hardware enforces the cache
partitioning.

Here we see that it works using a single RMID.

However doing so limits certain monitoring modes where a user might want to
get a breakdown per CPU of the allocations, such as with:
  $ perf stat -a -A -e llc_occupancy -R cat1
(where -R points to the monitoring group in rsrcfs). Here this mode would not be
possible because the two CPUs in the group share the same RMID.

Now let's take another scenario, and suppose you have two monitoring groups
as follows:

mon1: RMID1
    CPUs=CPU0,CPU1
mon2: RMID2
    TASKS=PID20

If PID20 runs on CP0, then RMID2 is activated, and thus allocations
done by PID20 are not counted towards RMID1. There is a blind spot.

Whether or not this is a problem depends on the semantic exported by
the interface for CPU mode:
   1-Count all allocations from any tasks running on CPU
   2-Count all allocations from tasks which are NOT monitoring themselves

If the kernel choses 1, then there is a blind spot and the measurement
is not as accurate as it could be because of the decision to use only one RDMID.
But if the kernel choses 2, then everything works fine with a single RMID.

If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e.,
measure any activity from any thread (choice 1), then the single RMID per group
does not work.

If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a
CPU, i.e., measures only when threads of the cgroup run on that CPU, then using
a single RMID per group works.

Hope this helps clarifies the usage model and design choices.