linux-kernel - Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170203222449.GA12894@intel.com>
Date:   Fri, 3 Feb 2017 14:24:51 -0800
From:   "Luck, Tony" <tony.luck@...el.com>
To:     David Carrillo-Cisneros <davidcc@...gle.com>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        "Shivappa, Vikas" <vikas.shivappa@...el.com>,
        Stephane Eranian <eranian@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        x86 <x86@...nel.org>, "hpa@...or.com" <hpa@...or.com>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        "Shankar, Ravi V" <ravi.v.shankar@...el.com>,
        "Yu, Fenghua" <fenghua.yu@...el.com>,
        "Kleen, Andi" <andi.kleen@...el.com>,
        "Anvin, H Peter" <h.peter.anvin@...el.com>
Subject: Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

On Fri, Feb 03, 2017 at 01:08:05PM -0800, David Carrillo-Cisneros wrote:
> On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony <tony.luck@...el.com> wrote:
> > On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
> >> If we tie allocation groups and monitoring groups, we are tying the
> >> meaning of CPUs and we'll have to choose between the CAT meaning or
> >> the perf meaning.
> >>
> >> Let's allow semantics that will allow perf like monitoring to
> >> eventually work, even if its not immediately supported.
> >
> > Would it work to make monitor groups be "task list only" or "cpu mask only"
> > (unlike control groups that allow mixing).
> 
> That works, but please don't use chmod. Make it explicit by the group
> position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1).

I had been thinking that after writing a PID to "tasks" we'd disallow
writes to "cpus". But is sounds nicer for the user to declare their
intention upfront. Counter propsosal in the naming war:

	.../monitor/bytask/{groupname}
	.../monitor/bycpu/{groupname}

> > Then the intel_rdt_sched_in() code could pick the RMID in ways that
> > give you the perf(1) meaning. I.e. if you create a monitor group and assign
> > some CPUs to it, then we will always load the RMID for that monitor group
> > when running on those cpus, regardless of what group(s) the current process
> > belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
> > assign RMID using same rules as CLOSID (so measurements from a control group
> > would track allocation policies).
> 
> I think that's very confusing for the user. A group's observed
> behavior should be determined by its attributes and not change
> depending on how other groups are configured. Think on multiple users
> monitoring simultaneously.
> 
> >
> > We are already planning that creating monitor only groups will change
> > what is reported in the control group (e.g. you pull some tasks out of
> > the control group to monitor them separately, so the control group only
> > reports the tasks that you didn't move out for monitoring).
> 
> That's also confusing, and the work-around that Vikas proposed of two
> separate files to enumerate tasks (one for control and one for
> monitoring) breaks the concept of a task group.

There are some simple cases where we can make the data shown in the
original control group look the same. E.g. we move a few tasks over to a
/bytask/ group (or several groups if we want a very fine breakdown) and
then have the report from the control group sum the RMIDs from the monitor
groups and add to the total from the native RMID of the control group.

But this falls apart if the user asks a single monitor group to monitor
tasks from multiple control groups.  Perhaps we could disallow this
(when we assign the first task to a monitor group, capture the CLOSID
and then only allow other tasks with the same CLOSID to be added ... unless
the group becomes empty, and which point we can latch onto a new CLOSID).

/bycpu/ monitoring is very resource intensive if we have to preserve
the control group reports. We'd need to allocate MAXCLOSID[1] RMIDs for
each group so that we can keep separate counts for tasks from each
control group that run on our CPUs and then sum them to report the
/bycpu/ data (instead of just one RMID, and no math).  This also
puts more memory references into the sched_in path while we
figure out which RMID to load into PQR_ASSOC.

I'd want to warn the user in the Documentation that splitting off
too many monitor groups from a control group will result in less
than stellar accuracy in reporting as the kernel cannot read
multiple RMIDs atomically and data is changing between reads.

> I know the present implementation scope is limited, so you could:
>   - support 1) and/or 2) only
>   - do a simple RMID management (e.g. same RMID all packages, allocate
> RMID on creation or fail)
>   - do the custom fs based tool that Vikas mentioned instead of using
> perf_event_open (if it's somehow easier to build and maintain a new
> tool rather than reuse perf(1) ).
> 
> any or all of the above are fine. But please don't choose group
> semantics that will prevent us from eventually supporting full
> perf-like behavior or that we already know explode in user's face.

I'm trying hard to find a way to do this. I.e. start with a patch
that has limited capabilities and needs a custom tool, but can later
grow into something that meets your needs.

-Tony

[1] Lazy allocation means finding we can't find a free RMID in the
middle of context switch ... not willing to go there.