linux-kernel - Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.20.1701171806220.3495@nanos>
Date:   Tue, 17 Jan 2017 18:31:11 +0100 (CET)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Vikas Shivappa <vikas.shivappa@...ux.intel.com>
cc:     vikas.shivappa@...el.com, davidcc@...gle.com, eranian@...gle.com,
        linux-kernel@...r.kernel.org, x86@...nel.org, hpa@...or.com,
        mingo@...nel.org, peterz@...radead.org, ravi.v.shankar@...el.com,
        tony.luck@...el.com, fenghua.yu@...el.com, andi.kleen@...el.com,
        h.peter.anvin@...el.com
Subject: Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> Cqm(cache quality monitoring) is part of Intel RDT(resource director
> technology) which enables monitoring and controlling of processor shared
> resources via MSR interface.

We know that already. No need for advertising this over and over.

> Below are the issues and the fixes we attempt-

Emphasis on attempt.

> - Issue(1): Inaccurate data for per package data, systemwide. Just prints
> zeros or arbitrary numbers.
> 
> Fix: Patches fix this by just throwing an error if the mode is not supported. 
> The modes supported is task monitoring and cgroup monitoring. 
> Also the per package
> data for say socket x is returned with the -C <cpu on socketx> -G cgrpy option.
> The systemwide data can be looked up by monitoring root cgroup.

Fine. That just lacks any comment in the implementation. Otherwise I would
not have asked the question about cpu monitoring. Though I fundamentaly
hate the idea of requiring cgroups for this to work.

If I just want to look at CPU X why on earth do I have to set up all that
cgroup muck? Just because your main focus is cgroups?

> - Issue(2): RMIDs are global and dont scale with more packages and hence
> also run out of RMIDs very soon.
> 
> Fix: Support per pkg RMIDs hence scale better with more
> packages, and get more RMIDs to use and use when needed (ie when tasks
> are actually scheduled on the package).

That's fine, just the implementation is completely broken 

> - Issue(5): CAT and cqm/mbm write the same PQR_ASSOC_MSR seperately
> Fix: Integrate the sched in code and write the PQR_MSR only once every switch_to 

Brilliant stuff that. I bet that the seperate MSR writes which we have now
are actually faster even in the worst case of two writes.

> [PATCH 02/12] x86/cqm: Remove cqm recycling/conflict handling

That's the only undisputed and probably correct patch in that whole series.

Executive summary: The whole series except 2/12 is a complete trainwreck.

Major issues:

- Patch split is random and leads to non compilable and non functional
  steps aside of creating a unreviewable mess.

- The core code is updated with random hooks which claim to be a generic
  framework but are completely hardcoded for that particular CQM case.

  The introduced hooks are neither fully documented (semantical and
  functional) nor justified why they are required.

- The code quality is horrible in coding style, technical correctness and
  design.

So how can we proceed from here?

I want to see small patch series which change a certain aspect of the
implementation and only that. They need to be split in preparatory changes,
which refactor code or add new interfaces to the core code, and the actual
implementation in CQM/MBM.

Each of the patches must have a proper changelog explaining the WHY and if
required the semantical properties of a new interface.

Each of the patches must compile without warnigns/errors and be fully
functional vs. the changes they do.

Any attempt to resend this disaster as a whole will be NACKed w/o even
looking at it. I've wasted enough time with this wholesale approach in the
past and it did not work out. I'm not going to play that wasteful game
over and over again.

Thanks,

	tglx