linux-kernel - Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA software controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1804031118410.27913@vshiva-Udesk>
Date:   Tue, 3 Apr 2018 11:45:22 -0700 (PDT)
From:   Shivappa Vikas <vikas.shivappa@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>
cc:     Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        vikas.shivappa@...el.com, tony.luck@...el.com,
        ravi.v.shankar@...el.com, fenghua.yu@...el.com,
        sai.praneeth.prakhya@...el.com, x86@...nel.org, hpa@...or.com,
        linux-kernel@...r.kernel.org, ak@...ux.intel.com
Subject: Re: [PATCH 1/6] x86/intel_rdt/mba_sc: Add documentation for MBA
 software controller



On Tue, 3 Apr 2018, Thomas Gleixner wrote:

> On Thu, 29 Mar 2018, Vikas Shivappa wrote:
>> +Memory bandwidth(b/w) in MegaBytes
>> +----------------------------------
>> +
>> +Memory bandwidth is a core specific mechanism which means that when the
>> +Memory b/w percentage is specified in the schemata per package it
>> +actually is applied on a per core basis via IA32_MBA_THRTL_MSR
>> +interface. This may lead to confusion in scenarios below:
>> +
>> +1. User may not see increase in actual b/w when percentage values are
>> +   increased:
>> +
>> +This can occur when aggregate L2 external b/w is more than L3 external
>> +b/w. Consider an SKL SKU with 24 cores on a package and where L2
>> +external b/w is 10GBps (hence aggregate L2 external b/w is 240GBps) and
>> +L3 external b/w is 100GBps. Now a workload with '20 threads, having 50%
>> +b/w, each consuming 5GBps' consumes the max L3 b/w of 100GBps although
>> +the percentage value specified is only 50% << 100%. Hence increasing
>> +the b/w percentage will not yeild any more b/w. This is because
>> +although the L2 external b/w still has capacity, the L3 external b/w
>> +is fully used. Also note that this would be dependent on number of
>> +cores the benchmark is run on.
>> +
>> +2. Same b/w percentage may mean different actual b/w depending on # of
>> +   threads:
>> +
>> +For the same SKU in #1, a 'single thread, with 10% b/w' and '4 thread,
>> +with 10% b/w' can consume upto 10GBps and 40GBps although they have same
>> +percentage b/w of 10%. This is simply because as threads start using
>> +more cores in an rdtgroup, the actual b/w may increase or vary although
>> +user specified b/w percentage is same.
>> +
>> +In order to mitigate this and make the interface more user friendly, we
>> +can let the user specify the max bandwidth per rdtgroup in bytes(or mega
>> +bytes). The kernel underneath would use a software feedback mechanism or
>> +a "Software Controller" which reads the actual b/w using MBM counters
>> +and adjust the memowy bandwidth percentages to ensure the "actual b/w
>> +< user b/w".
>> +
>> +The legacy behaviour is default and user can switch to the "MBA software
>> +controller" mode using a mount option 'mba_MB'.
>
> You said above:
>
>> This may lead to confusion in scenarios below:
>
> Reading the blurb after that creates even more confusion than being
> helpful.
>
> First of all this information should not be under the section 'Memory
> bandwidth in MB/s'.
>
> Also please write bandwidth. The weird acronym b/w (band per width???) is
> really not increasing legibility.

Ok will fix and add a seperate section.

>
> What you really want is a general section about memory bandwidth allocation
> where you explain the technical background in purely technical terms w/o
> fairy tale mode. Technical descriptions have to be factual and not
> 'could/may/would'.
>
> If I decode the above correctly then the current percentage based
> implementation was buggy from the very beginning in several ways.
>
> Now the obvious question which is in no way answered by the cover letter is
> why the current percentage based implementation cannot be fixed and we need
> some feedback driven magic to achieve that. I assume you spent some brain
> cycles on that question, so it would be really helpful if you shared that.
>
> If I understand it correctly then the problem is that the throttling
> mechanism is per core and affects the L2 external bandwidth.
>
>  Is this really per core? What about hyper threads. Both threads have that
>  MSR. How is that working?

It is per core mechanism. On hyperthreads, it just takes the lowest bandwidth 
among the thread siblings. We have the below to explain the same - i can add 
more description if needed

"The bandwidth throttling is a core specific mechanism on some of Intel
SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth."

>
> The L2 external bandwidth is higher than the L3 external bandwidth.
>
>  Is there any information available from CPUID or whatever source which
>  allows us to retrieve the bandwidth ratio or the absolute maximum
>  bandwidth per level?

There is no information in cpuid on the bandwidth available. Also we have seen 
from our experiments that the increase is not perfectly linear (delta bandwidth 
increase from 30% to 40% may not be same as 70% to 80%). So we currently 
dynamically caliberate this delta for the software controller.

>
> What's also missing from your explanation is how that feedback loop behaves
> under different workloads.
>
>  Is this assuming that the involved threads/cpus actually try to utilize
>  the bandwidth completely?

No, the feedback loop only guarentees that the usage will not exceed what the 
user specifies as max bandwidth. If it is using below the max value it does not 
matter how much less it is using.

>
>  What happens if the threads/cpus are only using a small set because they
>  are idle or their computations are mostly cache local and do not need
>  external bandwidth? Looking at the implementation I don't see how that is
>  taken into account.

The feedback only kicks into action if a rdtgroup uses more bandwidth than the 
max specified by the user. I specified that it is always "ensure the "actual b/w
354 < user b/w" " and can add more explanation on these scenarios.

Also note that we are using the MBM counters for this feedback loop. Now that 
the interface is much more useful because we have the same rdtgroup that is 
being monitored and controlled. (vs. if we had the perf mbm the group of threads 
in resctrl mba and in mbm could be different and would be hard to measure what 
the threads/cpus in the resctrl are using). So the resctrl being used for both 
of these is a requirement for this and we always check that.

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>
>
>
>
>
>
>