linux-kernel - Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8d85eae-e291-99a6-509c-94c41514ac16@arm.com>
Date:   Wed, 8 Mar 2023 17:45:35 +0000
From:   James Morse <james.morse@....com>
To:     Peter Newman <peternewman@...gle.com>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org,
        Fenghua Yu <fenghua.yu@...el.com>,
        Reinette Chatre <reinette.chatre@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        H Peter Anvin <hpa@...or.com>,
        Babu Moger <Babu.Moger@....com>,
        shameerali.kolothum.thodi@...wei.com,
        D Scott Phillips OS <scott@...amperecomputing.com>,
        carl@...amperecomputing.com, lcherian@...vell.com,
        bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
        xingxin.hx@...nanolis.org, baolin.wang@...ux.alibaba.com,
        Jamie Iles <quic_jiles@...cinc.com>,
        Xin Hao <xhao@...ux.alibaba.com>,
        Stephane Eranian <eranian@...gle.com>
Subject: Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to
 sleep

Hi Peter,

On 06/03/2023 13:14, Peter Newman wrote:
> On Mon, Mar 6, 2023 at 12:34 PM James Morse <james.morse@....com> wrote:
>> On 23/01/2023 15:33, Peter Newman wrote:
>>> On Fri, Jan 13, 2023 at 6:56 PM James Morse <james.morse@....com> wrote:
>>>> MPAM's cache occupancy counters can take a little while to settle once
>>>> the monitor has been configured. The maximum settling time is described
>>>> to the driver via a firmware table. The value could be large enough
>>>> that it makes sense to sleep.
>>>
>>> Would it be easier to return an error when reading the occupancy count
>>> too soon after configuration? On Intel it is already normal for counter
>>> reads to fail on newly-allocated RMIDs.
>>
>> For x86, you have as many counters as there are RMIDs, so there is no issue just accessing
>> the counter.
> 
> I should have said AMD instead of Intel, because their implementations
> have far fewer counters than RMIDs.

Right, I assume Intel and AMD behaved in the same way here.


>> With MPAM there may be as few as 1 monitor for the CSU (cache storage utilisation)
>> counter, which needs to be multiplexed between different PARTID to find the cache
>> occupancy (This works for CSU because its a stable count, it doesn't work for the
>> bandwidth monitors)
>> On such a platform the monitor needs to be allocated and programmed before it reads a
>> value for a particular PARTID/CLOSID. If you had two threads trying to read the same
>> counter, they could interleave perfectly to prevent either thread managing to read a value.
>> The 'not ready' time is advertised in a firmware table, and the driver will wait at most
>> that long before giving up and returning an error.

> Likewise, on AMD, a repeating sequence of tasks which are LRU in terms
> of counter -> RMID allocation could prevent RMID event reads from ever
> returning a value.

Isn't that a terrible user-space interface? "If someone else is reading a similar file,
neither of you make progress".


> The main difference I see with MPAM is that software allocates the
> counters instead of hardware, but the overall behavior sounds the same.
> 
> The part I object to is introducing the wait to the counter read because
> existing software already expects an immediate error when reading a
> counter too soon. To produce accurate data, these readings are usually
> read at intervals of multiple seconds.


> Instead, when configuring a counter, could you use the firmware table
> value to compute the time when the counter will next be valid and return
> errors on read requests received before that?

The monitor might get re-allocated, re-programmed and become valid for a different
PARTID+PMG in the mean time. I don't think these things should remain allocated over a
return to user-space. Without doing that I don't see how we can return-early and make
progress.

How long should a CSU monitor remain allocated to a PARTID+PMG? Currently its only for the
duration of the read() syscall on the file.


The problem with MPAM is too much of it is optional. This particular behaviour is only
valid for CSU monitors, (llc_occupancy), and then, only if your hardware designers didn't
have a value to hand when the monitor is programmed, and need to do a scan of the cache to
come up with a result. The retry is only triggered if the hardware sets NRDY.
This is also only necessary if there aren't enough monitors for every RMID/(PARTID*PMG) to
have its own. If there were enough, the monitors could be allocated and programmed at
startup, and the whole thing becomes cheaper to access.

If a hardware platform needs time to do this, it has to come from somewhere. I don't think
maintaining an epoch based list of which monitor secretly belongs to a PARTID+PMG in the
hope user-space reads the file again 'quickly enough' is going to be maintainable.

If returning errors early is an important use-case, I can suggest ensuring the MPAM driver
allocates CSU monitors up-front if there are enough (today it only does this for MBWU
monitors). We then have to hope that folk who care about this also build hardware
platforms with enough monitors.


Thanks,

James