linux-kernel - Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALPaoCgEaT2oax35ezRydUZwL9bMmMFFr2wRqPe4VYAnEQ-GGg@mail.gmail.com>
Date:   Thu, 9 Mar 2023 14:41:08 +0100
From:   Peter Newman <peternewman@...gle.com>
To:     James Morse <james.morse@....com>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org,
        Fenghua Yu <fenghua.yu@...el.com>,
        Reinette Chatre <reinette.chatre@...el.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        H Peter Anvin <hpa@...or.com>,
        Babu Moger <Babu.Moger@....com>,
        shameerali.kolothum.thodi@...wei.com,
        D Scott Phillips OS <scott@...amperecomputing.com>,
        carl@...amperecomputing.com, lcherian@...vell.com,
        bobo.shaobowang@...wei.com, tan.shaopeng@...itsu.com,
        xingxin.hx@...nanolis.org, baolin.wang@...ux.alibaba.com,
        Jamie Iles <quic_jiles@...cinc.com>,
        Xin Hao <xhao@...ux.alibaba.com>,
        Stephane Eranian <eranian@...gle.com>
Subject: Re: [PATCH v2 09/18] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Hi James,

On Wed, Mar 8, 2023 at 6:45 PM James Morse <james.morse@....com> wrote:
> On 06/03/2023 13:14, Peter Newman wrote:
> > On Mon, Mar 6, 2023 at 12:34 PM James Morse <james.morse@....com> wrote:
>
> > Instead, when configuring a counter, could you use the firmware table
> > value to compute the time when the counter will next be valid and return
> > errors on read requests received before that?
>
> The monitor might get re-allocated, re-programmed and become valid for a different
> PARTID+PMG in the mean time. I don't think these things should remain allocated over a
> return to user-space. Without doing that I don't see how we can return-early and make
> progress.
>
> How long should a CSU monitor remain allocated to a PARTID+PMG? Currently its only for the
> duration of the read() syscall on the file.
>
>
> The problem with MPAM is too much of it is optional. This particular behaviour is only
> valid for CSU monitors, (llc_occupancy), and then, only if your hardware designers didn't
> have a value to hand when the monitor is programmed, and need to do a scan of the cache to
> come up with a result. The retry is only triggered if the hardware sets NRDY.
> This is also only necessary if there aren't enough monitors for every RMID/(PARTID*PMG) to
> have its own. If there were enough, the monitors could be allocated and programmed at
> startup, and the whole thing becomes cheaper to access.
>
> If a hardware platform needs time to do this, it has to come from somewhere. I don't think
> maintaining an epoch based list of which monitor secretly belongs to a PARTID+PMG in the
> hope user-space reads the file again 'quickly enough' is going to be maintainable.
>
> If returning errors early is an important use-case, I can suggest ensuring the MPAM driver
> allocates CSU monitors up-front if there are enough (today it only does this for MBWU
> monitors). We then have to hope that folk who care about this also build hardware
> platforms with enough monitors.

Thanks, this makes more sense now. Since CSU data isn't cumulative, I
see how synchronously collecting a snapshot is useful in this situation.
I was more concerned about understanding the need for the new behavior
than getting errors back quickly.

However, I do want to be sure that MBWU counters will never be silently
deallocated because we will never be able to trust the data unless we
know that the counter has been watching the group's tasks for the
entirety of the measurement window.

Unlike on AMD, MPAM allows software to control which PARTID+PMG the
monitoring hardware is watching. Could we instead make the user
explicitly request the mbm_{total,local}_bytes events be allocated to
monitoring groups after creating them? Or even just allocating the
events on monitoring group creation only when they're available could
also be marginably usable if a single user agent is managing rdtgroups.

Thanks!
-Peter