linux-kernel - Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <874kf11yoz.ffs@nanos.tec.linutronix.de>
Date:   Mon, 17 May 2021 11:45:00 +0200
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Len Brown <lenb@...nel.org>, Borislav Petkov <bp@...en8.de>
Cc:     Willy Tarreau <w@....eu>, Andy Lutomirski <luto@...nel.org>,
        Florian Weimer <fweimer@...hat.com>,
        "Bae\, Chang Seok" <chang.seok.bae@...el.com>,
        Dave Hansen <dave.hansen@...el.com>, X86 ML <x86@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>, linux-api@...r.kernel.org,
        "libc-alpha\@sourceware.org" <libc-alpha@...rceware.org>,
        Rich Felker <dalias@...c.org>, Kyle Huey <me@...ehuey.com>,
        Keno Fischer <keno@...iacomputing.com>,
        Arjan van de Ven <arjan@...ux.intel.com>
Subject: Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

Len,

On Sun, May 02 2021 at 11:27, Len Brown wrote:
> Here is how it works:
>
> 1. The kernel boots and sees the feature in CPUID.
>
> 2. If the kernel supports that feature, it sets XCR0[feature].
>
>     For some features, there may be a bunch of kernel support,
>     while simple features may require only state save/restore.
>
> 2a.  If the kernel doesn't support the feature, XCR0[feature] remains cleared.
>
> 3. user-space sees the feature in CPUID
>
> 4. user-space sees for the feature via xgetbv[XCR0]
>
> 5. If the feature is enabled in XCR0, the user happily uses it.
>
>     For AMX, Linux implements "transparent first use"
>     so that it doesn't have to allocate 8KB context switch
>     buffers for tasks that don't actually use AMX.
>     It does this by arming XFD for all tasks, and taking a #NM
>     to allocate a context switch buffer only for those tasks
>     that actually execute AMX instructions.

I thought more about this and it's absolutely the wrong way to go for
several reasons.

AMX (or whatever comes next) is nothing else than a device and it
just should be treated as such. The fact that it is not exposed
via a driver and a device node does not matter at all.

Not doing so requires this awkward buffer allocation issue via #NM with
all it's downsides; it's just wrong to force the kernel to manage
resources of a user space task without being able to return a proper
error code. 

It also prevents fine grained control over access to this
functionality. As AMX is clearly a shared resource which is not per HT
thread (maybe not even per core) and it has impact on power/frequency it
is important to be able to restrict access on a per process/cgroup
scope.

Having a proper interface (syscall, prctl) which user space can use to
ask for permission and allocation of the necessary buffer(s) is clearly
avoiding the downsides and provides the necessary mechanisms for proper
control and failure handling.

It's not the end of the world if something which wants to utilize this
has do issue a syscall during detection. It does not matter whether
that's a library or just the application code itself.

That's a one off operation and every involved entity can cache the
result in TLS.

AVX512 has already proven that XSTATE management is fragile and error
prone, so we really have to stop this instead of creating yet another
half baken solution.

Thanks,

        tglx