linux-kernel - Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJvTdKkYp+zP_9tna6YsrOz2_nmEUDLJaL_i-SNog0m2T9wZ=Q@mail.gmail.com>
Date:   Thu, 20 May 2021 11:35:53 -0400
From:   Len Brown <lenb@...nel.org>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     Borislav Petkov <bp@...en8.de>, Willy Tarreau <w@....eu>,
        Andy Lutomirski <luto@...nel.org>,
        Florian Weimer <fweimer@...hat.com>,
        "Bae, Chang Seok" <chang.seok.bae@...el.com>,
        Dave Hansen <dave.hansen@...el.com>, X86 ML <x86@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>,
        "libc-alpha@...rceware.org" <libc-alpha@...rceware.org>,
        Rich Felker <dalias@...c.org>, Kyle Huey <me@...ehuey.com>,
        Keno Fischer <keno@...iacomputing.com>,
        Arjan van de Ven <arjan@...ux.intel.com>
Subject: Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

Hi Thomas,

On Mon, May 17, 2021 at 5:45 AM Thomas Gleixner <tglx@...utronix.de> wrote:

> AMX (or whatever comes next) is nothing else than a device and it
> just should be treated as such. The fact that it is not exposed
> via a driver and a device node does not matter at all.

TMM registers are part of the CPU architectural state.
If TMM registers exist for one logical CPU, they exist for all CPUs --
including HT siblings.
(Intel supports only homogeneous ISA)

Ditto for the instructions that access and operate on TMM registers.

One can reasonably predict, that like Intel has done for all other registers,
there will be future instructions added to the ISA to operate on TMM registers,
including in combination with non-TMM registers that are also part
of the architectural state.

It is an unfortunate word choice that some documentation calls the
TMUL instruction
an "accelerator".  It isn't.  It is part of the ISA, like any other instruction.

I agree that a device interface may make sense for real accelerators
that don't run x86 instructions, I don't see long term viability for attempting
to carve a sub-set of x86 instructions into a device, particularly when
the set of instructions will continue to evolve.

> Not doing so requires this awkward buffer allocation issue via #NM with
> all it's downsides; it's just wrong to force the kernel to manage
> resources of a user space task without being able to return a proper
> error code.

The hardware #NM support for fault on first use is a feature to allow the OS
to optimize space so that pages do not have to be dedicated to back registers
unless/until they are actually used.

There is absolutely no requirement that a particular
OS take advantage of that feature.  If you think that this optimization is
awkward, we can easily delete/disable it and simply statically allocate buffers
for all threads at initialization time.  Though you'll have to convince me
why the word "awkward" applies, rather than "elegant".

Regarding error return for allocation failures.

I'm not familiar with the use-case where vmalloc would be likely to fail today,
and I'd be interested if anybody can detail that use-case.
But even if there is none today, I grate that Linux could evolve to make vmalloc
fail in the future, and so an interface to reqeust pre-allocation of buffers
is reasonable insurance.  Chang has implemented this prctl in v5
of the TMUL patch series.

> It also prevents fine grained control over access to this
> functionality. As AMX is clearly a shared resource which is not per HT
> thread (maybe not even per core) and it has impact on power/frequency it
> is important to be able to restrict access on a per process/cgroup
> scope.

AMX is analogous to the multiplier used by AVX-512.
The architectural state must exist on every CPU, including HT siblings.
Today, the HT siblings share the same execution unit,
and I have no reason to expect that will change.

I thought we already addressed the FUD surrounding power/frequency.
As with every kind of instruction -- those that use
more power will leave less power for their peers, and there is a mechanism
to track that power budget.  I acknowledge that the mechanism was overly
conservative and slow to recover in initial AVX-512 systems, and that issue
persists even with the latest publically available hardware today.
I acknowledge that you do not trust that Intel has addressed this
(for both AVX-512 and AMX) in the first hardware that supports AMX.

> Having a proper interface (syscall, prctl) which user space can use to
> ask for permission and allocation of the necessary buffer(s) is clearly
> avoiding the downsides and provides the necessary mechanisms for proper
> control and failure handling.
>
> It's not the end of the world if something which wants to utilize this
> has do issue a syscall during detection. It does not matter whether
> that's a library or just the application code itself.
>
> That's a one off operation and every involved entity can cache the
> result in TLS.
>
> AVX512 has already proven that XSTATE management is fragile and error
> prone, so we really have to stop this instead of creating yet another
> half baked solution.

We fixed the glibc ABI issue.  It is available now and production
release is this summer.
Yes, it should have been addressed when AVX-512 was deployed.

thanks
Len Brown, Intel Open Source Technology Center