linux-kernel - Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b3c16707-c2d3-15fe-cac7-027ef022cfb7@intel.com>
Date:   Mon, 3 May 2021 07:14:43 -0700
From:   Dave Hansen <dave.hansen@...el.com>
To:     Florian Weimer <fweimer@...hat.com>
Cc:     Len Brown <lenb@...nel.org>, Borislav Petkov <bp@...en8.de>,
        Willy Tarreau <w@....eu>, Andy Lutomirski <luto@...nel.org>,
        "Bae, Chang Seok" <chang.seok.bae@...el.com>,
        X86 ML <x86@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
        linux-abi@...r.kernel.org,
        "libc-alpha@...rceware.org" <libc-alpha@...rceware.org>,
        Rich Felker <dalias@...c.org>, Kyle Huey <me@...ehuey.com>,
        Keno Fischer <keno@...iacomputing.com>
Subject: Re: Candidate Linux ABI for Intel AMX and hypothetical new related
 features

On 5/3/21 6:47 AM, Florian Weimer wrote:
> * Dave Hansen:
> 
>> On 5/2/21 10:18 PM, Florian Weimer wrote:
>>>> 5. If the feature is enabled in XCR0, the user happily uses it.
>>>>
>>>>     For AMX, Linux implements "transparent first use"
>>>>     so that it doesn't have to allocate 8KB context switch
>>>>     buffers for tasks that don't actually use AMX.
>>>>     It does this by arming XFD for all tasks, and taking a #NM
>>>>     to allocate a context switch buffer only for those tasks
>>>>     that actually execute AMX instructions.
>>> What happens if the kernel cannot allocate that additional context
>>> switch buffer?
>> Well, it's vmalloc()'d and currently smaller that the kernel stack,
>> which is also vmalloc()'d.  While it can theoretically fail, if it
>> happens you have bigger problems on your hands.
> Not sure if I understand.
> 
> Is your position that the kernel should terminate processes if it runs
> out of memory instead reporting proper errors, even if memory overcommit
> is disabled?

I assume you mean sysctl vm.overcommit=2 by "overcommit is disabled"?

> When this flag is 2, the kernel uses a "never overcommit"
> policy that attempts to prevent any overcommit of memory.
> Note that user_reserve_kbytes affects this policy.

Note the "attempts".

So, no, the kernel should not be terminating processes when it runs out
of memory.  It *attempts* not to do that.  What you are seeing here with
a demand-based XSAVE buffer allocation driven by a #NM fault is the
*addition* of a case where those attempts can fail, not the creation of
the first one.

The addition of this case doesn't bother me because I don't think it
will ultimately be visible to end users.

If I'm wrong, and our HPC friends who are so enamored with
"vm.overcommit=2" end up seeing lots of SIGSEGV's where where would
rather see syscall failures, there's an easy solution: disable first-use
detection.  Stop dynamically allocating XSAVE buffers on faults.

Actually, if we don't have a tunable or boot parameter for that now, we
should add one.