[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3c16707-c2d3-15fe-cac7-027ef022cfb7@intel.com>
Date: Mon, 3 May 2021 07:14:43 -0700
From: Dave Hansen <dave.hansen@...el.com>
To: Florian Weimer <fweimer@...hat.com>
Cc: Len Brown <lenb@...nel.org>, Borislav Petkov <bp@...en8.de>,
Willy Tarreau <w@....eu>, Andy Lutomirski <luto@...nel.org>,
"Bae, Chang Seok" <chang.seok.bae@...el.com>,
X86 ML <x86@...nel.org>, LKML <linux-kernel@...r.kernel.org>,
linux-abi@...r.kernel.org,
"libc-alpha@...rceware.org" <libc-alpha@...rceware.org>,
Rich Felker <dalias@...c.org>, Kyle Huey <me@...ehuey.com>,
Keno Fischer <keno@...iacomputing.com>
Subject: Re: Candidate Linux ABI for Intel AMX and hypothetical new related
features
On 5/3/21 6:47 AM, Florian Weimer wrote:
> * Dave Hansen:
>
>> On 5/2/21 10:18 PM, Florian Weimer wrote:
>>>> 5. If the feature is enabled in XCR0, the user happily uses it.
>>>>
>>>> For AMX, Linux implements "transparent first use"
>>>> so that it doesn't have to allocate 8KB context switch
>>>> buffers for tasks that don't actually use AMX.
>>>> It does this by arming XFD for all tasks, and taking a #NM
>>>> to allocate a context switch buffer only for those tasks
>>>> that actually execute AMX instructions.
>>> What happens if the kernel cannot allocate that additional context
>>> switch buffer?
>> Well, it's vmalloc()'d and currently smaller that the kernel stack,
>> which is also vmalloc()'d. While it can theoretically fail, if it
>> happens you have bigger problems on your hands.
> Not sure if I understand.
>
> Is your position that the kernel should terminate processes if it runs
> out of memory instead reporting proper errors, even if memory overcommit
> is disabled?
I assume you mean sysctl vm.overcommit=2 by "overcommit is disabled"?
> When this flag is 2, the kernel uses a "never overcommit"
> policy that attempts to prevent any overcommit of memory.
> Note that user_reserve_kbytes affects this policy.
Note the "attempts".
So, no, the kernel should not be terminating processes when it runs out
of memory. It *attempts* not to do that. What you are seeing here with
a demand-based XSAVE buffer allocation driven by a #NM fault is the
*addition* of a case where those attempts can fail, not the creation of
the first one.
The addition of this case doesn't bother me because I don't think it
will ultimately be visible to end users.
If I'm wrong, and our HPC friends who are so enamored with
"vm.overcommit=2" end up seeing lots of SIGSEGV's where where would
rather see syscall failures, there's an easy solution: disable first-use
detection. Stop dynamically allocating XSAVE buffers on faults.
Actually, if we don't have a tunable or boot parameter for that now, we
should add one.
Powered by blists - more mailing lists