[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrW8u5rUsZvoo5t4YtC+4boBVcK__-srtA1+-YX06QYD1w@mail.gmail.com>
Date: Wed, 14 Oct 2020 09:10:24 -0700
From: Andy Lutomirski <luto@...capital.net>
To: Dave Hansen <dave.hansen@...el.com>
Cc: "Brown, Len" <len.brown@...el.com>,
Andy Lutomirski <luto@...nel.org>,
"Bae, Chang Seok" <chang.seok.bae@...el.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...e.de>,
X86 ML <x86@...nel.org>, "Liu, Jing2" <jing2.liu@...el.com>,
"Shankar, Ravi V" <ravi.v.shankar@...el.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area
on first use
On Tue, Oct 13, 2020 at 11:03 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> > I have no problem with vmalloc(), but I do have a problem with
> > vfree() due to the IPIs that result. We need a cache or something.
>
> This sounds like the kind of thing we should just build into vmalloc()
> instead of having a bunch of callers implement their own caches. It
> shouldn't be too much of a challenge to have vmalloc() keep a cacheline
> or two of stats about common vmalloc() sizes and keep some pools around.
>
> It's not going to be hard to implement caches to reduce vfree()-induced
> churn, but I'm having a hard time imaging that it'll have anywhere near
> the benefits that it did for stacks. Tasks fundamentally come and go a
> *lot*, and those paths are hot.
>
> Tasks who go to the trouble to populate 8k or 64k of register state
> fundamentally *can't* come and go frequently. We also (probably) don't
> have to worry about AMX tasks doing fork()/exec() pairs and putting
> pressure on vmalloc(). Before an app can call out to library functions
> to do the fork, they've got to save the state off themselves and likely
> get it back to the init state. The fork() code can tell AMX is in the
> init state and decline to allocate actual space for the child.
>
> > I have to say: this mechanism is awful. Can we get away with skipping
> > the dynamic XSAVES mess entirely? What if we instead allocate
> > however much space we need as an array of pages and have one percpu
> > contiguous region. To save, we XSAVE(S or C) just the AMX state to
> > the percpu area and then copy it. To restore, we do the inverse. Or
> > would this kill the modified optimization and thus be horrible?
>
> Actually, I think the modified optimization would survive such a scheme:
>
> * copy page array into percpu area
> * XRSTORS from percpu area, modified optimization tuple is saved
> * run userspace
> * XSAVES back to percpu area. tuple matches, modified optimization
> is still in play
> * copy percpu area back to page array
>
> Since the XRSTORS->XSAVES pair is both done to the percpu area, the
> XSAVE tracking hardware never knows it isn't working on the "canonical"
> buffer (the page array).
I was suggesting something a little bit different. We'd keep XMM,
YMM, ZMM, etc state stored exactly the way we do now and, for
AMX-using tasks, we would save the AMX state in an entirely separate
buffer. This way the pain of having a variable xstate layout is
confined just to AMX tasks.
I'm okay with vmalloc() too, but I do think we need to deal with the
various corner cases like allocation failing.
Powered by blists - more mailing lists