[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <877d5rt0uz.fsf@oracle.com>
Date: Thu, 09 Jun 2022 00:54:20 +0530
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux-MM <linux-mm@...ck.org>,
the arch/x86 maintainers <x86@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Kravetz <mike.kravetz@...cle.com>,
Ingo Molnar <mingo@...nel.org>,
Andrew Lutomirski <luto@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Borislav Petkov <bp@...en8.de>,
Peter Zijlstra <peterz@...radead.org>,
Andi Kleen <ak@...ux.intel.com>, Arnd Bergmann <arnd@...db.de>,
Jason Gunthorpe <jgg@...dia.com>, jon.grimm@....com,
Boris Ostrovsky <boris.ostrovsky@...cle.com>,
Konrad Rzeszutek Wilk <konrad.wilk@...cle.com>,
Joao Martins <joao.m.martins@...cle.com>
Subject: Re: [PATCH v3 00/21] huge page clearing optimizations
Linus Torvalds <torvalds@...ux-foundation.org> writes:
> On Tue, Jun 7, 2022 at 8:10 AM Ankur Arora <ankur.a.arora@...cle.com> wrote:
>>
>> For highmem and page-at-a-time archs we would need to keep some
>> of the same optimizations (via the common clear/copy_user_highpages().)
>
> Yeah, I guess that we could keep the code for legacy use, just make
> the existing code be marked __weak so that it can be ignored for any
> further work.
>
> IOW, the first patch might be to just add that __weak to
> 'clear_huge_page()' and 'copy_user_huge_page()'.
>
> At that point, any architecture can just say "I will implement my own
> versions of these two".
>
> In fact, you can start with just one or the other, which is probably
> nicer to keep the patch series smaller (ie do the simpler
> "clear_huge_page()" first).
Agreed. Best way to iron out all the kinks too.
> I worry a bit about the insanity of the "gigantic" pages, and the
> mem_map_next() games it plays, but that code is from 2008 and I really
> doubt it makes any sense to keep around at least for x86. The source
> of that abomination is powerpc, and I do not think that whole issue
> with MAX_ORDER_NR_PAGES makes any difference on x86, at least.
Looking at it now, it seems to be caused by the wide range of
MAX_ZONEORDER values on powerpc? It made my head hurt so I didn't try
to figure it out in detail.
But, even on x86, AFAICT gigantic pages could straddle MAX_SECTION_BITS?
An arch specific clear_huge_page() code could, however handle 1GB pages
via some kind of static loop around (30 - MAX_SECTION_BITS).
I'm a little fuzzy on CONFIG_SPARSEMEM_EXTREME, and !SPARSEMEM_VMEMMAP
configs. But, I think we should be able to not look up pfn_to_page(),
page_to_pfn() at all or at least not in the inner loop.
> It most definitely makes no sense when there is no highmem issues, and
> all those 'struct page' games should just be deleted (or at least
> relegated entirely to that "legacy __weak function" case so that sane
> situations don't need to care).
Yeah, I'm hoping to do exactly that.
> For that same HIGHMEM reason it's probably a good idea to limit the
> new case just to x86-64, and leave 32-bit x86 behind.
Ack that.
>> Right. Or doing the whole contiguous area in one or a few chunks
>> chunks, and then touching the faulting cachelines towards the end.
>
> Yeah, just add a prefetch for the 'addr_hint' part at the end.
>
>> > Maybe an architecture could do even more radical things like "let's
>> > just 'rep stos' for the whole area, but set a special thread flag that
>> > causes the interrupt return to break it up on return to kernel space".
>> > IOW, the "latency fix" might not even be about chunking it up, it
>> > might look more like our exception handling thing.
>>
>> When I was thinking about this earlier, I had a vague inkling of
>> setting a thread flag and defer writes to the last few cachelines
>> for just before returning to user-space.
>> Can you elaborate a little about what you are describing above?
>
> So 'process_huge_page()' (and the gigantic page case) does three very
> different things:
>
> (a) that page chunking for highmem accesses
>
> (b) the page access _ordering_ for the cache hinting reasons
>
> (c) the chunking for _latency_ reasons
>
> and I think all of them are basically "bad legacy" reasons, in that
>
> (a) HIGHMEM doesn't exist on sane architectures that we care about these days
>
> (b) the cache hinting ordering makes no sense if you do non-temporal
> accesses (and might then be replaced by a possible "prefetch" at the
> end)
>
> (c) the latency reasons still *do* exist, but only with PREEMPT_NONE
>
> So what I was alluding to with those "more radical approaches" was
> that PREEMPT_NONE case: we would probably still want to chunk things
> up for latency reasons and do that "cond_resched()" in between
> chunks.
Thanks for the detail. That helps.
> Now, there are alternatives here:
>
> (a) only override that existing disgusting (but tested) function when
> both CONFIG_HIGHMEM and CONFIG_PREEMPT_NONE are false
>
> (b) do something like this:
>
> void clear_huge_page(struct page *page,
> unsigned long addr_hint,
> unsigned int pages_per_huge_page)
> {
> void *addr = page_address(page);
> #ifdef CONFIG_PREEMPT_NONE
> for (int i = 0; i < pages_per_huge_page; i++)
> clear_page(addr, PAGE_SIZE);
> cond_preempt();
> }
> #else
> nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
> prefetch(addr_hint);
> #endif
> }
We'll need a preemption point there for CONFIG_PREEMPT_VOLUNTARY
as well, right? Either way, as you said earlier could chunk
up in bigger units than a single page.
(In the numbers I had posted earlier, chunking in units of upto 1MB
gave ~25% higher clearing BW. Don't think the microcode setup costs
are that high, but don't have a good explanation why.)
> or (c), do that "more radical approach", where you do something like this:
>
> void clear_huge_page(struct page *page,
> unsigned long addr_hint,
> unsigned int pages_per_huge_page)
> {
> set_thread_flag(TIF_PREEMPT_ME);
> nontemporal_clear_big_area(addr, PAGE_SIZE*pages_per_huge_page);
> clear_thread_flag(TIF_PREEMPT_ME);
> prefetch(addr_hint);
> }
>
> and then you make the "return to kernel mode" check the TIF_PREEMPT_ME
> case and actually force preemption even on a non-preempt kernel.
I like this one. I'll try out (b) and (c) and see how the code shakes
out.
Just one minor point -- seems to me that the choice of nontemporal or
temporal might have to be based on a hint to clear_huge_page().
Basically the nontemporal path is only faster for
(pages_per_huge_page * PAGE_SIZE > LLC-size).
So in the page-fault path it might make sense to use the temporal
path (except for gigantic pages.) In the prefault path, nontemporal
might be better.
Thanks
--
ankur
Powered by blists - more mailing lists