[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fcbb726d-fe6a-8fe4-20fd-6a10cdef007a@intel.com>
Date: Fri, 17 Dec 2021 10:26:38 -0800
From: Dave Hansen <dave.hansen@...el.com>
To: Nikita Yushchenko <nikita.yushchenko@...tuozzo.com>,
Will Deacon <will@...nel.org>,
"Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Nick Piggin <npiggin@...il.com>,
Peter Zijlstra <peterz@...radead.org>,
Catalin Marinas <catalin.marinas@....com>,
Heiko Carstens <hca@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
"David S. Miller" <davem@...emloft.net>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Arnd Bergmann <arnd@...db.de>
Cc: x86@...nel.org, linux-kernel@...r.kernel.org,
linux-arch@...r.kernel.org, linux-mm@...ck.org,
linuxppc-dev@...ts.ozlabs.org, linux-s390@...r.kernel.org,
sparclinux@...r.kernel.org, kernel@...nvz.org
Subject: Re: [PATCH/RFC] mm: add and use batched version of
__tlb_remove_table()
On 12/17/21 12:19 AM, Nikita Yushchenko wrote:
> When batched page table freeing via struct mmu_table_batch is used, the
> final freeing in __tlb_remove_table_free() executes a loop, calling
> arch hook __tlb_remove_table() to free each table individually.
>
> Shift that loop down to archs. This allows archs to optimize it, by
> freeing multiple tables in a single release_pages() call. This is
> faster than individual put_page() calls, especially with memcg
> accounting enabled.
Could we quantify "faster"? There's a non-trivial amount of code being
added here and it would be nice to back it up with some cold-hard numbers.
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -95,11 +95,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_
>
> static void __tlb_remove_table_free(struct mmu_table_batch *batch)
> {
> - int i;
> -
> - for (i = 0; i < batch->nr; i++)
> - __tlb_remove_table(batch->tables[i]);
> -
> + __tlb_remove_tables(batch->tables, batch->nr);
> free_page((unsigned long)batch);
> }
This leaves a single call-site for __tlb_remove_table():
> static void tlb_remove_table_one(void *table)
> {
> tlb_remove_table_sync_one();
> __tlb_remove_table(table);
> }
Is that worth it, or could it just be:
__tlb_remove_tables(&table, 1);
?
> -void free_pages_and_swap_cache(struct page **pages, int nr)
> +static void __free_pages_and_swap_cache(struct page **pages, int nr,
> + bool do_lru)
> {
> - struct page **pagep = pages;
> int i;
>
> - lru_add_drain();
> + if (do_lru)
> + lru_add_drain();
> for (i = 0; i < nr; i++)
> - free_swap_cache(pagep[i]);
> - release_pages(pagep, nr);
> + free_swap_cache(pages[i]);
> + release_pages(pages, nr);
> +}
> +
> +void free_pages_and_swap_cache(struct page **pages, int nr)
> +{
> + __free_pages_and_swap_cache(pages, nr, true);
> +}
> +
> +void free_pages_and_swap_cache_nolru(struct page **pages, int nr)
> +{
> + __free_pages_and_swap_cache(pages, nr, false);
> }
This went unmentioned in the changelog. But, it seems like there's a
specific optimization here. In the exiting code,
free_pages_and_swap_cache() is wasteful if no page in pages[] is on the
LRU. It doesn't need the lru_add_drain().
Any code that knows it is freeing all non-LRU pages can call
free_pages_and_swap_cache_nolru() which should perform better than
free_pages_and_swap_cache().
Should we add this to the for loop in __free_pages_and_swap_cache()?
for (i = 0; i < nr; i++) {
if (!do_lru)
VM_WARN_ON_ONCE_PAGE(PageLRU(pagep[i]),
pagep[i]);
free_swap_cache(...);
}
But, even more than that, do all the architectures even need the
free_swap_cache()? PageSwapCache() will always be false on x86, which
makes the loop kinda silly. x86 could, for instance, just do:
static inline void __tlb_remove_tables(void **tables, int nr)
{
release_pages((struct page **)tables, nr);
}
I _think_ this will work everywhere that has whole pages as page tables.
Taking that one step further, what if we only had one generic:
static inline void tlb_remove_tables(void **tables, int nr)
{
int i;
#ifdef ARCH_PAGE_TABLES_ARE_FULL_PAGE
release_pages((struct page **)tables, nr);
#else
arch_tlb_remove_tables(tables, i);
#endif
}
Architectures that set ARCH_PAGE_TABLES_ARE_FULL_PAGE (or whatever)
don't need to implement __tlb_remove_table() at all *and* can do
release_pages() directly.
This avoids all the confusion with the swap cache and LRU naming.
Powered by blists - more mailing lists