[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c1ff17e2-902d-87e6-3c1d-fc5db2428b69@linux.ibm.com>
Date: Tue, 17 Sep 2019 08:26:01 +0530
From: "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>
To: Leonardo Bras <leonardo@...ux.ibm.com>,
linuxppc-dev@...ts.ozlabs.org, linux-kernel@...r.kernel.org
Cc: Benjamin Herrenschmidt <benh@...nel.crashing.org>,
Paul Mackerras <paulus@...ba.org>,
Michael Ellerman <mpe@...erman.id.au>,
Christophe Leroy <christophe.leroy@....fr>,
Andrew Morton <akpm@...ux-foundation.org>,
Mike Rapoport <rppt@...ux.ibm.com>,
Nicholas Piggin <npiggin@...il.com>,
Thomas Gleixner <tglx@...utronix.de>,
Jason Gunthorpe <jgg@...pe.ca>,
Vlastimil Babka <vbabka@...e.cz>,
Dan Williams <dan.j.williams@...el.com>,
Christoph Lameter <cl@...ux.com>,
"Tobin C. Harding" <tobin@...nel.org>,
Jann Horn <jannh@...gle.com>,
Jesper Dangaard Brouer <brouer@...hat.com>,
Souptick Joarder <jrdr.linux@...il.com>,
Ralph Campbell <rcampbell@...dia.com>,
Andrey Ryabinin <aryabinin@...tuozzo.com>,
Davidlohr Bueso <dave@...olabs.net>
Subject: Re: [PATCH 1/1] powerpc: mm: Check if serialize_against_pte_lookup()
really needs to run
On 9/17/19 2:25 AM, Leonardo Bras wrote:
> If a process (qemu) with a lot of CPUs (128) try to munmap() a large chunk
> of memory (496GB) mapped with THP, it takes an average of 275 seconds,
> which can cause a lot of problems to the load (in qemu case, the guest
> will lock for this time).
>
> Trying to find the source of this bug, I found out most of this time is
> spent on serialize_against_pte_lookup(). This function will take a lot of
> time in smp_call_function_many() if there is more than a couple CPUs
> running the user process. Since it has to happen to all THP mapped, it will
> take a very long time for large amounts of memory.
>
> By the docs, serialize_against_pte_lookup() is needed in order to avoid
> pmd_t to pte_t casting inside find_current_mm_pte() to happen concurrently
> with the next part of the functions it's called in.
> It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[];
>
> So, by what I could understand, if there is no find_current_mm_pte()
> running, there is no need to call serialize_against_pte_lookup().
>
> So, to avoid the cost of running serialize_against_pte_lookup(), I propose
> a counter that keeps track of how many find_current_mm_pte() are currently
> running, and if there is none, just skip smp_call_function_many().
>
> On my workload (qemu), I could see munmap's time reduction from 275 seconds
> to 418ms.
>
> Signed-off-by: Leonardo Bras <leonardo@...ux.ibm.com>
We could possibly avoid that serialize for a full task exit unmap. ie,
when tlb->fullmm == 1 . But that won't help the Qemu case because it
does an umap of the guest ram range for which tlb->fullmm != 1.
>
> ---
> I need more experienced people's help in order to understand if this is
> really a valid improvement, and if mm_struct is the best place to put such
> counter.
>
> Thanks!
> ---
> arch/powerpc/include/asm/pte-walk.h | 3 +++
> arch/powerpc/mm/book3s64/pgtable.c | 2 ++
> include/linux/mm_types.h | 1 +
> 3 files changed, 6 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/pte-walk.h b/arch/powerpc/include/asm/pte-walk.h
> index 33fa5dd8ee6a..3b82cb3bd563 100644
> --- a/arch/powerpc/include/asm/pte-walk.h
> +++ b/arch/powerpc/include/asm/pte-walk.h
> @@ -40,6 +40,8 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
> {
> pte_t *pte;
>
> + atomic64_inc(¤t->mm->lockless_ptlookup_count);
> +
> VM_WARN(!arch_irqs_disabled(), "%s called with irq enabled\n", __func__);
> VM_WARN(pgdir != current->mm->pgd,
> "%s lock less page table lookup called on wrong mm\n", __func__);
> @@ -53,6 +55,7 @@ static inline pte_t *find_current_mm_pte(pgd_t *pgdir, unsigned long ea,
> if (hshift)
> WARN_ON(*hshift);
> #endif
> + atomic64_dec(¤t->mm->lockless_ptlookup_count);
> return pte;
> }
That is not correct. We need to keep the count updated across the
local_irq_disable()/local_irq_enable(). That is we mostly should have a
variant like start_lockless_pgtbl_walk()/end_lockless_pgtbl_walk(). Also
there is hash_page_mm which we need to make sure we are protected
against w.r.t collapse pmd.
>
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
> index 7d0e0d0d22c4..8f6fc2f80071 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -95,6 +95,8 @@ static void do_nothing(void *unused)
> void serialize_against_pte_lookup(struct mm_struct *mm)
> {
> smp_mb();
> + if (atomic64_read(&mm->lockless_ptlookup_count) == 0)
> + return;
> smp_call_function_many(mm_cpumask(mm), do_nothing, NULL, 1);
> }
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 6a7a1083b6fb..97fb2545e967 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -518,6 +518,7 @@ struct mm_struct {
> #endif
> } __randomize_layout;
>
> + atomic64_t lockless_ptlookup_count;
> /*
> * The mm_cpumask needs to be at the end of mm_struct, because it
> * is dynamically sized based on nr_cpu_ids.
>
Move that to ppc64 specific mm_context_t.
-aneesh
Powered by blists - more mailing lists