[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8284d0-91cb-b65e-4c95-bfeb627234f@google.com>
Date: Wed, 21 Jun 2023 19:36:11 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Jason Gunthorpe <jgg@...pe.ca>
cc: Hugh Dickins <hughd@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...nel.org>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>,
David Hildenbrand <david@...hat.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Qi Zheng <zhengqi.arch@...edance.com>,
Yang Shi <shy828301@...il.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Peter Xu <peterx@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
Alistair Popple <apopple@...dia.com>,
Ralph Campbell <rcampbell@...dia.com>,
Ira Weiny <ira.weiny@...el.com>,
Steven Price <steven.price@....com>,
SeongJae Park <sj@...nel.org>,
Lorenzo Stoakes <lstoakes@...il.com>,
Huang Ying <ying.huang@...el.com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Zack Rusin <zackr@...are.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Anshuman Khandual <anshuman.khandual@....com>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Miaohe Lin <linmiaohe@...wei.com>,
Minchan Kim <minchan@...nel.org>,
Christoph Hellwig <hch@...radead.org>,
Song Liu <song@...nel.org>,
Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
Russell King <linux@...linux.org.uk>,
"David Sc. Miller" <davem@...emloft.net>,
Michael Ellerman <mpe@...erman.id.au>,
"Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Claudio Imbrenda <imbrenda@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Jann Horn <jannh@...gle.com>,
Vishal Moola <vishal.moola@...il.com>,
Vlastimil Babka <vbabka@...e.cz>,
linux-arm-kernel@...ts.infradead.org, sparclinux@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, linux-s390@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables
sharing page
On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 20, 2023 at 12:54:25PM -0700, Hugh Dickins wrote:
> > On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> > > On Tue, Jun 20, 2023 at 12:47:54AM -0700, Hugh Dickins wrote:
> > > > Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
> > > > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > > > loop, where allocating extra memory cannot be relied upon. This precedes
> > > > the generic version to avoid build breakage from incompatible pgtable_t.
> > > >
> > > > This is awkward because the struct page contains only one rcu_head, but
> > > > that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> > > > use the rcu_head at the same time: account concurrent deferrals with a
> > > > heightened refcount, only the first making use of the rcu_head, but
> > > > re-deferring if more deferrals arrived during its grace period.
> > >
> > > You didn't answer my question why we can't just move the rcu to the
> > > actual free page?
> >
> > I thought that I had answered it, perhaps not to your satisfaction:
> >
> > https://lore.kernel.org/linux-mm/9130acb-193-6fdd-f8df-75766e663978@google.com/
> >
> > My conclusion then was:
> > Not very good reasons: good enough, or can you supply a better patch?
>
> Oh, I guess I didn't read that email as answering the question..
>
> I was saying to make pte_fragment_free() unconditionally do the
> RCU. It is the only thing that uses the page->rcu_head, and it means
> PPC would double RCU the final free on the TLB path, but that is
> probably OK for now. This means pte_free_defer() won't do anything
> special on PPC as PPC will always RCU free these things, this address
> the defer concern too, I think. Overall it is easier to reason about.
>
> I looked at fixing the TLB stuff to avoid the double rcu but quickly
> got scared that ppc was using a kmem_cache to allocate other page
> table sizes so there is not a reliable struct page to get a rcu_head
> from. This looks like the main challenge for ppc... We'd have to teach
> the tlb code to not do its own RCU stuff for table levels that the
> arch is already RCU freeing - and that won't get us to full RCU
> freeing on PPC.
Sorry for being so dense all along: yes, your way is unquestionably
much better than mine. I guess I must have been obsessive about
keeping pte_free_defer()+pte_free_now() "on the outside", as they
were on x86, and never perceived how much easier it is with a small
tweak inside pte_fragment_free(); and never reconsidered it since.
But I'm not so keen on the double-RCU, extending this call_rcu() to
all the normal cases, while still leaving the TLB batching in place:
here is the replacement patch I'd prefer us to go forward with now.
Many thanks!
[PATCH v3 05/12] powerpc: add pte_free_defer() for pgtables sharing page
Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon. This precedes
the generic version to avoid build breakage from incompatible pgtable_t.
This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time. But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.
Suggested-by: Jason Gunthorpe <jgg@...pe.ca>
Signed-off-by: Hugh Dickins <hughd@...gle.com>
---
arch/powerpc/include/asm/pgalloc.h | 4 ++++
arch/powerpc/mm/pgtable-frag.c | 29 ++++++++++++++++++++++++++---
2 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
}
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
/*
* Functions that deal with pagetables that could be at any level of
* the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..0c6b68130025 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
return __alloc_for_ptecache(mm, kernel);
}
+static void pte_free_now(struct rcu_head *head)
+{
+ struct page *page;
+
+ page = container_of(head, struct page, rcu_head);
+ pgtable_pte_page_dtor(page);
+ __free_page(page);
+}
+
void pte_fragment_free(unsigned long *table, int kernel)
{
struct page *page = virt_to_page(table);
@@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel)
BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
if (atomic_dec_and_test(&page->pt_frag_refcount)) {
- if (!kernel)
- pgtable_pte_page_dtor(page);
- __free_page(page);
+ if (kernel)
+ __free_page(page);
+ else if (TestClearPageActive(page))
+ call_rcu(&page->rcu_head, pte_free_now);
+ else
+ pte_free_now(&page->rcu_head);
}
}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+ struct page *page;
+
+ page = virt_to_page(pgtable);
+ SetPageActive(page);
+ pte_fragment_free((unsigned long *)pgtable, 0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
--
2.35.3
Powered by blists - more mailing lists