lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8284d0-91cb-b65e-4c95-bfeb627234f@google.com>
Date:   Wed, 21 Jun 2023 19:36:11 -0700 (PDT)
From:   Hugh Dickins <hughd@...gle.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
cc:     Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Gerald Schaefer <gerald.schaefer@...ux.ibm.com>,
        Vasily Gorbik <gor@...ux.ibm.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Mike Rapoport <rppt@...nel.org>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Matthew Wilcox <willy@...radead.org>,
        David Hildenbrand <david@...hat.com>,
        Suren Baghdasaryan <surenb@...gle.com>,
        Qi Zheng <zhengqi.arch@...edance.com>,
        Yang Shi <shy828301@...il.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Peter Xu <peterx@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
        Alistair Popple <apopple@...dia.com>,
        Ralph Campbell <rcampbell@...dia.com>,
        Ira Weiny <ira.weiny@...el.com>,
        Steven Price <steven.price@....com>,
        SeongJae Park <sj@...nel.org>,
        Lorenzo Stoakes <lstoakes@...il.com>,
        Huang Ying <ying.huang@...el.com>,
        Naoya Horiguchi <naoya.horiguchi@....com>,
        Christophe Leroy <christophe.leroy@...roup.eu>,
        Zack Rusin <zackr@...are.com>,
        Axel Rasmussen <axelrasmussen@...gle.com>,
        Anshuman Khandual <anshuman.khandual@....com>,
        Pasha Tatashin <pasha.tatashin@...een.com>,
        Miaohe Lin <linmiaohe@...wei.com>,
        Minchan Kim <minchan@...nel.org>,
        Christoph Hellwig <hch@...radead.org>,
        Song Liu <song@...nel.org>,
        Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
        Russell King <linux@...linux.org.uk>,
        "David Sc. Miller" <davem@...emloft.net>,
        Michael Ellerman <mpe@...erman.id.au>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
        Heiko Carstens <hca@...ux.ibm.com>,
        Christian Borntraeger <borntraeger@...ux.ibm.com>,
        Claudio Imbrenda <imbrenda@...ux.ibm.com>,
        Alexander Gordeev <agordeev@...ux.ibm.com>,
        Jann Horn <jannh@...gle.com>,
        Vishal Moola <vishal.moola@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        linux-arm-kernel@...ts.infradead.org, sparclinux@...r.kernel.org,
        linuxppc-dev@...ts.ozlabs.org, linux-s390@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2 05/12] powerpc: add pte_free_defer() for pgtables
 sharing page

On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> On Tue, Jun 20, 2023 at 12:54:25PM -0700, Hugh Dickins wrote:
> > On Tue, 20 Jun 2023, Jason Gunthorpe wrote:
> > > On Tue, Jun 20, 2023 at 12:47:54AM -0700, Hugh Dickins wrote:
> > > > Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
> > > > pte_free_defer() will be called inside khugepaged's retract_page_tables()
> > > > loop, where allocating extra memory cannot be relied upon.  This precedes
> > > > the generic version to avoid build breakage from incompatible pgtable_t.
> > > > 
> > > > This is awkward because the struct page contains only one rcu_head, but
> > > > that page may be shared between PTE_FRAG_NR pagetables, each wanting to
> > > > use the rcu_head at the same time: account concurrent deferrals with a
> > > > heightened refcount, only the first making use of the rcu_head, but
> > > > re-deferring if more deferrals arrived during its grace period.
> > > 
> > > You didn't answer my question why we can't just move the rcu to the
> > > actual free page?
> > 
> > I thought that I had answered it, perhaps not to your satisfaction:
> > 
> > https://lore.kernel.org/linux-mm/9130acb-193-6fdd-f8df-75766e663978@google.com/
> > 
> > My conclusion then was:
> > Not very good reasons: good enough, or can you supply a better patch?
> 
> Oh, I guess I didn't read that email as answering the question..
> 
> I was saying to make pte_fragment_free() unconditionally do the
> RCU. It is the only thing that uses the page->rcu_head, and it means
> PPC would double RCU the final free on the TLB path, but that is
> probably OK for now. This means pte_free_defer() won't do anything
> special on PPC as PPC will always RCU free these things, this address
> the defer concern too, I think. Overall it is easier to reason about.
> 
> I looked at fixing the TLB stuff to avoid the double rcu but quickly
> got scared that ppc was using a kmem_cache to allocate other page
> table sizes so there is not a reliable struct page to get a rcu_head
> from. This looks like the main challenge for ppc... We'd have to teach
> the tlb code to not do its own RCU stuff for table levels that the
> arch is already RCU freeing - and that won't get us to full RCU
> freeing on PPC.

Sorry for being so dense all along: yes, your way is unquestionably
much better than mine.  I guess I must have been obsessive about
keeping pte_free_defer()+pte_free_now() "on the outside", as they
were on x86, and never perceived how much easier it is with a small
tweak inside pte_fragment_free(); and never reconsidered it since.

But I'm not so keen on the double-RCU, extending this call_rcu() to
all the normal cases, while still leaving the TLB batching in place:
here is the replacement patch I'd prefer us to go forward with now.

Many thanks!

[PATCH v3 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to free table page via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time.  But powerpc never reuses a fragment
once it has been freed: so mark the page Active in pte_free_defer(),
before calling pte_fragment_free() directly; and there call_rcu() to
pte_free_now() when last fragment is freed and the page is PageActive.

Suggested-by: Jason Gunthorpe <jgg@...pe.ca>
Signed-off-by: Hugh Dickins <hughd@...gle.com>
---
 arch/powerpc/include/asm/pgalloc.h |  4 ++++
 arch/powerpc/mm/pgtable-frag.c     | 29 ++++++++++++++++++++++++++---
 2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 	pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c */
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..0c6b68130025 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -106,6 +106,15 @@ pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel)
 	return __alloc_for_ptecache(mm, kernel);
 }
 
+static void pte_free_now(struct rcu_head *head)
+{
+	struct page *page;
+
+	page = container_of(head, struct page, rcu_head);
+	pgtable_pte_page_dtor(page);
+	__free_page(page);
+}
+
 void pte_fragment_free(unsigned long *table, int kernel)
 {
 	struct page *page = virt_to_page(table);
@@ -115,8 +124,22 @@ void pte_fragment_free(unsigned long *table, int kernel)
 
 	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
 	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
-		if (!kernel)
-			pgtable_pte_page_dtor(page);
-		__free_page(page);
+		if (kernel)
+			__free_page(page);
+		else if (TestClearPageActive(page))
+			call_rcu(&page->rcu_head, pte_free_now);
+		else
+			pte_free_now(&page->rcu_head);
 	}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
+{
+	struct page *page;
+
+	page = virt_to_page(pgtable);
+	SetPageActive(page);
+	pte_fragment_free((unsigned long *)pgtable, 0);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-- 
2.35.3

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ