[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210812210846.bfalflrvn4bfpyyh@box.shutemov.name>
Date: Fri, 13 Aug 2021 00:08:46 +0300
From: "Kirill A. Shutemov" <kirill@...temov.name>
To: Dave Hansen <dave.hansen@...el.com>
Cc: Borislav Petkov <bp@...en8.de>, Andy Lutomirski <luto@...nel.org>,
Sean Christopherson <seanjc@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Joerg Roedel <jroedel@...e.de>,
Andi Kleen <ak@...ux.intel.com>,
Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@...ux.intel.com>,
David Rientjes <rientjes@...gle.com>,
Vlastimil Babka <vbabka@...e.cz>,
Tom Lendacky <thomas.lendacky@....com>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <peterz@...radead.org>,
Paolo Bonzini <pbonzini@...hat.com>,
Ingo Molnar <mingo@...hat.com>,
Varad Gautam <varad.gautam@...e.com>,
Dario Faggioli <dfaggioli@...e.com>, x86@...nel.org,
linux-mm@...ck.org, linux-coco@...ts.linux.dev,
linux-kernel@...r.kernel.org,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>
Subject: Re: [PATCH 1/5] mm: Add support for unaccepted memory
On Tue, Aug 10, 2021 at 01:50:57PM -0700, Dave Hansen wrote:
> On 8/10/21 11:13 AM, Dave Hansen wrote:
> >> @@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> >> if (page_reported(page))
> >> __ClearPageReported(page);
> >>
> >> + if (PageOffline(page))
> >> + clear_page_offline(page, order);
> >> +
> >> list_del(&page->lru);
> >> __ClearPageBuddy(page);
> >> set_page_private(page, 0);
> > So, this is right in the fast path of the page allocator. It's a
> > one-time thing per 2M page, so it's not permanent.
> >
> > *But* there's both a global spinlock and a firmware call hidden in
> > clear_page_offline(). That's *GOT* to hurt if you were, for instance,
> > running a benchmark while this code path is being tickled. Not just to
> >
> > That could be just downright catastrophic for scalability, albeit
> > temporarily.
>
> One more thing...
>
> How long are these calls? You have to make at least 512 calls into the
> SEAM module. Assuming they're syscall-ish, so ~1,000 cycles each,
> that's ~500,000 cycles, even if we ignore the actual time it takes to
> zero that 2MB worth of memory and all other overhead within the SEAM module.
I hope to get away with 2 calls per 2M: one MapGPA and one TDACCEPTPAGE
(or 3 for MAXORDER -- 4M -- pages). I don't have any numbers yet.
> So, we're sitting on one CPU with interrupts off, blocking all the other
> CPUs from doing page allocation in this zone.
I agree that's not good. Let's see if it's going to be okay with accepting
in 2M chunks.
> Then, we're holding a global lock which prevents any other NUMA nodes
> from accepting pages.
Looking at this again, the global lock is aviodable: the caller owns the
pfn range so nobody can touch these bits in the bitmap. We can replace
bitmap_clear() with atomic clear_bit() loop and drop the lock completely.
> If the other node happens to *try* to do an
> accept, it will sit with its zone lock held waiting for this one.
> Maybe nobody will ever notice. But, it seems like an awfully big risk
> to me. I'd at least *try* do these calls outside of the zone lock.
> Then the collateral damage will at least be limited to things doing
> accepts rather than all zone->lock users.
>
> Couldn't we delay the acceptance to, say the place where we've dropped
> the zone->lock and do the __GFP_ZERO memset() like at prep_new_page()?
> Or is there some concern that the page has been split at that point?
It *will* be split by the point. Like if you ask for order-0 page and you
don't any left page allocator will try higher orders until finds anything.
On order-9 it would hit unaccepted. At that point the page going to split
and put on the free lists accordingly. That's all happens under zone lock.
__rmqueue_smallest ->
del_page_from_free_list()
expand()
> I guess that makes it more complicated because you might have a 4k page
> but you need to go accept a 2M page. You might end up having to check
> the bitmap 511 more times because you might see 511 more PageOffline()
> pages come through.
>
> You shouldn't even need the bitmap lock to read since it's a one-way
> trip from unaccepted->accepted.
Yeah. Unless we don't want to flip it back on making the range share.
I think we do. Otherwise it will cause problems for kexec.
--
Kirill A. Shutemov
Powered by blists - more mailing lists