[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e678affb-5eee-a055-7af1-1d29a965663b@google.com>
Date: Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
From: Hugh Dickins <hughd@...gle.com>
To: Gerald Schaefer <gerald.schaefer@...ux.ibm.com>
cc: Hugh Dickins <hughd@...gle.com>, Jason Gunthorpe <jgg@...pe.ca>,
Andrew Morton <akpm@...ux-foundation.org>,
Vasily Gorbik <gor@...ux.ibm.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Mike Rapoport <rppt@...nel.org>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Matthew Wilcox <willy@...radead.org>,
David Hildenbrand <david@...hat.com>,
Suren Baghdasaryan <surenb@...gle.com>,
Qi Zheng <zhengqi.arch@...edance.com>,
Yang Shi <shy828301@...il.com>,
Mel Gorman <mgorman@...hsingularity.net>,
Peter Xu <peterx@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Will Deacon <will@...nel.org>, Yu Zhao <yuzhao@...gle.com>,
Alistair Popple <apopple@...dia.com>,
Ralph Campbell <rcampbell@...dia.com>,
Ira Weiny <ira.weiny@...el.com>,
Steven Price <steven.price@....com>,
SeongJae Park <sj@...nel.org>,
Lorenzo Stoakes <lstoakes@...il.com>,
Huang Ying <ying.huang@...el.com>,
Naoya Horiguchi <naoya.horiguchi@....com>,
Christophe Leroy <christophe.leroy@...roup.eu>,
Zack Rusin <zackr@...are.com>,
Axel Rasmussen <axelrasmussen@...gle.com>,
Anshuman Khandual <anshuman.khandual@....com>,
Pasha Tatashin <pasha.tatashin@...een.com>,
Miaohe Lin <linmiaohe@...wei.com>,
Minchan Kim <minchan@...nel.org>,
Christoph Hellwig <hch@...radead.org>,
Song Liu <song@...nel.org>,
Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
Russell King <linux@...linux.org.uk>,
"David S. Miller" <davem@...emloft.net>,
Michael Ellerman <mpe@...erman.id.au>,
"Aneesh Kumar K.V" <aneesh.kumar@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>,
Christian Borntraeger <borntraeger@...ux.ibm.com>,
Claudio Imbrenda <imbrenda@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Jann Horn <jannh@...gle.com>,
Vishal Moola <vishal.moola@...il.com>,
Vlastimil Babka <vbabka@...e.cz>,
linux-arm-kernel@...ts.infradead.org, sparclinux@...r.kernel.org,
linuxppc-dev@...ts.ozlabs.org, linux-s390@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing
page
On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> Hugh Dickins <hughd@...gle.com> wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > >
> > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > extent that I've not even tried to verify it; but I think I do get the
> > > point now, that we need further info than just PPHHAA to know whether
> > > the page is on the list or not. But I think that if we move where the
> > > call_rcu() is done, then the page can stay on or off the list by same
> > > rules as before (but need to check HH bits along with PP when deciding
> > > whether to allocate, and whether to list_add_tail() when freeing).
> >
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> >
> > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > an easier way of dealing with those "is it on the list" questions.
> > I expect that we shall be close to reaching agreement on...
>
> This looks really nice, almost too good and easy to be true. I did not
> find any obvious flaw, just some comments below. It also survived LTP
> without any visible havoc, so I guess this approach is the best so far.
Phew! I'm of course glad to hear this: thanks for your efforts on it.
...
> > --- a/arch/s390/mm/pgalloc.c
> > +++ b/arch/s390/mm/pgalloc.c
> > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> > * while the PP bits are never used, nor such a page is added to or removed
> > * from mm_context_t::pgtable_list.
> > + *
> > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
>
> Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> allow reuse before grace period elapsed. And I hope that it does so, by
> setting the PP bits, which would be noticed in page_table_alloc(), in
> case the page would be seen there.
>
> Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> end of the list, and so they could be seen in page_table_alloc(), but they
> should not be reused before grace period elapsed and __tlb_remove_table()
> cleared the PP bits, as far as I understand.
>
> So what exactly do you mean with "which page_table_free_rcu() does not actually
> guarantee"?
I'll answer without locating and re-reading what Jason explained earlier,
perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
he may have explained it better. And without working out again all the
MMU_GATHER #defines, and which of them do and do not apply to s390 here.
The detail that sticks in my mind is the fallback in tlb_remove_table()
in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
batch the tables for freeing by RCU, and resorts instead to an immediate
TLB flush (I think: that again involves chasing definitions) followed by
tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
and is commented:
/*
* This isn't an RCU grace period and hence the page-tables cannot be
* assumed to be actually RCU-freed.
*
* It is however sufficient for software page-table walkers that rely on
* IRQ disabling.
*/
Whether that's good for your PP pages or not, I've given no thought:
I've just taken it on trust that what s390 has working today is good.
If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
then I would not have written "(which page_table_free_rcu() does not
actually guarantee)". But it cannot use call_rcu() because it does
not have an rcu_head to work with - it's in some generic code, and
there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
And Jason would have much preferred us to address the issue from that
angle; but not only would doing so destroy my sanity, I'd also destroy
20 architectures TLB-flushing, unbuilt and untested, in the attempt.
...
> > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > */
> > mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> > mask >>= 24;
> > - if (mask & 0x03U)
> > + if ((mask & 0x03U) && !PageActive(page)) {
> > + /*
> > + * Other half is allocated, and neither half has had
> > + * its free deferred: add page to head of list, to make
> > + * this freed half available for immediate reuse.
> > + */
> > list_add(&page->lru, &mm->context.pgtable_list);
> > - else
> > - list_del(&page->lru);
> > + } else {
> > + /* If page is on list, now remove it. */
> > + list_del_init(&page->lru);
> > + }
>
> Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> other half is still allocated, when called from pte_free_defer() on a
> fully allocated page, which was not on the list (and with PageActive, and
> (mask & 0x03U) true).
> Not sure if adding an additional mask check to the else path would be
> needed, but it seems that list_del_init() should also be able to handle
> this.
list_del_init() is very cheap in the unnecessary case: the cachelines
required are already there. You don't want a flag to say whether to
call it or not, it is already the efficient approach.
(But you were right not to use it in your pt_frag_refcount version,
because there we were still trying to do the call_rcu() per fragment
rather than per page, so page->lru could have been on the RCU queue.)
>
> Same thought applies to the similar logic in page_table_free_rcu()
> below.
>
> > spin_unlock_bh(&mm->context.lock);
> > mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> > mask >>= 24;
> > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> > }
> >
> > page_table_release_check(page, table, half, mask);
> > - pgtable_pte_page_dtor(page);
> > - __free_page(page);
> > + if (TestClearPageActive(page))
> > + call_rcu(&page->rcu_head, pte_free_now);
> > + else
> > + pte_free_now(&page->rcu_head);
>
> This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> worries me a bit, because it is done outside the spin_lock. It "feels" like
> there could be some race with the PageActive checks inside the spin_lock,
> but when drawing some pictures, I could not find any such scenario yet.
> Also, our existing spin_lock is probably not supposed to protect against
> PageActive changes anyway, right?
Here (and similarly in __tlb_remove_table()) is where we are about to free
the page table page: both of the fragments have already been released,
there is nobody left who could be racing against us to set PageActive.
I chose PageActive for its name, not for any special behaviour of that
flag: nothing else could be setting or clearing it while we own the page.
Hugh
Powered by blists - more mailing lists