[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7789261d-6a64-c47b-be6c-c9be680e5d33@redhat.com>
Date: Wed, 1 Sep 2021 18:13:07 +0200
From: David Hildenbrand <david@...hat.com>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Qi Zheng <zhengqi.arch@...edance.com>, akpm@...ux-foundation.org,
tglx@...utronix.de, hannes@...xchg.org, mhocko@...nel.org,
vdavydov.dev@...il.com, kirill.shutemov@...ux.intel.com,
mika.penttila@...tfour.com, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
songmuchun@...edance.com
Subject: Re: [PATCH v2 6/9] mm: free user PTE page table pages
On 01.09.21 17:32, Jason Gunthorpe wrote:
> On Wed, Sep 01, 2021 at 03:57:09PM +0200, David Hildenbrand wrote:
>> On 01.09.21 15:53, Jason Gunthorpe wrote:
>>> On Thu, Aug 19, 2021 at 11:18:55AM +0800, Qi Zheng wrote:
>>>
>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>> index 2630ed1bb4f4..30757f3b176c 100644
>>>> +++ b/mm/gup.c
>>>> @@ -500,6 +500,9 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>>>> if (unlikely(pmd_bad(*pmd)))
>>>> return no_page_table(vma, flags);
>>>> + if (!pte_try_get(mm, pmd))
>>>> + return no_page_table(vma, flags);
>>>> +
>>>> ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
>>>
>>> This is not good on a performance path, the pte_try_get() is
>>> locking/locking the same lock that pte_offset_map_lock() is getting.
>>
>> Yes, and we really need patch #8, anything else is just confusing reviewers.
>
> It is a bit better with patch 8, but it is still not optimal, we don't
> need to do the atomic work at all if the entire ptep is accessed while
> locked. So the above is stil not what I would expect here, even with
> RCU.
>
> eg I would expect that this kind of change would work first with the
> existing paired acessors, ie
>
> pte = pte_offset_map(pmd, address);
> pte_unmap(pte);
>
> Should handle the refcount under the covers, and same kind of idea for
> the _locked/_unlocked varient.
See my other mail.
>
> Only places that don't already use that pairing should get modified.
>
> To do this we have to extend the API so that pte_offset_map() can
> fail, or very cleverly return some kind of global non-present pte page
> (I wonder if the zero page would work?)
I explored both ideas (returning NULL, return a specially prepared page)
and it didn't work in some cases where we unmap+remap etc.
>
>>> Also, I don't really understand how this scheme works with
>>> get_user_pages_fast.
>>
>> With the RCU change it in #8 it should work just fine, because RCU
>> synchronize has to wait either until all other CPUs have left the RCU read
>> section, or re-enabled interrupts.
>
> So at this point in the series fast gup is broken, that does mean the
> series presentation really needs to be reworked. The better
> presentation is to add the API changes, with a
> no-functional-difference implementation, push the new API in well
> split patches to all the consumption sites, then change the API to
> have the new semantics.
Exactly my thoughts.
>
> RCU and refcount to free the page levels seems like a reasonable
> approach, but I have to say I haven't thought it through fully - are
> all the contexts that have the pte deref safe to do call_rcu?
Very good question. I'd assume so.
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists