[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <19A86A16-B440-4B73-98FE-922A09484DFD@oracle.com>
Date: Mon, 5 Aug 2019 09:56:45 -0600
From: William Kucharski <william.kucharski@...cle.com>
To: "Kirill A. Shutemov" <kirill@...temov.name>
Cc: LKML <linux-kernel@...r.kernel.org>, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org,
Dave Hansen <dave.hansen@...ux.intel.com>,
Song Liu <songliubraving@...com>,
Bob Kasten <robert.a.kasten@...el.com>,
Mike Kravetz <mike.kravetz@...cle.com>,
Chad Mynhier <chad.mynhier@...cle.com>,
"Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
Johannes Weiner <jweiner@...com>,
Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v3 2/2] mm,thp: Add experimental config option
RO_EXEC_FILEMAP_HUGE_FAULT_THP
> On Aug 5, 2019, at 7:28 AM, Kirill A. Shutemov <kirill@...temov.name> wrote:
>
>>
>> Is there different terminology you'd prefer to see me use here to clarify
>> this?
>
> My point is that maybe we should just use ~HPAGE_P?D_MASK in code. The new
> HPAGE_P?D_OFFSET doesn't add much for readability in my opinion.
Fair enough, I'll make that change.
>> OK, I can do that; I didn't want to unnecessarily eliminate the
>> VM_BUG_ON_PAGE(PageTransHuge(page)) call for everyone given this
>> is CONFIG experimental code.
>
> If you bring the feature, you bring the feature. Just drop these VM_BUGs.
OK.
> I'm not sure. It will be costly comparing to PageTransCompound/Huge as we
> need to check more than flags.
>
> To exclude hugetlb pages here, use VM_BUG_ON_PAGE(PageHuge(page), page).
> It will allow to catch wrong usage of the function.
That will also do it and that way we will better know if it ever happens,
which of course it shouldn't.
>> This routine's main function other than validation is to make sure the page
>> cache has not been polluted between when we go out to read the large page
>> and when the page is added to the cache (more on that coming up.) For
>> example, the way I was able to tell readahead() was polluting future
>> possible THP mappings is because after a buffered read I would typically see
>> 52 (the readahead size) PAGESIZE pages for the next 2M range in the page
>> cache.
>
> My point is that you should only see compound pages here if they are
> HPAGE_PMD_ORDER, shouldn't you? Any other order of compound page would be
> unexpected to say the least.
Yes, compound pages should only be HPAGE_PMD_ORDER.
The routine and the check will need to be updated if we ever can
allocate/cache larger pages.
>> It's my understanding that pages in the page cache should be locked, so I
>> wanted to check for that.
>
> No. They are get locked temporary for some operation, but not all the
> time.
OK, thanks for that.
>> I don't really care if the start of the VMA is suitable, just whether I can map
>> the current faulting page with a THP. As far as I know, there's nothing wrong
>> with mapping all the pages before the VMA hits a properly aligned bound with
>> PAGESIZE pages and then aligned chunks in the middle with THP.
>
> You cannot map any paged as huge into wrongly aligned VMA.
>
> THP's ->index must be aligned to HPAGE_PMD_NR, so if the combination VMA's
> ->vm_start and ->vm_pgoff doesn't allow for this, you must fallback to
> mapping the page with PTEs. I don't see it handled properly here.
It was my assumption that if say a VMA started at an address say one page
before a large page alignment, you could map that page with a PAGESIZE
page but if VMA size allowed, there was a fault on the next page, and
VMA size allowed, you could map that next range with a large page, taking
taking the approach of mapping chunks of the VMA with the largest page
possible.
Is it that the start of the VMA must always align or that the entire VMA
must be properly aligned and a multiple of the PMD size (so you either map
with all large pages or none)?
>> This is the page that content was just read to; readpage() will unlock the page
>> when it is done with I/O, but the page needs to be locked before it's inserted
>> into the page cache.
>
> Then you must to lock the page properly with lock_page().
>
> __SetPageLocked() is fine for just allocated pages that was not exposed
> anywhere. After ->readpage() it's not the case and it's not safe to use
> __SetPageLocked() for them.
In the current code, it's assumed it is not exposed, because a single read
of a large page that does no readahead before the page is inserted into the
cache means there are no external users of the page.
Regardless, I can make this change as part of the changes I will need to to
reorder when the page is inserted into the cache.
>> I can make that change; originally alloc_set_pte() didn't use the second
>> parameter at all when mapping a read-only page.
>>
>> Even now, if the page isn't writable, it would only be dereferenced by a
>> VM_BUG_ON_PAGE() call if it's COW.
>
> Please do change this. It has to be future-proof.
OK, thanks.
>> My thinking had been if any part of reading a large page and mapping it had
>> failed, the code could just put_page() the newly allocated page and fallback
>> to mapping the page with PAGESIZE pages rather than add a THP to the cache.
>
> I think it's must change. We should not allow inconsistent view on page
> cache.
Yes, I can see where it would be confusing to anyone looking at it that assumed
the page must already be in the cache before readpage() is called.
>> If mprotect() is called, wouldn't the pages be COWed to PAGESIZE pages the
>> first time the area was written to? I may be way off on this assumption.
>
> Okay, fair enough. COW will happen for private mappings.
>
> But for private mappings you don't need to enforce even RO. All permission
> mask should be fine.
Thanks.
>> Once again, the question is whether we want to make this just RO or RO + EXEC
>> to maintain my goal of just mapping program TEXT via THP. I'm willing to
>> hear arguments either way.
>
> It think the goal is to make feature useful and therefore we need to make
> it available for widest possible set of people.
>
> I think is should be allowed for RO (based on how file was opened, not on
> PROT_*) + SHARED and for any PRIVATE mappings.
That makes sense.
>> I did that because the existing code just blindly sets VM_MAYWRITE and I
>> obviously didn't want to, so making it a variable allowed me to shut it off
>> if it was a THP mapping.
>
> I think touching VM_MAYWRITE here is wrong. It should reflect what file
> under the mapping allows.
Fair enough.
Thanks again!
-- Bill
Powered by blists - more mailing lists