lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c8d02a3b-e1ad-2b95-ce15-13d3ed4cca87@oracle.com>
Date:   Sat, 3 Aug 2019 16:27:51 -0600
From:   William Kucharski <william.kucharski@...cle.com>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Song Liu <songliubraving@...com>,
        Bob Kasten <robert.a.kasten@...el.com>,
        Mike Kravetz <mike.kravetz@...cle.com>,
        Chad Mynhier <chad.mynhier@...cle.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        Johannes Weiner <jweiner@...com>,
        Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH v3 2/2] mm,thp: Add experimental config option
 RO_EXEC_FILEMAP_HUGE_FAULT_THP



On 8/1/19 6:36 AM, Kirill A. Shutemov wrote:

>>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> -#define HPAGE_PMD_SHIFT PMD_SHIFT
>> -#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
>> -#define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
>> -
>> -#define HPAGE_PUD_SHIFT PUD_SHIFT
>> -#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
>> -#define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
>> +#define HPAGE_PMD_SHIFT		PMD_SHIFT
>> +#define HPAGE_PMD_SIZE		((1UL) << HPAGE_PMD_SHIFT)
>> +#define HPAGE_PMD_OFFSET	(HPAGE_PMD_SIZE - 1)
>> +#define HPAGE_PMD_MASK		(~(HPAGE_PMD_OFFSET))
>> +
>> +#define HPAGE_PUD_SHIFT		PUD_SHIFT
>> +#define HPAGE_PUD_SIZE		((1UL) << HPAGE_PUD_SHIFT)
>> +#define HPAGE_PUD_OFFSET	(HPAGE_PUD_SIZE - 1)
>> +#define HPAGE_PUD_MASK		(~(HPAGE_PUD_OFFSET))
> 
> OFFSET vs MASK semantics can be confusing without reading the definition.
> We don't have anything similar for base page size, right (PAGE_OFFSET is
> completely different thing :P)?

I came up with the OFFSET definitions, the MASK definitions already existed
in huge_mm.h, e.g.:

#define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))

Is there different terminology you'd prefer to see me use here to clarify
this?


>> +#ifdef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
>> +extern vm_fault_t filemap_huge_fault(struct vm_fault *vmf,
>> +			enum page_entry_size pe_size);
>> +#endif
>> +
> 
> No need for #ifdef here.

I wanted to avoid referencing an extern that wouldn't exist if the config
option wasn't set; I can remove it.


>> +
>> +#ifndef	CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
>>   	if (PageSwapBacked(page)) {
>>   		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
>>   		if (PageTransHuge(page))
>> @@ -206,6 +208,13 @@ static void unaccount_page_cache_page(struct address_space *mapping,
>>   	} else {
>>   		VM_BUG_ON_PAGE(PageTransHuge(page), page);
>>   	}
>> +#else
>> +	if (PageSwapBacked(page))
>> +		__mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
>> +
>> +	if (PageTransHuge(page))
>> +		__dec_node_page_state(page, NR_SHMEM_THPS);
>> +#endif
> 
> Again, no need for #ifdef: the new definition should be fine for
> everybody.

OK, I can do that; I didn't want to unnecessarily eliminate the
VM_BUG_ON_PAGE(PageTransHuge(page)) call for everyone given this
is CONFIG experimental code.

> PageCompound() and PageTransCompound() are the same thing if THP is
> enabled at compile time.

> PageHuge() check here is looking out of place. I don't thing we can ever
> will see hugetlb pages here.

What I'm trying to do is sanity check that what the cache contains is a THP 
page. I added the PageHuge() check simply because PageTransCompound() is true 
for both THP and hugetlbfs pages, and there's no routine that returns true JUST 
for THP pages; perhaps there should be?

>> +		 *	+ the enbry is a page page with an order other than
> 
> Typo.

Thanks, fixed.

> 
>> +		 *	  HPAGE_PMD_ORDER
> 
> If you see unexpected page order in page cache, something went horribly
> wrong, right?

This routine's main function other than validation is to make sure the page 
cache has not been polluted between when we go out to read the large page and 
when the page is added to the cache (more on that coming up.) For example, the 
way I was able to tell readahead() was polluting future possible THP mappings is 
because after a buffered read I would typically see 52 (the readahead size) 
PAGESIZE pages for the next 2M range in the page cache.

>> +		 *	+ the page's index is not what we expect it to be
> 
> Same here.

Same rationale.

> 
>> +		 *	+ the page is not up-to-date
>> +		 *	+ the page is unlocked
> 
> Confused here.

These should never be true, but I wanted to double check for them anyway. I can 
eliminate the checks if we are satisfied these states can "never" happen for a 
cached page.

> 
> Do you expect caller to lock page before the check? If so, state it in the
> comment for the function.

It's my understanding that pages in the page cache should be locked, so I wanted 
to check for that.

This routine is validating entries we find in the page cache to see whether they 
are conflicts or valid cached THP pages.

> Wow. That's unreadable. Can we rewrite it something like (commenting each
> check):

I can definitely break it down into multiple checks; it is a bit dense, thus the 
comment but you're correct, it will read better if broken down more.


> You also need to check that VMA alignment is suitable for huge pages.
> See transhuge_vma_suitable().

I don't really care if the start of the VMA is suitable, just whether I can map
the current faulting page with a THP. As far as I know, there's nothing wrong
with mapping all the pages before the VMA hits a properly aligned bound with
PAGESIZE pages and then aligned chunks in the middle with THP.

>> +	if (unlikely(!(PageCompound(new_page)))) {
> 
> How can it happen?

That check was already removed for a pending v4, thanks. I wasn't sure if
__page_cache_alloc() could ever erroneously return a non-compound page so
I wanted to check for it.

>> +	__SetPageLocked(new_page);
> 
> Again?

This is the page that content was just read to; readpage() will unlock the page
when it is done with I/O, but the page needs to be locked before it's inserted
into the page cache.

>> +	/* did it get truncated? */
>> +	if (unlikely(new_page->mapping != mapping)) {
> 
> Hm. IIRC this path only reachable for just allocated page that is not
> exposed to anybody yet. How can it be truncated?

Matthew advised I duplicate the similar routine from filemap_fault(), but that 
may be because of the normal way pages get added to the cache, which I may need 
to modify my code to do.

>> +	ret = alloc_set_pte(vmf, NULL, hugepage);
> 
> It has to be
> 
> 	ret = alloc_set_pte(vmf, vmf->memcg, hugepage);
> 
> right?

I can make that change; originally alloc_set_pte() didn't use the second 
parameter at all when mapping a read-only page.

Even now, if the page isn't writable, it would only be dereferenced by a
VM_BUG_ON_PAGE() call if it's COW.

> It looks backwards to me. I believe the page must be in page cache
> *before* it got mapped.
> 
> I expect all sorts of weird bug due to races when the page is mapped but
> not visible via syscalls.

You may be correct.

My original thinking on this was that as a THP is going to be rarer and more 
valuable to the system, I didn't want to add it to the page cache until its 
contents had been fully read and it was mapped. Talking with Matthew it seems I 
may need to change to do things the same way as PAGESIZE pages, where the page 
is added to the cache prior to the readpage() call and we rely upon PageUptodate 
to see if the reads were successful.

My thinking had been if any part of reading a large page and mapping it had 
failed, the code could just put_page() the newly allocated page and fallback to 
mapping the page with PAGESIZE pages rather than add a THP to the cache.


>> +#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
> 
> IS_ENABLED()?
> 
>>   	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
>>   		goto out;
>> +#endif

This code short-circuits the address generation routine if the memory isn't DAX,
and if this code is enabled I need it not to goto out but rather fall through to
__thp_get_unmapped_area().

>> +	if ((prot & PROT_READ) && (prot & PROT_EXEC) &&
>> +		(!(prot & PROT_WRITE)) && (flags & MAP_PRIVATE) &&
> 
> Why require PROT_EXEC && PROT_READ. You only must ask for !PROT_WRITE.
> 
> And how do you protect against mprotect() later? Should you ask for
> ro-file instead?

My original goal was to only map program TEXT with THP, which means only
RO EXEC code, not just any non-writable address space.

If mprotect() is called, wouldn't the pages be COWed to PAGESIZE pages the
first time the area was written to? I may be way off on this assumption.

> All size considerations are already handled by thp_get_unmapped_area(). No
> need to duplicate it here.

Thanks, I'll remove them.

> You might want to add thp_ro_get_unmapped_area() that would check file for
> RO, before going for THP-suitable mapping.

Once again, the question is whether we want to make this just RO or RO + EXEC
to maintain my goal of just mapping program TEXT via THP. I'm willing to hear 
arguments either way.

> 
>> +		addr = thp_get_unmapped_area(file, addr, len, pgoff, flags);
>> +
>> +		if (addr && (!(addr & HPAGE_PMD_OFFSET)))
>> +			vm_maywrite = 0;
> 
> Oh. That's way too hacky. Better to ask for RO file instead.

I did that because the existing code just blindly sets VM_MAYWRITE and I 
obviously didn't want to, so making it a variable allowed me to shut it off if 
it was a THP mapping.


>> +#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
>>   		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>> +#endif
>> +
> 
> Just remove it. Don't add more #ifdefs.

OK; once again I didn't want to remove the existing VM_BUG_ON_PAGE() call
because this was an experimental config for now.


>> +#ifndef CONFIG_RO_EXEC_FILEMAP_HUGE_FAULT_THP
>>   		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
>> +#endif
>> +
> 
> Ditto.

Same rationale.

Thanks for looking this over; I'm curious as to what others think about the need 
for an RO file and the issue of when the large page gets added to the page cache.

     -- Bill

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ