lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Fri, 9 Oct 2020 17:25:30 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Hugh Dickins <hughd@...gle.com>
Cc:     Qian Cai <cai@....pw>, js1304@...il.com,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, kernel-team@....com,
        Vlastimil Babka <vbabka@...e.cz>,
        Christoph Hellwig <hch@...radead.org>,
        Roman Gushchin <guro@...com>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        Michal Hocko <mhocko@...e.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>
Subject: Re: [PATCH v3 7/8] mm/mempolicy: use a standard migration target
 allocation callback

On 10/9/20 3:23 PM, Hugh Dickins wrote:
> On Fri, 9 Oct 2020, Mike Kravetz wrote:
>> On 10/8/20 10:50 PM, Hugh Dickins wrote:
>>>
>>> It's a problem I've faced before in tmpfs, keeping a hold on the
>>> mapping while page lock is dropped.  Quite awkward: igrab() looks as
>>> if it's the right thing to use, but turns out to give no protection
>>> against umount.  Last time around, I ended up with a stop_eviction
>>> count in the shmem inode, which shmem_evict_inode() waits on if
>>> necessary.  Something like that could be done for hugetlbfs too,
>>> but I'd prefer to do it without adding extra, if there is a way.
>>
>> Thanks.
> 
> I failed to come up with anything neater than a stop_eviction count
> in the hugetlbfs inode.
> 
> But that's not unlike a very special purpose rwsem added into the
> hugetlbfs inode: and now that you're reconsidering how i_mmap_rwsem
> got repurposed, perhaps you will end up with an rwsem of your own in
> the hugetlbfs inode.

We have this in the Oracle UEK kernel.
https://github.com/oracle/linux-uek/commit/89260f55df008bb01c841714fb2ad26c300c8c88
Usage has been expanded to cover more cases.

When I proposed the same upstream, the suggestion was to try and use
i_mmap_rwsem.  That is what I have been trying to do.  Certainly, a
hugetlbfs specific rwsem is easier to manage and less disruptive.

> So I won't distract you with a stop_eviction patch unless you ask:
> that's easy, what you're looking into is hard - good luck!

Thanks Hugh!

>> In parallel to working the locking issue, I am also exploring enhanced
>> backout code to handle all the needed cases.

I'm now confident that we need this or some other locking in place.  Why?

I went back to a code base before locking changes.  My idea was to check
for races and backout changes as necessary.  Page fault handling code will
do something like this:

ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
... do some stuff, possibly allocate a page ...
ptl = huge_pte_lock(h, mm, ptep);

Because of pmd sharing, we can not be sure the ptep is valid until
after holding the ptl.  So, I was going to add something like this
after obtaining the ptl.

while(ptep != huge_pte_offset(mm, haddr, huge_page_size(h))) {
	spin_unlock(ptl);
	ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
	if (!ptep) {
		ret = VM_FAULT_OOM;
		goto backout_unlocked;
	}
	ptl = huge_pte_lock(h, mm, ptep);
}

Unfortunately, the problem is worse than I thought.  I knew the PMD
pointed to by the ptep could be unshared before attempting to acquire
the ptl.  However, it is possible that the page which was the PMD may
be repurposed before we even try to acquire the ptl.  Since the ptl is
a spinlock within the struct page of what was the PMD, it may no longer
be valid.  I actually experienced addressing exceptions in the spinlock
code within huge_pte_lock. :(

So, it seems that we need some way to prevent PMD unsharing between the
huge_pte_alloc() and huge_pte_lock().  It is going to have to be
i_mmap_rwsem or some other rwsem.
-- 
Mike Kravetz

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ