linux-kernel - Re: [RFC PATCH 2/3] hugetlbfs: introduce hinode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <6bcf16c6-0881-e5ae-b2c8-3f63a033b250@oracle.com>
Date:   Thu, 15 Oct 2020 20:59:09 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     HORIGUCHI NAOYA (堀口 直也) 
        <naoya.horiguchi@....com>
Cc:     "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Hugh Dickins <hughd@...gle.com>,
        Michal Hocko <mhocko@...nel.org>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        "Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        "Kirill A.Shutemov" <kirill.shutemov@...ux.intel.com>,
        Davidlohr Bueso <dave@...olabs.net>,
        Prakash Sangappa <prakash.sangappa@...cle.com>,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [RFC PATCH 2/3] hugetlbfs: introduce hinode_rwsem for pmd sharing
 synchronization

On 10/15/20 4:05 PM, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Oct 13, 2020 at 04:10:59PM -0700, Mike Kravetz wrote:
>> Due to pmd sharing, the huge PTE pointer returned by huge_pte_alloc
>> may not be valid.  This can happen if a call to huge_pmd_unshare for
>> the same pmd is made in another thread.
>>
>> To address this issue, add a rw_semaphore (hinode_rwsem) to the hugetlbfs
>> inode.
>> - hinode_rwsem is taken in read mode before calling huge_pte_alloc, and
>>   held until finished with the returned pte pointer.
>> - hinode_rwsem is held in write mode whenever huge_pmd_unshare is called.
>>
>> In the locking hierarchy, hinode_rwsem must be taken before a page lock.
>>
>> In an effort to minimize performance impacts, hinode_rwsem is not taken
>> if the caller knows the target can not possibly be part of a shared pmd.
>> lockdep_assert calls are added to huge_pmd_share and huge_pmd_unshare to
>> help catch callers not using the proper locking.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@...cle.com>
> 
> Hi Mike,
> 
> I didn't find a problem on main idea of introducing hinode_rwsem, so
> I'm fine if the known problems are fixed.

Thank you for taking a look Naoya!

I have been trying to address these race issues for some time.  The issues
have been there since the pmd sharing code was introduced.  Fortunately,
it is not easy to hit the issue.  However, targeted test programs can cause
BUGs.

> I have one question. This patch seems to make sure that huge_pmd_unshare()
> are called under holding hinode_rwsem in write mode for some case. Some
> callers of try_to_unmap() seem not to hold it like shrink_page_list(),
> unmap_page(), which is OK because they never call try_to_unmap() for hugetlb
> pages.  And unmap_ref_private() doesn't takes hinode_rwsem either, and
> that's also OK because this function never handles pmd sharing case.  So
> what about unmap_single_vma()?  It seems that this generic function could
> reach huge_pmd_unshare() without hinode_rwsem, so what prevents the race here?
> (Maybe I might miss some assumption or condition over this race...)

You are not missing anything.  I mistakingly left out the locking code in
of unmap_single_vma().  If I would have run some tests with lockdep enabled,
the new lock checking code would have noticed.

> 
> I left a few other minor comments below ...

I will address the below issues in the next revision.

Thanks again for taking a look.
-- 
Mike Kravetz

> 
>> @@ -4424,6 +4442,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>  
>>  	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>>  	if (ptep) {
>> +		/*
>> +		 * Since we hold no locks, ptep could be stale.  That is
>> +		 * OK as we are only making decisions based on content and
>> +		 * not actually modifying content here.
>> +		 */
> 
> nice comment, thank you.
> 
>>  		entry = huge_ptep_get(ptep);
>>  		if (unlikely(is_hugetlb_entry_migration(entry))) {
>>  			migration_entry_wait_huge(vma, mm, ptep);
>> @@ -4431,20 +4454,32 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>>  		} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
>>  			return VM_FAULT_HWPOISON_LARGE |
>>  				VM_FAULT_SET_HINDEX(hstate_index(h));
>> -	} else {
>> -		ptep = huge_pte_alloc(mm, haddr, huge_page_size(h));
>> -		if (!ptep)
>> -			return VM_FAULT_OOM;
>>  	}
>>  
>> +	/*
>> +	 * Acquire hinode_sem before calling huge_pte_alloc and hold
> 
>                    hinode_rwsem?
> 
>> +	 * until finished with ptep.  This prevents huge_pmd_unshare from
>> +	 * being called elsewhere and making the ptep no longer valid.
>> +	 *
>> +	 * ptep could have already be assigned via huge_pte_offset.  That
>> +	 * is OK, as huge_pte_alloc will return the same value unless
>> +	 * something has changed.
>> +	 */
> 
> ... 
> 
>> @@ -278,10 +278,14 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>>  		BUG_ON(dst_addr >= dst_start + len);
>>  
>>  		/*
>> -		 * Serialize via hugetlb_fault_mutex
>> +		 * Serialize via hinode_rwsem hugetlb_fault_mutex.
>                                              ^ "and" here?
> 
>> +		 * hinode_rwsem ensures the dst_pte remains valid even
>> +		 * in the case of shared pmds.  fault mutex prevents
>> +		 * races with other faulting threads.
>>  		 */
>>  		idx = linear_page_index(dst_vma, dst_addr);
>>  		mapping = dst_vma->vm_file->f_mapping;
>> +		hinode_lock_read(mapping, dst_vma, dst_addr);
>>  		hash = hugetlb_fault_mutex_hash(mapping, idx);
>>  		mutex_lock(&hugetlb_fault_mutex_table[hash]);
> 
> 
> Thanks,
> Naoya Horiguchi
>