linux-kernel - Re: [PATCH v12 18/31] mm: protect against PTE changes done by dup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <e61f71d0-1427-4a30-6dc3-b94a8fd609e8@linux.ibm.com>
Date:   Wed, 24 Apr 2019 12:33:11 +0200
From:   Laurent Dufour <ldufour@...ux.ibm.com>
To:     Jerome Glisse <jglisse@...hat.com>
Cc:     akpm@...ux-foundation.org, mhocko@...nel.org, peterz@...radead.org,
        kirill@...temov.name, ak@...ux.intel.com, dave@...olabs.net,
        jack@...e.cz, Matthew Wilcox <willy@...radead.org>,
        aneesh.kumar@...ux.ibm.com, benh@...nel.crashing.org,
        mpe@...erman.id.au, paulus@...ba.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, hpa@...or.com,
        Will Deacon <will.deacon@....com>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        sergey.senozhatsky.work@...il.com,
        Andrea Arcangeli <aarcange@...hat.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        kemi.wang@...el.com, Daniel Jordan <daniel.m.jordan@...cle.com>,
        David Rientjes <rientjes@...gle.com>,
        Ganesh Mahendran <opensource.ganesh@...il.com>,
        Minchan Kim <minchan@...nel.org>,
        Punit Agrawal <punitagrawal@...il.com>,
        vinayak menon <vinayakm.list@...il.com>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        zhong jiang <zhongjiang@...wei.com>,
        Haiyan Song <haiyanx.song@...el.com>,
        Balbir Singh <bsingharora@...il.com>, sj38.park@...il.com,
        Michel Lespinasse <walken@...gle.com>,
        Mike Rapoport <rppt@...ux.ibm.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        haren@...ux.vnet.ibm.com, npiggin@...il.com,
        paulmck@...ux.vnet.ibm.com, Tim Chen <tim.c.chen@...ux.intel.com>,
        linuxppc-dev@...ts.ozlabs.org, x86@...nel.org,
        Vinayak Menon <vinmenon@...eaurora.org>
Subject: Re: [PATCH v12 18/31] mm: protect against PTE changes done by
 dup_mmap()

Le 22/04/2019 à 22:32, Jerome Glisse a écrit :
> On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote:
>> Vinayak Menon and Ganesh Mahendran reported that the following scenario may
>> lead to thread being blocked due to data corruption:
>>
>>      CPU 1                   CPU 2                    CPU 3
>>      Process 1,              Process 1,               Process 1,
>>      Thread A                Thread B                 Thread C
>>
>>      while (1) {             while (1) {              while(1) {
>>      pthread_mutex_lock(l)   pthread_mutex_lock(l)    fork
>>      pthread_mutex_unlock(l) pthread_mutex_unlock(l)  }
>>      }                       }
>>
>> In the details this happens because :
>>
>>      CPU 1                CPU 2                       CPU 3
>>      fork()
>>      copy_pte_range()
>>        set PTE rdonly
>>      got to next VMA...
>>       .                   PTE is seen rdonly          PTE still writable
>>       .                   thread is writing to page
>>       .                   -> page fault
>>       .                     copy the page             Thread writes to page
>>       .                      .                        -> no page fault
>>       .                     update the PTE
>>       .                     flush TLB for that PTE
>>     flush TLB                                        PTE are now rdonly
> 
> Should the fork be on CPU3 to be consistant with the top thing (just to
> make it easier to read and go from one to the other as thread can move
> from one CPU to another).

Sure, this is quite confusing this way ;)

>>
>> So the write done by the CPU 3 is interfering with the page copy operation
>> done by CPU 2, leading to the data corruption.
>>
>> To avoid this we mark all the VMA involved in the COW mechanism as changing
>> by calling vm_write_begin(). This ensures that the speculative page fault
>> handler will not try to handle a fault on these pages.
>> The marker is set until the TLB is flushed, ensuring that all the CPUs will
>> now see the PTE as not writable.
>> Once the TLB is flush, the marker is removed by calling vm_write_end().
>>
>> The variable last is used to keep tracked of the latest VMA marked to
>> handle the error path where part of the VMA may have been marked.
>>
>> Since multiple VMA from the same mm may have the sequence count increased
>> during this process, the use of the vm_raw_write_begin/end() is required to
>> avoid lockdep false warning messages.
>>
>> Reported-by: Ganesh Mahendran <opensource.ganesh@...il.com>
>> Reported-by: Vinayak Menon <vinmenon@...eaurora.org>
>> Signed-off-by: Laurent Dufour <ldufour@...ux.ibm.com>
> 
> A minor comment (see below)
> 
> Reviewed-by: Jérome Glisse <jglisse@...hat.com>

Thanks for the review Jérôme.

>> ---
>>   kernel/fork.c | 30 ++++++++++++++++++++++++++++--
>>   1 file changed, 28 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f8dae021c2e5..2992d2c95256 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task);
>>   static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   					struct mm_struct *oldmm)
>>   {
>> -	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
>> +	struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL;
>>   	struct rb_node **rb_link, *rb_parent;
>>   	int retval;
>>   	unsigned long charge;
>> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   		rb_parent = &tmp->vm_rb;
>>   
>>   		mm->map_count++;
>> -		if (!(tmp->vm_flags & VM_WIPEONFORK))
>> +		if (!(tmp->vm_flags & VM_WIPEONFORK)) {
>> +			if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
>> +				/*
>> +				 * Mark this VMA as changing to prevent the
>> +				 * speculative page fault hanlder to process
>> +				 * it until the TLB are flushed below.
>> +				 */
>> +				last = mpnt;
>> +				vm_raw_write_begin(mpnt);
>> +			}
>>   			retval = copy_page_range(mm, oldmm, mpnt);
>> +		}
>>   
>>   		if (tmp->vm_ops && tmp->vm_ops->open)
>>   			tmp->vm_ops->open(tmp);
>> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>>   out:
>>   	up_write(&mm->mmap_sem);
>>   	flush_tlb_mm(oldmm);
>> +
>> +	if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) {
> 
> You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last
> will always be NULL if it is not enabled but maybe the compiler will
> miss the optimization opportunity if you only have the for() loop
> below.

I didn't check for the generated code, perhaps the compiler will be 
optimize that correctly.
This being said, I think the if block is better for the code 
readability, highlighting that this block is only needed in the case of SPF.

>> +		/*
>> +		 * Since the TLB has been flush, we can safely unmark the
>> +		 * copied VMAs and allows the speculative page fault handler to
>> +		 * process them again.
>> +		 * Walk back the VMA list from the last marked VMA.
>> +		 */
>> +		for (; last; last = last->vm_prev) {
>> +			if (last->vm_flags & VM_DONTCOPY)
>> +				continue;
>> +			if (!(last->vm_flags & VM_WIPEONFORK))
>> +				vm_raw_write_end(last);
>> +		}
>> +	}
>> +
>>   	up_write(&oldmm->mmap_sem);
>>   	dup_userfaultfd_complete(&uf);
>>   fail_uprobe_end:
>> -- 
>> 2.21.0
>>
>