linux-kernel - Re: [PATCH 2/3] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170728084634.foo3wjhsyydml6yj@techsingularity.net>
Date:   Fri, 28 Jul 2017 09:46:34 +0100
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     Minchan Kim <minchan@...nel.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        kernel-team <kernel-team@....com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Rik van Riel <riel@...hat.com>,
        Ingo Molnar <mingo@...hat.com>, x86@...nel.org,
        Russell King <linux@...linux.org.uk>,
        linux-arm-kernel@...ts.infradead.org,
        Tony Luck <tony.luck@...el.com>, linux-ia64@...r.kernel.org,
        Martin Schwidefsky <schwidefsky@...ibm.com>,
        "David S. Miller" <davem@...emloft.net>,
        Heiko Carstens <heiko.carstens@...ibm.com>,
        linux-s390@...r.kernel.org,
        Yoshinori Sato <ysato@...rs.sourceforge.jp>,
        linux-sh@...r.kernel.org, Jeff Dike <jdike@...toit.com>,
        user-mode-linux-devel@...ts.sourceforge.net,
        linux-arch@...r.kernel.org, Nadav Amit <nadav.amit@...il.com>
Subject: Re: [PATCH 2/3] mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem

On Fri, Jul 28, 2017 at 03:41:51PM +0900, Minchan Kim wrote:
> Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
> problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
> 
> Quote from Mel Gorman
> 
> "The race in question is CPU 0 running madv_free and updating some PTEs
> while CPU 1 is also running madv_free and looking at the same PTEs.
> CPU 1 may have writable TLB entries for a page but fail the pte_dirty
> check (because CPU 0 has updated it already) and potentially fail to flush.
> Hence, when madv_free on CPU 1 returns, there are still potentially writable
> TLB entries and the underlying PTE is still present so that a subsequent write
> does not necessarily propagate the dirty bit to the underlying PTE any more.
> Reclaim at some unknown time at the future may then see that the PTE is still
> clean and discard the page even though a write has happened in the meantime.
> I think this is possible but I could have missed some protection in madv_free
> that prevents it happening."
> 
> This patch aims for solving both problems all at once and is ready for
> other problem with KSM, MADV_FREE and soft-dirty story[3].
> 
> TLB batch API(tlb_[gather|finish]_mmu] uses [set|clear]_tlb_flush_pending
> and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can catch
> there are parallel threads going on. In that case, flush TLB to prevent
> for user to access memory via stale TLB entry although it fail to gather
> pte entry.
> 
> I confiremd this patch works with [4] test program Nadav gave so this patch
> supersedes "mm: Always flush VMA ranges affected by zap_page_range v2"
> in current mmotm.
> 
> NOTE:
> This patch modifies arch-specific TLB gathering interface(x86, ia64,
> s390, sh, um). It seems most of architecture are straightforward but s390
> need to be careful because tlb_flush_mmu works only if mm->context.flush_mm
> is set to non-zero which happens only a pte entry really is cleared by
> ptep_get_and_clear and friends. However, this problem never changes the
> pte entries but need to flush to prevent memory access from stale tlb.
> 
> Any thoughts?
> 

The cc list is somewhat ..... extensive, given the topic. Trim it if
there is another version.

> index 3f2eb76243e3..8c26961f0503 100644
> --- a/arch/arm/include/asm/tlb.h
> +++ b/arch/arm/include/asm/tlb.h
> @@ -163,13 +163,26 @@ tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long start
>  #ifdef CONFIG_HAVE_RCU_TABLE_FREE
>  	tlb->batch = NULL;
>  #endif
> +	set_tlb_flush_pending(tlb->mm);
>  }
>  
>  static inline void
>  tlb_finish_mmu(struct mmu_gather *tlb, unsigned long start, unsigned long end)
>  {
> -	tlb_flush_mmu(tlb);
> +	/*
> +	 * If there are parallel threads are doing PTE changes on same range
> +	 * under non-exclusive lock(e.g., mmap_sem read-side) but defer TLB
> +	 * flush by batching, a thread has stable TLB entry can fail to flush
> +	 * the TLB by observing pte_none|!pte_dirty, for example so flush TLB
> +	 * if we detect parallel PTE batching threads.
> +	 */
> +	if (mm_tlb_flush_pending(tlb->mm, false) > 1) {
> +		tlb->range_start = start;
> +		tlb->range_end = end;
> +	}
>  
> +	tlb_flush_mmu(tlb);
> +	clear_tlb_flush_pending(tlb->mm);
>  	/* keep the page table cache within bounds */
>  	check_pgt_cache();
>  

mm_tlb_flush_pending shouldn't be taking a barrier specific arg. I expect
this to change in the future and cause a conflict. At least I think in
this context, it's the conditional barrier stuff.

That aside, it's very unfortunate that the return value of
mm_tlb_flush_pending really matters. Knowing why 1 is magic there requires
knowledge of the internals on a per-arch basis which is a bit nuts.
Consider renaming this to mm_tlb_flush_parallel() to return true if there
is a nr_pending > 1 with comments explaining why. I don't think any of
the callers expect a nr_pending of 0 ever. That removes some knowledge of
the specifics.

The arch-specific changes to tlb_gather_mmu are almost all identical.
It's a little tricky to split the arch layer and core mm to have all
the set/clear of mm_tlb_flush_pending handled by the core mm.  It's not
required but it would be preferred. The set one is obvious. rename
tlb_gather_mmu to arch_tlb_gather_mmu (including the generic implementation)
and create a tlb_gather_mmu alias that calls arch_tlb_gather_mmu and
set_tlb_flush_pending.

The clear is not as straight-forward but can be done by creating a new
arch helper that handles this hunk on a per-arch basis

> +     if (mm_tlb_flush_pending(tlb->mm, false) > 1) {
> +             tlb->start = start;
> +             tlb->end = end;
> +     }

It'll be churn initially but it means any different handling in the TLB
batching area will be mostly a core concern.

-- 
Mel Gorman
SUSE Labs