linux-kernel - Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty page state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201121024922.GA1363491@google.com>
Date:   Fri, 20 Nov 2020 19:49:22 -0700
From:   Yu Zhao <yuzhao@...gle.com>
To:     Will Deacon <will@...nel.org>
Cc:     linux-kernel@...r.kernel.org, kernel-team@...roid.com,
        Catalin Marinas <catalin.marinas@....com>,
        Minchan Kim <minchan@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Anshuman Khandual <anshuman.khandual@....com>,
        linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org
Subject: Re: [PATCH 4/6] mm: proc: Invalidate TLB after clearing soft-dirty
 page state

On Fri, Nov 20, 2020 at 01:22:53PM -0700, Yu Zhao wrote:
> On Fri, Nov 20, 2020 at 02:35:55PM +0000, Will Deacon wrote:
> > Since commit 0758cd830494 ("asm-generic/tlb: avoid potential double flush"),
> > TLB invalidation is elided in tlb_finish_mmu() if no entries were batched
> > via the tlb_remove_*() functions. Consequently, the page-table modifications
> > performed by clear_refs_write() in response to a write to
> > /proc/<pid>/clear_refs do not perform TLB invalidation. Although this is
> > fine when simply aging the ptes, in the case of clearing the "soft-dirty"
> > state we can end up with entries where pte_write() is false, yet a
> > writable mapping remains in the TLB.
> 
> I don't think we need a TLB flush in this context, same reason as we
> don't have one in copy_present_pte() which uses ptep_set_wrprotect()
> to write-protect a src PTE.
> 
> ptep_modify_prot_start/commit() and ptep_set_wrprotect() guarantee
> either the dirty bit is set (when a PTE is still writable) or a PF
> happens (when a PTE has become r/o) when h/w page table walker races
> with kernel that modifies a PTE using the two APIs.

After we remove the writable bit, if we end up with a clean PTE, any
subsequent write will trigger a page fault. We can't have a stale
writable tlb entry. The architecture-specific APIs guarantee this.

If we end up with a dirty PTE, then yes, there will be a stale
writable tlb entry. But this won't be a problem because when we
write-protect a page (not PTE), we always check both pte_dirty()
and pte_write(), i.e., write_protect_page() and page_mkclean_one().
When they see this dirty PTE, they will flush. And generally, only
callers of pte_mkclean() should flush tlb; otherwise we end up one
extra if callers of pte_mkclean() and pte_wrprotect() both flush.

Now let's take a step back and see why we got
tlb_gather/finish_mmu() here in the first place. Commit b3a81d0841a95
("mm: fix KSM data corruption") explains the problem clearly. But
to fix a problem created by two threads clearing pte_write() and
pte_dirty() independently, we only need one of them to set
mm_tlb_flush_pending(). Given only removing the writable bit requires
tlb flush, that thread should be the one, as I just explained. Adding
tlb_gather/finish_mmu() is unnecessary in that fix. And there is no
point in having the original flush_tlb_mm() either, given data
integrity is already guaranteed. Of course, with it we have more
accurate access tracking.

Does a similar problem exist for page_mkclean_one()? Possibly. It
checks pte_dirty() and pte_write() but not mm_tlb_flush_pending().
At the moment, madvise_free_pte_range() only supports anonymous
memory, which doesn't do writeback. But the missing
mm_tlb_flush_pending() just seems to be an accident waiting to happen.
E.g., clean_record_pte() calls pte_mkclean() and does batched flush.
I don't know what it's for, but if it's called on file VMAs, a similar
race involving 4 CPUs can happen. This time CPU 1 runs
clean_record_pte() and CPU 3 runs page_mkclean_one().