linux-kernel - Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <461FEF1C.5010806@vmware.com>
Date:	Fri, 13 Apr 2007 13:59:08 -0700
From:	Zachary Amsden <zach@...are.com>
To:	Hugh Dickins <hugh@...itas.com>
CC:	David Rientjes <rientjes@...gle.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org
Subject: Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}

Hugh Dickins wrote:
> David (clear_refs_pte_range) is only using ptep_test_and_clear_young,
> though he did change the ptep_test_and_clear_dirty definition to be
> consistent with it.  old/young is never so serious as clean/dirty, so
> it may be that there's very little problem with what's in there now;
> it just becomes a shame if the wrong decision gets made too often e.g.
> if the misflushing is such that his clear_youngs never really take
> effect.  I simply cannot tell whether or not that's the case myself.
>   

If the logic is too difficult to reason about without entering ganda 
bherundasana, then it is simpler to just drop the _defer suffix.  Old / 
young actually is just as serious, not because A-bit is critical 
(although misvirtualizing it will do strange things to the page 
clocking), but because the entire PTE update is delayed - and thus an 
intervening PTE update which remaps the physical page can come in before 
the A-bit update is flushed.  The A-bit update carries the PPN of the 
original physical page, destroying the remap.

> But once the pte_update_defers get moved away from their flush_tlb_pages,
> as is the case now, it feels like we're on thin ice.
>   

Yes, that is why I think I did the original hack of having #defines 
without implementations - to ensure the flush stayed coupled with the 
use of pte_update_defer.

> Actually, I don't really get pte_update_defer at all: I can understand
> wanting to defer a call down, but it appears to be a call down to say
> we're deferring the shadow update?  Why not just do it with pte_update?
> I'd be happier without it until this could be restructured more safely
> (but my incomprehension is not necessarily the best guiding principle).
>   

So pte_update does the update immediately; it is always safe.  For 
select cases, where there is an immediate pte_update + flush 
combination, it is beneficial to combine the two into one hypercall, 
saving the several thousand cycle round trip into the hypervisor from 
happening twice.  This is why pte_update_defer was used close to the 
flushes.  I thought David was just defining the ptep_test_and_clear_xxx 
functions properly; I didn't realize he was actually using them to 
decouple the flush across a spinlock or logically complex code region.  
If that is the case, I agree pte_update instead of defer is absolutely 
the right thing to do.

I'm a little worried about the TLB consistency here, on real hardware.  
If you drop the spinlock, you could end up preempted.  If the preemption 
runs a kernel thread, which accesses the same memory for which you just 
modified the page table mappings, the TLB could still have a stale 
entry.  So, a clear of a dirty bit might happen in a writeback thread, 
but get preempted before the flush, then the hardware TLB still has a 
dirty entry.  This means writes through the same virtual address will 
skip the A/D bit update in the pte.  Now, for regular user memory 
access, that might not be a problem, but if it is the kernel touching 
user memory through kmap, the kernel virtual address might well ..  I'm 
off in the weeds.  But still slightly concerned about leaving that TLB 
entry stale across a possible preemption.

As an aside, in the tree I'm looking at, I don't see any users of 
ptep_clear_flush_dirty at all.. doesn't that mean writeback which 
undirties pages will leave the PTEs with latent dirty bits, causing them 
to get written back again when the process exits?  Or does rmap take 
care of this.  Ahh, I just discovered page_mkclean_one.  Never mind.  
But still no users of ptep_clear_flush_dirty.

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/