[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0704132108430.14740@blonde.wat.veritas.com>
Date: Fri, 13 Apr 2007 21:24:55 +0100 (BST)
From: Hugh Dickins <hugh@...itas.com>
To: Zachary Amsden <zach@...are.com>
cc: David Rientjes <rientjes@...gle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org
Subject: Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
On Fri, 13 Apr 2007, Zachary Amsden wrote:
> Hugh Dickins wrote:
> > Zach, while looking at your recent patches, I ran across the comment
> > on pte_update_defer, and where it was being used, and now think that
> > David's patch is actually incorrect. Previously pte_update_defer
> > was being used where a flush_tlb_page followed immediately after
> > within the same macro; with David's patch, mm's clear_refs_pte_range
> > is calling ptep_test_and_clear_young (including pte_update_defer) on
> > several ptes, then unlocking the page table, and later flushing TLB.
> > That's exactly wrong for pte_update_defer, isn't it?
> >
>
> Ok, disregard most of my last e-mail.
Phew! That's a lot quicker than digesting it ;)
But thanks for going to so much trouble.
> It is fine to decouple the flush from
> the update, as long as they stay close enough that you can reason they happen
> together. I guess I hadn't seen the other parts of the patch which release
> the page table spinlock in between the two, and somehow missed it again when
> responding to the above as I got too excited explaining why the decoupling is
> ok. It is not ok to release the spinlock when using shadow page tables on
> SMP. There are some rather complex races that can result. Here's one case:
>
> CPU-0 CPU-1
> ----------------------- ---------------------------
> test_and_clear_dirty(x)
> spin_unlock(ptl)
> write address mapped by X
> (harware updates dirty bit)
> spin_lock(ptl)
> set_pte_wrprotect(x)
> flush
> flush
>
> Now, the write protected pte which maps a dirty page gets broken in two ways;
> it is unclear if dirty bit or entiry PTE from CPU-0 is deferred until flush,
> so either write protected PTE for modified page loses the dirty bit (BAD!), or
> write protected PTE loses both dirty and write protect bits (VERY BAD!).
>
> To prevent this, we need a flush before dropping the spinlock. If that gets
> too complicated, we can drop the defer logic and just use pte_update instead,
> which notifies the hypervisor immediately of the mapping change.
David (clear_refs_pte_range) is only using ptep_test_and_clear_young,
though he did change the ptep_test_and_clear_dirty definition to be
consistent with it. old/young is never so serious as clean/dirty, so
it may be that there's very little problem with what's in there now;
it just becomes a shame if the wrong decision gets made too often e.g.
if the misflushing is such that his clear_youngs never really take
effect. I simply cannot tell whether or not that's the case myself.
But once the pte_update_defers get moved away from their flush_tlb_pages,
as is the case now, it feels like we're on thin ice.
Actually, I don't really get pte_update_defer at all: I can understand
wanting to defer a call down, but it appears to be a call down to say
we're deferring the shadow update? Why not just do it with pte_update?
I'd be happier without it until this could be restructured more safely
(but my incomprehension is not necessarily the best guiding principle).
Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists