[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4df4ef0c0801181435y67fee713h83f8e7f2a5b4e803@mail.gmail.com>
Date: Sat, 19 Jan 2008 01:35:25 +0300
From: "Anton Salikhmetov" <salikhmetov@...il.com>
To: "Linus Torvalds" <torvalds@...ux-foundation.org>
Cc: "Miklos Szeredi" <miklos@...redi.hu>, peterz@...radead.org,
linux-mm@...ck.org, jakob@...hought.net,
linux-kernel@...r.kernel.org, valdis.kletnieks@...edu,
riel@...hat.com, ksm@...dk, staubach@...hat.com,
jesper.juhl@...il.com, akpm@...ux-foundation.org,
protasnb@...il.com, r.e.wolff@...wizard.nl,
hidave.darkstar@...il.com, hch@...radead.org
Subject: Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files
2008/1/19, Linus Torvalds <torvalds@...ux-foundation.org>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > The page_check_address() function is called from the
> > page_mkclean_one() routine as follows:
>
> .. and the page_mkclean_one() function is totally different.
>
> Lookie here, this is the correct and complex sequence:
>
> > entry = ptep_clear_flush(vma, address, pte);
> > entry = pte_wrprotect(entry);
> > entry = pte_mkclean(entry);
> > set_pte_at(mm, address, pte, entry);
>
> That's a rather expensive sequence, but it's done exactly because it has
> to be done that way. What it does is to
>
> - *atomically* load the pte entry _and_ clear the old one in memory.
>
> That's the
>
> entry = ptep_clear_flush(vma, address, pte);
>
> thing, and it basically means that it's doing some
> architecture-specific magic to make sure that another CPU that accesses
> the PTE at the same time will never actually modify the pte (because
> it's clear and not valid)
>
> - it then - while the page table is actually clear and invalid - takes
> the old value and turns it into the new one:
>
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
>
> - and finally, it replaces the entry with the new one:
>
> set_pte_at(mm, address, pte, entry);
>
> which takes care to write the new entry in some specific way that is
> atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
> entry it writes the high word first, see the write barriers in
> "native_set_pte()" in include/asm-x86/pgtable-3level.h
>
> Now, compare that subtle and correct thing with what is *not* correct:
>
> if (pte_dirty(*pte) && pte_write(*pte))
> *pte = pte_wrprotect(*pte);
>
> which makes no effort at all to make sure that it's safe in case another
> CPU updates the accessed bit.
>
> Now, arguably it's unlikely to cause horrible problems at least on x86,
> because:
>
> - we only do this if the pte is already marked dirty, so while we can
> lose the accessed bit, we can *not* lose the dirty bit. And the
> accessed bit isn't such a big deal.
>
> - it's not doing any of the "be careful about" ordering things, but since
> the really important bits aren't changing, ordering probably won't
> practically matter.
>
> But the problem is that we have something like 24 different architectures,
> it's hard to make sure that none of them have issues.
>
> In other words: it may well work in practice. But when these things go
> subtly wrong, they are *really* nasty to find, and the unsafe sequence is
> really not how it's supposed to be done. For example, you don't even flush
> the TLB, so even if there are no cross-CPU issues, there's probably going
> to be writable entries in the TLB that now don't match the page tables.
>
> Will it matter? Again, probably impossible to see in practice. But ...
Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!
Back to the msync() issue, I'm going to come back with a new design
for the bug fix.
Thank you once again.
Anton
>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists