[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090312093335.6dd67251@skybase>
Date: Thu, 12 Mar 2009 09:33:35 +0100
From: Martin Schwidefsky <schwidefsky@...ibm.com>
To: Matt Mackall <mpm@...enic.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Gerald Schaefer <gerald.schaefer@...ibm.com>,
akpm@...ux-foundation.org
Subject: Re: [PATCH] fix/improve generic page table walker
On Wed, 11 Mar 2009 12:24:23 -0500
Matt Mackall <mpm@...enic.com> wrote:
> On Wed, 2009-03-11 at 14:49 +0100, Martin Schwidefsky wrote:
> > From: Martin Schwidefsky <schwidefsky@...ibm.com>
> >
> > On s390 the /proc/pid/pagemap interface is currently broken. This is
> > caused by the unconditional loop over all pgd/pud entries as specified
> > by the address range passed to walk_page_range. The tricky bit here
> > is that the pgd++ in the outer loop may only be done if the page table
> > really has 4 levels. For the pud++ in the second loop the page table needs
> > to have at least 3 levels. With the dynamic page tables on s390 we can have
> > page tables with 2, 3 or 4 levels. Which means that the pgd and/or the
> > pud pointer can get out-of-bounds causing all kinds of mayhem.
>
> Not sure why this should be a problem without delving into the S390
> code. After all, x86 has 2, 3, or 4 levels as well (at compile time) in
> a way that's transparent to the walker.
Its hard to understand without looking at the s390 details. The main
difference between x86 and s390 in that respect is that on s390 the
number of page table levels is determined at runtime on a per process
basis. A compat process uses 2 levels, a 64 bit process starts with 3
levels and can "upgrade" to 4 levels if something gets mapped above
4TB. Which means that a *pgd can point to a region-second (2**53 bytes),
a region-third (2**42 bytes) or a segment table (2**31 bytes), a *pud
can point to a region-third or a segment table. The page table
primitives know about this semantic, in particular pud_offset and
pmd_offset check the type of the page table pointed to by *pgd and *pud
and do nothing with the pointer if it is a lower level page table.
The only operation I can not "patch" is the pgd++/pud++ operation.
The current implementation requires that the address bits of the
non-existent higher order page tables in the page table walkers are
zero. This is where the vmas come into play. If there is a vma then is
it guaranteed that all the levels to cover the addresses in the vma are
allocated.
> > The proposed solution is to fast-forward over the hole between the start
> > address and the first vma and the hole between the last vma and the end
> > address. The pgd/pud/pmd/pte loops are used only for the address range
> > between the first and last vma. This guarantees that the page table
> > pointers stay in range for s390. For the other architectures this is
> > a small optimization.
>
> I've gone to lengths to keep VMAs out of the equation, so I can't say
> I'm excited about this solution.
The minimum fix is to add the mmap_sem. If a vma is unmapped while you
walk the page tables, they can get freed. You do have a dependency on
the vma list. All the other page table walkers in mm/ start with the
vma, then do the four loops. It would be consistent if the generic page
table walker would do the same.
Having thought about the problem again, I think I found a way how to
deal with the problem in the s390 page table primitives. The fix is not
exactly nice but it will work. With it s390 will be able to walk
addresses outside of the vma address range.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists