[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+55aFxYnBxGZr3ed0i46SpSdOj+3VSVBZiqRbdJuwFMuTmxDw@mail.gmail.com>
Date: Thu, 13 Nov 2014 15:50:02 -0800
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Jerome Glisse <j.glisse@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linux-mm <linux-mm@...ck.org>, Joerg Roedel <joro@...tes.org>,
Mel Gorman <mgorman@...e.de>, "H. Peter Anvin" <hpa@...or.com>,
Peter Zijlstra <peterz@...radead.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Johannes Weiner <jweiner@...hat.com>,
Larry Woodman <lwoodman@...hat.com>,
Rik van Riel <riel@...hat.com>,
Dave Airlie <airlied@...hat.com>,
Brendan Conoboy <blc@...hat.com>,
Joe Donohue <jdonohue@...hat.com>,
Duncan Poole <dpoole@...dia.com>,
Sherry Cheung <SCheung@...dia.com>,
Subhash Gutti <sgutti@...dia.com>,
John Hubbard <jhubbard@...dia.com>,
Mark Hairgrove <mhairgrove@...dia.com>,
Lucien Dunning <ldunning@...dia.com>,
Cameron Buschardt <cabuschardt@...dia.com>,
Arvind Gopalakrishnan <arvindg@...dia.com>,
Shachar Raindel <raindel@...lanox.com>,
Liran Liss <liranl@...lanox.com>,
Roland Dreier <roland@...estorage.com>,
Ben Sander <ben.sander@....com>,
Greg Stoner <Greg.Stoner@....com>,
John Bridgman <John.Bridgman@....com>,
Michael Mantor <Michael.Mantor@....com>,
Paul Blinzer <Paul.Blinzer@....com>,
Laurent Morichetti <Laurent.Morichetti@....com>,
Alexander Deucher <Alexander.Deucher@....com>,
Oded Gabbay <Oded.Gabbay@....com>,
Jérôme Glisse <jglisse@...hat.com>
Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table
(gpt) v2.
On Mon, Nov 10, 2014 at 3:53 PM, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> So I am fine with that, it's the details that confuse me. The thing
> doesn't seem to be generic enough to be used for arbitrary page
> tables, with (for example) the shifts fixed by the VM page size and
> the size of the pte entry type. Also, the levels seem to be very
> infexible, with the page table entries being the simple case, but then
> you have that "pdep" thing that seems to be just _one_ level of page
> directory.
Ok, so let me just put my money where my mouth is, and show some
example code of a tree walker that I think is actually more generic.
Sorry for the delay, I got distracted by other things, and I wanted to
write something to show what I think might be a better approach.
NOTE NOTE NOTE! I'm not saying you have to do it this way. But before
I even show the patch, let me show you the "tree descriptor" from my
stupid test-program that uses it, and hopefully that will show what
I'm really aiming for:
struct tree_walker_definition x86_64_def = {
.total_bits = 48,
.start = 0,
.end = 0x7fffffffffff,
.levels = {
{ .level_bits = 9, .lookup = pgd_lookup },
{ .level_bits = 9, .lookup = pud_lookup },
{ .level_bits = 9, .lookup = pmd_lookup },
{ .level_bits = 9, .walker = pte_walker }
}
};
so basically, the *concept* is that you can describe a real page table
by actually *describing* it. What the above does is tell you:
- the amount of bits the tables can cover (48 is four levels of 9
bits each, leaving 12 bits - 4096 bytes - for the actual pages)
- limit the range that can be walked (this isn't really all that
important, but it does, for example, mean that the walker will
fundamentally refuse to give access to the kernel mapping)
- show how the different levels work, and what their sizes are and
how you look them up or walk them, starting from the top-most.
Anyway, I think a descriptor like the above looks *understandable*. It
kind of stands on its own, even without showing the actual code.
Now, the code to actually *walk* the above tree looks like this:
struct tree_walker walk = {
.first = 4096,
.last = 4096*512*3,
.walk = show_walk,
.hole = show_hole,
.pre_walk = show_pre_walk,
.post_walk = show_post_walk,
};
walk_tree((struct tree_entry *)pgd, &x86_64_def, &walk);
ie you use the "walk_tree()" function to walk a particular tree (in
this case it's a fake page table directory in "pgd", see the details
in the stupid test-application), giving it the tree definition and the
"walk" parameters that show what should happen for particular details
(quite often hole/pre-walk/post-walk may be NULL, my test app just
shows them being called).
Now,. in addition to that, each tree description obviously needs the
functions to show how to look up the different levels ("lookup" for
moving from one level to another, and "walker" for actually walking
the last level page table knowing how "present" bits etc work.
Now, your code had a "uint64_t mask" for the present bits, which
probably works reasonably well in practice, but I really prefer to
just have that "walker" callback instead. That way the page tables can
look like *anything*, and you can walk them, without having magic
rules that there has to be a particular bit pattern that says it's
"present".
Also, my walker actually does super-pages - ie one of the mid-level
page tables could map one big area. I didn't much test it, but the
code is actually fairly straightforward, the way it's all been set up.
So it might be buggy, but it's *close*.
Now, one place we differ is on locking. I actually think that the
person who asks to walk the tree should just do the locking
themselves. You can't really walk the tree without knowing what kind
of tree it is, and so I think the caller should just do the locking.
Obviously, the tree walker itself may have some locking in the
"pre_walk/post_walk" thing and in its lookup routines, so the
description of the tree can contain some locking of its own, but I did
*not* want to make the infrastructure itself force any particular
locking strategy.
So this does something quite different from what your patch actually
did, and does that different thing very differently. It may not really
match what you are aiming for, but I'd *really* like the first
implementation of HMM that gets merged to not over-design the locking
(which I think yours did), and I want it to make *sense* (which I
don't think your patch did).
Also, please note that this *is* just an example. It has an example
user (that is just a stupid user-level toy app to show how it all is
put together), but it's not necessarily all that featureful, and it's
definitely not very tested.
But the code is actually fairly simple. But judge for yourself.
Linus
View attachment "patch.diff" of type "text/plain" (8795 bytes)
Powered by blists - more mailing lists