lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0703231456230.4133@skynet.skynet.ie>
Date:	Fri, 23 Mar 2007 14:57:23 +0000 (GMT)
From:	Mel Gorman <mel@....ul.ie>
To:	Christoph Hellwig <hch@...radead.org>
Cc:	Adam Litke <agl@...ibm.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Arjan van de Ven <arjan@...radead.org>,
	William Lee Irwin III <wli@...omorphy.com>,
	Ken Chen <kenchen@...gle.com>,
	Linux Memory Management List <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: pagetable_ops: Hugetlb character device example


On Thu, 22 Mar 2007, Christoph Hellwig wrote:

> On Thu, Mar 22, 2007 at 03:42:27PM +0000, Mel Gorman wrote:
> > A year ago, I may have agreed with you. However, Linus not only veto'd
> > it but
> > stamped on it repeatadly at VM Summit. He couldn't have made it clearer
> > if
> > he wore a t-shirt a hat and held up a neon sign. The assertion at the
> > time
> > was that variable page support of any sort had to be outside of the core
> > VM
> > because automatic support will get it wrong in some cases and makes the
> > core
> > VM harder to understand (because it's super-clear at the moment). Others
> > attending agreed with the position. That position rules out drivers or
> > filesystems giving hints about superpage sizes in the foreseeable
> > future.
> 
> Actually I think the only way to get it right is to do it in the core
> (or inm the architecture code for the really nasty bits of course), but
> then again this isn't the point I want to make here..
>

Maybe ultimatly it's the right thing to do, more can be done "on the side"
until such time as it's really worth handling the complexity in the core VM.

> > What they did not have any problem with was providing better interfaces
> > to
> > program against as long as they were on the side of the VM like
> > hugetlbfs
> > and not in the core. The character device for private mappings is an
> > example of an interface that is easier to program against than
> > hugetlbfs.
> > It's far easier for an application to mmap a file at a fixed location
> > than
> > trying to discover if hugetlbfs is mounted or not. However, to support
> > that
> > sort of interface, there needs to be a way of telling the VM to call the
> > an
> > alternative pagetable handler - hence Adam's patches.
> 
> .. and this is where we get into problems.  There should be no need to
> use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
> means 'this is hugetlb backed anonymous memory' is much nicer to archive
> this.

Except in the example he posted, the fault handler for the char device and
the hugetlbfs case are doing slightly different things. The fault handler
for hugetlbfs assumes the existance of a file mapping where as the char
device is inserting the page directly. Having the bit is not enough for the
core is not enough to determine that a slightly different fault handler was
needed.

With the current code, it is almost impossible for a driver with a different
pagetable layout to express different semantics to hugetlbfs with respects
to how their pagetables are setup. Altering hugetlbfs much is very difficult
because any alteration becomes global in nature. The ops would allow a
experimental drivers and interfaces to be developed without breaking
existing users of hugetlbfs and let us figure out things like "Is it worth
supporting 1GiB pages in Opterons" without breaking everything else in the
process.

> Because it makes clear there is exactly one special case here
> and no carte blanche for drivers to do whatever they want.

Drivers can already cause all sorts of mayhem through the existing hooks if
they are perverse enough. It is never encouraged of course but nothing
prevents them.

> I would prefer
> to even get rid of that single special case as mentioned above, but I'm
> definitly set dead against at making this special case totally open for
> random bits of the kernel to mess with.
>

As kernel memory is already backed by huge tlb entries in many cases, random
drivers should have little or no interest in doing anything mad with
pagetable ops. All it gets them is pain and entertaining posts from the
mailing list.

That said, your main objection seems to be opening to door to arbitrary
drivers to change the pagetable ops. I can see your point and Hughs on why
this could lead to some hilarity down the road, particularly if out-of-tree
drivers entering into the mess so how about the following;

Instead of having a vma->pagetable_ops with a structure of pointers,
it would be a simple integer  into a fixed list of pagetable operation
handlers. Something like....

#define PAGETABLE_OP_DEFAULT      0
#define PAGETABLE_OP_HUGETLB_FS   1
#define PAGETABLE_OP_HUGETLB_CHAR 2

struct pagetable_operations_struct[] pagetable_ops_lookup_table = {
 	/* PAGETABLE_OP_DEFAULT assuming we always used the table */
 	{
 		.fault = handle_pte_fault
 		....
 	},

 	/* PAGETABLE_OP_HUGETLB_FS */
 	{
 		.fault    = hugetlb_fault
 		.copy_vma = copy_hugetlb_page_range
 		......
 	},

 	/* PAGETABLE_OP_HUGETLB_CHAR */
 	{
 		.fault = whatever
 	}
};

Drivers would only be able to set an index in the VMA for this table.
The lookup would be about the same cost as what is currently there. However,
random drivers cannot mess with the pagetable ops - they would have to be
known by the core. Experimental drivers would have to update the table but
that shouldn't be an issue. Out-of-tree drivers would have no ability to
mess here at all which is a good thing.

> > Someone with sufficient energy could try implementing variable page
> > support
> > entirely as a device using Adam's interface.
> 
> Hopefully not, doing this in a driver would be utterly braindead and
> certainly not mergeable.
>

Indeed, but exposing superpages only through a magic filesystem with special
semantics doesn't win the nobel prize either. The point is to have the
ability to develop alternatives to try them out without breaking existing
users of hugepages. How else can it be shown that further churn in the
core superpage support would be worth it?

All of that said, while I was writing this mail up Ken Chen posted a patch 
for a char device that sits on top of hugetlbfs that looks promising. That 
can be kicked that around a bit and put the pagetable ops on the 
back-burner until such time as there is a driver of interest that really 
needs to do something different semantically different to hugetlbfs with 
respects to page tables.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ