[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <3A6E599E-07AC-4273-8643-ADFC8014D3E6@cam.ac.uk>
Date: Wed, 19 Sep 2007 08:34:47 +0100
From: Anton Altaparmakov <aia21@....ac.uk>
To: Christoph Lameter <clameter@....com>
Cc: Christoph Hellwig <hch@....de>, Mel Gorman <mel@...net.ie>,
David Chinner <dgc@....com>,
Jens Axboe <jens.axboe@...cle.com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [00/17] [RFC] Virtual Compound Page Support
Hi Christoph,
On 19 Sep 2007, at 04:36, Christoph Lameter wrote:
> Currently there is a strong tendency to avoid larger page
> allocations in
> the kernel because of past fragmentation issues and the current
> defragmentation methods are still evolving. It is not clear to what
> extend
> they can provide reliable allocations for higher order pages (plus the
> definition of "reliable" seems to be in the eye of the beholder).
>
> Currently we use vmalloc allocations in many locations to provide a
> safe
> way to allocate larger arrays. That is due to the danger of higher
> order
> allocations failing. Virtual Compound pages allow the use of regular
> page allocator allocations that will fall back only if there is an
> actual
> problem with acquiring a higher order page.
>
> This patch set provides a way for a higher page allocation to fall
> back.
> Instead of a physically contiguous page a virtually contiguous page
> is provided. The functionality of the vmalloc layer is used to provide
> the necessary page tables and control structures to establish a
> virtually
> contiguous area.
I like this a lot. It will get rid of all the silly games we have to
play when needing both large allocations and efficient allocations
where possible. In NTFS I can then just allocated higher order pages
instead of having to mess about with the allocation size and
allocating a single page if the requested size is <= PAGE_SIZE or
using vmalloc() if the size is bigger. And it will make it faster
because a lot of the time a higher order page allocation will succeed
with your patchset without resorting to vmalloc() so that will be a
lot faster.
So where I currently have fs/ntfs/malloc.h the below mess I could get
rid of it completely and just use the normal page allocator/
deallocator instead...
static inline void *__ntfs_malloc(unsigned long size, gfp_t gfp_mask)
{
if (likely(size <= PAGE_SIZE)) {
BUG_ON(!size);
/* kmalloc() has per-CPU caches so is faster for
now. */
return kmalloc(PAGE_SIZE, gfp_mask & ~__GFP_HIGHMEM);
/* return (void *)__get_free_page(gfp_mask); */
}
if (likely(size >> PAGE_SHIFT < num_physpages))
return __vmalloc(size, gfp_mask, PAGE_KERNEL);
return NULL;
}
And other places in the kernel can make use of the same. I think XFS
does very similar things to NTFS in terms of larger allocations at
least and there are probably more places I don't know about off the
top of my head...
I am looking forward to your patchset going into mainline. (-:
Best regards,
Anton
> Advantages:
>
> - If higher order allocations are failing then virtual compound pages
> consisting of a series of order-0 pages can stand in for those
> allocations.
>
> - "Reliability" as long as the vmalloc layer can provide virtual
> mappings.
>
> - Ability to reduce the use of vmalloc layer significantly by using
> physically contiguous memory instead of virtual contiguous memory.
> Most uses of vmalloc() can be converted to page allocator calls.
>
> - The use of physically contiguous memory instead of vmalloc may
> allow the
> use larger TLB entries thus reducing TLB pressure. Also reduces
> the need
> for page table walks.
>
> Disadvantages:
>
> - In order to use fall back the logic accessing the memory must be
> aware that the memory could be backed by a virtual mapping and take
> precautions. virt_to_page() and page_address() may not work and
> vmalloc_to_page() and vmalloc_address() (introduced through this
> patch set) may have to be called.
>
> - Virtual mappings are less efficient than physical mappings.
> Performance will drop once virtual fall back occurs.
>
> - Virtual mappings have more memory overhead. vm_area control
> structures
> page tables, page arrays etc need to be allocated and managed to
> provide
> virtual mappings.
>
> The patchset provides this functionality in stages. Stage 1 introduces
> the basic fall back mechanism necessary to replace vmalloc allocations
> with
>
> alloc_page(GFP_VFALLBACK, order, ....)
>
> which signifies to the page allocator that a higher order is to be
> found
> but a virtual mapping may stand in if there is an issue with
> fragmentation.
>
> Stage 1 functionality does not allow allocation and freeing of virtual
> mappings from interrupt contexts.
>
> The stage 1 series ends with the conversion of a few key uses of
> vmalloc
> in the VM to alloc_pages() for the allocation of sparsemems memmap
> table
> and the wait table in each zone. Other uses of vmalloc could be
> converted
> in the same way.
>
>
> Stage 2 functionality enhances the fallback even more allowing
> allocation
> and frees in interrupt context.
>
> SLUB is then modified to use the virtual mappings for slab caches
> that are marked with SLAB_VFALLBACK. If a slab cache is marked this
> way
> then we drop all the restraints regarding page order and allocate
> good large memory areas that fit lots of objects so that we rarely
> have to use the slow paths.
>
> Two slab caches--the dentry cache and the buffer_heads--are then
> flagged
> that way. Others could be converted in the same way.
>
> The patch set also provides a debugging aid through setting
>
> CONFIG_VFALLBACK_ALWAYS
>
> If set then all GFP_VFALLBACK allocations fall back to the virtual
> mappings. This is useful for verification tests. The test of this
> patch set was done by enabling that options and compiling a kernel.
>
>
> Stage 3 functionality could be the adding of support for the large
> buffer size patchset. Not done yet and not sure if it would be useful
> to do.
>
> Much of this patchset may only be needed for special cases in which
> the
> existing defragmentation methods fail for some reason. It may be
> better to
> have the system operate without such a safety net and make sure
> that the
> page allocator can return large orders in a reliable way.
>
> The initial idea for this patchset came from Nick Piggin's fsblock
> and from his arguments about reliability and guarantees. Since his
> fsblock uses the virtual mappings I think it is legitimate to
> generalize the use of virtual mappings to support higher order
> allocations in this way. The application of these ideas to the large
> block size patchset etc are straightforward. If wanted I can base
> the next rev of the largebuffer patchset on this one and implement
> fallback.
>
> Contrary to Nick, I still doubt that any of this provides a
> "guarantee".
> Have said that I have to deal with various failure scenarios in the VM
> daily and I'd certainly like to see it work in a more reliable manner.
>
> IMHO getting rid of the various workarounds to deal with the small 4k
> pages and avoiding additional layers that group these pages in
> subsystem
> specific ways is something that can simplify the kernel and make the
> kernel more reliable overall.
>
> If people feel that a virtual fall back is needed then so be it. Maybe
> we can shed our security blanket later when the approaches to deal
> with fragmentation have matured.
>
> The patch set is also available via git from the largeblock git
> tree via
>
> git pull
> git://git.kernel.org/pub/scm/linux/kernel/git/christoph/
> largeblocksize.git
> vcompound
>
> --
> -
> To unsubscribe from this list: send the line "unsubscribe linux-
> kernel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists