[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070215003810.GE29797@wotan.suse.de>
Date: Thu, 15 Feb 2007 01:38:10 +0100
From: Nick Piggin <npiggin@...e.de>
To: Lee Schermerhorn <Lee.Schermerhorn@...com>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux Memory Management List <linux-mm@...ck.org>
Subject: Re: [patch] mm: NUMA replicated pagecache
On Wed, Feb 14, 2007 at 03:32:04PM -0500, Lee Schermerhorn wrote:
> On Tue, 2007-02-13 at 07:09 +0100, Nick Piggin wrote:
> > Hi,
> >
> > Just tinkering around with this and got something working, so I'll see
> > if anyone else wants to try it.
> >
> > Not proposing for inclusion, but I'd be interested in comments or results.
> >
> > Thanks,
> > Nick
>
> I've included a small patch below that allow me to build and boot with
> these patches on an HP NUMA platform. I'm still seeing an "unable to
Thanks Lee. Merged.
> > - Would like to be able to control replication via userspace, and maybe
> > even internally to the kernel.
> How about per cpuset? Consider a cpuset, on a NUMA system, with cpus
> and memories from a specific set of nodes. One might choose to have
> page cache pages referenced by tasks in this cpuset to be pulled into
> the cpuset's memories for local access. The remainder of the system may
> choose not to replicate page cache pages--e.g., to conserve memory.
> However, "unreplicating" on write would still need to work system wide.
>
> But, note: may [probably] want option to disable replication for shmem
> pages? I'm thinking here of large data base shmem regions that, at any
> time, might have a lot of pages accessed "read only". Probably wouldn't
> want a lot of replication/unreplication happening behind the scene.
Yeah cpusets is an interesting possibility. A per-inode attribute could be
another one. The good old global sysctl is also a must :)
> > - Ideally, reclaim might reclaim replicated pages preferentially, however
> > I aim to be _minimally_ intrusive.
> > - Would like to replicate PagePrivate, but filesystem may dirty page via
> > buffers. Any solutions? (currently should mount with 'nobh').
> Linux migrates pages with PagePrivate using a per mapping migratepage
> address space op to handle the buffers. File systems can provide their
> own or use a generic version. How about a "replicatepage" aop?
I guess the main problem is those filesystems which dirty the page via
the buffers, via b_this_page, or b_data. However AFAIKS, these only happen
for things like directories. I _think_ we can safely assume that regular
file pages will not get modified (that would be data corruption!).
> > +struct page * find_get_page_readonly(struct address_space *mapping, unsigned long offset)
> > +{
> > + struct page *page;
> > +
> > +retry:
> > + read_lock_irq(&mapping->tree_lock);
> > + if (radix_tree_tag_get(&mapping->page_tree, offset,
> > + PAGECACHE_TAG_REPLICATED)) {
> > + int nid;
> > + struct pcache_desc *pcd;
> > +replicated:
> > + nid = numa_node_id();
> > + pcd = radix_tree_lookup(&mapping->page_tree, offset);
> ??? possible NULL pcd? I believe I'm seeing one here...
Hmm, OK. I'll have to do some stress testing. I'm sure there are a few bugs
left.
>
> > + if (!node_isset(nid, pcd->nodes_present)) {
> Do this check [and possible replicate] only if replication enabled
> [system wide?, per cpuset? based on explicit replication policy?, ...]?
Yep.
> > + struct page *repl_page;
> > +
> > + page = pcd->master;
> > + page_cache_get(page);
> > + read_unlock_irq(&mapping->tree_lock);
> > + repl_page = alloc_pages_node(nid,
> > + mapping_gfp_mask(mapping), 0);
> ??? don't try to hard to allocate page, as it's only a performance
> optimization. E.g., add in GFP_THISNODE and remove and __GFP_WAIT?
I think that has merit. The problem if we remove __GFP_WAIT is that the
page allocator gives us access to some reserves. __GFP_NORETRY should
be reasonable?
>
> > + if (!repl_page)
> > + return page;
> > + copy_highpage(repl_page, page);
> > + flush_dcache_page(repl_page);
> > + page->mapping = mapping;
> > + page->index = offset;
> > + SetPageUptodate(repl_page); /* XXX: nonatomic */
> > + page_cache_release(page);
> > + write_lock_irq(&mapping->tree_lock);
> > + __insert_replicated_page(repl_page, mapping, offset, nid);
> ??? can this fail due to race? Don't care because we retry the lookup?
> page freed [released] in the function...
Yeah, I told you it was ugly :P Sorry you had to wade through this, but
it can be cleaned up..
> > EXPORT_SYMBOL(find_lock_page);
> ??? should find_trylock_page() handle potential replicated page?
> until it is removed, anyway?
It is removed upstream, but in 2.6.20 it has no callers anyway so I didn't
worry about it.
Thanks for the comments & patch.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists