[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20151214211410.GP8474@quack.suse.cz>
Date: Mon, 14 Dec 2015 22:14:10 +0100
From: Jan Kara <jack@...e.cz>
To: Andreas Grünbacher
<andreas.gruenbacher@...il.com>
Cc: Jan Kara <jack@...e.cz>, Ted Tso <tytso@....edu>,
linux-ext4@...r.kernel.org, Laurent GUERBY <laurent@...rby.net>,
Andreas Dilger <adilger@...ger.ca>
Subject: Re: [PATCH 0/6] ext[24]: MBCache rewrite
Hi,
On Sat 12-12-15 00:57:01, Andreas Grünbacher wrote:
> 2015-12-09 18:57 GMT+01:00 Jan Kara <jack@...e.cz>:
> > Hello,
> >
> > inspired by recent reports [1] of problems with mbcache I had a look into what
> > we could to improve it. I found the current code rather overengineered
> > (counting with single entry being in several indices, having homegrown
> > implementation of rw semaphore, ...).
> >
> > After some thinking I've decided to just reimplement mbcache instead of
> > improving the original code in small steps since the fundamental changes in
> > locking and layout would be actually harder to review in small steps than in
> > one big chunk and overall the new mbcache is actually pretty simple piece of
> > code (~450 lines).
> >
> > The result of rewrite is smaller code (almost half the original size), smaller
> > cache entries (7 longs instead of 13), and better performance (see below
> > for details).
>
> I agree that mbcache has scalability problems worth fixing; you may
> also be right about replacing instead of fixing it.
>
> I would prefer an actual replacement over adding mbcache2 though: the
> two existing users will be converted immediately; there is no point in
> keeping the old version around. For that, can the current mbcache be
> converted to the API of the new one in a separate patch first (alloc +
> insert vs. create, get + release/free vs. delete_block)?
Well, conversion into the new API isn't that trivial because of the
lifetime / locking differences. If people prefer it, I can add a patch that
just renames everything from mb2 to mb after the old code is removed.
Opinions?
> The corner cases that mbcache has problems with are:
>
> (1) Many files with the same xattrs: Right now, an xattr block can be
> shared among at most EXT[24]_XATTR_REFCOUNT_MAX = 2^10 inodes. If 2^20
Do you know why there's this limit BTW? The on-disk format can support upto
2^32 references...
> inodes are cached, they will have at least 2^10 xattr blocks, all of
> which will end up in the same hash chain. An xattr block should be
> removed from the mbcache once it has reached its maximum refcount, but
> if I haven't overlooked something, this doesn't happen right now.
> Fixing that should be relatively easy.
Yeah, that sounds like a good optimization. I'll try that.
> (2) Very many files with unique xattrs. We might be able to come up
> with a reasonable heuristic or tweaking knob for detecting this case;
> if not, we could at least use a resizable hash table to keep the hash
> chains reasonably short.
So far we limit number of entries in the cache which keeps hash chains
short as well. Using resizable hash table and letting the system balance
number of cached entries just by shrinker is certainly possible however I'm
not sure whether the complexity is really worth it.
Regarding detection of unique xattrs: We could certainly detect trashing
of mbcache relatively easily. The difficult part if how to detect when to
enable it again because the workload can change. I'm thinking about some
backoff mechanism like caching only each k-th entry asked to be inserted
(starting with k = 1) and doubling k if we don't reach some low-watermark
cache hit ratio in some number of cache lookups, reducing k to half if
we reach high-watermark cache hit ratio. However again I'm not yet
convinced complex schemes like this are worth it for mbcache (although it
might be interesting research topic as such) and so I didn't try anything
like this for the initial posting.
Honza
--
Jan Kara <jack@...e.com>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists