[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49775B38.7010505@google.com>
Date: Wed, 21 Jan 2009 09:28:24 -0800
From: Mike Waychison <mikew@...gle.com>
To: Andi Kleen <andi@...stfloor.org>
CC: linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH v1 0/8] Deferred dput() and iput() -- reducing lock contention
Andi Kleen wrote:
> On Tue, Jan 20, 2009 at 10:22:00PM -0800, Mike Waychison wrote:
>> Andi Kleen wrote:
>>> Mike Waychison <mikew@...gle.com> writes:
>>>
>>>> livelock on dcache_lock/inode_lock (specifically in
>>>> atomic_dec_and_lock())
>>> I'm not sure how something can livelock in atomic_dec_and_lock which
>>> doesn't take a spinlock itself? Are you saying you run into NUMA memory
>>> unfairness here? Or did I misparse you?
>> By atomic_dec_and_lock, I really meant to say _atomic_dec_and_lock().
>
> Ok. So it's basically just the lock that is taken?
>
> In theory one could likely provide an x86 specific dec-and_lock that
> might perform better and doesn't lock if the count is still > 0, but that
> would only help if the reference count is still > 0. Is that a common
> situation in your test?
>
This is precisely what _atomic_dec_and_lock does. It only locks if it
looks like the counter will be hitting 0 with this dec.
It is a common case for the count to be > 1, however the troubles we see
happen with there are a large number of final dputs (and thus also final
iputs)
>> It takes the spinlock if the cmpxchg hidden inside atomic_dec_unless fails.
>>
>> There are likely NUMA unfairness issues at play, but it's not the main
>> worry at this point.
>>
>>>> This patchset is an attempt to try and reduce the locking overheads
>>>> associated
>>>> with final dput() and final iput(). This is done by batching dentries and
>>>> inodes into per-process queues and processing them in 'parallel' to
>>>> consolidate
>>>> some of the locking.
>>> I was wondering what this does to the latencies when dput/iput
>>> is only done for very objects. Does it increase costs then
>>> significantly?
>> very objects?
>
> Sorry.
>
> "is only done for very few objects". Somnhow the few got lost.
> Basically latency in the unloaded case.
>
> I always worry when people do complicated things for the high
> load case how the more usual "do it for a single object" workload
> fares.
Hmm. I'm not sure. The inline latency should fair out to be better
than before on average (common case is to disable pre-emption, enqueue
on the postponed list and re-enable pre-emption, which is fast). It's
the clearing of the queues that may take a little longer, though the
locking overhead should be the same. The trouble is that the queue
itself may not be cache hot and that would add a bit of latency.
Would you know of a good way to detect this latency for fewer objects?
>
>>> As a high level comment it seems like a lot of work to work
>>> around global locks, like the inode_lock, where it might be better to
>>> just split the lock up? Mind you I don't have a clear proposal
>>> how to do that, but surely it's doable somehow.
>>>
>> Perhaps.. the only plausible way I can think this would be doable would
>> be to rework the global resources (like the global inode_unused LRU list
>
> One simple way would be to just use multiple lists with an own lock
> each. I doubt that would impact the LRU behaviour very much.
>
>> and deal with inode state transitions), but even then, some sort of
>> consistency needs to happen at the super_block level,
>
> The sb could also look at multiple lists?
Maybe (hashing on inode nr perhaps?).
The other trouble is that dcache_lock which for the most part stands in
from of the inode_lock. While developing this patchset, I was
originally chasing dcache_lock contention issues. Once I relieved the
contention via deferred batching and parallel processing of dput(), I
was surprised to see that performance didn't get any better because now
I was mostly contending on inode_lock :\
The dcache_lock again could maybe be broken out to be per super_block,
but I have no idea how to make that work with mountpoint traversals.
Making this guy become a set of smaller locks seems difficult at best.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists