[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <56DDBCEB.8060307@hpe.com>
Date: Mon, 7 Mar 2016 12:39:55 -0500
From: Waiman Long <waiman.long@....com>
To: Dave Chinner <dchinner@...hat.com>
CC: Tejun Heo <tj@...nel.org>,
Christoph Lameter <cl@...ux-foundation.org>, <xfs@....sgi.com>,
<linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Scott J Norton <scott.norton@...com>,
Douglas Hatch <doug.hatch@...com>
Subject: Re: [RFC PATCH 0/2] percpu_counter: Enable switching to global counter
On 03/05/2016 01:34 AM, Dave Chinner wrote:
> On Fri, Mar 04, 2016 at 09:51:37PM -0500, Waiman Long wrote:
>> This patchset allows the degeneration of per-cpu counters back to
>> global counters when:
>>
>> 1) The number of CPUs in the system is large, hence a high cost for
>> calling percpu_counter_sum().
>> 2) The initial count value is small so that it has a high chance of
>> excessive percpu_counter_sum() calls.
>>
>> When the above 2 conditions are true, this patchset allows the user of
>> per-cpu counters to selectively degenerate them into global counters
>> with lock. This is done by calling the new percpu_counter_set_limit()
>> API after percpu_counter_set(). Without this call, there is no change
>> in the behavior of the per-cpu counters.
>>
>> Patch 1 implements the new percpu_counter_set_limit() API.
>>
>> Patch 2 modifies XFS to call the new API for the m_ifree and m_fdblocks
>> per-cpu counters.
>>
>> Waiman Long (2):
>> percpu_counter: Allow falling back to global counter on large system
>> xfs: Allow degeneration of m_fdblocks/m_ifree to global counters
> NACK.
>
> This change to turns off per-counter free block counters for 32p for
> the XFS free block counters. We proved 10 years ago that a global
> lock for these counters was a massive scalability limitation for
> concurrent buffered writes on 16p machines.
>
> IOWs, this change is going to cause fast path concurrent sequential
> write regressions for just about everyone, even on empty
> filesystems.
That is not really the case here. The patch won't change anything if
there is enough free blocks available in the filesystem. It will turn on
global lock at mount time iff the number of free blocks available is
less than the given limit. In the case of XFS, it is 12MB per CPU. On
the 80-thread system that I used for testing, it will be a bit less than
1GB. Even if global lock is enabled at the beginning, it will be
transitioned back to percpu lock as soon as enough free blocks become
available.
I am aware that if there are enough threads pounding on the lock, it can
cause a scalability bottleneck. However, the qspinlock used in x86
should greatly alleviate the scalability impact compared with 10 years
ago when we used the ticket lock. BTW, what exactly was the
microbenchmark that you used to exercise concurrent sequential write? I
would like to try it out on the new hardware and kernel.
The AIM7 microbenchmark that I used was not able to generate more than
1% CPU time in spinlock contention for __percpu_counter_add() on my
80-thread test system. On the other hand, the overhead of doing
percpu_counter_sum() had consumed more than 18% of CPU time with the
same microbenchmark when the filesystem was small. If the number of
__percpu_counter_add() call is large enough to cause significant
spinlock contention, I think the time wasted in percpu_counter_sum()
will be even more for a small filesytem. In the borderline case when the
filesystem is small enough to trigger the use of global lock with my
patch, but not small enough to trigger excessive percpu_counter_sum()
call, then my patch will have caused a degradation in performance.
So I don't think this patch will cause any problem with the free block
count. The other percpu count m_ifree, however, is a problem in the
current code. It used the default batch size, which is my 80-thread
system, is 12800 (2*nr_cpus^2). However, the number of free inodes in
the in the various XFS filesystems were less than 2k. So
percpu_counter_sum() was called every time xfs_mod_ifree() was called.
This costed about 3%CPU time with my microbenchmark, which was also
eliminated by my patch.
> The behaviour you are seeing only occurs when the filesystem is near
> to ENOSPC. As i asked you last time - if you want to make this
> problem go away, please increase the size of the filesystem you are
> running your massively concurrent benchmarks on.
>
> IOWs, please stop trying to optimise a filesystem slow path that:
>
> a) 99.9% of production workloads never execute,
> b) where we expect performance to degrade as allocation gets
> computationally expensive as we close in on ENOSPC,
> c) we start to execute blocking data flush operations that
> slow everything down massively, and
> d) is indicative that the workload is about to suffer
> from a fatal, unrecoverable error (i.e. ENOSPC)
>
I totally agree. I am not trying to optimize a filesystem slowpath.
There are use cases, however, where we may want to create relatively
small filesystem. One example that I cited in patch 2 is the battery
backed NVDIMM that I have played with recently. They can be used for log
files or other small files. Each dimm is 8 GB. You can have a few of
those available. So the filesystem size could be 32GB or so. That can
come close to the the limit where excessive percpu_counter_sum() call
can happen. What I want to do here is to try to reduce the chance of
excessive percpu_counter_sum() calls causing a performance problem. For
a large filesystem that is nowhere near ENOSPC, my patch will have no
performance impact whatsoever.
Cheers,
Longman
Powered by blists - more mailing lists