lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 05 Apr 2014 11:28:17 +0800
From:	Daniel J Blueman <>
To:	Theodore Ts'o <>
CC:, LKML <>,
	Steffen Persvold <>,
	Andreas Dilger <>
Subject: Re: ext4 performance falloff

On 04/05/2014 04:56 AM, Theodore Ts'o wrote:
> On Sat, Apr 05, 2014 at 01:00:55AM +0800, Daniel J Blueman wrote:
>> On a larger system 1728 cores/4.5TB memory and 3.13.9, I'm seeing very low
>> 600KB/s cached write performance to a local ext4 filesystem:

 > Thanks for the heads up.  Most (all?) of the ext4 don't have systems
 > with thousands of cores, so these issues generally don't come up for
 > us, and so we're not likely (hell, very unlikely!) to notice potential
 > problems cause by these sorts of uber-large systems.

Hehe. It's not every day we get access to these systems also.

>> Analysis shows that ext4 is reading from all cores' cpu-local data (thus
>> expensive off-NUMA-node access) for each block written:
>> if (free_clusters - (nclusters + rsv + dirty_clusters) <
>> 	free_clusters  = percpu_counter_sum_positive(fcc);
>> 	dirty_clusters = percpu_counter_sum_positive(dcc);
>> }
>> This threshold is defined as:
>> #define EXT4_FREECLUSTERS_WATERMARK (4 * (percpu_counter_batch *
>> nr_cpu_ids))
> The problem we are trying to solve here is that when we do delayed
> allocation, we're making an implicit promise that there will be space
> available
> I've done the calculations, and 4 * 32 * 1728 cores = 221184 blocks,
> or 864 megabytes.  That would mean that the file system is over 98%
> full, so that's actually pretty reasonable; most of the time there's
> more free space than that.

The filesystem is empty after the mkfs; the approach here may make sense 
if we want to allow all cores to write to this FS, but here we have one.

Instrumenting shows that free_clusters=16464621 nclusters=1 rsv=842790 
dirty_clusters=0 percpu_counter_batch=3456 nr_cpu_ids=1728; below 91GB 
space, we'd hit this issue. It feels more sensible to start this 
behaviour when the FS is say 98% full, irrespective of the number of 
cores, but that's not why the behaviour is there.

Since these block devices are attached to a single NUMA node's IO link, 
there is a scaling limitation there anyway, so there may be rationale in 
limiting this to use min(256,nr_cpu_ids) maybe?

> It looks like the real problem is that we're using nr_cpu_ids, which
> is the maximum possible number of cpu's that the system can support,
> which is different from the number of cpu's that you currently have.
> For normal kernels nr_cpu_ids is small, so that has never been a
> problem, but I bet you have nr_cpu_ids set to something really large,
> right?
> If you change nr_cpu_ids to total_cpus in the definition of
> EXT4_FREECLUSTERS_WATERMARK, does that make things better for your
> system?

I have reproduced this with CPU hotplug disabled, so nr_cpu_ids is 
nicely at 1728.

Daniel J Blueman
Principal Software Engineer, Numascale
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
More majordomo info at
Please read the FAQ at

Powered by blists - more mailing lists