linux-kernel - Re: Terrible disk performance when files cached

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20160415135619.GA2558@blaptop>
Date:	Fri, 15 Apr 2016 22:56:19 +0900
From:	Minchan Kim <minchan@...nel.org>
To:	Colum Paget <colum.paget@...omgb.com>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: Terrible disk performance when files cached > 4GB

On Fri, Apr 15, 2016 at 10:20:33AM +0100, Colum Paget wrote:
> Hi all,
> 
> I suspect that many people will have reported this, but I thought I'd drop you 
> a line just in case everyone figures someone else has reported it. It's 
> possible we're just doing something wrong and so encountering this problem, 
> but I can't find anyone saying they've found a solution, and the problem 
> doesn't seem to be present in 3.x kernels, which makes us think it could be a 
> bug.
> 
> We are seeing a problem in 4.4.5 and 4.4.6 32-bit 'hugemem' kernels running on 
> machines with > 4GB ram. The problem results in disk performance dropping 
> from 120 MB/s to 1MB/s or even less. 3.18.x 32-bit kernels do not seem to 
> exhibit this behaviour, or at least we can't make it happen reliably. We've 
> tried 3.14.65 and 3.14.65 and they don't exhibit the same degree of problem. 
> We've not yet been able to test 64 bit kernels, it will be a while before we 
> can. We've been able to reproduce the problem on multiple machines with 
> different hardware configs, and with different kernel configs as regards 
> SMP , NUMA support and transparent hugepages.
> 
> This problem can be reproduced thusly:
> 
> Unpack/transfer a *large* number of files onto disk. As they unpack one can 
> monitor the amount of memory being used for file caching with 'free'. Disk 
> transfer speeds can be tested by 'dd'-ing a large file locally. Initially the 
> transfer rate for this file will be over 100GB/s. However, when the amount of 
> cached memory exceeds some figure (this was 4GB on some systems, 10GB on 
> others) disk performance will start to dramatically degrade. Very swiftly the 
> disks become unusable.
> 
> On some machines this situation can be recovered by:
> 
>   echo 3 > /proc/sys/vm/drop_caches
> 
> However, we've seen some cases where even this doesn't seem to help, and the 
> machine has to be rebooted.
> 
> We believe the problem is that the memory cache gets so big that searching 
> through it becomes slower than reading files directly off disk. One problem 
> with this theory is that we're always copying the same file over and over in 
> our tests, so the file is unlikely to be a 'cache miss', personally I would 
> have expected performance to only be bad for cache misses, but it's bad for 
> everything, so maybe our theory is wrong.
> 
> For our purposes, we're fine running with 3.14.x series kernels, but I thought 
> I should let you know.
> 
> regards,
> 
> Colum

Did you see this patch?

https://lkml.org/lkml/2016/4/3/237

It fixes a bug 6b4f7799c6a5 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
introduced and 6b4f7799c6a5 was applied to v3.19. IOW, until 3.18, it was okay.

Thanks.