[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <58e701f4-6af1-d47a-7b3e-5cadf9e27296@molgen.mpg.de>
Date: Thu, 26 Aug 2021 12:41:25 +0200
From: Paul Menzel <pmenzel@...gen.mpg.de>
To: LKML <linux-kernel@...r.kernel.org>
Cc: it+linux-xfs@...gen.mpg.de, linux-fsdevel@...r.kernel.org,
linux-xfs@...r.kernel.org, linux-mm@...ck.org
Subject: Minimum inode cache size? (was: Slow file operations on file server
with 30 TB hardware RAID and 100 TB software RAID)
Dear Linux folks,
Am 20.08.21 um 16:39 schrieb Paul Menzel:
> Am 20.08.21 um 16:31 schrieb Paul Menzel:
>
>> Short problem statement: Sometimes changing into a directory on a file
>> server wit 30 TB hardware RAID and 100 TB software RAID both formatted
>> with XFS takes several seconds.
>>
>>
>> On a Dell PowerEdge T630 with two Xeon CPU E5-2603 v4 @ 1.70GHz and 96
>> GB RAM a 30 TB hardware RAID is served by the hardware RAID controller
>> and a 100 TB MDRAID software RAID connected to a Microchip 1100-8e
>> both formatted using XFS. Currently, Linux 5.4.39 runs on it.
>>
>> ```
>> $ more /proc/version
>> Linux version 5.4.39.mx64.334 (root@....molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Thu May 7 14:27:50 CEST 2020
>> $ dmesg | grep megar
>> [ 10.322823] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
>> [ 10.331910] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
>> [ 10.345055] megaraid_sas 0000:03:00.0: BAR:0x1 BAR's base_addr(phys):0x0000000092100000 mapped virt_addr:0x0000000059ea5995
>> [ 10.345057] megaraid_sas 0000:03:00.0: FW now in Ready state
>> [ 10.351868] megaraid_sas 0000:03:00.0: 63 bit DMA mask and 32 bit consistent mask
>> [ 10.361655] megaraid_sas 0000:03:00.0: firmware supports msix : (96)
>> [ 10.369433] megaraid_sas 0000:03:00.0: requested/available msix 13/13
>> [ 10.377113] megaraid_sas 0000:03:00.0: current msix/online cpus : (13/12)
>> [ 10.385190] megaraid_sas 0000:03:00.0: RDPQ mode : (disabled)
>> [ 10.392092] megaraid_sas 0000:03:00.0: Current firmware supports maximum commands: 928 LDIO threshold: 0
>> [ 10.403895] megaraid_sas 0000:03:00.0: Configured max firmware commands: 927
>> [ 10.416840] megaraid_sas 0000:03:00.0: Performance mode :Latency
>> [ 10.424029] megaraid_sas 0000:03:00.0: FW supports sync cache : No
>> [ 10.431417] megaraid_sas 0000:03:00.0: megasas_disable_intr_fusion is called outbound_intr_mask:0x40000009
>> [ 10.486158] megaraid_sas 0000:03:00.0: FW provided supportMaxExtLDs: 1 max_lds: 64
>> [ 10.495502] megaraid_sas 0000:03:00.0: controller type : MR(2048MB)
>> [ 10.502988] megaraid_sas 0000:03:00.0: Online Controller Reset(OCR) : Enabled
>> [ 10.511445] megaraid_sas 0000:03:00.0: Secure JBOD support : No
>> [ 10.518543] megaraid_sas 0000:03:00.0: NVMe passthru support : No
>> [ 10.525834] megaraid_sas 0000:03:00.0: FW provided TM TaskAbort/Reset timeout: 0 secs/0 secs
>> [ 10.536251] megaraid_sas 0000:03:00.0: JBOD sequence map support : No
>> [ 10.543931] megaraid_sas 0000:03:00.0: PCI Lane Margining support : No
>> [ 10.574406] megaraid_sas 0000:03:00.0: megasas_enable_intr_fusion is called outbound_intr_mask:0x40000000
>> [ 10.585995] megaraid_sas 0000:03:00.0: INIT adapter done
>> [ 10.592409] megaraid_sas 0000:03:00.0: JBOD sequence map is disabled megasas_setup_jbod_map 5660
>> [ 10.603273] megaraid_sas 0000:03:00.0: pci id : (0x1000)/(0x005d)/(0x1028)/(0x1f42)
>> [ 10.612815] megaraid_sas 0000:03:00.0: unevenspan support : yes
>> [ 10.619919] megaraid_sas 0000:03:00.0: firmware crash dump : no
>> [ 10.627013] megaraid_sas 0000:03:00.0: JBOD sequence map : disabled
>> $ dmesg | grep 1100-8e
>> [ 25.853170] smartpqi 0000:84:00.0: added 11:2:0:0 0000000000000000 RAID Adaptec 1100-8e
>> [ 25.867069] scsi 11:2:0:0: RAID Adaptec 1100-8e 2.93 PQ: 0 ANSI: 5
>> $ xfs_info /dev/sdc
>> meta-data=/dev/sdc isize=512 agcount=28, agsize=268435455 blks
>> = sectsz=512 attr=2, projid32bit=1
>> = crc=1 finobt=1, sparse=0, rmapbt=0
>> = reflink=0
>> data = bsize=4096 blocks=7323648000, imaxpct=5
>> = sunit=0 swidth=0 blks
>> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
>> log =internal log bsize=4096 blocks=521728, version=2
>> = sectsz=512 sunit=0 blks, lazy-count=1
>> realtime =none extsz=4096 blocks=0, rtextents=0
>> $ xfs_info /dev/md0
>> meta-data=/dev/md0 isize=512 agcount=102, agsize=268435328 blks
>> = sectsz=4096 attr=2, projid32bit=1
>> = crc=1 finobt=1, sparse=0, rmapbt=0
>> = reflink=0
>> data = bsize=4096 blocks=27348633088, imaxpct=1
>> = sunit=128 swidth=1792 blks
>> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
>> log =internal log bsize=4096 blocks=521728, version=2
>> = sectsz=4096 sunit=1 blks, lazy-count=1
>> realtime =none extsz=4096 blocks=0, rtextents=0
>> $ df -i /dev/sdc
>> Filesystem Inodes IUsed IFree IUse% Mounted on
>> /dev/sdc 2929459200 4985849 2924473351 1% /home/pmenzel
>> $ df -i /dev/md0
>> Filesystem Inodes IUsed IFree IUse% Mounted on
>> /dev/md0 2187890624 5331603 2182559021 1% /jbod/M8015
>> ```
>>
>> After not using a directory for a while (over 24 hours), changing into
>> it (locally) takes over five seconds or doing some git operations. For
>> example the Linux kernel source git tree located in my home directory.
>> (My shell has some git integration showing the branch name in the
>> prompt (`/usr/share/git-contrib/completion/git-prompt.sh`.) Once in
>> that directory, everything reacts instantly again. When waiting the
>> Linux pressure stall information (PSI) shows IO resource contention.
>>
>> Before:
>>
>> $ grep -R . /proc/pressure/
>> /proc/pressure/io:some avg10=0.40 avg60=0.10 avg300=0.10 total=48330841502
>> /proc/pressure/io:full avg10=0.40 avg60=0.10 avg300=0.10 total=48067233340
>> /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=755842910
>> /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=2530206336
>> /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=2318140732
>>
>> During `git log stable/linux-5.10.y`:
>>
>> $ grep -R . /proc/pressure/
>> /proc/pressure/io:some avg10=26.20 avg60=9.72 avg300=2.37 total=48337351849
>> /proc/pressure/io:full avg10=26.20 avg60=9.72 avg300=2.37 total=48073742033
>> /proc/pressure/cpu:some avg10=0.00 avg60=0.00 avg300=0.00 total=755843898
>> /proc/pressure/memory:some avg10=0.00 avg60=0.00 avg300=0.00 total=2530209046
>> /proc/pressure/memory:full avg10=0.00 avg60=0.00 avg300=0.00 total=2318143440
>>
>> The current explanation is, that over night several maintenance
>> scripts like backup/mirroring and accounting scripts are run, which
>> touch all files on the devices. Additionally sometimes other users run
>> cluster jobs with millions of files on the software RAID. Such things
>> invalidate the inode cache, and “my” are thrown out. When I use it
>> afterward it’s slow in the beginning. There is still free memory
>> during these times according to `top`.
>
> $ free -h
> total used free shared buff/cache available
> Mem: 94G 8.3G 5.3G 2.3M 80G 83G
> Swap: 0B 0B 0B
>
>> Does that sound reasonable with ten million inodes? Is that easily
>> verifiable?
>
> If an inode consume 512 bytes with ten million inodes, that would be
> around 500 MB, which should easily fit into the cache, so it does not
> need to be invalidated?
Something is wrong with that calculation, and the cache size is much bigger.
Looking into `/proc/slabinfo` and XFS’ runtime/internal statistics [1],
it turns out that the inode cache is likely the problem.
XFS’ internal stats show that only one third of the inodes requests are
answered from cache.
$ grep ^ig /sys/fs/xfs/stats/stats
ig 1791207386 647353522 20111 1143854223 394 1142080045 10683174
During the problematic time, the SLAB size is around 4 GB and, according
to slabinfo, the inode cache only has around 200.000 (sometimes even as
low as 50.000).
$ sudo grep inode /proc/slabinfo
nfs_inode_cache 16 24 1064 3 1 : tunables 24
12 8 : slabdata 8 8 0
rpc_inode_cache 94 138 640 6 1 : tunables 54
27 8 : slabdata 23 23 0
mqueue_inode_cache 1 4 896 4 1 : tunables 54
27 8 : slabdata 1 1 0
xfs_inode 1693683 1722284 960 4 1 : tunables 54
27 8 : slabdata 430571 430571 0
ext2_inode_cache 0 0 768 5 1 : tunables 54
27 8 : slabdata 0 0 0
reiser_inode_cache 0 0 760 5 1 : tunables 54
27 8 : slabdata 0 0 0
hugetlbfs_inode_cache 2 12 608 6 1 : tunables
54 27 8 : slabdata 2 2 0
sock_inode_cache 346 670 768 5 1 : tunables 54
27 8 : slabdata 134 134 0
proc_inode_cache 121 288 656 6 1 : tunables 54
27 8 : slabdata 48 48 0
shmem_inode_cache 2249 2827 696 11 2 : tunables 54
27 8 : slabdata 257 257 0
inode_cache 209098 209482 584 7 1 : tunables 54
27 8 : slabdata 29926 29926 0
(What is the difference between `xfs_inode` and `inode_cache`?)
Then going through all the files with `find -ls`, the inode cache grows
to four to five million and the SLAB size grows to around 8 GB. Over
night it shrinks back to the numbers above and the page cache grows back.
In the discussions [2], adji`vfs_cache_pressure` is recommended, but –
besides setting it to 0 – it only seems to delay the shrinking of the
cache. (As it’s an integer 1 is the lowest non-zero (positive) number,
which would delay it by a factor of 100.
Is there a way to specify the minimum numbers of entries in the inode
cache, or a minimum SLAB size up to that the caches should not be decreased?
Kind regards,
Paul
[1]: https://xfs.org/index.php/Runtime_Stats#ig
[2]:
https://linux-xfs.oss.sgi.narkive.com/qa0AYeBS/improving-xfs-file-system-inode-performance
"Improving XFS file system inode performance" from 2010
Powered by blists - more mailing lists