lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4F954604.2030704@gmail.com>
Date:	Mon, 23 Apr 2012 14:07:32 +0200
From:	Zdenek Kaspar <zkaspar82@...il.com>
To:	linux-kernel@...r.kernel.org
Cc:	linux-mm@...ck.org
Subject: Re: Over-eager swapping

Dne 23.4.2012 11:27, Richard Davies napsal(a):
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some of the machines. This is resulting in load spikes during the
> swapping and customer reports of very poor response latency from the virtual
> machines which have been swapped out, despite the hosts apparently having
> large amounts of free memory, and running fine if swap is turned off.
> 
> 
> All of the hosts are currently running a 3.1.4 or 3.2.2 kernel and have ksm
> enabled with 64GB of RAM and 2x eight-core AMD Opteron 6128 processors.
> However, we have seen this same problem since 2010 on a 2.6.32.7 kernel and
> older hardware - see http://marc.info/?l=linux-mm&m=128075337008943
> (previous helpful contributors cc:ed here - thanks).
> 
> We have /proc/sys/vm/swappiness set to 0. The kernel config is here:
> http://users.org.uk/config-3.1.4
> 
> 
> The rrd graphs at http://imgur.com/a/Fklxr show a typical incident.
> 
> We estimate memory used from /proc/meminfo as:
> 
>   = MemTotal - MemFree - Buffers + SwapTotal - SwapFree
> 
> The first rrd shows memory used increasing as a VM starts, but not getting
> near the 64GB of physical RAM.
> 
> The second rrd shows the heavy swapping this VM start caused.
> 
> The third rrd shows a multi-gigabyte jump in swap used = SwapTotal - SwapFree
> 
> The fourth rrd shows the large load spike (from 1 to 15) caused by this swap
> storm.
> 
> 
> It is obviously hard to capture all of the relevant data actually during an
> incident. However, as of this morning, the relevant stats are as below.
> 
> Any help much appreciated! Our strong belief is that there is unnecessary
> swapping going on here, and causing these load spikes. We would like to run
> with swap for real out-of-memory situations, but at present it is causing
> these kind of load spikes on machines which run completely happily with swap
> disabled.
> 
> Thanks,
> 
> Richard.
> 
> 
> # cat /proc/meminfo
> MemTotal:       65915384 kB
> MemFree:          271104 kB
> Buffers:        36274368 kB
> Cached:            31048 kB
> SwapCached:      1830860 kB
> Active:         30594144 kB
> Inactive:       32295972 kB
> Active(anon):   21883428 kB
> Inactive(anon):  4695308 kB
> Active(file):    8710716 kB
> Inactive(file): 27600664 kB
> Unevictable:        6740 kB
> Mlocked:            6740 kB
> SwapTotal:      33054708 kB
> SwapFree:       30067948 kB
> Dirty:              1044 kB
> Writeback:             0 kB
> AnonPages:      24962708 kB
> Mapped:             7320 kB
> Shmem:                48 kB
> Slab:            2210964 kB
> SReclaimable:    1013272 kB
> SUnreclaim:      1197692 kB
> KernelStack:        6816 kB
> PageTables:       129248 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    66012400 kB
> Committed_AS:   67375852 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:      259380 kB
> VmallocChunk:   34308695568 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:    155648 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:         576 kB
> DirectMap2M:     2095104 kB
> DirectMap1G:    65011712 kB
> 
> # cat /proc/sys/vm/zone_reclaim_mode
> 0
> 
> # cat /proc/sys/vm/min_unmapped_ratio
> 1
> 
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> ext4_groupinfo_1k     32     32    128   32    1 : tunables    0    0    0 : slabdata      1      1      0
> RAWv6                 34     34    960   34    8 : tunables    0    0    0 : slabdata      1      1      0
> UDPLITEv6              0      0    960   34    8 : tunables    0    0    0 : slabdata      0      0      0
> UDPv6                544    544    960   34    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCPv6          0      0    320   25    2 : tunables    0    0    0 : slabdata      0      0      0
> TCPv6                 72     72   1728   18    8 : tunables    0    0    0 : slabdata      4      4      0
> nf_conntrack_expect    592    592    216   37    2 : tunables    0    0    0 : slabdata     16     16      0
> nf_conntrack_ffffffff8199a280    933   1856    280   29    2 : tunables    0    0    0 : slabdata     64     64      0
> dm_raid1_read_record      0      0   1064   30    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_snap_pending_exception      0      0    104   39    1 : tunables    0    0    0 : slabdata      0      0      0
> dm_crypt_io         1811   2574    152   26    1 : tunables    0    0    0 : slabdata     99     99      0
> kcopyd_job             0      0   3240   10    8 : tunables    0    0    0 : slabdata      0      0      0
> dm_uevent              0      0   2608   12    8 : tunables    0    0    0 : slabdata      0      0      0
> cfq_queue              0      0    232   35    2 : tunables    0    0    0 : slabdata      0      0      0
> bsg_cmd                0      0    312   26    2 : tunables    0    0    0 : slabdata      0      0      0
> mqueue_inode_cache     36     36    896   36    8 : tunables    0    0    0 : slabdata      1      1      0
> udf_inode_cache        0      0    656   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_request           0      0    608   26    4 : tunables    0    0    0 : slabdata      0      0      0
> fuse_inode             0      0    704   46    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_big_inode_cache      0      0    832   39    8 : tunables    0    0    0 : slabdata      0      0      0
> ntfs_inode_cache       0      0    280   29    2 : tunables    0    0    0 : slabdata      0      0      0
> isofs_inode_cache      0      0    600   27    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_inode_cache        0      0    664   24    4 : tunables    0    0    0 : slabdata      0      0      0
> fat_cache              0      0     40  102    1 : tunables    0    0    0 : slabdata      0      0      0
> hugetlbfs_inode_cache     28     28    568   28    4 : tunables    0    0    0 : slabdata      1      1      0
> squashfs_inode_cache      0      0    640   25    4 : tunables    0    0    0 : slabdata      0      0      0
> jbd2_journal_handle   2720   2720     24  170    1 : tunables    0    0    0 : slabdata     16     16      0
> jbd2_journal_head    818   1620    112   36    1 : tunables    0    0    0 : slabdata     45     45      0
> jbd2_revoke_record   2048   4096     32  128    1 : tunables    0    0    0 : slabdata     32     32      0
> ext4_inode_cache    2754   5328    864   37    8 : tunables    0    0    0 : slabdata    144    144      0
> ext4_xattr             0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> ext4_free_data      1168   2628     56   73    1 : tunables    0    0    0 : slabdata     36     36      0
> ext4_allocation_context    540    540    136   30    1 : tunables    0    0    0 : slabdata     18     18      0
> ext4_io_end            0      0   1128   29    8 : tunables    0    0    0 : slabdata      0      0      0
> ext4_io_page         256    256     16  256    1 : tunables    0    0    0 : slabdata      1      1      0
> configfs_dir_cache      0      0     88   46    1 : tunables    0    0    0 : slabdata      0      0      0
> kioctx                 0      0    384   42    4 : tunables    0    0    0 : slabdata      0      0      0
> inotify_inode_mark     30     30    136   30    1 : tunables    0    0    0 : slabdata      1      1      0
> kvm_async_pf         448    448    144   28    1 : tunables    0    0    0 : slabdata     16     16      0
> kvm_vcpu              64     94  13856    2    8 : tunables    0    0    0 : slabdata     47     47      0
> UDP-Lite               0      0    768   42    8 : tunables    0    0    0 : slabdata      0      0      0
> xfrm_dst_cache         0      0    448   36    4 : tunables    0    0    0 : slabdata      0      0      0
> ip_fib_trie          219    219     56   73    1 : tunables    0    0    0 : slabdata      3      3      0
> arp_cache            417    500    320   25    2 : tunables    0    0    0 : slabdata     20     20      0
> RAW                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> UDP                  672    672    768   42    8 : tunables    0    0    0 : slabdata     16     16      0
> tw_sock_TCP          512   1088    256   32    2 : tunables    0    0    0 : slabdata     34     34      0
> TCP                  345    357   1536   21    8 : tunables    0    0    0 : slabdata     17     17      0
> blkdev_queue         414    440   1616   20    8 : tunables    0    0    0 : slabdata     22     22      0
> blkdev_requests      945   2209    344   47    4 : tunables    0    0    0 : slabdata     47     47      0
> sock_inode_cache     456    475    640   25    4 : tunables    0    0    0 : slabdata     19     19      0
> shmem_inode_cache   2063   2375    632   25    4 : tunables    0    0    0 : slabdata     95     95      0
> Acpi-ParseExt       3848   3864     72   56    1 : tunables    0    0    0 : slabdata     69     69      0
> Acpi-Namespace    633667 1059270     40  102    1 : tunables    0    0    0 : slabdata  10385  10385      0
> task_delay_info     1238   1584    112   36    1 : tunables    0    0    0 : slabdata     44     44      0
> taskstats            384    384    328   24    2 : tunables    0    0    0 : slabdata     16     16      0
> proc_inode_cache    2460   3250    616   26    4 : tunables    0    0    0 : slabdata    125    125      0
> sigqueue             400    400    160   25    1 : tunables    0    0    0 : slabdata     16     16      0
> bdev_cache           701    714    768   42    8 : tunables    0    0    0 : slabdata     17     17      0
> sysfs_dir_cache    31662  34425     80   51    1 : tunables    0    0    0 : slabdata    675    675      0
> inode_cache         2546   3886    552   29    4 : tunables    0    0    0 : slabdata    134    134      0
> dentry              9452  14868    192   42    2 : tunables    0    0    0 : slabdata    354    354      0
> buffer_head       8175114 8360937    104   39    1 : tunables    0    0    0 : slabdata 214383 214383      0
> vm_area_struct     35344  35834    176   46    2 : tunables    0    0    0 : slabdata    782    782      0
> files_cache          736    874    704   46    8 : tunables    0    0    0 : slabdata     19     19      0
> signal_cache        1011   1296    896   36    8 : tunables    0    0    0 : slabdata     36     36      0
> sighand_cache        682    945   2112   15    8 : tunables    0    0    0 : slabdata     63     63      0
> task_struct         1057   1386   1520   21    8 : tunables    0    0    0 : slabdata     66     66      0
> anon_vma            2417   2856     72   56    1 : tunables    0    0    0 : slabdata     51     51      0
> shared_policy_node   4877   6800     48   85    1 : tunables    0    0    0 : slabdata     80     80      0
> numa_policy        45589  48450     24  170    1 : tunables    0    0    0 : slabdata    285    285      0
> radix_tree_node   227192 248388    568   28    4 : tunables    0    0    0 : slabdata   9174   9174      0
> idr_layer_cache      603    660    544   30    4 : tunables    0    0    0 : slabdata     22     22      0
> dma-kmalloc-8192       0      0   8192    4    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-4096       0      0   4096    8    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-2048       0      0   2048   16    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-1024       0      0   1024   32    8 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-256        0      0    256   32    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-128        0      0    128   32    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-64         0      0     64   64    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-32         0      0     32  128    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-16         0      0     16  256    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-8          0      0      8  512    1 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-192        0      0    192   42    2 : tunables    0    0    0 : slabdata      0      0      0
> dma-kmalloc-96         0      0     96   42    1 : tunables    0    0    0 : slabdata      0      0      0
> kmalloc-8192          88    100   8192    4    8 : tunables    0    0    0 : slabdata     25     25      0
> kmalloc-4096        3567   3704   4096    8    8 : tunables    0    0    0 : slabdata    463    463      0
> kmalloc-2048       55140  55936   2048   16    8 : tunables    0    0    0 : slabdata   3496   3496      0
> kmalloc-1024        5960   6496   1024   32    8 : tunables    0    0    0 : slabdata    203    203      0
> kmalloc-512        12185  12704    512   32    4 : tunables    0    0    0 : slabdata    397    397      0
> kmalloc-256       195078 199040    256   32    2 : tunables    0    0    0 : slabdata   6220   6220      0
> kmalloc-128        45645  47328    128   32    1 : tunables    0    0    0 : slabdata   1479   1479      0
> kmalloc-64        14647251 14776576     64   64    1 : tunables    0    0    0 : slabdata 230884 230884      0
> kmalloc-32          5573   7552     32  128    1 : tunables    0    0    0 : slabdata     59     59      0
> kmalloc-16          7550  10752     16  256    1 : tunables    0    0    0 : slabdata     42     42      0
> kmalloc-8          13805  14848      8  512    1 : tunables    0    0    0 : slabdata     29     29      0
> kmalloc-192        47641  50883    192   42    2 : tunables    0    0    0 : slabdata   1214   1214      0
> kmalloc-96          3673   6006     96   42    1 : tunables    0    0    0 : slabdata    143    143      0
> kmem_cache            32     32    256   32    2 : tunables    0    0    0 : slabdata      1      1      0
> kmem_cache_node      495    576     64   64    1 : tunables    0    0    0 : slabdata      9      9      0
> 
> # cat /proc/buddyinfo
> Node 0, zone      DMA      0      0      1      0      2      1      1      0      1      1      3
> Node 0, zone    DMA32   9148   1941    657    673    131     53     18      2      0      0      0
> Node 0, zone   Normal   8080     13      0      2      0      2      1      0      1      0      0
> Node 1, zone   Normal  19071   3239    675    200    413     37      4      1      2      0      0
> Node 2, zone   Normal  37716   3924    154      9      3      1      2      0      1      0      0
> Node 3, zone   Normal  20015   4590   1768    996    334     20      1      1      1      0      0
> 
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree:          201460 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree:          283224 kB
> /sys/devices/system/node/node2/meminfo:Node 2 MemFree:          287060 kB
> /sys/devices/system/node/node3/meminfo:Node 3 MemFree:          316928 kB
> 
> # cat /proc/vmstat
> nr_free_pages 224933
> nr_inactive_anon 1173838
> nr_active_anon 5209232
> nr_inactive_file 6998686
> nr_active_file 2180311
> nr_unevictable 1685
> nr_mlock 1685
> nr_anon_pages 5940145
> nr_mapped 1836
> nr_file_pages 9635092
> nr_dirty 603
> nr_writeback 0
> nr_slab_reclaimable 253121
> nr_slab_unreclaimable 299440
> nr_page_table_pages 32311
> nr_kernel_stack 854
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 50485772
> nr_writeback_temp 0
> nr_isolated_anon 0
> nr_isolated_file 0
> nr_shmem 12
> nr_dirtied 5630347228
> nr_written 5625041387
> numa_hit 28372623283
> numa_miss 4761673976
> numa_foreign 4761673976
> numa_interleave 30490
> numa_local 28372334279
> numa_other 4761962980
> nr_anon_transparent_hugepages 76
> nr_dirty_threshold 8192
> nr_dirty_background_threshold 4096
> pgpgin 9523143630
> pgpgout 23124688920
> pswpin 57978726
> pswpout 50121412
> pgalloc_dma 0
> pgalloc_dma32 1132547190
> pgalloc_normal 32421613044
> pgalloc_movable 0
> pgfree 39379011152
> pgactivate 751722445
> pgdeactivate 591205976
> pgfault 41103638391
> pgmajfault 11853858
> pgrefill_dma 0
> pgrefill_dma32 24124080
> pgrefill_normal 540719764
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 297677595
> pgsteal_normal 4784595717
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 241277864
> pgscan_kswapd_normal 4004618399
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 65729843
> pgscan_direct_normal 1012932822
> pgscan_direct_movable 0
> zone_reclaim_failed 0
> pginodesteal 66
> slabs_scanned 668153728
> kswapd_steal 4063341017
> kswapd_inodesteal 2063
> kswapd_low_wmark_hit_quickly 9834
> kswapd_high_wmark_hit_quickly 488468
> kswapd_skip_congestion_wait 580150
> pageoutrun 22006623
> allocstall 926752
> pgrotated 28467920
> compact_blocks_moved 522323130
> compact_pages_moved 5774251432
> compact_pagemigrate_failed 5267247
> compact_stall 121045
> compact_fail 68349
> compact_success 52696
> htlb_buddy_alloc_success 0
> htlb_buddy_alloc_fail 0
> unevictable_pgs_culled 19976952
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 33137561
> unevictable_pgs_mlocked 35042070
> unevictable_pgs_munlocked 33138335
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 1024
> thp_fault_alloc 263176
> thp_fault_fallback 717335
> thp_collapse_alloc 21307
> thp_collapse_alloc_failed 91103
> thp_split 90328
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@...ck.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@...ck.org"> email@...ck.org </a>
> 

Since I have this issue too..

Does anyone on list have idea if it's possible to disable memory reclaim
for specified processes, but not by patching binaries?

It's really frustrating seeing some latency sensitive processes swapped
out, in example:

./getswap.pl
  PID         COMMAND        SWSIZE
    1            init        116 kB
  594           udevd        288 kB
 1211        dhclient        344 kB
 1255        rsyslogd        208 kB
 1274         rpcbind        140 kB
 1292       rpc.statd        444 kB
 1310           mdadm         84 kB
 1421          upsmon        436 kB
 1422          upsmon        408 kB
 1432            sshd        556 kB
 1454        ksmtuned         96 kB
 1463           crond        552 kB
 1494          smartd        164 kB
 1502        mingetty         76 kB
 2200            smbd        620 kB
 2212            smbd        748 kB
 2213            nmbd        532 kB
 2265      rpc.mountd        428 kB
 2282            tgtd         92 kB
 2283            tgtd         96 kB
 2328        qemu-vm3      15512 kB
 2366        qemu-vm2      13204 kB
 2410        qemu-vm4      17140 kB
 2448        qemu-vm5      38532 kB
 2495        qemu-vm6      19148 kB
 2534        qemu-vm7      44552 kB
 2579        qemu-vm9      18788 kB
 2620       qemu-vm10      19256 kB
 2699        qemu-vm8      40204 kB
 6376        qemu-vm1      29232 kB
 7646            ntpd        280 kB
32322            smbd        468 kB

NOTE to OP: I just don't use swap if possible, but with qemu-kvm you can
use hugetlbfs as another workaround, but you will sacrifice some
functionality, like KSM and maybe memory ballooning etc..

HTH, Z.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ