linux-kernel - Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are no reclaimable pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170224084949.GA19161@dhcp22.suse.cz>
Date:   Fri, 24 Feb 2017 09:49:50 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     Jia He <hejianet@...il.com>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Vlastimil Babka <vbabka@...e.cz>,
        Minchan Kim <minchan@...nel.org>,
        Rik van Riel <riel@...hat.com>
Subject: Re: [PATCH v2] mm/vmscan: fix high cpu usage of kswapd if there are
 no reclaimable pages

On Fri 24-02-17 14:49:52, Jia He wrote:
> In a numa server, topology looks like
> available: 3 nodes (0,2-3)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 2 cpus: 0 1 2 3 4 5 6 7
> node 2 size: 15299 MB
> node 2 free: 289 MB
> node 3 cpus:
> node 3 size: 15336 MB
> node 3 free: 184 MB
> node distances:
> node   0   2   3
>   0:  10  40  40
>   2:  40  10  20
>   3:  40  20  10
>  
> When I try to dynamically allocate the hugepages more than system total free 
> memory:
> e.g. echo 4000 >/proc/sys/vm/nr_hugepages
>  
> Then the kswapd will take 100% cpu for a long time(more than 3 hours, and will
> not be about to end)
> top result:
> top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
> Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
> KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
>    76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3 
> 
> The root cause is: kswapd3 is waken up and then try to do reclaim again and 
> again but it makes no progress. At last the allocated hugepages are less than
> 4000.
> HugePages_Total:    1864
> HugePages_Free:     1864
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:      16384 kB
>   
> At that time, even there are no relaimable pages in that node3, kswapd3 will 
> not go to sleep.
> Node 3, zone      DMA
>   per-node stats
>       nr_inactive_anon 0
>       nr_active_anon 0
>       nr_inactive_file 0
>       nr_active_file 0
>       nr_unevictable 0
>       nr_isolated_anon 0
>       nr_isolated_file 0
>       nr_pages_scanned 0
>       workingset_refault 0
>       workingset_activate 0
>       workingset_nodereclaim 0
>       nr_anon_pages 0
>       nr_mapped    0
>       nr_file_pages 0
>       nr_dirty     0
>       nr_writeback 0
>       nr_writeback_temp 0
>       nr_shmem     0
>       nr_shmem_hugepages 0
>       nr_shmem_pmdmapped 0
>       nr_anon_transparent_hugepages 0
>       nr_unstable  0
>       nr_vmscan_write 0
>       nr_vmscan_immediate_reclaim 0
>       nr_dirtied   0
>       nr_written   0
>   pages free     2951
>         min      2821
>         low      3526
>         high     4231
>    node_scanned  0
>         spanned  245760
>         present  245760
>         managed  245388
>       nr_free_pages 2951
>       nr_zone_inactive_anon 0
>       nr_zone_active_anon 0
>       nr_zone_inactive_file 0
>       nr_zone_active_file 0
>       nr_zone_unevictable 0
>       nr_zone_write_pending 0
>       nr_mlock     0
>       nr_slab_reclaimable 46
>       nr_slab_unreclaimable 90
>       nr_page_table_pages 0
>       nr_kernel_stack 0
>       nr_bounce    0
>       nr_zspages   0
>       numa_hit     2257
>       numa_miss    0
>       numa_foreign 0
>       numa_interleave 982
>       numa_local   0
>       numa_other   2257
>       nr_free_cma  0
>         protection: (0, 0, 0, 0) 
> It would be called a misconfiguration but it seems that it might be quite easy
> to hit with NUMA machines which have large differences in the node sizes.
> 
> Further more, when it consumes most the memory in node3, every alloc slow path
> might wake up kswapd3 and it will make things worse:
> __alloc_pages_slowpath
>     wake_all_kswapds
>         wakeup_kswapd
> 
> This patch resolves the issue from 2 aspects:
> 1. In prepare_kswapd_sleep, only when zone is not balanced and there are
>   reclaimable pages in this zone, kswapd will go to do relaim without sleeping
> 2. Don't wake up kswapd if there are no reclaimable pages in that node
> 
> After this patch:
> top - 07:29:43 up 3 min,  1 user,  load average: 0.12, 0.13, 0.06
> Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
> %Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 97.8 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
> KiB Mem:  31371520 total,   938112 used, 30433408 free,     5504 buffers
> KiB Swap:  6284224 total,        0 used,  6284224 free.   632448 cached Mem
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND    
>    78 root      20   0       0      0      0 S 0.000 0.000   0:00.00 kswapd3    
> 
> Changes:
> V2: - fix incorrect condition for assignment of node_has_reclaimable_pages
>     - make commit decription better

I believe we should pursue the proposal from Johannes which is more
generic and copes with corner cases much better.
-- 
Michal Hocko
SUSE Labs