linux-kernel - Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190127133132.GA9565@techsingularity.net>
Date:   Sun, 27 Jan 2019 14:09:36 +0000
From:   Mel Gorman <mgorman@...hsingularity.net>
To:     valdis.kletnieks@...edu
Cc:     Pavel Machek <pavel@....cz>,
        kernel list <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...l.org>, vbabka@...e.cz,
        aarcange@...hat.com, rientjes@...gle.com, mhocko@...nel.org,
        zi.yan@...rutgers.edu, hannes@...xchg.org, Jan Kara <jack@...e.cz>
Subject: Re: [regression -next0117] What is kcompactd and why is he eating
 100% of my cpu?

Adding Jan Kara to cc due to the fact it appears the lockup is within
buffer_migrate_page_norefs which changed recently.

On Sat, Jan 26, 2019 at 09:56:53PM -0500, valdis.kletnieks@...edu wrote:
> On Sat, 26 Jan 2019 21:00:05 +0100, Pavel Machek said:
> 
> > top - 13:38:51 up  1:42, 16 users,  load average: 1.41, 1.93, 1.62
> > Tasks: 182 total,   3 running, 138 sleeping,   0 stopped,   0 zombie
> > %Cpu(s):  2.3 us, 57.8 sy,  0.0 ni, 39.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> > KiB Mem:   3020044 total,  2429420 used,   590624 free,    27468 buffers
> > KiB Swap:  2097148 total,        0 used,  2097148 free.  1924268 cached Mem
> >
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> >   608 root      20   0       0      0      0 R  99.6  0.0  11:34.38 kcompactd0
> >  9782 root      20   0       0      0      0 I   7.9  0.0   0:59.02 kworker/0:
> >  2971 root      20   0   46624  23076  13576 S   4.3  0.8   2:50.22 Xorg
> 
> I've noticed this as well on earlier kernels (next-20181224 to 20190115)
> 
> Some more info:
> 
> 1) echo 3 > /proc/sys/vm/drop_caches  unwedges kcompactd in 1-3 seconds.
> 
> 2) Typical kcompactd traceback:
> 
> cat /proc/27/stack
> [<0>] retint_kernel+0x1b/0x2d
> [<0>] lock_is_held_type+0x1b/0x50
> [<0>] ___might_sleep+0xad/0x220
> [<0>] __might_sleep+0x113/0x130
> [<0>] on_each_cpu_cond_mask+0x12a/0x140
> [<0>] on_each_cpu_cond+0x18/0x20
> [<0>] invalidate_bh_lrus+0x29/0x30
> [<0>] __buffer_migrate_page+0x154/0x340
> [<0>] buffer_migrate_page_norefs+0x14/0x20
> [<0>] move_to_new_page+0x8e/0x360
> [<0>] migrate_pages+0x3cc/0xfd8
> [<0>] compact_zone+0xb70/0x1380
> [<0>] kcompactd_do_work+0x15b/0x500
> [<0>] kcompactd+0x74/0x340
> [<0>] kthread+0x158/0x170
> [<0>] ret_from_fork+0x3a/0x50
> [<0>] 0xffffffffffffffff
> 
> I've also seen khugepaged hung up:
> 
> cat /proc/29/stack
> [<0>] ___preempt_schedule+0x16/0x18
> [<0>] page_vma_mapped_walk+0x60/0x840
> [<0>] remove_migration_pte+0x67/0x390
> [<0>] rmap_walk_file+0x186/0x380
> [<0>] rmap_walk+0xa3/0xd0
> [<0>] remove_migration_ptes+0x69/0x70
> [<0>] migrate_pages+0xb6d/0xfd8
> [<0>] compact_zone+0xb70/0x1370
> [<0>] compact_zone_order+0xd8/0x120
> [<0>] try_to_compact_pages+0xe5/0x550
> [<0>] __alloc_pages_direct_compact+0x6d/0x1a0
> [<0>] __alloc_pages_slowpath+0x6c9/0x1640
> [<0>] __alloc_pages_nodemask+0x558/0x5b0
> [<0>] khugepaged+0x499/0x810
> [<0>] kthread+0x158/0x170
> [<0>] ret_from_fork+0x3a/0x50
> [<0>] 0xffffffffffffffff
> 
> Looks like something has gone astray with compact_zone.
> 

-- 
Mel Gorman
SUSE Labs