linux-kernel - Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 12 Dec 2018 10:48:32 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Liu Bo <bo.liu@...ux.alibaba.com>, Jan Kara <jack@...e.cz>,
        Dave Chinner <david@...morbit.com>,
        Theodore Ts'o <tytso@....edu>,
        Johannes Weiner <hannes@...xchg.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
        Hugh Dickins <hughd@...gle.com>
Subject: Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback

On Wed 12-12-18 12:42:49, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@...e.com>
> > 
> > Liu Bo has experienced a deadlock between memcg (legacy) reclaim and the
> > ext4 writeback
> > task1:
> > [<ffffffff811aaa52>] wait_on_page_bit+0x82/0xa0
> > [<ffffffff811c5777>] shrink_page_list+0x907/0x960
> > [<ffffffff811c6027>] shrink_inactive_list+0x2c7/0x680
> > [<ffffffff811c6ba4>] shrink_node_memcg+0x404/0x830
> > [<ffffffff811c70a8>] shrink_node+0xd8/0x300
> > [<ffffffff811c73dd>] do_try_to_free_pages+0x10d/0x330
> > [<ffffffff811c7865>] try_to_free_mem_cgroup_pages+0xd5/0x1b0
> > [<ffffffff8122df2d>] try_charge+0x14d/0x720
> > [<ffffffff812320cc>] memcg_kmem_charge_memcg+0x3c/0xa0
> > [<ffffffff812321ae>] memcg_kmem_charge+0x7e/0xd0
> > [<ffffffff811b68a8>] __alloc_pages_nodemask+0x178/0x260
> > [<ffffffff8120bff5>] alloc_pages_current+0x95/0x140
> > [<ffffffff81074247>] pte_alloc_one+0x17/0x40
> > [<ffffffff811e34de>] __pte_alloc+0x1e/0x110
> > [<ffffffffa06739de>] alloc_set_pte+0x5fe/0xc20
> > [<ffffffff811e5d93>] do_fault+0x103/0x970
> > [<ffffffff811e6e5e>] handle_mm_fault+0x61e/0xd10
> > [<ffffffff8106ea02>] __do_page_fault+0x252/0x4d0
> > [<ffffffff8106ecb0>] do_page_fault+0x30/0x80
> > [<ffffffff8171bce8>] page_fault+0x28/0x30
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > task2:
> > [<ffffffff811aadc6>] __lock_page+0x86/0xa0
> > [<ffffffffa02f1e47>] mpage_prepare_extent_to_map+0x2e7/0x310 [ext4]
> > [<ffffffffa08a2689>] ext4_writepages+0x479/0xd60
> > [<ffffffff811bbede>] do_writepages+0x1e/0x30
> > [<ffffffff812725e5>] __writeback_single_inode+0x45/0x320
> > [<ffffffff81272de2>] writeback_sb_inodes+0x272/0x600
> > [<ffffffff81273202>] __writeback_inodes_wb+0x92/0xc0
> > [<ffffffff81273568>] wb_writeback+0x268/0x300
> > [<ffffffff81273d24>] wb_workfn+0xb4/0x390
> > [<ffffffff810a2f19>] process_one_work+0x189/0x420
> > [<ffffffff810a31fe>] worker_thread+0x4e/0x4b0
> > [<ffffffff810a9786>] kthread+0xe6/0x100
> > [<ffffffff8171a9a1>] ret_from_fork+0x41/0x50
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > He adds
> > : task1 is waiting for the PageWriteback bit of the page that task2 has
> > : collected in mpd->io_submit->io_bio, and tasks2 is waiting for the LOCKED
> > : bit the page which tasks1 has locked.
> > 
> > More precisely task1 is handling a page fault and it has a page locked
> > while it charges a new page table to a memcg. That in turn hits a memory
> > limit reclaim and the memcg reclaim for legacy controller is waiting on
> > the writeback but that is never going to finish because the writeback
> > itself is waiting for the page locked in the #PF path. So this is
> > essentially ABBA deadlock.
> 
> Side node:
> 
> Do we have PG_writeback vs. PG_locked ordering documentated somewhere?

I am not aware of any

> IIUC, the trace from task2 suggests that we must not wait for writeback
> on the locked page.
> 
> But that not what I see for many wait_on_page_writeback() users: it usally
> called with the page locked. I see it for truncate, shmem, swapfile,
> splice...
> 
> Maybe the problem is within task2 codepath after all?

Jack and David have explained that this is due to an optimization
multiple filesystems do. They lock and set wribeback on multiple pages
and then send a largeer IO at once. So in this case we have the
following pattern

                                        lock_page(B)
                                        SetPageWriteback(B)
                                        unlock_page(B)
lock_page(A)
                                        lock_page(A)
pte_alloc_pne
  shrink_page_list
    wait_on_page_writeback(B)
                                        SetPageWriteback(A)
                                        unlock_page(A)

                                        # flush A, B to clear the writeback

-- 
Michal Hocko
SUSE Labs