linux-kernel - Re: madvise(MADV_REMOVE) deadlocks on shmem THP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <X//Y8iRUfuH8WDg2@jagdpanzerIV.localdomain>
Date:   Thu, 14 Jan 2021 14:38:58 +0900
From:   Sergey Senozhatsky <sergey.senozhatsky@...il.com>
To:     Hugh Dickins <hughd@...gle.com>
Cc:     Sergey Senozhatsky <sergey.senozhatsky.work@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Suleiman Souhlal <suleiman@...gle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Andrea Arcangeli <aarcange@...hat.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: madvise(MADV_REMOVE) deadlocks on shmem THP

On (21/01/13 20:31), Hugh Dickins wrote:
> > We are running into lockups during the memory pressure tests on our
> > boards, which essentially NMI panic them. In short the test case is
> > 
> > - THP shmem
> >     echo advise > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> > 
> > - And a user-space process doing madvise(MADV_HUGEPAGE) on new mappings,
> >   and madvise(MADV_REMOVE) when it wants to remove the page range
> > 
> > The problem boils down to the reverse locking chain:
> > 	kswapd does
> > 
> > 		lock_page(page) -> down_read(page->mapping->i_mmap_rwsem)
> > 
> > 	madvise() process does
> > 
> > 		down_write(page->mapping->i_mmap_rwsem) -> lock_page(page)
> > 
> > 
> > 
> > CPU0                                                       CPU1
> > 
> > kswapd                                                     vfs_fallocate()
> >  shrink_node()                                              shmem_fallocate()
> >   shrink_active_list()                                       unmap_mapping_range()
> >    page_referenced() << lock page:PG_locked >>                unmap_mapping_pages()  << down_write(mapping->i_mmap_rwsem) >>
> >     rmap_walk_file()                                           zap_page_range_single()
> >      down_read(mapping->i_mmap_rwsem) << W-locked on CPU1>>     unmap_page_range()
> >       rwsem_down_read_failed()                                   __split_huge_pmd()
> >        __rwsem_down_read_failed_common()                          __lock_page()  << PG_locked on CPU0 >>
> >         schedule()                                                 wait_on_page_bit_common()
> >                                                                     io_schedule()
> 
> Very interesting, Sergey: many thanks for this report.

Thanks for the quick feedback.

> There is no doubt that kswapd is right in its lock ordering:
> __split_huge_pmd() is in the wrong to be attempting lock_page().
> 
> Which used not to be done, but was added in 5.8's c444eb564fb1 ("mm:
> thp: make the THP mapcount atomic against __split_huge_pmd_locked()").

Hugh, I forgot to mention, we are facing these issues on 4.19.
Let me check if (maybe) we have cherry picked c444eb564fb1.

	-ss