lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 11 Dec 2018 17:21:49 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Liu Bo <bo.liu@...ux.alibaba.com>, Jan Kara <jack@...e.cz>,
        Dave Chinner <david@...morbit.com>,
        Theodore Ts'o <tytso@....edu>,
        Johannes Weiner <hannes@...xchg.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm, memcg: fix reclaim deadlock with writeback

On Tue 11-12-18 18:15:42, Kirill A. Shutemov wrote:
> On Tue, Dec 11, 2018 at 02:26:45PM +0100, Michal Hocko wrote:
[...]
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2993,6 +2993,17 @@ static vm_fault_t __do_fault(struct vm_fault *vmf)
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	vm_fault_t ret;
> >  
> > +	/*
> > +	 * Preallocate pte before we take page_lock because this might lead to
> > +	 * deadlocks for memcg reclaim which waits for pages under writeback.
> > +	 */
> > +	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
> > +		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm>mm, vmf->address);
> > +		if (!vmf->prealloc_pte)
> > +			return VM_FAULT_OOM;
> > +		smp_wmb(); /* See comment in __pte_alloc() */
> > +	}
> > +
> >  	ret = vma->vm_ops->fault(vmf);
> >  	if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY |
> >  			    VM_FAULT_DONE_COW)))
> 
> Sorry, but I don't think it fixes anything. Just hides it a level deeper.
> 
> The trick with ->prealloc_pte works for faultaround because we can rely on
> ->map_pages() to not sleep and we know how it will setup page table entry.
> Basically, core controls most of the path.
> 
> It's not the case with ->fault(). It is free to sleep and allocate
> whatever it wants.

Yeah, but if the fault callback wants to allocate then it has to
consider the usual allocation restrictions. e.g. NOFS if the allocation
itself can trip over fs locks.

> For instance, DAX page fault will setup page table entry on its own and
> return VM_FAULT_NOPAGE. It uses vmf_insert_mixed() to setup the page table
> and ignores your pre-allocated page table.

Does this happen with a page locked and with __GFP_ACCOUNT allocation. I
am not familiar with that code but I do not see it from a quick look.
 
> But it's just an example. The problem is that ->fault() is not bounded on
> what it can do, unlike ->map_pages().

That is a fair point but the primary issue here is that the generic #PF
code breaks the underlying assumption and performs
__GFP_ACCOUNT|GFP_KERNEL allocation from within a fs owned locked page.
-- 
Michal Hocko
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ