linux-kernel - Re: [PATCH] mm: swap: do not wait for lock_page() in unuse_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20200722184841.GC841369@xps-13>
Date:   Wed, 22 Jul 2020 20:48:41 +0200
From:   Andrea Righi <andrea.righi@...onical.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: swap: do not wait for lock_page() in
 unuse_pte_range()

On Wed, Jul 22, 2020 at 07:04:25PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 22, 2020 at 07:44:36PM +0200, Andrea Righi wrote:
> > Waiting for lock_page() with mm->mmap_sem held in unuse_pte_range() can
> > lead to stalls while running swapoff (i.e., not being able to ssh into
> > the system, inability to execute simple commands like 'ps', etc.).
> > 
> > Replace lock_page() with trylock_page() and release mm->mmap_sem if we
> > fail to lock it, giving other tasks a chance to continue and prevent
> > the stall.
> 
> I think you've removed the warning at the expense of turning a stall
> into a potential livelock.
> 
> > @@ -1977,7 +1977,11 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >  			return -ENOMEM;
> >  		}
> >  
> > -		lock_page(page);
> > +		if (!trylock_page(page)) {
> > +			ret = -EAGAIN;
> > +			put_page(page);
> > +			goto out;
> > +		}
> 
> If you look at the patterns we have elsewhere in the MM for doing
> this kind of thing (eg truncate_inode_pages_range()), we iterate over the
> entire range, take care of the easy cases, then go back and deal with the
> hard cases later.
> 
> So that would argue for skipping any page that we can't trylock, but
> continue over at least the VMA, and quite possibly the entire MM until
> we're convinced that we have unused all of the required pages.
> 
> Another thing we could do is drop the MM semaphore _here_, sleep on this
> page until it's unlocked, then go around again.
> 
> 		if (!trylock_page(page)) {
> 			mmap_read_unlock(mm);
> 			lock_page(page);
> 			unlock_page(page);
> 			put_page(page);
> 			ret = -EAGAIN;
> 			goto out;
> 		}
> 
> (I haven't checked the call paths; maybe you can't do this because
> sometimes it's called with the mmap sem held for write)
> 
> Also, if we're trying to scale this better, there are some fun
> workloads where readers block writers who block subsequent readers
> and we shouldn't wait for I/O in swapin_readahead().  See patches like
> 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11 for more on this kind of thing.

Thanks for the review, Matthew. I'll see if I can find a better solution
following your useful hints!

-Andrea