lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 12 Jan 2022 21:26:02 +0800
From:   Peter Xu <peterx@...hat.com>
To:     Hugh Dickins <hughd@...gle.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        David Hildenbrand <david@...hat.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Yang Shi <shy828301@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Alistair Popple <apopple@...dia.com>,
        "Kirill A . Shutemov" <kirill@...temov.name>,
        Matthew Wilcox <willy@...radead.org>
Subject: Re: [PATCH RFC v2 1/2] mm: Don't skip swap entry even if zap_details
 specified

On Wed, Jan 12, 2022 at 09:18:09PM +0800, Peter Xu wrote:
> On Sat, Jan 08, 2022 at 05:19:04PM -0800, Hugh Dickins wrote:
> > On Mon, 15 Nov 2021, Peter Xu wrote:
> > 
> > > This check existed since the 1st git commit of Linux repository, but at that
> > > time there's no page migration yet so I think it's okay.
> > 
> > //git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> > helps with the history.  Yes, the check was okay back then,
> > but a lot of changes have come in since: I'll tell more of those below.
> 
> Thanks for looking at this.  By the way, the link is greatly helpful. It's
> always good to be able to read into the history.
> 
> > 
> > You are absolutely right to clean this up and fix the bugs that
> > have crept in, but I think the patch itself is not quite right yet.
> 
> Do you mean the pmd path on checking mapping?  If so I agree, and I'll add that
> in v2 (even if as you pointed out that shouldn't be a real bug, iiuc, as you
> analyzed below).
> 
> Let me know if I missed anything else..
> 
> > 
> > > 
> > > With page migration enabled, it should logically be possible that we zap some
> > > shmem pages during migration.
> > 
> > Yes.
> > 
> > > When that happens, IIUC the old code could have
> > > the RSS counter accounted wrong on MM_SHMEMPAGES because we will zap the ptes
> > > without decreasing the counters for the migrating entries.  I have no unit test
> > > to prove it as I don't know an easy way to trigger this condition, though.
> > 
> > In the no-details case, yes, it does look like that. I ought to try
> > and reproduce that, but responding to mails seems more important.
> 
> Please let me know if you know how to reproduce it, since I don't know yet a
> real reproducer.
> 
> What I can do, though, is if with below patch applied to current linux master:
> 
> =============
> diff --git a/mm/memory.c b/mm/memory.c
> index 8f1de811a1dc..51fe02a22ea9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1383,8 +1383,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                 }
>  
>                 /* If details->check_mapping, we leave swap entries. */
> -               if (unlikely(details))
> +               if (unlikely(details)) {
> +                       WARN_ON_ONCE(is_migration_entry(entry));
>                         continue;
> +               }
>  
>                 if (!non_swap_entry(entry))
>                         rss[MM_SWAPENTS]--;
> =============
> 
> Then I can easily trigger it if with the help of a test program attached
> (zap-migrate.c).

(Attaching for real; also copy Matthew)

> 
> IMHO it won't really reproduce because it seems all relevant shmem operations
> (e.g. hole punch, unmap..) will take the page lock so it won't really race with
> migration (which needs the page lock too), as mentioned in previous cover
> letters when I was still working on the old versions of this.  But it could be
> possible that I missed something.
> 
> While the WARN_ON_ONCE() can trigger only because of the fast path in hole
> punching we have the pre-unmap without lock:
> 
> 		if ((u64)unmap_end > (u64)unmap_start)
> 			unmap_mapping_range(mapping, unmap_start,
> 					    1 + unmap_end - unmap_start, 0);
> 		shmem_truncate_range(inode, offset, offset + len - 1);
> 
> But that will be fixed up right afterwards in shmem_truncate_range(), afaict,
> which is with-lock.  Hence we have a small window to at least trigger the
> warning above, even if it won't really mess up the accounting finally.
> 
> > 
> > > 
> > > Besides, the optimization itself is already confusing IMHO to me in a few points:
> > 
> > It is confusing and unnecessary and wrong, I agree.
> > 
> > > 
> > >   - The wording "skip swap entries" is confusing, because we're not skipping all
> > >     swap entries - we handle device private/exclusive pages before that.
> > 
> > I'm entirely ignorant of device pages, so cannot comment on your 2/2,
> > but of course it's good if you can bring the cases closer together.
> > 
> > > 
> > >   - The skip behavior is enabled as long as zap_details pointer passed over.
> > >     It's very hard to figure that out for a new zap caller because it's unclear
> > >     why we should skip swap entries when we have zap_details specified.
> > 
> > History below will clarify that.
> > 
> > > 
> > >   - With modern systems, especially performance critical use cases, swap
> > >     entries should be rare, so I doubt the usefulness of this optimization
> > >     since it should be on a slow path anyway.
> > > 
> > >   - It is not aligned with what we do with huge pmd swap entries, where in
> > >     zap_huge_pmd() we'll do the accounting unconditionally.
> > 
> > The patch below does not align with what's done in zap_huge_pmd() either;
> > but I think zap_huge_pmd() is okay without "details" because its only swap
> > entries are migration entries, and we do not use huge pages when COWing
> > from file huge pages.
> > 
> > > 
> > > This patch drops that trick, so we handle swap ptes coherently.  Meanwhile we
> > > should do the same mapping check upon migration entries too.
> > > 
> > > Signed-off-by: Peter Xu <peterx@...hat.com>
> > > ---
> > >  mm/memory.c | 6 ++----
> > >  1 file changed, 2 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 8f1de811a1dc..e454f3c6aeb9 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1382,16 +1382,14 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
> > >  			continue;
> > >  		}
> > >  
> > > -		/* If details->check_mapping, we leave swap entries. */
> > > -		if (unlikely(details))
> > > -			continue;
> > > -
> > 
> > Good.
> > 
> > >  		if (!non_swap_entry(entry))
> > >  			rss[MM_SWAPENTS]--;
> > >  		else if (is_migration_entry(entry)) {
> > >  			struct page *page;
> > >  
> > >  			page = pfn_swap_entry_to_page(entry);
> > > +			if (unlikely(zap_skip_check_mapping(details, page)))
> > > +				continue;
> > 
> > Good; though I don't think it's worth an "unlikely", and I say again,
> 
> Sure, I'll drop this "unlikely".  Meanwhile, shall we drop all the rest of the
> "unlikely" too around this check?
> 
> > more forcefully, that "zap_skip_check_mapping" is a terrible name.
> > 
> > David suggested naming its inversion should_zap_page(), yes, that's
> > much better; I'd toyed with do_not_zap_page() and zap_skip_page(),
> > but should_zap_page() for the inversion sounds good to me.
> 
> Ah sure, I'll rename it to should_zap_page() in a new patch before this.
> 
> > 
> > And I'm pleased to see in one of Matthew's diffs that he intends to
> > bring zap_details and zap_skip_check_mapping() back into mm/memory.c
> > from include/linux/mm.h.
> 
> Yeah it's only used in memory.c.  I put it there initially because I wanted
> zap_details user can reference what's that check mapping is all about, but
> maybe that's not necessary.
> 
> If you or Matthew could provide a link to that patch, I'll be happy to pick
> that up too with this series.  Or I can also do nothing assuming Matthew will
> handle it elsewhere.
> 
> > 
> > >  			rss[mm_counter(page)]--;
> > >  		}
> > >  		if (unlikely(!free_swap_and_cache(entry)))
> > > -- 
> > > 2.32.0
> > 
> > History.  zap_details came in 2.6.6, and it was mostly to manage
> > truncation on the non-linear vmas we had at that time (remap_file_pages
> > syscall packing non-sequential offsets into a single vma, with pte_file
> > entries), where index could not be deduced from position of pte in vma:
> > truncation range had to be passed down in zap_details; and an madvise
> > needed it too, so it could not be private to mm/memory.c then.
> > 
> > But at the same time, I added the even_cows argument to
> > unmap_mapping_range(), to distinguish truncation (which POSIX requires
> > to unmap even COWed pages) from invalidation (for page cache coherence,
> > which shouldn't touch private COWs).  However, there appear to be no
> > users of that in 2.6.6, though I wouldn't have added that complication
> > just for the fun of confusing everyone: best guess would be that there
> > was parallel development, and the use for !even_cows got removed in
> > the very same release that it was being added.
> > 
> > (PageAnon was brand new in 2.6.6: maybe it could have been used instead
> > of comparing details->check_mapping, or maybe there's some other case
> > I've forgotten which actually needs the exact mapping check.)
> > 
> > Eventually a few drivers came to use unmap_shared_mapping_range(),
> > the !even_cows caller; but more to the point, hole-punching came in,
> > and I felt very strongly that hole-punching on a file had no right
> > to delete private user data.  So then !even_cows became useful.
> > 
> > IIRC, I've seen Linus say he *detests* zap_details.  It had much better
> > justification in the days of non-linear, and I was sad to add
> > single_page to it quite recently; but hope that can go away later
> > (when try_to_unmap_one() is properly extended to THPs).
> > 
> > Now, here's what may clear up a lot of the confusion.
> > The 2.6.7 zap_pte_range() got a line at the head of zap_pte_range()
> > 	if (details && !details->check_mapping && !details->nonlinear_vma)
> > 		details = NULL;
> > which paired with the
> > 		/*
> > 		 * If details->check_mapping, we leave swap entries;
> > 		 * if details->nonlinear_vma, we leave file entries.
> > 		 */
> > 		if (unlikely(details))
> > 			continue;
> > lower down.  I haven't followed up the actual commits, but ChangeLog-2.6.7
> > implies that 2.6.6 had a "details = NULL;" placed elsewhere but buggily.
> > In 2.6.12 it moved from zap_pte_range() to unmap_page_range().
> > It was intended, not so much to optimize, as to simplify the flow;
> > but in fact has caused all this confusion.
> > 
> > When Kirill discontinued non-linear mapping support in 4.0 (no tears
> > were shed and) the nonlinear_vma comment above got deleted, leaving
> > just the then more puzzling check_mapping comment.
> > 
> > Then in 4.6 the "details = NULL;" seems to have got deleted as part of
> > aac453635549 ("mm, oom: introduce oom reaper"), which added some new
> > details (ignore_dirty and check_swap_entries); which got reverted in
> > 4.11, but apparently the "details = NULL;" not restored.
> > 
> > My stamina is suddenly exhausted, without actually pinpointing a commit
> > for "Fixes:" in your eventual cleanup.  Sorry, I've had enough!
> 
> Yeah it's in most cases a pain for digging all these trivial details, thanks
> for digging already most of it out of the mist.
> 
> That's really what I hope this patch can help: not only because the uffd work
> will rely on it, but also on resolving this early (we do use some wordings like
> "technical debt" sometimes, I think it's the same here but different form) when
> the above "history.git" is still functional so we can still reference.
> 
> With your help and the history.git I can try a better commit message because
> obviously some of the contents needs amending (it's not a pure optimization at
> all), but I assume the patch content will be mostly the same, with the tweaks
> applied.
> 
> Per stated so far I don't know any real reproducer so maybe it's not a real
> issue in any production system?  Maybe that'll make it a bit easier, because
> then we don't strongly require a Fixes (which could be another hard question to
> answer besides the issue itself).
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu

View attachment "zap-migrate.c" of type "text/plain" (3300 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ