linux-kernel - Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures, honor failures

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+CK2bA4F+SipkReJzFjCSC-8kZdK4yrwCQZM+TvCTrqV2CGHg@mail.gmail.com>
Date:   Thu, 17 Dec 2020 17:02:03 -0500
From:   Pavel Tatashin <pasha.tatashin@...een.com>
To:     Jason Gunthorpe <jgg@...pe.ca>
Cc:     LKML <linux-kernel@...r.kernel.org>, linux-mm <linux-mm@...ck.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Michal Hocko <mhocko@...e.com>,
        David Hildenbrand <david@...hat.com>,
        Oscar Salvador <osalvador@...e.de>,
        Dan Williams <dan.j.williams@...el.com>,
        Sasha Levin <sashal@...nel.org>,
        Tyler Hicks <tyhicks@...ux.microsoft.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>, mike.kravetz@...cle.com,
        Steven Rostedt <rostedt@...dmis.org>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Matthew Wilcox <willy@...radead.org>,
        David Rientjes <rientjes@...gle.com>,
        John Hubbard <jhubbard@...dia.com>,
        Linux Doc Mailing List <linux-doc@...r.kernel.org>,
        Ira Weiny <ira.weiny@...el.com>,
        linux-kselftest@...r.kernel.org
Subject: Re: [PATCH v4 08/10] mm/gup: limit number of gup migration failures,
 honor failures

Hi Jason,

Thank you for your comments. My replies below.

On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@...pe.ca> wrote:
>
> On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote:
> > +/*
> > + * Verify that there are no unpinnable (movable) pages, if so return true.
> > + * Otherwise an unpinnable pages is found return false, and unpin all pages.
> > + */
> > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages,
> > +                               unsigned int gup_flags)
> > +{
> > +     unsigned long i, step;
> > +
> > +     for (i = 0; i < nr_pages; i += step) {
> > +             struct page *head = compound_head(pages[i]);
> > +
> > +             step = compound_nr(head) - (pages[i] - head);
>
> You can't assume that all of a compound head is in the pages array,
> this assumption would only work inside the page walkers if the page
> was found in a PMD or something.

I am not sure I understand your comment. The compound head is not
taken from the pages array, and not assumed to be in it. It is exactly
the same logic as that we currently have:
https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565

>
> > +     if (gup_flags & FOLL_PIN) {
> > +             unpin_user_pages(pages, nr_pages);
>
> So we throw everything away? Why? That isn't how the old algorithm worked

It is exactly like the old algorithm worked: if there are pages to be
migrated (not pinnable pages) we unpinned everything.
See here:
https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603

If cma_pages_list is not empty unpin everything. The list is not empty
if we isolated some pages, we isolated some pages if there are some
pages which are not pinnable. Now, we do exactly the same thing, but
cleaner, and handle errors. We must unpin everything because if we
fail, no pages should stay pinned, and also if we migrated some pages,
the pages array must be updated, so we need to call
__get_user_pages_locked() pin and repopulated pages array.

>
> > @@ -1654,22 +1664,55 @@ static long __gup_longterm_locked(struct mm_struct *mm,
> >                                 struct vm_area_struct **vmas,
> >                                 unsigned int gup_flags)
> >  {
> > -     unsigned long flags = 0;
> > +     int migrate_retry = 0;
> > +     int isolate_retry = 0;
> > +     unsigned int flags;
> >       long rc;
> >
> > -     if (gup_flags & FOLL_LONGTERM)
> > -             flags = memalloc_pin_save();
> > +     if (!(gup_flags & FOLL_LONGTERM))
> > +             return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > +                                            NULL, gup_flags);
> >
> > -     rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL,
> > -                                  gup_flags);
> > +     /*
> > +      * Without FOLL_WRITE fault handler may return zero page, which can
> > +      * be in a movable zone, and also will fail to isolate during migration,
> > +      * thus the longterm pin will fail.
> > +      */
> > +     gup_flags &= FOLL_WRITE;
>
> Is &= what you mean here? |= right?

Right, I meant |=.

>
> Seems like we've ended up in a weird place if FOLL_LONGTERM always
> includes FOLL_WRITE. Putting the zero page in ZONE_MOVABLE seems like
> a bad idea, no?

I am not sure, I just found this problem during testing, and this is
the solution I am proposing. I am worried about limiting the zero page
to a non movable zone, but let's see what others think about this.

>
> > +     /*
> > +      * Migration may fail, we retry before giving up. Also, because after
> > +      * migration pages[] becomes outdated, we unpin and repin all pages
> > +      * in the range, so pages array is repopulated with new values.
> > +      * Also, because of this we cannot retry migration failures in a loop
> > +      * without pinning/unpinnig pages.
> > +      */
>
> The old algorithm made continuous forward progress and only went back
> to the first migration point.

That is not right, the old code went back to the beginning. Making
continuous progress is possible, but we won't see any performance
betnefit from it, because migration failures is already exception
scenarios where machine is under memory stress. The truth is if we
fail to migrate it is unlikely will succeed if we retry right away, so
giving some time between retries may be even beneficial.  Also with
continious progress we need to take care of some corner cases where we
need to unpin already succeeded pages in case if forward progress is
not possible. Also, adjust pages array, start address etc.

>
> > +     for (; ; ) {
>
> while (true)?

Hm, the same thing? :)

>
> > +             rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
> > +                                          NULL, gup_flags);
>
> > +             /* Return if error or if all pages are pinnable */
> > +             if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags))
> > +                     break;
>
> So we sweep the pages list twice now?

Yes, but O(N) is the same. No new operation is added. Before we had
something like this:

while (npages)
     check if pinnable
            isolate
while (npages)
   unpin
migrate
while (npages)
  pin

Now:
while(npages)
    check if pinnable
while(npages)
   unpin
while(npages)
    isolate
migrate
   pin

>
> > +             /* Some pages are not pinnable, migrate them */
> > +             rc = migrate_movable_pages(rc, pages);
> > +
> > +             /*
> > +              * If there is an error, and we tried maximum number of times
> > +              * bail out. Notice: we return an error code, and all pages are
> > +              * unpinned
> > +              */
> > +             if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) {
> > +                     break;
> > +             } else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) {
> > +                     rc = -EBUSY;
>
> I don't like this at all. It shouldn't be so flakey
>
> Can you do migration without the LRU?

I do not think it is possible, we must isolate pages before migration.

Pasha