linux-kernel - Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0Ud6ZQ4ZTm1cAUKCdb8FMu0fk9vXgf-bnmb0aY5ndDHwyA@mail.gmail.com>
Date:   Wed, 12 Aug 2020 19:17:49 -0700
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Alex Shi <alex.shi@...ux.alibaba.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Tejun Heo <tj@...nel.org>, Hugh Dickins <hughd@...gle.com>,
        Konstantin Khlebnikov <khlebnikov@...dex-team.ru>,
        Daniel Jordan <daniel.m.jordan@...cle.com>,
        Yang Shi <yang.shi@...ux.alibaba.com>,
        Matthew Wilcox <willy@...radead.org>,
        Johannes Weiner <hannes@...xchg.org>,
        kbuild test robot <lkp@...el.com>,
        linux-mm <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>, cgroups@...r.kernel.org,
        Shakeel Butt <shakeelb@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Wei Yang <richard.weiyang@...il.com>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Rong Chen <rong.a.chen@...el.com>
Subject: Re: [PATCH v17 14/21] mm/compaction: do page isolation first in compaction

On Wed, Aug 12, 2020 at 6:47 PM Alex Shi <alex.shi@...ux.alibaba.com> wrote:
>
>
>
> 在 2020/8/13 上午12:51, Alexander Duyck 写道:
> > On Wed, Aug 12, 2020 at 4:44 AM Alex Shi <alex.shi@...ux.alibaba.com> wrote:
> >>
> >>
> >>
> >> 在 2020/8/11 下午10:47, Alexander Duyck 写道:
> >>> On Tue, Aug 11, 2020 at 1:23 AM Alex Shi <alex.shi@...ux.alibaba.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2020/8/10 下午10:41, Alexander Duyck 写道:
> >>>>> On Mon, Aug 10, 2020 at 6:10 AM Alex Shi <alex.shi@...ux.alibaba.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> 在 2020/8/7 下午10:51, Alexander Duyck 写道:
> >>>>>>> I wonder if this entire section shouldn't be restructured. This is the
> >>>>>>> only spot I can see where we are resetting the LRU flag instead of
> >>>>>>> pulling the page from the LRU list with the lock held. Looking over
> >>>>>>> the code it seems like something like that should be possible. I am
> >>>>>>> not sure the LRU lock is really protecting us in either the
> >>>>>>> PageCompound check nor the skip bits. It seems like holding a
> >>>>>>> reference on the page should prevent it from switching between
> >>>>>>> compound or not, and the skip bits are per pageblock with the LRU bits
> >>>>>>> being per node/memcg which I would think implies that we could have
> >>>>>>> multiple LRU locks that could apply to a single skip bit.
> >>>>>>
> >>>>>> Hi Alexander,
> >>>>>>
> >>>>>> I don't find problem yet on compound or skip bit usage. Would you clarify the
> >>>>>> issue do you concerned?
> >>>>>>
> >>>>>> Thanks!
> >>>>>
> >>>>> The point I was getting at is that the LRU lock is being used to
> >>>>> protect these and with your changes I don't think that makes sense
> >>>>> anymore.
> >>>>>
> >>>>> The skip bits are per-pageblock bits. With your change the LRU lock is
> >>>>> now per memcg first and then per node. As such I do not believe it
> >>>>> really provides any sort of exclusive access to the skip bits. I still
> >>>>> have to look into this more, but it seems like you need a lock per
> >>>>> either section or zone that can be used to protect those bits and deal
> >>>>> with this sooner rather than waiting until you have found an LRU page.
> >>>>> The one part that is confusing though is that the definition of the
> >>>>> skip bits seems to call out that they are a hint since they are not
> >>>>> protected by a lock, but that is exactly what has been happening here.
> >>>>>
> >>>>
> >>>> The skip bits are safe here, since even it race with other skip action,
> >>>> It will still skip out. The skip action is try not to compaction too much,
> >>>> not a exclusive action needs avoid race.
> >>>
> >>> That would be the case if it didn't have the impact that they
> >>> currently do on the compaction process. What I am getting at is that a
> >>> race was introduced when you placed this test between the clearing of
> >>> the LRU flag and the actual pulling of the page from the LRU list. So
> >>> if you tested the skip bits before clearing the LRU flag then I would
> >>> be okay with the code, however because it is triggering an abort after
> >>
> >> Hi Alexander,
> >>
> >> Thanks a lot for comments and suggestions!
> >>
> >> I have tried your suggestion:
> >>
> >> Signed-off-by: Alex Shi <alex.shi@...ux.alibaba.com>
> >> ---
> >>  mm/compaction.c | 14 +++++++-------
> >>  1 file changed, 7 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index b99c96c4862d..6c881dee8c9a 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -988,6 +988,13 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >>                         goto isolate_fail_put;
> >>
> >> +               /* Try get exclusive access under lock */
> >> +               if (!skip_updated) {
> >> +                       skip_updated = true;
> >> +                       if (test_and_set_skip(cc, page, low_pfn))
> >> +                               goto isolate_fail_put;
> >> +               }
> >> +
> >>                 /* Try isolate the page */
> >>                 if (!TestClearPageLRU(page))
> >>                         goto isolate_fail_put;
> >
> > I would have made this much sooner. Probably before you call
> > get_page_unless_zero so as to avoid the unnecessary atomic operations.
> >
> >> @@ -1006,13 +1013,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>
> >>                         lruvec_memcg_debug(lruvec, page);
> >>
> >> -                       /* Try get exclusive access under lock */
> >> -                       if (!skip_updated) {
> >> -                               skip_updated = true;
> >> -                               if (test_and_set_skip(cc, page, low_pfn))
> >> -                                       goto isolate_abort;
> >> -                       }
> >> -
> >>                         /*
> >>                          * Page become compound since the non-locked check,
> >>                          * and it's on LRU. It can only be a THP so the order
> >> --
> >>
> >> Performance of case-lru-file-mmap-read in vm-scalibity is dropped a bit. not
> >> helpful
> >
> > So one issue with this change is that it is still too late to be of
> > much benefit. Really you should probably be doing this much sooner,
> > for example somewhere before the get_page_unless_zero(). Also the
> > thing that still has me scratching my head is the "Try get exclusive
> > access under lock" comment. The function declaration says this is
> > supposed to be a hint, but we were using the LRU lock to synchronize
> > it. I'm wondering if we should really be protecting this with the zone
> > lock since we are modifying the pageblock flags which also contain the
> > migration type value for the pageblock and are only modified while
> > holding the zone lock.
>
> zone lock is probability better. you can try and test.

So I spent a good chunk of today looking the code over and what I
realized is that we probably don't even really need to have this code
protected by the zone lock since the LRU bit in the pageblock should
do most of the work for us. In addition we can get rid of the test
portion of this and just make it a set only operation if I am not
mistaken.

> >>> the LRU flag is cleared then you are creating a situation where
> >>> multiple processes will be stomping all over each other as you can
> >>> have each thread essentially take a page via the LRU flag, but only
> >>> one thread will process a page and it could skip over all other pages
> >>> that preemptively had their LRU flag cleared.
> >>
> >> It increase a bit crowd here, but lru_lock do reduce some them, and skip_bit
> >> could stop each other in a array check(bitmap). So compare to whole node
> >> lru_lock, the net profit is clear in patch 17.
> >
> > My concern is that what you can end up with is multiple threads all
> > working over the same pageblock for isolation. With the old code the
> > LRU lock was used to make certain that test_and_set_skip was being
> > synchronized on the first page in the pageblock so you would only have
> > one thread going through and working a single pageblock. However after
> > your changes it doesn't seem like the test_and_set_skip has that
> > protection since only one thread will ever be able to successfully
> > call it for the first page in the pageblock assuming that the LRU flag
> > is set on the first page in the pageblock block.
> >
> >>>
> >>> If you take a look at the test_and_set_skip the function only acts on
> >>> the pageblock aligned PFN for a given range. WIth the changes you have
> >>> in place now that would mean that only one thread would ever actually
> >>> call this function anyway since the first PFN would take the LRU flag
> >>> so no other thread could follow through and test or set the bit as
> >>
> >> Is this good for only one process could do test_and_set_skip? is that
> >> the 'skip' meaning to be?
> >
> > So only one thread really getting to fully use test_and_set_skip is
> > good, however the issue is that there is nothing to synchronize the
> > testing from the other threads. As a result the other threads could
> > have isolated other pages within the pageblock before the thread that
> > is calling test_and_set_skip will get to complete the setting of the
> > skip bit. This will result in isolation failures for the thread that
> > set the skip bit which may be undesirable behavior.
> >
> > With the old code the threads were all synchronized on testing the
> > first PFN in the pageblock while holding the LRU lock and that is what
> > we lost. My concern is the cases where skip_on_failure == true are
> > going to fail much more often now as the threads can easily interfere
> > with each other.
>
> I have a patch to fix this, which is on
>         https://github.com/alexshi/linux.git lrunext

I don't think that patch helps to address anything. You are now
failing to set the bit in the case that something modifies the
pageblock flags while you are attempting to do so. I think it would be
better to just leave the cmpxchg loop as it is.

> >
> >>> well. The expectation before was that all threads would encounter this
> >>> test and either proceed after setting the bit for the first PFN or
> >>> abort after testing the first PFN. With you changes only the first
> >>> thread actually runs this test and then it and the others will likely
> >>> encounter multiple failures as they are all clearing LRU bits
> >>> simultaneously and tripping each other up. That is why the skip bit
> >>> must have a test and set done before you even get to the point of
> >>> clearing the LRU flag.
> >>
> >> It make the things warse in my machine, would you like to have a try by yourself?
> >
> > I plan to do that. I have already been working on a few things to
> > clean up and optimize your patch set further. I will try to submit an
> > RFC this evening so we can discuss.
> >
>
> Glad to see your new code soon. Would you like do it base on
>                 https://github.com/alexshi/linux.git lrunext

I can rebase off of that tree. It may add another half hour or so. I
have barely had any time to test my code. When I enabled some of the
debugging features in the kernel related to using the vm-scalability
tests the boot time became incredibly slow so I may just make certain
I can boot and not mess the system up before submitting my patches as
an RFC. I can probably try testing them more tomorrow.

> >>>
> >>>>> The point I was getting at with the PageCompound check is that instead
> >>>>> of needing the LRU lock you should be able to look at PageCompound as
> >>>>> soon as you call get_page_unless_zero() and preempt the need to set
> >>>>> the LRU bit again. Instead of trying to rely on the LRU lock to
> >>>>> guarantee that the page hasn't been merged you could just rely on the
> >>>>> fact that you are holding a reference to it so it isn't going to
> >>>>> switch between being compound or order 0 since it cannot be freed. It
> >>>>> spoils the idea I originally had of combining the logic for
> >>>>> get_page_unless_zero and TestClearPageLRU into a single function, but
> >>>>> the advantage is you aren't clearing the LRU flag unless you are
> >>>>> actually going to pull the page from the LRU list.
> >>>>
> >>>> Sorry, I still can not follow you here. Compound code part is unchanged
> >>>> and follow the original logical. So would you like to pose a new code to
> >>>> see if its works?
> >>>
> >>> No there are significant changes as you reordered all of the
> >>> operations. Prior to your change the LRU bit was checked, but not
> >>> cleared before testing for PageCompound. Now you are clearing it
> >>> before you are testing if it is a compound page. So if compaction is
> >>> running we will be seeing the pages in the LRU stay put, but the
> >>> compound bit flickering off and on if the compound page is encountered
> >>> with the wrong or NULL lruvec. What I was suggesting is that the
> >>
> >> The lruvec could be wrong or NULL here, that is the base stone of whole
> >> patchset.
> >
> > Sorry I had a typo in my comment as well as it is the LRU bit that
> > will be flickering, not the compound. The goal here is to avoid
> > clearing the LRU bit unless we are sure we are going to take the
> > lruvec lock and pull the page from the list.
> >
> >>> PageCompound test probably doesn't need to be concerned with the lock
> >>> after your changes. You could test it after you call
> >>> get_page_unless_zero() and before you call
> >>> __isolate_lru_page_prepare(). Instead of relying on the LRU lock to
> >>> protect us from the page switching between compound and not we would
> >>> be relying on the fact that we are holding a reference to the page so
> >>> it should not be freed and transition between compound or not.
> >>>
> >>
> >> I have tried the patch as your suggested, it has no clear help on performance
> >> on above vm-scaliblity case. Maybe it's due to we checked the same thing
> >> before lock already.
> >>
> >> diff --git a/mm/compaction.c b/mm/compaction.c
> >> index b99c96c4862d..cf2ac5148001 100644
> >> --- a/mm/compaction.c
> >> +++ b/mm/compaction.c
> >> @@ -985,6 +985,16 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                 if (unlikely(!get_page_unless_zero(page)))
> >>                         goto isolate_fail;
> >>
> >> +                       /*
> >> +                        * Page become compound since the non-locked check,
> >> +                        * and it's on LRU. It can only be a THP so the order
> >> +                        * is safe to read and it's 0 for tail pages.
> >> +                        */
> >> +                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> >> +                               low_pfn += compound_nr(page) - 1;
> >> +                               goto isolate_fail_put;
> >> +                       }
> >> +
> >>                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >>                         goto isolate_fail_put;
> >>
> >> @@ -1013,16 +1023,6 @@ static bool too_many_isolated(pg_data_t *pgdat)
> >>                                         goto isolate_abort;
> >>                         }
> >>
> >> -                       /*
> >> -                        * Page become compound since the non-locked check,
> >> -                        * and it's on LRU. It can only be a THP so the order
> >> -                        * is safe to read and it's 0 for tail pages.
> >> -                        */
> >> -                       if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
> >> -                               low_pfn += compound_nr(page) - 1;
> >> -                               SetPageLRU(page);
> >> -                               goto isolate_fail_put;
> >> -                       }
> >>                 } else
> >>                         rcu_read_unlock();
> >>
> >
> > So actually there is more we could do than just this. Specifically a
> > few lines below the rcu_read_lock there is yet another PageCompound
> > check that sets low_pfn yet again. So in theory we could combine both
> > of those and modify the code so you end up with something more like:
> > @@ -968,6 +974,16 @@ isolate_migratepages_block(struct compact_control
> > *cc, unsigned long low_pfn,
> >                 if (unlikely(!get_page_unless_zero(page)))
> >                         goto isolate_fail;
> >
> > +               if (PageCompound(page)) {
> > +                       const unsigned int order = compound_order(page);
> > +
> > +                       if (likely(order < MAX_ORDER))
> > +                               low_pfn += (1UL << order) - 1;
> > +
> > +                       if (unlikely(!cc->alloc_contig))
> > +                               goto isolate_fail_put;
> >
>
> The current don't check this unless locked changed. But anyway check it
> every page may have no performance impact.

Yes and no. The same code is also ran outside the lock and that is why
I suggested merging the two and creating this block of logic. It will
be clearer once I have done some initial smoke testing and submitted
my patch.

> +               }
> > +
> >                 if (__isolate_lru_page_prepare(page, isolate_mode) != 0)
> >                         goto isolate_fail_put;
> >
> > Doing this you would be more likely to skip over the entire compound
> > page in a single jump should you not be able to either take the LRU
> > bit or encounter a busy page in __isolate_Lru_page_prepare. I had
> > copied this bit from an earlier check and modified it as I was not
> > sure I can guarantee that this is a THP since we haven't taken the LRU
> > lock yet. However I believe the page cannot be split up while we are
> > holding the extra reference so the PageCompound flag and order should
> > not change until we call put_page.
> >
>
> It looks like the lock_page protect this instead of get_page that just works
> after split func called.

So I thought that the call to page_ref_freeze that is used in
functions like split_huge_page_to_list is meant to address this case.
What it is essentially doing is setting the reference count to zero if
the count is at the expected value. So with the get_page_unless_zero
it would either fail because the value is already zero, or the
page_ref_freeze would fail because the count would be one higher than
the expected value. Either that or I am still missing another piece in
the understanding of this.

Thanks.

- Alex