linux-kernel - Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the tail of LRU after shrinking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAGsJ_4zdVyxh4ka7q8PA2Wi_ZPpjY_sv_bGS9E9NP_vpAFLOHg@mail.gmail.com>
Date: Sun, 17 Nov 2024 16:26:30 +1300
From: Barry Song <21cnbao@...il.com>
To: Chen Ridong <chenridong@...weicloud.com>
Cc: akpm@...ux-foundation.org, mhocko@...e.com, hannes@...xchg.org, 
	yosryahmed@...gle.com, yuzhao@...gle.com, david@...hat.com, 
	willy@...radead.org, ryan.roberts@....com, linux-mm@...ck.org, 
	linux-kernel@...r.kernel.org, chenridong@...wei.com, wangweiyang2@...wei.com, 
	xieym_ict@...mail.com
Subject: Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the
 tail of LRU after shrinking

On Sat, Nov 16, 2024 at 10:26 PM Chen Ridong <chenridong@...weicloud.com> wrote:
>
> From: Chen Ridong <chenridong@...wei.com>
>
> An issue was found with the following testing step:
> 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> 2. Mount memcg v1, and create memcg named test_memcg and set
>    usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
>
> It was found that:
>
> cat memory.usage_in_bytes
> 2144940032
> cat memory.memsw.usage_in_bytes
> 2255056896
>
> free -h
>               total        used        free
> Mem:           31Gi       2.1Gi        27Gi
> Swap:         1.0Gi       618Mi       405Mi
>
> As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> was used, which means that 500M may be wasted because other memcgs can not
> use these swap memory.
>
> It can be explained as follows:
> 1. When entering shrink_inactive_list, it isolates folios from lru from
>    tail to head. If it just takes folioN from lru(make it simple).
>
>    inactive lru: folio1<->folio2<->folio3...<->folioN-1
>    isolated list: folioN
>
> 2. In shrink_page_list function, if folioN is THP(2M), it may be splited
>    and added to swap cache folio by folio. After adding to swap cache,
>    it will submit io to writeback folio to swap, which is asynchronous.
>    When shrink_page_list is finished, the isolated folios list will be
>    moved back to the head of inactive lru. The inactive lru may just look
>    like this, with 512 filioes have been move to the head of inactive lru.
>
>    folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
>
>    It committed io from folioN1 to folioN512, the later folios committed
>    was added to head of the 'ret_folios' in the shrink_page_list function.
>    As a result, the order was shown as folioN512->folioN511->...->folioN1.
>
> 3. When folio writeback io is completed, the folio may be rotated to tail
>    of the lru one by one. It's assumed that filioN1,filioN2, ...,filioN512
>    are completed in order(commit io in this order), and they are rotated to
>    the tail of the LRU in order (filioN1<->...folioN511<->folioN512).
>    Therefore, those folios that are tail of the lru will be reclaimed as
>    soon as possible.
>
>    folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
>
> 4. However, shrink_page_list and folio writeback are asynchronous. If THP
>    is splited, shrink_page_list loops at least 512 times, which means that
>    shrink_page_list is not completed but some folios writeback have been
>    completed, and this may lead to failure to rotate these folios to the
>    tail of lru. The lru may look likes as below:
>
>    folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
>    folioN51<->folioN52<->...folioN511<->folioN512
>
>    Although those folios (N1-N50) have been finished writing back, they
>    are still at the head of the lru. This is because their writeback_end
>    occurred while it were still looping in shrink_folio_list(), causing
>    folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
>    these folios, which are not in the LRU but still in the 'folio_list',
>    to the tail of the LRU.
>    When isolating folios from lru, it scans from tail to head, so it is
>    difficult to scan those folios again.
>
> What mentioned above may lead to a large number of folios have been added
> to swap cache but can not be reclaimed in time, which may reduce reclaim
> efficiency and prevent other memcgs from using this swap memory even if
> they trigger OOM.
>
> To fix this issue, the folios whose writeback has been completed should be
> move to the tail of the LRU instead of always placing them at the head of
> the LRU when the shrink_page_list is finished. It can be realized by
> following steps.
> 1. In the shrink_page_list function, the folios whose are committed to

It seems like there's a grammatical error here—whose something？

>    are added to the head of 'folio_list', which will be return to the
>    caller.
> 2. When shrink_page_list finishes, it is known that how many folios have
>    been pageout, and they are all at the head of 'folio_list', which is
>    ready be moved back to LRU. So, in the 'move_folios_to_lru function'
>    function, if the first 'nr_io' folios (which have been pageout) have
>    been written back completely, move them to the tail of LRU. Otherwise,
>    move them to the head of the LRU.
>
> Signed-off-by: Chen Ridong <chenridong@...wei.com>
> ---
>  mm/vmscan.c | 37 +++++++++++++++++++++++++++++--------
>  1 file changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 76378bc257e3..04f7eab9d818 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1046,6 +1046,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>         struct folio_batch free_folios;
>         LIST_HEAD(ret_folios);
>         LIST_HEAD(demote_folios);
> +       LIST_HEAD(pageout_folios);
>         unsigned int nr_reclaimed = 0;
>         unsigned int pgactivate = 0;
>         bool do_demote_pass;
> @@ -1061,7 +1062,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                 struct address_space *mapping;
>                 struct folio *folio;
>                 enum folio_references references = FOLIOREF_RECLAIM;
> -               bool dirty, writeback;
> +               bool dirty, writeback, is_pageout = false;
>                 unsigned int nr_pages;
>
>                 cond_resched();
> @@ -1384,6 +1385,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>                                         nr_pages = 1;
>                                 }
>                                 stat->nr_pageout += nr_pages;
> +                               is_pageout = true;
>
>                                 if (folio_test_writeback(folio))
>                                         goto keep;
> @@ -1508,7 +1510,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  keep_locked:
>                 folio_unlock(folio);
>  keep:
> -               list_add(&folio->lru, &ret_folios);
> +               if (is_pageout)
> +                       list_add(&folio->lru, &pageout_folios);
> +               else
> +                       list_add(&folio->lru, &ret_folios);
>                 VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
>                                 folio_test_unevictable(folio), folio);
>         }
> @@ -1551,6 +1556,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>         free_unref_folios(&free_folios);
>
>         list_splice(&ret_folios, folio_list);
> +       list_splice(&pageout_folios, folio_list);

Do we really need this pageout_folios list? is the below not sufficient?

 +               if (nr_io > 0 &&
 +                   !folio_test_reclaim(folio) &&
 +                   !folio_test_writeback(folio))
 +                       lruvec_add_folio_tail(lruvec, folio);
 +               else
 +                       lruvec_add_folio(lruvec, folio);

>         count_vm_events(PGACTIVATE, pgactivate);
>
>         if (plug)
> @@ -1826,11 +1832,14 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file,
>
>  /*
>   * move_folios_to_lru() moves folios from private @list to appropriate LRU list.
> + * @lruvec: The LRU vector the list is moved to.
> + * @list: The folio list are moved to lruvec
> + * @nr_io: The first nr folios of the list that have been committed io.
>   *
>   * Returns the number of pages moved to the given lruvec.
>   */
>  static unsigned int move_folios_to_lru(struct lruvec *lruvec,
> -               struct list_head *list)
> +               struct list_head *list, unsigned int nr_io)
>  {
>         int nr_pages, nr_moved = 0;
>         struct folio_batch free_folios;
> @@ -1880,9 +1889,21 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
>                  * inhibits memcg migration).
>                  */
>                 VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
> -               lruvec_add_folio(lruvec, folio);
> +               /*
> +                * If the folio have been committed io and writed back completely,
> +                * it should be added to the tailed to the lru, so it can
> +                * be relaimed as soon as possible.
> +                */
> +               if (nr_io > 0 &&
> +                   !folio_test_reclaim(folio) &&
> +                   !folio_test_writeback(folio))
> +                       lruvec_add_folio_tail(lruvec, folio);
> +               else
> +                       lruvec_add_folio(lruvec, folio);
> +
>                 nr_pages = folio_nr_pages(folio);
>                 nr_moved += nr_pages;
> +               nr_io = nr_io > nr_pages ? (nr_io - nr_pages) : 0;
>                 if (folio_test_active(folio))
>                         workingset_age_nonresident(lruvec, nr_pages);
>         }
> @@ -1960,7 +1981,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>         nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
>
>         spin_lock_irq(&lruvec->lru_lock);
> -       move_folios_to_lru(lruvec, &folio_list);
> +       move_folios_to_lru(lruvec, &folio_list, stat.nr_pageout);
>
>         __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
>                                         stat.nr_demoted);
> @@ -2111,8 +2132,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
>          */
>         spin_lock_irq(&lruvec->lru_lock);
>
> -       nr_activate = move_folios_to_lru(lruvec, &l_active);
> -       nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
> +       nr_activate = move_folios_to_lru(lruvec, &l_active, 0);
> +       nr_deactivate = move_folios_to_lru(lruvec, &l_inactive, 0);
>
>         __count_vm_events(PGDEACTIVATE, nr_deactivate);
>         __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
> @@ -4627,7 +4648,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
>
>         spin_lock_irq(&lruvec->lru_lock);
>
> -       move_folios_to_lru(lruvec, &list);
> +       move_folios_to_lru(lruvec, &list, 0);

I'm not entirely convinced about using the 'nr' argument here.
Is the goal to differentiate between two cases?
1. we need to take care of written-back folios
2. we don't need to take care of written-back folios?

Would it be a bool or better to provide separate helpers?

>
>         walk = current->reclaim_state->mm_walk;
>         if (walk && walk->batched) {
> --
> 2.34.1
>

Thanks
Barry