[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20241116091658.1983491-2-chenridong@huaweicloud.com>
Date: Sat, 16 Nov 2024 09:16:58 +0000
From: Chen Ridong <chenridong@...weicloud.com>
To: akpm@...ux-foundation.org,
mhocko@...e.com,
hannes@...xchg.org,
yosryahmed@...gle.com,
yuzhao@...gle.com,
david@...hat.com,
willy@...radead.org,
ryan.roberts@....com,
baohua@...nel.org
Cc: linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
chenridong@...wei.com,
wangweiyang2@...wei.com,
xieym_ict@...mail.com
Subject: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the tail of LRU after shrinking
From: Chen Ridong <chenridong@...wei.com>
An issue was found with the following testing step:
1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
2. Mount memcg v1, and create memcg named test_memcg and set
usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
It was found that:
cat memory.usage_in_bytes
2144940032
cat memory.memsw.usage_in_bytes
2255056896
free -h
total used free
Mem: 31Gi 2.1Gi 27Gi
Swap: 1.0Gi 618Mi 405Mi
As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
was used, which means that 500M may be wasted because other memcgs can not
use these swap memory.
It can be explained as follows:
1. When entering shrink_inactive_list, it isolates folios from lru from
tail to head. If it just takes folioN from lru(make it simple).
inactive lru: folio1<->folio2<->folio3...<->folioN-1
isolated list: folioN
2. In shrink_page_list function, if folioN is THP(2M), it may be splited
and added to swap cache folio by folio. After adding to swap cache,
it will submit io to writeback folio to swap, which is asynchronous.
When shrink_page_list is finished, the isolated folios list will be
moved back to the head of inactive lru. The inactive lru may just look
like this, with 512 filioes have been move to the head of inactive lru.
folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
It committed io from folioN1 to folioN512, the later folios committed
was added to head of the 'ret_folios' in the shrink_page_list function.
As a result, the order was shown as folioN512->folioN511->...->folioN1.
3. When folio writeback io is completed, the folio may be rotated to tail
of the lru one by one. It's assumed that filioN1,filioN2, ...,filioN512
are completed in order(commit io in this order), and they are rotated to
the tail of the LRU in order (filioN1<->...folioN511<->folioN512).
Therefore, those folios that are tail of the lru will be reclaimed as
soon as possible.
folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
4. However, shrink_page_list and folio writeback are asynchronous. If THP
is splited, shrink_page_list loops at least 512 times, which means that
shrink_page_list is not completed but some folios writeback have been
completed, and this may lead to failure to rotate these folios to the
tail of lru. The lru may look likes as below:
folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
folioN51<->folioN52<->...folioN511<->folioN512
Although those folios (N1-N50) have been finished writing back, they
are still at the head of the lru. This is because their writeback_end
occurred while it were still looping in shrink_folio_list(), causing
folio_end_writeback()'s folio_rotate_reclaimable() to fail in moving
these folios, which are not in the LRU but still in the 'folio_list',
to the tail of the LRU.
When isolating folios from lru, it scans from tail to head, so it is
difficult to scan those folios again.
What mentioned above may lead to a large number of folios have been added
to swap cache but can not be reclaimed in time, which may reduce reclaim
efficiency and prevent other memcgs from using this swap memory even if
they trigger OOM.
To fix this issue, the folios whose writeback has been completed should be
move to the tail of the LRU instead of always placing them at the head of
the LRU when the shrink_page_list is finished. It can be realized by
following steps.
1. In the shrink_page_list function, the folios whose are committed to
are added to the head of 'folio_list', which will be return to the
caller.
2. When shrink_page_list finishes, it is known that how many folios have
been pageout, and they are all at the head of 'folio_list', which is
ready be moved back to LRU. So, in the 'move_folios_to_lru function'
function, if the first 'nr_io' folios (which have been pageout) have
been written back completely, move them to the tail of LRU. Otherwise,
move them to the head of the LRU.
Signed-off-by: Chen Ridong <chenridong@...wei.com>
---
mm/vmscan.c | 37 +++++++++++++++++++++++++++++--------
1 file changed, 29 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 76378bc257e3..04f7eab9d818 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1046,6 +1046,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
struct folio_batch free_folios;
LIST_HEAD(ret_folios);
LIST_HEAD(demote_folios);
+ LIST_HEAD(pageout_folios);
unsigned int nr_reclaimed = 0;
unsigned int pgactivate = 0;
bool do_demote_pass;
@@ -1061,7 +1062,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
struct address_space *mapping;
struct folio *folio;
enum folio_references references = FOLIOREF_RECLAIM;
- bool dirty, writeback;
+ bool dirty, writeback, is_pageout = false;
unsigned int nr_pages;
cond_resched();
@@ -1384,6 +1385,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
nr_pages = 1;
}
stat->nr_pageout += nr_pages;
+ is_pageout = true;
if (folio_test_writeback(folio))
goto keep;
@@ -1508,7 +1510,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
keep_locked:
folio_unlock(folio);
keep:
- list_add(&folio->lru, &ret_folios);
+ if (is_pageout)
+ list_add(&folio->lru, &pageout_folios);
+ else
+ list_add(&folio->lru, &ret_folios);
VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
folio_test_unevictable(folio), folio);
}
@@ -1551,6 +1556,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
free_unref_folios(&free_folios);
list_splice(&ret_folios, folio_list);
+ list_splice(&pageout_folios, folio_list);
count_vm_events(PGACTIVATE, pgactivate);
if (plug)
@@ -1826,11 +1832,14 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file,
/*
* move_folios_to_lru() moves folios from private @list to appropriate LRU list.
+ * @lruvec: The LRU vector the list is moved to.
+ * @list: The folio list are moved to lruvec
+ * @nr_io: The first nr folios of the list that have been committed io.
*
* Returns the number of pages moved to the given lruvec.
*/
static unsigned int move_folios_to_lru(struct lruvec *lruvec,
- struct list_head *list)
+ struct list_head *list, unsigned int nr_io)
{
int nr_pages, nr_moved = 0;
struct folio_batch free_folios;
@@ -1880,9 +1889,21 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
* inhibits memcg migration).
*/
VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
- lruvec_add_folio(lruvec, folio);
+ /*
+ * If the folio have been committed io and writed back completely,
+ * it should be added to the tailed to the lru, so it can
+ * be relaimed as soon as possible.
+ */
+ if (nr_io > 0 &&
+ !folio_test_reclaim(folio) &&
+ !folio_test_writeback(folio))
+ lruvec_add_folio_tail(lruvec, folio);
+ else
+ lruvec_add_folio(lruvec, folio);
+
nr_pages = folio_nr_pages(folio);
nr_moved += nr_pages;
+ nr_io = nr_io > nr_pages ? (nr_io - nr_pages) : 0;
if (folio_test_active(folio))
workingset_age_nonresident(lruvec, nr_pages);
}
@@ -1960,7 +1981,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
spin_lock_irq(&lruvec->lru_lock);
- move_folios_to_lru(lruvec, &folio_list);
+ move_folios_to_lru(lruvec, &folio_list, stat.nr_pageout);
__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(),
stat.nr_demoted);
@@ -2111,8 +2132,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
*/
spin_lock_irq(&lruvec->lru_lock);
- nr_activate = move_folios_to_lru(lruvec, &l_active);
- nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
+ nr_activate = move_folios_to_lru(lruvec, &l_active, 0);
+ nr_deactivate = move_folios_to_lru(lruvec, &l_inactive, 0);
__count_vm_events(PGDEACTIVATE, nr_deactivate);
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
@@ -4627,7 +4648,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
spin_lock_irq(&lruvec->lru_lock);
- move_folios_to_lru(lruvec, &list);
+ move_folios_to_lru(lruvec, &list, 0);
walk = current->reclaim_state->mm_walk;
if (walk && walk->batched) {
--
2.34.1
Powered by blists - more mailing lists