lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240312210822.GB65481@cmpxchg.org>
Date: Tue, 12 Mar 2024 17:08:22 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: Yu Zhao <yuzhao@...gle.com>
Cc: Axel Rasmussen <axelrasmussen@...gle.com>,
	Yafang Shao <laoar.shao@...il.com>,
	Chris Down <chris@...isdown.name>, cgroups@...r.kernel.org,
	kernel-team@...com, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: Re: MGLRU premature memcg OOM on slow writes

On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> Yes, these two are among the differences between the active/inactive
> LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> positions of dirty pages. The active/inactive LRU moves dirty pages
> all the way to the end of the line (reclaim happens at the front)
> whereas MGLRU moves them into the middle, during direct reclaim. The
> rationale for MGLRU was that this way those dirty pages would still
> be counted as "inactive" (or cold).

Note that activating the page is not a statement on the page's
hotness. It's simply to park it away from the scanner. We could as
well have moved it to the unevictable list - this is just easier.

folio_end_writeback() will call folio_rotate_reclaimable() and move it
back to the inactive tail, to make it the very next reclaim target as
soon as it's clean.

> This theory can be quickly verified by comparing how much
> nr_vmscan_immediate_reclaim grows, i.e.,
> 
>   Before the copy
>     grep nr_vmscan_immediate_reclaim /proc/vmstat
>   And then after the copy
>     grep nr_vmscan_immediate_reclaim /proc/vmstat
> 
> The growth should be trivial for MGLRU and nontrivial for the
> active/inactive LRU.
>
> If this is indeed the case, I'd appreciate very much if anyone could
> try the following (I'll try it myself too later next week).
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4255619a1a31..020f5d98b9a1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>  	}
>  
>  	/* waiting for writeback */
> -	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> -	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> -		gen = folio_inc_gen(lruvec, folio, true);
> -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> +	if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +		DEFINE_MAX_SEQ(lruvec);
> +		int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +
> +		old_gen = folio_update_gen(folio, new_gen);
> +		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +		list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);
>  		return true;

Right, because MGLRU sorts these pages out before calling the scanner,
so they never get marked for immediate reclaim.

But that also implies they won't get rotated back to the tail when
writeback finishes. Doesn't that mean that you now have pages that

a) came from the oldest generation and were only deferred due to their
   writeback state, and

b) are now clean and should be reclaimed. But since they're
   permanently advanced to the next gen, you'll instead reclaim pages
   that were originally ahead of them, and likely hotter.

Isn't that an age inversion?

Back to the broader question though: if reclaim demand outstrips clean
pages and the only viable candidates are dirty ones (e.g. an
allocation spike in the presence of dirty/writeback pages), there only
seem to be 3 options:

1) sleep-wait for writeback
2) continue scanning, aka busy-wait for writeback + age inversions
3) find nothing and declare OOM

Since you're not doing 1), it must be one of the other two, no? One
way or another it has to either pace-match to IO completions, or OOM.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ