lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Ydu6fXg2FmrseQOn@google.com>
Date:   Sun, 9 Jan 2022 21:47:57 -0700
From:   Yu Zhao <yuzhao@...gle.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Andi Kleen <ak@...ux.intel.com>,
        Catalin Marinas <catalin.marinas@....com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Hillf Danton <hdanton@...a.com>, Jens Axboe <axboe@...nel.dk>,
        Jesse Barnes <jsbarnes@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Matthew Wilcox <willy@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Michael Larabel <Michael@...haellarabel.com>,
        Rik van Riel <riel@...riel.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Will Deacon <will@...nel.org>,
        Ying Huang <ying.huang@...el.com>,
        linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        page-reclaim@...gle.com, x86@...nel.org,
        Konstantin Kharlamov <Hi-Angel@...dex.ru>
Subject: Re: [PATCH v6 6/9] mm: multigenerational lru: aging

On Fri, Jan 07, 2022 at 03:44:50PM +0100, Michal Hocko wrote:
> On Tue 04-01-22 13:22:25, Yu Zhao wrote:
> [...]
> > +static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> > +{
> > +	static const struct mm_walk_ops mm_walk_ops = {
> > +		.test_walk = should_skip_vma,
> > +		.p4d_entry = walk_pud_range,
> > +	};
> > +
> > +	int err;
> > +#ifdef CONFIG_MEMCG
> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +#endif
> > +
> > +	walk->next_addr = FIRST_USER_ADDRESS;
> > +
> > +	do {
> > +		unsigned long start = walk->next_addr;
> > +		unsigned long end = mm->highest_vm_end;
> > +
> > +		err = -EBUSY;
> > +
> > +		rcu_read_lock();
> > +#ifdef CONFIG_MEMCG
> > +		if (memcg && atomic_read(&memcg->moving_account))
> > +			goto contended;
> > +#endif
> > +		if (!mmap_read_trylock(mm))
> > +			goto contended;
> 
> Have you evaluated the behavior under mmap_sem contention? I mean what
> would be an effect of some mms being excluded from the walk? This path
> is called from direct reclaim and we do allocate with exclusive mmap_sem
> IIRC and the trylock can fail in a presence of pending writer if I am
> not mistaken so even the read lock holder (e.g. an allocation from the #PF)
> can bypass the walk.

You are right. Here it must be a trylock; otherwise it can deadlock.

I think there might be a misunderstanding: the aging doesn't
exclusively rely on page table walks to gather the accessed bit. It
prefers page table walks but it can also fallback to the rmap-based
function, i.e., lru_gen_look_around(), which only gathers the accessed
bit from at most 64 PTEs and therefore is less efficient. But it still
retains about 80% of the performance gains.

> Or is this considered statistically insignificant thus a theoretical
> problem?

Yes. People who work on the maple tree and SPF at Google expressed the
same concern during the design review meeting (all stakeholders on the
mailing list were also invited). So we had a counter to monitor the
contention in previous versions, i.e., MM_LOCK_CONTENTION in v4 here:
https://lore.kernel.org/lkml/20210818063107.2696454-8-yuzhao@google.com/

And we also combined this patchset with the SPF patchset to see if the
latter makes any difference. Our conclusion was the contention is
statistically insignificant to the performance under memory pressure.

This can be explained by how often we create a new generation. (We
only walk page tables when we create a new generation. And it's
similar to the low inactive condition for the active/inactive lru.)

Usually we only do so every few seconds. We'd run into problems with
other parts of the kernel, e.g., lru lock contention, i/o congestion,
etc. if we create more than a few generation every second.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ