lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAOUHufYjLf2RD2Y7wcebjvvDRHDT9cCG6sPbQ2JaTyyB306JOw@mail.gmail.com>
Date:   Thu, 13 Apr 2023 15:47:56 -0600
From:   Yu Zhao <yuzhao@...gle.com>
To:     Kalesh Singh <kaleshsingh@...gle.com>, akpm@...ux-foundation.org
Cc:     minchan@...gle.com, surenb@...gle.com, wvw@...gle.com,
        android-mm@...gle.com, kernel-team@...roid.com,
        Minchan Kim <minchan@...nel.org>,
        Oleksandr Natalenko <oleksandr@...alenko.name>,
        "Jan Alexander Steffens (heftig)" <heftig@...hlinux.org>,
        Suleiman Souhlal <suleiman@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] mm: Multi-gen LRU: remove wait_event_killable()

On Thu, Apr 13, 2023 at 3:43 PM Kalesh Singh <kaleshsingh@...gle.com> wrote:
>
> Android 14 and later default to MGLRU [1] and field telemetry showed
> occasional long tail latency (>100ms) in the reclaim path.
>
> Tracing revealed priority inversion in the reclaim path. In
> try_to_inc_max_seq(), when high priority tasks were blocked on
> wait_event_killable(), the preemption of the low priority task to call
> wake_up_all() caused those high priority tasks to wait longer than
> necessary. In general, this problem is not different from others of
> its kind, e.g., one caused by mutex_lock(). However, it is specific to
> MGLRU because it introduced the new wait queue lruvec->mm_state.wait.
>
> The purpose of this new wait queue is to avoid the thundering herd
> problem. If many direct reclaimers rush into try_to_inc_max_seq(),
> only one can succeed, i.e., the one to wake up the rest, and the rest
> who failed might cause premature OOM kills if they do not wait. So far
> there is no evidence supporting this scenario, based on how often the
> wait has been hit. And this begs the question how useful the wait
> queue is in practice.
>
> Based on Minchan's recommendation, which is in line with his commit
> 6d4675e60135 ("mm: don't be stuck to rmap lock on reclaim path") and
> the rest of the MGLRU code which also uses trylock when possible,
> remove the wait queue.
>
> [1] https://android-review.googlesource.com/q/I7ed7fbfd6ef9ce10053347528125dd98c39e50bf
>
> Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
> Cc: Yu Zhao <yuzhao@...gle.com>
> Cc: Minchan Kim <minchan@...nel.org>
> Reported-by: Wei Wang <wvw@...gle.com>
> Suggested-by: Minchan Kim <minchan@...nel.org>
> Signed-off-by: Kalesh Singh <kaleshsingh@...gle.com>

Acked-by: Yu Zhao <yuzhao@...gle.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ