linux-kernel - Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YdLn+192/0HfNJyl@dhcp22.suse.cz>
Date:   Mon, 3 Jan 2022 13:11:39 +0100
From:   Michal Hocko <mhocko@...e.com>
To:     Suren Baghdasaryan <surenb@...gle.com>
Cc:     Johannes Weiner <hannes@...xchg.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Rientjes <rientjes@...gle.com>,
        Matthew Wilcox <willy@...radead.org>,
        Roman Gushchin <guro@...com>, Rik van Riel <riel@...riel.com>,
        Minchan Kim <minchan@...nel.org>,
        "Kirill A. Shutemov" <kirill@...temov.name>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Christian Brauner <christian@...uner.io>,
        Christoph Hellwig <hch@...radead.org>,
        Oleg Nesterov <oleg@...hat.com>,
        David Hildenbrand <david@...hat.com>,
        Jann Horn <jannh@...gle.com>,
        Shakeel Butt <shakeelb@...gle.com>,
        Andy Lutomirski <luto@...nel.org>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Florian Weimer <fweimer@...hat.com>,
        Jan Engelhardt <jengelh@...i.de>,
        Tim Murray <timmurray@...gle.com>,
        linux-mm <linux-mm@...ck.org>,
        LKML <linux-kernel@...r.kernel.org>,
        kernel-team <kernel-team@...roid.com>
Subject: Re: [PATCH 4/3] mm: drop MMF_OOM_SKIP from exit_mmap

On Thu 30-12-21 09:29:40, Suren Baghdasaryan wrote:
> On Thu, Dec 30, 2021 at 12:24 AM Michal Hocko <mhocko@...e.com> wrote:
> >
> > On Wed 29-12-21 21:59:55, Suren Baghdasaryan wrote:
> > [...]
> > > After some more digging I think there are two acceptable options:
> > >
> > > 1. Call unlock_range() under mmap_write_lock and then downgrade it to
> > > read lock so that both exit_mmap() and __oom_reap_task_mm() can unmap
> > > vmas in parallel like this:
> > >
> > >     if (mm->locked_vm) {
> > >         mmap_write_lock(mm);
> > >         unlock_range(mm->mmap, ULONG_MAX);
> > >         mmap_write_downgrade(mm);
> > >     } else
> > >         mmap_read_lock(mm);
> > > ...
> > >     unmap_vmas(&tlb, vma, 0, -1);
> > >     mmap_read_unlock(mm);
> > >     mmap_write_lock(mm);
> > >     free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > ...
> > >     mm->mmap = NULL;
> > >     mmap_write_unlock(mm);
> > >
> > > This way exit_mmap() might block __oom_reap_task_mm() but for a much
> > > shorter time during unlock_range() call.
> >
> > IIRC unlock_range depends on page lock at some stage and that can mean
> > this will block for a long time or for ever when the holder of the lock
> > depends on a memory allocation. This was the primary problem why the oom
> > reaper skips over mlocked vmas.
> 
> Oh, I missed that detail. I thought __oom_reap_task_mm() skips locked
> vmas only to avoid destroying pgds from under follow_page().
> 
> >
> > > 2. Introduce another vm_flag mask similar to VM_LOCKED which is set
> > > before munlock_vma_pages_range() clears VM_LOCKED so that
> > > __oom_reap_task_mm() can identify vmas being unlocked and skip them.
> > >
> > > Option 1 seems cleaner to me because it keeps the locking pattern
> > > around unlock_range() in exit_mmap() consistent with all other places
> > > it is used (in mremap() and munmap()) with mmap_write_lock taken.
> > > WDYT?
> >
> > It would be really great to make unlock_range oom reaper aware IMHO.
> 
> What exactly do you envision? Say unlock_range() knows that it's
> racing with __oom_reap_task_mm() and that calling follow_page() is
> unsafe without locking, what should it do?

My original plan was to make the page lock conditional and use
trylocking from the oom reaper (aka lockless context). It is OK to
simply bail out and leave some mlocked memory behind if there is a
contention on a specific page. The overall objective is to free as much
memory as possible, not all of it.

IIRC Hugh was not a fan of this approach and he has mentioned that the
lock might not be even really needed and that the area would benefit
from a clean up rather than oom reaper specific hacks. I do tend to
agree with that. I just never managed to find any spare time for that
though and heavily mlocked oom victims tend to be really rare.
-- 
Michal Hocko
SUSE Labs