linux-kernel - Re: Memory allocation on speculative fastpaths

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAL36u31s_4TYPRtAzbGUpQVw2ButNv3vtKLhBkfJAhFSfcNDSg@mail.gmail.com>
Date:   Wed, 4 May 2022 01:20:39 -0700
From:   Michel Lespinasse <walken.cr@...il.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     "Paul E. McKenney" <paulmck@...nel.org>,
        Michal Hocko <mhocko@...e.com>,
        Liam Howlett <liam.howlett@...cle.com>, hannes@...xchg.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        Davidlohr Bueso <dave@...olabs.net>, David <david@...hat.com>
Subject: Re: Memory allocation on speculative fastpaths

(for context, this came up during a discussion of speculative page
faults implementation details)

On Tue, May 3, 2022 at 11:28 AM Matthew Wilcox <willy@...radead.org> wrote:
> Johannes (I think it was?) made the point to me that if we have another
> task very slowly freeing memory, a task in this path can take advantage
> of that other task's hard work and never go into reclaim.  So the
> approach we should take is:
>
> p4d_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> pud_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> pmd_alloc(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
>
> if (failure) {
>   rcu_read_unlock();
>   do_reclaim();
>   return FAULT_FLAG_RETRY;
> }

I don't think this works. The problem with allocating page tables is
not just that it may break an rcu-locked code section; you also need
the code inserting the new page tables into the mm's page table tree
to synchronize with any munmap() that may be concurrently running. RCU
isn't sufficient here, and we would need a proper lock when wiring new
page tables (current code relies on mmap lock for this).

> ... but all this is now moot since the approach we agreed to yesterday
> is:
>
> rcu_read_lock();
> vma = vma_lookup();
> if (down_read_trylock(&vma->sem)) {
>         rcu_read_unlock();
> } else {
>         rcu_read_unlock();
>         mmap_read_lock(mm);
>         vma = vma_lookup();
>         down_read(&vma->sem);
> }
>
> ... and we then execute the page table allocation under the protection of
> the vma->sem.
>
> At least, that's what I think we agreed to yesterday.

I don't remember discussing any of this yesterday. As I remember it,
the discussion was about having one large RCU section vs several small
ones linked by sequence count checks to verify the validity of the vma
at the start of each RCU section.