linux-kernel - Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJuCfpGGU9TpL62EzwUCjsUy0frmR33Nyk5BQiN=AiQUkiq7yg@mail.gmail.com>
Date:   Wed, 18 Jan 2023 09:36:44 -0800
From:   Suren Baghdasaryan <surenb@...gle.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Jann Horn <jannh@...gle.com>, peterz@...radead.org,
        Ingo Molnar <mingo@...hat.com>, Will Deacon <will@...nel.org>,
        akpm@...ux-foundation.org, michel@...pinasse.org,
        jglisse@...gle.com, vbabka@...e.cz, hannes@...xchg.org,
        mgorman@...hsingularity.net, dave@...olabs.net,
        willy@...radead.org, liam.howlett@...cle.com,
        ldufour@...ux.ibm.com, laurent.dufour@...ibm.com,
        paulmck@...nel.org, luto@...nel.org, songliubraving@...com,
        peterx@...hat.com, david@...hat.com, dhowells@...hat.com,
        hughd@...gle.com, bigeasy@...utronix.de, kent.overstreet@...ux.dev,
        punit.agrawal@...edance.com, lstoakes@...il.com,
        peterjung1337@...il.com, rientjes@...gle.com,
        axelrasmussen@...gle.com, joelaf@...gle.com, minchan@...gle.com,
        shakeelb@...gle.com, tatashin@...gle.com, edumazet@...gle.com,
        gthelen@...gle.com, gurua@...gle.com, arjunroy@...gle.com,
        soheil@...gle.com, hughlynch@...gle.com, leewalsh@...gle.com,
        posk@...gle.com, linux-mm@...ck.org,
        linux-arm-kernel@...ts.infradead.org,
        linuxppc-dev@...ts.ozlabs.org, x86@...nel.org,
        linux-kernel@...r.kernel.org, kernel-team@...roid.com
Subject: Re: [PATCH 12/41] mm: add per-VMA lock and helper functions to
 control it

On Wed, Jan 18, 2023 at 7:11 AM 'Michal Hocko' via kernel-team
<kernel-team@...roid.com> wrote:
>
> On Wed 18-01-23 14:23:32, Jann Horn wrote:
> > On Wed, Jan 18, 2023 at 1:28 PM Michal Hocko <mhocko@...e.com> wrote:
> > > On Tue 17-01-23 19:02:55, Jann Horn wrote:
> > > > +locking maintainers
> > > >
> > > > On Mon, Jan 9, 2023 at 9:54 PM Suren Baghdasaryan <surenb@...gle.com> wrote:
> > > > > Introduce a per-VMA rw_semaphore to be used during page fault handling
> > > > > instead of mmap_lock. Because there are cases when multiple VMAs need
> > > > > to be exclusively locked during VMA tree modifications, instead of the
> > > > > usual lock/unlock patter we mark a VMA as locked by taking per-VMA lock
> > > > > exclusively and setting vma->lock_seq to the current mm->lock_seq. When
> > > > > mmap_write_lock holder is done with all modifications and drops mmap_lock,
> > > > > it will increment mm->lock_seq, effectively unlocking all VMAs marked as
> > > > > locked.
> > > > [...]
> > > > > +static inline void vma_read_unlock(struct vm_area_struct *vma)
> > > > > +{
> > > > > +       up_read(&vma->lock);
> > > > > +}
> > > >
> > > > One thing that might be gnarly here is that I think you might not be
> > > > allowed to use up_read() to fully release ownership of an object -
> > > > from what I remember, I think that up_read() (unlike something like
> > > > spin_unlock()) can access the lock object after it's already been
> > > > acquired by someone else.
> > >
> > > Yes, I think you are right. From a look into the code it seems that
> > > the UAF is quite unlikely as there is a ton of work to be done between
> > > vma_write_lock used to prepare vma for removal and actual removal.
> > > That doesn't make it less of a problem though.
> > >
> > > > So if you want to protect against concurrent
> > > > deletion, this might have to be something like:
> > > >
> > > > rcu_read_lock(); /* keeps vma alive */
> > > > up_read(&vma->lock);
> > > > rcu_read_unlock();
> > > >
> > > > But I'm not entirely sure about that, the locking folks might know better.
> > >
> > > I am not a locking expert but to me it looks like this should work
> > > because the final cleanup would have to happen rcu_read_unlock.
> > >
> > > Thanks, I have completely missed this aspect of the locking when looking
> > > into the code.
> > >
> > > Btw. looking at this again I have fully realized how hard it is actually
> > > to see that vm_area_free is guaranteed to sync up with ongoing readers.
> > > vma manipulation functions like __adjust_vma make my head spin. Would it
> > > make more sense to have a rcu style synchronization point in
> > > vm_area_free directly before call_rcu? This would add an overhead of
> > > uncontended down_write of course.
> >
> > Something along those lines might be a good idea, but I think that
> > rather than synchronizing the removal, it should maybe be something
> > that splats (and bails out?) if it detects pending readers. If we get
> > to vm_area_free() on a VMA that has pending readers, we might already
> > be in a lot of trouble because the concurrent readers might have been
> > traversing page tables while we were tearing them down or fun stuff
> > like that.
> >
> > I think maybe Suren was already talking about something like that in
> > another part of this patch series but I don't remember...
>
> This http://lkml.kernel.org/r/20230109205336.3665937-27-surenb@google.com?

Yes, I spent a lot of time ensuring that __adjust_vma locks the right
VMAs and that VMAs are freed or isolated under VMA write lock
protection to exclude any readers. If the VM_BUG_ON_VMA in the patch
Michal mentioned gets hit then it's a bug in my design and I'll have
to fix it. But please, let's not add synchronize_rcu() in the
vm_area_free(). That will slow down any path that frees a VMA,
especially the exit path which might be freeing thousands of them. I
had an SPF version with synchronize_rcu() in the vm_area_free() and
phone vendors started yelling at me the very next day. call_rcu() with
CONFIG_RCU_NOCB_CPU (which Android uses for power saving purposes) is
already bad enough to show up in the benchmarks and that's why I had
to add call_rcu() batching in
https://lore.kernel.org/all/20230109205336.3665937-40-surenb@google.com.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@...roid.com.
>