linux-kernel - Re: [syzbot] [mm?] KCSAN: data-race in __anon_vma_prepare / __vmf_anon

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAG48ez2p=X-=9V64RfDQoP9C86D0mMoKS5Yc6OeWTnDp+vohvg@mail.gmail.com>
Date: Thu, 15 Jan 2026 16:48:36 +0100
From: Jann Horn <jannh@...gle.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Dmitry Vyukov <dvyukov@...gle.com>, 
	syzbot <syzbot+f5d897f5194d92aa1769@...kaller.appspotmail.com>, 
	Liam.Howlett@...cle.com, akpm@...ux-foundation.org, david@...nel.org, 
	harry.yoo@...cle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	riel@...riel.com, syzkaller-bugs@...glegroups.com, vbabka@...e.cz
Subject: Re: [syzbot] [mm?] KCSAN: data-race in __anon_vma_prepare / __vmf_anon_prepare

On Wed, Jan 14, 2026 at 10:16 PM Lorenzo Stoakes
<lorenzo.stoakes@...cle.com> wrote:
> On Wed, Jan 14, 2026 at 07:23:37PM +0100, Jann Horn wrote:
> > On Wed, Jan 14, 2026 at 7:02 PM Lorenzo Stoakes
> > <lorenzo.stoakes@...cle.com> wrote:
> > > On Wed, Jan 14, 2026 at 06:48:37PM +0100, Jann Horn wrote:
> > > > On Wed, Jan 14, 2026 at 6:29 PM Jann Horn <jannh@...gle.com> wrote:
> > > > > On Wed, Jan 14, 2026 at 6:06 PM Dmitry Vyukov <dvyukov@...gle.com> wrote:
> > > > > > On Wed, 14 Jan 2026 at 18:00, Jann Horn <jannh@...gle.com> wrote:
> > > > > > > On Wed, Jan 14, 2026 at 5:43 PM Dmitry Vyukov <dvyukov@...gle.com> wrote:
> > > > > > > > On Wed, 14 Jan 2026 at 17:32, syzbot
> > > > > > > > <syzbot+f5d897f5194d92aa1769@...kaller.appspotmail.com> wrote:
> > > > > > > One scenario to cause such a data race is to create a new anonymous
> > > > > > > VMA, then trigger two concurrent page faults inside this VMA. Assume a
> > > > > > > configuration with VMA locking disabled for simplicity, so that both
> > > > > > > faults happen under the mmap lock in read mode. This will lead to two
> > > > > > > concurrent calls to __vmf_anon_prepare()
> > > > > > > (https://elixir.bootlin.com/linux/v6.18.5/source/mm/memory.c#L3623),
> > > > > > > both threads only holding the mmap_lock in read mode.
> > > > > > > __vmf_anon_prepare() is essentially this (from
> > > > > > > https://elixir.bootlin.com/linux/v6.18.5/source/mm/memory.c#L3623,
> > > > > > > with VMA locking code removed):
> > > > > > >
> > > > > > > vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
> > > > > > > {
> > > > > > >         struct vm_area_struct *vma = vmf->vma;
> > > > > > >         vm_fault_t ret = 0;
> > > > > > >
> > > > > > >         if (likely(vma->anon_vma))
> > > > > > >                 return 0;
> > > > > > >         [...]
> > > > > > >         if (__anon_vma_prepare(vma))
> > > > > > >                 ret = VM_FAULT_OOM;
> > > > > > >         [...]
> > > > > > >         return ret;
> > > > > > > }
> > > > > > >
> > > > > > > int __anon_vma_prepare(struct vm_area_struct *vma)
> > > > > > > {
> > > > > > >         struct mm_struct *mm = vma->vm_mm;
> > > > > > >         struct anon_vma *anon_vma, *allocated;
> > > > > > >         struct anon_vma_chain *avc;
> > > > > > >
> > > > > > >         [...]
> > > > > > >
> > > > > > >         [... allocate stuff ...]
> > > > > > >
> > > > > > >         anon_vma_lock_write(anon_vma);
> > > > > > >         /* page_table_lock to protect against threads */
> > > > > > >         spin_lock(&mm->page_table_lock);
> > > > > > >         if (likely(!vma->anon_vma)) {
> > > > > > >                 vma->anon_vma = anon_vma;
> > > > > > >                 [...]
> > > > > > >         }
> > > > > > >         spin_unlock(&mm->page_table_lock);
> > > > > > >         anon_vma_unlock_write(anon_vma);
> > > > > > >
> > > > > > >         [... cleanup ...]
> > > > > > >
> > > > > > >         return 0;
> > > > > > >
> > > > > > >         [... error handling ...]
> > > > > > > }
> > > > > > >
> > > > > > > So if one thread reaches the "vma->anon_vma = anon_vma" assignment
> > > > > > > while the other thread is running the "if (likely(vma->anon_vma))"
> > > > > > > check, you get a (AFAIK benign) data race.
> > > > > >
> > > > > > Thanks for checking, Jann.
> > > > > >
> > > > > > To double check"
> > > > > >
> > > > > > "vma->anon_vma = anon_vma" is done w/o store-release, so the lockless
> > > > > > readers can't read anon_vma contents, is it correct? So none of them
> > > > > > really reading anon_vma, right?
> > > > >
> > > > > I think you are right that this should be using store-release;
> > > > > searching around, I also mentioned this in
> > > > > <https://lore.kernel.org/all/CAG48ez0qsAM-dkOUDetmNBSK4typ5t_FvMvtGiB7wQsP-G1jVg@mail.gmail.com/>:
> > > > >
> > > > > | > +Note that there are some exceptions to this - the `anon_vma`
> > > > > field is permitted
> > > > > | > +to be written to under mmap read lock and is instead serialised
> > > > > by the `struct
> > > > > | > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
> > > > > |
> > > > > | Hm, we really ought to add some smp_store_release() and READ_ONCE(),
> > > > > | or something along those lines, around our ->anon_vma accesses...
> > > > > | especially the "vma->anon_vma = anon_vma" assignment in
> > > > > | __anon_vma_prepare() looks to me like, on architectures like arm64
> > > > > | with write-write reordering, we could theoretically end up making a
> > > > > | new anon_vma pointer visible to a concurrent page fault before the
> > > > > | anon_vma has been initialized? Though I have no idea if that is
> > > > > | practically possible, stuff would have to be reordered quite a bit for
> > > > > | that to happen...
>
> OK I'm confused how we can end up with an uninitialised anon_vma actually?
>
> The write gets ordered before the initialisation, somehow?
>
>         anon_vma = find_mergeable_anon_vma(vma);
>         allocated = NULL;
>         if (!anon_vma) {
>                 anon_vma = anon_vma_alloc();
>
> WHICH IS (maybe inlined):
> ******************************
>         anon_vma = kmem_cache_alloc(anon_vma_cachep, GFP_KERNEL);
>         if (anon_vma) {
> |-----------------------> ??
> |               atomic_set(&anon_vma->refcount, 1);
> |               anon_vma->num_children = 0;
> |               anon_vma->num_active_vmas = 0;
> |               anon_vma->parent = anon_vma;
> |               /*
> |                * Initialise the anon_vma root to point to itself. If called
> |                * from fork, the root will be reset to the parents anon_vma.
> |                */
> |               anon_vma->root = anon_vma;
> |       }
> |       return anon_vma;
> |*****************************
> |
> |               anon_vma->num_children++; /* self-parent link for new root */
> |               allocated = anon_vma;
> |       }
> |
> |       anon_vma_lock_write(anon_vma);
> |       /* page_table_lock to protect against threads */
> |       spin_lock(&mm->page_table_lock);
> |       if (likely(!vma->anon_vma)) {
> |---------------vma->anon_vma = anon_vma;
>
> Am I totally misunderstanding?
>
> How likely is this?
>
> Given the anon_vma_lock_write() and spin_lock() are we not avoiding this anyway?

Acquiring a lock is an ACQUIRE operation; the "vma->anon_vma =
anon_vma" can't move up before the ACQUIRE operations, but something
like "anon_vma->num_children = 0" can be reordered down below the
ACQUIRE.

So the sequence

anon_vma->num_children = 0
spin_lock(&mm->page_table_lock)
vma->anon_vma = anon_vma

can be reordered into:

spin_lock(&mm->page_table_lock)
vma->anon_vma = anon_vma
anon_vma->num_children = 0

See https://www.kernel.org/doc/Documentation/memory-barriers.txt , which says:
<<<
 (1) ACQUIRE operation implication:

     Memory operations issued after the ACQUIRE will be completed after the
     ACQUIRE operation has completed.

     Memory operations issued before the ACQUIRE may be completed after
     the ACQUIRE operation has completed.
>>>

> > >
> > > As far as the page fault is concerned it only really cares about whether it
> > > exists or not, not whether it's initialised.
> >
> > Hmm, yeah, I'm not sure if anything in the page fault path actually
> > directly accesses the anon_vma. The page fault path does eventually
> > re-publish the anon_vma pointer with `WRITE_ONCE(folio->mapping,
> > (struct address_space *) anon_vma)` in __folio_set_anon() though,
> > which could then potentially allow a third thread to walk through
> > folio->mapping and observe the uninitialized anon_vma...
>
> But how would it be unintialised at that point?
>
> See above.
>
> >
> > Looking at the situation on latest stable (v6.18.5), two racing faults
> > on _adjacent_ anonymous VMAs could also end up with one thread writing
> > ->anon_vma while the other thread executes reusable_anon_vma(),
>
>         if (anon_vma_compatible(a, b)) {
>                 struct anon_vma *anon_vma = READ_ONCE(old->anon_vma);
>
>                 if (anon_vma && list_is_singular(&old->anon_vma_chain))
>                         return anon_vma;
>         }
>
> Hmm... again I don't see how we're finding a mergeable anon_vma in the
> adjacent VMA which is somehow uninitialised?
>
> > loading the pointer to that anon_vma and accessing its
> > ->anon_vma_chain.
>
> The VMA's anon_vma_chain you mean? anon_vma doesn't have that field.

Ah, oops, I misread that code; so I guess the first potentially
uninitialized read would occur when __anon_vma_prepare() passes the
reused anon_vma to anon_vma_lock_write().

> Is it again based on the assumption that on some architectures we might see
> a write of an allocated-but-not-initialised anon_vma?
>
> But I also don't see how this is harmful anyway as anything that touches
> anon_vma state meaningfully has to take the rmap lock anyway.

I think the rmap lock itself could theoretically be uninitialized, too.