lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230922135155.tqrelea66hcj5cog@revolver>
Date:   Fri, 22 Sep 2023 09:51:55 -0400
From:   "Liam R. Howlett" <Liam.Howlett@...cle.com>
To:     Mirsad Todorovac <mirsad.todorovac@....unizg.hr>
Cc:     maple-tree@...ts.infradead.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace /
 mtree_range_walk

* Mirsad Todorovac <mirsad.todorovac@....unizg.hr> [230921 17:01]:
> Hi,
> 

Hello!

> On the 6.6-rc2 vanilla torvalds tree kernel, KCSAN had reported a number of data-races:
> 
> [ 6413.367326] ==================================================================
> [ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
> 
> [ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
> [ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
> [ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
> [ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
> [ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
> [ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
> [ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
> [ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
> [ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
> [ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
> [ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
> [ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)
> 
> [ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
> [ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
> [ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
> [ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
> [ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
> [ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
> [ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)
> 
> [ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116
> 
> [ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
> [ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
> [ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
> [ 6413.367660] ==================================================================
> 
> For your convenience, took the trouble of finding those suspicious lines of code:
> 
> The write side:
> 
> lib/maple_tree.c:491
> --------------------
>   456 /*
>   457  * mas_set_parent() - Set the parent node and encode the slot
>   458  * @enode: The encoded maple node.
>   459  * @parent: The encoded maple node that is the parent of @enode.
>   460  * @slot: The slot that @enode resides in @parent.
>   461  *
>   462  * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
>   463  * parent type.
>   464  */
>   465 static inline
>   466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
>   467                     const struct maple_enode *parent, unsigned char slot)
>   468 {
>   469         unsigned long val = (unsigned long)parent;
>   470         unsigned long shift;
>   471         unsigned long type;
>   472         enum maple_type p_type = mte_node_type(parent);
>   473
>   474         MAS_BUG_ON(mas, p_type == maple_dense);
>   475         MAS_BUG_ON(mas, p_type == maple_leaf_64);
>   476
>   477         switch (p_type) {
>   478         case maple_range_64:
>   479         case maple_arange_64:
>   480                 shift = MAPLE_PARENT_SLOT_SHIFT;
>   481                 type = MAPLE_PARENT_RANGE64;
>   482                 break;
>   483         default:
>   484         case maple_dense:
>   485         case maple_leaf_64:
>   486                 shift = type = 0;
>   487                 break;
>   488         }
>   489
>   490         val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
> → 491         val |= (slot << shift) | type;
>   492         mte_to_node(enode)->parent = ma_parent_ptr(val);
>   493 }
> 
> lib/maple_tree.c:1701
> ---------------------
>   1682 /*
>   1683  * mas_adopt_children() - Set the parent pointer of all nodes in @parent to
>   1684  * @parent with the slot encoded.
>   1685  * @mas - the maple state (for the tree)
>   1686  * @parent - the maple encoded node containing the children.
>   1687  */
>   1688 static inline void mas_adopt_children(struct ma_state *mas,
>   1689                 struct maple_enode *parent)
>   1690 {
>   1691         enum maple_type type = mte_node_type(parent);
>   1692         struct maple_node *node = mte_to_node(parent);
>   1693         void __rcu **slots = ma_slots(node, type);
>   1694         unsigned long *pivots = ma_pivots(node, type);
>   1695         struct maple_enode *child;
>   1696         unsigned char offset;
>   1697
>   1698         offset = ma_data_end(node, type, pivots, mas->max);
>   1699         do {
>   1700                 child = mas_slot_locked(mas, slots, offset);
> → 1701                 mas_set_parent(mas, child, parent, offset);
>   1702         } while (offset--);
>   1703 }
> 
>   2562 static inline void mas_topiary_replace(struct ma_state *mas,
>   2563                 struct maple_enode *old_enode)
>   2564 {
>   2565         struct ma_state tmp[3], tmp_next[3];
>   2566         MA_TOPIARY(subtrees, mas->tree);
>   2567         bool in_rcu;
>   2568         int i, n;
>   2569
>   2570         /* Place data in tree & then mark node as old */
>   2571         mas_put_in_tree(mas, old_enode);
>   2572
>   2573         /* Update the parent pointers in the tree */
>   2574         tmp[0] = *mas;
>   2575         tmp[0].offset = 0;
>   2576         tmp[1].node = MAS_NONE;
>   2577         tmp[2].node = MAS_NONE;
>   2578         while (!mte_is_leaf(tmp[0].node)) {
>   2579                 n = 0;
>   2580                 for (i = 0; i < 3; i++) {
>   2581                         if (mas_is_none(&tmp[i]))
>   2582                                 continue;
>   2583
>   2584                         while (n < 3) {
>   2585                                 if (!mas_find_child(&tmp[i], &tmp_next[n]))
>   2586                                         break;
>   2587                                 n++;
>   2588                         }
>   2589
> → 2590                        mas_adopt_children(&tmp[i], tmp[i].node);
>   2591                 }
>   2592
>   2593                 if (MAS_WARN_ON(mas, n == 0))
>   2594                         break;
>   2595
>   2596                 while (n < 3)
>   2597                         tmp_next[n++].node = MAS_NONE;
>   2598
>   2599                 for (i = 0; i < 3; i++)
>   2600                         tmp[i] = tmp_next[i];
>   2601         }
>   2602
>   2603         /* Collect the old nodes that need to be discarded */
>   2604         if (mte_is_leaf(old_enode))
>   2605                 return mas_free(mas, old_enode);
>   2606
>   2607         tmp[0] = *mas;
>   2608         tmp[0].offset = 0;
>   2609         tmp[0].node = old_enode;
>   2610         tmp[1].node = MAS_NONE;
>   2611         tmp[2].node = MAS_NONE;
>   2612         in_rcu = mt_in_rcu(mas->tree);
>   2613         do {
>   2614                 n = 0;
>   2615                 for (i = 0; i < 3; i++) {
>   2616                         if (mas_is_none(&tmp[i]))
>   2617                                 continue;
> 
> The read side:
> 
>    527 /*
>    528  * ma_dead_node() - check if the @enode is dead.
>    529  * @enode: The encoded maple node
>    530  *
>    531  * Return: true if dead, false otherwise.
>    532  */
>    533 static inline bool ma_dead_node(const struct maple_node *node)
>    534 {
>    535         struct maple_node *parent;
>    536
>    537         /* Do not reorder reads from the node prior to the parent check */
>    538         smp_rmb();
> →  539         parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
>    540         return (parent == node);
>    541 }
> 
>   2767 static inline void *mtree_range_walk(struct ma_state *mas)
>   2768 {
>   2769         unsigned long *pivots;
>   2770         unsigned char offset;
>   2771         struct maple_node *node;
>   2772         struct maple_enode *next, *last;
>   2773         enum maple_type type;
>   2774         void __rcu **slots;
>   2775         unsigned char end;
>   2776         unsigned long max, min;
>   2777         unsigned long prev_max, prev_min;
>   2778
>   2779         next = mas->node;
>   2780         min = mas->min;
>   2781         max = mas->max;
>   2782         do {
>   2783                 offset = 0;
>   2784                 last = next;
>   2785                 node = mte_to_node(next);
>   2786                 type = mte_node_type(next);
>   2787                 pivots = ma_pivots(node, type);
>   2788                 end = ma_data_end(node, type, pivots, max);
> → 2789                 if (unlikely(ma_dead_node(node)))
>   2790                         goto dead_node;
>   2791
>   2792                 if (pivots[offset] >= mas->index) {
>   2793                         prev_max = max;
>   2794                         prev_min = min;
>   2795                         max = pivots[offset];
>   2796                         goto next;
>   2797                 }
>   2798
>   2799                 do {
>   2800                         offset++;
>   2801                 } while ((offset < end) && (pivots[offset] < mas->index));
> 
> As it is evident, ma_dead_node() expands to:
> 
>    527 /*
>    528  * ma_dead_node() - check if the @enode is dead.
>    529  * @enode: The encoded maple node
>    530  *
>    531  * Return: true if dead, false otherwise.
>    532  */
>    533 static inline bool ma_dead_node(const struct maple_node *node)
>    534 {
>    535         struct maple_node *parent;
>    536
>    537         /* Do not reorder reads from the node prior to the parent check */
>    538         smp_rmb();
> →  539         parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
>    540         return (parent == node);
>    541 }
> 
> as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
> validity.
> 
> But I do not really understand maple trees to their depth, I am only reporting the symptomatic
> outlook of the reported data-race.

This is the most complex part of the maple tree, and unique to the
ability to store a range over multiple existing entries.  I recently
rewrote this particular section to avoid a potential live-lock issue.

I am confident that the parent pointer will not be set to the node
pointer as checked by ma_dead_node().  But in an abundance of caution
and to resolve this potential false-positive, I am looking at this in
more detail.

This race is to see if the walk down the tree into unchanged data will
be seen before it is placed in the new subtree with its data also
unchanged.  Since we know the parent can never be the node, I feel this
is safe - but I'm not willing to live with the complaints from kasan.

> 
> This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
> interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
> open ...

I appreciate that and your detailed analysis with code pointers of where
this happens.  Is this easy to recreate?  If so, how?  Can you attach
your kernel config?

Thanks,
Liam

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ