lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5d908c60-88fb-49c0-a150-bfb468eaa513@alu.unizg.hr>
Date:   Sat, 23 Sep 2023 09:26:17 +0200
From:   Mirsad Todorovac <mirsad.todorovac@....unizg.hr>
To:     "Liam R. Howlett" <Liam.Howlett@...cle.com>,
        maple-tree@...ts.infradead.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: BUG: maple_tree: KCSAN: data-race in mas_topiary_replace /
 mtree_range_walk

On 9/22/23 15:51, Liam R. Howlett wrote:
> * Mirsad Todorovac <mirsad.todorovac@....unizg.hr> [230921 17:01]:
>> Hi,
>>
> 
> Hello!

Hi again,

>> On the 6.6-rc2 vanilla torvalds tree kernel, KCSAN had reported a number of data-races:
>>
>> [ 6413.367326] ==================================================================
>> [ 6413.367349] BUG: KCSAN: data-race in mas_topiary_replace / mtree_range_walk
>>
>> [ 6413.367375] write to 0xffff8883a0c5db00 of 8 bytes by task 5475 on cpu 24:
>> [ 6413.367386] mas_topiary_replace (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:491 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:1701 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2590)
>> [ 6413.367399] mas_spanning_rebalance.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2664 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2961)
>> [ 6413.367413] mas_wr_spanning_store.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:3894)
>> [ 6413.367428] mas_wr_store_entry.isra.0 (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4242)
>> [ 6413.367442] mas_store_prealloc (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:256 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:5408)
>> [ 6413.367455] vma_merge (/home/marvin/linux/kernel/torvalds2/mm/internal.h:1127 /home/marvin/linux/kernel/torvalds2/mm/mmap.c:1005)
>> [ 6413.367466] mprotect_fixup (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:632)
>> [ 6413.367480] do_mprotect_pkey (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:819)
>> [ 6413.367494] __x64_sys_mprotect (/home/marvin/linux/kernel/torvalds2/mm/mprotect.c:837)
>> [ 6413.367503] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
>> [ 6413.367517] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)
>>
>> [ 6413.367534] read to 0xffff8883a0c5db00 of 8 bytes by task 5558 on cpu 11:
>> [ 6413.367545] mtree_range_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:539 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:2789)
>> [ 6413.367556] mas_walk (/home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:251 /home/marvin/linux/kernel/torvalds2/lib/maple_tree.c:4844)
>> [ 6413.367567] lock_vma_under_rcu (/home/marvin/linux/kernel/torvalds2/mm/memory.c:5436)
>> [ 6413.367579] do_user_addr_fault (/home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1357)
>> [ 6413.367593] exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/paravirt.h:695 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1513 /home/marvin/linux/kernel/torvalds2/arch/x86/mm/fault.c:1561)
>> [ 6413.367602] asm_exc_page_fault (/home/marvin/linux/kernel/torvalds2/./arch/x86/include/asm/idtentry.h:570)
>>
>> [ 6413.367617] value changed: 0xffff888341d43116 -> 0xffff888340f92116
>>
>> [ 6413.367632] Reported by Kernel Concurrency Sanitizer on:
>> [ 6413.367640] CPU: 11 PID: 5558 Comm: ThreadPoolForeg Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
>> [ 6413.367653] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
>> [ 6413.367660] ==================================================================
>>
>> For your convenience, took the trouble of finding those suspicious lines of code:
>>
>> The write side:
>>
>> lib/maple_tree.c:491
>> --------------------
>>    456 /*
>>    457  * mas_set_parent() - Set the parent node and encode the slot
>>    458  * @enode: The encoded maple node.
>>    459  * @parent: The encoded maple node that is the parent of @enode.
>>    460  * @slot: The slot that @enode resides in @parent.
>>    461  *
>>    462  * Slot number is encoded in the enode->parent bit 3-6 or 2-6, depending on the
>>    463  * parent type.
>>    464  */
>>    465 static inline
>>    466 void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
>>    467                     const struct maple_enode *parent, unsigned char slot)
>>    468 {
>>    469         unsigned long val = (unsigned long)parent;
>>    470         unsigned long shift;
>>    471         unsigned long type;
>>    472         enum maple_type p_type = mte_node_type(parent);
>>    473
>>    474         MAS_BUG_ON(mas, p_type == maple_dense);
>>    475         MAS_BUG_ON(mas, p_type == maple_leaf_64);
>>    476
>>    477         switch (p_type) {
>>    478         case maple_range_64:
>>    479         case maple_arange_64:
>>    480                 shift = MAPLE_PARENT_SLOT_SHIFT;
>>    481                 type = MAPLE_PARENT_RANGE64;
>>    482                 break;
>>    483         default:
>>    484         case maple_dense:
>>    485         case maple_leaf_64:
>>    486                 shift = type = 0;
>>    487                 break;
>>    488         }
>>    489
>>    490         val &= ~MAPLE_NODE_MASK; /* Clear all node metadata in parent */
>> → 491         val |= (slot << shift) | type;
>>    492         mte_to_node(enode)->parent = ma_parent_ptr(val);
>>    493 }
>>
>> lib/maple_tree.c:1701
>> ---------------------
>>    1682 /*
>>    1683  * mas_adopt_children() - Set the parent pointer of all nodes in @parent to
>>    1684  * @parent with the slot encoded.
>>    1685  * @mas - the maple state (for the tree)
>>    1686  * @parent - the maple encoded node containing the children.
>>    1687  */
>>    1688 static inline void mas_adopt_children(struct ma_state *mas,
>>    1689                 struct maple_enode *parent)
>>    1690 {
>>    1691         enum maple_type type = mte_node_type(parent);
>>    1692         struct maple_node *node = mte_to_node(parent);
>>    1693         void __rcu **slots = ma_slots(node, type);
>>    1694         unsigned long *pivots = ma_pivots(node, type);
>>    1695         struct maple_enode *child;
>>    1696         unsigned char offset;
>>    1697
>>    1698         offset = ma_data_end(node, type, pivots, mas->max);
>>    1699         do {
>>    1700                 child = mas_slot_locked(mas, slots, offset);
>> → 1701                 mas_set_parent(mas, child, parent, offset);
>>    1702         } while (offset--);
>>    1703 }
>>
>>    2562 static inline void mas_topiary_replace(struct ma_state *mas,
>>    2563                 struct maple_enode *old_enode)
>>    2564 {
>>    2565         struct ma_state tmp[3], tmp_next[3];
>>    2566         MA_TOPIARY(subtrees, mas->tree);
>>    2567         bool in_rcu;
>>    2568         int i, n;
>>    2569
>>    2570         /* Place data in tree & then mark node as old */
>>    2571         mas_put_in_tree(mas, old_enode);
>>    2572
>>    2573         /* Update the parent pointers in the tree */
>>    2574         tmp[0] = *mas;
>>    2575         tmp[0].offset = 0;
>>    2576         tmp[1].node = MAS_NONE;
>>    2577         tmp[2].node = MAS_NONE;
>>    2578         while (!mte_is_leaf(tmp[0].node)) {
>>    2579                 n = 0;
>>    2580                 for (i = 0; i < 3; i++) {
>>    2581                         if (mas_is_none(&tmp[i]))
>>    2582                                 continue;
>>    2583
>>    2584                         while (n < 3) {
>>    2585                                 if (!mas_find_child(&tmp[i], &tmp_next[n]))
>>    2586                                         break;
>>    2587                                 n++;
>>    2588                         }
>>    2589
>> → 2590                        mas_adopt_children(&tmp[i], tmp[i].node);
>>    2591                 }
>>    2592
>>    2593                 if (MAS_WARN_ON(mas, n == 0))
>>    2594                         break;
>>    2595
>>    2596                 while (n < 3)
>>    2597                         tmp_next[n++].node = MAS_NONE;
>>    2598
>>    2599                 for (i = 0; i < 3; i++)
>>    2600                         tmp[i] = tmp_next[i];
>>    2601         }
>>    2602
>>    2603         /* Collect the old nodes that need to be discarded */
>>    2604         if (mte_is_leaf(old_enode))
>>    2605                 return mas_free(mas, old_enode);
>>    2606
>>    2607         tmp[0] = *mas;
>>    2608         tmp[0].offset = 0;
>>    2609         tmp[0].node = old_enode;
>>    2610         tmp[1].node = MAS_NONE;
>>    2611         tmp[2].node = MAS_NONE;
>>    2612         in_rcu = mt_in_rcu(mas->tree);
>>    2613         do {
>>    2614                 n = 0;
>>    2615                 for (i = 0; i < 3; i++) {
>>    2616                         if (mas_is_none(&tmp[i]))
>>    2617                                 continue;
>>
>> The read side:
>>
>>     527 /*
>>     528  * ma_dead_node() - check if the @enode is dead.
>>     529  * @enode: The encoded maple node
>>     530  *
>>     531  * Return: true if dead, false otherwise.
>>     532  */
>>     533 static inline bool ma_dead_node(const struct maple_node *node)
>>     534 {
>>     535         struct maple_node *parent;
>>     536
>>     537         /* Do not reorder reads from the node prior to the parent check */
>>     538         smp_rmb();
>> →  539         parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
>>     540         return (parent == node);
>>     541 }
>>
>>    2767 static inline void *mtree_range_walk(struct ma_state *mas)
>>    2768 {
>>    2769         unsigned long *pivots;
>>    2770         unsigned char offset;
>>    2771         struct maple_node *node;
>>    2772         struct maple_enode *next, *last;
>>    2773         enum maple_type type;
>>    2774         void __rcu **slots;
>>    2775         unsigned char end;
>>    2776         unsigned long max, min;
>>    2777         unsigned long prev_max, prev_min;
>>    2778
>>    2779         next = mas->node;
>>    2780         min = mas->min;
>>    2781         max = mas->max;
>>    2782         do {
>>    2783                 offset = 0;
>>    2784                 last = next;
>>    2785                 node = mte_to_node(next);
>>    2786                 type = mte_node_type(next);
>>    2787                 pivots = ma_pivots(node, type);
>>    2788                 end = ma_data_end(node, type, pivots, max);
>> → 2789                 if (unlikely(ma_dead_node(node)))
>>    2790                         goto dead_node;
>>    2791
>>    2792                 if (pivots[offset] >= mas->index) {
>>    2793                         prev_max = max;
>>    2794                         prev_min = min;
>>    2795                         max = pivots[offset];
>>    2796                         goto next;
>>    2797                 }
>>    2798
>>    2799                 do {
>>    2800                         offset++;
>>    2801                 } while ((offset < end) && (pivots[offset] < mas->index));
>>
>> As it is evident, ma_dead_node() expands to:
>>
>>     527 /*
>>     528  * ma_dead_node() - check if the @enode is dead.
>>     529  * @enode: The encoded maple node
>>     530  *
>>     531  * Return: true if dead, false otherwise.
>>     532  */
>>     533 static inline bool ma_dead_node(const struct maple_node *node)
>>     534 {
>>     535         struct maple_node *parent;
>>     536
>>     537         /* Do not reorder reads from the node prior to the parent check */
>>     538         smp_rmb();
>> →  539         parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK);
>>     540         return (parent == node);
>>     541 }
>>
>> as above, but the smb_rmb() protection is here, so the KCSAN warning should be double-checked for
>> validity.
>>
>> But I do not really understand maple trees to their depth, I am only reporting the symptomatic
>> outlook of the reported data-race.
> 
> This is the most complex part of the maple tree, and unique to the
> ability to store a range over multiple existing entries.  I recently
> rewrote this particular section to avoid a potential live-lock issue.

I see.

> I am confident that the parent pointer will not be set to the node
> pointer as checked by ma_dead_node().  But in an abundance of caution
> and to resolve this potential false-positive, I am looking at this in
> more detail.
> 
> This race is to see if the walk down the tree into unchanged data will
> be seen before it is placed in the new subtree with its data also
> unchanged.  Since we know the parent can never be the node, I feel this
> is safe - but I'm not willing to live with the complaints from kasan.

I cannot analyse a couple of thousand lines at such a short notice.

It looks suspicious because val in line 491 in a local variable and thus
isn't available elsewhere.

Maybe the compiler (gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0) did something
to the code?

>> This is all-in-all a very interested subject, and I wish there was a way to just slurp all those
>> interesting kernel intrinsics into the brain, but it just ain't that easy. Forgive me for being
>> open ...
> 
> I appreciate that and your detailed analysis with code pointers of where
> this happens.  Is this easy to recreate?  If so, how?  Can you attach
> your kernel config?

Got that attached first, so I do not forget. :-/

Recreate? Actually it happened quite a number of times on my configuration (480+?).

(Please NOTE: the log is stack decoded, but not deduplicated.)

Apparently, everything and anything uses page allocation, so maybe the config
will help you recreate.

> Thanks,
> Liam

Best regards,
Mirsad
Download attachment "config-6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0.xz" of type "application/x-xz" (58472 bytes)

Download attachment "mtree-kcsan-01-decoded.log.xz" of type "application/x-xz" (84452 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ