lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <016537ec-538a-4acf-9dc8-c7ba416c12dc@alu.unizg.hr>
Date:   Sat, 23 Sep 2023 17:30:59 +0200
From:   Mirsad Todorovac <mirsad.todorovac@....unizg.hr>
To:     linux-btrfs@...r.kernel.org
Cc:     Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
        David Sterba <dsterba@...e.com>, linux-kernel@...r.kernel.org
Subject: BUG: btrfs: KCSAN: data-race in btrfs_block_rsv_release [btrfs] /
 btrfs_preempt_reclaim_metadata_space [btrfs]

Hi again,

Adding to the previous bug report, on the vanilla torvalds tree 6.6-rc2 kernel on Ubuntu 22.04 LTS,
there is an additional bug report from KCSAN (please find the config attached):

[146315.465137] ==================================================================
[146315.465747] BUG: KCSAN: data-race in btrfs_block_rsv_release [btrfs] / btrfs_preempt_reclaim_metadata_space [btrfs]

[146315.466957] write to 0xffff888125878128 of 8 bytes by task 40555 on cpu 1:
[146315.466968] btrfs_block_rsv_release (/home/marvin/linux/kernel/torvalds2/fs/btrfs/block-rsv.c:122 /home/marvin/linux/kernel/torvalds2/fs/btrfs/block-rsv.c:293) btrfs
[146315.467591] btrfs_trans_release_metadata (/home/marvin/linux/kernel/torvalds2/fs/btrfs/transaction.c:1005) btrfs
[146315.468191] __btrfs_end_transaction (/home/marvin/linux/kernel/torvalds2/fs/btrfs/transaction.c:1022) btrfs
[146315.468792] btrfs_end_transaction (/home/marvin/linux/kernel/torvalds2/fs/btrfs/transaction.c:1064) btrfs
[146315.469397] btrfs_evict_inode (/home/marvin/linux/kernel/torvalds2/fs/btrfs/inode.c:5274) btrfs
[146315.470000] evict (/home/marvin/linux/kernel/torvalds2/fs/inode.c:664)
[146315.470010] iput.part.0 (/home/marvin/linux/kernel/torvalds2/fs/inode.c:1803)
[146315.470019] iput (/home/marvin/linux/kernel/torvalds2/fs/inode.c:1803 (discriminator 2))
[146315.470027] do_unlinkat (/home/marvin/linux/kernel/torvalds2/fs/namei.c:4407)
[146315.470038] __x64_sys_unlink (/home/marvin/linux/kernel/torvalds2/fs/namei.c:4444)
[146315.470048] do_syscall_64 (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:50 /home/marvin/linux/kernel/torvalds2/arch/x86/entry/common.c:80)
[146315.470060] entry_SYSCALL_64_after_hwframe (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:120)

[146315.470077] read to 0xffff888125878128 of 8 bytes by task 259386 on cpu 18:
[146315.470088] btrfs_preempt_reclaim_metadata_space (/home/marvin/linux/kernel/torvalds2/fs/btrfs/space-info.c:1167) btrfs
[146315.470698] process_one_work (/home/marvin/linux/kernel/torvalds2/kernel/workqueue.c:2630)
[146315.470710] worker_thread (/home/marvin/linux/kernel/torvalds2/kernel/workqueue.c:2697 /home/marvin/linux/kernel/torvalds2/kernel/workqueue.c:2784)
[146315.470720] kthread (/home/marvin/linux/kernel/torvalds2/kernel/kthread.c:388)
[146315.470729] ret_from_fork (/home/marvin/linux/kernel/torvalds2/arch/x86/kernel/process.c:147)
[146315.470741] ret_from_fork_asm (/home/marvin/linux/kernel/torvalds2/arch/x86/entry/entry_64.S:312)

[146315.470756] value changed: 0x0000000000080000 -> 0x0000000000000000

[146315.470770] Reported by Kernel Concurrency Sanitizer on:
[146315.470778] CPU: 18 PID: 259386 Comm: kworker/u65:12 Tainted: G             L     6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0 #41
[146315.470791] Hardware name: ASRock X670E PG Lightning/X670E PG Lightning, BIOS 1.21 04/26/2023
[146315.470798] Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space [btrfs]
[146315.471413] ==================================================================

<SMALLTALK>
Note that the first report with the btrfs_calculate_inode_block_rsv_size [btrfs] / btrfs_use_block_rsv [btrfs]
"EXPERIMENTAL PATCH" thread addressed the issue of the read section reading various parts of the block_rsv
which was not done atomically, so I think there could be a real data-race if a part of the structure is changed
by another thread while reading the structure members and making the execution path decisions based on their value.

I hate locks, but this seems like a place for a read lock to ensure the integrity of data while processing
the structure members. Unless locking was taken somewhere previously before calling all those functions
in the stacktrace ... But I don't have that deep insight into the btrfs module (150 K lines).

Of course, I did not run the kernel build w/o your review, even when the logic seems OK.

(But the patch was fixing the symptom and not the cause, so I realise that adding such would make the module
bloated and unstable ... However, somebody wise said that "more eyes see more" ...).
</SMALLTALK>

However, encouraged by your care for btrfs file system integrity I have decided to submit the
3rd bug report.

This time, and for your convenience, I took the liberty of just reviewing the KCSAN-suspected lines:

The write side:

fs/btrfs/block-rsv.c
====================
   105 static u64 block_rsv_release_bytes(struct btrfs_fs_info *fs_info,
   106                                     struct btrfs_block_rsv *block_rsv,
   107                                     struct btrfs_block_rsv *dest, u64 num_bytes,
   108                                     u64 *qgroup_to_release_ret)
   109 {
   110         struct btrfs_space_info *space_info = block_rsv->space_info;
   111         u64 qgroup_to_release = 0;
   112         u64 ret;
   113
   114         spin_lock(&block_rsv->lock);
   115         if (num_bytes == (u64)-1) {
   116                 num_bytes = block_rsv->size;
   117                 qgroup_to_release = block_rsv->qgroup_rsv_size;
   118         }
   119         block_rsv->size -= num_bytes;
   120         if (block_rsv->reserved >= block_rsv->size) {
   121                 num_bytes = block_rsv->reserved - block_rsv->size;
→ 122                 block_rsv->reserved = block_rsv->size;
   123                 block_rsv->full = true;
   124         } else {
   125                 num_bytes = 0;
   126         }
   127         if (qgroup_to_release_ret &&
   128             block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
   129                 qgroup_to_release = block_rsv->qgroup_rsv_reserved -
   130                                     block_rsv->qgroup_rsv_size;
   131                 block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
   132         } else {
   133                 qgroup_to_release = 0;
   134         }
   135         spin_unlock(&block_rsv->lock);
   136
   137         ret = num_bytes;
   138         if (num_bytes > 0) {
   139                 if (dest) {
   140                         spin_lock(&dest->lock);
   141                         if (!dest->full) {
   142                                 u64 bytes_to_add;
   143
   144                                 bytes_to_add = dest->size - dest->reserved;
   145                                 bytes_to_add = min(num_bytes, bytes_to_add);
   146                                 dest->reserved += bytes_to_add;
   147                                 if (dest->reserved >= dest->size)
   148                                         dest->full = true;
   149                                 num_bytes -= bytes_to_add;
   150                         }
   151                         spin_unlock(&dest->lock);
   152                 }
   153                 if (num_bytes)
   154                         btrfs_space_info_free_bytes_may_use(fs_info,
   155                                                             space_info,
   156                                                             num_bytes);
   157         }
   158         if (qgroup_to_release_ret)
   159                 *qgroup_to_release_ret = qgroup_to_release;
   160         return ret;
   161 }

The read side:

fs/btrfs/space-info.c
=====================
   1133 static void btrfs_preempt_reclaim_metadata_space(struct work_struct *work)
   1134 {
   1135         struct btrfs_fs_info *fs_info;
   1136         struct btrfs_space_info *space_info;
   1137         struct btrfs_block_rsv *delayed_block_rsv;
   1138         struct btrfs_block_rsv *delayed_refs_rsv;
   1139         struct btrfs_block_rsv *global_rsv;
   1140         struct btrfs_block_rsv *trans_rsv;
   1141         int loops = 0;
   1142
   1143         fs_info = container_of(work, struct btrfs_fs_info,
   1144                                preempt_reclaim_work);
   1145         space_info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA);
   1146         delayed_block_rsv = &fs_info->delayed_block_rsv;
   1147         delayed_refs_rsv = &fs_info->delayed_refs_rsv;
   1148         global_rsv = &fs_info->global_block_rsv;
   1149         trans_rsv = &fs_info->trans_block_rsv;
   1150
   1151         spin_lock(&space_info->lock);
   1152         while (need_preemptive_reclaim(fs_info, space_info)) {
   1153                 enum btrfs_flush_state flush;
   1154                 u64 delalloc_size = 0;
   1155                 u64 to_reclaim, block_rsv_size;
   1156                 u64 global_rsv_size = global_rsv->reserved;
   1157
   1158                 loops++;
   1159
   1160                 /*
   1161                  * We don't have a precise counter for the metadata being
   1162                  * reserved for delalloc, so we'll approximate it by subtracting
   1163                  * out the block rsv's space from the bytes_may_use.  If that
   1164                  * amount is higher than the individual reserves, then we can
   1165                  * assume it's tied up in delalloc reservations.
   1166                  */
→ 1167                 block_rsv_size = global_rsv_size +
→ 1168                         delayed_block_rsv->reserved +
→ 1169                         delayed_refs_rsv->reserved +
→ 1170                         trans_rsv->reserved;
   1171                 if (block_rsv_size < space_info->bytes_may_use)
   1172                         delalloc_size = space_info->bytes_may_use - block_rsv_size;
   1173
   1174                 /*
   1175                  * We don't want to include the global_rsv in our calculation,
   1176                  * because that's space we can't touch.  Subtract it from the
   1177                  * block_rsv_size for the next checks.
   1178                  */
   1179                 block_rsv_size -= global_rsv_size;
   1180
   1181                 /*
   1182                  * We really want to avoid flushing delalloc too much, as it
   1183                  * could result in poor allocation patterns, so only flush it if
   1184                  * it's larger than the rest of the pools combined.
   1185                  */
   1186                 if (delalloc_size > block_rsv_size) {
   1187                         to_reclaim = delalloc_size;
   1188                         flush = FLUSH_DELALLOC;
   1189                 } else if (space_info->bytes_pinned >
   1190                            (delayed_block_rsv->reserved +
   1191                             delayed_refs_rsv->reserved)) {
   1192                         to_reclaim = space_info->bytes_pinned;
   1193                         flush = COMMIT_TRANS;
   1194                 } else if (delayed_block_rsv->reserved >
   1195                            delayed_refs_rsv->reserved) {
   1196                         to_reclaim = delayed_block_rsv->reserved;
   1197                         flush = FLUSH_DELAYED_ITEMS_NR;
   1198                 } else {
   1199                         to_reclaim = delayed_refs_rsv->reserved;
   1200                         flush = FLUSH_DELAYED_REFS_NR;
   1201                 }
   1202
   1203                 spin_unlock(&space_info->lock);
   1204
   1205                 /*
   1206                  * We don't want to reclaim everything, just a portion, so scale
   1207                  * down the to_reclaim by 1/4.  If it takes us down to 0,
   1208                  * reclaim 1 items worth.
   1209                  */
   1210                 to_reclaim >>= 2;
   1211                 if (!to_reclaim)
   1212                         to_reclaim = btrfs_calc_insert_metadata_size(fs_info, 1);
   1213                 flush_space(fs_info, space_info, to_reclaim, flush, true);
   1214                 cond_resched();
   1215                 spin_lock(&space_info->lock);
   1216         }
   1217
   1218         /* We only went through once, back off our clamping. */
   1219         if (loops == 1 && !space_info->reclaim_size)
   1220                 space_info->clamp = max(1, space_info->clamp - 1);
   1221         trace_btrfs_done_preemptive_reclaim(fs_info, space_info);
   1222         spin_unlock(&space_info->lock);
   1223 }

It beats me ... if all the locks were acquired, where did data-race occur?

btrfs_preempt_reclaim_metadata_space() was executed from the work queue, so it is obviously asynchronous ...

Somewhat innocuous, in btrfs_preempt_reclaim_metadata_space() called from the work queue, space_info->lock
is held, while in the path of:

write to 0xffff888125878128 of 8 bytes by task 40555 on cpu 1:
btrfs_block_rsv_release (fs/btrfs/block-rsv.c:122 /home/marvin/linux/kernel/torvalds2/fs/btrfs/block-rsv.c:293) btrfs
btrfs_trans_release_metadata (fs/btrfs/transaction.c:1005) btrfs
__btrfs_end_transaction (fs/btrfs/transaction.c:1022) btrfs
btrfs_end_transaction (fs/btrfs/transaction.c:1064) btrfs
btrfs_evict_inode (fs/btrfs/inode.c:5274) btrfs
evict (fs/inode.c:664)

This lock isn't being acquired:

$ grep 'space_info->lock' -r fs/btrfs/block-rsv.c fs/btrfs/transaction.c fs/btrfs/inode.c fs/inode.c
$

fs/btrfs/block-rsv.c: block_rsv_release_bytes() acquired block_rsv->lock, but not space_info->lock.

Now Vulcan logic says that there could be concurrency there while processing:

→ 1167                 block_rsv_size = global_rsv_size +
→ 1168                         delayed_block_rsv->reserved +
→ 1169                         delayed_refs_rsv->reserved +
→ 1170                         trans_rsv->reserved;

but I could also be missing something obvious.

I find the btrfs code so challenging, and especially I am anticipating the RCU part.

Forgive me this rant, but filesystems are so interesting and such a great exercise for my little grey cells ;-)

Best regards,
Mirsad Todorovac
Download attachment "config-6.6.0-rc2-kcsan-00143-gb5cbe7c00aa0.xz" of type "application/x-xz" (58472 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ