linux-kernel - Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6839133c-56ca-45dc-8115-2fbcc907938e@kernel.dk>
Date: Wed, 18 Sep 2024 23:20:33 -0600
From: Jens Axboe <axboe@...nel.dk>
To: Matthew Wilcox <willy@...radead.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
 Dave Chinner <david@...morbit.com>, Chris Mason <clm@...a.com>,
 Christian Theune <ct@...ingcircus.io>, linux-mm@...ck.org,
 "linux-xfs@...r.kernel.org" <linux-xfs@...r.kernel.org>,
 linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 Daniel Dao <dqminh@...udflare.com>, regressions@...ts.linux.dev,
 regressions@...mhuis.info
Subject: Re: Known and unfixed active data loss bug in MM + XFS with large
 folios since Dec 2021 (any kernel from 6.1 upwards)

On 9/18/24 10:46 PM, Jens Axboe wrote:
> On 9/18/24 10:36 PM, Matthew Wilcox wrote:
>> On Wed, Sep 18, 2024 at 09:38:41PM -0600, Jens Axboe wrote:
>>> On 9/18/24 9:12 PM, Linus Torvalds wrote:
>>>> On Thu, 19 Sept 2024 at 05:03, Linus Torvalds
>>>> <torvalds@...ux-foundation.org> wrote:
>>>>>
>>>>> I think we should just do the simple one-liner of adding a
>>>>> "xas_reset()" to after doing xas_split_alloc() (or do it inside the
>>>>> xas_split_alloc()).
>>>>
>>>> .. and obviously that should be actually *verified* to fix the issue
>>>> not just with the test-case that Chris and Jens have been using, but
>>>> on Christian's real PostgreSQL load.
>>>>
>>>> Christian?
>>>>
>>>> Note that the xas_reset() needs to be done after the check for errors
>>>> - or like Willy suggested, xas_split_alloc() needs to be re-organized.
>>>>
>>>> So the simplest fix is probably to just add a
>>>>
>>>>                         if (xas_error(&xas))
>>>>                                 goto error;
>>>>                 }
>>>> +               xas_reset(&xas);
>>>>                 xas_lock_irq(&xas);
>>>>                 xas_for_each_conflict(&xas, entry) {
>>>>                         old = entry;
>>>>
>>>> in __filemap_add_folio() in mm/filemap.c
>>>>
>>>> (The above is obviously a whitespace-damaged pseudo-patch for the
>>>> pre-6758c1128ceb state. I don't actually carry a stable tree around on
>>>> my laptop, but I hope it's clear enough what I'm rambling about)
>>>
>>> I kicked off a quick run with this on 6.9 with my debug patch as well,
>>> and it still fails for me... I'll double check everything is sane. For
>>> reference, below is the 6.9 filemap patch.
>>>
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 30de18c4fd28..88093e2b7256 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -883,6 +883,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>>>  		if (order > folio_order(folio))
>>>  			xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index),
>>>  					order, gfp);
>>> +		xas_reset(&xas);
>>>  		xas_lock_irq(&xas);
>>>  		xas_for_each_conflict(&xas, entry) {
>>>  			old = entry;
>>
>> My brain is still mushy, but I think there is still a problem (both with
>> the simple fix for 6.9 and indeed with 6.10).
>>
>> For splitting a folio, we have the folio locked, so we know it's not
>> going anywhere.  The tree may get rearranged around it while we don't
>> have the xa_lock, but we're somewhat protected.
>>
>> In this case we're splitting something that was, at one point, a shadow
>> entry.  There's no struct there to lock.  So I think we can have a
>> situation where we replicate 'old' (in 6.10) or xa_load() (in 6.9)
>> into the nodes we allocate in xas_split_alloc().  In 6.10, that's at
>> least guaranteed to be a shadow entry, but in 6.9, it might already be a
>> folio by this point because we've raced with something else also doing a
>> split.
>>
>> Probably xas_split_alloc() needs to just do the alloc, like the name
>> says, and drop the 'entry' argument.  ICBW, but I think it explains
>> what you're seeing?  Maybe it doesn't?
> 
> Since I can hit it pretty reliably and quickly, I'm happy to test
> whatever you want on top of 6.9. From the other email, I backported:
> 
> a4864671ca0b ("lib/xarray: introduce a new helper xas_get_order")
> 6758c1128ceb ("mm/filemap: optimize filemap folio adding")
> 
> to 6.9 and kicked off a test with that 5 min ago, and it's still going.
> I'd say with 90% confidence that it should've hit already, but let's
> leave it churning for an hour and see what pops out the other end.

45 min later, I think I can conclusively call the backport of those two
on top of 6.9 good.

Below is what I'm running, which is those two commits (modulo the test
bits, for clarify). Rather than attempt to fix this differently for 6.9,
perhaps not a bad idea to just get those two into stable? It's not a lot
of churn, and at least that keeps it consistent rather than doing
something differently for stable.

I'll try and do a patch that just ensures the order is consistent across
lock cycles as Linus suggested, just to verify that this is indeed the
main issue. Will keep the xas_reset() as well.

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index cb571dfcf4b1..da2f5bba7944 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -1548,6 +1551,7 @@ void xas_create_range(struct xa_state *);
 
 #ifdef CONFIG_XARRAY_MULTI
 int xa_get_order(struct xarray *, unsigned long index);
+int xas_get_order(struct xa_state *xas);
 void xas_split(struct xa_state *, void *entry, unsigned int order);
 void xas_split_alloc(struct xa_state *, void *entry, unsigned int order, gfp_t);
 #else
@@ -1556,6 +1560,11 @@ static inline int xa_get_order(struct xarray *xa, unsigned long index)
 	return 0;
 }
 
+static inline int xas_get_order(struct xa_state *xas)
+{
+	return 0;
+}
+
 static inline void xas_split(struct xa_state *xas, void *entry,
 		unsigned int order)
 {
diff --git a/lib/xarray.c b/lib/xarray.c
index 5e7d6334d70d..c0514fb16d33 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1765,39 +1780,52 @@ void *xa_store_range(struct xarray *xa, unsigned long first,
 EXPORT_SYMBOL(xa_store_range);
 
 /**
- * xa_get_order() - Get the order of an entry.
- * @xa: XArray.
- * @index: Index of the entry.
+ * xas_get_order() - Get the order of an entry.
+ * @xas: XArray operation state.
+ *
+ * Called after xas_load, the xas should not be in an error state.
  *
  * Return: A number between 0 and 63 indicating the order of the entry.
  */
-int xa_get_order(struct xarray *xa, unsigned long index)
+int xas_get_order(struct xa_state *xas)
 {
-	XA_STATE(xas, xa, index);
-	void *entry;
 	int order = 0;
 
-	rcu_read_lock();
-	entry = xas_load(&xas);
-
-	if (!entry)
-		goto unlock;
-
-	if (!xas.xa_node)
-		goto unlock;
+	if (!xas->xa_node)
+		return 0;
 
 	for (;;) {
-		unsigned int slot = xas.xa_offset + (1 << order);
+		unsigned int slot = xas->xa_offset + (1 << order);
 
 		if (slot >= XA_CHUNK_SIZE)
 			break;
-		if (!xa_is_sibling(xas.xa_node->slots[slot]))
+		if (!xa_is_sibling(xa_entry(xas->xa, xas->xa_node, slot)))
 			break;
 		order++;
 	}
 
-	order += xas.xa_node->shift;
-unlock:
+	order += xas->xa_node->shift;
+	return order;
+}
+EXPORT_SYMBOL_GPL(xas_get_order);
+
+/**
+ * xa_get_order() - Get the order of an entry.
+ * @xa: XArray.
+ * @index: Index of the entry.
+ *
+ * Return: A number between 0 and 63 indicating the order of the entry.
+ */
+int xa_get_order(struct xarray *xa, unsigned long index)
+{
+	XA_STATE(xas, xa, index);
+	int order = 0;
+	void *entry;
+
+	rcu_read_lock();
+	entry = xas_load(&xas);
+	if (entry)
+		order = xas_get_order(&xas);
 	rcu_read_unlock();
 
 	return order;
diff --git a/mm/filemap.c b/mm/filemap.c
index 30de18c4fd28..b8d525825d3f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -852,7 +852,9 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 		struct folio *folio, pgoff_t index, gfp_t gfp, void **shadowp)
 {
 	XA_STATE(xas, &mapping->i_pages, index);
-	bool huge = folio_test_hugetlb(folio);
+	void *alloced_shadow = NULL;
+	int alloced_order = 0;
+	bool huge;
 	bool charged = false;
 	long nr = 1;
 
@@ -869,6 +871,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);
 	xas_set_order(&xas, index, folio_order(folio));
+	huge = folio_test_hugetlb(folio);
 	nr = folio_nr_pages(folio);
 
 	gfp &= GFP_RECLAIM_MASK;
@@ -876,13 +879,10 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 	folio->mapping = mapping;
 	folio->index = xas.xa_index;
 
-	do {
-		unsigned int order = xa_get_order(xas.xa, xas.xa_index);
+	for (;;) {
+		int order = -1, split_order = 0;
 		void *entry, *old = NULL;
 
-		if (order > folio_order(folio))
-			xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index),
-					order, gfp);
 		xas_lock_irq(&xas);
 		xas_for_each_conflict(&xas, entry) {
 			old = entry;
@@ -890,19 +890,33 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 				xas_set_err(&xas, -EEXIST);
 				goto unlock;
 			}
+			/*
+			 * If a larger entry exists,
+			 * it will be the first and only entry iterated.
+			 */
+			if (order == -1)
+				order = xas_get_order(&xas);
+		}
+
+		/* entry may have changed before we re-acquire the lock */
+		if (alloced_order && (old != alloced_shadow || order != alloced_order)) {
+			xas_destroy(&xas);
+			alloced_order = 0;
 		}
 
 		if (old) {
-			if (shadowp)
-				*shadowp = old;
-			/* entry may have been split before we acquired lock */
-			order = xa_get_order(xas.xa, xas.xa_index);
-			if (order > folio_order(folio)) {
+			if (order > 0 && order > folio_order(folio)) {
 				/* How to handle large swap entries? */
 				BUG_ON(shmem_mapping(mapping));
+				if (!alloced_order) {
+					split_order = order;
+					goto unlock;
+				}
 				xas_split(&xas, old, order);
 				xas_reset(&xas);
 			}
+			if (shadowp)
+				*shadowp = old;
 		}
 
 		xas_store(&xas, folio);
@@ -918,9 +932,24 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 				__lruvec_stat_mod_folio(folio,
 						NR_FILE_THPS, nr);
 		}
+
 unlock:
 		xas_unlock_irq(&xas);
-	} while (xas_nomem(&xas, gfp));
+
+		/* split needed, alloc here and retry. */
+		if (split_order) {
+			xas_split_alloc(&xas, old, split_order, gfp);
+			if (xas_error(&xas))
+				goto error;
+			alloced_shadow = old;
+			alloced_order = split_order;
+			xas_reset(&xas);
+			continue;
+		}
+
+		if (!xas_nomem(&xas, gfp))
+			break;
+	}
 
 	if (xas_error(&xas))
 		goto error;

-- 
Jens Axboe