[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <87tts63d3w.fsf@doe.com>
Date: Thu, 07 Sep 2023 08:26:35 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: Theodore Ts'o <tytso@....edu>, Zorro Lang <zlang@...nel.org>,
linux-ext4@...r.kernel.org, fstests@...r.kernel.org,
regressions@...ts.linux.dev,
Andrew Morton <akpm@...ux-foundation.org>,
Jan Kara <jack@...e.cz>
Subject: Re: [fstests generic/388, 455, 475, 482 ...] Ext4 journal recovery test fails
Matthew Wilcox <willy@...radead.org> writes:
> On Wed, Sep 06, 2023 at 01:38:23PM +0100, Matthew Wilcox wrote:
>> > Is this code path a possibility, which can cause above logs?
>> >
>> > ptr = jbd2_alloc() -> kmem_cache_alloc()
>> > <..>
>> > new_folio = virt_to_folio(ptr)
>> > new_offset = offset_in_folio(new_folio, ptr)
>> >
>> > And then I am still not sure what the problem really is?
>> > Is it because at the time of checkpointing, the path is still not fully
>> > converted to folio?
>>
>> Oh yikes! I didn't know that the allocation might come from kmalloc!
>> Yes, slab might use high-order allocations. I'll have to look through
>> this and figure out what the problem might be.
>
> I think the probable cause is bh_offset(). Before these patches, if
> we allocated a buffer at offset 9kB into an order-2 slab, we'd fill in
> b_page with the third page of the slab and calculate bh_offset as 1kB.
> With these patches, we set b_page to the first page of the slab, and
> bh_offset still comes back as 1kB so we read from / write to entirely
> the wrong place.
>
> With this redefinition of bh_offset(), we calculate the offset relative
> to the base page if it's a tail page, and relative to the folio if it's
> a folio. Works out nicely ;-)
Thanks Matthew for explaining the problem clearly.
>
> I have three other things I'm trying to debug right now, so this isn't
> tested, but if you have time you might want to give it a run.
sure, I gave it a try.
>
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 6cb3e9af78c9..dc8fcdc40e95 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -173,7 +173,10 @@ static __always_inline int buffer_uptodate(const struct buffer_head *bh)
> return test_bit_acquire(BH_Uptodate, &bh->b_state);
> }
>
> -#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
> +static inline unsigned long bh_offset(struct buffer_head *bh)
> +{
> + return (unsigned long)(bh)->b_data & (page_size(bh->b_page) - 1);
> +}
>
> /* If we *know* page->private refers to buffer_heads */
> #define page_buffers(page) \
I used "const" for bh to avoid warnings from fs/nilfs/alloc.c
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 4ede47649a81..b61fa79cb7f5 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -171,7 +171,10 @@ static __always_inline int buffer_uptodate(const struct buffer_head *bh)
return test_bit_acquire(BH_Uptodate, &bh->b_state);
}
-#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK)
+static inline unsigned long bh_offset(const struct buffer_head *bh)
+{
+ return (unsigned long)(bh)->b_data & (page_size(bh->b_page) - 1);
+}
/* If we *know* page->private refers to buffer_heads */
#define page_buffers(page) \
But this change alone was still giving me failures. On looking into
usage of b_data, I found we use offset_in_page() instead of bh_offset()
in jbd2. So I added below changes in fs/jbd2 to replace offset_in_page()
to bh_offset()...
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 1073259902a6..0c25640714ac 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -304,7 +304,7 @@ static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
addr = kmap_atomic(page);
checksum = crc32_be(crc32_sum,
- (void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
+ (void *)(addr + bh_offset(bh)), bh->b_size);
kunmap_atomic(addr);
return checksum;
@@ -333,7 +333,7 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
seq = cpu_to_be32(sequence);
addr = kmap_atomic(page);
csum32 = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&seq, sizeof(seq));
- csum32 = jbd2_chksum(j, csum32, addr + offset_in_page(bh->b_data),
+ csum32 = jbd2_chksum(j, csum32, addr + bh_offset(bh),
bh->b_size);
kunmap_atomic(addr);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 4d1fda1f7143..2ac57f7a242d 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -942,7 +942,7 @@ static void jbd2_freeze_jh_data(struct journal_head *jh)
J_EXPECT_JH(jh, buffer_uptodate(bh), "Possible IO failure.\n");
page = bh->b_page;
- offset = offset_in_page(bh->b_data);
+ offset = bh_offset(bh);
source = kmap_atomic(page);
/* Fire data frozen trigger just before we copy the data */
jbd2_buffer_frozen_trigger(jh, source + offset, jh->b_triggers);
With all of above diffs, here are the results.
ext4/1k: 15 tests, 1 failures, 1709 seconds
generic/455 Pass 43s
generic/475 Pass 128s
generic/482 Pass 183s
generic/455 Pass 43s
generic/475 Pass 134s
generic/482 Pass 191s
generic/455 Pass 41s
generic/475 Pass 139s
generic/482 Pass 135s
generic/455 Pass 46s
generic/475 Pass 132s
generic/482 Pass 146s
generic/455 Pass 47s
generic/475 Failed 145s
generic/482 Pass 156s
Totals: 15 tests, 0 skipped, 1 failures, 0 errors, 1709s
I guess the above failure (generic/475) could be due to it's flakey
behaviour which Ted was mentioning.
Now, while we are at it, I think we should also make change to reiserfs from
offset_in_page() to bh_offset()
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 015bfe4e4524..23411ec163d4 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -4217,7 +4217,7 @@ static int do_journal_end(struct reiserfs_transaction_handle *th, int flags)
page = cn->bh->b_page;
addr = kmap(page);
memcpy(tmp_bh->b_data,
- addr + offset_in_page(cn->bh->b_data),
+ addr + bh_offset(cn->bh),
cn->bh->b_size);
kunmap(page);
mark_buffer_dirty(tmp_bh);
I will also run "auto" group with ext4/1k with all of above change. Will
update the results once it is done.
-ritesh
Powered by blists - more mailing lists