[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c2419d01-5c84-3fb4-189e-4db519d08796@suse.com>
Date: Fri, 26 Jul 2019 18:29:00 +0300
From: Nikolay Borisov <nborisov@...e.com>
To: Chris Mason <clm@...com>
Cc: Kernel Team <Kernel-team@...com>,
"axboe@...nel.dk" <axboe@...nel.dk>, Tejun Heo <tj@...nel.org>,
David Sterba <dsterba@...e.com>, "jack@...e.cz" <jack@...e.cz>,
"josef@...icpanda.com" <josef@...icpanda.com>,
"linux-btrfs@...r.kernel.org" <linux-btrfs@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 3/5] Btrfs: only associate the locked page with one
async_cow struct
On 11.07.19 г. 22:52 ч., Chris Mason wrote:
> On 11 Jul 2019, at 12:00, Nikolay Borisov wrote:
>
>> On 10.07.19 г. 22:28 ч., Tejun Heo wrote:
>>> From: Chris Mason <clm@...com>
>>>
>>> The btrfs writepages function collects a large range of pages flagged
>>> for delayed allocation, and then sends them down through the COW code
>>> for processing. When compression is on, we allocate one async_cow
>>
>> nit: The code no longer uses async_cow to represent in-flight chunks
>> but
>> the more aptly named async_chunk. Presumably this patchset predates
>> those changes.
>
> Not by much, but yes.
>
>>
>>>
>>> The end result is that cowA is completed and cleaned up before cowB
>>> even
>>> starts processing. This means we can free locked_page() and reuse it
>>> elsewhere. If we get really lucky, it'll have the same page->index
>>> in
>>> its new home as it did before.
>>>
>>> While we're processing cowB, we might decide we need to fall back to
>>> uncompressed IO, and so compress_file_range() will call
>>> __set_page_dirty_nobufers() on cowB->locked_page.
>>>
>>> Without cgroups in use, this creates as a phantom dirty page, which>
>>> isn't great but isn't the end of the world. With cgroups in use, we
>>
>> Having a phantom dirty page is not great but not terrible without
>> cgroups but apart from that, does it have any other implications?
>
> Best case, it'll go through the writepage fixup worker and go through
> the whole cow machinery again. Worst case we go to this code more than
> once:
>
> /*
> * if page_started, cow_file_range inserted an
> * inline extent and took care of all the
> unlocking
> * and IO for us. Otherwise, we need to submit
> * all those pages down to the drive.
> */
> if (!page_started && !ret)
> extent_write_locked_range(inode,
> async_extent->start,
> async_extent->start +
> async_extent->ram_size
> - 1,
> WB_SYNC_ALL);
> else if (ret)
> unlock_page(async_chunk->locked_page);
>
>
> That never happened in production as far as I can tell, but it seems
> possible.
>
>>
>>
>>> might crash in the accounting code because page->mapping->i_wb isn't
>>> set.
>>>
>>> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference
>>> at 00000000000000d0
>>> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
>>> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
>>> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
>>> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
>>> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
>>> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
>>> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX:
>>> 0000000000026115
>>> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI:
>>> 0000000000000090
>>> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09:
>>> 0000000000000000
>>> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12:
>>> ffffffffffffffff
>>> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15:
>>> 0000000000000001
>>> [ 8308.582520] FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000)
>>> knlGS:0000000000000000
>>> [ 8308.585440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4:
>>> 0000000000360ee0
>>> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>> 0000000000000000
>>> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>> 0000000000000400
>>> [ 8308.594469] Call Trace:
>>> [ 8308.595149] account_page_cleaned+0x15b/0x1f0
>>> [ 8308.596340] __cancel_dirty_page+0x146/0x200
>>> [ 8308.599395] truncate_cleanup_page+0x92/0xb0
>>> [ 8308.600480] truncate_inode_pages_range+0x202/0x7d0
>>> [ 8308.617392] btrfs_evict_inode+0x92/0x5a0
>>> [ 8308.619108] evict+0xc1/0x190
>>> [ 8308.620023] do_unlinkat+0x176/0x280
>>> [ 8308.621202] do_syscall_64+0x63/0x1a0
>>> [ 8308.623451] entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> The fix here is to make asyc_cow->locked_page NULL everywhere but the
>>> one async_cow struct that's allowed to do things to the locked page.
>>>
>>> Signed-off-by: Chris Mason <clm@...com>
>>> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and
>>> reads")
>>> Reviewed-by: Josef Bacik <josef@...icpanda.com>
>>> ---
>>> fs/btrfs/extent_io.c | 2 +-
>>> fs/btrfs/inode.c | 25 +++++++++++++++++++++----
>>> 2 files changed, 22 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>>> index 5106008f5e28..a31574df06aa 100644
>>> --- a/fs/btrfs/extent_io.c
>>> +++ b/fs/btrfs/extent_io.c
>>> @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct
>>> address_space *mapping,
>>> if (page_ops & PAGE_SET_PRIVATE2)
>>> SetPagePrivate2(pages[i]);
>>>
>>> - if (pages[i] == locked_page) {
>>> + if (locked_page && pages[i] == locked_page) {
>>
>> Why not make the check just if (locked_page) then clean it up, since
>> if
>> __process_pages_contig is called from the owner of the page then it's
>> guaranteed that the page will fall within it's range.
>
> I'm not convinced that every single caller of __process_pages_contig is
> making sure to only send locked_page for ranges that correspond to the
> locked_page. I'm not sure exactly what you're asking for though, it
> looks like it would require some larger changes to the flow of that
> loop.
What I meant it is to simply factor out the code dealing with locked
page outside of the loop and still place it inside
__process_pages_contig. Also looking at the way locked_pages is passed
across different call chains I arrive at:
compress_file_range <-- locked page is null
extent_clear_unlock_delalloc
__process_pages_contig
submit_compressed_extents <---- locked page is null
extent_clear_unlock_delalloc
__process_pages_contig
btrfs_run_delalloc_range | run_delalloc_nocow
cow_file_range <--- [when called from btrfs_run_delalloc_range we are
all fine and dandy because it will always iterates a range which belongs
to the page. So we can free the page and set it null for subsequent
passes of the loop.]
Looking run_delalloc_nocow I see the page is used 5
times - 2 of those, at the beginning and end of the function, are only
used during error cases. The other 2 times is if cow_start is different
than -1 , which happens if !nocow is true. I've yet to wrap my head
around run_delalloc_nocow but I think it should also be safe to pass
locked page just once.
cow_file_range_async <--- always called with the correct locked page, in
this case the function is called before any async chunks are going to be
submitted.
extent_clear_unlock_delalloc
__process_pages_contig
btrfs_run_delalloc_range <--- this one is called with locked_page
belonging to the passed delalloc range.
run_delalloc_nocow
extent_clear_unlock_delalloc
__process_pages_contig
writepage_delalloc <-- calls find_lock_delalloc_range only if we aren't
caalled from compress path and the start range always belongs to the page
find_lock_delalloc_range <---- if the range is not delalloc it will
retry. But that function is also called with the correct page.
lock_delalloc_pages <--- ignores range which belongs only to this page
__unlock_for_delaloc <--- ignores range which belongs only to this page
<snip>
Powered by blists - more mailing lists