linux-kernel - Re: [PATCH btrfs/for-next] btrfs: fix fatal extent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H7-WjkgSm5Hayj9nrSRHzTPn-_m_oz=jtiDkSXJYshpCg@mail.gmail.com>
Date:   Wed, 17 Jun 2020 18:36:16 +0100
From:   Filipe Manana <fdmanana@...il.com>
To:     Boris Burkov <boris@....io>
Cc:     Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>,
        David Sterba <dsterba@...e.com>,
        linux-btrfs <linux-btrfs@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        kernel-team@...com
Subject: Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead
 vs releasepage race

On Wed, Jun 17, 2020 at 6:20 PM Filipe Manana <fdmanana@...il.com> wrote:
>
> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov <boris@....io> wrote:
> >
> > Under somewhat convoluted conditions, it is possible to attempt to
> > release an extent_buffer that is under io, which triggers a BUG_ON in
> > btrfs_release_extent_buffer_pages.
> >
> > This relies on a few different factors. First, extent_buffer reads done
> > as readahead for searching use WAIT_NONE, so they free the local extent
> > buffer reference while the io is outstanding. However, they should still
> > be protected by TREE_REF. However, if the system is doing signficant
> > reclaim, and simultaneously heavily accessing the extent_buffers, it is
> > possible for releasepage to race with two concurrent readahead attempts
> > in a way that leaves TREE_REF unset when the readahead extent buffer is
> > released.
> >
> > Essentially, if two tasks race to allocate a new extent_buffer, but the
> > winner who attempts the first io is rebuffed by a page being locked
> > (likely by the reclaim itself) then the loser will still go ahead with
> > issuing the readahead. The loser's call to find_extent_buffer must also
> > race with the reclaim task reading the extent_buffer's refcount as 1 in
> > a way that allows the reclaim to re-clear the TREE_REF checked by
> > find_extent_buffer.
> >
> > The following represents an example execution demonstrating the race:
> >
> >             CPU0                                                         CPU1                                           CPU2
> > reada_for_search                                            reada_for_search
> >   readahead_tree_block                                        readahead_tree_block
> >     find_create_tree_block                                      find_create_tree_block
> >       alloc_extent_buffer                                         alloc_extent_buffer
> >                                                                   find_extent_buffer // not found
> >                                                                   allocates eb
> >                                                                   lock pages
> >                                                                   associate pages to eb
> >                                                                   insert eb into radix tree
> >                                                                   set TREE_REF, refs == 2
> >                                                                   unlock pages
> >                                                               read_extent_buffer_pages // WAIT_NONE
> >                                                                 not uptodate (brand new eb)
> >                                                                                                             lock_page
> >                                                                 if !trylock_page
> >                                                                   goto unlock_exit // not an error
> >                                                               free_extent_buffer
> >                                                                 release_extent_buffer
> >                                                                   atomic_dec_and_test refs to 1
> >         find_extent_buffer // found
> >                                                                                                             try_release_extent_buffer
> >                                                                                                               take refs_lock
> >                                                                                                               reads refs == 1; no io
> >           atomic_inc_not_zero refs to 2
> >           mark_buffer_accessed
> >             check_buffer_tree_ref
> >               // not STALE, won't take refs_lock
> >               refs == 2; TREE_REF set // no action
> >     read_extent_buffer_pages // WAIT_NONE
> >                                                                                                               clear TREE_REF
> >                                                                                                               release_extent_buffer
> >                                                                                                                 atomic_dec_and_test refs to 1
> >                                                                                                                 unlock_page
> >       still not uptodate (CPU1 read failed on trylock_page)
> >       locks pages
> >       set io_pages > 0
> >       submit io
> >       return
> >     release_extent_buffer
>
> Small detail, missing the call to free_extent_buffer() from
> readahead_tree_block(), which is what ends calling
> release_extent_buffer().
>
> >       dec refs to 0
> >       delete from radix tree
> >       btrfs_release_extent_buffer_pages
> >         BUG_ON(io_pages > 0)!!!
> >
> > We observe this at a very low rate in production and were also able to
> > reproduce it in a test environment by introducing some spurious delays
> > and by introducing probabilistic trylock_page failures.
> >
> > To fix it, we apply check_tree_ref at a point where it could not
> > possibly be unset by a competing task: after io_pages has been
> > incremented. There is no race in write_one_eb, that we know of, but for
> > consistency, apply it there too. All the codepaths that clear TREE_REF
> > check for io, so they would not be able to clear it after this point.
> >
> > Signed-off-by: Boris Burkov <boris@....io>
>
> Btw, if you have a stack trace, it would be good to include it in the
> change log for grep-ability in case someone runs into the same
> problem.
>
> > ---
> >  fs/btrfs/extent_io.c | 45 ++++++++++++++++++++++++++++----------------
> >  1 file changed, 29 insertions(+), 16 deletions(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index c59e07360083..f6758ebbb6a2 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
> >         clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> >         num_pages = num_extent_pages(eb);
> >         atomic_set(&eb->io_pages, num_pages);
> > +       /*
> > +        * It is possible for releasepage to clear the TREE_REF bit before we
> > +        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> > +        */
> > +       check_buffer_tree_ref(eb);
>
> This is a whole different case from the one described in the
> changelog, as this is in the write path.
> Why do we need this one?
>
> At this point the eb pages are marked with the dirty bit, and
> btree_releasepage() returns 0 if the page is dirty and we don't end up
> calling try_release_extent_buffer().

In addition to that, when write_one_eb() is called we have incremented
its ref count before and EXTENT_BUFFER_WRITEBACK is set on the eb,
try_release_extent_buffer() should return and never call
release_extent_buffer(), since the refcount is > 1 and
extent_buffer_under_io() returns true.

> So I don't understand why this hunk is needed, what variant of the
> problem it's trying to solve.
>
> >
> >         /* set btree blocks beyond nritems with 0 to avoid stale content. */
> >         nritems = btrfs_header_nritems(eb);
> > @@ -5086,25 +5091,28 @@ struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
> >  static void check_buffer_tree_ref(struct extent_buffer *eb)
> >  {
> >         int refs;
> > -       /* the ref bit is tricky.  We have to make sure it is set
> > -        * if we have the buffer dirty.   Otherwise the
> > -        * code to free a buffer can end up dropping a dirty
> > -        * page
> > +       /*
> > +        * The TREE_REF bit is first set when the extent_buffer is added
> > +        * to the radix tree. It is also reset, if unset, when a new reference
> > +        * is created by find_extent_buffer.
> >          *
> > -        * Once the ref bit is set, it won't go away while the
> > -        * buffer is dirty or in writeback, and it also won't
> > -        * go away while we have the reference count on the
> > -        * eb bumped.
> > +        * It is only cleared in two cases: freeing the last non-tree
> > +        * reference to the extent_buffer when its STALE bit is set or
> > +        * calling releasepage when the tree reference is the only reference.
> >          *
> > -        * We can't just set the ref bit without bumping the
> > -        * ref on the eb because free_extent_buffer might
> > -        * see the ref bit and try to clear it.  If this happens
> > -        * free_extent_buffer might end up dropping our original
> > -        * ref by mistake and freeing the page before we are able
> > -        * to add one more ref.
> > +        * In both cases, care is taken to ensure that the extent_buffer's
> > +        * pages are not under io. However, releasepage can be concurrently
> > +        * called with creating new references, which is prone to race
> > +        * conditions between the calls to check_tree_ref in those codepaths
> > +        * and clearing TREE_REF in try_release_extent_buffer.
> >          *
> > -        * So bump the ref count first, then set the bit.  If someone
> > -        * beat us to it, drop the ref we added.
> > +        * The actual lifetime of the extent_buffer in the radix tree is
> > +        * adequately protected by the refcount, but the TREE_REF bit and
> > +        * its corresponding reference are not. To protect against this
> > +        * class of races, we call check_buffer_tree_ref from the codepaths
> > +        * which trigger io after they set eb->io_pages. Note that once io is
> > +        * initiated, TREE_REF can no longer be cleared, so that is the
> > +        * moment at which any such race is best fixed.
> >          */
> >         refs = atomic_read(&eb->refs);
> >         if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> > @@ -5555,6 +5563,11 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
> >         clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
> >         eb->read_mirror = 0;
> >         atomic_set(&eb->io_pages, num_reads);
> > +       /*
> > +        * It is possible for releasepage to clear the TREE_REF bit before we
> > +        * set io_pages. See check_buffer_tree_ref for a more detailed comment.
> > +        */
> > +       check_buffer_tree_ref(eb);
>
> This is the hunk that fixes the problem described in the change log,
> and it looks good to me.
>
> Thanks.
>
> >         for (i = 0; i < num_pages; i++) {
> >                 page = eb->pages[i];
> >
> > --
> > 2.24.1
> >
>
>
> --
> Filipe David Manana,
>
> “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”