linux-ext4 - Re: [PATCH] ext4_sb_breadahead_unmovable: Change to be no-blocking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANFwon18AaMB4iTwC633=WhupfqJjxs0+U0F0MWNynXByDhcZA@mail.gmail.com>
Date:   Fri, 11 Aug 2023 15:21:46 +0800
From:   Hui Zhu <teawater@...il.com>
To:     Hui Zhu <teawaterz@...ux.alibaba.com>
Cc:     viro@...iv.linux.org.uk, brauner@...nel.org, tytso@....edu,
        adilger.kernel@...ger.ca, akpm@...ux-foundation.org, jack@...e.cz,
        willy@...radead.org, yi.zhang@...wei.com, hare@...e.de,
        p.raghav@...sung.com, ritesh.list@...il.com, mpatocka@...hat.com,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-ext4@...r.kernel.org, teawater@...group.com
Subject: Re: [PATCH] ext4_sb_breadahead_unmovable: Change to be no-blocking

Oops, this is the version 2 of the patch.

Best,
Hui

Hui Zhu <teawaterz@...ux.alibaba.com> 于2023年8月11日周五 15:15写道：
>
> From: Hui Zhu <teawater@...group.com>
>
> This version fix the gfp flags in the callers instead of working this
> new "bool" flag through the buffer head layers according to the comments
> from Matthew Wilcox.
>
> Encountered an issue where a large number of filesystem reads and writes
> occurred suddenly within a container.  At the same time, other tasks on
> the same host that were performing filesystem read and write operations
> became blocked.  It was observed that many of the blocked tasks were
> blocked on the ext4 journal lock. For example:
> PID: 171453 TASK: ffff926566c9440 CPU: 54 COMMAND: "Thread"
> 0 [] __schedule at xxxxxxxxxxxxxxx
> 1 [] schedule at xxxxxxxxxxxxxxx
> 2 [] wait_transaction_locked at xxxxxxxxxxxxxxx
> 3 [] add_transaction_credits at xxxxxxxxxxxxxxx
> 4 [] start_this_handle at xxxxxxxxxxxxxxx
> 5 [] jbd2__journal_start at xxxxxxxxxxxxxxx
> 6 [] ext4_journal_start_sb at xxxxxxxxxxxxxxx
> 7 [] ext4_dirty_inode at xxxxxxxxxxxxxxx
> 8 [] mark_inode_dirty at xxxxxxxxxxxxxxx
> 9 [] generic_update_time at xxxxxxxxxxxxxxx
>
> Meanwhile, it was observed that the task holding the ext4 journal lock
> was blocked for an extended period of time on "shrink_page_list" due to
> "ext4_sb_breadahead_unmovable".
> 0 [] __schedule at xxxxxxxxxxxxxxx
> 1 [] _cond_resched at xxxxxxxxxxxxxxx
> 2 [] shrink_page_list at xxxxxxxxxxxxxxx
> 3 [] shrink_inactive_list at xxxxxxxxxxxxxxx
> 4 [] shrink_lruvec at xxxxxxxxxxxxxxx
> 5 [] shrink_node_memcgs at xxxxxxxxxxxxxxx
> 6 [] shrink_node at xxxxxxxxxxxxxxx
> 7 [] shrink_zones at xxxxxxxxxxxxxxx
> 8 [] do_try_to_free_pages at xxxxxxxxxxxxxxx
> 9 [] try_to_free_mem_cgroup_pages at xxxxxxxxxxxxxxx
> 10 [] try_charge at xxxxxxxxxxxxxxx
> 11 [] mem_cgroup_charge at xxxxxxxxxxxxxxx
> 12 [] __add_to_page_cache_locked at xxxxxxxxxxxxxxx
> 13 [] add_to_page_cache_lru at xxxxxxxxxxxxxxx
> 14 [] pagecache_get_page at xxxxxxxxxxxxxxx
> 15 [] grow_dev_page at xxxxxxxxxxxxxxx
> 16 [] __getblk_slow at xxxxxxxxxxxxxxx
> 17 [] ext4_sb_breadahead_unmovable at xxxxxxxxxxxxxxx
> 18 [] __ext4_get_inode_loc at xxxxxxxxxxxxxxx
> 19 [] ext4_get_inode_loc at xxxxxxxxxxxxxxx
> 20 [] ext4_reserve_inode_write at xxxxxxxxxxxxxxx
> 21 [] __ext4_mark_inode_dirty at xxxxxxxxxxxxxxx
> 22 [] add_dirent_to_buf at xxxxxxxxxxxxxxx
> 23 [] ext4_add_entry at xxxxxxxxxxxxxxx
> 24 [] ext4_add_nondir at xxxxxxxxxxxxxxx
> 25 [] ext4_create at xxxxxxxxxxxxxxx
> 26 [] vfs_create at xxxxxxxxxxxxxxx
>
> The function "grow_dev_page" increased the gfp mask with "__GFP_NOFAIL",
> causing longer blocking times.
>         /*
>          * XXX: __getblk_slow() can not really deal with failure and
>          * will endlessly loop on improvised global reclaim.  Prefer
>          * looping in the allocator rather than here, at least that
>          * code knows what it's doing.
>          */
>         gfp_mask |= __GFP_NOFAIL;
> However, "ext4_sb_breadahead_unmovable" is a prefetch function and
> failures are acceptable.
>
> Therefore, this commit changes "ext4_sb_breadahead_unmovable" to be
> non-blocking.
> Change gfp to ~__GFP_DIRECT_RECLAIM when ext4_sb_breadahead_unmovable
> calls sb_getblk_gfp.
> Modify grow_dev_page to will not be blocked by the allocation of folio
> if gfp is ~__GFP_DIRECT_RECLAIM.
>
> Signed-off-by: Hui Zhu <teawater@...group.com>
> ---
>  fs/buffer.c     | 27 +++++++++++++++++++--------
>  fs/ext4/super.c |  3 ++-
>  2 files changed, 21 insertions(+), 9 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index bd091329026c..330cf19c77b1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1038,6 +1038,7 @@ static sector_t folio_init_buffers(struct folio *folio,
>   * Create the page-cache page that contains the requested block.
>   *
>   * This is used purely for blockdev mappings.
> + * Will not blocking by allocate folio if gfp is ~__GFP_DIRECT_RECLAIM.
>   */
>  static int
>  grow_dev_page(struct block_device *bdev, sector_t block,
> @@ -1050,18 +1051,27 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>         int ret = 0;
>         gfp_t gfp_mask;
>
> -       gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
> +       gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS);
> +       if (gfp == ~__GFP_DIRECT_RECLAIM)
> +               gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> +       else {
> +               gfp_mask |= gfp;
>
> -       /*
> -        * XXX: __getblk_slow() can not really deal with failure and
> -        * will endlessly loop on improvised global reclaim.  Prefer
> -        * looping in the allocator rather than here, at least that
> -        * code knows what it's doing.
> -        */
> -       gfp_mask |= __GFP_NOFAIL;
> +               /*
> +                * XXX: __getblk_slow() can not really deal with failure and
> +                * will endlessly loop on improvised global reclaim.  Prefer
> +                * looping in the allocator rather than here, at least that
> +                * code knows what it's doing.
> +                */
> +               gfp_mask |= __GFP_NOFAIL;
> +       }
>
>         folio = __filemap_get_folio(inode->i_mapping, index,
>                         FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask);
> +       if (IS_ERR(folio)) {
> +               ret = PTR_ERR(folio);
> +               goto out;
> +       }
>
>         bh = folio_buffers(folio);
>         if (bh) {
> @@ -1091,6 +1101,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
>  failed:
>         folio_unlock(folio);
>         folio_put(folio);
> +out:
>         return ret;
>  }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index c94ebf704616..6a529509b83b 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -254,7 +254,8 @@ struct buffer_head *ext4_sb_bread_unmovable(struct super_block *sb,
>
>  void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
>  {
> -       struct buffer_head *bh = sb_getblk_gfp(sb, block, 0);
> +       struct buffer_head *bh = sb_getblk_gfp(sb, block,
> +                                              ~__GFP_DIRECT_RECLAIM);
>
>         if (likely(bh)) {
>                 if (trylock_buffer(bh))
> --
> 2.19.1.6.gb485710b
>