linux-kernel - Re: [PATCH] shmem, memcg: enable memcg aware shrinker

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ffdff0be-f2b6-c7c0-debc-9c5e8a33ae4e@virtuozzo.com>
Date:   Mon, 1 Jun 2020 14:05:10 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     Greg Thelen <gthelen@...gle.com>, Hugh Dickins <hughd@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Kirill A. Shutemov" <kirill@...temov.name>
Cc:     linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        stable@...r.kernel.org
Subject: Re: [PATCH] shmem, memcg: enable memcg aware shrinker

Hi, Greg,

good finding. See comments below.

On 01.06.2020 06:22, Greg Thelen wrote:
> Since v4.19 commit b0dedc49a2da ("mm/vmscan.c: iterate only over charged
> shrinkers during memcg shrink_slab()") a memcg aware shrinker is only
> called when the per-memcg per-node shrinker_map indicates that the
> shrinker may have objects to release to the memcg and node.
> 
> shmem_unused_huge_count and shmem_unused_huge_scan support the per-tmpfs
> shrinker which advertises per memcg and numa awareness.  The shmem
> shrinker releases memory by splitting hugepages that extend beyond
> i_size.
> 
> Shmem does not currently set bits in shrinker_map.  So, starting with
> b0dedc49a2da, memcg reclaim avoids calling the shmem shrinker under
> pressure.  This leads to undeserved memcg OOM kills.
> Example that reliably sees memcg OOM kill in unpatched kernel:
>   FS=/tmp/fs
>   CONTAINER=/cgroup/memory/tmpfs_shrinker
>   mkdir -p $FS
>   mount -t tmpfs -o huge=always nodev $FS
>   # Create 1000 MB container, which shouldn't suffer OOM.
>   mkdir $CONTAINER
>   echo 1000M > $CONTAINER/memory.limit_in_bytes
>   echo $BASHPID >> $CONTAINER/cgroup.procs
>   # Create 4000 files.  Ideally each file uses 4k data page + a little
>   # metadata.  Assume 8k total per-file, 32MB (4000*8k) should easily
>   # fit within container's 1000 MB.  But if data pages use 2MB
>   # hugepages (due to aggressive huge=always) then files consume 8GB,
>   # which hits memcg 1000 MB limit.
>   for i in {1..4000}; do
>     echo . > $FS/$i
>   done
> 
> v5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg
> aware") maintains the per-node per-memcg shrinker bitmap for THP
> shrinker.  But there's no such logic in shmem.  Make shmem set the
> per-memcg per-node shrinker bits when it modifies inodes to have
> shrinkable pages.
> 
> Fixes: b0dedc49a2da ("mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()")
> Cc: <stable@...r.kernel.org> # 4.19+
> Signed-off-by: Greg Thelen <gthelen@...gle.com>
> ---
>  mm/shmem.c | 61 +++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 35 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index bd8840082c94..e11090f78cb5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1002,6 +1002,33 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>  	return 0;
>  }
>  
> +/*
> + * Expose inode and optional page to shrinker as having a possibly splittable
> + * hugepage that reaches beyond i_size.
> + */
> +static void shmem_shrinker_add(struct shmem_sb_info *sbinfo,
> +			       struct inode *inode, struct page *page)
> +{
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	spin_lock(&sbinfo->shrinklist_lock);
> +	/*
> +	 * _careful to defend against unlocked access to ->shrink_list in
> +	 * shmem_unused_huge_shrink()
> +	 */
> +	if (list_empty_careful(&info->shrinklist)) {
> +		list_add_tail(&info->shrinklist, &sbinfo->shrinklist);
> +		sbinfo->shrinklist_len++;
> +	}
> +	spin_unlock(&sbinfo->shrinklist_lock);
> +
> +#ifdef CONFIG_MEMCG
> +	if (page && PageTransHuge(page))
> +		memcg_set_shrinker_bit(page->mem_cgroup, page_to_nid(page),
> +				       inode->i_sb->s_shrink.id);
> +#endif
> +}
> +
>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = d_inode(dentry);
> @@ -1048,17 +1075,13 @@ static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  			 * to shrink under memory pressure.
>  			 */
>  			if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> -				spin_lock(&sbinfo->shrinklist_lock);
> -				/*
> -				 * _careful to defend against unlocked access to
> -				 * ->shrink_list in shmem_unused_huge_shrink()
> -				 */
> -				if (list_empty_careful(&info->shrinklist)) {
> -					list_add_tail(&info->shrinklist,
> -							&sbinfo->shrinklist);
> -					sbinfo->shrinklist_len++;
> -				}
> -				spin_unlock(&sbinfo->shrinklist_lock);
> +				struct page *page;
> +
> +				page = find_get_page(inode->i_mapping,
> +					(newsize & HPAGE_PMD_MASK) >> PAGE_SHIFT);
> +				shmem_shrinker_add(sbinfo, inode, page);
> +				if (page)
> +					put_page(page);

1)I'd move PageTransHuge() check from shmem_shrinker_add() to here. In case of page is not trans huge,
  it looks strange and completely useless to add inode to shrinklist and then to avoid memcg_set_shrinker_bit().
  Nothing should be added to the shrinklist in this case.

2)In general I think this "last inode page spliter" does not fit SHINKER_MEMCG_AWARE conception, and
  shmem_unused_huge_shrink() should be reworked as a new separate !memcg-aware shrinker instead of
  .nr_cached_objects callback of generic fs shrinker.

CC: Kirill Shutemov

Kirill, are there any fundamental reasons to keep this shrinking logic in the generic fs shrinker?
Are there any no-go to for conversion this as a separate !memcg-aware shrinker?

>  			}
>  		}
>  	}
> @@ -1889,21 +1912,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  	if (PageTransHuge(page) &&
>  	    DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE) <
>  			hindex + HPAGE_PMD_NR - 1) {
> -		/*
> -		 * Part of the huge page is beyond i_size: subject
> -		 * to shrink under memory pressure.
> -		 */
> -		spin_lock(&sbinfo->shrinklist_lock);
> -		/*
> -		 * _careful to defend against unlocked access to
> -		 * ->shrink_list in shmem_unused_huge_shrink()
> -		 */
> -		if (list_empty_careful(&info->shrinklist)) {
> -			list_add_tail(&info->shrinklist,
> -				      &sbinfo->shrinklist);
> -			sbinfo->shrinklist_len++;
> -		}
> -		spin_unlock(&sbinfo->shrinklist_lock);
> +		shmem_shrinker_add(sbinfo, inode, page);
>  	}
>  
>  	/*

Thanks,
Kirill