linux-kernel - Re: [PATCH v4] mm/hugetlb: add mempolicy check in the reservation routine

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200728132453.GB14854@MiWiFi-R3L-srv>
Date:   Tue, 28 Jul 2020 21:24:53 +0800
From:   Baoquan He <bhe@...hat.com>
To:     Muchun Song <songmuchun@...edance.com>
Cc:     mike.kravetz@...cle.com, akpm@...ux-foundation.org,
        mhocko@...nel.org, rientjes@...gle.com, mgorman@...e.de,
        walken@...gle.com, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org,
        Jianchao Guo <guojianchao@...edance.com>
Subject: Re: [PATCH v4] mm/hugetlb: add mempolicy check in the reservation
 routine

Hi Muchun,

On 07/28/20 at 11:49am, Muchun Song wrote:
> In the reservation routine, we only check whether the cpuset meets
> the memory allocation requirements. But we ignore the mempolicy of
> MPOL_BIND case. If someone mmap hugetlb succeeds, but the subsequent
> memory allocation may fail due to mempolicy restrictions and receives
> the SIGBUS signal. This can be reproduced by the follow steps.
> 
>  1) Compile the test case.
>     cd tools/testing/selftests/vm/
>     gcc map_hugetlb.c -o map_hugetlb
> 
>  2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the
>     system. Each node will pre-allocate one huge page.
>     echo 2 > /proc/sys/vm/nr_hugepages
> 
>  3) Run test case(mmap 4MB). We receive the SIGBUS signal.
>     numactl --membind=0 ./map_hugetlb 4

I think supporting the  mempolicy of MPOL_BIND case is a good idea.
I am wondering what about the other mempolicy cases, e.g MPOL_INTERLEAVE,
MPOL_PREFERRED. Asking these because we already have similar handling in
sysfs, proc nr_hugepages_mempolicy writting. Please see
__nr_hugepages_store_common() for detail.

Thanks
Baoquan

> 
> With this patch applied, the mmap will fail in the step 3) and throw
> "mmap: Cannot allocate memory".
> 
> Signed-off-by: Muchun Song <songmuchun@...edance.com>
> Reported-by: Jianchao Guo <guojianchao@...edance.com>
> Suggested-by: Michal Hocko <mhocko@...nel.org>
> Reviewed-by: Mike Kravetz <mike.kravetz@...cle.com>
> ---
> changelog in v4:
>  1) Fix compilation errors with !CONFIG_NUMA.
> 
> changelog in v3:
>  1) Do not allocate nodemask on the stack.
>  2) Update comment.
> 
> changelog in v2:
>  1) Reuse policy_nodemask().
> 
>  include/linux/mempolicy.h | 14 ++++++++++++++
>  mm/hugetlb.c              | 22 ++++++++++++++++++----
>  mm/mempolicy.c            |  2 +-
>  3 files changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index ea9c15b60a96..0656ece1ccf1 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -152,6 +152,15 @@ extern int huge_node(struct vm_area_struct *vma,
>  extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
>  extern bool mempolicy_nodemask_intersects(struct task_struct *tsk,
>  				const nodemask_t *mask);
> +extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy);
> +
> +static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> +{
> +	struct mempolicy *mpol = get_task_policy(current);
> +
> +	return policy_nodemask(gfp, mpol);
> +}
> +
>  extern unsigned int mempolicy_slab_node(void);
>  
>  extern enum zone_type policy_zone;
> @@ -281,5 +290,10 @@ static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
>  static inline void mpol_put_task_policy(struct task_struct *task)
>  {
>  }
> +
> +static inline nodemask_t *policy_nodemask_current(gfp_t gfp)
> +{
> +	return NULL;
> +}
>  #endif /* CONFIG_NUMA */
>  #endif
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 589c330df4db..a34458f6a475 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3463,13 +3463,21 @@ static int __init default_hugepagesz_setup(char *s)
>  }
>  __setup("default_hugepagesz=", default_hugepagesz_setup);
>  
> -static unsigned int cpuset_mems_nr(unsigned int *array)
> +static unsigned int allowed_mems_nr(struct hstate *h)
>  {
>  	int node;
>  	unsigned int nr = 0;
> +	nodemask_t *mpol_allowed;
> +	unsigned int *array = h->free_huge_pages_node;
> +	gfp_t gfp_mask = htlb_alloc_mask(h);
> +
> +	mpol_allowed = policy_nodemask_current(gfp_mask);
>  
> -	for_each_node_mask(node, cpuset_current_mems_allowed)
> -		nr += array[node];
> +	for_each_node_mask(node, cpuset_current_mems_allowed) {
> +		if (!mpol_allowed ||
> +		    (mpol_allowed && node_isset(node, *mpol_allowed)))
> +			nr += array[node];
> +	}
>  
>  	return nr;
>  }
> @@ -3648,12 +3656,18 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
>  	 * we fall back to check against current free page availability as
>  	 * a best attempt and hopefully to minimize the impact of changing
>  	 * semantics that cpuset has.
> +	 *
> +	 * Apart from cpuset, we also have memory policy mechanism that
> +	 * also determines from which node the kernel will allocate memory
> +	 * in a NUMA system. So similar to cpuset, we also should consider
> +	 * the memory policy of the current task. Similar to the description
> +	 * above.
>  	 */
>  	if (delta > 0) {
>  		if (gather_surplus_pages(h, delta) < 0)
>  			goto out;
>  
> -		if (delta > cpuset_mems_nr(h->free_huge_pages_node)) {
> +		if (delta > allowed_mems_nr(h)) {
>  			return_unused_surplus_pages(h, delta);
>  			goto out;
>  		}
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 93fcfc1f2fa2..fce14c3f4f38 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1873,7 +1873,7 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>   * Return a nodemask representing a mempolicy for filtering nodes for
>   * page allocation
>   */
> -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
> +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
>  {
>  	/* Lower zones don't get a nodemask applied for MPOL_BIND */
>  	if (unlikely(policy->mode == MPOL_BIND) &&
> -- 
> 2.11.0
> 
>