linux-kernel - Re: [RFC] Reserve huge pages for reliable MAP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1208804726.17385.109.camel@localhost.localdomain>
Date:	Mon, 21 Apr 2008 14:05:25 -0500
From:	Adam Litke <agl@...ibm.com>
To:	Mel Gorman <mel@....ul.ie>
Cc:	wli@...omorphy.com, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org
Subject: Re: [RFC] Reserve huge pages for reliable MAP_PRIVATE hugetlbfs
	mappings


On Mon, 2008-04-21 at 19:36 +0100, Mel Gorman wrote:
> MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This is
> so that all future faults will be guaranteed to succeed. Applications are not
> expected to use mlock() as this can result in poor NUMA placement.
> 
> MAP_PRIVATE mappings do not reserve pages. This can result in an application
> being SIGKILLed later if a large page is not available at fault time. This
> makes huge pages usage very ill-advised in some cases as the unexpected
> application failure is intolerable. Forcing potential poor placement with
> mlock() is not a great solution either.
> 
> This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings similar
> to what happens for MAP_SHARED mappings. Once mmap() succeeds, the application
> developer knows that future faults will also succeed. However, there is no
> guarantee that children of the process will be able to write-fault the same
> mapping. The assumption is being made that the majority of applications that
> fork() either use MAP_SHARED as an IPC mechanism or are calling exec().
> 
> Opinions?

Very sound idea in my opinion and definitely a step in the right
direction toward even more reliable huge pages.  With this patch (my
comments included), the only remaining cause of unexpected SIGKILLs
would be copy-on-write.

> @@ -40,6 +40,34 @@ static int hugetlb_next_nid;
>   */
>  static DEFINE_SPINLOCK(hugetlb_lock);
> 
> +/* Helpers to track the number of pages reserved for a MAP_PRIVATE vma */
> +static unsigned long vma_resv_huge_pages(struct vm_area_struct *vma)
> +{
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return (unsigned long)vma->vm_private_data;
> +	return 0;
> +}
> +
> +static void adjust_vma_resv_huge_pages(struct vm_area_struct *vma,
> +						int delta)
> +{
> +	BUG_ON((unsigned long)vma->vm_private_data > 100);
> +	WARN_ON_ONCE(vma->vm_flags & VM_MAYSHARE);
> +	if (!(vma->vm_flags & VM_MAYSHARE)) {
> +		unsigned long reserve;
> +		reserve = (unsigned long)vma->vm_private_data + delta;
> +		vma->vm_private_data = (void *)reserve;
> +	}
> +}
> +static void set_vma_resv_huge_pages(struct vm_area_struct *vma,
> +						unsigned long reserve)
> +{
> +	BUG_ON((unsigned long)vma->vm_private_data > 100);

I assume you're just playing it safe for this RFC, but surely a 100 page
max reservation is not sufficient (especially since we have a whole
unsigned long to work with).  Also, I am not sure a BUG_ON would be an
appropriate response to exceeding the maximum.

<snip>

> @@ -1223,52 +1305,30 @@ static long region_truncate(struct list_
>  	return chg;
>  }
> 
> -static int hugetlb_acct_memory(long delta)
> +int hugetlb_reserve_pages(struct inode *inode,
> +					long from, long to,
> +					struct vm_area_struct *vma)
>  {
> -	int ret = -ENOMEM;
> +	long ret, chg;
> 
> -	spin_lock(&hugetlb_lock);
>  	/*
> -	 * When cpuset is configured, it breaks the strict hugetlb page
> -	 * reservation as the accounting is done on a global variable. Such
> -	 * reservation is completely rubbish in the presence of cpuset because
> -	 * the reservation is not checked against page availability for the
> -	 * current cpuset. Application can still potentially OOM'ed by kernel
> -	 * with lack of free htlb page in cpuset that the task is in.
> -	 * Attempt to enforce strict accounting with cpuset is almost
> -	 * impossible (or too ugly) because cpuset is too fluid that
> -	 * task or memory node can be dynamically moved between cpusets.
> -	 *
> -	 * The change of semantics for shared hugetlb mapping with cpuset is
> -	 * undesirable. However, in order to preserve some of the semantics,
> -	 * we fall back to check against current free page availability as
> -	 * a best attempt and hopefully to minimize the impact of changing
> -	 * semantics that cpuset has.
> +	 * Shared mappings and read-only mappings should based their reservation
> +	 * on the number of pages that are already allocated on behalf of the
> +	 * file. Private mappings that are writable need to reserve the full
> +	 * area. Note that a read-only private mapping that subsequently calls
> +	 * mprotect() to make it read-write may not work reliably
>  	 */
> -	if (delta > 0) {
> -		if (gather_surplus_pages(delta) < 0)
> -			goto out;
> -
> -		if (delta > cpuset_mems_nr(free_huge_pages_node)) {
> -			return_unused_surplus_pages(delta);
> -			goto out;
> -		}
> +	if (vma->vm_flags & VM_SHARED)
> +		chg = region_chg(&inode->i_mapping->private_list, from, to);
> +	else {
> +		if (vma->vm_flags & VM_MAYWRITE)
> +			chg = to - from;
> +		else
> +			chg = region_chg(&inode->i_mapping->private_list,
> +								from, to);
> +		set_vma_resv_huge_pages(vma, chg);

To promote reliability, might it be advisable to just reserve the pages
regardless of VM_MAYWRITE?  Otherwise we might want to consider
reserving the pages in hugetlb_change_protection().

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/