lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 11 Feb 2009 10:30:01 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	Andy Whitcroft <apw@...onical.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Hugh Dickins <hugh@...itas.com>,
	Lee Schermerhorn <Lee.Schermerhorn@...com>,
	Greg KH <gregkh@...e.de>,
	Maksim Yevmenkin <maksim.yevmenkin@...il.com>,
	Nick Piggin <npiggin@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	will@...wder-design.com, Rik van Riel <riel@...hat.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Mikos Szeredi <miklos@...redi.hu>, wli@...ementarian.org
Subject: Re: [PATCH] Do not account for the address space used by hugetlbfs
	using VM_ACCOUNT V2 (Was Linus 2.6.29-rc4)

On Wed, Feb 11, 2009 at 09:43:29AM +0000, Andy Whitcroft wrote:
> On Tue, Feb 10, 2009 at 02:02:27PM +0000, Mel Gorman wrote:
> > On Sun, Feb 08, 2009 at 01:01:04PM -0800, Linus Torvalds wrote:
> > > 
> > > Another week (and a half), another -rc.
> > > 
> > > Arch updates (sparc, blackfin), driver updates (dvb, mmc, ide), ACPI 
> > > updates, FS updates (ubifs, btrfs), you name it. It's all there.
> > > 
> > > But more importantly, people really have been working on regressions, and 
> > > hopefully this closes a nice set of the top one, and hopefully without 
> > > introducing too many new ones.
> > > 
> > 
> > Hugepages are currently busted due to commit
> > fc8744adc870a8d4366908221508bb113d8b72ee and the problem is in
> > 2.6.29-rc4. There was a bit of a discussion but it didn't get very far and
> > then I went offline for the weekend. Here is a revised patch that tries to
> > bring hugetlbfs more in line with what the core VM is doing by dealing with
> > VM_ACCOUNT and VM_NORESERVE.
> > 
> > =============
> > Do not account for the address space used by hugetlbfs using VM_ACCOUNT
> > 
> > When overcommit is disabled, the core VM accounts for pages used by anonymous
> > shared, private mappings and special mappings. It keeps track of VMAs that
> > should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
> > with VM_NORESERVE.
> > 
> > Overcommit for hugetlbfs is much riskier than overcommit for base pages
> > due to contiguity requirements. It avoids overcommiting on both shared and
> > private mappings using reservation counters that are checked and updated
> > during mmap(). This ensures (within limits) that hugepages exist in the
> > future when faults occurs or it is too easy to applications to be SIGKILLed.
> > 
> > As hugetlbfs makes its own reservations of a different unit to the base page
> > size, VM_ACCOUNT should never be set. Even if the units were correct, we would
> > double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
> > be set because an application can request no reserves be made for hugetlbfs
> > at the risk of getting killed later.
> > 
> > With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
> > VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
> > breaks the accounting for both the core VM and hugetlbfs, can trigger an
> > OOM storm when hugepage pools are too small lockups and corrupted counters
> > otherwise are used. This patch brings hugetlbfs more in line with how the
> > core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.
> > 
> > Signed-off-by: Mel Gorman <mel@....ul.ie>
> > --- 
> >  fs/hugetlbfs/inode.c    |    8 +++++---
> >  include/linux/hugetlb.h |    5 +++--
> >  include/linux/mm.h      |    3 +--
> >  ipc/shm.c               |    8 +++++---
> >  mm/fremap.c             |    2 +-
> >  mm/hugetlb.c            |   39 +++++++++++++++++++++++++--------------
> >  mm/mmap.c               |   38 ++++++++++++++++++++++----------------
> >  mm/mprotect.c           |    5 +++--
> >  8 files changed, 65 insertions(+), 43 deletions(-)
> > 
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 6903d37..9b800d9 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -108,7 +108,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> >  
> >  	if (hugetlb_reserve_pages(inode,
> >  				vma->vm_pgoff >> huge_page_order(h),
> > -				len >> huge_page_shift(h), vma))
> > +				len >> huge_page_shift(h), vma,
> > +				vma->vm_flags))
> >  		goto out;
> >  
> >  	ret = 0;
> > @@ -947,7 +948,7 @@ static int can_do_hugetlb_shm(void)
> >  			can_do_mlock());
> >  }
> >  
> > -struct file *hugetlb_file_setup(const char *name, size_t size)
> > +struct file *hugetlb_file_setup(const char *name, size_t size, int acctflag)
> >  {
> >  	int error = -ENOMEM;
> >  	struct file *file;
> > @@ -981,7 +982,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
> >  
> >  	error = -ENOMEM;
> >  	if (hugetlb_reserve_pages(inode, 0,
> > -			size >> huge_page_shift(hstate_inode(inode)), NULL))
> > +			size >> huge_page_shift(hstate_inode(inode)), NULL,
> > +			acctflag))
> >  		goto out_inode;
> >  
> >  	d_instantiate(dentry, inode);
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index f1d2fba..af09660 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -33,7 +33,8 @@ unsigned long hugetlb_total_pages(void);
> >  int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> >  			unsigned long address, int write_access);
> >  int hugetlb_reserve_pages(struct inode *inode, long from, long to,
> > -						struct vm_area_struct *vma);
> > +						struct vm_area_struct *vma,
> > +						int acctflags);
> >  void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
> >  
> >  extern unsigned long hugepages_treat_as_movable;
> > @@ -138,7 +139,7 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
> >  
> >  extern const struct file_operations hugetlbfs_file_operations;
> >  extern struct vm_operations_struct hugetlb_vm_ops;
> > -struct file *hugetlb_file_setup(const char *name, size_t);
> > +struct file *hugetlb_file_setup(const char *name, size_t, int);
> >  int hugetlb_get_quota(struct address_space *mapping, long delta);
> >  void hugetlb_put_quota(struct address_space *mapping, long delta);
> >  
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e8ddc98..3235615 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1129,8 +1129,7 @@ extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> >  	unsigned long flag, unsigned long pgoff);
> >  extern unsigned long mmap_region(struct file *file, unsigned long addr,
> >  	unsigned long len, unsigned long flags,
> > -	unsigned int vm_flags, unsigned long pgoff,
> > -	int accountable);
> > +	unsigned int vm_flags, unsigned long pgoff);
> >  
> >  static inline unsigned long do_mmap(struct file *file, unsigned long addr,
> >  	unsigned long len, unsigned long prot,
> > diff --git a/ipc/shm.c b/ipc/shm.c
> > index f8f69fa..05d51d2 100644
> > --- a/ipc/shm.c
> > +++ b/ipc/shm.c
> > @@ -340,6 +340,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> >  	struct file * file;
> >  	char name[13];
> >  	int id;
> > +	int acctflag = 0;
> >  
> >  	if (size < SHMMIN || size > ns->shm_ctlmax)
> >  		return -EINVAL;
> > @@ -364,11 +365,12 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> >  
> >  	sprintf (name, "SYSV%08x", key);
> >  	if (shmflg & SHM_HUGETLB) {
> > -		/* hugetlb_file_setup takes care of mlock user accounting */
> > -		file = hugetlb_file_setup(name, size);
> > +		/* hugetlb_file_setup applies strict accounting */
> > +		if (shmflg & SHM_NORESERVE)
> > +			acctflag = VM_NORESERVE;
> > +		file = hugetlb_file_setup(name, size, acctflag);
> 
> Does this change semantics?  At first look it appears that you are now
> implementing SHM_NORESERVE for SHM_HUGETLB where we did not previously
> do so?  Not that this is necesarily a bad thing, just feels like its
> changing.
> 

Yes, semantics for shared mappings are changed so that they obey
SHM_NORESERVE. This was part of bringing hugetlbfs closer to the
behaviour of the core VM but I should have called it out better in the
changelog.

> >  		shp->mlock_user = current_user();
> >  	} else {
> > -		int acctflag = 0;
> >  		/*
> >  		 * Do not allow no accounting for OVERCOMMIT_NEVER, even
> >  	 	 * if it's asked for.
> > diff --git a/mm/fremap.c b/mm/fremap.c
> > index 736ba7f..b6ec85a 100644
> > --- a/mm/fremap.c
> > +++ b/mm/fremap.c
> > @@ -198,7 +198,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
> >  			flags &= MAP_NONBLOCK;
> >  			get_file(file);
> >  			addr = mmap_region(file, start, size,
> > -					flags, vma->vm_flags, pgoff, 1);
> > +					flags, vma->vm_flags, pgoff);
> >  			fput(file);
> >  			if (IS_ERR_VALUE(addr)) {
> >  				err = addr;
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 618e983..2074642 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -2269,14 +2269,12 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
> >  
> >  int hugetlb_reserve_pages(struct inode *inode,
> >  					long from, long to,
> > -					struct vm_area_struct *vma)
> > +					struct vm_area_struct *vma,
> > +					int acctflag)
> >  {
> > -	long ret, chg;
> > +	long ret = 0, chg;
> >  	struct hstate *h = hstate_inode(inode);
> >  
> > -	if (vma && vma->vm_flags & VM_NORESERVE)
> > -		return 0;
> > -
> >  	/*
> >  	 * Shared mappings base their reservation on the number of pages that
> >  	 * are already allocated on behalf of the file. Private mappings need
> > @@ -2285,22 +2283,25 @@ int hugetlb_reserve_pages(struct inode *inode,
> >  	 */
> >  	if (!vma || vma->vm_flags & VM_SHARED)
> >  		chg = region_chg(&inode->i_mapping->private_list, from, to);
> 
> I thought the region map for a VM_SHARED mapping is meant to contain
> those pages for which we already have a reservation allocated.  So that
> if we have overlapping VM_RESERVE and VM_NORESERVE mappings we know that
> we did have a page reserved at fault time and know whether we can take
> it from the reserve portion of the pool.  By letting this get executed
> for VM_NORESERVE mappings that would seem to be getting out of sync,
> which doesn't sound right. 

This part doesn't get executed for NORESERVE so while region_chg() took
place to calculate a reservation, region_add() did not get called to
commit it.

> I don't see any updates to the documentation
> on the meanings of reservation map if you are changing it?
> 
> Here is the commentry from mm/hugetlb.c:
> 
>  * The private mapping reservation is represented in a subtly different
>  * manner to a shared mapping.  A shared mapping has a region map associated
>  * with the underlying file, this region map represents the backing file
>  * pages which have ever had a reservation assigned which this persists even
>  * after the page is instantiated.  A private mapping has a region map
>  * associated with the original mmap which is attached to all VMAs which
>  * reference it, this region map represents those offsets which have consumed
>  * reservation ie. where pages have been instantiated.
> 

This is still accurate.

> > -	else {
> > -		struct resv_map *resv_map = resv_map_alloc();
> > -		if (!resv_map)
> > -			return -ENOMEM;
> > -
> > +	else
> >  		chg = to - from;
> >  
> > -		set_vma_resv_map(vma, resv_map);
> > -		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> > -	}
> > -
> >  	if (chg < 0)
> >  		return chg;
> >  
> >  	if (hugetlb_get_quota(inode->i_mapping, chg))
> >  		return -ENOSPC;
> > +
> > +	/*
> > +	 * Only apply hugepage reservation if asked. We still have to
> > +	 * take the filesystem quota because it is an upper limit
> > +	 * defined for the mount and not necessarily memory as a whole
> > +	 */
> > +	if (acctflag & VM_NORESERVE) {
> > +		reset_vma_resv_huge_pages(vma);
> 
> Why do we now need to do this in the non-reserve case?  We didn't need
> to do it before.
> 

True, it was largely defensive against anything being in page_private
that would make it think it had reserves. 

> > +		return 0;
> > +	}
> > +
> 
> This also seems like a semantic change.  Previously NO_RESERVE did not
> take quota, now it does.  NO_RESERVE used to mean that we took our
> chances on there being pages available at fault time both for quota and
> in the pools. Now it means that we only risk there being no pages.
> Does that not significantly change semantics. 

Yes, and this was a mistake. For noreserve mappings, we may now be taking
twice the amount of quota and probably leaking it. This is wrong and I need
to move the check for quota below the check for VM_NORESERVE. Good spot.

> The main use case for
> NO_RESERVE was mapping a massive area to expand into, this might now be
> refused for quota when previously it was not.
> 
> >  	ret = hugetlb_acct_memory(h, chg);
> >  	if (ret < 0) {
> >  		hugetlb_put_quota(inode->i_mapping, chg);
> > @@ -2308,6 +2309,16 @@ int hugetlb_reserve_pages(struct inode *inode,
> >  	}
> >  	if (!vma || vma->vm_flags & VM_SHARED)
> >  		region_add(&inode->i_mapping->private_list, from, to);
> > +	else {
> > +		struct resv_map *resv_map = resv_map_alloc();
> > +
> > +		if (!resv_map)
> > +			return -ENOMEM;
> > +
> > +		set_vma_resv_map(vma, resv_map);
> > +		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> > +	}
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 214b6a2..6f70db8 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -918,7 +918,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> >  	struct inode *inode;
> >  	unsigned int vm_flags;
> >  	int error;
> > -	int accountable = 1;
> >  	unsigned long reqprot = prot;
> >  
> >  	/*
> > @@ -1019,8 +1018,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> >  					return -EPERM;
> >  				vm_flags &= ~VM_MAYEXEC;
> >  			}
> > -			if (is_file_hugepages(file))
> > -				accountable = 0;
> >  
> >  			if (!file->f_op || !file->f_op->mmap)
> >  				return -ENODEV;
> > @@ -1053,8 +1050,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> >  	if (error)
> >  		return error;
> >  
> > -	return mmap_region(file, addr, len, flags, vm_flags, pgoff,
> > -			   accountable);
> > +	return mmap_region(file, addr, len, flags, vm_flags, pgoff);
> >  }
> >  EXPORT_SYMBOL(do_mmap_pgoff);
> >  
> > @@ -1092,17 +1088,23 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
> >  
> >  /*
> >   * We account for memory if it's a private writeable mapping,
> > - * and VM_NORESERVE wasn't set.
> > + * not hugepages and VM_NORESERVE wasn't set.
> >   */
> > -static inline int accountable_mapping(unsigned int vm_flags)
> > +static inline int accountable_mapping(struct file *file, unsigned int vm_flags)
> >  {
> > +	/*
> > +	 * hugetlb has its own accounting separate from the core VM
> > +	 * VM_HUGETLB may not be set yet so we cannot check for that flag.
> > +	 */
> > +	if (file && is_file_hugepages(file))
> > +		return 0;
> > +
> >  	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
> >  }
> >  
> >  unsigned long mmap_region(struct file *file, unsigned long addr,
> >  			  unsigned long len, unsigned long flags,
> > -			  unsigned int vm_flags, unsigned long pgoff,
> > -			  int accountable)
> > +			  unsigned int vm_flags, unsigned long pgoff)
> >  {
> >  	struct mm_struct *mm = current->mm;
> >  	struct vm_area_struct *vma, *prev;
> > @@ -1128,18 +1130,22 @@ munmap_back:
> >  
> >  	/*
> >  	 * Set 'VM_NORESERVE' if we should not account for the
> > -	 * memory use of this mapping. We only honor MAP_NORESERVE
> > -	 * if we're allowed to overcommit memory.
> > +	 * memory use of this mapping.
> >  	 */
> > -	if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> > -		vm_flags |= VM_NORESERVE;
> > -	if (!accountable)
> > -		vm_flags |= VM_NORESERVE;
> > +	if ((flags & MAP_NORESERVE)) {
> > +	 	/* * We honor MAP_NORESERVE if allowed to overcommit */
> > +		if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> > +			vm_flags |= VM_NORESERVE;
> > +
> > +		/* hugetlb applies strict overcommit unless MAP_NORESERVE */
> > +		if (file && is_file_hugepages(file))
> > +			vm_flags |= VM_NORESERVE;
> > +	}
> >  
> >  	/*
> >  	 * Private writable mapping: check memory availability
> >  	 */
> > -	if (accountable_mapping(vm_flags)) {
> > +	if (accountable_mapping(file, vm_flags)) {
> >  		charged = len >> PAGE_SHIFT;
> >  		if (security_vm_enough_memory(charged))
> >  			return -ENOMEM;
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index abe2694..258197b 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -151,10 +151,11 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> >  	/*
> >  	 * If we make a private mapping writable we increase our commit;
> >  	 * but (without finer accounting) cannot reduce our commit if we
> > -	 * make it unwritable again.
> > +	 * make it unwritable again. hugetlb mapping were accounted for
> > +	 * even if read-only so there is no need to account for them here
> >  	 */
> >  	if (newflags & VM_WRITE) {
> > -		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|
> > +		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
> >  						VM_SHARED|VM_NORESERVE))) {
> >  			charged = nrpages;
> >  			if (security_vm_enough_memory(charged))
> > 
> 

Thanks, I'll get the quota thing fixed up.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ