[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090211094329.GE26746@shadowen.org>
Date: Wed, 11 Feb 2009 09:43:29 +0000
From: Andy Whitcroft <apw@...onical.com>
To: Mel Gorman <mel@....ul.ie>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
Hugh Dickins <hugh@...itas.com>,
Lee Schermerhorn <Lee.Schermerhorn@...com>,
Greg KH <gregkh@...e.de>,
Maksim Yevmenkin <maksim.yevmenkin@...il.com>,
Nick Piggin <npiggin@...e.de>,
Andrew Morton <akpm@...ux-foundation.org>,
will@...wder-design.com, Rik van Riel <riel@...hat.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
Mikos Szeredi <miklos@...redi.hu>, wli@...ementarian.org
Subject: Re: [PATCH] Do not account for the address space used by hugetlbfs
using VM_ACCOUNT V2 (Was Linus 2.6.29-rc4)
On Tue, Feb 10, 2009 at 02:02:27PM +0000, Mel Gorman wrote:
> On Sun, Feb 08, 2009 at 01:01:04PM -0800, Linus Torvalds wrote:
> >
> > Another week (and a half), another -rc.
> >
> > Arch updates (sparc, blackfin), driver updates (dvb, mmc, ide), ACPI
> > updates, FS updates (ubifs, btrfs), you name it. It's all there.
> >
> > But more importantly, people really have been working on regressions, and
> > hopefully this closes a nice set of the top one, and hopefully without
> > introducing too many new ones.
> >
>
> Hugepages are currently busted due to commit
> fc8744adc870a8d4366908221508bb113d8b72ee and the problem is in
> 2.6.29-rc4. There was a bit of a discussion but it didn't get very far and
> then I went offline for the weekend. Here is a revised patch that tries to
> bring hugetlbfs more in line with what the core VM is doing by dealing with
> VM_ACCOUNT and VM_NORESERVE.
>
> =============
> Do not account for the address space used by hugetlbfs using VM_ACCOUNT
>
> When overcommit is disabled, the core VM accounts for pages used by anonymous
> shared, private mappings and special mappings. It keeps track of VMAs that
> should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
> with VM_NORESERVE.
>
> Overcommit for hugetlbfs is much riskier than overcommit for base pages
> due to contiguity requirements. It avoids overcommiting on both shared and
> private mappings using reservation counters that are checked and updated
> during mmap(). This ensures (within limits) that hugepages exist in the
> future when faults occurs or it is too easy to applications to be SIGKILLed.
>
> As hugetlbfs makes its own reservations of a different unit to the base page
> size, VM_ACCOUNT should never be set. Even if the units were correct, we would
> double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
> be set because an application can request no reserves be made for hugetlbfs
> at the risk of getting killed later.
>
> With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
> VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
> breaks the accounting for both the core VM and hugetlbfs, can trigger an
> OOM storm when hugepage pools are too small lockups and corrupted counters
> otherwise are used. This patch brings hugetlbfs more in line with how the
> core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.
>
> Signed-off-by: Mel Gorman <mel@....ul.ie>
> ---
> fs/hugetlbfs/inode.c | 8 +++++---
> include/linux/hugetlb.h | 5 +++--
> include/linux/mm.h | 3 +--
> ipc/shm.c | 8 +++++---
> mm/fremap.c | 2 +-
> mm/hugetlb.c | 39 +++++++++++++++++++++++++--------------
> mm/mmap.c | 38 ++++++++++++++++++++++----------------
> mm/mprotect.c | 5 +++--
> 8 files changed, 65 insertions(+), 43 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 6903d37..9b800d9 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -108,7 +108,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>
> if (hugetlb_reserve_pages(inode,
> vma->vm_pgoff >> huge_page_order(h),
> - len >> huge_page_shift(h), vma))
> + len >> huge_page_shift(h), vma,
> + vma->vm_flags))
> goto out;
>
> ret = 0;
> @@ -947,7 +948,7 @@ static int can_do_hugetlb_shm(void)
> can_do_mlock());
> }
>
> -struct file *hugetlb_file_setup(const char *name, size_t size)
> +struct file *hugetlb_file_setup(const char *name, size_t size, int acctflag)
> {
> int error = -ENOMEM;
> struct file *file;
> @@ -981,7 +982,8 @@ struct file *hugetlb_file_setup(const char *name, size_t size)
>
> error = -ENOMEM;
> if (hugetlb_reserve_pages(inode, 0,
> - size >> huge_page_shift(hstate_inode(inode)), NULL))
> + size >> huge_page_shift(hstate_inode(inode)), NULL,
> + acctflag))
> goto out_inode;
>
> d_instantiate(dentry, inode);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index f1d2fba..af09660 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -33,7 +33,8 @@ unsigned long hugetlb_total_pages(void);
> int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, int write_access);
> int hugetlb_reserve_pages(struct inode *inode, long from, long to,
> - struct vm_area_struct *vma);
> + struct vm_area_struct *vma,
> + int acctflags);
> void hugetlb_unreserve_pages(struct inode *inode, long offset, long freed);
>
> extern unsigned long hugepages_treat_as_movable;
> @@ -138,7 +139,7 @@ static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
>
> extern const struct file_operations hugetlbfs_file_operations;
> extern struct vm_operations_struct hugetlb_vm_ops;
> -struct file *hugetlb_file_setup(const char *name, size_t);
> +struct file *hugetlb_file_setup(const char *name, size_t, int);
> int hugetlb_get_quota(struct address_space *mapping, long delta);
> void hugetlb_put_quota(struct address_space *mapping, long delta);
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e8ddc98..3235615 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1129,8 +1129,7 @@ extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> unsigned long flag, unsigned long pgoff);
> extern unsigned long mmap_region(struct file *file, unsigned long addr,
> unsigned long len, unsigned long flags,
> - unsigned int vm_flags, unsigned long pgoff,
> - int accountable);
> + unsigned int vm_flags, unsigned long pgoff);
>
> static inline unsigned long do_mmap(struct file *file, unsigned long addr,
> unsigned long len, unsigned long prot,
> diff --git a/ipc/shm.c b/ipc/shm.c
> index f8f69fa..05d51d2 100644
> --- a/ipc/shm.c
> +++ b/ipc/shm.c
> @@ -340,6 +340,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
> struct file * file;
> char name[13];
> int id;
> + int acctflag = 0;
>
> if (size < SHMMIN || size > ns->shm_ctlmax)
> return -EINVAL;
> @@ -364,11 +365,12 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
>
> sprintf (name, "SYSV%08x", key);
> if (shmflg & SHM_HUGETLB) {
> - /* hugetlb_file_setup takes care of mlock user accounting */
> - file = hugetlb_file_setup(name, size);
> + /* hugetlb_file_setup applies strict accounting */
> + if (shmflg & SHM_NORESERVE)
> + acctflag = VM_NORESERVE;
> + file = hugetlb_file_setup(name, size, acctflag);
Does this change semantics? At first look it appears that you are now
implementing SHM_NORESERVE for SHM_HUGETLB where we did not previously
do so? Not that this is necesarily a bad thing, just feels like its
changing.
> shp->mlock_user = current_user();
> } else {
> - int acctflag = 0;
> /*
> * Do not allow no accounting for OVERCOMMIT_NEVER, even
> * if it's asked for.
> diff --git a/mm/fremap.c b/mm/fremap.c
> index 736ba7f..b6ec85a 100644
> --- a/mm/fremap.c
> +++ b/mm/fremap.c
> @@ -198,7 +198,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
> flags &= MAP_NONBLOCK;
> get_file(file);
> addr = mmap_region(file, start, size,
> - flags, vma->vm_flags, pgoff, 1);
> + flags, vma->vm_flags, pgoff);
> fput(file);
> if (IS_ERR_VALUE(addr)) {
> err = addr;
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 618e983..2074642 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2269,14 +2269,12 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
>
> int hugetlb_reserve_pages(struct inode *inode,
> long from, long to,
> - struct vm_area_struct *vma)
> + struct vm_area_struct *vma,
> + int acctflag)
> {
> - long ret, chg;
> + long ret = 0, chg;
> struct hstate *h = hstate_inode(inode);
>
> - if (vma && vma->vm_flags & VM_NORESERVE)
> - return 0;
> -
> /*
> * Shared mappings base their reservation on the number of pages that
> * are already allocated on behalf of the file. Private mappings need
> @@ -2285,22 +2283,25 @@ int hugetlb_reserve_pages(struct inode *inode,
> */
> if (!vma || vma->vm_flags & VM_SHARED)
> chg = region_chg(&inode->i_mapping->private_list, from, to);
I thought the region map for a VM_SHARED mapping is meant to contain
those pages for which we already have a reservation allocated. So that
if we have overlapping VM_RESERVE and VM_NORESERVE mappings we know that
we did have a page reserved at fault time and know whether we can take
it from the reserve portion of the pool. By letting this get executed
for VM_NORESERVE mappings that would seem to be getting out of sync,
which doesn't sound right. I don't see any updates to the documentation
on the meanings of reservation map if you are changing it?
Here is the commentry from mm/hugetlb.c:
* The private mapping reservation is represented in a subtly different
* manner to a shared mapping. A shared mapping has a region map associated
* with the underlying file, this region map represents the backing file
* pages which have ever had a reservation assigned which this persists even
* after the page is instantiated. A private mapping has a region map
* associated with the original mmap which is attached to all VMAs which
* reference it, this region map represents those offsets which have consumed
* reservation ie. where pages have been instantiated.
> - else {
> - struct resv_map *resv_map = resv_map_alloc();
> - if (!resv_map)
> - return -ENOMEM;
> -
> + else
> chg = to - from;
>
> - set_vma_resv_map(vma, resv_map);
> - set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> - }
> -
> if (chg < 0)
> return chg;
>
> if (hugetlb_get_quota(inode->i_mapping, chg))
> return -ENOSPC;
> +
> + /*
> + * Only apply hugepage reservation if asked. We still have to
> + * take the filesystem quota because it is an upper limit
> + * defined for the mount and not necessarily memory as a whole
> + */
> + if (acctflag & VM_NORESERVE) {
> + reset_vma_resv_huge_pages(vma);
Why do we now need to do this in the non-reserve case? We didn't need
to do it before.
> + return 0;
> + }
> +
This also seems like a semantic change. Previously NO_RESERVE did not
take quota, now it does. NO_RESERVE used to mean that we took our
chances on there being pages available at fault time both for quota and
in the pools. Now it means that we only risk there being no pages.
Does that not significantly change semantics. The main use case for
NO_RESERVE was mapping a massive area to expand into, this might now be
refused for quota when previously it was not.
> ret = hugetlb_acct_memory(h, chg);
> if (ret < 0) {
> hugetlb_put_quota(inode->i_mapping, chg);
> @@ -2308,6 +2309,16 @@ int hugetlb_reserve_pages(struct inode *inode,
> }
> if (!vma || vma->vm_flags & VM_SHARED)
> region_add(&inode->i_mapping->private_list, from, to);
> + else {
> + struct resv_map *resv_map = resv_map_alloc();
> +
> + if (!resv_map)
> + return -ENOMEM;
> +
> + set_vma_resv_map(vma, resv_map);
> + set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> + }
> +
> return 0;
> }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 214b6a2..6f70db8 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -918,7 +918,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> struct inode *inode;
> unsigned int vm_flags;
> int error;
> - int accountable = 1;
> unsigned long reqprot = prot;
>
> /*
> @@ -1019,8 +1018,6 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> return -EPERM;
> vm_flags &= ~VM_MAYEXEC;
> }
> - if (is_file_hugepages(file))
> - accountable = 0;
>
> if (!file->f_op || !file->f_op->mmap)
> return -ENODEV;
> @@ -1053,8 +1050,7 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
> if (error)
> return error;
>
> - return mmap_region(file, addr, len, flags, vm_flags, pgoff,
> - accountable);
> + return mmap_region(file, addr, len, flags, vm_flags, pgoff);
> }
> EXPORT_SYMBOL(do_mmap_pgoff);
>
> @@ -1092,17 +1088,23 @@ int vma_wants_writenotify(struct vm_area_struct *vma)
>
> /*
> * We account for memory if it's a private writeable mapping,
> - * and VM_NORESERVE wasn't set.
> + * not hugepages and VM_NORESERVE wasn't set.
> */
> -static inline int accountable_mapping(unsigned int vm_flags)
> +static inline int accountable_mapping(struct file *file, unsigned int vm_flags)
> {
> + /*
> + * hugetlb has its own accounting separate from the core VM
> + * VM_HUGETLB may not be set yet so we cannot check for that flag.
> + */
> + if (file && is_file_hugepages(file))
> + return 0;
> +
> return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
> }
>
> unsigned long mmap_region(struct file *file, unsigned long addr,
> unsigned long len, unsigned long flags,
> - unsigned int vm_flags, unsigned long pgoff,
> - int accountable)
> + unsigned int vm_flags, unsigned long pgoff)
> {
> struct mm_struct *mm = current->mm;
> struct vm_area_struct *vma, *prev;
> @@ -1128,18 +1130,22 @@ munmap_back:
>
> /*
> * Set 'VM_NORESERVE' if we should not account for the
> - * memory use of this mapping. We only honor MAP_NORESERVE
> - * if we're allowed to overcommit memory.
> + * memory use of this mapping.
> */
> - if ((flags & MAP_NORESERVE) && sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> - vm_flags |= VM_NORESERVE;
> - if (!accountable)
> - vm_flags |= VM_NORESERVE;
> + if ((flags & MAP_NORESERVE)) {
> + /* * We honor MAP_NORESERVE if allowed to overcommit */
> + if (sysctl_overcommit_memory != OVERCOMMIT_NEVER)
> + vm_flags |= VM_NORESERVE;
> +
> + /* hugetlb applies strict overcommit unless MAP_NORESERVE */
> + if (file && is_file_hugepages(file))
> + vm_flags |= VM_NORESERVE;
> + }
>
> /*
> * Private writable mapping: check memory availability
> */
> - if (accountable_mapping(vm_flags)) {
> + if (accountable_mapping(file, vm_flags)) {
> charged = len >> PAGE_SHIFT;
> if (security_vm_enough_memory(charged))
> return -ENOMEM;
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index abe2694..258197b 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -151,10 +151,11 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
> /*
> * If we make a private mapping writable we increase our commit;
> * but (without finer accounting) cannot reduce our commit if we
> - * make it unwritable again.
> + * make it unwritable again. hugetlb mapping were accounted for
> + * even if read-only so there is no need to account for them here
> */
> if (newflags & VM_WRITE) {
> - if (!(oldflags & (VM_ACCOUNT|VM_WRITE|
> + if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
> VM_SHARED|VM_NORESERVE))) {
> charged = nrpages;
> if (security_vm_enough_memory(charged))
>
-apw
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists