lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Wed, 7 Mar 2012 16:27:20 -0800 From: Andrew Morton <akpm@...ux-foundation.org> To: David Gibson <david@...son.dropbear.id.au> Cc: hughd@...gle.com, paulus@...ba.org, linux-kernel@...r.kernel.org, Andrew Barry <abarry@...y.com>, Mel Gorman <mgorman@...e.de>, Minchan Kim <minchan.kim@...il.com>, Hillf Danton <dhillf@...il.com> Subject: Re: [PATCH 2/2] hugepages: Fix use after free bug in "quota" handling On Wed, 7 Mar 2012 15:48:14 +1100 David Gibson <david@...son.dropbear.id.au> wrote: > hugetlbfs_{get,put}_quota() are badly named. They don't interact with the > general quota handling code, and they don't much resemble its behaviour. > Rather than being about maintaining limits on on-disk block usage by > particular users, they are instead about maintaining limits on in-memory > page usage (including anonymous MAP_PRIVATE copied-on-write pages) > associated with a particular hugetlbfs filesystem instance. > > Worse, they work by having callbacks to the hugetlbfs filesystem code from > the low-level page handling code, in particular from free_huge_page(). > This is a layering violation of itself, but more importantly, if the kernel > does a get_user_pages() on hugepages (which can happen from KVM amongst > others), then the free_huge_page() can be delayed until after the > associated inode has already been freed. If an unmount occurs at the > wrong time, even the hugetlbfs superblock where the "quota" limits are > stored may have been freed. > > Andrew Barry proposed a patch to fix this by having hugepages, instead of > storing a pointer to their address_space and reaching the superblock from > there, had the hugepages store pointers directly to the superblock, bumping > the reference count as appropriate to avoid it being freed. Andrew Morton > rejected that version, however, on the grounds that it made the existing > layering violation worse. > > This is a reworked version of Andrew's patch, which removes the extra, and > some of the existing, layering violation. It works by introducing the > concept of a hugepage "subpool" at the lower hugepage mm layer - that is > a finite logical pool of hugepages to allocate from. hugetlbfs now creates > a subpool for each filesystem instance with a page limit set, and a pointer > to the subpool gets added to each allocated hugepage, instead of the > address_space pointer used now. The subpool has its own lifetime and is > only freed once all pages in it _and_ all other references to it (i.e. > superblocks) are gone. > > subpools are optional - a NULL subpool pointer is taken by the code to mean > that no subpool limits are in effect. > > Previous discussion of this bug found in: "Fix refcounting in hugetlbfs > quota handling.". See: https://lkml.org/lkml/2011/8/11/28 or > http://marc.info/?l=linux-mm&m=126928970510627&w=1 > > v2: Fixed a bug spotted by Hillf Danton, and removed the extra parameter to > alloc_huge_page() - since it already takes the vma, it is not necessary. Looks good - thanks for doing this. Some comments, nothing serious: > > ... > > @@ -896,38 +910,12 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) > sb->s_root = root; > return 0; > out_free: > + if (sbinfo->spool) > + kfree(sbinfo->spool); kfree(NULL) is OK, and omitting the NULL test is idiomatic. > kfree(sbinfo); > return -ENOMEM; > } > > > ... > > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -14,6 +14,15 @@ struct user_struct; > #include <linux/shm.h> > #include <asm/tlbflush.h> > > +struct hugepage_subpool { > + spinlock_t lock; > + long count; > + long max_hpages, used_hpages; > +}; Could we get this struct docmented please? Why it exists, what it is used for. Basically a copy-n-paste from the (nice) changelog. Also please doucment the fields. I'm sitting here reading the code trying to work out what these three fields semantically *mean*. Armed with enough information to understand the code, I then get to read it all again :( I don't think any of these things can go negative, and a negative page count is absurd. So perhaps they should have an unsigned type. And maybe a 32-bit type? How much memory is 2^32 hugepages, and does the present code even work correctly in that situation? > +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks); > +void hugepage_put_subpool(struct hugepage_subpool *spool); > + > int PageHuge(struct page *page); > > > ... > > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -53,6 +53,84 @@ static unsigned long __initdata default_hstate_size; > */ > static DEFINE_SPINLOCK(hugetlb_lock); > > +static inline void unlock_or_release_subpool(struct hugepage_subpool *spool) Strange name. unlock_and_release() would be closer? > +{ > + bool free = (spool->count == 0) && (spool->used_hpages == 0); I wish I knew what those fields do. > + spin_unlock(&spool->lock); > + > + /* If no pages are used, and no other handles to the subpool > + * remain, free the subpool the subpool remain */ Comment is mangled. Please use the normal multiline comment layout? There are examples in this file. > + if (free) > + kfree(spool); > +} > + > +struct hugepage_subpool *hugepage_new_subpool(long nr_blocks) nr_blocks is logically an unsigned thing. And maybe 32-bit. > +{ > + struct hugepage_subpool *spool; > + > + spool = kmalloc(sizeof(*spool), GFP_KERNEL); > + if (!spool) > + return NULL; > + > + spin_lock_init(&spool->lock); > + spool->count = 1; > + spool->max_hpages = nr_blocks; > + spool->used_hpages = 0; > + > + return spool; > +} hm, I wonder why we didn't support a modular hugetlbfs driver. Perhaps because of core hugetlb code which fiddles around in hugetlbfs. I wonder if this patch fixed that? > +void hugepage_put_subpool(struct hugepage_subpool *spool) > +{ > + spin_lock(&spool->lock); > + BUG_ON(!spool->count); > + spool->count--; > + unlock_or_release_subpool(spool); > +} ah-hah! ->count is in fact a refcount? Then why didn't we call it "refcount" to save me all that head scratching? Here I was thinking it must count hugepages. > +static int hugepage_subpool_get_pages(struct hugepage_subpool *spool, > + long delta) > +{ > + int ret = 0; > + > + if (!spool) > + return 0; > + > + spin_lock(&spool->lock); > + if ((spool->used_hpages + delta) <= spool->max_hpages) { > + spool->used_hpages += delta; > + } else { > + ret = -ENOMEM; > + } Can lose all the braces here. > + spin_unlock(&spool->lock); > + > + return ret; > +} And some documeentation would be nice. It's strange to have a foo_get_pages() which doesn't get any pages! And what does it mean when spool==0? Why would anyone call this function without a subpool? `delta' could be an unsigned int? > +static void hugepage_subpool_put_pages(struct hugepage_subpool *spool, > + long delta) > +{ > + if (!spool) > + return; > + > + spin_lock(&spool->lock); > + spool->used_hpages -= delta; > + /* If hugetlbfs_put_super couldn't free spool due to > + * an outstanding quota reference, free it now. */ This is the sole remaining use of the term "quota" in hugetlb (yay). Can we make this one go away as well? > + unlock_or_release_subpool(spool); > +} > + > +static inline struct hugepage_subpool *subpool_inode(struct inode *inode) > +{ > + return HUGETLBFS_SB(inode->i_sb)->spool; > +} > + > +static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma) > +{ > + return subpool_inode(vma->vm_file->f_dentry->d_inode); > +} > + > /* > * Region tracking -- allows tracking of reservations and instantiated pages > * across the pages in a mapping. > @@ -533,9 +611,9 @@ static void free_huge_page(struct page *page) > */ > struct hstate *h = page_hstate(page); > int nid = page_to_nid(page); > - struct address_space *mapping; > + struct hugepage_subpool *spool = > + (struct hugepage_subpool *)page_private(page); Could do struct hugepage_subpool *spool; spool = (struct hugepage_subpool *)page_private(page); and save the Ingo-upsetting 80-col trickery. I'm shocked that we weren't already using the .private field of hugetlb head pages. > - mapping = (struct address_space *) page_private(page); > set_page_private(page, 0); > page->mapping = NULL; > BUG_ON(page_count(page)); > @@ -551,8 +629,7 @@ static void free_huge_page(struct page *page) > enqueue_huge_page(h, page); > } > spin_unlock(&hugetlb_lock); > - if (mapping) > - hugetlb_put_quota(mapping, 1); > + hugepage_subpool_put_pages(spool, 1); > } > > static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) > > ... > > @@ -2316,7 +2395,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma, > */ > address = address & huge_page_mask(h); > pgoff = vma_hugecache_offset(h, vma, address); > - mapping = (struct address_space *)page_private(page); > + mapping = vma->vm_file->f_dentry->d_inode->i_mapping; Does mapping = vma->vm_file->f_mapping; work? Avoiding all this pointer hopping is why we added f_mapping, actually. > /* > * Take the mapping lock for the duration of the table walk. As > > ... > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@...r.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists