linux-kernel - Re: msync(2) bug(?), returns AOP_WRITEPAGE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <200710282023.l9SKNKK0031790@agora.fsl.cs.sunysb.edu>
Date:	Sun, 28 Oct 2007 16:23:20 -0400
From:	Erez Zadok <ezk@...sunysb.edu>
To:	Hugh Dickins <hugh@...itas.com>
Cc:	Erez Zadok <ezk@...sunysb.edu>,
	Pekka Enberg <penberg@...helsinki.fi>,
	Ryan Finnie <ryan@...nie.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	cjwatson@...ntu.com, linux-mm@...ck.org
Subject: Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland 

Huge,

I took your advise regarding ~(__GFP_FS|__GFP_IO), AOP_WRITEPAGE_ACTIVATE,
and such.  I revised my unionfs_writepage and unionfs_sync_page, and tested
it under memory pressure: I have a couple of live CDs that use tmpfs and can
deterministically reproduce the conditions resulting in A_W_A.  I also went
back to using grab_cache_page but with the gfp_mask suggestions you made.

I'm happy to report that it all works great now!  Below is the entirety of
the new unionfs_mmap and unionfs_sync_page code.  I'd appreciate if you and
others can look it over and see if you find any problems.

Thanks,
Erez.


static int unionfs_writepage(struct page *page, struct writeback_control *wbc)
{
	int err = -EIO;
	struct inode *inode;
	struct inode *lower_inode;
	struct page *lower_page;
	char *kaddr, *lower_kaddr;
	struct address_space *mapping; /* lower inode mapping */
	gfp_t old_gfp_mask;

	inode = page->mapping->host;
	lower_inode = unionfs_lower_inode(inode);
	mapping = lower_inode->i_mapping;

	/*
	 * find lower page (returns a locked page)
	 *
	 * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
	 * memory pressure conditions.  This is similar to how the loop
	 * driver behaves (see loop_set_fd in drivers/block/loop.c).
	 * If we can't find the lower page, we redirty our page and return
	 * "success" so that the VM will call us again in the (hopefully
	 * near) future.
	 */
	old_gfp_mask = mapping_gfp_mask(mapping);
	mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));

	lower_page = grab_cache_page(mapping, page->index);
	mapping_set_gfp_mask(mapping, old_gfp_mask);

	if (!lower_page) {
		err = 0;
		set_page_dirty(page);
		goto out;
	}

	/* get page address, and encode it */
	kaddr = kmap(page);
	lower_kaddr = kmap(lower_page);

	memcpy(lower_kaddr, kaddr, PAGE_CACHE_SIZE);

	kunmap(page);
	kunmap(lower_page);

	BUG_ON(!mapping->a_ops->writepage);

	/* call lower writepage (expects locked page) */
	clear_page_dirty_for_io(lower_page); /* emulate VFS behavior */
	err = mapping->a_ops->writepage(lower_page, wbc);

	/* b/c grab_cache_page locked it and ->writepage unlocks on success */
	if (err)
		unlock_page(lower_page);
	/* b/c grab_cache_page increased refcnt */
	page_cache_release(lower_page);

	if (err < 0) {
		ClearPageUptodate(page);
		goto out;
	}
	/*
	 * Lower file systems such as ramfs and tmpfs, may return
	 * AOP_WRITEPAGE_ACTIVATE so that the VM won't try to (pointlessly)
	 * write the page again for a while.  But those lower file systems
	 * also set the page dirty bit back again.  Since we successfully
	 * copied our page data to the lower page, then the VM will come
	 * back to the lower page (directly) and try to flush it.  So we can
	 * save the VM the hassle of coming back to our page and trying to
	 * flush too.  Therefore, we don't re-dirty our own page, and we
	 * don't return AOP_WRITEPAGE_ACTIVATE back to the VM (we consider
	 * this a success).
	 */
	if (err == AOP_WRITEPAGE_ACTIVATE)
		err = 0;

	/* all is well */
	SetPageUptodate(page);
	/* lower mtimes has changed: update ours */
	unionfs_copy_attr_times(inode);

	unlock_page(page);

out:
	return err;
}


static void unionfs_sync_page(struct page *page)
{
	struct inode *inode;
	struct inode *lower_inode;
	struct page *lower_page;
	struct address_space *mapping; /* lower inode mapping */
	gfp_t old_gfp_mask;

	inode = page->mapping->host;
	lower_inode = unionfs_lower_inode(inode);
	mapping = lower_inode->i_mapping;

	/*
	 * Find lower page (returns a locked page).
	 *
	 * We turn off __GFP_IO|__GFP_FS so as to prevent a deadlock under
	 * memory pressure conditions.  This is similar to how the loop
	 * driver behaves (see loop_set_fd in drivers/block/loop.c).
	 * If we can't find the lower page, we redirty our page and return
	 * "success" so that the VM will call us again in the (hopefully
	 * near) future.
	 */
	old_gfp_mask = mapping_gfp_mask(mapping);
	mapping_set_gfp_mask(mapping, old_gfp_mask & ~(__GFP_IO|__GFP_FS));

	lower_page = grab_cache_page(mapping, page->index);
	mapping_set_gfp_mask(mapping, old_gfp_mask);

	if (!lower_page) {
		printk(KERN_ERR "unionfs: grab_cache_page failed\n");
		goto out;
	}

	/* do the actual sync */

	/*
	 * XXX: can we optimize ala RAIF and set the lower page to be
	 * discarded after a successful sync_page?
	 */
	if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
		mapping->a_ops->sync_page(lower_page);

	/* b/c grab_cache_page locked it */
	unlock_page(lower_page);
	/* b/c grab_cache_page increased refcnt */
	page_cache_release(lower_page);

out:
	return;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/