[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080611150845.GA21910@skywalker>
Date: Wed, 11 Jun 2008 20:38:45 +0530
From: "Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: cmm@...ibm.com, jack@...e.cz, linux-ext4@...r.kernel.org,
linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap
write notification.
On Thu, Jun 05, 2008 at 12:30:45PM -0700, Andrew Morton wrote:
> On Thu, 5 Jun 2008 22:35:12 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@...ux.vnet.ibm.com> wrote:
>
> > We would like to get notified when we are doing a write on mmap
> > section. The changes are needed to handle ENOSPC when writing to an
> > mmap section of files with holes.
> >
>
> Whoa. You didn't copy anything like enough mailing lists for a change
> of this magnitude. I added some.
>
> This is a large change in behaviour!
>
> a) applications will now get a synchronous SIGBUS when modifying a
> page over an ENOSPC filesystem. Whereas previously they could have
> proceeded to completion and then detected the error via an fsync().
Or not detect the error at all if we don't call fsync() right ? Isn't a
synchronous SIGBUS the right behaviour ?
>
> It's going to take more than one skimpy little paragraph to
> justify this, and to demonstrate that it is preferable, and to
> convince us that nothing will break from this user-visible behaviour
> change.
>
> b) we're now doing fs operations (and some I/O) in the pagefault
> code. This has several implications:
>
> - performance changes
>
> - potential for deadlocks when a process takes the fault from
> within a copy_to_user() in, say, mm/filemap.c
>
> - performing additional memory allocations within that
> copy_to_user(). Possibility that these will reenter the
> filesystem.
>
> And that's just ext2.
>
> For ext3 things are even more complex, because we have the
> journal_start/journal_end pair which is effectively another "lock" for
> ranking/deadlock purposes. And now we're taking i_alloc_sem and
> lock_page and we're doing ->writepage() and its potential
> journal_start(), all potentially within the context of a
> copy_to_user().
One of the reason why we would need this in ext3/ext4 is that we cannot
do block allocation in the writepage with the recent locking changes.
The locking changes involve changing the locking order of journal_start
and page_lock. With writepage we are already called with page_lock and
we can't start new transaction needed for block allocation.
But if we agree that we should not do block allocation in page_mkwrite
we need to add writepages and allocate blocks in writepages.
>
> Now, things become easier because copy_to_user() only happens on the
> read() side of things, where we don't hold lock_page() and things are
> generally simpler.
>
> But still, this is a high-risk change. I think we should require a lot
> of convincing that issues such as the above have been suitably
> considered and addressed, and that the change has had *intense*
> testing.
>
> > index 47d88da..cc2e106 100644
> > --- a/fs/ext2/ext2.h
> > +++ b/fs/ext2/ext2.h
> > @@ -136,6 +136,7 @@ extern void ext2_get_inode_flags(struct ext2_inode_info *);
> > int __ext2_write_begin(struct file *file, struct address_space *mapping,
> > loff_t pos, unsigned len, unsigned flags,
> > struct page **pagep, void **fsdata);
> > +extern int ext2_page_mkwrite(struct vm_area_struct *vma, struct page *page);
> >
> > /* ioctl.c */
> > extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
> > diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> > index 5f2fa9c..d539dcf 100644
> > --- a/fs/ext2/file.c
> > +++ b/fs/ext2/file.c
> > @@ -18,6 +18,7 @@
> > * (jj@...site.ms.mff.cuni.cz)
> > */
> >
> > +#include <linux/mm.h>
> > #include <linux/time.h>
> > #include "ext2.h"
> > #include "xattr.h"
> > @@ -38,6 +39,24 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
> > return 0;
> > }
> >
> > +static struct vm_operations_struct ext2_file_vm_ops = {
> > + .fault = filemap_fault,
> > + .page_mkwrite = ext2_page_mkwrite,
> > +};
> > +
> > +static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > + struct address_space *mapping = file->f_mapping;
> > +
> > + if (!mapping->a_ops->readpage)
> > + return -ENOEXEC;
>
> this copied-and-pasted test can now be removed.
>
> > + file_accessed(file);
> > + vma->vm_ops = &ext2_file_vm_ops;
> > + vma->vm_flags |= VM_CAN_NONLINEAR;
> > + return 0;
> > +}
> > +
> > +
> > /*
> > * We have mostly NULL's here: the current defaults are ok for
> > * the ext2 filesystem.
> > @@ -52,7 +71,7 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
> > #ifdef CONFIG_COMPAT
> > .compat_ioctl = ext2_compat_ioctl,
> > #endif
> > - .mmap = generic_file_mmap,
> > + .mmap = ext2_file_mmap,
> > .open = generic_file_open,
> > .release = ext2_release_file,
> > .fsync = ext2_sync_file,
> > diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> > index 384fc0d..d4c5c23 100644
> > --- a/fs/ext2/inode.c
> > +++ b/fs/ext2/inode.c
> > @@ -1443,3 +1443,8 @@ int ext2_setattr(struct dentry *dentry, struct iattr *iattr)
> > error = ext2_acl_chmod(inode);
> > return error;
> > }
> > +
> > +int ext2_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> > +{
> > + return block_page_mkwrite(vma, page, ext2_get_block);
> > +}
> > --
-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists