lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sat, 8 Sep 2007 16:28:27 +1000
From:	Nick Piggin <nickpiggin@...oo.com.au>
To:	Goswin von Brederlow <brederlo@...ormatik.uni-tuebingen.de>
Cc:	Bernd Schubert <bs@...eap.de>,
	Randy Dunlap <randy.dunlap@...cle.com>,
	linux-kernel@...r.kernel.org,
	"J. Bruce Fields" <bfields@...ldses.org>, brian@...sterfs.com
Subject: Re: patch: improve generic_file_buffered_write() (2nd try 1/2)

On Saturday 08 September 2007 06:01, Goswin von Brederlow wrote:
> Nick Piggin <nickpiggin@...oo.com.au> writes:
> > On Thursday 06 September 2007 03:41, Bernd Schubert wrote:
> > Minor nit: when resubmitting a patch, you should include everything
> > (ie. the full changelog of problem statement and fix description) in a
> > single mail. It's just a bit easier...
>
> Will do next time.
>
> > So I believe the problem is that for a multi-segment iovec, we currently
> > prepare_write/commit_write once for each segment, right? We do this
>
> It is more complex.
>
> Currently a __grab_cache_page, a_ops->prepare_write,
> filemap_copy_from_user[_iovec] and a_ops->commit_write is done
> whenever we hit
>
>   a) a page boundary

This is required by the prepare_write/commit_write API. The write_begin
/ write_end API is also a page-based one, but in future, we are looking
at having a more general API but we haven't completely decided on the
form yet. "perform_write" is one proposal you can look for.


>   b) a segment boundary

This is done, as I said, because of the deadlock issue. While the issue is
more completely fixed in -mm, a special case for kernel memory (eg. nfsd)
is in the latest mainline kernels.


> Those two cases don't have to, and from the stats basically never,
> coincide. For NFSd this means we do this TWICE per segment and TWICE
> per page.

The page boundary doesn't matter so much (well it does for other reasons,
but we've never been good at them...). The segment boundary means that
we aren't able to do block sized writes very well and end up doing a lot of
read-modify-write operations that could be avoided.


> > because there is a nasty deadlock in the VM (copy_from_user being
> > called with a page locked), and copying multiple segs dramatically
> > increases the chances that one of these copies will cause a page fault
> > and thus potentially deadlock.
>
> What actually locks the page? Is it __grab_cache_page or
> a_ops->prepare_write?

prepare_write must be given a locked page.


> Note that the patch does not change the number of copy_from_user calls
> being made nor does it change their arguments. If we need 2 (or more)
> segments to fill a page we still do 2 seperate calls to
> filemap_copy_from_user_iovec, both only spanning (part of) one
> segment.
>
> What the patch changes is the number of copy_from_user calls between
> __grab_cache_page and a_ops->commit_write.

So you're doing all copy_from_user calls within a prepare_write? Then
you're increasing the chances of deadlock. If not, then you're breaking
the API contract.


> Copying a full PAGE_SIZE bytes from multiple segments in one go would
> be a further improvement if that is possible.
>
> > The fix you have I don't think can work because a filesystem must be
> > notified of the modification _before_ it has happened. (If I understand
> > correctly, you are skipping the prepare_write potentially until after
> > some data is copied?).
>
> Yes. We changed the order of copy_from_user calls and
> a_ops->prepare_write by mistake. We will rectify that and do the
> prepare_write for the full page (when possible) before copying the
> data into the page.

OK, that is what used to be done, but the API is broken due to this
deadlock. write_begin/write_end fixes it properly.


> > Anyway, there are fixes for this deadlock in Andrew's -mm tree, but
> > also a workaround for the NFSD problem in git commit 29dbb3fc. Did
> > you try a later kernel to see if it is fixed there?
>
> Later than 2.6.23-rc5?

No it would be included earlier. The "segment_eq" check should be
allowing kernel writes (nfsd) to write multiple segments. If you have a
patch which changes this significantly, then it would indicate the
existing logic has a problem (or you've got a userspace application doing
the writev, which should be fixed by the write_begin patches in -mm).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ