lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 29 May 2012 10:35:46 -0700
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Fengguang Wu <fengguang.wu@...el.com>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	"Myklebust, Trond" <Trond.Myklebust@...app.com>,
	linux-fsdevel@...r.kernel.org,
	Linux Memory Management List <linux-mm@...ck.org>
Subject: Re: write-behind on streaming writes

On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@...el.com> wrote:
>
> Actually O_SYNC is pretty close to the below code for the purpose of
> limiting the dirty and writeback pages, except that it's not on by
> default, hence means nothing for normal users.

Absolutely not.

O_SYNC syncs the *current* write, syncs your metadata, and just
generally makes your writer synchronous. It's just a f*cking moronic
idea. Nobody sane ever uses it, since you are much better off just
using fsync() if you want that kind of behavior. That's one of those
"stupid legacy flags" things that have no sane use.

The whole point is that doing that is never the right thing to do. You
want to sync *past* writes, and you never ever want to wait on them
unless you just sent more (newer) writes to the disk that you are
*not* waiting on - so that you always have more IO pending.

O_SYNC is the absolutely anti-thesis of that kind of "multiple levels
of overlapping IO". Because it requires that the IO is _done_ by the
time you start more, which is against the whole point.

> It seems to me all about optimizing the 1-dd case for desktop users,
> and the most beautiful thing about per-file write behind is, it keeps
> both the number of dirty and writeback pages low in the system when
> there are only one or two sequential dirtier tasks. Which is good for
> responsiveness.

Yes, but I don't think it's about a single-dd case - it's about just
trying to handle one common case (streaming writes) efficiently and
naturally. Try to get those out of the system so that you can then
worry about the *other* cases knowing that they don't have that kind
of big streaming behavior.

For example, right now our main top-level writeback logic is *not*
about streaming writes (just dirty counts), but then we try to "find"
the locality by making the lower-level writeback do the whole "write
back by chunking inodes" without really having any higher-level
information.

I just suspect that we'd be better off teaching upper levels about the
streaming. I know for a fact that if I do it by hand, system
responsiveness was *much* better, and IO throughput didn't go down at
all.

> Note that the above user space code won't work well when there are 10+
> dirtier tasks. It effectively creates 10+ IO submitters on different
> regions of the disk and thus create lots of seeks.

Not really much more than our current writeback code does. It
*schedules* data for writing, but doesn't wait for it until much
later.

You seem to think it was synchronous. It's not. Look at the second
sync_file_range() thing, and the important part is the "index-1". The
fact that you confused this with O_SYNC seems to be the same thing.
This has absolutely *nothing* to do with O_SYNC.

The other important part is that the chunk size is fairly large. We do
read-ahead in 64k kind of things, to make sense the write-behind
chunking needs to be in "multiple megabytes".  8MB is probably the
minimum size it makes sense.

The write-behind would be for things like people writing disk images
and video files. Not for random IO in smaller chunks.

                       Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ