[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50A52133.9050204@cs.utoronto.ca>
Date: Thu, 15 Nov 2012 12:06:59 -0500
From: Ryan Johnson <ryan.johnson@...utoronto.ca>
To: General Discussion of SQLite Database <sqlite-users@...ite.org>
CC: Vladislav Bolkhovitin <vst@...b.net>,
Nico Williams <nico@...ptonector.com>,
linux-fsdevel@...r.kernel.org, "Theodore Ts'o" <tytso@....edu>,
linux-kernel <linux-kernel@...r.kernel.org>,
Richard Hipp <drh@...ci.com>
Subject: Re: [sqlite] light weight write barriers
On 14/11/2012 8:17 PM, Vladislav Bolkhovitin wrote:
> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning
> commands order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial
> abstraction. After all, at the bottom of your stack, you will have to
> translate it either to cache flush, or commands order enforcement, or
> both.
Isn't that why we *have* "the stack" in the first place? So apps
*don't* have to worry about how the OS implements an artificial (=
high-level and portable) abstraction on a given device?
>
> Are you going to invent 3 types of barriers?
One will do, it just needs to be a good one.
Maybe I'm missing something here, so I'm going to back up a bit and
recap what I understand.
The filesystem abstracts the concept of encoding patterns of bits in
some physical media (data), and making it easy to find and retrieve
those bits later (metadata, incl. file name). When users read(), they
expect to see whatever they most recently sent to write(). They also
expect that what they write will still be there later, in spite of any
failure that leaves the disk itself intact.
Operating systems cheat by not actually writing to disk -- for
performance reasons -- and users are (mostly, usually) OK with that,
because the performance gains are so attractive and things usually work
out anyway. Disks cheat too, in the same way and for the same reason.
The cheating works great most of the time, but breaks down -- badly --
if we actually care about what is on disk after a crash (or if we use a
network filesystem). Enough people do care that fsync() was added to the
toolbox. It is defined to transfer "all modified in-core data of the
file referred to by the file descriptor fd to the disk device" and
"blocks until the device reports that the transfer has completed"
(quoting from the fsync(2) man page). Translation: "Stop cheating. Make
sure the stuff I already wrote actually got written. And tell the disk
to stop cheating, too."
Problem is, this definition is asymmetric: it says what happens to
writes issued before the fsync, but nothing about those issued after the
fsync starts and before it returns [1]. The reader has to assume
fsync() makes no promises whatsoever about these later writes: making
fsync capture them exposes callers of fsync() to DoS attacks, and them
from reaching disk until all outstanding fsync calls complete would add
complexity the spec doesn't currently demand, leading to understandable
reluctance by kernel devs to code it up. Unfortunately, we're left with
the filesystem equivalent of what we in the database world call
"eventual consistency" -- easy to implement, nice and fast, but very
difficult to write reliable code against unless you're willing to pay
the cost of being fully synchronous, all the time. Having tried that for
a few years, many people are "returning" to better-specified concurrency
models, trading some amount of performance for comfort that the app will
at least work predictably when things go wrong in strange and
unanticipated ways.
The request, then, is to tighten up fsync semantics in two conceptually
straightforward ways [2]: First, guarantee that later writes to an fd do
not hit disk until earlier calls to fsync() complete. Second, make the
call asynchronous. That's all.
Note that both changes are necessary. The improved ordering semantic
useless by itself, because it's still not safe to request a blocking
fsync from one thread and and then let other threads continue issuing
writes: there's a race between broadcasting that fsync has begun and
issuing the actual syscall that begins it. An asynchronous fsync is also
useless by itself, because it only benefits uncoordinated writes (which
evidently don't care what data actually reaches disk anyway).
The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.
The performance hit for #1 can be reduced significantly if the storage
hardware at hand happens to support some form of request ordering. The
amount of reduction could vary greatly depending on how sophisticated
such request ordering is, and how much effort the kernel and/or device
driver are willing to work for it. In any case, fsync should already do
this [4].
The performance hit for #3 can be minimized by buffering small or
otherwise convenient writes in the fs cache and letting the call return
immediately, as usual. The corresponding pages just have to be marked in
some way to prevent them from being written back too soon. Sequence
numbers work well for this sort of thing. Big requests may have to
block, but they probably would have anyway, if the buffer cache couldn't
absorb them. As with #1, fancy command ordering capabilities in the
underlying device just allow additional performance optimizations.
A carefully-written app (e.g. free of I/O races) would do pretty well
with this extended fsync, certainly far better than the current state of
the art allows.
Note that this still offers no protection for reads: no matter how many
times a thread issues fsync(), it still risks reading non-durable data
because reads are not ordered wrt either writes or fsync. That's not the
problem we're trying to solve, though.
Please feel free to point out where I've gone wrong, but this just
doesn't look like as complex or crazy an idea as you make it out to be.
[1] Maybe posix.1-1001 is more specific, but it's not publicly available
that I could see.
[2] I'm fully aware that implementing the request might require
significant -- perhaps even unreasonably complex -- changes to the way
the kernel currently does things (though I do doubt it). That's not a
good excuse to claim the idea itself is unreasonably complex or
ill-specified. Just say that it's not a good fit for the current code base.
[3] Another concern is whether fsync calls operate on the file or a
particular fd. What if a process opens the same file multiple times, or
multiple processes have fds pointing to the same file (whether by open
or fork)? I would argue for file-level barriers, because it leads to a
vastly simpler design (the fs cache doesn't track which process wrote
what via what fd). Besides, no app that cares about what ends up on disk
will allow uncoordinated writes anyway, so why do extra work just to
ensure I/O races stay fast?
[4] Really, device support for request ordering commands is a bit of a
red herring: the only way it helps significantly is if (a) the storage
device has a massive cache compared to the fs cache, (b) it allows I/O
scheduling to reduce latency of reads and/or writes (which fsync should
do already, and which matters little for flash), and (c) a logging
filesystem is not being used (else it's all sequential writes anyway).
In other words, it can help performance a bit but has little other
impact on what is essentially a software matter.
>
>> There's a lot to be said for simplicity... as long as the system is
>> not so simple as to not work at all.
>>
>> My p.o.v. is that a filesystem write barrier is effectively the same
>> as fsync() with the ability to return sooner (before writes hit stable
>> storage) when the filesystem and hardware support on-disk layouts and
>> primitives which can be used to order writes preceding and succeeding
>> the barrier.
>
> Your mistake is that you are considering barriers as something real,
> which can do something real for you, while it is just a artificial
> abstraction apparently invented by people with limited knowledge how
> storage works, hence having very foggy vision how barriers supposed to
> be processed by it. A simple wrong answer.
Storage: Accepts writes and ostensibly makes them available via reads
even after power failures. Reorders requests nearly arbitrarily and lies
about whether writes actually took effect, unless you issue appropriate
cache flushing and/or request ordering commands (and sometimes even
then, if it was a cheap consumer drive).
OS: Accepts writes and ostensibly makes them available via reads even
after power failures, reboots, etc. Reorders requests nearly arbitrarily
and lies about whether writes actually took effect, unless you issue a
stop-the-world, one-sided write barrier lovingly known as fsync
(assuming the actually disk listens when you tell it to stop cheating).
Wish: a two-sided write barrier that not only ensures previously-issued
writes complete before it reports success, but also prevents
later-issued writes from completing while it is in progress, giving a
reasonably simple way to enforce some ordering of writes in the system.
Can be implemented entirely in software, as the latter has full control
over which requests it chooses to schedule at the device, and also
decides whether to block the requesting thread or not. Can be made
virtually as fast as current writes, by maintaining a little extra
information in the fs cache.
Please, enlighten me: in what way does my limited knowledge of storage,
or my foggy vision of what is desired, make this feature impossible to
implement or useless if implemented?
>
> Generally, you can invent any abstraction convenient for you, but
> farther your abstractions from reality of your hardware => less you
> will get from it with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And
> start instead thinking about offload capabilities your storage can
> offer to you.
Apologies if this comes off as flame-bait, but I start to wonder whose
abstraction is broken here...
What I understand the above to mean is: "Linux file system abstractions
are too far from the reality of storage hardware, so it takes lots of
effort to accomplish little [in the way of enforcing write ordering].
Accept it. And start thinking instead about talking directly to a
storage controller that offers proper write barriers."
I hope I misread what you said, because that's a depressing thing to
hear from your OS.
Ryan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists