[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.0810071126001.504@hs20-bc2-1.build.redhat.com>
Date: Tue, 7 Oct 2008 11:44:34 -0400 (EDT)
From: Mikulas Patocka <mpatocka@...hat.com>
To: david@...g.hm
cc: Nick Piggin <nickpiggin@...oo.com.au>,
Andrew Morton <akpm@...ux-foundation.org>,
linux-kernel@...r.kernel.org, agk@...hat.com, mbroz@...hat.com,
chris@...chsys.com
Subject: Re: application syncing options (was Re: [PATCH] Memory management
livelock)
> > If you invent new interface that allows submitting several ordered IOs
> > from userspace, it will require excessive maintenance overhead over long
> > period of time. So it should be only justified, if the performance
> > improvement is excessive as well.
> >
> > It should not be like "here you improve 10% performance on some synthetic
> > benchmark in one application that was rewritten to support the new
> > interface" and then create a few more security vulnerabilities (because of
> > the complexity of the interface) and damage overall Linux progress,
> > because everyone is catching bugs in the new interface and checking it for
> > correctness.
>
> the same benchmarks that show that it's far better for the in-kernel
> filesystem code to use write barriers should apply for FUSE filesystems.
FUSE is slow by design, and it is used in cases where performance isn't
crucial.
> this isn't a matter of a few % in performance, if an application is
> sync-limited in a way that can be converted to write-ordered the potential is
> for the application to speed up my many times.
>
> programs that maintain indexes or caches of data that lives in other files
> will be able to write data && barrier && write index && fsync and double their
> performance vs write data && fsync && write index && fsync
They can do: write data with O_SYNC; write another piece of data with
O_SYNC.
And the only difference from barriers is the waiting time after the first
O_SYNC before the second I/O is submitted (such delay wouldn't happen with
barriers).
And now I/O delay is in milliseconds and process wakeup time is tens of
microseconds, it doesn't look like eliminating process wakeup time would
do more than few percents.
> databases can potentially do even better, today they need to fsync data to
> disk before they can update their journal to indicate that the data has been
> written, with a barrier they could order the writes so that the write to the
> journal doesn't happen until the writes of the data. they would neve need to
> call an fsync at all (when emptying the journal)
Good databases can pack several user transactions into one fsync() write.
If the database server is properly engineered, it accumulates all user
transactions committed so far into one chunk, writes that chunk with one
fsync() call and then reports successful commit to the clients.
So if you increase fsync() latency, it should have no effect on the
transactional throughput --- only on latency of transactions. Similarly,
if you decrease fsync() latency, it won't increase number of processed
transactions.
Certainly, there are primitive embedded database libraries that fsync()
after each transaction, but they don't have good performance anyway.
> for systems without solid-state drives or battery-backed caches, the ability
> to eliminate fsyncs by being able to rely on the order of the writes is a huge
> benifit.
I may ask --- where are the applications that require extra slow fsync()
latency? Databases are not that, they batch transactions.
If you want to improve things, you can try:
* implement O_DSYNC (like O_SYNC, but doesn't update inode mtime)
* implement range_fsync and range_fdatasync (sync on file range --- the
kernel has already support for that, you can just add a syscall)
* turn on FUA bit for O_DSYNC writes, that eliminates the need to flush
drive cache in O_DSYNC call
--- these are definitely less invasive than new I/O submitting interface.
Mikulas
> David Lang
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists