linux-kernel - Re: O_DIRECT question

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20070131022847.GU8030@opteron.random>
Date:	Wed, 31 Jan 2007 03:28:47 +0100
From:	Andrea Arcangeli <andrea@...e.de>
To:	Phillip Susi <psusi@....rr.com>
Cc:	Denis Vlasenko <vda.linux@...glemail.com>,
	Bill Davidsen <davidsen@....com>,
	Michael Tokarev <mjt@....msk.ru>,
	Linus Torvalds <torvalds@...l.org>, Viktor <vvp01@...ox.ru>,
	Aubrey <aubreylee@...il.com>, Hua Zhong <hzhong@...il.com>,
	Hugh Dickins <hugh@...itas.com>,
	Linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: O_DIRECT question

On Tue, Jan 30, 2007 at 06:07:14PM -0500, Phillip Susi wrote:
> It most certainly matters where the error happened because "you are 
> screwd" is not an acceptable outcome in a mission critical application. 

An I/O error is not an acceptable outcome in a mission critical app,
all mission critical setups should be fault tolerant, so if raid
cannot recover at the first sign of error the whole system should
instantly go down and let the secondary takeover from it. See slony
etc...

Trying to recover the recoverable by mucking up with data making even
_more_ writes on a failing disk before doing physical mirror image of
the disk (the readable part) isn't a good idea IMHO. At best you could
retry writing on the same sector hoping somebody disconnected the scsi
cable by mistake.

>  A well engineered solution will deal with errors as best as possible, 
> not simply give up and tell the user they are screwed because the 
> designer was lazy.  There is a reason that read and write return the 
> number of bytes _actually_ transfered, and the application is supposed 
> to check that result to verify proper operation.

You can track the range where it happened with fsync too like said in
previous email, and you can take the big database lock and then
read-write read-write every single block in that range until you find
the failing place if you really want to. read-write in place should be
safe.

> No, there is a slight difference.  An fsync() flushes all dirty buffers 
> in an undefined order.  Using O_DIRECT or O_SYNC, you can control the 
> flush order because you can simply wait for one set of writes to 
> complete before starting another set that must not be written until 
> after the first are on the disk.  You can emulate that by placing an 
> fsync between both sets of writes, but that will flush any other
> dirty 

Doing fsync after every write will provide the same ordering
guarantee as O_SYNC, thought it was obvious what I meant here.

The whole point is that most of the time you don't need it, you need
an fsync after a couple of writes. All smtp servers uses fsync for the
same reason, they also have to journal their writes to avoid losing
email when there is a power loss.

If you use writev or aio pwrite you can do well with O_SYNC too though.

> buffers whose ordering you do not care about.  Also there is no aio 
> version of fsync.

please have a second look at aio_abi.h:

	IOCB_CMD_FSYNC = 2,
	IOCB_CMD_FDSYNC = 3,

there must be a reason why they exist, right?

> sync has no effect on reading, so that test is pointless.  direct saves 
> the cpu overhead of the buffer copy, but isn't good if the cache isn't 
> entirely cold.  The large buffer size really has little to do with it, 

direct bypasses the cache so the cache is freezing not just cold.

> rather it is the fact that the writes to null do not block dd from 
> making the next read for any length of time.  If dd were blocking on an 
> actual output device, that would leave the input device idle for the 
> portion of the time that dd were blocked.

The objective was to measure the pipeline stall, if you stall it for
other reason anyway what's the point?

> In any case, this is a totally different example than your previous one 
> which had dd _writing_ to a disk, where it would block for long periods 
> of time due to O_SYNC, thereby preventing it from reading from the input 
> buffer in a timely manner.  By not reading the input pipe frequently, it 
> becomes full and thus, tar blocks.  In that case the large buffer size 
> is actually a detriment because with a smaller buffer size, dd would not 
> be blocked as long and so it could empty the pipe more frequently 
> allowing tar to block less.

It would run slower with smaller buffer size because it would block
too and it would read and write slower too. For my backup usage
keeping tar blocked is actually a feature, so the load of the backup
decreases. To me it's important the MB/sec of the writes and the
MB/sec of the reads (to lower the load), I don't care too much about
how long it takes as far as things runs as efficiently as possible
when they run. The rate limiting effect of the blocking isn't a
problem to me.

> You seem to have missed the point of this thread.  Denis Vlasenko's 
> message that you replied to simply pointed out that they are 
> semantically equivalent, so O_DIRECT can be dropped provided that O_SYNC 
> + madvise could be fixed to perform as well.  Several people including 
> Linus seem to like this idea and think it is quite possible.

I answered to that email to point out the fundamental differences
between O_SYNC and O_DIRECT, if you don't like what I said I'm sorry
but that's how things are running today and I don't see quite possible
to change (unless of course we remove performance from the equation,
then indeed they'll be much the same).

Perhaps a IOCB_CMD_PREADAHEAD plus MAP_SHARED backed by lagepages
loaded with a new syscall that reads a piece at time into the large
pagecache, could be an alternative design, or perhaps splice could
obsolete O_DIRECT. I've just a very hard time to see how
O_SYNC+madvise could ever obsolete O_DIRECT.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/