linux-kernel - Re: O_DIRECT question

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <45B4E3A3.40706@cfl.rr.com>
Date:	Mon, 22 Jan 2007 11:17:39 -0500
From:	Phillip Susi <psusi@....rr.com>
To:	Denis Vlasenko <vda.linux@...glemail.com>
CC:	Michael Tokarev <mjt@....msk.ru>,
	Linus Torvalds <torvalds@...l.org>, Viktor <vvp01@...ox.ru>,
	Aubrey <aubreylee@...il.com>, Hua Zhong <hzhong@...il.com>,
	Hugh Dickins <hugh@...itas.com>, linux-kernel@...r.kernel.org,
	hch@...radead.org, kenneth.w.chen@...el.com, akpm@...l.org
Subject: Re: O_DIRECT question

Denis Vlasenko wrote:
> The difference is that you block exactly when you try to access
> data which is not there yet, not sooner (potentially much sooner).
> 
> If application (e.g. database) needs to know whether data is _really_ there,
> it should use aio_read (or something better, something which doesn't use signals.
> Do we have this 'something'? I honestly don't know).

The application _IS_ using aio, which is why it can go and perform other 
work while it waits to be told that the read has completed.  This is not 
possible with mmap because the task is blocked while faulting in pages, 
and unless it tries to access the pages, they won't be faulted in.

> In some cases, evne this is not needed because you don't have any other
> things to do, so you just do read() (which returns early), and chew on
> data. If your CPU is fast enough and processing of data is light enough
> so that it outruns disk - big deal, you block in page fault handler
> whenever a page is not read for you in time.
> If CPU isn't fast enough, your CPU and disk subsystem are nicely working
> in parallel.

Being blocked in the page fault handler means the cpu is now idle 
because you can't go chew on data that _IS_ in core.  The aio + O_DIRECT 
allows you to control when IO is started rather than rely on the kernel 
to decide when is a good time for readahead, and to KNOW when that IO is 
done so you can chew on the data.

> With O_DIRECT, you alternate:
> "CPU is idle, disk is working" / "CPU is working, disk is idle".

You have this completely backwards.  With mmap this is what you get 
because you chew data, page fault... chew data... page fault...

> What do you want to do on I/O error? I guess you cannot do much -
> any sensible db will shutdown itself. When your data storage
> starts to fail, it's pointless to continue running.

Ever hear of error recovery?  A good db will be able to cope with one or 
two bad blocks, or at the very least continue operating the other tables 
or databases it is hosting, or flush transactions and switch to read 
only mode, or any number of things other than abort().

> You do not need to know which read() exactly failed due to bad disk.
> Filename and offset from the start is enough. Right?
> 
> So, SIGIO/SIGBUS can provide that, and if your handler is of
> 	void (*sa_sigaction)(int, siginfo_t *, void *);
> style, you can get fd, memory address of the fault, etc.
> Probably kernel can even pass file offset somewhere in siginfo_t...

Sure... now what does your signal handler have to do in order to handle 
this error in such a way as to allow the one request to be failed and 
the task to continue handling other requests?  I don't think this is 
even possible, yet alone clean.

> You can still be multithreaded. The point is, with O_DIRECT
> you _are forced_ to_ be_ multithreaded, or else perfomance will suck.

Or use aio.  Simple read/write with the kernel trying to outsmart the 
application is nice for very simple applications, but it does not 
provide very good performance.  This is why we have aio and O_DIRECT; 
because the application can manage the IO better than the kernel because 
it actually knows what it needs and when.

Yes, the application ends up being more complex, but that is the price 
you pay.  You simply can't get it perfect in a general purpose kernel 
that has to guess what the application is really trying to do.

> You think "Oracle". But this application may very well be
> not Oracle, but diff, or dd, or KMail. I don't want to care.
> I want all big writes to be efficient, not just those done by Oracle.
> *Including* single threaded ones.

Then redesign those applications to use aio and O_DIRECT.  Incidentally 
I have hacked up dd to do just that and have some very nice performance 
numbers as a result.

> Well, I too currently work with Oracle.
> Apparently people who wrote damn thing have very, eh, Oracle-centric
> world-view. "We want direct writes to the disk. Period." Why? Does it
> makes sense? Are there better ways? - nothing. They think they know better.

Nobody has shown otherwise to date.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/