lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B9060DE.4080104@gmail.com>
Date:	Fri, 05 Mar 2010 02:39:42 +0100
From:	M vd S <mvds.00@...il.com>
To:	linux-kernel@...r.kernel.org
Subject: Re: O_NONBLOCK is NOOP on block devices

 > > > If O_NONBLOCK is meaningful whatsoever (see man page docs for
> > > semantics) against block devices, one would expect a nonblocking io
> >
> > It isn't...
>
> Thanks for the reply. It's good to get confirmation that I am not all
> alone in an alternate non blocking universe. The linux man pages
> actually had me convinced O_NONBLOCK would actually keep a process
> from blocking on device io :-)
>

You're even less alone, I'm running into the same issue just now. But I 
think I've found a way around it, see below.

> > The manual page says "When possible, the file is opened in non-blocking
> > mode" . Your write is probably not blocking - but the memory allocation
> > for it is forcing other data to disk to make room. ie it didn't 
> block it
> > was just "slow".
>
> Even though I know quit well what blocking is, I am not sure how we
> define "slowness". Perhaps when we do define it, we can also define
> "immediately" to mean anything less than five seconds ;-)
>
> You are correct that io to the disk is precisely what must happen to
> complete, and last time I checked, that was the very definition of
> blocking. Not only are writes blocking, even reads are blocking. The
> docs for read(2) also says it will return EAGAIN if "Non-blocking I/O
> has been selected using O_NONBLOCK and no data was immediately
> available for reading."
>

The read(2) manpage reads, under NOTES:

"Many file systems and disks were considered to be fast enough that the 
implementation of O_NONBLOCK was deemed unnecessary.  So, O_NONBLOCK may 
not be available on files and/or disks."

The statement ("fast enough") maybe only reflects the state of affairs 
at that time - 10 ms seek time takes an eternity at 3 GHz, and times 
100k it takes an eternity IRL as well. I would define "immediately" if 
the data is available from kernel (or disk) buffers.

I need to do vast amounts (100k+) of scattered and unordered small reads 
from harddisk and want to keep my seeks short through sorting them. I 
have done some measurements and it seems perfectly possible to derive 
the physical disk layout from statistics on some 10-100k random seeks, 
so I can solve everything in userland. But before writing my own I/O 
scheduler I'd thought to give the kernel and/or SATA's NCQ tricks a shot.

Now the problem is how to tell the kernel/disk which data I want without 
blocking. readv(2) appearantly reads the requests in array order. 
Multithreading doesn't sound too good for just this purpose.

posix_fadvise(2) sounds like something: "POSIX_FADV_WILLNEED initiates a 
non-blocking read of the specified region into the page cache."
But there's appearantly no signalling to the process that an actual 
read() will indeed not block.

readahead(2) blocks until the specified data has been read.

aio_read(2) appearantly doesn't issue a real non blocking read request, 
so you will get the unneeded overhead of one thread per outstanding request.


mmap(2) / madvise(2) / mincore(2) may be a way around things (although 
non-atomic), but I haven't tested it yet. It might also solve the 
problem that started this thread, at least for the reading part of it. 
Writing a small read() like function that operates through mmap() 
doesn't seem too complicated. As for writing, you could use msync() with 
MS_ASYNC to initiate a write. I'm not sure how to find out if a write 
has indeed taken place, but at least initiating a non-blocking write is 
possible. munmap() might then still block.

Maybe some guru here can tell beforehand if such an approach would work?

Cheers,
M.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ