lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 18 Jun 2008 09:22:48 -0700 (PDT)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Bron Gondwana <brong@...tmail.fm>
cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Nick Piggin <npiggin@...e.de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Rob Mueller <robm@...tmail.fm>,
	Andi Kleen <andi@...stfloor.org>, Ingo Molnar <mingo@...e.hu>,
	Ken Murchison <murch@...rew.cmu.edu>
Subject: Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev
 spurious zero bytes (x86_64/not i386, bisected, reproducable))



On Wed, 18 Jun 2008, Bron Gondwana wrote:
> On Tue, Jun 17, 2008 at 09:03:17PM -0700, Linus Torvalds wrote:
> >
> > Is there any reason it doesn't use mmap(MAP_SHARED) and make the 
> > modifications that way too?
> 
> Portability[tm].

Hmm.. I'm pretty sure that using MAP_SHARED for writing is _more_ portable 
than mixing mmap() and "write()" - or at least more _consistent_.

That said, it's probably six one way, and half a dozen the other. The 
shared writable mmap() doesn't work well on unix-lookalikes (ie "not real 
unix"). That does include really really old Linux versions (ie 1.x 
series), but more relevantly probably includes things like QNX etc.

On the other hand, the mmap()+write(), as mentioned, doesn't work well on 
various hardware platforms where theer can be cache aliases, and that 
includes HP-UX (as you apparently have noticed), but I'm pretty certain 
there are other cases too.

The cache alias issue can actually be really thorny, because it's going to 
be very hard to see and essentially random: if your working set is big 
enough (or the cache is small enough) that the cache basically gets 
flushed between the write and the access through the mmap (and vice 
versa), you'll never see any problems.

But then, _occasionally_, you'll have really hard-to-replicate corruption 
due to cache aliases (ie you read something from the mmap() after the 
write, but you don't actually see the newly written data, because it's 
cached at a different virtual address).

Linux tries really hard to be coherent between mmap and read/write even on 
those kinds of platforms, but I would definitely not call it "portable". 
It really is a fundamentally nasty thing, and depends deeply on the CPU 
architecture, not just the OS.

> It actually does use MAP_SHARED already, but only for reading.
> Writing is all done with seeks and writes, followed by a map
> "refresh", which is really just an unmmap/mmap if the file has
> extended past the "SLOP" (between 8 and 16 k after the end of
> the file length at last mapping).

Yeah, I can certainly see that working. That said, I can also see it 
failing, partly because of the CPU virtual indexing cache issues, but 
partly because it's such an unusual thing to do (partly because it simply 
is known not to work on some systems, ie HP-UX). And that will mean that 
it is probably not a well-tested path.. As you found out.

(Side note: I mention HP-UX just because it is known to historically have 
totally and utterly brain-damaged and useless mmap support. It _may_ be 
that they've fixed it in more modern versions. It literally used to be a 
mix of horrible hardware problems - the virtual cache issue - _and_ a VM 
system that was based on some really old BSD code).

So the more traditional way would be to do an all-mmap thing, and extend 
the file with ftruncate(), not write. That's somethign that programs like 
nntpd have been doing for decades, so it's a very "traditional" model and 
thus much more likely to be safe. It also avoids all the aliasing issues, 
if all accesses are done the same way.

That said, you _would_ need to have alternate strategies to access things, 
but apparently Cyrus already has such strategies at least for HP-UX.

> Ahh - I found the explaination in doc/internal/hacking in
> the Cyrus source tree.  While 'ack' is a nice tool, it
> doesn't check files with no extention by default.  Ho hum:
> 
> - map_refresh and map_free
> 
>   - In many cases, it is far more effective to read a file via the operating
>     system's mmap facility than it is to via the traditional read() and
>     lseek system calls.  To this end, Cyrus provides an operating system
>     independent wrapper around the mmap() services (or lack thereof) of the
>     operating system.

One of the issues here is that in order to give coherency for mmap + 
read/write access, the OS may need to map the area uncached or at least 
flush caches when writing. So from a pure performance standpoint, it can 
also cause problems.

Of course, even a uncached mmap() _can_ certainly be faster than using 
just read()/write(), depending on the access patterns. So maybe Cyrus is 
doing the rigth thing, it just sounds rather fragile and prone to 
unexpected and hard-to-debug problems.

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ