linux-kernel - Re: [PATCH] issue storage device flush via sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49CB8EF3.2020409@redhat.com>
Date:	Thu, 26 Mar 2009 10:19:31 -0400
From:	Ric Wheeler <rwheeler@...hat.com>
To:	Eric Sandeen <sandeen@...deen.net>
CC:	Jeff Garzik <jeff@...zik.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Theodore Tso <tytso@....edu>, Ingo Molnar <mingo@...e.hu>,
	Alan Cox <alan@...rguk.ukuu.org.uk>,
	Arjan van de Ven <arjan@...radead.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>,
	Nick Piggin <npiggin@...e.de>, David Rees <drees76@...il.com>,
	Jesper Krogh <jesper@...gh.cc>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] issue storage device flush via sync_blockdev() (was Re:
 Linux 2.6.29)

On 03/25/2009 10:31 PM, Eric Sandeen wrote:
> Jeff Garzik wrote:
>    
>> Eric Sandeen wrote:
>>      
>
>    
>>> What about when you're running over a big raid device with
>>> battery-backed cache, and you trust the cache as much as much as the
>>> disks.  Wouldn't this unconditional cache flush be painful there on any
>>> of the callers even if they're rare?  (fs unmounts, freezes, unmounts,
>>> etc?  Or a fat filesystem on that device doing an fsync?)
>>>        
>> What exactly do you think sync_blockdev() does?  :)
>>      
>
> It used to push os cached data to the storage.  Now it tells the storage
> to flush cache too (with your patch).  This seems fine in general,
> although it's not a panacea for all the various data integrity issues
> that are being tossed about in this thread.  :)
>
>    
>> It is used right before a volume goes away.  If that's not a time to
>> flush the cache, I dunno what is.
>>
>> The _whole purpose_ of sync_blockdev() is to push out the data to
>> permanent storage.  Look at the users -- unmount volume, journal close,
>> etc.  Things that are OK to occur after those points include: power off,
>> device unplug, etc.
>>      
>
> Sure.  But I was thinking about enterprise raids with battery backup
> which may last for days.
>
> But, ok, I wasn't thinking quite right about the unmount situations etc;
> even on enterprise raids like this, flushing things out on unmount makes
> sense in the case where you lose power post-unmount and can't restore
> power before the battery backup dies.
>    

Enterprise raid systems don't have this issue. They all have sufficient 
battery power to safely destage the volatile cache to persistent storage 
on power outage (i.e., they keep enough drives spinning and so on to 
empty the cache).

Hardware RAID cards that you have in some servers (not external RAID 
arrays) would have this issue though and are probably fairly common.

> I also wondered if a cache flush on one lun issues a cache flush for the
> entire controller, or just for that lun.  Hopefully the latter, in which
> case it's not as big a deal.
>    

Probably just for that LUN - LUN's are usually independent in almost all 
ways. Firmware of course could do anything, I would assume that most 
high end arrays would probably ignore a flush command in any case.

>    
>> A secondary purpose of sync_blockdev() is as a hack, for simple/ancient
>> bdev-based filesystems that do not wish to bother with barriers and all
>> the associated complexity to tracking what writes do/do not need flushing.
>>      
>
> Yep, I get that and it seems reasonable except ....
>
>    
>>> xfs, reiserfs, ext4 all avoid the blkdev flush on fsync if barriers are
>>> not enabled, I think for that reason...
>>>        
>> Enabling barriers causes slowdowns far greater than that of simply
>> causing fsync(2) to trigger FLUSH CACHE, because barriers imply FLUSH
>> CACHE issuance for all in-kernel filesystem journalled/atomic
>> transactions, in addition to whatever syscalls userspace is issuing.
>>
>> The number of FLUSH CACHES w/ barriers is orders of magnitude larger
>> than the number of fsync/fdatasync calls.
>>      
>
> I understand all that.
>
> My point is that the above filesystems (xfs, reiserfs, ext4) skip the
> blkdev flush on fsync when barriers are explicitly disabled.
>
> They do this because if an admin disables barriers, they are trusting
> that the write cache is nonvolatile and will be able to destage fully
> even if external power is lost for some time.
>
> In that case you don't need a blkdev_issue_flush on fsync either (or are
> at least willing to live with the diminished risk, thanks to the battery
> backup), and on xfs, ext4 etc you can turn it off (it goes away w/ the
> barriers off setting).  With this change to the simple generic fsync
> path, you can't turn it off for those filesystems that use it for fsync.
>
> But I suppose it's rare that anybody ever uses a filesystem which uses
> this generic sync method on any sort of interesting storage like I'm
> talking about, and it's not a big deal...  (or maybe that interesting
> storage just ignores cache flushes anyway, I dunno).
>
> My main concerns were that these extra cache flushes for fsync aren't
> tunable, and that flushes on one lun might affect other luns.  I guess
> I've talked myself out of those concerns in a couple different ways now.  ;)
>
> -Eric
>    

I do enthusiastically agree that we should not be doing barriers and the 
blkdev flush for file systems that do barriers correctly.

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/