linux-kernel - Re: [PATCH 11/16] f2fs: add inode operations for special inodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <201210170835.02902.arnd@arndb.de>
Date:	Wed, 17 Oct 2012 08:35:02 +0000
From:	Arnd Bergmann <arnd@...db.de>
To:	Jaegeuk Kim <jaegeuk.kim@...il.com>
Cc:	Jaegeuk Kim <jaegeuk.kim@...sung.com>,
	"'Changman Lee'" <cm224.lee@...il.com>,
	"'Vyacheslav Dubeyko'" <slava@...eyko.com>,
	viro@...iv.linux.org.uk, "'Theodore Ts'o'" <tytso@....edu>,
	gregkh@...uxfoundation.org, linux-kernel@...r.kernel.org,
	chur.lee@...sung.com, cm224.lee@...sung.com,
	jooyoung.hwang@...sung.com
Subject: Re: [PATCH 11/16] f2fs: add inode operations for special inodes

On Tuesday 16 October 2012, Jaegeuk Kim wrote:
> 2012-10-16 (화), 16:14 +0000, Arnd Bergmann:
> > On Tuesday 16 October 2012, Jaegeuk Kim wrote:
> > For the lower bound, being able to support as little as 2 logs for
> > cheap hardware would be nice, but 4 logs is the important one.
> > 
> > 5 logs is probably not all that important, as long as you have the
> > choice between 4 and 6. If you implement three different ways, I
> > would prefer have the choice of 2/4/6 over 4/5/6 logs.
> 
> Ok, I'll try, but in the case of 2 logs, it may need to change recovery
> routines.

Ok, I see. If it needs any changes that require a lot of extra code or
if it would make the common (six logs) case less efficient, then
you should probably not do it.

> > I fear that this might not be good enough for a lot of cases when
> > the page sizes grow and there is no sufficient amount of nonvolatile
> > write cache in the device. I wonder whether there is something that can
> > be done to ensure we always write with a minimum alignment, and pad
> > out the data with zeroes if necessary in order to avoid getting into
> > garbage collection on devices that can't handle sub-page writes.
> 
> You're very familiar with flash. :)
> Yes, as the page size grows, the sub-page write issue is one of the
> most critical problems.
> I also thought this before, but I have not made a conclusion until now.
> Because, I don't know how to deal with this in other companies, but,
> I've seen that so many firmware developers in samsung have tried to
> reduce this overhead by adapting many schemes.
> I guess very cautiously that other companies also handle this well.
> Therefore, I keep a question whether file system should care about
> this perfectly or not.

My guess is that most devices would be able to handle this well enough
as long as the writes are only in the log areas, but some would fail
when there are cached sub-page writes by the time you update the metadata
in the beginning of the drive.

Besides the extreme case of getting into garbage collect when the device
runs out of nonvolatile cache to keep sub-pages, there is also the other
problem that it is always more efficient not to need the NV cache than
having to use it to do sub-page writes. This is especially true if the
NV cache is implemented as a log on a regular flash block. In those cases,
it would be better to pad the current write with zeroes to the next
page boundary and rely on garbage collection to do the compaction later.

As I mentioned before, my design avoided the problem by using larger
clusters to start with and then mitigating the space overhead from this
by allowing to put multiple inodes into a single cluster. The tradeoffs
from this are very different than what you have with a fixed 4KB block
size, and it's probably not worth redesigning f2fs to handle this on
such a global scale.

One thing that you can do though is pad each flash page with data from
garbage collection: There should basically always be data that needs
to be GC'd, and as soon as you have decided that you want to write
a block to a new location and the hardware requires that it writes
a block of data to pad the page, you might just as well send down that
block. In the opposite case where you have a full page worth of actual
data that needs to be written (e.g. for a sync()) and half a page
worth of data from garbage collection, you can decide not send the GC
data in order to stay inside on a page boundary.

Doing this systematically would allow using the eMMC-4.5 "large-unit"
context for all of the logs, which can be a significant performance
improvement, depending on the underlying implementation.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/