linux-kernel - RE: [RFC PATCH] f2fs: add extent cache base on rb-tree

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Mon, 12 Jan 2015 15:06:06 +0800
From:	Chao Yu <chao2.yu@...sung.com>
To:	'Changman Lee' <cm224.lee@...sung.com>
Cc:	'Jaegeuk Kim' <jaegeuk@...nel.org>,
	linux-f2fs-devel@...ts.sourceforge.net,
	linux-kernel@...r.kernel.org
Subject: RE: [RFC PATCH] f2fs: add extent cache base on rb-tree

Hi Changman,

> -----Original Message-----
> From: Changman Lee [mailto:cm224.lee@...sung.com]
> Sent: Wednesday, January 07, 2015 7:17 PM
> To: Chao Yu
> Cc: 'Jaegeuk Kim'; linux-f2fs-devel@...ts.sourceforge.net; linux-kernel@...r.kernel.org
> Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree
> 
> Hi Chao,
> 
> On Sun, Jan 04, 2015 at 11:19:28AM +0800, Chao Yu wrote:
> > Hi Changman,
> >
> > Sorry for replying late!
> >
> > > -----Original Message-----
> > > From: Changman Lee [mailto:cm224.lee@...sung.com]
> > > Sent: Tuesday, December 30, 2014 8:32 AM
> > > To: Jaegeuk Kim
> > > Cc: Chao Yu; linux-f2fs-devel@...ts.sourceforge.net; linux-kernel@...r.kernel.org
> > > Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree
> > >
> > > Hi all,
> > >
> > > On Mon, Dec 29, 2014 at 01:23:00PM -0800, Jaegeuk Kim wrote:
> > > > Hi Chao,
> > > >
> > > > On Mon, Dec 29, 2014 at 03:19:18PM +0800, Chao Yu wrote:
> > > >
> > > > [snip]
> > > >
> > > > Nice draft. :)
> > > >
> > > > >
> > > > > Please see the draft below.
> > > > >
> > > > > 1) Extent management:
> > > > > If we use global management that managing all extents which are from different
> > > > > inodes in sbi, we will face with serious lock contention when we access these
> > > > > extents belong to different inodes concurrently, the loss may outweights the
> > > > > gain.
> > > >
> > > > Agreed.
> > > >
> > > > > So we choose a local management for extent which means all extents are
> > > > > managed by inode itself to avoid above lock contention. Addtionlly, we manage
> > > > > all extents globally by linking all inode into a global lru list for extent
> > > > > cache shrinker.
> > > > > Approach:
> > > > > 	a) build extent tree/rwlock/lru list/extent count in each inode.
> > > > > 		*extent tree: link all extent in rb-tree;
> > > > > 		*rwlock: protect fields when accessing extent cache concurrently;
> > > > > 		*lru list: sort all extents in accessing time order;
> > > > > 		*extent count: record total count of extents in cache.
> > > > > 	b) use lru shrink list in sbi to manage all inode which cached extents.
> > > > > 		*inode will be added or repostioned in this global list whenever
> > > > > 		extent is being access in this inode.
> > > > > 		*use spinlock to protect this shrink list.
> > > >
> > > > 1. How about adding a data structure with inode number instead of referring
> > > > inode pointer?
> > > >
> > > > 2. How about managing extent entries globally and setting an upper bound to
> > > > the number of extent entries instead of limiting them per each inode?
> > > > (The rb-tree will handle many extents per inode.)
> > > >
> > > > 3. It needs to set a minimum length for the candidate of extent cache.
> > > >  (e.g., 64)
> > > >
> > >
> > > Agreed.
> > >
> > > > So, for example,
> > > > struct ino_entry_for_extents {
> > > > 	inode number;
> > > > 	rb_tree for extent_entry objects;
> > > > 	rwlock;
> > > > };
> > > >
> > > > struct extent_entry {
> > > > 	blkaddr, len;
> > > > 	list_head *;
> > > > };
> > > >
> > > > Something like this.
> > > >
> > > > [A, B, C, ... are extent entry]
> > > >
> > > > The sbi has
> > > > 1. an extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU)
> > > > 2. radix_tree:  ino_entry_for_extents (#10) has D, B in rb-tree
> > > >               ` ino_entry_for_extents (#11) has A, C in rb-tree
> > > >               ` ino_entry_for_extents (#12) has F    in rb-tree
> > > >               ` ino_entry_for_extents (#13) has G, E in rb-tree
> > > >
> > > > In f2fs_update_extent_cache and __get_data_block for #10,
> > > >   ino_entry_for_extents (#10) was founded and updated D or B.
> > > >   Then, updated entries are moved to MRU.
> > > >
> > > > In f2fs_evict_inode for #11, A and C are moved to LRU.
> > > > But, if this inode is unlinked, all the A, C, and ino_entry_for_extens (#11)
> > > > should be released.
> > > >
> > > > In f2fs_balance_fs_bg, some LRU extents are released according to the amount
> > > > of consumed memory. Then, it frees any ino_entry_for_extents having no extent.
> > > >
> > > > IMO, we don't need to consider readahead for this, since get_data_block will
> > > > be called by VFS readahead.
> > > >
> > > > Furthermore, we need to think about whether LRU is really best or not.
> > > > IMO, the extent cache aims to improve second access speed, rather than initial
> > > > cold misses. So, maybe MRU or another algorithms would be better.
> > > >
> > >
> > > Right. It's very comflicated to judge which is better.
> > > In read or write path, extents could be made every time. At that time, we should
> > > decide which extent evicts instead of new extents if we set upper bound.
> > > In update, one extent could be seperated into 3. It requires 3 insertion and 1 deletion.
> > > So if update happends frequently, we could give up extent management for some ranges.
> > > And we need to bring ideas from vm managemnt. For example,
> > > active/inactive list and second chance to promotion, or batch work for insertion/deletion
> > >
> > > I thought suddenly 'Simple is best'.
> > > Let's think about better ideas together.
> >
> > Yeah, how about using an opposite way to the way of page cache manager?
> >
> > for example:
> > node page A,B,C,D is in page cache;
> > extent a,b,c,d is in extent cache;
> > extent a is built from page A, ..., d is built from page D.
> > page cache: LRU A -> B -> C -> D MRU
> > extent cache: LRU a -> b -> c -> d MRU
> >
> > If we use
> > 1) the same way LRU, cache pair A-a, B-b, ... may be reclaimed in the same time as OOM.
> > 2) the opposite way, maybe A,B in page cache and d,c in extent cache will be reclaimed,
> > but we still can hit whole cache in combination of page cache and extent cache.
> >
> > So by the way '2)' we can increase the total coverage area of page cache + extent cache.
> 
> Good idea. :)
> If we decide which one is better between global lru and per inode lru,
> it's expected that f2fs read performance will be more improved.
> I'll wait your patch.

Please review and comment on following RFC patch set, thank you! :)

Thanks,

> 
> Thanks,
> 
> >
> > >
> > > > Thanks,
> > > >
> > > > >
> > > > > 2) Limitation:
> > > > > In one inode, as we split or add extent in extent cache when read/write, extent
> > > > > number will enlarge, so memory and CPU overhead will increase.
> > > > > In order to control the overhead of memory and CPU, we try to set a upper bound
> > > > > number to limit total extent number in each inode, This number is global
> > > > > configuration which is visable to all inode. This number will be exported to
> > > > > sysfs for configuring according to requirement of user. By default, designed
> > > > > number is 8.
> > > > >
> > >
> > > Chao,
> > > It's better which # of extent are controlled globally rather than limit extents
> > > per inode as Jaegeuk said to reduce extent management overhead.
> >
> > It's OK.
> >
> > >
> > > > > 3) Shrinker:
> > > > > There are two shrink paths:
> > > > > 	a) one is triggered when extent count has exceed the upper bound of
> > > > > 	inode's extent cache. We will try to release extent(s) from head of
> > > > > 	inode's inner extent lru list until extent count is equal to upper bound.
> > > > > 	This operation could be in f2fs_update_extent_cache().
> > > > > 	b) the other one is triggered when memory util exceed threshold, we try
> > > > > 	get inode from head of global lru list(s), and release extent(s) with
> > > > > 	fixed number (by default: 64 extents) in inode one by one.
> > > > > 	This operation could be in f2fs_balance_fs_bg().
> > > > >
> > >
> > > Let's consider to use register_shrinker which is called by vm when
> > > memory pressure happens.
> >
> > Great, thanks for your reminding! :-)
> >
> > Thanks,
> > Yu
> >
> > >
> > > > > 4) Revalidation:
> > > > > In ->evict(), extent cache will be released, in order to reuse extent cache
> > > > > of inode when reopen for high hit ratio, a proper way is to add cached extent
> > > > > tree into a hash table (could be radix tree), revalidate extent tree and recover
> > > > > it to inode when reopen.
> > > > > Besides, a global lru list is needed here for shrinker.
> > > > >
> > > > > 5) Readahead:
> > > > > Expand extent cache by readaheading when we call get_dnode_of_data in non-emergency
> > > > > path. Note, ra can affect lock contention for both ->rwlock in extent cache and
> > > > > ->lock of global shrink list.
> > > > >

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/