lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 09 Oct 2012 14:13:29 +0900
From:	Chul Lee <chur.lee@...sung.com>
To:	dave@...os.cz, '?????????' <jaegeuk.kim@...sung.com>
Cc:	viro@...iv.linux.org.uk, 'Theodore Ts'o' <tytso@....edu>,
	gregkh@...uxfoundation.org, linux-kernel@...r.kernel.org,
	cm224.lee@...sung.com, jooyoung.hwang@...sung.com,
	linux-fsdevel@...r.kernel.org
Subject: RE: [PATCH 03/16] f2fs: add superblock and major in-memory structures

Dear David Sterba,

David Sterba wrote: 
> On Fri, Oct 05, 2012 at 08:57:46PM +0900, ????????? wrote:
> > +struct f2fs_nm_info {
> > +	block_t nat_blkaddr;		/* base disk address of NAT */
> > +	unsigned int nat_segs;		/* the number of nat segments */
> > +	unsigned int nat_blocks;	/* the number of nat blocks of
> > +					   one size */
> > +	nid_t max_nid;		/* */
> > +
> > +	unsigned int nat_cnt;		/* the number of nodes in NAT
> Buffer */
> > +	struct radix_tree_root nat_root;
> > +	rwlock_t nat_tree_lock;		/* Protect nat_tree_lock */
> > +	struct list_head nat_entries;	/* cached nat entry list (clean)
> */
> > +	struct list_head dirty_nat_entries; /* cached nat entry list (dirty)
> */
> > +
> > +	unsigned int fcnt;		/* the number of free node id */
> > +	struct mutex build_lock;	/* lock for build free nids */
> > +	struct list_head free_nid_list;	/* free node list */
> > +	spinlock_t free_nid_list_lock;	/* Protect pre-free nid list */
> > +
> > +	spinlock_t stat_lock;		/* Protect status variables */
> 
> a mutex and 2 spinlocks that will probably share the same cacheline (I
> counted only roughly, looks like it's the 2nd cacheline), this may incur
> performance drop if the locks are contended frequently and all at once.
> 
> Verifying if this is the case needs to be more familiar with the
> codepaths and access patterns, which I'm not, this is JFYI.
> 


Thank for the info. We'll omit one spinlock (stat_lock) and consider
rearranging the others.


> > +
> > +	int nat_upd_blkoff[3];		/* Block offset
> > +					   in current journal segment
> > +					   where the last NAT update
happened */
> > +	int lst_upd_blkoff[3];		/* Block offset
> > +					   in current journal segment */
> > +
> > +	unsigned int written_valid_node_count;
> > +	unsigned int written_valid_inode_count;
> > +	char *nat_bitmap;		/* NAT bitmap pointer */
> > +	int bitmap_size;		/* bitmap size */
> > +
> > +	nid_t init_scan_nid;	/* the first nid to be scanned */
> > +	nid_t next_scan_nid;	/* the next nid to be scanned */
> > +};
> > +
> > +struct dnode_of_data {
> > +	struct inode *inode;
> > +	struct page *inode_page;
> > +	struct page *node_page;
> > +	nid_t nid;
> > +	unsigned int ofs_in_node;
> > +	int ilock;
> 
> A variable named like-a lock but not a lock type? This at least looks
> strange and a comment would not hurt. From a quick look I don't see any
> global lock that would protect it against any races, but also I don't
> see a scenario where a race condition can occur.
> 


There could be a confusion by the name. We intended to use the variable
(ilock) to indicate that the inode_page is already locked and can be updated
without locking on it. If the ilock is not set, lock_page() on the
inode_page is needed.
Also, we defined the struct dnode_of_data for using it as a collection of
common passing parameters to some functions, that maybe why you don't see
any global lock.

We'll consider renaming it with a comment.


> > +	block_t	data_blkaddr;
> > +};
> > +
> > +struct f2fs_sb_info {
> > +	struct super_block *sb;			/* Pointer to VFS super
> block */
> > +	int s_dirty;
> 
> Is s_dirty actually used? I can see it only set and reset at checkpoint,
> not eg. synced with an on-disk block to signalize a dirty status.
> 


The s_dirty is checked, when sync_fs is called.


> > +	struct f2fs_super_block *raw_super;	/* Pointer to the super
> block
> > +						   in the buffer */
> > +	struct buffer_head *raw_super_buf;	/* Buffer containing
> > +						   the f2fs raw super block
*/
> > +	struct f2fs_checkpoint *ckpt;		/* Pointer to the
> checkpoint
> > +						   in the buffer */
> > +	struct mutex orphan_inode_mutex;
> > +	spinlock_t dir_inode_lock;
> > +	struct mutex cp_mutex;
> > +	/* orphan Inode list to be written in Journal block during CP  */
> > +	struct list_head orphan_inode_list;
> > +	struct list_head dir_inode_list;
> > +	unsigned int n_orphans, n_dirty_dirs;
> > +
> > +	unsigned int log_sectorsize;
> > +	unsigned int log_sectors_per_block;
> > +	unsigned int log_blocksize;
> > +	unsigned int blocksize;
> > +	unsigned int root_ino_num;		/* Root Inode Number*/
> > +	unsigned int node_ino_num;		/* Root Inode Number*/
> > +	unsigned int meta_ino_num;		/* Root Inode Number*/
> > +	unsigned int log_blocks_per_seg;
> > +	unsigned int blocks_per_seg;
> > +	unsigned int log_segs_per_sec;
> > +	unsigned int segs_per_sec;
> > +	unsigned int secs_per_zone;
> > +	unsigned int total_sections;
> > +	unsigned int total_node_count;
> > +	unsigned int total_valid_node_count;
> > +	unsigned int total_valid_inode_count;
> > +	unsigned int segment_count[2];
> > +	unsigned int block_count[2];
> > +	unsigned int last_victim[2];
> > +	block_t user_block_count;
> > +	block_t total_valid_block_count;
> > +	block_t alloc_valid_block_count;
> > +	block_t last_valid_block_count;
> > +	atomic_t nr_pages[NR_COUNT_TYPE];
> > +
> > +	struct f2fs_mount_info mount_opt;
> > +
> > +	/* related to NM */
> > +	struct f2fs_nm_info *nm_info;		/* Node Manager information
> */
> > +
> > +	/* related to SM */
> > +	struct f2fs_sm_info *sm_info;		/* Segment Manager
> > +						   information */
> > +	int total_hit_ext, read_hit_ext;
> > +	int rr_flush;
> > +
> > +	/* related to GC */
> > +	struct proc_dir_entry *s_proc;
> > +	struct f2fs_gc_info *gc_info;		/* Garbage Collector
> > +						   information */
> > +	struct mutex gc_mutex;			/* mutex for GC */
> > +	struct mutex fs_lock[NR_LOCK_TYPE];	/* mutex for GP */
> > +	struct mutex write_inode;		/* mutex for write inode */
> > +	struct mutex writepages;		/* mutex for writepages() */
> 
> wow, thats 1+8+2 = 11 mutexes in a row! The ones hidden under
> NR_LOCK_TYPE may hurt, as they seem to protect various and common file
> opterations (guesed from the lock_type names).


Yes, they protect global variables shared by various operations and
checkpoint.
Could you tell me what you recommend? Each fs_lock's under NR_LOCK_TYPE
would have had different lock names?


> 
> > +	struct f2fs_gc_kthread	*gc_thread;	/* GC thread */
> > +	int bg_gc;
> > +	int last_gc_status;
> > +	int por_doing;
> > +
> > +	struct inode *node_inode;
> > +	struct inode *meta_inode;
> > +
> > +	struct bio *bio[NR_PAGE_TYPE];
> > +	sector_t last_block_in_bio[NR_PAGE_TYPE];
> > +	struct rw_semaphore bio_sem;
> > +	void *ckpt_mutex;			/* mutex protecting
> > +						   node buffer */
> 
> where do you use the ckpt_mutex variable? also same concern about naming
> vs. type ...


We'll remove the obsolete variable. Thanks.

> 
> > +	spinlock_t stat_lock;			/* lock for handling the
> number
> > +						   of valid blocks and
> > +						   valid nodes */
> 
> and again a sequence of synchronization primitives.
> 

We'll omit the stat_lock in the f2fs_nm_info.

> > +};
> 
> > +static inline unsigned char inc_node_version(unsigned char version)
> > +{
> > +	(version == 255) ? version = 0 : ++version;
> 
> Isn't this equivalent to simply
> 
> 	return ++version;
> 
> ?


That would be great. Thanks.

> 
> > +	return version;
> > +}
> > +
> > +struct nat_entry {
> > +	struct node_info ni;
> > +	bool checkpointed;
> > +	struct list_head list;	/* clean/dirty list */
> > +} __packed;
> 
> This is packed, but only an in-memory structure, so reorder the
> checkpointed and list so that 'list' is aligned. Otherwise, the compiler
> will cope with that, but generates extra code on architectures that
> don't like unaligned access.
> 

Thanks for the info. We'll reorder them.


> > +static inline uint64_t div64_64(uint64_t dividend, uint64_t divisor)
> 
> Duplicating an existing function? (Or the variant 64/64 is not exported?)

Right. We should've used div64_u64().


> 
> > +{
> > +	uint32_t d = divisor;
> > +
> > +	if (divisor > 0xffffffffUll) {
> > +		unsigned int shift = fls(divisor >> 32);
> > +		d = divisor >> shift;
> > +		dividend >>= shift;
> > +	}
> > +
> > +	if (dividend >> 32)
> > +		do_div(dividend, d);
> > +	else
> > +		dividend = (uint32_t) dividend / d;
> > +
> > +	return dividend;
> > +}
> 
> david

Chul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ