linux-kernel - Re: [PATCH 0/6] writeback time order/delay fixes take 3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20070822011841.GA8090@mail.ustc.edu.cn>
Date:	Wed, 22 Aug 2007 09:18:41 +0800
From:	Fengguang Wu <wfg@...l.ustc.edu.cn>
To:	Chris Mason <chris.mason@...cle.com>
Cc:	Andrew Morton <akpm@...l.org>, Ken Chen <kenchen@...gle.com>,
	linux-kernel@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	Jens Axboe <jens.axboe@...cle.com>
Subject: Re: [PATCH 0/6] writeback time order/delay fixes take 3

On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
> On Sun, 12 Aug 2007 17:11:20 +0800
> Fengguang Wu <wfg@...l.ustc.edu.cn> wrote:
> 
> > Andrew and Ken,
> > 
> > Here are some more experiments on the writeback stuff.
> > Comments are highly welcome~ 
> 
> I've been doing benchmarks lately to try and trigger fragmentation, and
> one of them is a simulation of make -j N.  It takes a list of all
> the .o files in the kernel tree, randomly sorts them and then
> creates bogus files with the same names and sizes in clean kernel trees.
> 
> This is basically creating a whole bunch of files in random order in a
> whole bunch of subdirectories.
> 
> The results aren't pretty:
> 
> http://oss.oracle.com/~mason/compilebench/makej/compare-compile-dirs-0.png
> 
> The top graph shows one dot for each write over time.  It shows that
> ext3 is basically writing all over the place the whole time.  But, ext3
> actually wins the read phase, so the layout isn't horrible.  My guess
> is that if we introduce some write clustering by sending a group of
> inodes down at the same time, it'll go much much better.
> 
> Andrew has mentioned bringing a few radix trees into the writeback paths
> before, it seems like file servers and other general uses will benefit
> from better clustering here.
> 
> I'm hoping to talk you into trying it out ;)

Thank you for the description of problem. So far I have a similar one
in mind: if we are to delay writeback of atime-dirty-only inodes to
above 1 hour, some grouping/piggy-backing scenario would be
beneficial.  (Which I guess does not deserve the complexity now that
we have Ingo's make-reltime-default patch.)

My vague idea is to
- keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching queue.
- convert s_dirty to some radix-tree/rbtree based data structure.
  It would have dual functions: delayed-writeback and clustered-writeback.
  
clustered-writeback:
- Use inode number as clue of locality, hence the key for the sorted
  tree.
- Drain some more s_dirty inodes into s_io on every kupdate wakeup,
  but do it in the ascending order of inode number instead of
  ->dirtied_when. 

delayed-writeback:
- Make sure that a full scan of the s_dirty tree takes <=30s, i.e.
  dirty_expire_interval.

Notes:
(1) I'm not sure inode number is correlated to disk location in
    filesystems other than ext2/3/4. Or parent dir?
(2) It duplicates some function of elevators. Why is it necessary?
    Maybe we have no clue on the exact data location at this time?

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/