linux-kernel - Re: [PATCH 10/10] mm: per device dirty threshold

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <17965.29252.950216.971096@notabene.brown>
Date:	Tue, 24 Apr 2007 12:58:12 +1000
From:	Neil Brown <neilb@...e.de>
To:	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Cc:	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	miklos@...redi.hu, akpm@...ux-foundation.org, dgc@....com,
	tomoki.sekiyama.qu@...achi.com, nikita@...sterfs.com,
	trond.myklebust@....uio.no, yingchao.zhou@...il.com
Subject: Re: [PATCH 10/10] mm: per device dirty threshold

On Friday April 20, a.p.zijlstra@...llo.nl wrote:
> Scale writeback cache per backing device, proportional to its writeout speed.

So it works like this:

 We account for writeout in full pages.
 When a page has the Writeback flag cleared, we account that as a
 successfully retired write for the relevant bdi.
 By using floating averages we keep track of how many writes each bdi
 has retired 'recently' where the unit of time in which we understand
 'recently' is a single page written.

 We keep a floating average for each bdi, and a floating average for
 the total writeouts (that 'average' is, of course, 1.)

 Using these numbers we can calculate what faction of 'recently'
 retired writes were retired by each bdi (get_writeout_scale).

 Multiplying this fraction by the system-wide number of pages that are
 allowed to be dirty before write-throttling, we get the number of
 pages that the bdi can have dirty before write-throttling the bdi.

 I note that the same fraction is *not* applied to background_thresh.
 Should it be?  I guess not - there would be interesting starting
 transients, as a bdi which had done no writeout would not be allowed
 any dirty pages, so background writeout would start immediately,
 which isn't what you want... or is it?

 For each bdi we also track the number of (dirty, writeback, unstable)
 pages and do not allow this to exceed the limit set for this bdi.

 The calculations involving 'reserve' in get_dirty_limits are a little
 confusing.  It looks like you calculating how much total head-room
 there is for the bdi (pages that the system can still dirty - pages
 this bdi has dirty) and making sure the number returned in pbdi_dirty
 doesn't allow more than that to be used.  This is probably a
 reasonable thing to do but it doesn't feel like the right place.  I
 think get_dirty_limits should return the raw threshold, and
 balance_dirty_pages should do both tests - the bdi-local test and the
 system-wide test.

 Currently you have a rather odd situation where
+			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+				break;
 might included numbers obtained with bdi_stat_sum being compared with
 numbers obtained with bdi_stat.

 With these patches, the VM still (I think) assumes that each BDI has
 a reasonable queue limit, so that writeback_inodes will block on a
 full queue.  If a BDI has a very large queue, balance_dirty_pages
 will simply turn lots of DIRTY pages into WRITEBACK pages and then
 think "We've done our duty" without actually blocking at all.

 With the extra accounting that we now have, I would like to see
 balance_dirty_pages dirty pages wait until RECLAIMABLE+WRITEBACK is
 actually less than 'threshold'.  This would probably mean that we
 would need to support per-bdi background_writeout to smooth things
 out.  Maybe that it fodder for another patch-set.

 You set:
+	vm_cycle_shift = 1 + ilog2(vm_total_pages);

 Can you explain that?  My experience is that scaling dirty limits
 with main memory isn't what we really want.  When you get machines
 with very large memory, the amount that you want to be dirty is more
 a function of the speed of your IO devices, rather than the amount
 of memory, otherwise you can sometimes see large filesystem lags
 ('sync' taking minutes?)

 I wonder if it makes sense to try to limit the dirty data for a bdi
 to the amount that it can write out in some period of time - maybe 3
 seconds.  Probably configurable.  You seem to have almost all the
 infrastructure in place to do that, and I think it could be a
 valuable feature.

 At least, I think vm_cycle_shift should be tied (loosely) to 
   dirty_ratio * vm_total_pages
 ??

On the whole, looks good!

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/