linux-kernel - dirty_expire_centisecs, msync behavior

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <522BBE46.9060809@symas.com>
Date:	Sat, 07 Sep 2013 17:01:10 -0700
From:	Howard Chu <hyc@...as.com>
To:	Linux Kernel Mailing List <Linux-Kernel@...r.Kernel.ORG>
Subject: dirty_expire_centisecs, msync behavior

The documentation for dirty_expire_centisecs states: "Data which has been 
dirty in-memory for longer than this interval will be written out next time a 
flusher thread wakes up."

In practice, it appears that once the expire time has passed, all dirty pages 
get flushed, regardless of their age. This behavior makes this setting fairly 
useless. This appears to have been the behavior for most of 2.6 and 3.x. Can 
anyone explain, is the current behavior really as intended, and is the doc 
just out of date?

On a slightly related note, what was the key problem with this patch "msync: 
support syncing a small part of the file"? 
http://thread.gmane.org/gmane.linux.kernel/1313767/focus=1317498

Andrew Morton's message states that Paolo's patch would break nonlinear 
mappings, and the matter was dropped. Why wasn't it possible to write a patch 
that would also work with nonlinear mappings? I couldn't find any earlier 
context for that subject, pointers welcome.

My interest in both of these questions stems from what I've observed while 
testing the LMDB memory-mapped database. On a machine with 32GB RAM, using a 
database that occupies about 18GB of memory, doing continuous writes to the DB 
without ever calling msync, and default writeback settings, I see DB 
throughput spike downward every time the flusher wakes up. The DB is a mmap'd 
file on an XFS partition, and a DB write operation simply dirties a random set 
of pages. After the program has been running for more than 
dirty_expire_centisecs, every dirty_writeback_centisecs the DB app basically 
stops while the flusher writes out all the dirty pages.

I'm curious about a couple things - since the DB knows which pages it is 
dirtying in a given transaction, would it help overall throughput if the DB 
told the OS (via msync) exactly which ranges to flush? Obviously not, in the 
current implementation of msync, but can a patch like Paolo's make this 
better? And can the dirty_expire_centisecs behavior be fixed, so that it's 
only writing out a smaller set of pages on each wakeup? What else can we do to 
minimize the impact of the flusher? If I turn it off completely the throughput 
nearly doubles, from 5100 DB writes/sec to 9000/sec. If I turn off the timed 
flush and just use dirty_background_bytes the throughput just slows to around 
7000/sec.

It seems to me the main slowdown is because the OS is locking dirty pages 
indiscriminately. The DB does copy-on-write, so pages that it dirties in one 
transaction will not be written again in the next transaction. I would have 
expected read-only accesses to these pages to be able to progress without any 
delay but that doesn't seem to be the case.

-- 
   -- Howard Chu
   CTO, Symas Corp.           http://www.symas.com
   Director, Highland Sun     http://highlandsun.com/hyc/
   Chief Architect, OpenLDAP  http://www.openldap.org/project/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/