linux-kernel - Re: [PATCH RFC] mm: implement write-behind policy for sequential file writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3f67ed30-4a2e-09d2-3663-8be423dbbdac@yandex-team.ru>
Date:   Tue, 3 Oct 2017 00:50:31 +0300
From:   Konstantin Khlebnikov <khlebnikov@...dex-team.ru>
To:     Jens Axboe <axboe@...nel.dk>, linux-fsdevel@...r.kernel.org,
        linux-mm@...ck.org, linux-kernel@...r.kernel.org
Cc:     Michal Hocko <mhocko@...e.com>, Mel Gorman <mgorman@...e.de>,
        Johannes Weiner <hannes@...xchg.org>,
        Tejun Heo <tj@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [PATCH RFC] mm: implement write-behind policy for sequential file
 writes

On 02.10.2017 23:00, Jens Axboe wrote:
> On 10/02/2017 03:54 AM, Konstantin Khlebnikov wrote:
>> Traditional writeback tries to accumulate as much dirty data as possible.
>> This is worth strategy for extremely short-living files and for batching
>> writes for saving battery power. But for workloads where disk latency is
>> important this policy generates periodic disk load spikes which increases
>> latency for concurrent operations.
>>
>> Present writeback engine allows to tune only dirty data size or expiration
>> time. Such tuning cannot eliminate pikes - this just lowers and multiplies
>> them. Other option is switching into sync mode which flushes written data
>> right after each write, obviously this have significant performance impact.
>> Such tuning is system-wide and affects memory-mapped and randomly written
>> files, flusher threads handle them much better.
>>
>> This patch implements write-behind policy which tracks sequential writes
>> and starts background writeback when have enough dirty pages in a row.
> 
> This is a great idea in general. My only concerns would be around cases
> where we don't expect the writes to ever make it to media. It's not an
> uncommon use case - app dirties some memory in a file, and expects
> to truncate/unlink it before it makes it to disk. We don't want to trigger
> writeback for those. Arguably that should be app hinted.

Yes, this is case where serious degradation might happens.

Threshold 256k saves small files from writing.
Big temporary files anyway have good chances to be pushed
into disk by memory pressure or flusher thread.

> 
>> Write-behind tracks current writing position and looks into two windows
>> behind it: first represents unwitten pages, Second - async writeback.
>>
>> Next write starts background writeback when first window exceed threshold
>> and waits for pages falling behind async writeback window. This allows to
>> combine small writes into bigger requests and maintain optimal io-depth.
>>
>> This affects only writes via syscalls, memory mapped writes are unchanged.
>> Also write-behind doesn't affect files with fadvise POSIX_FADV_RANDOM.
>>
>> If async window set to 0 then write-behind skips dirty pages for congested
>> disk and never wait for writeback. This is used for files with O_NONBLOCK.
>>
>> Also for files with fadvise POSIX_FADV_NOREUSE write-behind automatically
>> evicts completely written pages from cache. This is perfect for writing
>> verbose logs without pushing more important data out of cache.
>>
>> As a bonus write-behind makes blkio throttling much more smooth for most
>> bulk file operations like copying or downloading which writes sequentially.
>>
>> Size of minimal write-behind request is set in:
>> /sys/block/$DISK/bdi/min_write_behind_kb
>> Default is 256Kb, 0 - disable write-behind for this disk.
>>
>> Size of async window set in:
>> /sys/block/$DISK/bdi/async_write_behind_kb
>> Default is 1024Kb, 0 - disables sync write-behind.
> 
> Should we expose these, or just make them a function of the IO limitations
> exposed by the device? Something like 2x max request size, or similar.

Window depend on IO latency expectations for parallel workload and
concurrency at all levels.
Also it seems that RAIDs needs special treatment.
For now I think this is minimal possible interface.

> 
> Finally, do you have any test results?
> 

Nothing particular yet.

For example:
$ fio  --name=test --rw=write --filesize=1G --ioengine=sync --blocksize=4k --end_fsync=1

with patch ends earlier
9.0s -> 8.2s for HDD
5.4s -> 4.7s for SSD
because write starts earlier. both uses old sq/cfq.