linux-kernel - Re: [PATCH] [RFC] vmscan.c: add a sysctl entry for controlling memory reclaim IO congestion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <afb916d8-2c19-f4b7-649f-0d819c2f7e08@wangsu.com>
Date:   Thu, 19 Sep 2019 10:20:28 +0800
From:   Lin Feng <linf@...gsu.com>
To:     Matthew Wilcox <willy@...radead.org>
Cc:     corbet@....net, mcgrof@...nel.org, akpm@...ux-foundation.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        keescook@...omium.org, mchehab+samsung@...nel.org,
        mgorman@...hsingularity.net, vbabka@...e.cz, mhocko@...e.com,
        ktkhai@...tuozzo.com, hannes@...xchg.org
Subject: Re: [PATCH] [RFC] vmscan.c: add a sysctl entry for controlling memory
 reclaim IO congestion_wait length

Hi,

On 9/18/19 19:38, Matthew Wilcox wrote:
> On Wed, Sep 18, 2019 at 11:21:04AM +0800, Lin Feng wrote:
>>> Adding a new tunable is not the right solution.  The right way is
>>> to make Linux auto-tune itself to avoid the problem.  For example,
>>> bdi_writeback contains an estimated write bandwidth (calculated by the
>>> memory management layer).  Given that, we should be able to make an
>>> estimate for how long to wait for the queues to drain.
>>>
>>
>> Yes, I had ever considered that, auto-tuning is definitely the senior AI way.
>> While considering all kinds of production environments hybird storage solution
>> is also common today, servers' dirty pages' bdi drivers can span from high end
>> ssds to low end sata disk, so we have to think of a *formula(AI core)* by using
>> the factors of dirty pages' amount and bdis' write bandwidth, and this AI-core
>> will depend on if the estimated write bandwidth is sane and moreover the to be
>> written back dirty pages is sequential or random if the bdi is rotational disk,
>> it's likey to give a not-sane number and hurt guys who dont't want that, while
>> if only consider ssd is relatively simple.
>>
>> So IMHO it's not sane to brute force add a guessing logic into memory writeback
>> codes and pray on inventing a formula that caters everyone's need.
>> Add a sysctl entry may be a right choice that give people who need it and
>> doesn't hurt people who don't want it.
> 
> You're making this sound far harder than it is.  All the writeback code
> needs to know is "How long should I sleep for in order for the queues
> to drain a substantial amount".  Since you know the bandwidth and how
> many pages you've queued up, it's a simple calculation.
> 

Ah, I should have read more of the writeback codes ;-)
Based on Michal's comments:
 > the underlying problem. Both congestion_wait and wait_iff_congested
 > should wake up early if the congestion is handled. Is this not the case?
If process is waken up once bdi congested is clear, this timeout length's role
seems not that important. I need to trace more if I can reproduce this issue
without online network traffic. But still weird thing is that once I set the
people-disliked-tunable iowait drop down instantly, they are contradictory.

Anyway, thanks a lot for your suggestions!
linfeng