linux-kernel - Re: [RFC v3] writeback: add elastic bdi in cgwb bdp

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <7c16bc60-5243-07e8-5783-089e477c2693@redhat.com>
Date:   Thu, 14 Nov 2019 14:23:40 +0100
From:   David Hildenbrand <david@...hat.com>
To:     Hillf Danton <hdanton@...a.com>,
        Andrew Morton <akpm@...ux-foundation.org>
Cc:     linux-mm <linux-mm@...ck.org>, Rong Chen <rong.a.chen@...el.com>,
        Fengguang Wu <fengguang.wu@...el.com>,
        Tejun Heo <tj@...nel.org>, Shakeel Butt <shakeelb@...gle.com>,
        Minchan Kim <minchan@...nel.org>, Mel Gorman <mgorman@...e.de>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [RFC v3] writeback: add elastic bdi in cgwb bdp

On 14.11.19 10:38, Hillf Danton wrote:
> 
> On Tue, 12 Nov 2019 17:02:36 -0800 Andrew Morton wrote:
>>
>> On Tue, 12 Nov 2019 11:42:27 +0800 Hillf Danton <hdanton@...a.com> wrote:
>>>
>>> The elastic bdi (ebdi) which is the mirror bdi of spinning disk,
>>> SSD and USB key on market is introduced to balancing dirty pages
>>> (bdp).
>>>
>>> The risk arises that system runs out of free memory, when dirty
>>> pages are produced too many too soon, so bdp is needed in field.
>>>
>>> Ebdi facilitates bdp in elastic time intervals e.g. from a jiffy
>>> to one HZ, depending on the time it would take to increase dirty
>>> pages by the amount which is defined by the variable
>>> ratelimit_pages.
>>>
>>> During cgroup writeback (cgwb) bdp, ebdi helps observe the
>>> changes both in cgwb's dirty pages (dirty speed) and in
>>> written-out pages (laundry speed) in elastic time intervals,
>>> until a balance is established between the two parties i.e.
>>> the two speeds statistically equal.
>>>
>>> The above mechanism of elastic equilibrium effectively prevents
>>> dirty page hogs, as no chance is left for dirty pages to pile up,
>>> thus cuts the risk that system free memory falls to unsafe level.
>>>
>>> Thanks to Rong Chen for testing.
>>
>> That sounds like a Tested-by:
>>
> Yes, Sir, will add Tested-by: Rong Chen <rong.a.chen@...el.com>
> 
>> The changelog has no testing results.  Please prepare results which
>> show, amongst other things, the change in performance when the kernel
>> isn't tight on memory.  As well as the alteration in behaviour when
>> memory is short.
>>
> Will do.
> 
>> Generally, please work on making this code much more understandable?
>>
> Will do.
> 
>>>
>>> ...
>>>
>>> --- a/fs/fs-writeback.c
>>> +++ b/fs/fs-writeback.c
>>> @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd
>>>   	if (nr_pages == LONG_MAX)
>>>   		return LONG_MAX;
>>>   
>>> +	return nr_pages;
>>> +
>>>   	/*
>>>   	 * This may be called on clean wb's and proportional distribution
>>>   	 * may not make sense, just use the original @nr_pages in those
>>> @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct
>>>   		pages = min(pages, work->nr_pages);
>>>   		pages = round_down(pages + MIN_WRITEBACK_PAGES,
>>>   				   MIN_WRITEBACK_PAGES);
>>> +		pages = work->nr_pages;
>>
>> It's unclear what this is doing, but it makes the three preceding
>> statements non-operative.
>>
> This change, and the above one as well, is trying to bypass the
> current bandwidth, and a couple of rounds of cleanup are needed
> after it survives the LTP.
> 
>>>   	}
>>>   
>>>   	return pages;
>>> @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work)
>>>   		wb_wakeup_delayed(wb);
>>>   
>>>   	current->flags &= ~PF_SWAPWRITE;
>>> +
>>> +	if (waitqueue_active(&wb->bdp_waitq))
>>> +		wake_up_all(&wb->bdp_waitq);
>>
>> Please add a comment explaining why this is being done here.
>>
> After writing out some dirty pages, it it a check point to see if
> a balance is already set up between the dirty speed and laundry
> speed. Those under throttling will be unthrottled after seeing
> a balance in place.
> 
> A comment will be added.
> 
>>>   }
>>>   
>>>   /*
>>> --- a/mm/page-writeback.c
>>> +++ b/mm/page-writeback.c
>>> @@ -1830,6 +1830,67 @@ pause:
>>>   		wb_start_background_writeback(wb);
>>>   }
>>>   
>>> +/**
>>> + * cgwb_bdp_should_throttle()	tell if a wb should be throttled
>>> + * @wb bdi_writeback to throttle
>>> + *
>>> + * To avoid the risk of exhausting the system free memory, we check
>>> + * and try much to prevent too many dirty pages from being produced
>>> + * too soon.
>>> + *
>>> + * For cgroup writeback, it is essencially to keep an equilibrium
>>
>> "it is essential"?
>>
> Yes Sir.
> 
>>> + * between its dirty speed and laundry speed i.e. dirty pages are
>>> + * written out as fast as they are produced in an ideal state.
>>> + */
>>> +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
>>> +{
>>> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>>> +
>>> +	if (fatal_signal_pending(current))
>>> +		return false;
>>> +
>>> +	gdtc.avail = global_dirtyable_memory();
>>> +
>>> +	domain_dirty_limits(&gdtc);
>>> +
>>> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
>>> +		     global_node_page_state(NR_UNSTABLE_NFS) +
>>> +		     global_node_page_state(NR_WRITEBACK);
>>> +
>>> +	if (gdtc.dirty < gdtc.bg_thresh)
>>> +		return false;
>>> +
>>> +	if (!writeback_in_progress(wb))
>>> +		wb_start_background_writeback(wb);
>>
>> This is a bit ugly.  Something called "bool cgwb_bdp_should_throttle()"
>> shoiuld just check whether we should throttle.  But here it is, also
>> initiating writeback.  That's an inappropriate thing for this function
>> to do?
>>
> It is the current bdp behavior trying to keep dirty pages below the
> user-configurable background threshold by waking up flushers, because
> no dirty page will be sent to disk without flusher's efforts, please
> see 143dfe8611a6 ("writeback: IO-less balance_dirty_pages()").
> 
> Will try to find some chance to pinch it out.
> 
>> Also, we don't know *why* this is being done here, because there's no
>> code comment explaining the reasoning to us.
>>
> Will add a comment.
> 
>>
>>> +	if (gdtc.dirty < gdtc.thresh)
>>> +		return false;
>>> +
>>> +	/*
>>> +	 * throttle wb if there is the risk that wb's dirty speed is
>>> +	 * running away from its laundry speed, better with statistic
>>> +	 * error taken into account.
>>> +	 */
>>> +	return  wb_stat(wb, WB_DIRTIED) >
>>> +		wb_stat(wb, WB_WRITTEN) + wb_stat_error();
>>> +}
>>> +
>>>
>>> ...
>>>
>>> @@ -1888,29 +1945,38 @@ void balance_dirty_pages_ratelimited(str
>>>   	 * 1000+ tasks, all of them start dirtying pages at exactly the same
>>>   	 * time, hence all honoured too large initial task->nr_dirtied_pause.
>>>   	 */
>>> -	p =  this_cpu_ptr(&bdp_ratelimits);
>>> -	if (unlikely(current->nr_dirtied >= ratelimit))
>>> -		*p = 0;
>>> -	else if (unlikely(*p >= ratelimit_pages)) {
>>> -		*p = 0;
>>> -		ratelimit = 0;
>>> -	}
>>> +	dirty = this_cpu_ptr(&bdp_ratelimits);
>>> +
>>>   	/*
>>>   	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
>>>   	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
>>>   	 * the dirty throttling and livelock other long-run dirtiers.
>>>   	 */
>>> -	p = this_cpu_ptr(&dirty_throttle_leaks);
>>> -	if (*p > 0 && current->nr_dirtied < ratelimit) {
>>> -		unsigned long nr_pages_dirtied;
>>> -		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
>>> -		*p -= nr_pages_dirtied;
>>> -		current->nr_dirtied += nr_pages_dirtied;
>>> +	leak = this_cpu_ptr(&dirty_throttle_leaks);
>>> +
>>> +	if (*dirty + *leak < ratelimit_pages) {
>>> +		/*
>>> +		 * nothing to do as it would take some more time to
>>> +		 * eat out ratelimit_pages
>>> +		 */
>>> +		try_bdp = false;
>>> +	} else {
>>> +		try_bdp = true;
>>> +
>>> +		/*
>>> +		 * bdp in flight helps detect dirty page hogs soon
>>> +		 */
>>
>> How?  Please expand on this comment a lot.
>>
> We should be cautious here in red zone after paying the ratelimit_pages
> price; we might soon have to tackle a deluge of dirty page hogs.
> 
> Will cut it.
> 
>>> +		flights = this_cpu_ptr(&bdp_in_flight);
>>> +
>>> +		if ((*flights)++ & 1) {
>>
>> What is that "& 1" doing?
>>
> It helps to tell if a bdp is alredy in flight.
> 
> It would have been something like
> 
> 		if (*flights == 0) {
> 			(*flights)++;
> 		} else {
> 			*flights = 0;
>>> +			*dirty = *dirty + *leak - ratelimit_pages;
>>> +			*leak = 0;
>>> +		}
> 
> but I was curious to see the flights in long run.
> 
> Thanks
> Hillf
> 
>>>   	}
>>>   	preempt_enable();
>>>   
>>> -	if (unlikely(current->nr_dirtied >= ratelimit))
>>> -		balance_dirty_pages(wb, current->nr_dirtied);
>>> +	if (try_bdp)
>>> +		cgwb_bdp(wb);
>>>   
>>>   	wb_put(wb);
> 
> 

Friendly note that your mail client is messing up the thread hierarchy 
again (I think it was correct for a while):

Message-Id: <20191114093832.8504-1-hdanton@...a.com>
In-Reply-To: <20191112034227.3112-1-hdanton@...a.com>
References: <20191112034227.3112-1-hdanton@...a.com>

I assume a kernel developer can setup a mail client or switch to a sane 
one. Please don't prove me wrong. ;)

-- 

Thanks,

David / dhildenb