[<prev] [next>] [day] [month] [year] [list]
Message-ID: <7c16bc60-5243-07e8-5783-089e477c2693@redhat.com>
Date: Thu, 14 Nov 2019 14:23:40 +0100
From: David Hildenbrand <david@...hat.com>
To: Hillf Danton <hdanton@...a.com>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: linux-mm <linux-mm@...ck.org>, Rong Chen <rong.a.chen@...el.com>,
Fengguang Wu <fengguang.wu@...el.com>,
Tejun Heo <tj@...nel.org>, Shakeel Butt <shakeelb@...gle.com>,
Minchan Kim <minchan@...nel.org>, Mel Gorman <mgorman@...e.de>,
linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [RFC v3] writeback: add elastic bdi in cgwb bdp
On 14.11.19 10:38, Hillf Danton wrote:
>
> On Tue, 12 Nov 2019 17:02:36 -0800 Andrew Morton wrote:
>>
>> On Tue, 12 Nov 2019 11:42:27 +0800 Hillf Danton <hdanton@...a.com> wrote:
>>>
>>> The elastic bdi (ebdi) which is the mirror bdi of spinning disk,
>>> SSD and USB key on market is introduced to balancing dirty pages
>>> (bdp).
>>>
>>> The risk arises that system runs out of free memory, when dirty
>>> pages are produced too many too soon, so bdp is needed in field.
>>>
>>> Ebdi facilitates bdp in elastic time intervals e.g. from a jiffy
>>> to one HZ, depending on the time it would take to increase dirty
>>> pages by the amount which is defined by the variable
>>> ratelimit_pages.
>>>
>>> During cgroup writeback (cgwb) bdp, ebdi helps observe the
>>> changes both in cgwb's dirty pages (dirty speed) and in
>>> written-out pages (laundry speed) in elastic time intervals,
>>> until a balance is established between the two parties i.e.
>>> the two speeds statistically equal.
>>>
>>> The above mechanism of elastic equilibrium effectively prevents
>>> dirty page hogs, as no chance is left for dirty pages to pile up,
>>> thus cuts the risk that system free memory falls to unsafe level.
>>>
>>> Thanks to Rong Chen for testing.
>>
>> That sounds like a Tested-by:
>>
> Yes, Sir, will add Tested-by: Rong Chen <rong.a.chen@...el.com>
>
>> The changelog has no testing results. Please prepare results which
>> show, amongst other things, the change in performance when the kernel
>> isn't tight on memory. As well as the alteration in behaviour when
>> memory is short.
>>
> Will do.
>
>> Generally, please work on making this code much more understandable?
>>
> Will do.
>
>>>
>>> ...
>>>
>>> --- a/fs/fs-writeback.c
>>> +++ b/fs/fs-writeback.c
>>> @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd
>>> if (nr_pages == LONG_MAX)
>>> return LONG_MAX;
>>>
>>> + return nr_pages;
>>> +
>>> /*
>>> * This may be called on clean wb's and proportional distribution
>>> * may not make sense, just use the original @nr_pages in those
>>> @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct
>>> pages = min(pages, work->nr_pages);
>>> pages = round_down(pages + MIN_WRITEBACK_PAGES,
>>> MIN_WRITEBACK_PAGES);
>>> + pages = work->nr_pages;
>>
>> It's unclear what this is doing, but it makes the three preceding
>> statements non-operative.
>>
> This change, and the above one as well, is trying to bypass the
> current bandwidth, and a couple of rounds of cleanup are needed
> after it survives the LTP.
>
>>> }
>>>
>>> return pages;
>>> @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work)
>>> wb_wakeup_delayed(wb);
>>>
>>> current->flags &= ~PF_SWAPWRITE;
>>> +
>>> + if (waitqueue_active(&wb->bdp_waitq))
>>> + wake_up_all(&wb->bdp_waitq);
>>
>> Please add a comment explaining why this is being done here.
>>
> After writing out some dirty pages, it it a check point to see if
> a balance is already set up between the dirty speed and laundry
> speed. Those under throttling will be unthrottled after seeing
> a balance in place.
>
> A comment will be added.
>
>>> }
>>>
>>> /*
>>> --- a/mm/page-writeback.c
>>> +++ b/mm/page-writeback.c
>>> @@ -1830,6 +1830,67 @@ pause:
>>> wb_start_background_writeback(wb);
>>> }
>>>
>>> +/**
>>> + * cgwb_bdp_should_throttle() tell if a wb should be throttled
>>> + * @wb bdi_writeback to throttle
>>> + *
>>> + * To avoid the risk of exhausting the system free memory, we check
>>> + * and try much to prevent too many dirty pages from being produced
>>> + * too soon.
>>> + *
>>> + * For cgroup writeback, it is essencially to keep an equilibrium
>>
>> "it is essential"?
>>
> Yes Sir.
>
>>> + * between its dirty speed and laundry speed i.e. dirty pages are
>>> + * written out as fast as they are produced in an ideal state.
>>> + */
>>> +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
>>> +{
>>> + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
>>> +
>>> + if (fatal_signal_pending(current))
>>> + return false;
>>> +
>>> + gdtc.avail = global_dirtyable_memory();
>>> +
>>> + domain_dirty_limits(&gdtc);
>>> +
>>> + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
>>> + global_node_page_state(NR_UNSTABLE_NFS) +
>>> + global_node_page_state(NR_WRITEBACK);
>>> +
>>> + if (gdtc.dirty < gdtc.bg_thresh)
>>> + return false;
>>> +
>>> + if (!writeback_in_progress(wb))
>>> + wb_start_background_writeback(wb);
>>
>> This is a bit ugly. Something called "bool cgwb_bdp_should_throttle()"
>> shoiuld just check whether we should throttle. But here it is, also
>> initiating writeback. That's an inappropriate thing for this function
>> to do?
>>
> It is the current bdp behavior trying to keep dirty pages below the
> user-configurable background threshold by waking up flushers, because
> no dirty page will be sent to disk without flusher's efforts, please
> see 143dfe8611a6 ("writeback: IO-less balance_dirty_pages()").
>
> Will try to find some chance to pinch it out.
>
>> Also, we don't know *why* this is being done here, because there's no
>> code comment explaining the reasoning to us.
>>
> Will add a comment.
>
>>
>>> + if (gdtc.dirty < gdtc.thresh)
>>> + return false;
>>> +
>>> + /*
>>> + * throttle wb if there is the risk that wb's dirty speed is
>>> + * running away from its laundry speed, better with statistic
>>> + * error taken into account.
>>> + */
>>> + return wb_stat(wb, WB_DIRTIED) >
>>> + wb_stat(wb, WB_WRITTEN) + wb_stat_error();
>>> +}
>>> +
>>>
>>> ...
>>>
>>> @@ -1888,29 +1945,38 @@ void balance_dirty_pages_ratelimited(str
>>> * 1000+ tasks, all of them start dirtying pages at exactly the same
>>> * time, hence all honoured too large initial task->nr_dirtied_pause.
>>> */
>>> - p = this_cpu_ptr(&bdp_ratelimits);
>>> - if (unlikely(current->nr_dirtied >= ratelimit))
>>> - *p = 0;
>>> - else if (unlikely(*p >= ratelimit_pages)) {
>>> - *p = 0;
>>> - ratelimit = 0;
>>> - }
>>> + dirty = this_cpu_ptr(&bdp_ratelimits);
>>> +
>>> /*
>>> * Pick up the dirtied pages by the exited tasks. This avoids lots of
>>> * short-lived tasks (eg. gcc invocations in a kernel build) escaping
>>> * the dirty throttling and livelock other long-run dirtiers.
>>> */
>>> - p = this_cpu_ptr(&dirty_throttle_leaks);
>>> - if (*p > 0 && current->nr_dirtied < ratelimit) {
>>> - unsigned long nr_pages_dirtied;
>>> - nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
>>> - *p -= nr_pages_dirtied;
>>> - current->nr_dirtied += nr_pages_dirtied;
>>> + leak = this_cpu_ptr(&dirty_throttle_leaks);
>>> +
>>> + if (*dirty + *leak < ratelimit_pages) {
>>> + /*
>>> + * nothing to do as it would take some more time to
>>> + * eat out ratelimit_pages
>>> + */
>>> + try_bdp = false;
>>> + } else {
>>> + try_bdp = true;
>>> +
>>> + /*
>>> + * bdp in flight helps detect dirty page hogs soon
>>> + */
>>
>> How? Please expand on this comment a lot.
>>
> We should be cautious here in red zone after paying the ratelimit_pages
> price; we might soon have to tackle a deluge of dirty page hogs.
>
> Will cut it.
>
>>> + flights = this_cpu_ptr(&bdp_in_flight);
>>> +
>>> + if ((*flights)++ & 1) {
>>
>> What is that "& 1" doing?
>>
> It helps to tell if a bdp is alredy in flight.
>
> It would have been something like
>
> if (*flights == 0) {
> (*flights)++;
> } else {
> *flights = 0;
>>> + *dirty = *dirty + *leak - ratelimit_pages;
>>> + *leak = 0;
>>> + }
>
> but I was curious to see the flights in long run.
>
> Thanks
> Hillf
>
>>> }
>>> preempt_enable();
>>>
>>> - if (unlikely(current->nr_dirtied >= ratelimit))
>>> - balance_dirty_pages(wb, current->nr_dirtied);
>>> + if (try_bdp)
>>> + cgwb_bdp(wb);
>>>
>>> wb_put(wb);
>
>
Friendly note that your mail client is messing up the thread hierarchy
again (I think it was correct for a while):
Message-Id: <20191114093832.8504-1-hdanton@...a.com>
In-Reply-To: <20191112034227.3112-1-hdanton@...a.com>
References: <20191112034227.3112-1-hdanton@...a.com>
I assume a kernel developer can setup a mail client or switch to a sane
one. Please don't prove me wrong. ;)
--
Thanks,
David / dhildenb
Powered by blists - more mailing lists