[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66e8e0cc-1abb-a283-1e0d-068124a84790@suse.cz>
Date: Thu, 14 Oct 2021 17:42:51 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Mel Gorman <mgorman@...hsingularity.net>
Cc: Linux-MM <linux-mm@...ck.org>, NeilBrown <neilb@...e.de>,
Theodore Ts'o <tytso@....edu>,
Andreas Dilger <adilger.kernel@...ger.ca>,
"Darrick J . Wong" <djwong@...nel.org>,
Matthew Wilcox <willy@...radead.org>,
Michal Hocko <mhocko@...e.com>,
Dave Chinner <david@...morbit.com>,
Rik van Riel <riel@...riel.com>,
Johannes Weiner <hannes@...xchg.org>,
Jonathan Corbet <corbet@....net>,
Linux-fsdevel <linux-fsdevel@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback
completes if congested
On 10/14/21 12:47, Mel Gorman wrote:
> Thanks Vlastimil
>
> On Wed, Oct 13, 2021 at 05:39:36PM +0200, Vlastimil Babka wrote:
>> > +/*
>> > + * Account for pages written if tasks are throttled waiting on dirty
>> > + * pages to clean. If enough pages have been cleaned since throttling
>> > + * started then wakeup the throttled tasks.
>> > + */
>> > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
>> > + int nr_throttled)
>> > +{
>> > + unsigned long nr_written;
>> > +
>> > + __inc_node_page_state(page, NR_THROTTLED_WRITTEN);
>>
>> Is this intentionally using the __ version that normally expects irqs to be
>> disabled (AFAIK they are not in this path)? I think this is rarely used cold
>> path so it doesn't seem worth to trade off speed for accuracy.
>>
>
> It was intentional because IRQs can be disabled and if it's race-prone,
> it's not overly problematic but you're right, better to be safe. I changed
> it to the safe type as it's mostly free on x86, arm64 and s390 and for
> other architectures, this is a slow path.
Great, thanks.
>> > + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
>> > + READ_ONCE(pgdat->nr_reclaim_start);
>>
>> Even if the inc above was safe, node_page_state() will return only the
>> global counter, so the value we read here will only actually increment when
>> some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just
>> worth documenting?
>>
>
> I didn't think the penalty of doing an accurate read while writeback
> throttled is worth it. I'll add a comment.
>
>> > +
>> > + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
>> > + wake_up_all(&pgdat->reclaim_wait);
>>
>> Hm it seems a bit weird that the more tasks are throttled, the more we wait,
>> and then wake up all. Theoretically this will lead to even more
>> bursty/staggering herd behavior. Could be better to wake up single task each
>> SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in
>> practice due to HZ/10 timeouts being short enough?
>>
>
> Yes, the more tasks are throttled the longer tasks wait because tasks are
> allocating faster than writeback can complete so I wanted to reduce the
> allocation pressure. I considered waking one task at a time but there is
> no prioritisation of tasks on the waitqueue and it's not clear that the
> additional complexity is justified. With inaccurate counters, a light
> allocator could get throttled for the full timeout unnecessarily.
>
> Even if we were to wake one task at a time, I would prefer it was done
> as a potential optimisation on top.
Fair enough.
> Diff on top based on review feedback;
Thanks, with that you can add
Acked-by: Vlastimil Babka <vbabka@...e.cz>
to the updated version
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bcd22e53795f..735b1f2b5d9e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1048,7 +1048,15 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> {
> unsigned long nr_written;
>
> - __inc_node_page_state(page, NR_THROTTLED_WRITTEN);
> + inc_node_page_state(page, NR_THROTTLED_WRITTEN);
> +
> + /*
> + * This is an inaccurate read as the per-cpu deltas may not
> + * be synchronised. However, given that the system is
> + * writeback throttled, it is not worth taking the penalty
> + * of getting an accurate count. At worst, the throttle
> + * timeout guarantees forward progress.
> + */
> nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> READ_ONCE(pgdat->nr_reclaim_start);
>
Powered by blists - more mailing lists