[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63898e7a-0846-3105-96b5-76c89635e499@suse.cz>
Date: Wed, 13 Oct 2021 17:39:36 +0200
From: Vlastimil Babka <vbabka@...e.cz>
To: Mel Gorman <mgorman@...hsingularity.net>,
Linux-MM <linux-mm@...ck.org>
Cc: NeilBrown <neilb@...e.de>, Theodore Ts'o <tytso@....edu>,
Andreas Dilger <adilger.kernel@...ger.ca>,
"Darrick J . Wong" <djwong@...nel.org>,
Matthew Wilcox <willy@...radead.org>,
Michal Hocko <mhocko@...e.com>,
Dave Chinner <david@...morbit.com>,
Rik van Riel <riel@...riel.com>,
Johannes Weiner <hannes@...xchg.org>,
Jonathan Corbet <corbet@....net>,
Linux-fsdevel <linux-fsdevel@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback
completes if congested
On 10/8/21 15:53, Mel Gorman wrote:
> Page reclaim throttles on wait_iff_congested under the following conditions
>
> o kswapd is encountering pages under writeback and marked for immediate
> reclaim implying that pages are cycling through the LRU faster than
> pages can be cleaned.
>
> o Direct reclaim will stall if all dirty pages are backed by congested
> inodes.
>
> wait_iff_congested is almost completely broken with few exceptions. This
> patch adds a new node-based workqueue and tracks the number of throttled
> tasks and pages written back since throttling started. If enough pages
> belonging to the node are written back then the throttled tasks will wake
> early. If not, the throttled tasks sleeps until the timeout expires.
>
> [neilb@...e.de: Uninterruptible sleep and simpler wakeups]
> [hdanton@...a.com: Avoid race when reclaim starts]
> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
Seems mostly OK, have just some doubts wrt NR_THROTTLED_WRITTEN mechanics,
that may ultimately be just a point of comments to add.
...
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1006,6 +1006,56 @@ static void handle_write_error(struct address_space *mapping,
> unlock_page(page);
> }
>
> +static void
> +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> + long timeout)
> +{
> + wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> + long ret;
> + DEFINE_WAIT(wait);
> +
> + /*
> + * Do not throttle IO workers, kthreads other than kswapd or
> + * workqueues. They may be required for reclaim to make
> + * forward progress (e.g. journalling workqueues or kthreads).
> + */
> + if (!current_is_kswapd() &&
> + current->flags & (PF_IO_WORKER|PF_KTHREAD))
> + return;
> +
> + if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
> + WRITE_ONCE(pgdat->nr_reclaim_start,
> + node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> + }
> +
> + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + ret = schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> + atomic_dec(&pgdat->nr_reclaim_throttled);
> +
> + trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
> + jiffies_to_usecs(timeout - ret),
> + reason);
> +}
> +
> +/*
> + * Account for pages written if tasks are throttled waiting on dirty
> + * pages to clean. If enough pages have been cleaned since throttling
> + * started then wakeup the throttled tasks.
> + */
> +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> + int nr_throttled)
> +{
> + unsigned long nr_written;
> +
> + __inc_node_page_state(page, NR_THROTTLED_WRITTEN);
Is this intentionally using the __ version that normally expects irqs to be
disabled (AFAIK they are not in this path)? I think this is rarely used cold
path so it doesn't seem worth to trade off speed for accuracy.
> + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> + READ_ONCE(pgdat->nr_reclaim_start);
Even if the inc above was safe, node_page_state() will return only the
global counter, so the value we read here will only actually increment when
some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just
worth documenting?
> +
> + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> + wake_up_all(&pgdat->reclaim_wait);
Hm it seems a bit weird that the more tasks are throttled, the more we wait,
and then wake up all. Theoretically this will lead to even more
bursty/staggering herd behavior. Could be better to wake up single task each
SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in
practice due to HZ/10 timeouts being short enough?
> +}
> +
Powered by blists - more mailing lists