lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 13 Oct 2021 17:39:36 +0200
From:   Vlastimil Babka <vbabka@...e.cz>
To:     Mel Gorman <mgorman@...hsingularity.net>,
        Linux-MM <linux-mm@...ck.org>
Cc:     NeilBrown <neilb@...e.de>, Theodore Ts'o <tytso@....edu>,
        Andreas Dilger <adilger.kernel@...ger.ca>,
        "Darrick J . Wong" <djwong@...nel.org>,
        Matthew Wilcox <willy@...radead.org>,
        Michal Hocko <mhocko@...e.com>,
        Dave Chinner <david@...morbit.com>,
        Rik van Riel <riel@...riel.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jonathan Corbet <corbet@....net>,
        Linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback
 completes if congested

On 10/8/21 15:53, Mel Gorman wrote:
> Page reclaim throttles on wait_iff_congested under the following conditions
> 
> o kswapd is encountering pages under writeback and marked for immediate
>   reclaim implying that pages are cycling through the LRU faster than
>   pages can be cleaned.
> 
> o Direct reclaim will stall if all dirty pages are backed by congested
>   inodes.
> 
> wait_iff_congested is almost completely broken with few exceptions. This
> patch adds a new node-based workqueue and tracks the number of throttled
> tasks and pages written back since throttling started. If enough pages
> belonging to the node are written back then the throttled tasks will wake
> early. If not, the throttled tasks sleeps until the timeout expires.
> 
> [neilb@...e.de: Uninterruptible sleep and simpler wakeups]
> [hdanton@...a.com: Avoid race when reclaim starts]
> Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>

Seems mostly OK, have just some doubts wrt NR_THROTTLED_WRITTEN mechanics,
that may ultimately be just a point of comments to add.

...

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1006,6 +1006,56 @@ static void handle_write_error(struct address_space *mapping,
>  	unlock_page(page);
>  }
>  
> +static void
> +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +							long timeout)
> +{
> +	wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> +	long ret;
> +	DEFINE_WAIT(wait);
> +
> +	/*
> +	 * Do not throttle IO workers, kthreads other than kswapd or
> +	 * workqueues. They may be required for reclaim to make
> +	 * forward progress (e.g. journalling workqueues or kthreads).
> +	 */
> +	if (!current_is_kswapd() &&
> +	    current->flags & (PF_IO_WORKER|PF_KTHREAD))
> +		return;
> +
> +	if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
> +		WRITE_ONCE(pgdat->nr_reclaim_start,
> +			node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> +	}
> +
> +	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +	ret = schedule_timeout(timeout);
> +	finish_wait(wqh, &wait);
> +	atomic_dec(&pgdat->nr_reclaim_throttled);
> +
> +	trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
> +				jiffies_to_usecs(timeout - ret),
> +				reason);
> +}
> +
> +/*
> + * Account for pages written if tasks are throttled waiting on dirty
> + * pages to clean. If enough pages have been cleaned since throttling
> + * started then wakeup the throttled tasks.
> + */
> +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> +							int nr_throttled)
> +{
> +	unsigned long nr_written;
> +
> +	__inc_node_page_state(page, NR_THROTTLED_WRITTEN);

Is this intentionally using the __ version that normally expects irqs to be
disabled (AFAIK they are not in this path)? I think this is rarely used cold
path so it doesn't seem worth to trade off speed for accuracy.

> +	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> +		READ_ONCE(pgdat->nr_reclaim_start);

Even if the inc above was safe, node_page_state() will return only the
global counter, so the value we read here will only actually increment when
some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just
worth documenting?

> +
> +	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> +		wake_up_all(&pgdat->reclaim_wait);

Hm it seems a bit weird that the more tasks are throttled, the more we wait,
and then wake up all. Theoretically this will lead to even more
bursty/staggering herd behavior. Could be better to wake up single task each
SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in
practice due to HZ/10 timeouts being short enough?

> +}
> +

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ