[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <rm43wgtlvwowjolzcf6gj4un4qac4myngxqnd2jwt5yqxree62@t66scnrruttc>
Date: Tue, 31 Oct 2023 10:53:41 +0100
From: Michal Hocko <mhocko@...e.com>
To: Gregory Price <gourry.memverge@...il.com>
Cc: linux-kernel@...r.kernel.org, linux-cxl@...r.kernel.org,
linux-mm@...ck.org, ying.huang@...el.com,
akpm@...ux-foundation.org, aneesh.kumar@...ux.ibm.com,
weixugc@...gle.com, apopple@...dia.com, hannes@...xchg.org,
tim.c.chen@...el.com, dave.hansen@...el.com, shy828301@...il.com,
gregkh@...uxfoundation.org, rafael@...nel.org,
Gregory Price <gregory.price@...verge.com>
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
On Mon 30-10-23 20:38:06, Gregory Price wrote:
> This patchset implements weighted interleave and adds a new sysfs
> entry: /sys/devices/system/node/nodeN/accessM/il_weight.
>
> The il_weight of a node is used by mempolicy to implement weighted
> interleave when `numactl --interleave=...` is invoked. By default
> il_weight for a node is always 1, which preserves the default round
> robin interleave behavior.
>
> Interleave weights may be set from 0-100, and denote the number of
> pages that should be allocated from the node when interleaving
> occurs.
>
> For example, if a node's interleave weight is set to 5, 5 pages
> will be allocated from that node before the next node is scheduled
> for allocations.
I find this semantic rather weird TBH. First of all why do you think it
makes sense to have those weights global for all users? What if
different applications have different view on how to spred their
interleaved memory?
I do get that you might have a different tiers with largerly different
runtime characteristics but why would you want to interleave them into a
single mapping and have hard to predict runtime behavior?
[...]
> In this way it becomes possible to set an interleaving strategy
> that fits the available bandwidth for the devices available on
> the system. An example system:
>
> Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
>
> In this setup, the effective weights for nodes 0-3 for a task
> running on Node 0 may be [60, 20, 10, 10].
>
> This spreads memory out across devices which all have different
> latency and bandwidth attributes at a way that can maximize the
> available resources.
OK, so why is this any better than not using any memory policy rely
on demotion to push out cold memory down the tier hierarchy?
What is the actual real life usecase and what kind of benefits you can
present?
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists