lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250109171821.3203865-1-joshua.hahnjy@gmail.com>
Date: Thu,  9 Jan 2025 09:18:18 -0800
From: Joshua Hahn <joshua.hahnjy@...il.com>
To: Gregory Price <gourry@...rry.net>
Cc: Hyeonggon Yoo <hyeonggon.yoo@...com>,
	"Huang, Ying" <ying.huang@...ux.alibaba.com>,
	kernel_team@...ynix.com,
	42.hyeyoo@...il.com,
	"rafael@...nel.org" <rafael@...nel.org>,
	"lenb@...nel.org" <lenb@...nel.org>,
	"gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
	"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
	김홍규(KIM HONGGYU) System SW <honggyu.kim@...com>,
	김락기(KIM RAKIE) System SW <rakie.kim@...com>,
	"dan.j.williams@...el.com" <dan.j.williams@...el.com>,
	"Jonathan.Cameron@...wei.com" <Jonathan.Cameron@...wei.com>,
	"dave.jiang@...el.com" <dave.jiang@...el.com>,
	"horen.chuang@...ux.dev" <horen.chuang@...ux.dev>,
	"hannes@...xchg.org" <hannes@...xchg.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [External Mail] Re: [External Mail] [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning

On Thu, 9 Jan 2025 10:56:20 -0500 Gregory Price <gourry@...rry.net> wrote:

> On Wed, Jan 08, 2025 at 10:19:19AM +0900, Hyeonggon Yoo wrote:
> > Hi, hope you all had a nice year-end holiday :)
> > 
> ... snip ...
> > Please let me know if there's any point we discussed that I am missing.
> > 
> > Additionally I would like to mention that within an internal discussion
> > my colleague Honggyu suggested introducing 'mode' parameter which can be
> > either 'manual' or 'auto' instead of 'use_defaults' to be provide more
> > intuitive interface.
> > 
> > With Honggyu's suggestion and the points we've discussed,
> > I think the interface could be:
> > 
> > # At booting, the mode is 'auto' where the kernel can automatically
> > # update any weights.
> > 
> > mode             auto         # User hasn't specified any weight yet.
> > effective        [2, 1, -, -] # Using system defaults for node 0-1,
> >                               # and node 2-3 not populated yet.
> > 
> > # When a new NUMA node is added (e.g. via hotplug) in the 'auto' mode,
> > # all weights are re-calculated based on ACPI HMAT table, including the
> > # weight of the new node.
> > 
> > mode             auto         # User hasn't specified weights yet.
> > effective        [2, 1, 1, -] # Using system defaults for node 0-2,
> >                               # and node 3 not populated yet.
> > 
> > # When user set at least one weight value, change the mode to 'manual'
> > # where the kernel does not update any weights automatically without
> > # user's consent.
> > 
> > mode             manual       # User changed the weight of node 0 to 4,
> >                               # changing the mode to manual config mode.
> > effective        [4, 1, 1, -]
> > 
> > 
> > # When a new NUMA node is added (e.g. via hotplug) in the manual mode,
> > # the new node's weight is zero because it's in manual mode and user
> > # did not specify the weight for the new node yet.
> > 
> > mode             manual
> > effective        [4, 1, 1, 0]
> > 
> 
> 0's cannot show up in the effective list - the allocators can never
> percieve a 0 as there are (race) conditions where that may cause a div0.
> 
> The actual content of the list may be 0, but the allocator will see '1'.
> 
> IIRC this was due to lock/sleep limitations in the allocator paths and
> accessing this RCU protected memory. If someone wants to take another
> look at the allocator paths and characterize the risk more explicitly,
> this would be helpful.

Hi Gregory and Hyeonggon,

Based on a quick look, I see that there can be a problematic scenario
in alloc_pages_bulk_array_weighted_interleave where we sum up all
the weights from iw_table and divide by this sum. This _can_ be problematic
for two reasons, one of them being the div0 mentioned.

Currently, you can access the weights in one of two ways:
The first way is to call get_il_weight, which will retrieve a specified
node's weight under an rcu read lock. Within this function, it first
checks if the value at iw_table[nid] is 0, and if it is, returns 1.
Although this prevents a div0 scenario by ensuring that all weights are
nonzero, there is a coherency problem, since each instance of get_il_weight
creates a new rcu read lock. Therefore, retrieving node weights within a
loop creates a race condition in which the state of iw_table may change
in between iterations of the loop.

The second way is to directly dereference iw_table under a rcu lock,
copy its contents locally, then free the lock. This is how
alloc_pages_bulk_array_weighted_interleave currently calculates the sum.
The problem here is that even though we solve the coherency issue, there
is no check to ensure that this sum is zero. Thus, while having an array of
weights [0,0,0,0] gets translated into [1,1,1,1] when inspecting each
node individually using get_il_weight, it is still stored internally as 0
and can lead to a div0 here.

There are a few workarounds:
- Check that weight_total != 0 before performing the division.
- During the weight sum iteration, add by weights[node] ? weights[node] : 1
  like it is calculated within get_il_weight
- Prevent users from ever storing 0 into a node.

Of course, we can implement all three of these changes to make sure that
there are no unforunate div0s. However, there are realistic scenarios
where we may want the node to actually have a weight of 0, so perhaps
it makes sense to just do the first to checks. I can write up a quick
patch to perform these checks, if it looks good to everyone.

Please let me know if I missed anything as well.

Hope you all have a great day!
Joshua

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ