linux-kernel - Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local>
Date: Sat, 17 May 2025 20:01:52 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Usama Arif <usamaarif642@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
        ziy@...dia.com, laoar.shao@...il.com, baolin.wang@...ux.alibaba.com,
        Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
        linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
        kernel-team@...a.com, SeongJae Park <sj@...nel.org>
Subject: Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the
 process

+cc SJ.

On Fri, May 16, 2025 at 01:57:18PM +0100, Lorenzo Stoakes wrote:
> On Fri, May 16, 2025 at 01:24:18PM +0200, David Hildenbrand wrote:
> > Looking forward to hearing what your magic thinking cap can do! :)
>
> OK so just to say at the outset, this is purely playing around with a
> theoretical idea here, so if it's crazy just let me know :))
>
> Right now madvise() has limited utility because:
>
> - You have little control over how the operation is done
> - You get little feedback about what's actually succeeded or not
> - While you can perform multiple operations at once via process_madvise(),
>   even to the current process (after my changes to extend it), it's limited
>   to a single advice over 8 ranges.

SJ raised the point that I am just wrong about this (see [0]).

So this makes the interface even more compelling in my view, we can either:

1. Simply remove the 'madvise_ranges' stuff below, and replace with an iovec,
   and simply forward that to process_madvise() (simpler, probably preferable).
2. Keep it as-is so we get to perform _multiple_ advice operations in a batch.
3. Use an iovec but also have some other array specifying operations to get the
   same thing, but maybe extend process_madvise()?

I really like the idea however of just - using the existing process_madvise()
code - picking up all the recent improvements - avoiding duplication, etc.

The key idea of this interface is more-so being able to control certain
behaviours (such as stopping on gaps etc.)

Yet Another (TM) alternative would be to use the -currently unused-
process_madvise() flags field to specify options as mentioned here, including
the 'set default' thing.

Now _that_ is really interesting actually.

It will give us less flexibility, but require a much much less major change.

OK damn, that's quite compelling... maybe I will do an RFC patch for that... :)

Happy to hear thoughts on it...

[0]: https://lore.kernel.org/all/20250517162048.36347-1-sj@kernel.org/

> - You can't say 'ignore errors just try'
> - You get the weird gap behaviour.
>
> So the concept is - make everything explicit and add a new syscall that
> wraps the existing madvise() stuff and addresses all the above issues.
>
> Specifically pertinent to the case at hand - also add a 'set_default'
> boolean (you'll see shortly exactly where) to also tell madvise() to make
> all future VMAs default to the specified advice. We'll whitelist what we're
> allowed to use here and should be able to use mm->def_flags.
>
> So the idea is we'll use a helper struct-configured function (hey, it's me,
> I <3 helper structs so of course) like:
>
> int madvise_ranges(struct madvise_range_control *ctl);
>
> With the data structures as follows (untested, etc. etc.):
>
> enum madvise_range_type {
> 	MADVISE_RANGE_SINGLE,
> 	MADVISE_RANGE_MULTI,
> 	MADVISE_RANGE_ALL,
> };
>
> struct madvise_range {
> 	const void *addr;
> 	size_t size;
> 	int advice;
> };
>
> struct madvise_ranges {
> 	const struct madvise_range *arr;
> 	size_t count;
> };
>
> struct madvise_range_stats {
> 	struct madvise_range range;
> 	bool success;
> 	bool partial;
> };
>
> struct madvise_ranges_stats {
> 	unsigned long nr_mappings_advised;
> 	unsigned long nr_mappings_skipped;
> 	unsigned long nr_pages_advised;
> 	unsigned long nr_pages_skipped;
> 	unsigned long nr_gaps;
>
> 	/*
> 	 * Useful for madvise_range_control->ignore_errors:
> 	 *
> 	 * If non-NULL, points to an array of size equal to the number of ranges
> 	 * specified. Indiciates the specified range, whether it succeeded, and
> 	 * whether that success was partial (that is, the range specified
> 	 * multiple mappings, only some of which had advice applied
> 	 * successfully).
> 	 *
> 	 * Not valid for MADVISE_RANGE_ALL.
> 	 */
>  	struct madvise_range_stats *per_range_stats;
>
> 	/* Error details. */
> 	int err;
> 	unsigned long failed_address;
> 	size_t offset; /* If multi, at which offset did this occur? */
> };
>
> struct madvise_ranges_control {
> 	int version; /* Allow future updates to API. */
>
> 	enum madvise_range_type type;
>
> 	union {
> 		struct madvise_range range; /* MADVISE_RANGE_SINGLE */
> 		struct madvise_ranges ranges; /* MADVISE_RANGE_MULTI */
> 		struct all { /* MADVISE_RANGE_ALL */
> 			int advice;
> 			/*
> 			 * If set, also have all future mappings have this applied by default.
> 			 *
> 			 * Only whitelisted advice may set this, otherwise -EINVAL will be returned.
> 			 */
> 			bool set_default;
> 		};
> 	};
> 	struct madvise_ranges_stats *stats; /* If non-NULL, report information about operation. */
>
> 	int pidfd; /* If is_remote set, the remote process. */
>
> 	/* Options. */
> 	bool is_remote :1; /* Target remote process as specified by pidfd. */
> 	bool ignore_errors :1; /* If error occurs applying advice, carry on to next VMA. */
> 	bool single_mapping_only :1; /* Error out if any range is not a single VMA. */
> 	bool stop_on_gap :1; /* Stop operation if input range includes unmapped memory. */
> };
>
> So the user can specify whether to apply advice to a single range,
> multiple, or the whole address space, with real control over how the operation proceeds.
>
> This basically solves the problem this series tries to address while also
> providing an improved madvise() API at the same time.
>
> Thoughts? Have I finally completely lost my mind?