[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local>
Date: Fri, 16 May 2025 13:57:18 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Usama Arif <usamaarif642@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
ziy@...dia.com, laoar.shao@...il.com, baolin.wang@...ux.alibaba.com,
Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
kernel-team@...a.com
Subject: Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the
process
On Fri, May 16, 2025 at 01:24:18PM +0200, David Hildenbrand wrote:
> Looking forward to hearing what your magic thinking cap can do! :)
OK so just to say at the outset, this is purely playing around with a
theoretical idea here, so if it's crazy just let me know :))
Right now madvise() has limited utility because:
- You have little control over how the operation is done
- You get little feedback about what's actually succeeded or not
- While you can perform multiple operations at once via process_madvise(),
even to the current process (after my changes to extend it), it's limited
to a single advice over 8 ranges.
- You can't say 'ignore errors just try'
- You get the weird gap behaviour.
So the concept is - make everything explicit and add a new syscall that
wraps the existing madvise() stuff and addresses all the above issues.
Specifically pertinent to the case at hand - also add a 'set_default'
boolean (you'll see shortly exactly where) to also tell madvise() to make
all future VMAs default to the specified advice. We'll whitelist what we're
allowed to use here and should be able to use mm->def_flags.
So the idea is we'll use a helper struct-configured function (hey, it's me,
I <3 helper structs so of course) like:
int madvise_ranges(struct madvise_range_control *ctl);
With the data structures as follows (untested, etc. etc.):
enum madvise_range_type {
MADVISE_RANGE_SINGLE,
MADVISE_RANGE_MULTI,
MADVISE_RANGE_ALL,
};
struct madvise_range {
const void *addr;
size_t size;
int advice;
};
struct madvise_ranges {
const struct madvise_range *arr;
size_t count;
};
struct madvise_range_stats {
struct madvise_range range;
bool success;
bool partial;
};
struct madvise_ranges_stats {
unsigned long nr_mappings_advised;
unsigned long nr_mappings_skipped;
unsigned long nr_pages_advised;
unsigned long nr_pages_skipped;
unsigned long nr_gaps;
/*
* Useful for madvise_range_control->ignore_errors:
*
* If non-NULL, points to an array of size equal to the number of ranges
* specified. Indiciates the specified range, whether it succeeded, and
* whether that success was partial (that is, the range specified
* multiple mappings, only some of which had advice applied
* successfully).
*
* Not valid for MADVISE_RANGE_ALL.
*/
struct madvise_range_stats *per_range_stats;
/* Error details. */
int err;
unsigned long failed_address;
size_t offset; /* If multi, at which offset did this occur? */
};
struct madvise_ranges_control {
int version; /* Allow future updates to API. */
enum madvise_range_type type;
union {
struct madvise_range range; /* MADVISE_RANGE_SINGLE */
struct madvise_ranges ranges; /* MADVISE_RANGE_MULTI */
struct all { /* MADVISE_RANGE_ALL */
int advice;
/*
* If set, also have all future mappings have this applied by default.
*
* Only whitelisted advice may set this, otherwise -EINVAL will be returned.
*/
bool set_default;
};
};
struct madvise_ranges_stats *stats; /* If non-NULL, report information about operation. */
int pidfd; /* If is_remote set, the remote process. */
/* Options. */
bool is_remote :1; /* Target remote process as specified by pidfd. */
bool ignore_errors :1; /* If error occurs applying advice, carry on to next VMA. */
bool single_mapping_only :1; /* Error out if any range is not a single VMA. */
bool stop_on_gap :1; /* Stop operation if input range includes unmapped memory. */
};
So the user can specify whether to apply advice to a single range,
multiple, or the whole address space, with real control over how the operation proceeds.
This basically solves the problem this series tries to address while also
providing an improved madvise() API at the same time.
Thoughts? Have I finally completely lost my mind?
Powered by blists - more mailing lists