lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAH5fLghMi+CHs=UOJCs+COi3rOq6Bhi4ZkiVwcYC+fKwXmGmOA@mail.gmail.com>
Date: Wed, 25 Sep 2024 17:06:12 +0200
From: Alice Ryhl <aliceryhl@...gle.com>
To: Dave Chinner <david@...morbit.com>
Cc: Miguel Ojeda <ojeda@...nel.org>, Andrew Morton <akpm@...ux-foundation.org>, 
	Qi Zheng <zhengqi.arch@...edance.com>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Muchun Song <muchun.song@...ux.dev>, Boqun Feng <boqun.feng@...il.com>, 
	Gary Guo <gary@...yguo.net>, Björn Roy Baron <bjorn3_gh@...tonmail.com>, 
	Benno Lossin <benno.lossin@...ton.me>, Andreas Hindborg <a.hindborg@...sung.com>, 
	Trevor Gross <tmgross@...ch.edu>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	rust-for-linux@...r.kernel.org
Subject: Re: [PATCH] rust: shrinker: add shrinker abstraction

On Tue, Sep 17, 2024 at 3:59 AM Dave Chinner <david@...morbit.com> wrote:
>
> On Mon, Sep 16, 2024 at 12:22:40PM +0200, Alice Ryhl wrote:
> > On Mon, Sep 16, 2024 at 4:59 AM Dave Chinner <david@...morbit.com> wrote:
> > >
> > > On Thu, Sep 12, 2024 at 09:54:01AM +0000, Alice Ryhl wrote:
> > > > Rust Binder holds incoming transactions in a read-only mmap'd region
> > > > where it manually manages the pages. These pages are only in use until
> > > > the incoming transaction is consumed by userspace, but the kernel will
> > > > keep the pages around for future transactions. Rust Binder registers a
> > > > shrinker with the kernel so that it can give back these pages if the
> > > > system comes under memory pressure.
> > > >
> > > > Separate types are provided for registered and unregistered shrinkers.
> > > > The unregistered shrinker type can be used to configure the shrinker
> > > > before registering it. Separating it into two types also enables the
> > > > user to construct the private data between the calls to `shrinker_alloc`
> > > > and `shrinker_register` and avoid constructing the private data if
> > > > allocating the shrinker fails.
> > >
> > > This seems a bit nasty. It appears to me that the code does an
> > > unsafe copy of the internal shrinker state between the unregistered
> > > and registered types. Shrinkers have additional internal state when
> > > SHRINKER_MEMCG_AWARE and/or SHRINKER_NUMA_AWARE flags are set,
> > > and this abstraction doesn't seem to handle either those flags or
> > > the internal state they use at all.
> >
> > Doing an unsafe copy of the internal shrinker state is certainly not
> > my intent. The UnregisteredShrinker and RegisteredShrinker types just
> > wrap a non-null pointer, so the moves inside `register` are not copies
> > of the whole `struct shrinker` but just copies of a pointer to the
> > shrinker.
>
> So you're playing type casting games with the same structure just to
> avoid the potential doing a little bit of unnecessary work in an
> error situation?
>
> This all seems rather convoluted and largely unnecessary to me as
> the shrinker_free() C code transparent handles both registered and
> unregistered shrinkers being passed to it. There really is no
> difference between the two cases in terms of error handling and
> teardown....

Well another thing it does is make it possible to specify shrinker
without making it mandatory. I don't think it's so bad to have a
builder type, but I can drop it if you prefer that.

If we replace the builder with a single function for constructing it,
then we need to pick one of:
* Not accept a value for "seeks" at all.
* Always require an explicit value for "seeks".
* Make a "shrinker options" struct that we can pass to the constructor.
* Have multiple constructors.

> > > > +impl UnregisteredShrinker {
> > > > +    /// Create a new shrinker.
> > > > +    pub fn alloc(name: &CStr) -> Result<Self, AllocError> {
> > > > +        // SAFETY: Passing `0` as flags is okay. Using `%s` as the format string is okay when we
> > > > +        // pass a nul-terminated string as the string for `%s` to print.
> > > > +        let ptr =
> > > > +            unsafe { bindings::shrinker_alloc(0, c_str!("%s").as_char_ptr(), name.as_char_ptr()) };
> > >
> > > This passes flags as 0, so doesn't support SHRINKER_MEMCG_AWARE or
> > > SHRINKER_NUMA_AWARE shrinker variants. These flags should be
> > > passed through here.
> >
> > I don't have a user for memcg/numa aware shrinkers in Rust, which is
> > why I did not extend these abstractions to support them. However, if
> > you would like me to, I'm happy to do so.
>
> Please do. A partial implementation only makes it harder for the
> next person to use the API....

Alright.

> > It is easy to add a flags
> > argument to this method, but I imagine they also need a few other
> > additions elsewhere in the API to really be supported?
>
> In the shrinker API, nothing extra needs to be done.
>
> If you haven't added bindings to the list_lru interfaces yet, then
> you are going to need to, including the memcg and numa aware
> interfaces for that.

I'm not familiar with the memcg/numa aware interfaces for list_lru, as
I only know about Binder's usage. Is there a good example of using
them somewhere?

Or does it not require anything beyond just adding Rust versions of
list_lru_shrink_walk and list_lru_shrink_count that accept a
ShrinkControl from a shrinker?

> > Now that I think about it, perhaps Binder's shrinker *could* be memcg
> > aware? It uses the list_lru abstraction, and currently calls
> > `list_lru_count` in `count_objects`, but maybe we could just use
> > `list_lru_shrink_count` instead and magically become memcg aware ...?
>
> The scan code also needs to use list_lru_shrink_walk(), too,
> not list_lru_walk().
>
> And, yes, that's really the whole point of the list_lru abstraction -
> almost nothing extra needs to be done to enable both NUMA and memcg
> based reclaim context support for a cache once it's been converted
> to use list_lru for tracking reclaimable objects.

If it's that easy, then that's great!

> > > > +    /// Set the number of seeks needed to recreate an object.
> > > > +    pub fn set_seeks(&mut self, seeks: u32) {
> > > > +        unsafe { (*self.shrinker.as_ptr()).seeks = seeks as c_int };
> > > > +    }
> > >
> > > Seems kinda weird to have a separate function for setting seeks,
> > > when...
> > >
> > > > +
> > > > +    /// Register the shrinker.
> > > > +    ///
> > > > +    /// The provided pointer is used as the private data, and the type `T` determines the callbacks
> > > > +    /// that the shrinker will use.
> > > > +    pub fn register<T: Shrinker>(self, private_data: T::Ptr) -> RegisteredShrinker<T> {
> > >
> > > .... all the other data needed to set up a shrinker is provided to
> > > this function....
> > >
> > > > +        let shrinker = self.shrinker;
> > > > +        let ptr = shrinker.as_ptr();
> > > > +
> > > > +        // The destructor of `self` calls `shrinker_free`, so skip the destructor.
> > > > +        core::mem::forget(self);
> > > > +
> > > > +        let private_data_ptr = <T::Ptr as ForeignOwnable>::into_foreign(private_data);
> > > > +
> > > > +        // SAFETY: We own the private data, so we can assign to it.
> > > > +        unsafe { (*ptr).private_data = private_data_ptr.cast_mut() };
> > > > +        // SAFETY: The shrinker is not yet registered, so we can update this field.
> > > > +        unsafe { (*ptr).count_objects = Some(rust_count_objects::<T>) };
> > > > +        // SAFETY: The shrinker is not yet registered, so we can update this field.
> > > > +        unsafe { (*ptr).scan_objects = Some(rust_scan_objects::<T>) };
> > >
> > > .... and implemented exactly the same way.
> >
> > Well .. this design makes setting `seeks` optional, but `private_data`
> > mandatory. We *have* to set the two function pointers, but you don't
> > have to set `seeks` explicitly.
> >
> > It's not obvious, so it's probably worth mentioning that this design
> > doesn't actually require you to make an allocation and store a real
> > pointer in the private data.
>
> Yes, definitely worth mentioning, because I think we need to require
> new shrinkers to provide a direct pointer to the context the
> shrinker will operate on. We want to move away from anonymous
> "global" shrinkers that don't provide a context for the shrinker to
> operation on for a number of reasons...
>
> If we change the C code to work this way, then the private pointer
> in the rust binding will become mandatory. Hence it might be worth
> making the rust bindings and shrinker implementations behave this
> way from the start....

Thanks for the context. That is good to know.

> > When implementing the Shrinker trait for
> > your custom shrinker, you have to choose which pointer type to use. If
> > you choose the unit type `()` as the pointer type, then no real
> > pointer is actually stored.
> >
> > > > +/// How many objects are there in the cache?
> > > > +///
> > > > +/// This is used as the return value of [`Shrinker::count_objects`].
> > > > +pub struct CountObjects {
> > > > +    inner: c_ulong,
> > > > +}
> > > > +
> > > > +impl CountObjects {
> > > > +    /// Indicates that the number of objects is unknown.
> > > > +    pub const UNKNOWN: Self = Self { inner: 0 };
> > > > +
> > > > +    /// Indicates that the number of objects is zero.
> > > > +    pub const EMPTY: Self = Self {
> > > > +        inner: SHRINK_EMPTY,
> > > > +    };
> > > > +
> > > > +    /// The maximum possible number of freeable objects.
> > > > +    pub const MAX: Self = Self {
> > > > +        // The shrinker code assumes that it can multiply this value by two without overflow.
> > > > +        inner: c_ulong::MAX / 2,
> > > > +    };
> > > > +
> > > > +    /// Creates a new `CountObjects` with the given value.
> > > > +    pub fn from_count(count: usize) -> Self {
> > > > +        if count == 0 {
> > > > +            return Self::EMPTY;
> > > > +        }
> > >
> > > No. A return count of 0 is not the same as a return value of
> > > SHRINK_EMPTY.
> > >
> > > A return value of 0 means "no reclaim work can be done right now".
> > >
> > > This implies that there are objects that can be reclaimed in the near
> > > future, but right now they are unavailable for reclaim. This can be
> > > due to a trylock protecting the count operation failing, the all the
> > > objects in the cache being dirty and needing work to be done before
> > > they can be reclaimed, etc.
> > >
> > > A return value of SHRINK_EMPTY means "there are no reclaimable
> > > objects at all".
> > >
> > > This implies that the cache is empty - it has absolutely nothing in
> > > it that can be reclaimed either now or in the near future.
> > >
> > > These two return values are treated very differently by the high
> > > level code. SHRINK_EMPTY is used by shrink_slab_memcg() to maintain
> > > a "shrinker needs to run" bit in the memcg shrinker search bitmap.
> > > The bit is cleared when SHRINK_EMPTY is returned, meaning the
> > > shrinker will not get called again until a new object is queued
> > > to it's LRU. This sets the "shrinker needs to run" bit again, and
> > > so the shrinker will run next time shrink_slab_memcg() is called.
> > >
> > > In constrast, a return value of zero (i.e. no work to be done right
> > > now) does not change the "shrinker needs to run" state bit, and
> > > hence it will always be called the next time the shrink_slab_memcg()
> > > is run to try to reclaim objects from that memcg shrinker...
> >
> > This is a part of the API that I was pretty unsure about, so very
> > happy to get your input. For both of the shrinkers I am familiar with,
> > they know the exact count of objects and there's no error case for any
> > lock. The shrinkers just return the exact count directly from
> > count_objects, but that seemed incorrect to me, as this means that the
> > shrinker returns 0 when it really should return SHRINK_EMPTY.
>
> Yes, they should be returning SHRINK_EMPTY and not 0 in this case.
> See super_cache_count() for example.

Okay.

> > That's
> > why I implemented `from_count` this way - the intent is that you use
> > `CountObjects::from_count` when you know the exact count, so if you
> > pass 0 to that method, then that means the shrinker really is empty.
> > If you don't know what the count is, then you return
> > `CountObjects::UNKNOWN` instead.
>
> Understood, and I see your point. I think it's a good approach to
> take, but I don't think that the rust binding implementation is the
> appropriate place to introduce such API/semantic enhancements.
>
> I would really like to avoid the situation where rust-based
> shrinkers have explicitly different API and behavioural semantics to
> the C shrinkers. Hence I would prefer the rust bindings to implement
> the existing C semantics and then, as a separate piece of work, we
> move everything (both C and Rust) to the new set of semantics as a
> single changeset.
>
> I know that is more difficult to do and co-ordinate, but I think we
> are going to be much better off in the long run if we start from a
> common base and then make improvements to both sides of the API at
> the same time. Note that I'm not asking you to change all the C code
> here - I can certainly help there - but I am asking that we try
> really hard to keep APIs and behavioural semantics as close together
> as possible.

That is reasonable. What do you prefer that I do with the CountObjects
type I have now? I can change it to this:

* CountObjects::EMPTY
* CountObjects::new(number)

Where `new` just uses the provided number as-is (unless it's too big).

I can also get rid of the CountObjects type entirely.

> > That said, I guess there is some reason that the C shrinker API is
> > designed to use the value 0 for unknown rather than empty, so perhaps
> > I should not try to do it differently.
>
> No reason, it just evolved organically that way over the past decade
> - SHRINK_EMPTY was introduced (historically speaking) fairly
> recently and only propogated into the shrinkers that needed to set
> it (i.e. the memcg aware shrinkers).
>
> This is the sort of API fragmentation I would like us to avoid - I
> don't want to see some shrinkers with SHRINK_EMPTY, some with 0, and
> then yet another new subset with UNKNOWN, yet all largely meaning
> the same thing in a global shrinker context but 2 of the three types
> being incorrect in a memcg shrinker context...

Yeah, that makes sense. I'm on board with keeping things consistent.

> > > > +impl ScanObjects {
> > > > +    /// Indicates that the shrinker should stop trying to free objects from this cache due to
> > > > +    /// potential deadlocks.
> > > > +    pub const STOP: Self = Self { inner: SHRINK_STOP };
> > > > +
> > > > +    /// The maximum possible number of freeable objects.
> > > > +    pub const MAX: Self = Self {
> > > > +        // The shrinker code assumes that it can multiply this value by two without overflow.
> > > > +        inner: c_ulong::MAX / 2,
> > >
> > > No it doesn't. This is purely a count of objects freed by the
> > > shrinker.
> >
> > In mm/shrinker.c it multiplies by two here:
> > total_scan = min(total_scan, (2 * freeable));
>
> freeable is derived from the return of ->count_objects, not
> from ->scan_objects (i.e. it is a CountObjects type, not a
> ScanObjects type).
>
> ->scan_objects() returns SHRINK_STOP to indicate that scanning
> should stop immediately or the number of objects freed. The only
> thing that is done with value other than SHRINK_STOP is to aggregate
> it across repeated calls to ->scan_objects().

Fair point. I can change the maximum to SHRINK_STOP-1. Or do you want
me to get rid of it completely?

> > > > +    /// Determines whether it is safe to recurse into filesystem code.
> > > > +    pub fn gfp_fs(&self) -> bool {
> > > > +        // SAFETY: Okay by type invariants.
> > > > +        let mask = unsafe { (*self.ptr.as_ptr()).gfp_mask };
> > > > +
> > > > +        (mask & bindings::__GFP_FS) != 0
> > > > +    }
> > >
> > > This needs a check for __GFP_IO as well, as there are block layer
> > > shrinkers that need to be GFP_NOIO aware...
> >
> > Would you prefer that I add additional methods, or that I modify this
> > method to also check __GFP_IO, or that I just expose a way to get the
> > `gfp_mask` value directly?
>
> If it was C code, I would suggest the API would something like:
>
>         if (!reclaim_allowed(sc, __GFP_FS))
>                 return SHRINK_STOP;
>
> or
>         if (!reclaim_allowed(sc, __GFP_IO))
>                 return SHRINK_STOP;
>
> As this matches the typical usage (e.g. see super_cache_scan()).

I guess there are several options. A few Rust versions:

fn reclaim_allowed(&self, mode: ReclaimMode) -> bool;
enum ReclaimMode {
    __GFP_FS,
    __GFP_IO,
}

or perhaps just:

fn reclaim_fs_allowed(&self) -> bool;
fn reclaim_io_allowed(&self) -> bool;

> > > > +    /// Returns the number of objects that `scan_objects` should try to reclaim.
> > > > +    pub fn nr_to_scan(&self) -> usize {
> > > > +        // SAFETY: Okay by type invariants.
> > > > +        unsafe { (*self.ptr.as_ptr()).nr_to_scan as usize }
> > > > +    }
> > > > +
> > > > +    /// The callback should set this value to the number of objects processed.
> > > > +    pub fn set_nr_scanned(&mut self, val: usize) {
> > > > +        let mut val = val as c_ulong;
> > > > +        // SAFETY: Okay by type invariants.
> > > > +        let max = unsafe { (*self.ptr.as_ptr()).nr_to_scan };
> > > > +        if val > max {
> > > > +            val = max;
> > > > +        }
> > >
> > > No. Shrinkers are allowed to scan more objects in a batch than
> > > the high level code asked them to scan. If they do this, they
> > > *must* report back the count of all the objects they scanned so the
> > > batched scan accounting can adjust the remaining amount of work that
> > > needs to be done appropriately.
> >
> > Acknowledged.
> >
> > This is related to another question that I had, actually. Other than
> > returning SHRINK_STOP, in what situations should `nr_scanned` be
> > different from the return value of `scan_objects`?
>
> nr_scanned will almost always be different to the return value of
> ->scan_objects.
>
> There are 3 values of note here.
>
> - nr_scanned is set by the shrinker scan setup code to the batch
>   size the shrinker should scan in it's next invocation.
>
>   If the shrinker ignores nr_to_scan batch limits, it can indicate
>   how many objects it scanned by setting the number it scanned to
>   nr_scanned before it returns.
>
> - nr_to_scan is set to nr_scanned at setup.
>
>   The shrinker should then decrement nr_scanned for each object that
>   is scanned by the shrinker.  The shrinker should stop scanning and
>   return nr_freed when nr_to_scan reaches zero.
>
>   If the value of this is non-zero when the scan returns, it is an
>   indication that the shrinker either did not scan the entire
>   batch or the batch size was larger than the remaining number of
>   objects in the cache.
>
> - nr_freed: return value from the shrinker
>
>   This is a count of the number of objects the shrinker freed during
>   the scan.
>
>   If the shrinker cannot make progress for any reason it
>   should immediately return SHRINK_STOP.
>
> So on a typical shrinker pass, we'll exit with:
>
>         nr_scanned = batch size;
>         nr_to_scan = 0;
>         nr_freed = [0..batch size];
>
> Some shrinkers, however, have variable sized objects, and so they
> will normalise nr_freed and nr_scanned to the object size. The
> i915/gem shrinker does this, as it considers a page to be an object,
> but in reality is has a smaller number of variable sized multi-page
> objects being tracked. Hence it might be asked to scan 128 objects
> (default batch size), but the scan iterated over objects that
> contain 256 pages.
>
> In that case, it will return:
>
>         nr_scanned = 256;
>         nr_to_scan = 0;
>         nr_freed = [0..256]
>
> And the high level shrinker code will treat that as having scanned
> 256 objects rather than the 128 objects it asked for.
>
> There are several shrinkers that play weird accounting games like
> this. It's not really all that nice and it would be ice to clean all
> this up, but there's been no real urgency to do this because it's
> just a slightly different manner of accounting reclaim progress....

Ahh ... so the numbers can be different if you scan an object without
freeing it.

Alice

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ