[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d77ac703-2fa8-48f8-846a-fb88d45b9287@igalia.com>
Date: Mon, 28 Apr 2025 10:10:41 +0900
From: Changwoo Min <changwoo@...lia.com>
To: "Rafael J. Wysocki" <rafael@...nel.org>, Lukasz Luba <lukasz.luba@....com>
Cc: christian.loehle@....com, tj@...nel.org, len.brown@...el.com,
kernel-dev@...lia.com, linux-pm@...r.kernel.org,
linux-kernel@...r.kernel.org, pavel@...nel.org
Subject: Re: [PATCH] PM: EM: Add inotify support when the energy model is
updated.
Gentle ping as it reaches around 1.5 weeks.
Regards,
Changwoo Min
On 4/17/25 22:28, Changwoo Min wrote:
> Hi Lukasz and Rafael,
>
> Thank you super much for the comments!
>
> On 4/16/25 02:05, Rafael J. Wysocki wrote:
>> On Tue, Apr 15, 2025 at 10:31 AM Lukasz Luba <lukasz.luba@....com> wrote:
>>>
>>> Hi Changwoo,
>>>
>>> Thanks for the patch.
>>>
>>> On 4/15/25 01:42, Changwoo Min wrote:
>>>> The sched_ext schedulers [1] currently access the energy model
>>>> through the
>>>> debugfs to make energy-aware scheduling decisions [2]. The userspace
>>>> part
>>>> of a sched_ext scheduler feeds the necessary (post-processed)
>>>> energy-model
>>>> information to the BPF part of the scheduler.
>>>
>>> This is very interesting use case!
>>>
>>>>
>>>> However, there is a limitation in the current debugfs support of the
>>>> energy
>>>> model. When the energy model is updated (em_dev_update_perf_domain),
>>>> there
>>>> is no way for the userspace part to know such changes (besides
>>>> polling the
>>>> debugfs files).
>>>>
>>>> Therefore, add inotify support (IN_MODIFY) when the energy model is
>>>> updated. With this inotify support, the sched_ext scheduler can
>>>> monitor the
>>>> energy model change in userspace using the regular inotify interface
>>>> and
>>>> feed the updated energy model information to make energy-aware
>>>> scheduling
>>>> decisions.
>>>>
>>>> [1] https://lwn.net/Articles/922405/
>>>> [2] https://github.com/sched-ext/scx/pull/1624
>>>>
>>>> Signed-off-by: Changwoo Min <changwoo@...lia.com>
>>>> ---
>>>> kernel/power/energy_model.c | 47 ++++++++++++++++++++++++++++++++
>>>> +++++
>>>> 1 file changed, 47 insertions(+)
>>>>
>>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>>> index d9b7e2b38c7a..0c06e0278df6 100644
>>>> --- a/kernel/power/energy_model.c
>>>> +++ b/kernel/power/energy_model.c
>>>> @@ -14,6 +14,7 @@
>>>> #include <linux/cpumask.h>
>>>> #include <linux/debugfs.h>
>>>> #include <linux/energy_model.h>
>>>> +#include <linux/fsnotify.h>
>>>> #include <linux/sched/topology.h>
>>>> #include <linux/slab.h>
>>>>
>>>> @@ -156,9 +157,53 @@ static int __init em_debug_init(void)
>>>> return 0;
>>>> }
>>>> fs_initcall(em_debug_init);
>>>> +
>>>> +static void em_debug_update_ps(struct em_perf_domain *em_pd, int i,
>>>> + struct dentry *pd)
>>>> +{
>>>> + static const char *names[] = {
>>>> + "frequency",
>>>> + "power",
>>>> + "cost",
>>>> + "performance",
>>>> + "inefficient",
>>>> + };
>>>> + struct em_perf_state *table;
>>>> + unsigned long freq;
>>>> + struct dentry *d, *cd;
>>>> + char name[24];
>>>> + int j;
>>>> +
>>>> + rcu_read_lock();
>>>> + table = em_perf_state_from_pd(em_pd);
>>>> + freq = table[i].frequency;
>>>> + rcu_read_unlock();
>>>> +
>>>> + snprintf(name, sizeof(name), "ps:%lu", freq);
>>>> + d = debugfs_lookup(name, pd);
>>>> +
>>>> + for (j = 0; j < ARRAY_SIZE(names); j++) {
>>>> + cd = debugfs_lookup(names[j], d);
>>>> + if (!cd)
>>>> + return;
>>>> + fsnotify_dentry(cd, FS_MODIFY);
>>>> + cond_resched();
>>>> + }
>>>> +}
>>>> +
>>>> +static void em_debug_update(struct device *dev)
>>>> +{
>>>> + struct dentry *d;
>>>> + int i;
>>>> +
>>>> + d = debugfs_lookup(dev_name(dev), rootdir);
>>>> + for (i = 0; i < dev->em_pd->nr_perf_states; i++)
>>>> + em_debug_update_ps(dev->em_pd, i, d);
>>>> +}
>>>> #else /* CONFIG_DEBUG_FS */
>>>> static void em_debug_create_pd(struct device *dev) {}
>>>> static void em_debug_remove_pd(struct device *dev) {}
>>>> +static void em_debug_update(struct device *dev) {}
>>>> #endif
>>>>
>>>> static void em_release_table_kref(struct kref *kref)
>>>> @@ -323,6 +368,8 @@ int em_dev_update_perf_domain(struct device *dev,
>>>>
>>>> em_table_free(old_table);
>>>>
>>>> + em_debug_update(dev);
>>>> +
>>>
>>> I would move this out of the locked section, below the mutex
>>> unlock. Looking at the code in em_debug_update() you are trying
>>> to send such notification for each EM's table entry * number of
>>> fields, which is heavy. The RCU copy that you get will make sure
>>> you have consistent view on the data and you don't have to
>>> be under the mutex lock.
>
> That makes sense. I will change it in the next version as you suggested.
>
>>>
>>> A different question would be if the notification has to be
>>> that heavy?
>
> This is a good question. In this version, I tried to mimic the situation
> such that all the files under a performance domain are updated, so I
> sent inotify for all files.
>
> However, that is *not* necessary considering how a userspace code
> monitors the update of the energy model. Now, I think it will be better
> to inotify just the directory of a performance domain (e.g., /sys/
> kernel/debug/energy_model/cpu0). A userspace application just monitors
> '/sys/kernel/debug/energy_model' to get a notification for an update on
> any performance domain.
>
> The change will be minimal as follows:
>
> @@ -14,6 +14,7 @@
> #include <linux/cpumask.h>
> #include <linux/debugfs.h>
> #include <linux/energy_model.h>
> +#include <linux/fsnotify.h>
> #include <linux/sched/topology.h>
> #include <linux/slab.h>
>
> @@ -156,9 +157,18 @@ static int __init em_debug_init(void)
> return 0;
> }
> fs_initcall(em_debug_init);
> +
> +void em_debug_update(struct device *dev)
> +{
> + struct dentry *d;
> +
> + d = debugfs_lookup(dev_name(dev), rootdir);
> + fsnotify_dentry(d, FS_MODIFY);
> +}
> #else /* CONFIG_DEBUG_FS */
> static void em_debug_create_pd(struct device *dev) {}
> static void em_debug_remove_pd(struct device *dev) {}
> +static void em_debug_update(struct device *dev) {}
> #endif
>
> static void em_destroy_table_rcu(struct rcu_head *rp)
> @@ -335,6 +345,8 @@ int em_dev_update_perf_domain(struct device *dev,
> em_table_free(old_table);
>
> mutex_unlock(&em_pd_mutex);
> +
> + em_debug_update(dev);
> return 0;
> }
> EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
>
>>> Can we just 'ping' the user-space that there is a change and ask to read
>>> the new values?
>
> Besides the inotify, another 'ping' mechanism that I can think of is
> using kprobe/kretprobe. The BPF code hooks em_dev_update_perf_domain()
> using kprobe/kretprobe. When the function is called, the BPF code tells
> the userspace to re-read the energy model.
>
> This is pretty ad-hoc and has two (or more) problems. Firstly, it
> requires another communication mechanism (maybe BPF ring buffer) from
> the BPF code to the userspace. This will eventually mimic what inotify
> does. More importantly, the BPF code has a dependency on the function
> name. The changes in the function name/prototype will break the BPF
> code. We may consider adding a tracepoint to em_dev_update_perf_domain()
> to avoid this, but to me, it feels like another bandage over a bandage.
>
> Due to this, I think inotify is the right solution to the problem.
>
> @Lukasz -- If you have anything particular in mind other than kprobe and
> inotify, please let me know.
>
>>>
>>> Another question, but this time to Rafael would be if for such use case
>>> we can use debugfs, or we need a sysfs?
>>
>> debugfs is not really suitable for this IMV and the problem at hand is
>> a symptom of that (but there will be more issues in the future
>> AFAICS).
>
> I agree debugfs is not ideal in the sense that it is not considered as a
> stable interface. We can consider moving from the debugfs to the sysfs
> if there is a consensus. In my view, moving to sysfs is reasonable since
> there is no major change in the debugfs hierarchy since its inception.
> If necessary, I will add the sysfs implementation (dropping the debugfs
> code) to the next version.
>
> @Rafael -- Could you elaborate a bit more about 'the problem at hand'
> and 'more issues in the future'? Once inotify-ing only the performance
> domain directory as described above, I think overhead is no longer a
> problem. Do you have anything particular concerning about?
>
>>
>> Before starting to invent new interfaces for user space, though, I'm
>> wondering why the BPF code cannot obtain the energy model information
>> from the kernel?
>
> First of all, we prefer a userspace interface over adding new BPF kfuncs
> to access the energy model. The user space has much more freedom than
> the BPF code (e.g., using external libraries and floating point
> arithmetics). For instance, one sched_ext scheduler may leverage the
> energy model information to find the best subset of CPUs given the
> workload using machine learning or some numerical optimization
> techniques. Such approaches are infeasible (if not impossible) in the
> BPF/kernel code.
>
> Secondly, I am unsure if the current energy model data structures are
> accessible without any modifications through new BPF kfuncs. In
> particular, there are two flexible array members (cpus[] in
> em_perf_domain and state[] in em_perf_table). I guess that the BPF
> verifier may not be able to know the bounds of the arrays.
>
> Finally, even if the current EM data structures are accessible without
> any modifications, I still think adding new BPF kfuncs is not ideal
> because that creates dependencies between the EM data structures and the
> BPF code. If the EM data structures change (e.g., struct name change,
> field name change), the BPF code will break. While there is a BPF
> compatibility feature (CO-RE), managing such changes in the BPF code
> will be painful.
>
> Regards,
> Changwoo Min
>>
>>>> mutex_unlock(&em_pd_mutex);
>>>> return 0;
>>>> }
>>
>>
>
Powered by blists - more mailing lists