linux-kernel - Re: [PATCH] PM: EM: Add inotify support when the energy model is updated.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d77ac703-2fa8-48f8-846a-fb88d45b9287@igalia.com>
Date: Mon, 28 Apr 2025 10:10:41 +0900
From: Changwoo Min <changwoo@...lia.com>
To: "Rafael J. Wysocki" <rafael@...nel.org>, Lukasz Luba <lukasz.luba@....com>
Cc: christian.loehle@....com, tj@...nel.org, len.brown@...el.com,
 kernel-dev@...lia.com, linux-pm@...r.kernel.org,
 linux-kernel@...r.kernel.org, pavel@...nel.org
Subject: Re: [PATCH] PM: EM: Add inotify support when the energy model is
 updated.

Gentle ping as it reaches around 1.5 weeks.

Regards,
Changwoo Min

On 4/17/25 22:28, Changwoo Min wrote:
> Hi Lukasz and Rafael,
> 
> Thank you super much for the comments!
> 
> On 4/16/25 02:05, Rafael J. Wysocki wrote:
>> On Tue, Apr 15, 2025 at 10:31 AM Lukasz Luba <lukasz.luba@....com> wrote:
>>>
>>> Hi Changwoo,
>>>
>>> Thanks for the patch.
>>>
>>> On 4/15/25 01:42, Changwoo Min wrote:
>>>> The sched_ext schedulers [1] currently access the energy model 
>>>> through the
>>>> debugfs to make energy-aware scheduling decisions [2]. The userspace 
>>>> part
>>>> of a sched_ext scheduler feeds the necessary (post-processed) 
>>>> energy-model
>>>> information to the BPF part of the scheduler.
>>>
>>> This is very interesting use case!
>>>
>>>>
>>>> However, there is a limitation in the current debugfs support of the 
>>>> energy
>>>> model. When the energy model is updated (em_dev_update_perf_domain), 
>>>> there
>>>> is no way for the userspace part to know such changes (besides 
>>>> polling the
>>>> debugfs files).
>>>>
>>>> Therefore, add inotify support (IN_MODIFY) when the energy model is
>>>> updated. With this inotify support, the sched_ext scheduler can 
>>>> monitor the
>>>> energy model change in userspace using the regular inotify interface 
>>>> and
>>>> feed the updated energy model information to make energy-aware 
>>>> scheduling
>>>> decisions.
>>>>
>>>> [1] https://lwn.net/Articles/922405/
>>>> [2] https://github.com/sched-ext/scx/pull/1624
>>>>
>>>> Signed-off-by: Changwoo Min <changwoo@...lia.com>
>>>> ---
>>>>    kernel/power/energy_model.c | 47 ++++++++++++++++++++++++++++++++ 
>>>> +++++
>>>>    1 file changed, 47 insertions(+)
>>>>
>>>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>>>> index d9b7e2b38c7a..0c06e0278df6 100644
>>>> --- a/kernel/power/energy_model.c
>>>> +++ b/kernel/power/energy_model.c
>>>> @@ -14,6 +14,7 @@
>>>>    #include <linux/cpumask.h>
>>>>    #include <linux/debugfs.h>
>>>>    #include <linux/energy_model.h>
>>>> +#include <linux/fsnotify.h>
>>>>    #include <linux/sched/topology.h>
>>>>    #include <linux/slab.h>
>>>>
>>>> @@ -156,9 +157,53 @@ static int __init em_debug_init(void)
>>>>        return 0;
>>>>    }
>>>>    fs_initcall(em_debug_init);
>>>> +
>>>> +static void em_debug_update_ps(struct em_perf_domain *em_pd, int i,
>>>> +                            struct dentry *pd)
>>>> +{
>>>> +     static const char *names[] = {
>>>> +             "frequency",
>>>> +             "power",
>>>> +             "cost",
>>>> +             "performance",
>>>> +             "inefficient",
>>>> +     };
>>>> +     struct em_perf_state *table;
>>>> +     unsigned long freq;
>>>> +     struct dentry *d, *cd;
>>>> +     char name[24];
>>>> +     int j;
>>>> +
>>>> +     rcu_read_lock();
>>>> +     table = em_perf_state_from_pd(em_pd);
>>>> +     freq = table[i].frequency;
>>>> +     rcu_read_unlock();
>>>> +
>>>> +     snprintf(name, sizeof(name), "ps:%lu", freq);
>>>> +     d = debugfs_lookup(name, pd);
>>>> +
>>>> +     for (j = 0; j < ARRAY_SIZE(names); j++) {
>>>> +             cd = debugfs_lookup(names[j], d);
>>>> +             if (!cd)
>>>> +                     return;
>>>> +             fsnotify_dentry(cd, FS_MODIFY);
>>>> +             cond_resched();
>>>> +     }
>>>> +}
>>>> +
>>>> +static void em_debug_update(struct device *dev)
>>>> +{
>>>> +     struct dentry *d;
>>>> +     int i;
>>>> +
>>>> +     d = debugfs_lookup(dev_name(dev), rootdir);
>>>> +     for (i = 0; i < dev->em_pd->nr_perf_states; i++)
>>>> +             em_debug_update_ps(dev->em_pd, i, d);
>>>> +}
>>>>    #else /* CONFIG_DEBUG_FS */
>>>>    static void em_debug_create_pd(struct device *dev) {}
>>>>    static void em_debug_remove_pd(struct device *dev) {}
>>>> +static void em_debug_update(struct device *dev) {}
>>>>    #endif
>>>>
>>>>    static void em_release_table_kref(struct kref *kref)
>>>> @@ -323,6 +368,8 @@ int em_dev_update_perf_domain(struct device *dev,
>>>>
>>>>        em_table_free(old_table);
>>>>
>>>> +     em_debug_update(dev);
>>>> +
>>>
>>> I would move this out of the locked section, below the mutex
>>> unlock. Looking at the code in em_debug_update() you are trying
>>> to send such notification for each EM's table entry * number of
>>> fields, which is heavy. The RCU copy that you get will make sure
>>> you have consistent view on the data and you don't have to
>>> be under the mutex lock.
> 
> That makes sense. I will change it in the next version as you suggested.
> 
>>>
>>> A different question would be if the notification has to be
>>> that heavy?
> 
> This is a good question. In this version, I tried to mimic the situation
> such that all the files under a performance domain are updated, so I
> sent inotify for all files.
> 
> However, that is *not* necessary considering how a userspace code
> monitors the update of the energy model. Now, I think it will be better
> to inotify just the directory of a performance domain (e.g., /sys/
> kernel/debug/energy_model/cpu0). A userspace application just monitors
> '/sys/kernel/debug/energy_model' to get a notification for an update on
> any performance domain.
> 
> The change will be minimal as follows:
> 
> @@ -14,6 +14,7 @@
>   #include <linux/cpumask.h>
>   #include <linux/debugfs.h>
>   #include <linux/energy_model.h>
> +#include <linux/fsnotify.h>
>   #include <linux/sched/topology.h>
>   #include <linux/slab.h>
> 
> @@ -156,9 +157,18 @@ static int __init em_debug_init(void)
>       return 0;
>   }
>   fs_initcall(em_debug_init);
> +
> +void em_debug_update(struct device *dev)
> +{
> +    struct dentry *d;
> +
> +    d = debugfs_lookup(dev_name(dev), rootdir);
> +    fsnotify_dentry(d, FS_MODIFY);
> +}
>   #else /* CONFIG_DEBUG_FS */
>   static void em_debug_create_pd(struct device *dev) {}
>   static void em_debug_remove_pd(struct device *dev) {}
> +static void em_debug_update(struct device *dev) {}
>   #endif
> 
>   static void em_destroy_table_rcu(struct rcu_head *rp)
> @@ -335,6 +345,8 @@ int em_dev_update_perf_domain(struct device *dev,
>       em_table_free(old_table);
> 
>       mutex_unlock(&em_pd_mutex);
> +
> +    em_debug_update(dev);
>       return 0;
>   }
>   EXPORT_SYMBOL_GPL(em_dev_update_perf_domain);
> 
>>> Can we just 'ping' the user-space that there is a change and ask to read
>>> the new values?
> 
> Besides the inotify, another 'ping' mechanism that I can think of is
> using kprobe/kretprobe. The BPF code hooks em_dev_update_perf_domain()
> using kprobe/kretprobe. When the function is called, the BPF code tells
> the userspace to re-read the energy model.
> 
> This is pretty ad-hoc and has two (or more) problems. Firstly, it
> requires another communication mechanism (maybe BPF ring buffer) from
> the BPF code to the userspace. This will eventually mimic what inotify
> does. More importantly, the BPF code has a dependency on the function
> name. The changes in the function name/prototype will break the  BPF
> code. We may consider adding a tracepoint to em_dev_update_perf_domain()
> to avoid this, but to me, it feels like another bandage over a bandage.
> 
> Due to this, I think inotify is the right solution to the problem.
> 
> @Lukasz -- If you have anything particular in mind other than kprobe and
> inotify, please let me know.
> 
>>>
>>> Another question, but this time to Rafael would be if for such use case
>>> we can use debugfs, or we need a sysfs?
>>
>> debugfs is not really suitable for this IMV and the problem at hand is
>> a symptom of that (but there will be more issues in the future
>> AFAICS).
> 
> I agree debugfs is not ideal in the sense that it is not considered as a
> stable interface. We can consider moving from the debugfs to the sysfs
> if there is a consensus. In my view, moving to sysfs is reasonable since
> there is no major change in the debugfs hierarchy since its inception.
> If necessary, I will add the sysfs implementation (dropping the debugfs
> code) to the next version.
> 
> @Rafael -- Could you elaborate a bit more about 'the problem at hand'
> and 'more issues in the future'? Once inotify-ing only the performance
> domain directory as described above, I think overhead is no longer a
> problem. Do you have anything particular concerning about?
> 
>>
>> Before starting to invent new interfaces for user space, though, I'm
>> wondering why the BPF code cannot obtain the energy model information
>> from the kernel?
> 
> First of all, we prefer a userspace interface over adding new BPF kfuncs
> to access the energy model. The user space has much more freedom than
> the BPF code (e.g., using external libraries and floating point
> arithmetics). For instance, one sched_ext scheduler may leverage the
> energy model information to find the best subset of CPUs given the
> workload using machine learning or some numerical optimization
> techniques. Such approaches are infeasible (if not impossible) in the
> BPF/kernel code.
> 
> Secondly,  I am unsure if the current energy model data structures are
> accessible without any modifications through new BPF kfuncs. In
> particular, there are two flexible array members (cpus[] in
> em_perf_domain and state[] in em_perf_table). I guess that the BPF
> verifier may not be able to know the bounds of the arrays.
> 
> Finally, even if the current EM data structures are accessible without
> any modifications, I still think adding new BPF kfuncs is not ideal
> because that creates dependencies between the EM data structures and the
> BPF code. If the EM data structures change (e.g., struct name change,
> field name change), the BPF code will break. While there is a BPF
> compatibility feature (CO-RE), managing such changes in the BPF code
> will be painful.
> 
> Regards,
> Changwoo Min
>>
>>>>        mutex_unlock(&em_pd_mutex);
>>>>        return 0;
>>>>    }
>>
>>
>