linux-kernel - Re: [PATCH V4 4/5] soc: qcom: Introduce SCMI based Memlat (Memory Latency) governor

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4399f7da-9478-57c0-7405-e51dc35d90ba@quicinc.com>
Date: Tue, 17 Dec 2024 15:46:24 +0530
From: Sibi Sankar <quic_sibis@...cinc.com>
To: Dmitry Baryshkov <dmitry.baryshkov@...aro.org>
CC: <myungjoo.ham@...sung.com>, <Kyungmin.park@...sung.com>,
        <cw00.choi@...sung.com>, Viresh Kumar <viresh.kumar@...aro.org>,
        <sudeep.holla@....com>, <cristian.marussi@....com>,
        <andersson@...nel.org>, <konrad.dybcio@...aro.org>,
        <robh+dt@...nel.org>, <krzysztof.kozlowski+dt@...aro.org>,
        <linux-kernel@...r.kernel.org>, <linux-arm-msm@...r.kernel.org>,
        <devicetree@...r.kernel.org>, <linux-arm-kernel@...ts.infradead.org>,
        <quic_rgottimu@...cinc.com>, <quic_kshivnan@...cinc.com>,
        <conor+dt@...nel.org>, <arm-scmi@...r.kernel.org>,
        Amir Vajid <avajid@...cinc.com>
Subject: Re: [PATCH V4 4/5] soc: qcom: Introduce SCMI based Memlat (Memory
 Latency) governor



On 12/5/24 17:00, Dmitry Baryshkov wrote:
> On Thu, 5 Dec 2024 at 12:53, Sibi Sankar <quic_sibis@...cinc.com> wrote:
>>
>>
>>
>> On 11/14/24 18:02, Dmitry Baryshkov wrote:
>>> On Thu, Nov 14, 2024 at 09:43:53AM +0530, Sibi Sankar wrote:
>>>>
>>>>
>>>> On 10/26/24 23:46, Dmitry Baryshkov wrote:
>>>>> On Tue, Oct 22, 2024 at 01:48:25PM +0530, Sibi Sankar wrote:
>>>>>>
>>>>>>
>>>>>> On 10/7/24 23:27, Dmitry Baryshkov wrote:
>>>>>>> On Mon, Oct 07, 2024 at 11:40:22AM GMT, Sibi Sankar wrote:
>>>
>>>>>>>
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct map_param_msg {
>>>>>>>> +        u32 hw_type;
>>>>>>>> +        u32 mon_idx;
>>>>>>>> +        u32 nr_rows;
>>>>>>>> +        struct map_table tbl[MAX_MAP_ENTRIES];
>>>>>>>> +} __packed;
>>>>>>>> +
>>>>>>>> +struct node_msg {
>>>>>>>> +        u32 cpumask;
>>>>>>>> +        u32 hw_type;
>>>>>>>> +        u32 mon_type;
>>>>>>>> +        u32 mon_idx;
>>>>>>>> +        char mon_name[MAX_NAME_LEN];
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct scalar_param_msg {
>>>>>>>> +        u32 hw_type;
>>>>>>>> +        u32 mon_idx;
>>>>>>>> +        u32 val;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +enum common_ev_idx {
>>>>>>>> +        INST_IDX,
>>>>>>>> +        CYC_IDX,
>>>>>>>> +        CONST_CYC_IDX,
>>>>>>>> +        FE_STALL_IDX,
>>>>>>>> +        BE_STALL_IDX,
>>>>>>>> +        NUM_COMMON_EVS
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +enum grp_ev_idx {
>>>>>>>> +        MISS_IDX,
>>>>>>>> +        WB_IDX,
>>>>>>>> +        ACC_IDX,
>>>>>>>> +        NUM_GRP_EVS
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +#define EV_CPU_CYCLES           0
>>>>>>>> +#define EV_INST_RETIRED         2
>>>>>>>> +#define EV_L2_D_RFILL           5
>>>>>>>> +
>>>>>>>> +struct ev_map_msg {
>>>>>>>> +        u32 num_evs;
>>>>>>>> +        u32 hw_type;
>>>>>>>> +        u32 cid[NUM_COMMON_EVS];
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct cpufreq_memfreq_map {
>>>>>>>> +        unsigned int cpufreq_mhz;
>>>>>>>> +        unsigned int memfreq_khz;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct scmi_monitor_info {
>>>>>>>> +        struct cpufreq_memfreq_map *freq_map;
>>>>>>>> +        char mon_name[MAX_NAME_LEN];
>>>>>>>> +        u32 mon_idx;
>>>>>>>> +        u32 mon_type;
>>>>>>>> +        u32 ipm_ceil;
>>>>>>>> +        u32 mask;
>>>>>>>> +        u32 freq_map_len;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct scmi_memory_info {
>>>>>>>> +        struct scmi_monitor_info *monitor[MAX_MONITOR_CNT];
>>>>>>>> +        u32 hw_type;
>>>>>>>> +        int monitor_cnt;
>>>>>>>> +        u32 min_freq;
>>>>>>>> +        u32 max_freq;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +struct scmi_memlat_info {
>>>>>>>> +        struct scmi_protocol_handle *ph;
>>>>>>>> +        const struct qcom_generic_ext_ops *ops;
>>>>>>>> +        struct scmi_memory_info *memory[MAX_MEMORY_TYPES];
>>>>>>>> +        u32 cluster_info[NR_CPUS];
>>>>>>>> +        int memory_cnt;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +static int populate_cluster_info(u32 *cluster_info)
>>>>>>>> +{
>>>>>>>> +        char name[MAX_NAME_LEN];
>>>>>>>> +        int i = 0;
>>>>>>>> +
>>>>>>>> +        struct device_node *cn __free(device_node) = of_find_node_by_path("/cpus");
>>>>>>>> +        if (!cn)
>>>>>>>> +                return -ENODEV;
>>>>>>>> +
>>>>>>>> +        struct device_node *map __free(device_node) = of_get_child_by_name(cn, "cpu-map");
>>>>>>>> +        if (!map)
>>>>>>>> +                return -ENODEV;
>>>>>>>> +
>>>>>>>> +        do {
>>>>>>>> +                snprintf(name, sizeof(name), "cluster%d", i);
>>>>>>>> +                struct device_node *c __free(device_node) = of_get_child_by_name(map, name);
>>>>>>>> +                if (!c)
>>>>>>>> +                        break;
>>>>>>>> +
>>>>>>>> +                *(cluster_info + i) = of_get_child_count(c);
>>>>>>>> +                i++;
>>>>>>>> +        } while (1);
>>>>>>>
>>>>>>> Can you use existing API from drivers/base/arch_topology.c? If not, can
>>>>>>> it be extended to support your usecase?
>>>>>>
>>>>>> ack. But I'm pretty sure it's going to take a while for reaching such
>>>>>> an agreement so I'll drop this feature during the next re-spin.
>>>>>
>>>>> Why? What kind of API do you actually need? The arch_topology.c simply
>>>>> exports a table of struct cpu_topology. Is it somehow different from
>>>>> what you are parsing manually?
>>>>
>>>> yup, we had to figure out the physical id of the cpu
>>>> since cpus can be disabled by the bootloader using
>>>> status = "failed" property and we have to pass this
>>>> onto the cpucp memlat algorithm.
>>>
>>> Isn't it equal to the index in the cpu_topology table?
>>
>> from what I see cpu_topology indexes remain unpopulated
>> for cpus that are disabled since get_cpu_for_node
>> ignores those?
> 
> Why do you need cpu_topology for disabled aka non-existing CPU devices?

sorry was out sick couldn't back earlier. We need the know
what cpus are disbled to pass on the correct mask of cpus
enabled to the SCP.

> 
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +        return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void populate_physical_mask(struct device_node *np, u32 *mask, u32 *cluster_info)
>>>>>>>> +{
>>>>>>>> +        struct device_node *dev_phandle __free(device_node);
>>>>>>>> +        int cpu, i = 0, physical_id;
>>>>>>>> +
>>>>>>>> +        do {
>>>>>>>> +                dev_phandle = of_parse_phandle(np, "cpus", i++);
>>>>>>>> +                cpu = of_cpu_node_to_id(dev_phandle);
>>>>>>>> +                if (cpu != -ENODEV) {
>>>>>>>> +                        physical_id = topology_core_id(cpu);
>>>>>>>> +                        for (int j = 0; j < topology_cluster_id(cpu); j++)
>>>>>>>> +                                physical_id += *(cluster_info + j);
>>>>>>>> +                        *mask |= BIT(physical_id);
>>>>>>>> +                }
>>>>>>>> +        } while (dev_phandle);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static struct cpufreq_memfreq_map *init_cpufreq_memfreq_map(struct device *dev,
>>>>>>>> +                                                            struct scmi_memory_info *memory,
>>>>>>>> +                                                            struct device_node *of_node,
>>>>>>>> +                                                            u32 *cnt)
>>>>>>>> +{
>>>>>>>> +        struct device_node *tbl_np __free(device_node), *opp_np __free(device_node);
>>>>>>>> +        struct cpufreq_memfreq_map *tbl;
>>>>>>>> +        int ret, i = 0;
>>>>>>>> +        u32 level, len;
>>>>>>>> +        u64 rate;
>>>>>>>> +
>>>>>>>> +        tbl_np = of_parse_phandle(of_node, "operating-points-v2", 0);
>>>>>>>
>>>>>>> Please use existing API to parse OPP tables or document a reason why it
>>>>>>> can't be used.
>>>>>>
>>>>>> Thanks, I had them documented as opens in the coverletter. Dropped them
>>>>>> since no one had any comments on it during V3. Will add them as comments
>>>>>> to this driver instead.
>>>>>>
>>>>>> https://lore.kernel.org/lkml/20240702191440.2161623-1-quic_sibis@quicinc.com/
>>>>>>
>>>>>> re-copying things again:
>>>>>> opp-tables are used but they don't get to be added to the scmi device
>>>>>> (thus we rely on a lot of manual parsing) because the memlat client driver
>>>>>> doesn't vote on these resources clocks/interconnects/power-domain
>>>>>> from the kernel and some of the resources aren't modeled in the first
>>>>>> place like DDR_QOS.
>>>>>
>>>>> As discussed offline, please consider extending the OPP to be able to
>>>>> get the struct opp_table for the particular phandle. Another option
>>>>> might be to change the memlat driver by having a separate device for
>>>>> each monitor. This way you can use existing API to parse OPP tables and
>>>>> to get necessary data from those tables.
>>>>
>>>> + Viresh
>>>>
>>>> Spoke with Viresh offline and he had stricter requirements
>>>> than what you proposed. He definitely wanted the opp-tables
>>>> to be assoiciated with devices at the very least and have
>>>> all opp parsing logic within the opp-framework. Given that
>>>> we have to model all these dummy devices just to add the
>>>> tables I'll re-check the feasibility of movign the tables
>>>> into the driver itself. Will move the patch series back
>>>> into RFC and re-post just the vendor protocol since that's
>>>> close to merge
>>>
>>> I don't think it's sensible to move the tables to the driver. Instead
>>> adding a device per monitor sounds like a better idea.
>>
>> yeah, I would like to keep this in dt as well. But in order
>> to be able to do that through the opp core we would need
>> to put in a clock and interconnect paths so that the framework
>> is able to add the table to the device.
>> I should be able
>> to list the scmi perf domain as the clock phandle but inorder
>> to be able to convert the kbps values for the interconnect I
>> would need to store the bus width and so on which is currently
>> abstracted by the interconnect framework. Also in the future
>> we may have to model dummy devices just to get the table parsed
>> if the devices aren't modelled and controlled in the kernel.
>> All of these seems to indicate that having the tables in the
>> driver is a better alternative, lol.
> 
> Or fix the OPP API. I don't think that having monitor tables for the
> SCMI-based platforms in the driver code is going to scale.

There is nothing to fix in the OPP API, looking at number
of clocks, interconnects and regulators to allocate
memory and associate a opp-table to a device is already
the right thing to do. So to adhere to this we would
need to list dummy clocks, interconnects even when it
isn't modelled or controlled by the kernel. Sure I'll
continue to have it in dt in the next re-spin and take
this series back in RFC.

-Sibi

> 
>>
>> -Sibi
>>
>>>
>>>>>>>> +        if (!tbl_np)
>>>>>>>> +                return ERR_PTR(-ENODEV);
>>>>>>>> +
>>>>>>>> +        len = min(of_get_available_child_count(tbl_np), MAX_MAP_ENTRIES);
>>>>>>>> +        if (len == 0)
>>>>>>>> +                return ERR_PTR(-ENODEV);
>>>>>>>> +
>>>>>>>> +        tbl = devm_kzalloc(dev, (len + 1) * sizeof(struct cpufreq_memfreq_map),
>>>>>>>> +                           GFP_KERNEL);
>>>>>>>> +        if (!tbl)
>>>>>>>> +                return ERR_PTR(-ENOMEM);
>>>>>>>> +
>>>>>>>> +        for_each_available_child_of_node(tbl_np, opp_np) {
>>>>>>>> +                ret = of_property_read_u64_index(opp_np, "opp-hz", 0, &rate);
>>>>>>>> +                if (ret < 0)
>>>>>>>> +                        return ERR_PTR(ret);
>>>>>>>> +
>>>>>>>> +                tbl[i].cpufreq_mhz = rate / HZ_PER_MHZ;
>>>>>>>> +
>>>>>>>> +                if (memory->hw_type != QCOM_MEM_TYPE_DDR_QOS) {
>>>>>>>> +                        ret = of_property_read_u64_index(opp_np, "opp-hz", 1, &rate);
>>>>>>>> +                        if (ret < 0)
>>>>>>>> +                                return ERR_PTR(ret);
>>>>>>>> +
>>>>>>>> +                        tbl[i].memfreq_khz = rate / HZ_PER_KHZ;
>>>>>>>> +                } else {
>>>>>>>> +                        ret = of_property_read_u32(opp_np, "opp-level", &level);
>>>>>>>> +                        if (ret < 0)
>>>>>>>> +                                return ERR_PTR(ret);
>>>>>>>> +
>>>>>>>> +                        tbl[i].memfreq_khz = level;
>>>>>>>> +                }
>>>>>>>> +
>>>>>>>> +                dev_dbg(dev, "Entry%d CPU:%u, Mem:%u\n", i, tbl[i].cpufreq_mhz, tbl[i].memfreq_khz);
>>>>>>>> +                i++;
>>>>>>>> +        }
>>>>>>>> +        *cnt = len;
>>>>>>>> +
>>>>>>>> +        return tbl;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int process_scmi_memlat_of_node(struct scmi_device *sdev, struct scmi_memlat_info *info)
>>>>>>>> +{
>>>>>>>> +        struct scmi_monitor_info *monitor;
>>>>>>>> +        struct scmi_memory_info *memory;
>>>>>>>> +        char name[MAX_NAME_LEN];
>>>>>>>> +        u64 memfreq[2];
>>>>>>>> +        int ret;
>>>>>>>> +
>>>>>>>> +        ret = populate_cluster_info(info->cluster_info);
>>>>>>>> +        if (ret < 0) {
>>>>>>>> +                dev_err_probe(&sdev->dev, ret, "failed to populate cluster info\n");
>>>>>>>> +                goto err;
>>>>>>>> +        }
>>>>>>>> +
>>>>>>>> +        of_node_get(sdev->dev.of_node);
>>>>>>>> +        do {
>>>>>>>> +                snprintf(name, sizeof(name), "memory-%d", info->memory_cnt);
>>>>>>>> +                struct device_node *memory_np __free(device_node) =
>>>>>>>> +                        of_find_node_by_name(sdev->dev.of_node, name);
>>>>>>>> +
>>>>>>>> +                if (!memory_np)
>>>>>>>> +                        break;
>>>>>>>> +
>>>>>>>> +                if (info->memory_cnt >= MAX_MEMORY_TYPES)
>>>>>>>> +                        return dev_err_probe(&sdev->dev, -EINVAL,
>>>>>>>> +                                             "failed to parse unsupported memory type\n");
>>>>>>>> +
>>>>>>>> +                memory = devm_kzalloc(&sdev->dev, sizeof(*memory), GFP_KERNEL);
>>>>>>>> +                if (!memory) {
>>>>>>>> +                        ret = -ENOMEM;
>>>>>>>> +                        goto err;
>>>>>>>> +                }
>>>>>>>> +
>>>>>>>> +                ret = of_property_read_u32(memory_np, "qcom,memory-type", &memory->hw_type);
>>>>>>>> +                if (ret) {
>>>>>>>> +                        dev_err_probe(&sdev->dev, ret, "failed to read memory type\n");
>>>>>>>> +                        goto err;
>>>>>>>> +                }
>>>>>>>> +
>>>>>>>> +                ret = of_property_read_u64_array(memory_np, "freq-table-hz", memfreq, 2);
>>>>>>>> +                if (ret && (ret != -EINVAL)) {
>>>>>>>> +                        dev_err_probe(&sdev->dev, ret, "failed to read min/max freq\n");
>>>>>>>> +                        goto err;
>>>>>>>> +                }
>>>>>>>
>>>>>>> Can we get this information from the OPP table instead?
>>>>>>
>>>>>> we don't list all the available ddr/llcc freqs in the opp-table
>>>>>> so that we can keep the table constant across platforms.
>>>>>
>>>>> NO. Use opp-supported-hw to limit data to a particular platform. There
>>>>> is no reason to keep min/max out of the OPP table.
>>>>
>>>> if we are movign the opp-tables into driver data for the reasons
>>>> described above, this can probably stay?
>>>
>>> No. They duplicate the information that can be a part of the tables. It
>>> doesn't matter if the tables are in the driver or in DT.
>>>
>>>
> 
> 
>