netdev - Re: [PATCH net-next v9 2/2] net: ti: icssg_prueth: add TAPRIO offload support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <69f01921-545a-4ddd-85ee-d1b7fc635df5@ti.com>
Date: Wed, 12 Jun 2024 15:45:18 +0530
From: MD Danish Anwar <danishanwar@...com>
To: Vladimir Oltean <vladimir.oltean@....com>
CC: Jan Kiszka <jan.kiszka@...mens.com>,
        Dan Carpenter
	<dan.carpenter@...aro.org>,
        Andrew Lunn <andrew@...n.ch>, Simon Horman
	<horms@...nel.org>,
        Diogo Ivo <diogo.ivo@...mens.com>,
        Wolfram Sang
	<wsa+renesas@...g-engineering.com>,
        Randy Dunlap <rdunlap@...radead.org>,
        Niklas Schnelle <schnelle@...ux.ibm.com>,
        Vignesh Raghavendra
	<vigneshr@...com>,
        Richard Cochran <richardcochran@...il.com>,
        Roger Quadros
	<rogerq@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
        Jakub Kicinski
	<kuba@...nel.org>, Eric Dumazet <edumazet@...gle.com>,
        "David S. Miller"
	<davem@...emloft.net>,
        <linux-arm-kernel@...ts.infradead.org>, <netdev@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>, <srk@...com>,
        Jacob
 Keller <jacob.e.keller@...el.com>,
        Roger Quadros <rogerq@...com>
Subject: Re: [PATCH net-next v9 2/2] net: ti: icssg_prueth: add TAPRIO offload
 support



On 06/06/24 7:47 pm, Vladimir Oltean wrote:
> On Thu, Jun 06, 2024 at 04:33:58PM +0530, MD Danish Anwar wrote:
>>>>>> +static void tas_reset(struct prueth_emac *emac)
>>>>>> +{
>>>>>> +	struct tas_config *tas = &emac->qos.tas.config;
>>>>>> +	int i;
>>>>>> +
>>>>>> +	for (i = 0; i < TAS_MAX_NUM_QUEUES; i++)
>>>>>> +		tas->max_sdu_table.max_sdu[i] = 2048;
>>>>>
>>>>> Macro + short comment for the magic number, please.
>>>>>
>>>>
>>>> Sure I will add it. Each elements in this array is a 2 byte value
>>>> showing the maximum length of frame to be allowed through each gate.
>>>
>>> Is the queueMaxSDU[] array active even with the TAS being in the reset
>>> state? Does this configuration have any impact upon the device MTU?
>>> I don't know why 2048 was chosen.
>>
>> I talked to the firmware team. The value of 248 is actually wrong. It
>> should be the device mtu only i.e. PRUETH_MAX_MTU.
> 
> There was another comment about the value of 0, sent separately.
> 

Yes, I have replied to that.

>>> If you're replacing an existing active schedule with a shadow one, the
>>> ICSSG_EMAC_PORT_TAS_ENABLE command isn't needed because the TAS is
>>> already enabled on the port, right? In fact it will be suppressed by
>>> tas_set_state() without even generating an emac_set_port_state() call,
>>> right?
>>>
>>
>> As this point TAS is not enabled. TAS is enabled on the port only when
>> ICSSG_EMAC_PORT_TAS_ENABLE is sent. Which happens at the end of
>> emac_taprio_replace().
> 
> "If you're replacing an existing active schedule" => emac_taprio_replace()
> was already called once, and we're calling it again, with no emac_taprio_destroy()
> in between.
> 
> This is done using the "tc qdisc replace" command. You can keep the
> mqprio parameters the same, just change the schedule parameters.
> The transition from the old to the new schedule is supposed to be
> seamless and at a well-defined time, according to the IEEE definitions.
> 

I talked with the FW team. During "tc qdisc replace" there is no need to
send ICSSG_EMAC_PORT_TAS_ENABLE as it is already enabled. I'll fix it.

>>>> The following three offsets are configured in this function,
>>>> 1. TAS_ADMIN_CYCLE_TIME → admin cycle time
>>>> 2. TAS_CONFIG_CHANGE_CYCLE_COUNT → number of cycles after which the
>>>> admin list is taken as operating list.
>>>> This parameter is calculated based on the base_time, cur_time and
>>>> cycle_time. If the base_time is in past (already passed) the
>>>> TAS_CONFIG_CHANGE_CYCLE_COUNT is set to 1. If the base_time is in
>>>> future, TAS_CONFIG_CHANGE_CYCLE_COUNT is calculated using
>>>> DIV_ROUND_UP_ULL(base_time - cur_time, cycle_time)
>>>> 3. TAS_ADMIN_LIST_LENGTH → Number of window entries in the admin list.
>>>>
>>>> After configuring the above three parameters, the driver gives the
>>>> trigger signal to the firmware using the R30 command interface with
>>>> ICSSG_EMAC_PORT_TAS_TRIGGER command.
>>>>
>>>> The schedule starts based on TAS_CONFIG_CHANGE_CYCLE_COUNT. Those cycles
>>>> are relative to time remaining in the base_time from now i.e. base_time
>>>> - cur_time.
>>>
>>> So you're saying that the firmware executes the schedule switch at
>>>
>>> 	now                  +      TAS_ADMIN_CYCLE_TIME * TAS_CONFIG_CHANGE_CYCLE_COUNT ns
>>> 	~~~
>>> 	time of reception of
>>> 	ICSSG_EMAC_PORT_TAS_TRIGGER
>>> 	R30 command
>>>
>>> ?
>>>
>>
>> I talked to the firmware team on this topic. Seems like this is actually
>> a bug in the firmware design. This *now* is very relative and it will
>> always introduce jitter as you have mentioned.
>>
>> The firmware needs to change to handle the below two cases that you have
>> mentioned.
>>
>> The schedule should start at base-time (given by user). Instead of
>> sending the cycle count from now to base-time to firmware. Driver should
>> send the absolute cycle count corresponding to the base-time. Firmware
>> can then check the curr cycle count and when it matches the count set by
>> driver firmware will start scheduling.
>>
>> change_cycle_count = base-time / cycle-time;
>>
>> This way the irregularity with *now* will be removed. Now even if we run
>> the same command on two different ICSSG devices(whose clocks are synced
>> with PTP), the scheduling will happen at same time.
>>
>> As the change_cycle_count will be same for both of them. Since the
>> clocks are synced the current cycle count (read from
>> TIMESYNC_FW_WC_CYCLECOUNT_OFFSET) will also be same for both the devices
> 
> You could pass the actual requested base-time to the firmware, and let
> the firmware calculate a cycle count or whatever the hardware needs.
> Otherwise, you advance the base-time in the driver into what was the
> future at the time, but by the time the r30 command reaches the
> firmware, the passed number of cycles has already elapsed.
> 

Yes that would work too.

>>> I'm not really interested in how the driver calculates the cycle count,
>>> just in what are the primitives that the firmware ABI wants.
>>>
>>> Does the readb_poll_timeout() call from tas_update_oper_list() actually
>>> wait until this whole time elapses? It is user space input, so it can
>>> keep a task waiting in the kernel, with rtnl_lock() acquired, for a very
>>> long time if the base_time is far away in the future.
>>>
>>
>> readb_poll_timeout() call from tas_update_oper_list() waits for exactly
>> 10 msecs. Driver send the trigger_list_change command and sets
>> config_change register to 1 (details in tas_set_trigger_list_change()).
>> Driver waits for 10 ms for firmware to clear this register. If the
>> register is not cleared, list wasn't changed by firmware. Driver will
>> then return err.
> 
> And the firmware clears this register when? Quickly upon reception of
> the TAS_TRIGGER command, or after the TAS is actually triggered (after
> change_cycle_count cycles)?
> 

tas->config_list->config_change is set to 1 by driver in
tas_set_trigger_list_change() before sending the TAS_TRIGGER command.
tas->config_list->config_change indicates that the driver has filled the
shadow list.

Firmware then receives the TAS_TRIGGER command and clears
`tas->config_list->config_change` as soon as new shadow list is received.

Firmware then goes on to copy the shadow list to active list and once
that is done, firmware clears `tas->config_list->config_pending`
register. After this TAS is actually triggered when change_cycle_count
cycles have passed.

>>> 2. You cannot apply a phase offset between the schedules on two ICSSG
>>> devices in the same network.
>>>
>>> Since there is a PHY-dependent propagation delay on each link, network
>>> engineers typically delay the schedules on switch ports along the path
>>> of a stream.
>>>
>>> Say for example there is a propagation delay of 800 ns on a switch with
>>> base-time 0. On the next switch, you could add the schedule like this:
>>>
>>> tc qdisc replace dev swp0 parent root taprio \
>>> 	num_tc 8 \
>>> 	map 0 1 2 3 4 5 6 7 \
>>> 	queues 1@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 \
>>> 	base-time 800 \
>>> 	sched-entry S 0x81 100000 \
>>> 	sched-entry S 0x01 900000 \
>>> 	flags 0x2 \
>>> 	max-sdu 0 0 0 0 0 0 0 79
>>>
>>> Same schedule, phase-shifted by 800 ns, so that if the packet goes
>>> through an open gate in the first switch, it will also go through an
>>> open gate through the second.
>>>
>>> According to your own calculations and explanations, the firmware ABI
>>> makes no difference between base-time 0 and base-time 800.
>>>
>>
>> In the new implementation base-time 0 and base-time 800 will make a
>> difference. as the change_cycle_count will be different from both the cases.
>> In case of base-time 0, change_cycle_count will be 1. Implying schedule
>> will start on the very next cycle.
>>
>> In case of base-time 800, change_cycle_count will be 800 / cycle-time.
> 
> In this example, cycle-time is (much) larger than 800 ns, so 800 / cycle-time is 0.
> Simply put, base-time 0 and base-time 800 will still be treated equally,
> if the firmware only starts the schedule upon integer multiples of the
> cycle time. A use case is offsetting schedules by a small value, smaller
> than the cycle time.
> 
> The base-time value of 800 should be advanced by the smallest integer
> multiple of the cycle-time that satisfies the inequality
> new-base-time = (base-time + N * cycle-time) >= now.
> 
> You can see that for the same value of N and cycle-time, new-base-time
> will different when base-time = 0 vs when base-time = 800. Taprio
> expects that difference to be reflected into the schedule.
> 
>>> In this case they are probably both smaller than the current time, so
>>> TAS_CONFIG_CHANGE_CYCLE_COUNT will be set to the same "1" in both cases.
>>>
>>
>> If cycle-time is larger then both 0 and 800 then the change_cycle_count
>> would be 1 in both the cases.
>>
>>> But even assuming a future base-time, it still will make no difference.
>>> The firmware seems to operate only on integer multiples of a cycle-time
>>> (here 1000000).
>>
>> Yes, the firmware works only on multiple of cycle time. If the base-time
>> is not a multiple of cycle time, the scheduling will start on the next
>> cycle count.
>>
>> i.e. change_cycle_count = ceil (base-time / cycle-time)
>>> Summarized, the blocking problems I see are:
>>>
>>> - For issue #2, the driver should not lie to the user space that it
>>>   applied a schedule with a base-time that isn't a precise multiple of
>>>   the cycle-time, because it doesn't do that.
>>>
>>
>> Yes, I acknowledge it's a limitation. Driver can print "requested
>> base-time is not multiple of cycle-time, secheduling will start on the
>> next available cycle from base-time". I agree the driver shouldn't lie
>> about this. Whenever driver encounters a base time which is not multiple
>> of cycle-time. It can still do the scheduling but throw a print so that
>> user is aware of this.
> 
> Is that a firmware or a hardware limitation? You're making it sound as
> if we shouldn't be expecting for it to be lifted.
> 

It is a firmware limitation and I have conveyed this issue to the
firmware team. They are working on fixing this as well. In my earlier
reply I meant to say that if this is not fixed or if it takes very long
to fix this, we can still go ahead with the driver by mentioning this
limitation by a print.

>>> - For issue #1, the bigger problem is that there is always a
>>>   software-induced jitter which makes whatever the user space has
>>>   requested irrelevant.
>>>
>>
>> As a I mentioned earlier, the new implementation will take care of this.
>>
>> I will work with the firmware team to get this fixed. Once that's done I
>> will send a new revision.
>>
>> Thanks for all the feedbacks. Please let me know if some more
>> clarification is needed.
> 
> Ok, so we're waiting for a new firmware release, and a check in the
> driver that the firmware version >= some minimum requirement, else
> -EOPNOTSUPP?

Yes, we are waiting on a new firmware release. The check however might
not be necessary as ICSSG firmware is not yet public. It's not part of
Linux-firmware. We are planning on integrating the ICSSG firmware with
Linux-firmware. However as of now it's not public. Since the firmware is
not public yet and the first public version will most probably be after
this things are fixed and driver support is upstream-ed, I don't think
version check will be needed.

Once the firmware is public and the some change is done in driver that
depends on firmware versions then the check will be necessary.

-- 
Thanks and Regards,
Danish