[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1045511371.220520131.1638894949373.JavaMail.zimbra@uliege.be>
Date: Tue, 7 Dec 2021 17:35:49 +0100 (CET)
From: Justin Iurman <justin.iurman@...ege.be>
To: Jakub Kicinski <kuba@...nel.org>
Cc: netdev@...r.kernel.org, davem@...emloft.net, dsahern@...nel.org,
yoshfuji@...ux-ipv6.org, linux-mm@...ck.org, cl@...ux.com,
penberg@...nel.org, rientjes@...gle.com,
iamjoonsoo kim <iamjoonsoo.kim@....com>,
akpm@...ux-foundation.org, vbabka@...e.cz
Subject: Re: [RFC net-next 2/2] ipv6: ioam: Support for Buffer occupancy
data field
On Dec 7, 2021, at 4:50 PM, Jakub Kicinski kuba@...nel.org wrote:
> On Tue, 7 Dec 2021 12:54:04 +0100 (CET) Justin Iurman wrote:
>> >> The function kmem_cache_size is used to retrieve the size of a slab
>> >> object. Note that it returns the "object_size" field, not the "size"
>> >> field. If needed, a new function (e.g., kmem_cache_full_size) could be
>> >> added to return the "size" field. To match the definition from the
>> >> draft, the number of bytes is computed as follows:
>> >>
>> >> slabinfo.active_objs * size
>> >
>> > Implementing the standard is one thing but how useful is this
>> > in practice?
>>
>> IMHO, very useful. To be honest, if I were to implement only a few data
>> fields, these two would be both included. Take the example of CLT [1]
>> where the queue length data field is used to detect low-level issues
>> from inside a L5-7 distributed tracing tool. And this is just one
>> example among many others. The queue length data field is very specific
>> to TX queues, but we could also use the buffer occupancy data field to
>> detect more global loads on a node. Actually, the goal for operators
>> running their IOAM domain is to quickly detect a problem along a path
>> and react accordingly (human or automatic action). For example, if you
>> monitor TX queues along a path and detect an increasing queue on a
>> router, you could choose to, e.g., rebalance its queues. With the
>> buffer occupancy, you could detect high-loaded nodes in general and,
>> e.g., rebalance traffic to another branch. Again, this is just one
>> example among others. Apart from more accurate ECMPs, you could for
>> instance deploy a smart (micro)service selection based on different
>> metrics, etc.
>>
>> [1] https://github.com/Advanced-Observability/cross-layer-telemetry
>
> Ack, my question was more about whether the metric as implemented
Oh, sorry about that.
> provides the best signal. Since the slab cache scales dynamically
> (AFAIU) it's not really a big deal if it's full as long as there's
> memory available on the system.
Well, I got the same understanding as you. However, we do not provide a
value meaning "X percent used" just because it wouldn't make much sense,
as you pointed out. So I think it is sound to have the current value,
even if it's a quite dynamic one. Indeed, what's important here is to
know how many bytes are used and this is exactly what it does. If a node
is under heavy load, the value would be hell high. The operator could
define a threshold for each node resp. and detect abnormal values.
We probably want the metadata included for accuracy as well (e.g.,
kmem_cache_size vs new function kmem_cache_full_size).
Powered by blists - more mailing lists