linux-kernel - Re: [PATCH v6 1/5] drm/panthor: introduce job cycle and timestamp accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef799587-f7c2-472a-8550-9c40a395eccb@arm.com>
Date: Mon, 23 Sep 2024 09:55:16 +0100
From: Steven Price <steven.price@....com>
To: Adrián Larumbe <adrian.larumbe@...labora.com>
Cc: Boris Brezillon <boris.brezillon@...labora.com>,
 Liviu Dudau <liviu.dudau@....com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Sumit Semwal <sumit.semwal@...aro.org>,
 Christian König <christian.koenig@....com>,
 kernel@...labora.com, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org, linux-media@...r.kernel.org,
 linaro-mm-sig@...ts.linaro.org
Subject: Re: [PATCH v6 1/5] drm/panthor: introduce job cycle and timestamp
 accounting

On 20/09/2024 23:36, Adrián Larumbe wrote:
> Hi Steve, thanks for the review.

Hi Adrián,

> I've applied all of your suggestions for the next patch series revision, so I'll
> only be answering to your question about the calc_profiling_ringbuf_num_slots
> function further down below.
> 

[...]

>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops panthor_queue_sched_ops = {
>>>  	.free_job = queue_free_job,
>>>  };
>>>  
>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>> +				       u32 cs_ringbuf_size)
>>> +{
>>> +	u32 min_profiled_job_instrs = U32_MAX;
>>> +	u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>> +
>>> +	/*
>>> +	 * We want to calculate the minimum size of a profiled job's CS,
>>> +	 * because since they need additional instructions for the sampling
>>> +	 * of performance metrics, they might take up further slots in
>>> +	 * the queue's ringbuffer. This means we might not need as many job
>>> +	 * slots for keeping track of their profiling information. What we
>>> +	 * need is the maximum number of slots we should allocate to this end,
>>> +	 * which matches the maximum number of profiled jobs we can place
>>> +	 * simultaneously in the queue's ring buffer.
>>> +	 * That has to be calculated separately for every single job profiling
>>> +	 * flag, but not in the case job profiling is disabled, since unprofiled
>>> +	 * jobs don't need to keep track of this at all.
>>> +	 */
>>> +	for (u32 i = 0; i < last_flag; i++) {
>>> +		if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>> +			min_profiled_job_instrs =
>>> +				min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>>> +	}
>>> +
>>> +	return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
>>> +}
>>
>> I may be missing something, but is there a situation where this is
>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>> can only add extra instructions to the no-flags case - whereas this
>> implies you're thinking that instructions may also be removed (or replaced).
>>
>> Steve
> 
> Since we create a separate kernel BO to hold the profiling information slot, we
> need one that would be able to accomodate as many slots as the maximum number of
> profiled jobs we can insert simultaneously into the queue's ring buffer. Because
> profiled jobs always take more instructions than unprofiled ones, then we would
> usually need fewer slots than the number of unprofiled jobs we could insert at
> once in the ring buffer.
> 
> Because we represent profiling metrics with a bit mask, then we need to test the
> size of the CS for every single metric enabled in isolation, since enabling more
> than one will always mean a bigger CS, and therefore fewer jobs tracked at once
> in the queue's ring buffer.
> 
> In our case, calling calc_job_credits(0) would simply tell us the number of
> instructions we need for a normal job with no profiled features enabled, which
> would always requiere less instructions than profiled ones, and therefore more
> slots in the profiling info kernel BO. But we don't need to keep track of
> profiling numbers for unprofiled jobs, so there's no point in calculating this
> number.
> 
> At first I was simply allocating a profiling info kernel BO as big as the number
> of simultaneous unprofiled job slots in the ring queue, but Boris pointed out
> that since queue ringbuffers can be as big as 2GiB, a lot of this memory would
> be wasted, since profiled jobs always require more slots because they hold more
> instructions, so fewer profiling slots in said kernel BO.
> 
> The value of this approach will eventually manifest if we decided to keep track of
> more profiling metrics, since this code won't have to change at all, other than
> adding new profiling flags in the panthor_device_profiling_flags enum.

Thanks for the detailed explanation. I think what I was missing is that
the loop is checking each bit flag independently and *not* checking
calc_job_credits(0).

The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
confused me - that should be completely redundant. Or at least we need
something more intelligent if we have profiling bits which are not
mutually compatible.

I'm also not entirely sure that the amount of RAM saved is significant,
but you've already written the code so we might as well have the saving ;)

Thanks,
Steve

> Regards,
> Adrian
> 
>>> +
>>>  static struct panthor_queue *
>>>  group_create_queue(struct panthor_group *group,
>>>  		   const struct drm_panthor_queue_create *args)
>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>>  		goto err_free_queue;
>>>  	}
>>>  
>>> +	queue->profiling.slot_count =
>>> +		calc_profiling_ringbuf_num_slots(group->ptdev, args->ringbuf_size);
>>> +
>>> +	queue->profiling.slots =
>>> +		panthor_kernel_bo_create(group->ptdev, group->vm,
>>> +					 queue->profiling.slot_count *
>>> +					 sizeof(struct panthor_job_profiling_data),
>>> +					 DRM_PANTHOR_BO_NO_MMAP,
>>> +					 DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>> +					 DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>> +					 PANTHOR_VM_KERNEL_AUTO_VA);
>>> +
>>> +	if (IS_ERR(queue->profiling.slots)) {
>>> +		ret = PTR_ERR(queue->profiling.slots);
>>> +		goto err_free_queue;
>>> +	}
>>> +
>>> +	ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>> +	if (ret)
>>> +		goto err_free_queue;
>>> +
>>> +	/*
>>> +	 * Credit limit argument tells us the total number of instructions
>>> +	 * across all CS slots in the ringbuffer, with some jobs requiring
>>> +	 * twice as many as others, depending on their profiling status.
>>> +	 */
>>>  	ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>>  			     group->ptdev->scheduler->wq, 1,
>>> -			     args->ringbuf_size / (NUM_INSTRS_PER_SLOT * sizeof(u64)),
>>> +			     args->ringbuf_size / sizeof(u64),
>>>  			     0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>>  			     group->ptdev->reset.wq,
>>>  			     NULL, "panthor-queue", group->ptdev->base.dev);
>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>>  {
>>>  	struct panthor_group_pool *gpool = pfile->groups;
>>>  	struct panthor_job *job;
>>> +	u32 credits;
>>>  	int ret;
>>>  
>>>  	if (qsubmit->pad)
>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>>  		}
>>>  	}
>>>  
>>> +	job->profiling.mask = pfile->ptdev->profile_mask;
>>> +	credits = calc_job_credits(job->profiling.mask);
>>> +	if (credits == 0) {
>>> +		ret = -EINVAL;
>>> +		goto err_put_job;
>>> +	}
>>> +
>>>  	ret = drm_sched_job_init(&job->base,
>>>  				 &job->group->queues[job->queue_idx]->entity,
>>> -				 1, job->group);
>>> +				 credits, job->group);
>>>  	if (ret)
>>>  		goto err_put_job;
>>>  
>