linux-kernel - Re: [Intel-gfx] [PATCH 3/3] Introduce & use new lightweight SGL iterators

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <573B091C.5090505@intel.com>
Date:	Tue, 17 May 2016 13:05:48 +0100
From:	Dave Gordon <david.s.gordon@...el.com>
To:	Tvrtko Ursulin <tvrtko.ursulin@...ux.intel.com>,
	Chris Wilson <chris@...is-wilson.co.uk>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: [Intel-gfx] [PATCH 3/3] Introduce & use new lightweight SGL
 iterators

On 17/05/16 11:34, Tvrtko Ursulin wrote:
>
> On 16/05/16 16:19, Dave Gordon wrote:
>> The existing for_each_sg_page() iterator is somewhat heavyweight, and is
>> limiting i915 driver performance in a few benchmarks. So here we
>> introduce somewhat lighter weight iterators, primarily for use with GEM
>> objects or other case where we need only deal with whole aligned pages.
>
> Interesting idea, if for nothing then for eliminating the dreaded
> st->nents of for_each_sg_page. :)
>
> Which benchmarks it improves and how much do you know?

I know nothing :)

But last time I posted some easy-to-use iterators, Chris Wilson said 
they didn't address his complaint, which was that the existing ones were 
too slow.

https://lkml.org/lkml/2016/5/9/325

> It generates more code for me but that seems to be by design, yes?
> Because what were previously calls to __sg_page_iter_next is now
> effectively inlined and instead there is only a call to sg_next, which
> __sg_page_iter_next would call anyway.

In the light of Chris' comments, I decided to try some versions that 
were mostly focussed on avoiding the excessive number of function calls 
in the old ones. If this still isn't fast enough, we could also try 
inlining sg_next().

>> Unlike the old iterator, the new iterators use an internal state
>> structure which is not intended to be accessed by the caller; instead
>> each takes as a parameter an output variable which is set before each
>> iteration. This makes them particularly simple to use :)
>>
>> One of the new iterators provides the caller with the DMA address of
>> each page in turn; the other provides the 'struct page' pointer required
>> by many memory management operations.
>>
>> Various uses of for_each_sg_page() are then converted to the new macros.
>>
>> Signed-off-by: Dave Gordon <david.s.gordon@...el.com>
>> Cc: Chris Wilson <chris@...is-wilson.co.uk>
>> Cc: linux-kernel@...r.kernel.org
>> ---
>>   drivers/gpu/drm/i915/i915_drv.h         | 62
>> +++++++++++++++++++++++++--
>>   drivers/gpu/drm/i915/i915_gem.c         | 20 ++++-----
>>   drivers/gpu/drm/i915/i915_gem_fence.c   | 14 +++---
>>   drivers/gpu/drm/i915/i915_gem_gtt.c     | 76
>> ++++++++++++++++-----------------
>>   drivers/gpu/drm/i915/i915_gem_userptr.c |  7 ++-
>>   5 files changed, 116 insertions(+), 63 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.h
>> b/drivers/gpu/drm/i915/i915_drv.h
>> index 72f0b02..c0fc6aa 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.h
>> +++ b/drivers/gpu/drm/i915/i915_drv.h
>> @@ -2125,6 +2125,10 @@ struct drm_i915_gem_object_ops {
>>   #define INTEL_FRONTBUFFER_ALL_MASK(pipe) \
>>       (0xff << (INTEL_FRONTBUFFER_BITS_PER_PIPE * (pipe)))
>>
>> +void i915_gem_track_fb(struct drm_i915_gem_object *old,
>> +               struct drm_i915_gem_object *new,
>> +               unsigned frontbuffer_bits);
>> +
>>   struct drm_i915_gem_object {
>>       struct drm_gem_object base;
>>
>> @@ -2256,9 +2260,61 @@ struct drm_i915_gem_object {
>>   };
>>   #define to_intel_bo(x) container_of(x, struct drm_i915_gem_object,
>> base)
>>
>> -void i915_gem_track_fb(struct drm_i915_gem_object *old,
>> -               struct drm_i915_gem_object *new,
>> -               unsigned frontbuffer_bits);
>> +/*
>> + * New optimised SGL iterator for i915 GEM objects
>> + */
>> +struct sgt_iter {
>> +    struct scatterlist *sgp;
>> +    union {
>> +        unsigned long pfn;
>> +        unsigned long dma;
>
> dma_addr_t
>
>> +    } ix;
>> +    unsigned int curr;
>> +    unsigned int max;
>
> I think since this counts in bytes it should match obj->base.size which
> is a size_t?

Could be, although I think no GEM object can be greater than 4G.

> Also, maybe the iterator would be more efficient if it counted in pages?
> Hm, probably makes no difference, but maybe worth exploring. Was just
> thinking about avoiding the need to to convert byte position to page
> position in every iteration.

I tried that too, you still end up with byte/page conversions, just in 
different places. Anyway, it's only a single shift instruction.

>> +};
>> +
>> +/* Constructor for new iterator */
>> +static inline struct sgt_iter
>> +__sgt_iter(struct scatterlist *sgl, bool dma)
>> +{
>> +    struct sgt_iter s = { .sgp = sgl };
>> +
>> +    if (sgl) {
>> +        s.max = s.curr = sgl->offset;
>> +        s.max += sgl->length;
>> +        if (dma)
>> +            s.ix.dma = sg_dma_address(sgl);
>> +        else
>> +            s.ix.pfn = page_to_pfn(sg_page(sgl));
>
> I suppose we can rely on the compiler to optimize one branch away rather
> than having to provide two iterators.

Yes; the bool is always a compile-time const so the compiler only 
generates the relevant side of the branch.

>> +    }
>> +
>> +    return s;
>> +}
>> +
>> +/**
>> + * for_each_sgt_dma - iterate over the DMA addresses of the given
>> sg_table
>> + * @__dmap:    DMA address (output)
>> + * @__iter:    'struct sgt_iter' (iterator state, internal)
>> + * @__sgt:    sg_table to iterate over (input)
>> + */
>> +#define for_each_sgt_dma(__dmap, __iter, __sgt)                \
>> +    for ((__iter) = __sgt_iter((__sgt)->sgl, true);            \
>> +         ((__dmap) = (__iter).ix.dma + (__iter).curr);        \
>> +         (((__iter).curr += PAGE_SIZE) < (__iter).max) ||        \
>> +        ((__iter) = __sgt_iter(sg_next((__iter).sgp), true), 0))
>> +
>> +/**
>> + * for_each_sgt_page - iterate over the pages of the given sg_table
>> + * @__pp:    page pointer (output)
>> + * @__iter:    'struct sgt_iter' (iterator state, internal)
>> + * @__sgt:    sg_table to iterate over (input)
>> + */
>> +#define for_each_sgt_page(__pp, __iter, __sgt)                \
>> +    for ((__iter) = __sgt_iter((__sgt)->sgl, false);        \
>> +         ((__pp) = (__iter).ix.pfn == 0 ? NULL :            \
>> +        pfn_to_page((__iter).ix.pfn + ((__iter).curr >> PAGE_SHIFT)));\
>
> I could figure out what is the pfn == 0 check for? The other loop has no
> such checks so one of them looks suspicious or I am missing something.

The other *does* have a similar check -- this is the loop-break 
condition :) But here we need to avoid passing the invalid pfn 0 to 
pfn_to_page(), whereas in the other one we're just doing an addition, so 
we can terminate on (ix.dma == curr == 0).

> Also I think to be worth it you would have to handle the offset as well.
> That way all usages for for_each_sg_page could be converted and without
> it it leaves a mix of new and old.

Do you mean offset-within-page? That actually does work for DMA 
addresses (you get the same offset within each page), and it's ignored 
for page iteration.

Or do you mean starting-offset? That's really hard :( You can't find the 
n'th page without walking the entire list from the start. So I didn't 
write an iterator with a starting offset, and the few cases which need 
that can't easily be converted.

(I do have a "convenience" iterator with an offset as previously posted, 
but it's not a fast inlined version).

I'll try a couple more tweaks to this ...

.Dave.