How do you improve optimization for integer calculations?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How do you improve optimization for integer calculations?

Nikita Zlobin
Hello.

I just have attempt, writing serious application, considering language
(C), tookit (gtk3, cairo, pango, etc) and that it is audio app. Besides
that many audio things use simple "float" for audio data, i noticed
some posts, where tension towards integer math appeared. While it is
not my case (my app is jack-based, and audio data have FP type
everywhere), i slightly distracted from audio side.

My app for now is spectrum analyzer, which later should evolve into
couple of spectral editing helper utilities (more exactly to generate
spectrogram and apply changes either by resynth or by filtering with
diff spectrogram).

I distracted to experiments with graphics post-processing for instant
spectrum view rendering. Due to cairo nature, color data in cairo image
surface may be at best argb32, rgb24 and rgb30 (for now i implemented
uspport only for first two). As result post-processing is all done in
integer mode. I can understand this, as i noticed that channel orders
in memory depends on endianness, what points that cairo relies more on
masks and bitshifts than for byte value arrays where is possible.

I decided to look in debugger for how gcc does optimize these ops.
Noticed following. Below is code chunk from kdbg with some asm blocks
expanded. There is only one line, where i tried to use floating point
ops, just for comparison (plus some context for better understanding).

            for (int l = 0; l < 3; l++)
                for (int c = 0; c < 3; c++)
                {
                    unsigned char
                        * p = bpix[l][c];

                    col[0] += p[0] ,
0x5b27 movzbl 0x60(%rsp),%esi
                    col[1] += p[1] ,
0x5a34 movzbl 0x65(%rsp),%eax
                    col[2] += p[2] ,
0x5a42 movzbl 0x62(%rsp),%edx
                    col[3] += p[3];
0x5a47 movzbl 0x67(%rsp),%esi
0x5a4c mov    0x10(%rsp),%rbp
                }
            col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /= 9.0;
0x5a39 pxor   %xmm0,%xmm0
            op[0] += col[0] ,
0x5b8d add    %sil,0x0(%rbp)
            op[1] += col[1] ,
0x5b94 add    %cl,0x1(%rbp)
            op[2] += col[2] ,
0x5b97 add    %dl,0x2(%rbp)
            op[3] += col[3];
0x5b91 add    %al,0x3(%rbp)

            for (int l = 0; l < 3; l++)
                ip[l] += 3;
            op += ch_n;
        }

Notice, how line, containing chain of divisions, is compiled to single
sse operation. What is interesting, this is only asm line, involving
xmm or imm - i can't find anything else like mov, involving these
registers.

With division to integer 9 - it is still one line, but no xmm is
involved.

                    col[0] += p[0] ,
0x5b17 movzbl 0x64(%rsp),%eax
                    col[1] += p[1] ,
0x5a35 movzbl 0x65(%rsp),%eax
                    col[2] += p[2] ,
0x5a4a movzbl 0x7e(%rsp),%esi
                    col[3] += p[3];
0x5a4f movzbl 0x7f(%rsp),%ecx
                }
            col[0] /= 9, col[1] /= 9, col[2] /= 9, col[3] /= 9;
0x5a3a mov    $0x38e38e39,%r8d
            op[0] += col[0] ,
0x5b78 add    %dl,0x0(%rbp)
            op[1] += col[1] ,
0x5b3a add    %dil,0x1(%rbp)

As for GCC opts, i used unusual combo "-march=native -O3 -g" in order
to get exhausing optimization, but still keep it readable for kdbg
disasm to get a look.

While searching for info about sse integer support, search pointed me
to wikipedia page about x86 instructions, where i found almost enough
instruction set - logical, add, sub and mul, signed and
unsigned, words and bytes (at least in SSE2 which according to
description enabled MMX set to use SSE registers, where integer support
was primary in MMX).

So, i'm in maze - why doesn't gcc involve such ops in integer mode?
Could it be possible, that what it did is better without SSE?

I'm about to eventually try more FP ops in this place but unsure, about
possible conversions, since source and dest are anyway cairo surfaces
with integer data.
_______________________________________________
Linux-audio-dev mailing list
[hidden email]
https://lists.linuxaudio.org/listinfo/linux-audio-dev
Reply | Threaded
Open this post in threaded view
|

Re: How do you improve optimization for integer calculations?

Maarten de Boer-3
Hi,


[…]
           col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /= 9.0;
0x5a39 pxor   %xmm0,%xmm0
[…]

Notice, how line, containing chain of divisions, is compiled to single
sse operation.

I don’t see any SSE operation here. The pxor is just to zero the xmm0 register.

It’s a bit difficult to know what you are doing here, not having context and not knowing the datatypes, but it does indeed look like this code could benefit from vectorisation, since you are doing calculation in blocks of 4. E.g. you can multiply 4 floating points in a single SSE instruction, add 4 floating points in a single SSE instructions, etc.

e.g.

__m128 factor = _mm_set_ps1 (1.0f/9.0f);
__m128 result = _mm_mul_ps (packed, factor);

would divide the 4 floats in packed by 9. (We could use _mm_div_ps , but multiplication is faster than division)


Depending on your data, it might be faster to stay in the floating point domain as long as possible to use SSE floating point operations, and convert to integer at the last moment.

If you do want/need to stay in the integer domain, note that their is no SIMD instruction for integer division, but you could use a multiplication here as well:

__m128i _mm_mulhi_epi16 (__m128i a, __m128i b)

multiplies the packed 16-bit integers in a and b (so 8 at the same time), producing intermediate 32-bit integers, and stores the high 16 bits of the intermediate integers in the result.

Taking the high 16 bits of the 32 bit intermediate result is effectively dividing by 65536. Since x/9 can be expressed (with some error) as x*7281/65536:

__m128i factor = _mm_set1_epi16 (7282);
__m128i result = _mm_mulhi_epi16(packed, factor)

Of course you would have to get your 8 bit integers (I assume) into/out of the packed 16 bit registers.

That said, whether you want to do this kind of vectorisation by hand is a different matter. The compiler is pretty good in doing these kind of optimisations. Make sure you pass the right flags to turn on SSE and AVX at the levels you want to support. But it certainly is possible to improve what the compilers does. I have obtained significant speed boosts though rewriting inner loops with SSE intrinsics. But even if you choose to stay in C, having some knowledge of the SSE instruction set certainly might help.

Maarten

_______________________________________________
Linux-audio-dev mailing list
[hidden email]
https://lists.linuxaudio.org/listinfo/linux-audio-dev
Reply | Threaded
Open this post in threaded view
|

Re: How do you improve optimization for integer calculations?

Nikita Zlobin
In Sun, 7 Apr 2019 17:13:31 +0200
Maarten de Boer <[hidden email]> wrote:

> Hi,
>
>
> > […]
> >            col[0] /= 9.0, col[1] /= 9.0, col[2] /= 9.0, col[3] /=
> > 9.0; 0x5a39 pxor   %xmm0,%xmm0
> > […]
> >
> > Notice, how line, containing chain of divisions, is compiled to
> > single sse operation.  
>
> I don’t see any SSE operation here. The pxor is just to zero the xmm0
> register.
>
> It’s a bit difficult to know what you are doing here, not having
> context and not knowing the datatypes, but it does indeed look like
> this code could benefit from vectorisation, since you are doing
> calculation in blocks of 4. E.g. you can multiply 4 floating points
> in a single SSE instruction, add 4 floating points in a single SSE
> instructions, etc.
>
> e.g.
>
> __m128 factor = _mm_set_ps1 (1.0f/9.0f);
> __m128 result = _mm_mul_ps (packed, factor);
>
> would divide the 4 floats in packed by 9. (We could use _mm_div_ps ,
> but multiplication is faster than division)
>
> (See https://software.intel.com/sites/landingpage/IntrinsicsGuide/
> <https://software.intel.com/sites/landingpage/IntrinsicsGuide/> )
>
> Depending on your data, it might be faster to stay in the floating
> point domain as long as possible to use SSE floating point
> operations, and convert to integer at the last moment.
>
> If you do want/need to stay in the integer domain, note that their is
> no SIMD instruction for integer division, but you could use a
> multiplication here as well:
>
> __m128i _mm_mulhi_epi16 (__m128i a, __m128i b)
>
> multiplies the packed 16-bit integers in a and b (so 8 at the same
> time), producing intermediate 32-bit integers, and stores the high 16
> bits of the intermediate integers in the result.
>
> Taking the high 16 bits of the 32 bit intermediate result is
> effectively dividing by 65536. Since x/9 can be expressed (with some
> error) as x*7281/65536:
>
> __m128i factor = _mm_set1_epi16 (7282);
> __m128i result = _mm_mulhi_epi16(packed, factor)
>
> Of course you would have to get your 8 bit integers (I assume)
> into/out of the packed 16 bit registers.
>
> That said, whether you want to do this kind of vectorisation by hand
> is a different matter. The compiler is pretty good in doing these
> kind of optimisations. Make sure you pass the right flags to turn on
> SSE and AVX at the levels you want to support. But it certainly is
> possible to improve what the compilers does. I have obtained
> significant speed boosts though rewriting inner loops with SSE
> intrinsics. But even if you choose to stay in C, having some
> knowledge of the SSE instruction set certainly might help.
>
> Maarten


Thanks, Maarten.

Looks like you propose to use intel-specific intrinsics. I already
looked gcc docs in hope to find something similar, but only found
vectorization section, from C extensions. Hoped to see something
probablt gcc-specific, but not intel-spec. I made earlier experiments,
getting purest sse code, while multiplying or adding elements from
fixed-sized array, created as auto.

I really did not recognize that nasty trick, clearing xmm0 :).
Also i understood, why SSE can't be used there. Without integer
division support it is undoable with SSE - replacing with
multiplication means conversion to float.

Yet, just as experiment, i replaced this and 4 add lines with:
    op[0] = (col[1] + col[2]) / 2;
to look, wether it will involve PAVGW or similar - it did not.

Can't really understand meaning of that single pxor line without other
SSE stuff. But after i changed gcc options to -O2, more meaningful
lines appeared where expected. Probably -O3 shuffled code too hard to
correctly represent in debugger even with -g :/ .

I'm now certain about to try FP way.

More about my app - i'm learning in university, for software
engineering. It so happened, that just in begining of this course i got
final decision to begin what i searched for long time ago (these
spectral editing helpers), and during session new subject appeared,
where we had to write any application, relying on input/output (such as
any GUI program), so i dedicated my plan to this.
I'm still uncertain is it ok to publish it before it is defended,
otherwise it is ready for that.

As for post-proc - i'm experimenting with subpixel rendering (my
another weakness :) ). I'm taking 3x3 pixel blocks (so-called
super-pixels) and pass them through complete chain. Probably 3x1 could
be good, it cairo has no problem rendering to surfaces with such
virtual pixel ratio, for now it is that. In my sequence image is
splited to grey minimum and color remainder, with grey mixed down at
subpixel level, while color part simply pixelated, both summed in
destination. Code chunk, i showed in previous post, is for this
remainder averaging part.

With current implementation and all cairo rendering to 3x res surf
commented out, it has ≈30% more speed comparing to simple downsampling
with cairo itself (when 3x surf itself is used as drawing source), but
this is taking in account that for now source and dest surfaces are
created and destroyed on each draw() callback run (i'm just about to
solve this issue).
_______________________________________________
Linux-audio-dev mailing list
[hidden email]
https://lists.linuxaudio.org/listinfo/linux-audio-dev
Reply | Threaded
Open this post in threaded view
|

Re: How do you improve optimization for integer calculations?

Maarten de Boer-3

Looks like you propose to use intel-specific intrinsics. I already
looked gcc docs in hope to find something similar, but only found
vectorization section, from C extensions. Hoped to see something
probablt gcc-specific, but not intel-spec.

I am not sure if I understand what you mean, but Intel’s SSE intrinsics are well supported by gcc.


[…] Probably -O3 shuffled code too hard to
correctly represent in debugger even with -g :/ 

Instead of using the debugger to look at the assembly, you could use objdump -S -l on the object file

-S, --source     Intermix source code with disassembly
-l, --line-numbers     Include line numbers and filenames in output

Good luck.

Maarten







_______________________________________________
Linux-audio-dev mailing list
[hidden email]
https://lists.linuxaudio.org/listinfo/linux-audio-dev
Reply | Threaded
Open this post in threaded view
|

Re: How do you improve optimization for integer calculations?

Nikita Zlobin
In Sun, 7 Apr 2019 22:27:34 +0200
Maarten de Boer <[hidden email]> wrote:

> > Looks like you propose to use intel-specific intrinsics. I already
> > looked gcc docs in hope to find something similar, but only found
> > vectorization section, from C extensions. Hoped to see something
> > probablt gcc-specific, but not intel-spec.  
>
> I am not sure if I understand what you mean, but Intel’s SSE
> intrinsics are well supported by gcc.
>
> This might be a good read:
> https://www.it.uu.se/edu/course/homepage/hpb/vt12/lab4.pdf
> <https://www.it.uu.se/edu/course/homepage/hpb/vt12/lab4.pdf>
>
> > […] Probably -O3 shuffled code too hard to
> > correctly represent in debugger even with -g :/  
>
> Instead of using the debugger to look at the assembly, you could use
> objdump -S -l on the object file
>
> -S, --source     Intermix source code with disassembly
> -l, --line-numbers     Include line numbers and filenames in output
>
> Good luck.
>
> Maarten
>
>
>
>
>
>

Thanks!
I tried gcc option -save-temps just before objdump, but second with
give options seems to be way more useful :).
_______________________________________________
Linux-audio-dev mailing list
[hidden email]
https://lists.linuxaudio.org/listinfo/linux-audio-dev
Reply | Threaded
Open this post in threaded view
|

Re: How do you improve optimization for integer calculations?

Nikita Zlobin
In reply to this post by Nikita Zlobin
In Sun, 7 Apr 2019 23:06:45 +0500
Nikita Zlobin <[hidden email]> wrote:

> I really did not recognize that nasty trick, clearing xmm0 :).
> Also i understood, why SSE can't be used there. Without integer
> division support it is undoable with SSE - replacing with
> multiplication means conversion to float.
>

I recently discovered fast integer division algorythm, allowing to
accelerate multiple divisions with same divisor. I got working this
way, but then discovered that gcc uses this method, so it is still
doable by SSE. Though from other side, i still can't find enough
places, where benefit of working with colors as single integers rather
than separate color values would be meaningful... one such place is
accumulator, used for averaging. While input is uint8_t[4], accumulator
is uint16_t[4]. I have to either work with them by elements or use
masks, bitshifts and OR for each element... just to prepare single
value and store (either uing32_t[2] or just one uint64_t).

Looks like benchmarks are necessary, along with these intrinsics, to
test, wether integer SSE really better than what gcc proposes.
_______________________________________________
Linux-audio-dev mailing list
[hidden email]
https://lists.linuxaudio.org/listinfo/linux-audio-dev
Reply | Threaded
Open this post in threaded view
|

Re: How do you improve optimization for integer calculations?

Nikita Zlobin
In reply to this post by Maarten de Boer-3
On Sun, 7 Apr 2019 22:27:34 +0200
Maarten de Boer <[hidden email]> wrote:

> > Looks like you propose to use intel-specific intrinsics. I already
> > looked gcc docs in hope to find something similar, but only found
> > vectorization section, from C extensions. Hoped to see something
> > probablt gcc-specific, but not intel-spec.  
>
> I am not sure if I understand what you mean, but Intel’s SSE
> intrinsics are well supported by gcc.
>
> This might be a good read:
> https://www.it.uu.se/edu/course/homepage/hpb/vt12/lab4.pdf
> <https://www.it.uu.se/edu/course/homepage/hpb/vt12/lab4.pdf>
>
> > […] Probably -O3 shuffled code too hard to
> > correctly represent in debugger even with -g :/  
>
> Instead of using the debugger to look at the assembly, you could use
> objdump -S -l on the object file
>
> -S, --source     Intermix source code with disassembly
> -l, --line-numbers     Include line numbers and filenames in output
>
> Good luck.
>
> Maarten
>


Thanks for advises. It is probably time for some feedback about solving.

While it is not about audio, i hope it will be useful here too, as it
is about more common gcc simd usage. In two words, i found way to
involve packed simd way without using intrinsics.

I casually found SIMD intrinsics in 'info gcc', C extensions / Target
builtins, with mmx/sse in x86 builtins (__builtin_ia32_paddb and such).
However I hoped for less restricting solution. And this is time when i
got real meaning of gcc vector extension __attribute__(( vector_size
(2^n) )).

After I involved this, I god so wanted packed instructions, both for
float and int code versions. What is interesting, float variant before
improving was 2/3 times slower than int-based variant (I'm in maze, why
that isolatex pxor xmm0, xmm0 instruction would ever appear without
other SIMD code). After rewriting float variant with vector ext it
barely was close to unmodified int variant by performance :) .

Int variant also got some SIMD ops (only FP, there are no scalar SSE
ops for int, probably because make no sence). And it is still forward,
related to FP variant :) . It only seems, that I now stucked at data
transfer speed bootleneck.

For exact example, my code is like this:

/**************/
typedef uint16_t v8u16 __attribute__(( vector(16) ));

v8u16
        vect1 = {...some uint8_t values...},
        vect2 = {............same.........},
        vect3 = {...};
vect1 = (vect1 + vect2 + vect3) / 3;

/** writing vect1 elements back to dynamic memory **/

Each value in vector is filled from byte value, loaded from dynamic
memory. Speedup is noticed even when using only half of vector's
capacity, and it is still tiny bit better, then if just using MMX
4*uint16 way. But when trying to use full capacity, speedup is
negligible. My cpu is Intel B950, sandybridge. I'm pretty sure, without
this bootleneck or with more complex processing, speedup would be more
notable.

Also another factor here appears to be dynamic memory access order. The
above example is almost exact excerpt from my code, with some omitions
to save space. Data from dyn mem are processed by bocks of 3*4 uint8_t
(3 ARGB32 pixels). Each block is distributed between all 3 vectors so,
that it occipes exactly 4 matching elements in each, allowing to load 2
blocks in 3 x 128bit vectors. Bit if i try to do it in single step
during initialization, for each vector needs to jump between 2 blocks
during init (could be more for avx2 and avx512, though I lack them).
This reduces performance even lower than before code vectorization.

The only way to not lose it is seems to do half-initialization, with
following per-element init like:
        vect1[4] = , vect1[5] = ,
and so on.

What is really good with these vectors - they are safe to be used even
if target arch doesn't support necessary vector size or has no SIMD
ever. According to 'info gcc':

> Specifying a combination that is not valid for the current
> architecture causes GCC to synthesize the instructions using a
> narrower mode.  For example, if you specify a variable of type `V4SI'
> and your architecture does not allow for this specific SIMD type, GCC
> produces code that uses 4 `SIs'.

So, theoretically it is posible to enage vector size 64 (avx512) and
still expect it working on lesser architectures. The only issue - i
can't understand wether it decreases performance for lessers, comparing
to dedicated vector sizes or vectorless code. Some times i got
slowdown, but sometimes it was just fine. Probably will need more
tests. Even finally trying some intrinsics :) (I hoped to avoid, but
why not just try).
_______________________________________________
Linux-audio-dev mailing list
[hidden email]
https://lists.linuxaudio.org/listinfo/linux-audio-dev