Clive McCarthy
January 16th, 2012, 11:59 PM
I'm trying some simple experiments using gcc and SSE intrinsics to speed up a quite conventional image filter convolution.
I compile with: -g -march=pentium4 -Wall -Wextra -O3 -mmmx -msse -msse2 -msse3 -mssse3 -mfpmath=sse -mstackrealign -D_REENTRANT -std=gnu89
I have found that my first attempt at using SSE is significantly slower than conventional coding. Here's a couple of code fragments from the very heart of the convolver:
/* my SSE SIMD method */
pixel[0] = kr; pixel[1] = kg; pixel[2] = kb; pixel[3] = ALPHA_OPAQUE;
//Load 4 floats from a 16-byte aligned address.
pixel_m128 = _mm_load_ps(pixel);
// Load 1 float into all 4 fields of an __m128
weight_m128 = _mm_load1_ps(&weight);
// Multiply corresponding floats
prod_m128 = _mm_mul_ps(weight_m128, pixel_m128);
// Add corresponding floats
sum_m128 = _mm_add_ps(prod_m128, sum_m128);
/* my conventional method */
r_sum += kr * weight; g_sum += kg * weight; b_sum += kb * weight;
I wasn't expecting the SSE result to be 25% slower than my conventional method. Am I missing something? Is the performance hit all from the type conversions to and from the SSE registers? This is my first attempt at using SSE intrinsics so I could be doing something trivially stupid. Help please.
I compile with: -g -march=pentium4 -Wall -Wextra -O3 -mmmx -msse -msse2 -msse3 -mssse3 -mfpmath=sse -mstackrealign -D_REENTRANT -std=gnu89
I have found that my first attempt at using SSE is significantly slower than conventional coding. Here's a couple of code fragments from the very heart of the convolver:
/* my SSE SIMD method */
pixel[0] = kr; pixel[1] = kg; pixel[2] = kb; pixel[3] = ALPHA_OPAQUE;
//Load 4 floats from a 16-byte aligned address.
pixel_m128 = _mm_load_ps(pixel);
// Load 1 float into all 4 fields of an __m128
weight_m128 = _mm_load1_ps(&weight);
// Multiply corresponding floats
prod_m128 = _mm_mul_ps(weight_m128, pixel_m128);
// Add corresponding floats
sum_m128 = _mm_add_ps(prod_m128, sum_m128);
/* my conventional method */
r_sum += kr * weight; g_sum += kg * weight; b_sum += kb * weight;
I wasn't expecting the SSE result to be 25% slower than my conventional method. Am I missing something? Is the performance hit all from the type conversions to and from the SSE registers? This is my first attempt at using SSE intrinsics so I could be doing something trivially stupid. Help please.