All,
Background: Some code I've been writing required a lightweight matrix/vector class with some basic linear algebra. After writing up something quick, I decided to compare the running time of my matrix-matrix multiply with Boost's ublas library. My code was slower and I was confused why as matrix-matrix multiplication is fairly straight forward. A colleague of mine suggested replacing the for loops with corresponding decrementing while loops. To my shock, using while loops improved my running time and it was as fast as Boost's. This is confusing to me, as I would have expected g++ to produce the same assembly for such a straight forward segment of code.
My friend didn't know why decrementing while loops were faster than incrementing for loops, but in all the vision code he works with he noticed that while loops were always used in preference to for loops. I did some quick googling, and the best I could come up with was that most CPUs tend to have better branching tools for >= 0 tests. But replacing the incrementing for loops with decrementing for loops still did not net the gain in performance found when using decrementing while loops.
Questions:
1. Does anyone here have a guess as to why g++ does not produce the same assembly code for the corresponding methods?
2. Why is the decrementing while loop implementation much faster than the incrementing for loops (at least for the first run)?
3. The subsequent improvements in running time for the FOR loop implementation when calculating an average time over 20 runs is probably due to caching; but the running time eventually overtakes that of the while loop! Why is the performance increase in the FOR loop over subsequent calls so large, eventually beating the while loop?
The methods:
Code:
Matrix<T> mult_FOR(const Matrix<T>& rhs) const{
assert(nCol_ == rhs.size1() );
size_t rhs_size2 = rhs.size2();
Matrix<T> theMatrix(nRow_, rhs_size2);
size_t dataLoc;
for (size_t r=0; r < nRow_; ++r) {
for (size_t c=0 ; c < rhs_size2; ++c) {
T& x = theMatrix(r, c);
dataLoc = nCol_*r;
for (size_t i=0; i < nCol_; ++i) {
x += data_[dataLoc++] * rhs(i,c);
}
}
}
return theMatrix;
}
Code:
Matrix<T> mult_WHILE(const Matrix<T>& rhs) const{
assert(nCol_ == rhs.size1() );
int rhs_size2 = rhs.size2();
Matrix<T> theMatrix(nRow_, rhs_size2);
int dataLoc;
int r, c, i;
r = nRow_ -1;
do{
c = rhs_size2 -1;
do{
T& x = theMatrix(r, c);
i = nCol_ -1;
dataLoc = nCol_*r + i;
do{
x += data_[dataLoc--] * rhs(i,c);
--i;
} while(i >= 0);
--c;
} while (c >= 0);
--r;
} while(r >=0);
return theMatrix;
}
and execution:
Code:
g++ -O2 Post.cpp
./a.out
First time for mult_FOR (seconds): 2.76
Average time for mult_FOR (seconds): 2.0005
Deviation for mult_FOR (seconds): 0.81
First time for mult_WHILE (seconds): 2.28
Average time for mult_WHILE (seconds): 2.27
Deviation for mult_WHILE (seconds): 0.02
If anyone wants the entire file, I have attached it (I had to name it Post.txt as it would not let me upload Post.cpp). here: Post.txt
Bookmarks