Tuesday, December 25, 2007

TBB v/s OMP


I was reading this blog at intel and I decided to make my own stupid-parallel_for benchmark.
The idea is to make a no-so-simple loop in parallel, so I made the work for each iteration dependent of the index:
for(i=0;i<=N;i++)
for(j=0;j
b[i] += j;
The results are quite interesting (IMHO), I think as OpenMP doesn't have task stealing,  it waste a lot of time waiting for the other thread to finish his chunk of work. (At least, my Activity Monitor looks like that, just 1 core used most of the time for large i with omp.) 
I would like to test this with more cores to see the scaling.
If you want to test by yourself, here's the code:

#include
#include
#include
#include "tbb/task_scheduler_init.h"
#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/tick_count.h"

using namespace tbb;
using namespace std;

static const size_t N = 23;

class ApplyFoo {
  float *const my_a;
public:
  void operator() ( const blocked_range& r ) const {
    float *a = my_a;
    for( size_t i=r.begin();i!=r.end();i++){
      for(size_t j=0;j!=i*i;j++)
a[i] += j;
    }
  }
  ApplyFoo( float a[] ) :
    my_a(a)
  {}
};

int main(size_t argc, char *argv[]) {
 task_scheduler_init init;
 int N = 8000;
 float a[N],b[N];
 int i,j; 
 tbb::tick_count t0 = tbb::tick_count::now();
 parallel_for(blocked_range(0, N),
     ApplyFoo( a ), auto_partitioner() );
 tbb::tick_count t1 = tbb::tick_count::now();
#pragma omp parallel for private (i,j) 
 for(i=0;i<=N;i++){
   for(j=0;j!=i*i;j++)
     b[i] += j;  
   }
 tbb::tick_count t2 = tbb::tick_count::now();
 for(i=0;i<=N;i++){
   for(j=0;j!=i*i;j++)
     b[i] += j;  
   }
 tbb::tick_count t3 = tbb::tick_count::now();
 printf("TBB %g seconds\n",(t1 - t0).seconds());
 printf("OpenMP %g seconds\n",(t2 - t1).seconds());
 printf("Serial %g seconds\n",(t3 - t2).seconds());
 printf("%d %g %g %g\n",N,(t1 - t0).seconds(), (t2 - t1).seconds(), (t3 - t2).seconds());


 return 0;
}

1 comment:

Unknown said...

The problem with your program, as you correctly pointed out, is due to the poor load balancing. You are parallelizing the loop over i and the work is steadilly increasing as a function of i, so the thread given the latter block of iterations will take longer ... hence the poor load balancing. OpenMp doesn't have work stealing, but we do have a schedule clause that you can use to address this situation. You would add the cluase to the "omp parallel for". Try "schedule(static, 100)" or perhaps even "scheduel(dynamic, 100)" You may wnat to play around with the chunk size a bit or compute it realative to the size of N; "schedule(static, N/100)". Addition of this simple clause would dramatically improve the OpenMP performance.

But performance doesn't matter. I'm much more interested in programmer effort. OpenMP you add basically one simple pragma. In my mind, that's pretty cool.