So, I have this comment by timothy_g telling me to add schedule(dynamic,100) to the pragma of my code to increase performance.
I did it, and WOW, OpenMP is almost half the time of serial version (e.g. for N=3000, TBB 15.8937 s, OpenMP 10.2118 s, Serial 18.9615 s).
The "problem" is the poor load balance of my code, so I must tell to OMP to change its default schedule and obtain dynamically the next chunk of work once finished the current. Pretty nice.
As I see this,in order of making OMP faster I have to know and say more. Roland Barthes said once that "La langue, comme performance de tout langage, n'est ni réactionnaire, ni progressiste ; elle est tout simplement : fasciste ; car le fascisme, ce n'est pas d'empêcher de dire, c'est d'obliger à dire."
What I like of this quote is that the fascism is to bind to say.
For me, and in this contex, OpenMP is more fascist than TBB, because you must say more. And don't matter if is just a line of code, the semantics of what you say is the important, not the syntax.
So, here is the version 2 of the code (I changed to an auto_partitioner() in TBB in order of say less):
parallel_for(blocked_range(0, N), ApplyFoo( a ), auto_partitioner() );
#pragma omp parallel for private (i,j) schedule(static, 100) for(i=0;i<=n;i++){ for(j=0;j!=i*i;j++) b[i] += j; }
3 comments:
There is a problem in your code that affects correctness of the measurements. You allocated arrays on stack, and then you increase array elements with +=; and you did not care about initializing them. However you should care, despite that the arrays are just temporal storages. An uninitialized floating-point value can be NaN; when you add something to it the result also NaN, so you repeatedly do NaN calculations. The problem is that it can be _extremely_ slow comparing to regular FP calculations - thus it slows down your test, or parts of it.
To avoid the issue, either make the arrays global so that they are zeroed, or initialize every element to some value before timing.
Your test should have nearly 2x speedup on a dual core CPU with both TBB and OpenMP, assuming grainsize and scheduling policy are set properly.
Well well well......
Post a Comment