We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpSyntax Questions › Help me understand this optimization failure.
Page Index Toggle Pages: 1
Help me understand this optimization failure. (Read 348 times)
Help me understand this optimization failure.
Feb 17th, 2009, 7:35pm
 
Ok, this is just weird, and I don't know what to make of it.  I've got some code that does a simple least-squares fit of a line against a set of pixels.  The points are fixed (x,y) locations within an image, and the pixels are the output of an edge detection function: black where there are no edges, bright pixels at the detected edges.

The goal is to be able to estimate the slope of the edge at location (x,y).

So what I do is to sample some angles from 0 to 180 degrees, and for each one generate the coefficients of a line passing through that point at that angle (the ax+by+c=0 form of a line).  Then, examine a rectangular box of pixels around the (x,y) location.  For each pixel within the box, if it is brighter than some cutoff value, then calculate the squared error of that point to the line under consideration.  So we have:

loop over angles {
 loop over y-values in the box {
   loop over x-values in the box {
     get brightness of pixel;
     if greater than cutoff {
       calculate and accumulate the squared distance to the line of that pixel;
     }
   }
 }
}

It's a lot of nested looping, so I thought I'd try to make it faster.  Particularly, since the set of pixels that are brighter than the cutoff is fixed for a given (x,y) location, it seems silly to loop over that box of pixels and test each one with the brightness() function every time.  Why not find those first and put them in an array, then run the calculation against that list of points rather than against the pixels directly?

On paper, it sounds great.  Loop over the pixels in the box only once, instead of once for every angle I consider (which, since I sample the angle every 5 degrees, is 36 times).

Except that doing this makes the code run about six times slower.  Here's the code as it stands now:

// Estimate the tangent angle of the edge at pixel x,y; returns an angle in radians.
// (x,y) is the location for which to estimate the slope, e is the image containing
// the results of the edge detection algorithm.
float estimateTangent(int x, int y, PImage e)
{
 int boxRadius = 5;          // search an 11x11 (11 = 2*5+1) box around x,y
 int xmin, xmax, ymin, ymax; // the bounding coordinates of the box (may be smaller than 11x11 if the box falls off the edge)
 float theta, bestTheta = 0.0;
 float error, bestError;
 float dSquared;

 xmin = max(0, x-boxRadius);
 ymin = max(0, y-boxRadius);
 xmax = min(e.width-1, x+boxRadius);
 ymax = min(e.height-1, y+boxRadius);

 // Find and count the bright pixels in the box around (x,y):
 // To avoid having to loop over the box once for each theta estimation, we'll make lists
 // of the coordinates of the bright points within the box and then work off of those.
 // To avoid having to count the points prior to creating these arrays, we'll just size the
 // arrays to as many pixels as could possibly be in the box.
 int[] px, py;
 int numPoints = 0;
 px = new int[(2*boxRadius+1) * (2*boxRadius+1)];
 py = new int[(2*boxRadius+1) * (2*boxRadius+1)];
 for(int row = ymin; row <= ymax; row++)
   for(int col = xmin; col <= xmax; col++)
     if(brightness(e.pixels[row*e.width + col]) >= cutoff)
     {
       px[numPoints]   = col;
       py[numPoints++] = row;
     }
 
 bestError = float(boxRadius*boxRadius * (boxRadius*2+1) * (boxRadius*2+1));
 for(theta = 0; theta < PI; theta += 5.0 / 57.2957798)
 {
   // calculate values of a, b, c for the above formula, given angle theta and point x,y.
   // m is the slope of the line in y=mx+i (slope-intercept) form, from which we derive
   // the a, b, c coefficients of the ax+by+c=0 form of the line, which is needed for
   // the perpendicular distance formula d(x,y) = ax+by+c / sqrt(a^2+b^2).
   float m = -sin(theta) / cos(theta); // Use -sin() because our coordinates have y increasing down the image.
   float i = y-m*x;                    // this is the y-intercept of the line
   float a = m,  // these variables are just aliases for code clarity.
         b = -1, // obviously, we could use m and i directly.
         c = i;
   // Loop over the bright pixels; for each one, calculate its distance squared and add that to the cumulative error.
   // Some algebra on the perpendicular distance formula yields: d^2(x,y) = (ax+by+c)^2 / (a^2 + b^2).  The denominator of that
   // result is constant, so we'll pre-calculate that.
   error = 0.0;
   float denominator = a*a+b*b;
   for(int j = 0; j < numPoints; j++)
     error += (a*px[j] + b*py[j] + c)*(a*px[j] + b*py[j] + c) / denominator;
   
   if(error < bestError) // best one so far?  If so, remember it.
   {
     bestError = error;
     bestTheta = theta;
   }
 }  
 return bestTheta;
}

The way it was before I tried to do this optimization, eliminated all the stuff about creating and filling the px[] and py[] arrays, but had this:

 for(int row = ymin; row <= ymax; row++)
   for(int col = xmin; col <= xmax; col++)
     if(brightness(e.pixels[row*e.width + col]) >= cutoff)
       error += (a*col + b*row + c)*(a*col + b*row + c) / denominator;

instead of:

   for(int j = 0; j < numPoints; j++)
     error += (a*px[j] + b*py[j] + c)*(a*px[j] + b*py[j] + c) / denominator;

So there you have it.  Any idea why my "optimized" code is six times slower than the naive implementation which does lots of extra looping and works directly off the pixels in PImage e?
Re: Help me understand this optimization failure.
Reply #1 - Feb 19th, 2009, 10:58am
 
Optimization is a delicate art, half based on knowledge of CPU and VM and experience, half experimental and made of trial and error... Smiley Of course, a good profiler can help a lot too (to find where the time is actually spent).

Anyway, a possible culprit here is garbage collector: for each pixel you create two arrays of 121 entries each. Java has to garbage collect these arrays once you leave the method, although collection can be deferred until enough garbage has cumulated.
Even though it isn't very nice, I would try and externalize boxRadius, px and py and allocate these arrays only once. Since you use only the part between 0 and numPoints, you don't care of previous values.

Another minor optimization, which might have no influence but is worth trying, is to make a sqr() function (just return x*x).
Why, since it adds function call overhead? Because it forces to compute the argument only once: an expression like a*px[j] + b*py[j] + c has two array accesses, two multiplications and three additions. Computing it once (perhaps with a temp variable if not using a function) might be beneficial. Or not.

Of course, separate experiments, to see the influence of each.
Re: Help me understand this optimization failure.
Reply #2 - Feb 19th, 2009, 6:50pm
 
Yeah, it is a delicate art, and I'm new enough with Java as to not know a lot about how the VM is going to deal with some of this stuff.

I did try externalizing the arrays and including a check to only initialize them once.  It made zero difference.

Also, on the theory that perhaps it was a cache coherency issue (using separate arrays for the x and y values could end up causing a lot of cache misses) I did this total hack using one array that was twice as long, with the x values in the even-numbered indexes and the the y values in the odd numbered indexes.

Again, zero difference.

For the time being, I've reverted the code back to the non-optimized version (hey, if the brute-force solution is six times faster, why not use it?), i'm just mightily curious as to why the optimization didn't work.

Can you suggest some profiling/debugging tools that might help?  As I say, I'm very new to Java.  Been programming everything else under the sun since I was a kid, just not Java.

Also odd: the suggestion to calculate the ax+by+c term only once also results in a slight slow-down, taking the code from 6723.5 lines-per-second to 6632.2.  (that's overall program speed, though, including all other aspects of the program, so the difference would be larger if measured only against the least-square estimation function.)  I hadn't done so originally because modern compilers are supposed to handle optimizations like that (common sub-expression elimination) automatically, and I was presuming the JVM was at least good enough to do that.  Apparently, it is, and trying to do its work for it just messes it up somehow.

Very weird.  Makes me want to not try to optimize my code at all.
Re: Help me understand this optimization failure.
Reply #3 - Feb 19th, 2009, 7:17pm
 
In short, yes, micro-optimizations are often useless because modern compilers, particularly the Java one, are highly optimized.
And indeed, I fail to see why the brute force solution is faster...
Re: Help me understand this optimization failure.
Reply #4 - Feb 19th, 2009, 11:18pm
 
I give up on the brute force thing.

But it did occur to me that there's no reason to calculate the brightness() values dynamically.  I replaced direct comparisons of the cutoff value against brightness of given pixels with comparisons against a one-dimensional array of pre-computed brightness values.  Basically, I just pre-compute the brightness of each pixel in the edge-detect results, and work against that instead of the PImage with the edges.

Result:  230% speedup.  Woo hoo!

Lesson: brightness() is slow.  Knowing how they work under the covers, I would expect hue() and saturation() to be similarly slow for anyone else who deals a lot with those functions.
Page Index Toggle Pages: 1