We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpSyntax Questions › Efficiently working with large volumes of data
Page Index Toggle Pages: 1
Efficiently working with large volumes of data (Read 2475 times)
Efficiently working with large volumes of data
May 25th, 2010, 5:53am
 
I'm working on an application that takes in put from very large CSV files (up to several million rows, many columns). The data is usually numeric (floats), but may have portions that are NaN. Whilst I can generally deal with the data reasonably well, doing some timing investigation reveals two areas of significant performance loss that I'd really like suggestions on improving:

Casting string values to floats
I'm using something like:
Code:
Float[] pieces = float(trim(lines[i].split(",|\\t"))); 


The action of casting to float seems to take an inordinate amount of time. I've tried double and other casting methods (Float.parseFloat()), but can't seem to get it down.

Catching NaN values
My data may contain NaN values at various points. I need to perform certain functions on the data such as max(), min(), etc, but if a NaN value is passed to them, the result is NaN. Thus I need to check for the NaN value before passing the value to max(), min(), which I'm implementing manually (not accessing the array directly).

Using Float.isNan(value) works, but seems to take ages, particularly for several million entries.

I currently have a workaround where at import I catch NaN values once, and override and set to 98.7654. I then catch this value with a simple == statement before passing to max() or min(). This works and is really fast, but seems a very ugly implementation. Suggestions?

Any help greatly appreciated.
Re: Efficiently working with large volumes of data
Reply #1 - May 25th, 2010, 6:29am
 
Quote:
I currently have a workaround where at import I catch NaN values once, and override and set to 98.7654. I then catch this value with a simple == statement before passing to max() or min(). This works and is really fast, but seems a very ugly implementation. Suggestions?

A slightly-less-ugly way to do it is to use the constant Float.MAX_VALUE as your "NaN", which is even less likely than 98.7654 to be part of your data set.  (It would probably also make it a little more obvious what you're doing to someone reading your code, if that's a concern.)

Beyond that, if you really need to speed up the import process, your best bet is probably to write your own parsing/conversion method to go from a string to a float, and write it so that it takes advantage of any assumptions you can make about your data.  Doing this will give you a routine that's less flexibile, but probably much faster for your data set.

For example:  

The standard conversion routines are written to handle numbers in scientific/engineering notation like "4.147e25".  Are there any of those in your data set?  If not, then you can write a routine that doesn't need to scan the string for occurrences of the letter e.

Are all your values expressed to a fixed number of decimal places?  For example, if you know you'll always have exactly two decimal places, then you don't need to scan for a period; you always know it'll be the third-to-last character.

Are all your values positive?  Then you can skip looking for a minus sign, etc.

EDIT:  Also, if you're in control of the creation of the data files, it might be worth changing the program that generates them to create some regularity, e.g., forcing it to always use a fixed decimal precision, etc.

Re: Efficiently working with large volumes of data
Reply #2 - May 25th, 2010, 6:57am
 
Note that split(",|\\t") can be a culprit for your performance issue!
It is fine for small data sets, but if called millions of times, it might slow down the processing of data: unless Java does some optimization (caching regexes?), it takes the regular expression, compiles it, match it against the string, and repeat that on each line.
If you know in advance if you have a comma or a tab, you should instead split manually the line.

Other than that, the string to float conversion will always take some time, as it isn't simple to do...

Also beware: you used Float[] pieces. That means each generated float will be wrapped in a Float object thanks (?) to auto-boxing. In other words, you will get an object creation on each value. Both a memory hog and a performance hit.
Re: Efficiently working with large volumes of data
Reply #3 - May 25th, 2010, 7:26am
 
Quote:
A slightly-less-ugly way to do it is to use the constant Float.MAX_VALUE as your "NaN", which is even less likely than 98.7654 to be part of your data set

Thanks. That's certainly nicer, though I was hoping I was missing something. Seems daft that assigning to a custom number is 100 times more efficient than checking isNaN.

Quote:
Beyond that, if you really need to speed up the import process, your best bet is probably to write your own parsing/conversion method

I'd like to avoid this as I'm more than likely to make a mistake. The file import doesn't happen too often, so can be lived with, though makes initial load painful.

Quote:
The standard conversion routines are written to handle numbers in scientific/engineering notation like "4.147e25".  Are there any of those in your data set?  If not, then you can write a routine that doesn't need to scan the string for occurrences of the letter e.

Unfortunately my data is unknown, and may well include engineering notation (indeed I was pleasantly surprised when I found it interpreting them).

Thanks for the advice.

Quote:
Note that split(",|\\t") can be a culprit for your performance issue!

Don't think it's too bad. Running the same split assigned to String[] and not cast to float runs two orders of magnitude faster. As I don't know what my input data is, I'd like to catch as much as I can.

Quote:
Also beware: you used Float[] pieces. That means each generated float will be wrapped in a Float object thanks (?) to auto-boxing.

I hope this isn't happening. My rough order events:
  • Am first declaring data[][] with length of input file and width of columns.
  • Using float[] to populate an array of entries from a single row.
  • Test each entry for isNaN. If false, data[0][a] = float[a]. If true, set data[0][a] = Float.MAX_VALUE.
  • Repeat for each item in float[]
  • Redeclare float[] for the next line of the source data.
  • Repeat for all rows.


If I'm right, float[] should only ever exist for the period that the line is being parsed. After that it should be reused for the other lines.

---

Thanks both for input. Other thoughts most welcome.
Re: Efficiently working with large volumes of data
Reply #4 - May 25th, 2010, 8:28am
 
RedTar wrote on May 25th, 2010, 7:26am:
Quote:
Also beware: you used Float[] pieces. That means each generated float will be wrapped in a Float object thanks () to auto-boxing.

I hope this isn't happening. My rough order events:
  • Am first declaring data[][] with length of input file and width of columns.
  • Using float[] to populate an array of entries from a single row.
  • Test each entry for isNaN. If false, data[0][a] = float[a]. If true, set data[0][a] = Float.MAX_VALUE.
  • Repeat for each item in float[]
  • Redeclare float[] for the next line of the source data.
  • Repeat for all rows.


If I'm right, float[] should only ever exist for the period that the line is being parsed. After that it should be reused for the other lines.

I'm not an expert, but I think PhiLho is probably correct on this--the Float objects aren't re-used, they're destroyed and then a new set is created each time the code reaches this line.

I assume you need the data to be in a Float[] array instead of a float[] array for some reason.  You might want to try something like this:

Code:

Float[] pieces = new Float[100];
int nPieces;
// (Or whatever number is safely large enough
// to hold all data in a line.)

for (i = blah blah blah) {

float[] temp = float(trim(lines[i].split(",|\\t")));
nPieces = temp.length;
for (j = 0; j < nPieces; ++j)
pieces[j] = temp[j];

}

This avoids the cost of constructing a Float object for each data point, and might be faster.  (You need int nPieces of course, because pieces.length will always be 100 regardless of how many actual data points you have stored in it.)
Re: Efficiently working with large volumes of data
Reply #5 - May 25th, 2010, 8:37am
 
RedTar wrote on May 25th, 2010, 7:26am:
I hope this isn't happening. My rough order events:
  • Am first declaring data[][] with length of input file and width of columns.
  • Using float[] to populate an array of entries from a single row.

You are missing something: you shown a Float[], now you write about float[]. These are different things! First one is an array of objects. Second one is an array of float values.
Re: Efficiently working with large volumes of data
Reply #6 - May 25th, 2010, 8:43am
 
Yes, if you didn't know there was a difference between Float and float, then try just changing your Float[] to float[].  

My post above that should only apply if you are deliberately using Float for some reason.
Re: Efficiently working with large volumes of data
Reply #7 - May 25th, 2010, 8:43am
 
PhiLho  wrote on May 25th, 2010, 8:37am:
RedTar wrote on May 25th, 2010, 7:26am:
I hope this isn't happening. My rough order events:
  • Am first declaring data[][] with length of input file and width of columns.
  • Using float[] to populate an array of entries from a single row.

You are missing something: you shown a Float[], now you write about float[]. These are different things! First one is an array of objects. Second one is an array of float values.


A typo on my part Wink I'll have to check my code when I get off work, but I strongly suspect it's float[] as I only need it to contain the values from the split line for as long as it takes me to check them for isNaN.
Re: Efficiently working with large volumes of data
Reply #8 - May 25th, 2010, 8:59am
 
Note that float(array) is just a facility provided by Processing creating an array of floats of same size than the array of strings, then doing the conversion for you.
You can do the same explicitly (untested):
Code:
String[] piecesStr = lines[i].split(",|\\t");
// Unlike trim(array), doesn't create a new array
for (int j = 0; j < piecesStr.length; j++) piecesStr[j] = piecesStr[j].trim();
float[] pieces = new float[piecesStr.length];
for (int j = 0; j < piecesStr.length; j++)
{
 pieces[j] = parseFloat(piecesStr[j], Float.MIN_VALUE);
}

I used here parseFloat which is Float(str).floatValue() wrapped in a try/catch to handle cases were there is no float number (eg. a string like "NaN"...). In this case, the second parameter is used as default value.
Re: Efficiently working with large volumes of data
Reply #9 - May 27th, 2010, 6:07am
 
Quote:
I used here parseFloat which is Float(str).floatValue() wrapped in a try/catch to handle cases were there is no float number (eg. a string like "NaN"...). In this case, the second parameter is used as default value.
Thanks, I wasn't aware you could pass two args to parseFloat.

FYI ended up implementing something along the lines of:
Code:
  for (int i = 0; i < lines.length; i++) {
   String[] piecesStr = lines[i].split(",|\\t");
   for (int j = 0; j < piecesStr.length; j++){
     data[j][i] = parseFloat(piecesStr[j], Float.MAX_VALUE);
   }
 }

Looks like parseFloat integrates trimming, so that wasn't needed.

The whole procedure only shaves 15% off the time, but certainly is a lot cleaner, and good for my learning.

Thanks both for the help.
Page Index Toggle Pages: 1