We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpSyntax Questions › Split() comma but exclude within quotes
Page Index Toggle Pages: 1
Split() comma but exclude within quotes (Read 1050 times)
Split() comma but exclude within quotes
Jun 29th, 2009, 7:06pm
 
I'm trying to parse CSV files (comma separated file).  So I'm doing a split.  However, sometimes the strings may contain commas themselves.  This is often done by surrounding the string in quotes.

Code:

for (int i = 0; i < lines.length; i++) {
String[] pieces = split(lines[i], ',');
}


This, is, a, "Test, yes, a test", thanks

I want to produce:
pieces[0] = this
pieces[1] = is
pieces[2] = a
pieces[3] = "Test, yes, a test"
pieces[4] = thanks
Re: Split() comma but exclude within quotes
Reply #1 - Jun 30th, 2009, 1:00am
 
You can do the parsing yourself, it isn't so hard, but there are always some corner cases to handle (eg. multiline strings...).
If you need to get a quick, reliable result, I suggest to just look for a Java CSV library.
If you want to write it yourself, you have to parse character by character using a finite state automaton.
Re: Split() comma but exclude within quotes
Reply #2 - Jun 30th, 2009, 1:58am
 
One simple approach that might just work is to split on " first.  Now you should have an array where each odd element (i.e. 1,3,5) would have been contained within inverted commas.

You can then create a new array to contain all elements:

Iterate through the array produced by splitting on inverted commas - for the contents of even indices (0,2,4...) split on commas and add the resulting array to the new array.  For odd indices you'll simply have to replace the "inverted" commas as these will have been removed during the first split operation, and add to the new array.

This is untested but my gut feeling is that this should work and avoid having to write complex parsing routines Wink
(You might want to test what happens if the first character is an inverted comma.)

Another approach might be to use indexOf to find when the next inverted comma is.  Split on commas to that point.  Do indexOf again: this is a sentence containing commas - strip it out and add it to your array.  Do indexOf, split to that point etc...
Re: Split() comma but exclude within quotes
Reply #3 - Jun 30th, 2009, 2:30am
 
blindfish wrote on Jun 30th, 2009, 1:58am:
One simple approach that might just work is to split on " first.
What about quotes in values
Reminder: in CSV, they are doubled to be escaped.

Ad hoc parsing is OK if you are sure of what the file will contain or not. Otherwise, you go for surprises... Smiley
Re: Split() comma but exclude within quotes
Reply #4 - Jun 30th, 2009, 2:49am
 
Good point.  Must admit I don't enjoy parsing text for just that reason:  far too many surprises - i.e. what you think should work more often than not doesn't...  That'll teach me for thinking aloud Wink

In this case if I had control over the source file I might have simply tried to avoid the problem altogether by using something other than commas to separate the contents...
Re: Split() comma but exclude within quotes
Reply #5 - Jun 30th, 2009, 3:39am
 
Sure, TSV is an excellent format for lot of cases.
But you often have no control on the source of input, alas, be it an export from a rigid application, data provided by a site, etc.
Re: Split() comma but exclude within quotes
Reply #6 - Jun 30th, 2009, 6:04am
 
Thanks!!! I think that will get me started. Smiley

This was one of the reasons I choose Processing.. the community rocks!
Page Index Toggle Pages: 1