We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpPrograms › text analysis not working with loaded text
Page Index Toggle Pages: 1
text analysis not working with loaded text (Read 1451 times)
text analysis not working with loaded text
Sep 15th, 2009, 4:26pm
 
Hi,

I've been doing some on and off Processing stuff over the last year and thoroughly enjoyed it. Fantastic program. But this one has got me stumped.

I wanted to make an array of all the words used in a passage of text, but with no duplicates. So if "hello" was used in the first sentence and in the third, it would only be recorded from the first sentence and ignored the second time. A bit like making a dictionary. Eventually I would like to rank them to find the most popular word.

I'm not sure how you would approach this but i'm using an array of all words in the passage, plus another array to store each word and check if it hasn't appeared before, and then looping through the arrays using for loops.

It works fine using an array of words i write in the program. But if i load the text from an external source it seems to add the word regardless of whether it has appeared before. The program runs fine and there are no errors.

below is the code and it loads the text from gutenberg.org, but shortens the array to only 50 words so you don't have to wait too long for the results! you can comment out those lines in the setup() part and uncomment the words //= {"hello"... to try it with an internal array.

Any help is greatfully received.

Code:

String delimiters = " ,.?!;:[]";
String[] words;// = {"hello", "my", "name", "is", "david", "and", "david", "is", "my", "name" };
String[] checks = new String[0];
int counter = 1509;

String url = "http://www.gutenberg.org/dirs/etext97/1ws3310.txt";

int c=0;

void setup() {
String[] rawText = loadStrings(url);
String joinedText = join(rawText, "" );
words = splitTokens(joinedText, delimiters);
words = subset(words, 1509, 50);

checks = append(checks, words[0] );

frameRate(5);
}


void draw() {
println("running " + c + ", " + words.length + ", " + checks.length);

String theWord = words[c];
println(theWord);
println("///");

for(int j=0; j<checks.length; j++) {
String theCheck = checks[j];
println(theCheck);
if(theCheck == theWord) {
println("break");
break;
} else if((theCheck != theWord) & (j== checks.length-1)) {
println(theWord + ", " + theCheck);
checks = append(checks, theWord);
}
}

c++;

if(c == words.length) {

println("//////");
println(checks);
println(words.length);
println(checks.length);

noLoop();
}
}

Re: text analysis not working with loaded text
Reply #1 - Sep 15th, 2009, 5:37pm
 
ok looks like i needed to use string.equals(string) rather than string == string to compare them.
Re: text analysis not working with loaded text
Reply #2 - Sep 16th, 2009, 5:08am
 
After solving that and a few modifications it works fine. But now I want to sort the results so the word that is used the most is the first element in the array.

I didn't see how this would be possible with two separate arrays (one for String of word and one for int of times used) and I don't think a 2D array can have one type of value in one column and another type of data in the other eg. myArray[0][0] = "hello", myArray[0][1] = 5;

So I have created an array of Entry class objects. each object has an itsWord field for the passed word and total field for how many times the word appears. Its gathering the data and spitting it back out correctly as before but I can't get it to sort. I have tried Arrays.sort() that was recommended on other threads, but it gives me a ClassCastException error. I don't understand the comparator if thats what I need to use.

Any help is greatly appreciated.

I've also looked at hashmaps but can't find a way of sorting them, and not sure if you could sort them by their values rather than their keys.

cheers, david

Code:

String delimiters = " ,.?!;:[]";
String[] words;
Entry[] checks = new Entry[1];
int starter = 1509;
int c=0;

String url = "http://www.gutenberg.org/dirs/etext97/1ws3310.txt";

void setup() {

String[] rawText = loadStrings(url);
String joinedText = join(rawText, "" );
words = splitTokens(joinedText, delimiters);
words = subset(words, starter, 50);

checks[0] = new Entry(words[0], 0);
}


void draw() {

String theWord = words[c];

for(int j=0; j<checks.length; j++) {
String theCheck = checks[j].getWord();
if(theCheck.equals(theWord)) {
checks[j].increment();
break;
} else if((theCheck != theWord) & (j == checks.length-1)) {
checks = (Entry[]) append(checks, new Entry(theWord, 0));
}
}

c++;

if(c == words.length) {
//error cropping up here
Arrays.sort(checks);
//
println("//////");
for(int i=0; i<checks.length; i++) {
println("pp " + checks[i].itsWord + " " + checks[i].total);
}

println(words.length);
println(checks.length);
noLoop();
}
}

//////////////////////////////
class Entry {

String itsWord;
float total;

Entry(String wor, float num) {
itsWord = wor;
total = num;
}

void increment() {
total++;
}

String getWord() {
return itsWord;
}
}

Re: text analysis not working with loaded text
Reply #3 - Sep 16th, 2009, 8:01am
 
If I paste Arrays.sort() in the search field above, I get a number of relevant threads, some of them explaining how to use it, to use Comparator or Comparable, etc.
I suppose that's what you did, since you mention Comparator.
I don't think pasting a code sample similar to those in these thread could help you. What do you not understand in the Comparator

I started a tutorial about sorting (arrays and Collections) but never achieved it. Perhaps I should go back to work...

Note: HashMaps are basically unsorted, mostly useful to get a quantity by given string, for example. TreeMap are a sorted variant (given thread looks quite similar to your use case...).
Re: text analysis not working with loaded text
Reply #4 - Sep 16th, 2009, 8:36am
 
Thanks for the response PhiLho.

I found an example for strings using a comparator class (where i was getting stuck) and with a little rejigging got it to work with numbers. I used a tutorial you suggested on that same thread which was really helpful explaining it all. http://lkamal.blogspot.com/2008/07/java-sorting-comparator-vs-comparable.html

I come from a flash background so get a bit stuck sometimes especially on the strictness of Processing and Java. All the tutorials on Processing and integrating Java into Processing are really helpful, especially for novices like me, but there is definitely room for more tutorials.

Thanks and all the best.

ps. Having got the text analysis working, I can now tell you "the" is the most commonly occurring word in King Lear by Shakespeare. wow- makes it all worthwhile!
Re: text analysis not working with loaded text
Reply #5 - Sep 16th, 2009, 9:30am
 
I am glad you sorted it yourself. That's the best way to learn! Smiley

Perhaps you already know it, but I found the examples/Topics/Advanced Data/HashMapClass example quite fascinating in showing word frequency in a visual (word cloud, unsorted here) way.
Re: text analysis not working with loaded text
Reply #6 - Sep 16th, 2009, 4:38pm
 
Its a lovely visual implementation.

I did stumble upon that and thought it might be the way to go, but I didn't find any methods for sorting the hashmap.
Page Index Toggle Pages: 1