Splitting strings into individual words then comparing with words from different strings

edited December 2015 in Programming Questions

Hello,

I am trying to do the following:

Load in a single sentence (unknown number of words but not many). (This part doesn't need solved, I already have sentences coming in).

Split into individual words, Save words,

load next string

Split string into words, check if any of the words are already saved (and if so, add 1 to the quantity of that specific word). If word isn't already saved, save it.

Repeat with x number of sentences.

Essentially the idea is to build up an arrayList(?) of words, and then display the top 4 most frequently used words in the sketch window.

Note. I've set up a bit of code that removes all common words such as "the" "and""to" etc.

Here is the relevant piece of code I have been experimenting with (Only addressing part of the issue at the moment) :

    String[] splitWords = split(sentence, " "); //split sentence into individual words

    int numberOfWords2 = splitWords.length; //count number of words in sentence

    for (int p=0; p<=splitWords.length; p++) { //add words to array list
      myWords.append(splitWords[p]);
    }

    sentenceCount = sentenceCount +1; //count number of sentences
    println(sentenceCount);

    numberOfWords = numberOfWords+numberOfWords2; //total number of words

The main issue I am having is adding the words to the arrayList AFTER the existing words, rather than replacing them.

I am also unsure of the best way to keep a count of how many times each word has appeared across all sentences.

Thanks!

Answers

  • I think you want this

    https://www.processing.org/reference/HashMap.html

    it has been done many times, try google it here in the forum

    import java.util.Map;
    
    // Note the HashMap's "key" is a String and "value" is an Integer
    HashMap<String, Integer> hm = new HashMap<String, Integer>();
    
    // Putting key-value pairs in the HashMap
    hm.put("Casey", 36);
    
    // We can also access values by their key
    int val = hm.get("Casey");
    println("Casey is " + val);
    val++;
    hm.put("Casey", val);
    val = hm.get("Casey");
    println("Casey is " + val);
    
  • edited December 2015

    Though not as performant as a Hashmap<String, Integer>, container class IntDict + increment() method is the easiest way to pull that out: *-:)

    final IntDict words = new IntDict();
    
    words.increment("apple");
    words.increment("orange");
    words.increment("apple");
    
    println(words); // IntDict size=2 { "apple": 2, "orange": 1 }
    exit();
    
  • @Chrisir , why do you import java.util.Map; if you end up not using it? :-/

  • import java.util.Map;
    
    // Note the HashMap's "key" is a String and "value" is an Integer
    HashMap<String, Integer> hm = new HashMap<String, Integer>();
    
    String[] testList = {
      "Ralph", "John", "Oliver", "Who", "Oliver", "What", "Oliver", "John"
    }  ; 
    
    // ------------------------------------------------
    // the core functions 
    
    void setup() {
      size(200, 200);
    
      // build the Map 
      for (String myWord : testList) {
        add(myWord);
      }
    
      // Using an enhanced loop to interate over each entry
      for (Map.Entry me : hm.entrySet()) {
        print(me.getKey() + " is ");
        println(me.getValue());
      }
    }
    
    void draw() {
      // empty
    }
    
    // ------------------------------------------------
    // other functions 
    
    void add(String wordLocal) {
      // add a word or when it's already there increase the counter 
    
      // is it new? 
      if (!hm.containsKey(wordLocal)) {
        // new one 
        // Putting key-value pairs in the HashMap
        hm.put(wordLocal, 1);
      } // if 
      else {
        // old one : increase counter 
        // We can also access values by their key
        int val = hm.get(wordLocal);
        val++;
        hm.put(wordLocal, val); // Putting key-value pairs in the HashMap
      } // else
    }
    //
    
  • import java.util.Map; is needed in the broader scheme of things, e.g. in my 2nd sketch ;-)

  • edited December 2015

    I'd ratherimport java.util.Map.Entry; and used:
    for (Entry<String, Integer> me : hm.entrySet()) {}

    For import java.util.Map; I'd use it for: ;)
    final Map<String, Integer> hm = new HashMap<String, Integer>();

  • A more simplified addWord() for @Chrisir's Hashmap<String, Integer> version: :ar!

    static final void addWord(Map<String, Integer> map, String word) {
      Integer count = map.get(word);
      map.put(word, count == null? 1 : count + 1);
    }
    

    Invoke it like this: for (String word : testList) addWord(hm, word); B-)

  • Wouldn't it crash npe if the word was new?

  • edited December 2015

    I'm checking for null inside put(): count == null? 1 : count + 1

    And method get() returns null in case "key" doesn't exist or the value was already null:
    http://docs.Oracle.com/javase/8/docs/api/java/util/Map.html#get-java.lang.Object-

    Returns:
    the value to which the specified key is mapped, or null if this map contains no mapping for the key

  • edited December 2015

    ah, is this because you used Integer instead of int?

    Because this gives an npe (null pointer exception) if wordLocal is new I think :

    int val = hm.get(wordLocal);
    
  • edited December 2015

    Primitive datatypes can't have null assigned to them.
    And Hashmap<String, Integer> has its values already as Integer objects after all. >-)
    So it was only natural to have some Integer variable to recieve from get(). :>

  • thanks, mate!

  • Great thanks, the next step I am having trouble with is:

    Instead of the test array, I want to use the words from a tweet. So when the tweet is received, I need to split it into individual words and then add each word to the array. This could potentially become a pretty large array.

    The main difficulties I am having are:

    1. Instead of the test string array I want to add each word from a tweet, and then keep adding words of future tweets.(Should I use an array, arrayList of HashMap for this since the size is unknown).

    2. As tweets come in I want to update 4 variables:

    variable 1 = first most common word across all tweets variable 2 = second most common word across all tweets variable 3 = third most common word across all tweets variable 4 = fourth most common word across all tweets

    ...then print the 4 most common words after each tweet is loaded. At the moment tweets are coming in as a string named 'tweet'.

    Thanks again.

  • you could add the incoming to an ArrayList of String eg

    and also add them to the hashmap (see above)

    for 4 highest ranking

    after the line 48 insert (pseudo code)

    val++;
    
    if val > valOfPlace1 {
        valOfPlace1 = val
        wordOfPlace1 = wordLocal
    }
    else if val > valOfPlace2 {
        valOfPlace2 = val
        wordOfPlace2 = wordLocal
    }
    
  • _vk_vk
    edited December 2015

    Perhaps after adding to hashMap and counting the words, you don't need keep them im the "receiver" ArrayList. If so you may look for a FIFO structure. Never used, but something like Deque i guess. @GoToLoop will know, I'm sure ;)

  • edited December 2015 Answer ✓
    // forum.Processing.org/two/discussion/13882/
    // splitting-strings-into-individual-words-
    // then-comparing-with-words-from-different-strings
    
    // GoToLoop (2015-Dec-11)
    
    final IntDict words = new IntDict();
    
    words.increment("apple");
    words.increment("orange");
    words.increment("apple");
    words.increment("açaí");
    words.increment("açaí");
    words.increment("apple");
    words.increment("mangosteen");
    
    words.sortValuesReverse();
    println(words, ENTER);
    
    for (int i = 0, ranks = min(4, words.size()); i != ranks; ++i)
      println(i+1, words.key(i), words.value(i)); 
    
    exit(); 
    
  • One final question (hopefully!):

    I am trying to get the follow piece of code to work:

    String tweet;
    final IntDict words = new IntDict();
    
        String[] individualWords = split(tweet, ' '); //Separate tweet into separate words
            int numWords = individualWords.length; // number of words in tweet
    
            println(tweet + "LENGTH =" + numWords); //print tweet and the length (number of words) of that tweet 
    
            for (int wordVar = 0; wordVar<= numWords; wordVar++) { // 0 to number of words in tweets
              words.increment(individualWords[wordVar]); //save each word in the tweet
            }
    

    I must be doing something wrong as it doesn't like the contents of the for loop.

    Thanks again!

  • edited December 2015 Answer ✓
    // forum.Processing.org/two/discussion/13882/
    // splitting-strings-into-individual-words-
    // then-comparing-with-words-from-different-strings
    
    // GoToLoop (2015-Dec-12)
    
    final IntDict words = new IntDict();
    
    String tweets = "Apple  orange\fapple açaÍ aÇaí \t apple \r mangosteen  \n";
    println(tweets, ENTER);
    
    for (String w : splitTokens(tweets))  words.increment(w.toLowerCase());
    
    words.sortValuesReverse();
    println(words, ENTER);
    
    for (int i = 0, ranks = min(4, words.size()); i != ranks; ++i)
      println(i+1, words.key(i), words.value(i)); 
    
    exit();
    
Sign In or Register to comment.