How to set a list of stop words to be subtracted from a larger list of words.

Hi, I am currently studying the text and typography chapters of the Handbook. To go a little further I watched the beautiful series of videos about the topic by Daniel Shiffman on Vimeo. (for reference: https://vimeo.com/channels/introcompmedia/page:1 Videos 17.0/18.0/18.1/18.2)

As an exercise I decided to reproduce the word counting program as he does. My question is: after I get to the point in which i have the full list of keys and values, how can i subtract from the keys Array a set number of keys? I was thinking about creating a new Array containing the stop words I would like to subtract from the main key array using the hasKey() method? i am a bit lost :(

Here is the code I have till now (it is pretty much the same he has in the video, but I focused on understanding how it worked and tried to replicate it on my own to understand it better)(Ps. sorry for the annotations, if they disturb you, i can clean the code real quick :) )

//Array of words still to be divided with tokens. 
String[] words;
//Set a new IntDict.
IntDict concordance;

void setup() {
  size(800, 600);
  colorMode(HSB, 360, 100, 100);
  //Import an external file; The file is an array of lines.
  String[] lines = loadStrings("The Rime Of The Ancient Mariner.txt");
  //Once I have the file, I need to deconstruct the array by joining all the lines.
  String entireplay = join(lines, " ");
  //Once I have all the lines, I can set tokens to get an array of words. (PS. Create a new words Array first!)
  words = splitTokens(entireplay, ",.!?:;\"--() ");
  //At this point I need a way to link every word to its value. (PS. Create a new IntDict first!)
  concordance = new IntDict();

  //Looping in the words array, everytime a word presents itself, increment its value by one.
  //PS. the method toLowerCase will avoid counting separately capitalized and lowercase words!
  for(int i = 0; i < words.length; i++) {
    concordance.increment(words[i].toLowerCase());
  }
  //Now I can sort the values(or keys), in this case, values are sorted in a decrescent way. (See Ref Page)
  concordance.sortValuesReverse();

}

void draw() {
  background(0, 0, 99);
  //Having Keys and Values, i can loop through the arrays and get the correspondant element.
  String[] keys = concordance.keyArray();
  for(int i = 0; i < keys.length; i++) {
    int count = concordance.get(keys[i]);
    println(keys[i], count);
  }
  noLoop();

}

So my idea was this one, but i am having hard time in trying to translate it into code language, so i tried in english first :)

Create a new String array and name it stopWords.
Upload a file containing the words, join the lines, set tokens, get the final array containing the words.
then, in draw() (since i still want to count the stop words, and not completely ignore them)
while looping through the key array, 
if key array hasKey() of the stopWords array
then don't draw that key.

Hope i was not too messy, I will appreciate every input :) thank you! R.

Answers

  • Answer ✓

    before setup

    String[] stopWordsHelper = {"this","the", "that", "here", "there", "it", "and"};
    

    in setup()

      for(int i = 0; i < stopWordsHelper.length; i++) {
        stopWords.increment(stopWordsHelper[i].toLowerCase());
      }
    
  • edited May 2015 Answer ✓

    https://processing.org/reference/IntDict_remove_.html

    // forum.processing.org/two/discussion/10742/
    // how-to-set-a-list-of-stop-words-to-be-subtracted-from-a-larger-list-of-words
    
    final IntDict concordance = new IntDict(
      new String[] {"the", "iron", "throne", "is", "at", "king's", "landing"},
      new int[] {2, 1, 1, 1, 1, 1, 1}
    );
    
    final String[] stops = {
      "the", "a", "an", "of", "in", "on", "at", "it", "is"
    };
    
    println(concordance);
    
    for (String s : stops)  concordance.remove(s);
    
    println(concordance);
    exit();
    
  • thanks a lot to the both of you, clear and helpful! I don't know why I fossilized myself on checking only the string reference page and trascurated the intDict one. I thought the problem was somewhere else! Anyway, thanks again!

Sign In or Register to comment.