I am unsure how to parse a text file containing words and frequency of use into an array

edited August 2015 in How To...

"non 145637 di 129302 che 128567 è 105309 e 99519 la 92752 il 83487 un 81762 a 78680 per 59380 in 48533 una 48229 mi 46947 sono 45347 ho 39064 ma 36205 l' 35870 lo 35124 ha 34790 le 34746 si 33241 ti 32019..." is an extract of content from the text file. I want to get it into a format associating the number with the word preceding it.. There are about 500 000 words in the text file representing the frequency of occurrence ( taken from Wikipedia). I am very new to Processing so apologise for not even having attempted a Processing script. Its the logic to use to separately identify alpha and numeric that I am not sure about.

Answers

  • edited August 2015 Answer ✓

    when there is a line break after each pair (non 145637) you are almost there

    step 1

        String lines[] = loadStrings("list.txt");
    
        println("there are " + lines.length + " lines");
    
        for (int i = 0 ; i < lines.length; i++) {
          println(lines[i]);
        }
    

    what does it give you? Your textfile must be named list.txt (or change the name list.txt in the sketch / code)

    step 2

    now with split to divide each line into 2 parts

    size (1000, 600);
    
    int max = 550000; 
    
    String[] wordsList = new String [max];
    int[]    frequencyList = new int  [max]; 
    
    String lines[] = loadStrings("list.txt");
    
    println("there are " + lines.length + " lines");
    
    println("---------------------");
    
    
    for (int i = 0; i < lines.length; i++) {
      print(lines[i] + " -> ");
    
      String[] temp = splitTokens(lines[i]);
      print(temp[0]+ " - ");  // Prints 
      println(temp[1]);  // Prints 
    
      wordsList[i] = trim(temp[0]);
      frequencyList[i] = int(trim(temp[1]));
    }
    
    println("---------------------");
    
    // show result 
    // upper bound is lines.length which is < max 
    for (int i = 0; i < lines.length; i++) {
      print(wordsList[i] + " : ");  // Prints 
      println(frequencyList[i]);  // Prints
    }
    // 
    

    ;-)

  • Thank you so much for your help. Will study the method that you have shown me to get to understand better. Now need to display results to the sketch window.

  • Answer ✓
      // show graphical
      text(wordsList[i], 20, i*20+29);
      line( 120, i*20+29, 
      120+frequencyList[i]/500, i*20+29 );
    
  • Thanks very much for your help. Much appreciated

  • edited August 2015

    there are different ways to do this

    • you could scale the size of the words depending on their freq

    • you could go 3D

    • you could have vertical lines

    • you could place them in a circle

    • you could have mouse over effect that displays the number of freq in a small rectangle

  • end of the sketch

    text("Word frequency", width-111, height-322);
    text("scale 1:500", width-111, height-299);
    println("done ---------------------");
    
  • use colors with fill() and thicker bars with rect() instead of my lines

  • mouse you need setup and draw

    When you count your own words use hashMap or so iirc

Sign In or Register to comment.