Use 2D array to find coincidences of words

Rayle · March 2018

I am triying to discover how many times words from a list appear together in the sentences of a given text. The first thing I did was to find the sentences with a particular word (law in the example below), make an array with them, and count the times the other words appear with an IntDict.

This is the code

StringList results;
IntDict coincidence;

void setup() {

  String [] text = loadStrings("text.txt");
  String onePhrase = join(text, " ").toLowerCase();
  String [] phrases = splitTokens(onePhrase, ".?!");
  String search = "\\blaw.?\\b";
  String [] wordList = loadStrings("wordList.txt");

for (int i = 0; i<phrases.length; i ++) {
      String [] matching = match(phrases[i], search);
      if (matching != null) {
        results.append(phrases[i]); 
    }
  }

   String [] resultsArray = results.array();

   String joinResults = join(resultsArray, " ");
   String [] splitResults = splitTokens(joinResults, " -,.:;¿?¿!/_");

    for(int i = 0; i < splitResults.length; i++) {
    for (String searching : wordList) {
      if (splitResults[i].equals(searching)){
      coincidence.increment(splitResults[i]);
    }
   }
  }

But if I want to count how many times each word of the list appears together with the other ones, I need to remake the process a lot of times.

I tried with Array[][] but it doesn't work. Now I am stuck with the problem and I don't know how to proceed. I would appreciate any idea.

StringList list;
String [] search;

void setup() {

  String [] text = loadStrings("text.txt");
  String onePhrase = join(text, " ").toLowerCase();
  String [] phrases = splitTokens(onePhrase, ".?!");

  Table table = loadTable("listOfWords.csv", "header");

  for (int i = 0; i <table.getRowCount(); i ++) {
   TableRow row = table.getRow(i);
   list.append(row.getString("Word"));
   search = list.array();
  }

  for (int i = 0; i < phrases.length; i ++) {
    for (int j = 0; j < search.length; j ++) {
      String [][] matching = matchAll(phrases[i], search[j]);

       printArray(matching);
    }
  }
}

Chrisir · March 2018

please post the data files you are using

you wrote

how many times words from a list appear together in the sentences of a given text

This means you need to count something.

The rest is quite unclear.

words from a list

You mean one word? Or two? Or all of them?

words from a list appear together

You mean in the order they have in the list or in just any order?

so given text might be The law helps society.

word list: law society ?

Count this or not?

Rayle · March 2018

Thank you Chrisir. I am sorry for not to be clear enough. The idea is to know how many times a word, like law, appears in the same sentece with the other words: the co-ocurrences in the sentences. Something like this: Sample

I have, for example, this list of words in a csv file with the times they appear in the full text.

Word     Times
power   415
sect    253
government  215
men 207
will    205
nature  205
society 203
people  203
state   197
man 195

With the first code it is possible to know how many words from the list appears in the same sentence as "power" (for example), and how many times:

{"Matches": [
  {
    "Word": "power",
    "Times": 415
  },
  {
    "Word": "people",
    "Times": 102
  },
  {
    "Word": "society",
    "Times": 100
  },
  {
    "Word": "government",
    "Times": 75
  },
  {
    "Word": "nature",
    "Times": 73
  },
  {
    "Word": "man",
    "Times": 70
  },
  {
    "Word": "will",
    "Times": 69
  },
  {
    "Word": "state",
    "Times": 65
  }
]}

If I want to know how many words of my list, and how many times, they appear in the same sentence as the word "people", I have to run the code again, changing in the String search the word power for people, and so on. It is boring and not very practical. So I was thinking how to do the same thing for all the words in the list.

The text I was experimented with is The second Treatise of Government, downloaded from here: gutenberg.org/cache/epub/7370/pg7370.txt.

This is the list of the first 20 most frequent words:

Word    Times
power   415
government  215
men 207
will    205
nature  205
society 203
people  203
state   197
man 195
law 194
laws    168
legislative 144
force   119
property    106
common  102
war 98
good    96
consent 95
authority   92
life    90

Chrisir · March 2018

If I want to know how many words of my list, and how many times, they appear in the same sentence as the word "people", I have to run the code again, changing in the String search the word power for people, and so on. It is boring and not very practical.

not sure, but the technical solution is to go over the "list first of the 20 most frequent words" in a nested for loop:

this means the outer for loop calls the inner for loop

that means each word is checked against each other word

the equivalent visual is a grid like chess board (or your result table above) where you visit each field once 1 2 3 4 5 6 ...

for (int i = 0; i < table20Words.length; i++) {
  for (int j = 0; j < table20Words.length; j++) {

       boolean result = occurTogetherInOneSentence( table20Words[i],table20Words[j]);
       if(result==true) {
            resultTable[i][ j] = resultTable[i][ j] + 1; 
       } // if

   }//for
 }//for

the function occurTogetherInOneSentence returns a boolean value; it receives 2 Strings. It for loops over all sentences and checks both words in this sentence (e.g. if (sentence[i].contains(str1) && sentence[i].contains(str2) ) return true;
resultTable contains the values above like 415,214....
define it before setup(): int [][] resultTable = new int [20][20];

koogs · March 2018

I think that can be improved on - looping over all the sentences 400 times can't be the most efficient way.

My mind says hashmaps. But it's 8am on a bank holiday so that's all it says at the moment.

For each sentence, find the popular words that the sentence contains, increment the relevant counts... That last bit may be non trivial.

Chrisir · March 2018

ah, true

for loop over all sentences with sentence_i

    for (int i = 0; i < table20Words.length; i++) {
      for (int j = 0; j < table20Words.length; j++) {

           boolean result = false; 

           if (sentence[sentence_i].contains(table20Words[i]) &&  sentence[sentence_i].contains(table20Words[j]) )                              
                result = true; 
           if(result==true) {
                resultTable[i][ j] = resultTable[i][ j] + 1; 
           } // if

       }//for
     }//for

 }//for

Chrisir · March 2018

increment the relevant counts... That last bit may be non trivial.

I solved this with the 2d array resultTable

Chrisir · March 2018

my minds screams hashMap all the time too but we can leave that for later.

koogs · March 2018

For each sentence
  // Find
  Clear the list
  For each word
    If sentence contains word
      Add word to list
    End
  End

  // Collate
  For each word1 in list
    For each word2 in list
      Increment count for word1/word2 combination
    End
  End

End

koogs · March 2018

(my non trivial worry was that you might have 5 matches in a sentence and working out all the pairs of those things in order to increment the correct counts. But, yeah, it's just a nested loop over the same list twice)

koogs · March 2018

It was more obvious at 9:30 than it was at 8

Rayle · March 2018

Thank you very much Chrisir and koogs. Your ideas are very inspiring and a nice solution I did not think about. I will try with the 2d array before trying with a hashMap.

Rayle · March 2018

Thank you again for your valuable ideas. This is the final code:

String[] textLines;
String oneLine;
StringList words;
String[] sentences;
String[] forSearching;
String[] matchsentences;
Table tabla;
int[] lines;
int[][] resultTable = new int [50][50];

float radianes = radians(360-45);

void setup() {

  size(800, 600);

  textLines = loadStrings("TratadoGovCivil.txt");
  oneLine = join(textLines, " ").toLowerCase();
  sentences = splitTokens(oneLine, ".?!");
  words = new StringList();
  tabla = loadTable("numeroPalabras.csv", "header");
  lines = new int[50];

  for (int i = 0; i < tabla.getRowCount(); i ++) {
    TableRow fila = tabla.getRow(i);
    words.append(fila.getString("Palabra"));
    forSearching = words.array();
  }

  //========CODE BASED IN CHRISIR AND KOOGS IDEAS==========
  // https://forum.processing.org/two/discussion/27481/use-2d-array-to-find-coincidences-of-words

  for( int a = 0; a <sentences.length; a ++){
    for (int i = 0; i < forSearching.length; i ++) {
      for (int j = 0; j < forSearching.length; j ++) {

        boolean result = false;

        if (sentences[a].contains(forSearching[i]) && sentences[a].contains(forSearching[j]))
          result = true;
        if (result == true) {
          resultTable[i][j] = resultTable[i][j] + 1;
      }
      int[] lines2 = new int[50*50];
      lines = append(lines2,resultTable[i][j]);
     }
    }
   }

  //============================================
}


void draw() {

  background(255);
  fill(0);

  for(int i = 0; i < forSearching.length; i ++) {
    pushMatrix();
    if (i == 0) translate(140*(i+1), 40);
    else translate((40*(i+1))+100, 40);
    rotate(radianes);
    textAlign(LEFT);
    text(forSearching[i], 0, 0);
    popMatrix();
    textAlign(RIGHT);
    text(forSearching[i], 120, (20*(i+1))+40);
  }

  for(int i = 0; i < 50; i ++) {
    for (int j = 0; j < 50; j ++) {
      pushMatrix();
      translate(100, 40);
      textAlign(CENTER);
      text(resultTable[i][j], 40*(i+1), 20*(j+1));
      popMatrix();
    }
  }
  noLoop();
}

And this is the nice output:

Sin título

koogs · March 2018

What are lines 44 and 45 doing? In fact, what is lines?

Lines 37, 41 and 42 look pointless

Chrisir · March 2018

it's true, 37/40/41 can be optimized. My bad.

Rayle · March 2018

You are right, koogs, lines 44 and 45 are the remains of an attemp to write a code to save the 2D array to a csv file. I forgot to delete them.

String[] textLines;
String oneLine;
StringList words;
String[] sentences;
String[] forSearching;
String[] matchsentences;
Table tabla;
int[][] resultTable = new int [50][50];

float radianes = radians(360-45);

void setup() {

  size(800, 600);

  textLines = loadStrings("TratadoGovCivil.txt");
  oneLine = join(textLines, " ").toLowerCase();
  sentences = splitTokens(oneLine, ".?!");
  words = new StringList();
  tabla = loadTable("numeroPalabras.csv", "header");


  for (int i = 0; i < tabla.getRowCount(); i ++) {
    TableRow fila = tabla.getRow(i);
    words.append(fila.getString("Palabra"));
    forSearching = words.array();
  }

  //==========CODE BASED IN CHRISIR AND KOOGS IDEAS============================
  // https://forum.processing.org/two/discussion/27481/use-2d-array-to-find-coincidences-of-words

  for( int a = 0; a <sentences.length; a ++){
    for (int i = 0; i < forSearching.length; i ++) {
      for (int j = 0; j < forSearching.length; j ++) {
        if (sentences[a].contains(forSearching[i]) && sentences[a].contains(forSearching[j]))
          resultTable[i][j] = resultTable[i][j] + 1;
     }
    }
   }

  //===============================================================
}


void draw() {

  background(255);
  fill(0);

  for(int i = 0; i < forSearching.length; i ++) {
    pushMatrix();
    if (i == 0) translate(140*(i+1), 40);
    else translate((40*(i+1))+100, 40);
    rotate(radianes);
    textAlign(LEFT);
    text(forSearching[i], 0, 0);
    popMatrix();
    textAlign(RIGHT);
    text(forSearching[i], 120, (20*(i+1))+40);
  }

  for(int i = 0; i < 50; i ++) {
    for (int j = 0; j < 50; j ++) {
      pushMatrix();
      translate(100, 40);
      textAlign(CENTER);
      text(resultTable[i][j], 40*(i+1), 20*(j+1));
      popMatrix();
    }
  }
  noLoop();
}

Chrisir · March 2018

Well done!

koogs · March 2018

println("It was the fault of the government".contains("men"));

Chrisir · March 2018

Good point

Also the diagonal of the table should be empty

power shouldn’t be checked against power

koogs · March 2018

the sentence splitting code also doesn't handle abbreviations, like "i.e." and "U.S."

Rayle · April 2018

As usually you both are going straight to the point. I was thinking how to avoid the problems you mention. For the first one (government contains men, etc.) I think it is possible to use a regular expression with match function (lines: 38, 39, and 59). Also, we can erase abreviations like i.e., or U.S. to avoid the problem when we are splitting the string. I think it is not an optimal solution, because "i.e." is not ver significant, but U.S. could be important. Anyway: there is an english stopwords list to remove the abreviations (lines: 22, 26, 42-51).

Of course, the diagonal is interesting. In fact we only need half of the square matrix (the upper or the lower half). If we put 1 in the diagonal values and transform the half of the matrix into the inverse of the other half, we obtain a positive reciprocal matrix (I have to confess I don't know how to make this in the for loop, so I use a trick in line 97). If we think about the co-ocurrencies as weights, it is possible to calculate the eigenvector of the matrix and compare this vector with that from other texts or authors. The outputs could be interesting.

Thank you for your useful comments. It is very kind of you.

Here is the new code:

StringList words;
StringList depuratedText;
String[] finalText;
String[] sentences;
String[] erase;
String[] allWords;
String[] forSearching;
String[] forHeader;
Table tabla;
int[][] resultTable = new int [50][50];

float radianes = radians(360-45);

void setup() {

  size(800, 600);

  textLines = loadStrings("TratadoGovCivil.txt");
  erase = loadStrings("stopwords-en.txt");
  oneLine = join(textLines, " ").toLowerCase();
  allWords = splitTokens(oneLine, " -,:;¿¡/_");
  //sentences = splitTokens(oneLine, ".?!");
  depuratedText = new StringList(allWords);
  words = new StringList();
  tabla = loadTable("numeroPalabras.csv", "header");


  for (int i = 0; i < tabla.getRowCount(); i ++) {
    TableRow fila = tabla.getRow(i);
    words.append(fila.getString("Palabra"));
    forSearching = words.array();
    forHeader = words.array();
  }

  for (int i = 0; i<forSearching.length; i ++) {
    forSearching[i] = "\\b" + forSearching[i] + ".?\\b";
  }

  for (int i = 0; i <erase.length; i ++) {
    for (int j = 0; j <depuratedText.size(); j ++) {
   if (depuratedText.get(j).equals(erase[i])) {
     depuratedText.remove(j);
     finalText = depuratedText.array();
     }
    }
    depuratedLine = join(finalText, " ");
    sentences = splitTokens(depuratedLine, ".?!");
  }

  //==========CODE BASED IN CHRISIR AND KOOGS IDEAS============================
  // https://forum.processing.org/two/discussion/27481/use-2d-array-to-find-coincidences-of-words

  for( int a = 0; a <sentences.length; a ++){
    for (int i = 0; i < forSearching.length; i ++) {
      for (int j = 0; j < forSearching.length; j ++) {
        if (match(sentences[a], forSearching[i]) != null && match(sentences[a], forSearching[j]) != null){
        if(i == j) {
          resultTable[i][j] = 1;
        } else {
          resultTable[i][j] = resultTable[i][j] + 1; 
        }  
     }
    }
   }
  }
  //===============================================================
}


void draw() {

  background(255);
  fill(0);

  for(int i = 0; i < forSearching.length; i ++) {
    pushMatrix();
    if (i == 0) translate(140*(i+1), 40);
    else translate((40*(i+1))+100, 40);
    rotate(radianes);
    textAlign(LEFT);
    text(forHeader[i], 0, 0);
    popMatrix();
    textAlign(RIGHT);
    text(forHeader[i], 120, (20*(i+1))+40);
  }

  for(int i = 0; i < 50; i ++) {
    for (int j = 0; j < 50; j ++) {
      pushMatrix();
      translate(100, 40);
      textAlign(CENTER);
      if (j > i) {
        fill(191);
        text("1/" + resultTable[i][j], 40*(i+1), 20*(j+1));
      }else{
      fill(0);
      text(resultTable[i][j], 40*(i+1), 20*(j+1));
      }
      popMatrix();
    }
  }
  noLoop();
}

Sin título2

jeremydouglass · April 2018

Another approach from building your own in Java:

If you are planning on also doing other language processing operations other than building a co-occurrence matrix, it might be worth developing in Processing.py (Python mode) so that you can use NLTK.

For example:

https://stackoverflow.com/questions/17458751/python-symmetric-word-matrix-using-nltk

Note in particular the discussions of spare matrix approaches vs. dense matrix approaches.

Rayle · April 2018

Great idea. Thank you jeremydouglass.

koogs · April 2018

for( int a = 0; a <sentences.length; a ++){
  for (int i = 0; i < forSearching.length; i ++) {

For the upper right of the matrix only you can start the inner loop at the diagonal (or just beyond it). Basically, if m > n then you've already counted n,m so don't bother with m,n

for( int a = 0; a <sentences.length; a ++){
  for (int i = a + 1; i < forSearching.length; i ++) {

Rayle · April 2018

Thank's koogs. I didn't think about it. It's easy.

koogs · April 2018

I'd test match() in isolation too, I'm not sure it doesn't work just the same as contains() in the way you are calling it.

Rayle · April 2018

Thank's for the suggestion. I tested it with a short text, and it seems to work well. Words like punishment or government are not counted as men.

koogs · April 2018

The last message I wrote with the loops is all wrong, those two lines should be the i and j loops, not the a and i loops (typing on phone is tricky...)

Rayle · April 2018

Thanks everyone for your help. I think the final result is nice. archivo-01

jeremydouglass · April 2018

Beautiful work!

(Should law / laws be lemmatized into one entry, or are they conceptually distinct here?)

Rayle · April 2018

Thank you, jeremydouglass. Usually, in this text (the Second Treatise of John Locke) when the author says law, he is talking about the natural law. The "laws" are those promulgated by the government. So I thought it was interesting to keep both terms.

jeremydouglass · April 2018

@Rayle -- Very interesting. Some of your most common sentence co-occurrences seem like they might be part of recurring phrases that would show up in bigram or trigram counts: law / nature ("natural law") state / nature ("natural state" "state of nature") etc.

Interesting that 'law' and 'laws' are usually distinct but have very high sentence co-occurrence. If comparison and contrast is a key part of the treatise (which I do not know well) then perhaps you might try separating all sentences containing only 'law' or 'laws' and then analyzing those two groups separately (if there is enough text to do this). Or try topic modeling....

Rayle · April 2018

Very good idea @jeremydouglass. I have coded a way to obtain short sentences from the text with the key word in the middle ("law", "laws", etc.). I can see the context now. This is the code:

String [] lines;
String oneLine;
String [] erase;

String [] sentences;

IntDict coincidence;
String [] dictionary;

String searching = "\\blaw.?\\b";

StringList output;

void setup() {

  size(800, 700);

  lines = loadStrings("TratadoGovCivil.txt");
  oneLine = join(lines, " ").toLowerCase();
  sentences = split(oneLine, " ");
  output = new StringList();    
}

void draw() {

  background(255);
  fill(0);

  textAlign(CENTER);

  for (int i = 0; i < sentences.length; i ++) {
   String [] matching = match(sentences[i], searching);
   if(matching != null && i <= 10) {
     output.append(subset(sentences, i - i, 20));
   }
   else if (matching != null && i > 10) {
     output.append(subset(sentences, i - 10, 20));
   }
  }

  String [] finalOutput = output.array();
  String [] shortSentences = new String[finalOutput.length/20];

  int j = -1;

  for (int i = 0; i < shortSentences.length; i++) shortSentences[i] = "";
  for (int i = 0; i < finalOutput.length; i ++) {
   if (i%20 == 0) j ++;
   shortSentences[j] = shortSentences[j] + finalOutput[i] + " ";

}
  for(int i = 0; i < shortSentences.length; i++) {
  text(shortSentences[i], width/2, 25*i);
  text("*****", width/2, (25*i)+15);
  }

  saveStrings("data/short_sentences.txt", shortSentences);
  noLoop();
}

And a screenshot of the output:

sentences

jeremydouglass · April 2018

Very, very nice! And thanks so much for sharing it with the forum.

I'm not sure if anyone has ever implemented a word dendrogram for Processing, but these kinds of visualizations are excellent for text exploration of the roles specific keywords play in relation to other words. To increase the power of seeing keyword relationships in a dendrogram or markov chain visualization of what comes before and after "law", you can strip function words and stopwords out of your sentences before feeding them in.

Chrisir · April 2018

Did you see the PI visualization recently?

You might order your word in a circle and connect those that appear in one sentence with a line and make the line thicker with more sentences they share

jeremydouglass · April 2018

Here is a link to the Pi visualization that @Chrisir mentioned.

https://forum.processing.org/two/discussion/27283/pi-visualization

This form of visualization in called a chord diagram:

https://en.wikipedia.org/wiki/Chord_diagram

With text chord diagrams tend to be more legible with smaller sets of words, often filtered by type (e.g. only nouns as they co-occur in sentences, or only verbs, et cetera).

Rayle · April 2018

Pi visualization is very nice. Thanks for the reference. I will try to do the same thing with the words. I don't know if it will be equally clear with 20 or 50 words, instead of 10 digits. The dedrogram is also a good idea, and perhaps more clear. In fact, I was triying to code an arc diagram in Processing, like this one made with Protovis

http://mbostock.github.io/protovis/ex/arc.html

ArcGraph4

but it is not very convincing for me. Now, I am triying to use the 2D Array from the first code to make an ordered matrix to show regularities. Something like the Jacques Bertin's physical matrices

http://www.aviz.fr/diyMatrix/

I must confess that it is proving difficult to find out the code to order the columns and rows of the matrix. If I get it right I will share it with you.

Thanks again for your valuable ideas. You go beyond a simple help with the code. It is very kind of you.

jeremydouglass · April 2018

@Rayle -- interesting idea about the arc diagram. When you say "it is not very convincing for me" I think I might agree -- I find that both arc diagrams and sometimes non-ribbon chord diagrams

https://datavizcatalogue.com/methods/non_ribbon_chord_diagram.html

tend to give a general impression of "this is complicated!" rather than being easy to read and conveying details of the data structure. Being able to see the ribbon thickness is what makes them useful to me.

https://datavizcatalogue.com/methods/chord_diagram.html

So, for example, adding ribbon thickness to a basic arc diagram

...like in the Bible visualization arc diagram by Philipp Steinweber and Andreas Koller.

Advanced chord diagram systems like Circos have lots of ideas about how you can add details to make complex systems more readable....

http://circos.ca/

...but at a basic level, having a limited number of nodes and varying the thickness of the ribbons seems like the most important thing for indicating relationships in a way that viewers can easily understand.

jb4x · April 2018

Hi everyone,

An other form of graph that can be use to render the result is the Force-directed Graph.

It is used to show the co-occurence of things. It can be messy, but it can also look pretty cool !

jeremydouglass · April 2018

@jb4x -- Interesting idea. Force-directed graphs don't always work well for actually understanding sentence word data (as you say, messy) but they can be done in Processing -- an example:

https://zshiba.github.io/visualization/force_directed_graph.html

Rayle · April 2018

With the help of jb4x I made this two images, inspired by PI visualization. They are very nice. Notwithstanding, as jeremydouglass said, it seems too complicated. Maybe changing the thickness of the ribbon could be the solution, or putting different colors. These graphs offer a different information from de matrix: the conections are between two consecutive words. They are more related with the evolution of the texts than with their inner meaning. Anyway, they are nice and give us some clues to discover the similarities and differences between authors.

Locke

Hobbes

jeremydouglass · April 2018

They are beautiful, @Raye!

A quick improvement to make comparative reading easier (if that is the goal) would be to change the radial sorting order to alphabetical, rather than sorting by descending frequency.

Then the word sequence would be sorted in the same order, and e.g. "power" would have the same color in both images (purple), regardless of arc size, making visual comparisons between the two arc sizes much easier (more power in Locke). This would also make it easier for viewers to visual look up relationships, e.g. "god <-> man" without as much visual hunting.

I might also suggest printing the radial keywords in white with a color-coordinated underline or a colored dot next it. You really have to squint to see e.g. "law" in Hobbes.

Rayle · April 2018

Changing the colors was easy. I'm afraid that sorting the words alphabetically takes a little more time.

Lockedef-01

jeremydouglass · April 2018

Very nice! Love the color palette.

Using a fixed ordering (such as alphabetical) rather than ordering by descending frequency is mainly important for comparison between two images of different data -- looking up terms is a secondary benefit.

If you sort your beginning anchor points by ending destination along each arc then the collections of lines will more strongly resemble ribbons, even if you don't change anything else about how the lines are rendered. Everything will appear less wispy and more banded.

jb4x · April 2018

The renders look really nice, very well done ! :-bd

As you wrote, there are 2 limitations with that way of rendering. The first one is the fact that it is hard to read and the second is that, as we discussed, some data are hidden because you use the progression of the text and miss some co-occurence link if there are more than 2 words in a sentence.

For the first point I think that the solution of changing the thickness of the ribbon, as you wrote, would be the way to go.

For the second point, I don't really have a concrete idea yet but maybe you should treat first the sentence that have 2 words in it, then the one that have 3 words, then the one that have 4 etc. This way you can have full control over the render. Another thought is to play with ribbon transparency. Let's say you have a thick ribbon connecting "law" to "nature". Maybe you have 40% of those "law" that are also connected to "laws" so you you have another ribbon with a 40% thickness compare to the other one on overlay to show that link.

Rayle · April 2018

Very inspiring ideas. I will try some of them: a new sorting method and ribbons with different thikness. Don't forget the original code to make the graphs is from jb4x. Thanks to his suggestions I made only slighty modifications on his orginal idea.

Since the goal is to show the similarities between authors that are usually considered very different, (something like the reference jeremydouglass gave us: similardiversity.net/, I think it is possible to give up some information without problem.

Thanks to both of you.

jeremydouglass · April 2018

Re: original code by @jb4x -- ah, yes, of course, the PI visualization code.

Yes, for that original PI visualization sorting wouldn't make any sense -- if I understand that one right, the whole point is to see the way that random distributions of digit correspondences form a texture, like a diaphanous surface.

However, in your case, you want to see arc-to-arc relationships, like in Circos, so sorting makes a lot more sense. The order in which digits occur in PI is the randomness itself, which is the subject of that viz -- but the order in which individual bigrams occur in your source text doesn't have anything to do with the co-occurrence ratios you are interested in, and it partially hides those ratios.

I didn't implement arc sorting for the original PI viz, and I don't know what your updated code looks like, but here is a simple example of a data structure that creates a sorted correspondence list for each arc in the PI code -- there are 10 IntLists, 0-9, and each one contains bands of arc destinations, sorted 0-9.

  IntList[] digitSorted = new IntList[10];
  for (int i = 0; i < 10; i++) {
    digitSorted[i] = new IntList();
  }
  for (int i = 0; i < dataString.length(); i++) {
    digitSorted[i%10].append(Character.getNumericValue(dataString.charAt(i)));
  }
  for (int i = 0; i < 10; i++) {
    digitSorted[i].sort();
  }

Rayle · April 2018

The only updates I made in the PI viz code were to transform the words into a String of numbers, (like PI), each number corresponding to a word. I just needed to loadTable with the correspondences to return the word afterwards.

I will try with your code. Thanks.

Howdy, Stranger!

Categories

In this Discussion

Use 2D array to find coincidences of words

Best Answers

Answers