Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

Programming Questions

procklipr..

Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

in Programming Questions • 3 years ago

Greetings all,

I am trying to create an associative dictionary. This is a database which contains a record for every unique word occurring in an arbitrary set of text files. Each record should contain the word, a list of files which contain the word, and the indices into those file of each occurrence of the word in that file. Something like this:

Word                  Files                  Occurrences
able                  file1.txt                 0, 20, 35, ....
                         file2.txt                 330, 450, 453....
baker                file1.txt                 1, 15
                         file3.txt                 1,4,9
charlie            fileN.txt                 i, i1,i2,...

What I have so far is a HashMap using a Word class as the value. The Word class contains another HashMap, this time for each file that contains an occurrence of a given word. It also contains an ArrayList for the indices at which that word occurs. I've been struggling with this for a while. Is there a design pattern for this situation? I'd hate to reinvent the wheel, especially as I'm a Processing (and programming generally) noob.

I've attached my code below - its messy I know. Sorry.

HashMap words; // HashMap object
String[] tokens; // Array of all words from input file
int counter;
void setup() {
String path = sketchPath+"/data";
String[] files = listFileNames(path);
println(files);
words = new HashMap();
for(int a=0; a<files.length; a++) { // repeat this block for each file in 'files'
// Load file and chop it up
String[] lines = loadStrings(files[a]);
String allText = join(lines, " ");
tokens = splitTokens(allText, " ,.?!:;[]-");
println("Done chopping up file #: "+a+" Called "+files[a]);
// Look at words one at a time
while(counter<tokens.length) {
String s = tokens[counter];
// Is the word in the HashMap
if (words.containsKey(s)) {
println("database already has an entry for "+s.toUpperCase());
Word w=(Word) words.get(s);
w.updateWord(files[a],counter);
}
else {
println("make a new entry for "+s.toUpperCase());
Word w = new Word(files[a], s,counter);
words.put(s, w);
}
counter++;
} // end of while loop for each word in 'tokens'
} // End of for loop for each file
dumpDictionary();
}
/* A routine to print out the contents of the main hash map
*/
void dumpDictionary() {
Iterator i = words.keySet().iterator();
while (i.hasNext()) {
String w = (String) i.next();
print(w+": ");
Word wd = (Word) words.get(w);
HashMap fi = wd.fileIndices;
int fs = fi.size();
println(fs);
//Iterator j = fi.values().iterator();
/*
while(j.hasNext()) {
ArrayList ix = (ArrayList) fi.get(w);
int sz = ix.size();
for(int k=0; k<sz; k++){
Integer m = (Integer) ix.get(k);
}
}
*/
}
}
// This function returns all the files in a directory as an array of Strings
String[] listFileNames(String dir) {
File file = new File(dir);
if (file.isDirectory()) {
String names[] = file.list();
return names;
} else {
// If it's not a directory
return null;
}
}
/* Word object to store in associative dictionary.
Each object should store the indices of the occurence of the word in every file that's sent to it.
Indices should be kept as an Array this time; use append() to update
For next version try using an ArrayList.
ArrayList version seems to be working, now try it in hash map experiment.
Handle all checking inside class
*/
class Word {
int count;
String word;
HashMap fileIndices; // A hashMap of fileIndexes, 1 entry per file
Word() { // null-arg constructor
fileIndices = new HashMap();
}
Word(String fileName, String word, int index) {
fileIndices = new HashMap();
this.word = word;
count = 1;
ArrayList indices = new ArrayList();
indices.add(new Integer(index));
fileIndices.put(fileName, indices);
}
/*'updateWord()' method takes a file name and index as parameters.
If 'fileName' is already stored, just updated the count for that file, otherwise add 'fileName' to database,
and set its first entry as 'index'.
*/
void updateWord(String fileName, int index) {
//println("update this word with "+fileName+" and index "+index);
// First, Check to see if this file has already been added to this word
if(fileIndices.containsKey(fileName)) {
//println("we've already had this name");
ArrayList temp = (ArrayList) fileIndices.get(fileName);
temp.add(new Integer(index));
fileIndices.put(fileName, temp);
}
// otherwise, add a new binding to hashMap
else {
//println("this name is new");
ArrayList temp = new ArrayList();
temp.add(new Integer(index));
fileIndices.put(fileName, temp);
}
count++; // add 1 to word count
}
void dumpWord() { // a debugging method to print the word's indices
int len = fileIndices.size();
println("\nthis word has "+len+" indices");
Iterator i = fileIndices.keySet().iterator();
while(i.hasNext()) {
String n = (String) i.next();
ArrayList ids = (ArrayList) fileIndices.get(n);
print("key "+n+" has these indices: ");
for(int j=0; j<ids.size(); j++) {
Integer x = (Integer) ids.get(j);
print(x.val+" | ");
}
println();
}
}
int getSize() { // Return size of hashMap
return(fileIndices.size());
}
//'incCount()' method just increments the total count varible
void incCount() {
count++;
}
} // End of Word class
class Integer {
int val;
Integer() {
this.val = 0;
}
Integer(int i) {
this.val = i;
}
int integerVal() {
return(this.val);
}
}

Replies(12)

kooogy

Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

i think this in updateWord is wrong (or at least inefficient)

ArrayList temp = (ArrayList) fileIndices.get(fileName);
temp.add(new Integer(index));
fileIndices.put(fileName, temp);

and it would be possible to

fileIndices.get(filename).add(new Integer(index));

(ie get the arraylist associated with this file and add the new index to it)

but i do know that your Integer class really shouldn't be there - it'll get in the way of the standard Integer class. just use the real one instead.

procklipr..

Re: Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

Thanks for the feedback - I'm new to OOP, and sometimes stringing lots of .dot methods together is hard for me to grasp. Regarding the Integer class, I created it because I was unable to make an ArrayList using simple int's. Am I doing something wrong?

but i do know that your Integer class really shouldn't be there - it'll get in the way of the standard Integer class. just use the real one instead.

procklipr..

Re: Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

So I went an tried again to create an ArrayList of int's. And of course this time it worked. I must have been doing something wrong. I'll try and re-write this that way.

UPDATE:

The problem isn't when I create the ArrayList of ints, but when I try to access it. for example, this doesn't work

int k = (int) indices.get(i);

where indices is an ArrayList with elements added thus indices.add(1).

kooogy

Re: Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

my point was that there's already an Integer class, you defining another one will just confuse it.

so you can still use new Integer() but phi.lho has pointed out that Integer.valueOf() is better.

(in java5 this is easier due to the autoboxing - you give it an int and it'll automatically convert it to an Integer. also you'd be able to define your main HashMap as HashMap<String, HashMap<String, ArrayList<Integer>>> which (believe it or not) is clearer.)

btw use intValue() to convert from the Integer you extract to an int, i don't think you can just cast it. (i always have to look these things up, or use netbeans' completion)

procklipr..

Re: Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

(in java5 this is easier due to the autoboxing - you give it an int and it'll automatically convert it to an Integer. also you'd be able to define your main HashMap as HashMap<String, HashMap<String, ArrayList<Integer>>> which (believe it or not) is clearer.)

Is this something I can do in Processing? How? I can see how its clearer. It's dense, but self-evident, you dont have to refer to other classes to know what it is.

btw, I didn't know about an Integer class in Java, I was just hacking around trying to get something to work when I was trying to get an integer back out of the ArrayList.

kooogy

Re: Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

i think the answer is 'it's complicated'. processing uses a version of java that'll handle it but the preparser (which handles the extra bits like 'color') couldn't. but i read somewhere (probably here) that this has changed, but don't know the details.

oh, wait, http://processing.org/bugs/bugzilla/598.html comment 7 onwards, says that it does. dunno what version of processing this relates to though.

procklipr..

Re: Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

i think the answer is 'it's complicated'.

It sure is! Too much for me right now. Trudging the javadocs learning about HashMap and ArrayList methods is hard enough for me right now! (I've been doing some more tinkering, see new comment)

PhiLho

Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

fileIndices.get(filename).add(new Integer(index));

It is better to use:

fileIndices.get(filename).add(Integer.valueOf(index));

instead... It might create less objects, thus is faster in general.

procklipr..

Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

Okay, I've been doing some more tinkering and I feel I've gotten a bit wiser. Here's a new test sketch that seems to be working. I'll try it with some real files soon.The thing that's been stumping me is that even though I populate the ArrayList with int's, when I retrieve them they come back as 'Objects'. I don't know how to convert them to int's. HOWEVER I did find a nifty toString() method for the ArrayList class in the javadocs. With that I could first convert to string and then to int. (see lines 61-64). What I found curious was that I couldn't return int(r) directly, but had to cast it first: (int) int(r). Don't know why, but that works!

/* Word Class, another attempt at using a HashMap within the class
*/
class Word {
String word; //The word to be tracked
int count; //Global count of occurrences of word
HashMap fileIndexes; //maps filenames to ArrayList of ints
// CONSTRUCTORS
Word() { //Null-arg constructor
fileIndexes = new HashMap();
}
Word(String word, String fileName, int index) {
fileIndexes = new HashMap();
this.word = word;
count = 1;
fileIndexes.put(fileName,new ArrayList());
ArrayList temp = (ArrayList) fileIndexes.get(fileName);
temp.add(index);
}
// METHODS
void printWord() {
print("\nThe word: "+word);
int s = fileIndexes.size();
println(" appears "+count+" times in "+s+" files. ");
Iterator i = fileIndexes.keySet().iterator();
while(i.hasNext()){
String w = (String) i.next();
print(w+" @ ");
ArrayList temp = (ArrayList) fileIndexes.get(w);
println(temp);
}
println();
}
void addFile(String fileName, int index) {
ArrayList temp = new ArrayList();
temp.add(index);
fileIndexes.put(fileName, temp);
count++;
}
void updateFile(String fileName, int index) {
ArrayList temp = (ArrayList) fileIndexes.get(fileName);
temp.add(index);
fileIndexes.put(fileName, temp);
count++;
}
int getNthIndex(String fileName, int nth) { // return the index value of the Nth occurrence
ArrayList temp = (ArrayList) fileIndexes.get(fileName);
if(nth>temp.size()) {
return(-1); // return -1 if request is out of bounds
}
else {
String r = (String) temp.get(nth).toString();
//int rr = int(r);
//return(rr);
return( (int) int(r) );
}
}
}

kooogy

Re: Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

i said in one of the other comments "btw use intValue() to convert from the Integer you extract to an int"

but that was a bit vague and i forgot to mention that you need to cast from the Object to Long (Object is the base class in java, everything is an Object)

so this:

String r = (String) temp.get(nth).toString();
return( (int) int(r) );

becomes

return (Long)temp.get(nth).intValue(); // untested!

(get nth Object from temp, cast to a Long, get the int value of the long. this is where ArrayList<Long> is useful - means the compiler knows that the values in the list are longs so you don't have to remember to cast them.)

PhiLho

Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

koogy, I hope you don't mind if I correct your helpful remarks...

I don't know why you mention Long type, as it isn't mentioned earlier in the thread.
Beside, (Long)temp.get(nth).intValue(); won't do what you expect (but I haven't tested either...): if I am not mistaken, it will apply intValue() to the result of get(), which won't compile if the ArrayList isn't typed (as you shown it can be). Then it will cast the result to Long, which probably fails too (casting a primitive value to an object, I don't thing autoboxing will work there).
It is better to do: ((Long) temp.get(nth)).intValue(); ie. cast the result of get() to Long (supposing it was what was put there, otherwise (Integer) cast is needed) then get its intValue().

To procklipressing: In Java, thus Processing, int and Integer are different things (and long and Long, Boolean and boolean, etc.).
The first one is a primitive type, taking a small amount of memory. The second one is an object, wrapping up the int value, offering some methods on it, making it slower and bigger, but allowing it to be accepted by collections like ArrayLists, which handle only objects.
Since Java 1.5, the frontier is blurred by the auto(un)boxing feature, where you can do:

int ip = 42;
Integer io = ip;

while previously you had to do something like Integer io = new Integer(ip); ie. explicitly wrap the primitive value in an object.
Last note about Integer.valueOf(ip): it handles a small pool of created objects of this type: so if it is already allocated, it returns the existing object instead of creating a duplicate object. Useful if you have a million of such objects for example, with lot of duplicates.

procklipr..

Re: Is there a design pattern for an Associative dictionary (aka multifile concordance) ?

3 years ago

Thanks all, this is proving to be a valuable education.

To procklipressing: In Java, thus Processing, int and Integer are different things (and long and Long, Boolean and boolean, etc.).
The first one is a primitive type, taking a small amount of memory. The second one is an object, wrapping up the int value, offering some methods on it, making it slower and bigger, but allowing it to be accepted by collections like ArrayLists, which handle only objects.

For example, that was something about which I had no idea. I assumed that if I populated an ArrayList with int's, I would get int's out when I called ArrayList.get(). I quickly learned that even though I used ArrayList.put(int i), the ArrayList turned them into objects. Ignorant of the Integer class, I was also ignorant of any way to extract an int value. So thanks!

Also, this little nugget is especially illuminating:

Beside, (Long)temp.get(nth).intValue(); won't do what you expect (but I haven't tested either...): if I am not mistaken, it will apply intValue() to the result of get(), which won't compile if the ArrayList isn't typed (as you shown it can be). Then it will cast the result to Long, which probably fails too (casting a primitive value to an object, I don't thing autoboxing will work there).
It is better to do: ((Long) temp.get(nth)).intValue(); ie. cast the result of get() to Long (supposing it was what was put there, otherwise (Integer) cast is needed) then get its intValue().

This is exactly the sort of thing that I'm struggling to get a firm grasp on. When 'intValue()' was first introduced into the thread, I had no idea what to do with it. I assumed it was a method, but to which class? That I'd didn't know. Explaining the difference the extra parens make in the above example helps me realize that expressions will evaluate to a type (class), and so will have access to the methods of the class even if its not an explicit object variable. Until now, I more or less assumed that one needed a variable of a given class to which to append the .dot methods of that class. In other words, I thought this form:

<var. name>.<method>

was required, and had no idea this form:

(<expression that evaluates to type 'class'>).<method in 'class'>

...was even possible. I have seen in the online documentation some examples of complex use of .dot methods, but I don't think I grokked ti 'til now.

BTW, the main hashmap code seems to be working now. I'm now working on the next step, how to make use of the dictionary in my application. I've pasted it here (the Word class is the same as above, so I wont include it now).

HashMap words; // HashMap object
String[] tokens; // Array of all words from input file
int counter;
void setup() {
String path = sketchPath+"/data";
String[] files = listFileNames(path);
words = new HashMap();
for(int a=0; a<files.length; a++) { // repeat this block for each file in 'files'
String[] lines = loadStrings(files[a]); // Load file and chop it up
String allText = join(lines, " ");
tokens = splitTokens(allText, " ,.?!:;[]-");
while(counter<tokens.length) { // Look at words one at a time
String s = tokens[counter];
if (words.containsKey(s)) { // Is the word in the HashMap?
Word w=(Word) words.get(s);
if(w.hasFile(files[a])) { // if File already in word, then update()
w.updateFile(files[a],counter);
}
else { // else add the file to the word with addFile()
w.addFile(files[a],counter);
}
}
else {
Word w = new Word(s, files[a], counter); // Constructed with word, filename, index
words.put(s, w);
}
counter++;
} // end of while loop for each word in 'tokens'
counter = 0;
} // End of for loop for each file
// Now we've populated the hashmap, lets see if we can see what's in it
Iterator i = words.keySet().iterator();
while (i.hasNext()) {
String fn = (String) i.next();
Word wd = (Word) words.get(fn);
wd.printWord();
}
}

Top Reply