Is there a design pattern for an Associative dictionary (aka multifile concordance) ?
in
Programming Questions
•
3 years ago
Greetings all,
I am trying to create an associative dictionary. This is a database which contains a record for every unique word occurring in an arbitrary set of text files. Each record should contain the word, a list of files which contain the word, and the indices into those file of each occurrence of the word in that file. Something like this:
Word Files Occurrences
able file1.txt 0, 20, 35, ....
file2.txt 330, 450, 453....
baker file1.txt 1, 15
file3.txt 1,4,9
charlie fileN.txt i, i1,i2,...
What I have so far is a HashMap using a Word class as the value. The Word class contains another HashMap, this time for each file that contains an occurrence of a given word. It also contains an ArrayList for the indices at which that word occurs. I've been struggling with this for a while. Is there a design pattern for this situation? I'd hate to reinvent the wheel, especially as I'm a Processing (and programming generally) noob.
I've attached my code below - its messy I know. Sorry.
I am trying to create an associative dictionary. This is a database which contains a record for every unique word occurring in an arbitrary set of text files. Each record should contain the word, a list of files which contain the word, and the indices into those file of each occurrence of the word in that file. Something like this:
Word Files Occurrences
able file1.txt 0, 20, 35, ....
file2.txt 330, 450, 453....
baker file1.txt 1, 15
file3.txt 1,4,9
charlie fileN.txt i, i1,i2,...
What I have so far is a HashMap using a Word class as the value. The Word class contains another HashMap, this time for each file that contains an occurrence of a given word. It also contains an ArrayList for the indices at which that word occurs. I've been struggling with this for a while. Is there a design pattern for this situation? I'd hate to reinvent the wheel, especially as I'm a Processing (and programming generally) noob.
I've attached my code below - its messy I know. Sorry.
- HashMap words; // HashMap object
- String[] tokens; // Array of all words from input file
- int counter;
- void setup() {
- String path = sketchPath+"/data";
- String[] files = listFileNames(path);
- println(files);
- words = new HashMap();
- for(int a=0; a<files.length; a++) { // repeat this block for each file in 'files'
- // Load file and chop it up
- String[] lines = loadStrings(files[a]);
- String allText = join(lines, " ");
- tokens = splitTokens(allText, " ,.?!:;[]-");
- println("Done chopping up file #: "+a+" Called "+files[a]);
- // Look at words one at a time
- while(counter<tokens.length) {
- String s = tokens[counter];
- // Is the word in the HashMap
- if (words.containsKey(s)) {
- println("database already has an entry for "+s.toUpperCase());
- Word w=(Word) words.get(s);
- w.updateWord(files[a],counter);
- }
- else {
- println("make a new entry for "+s.toUpperCase());
- Word w = new Word(files[a], s,counter);
- words.put(s, w);
- }
- counter++;
- } // end of while loop for each word in 'tokens'
- } // End of for loop for each file
- dumpDictionary();
- }
- /* A routine to print out the contents of the main hash map
- */
- void dumpDictionary() {
- Iterator i = words.keySet().iterator();
- while (i.hasNext()) {
- String w = (String) i.next();
- print(w+": ");
- Word wd = (Word) words.get(w);
- HashMap fi = wd.fileIndices;
- int fs = fi.size();
- println(fs);
- //Iterator j = fi.values().iterator();
- /*
- while(j.hasNext()) {
- ArrayList ix = (ArrayList) fi.get(w);
- int sz = ix.size();
- for(int k=0; k<sz; k++){
- Integer m = (Integer) ix.get(k);
- }
- }
- */
- }
- }
- // This function returns all the files in a directory as an array of Strings
- String[] listFileNames(String dir) {
- File file = new File(dir);
- if (file.isDirectory()) {
- String names[] = file.list();
- return names;
- } else {
- // If it's not a directory
- return null;
- }
- }
- /* Word object to store in associative dictionary.
- Each object should store the indices of the occurence of the word in every file that's sent to it.
- Indices should be kept as an Array this time; use append() to update
- For next version try using an ArrayList.
- ArrayList version seems to be working, now try it in hash map experiment.
- Handle all checking inside class
- */
- class Word {
- int count;
- String word;
- HashMap fileIndices; // A hashMap of fileIndexes, 1 entry per file
- Word() { // null-arg constructor
- fileIndices = new HashMap();
- }
- Word(String fileName, String word, int index) {
- fileIndices = new HashMap();
- this.word = word;
- count = 1;
- ArrayList indices = new ArrayList();
- indices.add(new Integer(index));
- fileIndices.put(fileName, indices);
- }
- /*'updateWord()' method takes a file name and index as parameters.
- If 'fileName' is already stored, just updated the count for that file, otherwise add 'fileName' to database,
- and set its first entry as 'index'.
- */
- void updateWord(String fileName, int index) {
- //println("update this word with "+fileName+" and index "+index);
- // First, Check to see if this file has already been added to this word
- if(fileIndices.containsKey(fileName)) {
- //println("we've already had this name");
- ArrayList temp = (ArrayList) fileIndices.get(fileName);
- temp.add(new Integer(index));
- fileIndices.put(fileName, temp);
- }
- // otherwise, add a new binding to hashMap
- else {
- //println("this name is new");
- ArrayList temp = new ArrayList();
- temp.add(new Integer(index));
- fileIndices.put(fileName, temp);
- }
- count++; // add 1 to word count
- }
- void dumpWord() { // a debugging method to print the word's indices
- int len = fileIndices.size();
- println("\nthis word has "+len+" indices");
- Iterator i = fileIndices.keySet().iterator();
- while(i.hasNext()) {
- String n = (String) i.next();
- ArrayList ids = (ArrayList) fileIndices.get(n);
- print("key "+n+" has these indices: ");
- for(int j=0; j<ids.size(); j++) {
- Integer x = (Integer) ids.get(j);
- print(x.val+" | ");
- }
- println();
- }
- }
- int getSize() { // Return size of hashMap
- return(fileIndices.size());
- }
- //'incCount()' method just increments the total count varible
- void incCount() {
- count++;
- }
- } // End of Word class
- class Integer {
- int val;
- Integer() {
- this.val = 0;
- }
- Integer(int i) {
- this.val = i;
- }
- int integerVal() {
- return(this.val);
- }
- }
1