I need to do text processing in a stored text file. My questions are:
1) How can I detect only the english words?
2) Is there a possibility to detect only valid english words (as those you can find in a dictionary)?
Well, just read an English dictionary (a simple text file with one word per line, easy to find on Internet), put the words in a HashSet and for each word you find in your text file, check if it is in the dictionary. You can expect some misses (eg. proper nouns, abbreviations, numbers, etc.) but above a given percentage of hits, you can probably tell the language. You can check against several dictionaries to see which one has the more hits.
Answers
Well, just read an English dictionary (a simple text file with one word per line, easy to find on Internet), put the words in a HashSet and for each word you find in your text file, check if it is in the dictionary. You can expect some misses (eg. proper nouns, abbreviations, numbers, etc.) but above a given percentage of hits, you can probably tell the language. You can check against several dictionaries to see which one has the more hits.
Also this Java library looks promising - for the language detection
http://code.google.com/p/language-detection/