|
Author |
Topic: parser bother (Read 1024 times) |
|
ryan* Guest
|
parser bother
« on: Feb 20th, 2003, 1:13pm » |
|
Hi I was wondering if either Fry or Reas would explain how the parser commands work and what they do. I'm trying to work towards a simple HTML parser that just scans for certain things like images. splitInts() splitFloats() splitStrings() join() thanks
|
|
|
|
skloopy
|
Re: parser bother
« Reply #1 on: Feb 21st, 2003, 2:19am » |
|
BTW, sorry if this is too much trouble.. :^)
|
|
|
|
REAS
|
Re: parser bother
« Reply #2 on: Feb 21st, 2003, 3:40am » |
|
join() is not implemented yet. probably in _52_ all the splits work like this: splitInts(string to be split, token that is separating) for example: String s = "0001+0002"; int[] data = splitInts( s, '+' ); println(data[0]); println(data[1]); let me know if you need more...
|
|
|
|
skloopy
|
Re: parser bother
« Reply #3 on: Feb 22nd, 2003, 7:39am » |
|
thanks! Are these the same mothods you use to isolate certain words in the Processing parser, or is there another Java or custom class? basically all I need to do is take the body or text and search for <img src=" and then get the string from that until the next " . Do you have any tips on how I can do that? I mean I could just write something that scans brut-force and tooks for letter sequences, but you already wrote a parser in processing..
|
|
|
|
benelek
|
Re: parser bother
« Reply #4 on: Feb 22nd, 2003, 11:33am » |
|
the java spec for the String class provides several good methods for operating on a string: http://java.sun.com/products/jdk/1.2/docs/api/java/lang/String.html this may take longer than Casey's way, but if ur intent on using built-in java stuff... Code: String theCode = "bladidada <img src=theAddress>"; int startIndex = theCode.indexOf("<img src="); int endIndex = theCode.indexOf(">", startIndex); String theAddress = theCode.substring(startIndex+9, endIndex); println(theAddress); |
| -jacob
|
|
|
|
benelek
|
Re: parser bother
« Reply #5 on: Feb 22nd, 2003, 11:38am » |
|
mmm, actually this brings me to something that's been nagging me for a while. does anybody know why i can't use " and ' interchangeably in P5, as in javascript?
|
|
|
|
fry
|
Re: parser bother
« Reply #6 on: Feb 22nd, 2003, 3:33pm » |
|
on Feb 22nd, 2003, 11:38am, benelek wrote:mmm, actually this brings me to something that's been nagging me for a while. does anybody know why i can't use " and ' interchangeably in P5, as in javascript |
| hm, hadn't even thought of implementing it. you might post that to suggestions and see if others are into it as well.
|
|
|
|
fry
|
Re: parser bother
« Reply #7 on: Feb 22nd, 2003, 3:50pm » |
|
on Feb 22nd, 2003, 7:39am, Ryan wrote:thanks! Are these the same mothods you use to isolate certain words in the Processing parser, or is there another Java or custom class basically all I need to do is take the body or text and search for <img src=" and then get the string from that until the next " . Do you have any tips on how I can do that I mean I could just write something that scans brut-force and tooks for letter sequences, but you already wrote a parser in processing.. |
| there are a couple parsers at work behind the scenes, the one that gives us all the quirky stuff with the code is based on oro-matcher, which uses 'regular expressions' as a more advanced way to do pattern matching. (it's a hack to use oro-matcher so it's our fault not theirs) we need to move to a 'real' parser which breaks things into a sort of tree based on a grammar, whcih will eventually fix those bugs.. split() and friends are simple specific-use methods that we included because we find them useful for our own work. so for instance, if you had your entire html file as a String, you could solve your original problem with: Code:String pieces[] = splitStrings(htmlstring, '<'); for (int i = 0; i < pieces.length; i++) { // uses toLowercase to that it doesn't care // whether it's img src or IMG SRC or IMG src etc if (pieces[i].toLowercase().indexOf("img src=") == 0) { // this is an image tag // 9 is for the number of characters in: img src=" String filename = pieces[i].substring(9); int quote = filename.indexOf("\""); filename = filename.substring(0, quote); // now do something with the filename } } |
| this quickly gets messy when you have to make exceptions for whether or not the page designer put quotes around the filename after src=, or if someone uses a tag like <IMG BORDER=0 SRC=blahblha.gif>. it's not difficult but just gets messy. this is where a more robust parser comes into play.. regular expressions allow you to do conditional matching (i.e. i can state that quotes are optional) and isn't as brittle. a full parser (not just matching) would do a better job of dealing with those quirky scenarios too, since it's easier to specify those exceptional cases in the parser 'grammar'.
|
« Last Edit: Feb 22nd, 2003, 3:54pm by fry » |
|
|
|
|
benelek
|
Re: parser bother
« Reply #8 on: Feb 23rd, 2003, 12:27am » |
|
i haven't had any experience with regular expressions (besides the stuff that usually comes out of my mouth, hehe), would you mind explaining what they involve?
|
|
|
|
skloopy
|
Re: parser bother
« Reply #9 on: Feb 23rd, 2003, 1:58am » |
|
Thanks for the help. I'll post the result when I'm done. It seems like it's impossible right now to import a library (like a parser) into Processing I've tried the command line method. But it seems like for me, the split() command may be all I need. The code might get a little long. It might be really cool if you could have multiple java files in a Processing project. Maybe you could use a page metaphor?
|
« Last Edit: Feb 23rd, 2003, 1:58am by skloopy » |
|
|
|
|
benelek
|
Re: parser bother
« Reply #11 on: Feb 23rd, 2003, 5:39am » |
|
cool, thanks Mike.
|
|
|
|
|