ProHTML: how to get text inside tags?

codekiln

ProHTML: how to get text inside tags?

in Contributed Library Questions • 2 years ago

I'm using ProHTML to extract the biographies from congress.gov. Here's an example page:

http://bioguide.congress.gov/scripts/biodisplay.pl?index=H000213

I'd like to get the text inside the tag that contains the biography. For some reason, though, I'm unable to get at the text inside the StandAloneElement. Each StandAloneElement sae.hasChildren() returns false, even though there must be a text node inside the tag. What am I missing?

import prohtml.*;
import java.util.List;
HtmlElementFinder htmlElementFinder;
String url = "http://bioguide.congress.gov/scripts/biodisplay.pl?index=H000213"
htmlElementFinder = new HtmlElementFinder(url, "p");
java.util.List bio = htmlElementFinder.getElements();
for (int i = 0;i<bio.size();i++) {
StandAloneElement sae = (StandAloneElement)bio.get(i);
println( sae.hasChildren() ); // prints false for some reason
}

I tried doing this first with XMLElement, but I couldn't figure out how to cast the HTML to a valid XML document.

Replies(4)

codekiln

Re: ProHTML: how to get text inside tags?

2 years ago

I don't think ProHTML is the tool I want. There's no way to just get a block of text, and within the text the punctuation is thrown out the window (see example below). I'd rather use XMLElement. Anyone know a way to use loadStrings() as an input to XMLElement?

import prohtml.*;
import java.util.List;
HtmlTree htmlTree;
String url = "http://bioguide.congress.gov/scripts/biodisplay.pl?index=H000213";
htmlTree = new HtmlTree(url);
HtmlElement html = htmlTree.pageTree;
//html.printElementTree(". ");
List lst = html.getSpecificElements("p");
for( int i = 0; i < lst.size(); i++ ) {
HtmlElement elem = (HtmlElement) lst.get(i);
List lst2 = elem.getChildren();
for( int j = 0; j < lst2.size(); j++ ) {
if(! ( ((Element)lst2.get(j)).type() == Conts.TEXT_ELEMENT || ((Element)lst2.get(j)).type() == Conts.LETTER_ELEMENT )) {
lst2.remove(j);
}
}
println( lst2 );
}

Output:

[a, Representative, from, California, born, in, New, York, NY, June, 28, 1945, graduated, from, University, High, School, Los, Angeles, Calif, 1962, BA, Smith, College, Northampton, Mass, 1966, JD, Harvard, University, School, of, Law, Cambridge, Mass, 1969, staff, for, United, States, Senator, John, V, Tunney, of, California, 19721973, adjunct, professor, Georgetown, University, Law, Center, Washington, DC, 19741975, chief, counsel, and, staff, director, United, States, Senate, Judiciary, subcommittee, on, constitutional, rights, 19751977, deputy, secretary, to, the, cabinet, The, White, House, 19771978, special, counsel, Department, of, Defense, 1979, elected, as, a, Democrat, to, the, One, Hundred, Third, and, to, the, two, succeeding, Congresses, January, 3, 1993January, 3, 1999, was, not, a, candidate, for, reelection, to, One, Hundred, Sixth, Congress, in, 1998, but, was, an, unsuccessful, candidate, for, nomination, as, governor, of, California, elected, as, a, Democrat, to, the, One, Hundred, Seventh, and, to, the, five, succeeding, Congresses, until, her, resignation, on, February, 28, 2011, January, 3, 2001February, 28, 2011]

If I try to do this by instantiating an XMLElement with the url, I get an error:

XMLElement xml = new XMLElement(this, "http://bioguide.congress.gov/scripts/biodisplay.pl?index=H000213");
Expected: delimited string, SystemID='file:.', Line=6

A quick search on the Processing forum shows that this is because the encoding of the page is wrong for XMLElement. If I try to do this by using the Reader constructor, it fails silently:

BufferedReader reader;
String line;
void setup() {
noLoop();
// Open the file from the createWriter() example
reader = createReader(createInput("http://bioguide.congress.gov/scripts/biodisplay.pl?index=H000213"));
do{
try {
line = reader.readLine();
} catch (IOException e) {
e.printStackTrace();
line = null;
}
if( line != null ) {
//println( line );
}
} while( line != null );
println( reader.getClass().getName() ); // java.io.BufferedReader extends from Reader
XMLElement xml = new XMLElement( reader ); // using the Reader constructor
println( xml.toString() ); // nothing
println( xml.getChildCount() ); // 0
}

PhiLho

Re: ProHTML: how to get text inside tags?

2 years ago

That old fashioned pure HTML (3.2), so an XML parser will choke on it. You can use a library like jSoup to parse such file.

helgosam

Re: ProHTML: how to get text inside tags?

2 years ago

@codekiln

Did you ever fix this using proHTML in the end? Or did you have to use jSoup?

@phi.lho

Is there a way to use jSoup from within Processing? Can I drop the jSoup Library into my processing libraries and then access it from there? Are there some tutorials somewhere for how to access and use Java libraries from inside Processing?

Many thanks, Sam

PhiLho

Re: ProHTML: how to get text inside tags?

2 years ago

I haven't tried jSoup with Processing, but you can just put the jar library in a code folder (or just drop it on the PDE), and then use it as you would do from Java.

Top Reply

ProHTML: how to get text inside <p> tags?

Replies(4)

Re: ProHTML: how to get text inside <p> tags?

Re: ProHTML: how to get text inside <p> tags?

Re: ProHTML: how to get text inside <p> tags?

Re: ProHTML: how to get text inside <p> tags?

Statistics

Tags

Actions

Related Posts