ProHTML: how to get text inside <p> tags?
in
Contributed Library Questions
•
2 years ago
I'm using ProHTML to extract the biographies from congress.gov. Here's an example page:
I'd like to get the text inside the <p> tag that contains the biography. For some reason, though, I'm unable to get at the text inside the StandAloneElement. Each StandAloneElement sae.hasChildren() returns false, even though there must be a text node inside the <p> tag. What am I missing?
- import prohtml.*;
- import java.util.List;
- HtmlElementFinder htmlElementFinder;
- String url = "http://bioguide.congress.gov/scripts/biodisplay.pl?index=H000213"
- htmlElementFinder = new HtmlElementFinder(url, "p");
- java.util.List bio = htmlElementFinder.getElements();
- for (int i = 0;i<bio.size();i++) {
- StandAloneElement sae = (StandAloneElement)bio.get(i);
- println( sae.hasChildren() ); // prints false for some reason
- }
I tried doing this first with XMLElement, but I couldn't figure out how to cast the HTML to a valid XML document.
1