how to read the text of a web page

For a statistical research I've made a program that counts and stores what and how often words are used. Now it reads form a txt file but I would like to extend the statistics to the internet. There are tuorials on how to read the text from a web page, but I just need the written text, not the code part; Is there a distinction that I can use? The program should read every file in a domain; As an example, I should say the link is "https://www.processing.org/" and it should read the texts on the presentations, the tutorials ecc, but not the html part.

It started as a project on my favourite singer, but I soon discovered I'm wayyyy to lazy to copy and paste his 100+ songs to a .txt file.

Answers

  • If you just want to scrape data from a web page it's probably possible using Processing; but I'd look at using established web technology specifically designed for the purpose. I've used CasperJS to do exactly this sort of thing in the past; though I've also used a google plugin (Web scraper - there are others) which may be more accessible if you're not familiar with JS.

    Although you say you just want the raw text; you probably want to split it into separate fields - I'm guessing things like track name, year etc. - and these will hopefully be embedded in separate and consistent HTML elements within the pages. In that case you just need to figure out selectors for each container element and grab its contents rather than grabbing the whole lot as a string and then trying to parse data out of that (which will most likely be a lot of extra work).

  • not really; what I mean is that, for example if my program was reading this page(made in a second): sacchidisale.altervista.org/nuova-pagina.html

    it should read:

    " This is a paragraph.

    hello my name is jimmi I like processing "

    instead of:

    " <!doctype html>

    (function(d, s) { var fjs = d.getelementsbytagname(s)[0], js = d.createelement(s); js.src = "//tb.altervista.org/js/script.js?1"; fjs.parentnode.insertbefore(js, fjs);}(document, 'script'));

    this is a paragraph.

    hello my name is jimmi i like processing

    "

    the program finds more things but the processing forum reads them as links so I can't quote them.

    that's it...

  • edited October 2015

    You may try loadXML() for the HTML and "scrape" it w/ XML's methods:
    https://processing.org/reference/loadXML_.html
    https://processing.org/reference/XML.html

  • Whatever: the suggested approach is still a better and more straightforward choice for scraping text from a webpage.

    Open your example page in Chrome (though the following should work in most browsers). Hit F12 to open the console. Copy paste the following JS into the console and hit Enter:

    (function foo() {
    var text = "",
    // this selector gets all paragraphs
    paragraphs = document.getElementsByTagName('p'),
    pLength = paragraphs.length;
    
    for(var i = 0; i<pLength; i++) {
      text += paragraphs[i].innerText + "\n";
    }
    return text;
    })()
    

    This is just an example. The trick is writing selectors to grab the text you want from the relevant elements. What you're suggesting (grabbing all text) is equivalent to the following:

    document.getElementsByTagName('body')[0].innerText

    Obviously depends on the target content; but you may get way more back than you want and will then have to do a lot of work to filter out the noise...

  • You may try loadXML() for the HTML and "scrape" it w/ it XML's methods

    @GoToLoop: that's a possibility... but I suspect it's going to be far more tedious than using JS. As a quick example here's how you can scrape every link's href attribute from the page:

    (function() {
    var allLinks = document.getElementsByTagName('a'),
        len = allLinks.length,
        linksOutput = [];
    
    for(var i=0; i<len; i++) {
      //TODO: add condition to check link is on the target site domain
      linksOutput.push(allLinks[i].getAttribute('href'));
    }
    
    //outputting as string; but in practice you'd work with the array
    return linksOutput.toString();
    
    })();
    

    I don't even want to begin thinking how I'd do that with those XML methods :((

    @ondsinet_: Note that the above example is another reason you want to scrape HTML elements (whether you choose the pain of loadXML or not): you can use the links on the page to find sub-pages to scrape (nowadays sites generally don't hand you a site-map on a plate): just check they're targeting the same domain and keep a record of pages you've already visited. You can search for examples online for setting up a CasperJS script that will recursively work its way through a site using exactly this technique...

    If you've already started a project in Processing you could just use JS to scrape the content into a data file and import that to Processing ;)

  • yet I don't understand how to distinguish plain text from code/tags ecc

  • yet I don't understand how to distinguish plain text from code/tags ecc

    You're not following: both loadXML and my suggested JS approach do that for you; loadXML by parsing the HTML source (which is a subset of XML) and JS by leveraging the browser's built-in DOM model. The result of both is essentially an object representing the page contents which you can then use methods to inspect.

    When you request a web page, by whatever means, you will always get the HTML source in response and you're advised to use existing libraries to parse this rather than trying to do so yourself. If all you really want is the rendered text of the entire page then the JS example I gave should do (I imagine the XML equivalent would be far, far longer):

    document.getElementsByTagName('body')[0].innerText

    You would however make your task far easier - and more effective - by being more selective. For example how will you avoid text repetition in extraneous content such as navigation and the page template from affecting your results? It's a relatively simple matter to use the Browser inspector (try right click - inspect element) to find a suitable wrapping element within the target page and target that with a selector to get relevant content only...

Sign In or Register to comment.