Removing url from string

edited April 2016 in Library Questions

Hi all

I'm trying to remove the url's from strings, generated by tweets (using twitter4j) I've been trying all sorts of things with no luck.

So Let's say I have a string containing

"Hello, follow me! http://google.com"

or

"This is a tweet with an instagram photo https://instagr.am/blablabla"

I want to strip just the url from that string, but that url may be anything.

I've been trying things with regex, and other instructions I've found on the web, but I'm really lost. Could anyone point me in the right direction?

Answers

  • _vk_vk
    edited April 2016 Answer ✓

    Don't...

    :)

    I mean twitter already provides you with a lot of stuff for each "entity" like the URL both long and short versions. And twitter4j does handle those for you:

    https://twitter4j.org/javadoc/twitter4j/EntitySupport.html

    i have some code at home i can post later if you need.

    It is something like:

    url = tweet.getURLEntities().getURL()
    

    If no URL is present in the Status it returns null I think.

  • Thanks for the reply! So then can I use getURLEntities() to remove the url from the string?

    I haven't quite figured out how to remove a string from a string yet. Is it possible without splitting the string into separate words first?

    Also, this might be more appropriate in the Library questions department, since it the solution seems to be part of twitter4j.

  • _vk_vk
    Answer ✓

    Using those is not to get a url from a String. You are getting a string that is just the url directly from tweeter see

    https://dev.twitter.com/overview/api/entities-in-twitter-objects

    To deal with Strings your self follow goToLoop's path. But as you noticed deal with this from scratch is not trivial, so tweeter already suplies you with those stuff in an easy way. The entities.

  • Alright, thanks for pointing me in the right direction guys! I've got this! I'll post my results when I get there :)

  • edited April 2016

    @GoToLoop, I know of the existence of replace(), but it can only replace characters with something else (or nothing), it won't work with a String. Is there a workaround? It would really facilitate the process of stripping the urls from tweets...

    The other way to go would involve splitting the tweet String into words, comparing each word to each possible url entity, then putting together only the words dat don't match a url. I've already tried constructing something like this, but that didn't really work out so well.

  • _vk_vk
    edited April 2016 Answer ✓

    Together with an url entitie there is an array of ints containing the index of the first and last char of that specific url in the tweet string, like:

    This is a tweet http://someurl.com

    Would have [16, 34] (if I counted right :)

  • That is another brilliant hint. Thanks a lot!

  • edited April 2016

    replace(), but it can only replace characters with something else (or nothing), it won't work with a String.

    replace() is a method of class String. So I don't get it when you say it won't work on 1! :-@

    If you've got ahold of the URL String and wanna remove it outta the bigger String, it's as simple as:
    biggerStr = biggerStr.replace(urlString, "");

  • Ok, this is what I've got working atm:

       Status status = tweets.get(currentTweet);
       URLEntity[] urlS = status.getURLEntities();
       int numberOfUrlS = urlS.length;
       //println (numberOfUrlS);
    
      int urlStart = 0;
      int urlEnd = 0;
    
       if (numberOfUrlS >= 1) {
         urlStart = urlS[0].getStart();
         urlEnd = urlS[numberOfUrlS-1].getEnd();     
       }
       String tweetBruto = status.getText();
       String tweetTarra1 = tweetBruto.substring(0, urlStart);
       String tweetTarra2 = tweetBruto.substring(urlEnd);
       String tweetNetto = tweetTarra1 + tweetTarra2; 
    

    This removes all consecutive urls in a tweet. They are usually at the end, so I could do without the urlEnd parameter and thus avoiding having to combine the tweetTarra strings, but just in case I've got it like this. Text in between the first and last url is lost though, I think I could do this prettier and better with a for-loop, but I'm not quite there yet :)

  • something like this:

    String removeURL (Status s) {
      String fullText = s.getText();
      URLEntity[] URLs = s.getURLEntities();
      if (URLs.length > 0) {
        for (URLEntity ue : URLs) {
          fullText = fullText.replace(ue.getURL(), "");
        }
      }
      return fullText;
    }
    
  • Nice job, @_vk! A shorter solution using my replace() hint! :)>-
    Just 1 tip: No need to check for length. As long as URLs isn't null, for ( : ) won't crash! ;))

    static final String removeURL(final Status s) {
      if (s == null)  return "";
      String fullText = s.getText();
    
      for (final URLEntity ue : s.getURLEntities())
        fullText = fullText.replace(ue.getURL(), "");
    
      return fullText;
    }
    
  • :-c :)

    Thank GTL

Sign In or Register to comment.