Processing 1.0 - Processing Discourse - character conversion for urls (xmlelement)

We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.

Index › Programming Questions & Help › Syntax Questions › character conversion for urls (xmlelement)

‹ Previous Topic | Next Topic ›

Pages: 1

character conversion for urls (xmlelement) (Read 976 times)

verwirrt

character conversion for urls (xmlelement)
May 27^th, 2008, 10:10pm

hi there,

i'm trying to load xml files (from last.fm) with XMLElement.
my code works just fine except when there are "foreign" characters involved (like ú ò ö ü etc.)

for example:
xmlartisttag = new XMLElement(this, "http://ws.audioscrobbler.com/1.0/artist/múm/toptags.xml");

will crash saying the file does not exist...in my browser it comes out fine.

i've tried replacing the ú with "& uacute;" but then it sends just an "m" instead of "múm". unicode escapes don't work either.

any suggestions on how to convert such characters so that XMLElement understands them?

thanks alot in advance!
mikko

PhiLho

Re: character conversion for urls (xmlelement)
Reply #1 - May 28^th, 2008, 12:13am

In URLs, such characters must be encoded using URL escape sequence: %xx where xx is the hexa code of the character, or, beyond Ascii, %xx%yy where xx yy are UTF-8 encoding of the character.
You can see that when searching múm on Wikipedia: http://en.wikipedia.org/wiki/M%C3%BAm
Here is code to convert the URLs:
Code:

String str = "http://ws.audioscrobbler.com/1.0/artist/múm/toptags.xml";
byte[] utf8 = null;
byte[] conv = new byte[1];
try
{
  utf8 = str.getBytes("UTF-8");
} catch (Exception e) {}
StringBuffer sb = new StringBuffer();
for (int i = 0; i < utf8.length; i++)
{
  if (utf8[i] < 0) // Beyond Ascii: high bit is set, hence negative byte
  {
    sb.append("%" + Integer.toString(256 + (int)utf8[i], 16));
  }
  else
  {
    conv[0] = utf8[i];
    try
    {
	sb.append(new String(conv, "ASCII")); // Convert back to Ascii
    } catch (Exception e) {}
  }
}
println(sb);

PhiLho

Re: character conversion for urls (xmlelement)
Reply #2 - May 28^th, 2008, 1:24pm

OK, it was a quick hack, interesting because it shown the bases of how URL encoding works, although lacking somehow, not handling other characters forbidden (or unadvised) nor space.

But there is a simpler way I overlook: ~~using java.net.URLEncoder.encode()~~
Er, no, it also encodes :// and all the slashes of the URL!
It probably aims at encoding parameters in a URL (eg. passing a URL to a script).

The solution is given by ~~http://mindprod.com/jgloss/urlencoded.html page~~
Er, still not (I experiment as I type...), it points out that accents are not encoded!

So I came up with a compound, a bit convoluted, solution:
Code:

import java.net.URI;
import java.net.URISyntaxException;
import java.net.MalformedURLException;
import java.io.UnsupportedEncodingException;

String str = "http://ws.audioscrobbler.com/1.0/artist/múm/toptags.xml";

String EncodeURL(String url)
{
  // Use URI to encode low Ascii characters depending on context of various parts
  // For some reason, uri = new URI(url) chokes on space, so we have to split the URL
  String scheme = null;  // http, ftp, etc.
  String ssp = null; // scheme-specific part
  String fragment = null; // #anchor for example
  int colonPos = url.indexOf(":");
  if (colonPos < 0) return "Not an URL";
  scheme = url.substring(0, colonPos);
  ssp = url.substring(colonPos + 1);
  int fragPos = ssp.lastIndexOf("#");
  if (fragPos >= 0)
  {
    // Won't work if there is no real anchor/fragment
    // but this char is part of one parameter of the query,
    // but it is a bit unlikely...
    // That's probably why Java doesn't want to do it automatically,
    // it must be disambiguated manually
    fragment = ssp.substring(fragPos + 1);
    ssp = ssp.substring(0, fragPos);
  }

  URI uri = null;
  try
  {
    uri = new URI(scheme, ssp, fragment);
  } catch (URISyntaxException use) { return use.toString(); }
  String encodedURL1 = null;
  try
  {
    encodedURL1 = uri.toURL().toString();
  } catch (MalformedURLException mue) { return mue.toString(); }
  // Here, we still have Unicode chars unchanged

  byte[] utf8 = null;
  // Convert whole string to UTF-8 at once: low Ascii (below 0x80) is unchanged, other stuff is converted
  // to UTF-8, which always have the high bit set.
  try
  {
    utf8 = encodedURL1.getBytes("UTF-8");
  } catch (UnsupportedEncodingException uee) { return uee.toString(); }

  StringBuffer encodedURL = new StringBuffer();

  byte[] conv = new byte[1];
  for (int i = 0; i < utf8.length; i++)
  {
    if (utf8[i] < 0) // Beyond Ascii: high bit is set, hence negative byte
    {
	encodedURL.append("%" + Integer.toString(256 + (int)utf8[i], 16));
    }
    else
    {
	conv[0] = utf8[i];
	try
	{
	  encodedURL.append(new String(conv, "ASCII")); // Convert back to Ascii
	} catch (UnsupportedEncodingException uee) { return uee.toString(); }
    }
  }

  return encodedURL.toString();
}

void setup()
{
  println(EncodeURL(str));
  println(EncodeURL("http://www.example.com/you & I 10%? weird & weirder neé"));
  println(EncodeURL("http://www.example.com/Éric.html#CV"));
  println(EncodeURL("http://www.example.com/éditer.php?p1=déjà vu&p2=sl/ash#meh"));
  exit();
}

There might be better, simpler ways... I would be happy to see them.

fry

Re: character conversion for urls (xmlelement)
Reply #3 - May 28^th, 2008, 3:05pm

a little simpler...
Code:


String prefix = "http://ws.audioscrobbler.com/1.0/artist/";
String suffix = "/toptags.xml"; 

String encodeURL(String name) {
  StringBuffer sb = new StringBuffer();
  sb.append(prefix);
  byte[] utf8 = name.getBytes("UTF-8");
  for (int i = 0; i < utf8.length; i++) {
    int value = utf8[i] & 0xff;
    if (value < 33 || value > 126) {
	sb.append('%');
	sb.append(hex(value, 2));
    } else {
	sb.append((char) value[i]);
    }
  }
  sb.append(suffix);
  return sb.toString();
}

PhiLho

Re: character conversion for urls (xmlelement)
Reply #4 - May 28^th, 2008, 4:41pm

You cheat! Wink

I was trying to make a generic solution...
Beside, it needs some little changes:
Code:

String prefix = "http://ws.audioscrobbler.com/1.0/artist/";
String suffix = "/toptags.xml";  
 
String encodeURL(String name) {
  StringBuffer sb = new StringBuffer();
  sb.append(prefix);
  byte[] utf8 = null;
  try { utf8 = name.getBytes("UTF-8"); } catch (Exception e) {}
  for (int i = 0; i < utf8.length; i++) {
    int value = utf8[i] & 0xff;
    if (value < 33 || value > 126) {
 sb.append('%');
 sb.append(hex(value, 2));
    } else {
 sb.append((char) value);
    }
  }
  sb.append(suffix);
  return sb.toString();
} 

void setup()
{
println(encodeURL("mùm"));
exit();
}

If you go this way, there is even simpler:
Code:

String encodeURL(String name) {
  String encoded = null;
  try { encoded = prefix + java.net.URLEncoder.encode(name, "UTF-8") + suffix; } catch (Exception e) {}
  return encoded;
}

Actually, it would be better if the name has some special chars like a question mark or an exclamation mark.

I appreciate the simpler way to convert from byte to char (forgot append() accepted that), sometime I go convoluted ways...

verwirrt

Re: character conversion for urls (xmlelement)
Reply #5 - May 29^th, 2008, 8:21am

thanks alot to the both of you!

of course i tried the simplest code first Wink

and it works like a charm Cheesy

that saved my day

EDIT:

just in case someone wants to do the same as myself.
still had a problem with weird artist names like "Iron & Wine" or "+/-".
but with this workaround it behaves well:

Quote:

String convertencoding(String thestring){

//convert thestring to utf-8
String encoded = null;
try {
encoded = java.net.URLEncoder.encode(thestring, "UTF-8");
} catch (Exception e) {}

//workaround problem with artists like "Iron & Wine"
String Strlist1[] = split(encoded, "%26");
encoded = join(Strlist1, "%2526");

//workaround the "+" problem with artists like "+/-"
String Strlist2[] = split(encoded, "%2B");
encoded = join(Strlist2, "%252B");

//workaround the "/" problem with artists like "+/-"
String Strlist3[] = split(encoded, "%2F");
encoded = join(Strlist3, "%252F");

return encoded;
}

« Last Edit: May 29^th, 2008, 11:42am by verwirrt »

Pages: 1

‹ Previous Topic | Next Topic ›