We closed this forum 18 June 2010. It has served us well since 2005 as the ALPHA forum did before it from 2002 to 2005. New discussions are ongoing at the new URL http://forum.processing.org. You'll need to sign up and get a new user account. We're sorry about that inconvenience, but we think it's better in the long run. The content on this forum will remain online.
IndexProgramming Questions & HelpSyntax Questions › character conversion for urls (xmlelement)
Page Index Toggle Pages: 1
character conversion for urls (xmlelement) (Read 976 times)
character conversion for urls (xmlelement)
May 27th, 2008, 10:10pm
 
hi there,

i'm trying to load xml files (from last.fm) with XMLElement.
my code works just fine except when there are "foreign" characters involved (like ú ò ö ü etc.)

for example:
xmlartisttag = new XMLElement(this, "http://ws.audioscrobbler.com/1.0/artist/múm/toptags.xml");

will crash saying the file does not exist...in my browser it comes out fine.

i've tried replacing the ú with "& uacute;" but then it sends just an "m" instead of "múm". unicode escapes don't work either.

any suggestions on how to convert such characters so that XMLElement understands them?

thanks alot in advance!
mikko
Re: character conversion for urls (xmlelement)
Reply #1 - May 28th, 2008, 12:13am
 
In URLs, such characters must be encoded using URL escape sequence: %xx where xx is the hexa code of the character, or, beyond Ascii, %xx%yy where xx yy are UTF-8 encoding of the character.
You can see that when searching múm on Wikipedia: http://en.wikipedia.org/wiki/M%C3%BAm
Here is code to convert the URLs:
Code:
String str = "http://ws.audioscrobbler.com/1.0/artist/múm/toptags.xml";
byte[] utf8 = null;
byte[] conv = new byte[1];
try
{
utf8 = str.getBytes("UTF-8");
} catch (Exception e) {}
StringBuffer sb = new StringBuffer();
for (int i = 0; i < utf8.length; i++)
{
if (utf8[i] < 0) // Beyond Ascii: high bit is set, hence negative byte
{
sb.append("%" + Integer.toString(256 + (int)utf8[i], 16));
}
else
{
conv[0] = utf8[i];
try
{
sb.append(new String(conv, "ASCII")); // Convert back to Ascii
} catch (Exception e) {}
}
}
println(sb);
Re: character conversion for urls (xmlelement)
Reply #2 - May 28th, 2008, 1:24pm
 
OK, it was a quick hack, interesting because it shown the bases of how URL encoding works, although lacking somehow, not handling other characters forbidden (or unadvised) nor space.

But there is a simpler way I overlook: using java.net.URLEncoder.encode()
Er, no, it also encodes :// and all the slashes of the URL!
It probably aims at encoding parameters in a URL (eg. passing a URL to a script).

The solution is given by http://mindprod.com/jgloss/urlencoded.html page
Er, still not (I experiment as I type...), it points out that accents are not encoded!

So I came up with a compound, a bit convoluted, solution:
Code:
import java.net.URI;
import java.net.URISyntaxException;
import java.net.MalformedURLException;
import java.io.UnsupportedEncodingException;

String str = "http://ws.audioscrobbler.com/1.0/artist/múm/toptags.xml";

String EncodeURL(String url)
{
// Use URI to encode low Ascii characters depending on context of various parts
// For some reason, uri = new URI(url) chokes on space, so we have to split the URL
String scheme = null; // http, ftp, etc.
String ssp = null; // scheme-specific part
String fragment = null; // #anchor for example
int colonPos = url.indexOf(":");
if (colonPos < 0) return "Not an URL";
scheme = url.substring(0, colonPos);
ssp = url.substring(colonPos + 1);
int fragPos = ssp.lastIndexOf("#");
if (fragPos >= 0)
{
// Won't work if there is no real anchor/fragment
// but this char is part of one parameter of the query,
// but it is a bit unlikely...
// That's probably why Java doesn't want to do it automatically,
// it must be disambiguated manually
fragment = ssp.substring(fragPos + 1);
ssp = ssp.substring(0, fragPos);
}

URI uri = null;
try
{
uri = new URI(scheme, ssp, fragment);
} catch (URISyntaxException use) { return use.toString(); }
String encodedURL1 = null;
try
{
encodedURL1 = uri.toURL().toString();
} catch (MalformedURLException mue) { return mue.toString(); }
// Here, we still have Unicode chars unchanged

byte[] utf8 = null;
// Convert whole string to UTF-8 at once: low Ascii (below 0x80) is unchanged, other stuff is converted
// to UTF-8, which always have the high bit set.
try
{
utf8 = encodedURL1.getBytes("UTF-8");
} catch (UnsupportedEncodingException uee) { return uee.toString(); }

StringBuffer encodedURL = new StringBuffer();

byte[] conv = new byte[1];
for (int i = 0; i < utf8.length; i++)
{
if (utf8[i] < 0) // Beyond Ascii: high bit is set, hence negative byte
{
encodedURL.append("%" + Integer.toString(256 + (int)utf8[i], 16));
}
else
{
conv[0] = utf8[i];
try
{
encodedURL.append(new String(conv, "ASCII")); // Convert back to Ascii
} catch (UnsupportedEncodingException uee) { return uee.toString(); }
}
}

return encodedURL.toString();
}

void setup()
{
println(EncodeURL(str));
println(EncodeURL("http://www.example.com/you & I 10%? weird & weirder neé"));
println(EncodeURL("http://www.example.com/Éric.html#CV"));
println(EncodeURL("http://www.example.com/éditer.php?p1=déjà vu&p2=sl/ash#meh"));
exit();
}

There might be better, simpler ways... I would be happy to see them.
Re: character conversion for urls (xmlelement)
Reply #3 - May 28th, 2008, 3:05pm
 
a little simpler...
Code:

String prefix = "http://ws.audioscrobbler.com/1.0/artist/";
String suffix = "/toptags.xml";

String encodeURL(String name) {
StringBuffer sb = new StringBuffer();
sb.append(prefix);
byte[] utf8 = name.getBytes("UTF-8");
for (int i = 0; i < utf8.length; i++) {
int value = utf8[i] & 0xff;
if (value < 33 || value > 126) {
sb.append('%');
sb.append(hex(value, 2));
} else {
sb.append((char) value[i]);
}
}
sb.append(suffix);
return sb.toString();
}

Re: character conversion for urls (xmlelement)
Reply #4 - May 28th, 2008, 4:41pm
 
You cheat! Wink I was trying to make a generic solution...
Beside, it needs some little changes:
Code:
String prefix = "http://ws.audioscrobbler.com/1.0/artist/";
String suffix = "/toptags.xml";

String encodeURL(String name) {
StringBuffer sb = new StringBuffer();
sb.append(prefix);
byte[] utf8 = null;
try { utf8 = name.getBytes("UTF-8"); } catch (Exception e) {}
for (int i = 0; i < utf8.length; i++) {
int value = utf8[i] & 0xff;
if (value < 33 || value > 126) {
sb.append('%');
sb.append(hex(value, 2));
} else {
sb.append((char) value);
}
}
sb.append(suffix);
return sb.toString();
}

void setup()
{
println(encodeURL("mùm"));
exit();
}

If you go this way, there is even simpler:
Code:
String encodeURL(String name) {
String encoded = null;
try { encoded = prefix + java.net.URLEncoder.encode(name, "UTF-8") + suffix; } catch (Exception e) {}
return encoded;
}

Cheesy
Actually, it would be better if the name has some special chars like a question mark or an exclamation mark.

I appreciate the simpler way to convert from byte to char (forgot append() accepted that), sometime I go convoluted ways...
Re: character conversion for urls (xmlelement)
Reply #5 - May 29th, 2008, 8:21am
 
thanks alot to the both of you!

of course i tried the simplest code first Wink and it works like a charm Cheesy

that saved my day Smiley

EDIT:

just in case someone wants to do the same as myself.
still had a problem with weird artist names like "Iron & Wine" or "+/-".
but with this workaround it behaves well:

Quote:


String convertencoding(String thestring){

  //convert thestring to utf-8
  String encoded = null;
  try {
    encoded = java.net.URLEncoder.encode(thestring, "UTF-8");
     } catch (Exception e) {}
 
  //workaround problem with artists like "Iron & Wine"
  String Strlist1[] = split(encoded, "%26");
  encoded = join(Strlist1, "%2526");
 
  //workaround the "+" problem with artists like "+/-"
  String Strlist2[] = split(encoded, "%2B");
  encoded = join(Strlist2, "%252B");

  //workaround the "/" problem with artists like "+/-"
  String Strlist3[] = split(encoded, "%2F");
  encoded = join(Strlist3, "%252F");


  return encoded;
}    


Page Index Toggle Pages: 1