As part of an application I’m developing, I needed to store tags from multiple sources, and I chose to use Flickr’s method of storing raw and clean tags. I needed to figure out how Flickr converts raw tags to clean ones. This article by Terrell Russell helped a lot, but missed a few elements (and I needed it in Java).
The original regular expression by Russell did not include a comma, and I also found out certain special characters are substituted (I guess I will find more of them as I keep comparing Flickr tags).
public static String cleanRawTag(String raw, boolean isMachineTag)
{
if(isMachineTag)
{
// raw = geo:lat=13.751193
// name = geo:lat=13751193
int equals = raw.indexOf('=');
return raw.substring(0, equals+1).toLowerCase() + cleanRawTag(raw.substring(equals+1), false);
}
else
{
String clean = raw.replaceAll("[s"!@#$%^&*():,-_+='/.;`<>[]?\]", "").toLowerCase();
return clean.replace('ß', 's').replace('ς', 'σ');
}
}