tolower() – error catching unmappable characters

The tolower() function returns an error where it can't map to the Unicode character set of the input data – a common occurrence when analysing social media data with emoticons.

Emoticons are those symbols such as πŸ˜ŠπŸΆπŸŽ… that are commonly used on mobile phones but aren't always recognised on all platforms.

For example, when converting tweets to @delta (Delta Airlines), I got the following error:

Error in tolower(text) :
invalid input '@ActualALove: First time I've seen a foot-rest in first class! Oh @Delta, how I love thee \ud83d\ude0a✈\ud83d\udc78 http://t.co/noKI9CiM' in 'utf8towcs'

When I looked up the actual tweet, it looked liked this.

20130106-194554.jpg

The two unicode characters that weren't recognised were \ud83d\ude0a (SMILING FACE WITH SMILING EYES) and \ud83d\udc78 (PRINCESS).

Gaston Sanchez has posted a solution to this problem in his blog Data Analysis Visually Enforced. I've used the code and it works well. When I have time, I'll extend it to replace the offending characters instead of returning NA for the entire string.

Advertisements