tolower() – error catching unmappable characters

The tolower() function returns an error where it can't map to the Unicode character set of the input data – a common occurrence when analysing social media data with emoticons.

Emoticons are those symbols such as ๐Ÿ˜Š๐Ÿถ๐ŸŽ… that are commonly used on mobile phones but aren't always recognised on all platforms.

For example, when converting tweets to @delta (Delta Airlines), I got the following error:

Error in tolower(text) :
invalid input '@ActualALove: First time I've seen a foot-rest in first class! Oh @Delta, how I love thee \ud83d\ude0aโœˆ\ud83d\udc78' in 'utf8towcs'

When I looked up the actual tweet, it looked liked this.


The two unicode characters that weren't recognised were \ud83d\ude0a (SMILING FACE WITH SMILING EYES) and \ud83d\udc78 (PRINCESS).

Gaston Sanchez has posted a solution to this problem in his blog Data Analysis Visually Enforced. I've used the code and it works well. When I have time, I'll extend it to replace the offending characters instead of returning NA for the entire string.


2 thoughts on “tolower() – error catching unmappable characters

  1. Hi,
    Sorry, I am a newbie at using R. I have the exact same issue of yours, but I don’t know how to apply the function to the text I’m mining. Where should I have to copy the function? Can you point out the answer in a very easy way, please?
    Thank you for your time.

  2. Thanks for the comment L.

    The easiest way is to copy the code into a new R program file and give it a name (e.g. tryTolower.r). Then in the R program that you’re doing the text mining, use the source command to link to this function. For example:

    source (c:\path-name\tryTolower.r)

    Then after this line of code, you can then use the tryTolower function by:

    result = tryTolower(sentence)

    Hope this helps else leave another comment. Ray

