2 min read
Unicode is a computing industry standard for the consistent encoding and representation of text expressed in most of the world's writing systems. Unicode assigns unique numbers (called code points) to characters from nearly all languages in active use today, plus many other characters such as mathematical symbols. I often run into Unicode characters in tabular text data, particularly foreign characters that have an accent or diacritical above the letter.
Unicode is really a problem when I use the R knitr package in RStudio to create PDF reports of data analyses; knitr chokes on Unicode characters that are displayed in lists or data frames.
In contranst to Unicode, ASCII only assigns values to 128 characters (a-z, A-Z, 0-9, space, some punctuation, and some control characters). For every character that has an ASCII value, the Unicode code point and the ASCII value are the same. We can use this information to remove non-ASCII Unicode characters from data.
This regular expression will identify non-ASCII values (Unicode code points outside of 0-128):
I use Textmate (Mac) or Notepad++ (Windows), and replace the non-ASCII Unicode characters with a question mark. The expression matches anything that is non-standard (the ^ symbol means 'not' and \x00-\x7F is the entire ASCII table). Because the expression matches values outside of Hex 00 (ASCII code 0) to Hex 7F (ASCII code 127), it won't replace horizontal tabs or carriage returns.
I haven't yet figured out how to do this directly in R (it's a work in progress).