Skip to main content

Systems biologist and bioinformatician

Walter Jessen

How to Remove Non-ASCII Unicode Characters from Data

2 min read

Unicode is a computing industry standard for the consistent encoding and representation of text expressed in most of the world's writing systems. Unicode assigns unique numbers (called code points) to characters from nearly all languages in active use today, plus many other characters such as mathematical symbols. I often run into Unicode characters in tabular text data, particularly foreign characters that have an accent or diacritical above the letter.

Unicode is really a problem when I use the R knitr package in RStudio to create PDF reports of data analyses; knitr chokes on Unicode characters that are displayed in lists or data frames.

In contranst to Unicode, ASCII only assigns values to 128 characters (a-z, A-Z, 0-9, space, some punctuation, and some control characters). For every character that has an ASCII value, the Unicode code point and the ASCII value are the same. We can use this information to remove non-ASCII Unicode characters from data.

This regular expression will identify non-ASCII values (Unicode code points outside of 0-128):

[^\x00-\x7F]+

I use Textmate (Mac) or Notepad++ (Windows), and replace the non-ASCII Unicode characters with a question mark. The expression matches anything that is non-standard (the ^ symbol means 'not' and \x00-\x7F is the entire ASCII table).  Because the expression matches values outside of Hex 00 (ASCII code 0) to Hex 7F (ASCII code 127), it won't replace horizontal tabs or carriage returns.

I haven't yet figured out how to do this directly in R (it's a work in progress).

Walter Jessen

Celebrate Science Indiana 2015

Science festival

Location: Indiana State Fair Grounds

Time:

Ends:

An annual public science festival held in Indianapolis, Indiana, that has hands-on, interactive activities for kids and their families.

http://www.celebratescienceindiana.org

Walter Jessen

Getting JavaScript Output at JS Bin

1 min read

I'm sharpening my JavaScript skills and have been working through lessons on Pluralsight and YouTube over the last few weeks. One tool that frequently gets used in the lessions is JS Bin.

JS Bin is an online text editor primarily focused on Javascript. When you type in the JS Bin editor panels (HTML, CSS or JavaScript), you can see the output being generated in real-time in the output panel. Both the code and a complete output of the code can also be saved and shared.

I spent some time today trying to figure out how to make the output work for JavaScript code. I know ... duh. It's very simple. Nevertheless, I wasn't able to find documentation online, so here it is:

Add this to the JavaScript editor panel at the top:

Add this to the HTML editor panel between body tags:

You will now be able to render JavaScript function calls in the output panel.

Code on!

Walter Jessen

Yes: Network Science and Alzheimer’s Disease, the IADC’s annual scientific symposium, is designed to help scientists and clinicians to better understand the complex multi-level systems involved in Alzheimer’s disease, from altered brain connectivity and gene networks to the social networks surrounding patients and caregivers.

Walter Jessen

Coho salmon with cilantro lime sauce

I really don't cook all that much. But in an effort to manage my cholesterol by diet, today I decided I was going to cook fish. For dinner tonight I made steamed wild-caught Coho salmon covered in a cilantro lime sauce with zucchini, squash and grape tomatoes. Very tasty!

Pages

Social