Matthew O’Connor and I recently worked on a project that sent SMS messages to mobile customers. Unfortunately the SMS aggregator we used on the project rejected messages with non-ascii characters.
One approach we considered was to strip our messages of any characters that were not ascii and send them as is. After looking through some of the rejected messages we realized most of the problems occurred with unicode punctuation. Instead of simple deleting the characters we tried transliterating them to their ascii equivalent.
Our first approach used IConv:
require 'iconv'
module SmsEncoder
def self.convert(utf8_text)
text = Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8", utf8_text).first
text.gsub(/`/, "'")
rescue Iconv::Failure
""
end
end
For some reason the backtick ` also caused problems so we converted that after using Iconv.
This approach worked perfectly on OS X but as soon as we moved to the Linux servers the libiconv characteristics changed and most untranslatable characters became question marks instead of empty strings.
Instead of wrestling with libiconv we looked for a solution entirely in ruby. We found
unidecode which got us most of the way there. Unidecode did a little more than we wanted though and translated Chinese and Japanese characters to their approximate sounds. e.g. 今年1月 gets transliterated to Jin Nian 1Yue
We decided to only transliterate extended latin charaters, punctuation and money symbols.
Here is the final code with the unidecode monkey patch:
require 'set'
require 'unidecode'
module SmsEncoder
def self.convert(utf8_text)
Unidecoder.decode(utf8_text.to_s).gsub("[?]", "").gsub(/`/, "'").strip
end
end
module Unidecoder
class << self
def decode(string)
string.gsub(/[^x20-x7e]/u) do |character|
codepoint = character.unpack("U").first
if should_transliterate?(codepoint)
CODEPOINTS[code_group(character)][grouped_point(character)] rescue ""
else
""
end
end
end
private
# c.f. http://unicode.org/roadmaps/bmp/
CODE_POINT_RANGES = {
:basic_latin => Set.new(32 .. 126),
:latin1_supplement => Set.new(160 .. 255),
:latin1_extended_a => Set.new(256 .. 383),
:latin1_extended_b => Set.new(384 .. 591),
:general_punctuation => Set.new(8192 .. 8303),
:currency_symbols => Set.new(8352 .. 8399),
}
def should_transliterate?(codepoint)
@all_ranges ||= CODE_POINT_RANGES.values.sum
@all_ranges.include? codepoint
end
end
end
and tests:
class SmsEncoderTest < Test::Unit::TestCase
def test_transliteration_of_blank
assert_equal "", SmsEncoder.convert(nil)
assert_equal "", SmsEncoder.convert("")
end
def test_transliteration_of_whitespace
assert_equal "", SmsEncoder.convert(" tn")
end
def test_transliteration_of_text_surrounded_by_space
assert_equal "abc", SmsEncoder.convert(" abc ")
end
def test_transliteration_of_ascii
orig_text = "!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~"
conv_text = SmsEncoder.convert(orig_text)
assert_equal orig_text.gsub(/`/, "'"), conv_text
end
def test_transliteration_of_unicode_punctuation
utf8_text = "“foo” ‹foo› ‘foo’ ,foo, –foo— {foo} (foo) `foo`"
ascii_text = SmsEncoder.convert(utf8_text)
assert_equal ""foo" <foo> 'foo' ,foo, foo-- {foo} (foo) 'foo'", ascii_text
end
def test_transliteration_of_common_latin1_characters
utf8_text = "ñ ò ^ ¡ ¿ Æ æ ß Ç §"
ascii_text = SmsEncoder.convert(utf8_text)
assert_equal "n o ^ ! ? AE ae ss C SS", ascii_text
end
def test_transliteration_of_money_characters
utf8_text = "€ £ $ ¥"
ascii_text = SmsEncoder.convert(utf8_text)
assert_equal "EU PS $ Y=", ascii_text
end
def test_untransliterable_characters
utf8_text = "ɏ x1f x01 x00 Ʌ x7f"
ascii_text = SmsEncoder.convert(utf8_text)
assert_equal "", ascii_text
end
def test_transliteration_of_chinese_characters
utf8_text = "ウェブ全体から検索"
ascii_text = SmsEncoder.convert(utf8_text)
assert_equal "", ascii_text
end
end
About the Author