Unicode Transliteration to Ascii

November 24, 2008 Pivotal Labs

Matthew O’Connor and I recently worked on a project that sent SMS messages to mobile customers. Unfortunately the SMS aggregator we used on the project rejected messages with non-ascii characters.

One approach we considered was to strip our messages of any characters that were not ascii and send them as is. After looking through some of the rejected messages we realized most of the problems occurred with unicode punctuation. Instead of simple deleting the characters we tried transliterating them to their ascii equivalent.

Our first approach used IConv:

require 'iconv'

module SmsEncoder
  def self.convert(utf8_text)
    text = Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8", utf8_text).first
    text.gsub(/`/, "'")
  rescue Iconv::Failure
    ""
  end
end

For some reason the backtick ` also caused problems so we converted that after using Iconv.

This approach worked perfectly on OS X but as soon as we moved to the Linux servers the libiconv characteristics changed and most untranslatable characters became question marks instead of empty strings.

Instead of wrestling with libiconv we looked for a solution entirely in ruby. We found
unidecode which got us most of the way there. Unidecode did a little more than we wanted though and translated Chinese and Japanese characters to their approximate sounds. e.g. 今年1月 gets transliterated to Jin Nian 1Yue

We decided to only transliterate extended latin charaters, punctuation and money symbols.

Here is the final code with the unidecode monkey patch:

require 'set'
require 'unidecode'

module SmsEncoder
  def self.convert(utf8_text)
    Unidecoder.decode(utf8_text.to_s).gsub("[?]", "").gsub(/`/, "'").strip
  end
end

module Unidecoder
  class << self
    def decode(string)
      string.gsub(/[^x20-x7e]/u) do |character|
        codepoint = character.unpack("U").first
        if should_transliterate?(codepoint)
          CODEPOINTS[code_group(character)][grouped_point(character)] rescue ""
        else
          ""
        end
      end
    end

    private

    # c.f. http://unicode.org/roadmaps/bmp/
    CODE_POINT_RANGES = {
      :basic_latin => Set.new(32 .. 126),
      :latin1_supplement => Set.new(160 .. 255),
      :latin1_extended_a => Set.new(256 .. 383),
      :latin1_extended_b => Set.new(384 .. 591),
      :general_punctuation => Set.new(8192 .. 8303),
      :currency_symbols => Set.new(8352 .. 8399),
    }

    def should_transliterate?(codepoint)
      @all_ranges ||= CODE_POINT_RANGES.values.sum
      @all_ranges.include? codepoint
    end
  end
end

and tests:

class SmsEncoderTest < Test::Unit::TestCase
  def test_transliteration_of_blank
    assert_equal "", SmsEncoder.convert(nil)
    assert_equal "", SmsEncoder.convert("")
  end

  def test_transliteration_of_whitespace
    assert_equal "", SmsEncoder.convert(" tn")
  end

  def test_transliteration_of_text_surrounded_by_space
    assert_equal "abc", SmsEncoder.convert("  abc  ")
  end

  def test_transliteration_of_ascii
    orig_text = "!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~"
    conv_text = SmsEncoder.convert(orig_text)
    assert_equal orig_text.gsub(/`/, "'"), conv_text
  end

  def test_transliteration_of_unicode_punctuation
    utf8_text = "“foo” ‹foo› ‘foo’ ,foo, –foo— {foo} (foo) `foo`"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal ""foo" <foo> 'foo' ,foo, foo-- {foo} (foo) 'foo'", ascii_text
  end

  def test_transliteration_of_common_latin1_characters
    utf8_text = "ñ ò ^ ¡ ¿ Æ æ ß Ç §"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal "n o ^ ! ? AE ae ss C SS", ascii_text
  end

  def test_transliteration_of_money_characters
    utf8_text = "€ £ $ ¥"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal "EU PS $ Y=", ascii_text
  end

  def test_untransliterable_characters
    utf8_text = "ɏ x1f x01 x00 Ʌ x7f"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal "", ascii_text

  end

  def test_transliteration_of_chinese_characters
    utf8_text = "ウェブ全体から検索"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal "", ascii_text
  end
end

About the Author

Biography

EMC IT: Using Pivotal Cloud Foundry to Streamline Product Licensing

EMC faced a challenge within their license management group, which covers a very large product portfolio. I...

The Road to Persistent Data Services on Cloud Foundry Diego

During a presentation at the Cloud Foundry Summit 2015, Pivotal’s Caleb Miles and Ted Young of Guidewire So...

Unicode Transliteration to Ascii

About the Author

Previous

Next

Unicode Transliteration to Ascii

About the Author

Previous

Next

Most Recent

Following the xz supply chain attack blog, explore security and trust in open source with VMware Tanzu's secure container solutions and proactive measures.

VMware Tanzu empowers Netflix accelerates its service evolution and boosts the capabilities of its development teams. Tanzu helps to provide them with the platform to run on and scale.

Unveil regulatory compliance ease with VMware Tanzu Spring Runtime! Elevate audits, adhere to FIPS & NIST standards, benefit IT, DevOps, and Auditors.

Uncover open source risks and the 'Zero CVE' myth with insights on continuous lifecycle management. Discover how VMware Tanzu supports diverse projects effectively.

This blog provides a summary of VMware Tanzu CloudHealth news and product updates for the month of April, 2024

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.