Detecting invalid encoding in CSV uploads

January 17, 2009 Pivotal Labs

We ran into an odd bug using FasterCSV to import some data. We were requiring the CSV files to be UTF-8 encoded, but some users tried to upload files in other encodings. FasterCSV ended up choking on characters that weren’t valid UTF-8 and truncating the data to the end of the line and leaving fields blank. We didn’t want to ask the user to select an encoding, because they’d probably get it wrong anyway, so we decided to reject any files with characters that would cause problems. The trick then, is how to detect that.

First, the tests. We want to detect if an input string contains valid characters in UTF-8 encoding. And we need to deal with both strings and IO (File or StringIO) objects (more on that in a bit).

describe "::encoding_is_utf8? checks strings and IOs" do
  before do
    @utf8 = "This is a test with ç international characters"
  end

  it "returns true when all characters are valid" do
    Importer.encoding_is_utf8?(@utf8).should be_true
    Importer.encoding_is_utf8?(StringIO.new(@utf8)).should be_true
  end

  it "returns false when any characters are invalid" do
    bogus = Iconv.conv('ISO-8859-1', 'UTF-8', @utf8)
    Importer.encoding_is_utf8?(bogus).should be_false
    Importer.encoding_is_utf8?(StringIO.new(bogus)).should be_false
  end
end

Here’s the implementation:

class Importer
  def self.encoding_is_utf8?(file_or_string)
    file_or_string = [file_or_string] if file_or_string.is_a?(String)
    is_utf8 = file_or_string.all? { |line| Iconv.conv('UTF-8//IGNORE', 'UTF-8', line) == line }
    file_or_string.rewind if file_or_string.respond_to?(:rewind)
    is_utf8
  end
#...

So the meat of the check is that we are using the Iconv library to detect bad characters. We convert from an assumed UTF-8 to UTF-8, ignoring any characters that can’t be represented in UTF-8. If the output and input aren’t identical, that means there were bogus characters and the uploaded file should be rejected.

The #rewind is needed to reset the read position in the file so FasterCSV can start over from the beginning. Specs for that aren’t included here.

Then in our controller, we ensure the CSV doesn’t have any bad characters before we give it to FasterCSV. We extracted that check into its own method, shown here:

def require_utf8!(csv_content)
  unless Importer.encoding_is_utf8?(csv_content)
    raise "Import file must be UTF-8 only. You can paste non-UTF-8 CSV directly into the CSV Text field for automatic conversion."
  end
end

As you can read in the exception message (which ends up in the flash), the user can work around the encoding issue by pasting the CSV into a textarea input in the browswer, which automatically transcodes the data into UTF-8. Aren’t browsers awesome? The other option would be to transcode the CSV file, but the textarea is easier if the files aren’t gigundous. Anyway, since we can input CSV as either a file or a textarea string, that’s why #encoding_is_utf8? needs to check both files and strings.

This approach and implementation seem fine to me. I get the feeling there might be a much simpler way, though. Anyone got a better idea?

About the Author

Biography

Fear and loathing in software development

There's a reason you don't stick your hand into fires, walk into traffic, or use Windows Vista. Humans, al...

Pivotal Tracker and GTD

Michael Buffington has posted the first of a what will hopefully be a series of tours on Pivotal Tracker, t...

Detecting invalid encoding in CSV uploads

About the Author

Previous

Next

Detecting invalid encoding in CSV uploads

About the Author

Previous

Next

Related content in this Stream

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.