How To Test PDFs with Capybara

July 22, 2013 Joe Masilotti

Validating that a web app’s content is rendered correctly is an integral part of testing web apps. Displaying user-submitted input in HTML is the core functionality of almost any website. For example, the last couple of web sites you used most likely had you enter information in a text box that was later shown to you on another page. A common way Rails developers achieve this is via RSpec and Capybara. Capybara provides lots of matchers to narrow down how to find elements on the page. But what if you aren’t displaying HTML to your users? What if the output is in a different format, like a PDF? How do you test PDFs with Capybara and RSpec?

Invalid Byte Sequence in UTF-8?

Under one of my feature specs I attempted to call page.should have_content('some content') on a rendered PDF. Unfortunately, an error is thrown: invalid byte sequence in UTF-8. This error signals that the document contains invalid encoding. The problem could be a couple of unknown characters or that the document is missing an encoding type. Either way, Capybara needs a way to grab the actual contents in a format it can understand.

wicked_pdf’s debug option

I used the wicked_pdf gem to transform HTML/CSS pages to PDF via the wkhtmltopdf shell utility. They have both been around for some time now and reliably render output using the Rails stack.

My first attempt to test content of a PDF used wicked_pdf’s debug option. By adding :show_as_html => params[:debug].present? to my rendering line I accessed the HTML (pre-PDF rendering) by appending ?debug=1 to my request. Now that the content was standard HTML Capybara’s page worked normally.

However, the PDF may not line up with the HTML 100% using this method. This discrepancy causes problems when using complex CSS to render your pages; just because the spec passes via HTML doesn’t guarantee the PDF will do the same. For example problems arise when using certain combinations of the display: property.

Asset Pipeline

How the accompanying CSS is referenced is another issue with this approach. Providing an absolute reference to any assets is required, since the wkhtmltopdf binary is run outside of your Rails application. For example, a CSS file is referenced with the wicked_pdf_stylesheet_link_tag "pdf" helper method.

As a result, the HTML-rendered views grab CSS via a relative path while the PDF grabs CSS via an absolute path. While this is not an issue for a small app, this can become difficult to maintain once there are multiple CSS modules for different output formats.

PDF::Reader to the Rescue

A different approach is to leverage pdf-reader‘s text rendering. You can then set the content of the Capybara page directly. First, render the PDF document and save it to a temporary local file. Then tell pdf-reader to parse the text to a standard format for Capybara to use. Finally, directly set Capybara’s @body variable to the (now valid) PDF text contents.

def convert_pdf_to_page
    temp_pdf = Tempfile.new('pdf')
    temp_pdf << page.source.force_encoding('UTF-8')
    reader = PDF::Reader.new(temp_pdf)
    pdf_text = reader.pages.map(&:text)
    page.driver.response.instance_variable_set('@body', pdf_text)
end

Once your page has loaded, call convert_pdf_to_page and use Capybara’s page normally. All of the text matchers should work.

page.should have_content('some PDF content')

Now the code stays succinct by using Capybara’s built-in matchers for both HTML pages and PDF files. Now you don’t need to worry about learning a new DSL for PDF specs.

This post was heavily inspired by Matthew O’Riordan’s gist. It uses an older version of pdf-reader so it needs to set up a bunch of manual string parsing to work. Lucky for us one of pdf-reader’s new API method now makes it easy to test PDFs with Capybara.

Improvements

Move convert_to_pdf to spec_helper.rb to gain access to the helper method in all of your specs.
Automatically convert PDF to text when the rendered content is detected as a PDF.

About the Author

Biography

Pivots Talking Tuesday: Startup Architecture

Welcome to the first of many Pivots Talking Tuesday blog posts! On a biweekly basis, we will be posting vid...

So you want to be a programmer?

I have been asked by at least a few people if I could tell them the best way to get started with programmin...

How To Test PDFs with Capybara

Invalid Byte Sequence in UTF-8?

wicked_pdf’s debug option

Asset Pipeline

PDF::Reader to the Rescue

Improvements

About the Author

Previous

Next

How To Test PDFs with Capybara

Invalid Byte Sequence in UTF-8?

wicked_pdf’s debug option

Asset Pipeline

PDF::Reader to the Rescue

Improvements

About the Author

Previous

Next

Related content in this Stream

How VMware Tanzu CloudHealth helps customers uncover spiraling AWS Extended Support charges.

VMware Tanzu enhances Spring development with simplified operations, accelerated innovation, seamless microservices transition, increased security, and effortless scaling.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

This 7-part blog series provides a roadmap for architecting a data science platform using VMware Tanzu. We'll delve into the building blocks of a successful platform that drives data-driven insights.

Bitnami-packaged open source software is loved by developers for its ease of use, which enables developers to directly pull a Bitnami package and seamlessly start using it with little effort.

VMware Tanzu announces the General Availability of AWS Commitment Discount Recommendations, which provides recommendations for all reservable services in AWS through VMware Tanzu CloudHealth.

Introducing VMWare Tanzu Data Hub, a self-managed Database as a Service (DBaaS) Platform, providing enterprises a way to host their internal DBaaS offering for internal business users.

In the cloud-native landscape, MCAs drive seamless compliance integration. Their expertise ensures proactive security measures align with regulatory standards for sustained innovation & collaboration.

Tanzu Application Platform brings innovation faster with more frequent feature updates. With 1.9, take advantage of enhanced DORA metrics visibility and improved compliance options for companies.

We’re excited to share some great news! Spring Academy Pro content is now free. It will be available to everyone who registers a work, vocational, or educational email address.

March 28, 2024, marks the official minor release date of Spring Cloud Gateway for K8s version 2.2, and it's set to optimize how developers protect access to their GraphQL services.

We are excited to announce that VMware Tanzu Application Service 6.0 is now generally available!

Get a clear picture of your OSS supply chain, and the risks you face from your open source software dependencies, using the all-new Tanzu OSS Health Assessment.

Trivy can now utilize CSAF VEX data to filter out false positives in CVE reports, maximizing the value of VEX documents in VMware Tanzu Application Catalog.

Bitnami-packaged open source software container images available in DockerHub are now signed by Notation, an implementation of the Notary Project specifications and a CNCF-incubating project.

There’s never been a better time to be a Java and Spring developer! Let me show you why with a sneak peak into JD Conference 2024.