Despite all my Google searching for an existing tool that would build an identical zip file every time for the same inputs, I came up empty-handed. I decided to dig as deep as necessary to figure out what prevents us from creating the same zip file every time for the same inputs. My particular use case was creating a zip file using the built-in zip command line utility on OSX.
(Not sure why someone might want reproducible zip files? There are several reasons why someone might want reproducible builds, but a reproducible zip would eliminate one level of necessary trust for anything distributed via zip file, if everyone else could take the same inputs and generate an identical zip. It could simplify testing, too, in some circumstances.)
One of the simplest tests you can do to determine if you are creating an identical zip is look at the checksum of your generated zip file: e.g. zip -r /tmp/foo.zip /tmp/foo; md5 /tmp/foo.zip
. If you run that command instantly twice in a row, you might see the same checksum; if you run it by hand twice in a row, you’ll almost certainly get two different checksums. This is a really good indication that the zip file has some kind of embedded timestamp.
If you carefully inspect the internal headers for a zip file generated with a plain zip command and cross-reference those with PKWare’s .ZIP file format specification, you’ll see that your zip file has an “extended timestamp” field. That sounds promising, but because it’s an “extra field” from third parties, the zip specification doesn’t go into detail about the content of that field. One Google search later and you’ll find an article on unzip that indicates the extended timestamp field includes an access timestamp! Since we have to access the file to read its content to be put into the zip archive, that extended timestamp field will change every time. The simplest workaround I’ve found is to pass the -X option to zip, which is supposed to exclude all extra fields.
The -X option will get you pretty far. You should be able to run zip -X -r foo1.zip foo; sleep 5; zip -X -r foo2.zip foo; cmp foo1.zip foo2.zip
and see that the foo1.zip and foo2.zip are identical.
However, zip also includes the modification times for all included files. This means that running zip -X -r foo1.zip foo; sleep 5; touch foo/*; zip -X -r foo2.zip foo; cmp foo1.zip foo2.zip
will show you that foo1.zip and foo2.zip differ. If the modified times happened at two different minutes, unzip -l foo1.zip; unzip -l foo2.zip
will show you the different modification times.
You can do a terrible hack to force all the files in the foo directory to have the same timestamp: find foo -exec touch -t 201401010000 {} +
to set the modification times for everything in the foo directory to Jan 1 2014. The zip file seems to be reproducible at that point, but it’s entirely dependent on a known, static timestamp. It would probably be better if individual files had individual, meaningful timestamps (that script is outside the scope of this post).
The last and probably most trivial barrier to reproducible zip files is the order of the contents of the zip file. That is, zip -X foo12.zip foo/1.txt foo/2.txt
will always produce a different zip from zip -X foo21.zip foo/2.txt foo/1.txt
.
Initially I was surprised at lack of tooling to make reproducible zip files, but after looking into the issues in depth, it seems that the modified timestamp is perhaps the most difficult problem to solve in a generic way that suits most people’s needs. Also, my cursory search of zip libraries in both Ruby and Go did not turn up any libraries that make it obviously easy to rewrite the internal headers in zip files. That being said, if you have particular needs for reproducible zip files, I hope this post helps you understand what modifications you need to apply to a zip file to address the “less deterministic” data in it. If you write an open source tool to accomplish this, please let us all know about it in the comments.
About the Author