On the importance of distfiles

Frederic Cambus September 17, 2023 [FreeBSD] [NetBSD] [OpenBSD]

I've been involved with the OpenBSD ports collection since 2015, and have accumulated some notes on the topic over the years. This is an attempt at doing a redacted version, mostly for my personal use. While some of these notes are specific to the OpenBSD Ports Collection, most of them will also apply to the others BSDs and Linux distributions.

It will come at no surprise that distfiles are at the core of the problem domain. The ports system fetches distribution files, most often tarballs, verifies their checksum and starts building programs. In order to be able to reliably build packages from the source tarballs, we need both availability and integrity.

Because MASTER_SITES can be down, either temporarily or permanently, each of the BSD maintains their own distfiles mirrors, or caches.

As a rule of thumb though, MASTER_SITES (or nowadays simply SITES) should not be set to point to ftp.openbsd.org, mostly because there is sometimes no guarantee that the files will be cached there forever, and also to avoid putting unnecessary load on the server. To remedy this, some porters maintain their own distfiles hosting sites.

For checking distfiles integrity, each BSD uses a different combination of cryptographic hashes:

OpenBSD and FreeBSD both use SHA256, while NetBSD uses BLAKE2s and SHA512. The hashes are stored in a file called distinfo.

Here is an example from OpenBSD's distinfo for binutils:

SHA256 (binutils-2.41.tar.bz2) = pMS+wFL3uDcAJOYDieGUN38/SLVmGEGOpRBn9nqqsws=
SIZE (binutils-2.41.tar.bz2) = 37132937

And another excerpt from NetBSD's Pkgsrc distinfo for binutils:

BLAKE2s (binutils-2.41.tar.bz2) = bd20a803c6f86632b62e27fce2cb07eb0ee4aa06fb374d80c8ba235568466dd2
SHA512 (binutils-2.41.tar.bz2) = 8c4303145262e84598d828e1a6465ddbf5a8ff757efe3fd981948854f32b311afe5b154be3966e50d85cf5d25217564c1f519d197165aac8e82efcadc9e1e47c
Size (binutils-2.41.tar.bz2) = 37132937 bytes

Checksums can fail because there was a network failure while downloading the source file, or because the file itself changed.

If the distfile changed, there can be several causes:

Re-rolling tarballs can happen for software which is not versioned, or when upstream try to fix minor issues not long after a release, without issuing a new one in order not to have to bump version numbers.

For the last possible cause, this has been a problem with GitHub auto generated tarballs in the past. More information can be found in sthen@'s "Porters, please read re GitHub auto-generated tarballs vs releases" post on the ports mailing list back in 2018.

In January 2023, GitHub updated the Git version they are using on their platform. Because Git switched to use their internal gzip implementation for generating tarballs, this resulted in the generated tarballs having different checksums. The change has quickly been reverted.

Each durable checksum failure will require maintainers to spend time analyzing changes to ensure there has not been any malicious changes happening.

Ideally, distfiles should be as small as possible to prevent wasting bandwidth when fetching and CPU cycles when unpacking content. Unfortunately, that's not always the case because some projects do vendor dependencies, and everyone has to pay the cost.

Back to top