Building PyTorch: The Dependencies

This is the first of a series of blog posts where I’ll be discussing our experiences building a distribution of PyTorch for Chainguard Libraries. While these topics will be from our experiences building PyTorch, much of the content will apply generally to building native Python wheels.

Building PyTorch wasn’t new for us. We’ve been maintaining PyTorch builds for our Chainguard Images for over a year now. I’d recently led a significant refactor of those builds, so the process was still fresh in my brain. But for the library build, we took on the additional challenge of achieving 100% feature parity with the upstream builds. Feature parity would involve integrating a number of optional dependencies. For native (C/C++) build dependencies, we typically package them for and consume them from our OS. This allows us to leverage the automation we already have in place that keeps all of our OS packages up to date and CVE-free. For PyTorch, we first needed to figure out what these projects are and how to integrate them.

What are the dependencies?

You can get a good first approximation of the dependencies by dumping the build configuration of an existing upstream build. This is from PyTorch 2.8.0:

$ python -c 'import torch; print(torch.__config__.show())'
PyTorch built with:
  - GCC 13.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, COMMIT_SHA=a1cb3cc05d46d198467bebbb6e8fba50a325d4e7, CUDA_VERSION=12.8, CUDNN_VERSION=9.8.0, CXX_COMPILER=/opt/rh/gcc-toolset-13/root/usr/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF,

But ultimately you need to look at the source and specifically the CI configuration to get the full picture.

Teasing Apart MKL.*

“MKL” pops up in a number of places. The PyTorch configuration has settings for MKL, MKL_STATIC, MKLDNN, and MKLDNN_ACL. It’s easy to discover that MKL is an acronym for “Math Kernel Libraries”. MKLDNN doesn’t exist anymore. It was renamed to oneDNN, though PyTorch still refers to it by the old name. oneDNN is an open source project, with seemingly no relation to the math kernel libraries other than… it’s also performance related? MKLDNN_ACL actually refers to the ARM Compute Library, which oneDNN can use. I used ChatGPT to figure much of this out. I was convinced it was hallucinating.

oneDNN and the ARM Compute Library were straightforward to integrate, so I won’t discuss them further. MKL itself was more complicated.

MKL: The Actual Math Kernel Libraries

PyTorch upstream consumes this dependency as a wheel from PyPI, but we only pull from our own indexes during builds. It’s a proprietary artifact, so building our own wheel from source wouldn’t be an option. We decided to consume it from one of our internal APK repositories. But where is the canonical upstream? There is no standalone “Math Kernel Libraries” project. We found that these proprietary libraries were later renamed to oneMKL, and then merged into a project called oneAPI MKL.

oneAPI MKL is available from Intel as part of a toolkit, and as a standalone download. We chose the “standalone” download, because we only need the MKL libraries. This “standalone” download includes a number of projects, jumbled together, under various licenses which complicated the packaging. While testing the integration with PyTorch, we discovered that PyTorch only needs the “mkl-classic” subset of the packages, which contain the math libraries themselves.

Note: we ended up integrating the latest version of MKL, but noticed upstream is still a version back. Since our builds passed the upstream tests, we opened a PR to let upstream know.

CUDA

We were able to easily adapt our internal CUDA development tooling to build our PyTorch wheels. But handling the CUDA runtime dependencies required more work. PyTorch upstream does not distribute CUDA along with their wheels. What they do is add pinned dependencies on the wheels that provide the CUDA libraries they built against. That seems like a good idea. Let’s do that.

The problem is, the CUDA libraries we build against are installed by our APKs. Our APKs have different names compared to the wheels, and there is no guarantee that an APK version we built against is actually available in wheel format. We don’t want to build wheels that can’t be installed due to missing dependencies!

One option we investigated was to use the Python distribution of the CUDA toolkit. That would mean we could depend on exactly what we built with, and know those packages will also be available to our customers. This would also give us the option of pinning the exact versions of libraries that PyTorch upstream used in their release, compared to using the latest available version that we’d get from our APKs. We tried that. I implemented logic that installs the Python CUDA distribution, and feeds its metadata into the package dependency generation code. But it turns out that the Python distribution of the CUDA toolkit is not the complete toolkit, at least not on Linux. While it includes an nvcc package (the CUDA compiler), it doesn’t actually provide the nvcc command at all. So yeah, that won’t build anything.

We rolled that back and re-implemented the logic so that it converts the APK names and versions to the wheel equivalents, which is then fed into the build to generate dependencies. A problem this has caused is related to the NCCL component of the CUDA toolkit. NCCL is an open source project, which allows us to build it completely from source for our APK distribution. However, the Python packaging is not distributed along with the source, which currently prevents us from generating our own Python wheel. We currently treat it as an external dependency like the other CUDA libraries. Our APK automation builds, tests and publishes new releases of packages typically within hours of upstream applying a new tag. But there is a lag between when upstream tags a release of NCCL, and when that release is published on PyPI. This means that we have a window between when we build PyTorch with runtime dependencies on the latest NCCL source release, and when a binary wheel is available to satisfy its dependencies. We’re exploring options for avoiding this situation.

MAGMA

We made several attempts at integrating the MAGMA project before landing on a final approach. Our first attempt was to build an APK for it, and consume that APK at build time. But we found that this exploded our wheel sizes – the MAGMA binaries added 4G! We took a closer look at how upstream integrates MAGMA, and discovered that they apply several customizations to the build. For upstream parity, we would need to do the same. But maintaining those deltas external to our PyTorch library build recipe seemed like a maintenance nightmare, so instead we elected to just build MAGMA inline with our PyTorch wheels.

This was sufficient to complete our builds. However, during test, we observed a number of tests being skipped due to missing MAGMA support:

test_meta.py::TestMetaCUDA::test_dispatch_meta_outplace_logdet_cuda_float64 SKIPPED [0.0018s] (no MAGMA library detected) [ 62%]

It turns out that other users had reported the same issue. We were able to root cause the problem – and it circles back to MKL. I’ll discuss that in a future post!