InfoQ Homepage News TensorFlow 2.4 Release Includes CUDA 11 Support and API Updates

TensorFlow 2.4 Release Includes CUDA 11 Support and API Updates


The TensorFlow project announced the release of version 2.4.0 of the deep-learning framework, featuring support for CUDA 11 and NVIDIA's Ampere GPU architecture, as well as new strategies and profiling tools for distributed training. Other API updates include mixed-precision in Keras and a NumPy frontend.

The TensorFlow team provided an overview of the release in a blog post. For distributed training, a new experimental ParameterServerStrategy is introduced, while MultiWorkerMirroredStrategy moves from experimental to stable. The TensorFlow Profiler also includes support for profiling MultiWorkerMirroredStrategy jobs. In the Keras API, mixed precision has moved from experimental to stable, and several base classes have been refactored to improve memory consumption and simplify custom training logic.

An experimental frontend for NumPy allows users to write NumPy code to run on top of TensorFlow, taking advantage of TensorFlow's execution optimizations. The binaries for the 2.4 release have been built to run with CUDA 11 and cuDNN 8, which support NVIDIA's new Ampere GPU hardware.

Data parallelism is a distributed training technique that slices the training data across multiple workers. TensorFlow's tf.distribute module supports several different data parallel strategies for synchronous and asynchronous distributed training scenarios involving a single machine with multiple GPUs or a cluster of multiple machines.

In synchronous training, each worker independently computes gradients from its own slice of data; the results are aggregated to update the model, requiring each worker to wait, or synchronize, before running a new batch. The MultiWorkerMirroredStrategy, which moves from experimental to stable with the new release, implements synchronous distributed training across multiple machines, and the TensorFlow Profiler has been updated to support MultiWorkerMirroredStrategy training jobs.

The new release also includes an experimental ParameterServerStrategy which allows workers to asynchronously update a model hosted by a parameter server; this provides more fault-tolerance for the cluster, as workers are not dependent on each other.

Keras, the high-level framework for writing models on top of TensorFlow, includes a mixed precision API. Mixed precision allows training to use faster 16-bit arithmetic for some portions of a model, while retaining 32-bit precision in other parts for numeric stability, improving performance "more than 3 times." With the new release, the mixed precision API moves from experimental to stable. The Keras Functional API has also been refactored to improve memory consumption and simplify calling logic.

A new experimental module, tf.experimental.numpy, exposes a subset of APIs from the popular NumPy scientific computing package. The module provides an implementation of NumPy's NDArray that internally wraps a TensorFlow Tensor object. Users can then write NumPy-compatible code that takes advantage of TensorFlow's runtime accelerations. Because the module is experimental, it is subject to future breaking changes; in addition, the documentation lists several NumPy features that will not be supported, including the NumPy C API support, Swig integration, and Fortran storage order.

The TensorFlow binaries for the release are built for CUDA 11, supporting the new features of NVIDIA's A100 Ampere GPUs, which achieve improved training times compared to the previous V100 GPUs. TensorFlow 2.4 on CUDA 11 by default enables support for the TensorFloat-32 data type; this results in faster matrix math operations, albeit at reduced precision. PyTorch, the other major deep-learning framework, released its version 1.7 in October, also with support for CUDA 11 and the A100 GPU, as well as distributed training options for the Windows operating system.

In a response to the release announcement, Leigh Johnson, a machine-learning engineer at Slack, tweeted:

Can't wait to try out Parameter Server Training. Coordinating distributed interruptible (spot) workloads is a hard problem, with significant impact on the [cost] and compute footprint of modeling.

The TensorFlow source code and 2.4 release notes are available on GitHub.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p


Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.