A Short Note on Automatic Differentiation

Mykhailo Iaremko
C++ developer

calculus

Do you remember your undergraduate calculus course? Frankly speaking, I don’t. But I do remember a big blackboard completely covered in chalk scribbles. That was a nightmare! Hopefully, those times are gone. I mean, why would we care about derivatives now?

Here at ELEKS, we do care about derivatives. We create and deliver precise models of various dynamic processes. The application field doesn’t matter: it may be physics, finances, or biology. What really matters is once you need to simulate something – most likely you’ll need to compute derivatives. And you need to compute them fast and with great precision. Thus, we created ADEL, a C++ template based-library that fits our needs. But let’s start with some background information.

The requirements listed above do not allow for solutions such as symbolic derivation or finite difference schemes. The first approach generates the exact formulae for derivatives of all levels and directions. This is a very memory/time-consuming method. It produces enormous formulae. While the second one computes the numerical approximations of the value of a derivative. These suffer from various round-offs, cancellation and discretization errors.

Fortunately, there is a middle ground: automatic differentiation (AD). The way it computes derivatives is essentially through a chain rule. It generates evaluations (but not formulae) of the derivatives. As a result, there are none of the drawbacks (accuracy loss) of numerical differentiation where the step size needs to be carefully chosen to avoid errors. All intermediate expressions are evaluated as soon as possible; this saves memory, and removes the need for later simplification. Moreover AD is not as complicated as symbolic representation.

There are two modes of automatic differentiation: the forward mode and the reverse mode. Both use different directions of the chain rules to propagate the derivative information. There are two main implementations of automatic differentiations:

  1. Operator overloading is straightforward and easy to understand: one overloads the operators not only to compute function values, but also to compute derivative information or buildup the computational graph.
  2. Source transformation works like a compiler, which reads in the code that computes the function,analyzes the code and exports a source code that computes the derivatives. This implementation normally generates more optimized programs but involves a lot of implementation efforts.

ADEL contains the implementation of operator overloaded versions of both forward and reverse modes.

The main feature of ADEL’s forward mode is careful handmade optimization. Template based implementation with loop unrolling is compiled into very efficient sequential code. This implementation suits GPU computing architecture very well. The support of CUDA makes the solution truly unique.

The reverse mode of the ADEL library is special in the way it stores the computational graph. It uses the stack of the application for this purpose. Each overloaded operation creates the data structure that is placed on the top of the stack; the life-time of these structures is just enough to compute the derivative data. This makes the algorithm extremely efficient since the most of the operations could be inlined.

We compared our implementation with some of already existing solutions in order to see how it stands up against the others. We implemented Newton’s method for testing the accuracy of our tools. For testing the performance, a random expressions generator was used. It is capable of generating some monstrous things (your professor would never dream of such things). Here is an example of a test function:

template<typename ADType>
ADType TestFunction(const ADType(&x)[6]) {
  y[0] = x[1]; y[1] = -2.730; y[2] = 4.555;
  for (int i = 0; i < 1000; i++) {
    ADType y[3];
    y[0] = x[1]+x[5]+4.624/x[0]+(x[3]-y[0])/x[4]+(atan(-y[2])-x[2])*y[1];
    y[1] = (sin(y[1])-y[2])/x[2]/x[4]*(x[3]+log(x[0])/(y[0]-x[5]))+x[1];
    y[2] = atan(x[5])*(-y[2])/x[3]/(x[4]/(x[2]/x[1]+acos(y[1])))*y[0];
  }
  return 2.658/log(x[0])*(log(y[2])/x[4])*(x[1]+x[5])*(y[0]/y[1])*x[3];
};

The main characteristic of the test case is the number of independent (x) and dependent (y) variables as well as total number of functions/operators. The given task was to compute the gradient and hessian of the return value with respect to all independent variables. In the example above, we have 6 independent and 3 temporary (dependent) variables that produce a single output.

We tested the following third-party AD implementations:

  1. FADBAD: a lightweight, user-friendly library. We used a combined mode: the gradient was computed via the reverse mode, while the hessian – via the forward mode.
  2. ADOL-C: a huge math package that is capable of computing literally everything. The reverse mode was used in the experiments.

There were four test cases of differing sizes with 5/10/20/30 input and 1/5/10/15 intermediate variables. The results were normalized, in such way that the slowest execution was scored with 100 time units, therefore less is better.

adel-benchmark

Looking at the results one can see that we did not create an ultimate AD tool – not yet. In smaller tests both ADEL modes greatly overrun their rivals. When the number of variables increases, the performance of ADEL [Forward] degrades because of the large memory overhead per variable. ADEL [Reverse] shows the best average performance; FADBAD comes second; ADOL-C received no prize at all.

Automatic differentiation is a very promising but extremely underused tool. The majority of the community still does not recognize it. This article is just a small but solid step into the bright future of AD tools. As a final word, we encourage you to download the source code and to experiment with ADEL on your own. If there is something we can improve, feel free to contact us.

tags

Comments

  • Can your stack-based approach to AD handle loops in the functions? We do a lot of multivariate coding for probability models in Stan, for which we wrote our own forward and reverse-mode AD:

    http://mc-stan.org/

    We focused on efficient vectorization and arena-based memory management, as well as unfolding derivatives for matrix and linear algebra ops and vectorized probability functions.

  • Hi,

    yeah we can handle loops: only one single iteration of the loop is stored in a memory, but not whole computational graph. This is not canonical reverse mode, we call it hybrid, but is shows to be viable.

    I didn’t mention in the article but the test were performed under Win platform. That’s why we didn’t include so many of existing AD tools to compare with.

    If you wish we can run some tests of yours(we’ll go *nix), but not these meaningless random tests.