PARALLEL INCOMPLETE LU FACTORIZATIONS BASED ON
ALTERNATING TRIANGULAR SOLVES
MARC A. TUNNELL
AND ERIK G. BOMAN
Abstract.
Incomplete factorizations are popular preconditioners and are well known to be effective for a wide
range of problems. Additionally, these preconditioners can be used as a “black box” and do not rely on
any a priori knowledge of the problem. However, traditional algorithms for computing these incomplete
factorizations are based on Gaussian elimination and do not parallelize well. Recently, a more parallel
incomplete factorization algorithm was proposed by Chow and Patel [7], where the factors are computed
iteratively. In this paper, we propose a new iterative approach that is based on alternating triangular
solves of L and U . We develop two versions: ATS-ILU for a static sparsity pattern, and ATS-ILUT for a
dynamic pattern (using thresholding). We show that this new method is similar to the fine-grained iterative
ILU method by Chow but has the added advantage that it allows greater reuse of memory and is fully
deterministic in parallel, meaning the results do not depend on scheduling. We evaluate the new method
on several test matrices from the SuiteSparse collection and show that is competitive with current ILU
methods.
Key words. preconditioner, incomplete factorization, ILU, parallel, least squares
1. Introduction. Preconditioning is well known to be essential for improving the speed
of convergence of Krylov methods such as Conjugate Gradient (CG) and Generalized Min-
imal Residual (GMRES) [4, 13, 22, 23]. Incomplete Lower-Upper (ILU) factorizations are
a popular class of preconditioners as they can be used as a “black box” on a wide range of
problems. There are two main types of ILU factorizations, level-based ILU(k) [4, 18, 22]
and threshold-based ILUT [20]. However, these methods are inherently sequential and do
not parallelize well.
There has been interest in the parallelization of these more classical interpretations of
ILU, largely through graph partitioning schemes. These graph partition-based methods,
such as [10, 14, 15, 17], offer a promising approach to parallelizing classical ILU methods.
By decomposing the graph corresponding to the matrix and determining variables that
can be eliminated in parallel, these methods aim to distribute the computational load more
evenly across processors. While these strategies have shown effectiveness for certain types of
problems [4], their implementation can be highly complex. Additionally, their performance
can be problem-dependent, requiring consideration of the underlying graph structure when
choosing a parallelization strategy.
More recently, there have been strides into methods of computing ILU factors iteratively,
potentially giving up some of the robustness of the classical methods for better parallel
properties [7]. Iterative ILU methods, such as those introduced by Chow [7], offer significant
advantages in terms of scalability on modern parallel architectures. For the remainder of
this paper, we refer to the method introduced by Chow as ParILU and its thresholded
counterpart as ParILUT [2, 7]. These methods approximate the ILU factors through a
series of iterative updates, which can be more easily distributed across multiple processors
or offloaded to accelerators.
One of the primary benefits of iterative ILU methods is their ability to handle larger
and more complex systems efficiently and be utilized as a true “black box” preconditioner
by domain scientists. By breaking down each iterative update into smaller approximate
subproblems and solving them independently, different parts of the factorization can be
computed in parallel without the need for complex graph-partitioning algorithms. This
Purdue University, mtunnell@purdue.edu
Sandia National Labs, egboman@sandia.gov
approach allows for the use of iterative ILU methods on a wide range of problems, including
those with complex or irregular graph structures that may preclude high levels of parallelism
in the graph-partitioned classical ILU methods.
Furthermore, iterative ILU methods are adaptable to various hardware accelerators
such as graphics processing units (GPUs) [3], which are increasingly important for high-
performance computing. By leveraging the parallel processing capability of these accelera-
tors, iterative ILU methods can significantly reduce the real-world time required to compute
the ILU factors for large-scale problems, thereby speeding up the overall solution process.
In this paper, we propose a new class of iteratively-computed ILU preconditioners,
which we call Alternating triangular Solves ILU (ATS-ILU). This method builds upon the
strengths of existing iterative ILU approaches while leveraging improved memory reuse and
determinism in parallel. We provide an analysis of the method and evaluate its performance
compared to the state of the art on a variety of test matrices. We show that our method
is competitive with current ILU methods and has the potential to be a powerful tool for
solving large-scale problems on modern parallel architectures. We implement our algorithm
in Julia [5, 19], as well as Kokkos [11, 25], and utilize the Kokkos ecosystem [24] for their
implementations of ParILUT and GMRES.
This paper is organized as follows. First, we gave an introduction to classical ILU meth-
ods as well as their iterative counterparts in section 1. Next, we present needed background
information in section 2, including an introduction of our notation, a review of a classical
ILU method, and the fine-grained ILU method by Chow [7]. In section 3, we introduce
our new class of iteratively computed ILU preconditioners based on alternating triangular
solves and discuss its relation to ParILU. The algorithms for ATS-ILU and ATS-ILUT are
presented in subsection 3.3 and subsection 3.5, respectively. In section 4, we provide im-
plementation details for our new method. In section 5, we discuss the parallelization of
our method. We evaluate the performance of our method on a variety of test matrices in
section 6. Finally, we discuss future work and concluding remarks in section 7.
2. Background Information. We start this section by describing the notation used
in this paper. We use largely standard notation but introduce some new notation for the
sake of clarity. We use A to denote a matrix, A
T
to denote its transpose, a to denote
a vector, I to refer to the identity matrix where the size is determined based on context,
and L and U to reference sparse matrices that are lower and upper triangular in structure.
We subscript these matrices with the row and column indices separated by a comma. The
notation a
i,j
refers to a scalar entry in the matrix A at row i and column j, while a
i,:
refers
to a row vector of A at row i. We utilize MATLAB [16] style slicing notation, where a
i,j:k
refers to the elements of a
i,:
at the column indices j through k. We additionally use non-
contiguous slicing, where a
i,idx
refers to the elements of a
i,:
at the column indices in idx.
Non-contiguous submatrices may also be referenced in a similar manner, where A
idx
1
,idx
2
refers to the submatrix of A whose entries correspond to the cartesian product of idx
1
and
idx
2
. We assume that slicing with a set of indices is performed in the natural order of the
indices in the set. In the case of the cartesian product, a topological ordering is used where
it is first ordered by the row index and then by the column index.
We use N to denote the set of natural numbers, which we define to start at 1, and R
to denote the set of real numbers. The notation N
2
refers to the set of pairs of natural
numbers, and R
n×m
refers to the set of n × m matrices with real entries. We use A
k
to
refer to the k
th
power of the matrix A, whereas L
(k)
refers to the k
th
iteration of a method
applied to L. We are consistent with our use of 1-based indexing throughout this paper and
assume all loops are inclusive of their endpoints unless otherwise stated.
We use notation like L to denote a graph of the sparsity pattern of L and are consistent
with our use of S to denote some arbitrary initial sparsity pattern. We use standard notation
to define operations on a graph, where L\U refers to the graph L with the edges that exist
in U removed. Similarly, L U refers to the graph L with the edges that exist in U added.
We assume that the edges in a graph are uniquely defined by a set and often use set notation
to define the entries of an adjacency matrix that fully describes the edges of a graph.
2.1. Related work. Incomplete factorizations are some of the oldest preconditioners
[1] that are still used today. In addition to the standard ILU(k) method based on levels of
fill, variations based on thresholding (adaptive pattern) have been proposed [21].
Calgaro et al. [6] proposed an incremental ILU method for the case of a sequence of
matrices. Chow and Patel [7] proposed a novel iterative, fine-grained method for computing
the incomplete factors that is well suited for parallel computing. Anzt et al. [2, 3] further
improved this method to thresholded ILU, and showed results on GPU.
Our method is also highly parallel and similar to the iterative fine-grained method by
Chow, but is derived in a different way.
2.2. Classical ILU Method. The classical ILU(k) method is based on Gaussian
elimination, but with entries dropped to avoid or reduce the amount of fill-in, and given a
pattern, S, of non-zero entries in the factorization. These non-zero locations, in the case
of an ILU(0) factorization, are often chosen to be the same as the non-zero pattern of the
matrix, A, but this is not required. For higher levels of fill, a common choice is to use the
sparsity pattern of A
k+1
. This can be seen as an approximation to the true LU factors of
A, where the equation LU = A is satisfied exactly on the pattern of A and may be violated
elsewhere.
We show this algorithm given a fixed non-zero pattern, S, in Algorithm 1. This algo-
rithm computes the ILU factors by iterating through the rows and columns of the input
matrix A, updating the non-zero entries according to the sparsity pattern S. For each row
i, the algorithm computes the entries in the k
th
column of L by scaling the entries in the
k
th
column of U by the diagonal entry u
k,k
. This algorithm is inherently sequential as the
updates to the factors are dependent on the previous row or column.
Algorithm 1 Classic ILU
Input: Sparse matrix A, sparsity pattern S.
Output: Sparse ILU factors L and U , where LU = A along the sparsity pattern S.
L I
U A where u
i,j
= 0 if (i, j) ∈ S
for j = 1 to n 1 do
for i = j + 1 to n and (i, j) S do
i,j
u
i,j
/u
j,j
for k = i to n and (i, k) S do
u
i,k
u
i,k
i,j
· u
j,k
end for
end for
end for
2.3. Fine-Grained Iterative ILU Method by Chow. Like the classical ILU method,
ParILU computes an approximation to the true LU factors of A but does so iteratively using
a series of smaller, approximate subproblems. Along the given sparsity pattern S, the Par-
ILU method updates the non-zero entries of the factors L and U by relaxing each variable
independently. This independent relaxation is performed in parallel, allowing for better
handling of large-scale and complex systems, with the potential for significant speedup on
modern parallel architectures. This algorithm is given in Algorithm 2.
Algorithm 2 ParILU [7]
1: Input: Sparse matrix A, sparsity pattern S, starting factors L and U
2: Output: Updated factors L and U
3: while not converged do
4: for (i, j) S do
5: if i > j then
6:
ij
a
ij
P
j1
k=1
ik
u
kj
7: else
8: u
ij
a
ij
P
i1
k=1
ik
u
kj
/ℓ
ii
9: end if
10: end for
11: end while
Each update can be performed in parallel, as the updates to each row or column are
independent. In application, the algorithm can either be implemented in a manner that is de-
terministic in parallel or a non-deterministic manner. In implementing the non-deterministic
version, the updates are performed in an atomic fashion directly to the factors L and U . In
the deterministic version, the updates are performed directly to the matric L but are stored
in a temporary matrix for U .
3. Alternating Triangular Solves Metho d. In this section, we introduce our new
method for computing ILU factors, ATS-ILU. This method is based on the idea of alternating
iterative updates to the L and U factors of the matrix A. The basic idea is the same as
before, where we iteratively update the factors L and U until convergence, but where the
updates are performed in an alternating manner. This general process is a common method
for solving bilinear systems and is outlined in Algorithm 3.
Algorithm 3 Alternating ILU
U
(0)
triu(A)
k 0
while not converged do
Solve L
(k)
U
(k)
A for L
(k)
Solve L
(k)
U
(k+1)
A for U
(k+1)
Check convergence
k k + 1
end while
One way to perform this procedure would be to perform Algorithm 4 with the entirety
of U
(k)
and let A be the right-hand side vector to solve for L
(k+1)
, and similar to solve
for U
(k+1)
. This entire process can largely be performed in parallel as each row of L and
column of U can be solved independently. Despite the potential for high levels of parallelism,
it is still extremely computationally expensive and likely suffers from significant levels of
fill-in during intermediate steps. The computational cost could be reduced by using an
approximation. A simple approximation is the Neumann series: (I L)
1
= I +L+L
2
+. . .
This has been explored [3] in the context of solving for dense vectors. Sparsity may be
preserved to some degree, but the fill will increase with summing higher powers. Therefore,
we do not consider this option any further.
Additionally, the algorithm as stated above does not guarantee that L and U remain
lower and upper triangular, respectively. One method to address this issue would be to solve
for L only in the lower triangular part of A and for U only in the upper triangular part
of A. This would ensure that the factors remain lower and upper triangular, respectively,
but would still leave the problem of significant levels of fill-in. Instead, we suggest a more
practical approach where we impose a sparsity pattern on L and U , namely L and U,
respectively. This sparsity pattern can be chosen to be the same as the sparsity pattern of
A, which is the choice we make in this paper.
In order to get around the issue of fill-in, we propose a method where we approximately
solve for L and U along their given sparsity patterns, which we discuss next.
3.1. Sparse Triangular Solve. Because we introduce an approximate triangular solve
in the following section, we provide a brief overview of the classical sparse triangular solve
algorithm for reference in Algorithm 4. This algorithm is used to solve a lower triangular
system of equations, where the solution vector x is updated in a forward sweep. The
algorithm performs an exact inversion of the sparse lower triangular matrix L that has
sparsity pattern L. Although it is accurate, it is computationally expensive and does not
parallelize particularly well. Therefore, it is only suitable for fairly small matrices.
Algorithm 4 Sparse Triangular Solve: Column-by-Column Elimination (Lower Triangular)
1: procedure triangularSolve(L, S, b)
2: Input: Lower triangular matrix L of size n × n, sparsity pattern S, vector b of size
n
3: Output: Solution vector x of size n
4: x copy(b)
5: for j = 1 to n do
6: x
j
x
j
/ℓ
j,j
7: for i = j + 1 to n and (i, j) S do
8: x
i
x
i
x
j
·
i,j
9: end for
10: end for
11: return x
12: end procedure
3.2. Approximate Sparse Triangular Solve. The exact sparse triangular solve may
introduce fill. We want an approximate solve with no fill. Our approximate sparse triangular
solve algorithm is given in Algorithm 5.
Typically, the right hand side b is sparse and B is the sparsity pattern of b. Our use
case is when b is a column of U . The special case of a sparse solve with a sparse right hand
side has been well studied [8, 12], and the sparsity in x is determined by the reachability set.
However, as we wish to preserve sparsity, we instead impose that the sparsity pattern of x
shall be the same as that of b. The intuition behind our method is to extract the nonzero
parts of b and solve for the corresponding submatrix of L, though the implementation is
slightly different to avoid extracting such a submatrix. This results in an approximate solve
that preserves sparsity.
We describe the ATS-ILU algorithm next.
3.3. Alternating Triangular Solves ILU(k) Algorithm. The ATS-ILU(k) algo-
rithm is based on the idea of approximately solving for L and U in an alternating fashion
along only their given sparsity patterns. Again, the rows of L and the columns of U can
Algorithm 5 Sparse Triangular Solve: Column-by-Column Elimination (Approximate)
1: procedure ApproximateTriangularSolve(L, L, b, B)
2: Input: Lower triangular matrix L R
n×n
, sparsity pattern L, vector b R
n
, and
pattern B along which to approximate the application of the inverse to b
3: Output: Solution vector x of size n
4: x copy(b)
5: for j = 1 to n and j B do
6: x
j
x
j
/ℓ
j,j
7: for i = j + 1 to n and i B and (i, j) L do
8: x
i
x
i
x
j
·
i,j
9: end for
10: end for
11: return x
12: end procedure
be solved independently, allowing for a high level of parallelism. The algorithm is shown
in Algorithm 6. We present the algorithm for a general pattern S but in practice, this will
correspond to the pattern of A
k
for some small power k.
Algorithm 6 ATS-ILU(k)
1: Input: Sparse matrix A, sparsity pattern S, starting factors L and U
2: while not converged do
3: for i {1 2 . . . n} do
4: idx {j N | (i, j) S, j i}
5:
i,idx
a
i,idx
(U
idx,idx
)
1
6: end for
7: for j {1 2 . . . n} do
8: idx {i N | (i, j) S, i j}
9: u
idx,j
(L
idx,idx
)
1
a
idx,j
10: end for
11: end while
In this algorithm, we solve for each row of L and each column of U independently.
Recall from section 2 that the notation a
i,idx
refers to the i
th
row of A restricted to the
indices in idx, and similarly for U
idx,idx
and L
idx,idx
. These submatrices can be viewed as
the (dense) non-contiguous submatrices of L and U that correspond to the sparsity pattern
S along the given row or column. We show an example of this in Figure 3.1.
We note that solving for the solution of these non-contiguous submatrices and vectors
can either be viewed as an exact solution of an approximate problem or an approximate
solution of an exact problem. The former applies the inverse of the dense submatrix to the
right-hand side vector, while the latter utilizes the approximate solve algorithm shown in
Algorithm 5. These two views are equivalent in the sense that they both result in the same
solution vector.
3.4. Relation between ATS-ILU and ParILU. We again note that the ATS-
ILU(k) algorithm has some similarities to the ParILU algorithm by Chow, but there are
some key differences. Both algorithms perform iterative updates to the L and U factors
of the matrix A until convergence, both algorithms seek to solve the equation LU = A
approximately, and both algorithms can be implemented in a parallel fashion. However,
1 5 9 13 17 21 25
1
5
9
13
17
21
25
Original Matrix
1 5 9 13 17 21 25
1
5
9
13
17
21
25
Lower Triangular Factor
1 5 9 13 17 21 25
1
5
9
13
17
21
25
Upper Triangular Factor
5 8 12 14 15 17
17
Non-contiguous Target Row in Original Matrix
5 8 12 14 15 17
5
8
12
14
15
17
Non-contiguous Sub-matrix
from Upper Triangular Factor
Fig. 3.1. This figure illustrates the extraction of the small non-contiguous submatrix from U given row
17 from L. Given the non-zero index locations of the given row, we extract the non-contiguous submatrix
from U that corresponds to these columns and rows. Similarly, we extract the right-hand side vector from
A that corresponds to these columns at the given row.
in Chow’s method, a single inner product is used to relax one value at a time, while our
method is coarser-grained, as we operate on an entire row (column) at a time.
A second difference is that ParILU is asynchronous and updates L and U simultane-
ously, while we alternate between L and U updates. However, there is a close relation-
ship. Suppose the execution order in Chow’s method is to update first the lower triangular
part, then the upper part. In this case, ParILU becomes synchronous and very similar
to ATS-ILU. Conversely, one could modify ATS-ILU to update L and U simultaneously
(asynchronously) but we do not explore that here.
We claim ATS-ILU uses exactly the same number of flops as ParILU per sweep (or
update iteration). This can be seen from Figure 3.1. Consider the flops needed to update
row 17 in L. In ATS-ILU, this is given by the number of nonzeros in the small submatrix
shown in the bottom right. These correspond to the red nonzeros in U . On the other hand,
ParILU requires sparse inner products between row 17 of L and columns 1 to 17 of U .
Interestingly, this corresponds exactly to the same red nonzeros. Therefore, the number of
flops is the same.
Note that even if the flop count is the same, the methods are different and will typically
produce different factors. For example, ATS-ILU updates the diagonal of L so it will
typically not be unit while ParILU strictly enforces unit diagonal.
We believe our method is likely more memory efficient as the memory access pattern
is more regular. Also, we avoid the sparse inner products, which are difficult to implement
efficiently.
3.5. Alternating Triangular Solves Algorithm with Thresholding (ILUT). In
this subsection, we describe the ATS-ILUT algorithm, which is a thresholded variant of the
ATS-ILU algorithm that allows for a variable sparsity pattern based on the level of fill-in.
The variable sparsity pattern utilized in algorithm is based on the one used in the
ParILUT method [2, 3]. After testing, we found that the “candidate fill-in” method used
in ParILUT was highly competitive and we utilize something similar. This method is based
on creating a new sparsity pattern based on the sparsity pattern of the residual matrix,
R = A LU [2]. Like in [2], candidate locations for fill-in are chosen from
C R\(L U),
where R is the sparsity pattern of R = A LU , and L and U are the sparsity patterns of L
and U , respectively. The lower and upper triangular candidate locations are added to the
sparsity pattern of L and U , respectively, prior to updating the factors in each iteration and
the original locations of L and U are retained. This method allows for a variable sparsity
pattern that can be adjusted based on the level of fill-in in the factors L and U .
At the end of each iteration, the factors L and U are thresholded to remove elements
with a magnitude below a certain threshold. This threshold is chosen to be the k
th
largest
element in the factors L and U , where k is the maximum number of elements allowed in L
and U . This method of thresholding is both simple and effective, and it reduces the number
of variables the end user needs to tune in order to get a good approximation of the ILU
factors. We note that this thresholding strategy is also used in the ParILUT method [2].
The full algorithm is given in Algorithm 7.
Algorithm 7 ATS-ILUT
1: Input: Sparse matrix A R
n×n
, starting factors L and U , starting sparsity pattern
S, maximum number of elements L
max
and U
max
in L and U , respectively
2: R (A LU )
3: L
(i, j) N
2
| j i, r
i,j
= 0
4: for i {1 2 . . . n} do
5: idx {j N | j i, (i, j) L S}
6:
i,idx
a
i,idx
(U
idx,idx
)
1
7: end for
8: R (A LU )
9: U
(i, j) N
2
| j i, r
i,j
= 0
10: for j {1 2 . . . n} do
11: idx {i N | j i, (i, j) U S}
12: u
idx,j
(L
idx,idx
)
1
a
idx,j
13: end for
14: τ L
max
rank element in {
i,j
| (i, j) N
2
,
i,j
= 0
}
15: Threshold L to elements with larger magnitude than τ
16: τ U
max
rank element in {
u
i,j
| (i, j) N
2
, u
i,j
= 0
}
17: Threshold U to elements with larger magnitude than τ
Note that on lines 2 and 8 of Algorithm 7, only the lower and upper portions of the
residual are needed. As an implementation detail, one could compute half of the residual at
each half step to reduce the total expended work by a significant margin. Next we briefly
discuss options for the starting sparsity pattern.
3.6. Initial Guess and Sparsity Pattern. The initial guess for L and U make a
difference. In fact, there is no guarantee that the alternating method converges to the global
solution (though empirically it usually works).
There are several options for the starting sparsity pattern S. One option is to use the
sparsity pattern of the matrix A, which was used in the ParILUT method [2]. Alternatively,
one could use the sparsity pattern of the matrix A
k
for some k > 1 or k = 0. We find it
hard to justify using the sparsity pattern of A
k
for k > 1. The sparsity pattern of A
0
= I
is the diagonal of A, which is a reasonable choice. Later in section 6, we test the sparsity
patterns of A
k
for k {0, 1, 2} and report our findings.
Next, we discuss important implementation details of the two algorithms.
4. Implementation details. This section is dedicated to the implementation details
of the ATS-ILU and ATS-ILUT algorithms. For simplicity, we focus on how to solve for L
given U . How to compute U given L is analogous.
For reasons relating to the manner in which we iterate over L and U , we store L in a
compressed sparse row (CSR) format and U in a compressed sparse column (CSC) format.
For this reason, we can treat U as the transpose of itself in CSR format, which transforms
the solve in lines 6 and 20 of Algorithm 7 and line 5 of Algorithm 6 from
i,idx
a
i,idx
(U
idx,idx
)
1
to
i,idx
(U
idx,idx
)
T
a
idx,i
T
,
which allows us to use nearly the same algorithm for the update of both L and U . In
implementation, the only difference between the two algorithms is the manner in which we
lookup the elements of A that we are solving for. If we were to pass A
T
instead of A to the
update routine when solving for U , the implementation would be identical.
We implement the approximate sparse triangular solve algorithm shown in Algorithm 5
using a custom kernel that looks up values in U and A given the input indices from L via
binary search. We call this kernel binary
search triangular solve. This implementation
performs a lookup of the values in the non-contiguous submatrix of U and the right-hand
side vector of A. In the first pass of the triangular solve, the row values of L are repopulated
with their corresponding elements in A, and are treated as the non-contiguous right-hand
side vector.
We implemented the ATS-ILUT algorithm with a large number of different variations.
These different variations are controlled by options that conditionally compile different parts
of the code using constexpr variables and C-style preprocessing. The options we imple-
mented are as follows:
1. USE
NEW: This option utilizes the newest factor in the update step. i.e L
(k+1)
is
used to update U
(k)
and similar.
2. UPDATE RESIDUAL BETWEEN: This option updates the residual matrix R between the
two update steps.
3. USE RESIDUAL LHS: This option augments the left-hand side of the solve with the
residual matrix. i.e for some modified matrix m M , we let m
i,j
= u
i,j
if u
i,j
= 0
and r
i,j
otherwise.
4. USE RESIDUAL RHS: This option augments the right-hand side of the solve with the
residual matrix. i.e given a solve for some row i of L we have b, where b
j
= a
i,j
if
a
i,j
= 0 and r
i,j
otherwise.
5. ADDITIONAL SWEEPS BEFORE: This option performs additional sweeps prior to the
beginning of the main loop.
6. ADDITIONAL SWEEPS AFTER: This option performs additional sweeps after the thresh-
olding step of the main loop.
We have exhaustively tested all combinations of these options (with options 5 and 6 limited
to {0, 1}) on several matrices in the SuiteSparse collection. In section 6, we will discuss a
selection of the results from these tests.
5. Parallel Aspects. A key advantage of ATS-ILU compared to the traditional ILU
algorithm is that it is highly parallel. Thus, it is well suited for modern architectures, such
as multicore CPU and accelerators such as GPU.
The ParILU method is also parallel, but not deterministic. The results will vary de-
pending on the thread execution order. It is possible to impose a certain execution order,
but this will slow the method down. In contrast, ATS-ILU is naturally deterministic. This
is easy to see as L is used to update U and vice versa. It is also possible to update L and U
simultaneously (asynchronously) but we did not explore that as we view the deterministic
behavior as an advantage.
In ParILU, there is one thread per nonzero in the matrix. In ATS-ILU, the parallelism
is over rows and columns. In our current implementation, one thread is assigned one row (or
column), giving a coarser parallelism than in ParILU. However, another option is to assign
a thread team for the sparse triangular solve in each row (column). This would require
a team-level sparse triangular solve, which is not widely available. Thus, we consider this
option future work.
6. Experiments and Results. We implement both the ATS-ILU and ATS-ILUT
algorithms in C++ using the Kokkos library [25]. We test the algorithms on a variety
of matrices from the SuiteSparse collection [9] and compare the results to the ParILUT
algorithm. We note that we use the Kokkos implementation of ParILUT, which is a shared-
memory parallel implementation of the ParILUT algorithm [2]. This implementation per-
forms thresholding using the same method as the [2] implementation and not the fast ap-
proximate approach described in [3]. This is a valid comparison because we use the same
thresholding strategy in our ATS-ILUT algorithm and the approximate one would be simi-
larly applicable to our algorithm. All of the experiments given below were run on a single
node processor with 24 threads and 128 GB of memory.
We compare the ATS-ILUT algorithm implementation against the ParILUT algorithm.
In [3], they found their thresholding variant outperformed both their non-thresholded level-
based variant but also often outperformed the classical level-based method. Because the
results they obtained in their paper largely carry over to ours, we do not repeat that exper-
iment in particular. Instead, we focus on the improved thresholding-based variants of ATS
and ParILU.
Throughout this section, our measure of goodness is the number of GMRES iterations
to converge to a relative residual of 10
10
. When referring to something as worse in terms
of this measure, we mean that it required more iterations to converge. Conversely, when
referring to something as better, we mean that it required fewer iterations to converge. We
set the maximum subspace size of GMRES to 500 and the maximum number of iterations to
1500. We scale each matrix to have unit diagonal with Jacobi scaling. We use the GMRES
implementation from the KokkosKernels library [25].
We tested 36 combinations of the options described in section 4 on all sparsity pat-
terns stated above. We limited the two sweep-related options to either 0 or 1 additional
sweeps. We share the results for the best two configurations below. Both of these con-
figurations have USE NEW and UPDATE RESIDUAL BETWEEN enabled, and USE RESIDUAL LHS
and USE RESIDUAL RHS disabled. We found the performance of the algorithm to be largely
unaffected by the ADDITIONAL SWEEPS BEFORE option but do notice a difference with the
ADDITIONAL SWEEPS AFTER option. Because there was little difference in performing an ini-
tial sweep before the first iteration, we share the results with it disabled. The results for the
two remaining configurations are given next in section 6. Note that “No Extra” refers to
the configuration with ADDITIONAL SWEEPS AFTER disabled and “One Extra” refers to the
configuration with ADDITIONAL SWEEPS AFTER set to 1. These configurations were tested on
atmosmodd, torso2, majorbasis, and venkat01 from the SuiteSparse collection [9].
Additionally, we tested with starting sparsity patterns of A
k
for k {0, 1, 2}. The
behavior with the starting sparsity pattern of A
0
was similar, sometimes marginally better
but often marginally worse, than the behavior with the starting sparsity pattern of A
1
.
Because the behavior is similar between these two patterns and because A
1
was the starting
pattern used in the ParILUT method, we do not share the results with the starting sparsity
pattern of A
0
in this paper.
The behavior with the starting sparsity pattern A
2
was inconsistent and was largely
uncompetitive with the other starting sparsity patterns. With this starting sparsity pattern,
the preconditioner after the first update cycle was always worse than with the other starting
patterns. In some cases, the resulting factors were better after the fifth update cycle, but
this was not consistent across all matrices or even levels of fill for a given matrix. We discuss
potential reasons for this behavior in section 7 but do not share the results in this paper
due to space.
Table 6.1
Comparison of ATS-ILU Variants with PAR-ILUT across Different Matrices, Fill Levels, and Itera-
tions with an A
1
Starting Sparsity Pattern. The best at each iteration is bolded.
Matrices: atmosmodd torso2 majorbasis venkat01
Method Fill
Iterations
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
ATS-ILUT
No Extra
1.0 166 150 147 146 146 9 8 7 8 7 12 9 10 9 10 23 23 23 24 25
2.0 132 101 96 95 98 9 7 6 8 7 12 8 9 9 9 17 14 13 12 12
3.0 132 95 86 83 82 9 7 6 8 7 12 8 9 9 9 17 13 11 10 10
ATS-ILUT
One Extra
1.0 147 145 145 145 145 7 8 8 9 9 12 10 10 10 10 23 23 24 25 25
2.0 114 97 97 97 98 7 8 8 9 9 11 9 9 10 10 17 14 13 12 12
3.0 114 89 86 85 86 7 8 8 9 9 11 9 9 10 10 17 13 11 10 10
ParILUT
1.0 154 145 145 145 145 8 7 6 6 6 12 10 9 9 9 23 23 24 24 24
2.0 120 98 97 97 97 8 5 4 4 3 12 8 6 6 5 19 15 13 12 12
3.0 120 91 86 85 85 8 5 4 4 3 12 8 6 5 5 19 15 12 11 10
The results in section 6 show that the ATS-ILUT algorithm is competitive with the
ParILUT algorithm. In most cases, the resulting factors after 5 update cycles are similar
in performance to the ParILUT factors. However, the ATS-ILUT algorithm is often able to
achieve a significantly better result after the first update cycle relative to ParILUT at the
same update cycle.
Next, we show results for four more matrices. We test on abnormal sandia an internal
Sandia matrix that arises when solving a nonlinear thermal-fluid problem, as well as three
more SuiteSparse matrices: af shell3, G3 circuit, and parabolic fem. This is given next
in Table 6.
Similarly, these results largely show parity between the ATS-ILUT and ParILUT algo-
rithms. ATS-ILUT with the “No Extra” configuration was often better at lower fill levels,
while the “One Extra” configuration tended to converge to a good preconditioner faster.
We also tested these methods on the matrices ecology2 and offshore. No algorithm
produced a preconditioner that converged to the tolerance within the specified number of
GMRES iterations for the ecology2 matrix. For the offshore matrix, all methods produced
Table 6.2
Comparison of ATS-ILU Variants with PAR-ILUT across Different Matrices, Fill Levels, and Itera-
tions with an A
1
Starting Sparsity Pattern. The best at each iteration is bolded.
Matrices: abnormal sandia af shell3 G3 circuit parabolic fem
Method Fill
Iterations
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
ATS-ILUT
No Extra
1.0 54 44 42 42 42 905 723 594 605 568 1169 1140 1148 1133 1131 1183 1089 1090 1083 1071
2.0 51 33 25 24 23 872 564 395 334 299 860 639 467 426 396 765 561 464 492 444
3.0 51 31 22 18 17 872 559 378 308 258 860 638 448 373 318 765 551 422 434 381
ATS-ILUT
One Extra
1.0 50 45 45 46 45 797 657 638 625 631 1153 1186 1189 1187 1183 1261 1192 1224 1220 1201
2.0 42 29 27 27 27 651 402 319 289 274 690 520 408 424 414 729 719 505 547 446
3.0 42 24 20 20 20 651 397 290 226 204 690 520 357 326 305 729 715 681 525 378
ParILUT
1.0 54 45 45 45 45 822 597 581 616 592 1188 1170 1180 1215 1217 1232 1168 1190 1201 1197
2.0 49 32 26 25 25 752 415 311 279 268 758 531 390 365 360 864 482 379 353 354
3.0 49 30 21 18 17 752 415 293 234 204 758 531 379 295 269 864 479 320 219 191
a preconditioner that converged to the tolerance within the specified number of iterations
if left to update for only one or two cycles. When updating for three cycles, all methods
(all tested variants of ATS-ILUT and ParILUT) exhibited strange behavior and did not
converge to the tolerance within the specified number of GMRES iterations. There was
little difference in the number of GMRES iterations required to converge between the ATS-
ILUT and ParILUT algorithms for the offshore matrix after the first or second update.
7. Conclusions. In this paper we have introduced the ATS-ILU and ATS-ILUT algo-
rithms based on alternating inexact triangular solves. The ATS-ILU algorithm is a novel
approach to computing ILU(k)-type factors that is deterministic and parallel. The ATS-
ILUT algorithm is a thresholded variant of the ATS-ILU algorithm that allows for a variable
sparsity pattern based on the level of fill-in. We have shown that the ATS-ILU and ATS-
ILUT algorithms are competitive with the ParILUT algorithm on a variety of matrices from
the SuiteSparse collection as well as the abnormal Sandia matrix. Our algorithm is deter-
ministic without a loss in performance, which is a significant advantage over the ParILU
algorithm. Additionally, our algorithm has better memory reuse than the ParILU algorithm,
which is important for performance on modern architectures.
In terms of the number of GMRES iterations to converge, there was relatively little
difference in the performance between the ParILUT and ATS-ILUT algorithms after a few
update cycles. One reason for this could be that both algorithms are successful in solving
for factors that are nearly-optimal given the sparsity pattern. We note that both of our
algorithms use a similar method for the addition of new candidate locations to the sparsity
pattern at each iteration. Indeed, we found that utilizing a different starting sparsity pattern
(A
2
) sometimes resulted in better factors ultimately, but not in a manner that was consistent
across all matrices or levels of fill for a given matrix. We believe that there is further work
to be done in two main domains: choice of starting sparsity pattern and the method for
adding new candidate locations to the sparsity pattern. We believe that the choice of
starting sparsity pattern is important and that the behavior with the A
2
starting sparsity
pattern provides some empirical evidence for this.
Even with our algorithm ultimately creating factors that are similarly effective to the
ParILUT factors, we believe that there is still merit in further exploring the ATS-ILU and
ATS-ILUT algorithms. Our algorithm produces factors that work as a decent preconditioner
for GMRES with only a single update cycle (sweep). Thus, our method has an advantage
when short setup time is important. For certain applications, such as, one in which the
preconditioner needs to be updated frequently or if the matrix is changing rapidly, this
could be a significant advantage. We believe that the ATS-ILU and ATS-ILUT algorithms
are a promising new approach to computing thresholded ILU factors and that there is further
work to be done in this area.
Acknowledgments. Sandia National Laboratories is a multimission laboratory man-
aged and operated by National Technology and Engineering Solutions of Sandia, LLC.,
a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of
Energy’s National Nuclear Security Administration under contract DE-NA-0003525.
REFERENCES
[1] An iterative solution method for linear systems of which the coefficient matrix is a symmetric -matrix,
Mathematics of computation, 31 (1977), pp. 148–162.
[2] H. Anzt, E. Chow, and J. Dongarra, Parilut—a new parallel threshold ILU factorization,
SIAM Journal on Scientific Computing, 40 (2018), pp. C503–C519, https://doi.org/10.1137/
16M1079506, https://doi.org/10.1137/16M1079506.
[3] H. Anzt, T. Ribizel, G. Flegar, E. Chow, and J. Dongarra, Parilut - a parallel threshold ilu
for gpus, in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS),
IEEE, May 2019, https://doi.org/10.1109/ipdps.2019.00033, http://dx.doi.org/10.1109/
IPDPS.2019.00033.
[4] M. Benzi, Preconditioning techniques for large linear systems: A survey, Journal of Computational
Physics, 182 (2002), p. 418–477, https://doi.org/10.1006/jcph.2002.7176, http://dx.doi.org/
10.1006/jcph.2002.7176.
[5] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A fresh approach to numerical
computing, SIAM Review, 59 (2017), pp. 65–98, https://doi.org/10.1137/141000671, https:
//epubs.siam.org/doi/10.1137/141000671.
[6] C. Calgaro, J.-P. Chehab, and Y. Saad, Incremental incomplete lu factorizations with applica-
tions, Numerical Linear Algebra with Applications, 17 (2010), pp. 811–837, https://doi.org/
https://doi.org/10.1002/nla.756, https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.
756, https://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.756.
[7] E. Chow and A. Patel, ”a fine-grained parallel ILU factorization”, SIAM Journal on Scientific
Computing, 37 (2015), pp. C169–C197.
[8] T. Davis, Direcet Methods for Sparse Linear Systems, SIAM, 2006.
[9] T. A. Davis and Y. Hu, The university of florida sparse matrix collection, ACM Transactions on
Mathematical Software, 38 (2011), p. 1–25, https://doi.org/10.1145/2049662.2049663, http:
//dx.doi.org/10.1145/2049662.2049663.
[10] X. Dong and G. Cooperman, A Bit-Compatible Parallelization for ILU(k) Preconditioning, Springer
Berlin Heidelberg, 2011, p. 66–77, https://doi.org/10.1007/978-3-642-23397-5_8, http://dx.
doi.org/10.1007/978-3-642-23397-5_8.
[11] H. C. Edwards, C. R. Trott, and D. Sunderland, Kokkos: Enabling manycore performance porta-
bility through polymorphic memory access patterns, Journal of Parallel and Distributed Com-
puting, 74 (2014), pp. 3202 3216, https://doi.org/https://doi.org/10.1016/j.jpdc.2014.
07.003, http://www.sciencedirect.com/science/article/pii/S0743731514001257. Domain-
Specific Languages and High-Level Frameworks for High-Performance Computing.
[12] J. R. Gilbert and T. Peierls, Sparse partial pivoting in time proportional to arithmetic operations,
SIAM Journal on Scientific and Statistical Computing, 9 (1988), pp. 862–874, https://doi.org/
10.1137/0909058, https://doi.org/10.1137/0909058.
[13] M. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal
of Research of the National Bureau of Standards, 49 (1952), p. 409, https://doi.org/10.6028/
jres.049.044, http://dx.doi.org/10.6028/jres.049.044.
[14] D. Hysom and A. Pothen, Efficient parallel computation of ilu(k) preconditioners, in Proceedings of
the 1999 ACM/IEEE conference on Supercomputing, SC ’99, ACM, Jan. 1999, https://doi.org/
10.1145/331532.331561, http://dx.doi.org/10.1145/331532.331561.
[15] P. H
´
enon and Y. Saad, A parallel multistage ilu factorization based on a hierarchical graph decom-
position, SIAM Journal on Scientific Computing, 28 (2006), p. 2266–2293, https://doi.org/10.
1137/040608258, http://dx.doi.org/10.1137/040608258.
[16] T. M. Inc., Matlab version: 9.13.0 (r2022b), 2022, https://www.mathworks.com.
[17] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular
graphs, SIAM Journal on Scientific Computing, 20 (1998), p. 359–392, https://doi.org/10.1137/
s1064827595287997, http://dx.doi.org/10.1137/S1064827595287997.
[18] N. Li, Y. Saad, and E. Chow, Crout versions of ilu for general sparse matrices, SIAM Journal
on Scientific Computing, 25 (2003), p. 716–728, https://doi.org/10.1137/s1064827502405094,
http://dx.doi.org/10.1137/S1064827502405094.
[19] A. Montoison and D. Orban, Krylov.jl: A Julia basket of hand-picked Krylov methods, Journal of
Open Source Software, 8 (2023), p. 5187, https://doi.org/10.21105/joss.05187.
[20] Y. Saad, Ilut: A dual threshold incomplete lu factorization, Numerical Linear Algebra with Applica-
tions, 1 (1994), p. 387–402, https://doi.org/10.1002/nla.1680010405, http://dx.doi.org/10.
1002/nla.1680010405.
[21] Y. Saad, Ilut: A dual threshold incomplete lu factorization, Numerical Linear Algebra
with Applications, 1 (1994), pp. 387–402, https://doi.org/https://doi.org/10.1002/nla.
1680010405, https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.1680010405, https://
arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.1680010405.
[22] Y. Saad, Iterative methods for sparse linear systems, SIAM, Philadelphia, MS, 2 ed., 2003.
[23] Y. Saad and M. H. Schultz, Gmres: A generalized minimal residual algorithm for solving nonsym-
metric linear systems, SIAM Journal on Scientific and Statistical Computing, 7 (1986), p. 856–869,
https://doi.org/10.1137/0907058, http://dx.doi.org/10.1137/0907058.
[24] C. Trott, L. Berger-Vergiat, D. Poliakoff, S. Rajamanickam, D. Lebrun-Grandie, J. Mad-
sen, N. Al Awar, M. Gligoric, G. Shipman, and G. Womeldorff, The kokkos ecosystem:
Comprehensive performance portability for high performance computing, Computing in Science
Engineering, 23 (2021), pp. 10–18, https://doi.org/10.1109/MCSE.2021.3098509.
[25] C. R. Trott, D. Lebrun-Grandi
´
e, D. Arndt, J. Ciesko, V. Dang, N. Ellingwood, R. Gaya-
tri, E. Harvey, D. S. Hollman, D. Ibanez, N. Liber, J. Madsen, J. Miles, D. Poliakoff,
A. Powell, S. Rajamanickam, M. Simberg, D. Sunderland, B. Turcksin, and J. Wilke,
Kokkos 3: Programming model extensions for the exascale era, IEEE Transactions on Parallel
and Distributed Systems, 33 (2022), pp. 805–817, https://doi.org/10.1109/TPDS.2021.3097283.