PARALLEL INCOMPLETE LU FACTORIZATIONS BASED ON

ALTERNATING TRIANGULAR SOLVES

MARC A. TUNNELL

∗

AND ERIK G. BOMAN

†

Abstract.

Incomplete factorizations are popular preconditioners and are well known to be eﬀective for a wide

range of problems. Additionally, these preconditioners can be used as a “black box” and do not rely on

any a priori knowledge of the problem. However, traditional algorithms for computing these incomplete

factorizations are based on Gaussian elimination and do not parallelize well. Recently, a more parallel

incomplete factorization algorithm was proposed by Chow and Patel [7], where the factors are computed

iteratively. In this paper, we propose a new iterative approach that is based on alternating triangular

solves of L and U . We develop two versions: ATS-ILU for a static sparsity pattern, and ATS-ILUT for a

dynamic pattern (using thresholding). We show that this new method is similar to the ﬁne-grained iterative

ILU method by Chow but has the added advantage that it allows greater reuse of memory and is fully

deterministic in parallel, meaning the results do not depend on scheduling. We evaluate the new method

on several test matrices from the SuiteSparse collection and show that is competitive with current ILU

methods.

Key words. preconditioner, incomplete factorization, ILU, parallel, least squares

1. Introduction. Preconditioning is well known to be essential for improving the speed

of convergence of Krylov methods such as Conjugate Gradient (CG) and Generalized Min-

imal Residual (GMRES) [4, 13, 22, 23]. Incomplete Lower-Upper (ILU) factorizations are

a popular class of preconditioners as they can be used as a “black box” on a wide range of

problems. There are two main types of ILU factorizations, level-based ILU(k) [4, 18, 22]

and threshold-based ILUT [20]. However, these methods are inherently sequential and do

not parallelize well.

There has been interest in the parallelization of these more classical interpretations of

ILU, largely through graph partitioning schemes. These graph partition-based methods,

such as [10, 14, 15, 17], oﬀer a promising approach to parallelizing classical ILU methods.

By decomposing the graph corresponding to the matrix and determining variables that

can be eliminated in parallel, these methods aim to distribute the computational load more

evenly across processors. While these strategies have shown eﬀectiveness for certain types of

problems [4], their implementation can be highly complex. Additionally, their performance

can be problem-dependent, requiring consideration of the underlying graph structure when

choosing a parallelization strategy.

More recently, there have been strides into methods of computing ILU factors iteratively,

potentially giving up some of the robustness of the classical methods for better parallel

properties [7]. Iterative ILU methods, such as those introduced by Chow [7], oﬀer signiﬁcant

advantages in terms of scalability on modern parallel architectures. For the remainder of

this paper, we refer to the method introduced by Chow as ParILU and its thresholded

counterpart as ParILUT [2, 7]. These methods approximate the ILU factors through a

series of iterative updates, which can be more easily distributed across multiple processors

or oﬄoaded to accelerators.

One of the primary beneﬁts of iterative ILU methods is their ability to handle larger

and more complex systems eﬃciently and be utilized as a true “black box” preconditioner

by domain scientists. By breaking down each iterative update into smaller approximate

subproblems and solving them independently, diﬀerent parts of the factorization can be

computed in parallel without the need for complex graph-partitioning algorithms. This

∗

Purdue University, mtunnell@purdue.edu

†

Sandia National Labs, egboman@sandia.gov

approach allows for the use of iterative ILU methods on a wide range of problems, including

those with complex or irregular graph structures that may preclude high levels of parallelism

in the graph-partitioned classical ILU methods.

Furthermore, iterative ILU methods are adaptable to various hardware accelerators

such as graphics processing units (GPUs) [3], which are increasingly important for high-

performance computing. By leveraging the parallel processing capability of these accelera-

tors, iterative ILU methods can signiﬁcantly reduce the real-world time required to compute

the ILU factors for large-scale problems, thereby speeding up the overall solution process.

In this paper, we propose a new class of iteratively-computed ILU preconditioners,

which we call Alternating triangular Solves ILU (ATS-ILU). This method builds upon the

strengths of existing iterative ILU approaches while leveraging improved memory reuse and

determinism in parallel. We provide an analysis of the method and evaluate its performance

compared to the state of the art on a variety of test matrices. We show that our method

is competitive with current ILU methods and has the potential to be a powerful tool for

solving large-scale problems on modern parallel architectures. We implement our algorithm

in Julia [5, 19], as well as Kokkos [11, 25], and utilize the Kokkos ecosystem [24] for their

implementations of ParILUT and GMRES.

This paper is organized as follows. First, we gave an introduction to classical ILU meth-

ods as well as their iterative counterparts in section 1. Next, we present needed background

information in section 2, including an introduction of our notation, a review of a classical

ILU method, and the ﬁne-grained ILU method by Chow [7]. In section 3, we introduce

our new class of iteratively computed ILU preconditioners based on alternating triangular

solves and discuss its relation to ParILU. The algorithms for ATS-ILU and ATS-ILUT are

presented in subsection 3.3 and subsection 3.5, respectively. In section 4, we provide im-

plementation details for our new method. In section 5, we discuss the parallelization of

our method. We evaluate the performance of our method on a variety of test matrices in

section 6. Finally, we discuss future work and concluding remarks in section 7.

2. Background Information. We start this section by describing the notation used

in this paper. We use largely standard notation but introduce some new notation for the

sake of clarity. We use A to denote a matrix, A

to denote its transpose, a to denote

a vector, I to refer to the identity matrix where the size is determined based on context,

and L and U to reference sparse matrices that are lower and upper triangular in structure.

We subscript these matrices with the row and column indices separated by a comma. The

notation a

i,j

refers to a scalar entry in the matrix A at row i and column j, while a

i,:

refers

to a row vector of A at row i. We utilize MATLAB [16] style slicing notation, where a

i,j:k

refers to the elements of a

i,:

at the column indices j through k. We additionally use non-

contiguous slicing, where a

i,idx

refers to the elements of a

i,:

at the column indices in idx.

Non-contiguous submatrices may also be referenced in a similar manner, where A

idx

,idx

refers to the submatrix of A whose entries correspond to the cartesian product of idx

and

idx

. We assume that slicing with a set of indices is performed in the natural order of the

indices in the set. In the case of the cartesian product, a topological ordering is used where

it is ﬁrst ordered by the row index and then by the column index.

We use N to denote the set of natural numbers, which we deﬁne to start at 1, and R

to denote the set of real numbers. The notation N

refers to the set of pairs of natural

numbers, and R

n×m

refers to the set of n × m matrices with real entries. We use A

refer to the k

power of the matrix A, whereas L

(k)

refers to the k

iteration of a method

applied to L. We are consistent with our use of 1-based indexing throughout this paper and

assume all loops are inclusive of their endpoints unless otherwise stated.

We use notation like L to denote a graph of the sparsity pattern of L and are consistent

with our use of S to denote some arbitrary initial sparsity pattern. We use standard notation

to deﬁne operations on a graph, where L\U refers to the graph L with the edges that exist

in U removed. Similarly, L ∪ U refers to the graph L with the edges that exist in U added.

We assume that the edges in a graph are uniquely deﬁned by a set and often use set notation

to deﬁne the entries of an adjacency matrix that fully describes the edges of a graph.

2.1. Related work. Incomplete factorizations are some of the oldest preconditioners

[1] that are still used today. In addition to the standard ILU(k) method based on levels of

ﬁll, variations based on thresholding (adaptive pattern) have been proposed [21].

Calgaro et al. [6] proposed an incremental ILU method for the case of a sequence of

matrices. Chow and Patel [7] proposed a novel iterative, ﬁne-grained method for computing

the incomplete factors that is well suited for parallel computing. Anzt et al. [2, 3] further

improved this method to thresholded ILU, and showed results on GPU.

Our method is also highly parallel and similar to the iterative ﬁne-grained method by

Chow, but is derived in a diﬀerent way.

2.2. Classical ILU Method. The classical ILU(k) method is based on Gaussian

elimination, but with entries dropped to avoid or reduce the amount of ﬁll-in, and given a

pattern, S, of non-zero entries in the factorization. These non-zero locations, in the case

of an ILU(0) factorization, are often chosen to be the same as the non-zero pattern of the

matrix, A, but this is not required. For higher levels of ﬁll, a common choice is to use the

sparsity pattern of A

k+1

. This can be seen as an approximation to the true LU factors of

A, where the equation LU = A is satisﬁed exactly on the pattern of A and may be violated

elsewhere.

We show this algorithm given a ﬁxed non-zero pattern, S, in Algorithm 1. This algo-

rithm computes the ILU factors by iterating through the rows and columns of the input

matrix A, updating the non-zero entries according to the sparsity pattern S. For each row

i, the algorithm computes the entries in the k

column of L by scaling the entries in the

column of U by the diagonal entry u

k,k

. This algorithm is inherently sequential as the

updates to the factors are dependent on the previous row or column.

Algorithm 1 Classic ILU

Input: Sparse matrix A, sparsity pattern S.

Output: Sparse ILU factors L and U , where LU = A along the sparsity pattern S.

L ← I

U ← A where u

i,j

= 0 if (i, j) ∈ S

for j = 1 to n − 1 do

for i = j + 1 to n and (i, j) ∈ S do

ℓ

i,j

← u

i,j

j,j

for k = i to n and (i, k) ∈ S do

i,k

← u

i,k

− ℓ

i,j

· u

j,k

end for

2.3. Fine-Grained Iterative ILU Method by Chow. Like the classical ILU method,

ParILU computes an approximation to the true LU factors of A but does so iteratively using

a series of smaller, approximate subproblems. Along the given sparsity pattern S, the Par-

ILU method updates the non-zero entries of the factors L and U by relaxing each variable

independently. This independent relaxation is performed in parallel, allowing for better

handling of large-scale and complex systems, with the potential for signiﬁcant speedup on

modern parallel architectures. This algorithm is given in Algorithm 2.

Algorithm 2 ParILU [7]

1: Input: Sparse matrix A, sparsity pattern S, starting factors L and U

2: Output: Updated factors L and U

3: while not converged do

4: for (i, j) ∈ S do

5: if i > j then

6: ℓ

← a

−

j−1

k=1

ℓ

7: else

8: u

←



−

i−1

k=1

ℓ



/ℓ

9: end if

10: end for

11: end while

Each update can be performed in parallel, as the updates to each row or column are

independent. In application, the algorithm can either be implemented in a manner that is de-

terministic in parallel or a non-deterministic manner. In implementing the non-deterministic

version, the updates are performed in an atomic fashion directly to the factors L and U . In

the deterministic version, the updates are performed directly to the matric L but are stored

in a temporary matrix for U .

3. Alternating Triangular Solves Metho d. In this section, we introduce our new

method for computing ILU factors, ATS-ILU. This method is based on the idea of alternating

iterative updates to the L and U factors of the matrix A. The basic idea is the same as

before, where we iteratively update the factors L and U until convergence, but where the

updates are performed in an alternating manner. This general process is a common method

for solving bilinear systems and is outlined in Algorithm 3.

Algorithm 3 Alternating ILU

(0)

← triu(A)

k ← 0

while not converged do

Solve L

(k)

≈ A for L

(k)

Solve L

(k)

(k+1)

≈ A for U

(k+1)

Check convergence

k ← k + 1

end while

One way to perform this procedure would be to perform Algorithm 4 with the entirety

of U

(k)

and let A be the right-hand side vector to solve for L

(k+1)

, and similar to solve

for U

(k+1)

. This entire process can largely be performed in parallel as each row of L and

column of U can be solved independently. Despite the potential for high levels of parallelism,

it is still extremely computationally expensive and likely suﬀers from signiﬁcant levels of

ﬁll-in during intermediate steps. The computational cost could be reduced by using an

approximation. A simple approximation is the Neumann series: (I −L)

−1

= I +L+L

+. . .

This has been explored [3] in the context of solving for dense vectors. Sparsity may be

preserved to some degree, but the ﬁll will increase with summing higher powers. Therefore,

we do not consider this option any further.

Additionally, the algorithm as stated above does not guarantee that L and U remain

lower and upper triangular, respectively. One method to address this issue would be to solve

for L only in the lower triangular part of A and for U only in the upper triangular part

of A. This would ensure that the factors remain lower and upper triangular, respectively,

but would still leave the problem of signiﬁcant levels of ﬁll-in. Instead, we suggest a more

practical approach where we impose a sparsity pattern on L and U , namely L and U,

respectively. This sparsity pattern can be chosen to be the same as the sparsity pattern of

A, which is the choice we make in this paper.

In order to get around the issue of ﬁll-in, we propose a method where we approximately

solve for L and U along their given sparsity patterns, which we discuss next.

3.1. Sparse Triangular Solve. Because we introduce an approximate triangular solve

in the following section, we provide a brief overview of the classical sparse triangular solve

algorithm for reference in Algorithm 4. This algorithm is used to solve a lower triangular

system of equations, where the solution vector x is updated in a forward sweep. The

algorithm performs an exact inversion of the sparse lower triangular matrix L that has

sparsity pattern L. Although it is accurate, it is computationally expensive and does not

parallelize particularly well. Therefore, it is only suitable for fairly small matrices.

Algorithm 4 Sparse Triangular Solve: Column-by-Column Elimination (Lower Triangular)

1: procedure triangularSolve(L, S, b)

2: Input: Lower triangular matrix L of size n × n, sparsity pattern S, vector b of size

3: Output: Solution vector x of size n

4: x ← copy(b)

5: for j = 1 to n do

6: x

← x

/ℓ

j,j

7: for i = j + 1 to n and (i, j) ∈ S do

8: x

← x

− x

· ℓ

i,j

9: end for

10: end for

11: return x

12: end procedure

3.2. Approximate Sparse Triangular Solve. The exact sparse triangular solve may

introduce ﬁll. We want an approximate solve with no ﬁll. Our approximate sparse triangular

solve algorithm is given in Algorithm 5.

Typically, the right hand side b is sparse and B is the sparsity pattern of b. Our use

case is when b is a column of U . The special case of a sparse solve with a sparse right hand

side has been well studied [8, 12], and the sparsity in x is determined by the reachability set.

However, as we wish to preserve sparsity, we instead impose that the sparsity pattern of x

shall be the same as that of b. The intuition behind our method is to extract the nonzero

parts of b and solve for the corresponding submatrix of L, though the implementation is

slightly diﬀerent to avoid extracting such a submatrix. This results in an approximate solve

that preserves sparsity.

We describe the ATS-ILU algorithm next.

3.3. Alternating Triangular Solves ILU(k) Algorithm. The ATS-ILU(k) algo-

rithm is based on the idea of approximately solving for L and U in an alternating fashion

along only their given sparsity patterns. Again, the rows of L and the columns of U can

Algorithm 5 Sparse Triangular Solve: Column-by-Column Elimination (Approximate)

1: procedure ApproximateTriangularSolve(L, L, b, B)

2: Input: Lower triangular matrix L ∈ R

n×n

, sparsity pattern L, vector b ∈ R

, and

pattern B along which to approximate the application of the inverse to b

3: Output: Solution vector x of size n

4: x ← copy(b)

5: for j = 1 to n and j ∈ B do

6: x

← x

/ℓ

j,j

7: for i = j + 1 to n and i ∈ B and (i, j) ∈ L do

8: x

← x

− x

· ℓ

i,j

9: end for

10: end for

11: return x

12: end procedure

be solved independently, allowing for a high level of parallelism. The algorithm is shown

in Algorithm 6. We present the algorithm for a general pattern S but in practice, this will

correspond to the pattern of A

for some small power k.

Algorithm 6 ATS-ILU(k)

1: Input: Sparse matrix A, sparsity pattern S, starting factors L and U

2: while not converged do

3: for i ∈ {1 2 . . . n} do

4: idx ← {j ∈ N | (i, j) ∈ S, j ≤ i}

5: ℓ

i,idx

← a

i,idx

idx,idx

)

−1

6: end for

7: for j ∈ {1 2 . . . n} do

8: idx ← {i ∈ N | (i, j) ∈ S, i ≥ j}

9: u

idx,j

← (L

idx,idx

)

−1

idx,j

10: end for

11: end while

In this algorithm, we solve for each row of L and each column of U independently.

Recall from section 2 that the notation a

i,idx

refers to the i

row of A restricted to the

indices in idx, and similarly for U

idx,idx

and L

idx,idx

. These submatrices can be viewed as

the (dense) non-contiguous submatrices of L and U that correspond to the sparsity pattern

S along the given row or column. We show an example of this in Figure 3.1.

We note that solving for the solution of these non-contiguous submatrices and vectors

can either be viewed as an exact solution of an approximate problem or an approximate

solution of an exact problem. The former applies the inverse of the dense submatrix to the

right-hand side vector, while the latter utilizes the approximate solve algorithm shown in

Algorithm 5. These two views are equivalent in the sense that they both result in the same

solution vector.

3.4. Relation between ATS-ILU and ParILU. We again note that the ATS-

ILU(k) algorithm has some similarities to the ParILU algorithm by Chow, but there are

some key diﬀerences. Both algorithms perform iterative updates to the L and U factors

of the matrix A until convergence, both algorithms seek to solve the equation LU = A

approximately, and both algorithms can be implemented in a parallel fashion. However,

1 5 9 13 17 21 25

Original Matrix

1 5 9 13 17 21 25

Lower Triangular Factor

1 5 9 13 17 21 25

Upper Triangular Factor

5 8 12 14 15 17

Non-contiguous Target Row in Original Matrix

5 8 12 14 15 17

Non-contiguous Sub-matrix

from Upper Triangular Factor

Fig. 3.1. This ﬁgure illustrates the extraction of the small non-contiguous submatrix from U given row

17 from L. Given the non-zero index locations of the given row, we extract the non-contiguous submatrix

from U that corresponds to these columns and rows. Similarly, we extract the right-hand side vector from

A that corresponds to these columns at the given row.

in Chow’s method, a single inner product is used to relax one value at a time, while our

method is coarser-grained, as we operate on an entire row (column) at a time.

A second diﬀerence is that ParILU is asynchronous and updates L and U simultane-

ously, while we alternate between L and U updates. However, there is a close relation-

ship. Suppose the execution order in Chow’s method is to update ﬁrst the lower triangular

part, then the upper part. In this case, ParILU becomes synchronous and very similar

to ATS-ILU. Conversely, one could modify ATS-ILU to update L and U simultaneously

(asynchronously) but we do not explore that here.

We claim ATS-ILU uses exactly the same number of ﬂops as ParILU per sweep (or

update iteration). This can be seen from Figure 3.1. Consider the ﬂops needed to update

row 17 in L. In ATS-ILU, this is given by the number of nonzeros in the small submatrix

shown in the bottom right. These correspond to the red nonzeros in U . On the other hand,

ParILU requires sparse inner products between row 17 of L and columns 1 to 17 of U .

Interestingly, this corresponds exactly to the same red nonzeros. Therefore, the number of

ﬂops is the same.

Note that even if the ﬂop count is the same, the methods are diﬀerent and will typically

produce diﬀerent factors. For example, ATS-ILU updates the diagonal of L so it will

typically not be unit while ParILU strictly enforces unit diagonal.

We believe our method is likely more memory eﬃcient as the memory access pattern

is more regular. Also, we avoid the sparse inner products, which are diﬃcult to implement

eﬃciently.

3.5. Alternating Triangular Solves Algorithm with Thresholding (ILUT). In

this subsection, we describe the ATS-ILUT algorithm, which is a thresholded variant of the

ATS-ILU algorithm that allows for a variable sparsity pattern based on the level of ﬁll-in.

The variable sparsity pattern utilized in algorithm is based on the one used in the

ParILUT method [2, 3]. After testing, we found that the “candidate ﬁll-in” method used

in ParILUT was highly competitive and we utilize something similar. This method is based

on creating a new sparsity pattern based on the sparsity pattern of the residual matrix,

R = A − LU [2]. Like in [2], candidate locations for ﬁll-in are chosen from

C ← R\(L ∪ U),

where R is the sparsity pattern of R = A − LU , and L and U are the sparsity patterns of L

and U , respectively. The lower and upper triangular candidate locations are added to the

sparsity pattern of L and U , respectively, prior to updating the factors in each iteration and

the original locations of L and U are retained. This method allows for a variable sparsity

pattern that can be adjusted based on the level of ﬁll-in in the factors L and U .

At the end of each iteration, the factors L and U are thresholded to remove elements

with a magnitude below a certain threshold. This threshold is chosen to be the k

largest

element in the factors L and U , where k is the maximum number of elements allowed in L

and U . This method of thresholding is both simple and eﬀective, and it reduces the number

of variables the end user needs to tune in order to get a good approximation of the ILU

factors. We note that this thresholding strategy is also used in the ParILUT method [2].

The full algorithm is given in Algorithm 7.

Algorithm 7 ATS-ILUT

1: Input: Sparse matrix A ∈ R

n×n

, starting factors L and U , starting sparsity pattern

S, maximum number of elements L

max

and U

max

in L and U , respectively

2: R ← (A − LU )

3: L ←



(i, j) ∈ N

| j ≤ i, r

i,j

= 0



4: for i ∈ {1 2 . . . n} do

5: idx ← {j ∈ N | j ≤ i, (i, j) ∈ L ∪ S}

6: ℓ

i,idx

← a

i,idx

idx,idx

)

−1

7: end for

8: R ← (A − LU )

9: U ←



(i, j) ∈ N

| j ≥ i, r

i,j

= 0



10: for j ∈ {1 2 . . . n} do

11: idx ← {i ∈ N | j ≥ i, (i, j) ∈ U ∪ S}

12: u

idx,j

← (L

idx,idx

)

−1

idx,j

13: end for

14: τ ← L

max

rank element in {



ℓ

i,j

| (i, j) ∈ N

, ℓ

i,j

= 0



}

15: Threshold L to elements with larger magnitude than τ

16: τ ← U

max

rank element in {



i,j

| (i, j) ∈ N

, u

i,j

= 0



}

17: Threshold U to elements with larger magnitude than τ

Note that on lines 2 and 8 of Algorithm 7, only the lower and upper portions of the

residual are needed. As an implementation detail, one could compute half of the residual at

each half step to reduce the total expended work by a signiﬁcant margin. Next we brieﬂy

discuss options for the starting sparsity pattern.

3.6. Initial Guess and Sparsity Pattern. The initial guess for L and U make a

diﬀerence. In fact, there is no guarantee that the alternating method converges to the global

solution (though empirically it usually works).

There are several options for the starting sparsity pattern S. One option is to use the

sparsity pattern of the matrix A, which was used in the ParILUT method [2]. Alternatively,

one could use the sparsity pattern of the matrix A

for some k > 1 or k = 0. We ﬁnd it

hard to justify using the sparsity pattern of A

for k > 1. The sparsity pattern of A

= I

is the diagonal of A, which is a reasonable choice. Later in section 6, we test the sparsity

patterns of A

for k ∈ {0, 1, 2} and report our ﬁndings.

Next, we discuss important implementation details of the two algorithms.

4. Implementation details. This section is dedicated to the implementation details

of the ATS-ILU and ATS-ILUT algorithms. For simplicity, we focus on how to solve for L

given U . How to compute U given L is analogous.

For reasons relating to the manner in which we iterate over L and U , we store L in a

compressed sparse row (CSR) format and U in a compressed sparse column (CSC) format.

For this reason, we can treat U as the transpose of itself in CSR format, which transforms

the solve in lines 6 and 20 of Algorithm 7 and line 5 of Algorithm 6 from

ℓ

i,idx

← a

i,idx

idx,idx

)

−1

ℓ

i,idx

←



idx,idx

)

−T

idx,i



which allows us to use nearly the same algorithm for the update of both L and U . In

implementation, the only diﬀerence between the two algorithms is the manner in which we

lookup the elements of A that we are solving for. If we were to pass A

instead of A to the

update routine when solving for U , the implementation would be identical.

We implement the approximate sparse triangular solve algorithm shown in Algorithm 5

using a custom kernel that looks up values in U and A given the input indices from L via

binary search. We call this kernel binary

search triangular solve. This implementation

performs a lookup of the values in the non-contiguous submatrix of U and the right-hand

side vector of A. In the ﬁrst pass of the triangular solve, the row values of L are repopulated

with their corresponding elements in A, and are treated as the non-contiguous right-hand

side vector.

We implemented the ATS-ILUT algorithm with a large number of diﬀerent variations.

These diﬀerent variations are controlled by options that conditionally compile diﬀerent parts

of the code using constexpr variables and C-style preprocessing. The options we imple-

mented are as follows:

1. USE

NEW: This option utilizes the newest factor in the update step. i.e L

(k+1)

used to update U

(k)

and similar.

2. UPDATE RESIDUAL BETWEEN: This option updates the residual matrix R between the

two update steps.

3. USE RESIDUAL LHS: This option augments the left-hand side of the solve with the

residual matrix. i.e for some modiﬁed matrix m M , we let m

i,j

= u

i,j

if u

i,j

= 0

and r

i,j

otherwise.

4. USE RESIDUAL RHS: This option augments the right-hand side of the solve with the

residual matrix. i.e given a solve for some row i of L we have b, where b

= a

i,j

= 0 and r

i,j

otherwise.

5. ADDITIONAL SWEEPS BEFORE: This option performs additional sweeps prior to the

beginning of the main loop.

6. ADDITIONAL SWEEPS AFTER: This option performs additional sweeps after the thresh-

olding step of the main loop.

We have exhaustively tested all combinations of these options (with options 5 and 6 limited

to {0, 1}) on several matrices in the SuiteSparse collection. In section 6, we will discuss a

selection of the results from these tests.

5. Parallel Aspects. A key advantage of ATS-ILU compared to the traditional ILU

algorithm is that it is highly parallel. Thus, it is well suited for modern architectures, such

as multicore CPU and accelerators such as GPU.

The ParILU method is also parallel, but not deterministic. The results will vary de-

pending on the thread execution order. It is possible to impose a certain execution order,

but this will slow the method down. In contrast, ATS-ILU is naturally deterministic. This

is easy to see as L is used to update U and vice versa. It is also possible to update L and U

simultaneously (asynchronously) but we did not explore that as we view the deterministic

behavior as an advantage.

In ParILU, there is one thread per nonzero in the matrix. In ATS-ILU, the parallelism

is over rows and columns. In our current implementation, one thread is assigned one row (or

column), giving a coarser parallelism than in ParILU. However, another option is to assign

a thread team for the sparse triangular solve in each row (column). This would require

a team-level sparse triangular solve, which is not widely available. Thus, we consider this

option future work.

6. Experiments and Results. We implement both the ATS-ILU and ATS-ILUT

algorithms in C++ using the Kokkos library [25]. We test the algorithms on a variety

of matrices from the SuiteSparse collection [9] and compare the results to the ParILUT

algorithm. We note that we use the Kokkos implementation of ParILUT, which is a shared-

memory parallel implementation of the ParILUT algorithm [2]. This implementation per-

forms thresholding using the same method as the [2] implementation and not the fast ap-

proximate approach described in [3]. This is a valid comparison because we use the same

thresholding strategy in our ATS-ILUT algorithm and the approximate one would be simi-

larly applicable to our algorithm. All of the experiments given below were run on a single

node processor with 24 threads and 128 GB of memory.

We compare the ATS-ILUT algorithm implementation against the ParILUT algorithm.

In [3], they found their thresholding variant outperformed both their non-thresholded level-

based variant but also often outperformed the classical level-based method. Because the

results they obtained in their paper largely carry over to ours, we do not repeat that exper-

iment in particular. Instead, we focus on the improved thresholding-based variants of ATS

and ParILU.

Throughout this section, our measure of goodness is the number of GMRES iterations

to converge to a relative residual of 10

−10

. When referring to something as worse in terms

of this measure, we mean that it required more iterations to converge. Conversely, when

referring to something as better, we mean that it required fewer iterations to converge. We

set the maximum subspace size of GMRES to 500 and the maximum number of iterations to

1500. We scale each matrix to have unit diagonal with Jacobi scaling. We use the GMRES

implementation from the KokkosKernels library [25].

We tested 36 combinations of the options described in section 4 on all sparsity pat-

terns stated above. We limited the two sweep-related options to either 0 or 1 additional

sweeps. We share the results for the best two conﬁgurations below. Both of these con-

ﬁgurations have USE NEW and UPDATE RESIDUAL BETWEEN enabled, and USE RESIDUAL LHS

and USE RESIDUAL RHS disabled. We found the performance of the algorithm to be largely

unaﬀected by the ADDITIONAL SWEEPS BEFORE option but do notice a diﬀerence with the

ADDITIONAL SWEEPS AFTER option. Because there was little diﬀerence in performing an ini-

tial sweep before the ﬁrst iteration, we share the results with it disabled. The results for the

two remaining conﬁgurations are given next in section 6. Note that “No Extra” refers to

the conﬁguration with ADDITIONAL SWEEPS AFTER disabled and “One Extra” refers to the

conﬁguration with ADDITIONAL SWEEPS AFTER set to 1. These conﬁgurations were tested on

atmosmodd, torso2, majorbasis, and venkat01 from the SuiteSparse collection [9].

Additionally, we tested with starting sparsity patterns of A

for k ∈ {0, 1, 2}. The

behavior with the starting sparsity pattern of A

was similar, sometimes marginally better

but often marginally worse, than the behavior with the starting sparsity pattern of A

Because the behavior is similar between these two patterns and because A

was the starting

pattern used in the ParILUT method, we do not share the results with the starting sparsity

pattern of A

in this paper.

The behavior with the starting sparsity pattern A

was inconsistent and was largely

uncompetitive with the other starting sparsity patterns. With this starting sparsity pattern,

the preconditioner after the ﬁrst update cycle was always worse than with the other starting

patterns. In some cases, the resulting factors were better after the ﬁfth update cycle, but

this was not consistent across all matrices or even levels of ﬁll for a given matrix. We discuss

potential reasons for this behavior in section 7 but do not share the results in this paper

due to space.

Table 6.1

Comparison of ATS-ILU Variants with PAR-ILUT across Diﬀerent Matrices, Fill Levels, and Itera-

tions with an A

Starting Sparsity Pattern. The best at each iteration is bolded.

Matrices: atmosmodd torso2 majorbasis venkat01

Method Fill

Iterations

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

ATS-ILUT

No Extra

1.0 166 150 147 146 146 9 8 7 8 7 12 9 10 9 10 23 23 23 24 25

2.0 132 101 96 95 98 9 7 6 8 7 12 8 9 9 9 17 14 13 12 12

3.0 132 95 86 83 82 9 7 6 8 7 12 8 9 9 9 17 13 11 10 10

ATS-ILUT

One Extra

1.0 147 145 145 145 145 7 8 8 9 9 12 10 10 10 10 23 23 24 25 25

2.0 114 97 97 97 98 7 8 8 9 9 11 9 9 10 10 17 14 13 12 12

3.0 114 89 86 85 86 7 8 8 9 9 11 9 9 10 10 17 13 11 10 10

ParILUT

1.0 154 145 145 145 145 8 7 6 6 6 12 10 9 9 9 23 23 24 24 24

2.0 120 98 97 97 97 8 5 4 4 3 12 8 6 6 5 19 15 13 12 12

3.0 120 91 86 85 85 8 5 4 4 3 12 8 6 5 5 19 15 12 11 10

The results in section 6 show that the ATS-ILUT algorithm is competitive with the

ParILUT algorithm. In most cases, the resulting factors after 5 update cycles are similar

in performance to the ParILUT factors. However, the ATS-ILUT algorithm is often able to

achieve a signiﬁcantly better result after the ﬁrst update cycle relative to ParILUT at the

same update cycle.

Next, we show results for four more matrices. We test on abnormal sandia an internal

Sandia matrix that arises when solving a nonlinear thermal-ﬂuid problem, as well as three

more SuiteSparse matrices: af shell3, G3 circuit, and parabolic fem. This is given next

in Table 6.

Similarly, these results largely show parity between the ATS-ILUT and ParILUT algo-

rithms. ATS-ILUT with the “No Extra” conﬁguration was often better at lower ﬁll levels,

while the “One Extra” conﬁguration tended to converge to a good preconditioner faster.

We also tested these methods on the matrices ecology2 and offshore. No algorithm

produced a preconditioner that converged to the tolerance within the speciﬁed number of

GMRES iterations for the ecology2 matrix. For the offshore matrix, all methods produced

Table 6.2

Comparison of ATS-ILU Variants with PAR-ILUT across Diﬀerent Matrices, Fill Levels, and Itera-

tions with an A

Starting Sparsity Pattern. The best at each iteration is bolded.

Matrices: abnormal sandia af shell3 G3 circuit parabolic fem

Method Fill

Iterations

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

ATS-ILUT

No Extra

1.0 54 44 42 42 42 905 723 594 605 568 1169 1140 1148 1133 1131 1183 1089 1090 1083 1071

2.0 51 33 25 24 23 872 564 395 334 299 860 639 467 426 396 765 561 464 492 444

3.0 51 31 22 18 17 872 559 378 308 258 860 638 448 373 318 765 551 422 434 381

ATS-ILUT

One Extra

1.0 50 45 45 46 45 797 657 638 625 631 1153 1186 1189 1187 1183 1261 1192 1224 1220 1201

2.0 42 29 27 27 27 651 402 319 289 274 690 520 408 424 414 729 719 505 547 446

3.0 42 24 20 20 20 651 397 290 226 204 690 520 357 326 305 729 715 681 525 378

ParILUT

1.0 54 45 45 45 45 822 597 581 616 592 1188 1170 1180 1215 1217 1232 1168 1190 1201 1197

2.0 49 32 26 25 25 752 415 311 279 268 758 531 390 365 360 864 482 379 353 354

3.0 49 30 21 18 17 752 415 293 234 204 758 531 379 295 269 864 479 320 219 191

a preconditioner that converged to the tolerance within the speciﬁed number of iterations

if left to update for only one or two cycles. When updating for three cycles, all methods

(all tested variants of ATS-ILUT and ParILUT) exhibited strange behavior and did not

converge to the tolerance within the speciﬁed number of GMRES iterations. There was

little diﬀerence in the number of GMRES iterations required to converge between the ATS-

ILUT and ParILUT algorithms for the offshore matrix after the ﬁrst or second update.

7. Conclusions. In this paper we have introduced the ATS-ILU and ATS-ILUT algo-

rithms based on alternating inexact triangular solves. The ATS-ILU algorithm is a novel

approach to computing ILU(k)-type factors that is deterministic and parallel. The ATS-

ILUT algorithm is a thresholded variant of the ATS-ILU algorithm that allows for a variable

sparsity pattern based on the level of ﬁll-in. We have shown that the ATS-ILU and ATS-

ILUT algorithms are competitive with the ParILUT algorithm on a variety of matrices from

the SuiteSparse collection as well as the abnormal Sandia matrix. Our algorithm is deter-

ministic without a loss in performance, which is a signiﬁcant advantage over the ParILU

algorithm. Additionally, our algorithm has better memory reuse than the ParILU algorithm,

which is important for performance on modern architectures.

In terms of the number of GMRES iterations to converge, there was relatively little

diﬀerence in the performance between the ParILUT and ATS-ILUT algorithms after a few

update cycles. One reason for this could be that both algorithms are successful in solving

for factors that are nearly-optimal given the sparsity pattern. We note that both of our

algorithms use a similar method for the addition of new candidate locations to the sparsity

pattern at each iteration. Indeed, we found that utilizing a diﬀerent starting sparsity pattern

) sometimes resulted in better factors ultimately, but not in a manner that was consistent

across all matrices or levels of ﬁll for a given matrix. We believe that there is further work

to be done in two main domains: choice of starting sparsity pattern and the method for

adding new candidate locations to the sparsity pattern. We believe that the choice of

starting sparsity pattern is important and that the behavior with the A

starting sparsity

pattern provides some empirical evidence for this.

Even with our algorithm ultimately creating factors that are similarly eﬀective to the

ParILUT factors, we believe that there is still merit in further exploring the ATS-ILU and

ATS-ILUT algorithms. Our algorithm produces factors that work as a decent preconditioner

for GMRES with only a single update cycle (sweep). Thus, our method has an advantage

when short setup time is important. For certain applications, such as, one in which the

preconditioner needs to be updated frequently or if the matrix is changing rapidly, this

could be a signiﬁcant advantage. We believe that the ATS-ILU and ATS-ILUT algorithms

are a promising new approach to computing thresholded ILU factors and that there is further

work to be done in this area.

Acknowledgments. Sandia National Laboratories is a multimission laboratory man-

aged and operated by National Technology and Engineering Solutions of Sandia, LLC.,

a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of

Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

REFERENCES

[1] An iterative solution method for linear systems of which the coeﬃcient matrix is a symmetric -matrix,

Mathematics of computation, 31 (1977), pp. 148–162.

[2] H. Anzt, E. Chow, and J. Dongarra, Parilut—a new parallel threshold ILU factorization,

SIAM Journal on Scientiﬁc Computing, 40 (2018), pp. C503–C519, https://doi.org/10.1137/

16M1079506, https://doi.org/10.1137/16M1079506.

[3] H. Anzt, T. Ribizel, G. Flegar, E. Chow, and J. Dongarra, Parilut - a parallel threshold ilu

for gpus, in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS),

IEEE, May 2019, https://doi.org/10.1109/ipdps.2019.00033, http://dx.doi.org/10.1109/

IPDPS.2019.00033.

[4] M. Benzi, Preconditioning techniques for large linear systems: A survey, Journal of Computational

Physics, 182 (2002), p. 418–477, https://doi.org/10.1006/jcph.2002.7176, http://dx.doi.org/

10.1006/jcph.2002.7176.

[5] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A fresh approach to numerical

computing, SIAM Review, 59 (2017), pp. 65–98, https://doi.org/10.1137/141000671, https:

//epubs.siam.org/doi/10.1137/141000671.

[6] C. Calgaro, J.-P. Chehab, and Y. Saad, Incremental incomplete lu factorizations with applica-

tions, Numerical Linear Algebra with Applications, 17 (2010), pp. 811–837, https://doi.org/

https://doi.org/10.1002/nla.756, https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.

756, https://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.756.

[7] E. Chow and A. Patel, ”a ﬁne-grained parallel ILU factorization”, SIAM Journal on Scientiﬁc

Computing, 37 (2015), pp. C169–C197.

[8] T. Davis, Direcet Methods for Sparse Linear Systems, SIAM, 2006.

[9] T. A. Davis and Y. Hu, The university of ﬂorida sparse matrix collection, ACM Transactions on

Mathematical Software, 38 (2011), p. 1–25, https://doi.org/10.1145/2049662.2049663, http:

//dx.doi.org/10.1145/2049662.2049663.

[10] X. Dong and G. Cooperman, A Bit-Compatible Parallelization for ILU(k) Preconditioning, Springer

Berlin Heidelberg, 2011, p. 66–77, https://doi.org/10.1007/978-3-642-23397-5_8, http://dx.

doi.org/10.1007/978-3-642-23397-5_8.

[11] H. C. Edwards, C. R. Trott, and D. Sunderland, Kokkos: Enabling manycore performance porta-

bility through polymorphic memory access patterns, Journal of Parallel and Distributed Com-

puting, 74 (2014), pp. 3202 – 3216, https://doi.org/https://doi.org/10.1016/j.jpdc.2014.

07.003, http://www.sciencedirect.com/science/article/pii/S0743731514001257. Domain-

Speciﬁc Languages and High-Level Frameworks for High-Performance Computing.

[12] J. R. Gilbert and T. Peierls, Sparse partial pivoting in time proportional to arithmetic operations,

SIAM Journal on Scientiﬁc and Statistical Computing, 9 (1988), pp. 862–874, https://doi.org/

10.1137/0909058, https://doi.org/10.1137/0909058.

[13] M. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linear systems, Journal

of Research of the National Bureau of Standards, 49 (1952), p. 409, https://doi.org/10.6028/

jres.049.044, http://dx.doi.org/10.6028/jres.049.044.

[14] D. Hysom and A. Pothen, Eﬃcient parallel computation of ilu(k) preconditioners, in Proceedings of

the 1999 ACM/IEEE conference on Supercomputing, SC ’99, ACM, Jan. 1999, https://doi.org/

10.1145/331532.331561, http://dx.doi.org/10.1145/331532.331561.

[15] P. H

enon and Y. Saad, A parallel multistage ilu factorization based on a hierarchical graph decom-

position, SIAM Journal on Scientiﬁc Computing, 28 (2006), p. 2266–2293, https://doi.org/10.

1137/040608258, http://dx.doi.org/10.1137/040608258.

[16] T. M. Inc., Matlab version: 9.13.0 (r2022b), 2022, https://www.mathworks.com.

[17] G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular

graphs, SIAM Journal on Scientiﬁc Computing, 20 (1998), p. 359–392, https://doi.org/10.1137/

s1064827595287997, http://dx.doi.org/10.1137/S1064827595287997.

[18] N. Li, Y. Saad, and E. Chow, Crout versions of ilu for general sparse matrices, SIAM Journal

on Scientiﬁc Computing, 25 (2003), p. 716–728, https://doi.org/10.1137/s1064827502405094,

http://dx.doi.org/10.1137/S1064827502405094.

[19] A. Montoison and D. Orban, Krylov.jl: A Julia basket of hand-picked Krylov methods, Journal of

Open Source Software, 8 (2023), p. 5187, https://doi.org/10.21105/joss.05187.

[20] Y. Saad, Ilut: A dual threshold incomplete lu factorization, Numerical Linear Algebra with Applica-

tions, 1 (1994), p. 387–402, https://doi.org/10.1002/nla.1680010405, http://dx.doi.org/10.

1002/nla.1680010405.

[21] Y. Saad, Ilut: A dual threshold incomplete lu factorization, Numerical Linear Algebra

with Applications, 1 (1994), pp. 387–402, https://doi.org/https://doi.org/10.1002/nla.

1680010405, https://onlinelibrary.wiley.com/doi/abs/10.1002/nla.1680010405, https://

arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/nla.1680010405.

[22] Y. Saad, Iterative methods for sparse linear systems, SIAM, Philadelphia, MS, 2 ed., 2003.

[23] Y. Saad and M. H. Schultz, Gmres: A generalized minimal residual algorithm for solving nonsym-

metric linear systems, SIAM Journal on Scientiﬁc and Statistical Computing, 7 (1986), p. 856–869,

https://doi.org/10.1137/0907058, http://dx.doi.org/10.1137/0907058.

[24] C. Trott, L. Berger-Vergiat, D. Poliakoff, S. Rajamanickam, D. Lebrun-Grandie, J. Mad-

sen, N. Al Awar, M. Gligoric, G. Shipman, and G. Womeldorff, The kokkos ecosystem:

Comprehensive performance portability for high performance computing, Computing in Science

Engineering, 23 (2021), pp. 10–18, https://doi.org/10.1109/MCSE.2021.3098509.

[25] C. R. Trott, D. Lebrun-Grandi

e, D. Arndt, J. Ciesko, V. Dang, N. Ellingwood, R. Gaya-

tri, E. Harvey, D. S. Hollman, D. Ibanez, N. Liber, J. Madsen, J. Miles, D. Poliakoff,

A. Powell, S. Rajamanickam, M. Simberg, D. Sunderland, B. Turcksin, and J. Wilke,

Kokkos 3: Programming model extensions for the exascale era, IEEE Transactions on Parallel

and Distributed Systems, 33 (2022), pp. 805–817, https://doi.org/10.1109/TPDS.2021.3097283.