Summary of Converting Deep Neural Networks to Shallow

Summary Converting Deep Neural Networks to Shallow arxiv.org

6,421 words - PDF document - View PDF document

One Line

This study presents a method for converting deep neural networks into shallow networks by representing the partition of the input space and using linear models, providing a constructive proof and algorithm for finding weights, and improving interpretability and explainability.

Slides

Slide Presentation (7 slides)

Copy slides outline Copy embed code Download as Word

Converting Deep Neural Networks to Shallow

Source: arxiv.org - PDF - 6,421 words - view

Introduction

• Deep neural networks can be converted into shallow networks

• The algorithm for conversion is provided

• Shallow networks offer improved interpretability and can compute SHAP values

Conversion Algorithm

• All deep ReLU networks can be rewritten as a shallow network

• The algorithm decomposes deep networks into local linear models

• Linear models are constructed based on input space partitioning and activation patterns

Interpretability and SHAP Values

• Shallow networks enhance interpretability of deep neural networks

• SHAP values can be computed using shallow networks

• Shallow networks provide explicit weights for computation of various metrics

Connection to Linear Programming

• ReLU activation patterns have a connection to linear programming

• Linear regions and local linear models are represented in ReLU networks

• The study builds on previous findings to establish this connection

Conclusion

• Converting deep neural networks to shallow networks enhances understanding and interpretability

• The algorithm provided enables the conversion process

• Shallow networks offer improved explainability and computation of metrics

Overall Message

• Converting deep neural networks to shallow networks improves interpretability and enables computation of metrics

• Shallow networks provide explicit weights and connections to linear programming

• Enhancing understanding and interpretability is the main goal of this study

Key Points

The paper presents a method for converting deep neural networks into shallow networks.
The authors prove that all deep ReLU networks can be rewritten as a shallow network.
The algorithm for converting deep networks to shallow networks is provided.
The authors discuss the interpretability of shallow networks and how they can be used to compute SHAP values.
The study establishes a connection between linear programming and ReLU activation patterns.
The study aims to enhance the understanding and interpretability of deep neural networks by converting them into functionally identical shallow networks.

Summary

456 word summary

The paper presents a method for converting deep neural networks into shallow networks. The authors prove that all deep ReLU networks can be rewritten as a shallow network. They introduce the concept of activation patterns and show how to decompose a deep network into local linear models. The algorithm for converting deep networks to shallow networks is provided. The authors also discuss the interpretability of shallow networks and how they can be used to compute SHAP values. The limitations and future directions of the research are discussed. The paper includes mathematical equations and proofs to support the findings. The document discusses the conversion of deep neural networks into shallow ones by representing the partition of the input space and the linear models. The main idea is to find a shallow network that captures the same function as the deep network. The paper presents an algorithm that constructs a shallow network based on the decomposition of the input space into regions and the linear models within each region. The algorithm identifies feasible patterns, computes hyperplanes, and determines the half-space conditions for each activation pattern. It also provides a formal correspondence between linear programs and regions. The paper proves that every ReLU network can be converted into a shallow ReLU network. The implementation involves constructing layers that encode the linear models and half-space conditions, and using activation vectors to select the correct linear model. The resulting shallow network captures the same function as the deep network. The document includes mathematical proofs, definitions, and examples to support the main concepts and findings. This summary is about a study that explores the conversion of deep neural networks to shallow networks. The authors describe these networks as linear models applied to specific regions of the input space. The output layer is determined by an element-wise ReLU activation function. The study focuses on feed-forward neural networks and establishes a connection between linear programming and ReLU activation patterns.

The study extends previous findings on the theoretical understanding of neural networks and linear regions. It builds on breakthroughs that represent ReLU networks as collections of local linear models. The authors provide a constructive proof that every deep ReLU network can be rewritten as a functionally identical shallow network. They also present an algorithm to find the weights of the shallow network based on a trained deep network.

The contributions of this study include the constructive proof, the algorithm for finding weights, and the improvements in explainability that arise from the shallow network construction. The authors provide explicit weights for the shallow network, enabling the computation of various metrics, including fast SHAP values.

Overall, this study aims to enhance the understanding and interpretability of deep neural networks by converting them into functionally identical shallow networks.

Raw indexed text (31,686 chars / 6,421 words / 644 lines)

Any Deep ReLU Network is Shallow

Mattia J. Villani

Department of Informatics

King’s College London

London WC2B 4BG

[email protected]

Nandi Schoots

Department of Informatics

King’s College London

London WC2B 4BG

[email protected]

Abstract

We constructively prove that every deep ReLU network can be rewritten as a func-

tionally identical three-layer network with weights valued in the extended reals.

Based on this proof, we provide an algorithm that, given a deep ReLU network,

finds the explicit weights of the corresponding shallow network. The resulting

shallow network is transparent and used to generate explanations of the model’s

behaviour.

Introduction

Deep learning systems are leveraged for their flexibility and performance to solve real world tasks

efficiently; yet, their complexity warrants in-depth analysis to provide safety and soundness guar-

antees. In light of this, we present a method to convert black-box neural networks into functionally

identical shallow white-box networks.

Our contributions are:

1. A constructive proof of the existence of a functionally identical shallow network (three

hidden layers) with weights in the extended reals for every deep ReLU network (Section

3);

2. An algorithm to find the values of these weights given a trained network (Section 4), which

relies on searching the space of possible neuron activations through heuristics that we prove

to be exhaustive; and

3. Explainability improvements that arise from the shallow network construction (Section 5)

including fast SHAP values.

Our constructive proof permits explicit specification of the weights, thus enabling us to compute

the shallow network algorithmically. The code for replicating our experiments is provided in the

Appendix.

This work builds on recent breakthroughs that exactly represent ReLU networks as collections of

local linear models. Sudjianto et al. (2020) find explicit weights of local linear models for ReLU

networks as a function of the weights. Other works compute local approximations of neural net-

works as explanations (Dong et al. 2021), or use them to analyse phenomena such as dying neurons

(Jiang et al. 2022). Conversely, He et al. (2021) proves that every piece-wise linear function on a

bounded domain can be realised as a ReLU network of a given depth, provided as a function of the

linear regions.

This study extends these findings by enhancing the theoretical understanding of neural networks and

establishing a firm connection between linear programming and ReLU activation patterns. Conse-

quently, our work advances global interpretability of neural networks and provides effective heuris-

tics for exploring the extensive search space of activation patterns.

Preprint. Under review.2

Background

We restrict our analysis to architectures which are feed-forward neural networks; in particular, those

whose activation function for each layer is the rectifier linear unit (ReLU), but generalize our findings

in Corollary 3 to other ReLU-based architectures.

Definition 1 (ReLU Network). A ReLU network N : R n → R m , is a composition of L ∈ N hidden

layers given by:

χ (l) = σ(W (l) χ (l−1) + b (l) ),

where σ is an element-wise ReLU activation function, σ(x i ) = max{0, x i }. We define χ (0) = x and

the output layer is given by N (x) = W (L+1) χ (L) + b (L+1) . The number of neurons in each layer is

given by a vector N = [n 1 , n 2 , . . . , n L ], and all activations are in the positive reals, i.e. χ (l) ∈ R n ≥0

for all l ∈ {1, . . . , L} = [L]. We stress the dependence on x by writing χ (l) (x).

We can describe these functions as linear models that are applied to certain regions of the input space

(Sudjianto et al. 2020). Moreover, these regions partition R n .

1 1

Example 1. Consider the network given by the weight matrix W =

. Let the input

1 0

vector be 1 . Then, the output of this network, after applying the ReLU activation function,

x 2

y 1

1 1 x 1

x 1 + x 2

max(0, x 1 + x 2 )

= ReLU

y 2

0 1 x 2

x 2

max(0, x 2 )

This gives us a partition of the input space R 2 . The boundaries are determined by the points

where the vector output of the linear transformation has entries equal to zero.

Then, the four parts (or regions) are given by:

ω 1 = {(x 1 , x 2 ) : x 1 + x 2 > 0, x 2 > 0}, ω 2 = {(x 1 , x 2 ) : x 1 + x 2 ≤ 0, x 2 > 0},

ω 3 = {(x 1 , x 2 ) : x 1 + x 2 > 0, x 2 ≤ 0}, ω 4 = {(x 1 , x 2 ) : x 1 + x 2 ≤ 0, x 2 ≤ 0},

such that Ω = {ω 1 , ω 2 , ω 3 , ω 4 } covers the input space.

Proposition 1. [Sudjianto et al. (2020)] For a ReLU network N : R n → R m there is a finite

partition Ω of R n of cardinality p := #Ω such that for each part ω ∈ Ω there exists a piece-

wise linear function f : R n → R m , and its restriction on ω, denoted f | ω , can be described by a

linear function:

f | ω (x) = α ω

x + β ω .

Moreover, each part is a polytope, given by the intersection of a collection of half-spaces. We

write the minimum set of half-space conditions that can be used to specify the entire partition as

H 1 , . . . , H k , where each H i is given by the set of all x ∈ R n such that:

h i,1 x 1 + h i,2 x 2 + . . . + h i,n x n > c i .

To represent the decomposition and linear models, it is useful to characterise the states of a neural

network by looking at which neurons are active and inactive in all layers.

2Definition 2 (Activation Pattern). For a given ReLU network N : R n → R m with neuron dimen-

sionality vector N = [n 1 , . . . , n L ], the activation pattern at a point x ∈ R n is a collection of vectors

P = {P (1) , . . . , P (L) }, with P (l) ∈ {0, 1} n l for each l ∈ [L], such that for all i ∈ [n l ],

(l)

P i

(l)

= 1 ⇐⇒ χ i (x) > 0.

Finding a Shallow Network for a given Piece-wise Linear Function

Figure 1 illustrates the intuition that a network can be fully characterized by 1) the partition; and 2)

a linear model for each item in the partition. This suggests that we can convert any deep network

into a shallow one. We will now formalize and prove this intuition.

Figure 1: A high-level representation of the conversion of a deep ReLU network (left) using a

decomposition of the input space into regions that each have their own linear model (middle) to a

shallow ReLU network (right).

Theorem 1. For every ReLU network N : R n → R m , there exists a shallow ReLU network

S : R n → R m of depth L = 3, with weights in the extended reals R̄ = R ∪ {∞} such that for

all x ∈ R n , the following holds:

N (x) = S(x).

Proof. Our strategy will be to build a first layer that encodes the linear models and the half-spaces

that span the partition. We will use the weights that encode the half-spaces to, in the last layer,

select the correct linear model out of the list of linear models. See Figure 2 for an overview of the

components of the shallow network.

Output

Positive Output

Negative Output

Positive Input Negative Input Region Indicator

Positive Input Negative Input Converse Half Space Activation

Input

Figure 2: A schematic representation of the shallow network.

Let ({f ω } ω∈Ω , Ω) be the decomposition that corresponds to N . Label the parts 1, . . . , p. Recall that

each part is a polytope and we can represent it as the intersection of half-spaces. In particular, let

3H 1 , . . . , H k be the the minimum set of half-spaces that specify the partition, where each H i is given

by the set of all x ∈ R n such that:

h i,1 x 1 + h i,2 x 2 + . . . + h i,n x n > c i .

Let Ψ be a linear transformation applying all the half-space conditions:



    

c 1

−h 1,1 −h 1,2 . . . −h 1,n

x 1

 −h 2,1 −h 2,2 . . . −h 2,n   x 2   c 2 

Ψ(x) = Hx + c = 

...

. . .   . . .   . . . 

c k

−h k,1 −h k,2 . . . −h k,n

x n

where H is a k × n matrix. Entry-wise this matrix Ψ(x) looks as follows

Ψ(x) i = −h i,1 · x 1 − h i,2 · x 2 − . . . − h i,n · x n + c i .

Notice that this implies that whenever x ∈ H i we have ReLU(Ψ(x)) i = 0. The first layer is given

by:

" #!

I n

(1)

−I n x + 0

χ = ReLU(W x + b ) = ReLU

which is an activation vector of length 2 · n + k.

This transformation develops the building blocks to generate neurons that are only inactive when x

is in a certain region: we want to construct a layer that only has an inactive neuron whenever the

input is in a particular region. In other words, we find a p × k zero-one matrix R which describes the

regions in terms of which boundaries are active. Precisely, we can design R so that R · ReLU(Ψ(x))

has a zero entry at j only when x ∈ ω j , j ∈ {1, . . . , p}. Assume without loss of generality that

the half-space conditions corresponding to region ω j are H 1 , . . . , H a . The half-space conditions

H 1 , . . . , H a are satisfied at a point x if and only if x ∈ ω j . We construct R such that at row j it

assigns a 1 to entries 1, . . . , a, and a 0 to all other entries in that row, repeating this for all regions.

This leads to the description of the second layer:

  



I n 0 0

 ..  



(1)

(2)

(2) (1)

(2)

χ +  .  

χ = ReLU(W χ + b ) = ReLU 

0 0 R

ReLU(x)

ReLU(−x)

= ReLU

R · ReLU(Ψ(x))

as an activation vector of length 2 · n + p.

Finally, in the third layer we 1) multiply the input with each of the linear models, and 2) use the

indicator function that is zero only at entry p if the input x is in part p, to isolate the correct linear

model. Without loss of generality, we assume that m = 1 so that N : R n → R. If the dimension of

the output space m > 1, then we repeat the construction of the last layer m times.

Let Λ be a linear transformation given by:

Λ(x) = α T x + β,

where α, β are respectively a p × n matrix and a p-dimensional bias, such that the row α i and entry

β i contain the coefficients of f i . Note that, in this step, if m > 1, then the matrix and vector would

be p · m × n and p · m dimensional respectively.

We construct a matrix O p = θ · I p , where θ = ∞, such that ∞ · 0 = 0 and ∞ · a = ∞ for all

a ∈ R >0 . Note that Python contains such a constant, denoted by inf. We use this matrix so that

when we subtract it from a matrix of linear models, the only unaffected linear model will be f j , i.e.





−∞

 .. 

 . 





 −∞ 





Λ(x) − O p · ReLU(R · ReLU(Ψ(x))) =  Λ(x) j  ,

 −∞ 





 . 

 .. 

−∞

4which is a vector of length p for which the only positive entry is given by the linear model at the

region j evaluated at x. Since elements of other regions can lie arbitrarily close to region ω j , each

R · ReLU(Ψ(x)) can be arbitrarily small, and so we cannot just choose a very large number instead

of infinity. This entails our third layer activations are given by:

α −α −O p (2)

χ (3) = ReLU(W (3) χ (2) + b (3) ) = ReLU

χ +

−α α −O p

−β

# !

ReLU(x)

α −α −O p

ReLU(−x) +

= ReLU

−α α −O p

−β

R · ReLU(Ψ(x))

Λ(x) − O p · ReLU (R · ReLU(Ψ(x)))

= ReLU

−Λ(x) − O p · ReLU (R · ReLU(Ψ(x)))

which is a vector of length 2p with exactly one non-zero entry at either j or 2j. This non-zero entry

is Λ(x) at row j (if Λ(x) is positive) or −Λ(x) at row 2j (if Λ(x) is negative).

Lastly we need a weight matrix to project onto R m . This projection requires 2p weights. When we

project the −Λ(x) value to the output layer, we need to swap the sign. The first p weights of W (4)

are 1 and the last p weights are −1:

S(x) = W (4) · χ (3) = Λ(x) j = N (x).

Note that here we assumed m = 1. In general, the length of χ (3) is 2 · p · m, as is the number of

required projection weights.

Corollary 1. Given a ReLU network N : R n → R m , the shallow network S : R n → R m of depth

L = 3, as specified in the proof of Theorem 1, has width bounded by

max{2n + k, 2n + p, 2 · p · m},

where p = #Ω and k = #{half space conditions}.

Note that Montufar et al. (2014), Serra & Ramalingam (2020), and Chen et al. (2022), find progres-

sively tighter bounds for the number of linear regions in ReLU networks as functions of the number

of neurons at each layer.

Corollary 2. For every continuous piece-wise linear function f , there exists a shallow neural net-

work S of depth L = 3 with weights in the extended reals R̄ = R ∪ {∞} such that for all x ∈ R n a

compact subspace of R n , the functions are identical.

Proof. He et al. (2021) proves that every piece-wise linear function on a compact subspace of real

space can be represented exactly by a neural network of finite depth and width.

Corollary 3. For every tensor convolutional neural network (including all graph convolutional neu-

ral networks, recurrent neural networks and convolutional neural networks) with ReLU activation

functions there exists a shallow neural network S of depth L = 3 with weights in the extended reals

R̄ = R ∪ {∞} such that for all x ∈ R n , the functions are identical.

Proof. Villani & McBurney (2023) show that tensor convolutional networks with ReLU activations

can be decomposed into local linear models (like feedforward ReLU networks).

This theory allows us to verify whether two networks are functionally identical on their domain of

definition, rather than just on their test dataset. By lexicographically orderering the linear models

and half-space conditions, the algorithm maps any set of functionally identical networks to the same

unique shallow network.

Implementation

Given a deep ReLU network, we propose an algorithm that finds the weights of a functionally

identical shallow network. This algorithm is based on the constructive proof of Theorem 1. We

5show that every activation pattern leads to a linear program, whose feasible set is exactly a region of

the partition of a ReLU network.

In order to find the weights of the shallow network, we need to search the space of activation pat-

terns to find all the activation patterns that yield a nonempty region. This is a vast space, with

2 n 1 +...+n L possible patterns, which we reduce through certain heuristics. We develop a theory that

connects linear programs to ReLU networks, and search the reduced space of patterns via a brute

force algorithm.

4.1

Formal Correspondence between Linear Programs and Regions

We define local linear programs as the set of constraints required by an activation pattern at a given

layer.

Definition 3 (Local linear program). Let P = {P (1) , . . . , P (L) } be an activation pattern. The local

linear program corresponding to P (l) , is given by the set of activation vectors χ (l−1) at layer (l − 1)

such that

diag(1 − P (l) ) · W (l) − diag(P (l) ) · W (l) · χ (l−1) ≤ −diag(1 − P (l) ) · b (l) + diag(P (l) ) · b (l) .

The following lemma is used in the proof of Lemma 2 and Proposition 2.

Lemma 1. Let x ∈ R n , x has activation pattern P (l) at layer l only if χ (l−1) satisfies the local

linear program corresponding to P (l) . (See Appendix.)

We use the following lemma to reduce the search space of possible patterns.

Lemma 2. Let P = {P (1) , . . . , P (l) , . . . , P (L) } be the activation pattern of ω P . If the correspond-

ing local linear program at layer l has an empty feasible set, then ω P = ∅. (See Appendix.)

We define global linear programs in terms of the input vector as opposed to in terms of an activation

vector.

Definition 4 (Global linear program). Let P = {P (1) , . . . , P (L) } be an activation pattern. Recur-

sively define

A (1) := W (1)

A (l) := W (l) · diag(P (l−1) ) · A (l−1)

d (1) := b (1)

d (l) := ·W (l) · diag(P (l−1) ) · d (l−1) + b (l) .

The global linear program corresponding to P at layer l, is given by the set of inputs x ∈ R n such

that for all 1 ≤ l ≤ L we have

diag(1 − P (l) ) · A (l) − diag(P (l) ) · A (l) · x ≤ −diag(1 − P (l) ) · d (l) + diag(P (l) ) · d (l) .

The following proposition is used in the algorithm to find the parameters of the hyperplanes as the

constraints of the global linear program.

Proposition 2. If x ∈ R n has activation pattern P = {P (1) , . . . , P (L) }, then x satisfies the global

linear program corresponding to P . (See Appendix.)

4.2

Algorithm

It is sufficient to find H, c, α, β to build the shallow network, while knowing which half spaces

are needed to define each region. The algorithm we present to find the weights is structured as

follows. First, we identify the set of feasible patterns. For each of these, we compute the parameters

A (l) , d (l) ∀l ∈ [L] of the global linear program. Stacking these matrices determines the half-space

conditions for each activation pattern, and hence −H = stack({A (l) } l∈[L] ) and similarly for c =

stack({d (l) } l∈[L] ). Finally, given the model weights and an activation pattern, it is easy to compute

the local linear model α, β by using the explicit formula from Sudjianto et al. (2020).

6Algorithm 1 Find Shallow Network Weights

Require: L > 0, ∀l ∈ [L], n i > 0, W (l) ∈ R n l ×n l+1 , b (l) ∈ R n l+1 {this is the input}

Ensure: α, β, H, c {this is the output}

1: ∀l initialise lists called layerwise-feasible-patterns l

2: for l ∈ [L] do

for P (l) ∈ {0, 1} n l do

if has-nonempty-local-linear-program(P (l) , W (l) , b (l) ) = True then

store P (l) in layerwise-feasible-patterns l

6: Initialise lists of lists called feasible-patterns, all-hyperplanes

7: store layerwise-feasible-patterns 1 in feasible-patterns

8: for l ∈ {2, ..., L} do

for P (l) in layerwise-feasible-patterns l do

10:

for pattern in feasible-patterns do

11:

A (l) , d (l) ← get-linear-program-weights(P (l) , {W (i) , b (i) } i∈[l] )

12:

if has-nonempty-global-linear-program(P (l) , {A (i) , d (i) } i∈[l] ) = True then

13:

store P (l) in pattern

14:

store pattern in feasible-patterns

15:

store A (l) , d (l) in pattern

16:

H, c ← get-hyperplanes({A (i) , d (i) } i∈[l] )

17:

store H, c in hyperplanes

18: for P ← pattern in feasible-patterns do

19:

α P , β P ← get-local-linear-model(P, {W (l) , b (l) )} l∈[L] )

20:

store α P , β P in local-linear-models

21: return local-linear-models, hyperplanes

While the algorithm has complexity O(2 Σ l=1 n l ), Hanin & Rolnick (2019) find that the number of

activation patterns is generally small. Our algorithm avoids searching the entire pattern space by

pruning large portions of it.

We executed the algorithm on a local machine’s

CPU (16 GB RAM). We randomly initialized

weights on a small network (using Xavier uni-

form). The networks have three hidden layers,

with variable first and second layer width and

a fixed third layer width. We measure the time

it takes to compute a shallow network for each

of the randomly initialized networks. For each

combination of layer widths we repeat the ex-

periment five times.

The time taken to compute the algorithm for

randomly initialized networks grows exponen-

tially with the number of neurons, see Figure 3.

Note that the initialisation forces the algorithm

to work in the worst case, where none of the

activation patterns can be pruned and all pat-

terns need to be checked. As expected based on

the algorithm’s design, the time grows faster in

the number of second layer nodes than in first

layer nodes. In Figure 3, we see that 6 neu-

rons in the first layer, but only 5 neurons in

the second layer requires slightly less than 100

seconds to compute, whereas the reverse takes

slightly more than 100 seconds on average.

Figure 3: We plot the time to compute the shal-

low networks corresponding to randomly initial-

ized networks N on the z-axis as a function of

the number of neurons in the first hidden layer (y-

axis) and the second hidden layer (x-axis) of N .

Error bars indicate one standard deviation.

Interpretability

The finding that any ReLU Network can be decomposed into a set of local linear models (Sudjianto

et al. 2020) has made black-box neural networks more interpretable. Very recently, an algorithm was

introduced Balestriero & LeCun (2023) that (like our algorithm) provides a global interpretation of

the model by specifying the partition and the linear model of each part.

The fully understood construction of the white-box shallow network empowers users to perform

model surgery on it. Below, we highlight two ways in which interpretability is enhanced: (1) we

can compute fast SHAP values for each datapoint; (2) given a region, we give an explanation for the

region by approximating it with a hypercube containing it.

5.1

Calculating Exact SHAP Values

SHAP values are popular metrics used as explanations in the field of explainable artificial intelli-

gence (Lundberg & Lee n.d., Samek et al. 2021). For a given datapoint they impute how much each

input dimension contributes to the output. SHAP values are defined as follows.

Definition 5 (SHAP values). Given a function f : R n → R m and an input x ∈ R n . A function

f x : {0, 1} n → R m is a masking function if it evaluates f (x) with x unchanged when z i = 1 and

evaluates a ‘turned off’ input at dimension i (such as mean or zero) when the input z i = 0, for all

i ∈ [n]. The SHAP values for each feature (or dimension) i ∈ [n] are given by an n × m matrix. Let

f x : {0, 1} n → R m be a masking function. The SHAP values are then:

|z|!(n − |z| − 1)!

ϕ i (f, x) =

[f x (z) − f x (z −i )],

′

z∈{z ∈{0,1} : z i =1}

where (z −i ) j := z j for all j excluding i; and |z| counts the number of nonzero entries of z.

Typically, for machine learning models, these SHAP values are estimated. However, for a linear

model they are known (Lundberg & Lee n.d.) and are given in terms of the mean of a distribution

over the input space. Hence, given a decomposition of a network we can give exact SHAP values.

Proposition 3. Let N : R n → R m be a ReLU network with decomposition into local linear models.

Let ω be a region with local linear model f (x ′ ) = αx ′ + β, let D be a distribution over ω and let

x ∈ ω. Let f x be a masking function, where given a masked dimension we define f x as returning a

sample from the distribution D. Then the SHAP values for feature i are given by

ϕ i (N , x) = α(x − E D(ω) (x)).

Proof. Since N is linear on the polytope ω, for every sampled input X ∼ D(ω) we will have that

N (X) = f x (X) = αX + β. The result then follows from Lundberg & Lee (n.d.).

A consequence of this proposition is that SHAP values can be computed exactly and almost instantly

from the decomposition after computing the mean of the background distribution. 1

5.2

Explainability via Hypercubes around Regions

Given a region, we can approximate it by drawing the smallest possible hypercube around it. This

enables us to summarise the region, using a small number of parameters: the location and size of the

hypercube. In particular, we highlight that this summary is in tension with faithfully representing the

region, which would require specifying many hyperplanes and n parameters for each one of these

hyperplanes.

We trained a feedforward ReLU network with neuron vector [5,7,4] on the first two dimensions of

the Iris dataset (Fisher 1936) for 5000 epochs with a Sigmoid output layer, cross entropy loss and

Adam with learning rate 0.0001 and betas equal to 0.9 and 0.999. We then used our algorithm, as

introduced in Section 4.2 to decompose the neural network and find all regions. We then select the

regions where test data lives and record them in Figure 4. The hypercubes are coarse explanations

This can be done analytically if we choose the distribution to be uniform. However, it may be inappropriate

to select this distribution, since the polytope may not be bounded.

8Figure 4: We plot the regions (in green) for a test portion (40%) of the first two dimensions of the

Iris dataset. All these datapoints (red crosses) fall in one of five regions. In red we draw the smallest

hypercube that fits the entire region. The x-axis represents petal length, and the y-axis petal width.

that become less faithful as the number of dimensions increases. However, identifying the position

as well as inspecting which inputs are in a region helps us identify common characteristics of the

inputs; the size parameter of the cube can be compared to others in order to capture an approximation

of the size of the region.

Discussion

Our contribution has been to prove all deep ReLU networks can be rewritten as a shallow network

(Theorem 1), while providing a way to compute the weights algorithmically (Algorithm 1). This

enhances interpretability by allowing us to quickly compute feature importance and giving details

of the exact, local functional relationship between input and output. Deep neural networks dis-

cern hierarchical concepts, with later layers encapsulating more complex concepts than earlier ones.

Remarkably, we demonstrate that these concepts can be rearranged horizontally, maintaining their

original functionality. This restructuring allows us to derive a shallow network from a deep one.

Our construction of a shallow network is not injective, in that multiple (functionally identical) deep

networks with different architectures can all map to the same shallow network. If we consider equiv-

alence classes of neural networks that each encode a different function, then our shallow networks

can be seen as a representative of their equivalence class.

6.1

Limitations

The construction of a shallow network makes use of infinite weights. Although these weights can

be implemented in Python, they can not be fine-tuned using gradient descent. Note that the weights

encoding the linear models can be fine-tuned. To maintain the transparent network structure after

fine-tuning, the weights encoding identity matrices between the blue blocks in Figure 2 should be

frozen. Additionally, the four weight matrices denoted by α and −α between the blue and green

blocks should be updated in conjunction with one another.

The current algorithm is well suited for small models, but large computational resources are needed

to perform the search of feasible patterns in larger models. Future work should focus on finding

efficient heuristics to prune the space of possible activation patterns. Moreover, research should

prioritise developing faithful explanations that navigate the complexity of these models.

9References

Balestriero, R. & LeCun, Y. (2023), Fast and exact enumeration of deep networks partitions re-

gions, in ‘ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP)’, IEEE, pp. 1–5.

Chen, K.-L., Garudadri, H. & Rao, B. D. (2022), ‘Improved Bounds on Neural Complexity for

Representing Piecewise Linear Functions’, 36(NeurIPS).

Dong, H., Liu, B., Chen, F., Ye, D. & Liu, G. (2021), ‘How to explain neural networks: an approxi-

mation perspective’, arXiv preprint arXiv:2105.07831 .

Fisher, R. A. (1936), ‘The use of multiple measurements in taxonomic problems’, Annals of Eugen-

ics 7(2), 179–188.

Hanin, B. & Rolnick, D. (2019), ‘Deep ReLU networks have surprisingly few activation patterns’,

Advances in Neural Information Processing Systems 32(NeurIPS).

He, J., Li, L. & Xu, J. (2021), ‘ReLU Deep Neural Networks from the Hierarchical Basis Perspec-

tive’, pp. 1–27.

Jiang, Z., Wang, Y., Li, C.-T., Angelov, P. & Jiang, R. (2022), ‘Delve into Activations: Towards

Understanding Dying Neuron’, IEEE Transactions on Artificial Intelligence pp. 1–13.

Lundberg, S. M. & Lee, S.-I. (n.d.), ‘A unified approach to interpreting model predictions’, Advances

in Neural Information Processing Systems 30(NeurIPS).

Montufar, G. F., Pascanu, R., Cho, K. & Bengio, Y. (2014), ‘On the number of linear regions of deep

neural networks’, Advances in Neural Information Processing Systems 27(NeurIPS).

Samek, W., Montavon, G., Lapuschkin, S., Anders, C. J. & Müller, K. R. (2021), ‘Explaining Deep

Neural Networks and Beyond: A Review of Methods and Applications’, Proceedings of the IEEE

109(3), 247–278.

Serra, T. & Ramalingam, S. (2020), ‘Empirical bounds on linear regions of deep rectifier networks’,

AAAI 2020 - 34th AAAI Conference on Artificial Intelligence pp. 5628–5635.

Sudjianto, A., Knauth, W., Singh, R., Yang, Z. & Zhang, A. (2020), ‘Unwrapping the black

box of deep relu networks: interpretability, diagnostics, and simplification’, arXiv preprint

arXiv:2011.04041 .

Villani, M. J. & McBurney, P. (2023), ‘Unwrapping all relu networks’, arXiv preprint

arXiv:2305.09424 .

10Appendix

7.1

Proofs supporting Section 4

Definition 6 (Local linear program). Let P = {P (1) , . . . , P (L) } be an activation pattern. The local

linear program corresponding to P (l) , is given by the set of activation vectors χ (l−1) at layer (l − 1)

such that

diag(1 − P (l) ) · W (l) − diag(P (l) ) · W (l) · χ (l−1) ≤ −diag(1 − P (l) ) · b (l) + diag(P (l) ) · b (l) .

Lemma 3. Let x ∈ R n ; x has activation pattern P (l) at layer l only if χ (l−1) satisfies the local

linear program corresponding to P (l) .

Proof. Suppose x ∈ R n has activation pattern P (l) .

(l)

If P i = 0 then (W (l) χ (l−1) (x) + b (l) ) i ≤ 0 then (W (l) χ (l−1) (x)) i ≤ −b i , so

diag(1 − P (l) ) · W (l) · χ (l−1) (x) ≤ −diag(1 − P (l) ) · b (l) .

(l)

If P i

(l)

= 1 then (W (l) χ (l−1) (x) + b (l) ) i ≥ 0 then −(W (l) χ (l−1) (x)) i ≤ b i , so

−diag(P (l) ) · W (l) · χ (l−1) (x) ≤ diag(P (l) ) · b (l) .

We can conclude that

diag(1 − P (l) ) · W (l) − diag(P (l) ) · W (l) ·χ (l−1) (x) ≤ −diag(1−P (l) )·b (l) +diag(P (l) )·b (l) .

Lemma 4. Let P = {P (1) , ..., P (l) , ..., P (L) } be the activation pattern of ω P . If the corresponding

local linear program at layer l has an empty feasible set, then ω P = ∅.

Proof. We will prove this lemma by contradiction. Assume that the local linear program at layer l

has an empty feasible set and that w P ̸ = ∅.

Let x ∈ ω P . Then x has activation pattern P (l) . By Lemma 3 this means that x satisfies

diag(1 − P (l) ) · W (l) − diag(P (l) ) · W (l) ·χ (l−1) (x) ≤ −diag(1−P (l) )·b (l) +diag(P (l) )·b (l) .

However, in this case the local linear program at layer l has non-empty set, so we have found a

contradiction.

Definition 7 (Global linear program). Let P = {P (1) , . . . , P (L) } be an activation pattern. Recur-

sively define

A (1) := W (1)

A (l) := W (l) · diag(P (l−1) ) · A (l−1)

d (1) := b (1)

d (l) := ·W (l) · diag(P (l−1) ) · d (l−1) + b (l) .

The global linear program corresponding to P at layer l, is given by the set of inputs x ∈ R n such

that for all 1 ≤ l ≤ L we have

diag(1 − P (l) ) · A (l) − diag(P (l) ) · A (l) · x ≤ −diag(1 − P (l) ) · d (l) + diag(P (l) ) · d (l) .

Proposition 4. If x ∈ R n has activation pattern P = {P (1) , . . . , P (L) }, then x satisfies the global

linear program corresponding to P .

11Proof. By definition of P (l) we have that, χ (l) = diag(P (l) ) · (W (l) χ (l−1) + b (l) ) for x ∈ ω P . We

will prove by induction that

χ (l) = diag(P (l) ) A (l) x + d (l) , ∀x ∈ ω P .

The base case is true by definition:

χ (1) = diag(P (1) ) · A (1) x + d (1) = diag(P (1) ) · W (1) x + b (1) .

Now assuming that χ (l−1) = diag(P (l−1) ) A (l−1) x + d (l−1) we will show that χ (l) =

diag(P (l) ) A (l) x + d (l) . We write

χ (l) = diag(P (l) ) W (l) χ (l−1) + b (l)

= diag(P (l) ) W (l) A (l−1) x + d (l−1) + b (l)

= diag(P (l) ) W (l) A (l−1) x + W (l) d (l−1) + b (l)

= diag(P (l) ) A (l) x + d (l) .

By Lemma 3, for all l, χ (l−1) satisfy the local linear program, or equivalently all

diag(P (l−1) ) A (l−1) x + d (l−1) satisfy the linear program if x ∈ ω P . We rewrite the local lin-

ear programs as:

diag(1 − P (l) ) · W (l) − diag(P (l) ) · W (l) diag(P (l−1) ) A (l−1) x + d (l−1)

≤ −diag(1 − P (l) ) · b (l) + diag(P (l) ) · b (l) .

From which we conclude that,

diag(1 − P (l) ) · W (l) − diag(P (l) ) · W (l) diag(P (l−1) )A (l−1) x

≤ −diag(1 − P (l) ) · (b (l) + W (l−1) diag(P (l−1 )d (l−1) ) + diag(P (l) ) · (b (l) + W (l−1) diag(P (l−1) )d (l−1) ),

simplifying to:

diag(1 − P (l) ) · A (l) − diag(P (l) ) · A (l) x ≤ −diag(1 − P (l) ) · d (l) + diag(P (l) ) · d (l) ,

as required.