1 Introduction
Deep neural networks (DNNs) [23] have produced stateoftheart results in many challenging tasks including image classification [13, 22, 47, 20, 53, 36]
[41, 48, 35], and object detection [38, 39, 62]. One of the key factors behind the success lies in the innovation of neural architectures, such as VGG [45] and ResNet[15]. However, designing effective neural architectures is often very laborintensive and relies heavily on human expertise. More critically, such a humandesigned process is hard to fully explore the whole architecture design space. As a result, the resultant architectures are often very redundant and may not be optimal. Hence, there is a growing interest in replacing the manual process of architecture design with an automatic way called Neural Architecture Search (NAS).Recently, substantial studies [28, 34, 67, 12] have shown that automatically discovered architectures are able to achieve highly competitive performance compared to the handcrafted architectures. However, there are some limitations to the NAS based architecture design methods. In fact, since there is an extremely large search space [34, 67] (e.g., billions of candidate architectures), these methods are hard to be trained and often produce suboptimal architectures, leading to the limited representation performance or substantial computational cost. Thus, even for the architectures searched by NAS methods, it is still necessary to optimize their redundant operations to achieve better performance and/or reduce the computational cost.
To optimize the architectures, Luo et al. recently proposed a Neural Architecture Optimization (NAO) method [30]. Specifically, NAO first encodes an architecture into an embedding in the continuous space and then conducts gradient descent to obtain a better embedding. After that, it uses a decoder to map the embedding back to obtain an optimized architecture. However, NAO has its own set of limitations. First, NAO often produces a totally different architecture from the input architecture, making it hard to analyze the relationship between the optimized and the original architectures (See Fig. 1). Second, NAO may improve the architecture design at the expense of introducing extra parameters or computational cost. Third, similar to the NAS methods, NAO has a very large search space, which may not be necessary for architecture optimization and may make the optimization problem very expensive to solve. An illustrative comparison between our methods and NAO can be found in Fig. 1.
Unlike existing methods that design/find neural architectures, we have proposed a Neural Architecture Transformer (NAT) [14] method to automatically optimize neural architectures to achieve better performance and/or lower computational cost. To this end, NAT replaces the expensive operations or redundant modules in an architecture with the more efficient operations. Note that NAT can be used as a general architecture optimizer that takes any architecture as input and outputs an optimized architecture. NAT has shown great performance in optimizing various architectures on several benchmark datasets. However, NAT only considers three operation transitions, i.e., remaining unchanged, replacing with null connection, replacing with skip connection. Such a small search/transition space may hamper the performance of architecture optimization. Thus, it is important and necessary to enlarge the search space of architecture optimization.
In this paper, based on NAT, we propose a Neural Architecture Transformer++ (NAT++) method which considers a larger search space to conduct architecture optimization in a finer manner. To this end, we present a twolevel transition rule to simultaneously change both the type and the kernel size of an operation in architecture optimization. Specifically, NAT++ encourages operations to have more efficient types (e.g., convolutionseparable convolution) or smaller kernel sizes (e.g., ). For convenience, we use valid transitions to denote those transitions that do not increase the computational cost. Note that different operations may have different valid transitions. To make NAT++ accommodate all the considered operations, we propose a BinaryMasked Softmax (BMSoftmax) layer to omit all the invalid transitions that violate the transition rule. In this way, NAT++ is able to predict the optimal transitions for the operations with different valid transitions simultaneously. Extensive experiments show that our NAT++ significantly outperforms existing methods.
The contributions of this paper are summarized as follows.

[leftmargin=*]

We propose a Neural Architecture Transformer (NAT) method which optimizes arbitrary architectures for better performance and/or less computational cost. To this end, NAT either removes the redundant operations or replaces them with skip connections. To better exploit the adjacency information of operations in an architecture, we propose to exploit graph convolutional network (GCN) to build the architecture optimization model.

Based on NAT, we propose a Neural Architecture Transformer++ (NAT++) method which considers a larger search space for architecture optimization. Specifically, NAT++ presents a twolevel transition rule which encourages operations to have a more efficient type and/or a smaller kernel size. Thus, NAT++ is able to automatically obtain the valid transitions (i.e., the transitions to more efficient operations).

To accommodate the operations which may have different valid transitions, we propose a BinaryMasked Softmax (BMSoftmax) layer to build a general NAT++ model which predicts the optimal transitions for all the operations simultaneously.

Extensive experiments on several benchmark datasets show that our NAT and NAT++ consistently improve the design of various architectures, including both handcrafted and NAS based architectures. Compared to the original architectures, the optimized architectures tend to yield significantly better performance and/or lower computational cost.
This paper extends our preliminary version [14] from several aspects. 1) We propose an advanced version NAT++ by enlarging the search space to improve the performance of architecture optimization. 2) We present a twolevel transition rule to automatically obtain the valid transitions for each operation on both the operation type level and the kernel size level. 3) We propose a BinaryMasked Softmax (BMSoftmax) layer to omit all the invalid transitions. 4) We compare the computational cost of different operations and analyze the effect of the transitions among them on our method. 5) We provide more analysis about the impact of different operations on the convergence speed of architectures. 6) We investigate the possible bias towards the architectures with too many skip connections in the proposed method. 7) We provide more empirical results to show the effectiveness of NAT and NAT++ based on various architectures.
2 Related Work
2.1 Handcrafted Architecture Design
Many studies have proposed a series of deep neural architectures, such as AlexNet [22], VGG [45] and so on. Based on these models, many efforts have been made to further increase the representation ability of deep networks. Szegedy et al. propose the GoogLeNet [49] which consists of a set of convolutions with different kernel sizes. He et al. propose the residual network (ResNet) [15] by introducing residual shortcuts between different layers. To design more compact models, MobileNet [18, 40] employs depthwise separable convolution to reduce model size and computational overhead. ShuffleNet [61, 31] exploits pointwise group convolution and channel shuffle to significantly reduce computational cost while maintaining comparable accuracy. However, the humandesigned process often requires substantial human effort and cannot fully explore the whole architecture space.
2.2 Neural Architecture Search
Recently, neural architecture search (NAS) has been proposed to automate the process of architecture design [2, 63, 3, 50, 46]. Specifically, Zoph et al.
use a recurrent neural network as the controller
[67]to construct each convolution by determining the optimal stride, the number and the shape of filters. Pham
et al. propose a weight sharing technique [34] to significantly improve search efficiency. Liu et al. propose a differentiable NAS method, called DARTS [28], which relaxes the search space to be continuous. Recently, Luo et al. propose the Neural Architecture Optimization (NAO) [30] method to perform architecture search on continuous space by exploiting encodingdecoding technique. Unlike these methods, our method optimizes architectures without introducing extra computational cost (See comparisons in Fig. 1).2.3 Architecture Adaptation and Model Compression
Several methods have been proposed to adapt architectures to some specific platform or compress some existing architectures. To obtain compact models, [58, 24, 9, 6] adapt architectures to the more compact ones by learning the optimal settings of each convolution. One can also exploit model compression methods [25, 17, 29, 66] to remove the redundant channels to obtain compact models. Recently, ESNAC [4] uses Bayesian optimization techniques to search for a compressed network via layer removal, layer shrinkage, and adding skip connections. ASP [52] proposes an affine parameter sharing method to search for the optimal channel numbers of each layer to optimize architectures. Nevertheless, these methods have to learn a compressed model for a specific architecture and have limited generalization ability to different architectures. Unlike these methods, we seek to learn a general optimizer for arbitrary architecture.
3 Neural Architecture Transformer
3.1 Problem Definition
Following [34, 28], we consider a cell as the basic block to build the entire network. Given a cell based architecture space , we can represent an architecture as a directed acyclic graph (DAG), i.e., , where is a set of nodes that denote the feature maps in DNNs and is an edge set [67, 34, 28], as shown in Fig. 2. Here, a DAG contains nodes , where and denote the outputs of two previous cells, and denotes the output node that concatenates all intermediate nodes . Each intermediate node is able to connect with all previous nodes. The directed edge denotes some operation (e.g.,
convolution or max pooling) that transforms the feature map from node
to . For convenience, we divide the edges in into three categories, namely, , , , as shown in Fig. 2. Here, denotes the skip connection, denotes the null connection (i.e., no edge between two nodes), and denotes the operations other than skip connection or null connection (e.g., convolution or max pooling). Note that different operations have different cost. Specifically, let be a function to evaluate the computational cost. Obviously, we have .In this paper, we propose an architecture optimization method, called Neural Architecture Transformer (NAT), to optimize any given architecture to achieve better performance and/or less computational cost. To avoid introducing extra computational cost, an intuitive way is to make the original operation have less computational cost, e.g., replacing operations with skip or null connection. Although skip connection has a slightly higher cost than null connection, it often can significantly improve the performance [15, 16]. Thus, we enable the transition from null connection to skip connection to increase the representation ability of deep networks. In summary, we constrain the possible transitions among , and in Fig. 2 in order to reduce the computational cost.
3.2 Markov Decision Process for NAT
In this paper, we seek to learn a general architecture optimizer which takes any given architecture as input and outputs the corresponding optimized architecture. Let be the input architecture which follows some distribution , e.g., multivariate uniformly discrete distribution. We seek to obtain the optimized architecture by learning the mapping , where denotes the learnable parameters. Let and be the welllearned model parameters of architectures and , respectively. We measure the performance of and by some metric and , e.g., accuracy. For convenience, we define the performance improvement between and by . To illustrate our method, we first discuss the architecture optimization problem for a specific architecture and then generalize it to the problem for different architectures.
To learn a good architecture transformer to optimize a specific , we can maximize the performance improvement . However, simply maximizing may easily find an architecture with much higher computational cost than the input counterpart . Instead, we seek to obtain the optimized architectures with better performance without introducing additional computational cost. To this end, we introduce a constraint to encourage the optimized architecture to have lower computational cost than the input one. Moreover, it is worth mentioning that, directly obtaining the optimal w.r.t. the input architecture is nontrivial [67]. Following [67, 34], we instead learn a policy and use it to produce an optimized architecture, i.e., . To learn the policy, we seek to solve the following optimization problem:
(1)  
where denotes the expectation operation over .
However, the optimization problem in Eqn. (1) only focuses on a single input architecture. To learn a general architecture transformer that is able to optimize any given architecture, we maximize the expectation of performance improvement over the distribution of input architecture . Formally, the expected performance improvement over different input architectures can be formulated by . Consequently, the optimization problem becomes
(2)  
Unlike conventional neural architecture search (NAS) methods that design/find an architecture from scratch [34, 28], we hope to optimize any given architectures by replacing redundant operations (e.g., convolution) in the input architecture with the more efficient ones (e.g., skip connection). Since we only allow the transitions that do not increase the computational cost (also called valid transitions) in Fig. 2, compared to the input architecture , the optimized architecture would have less or at least the same computational cost. Thus, the proposed method can naturally satisfy the cost constraint .
As mentioned above, our NAT only takes a single architecture as input to predict the optimized architectures. However, one may obtain a better optimized architecture if we consider the previous success and failure optimization results/records of other architectures. In this case, the optimization problem would be extremely complicated and hard to solve. To alleviate the training difficulty of the optimization problem, we formulate it as a Markov Decision Process (MDP). Specifically, we exploit the Markov property to optimize the current architecture without considering the previous optimization results (similar to the MDP formulation in the multiarm bandit problem [51, 1]). In this way, MDP is able to greatly simplify the decision process. We put more discussions on our MDP formulation in the supplementary.
MDP formulation details. A typical MDP [42] is defined by a tuple , where is a finite set of states, is a finite set of actions, is the state transition distribution, is the reward function, is the distribution of initial state, and is a discount factor. Here, we define an architecture as a state, a transformation mapping as an action. Here, we use the accuracy improvement on the validation set as the reward. Since the problem is a onestep MDP, we can omit the discount factor . Based on the problem definition, we transform any into an optimized architecture with the policy .
3.3 Policy Learning by Graph Convolutional Network
As mentioned in Section 3.2, NAT takes an architecture graph as input and outputs the optimization policy . To learn the optimal policy, since the optimization of an operation/edge in the architecture graph depends on the adjacent nodes and edges, we consider both the current edge and its neighbors. Therefore, we build the controller model with a graph convolutional networks (GCN) [21] to exploit the adjacency information of the operations in the architecture. Here, an architecture graph can be represented by a data pair , where denotes the adjacency matrix of the graph and denotes the attributes of the nodes together with their two input edges. We put more details in the supplementary.
Note that a graph convolutional layer is able to extract features by aggregating the information from the neighbors of each node (i.e., onehop neighbors) [55]. Nevertheless, building the model with too many graph convolutional layers (i.e., highorder model) may introduce redundant information [65] and hamper the performance (See results in Fig. 7(a)). In practice, we build our NAT with a twolayer GCN, which can be formulated as
(3) 
where and denote the weights of two graph convolution layers, denotes the weight of the fullyconnected layer,
is a nonlinear activation function (
e.g., ReLU
[33]),denotes the softmax layer, and
refers to the probability distribution of
over 3 transitions on the edges, i.e., “remaining unchanged”, “replacing with null connection”, and “replacing with skip connection”.It is worth mentioning that, the controller model is essentially a 3class GCN based classifier. Given
edges in an architecture, NAT outputs the probability distribution
. For convenience, we denote as the parameters of the controller model of NAT.3.4 Training Method for NAT
As shown in Fig. 3, given an architecture as input, NAT outputs the policy/distribution over different candidate transitions. Based on , we conduct sampling to obtain the optimized architecture . After that, we compute the reward to guide the search process. To learn NAT, we first update the supernet parameters and then update the architecture transformer parameters in each iteration. We show the detailed training procedure in Algorithm 1.
Training the parameters of the supernet . Given any , we need to update the supernet parameters based on the training data. To accelerate the training process, we adopt the parameter sharing technique [34]. Then, we can use the shared parameters to represent the parameters for different architectures. For any architecture , let
be the loss function,
e.g., the crossentropy loss. Then, given any sampled architectures, the updating rule for with parameter sharing can be given by , where is the learning rate.Training the parameters of the controller model .
We train the transformer with reinforcement learning (
i.e., policy gradient) [54] for several reasons. First, from Eqn. (2), there are no supervision signals (i.e., “groundtruth” better architectures) to train the model in a supervised manner. Second, the metrics of both accuracy and computational cost are nondifferentiable. As a result, the gradientbased methods cannot be directly used for training. To address these issues, we use reinforcement learning to train our model by maximizing the expected reward over the optimization results of different architectures.To encourage exploration, we use an entropy regularization term in the objective to prevent the transformer from converging to a local optimum too quickly [68], e.g., selecting the “original” option for all the operations. The objective can be formulated as
(4)  
where is the probability to sample some architecture from the distribution , is the probability to sample some architecture from the distribution , evaluates the entropy of the policy, and controls the strength of the entropy regularization term. For each input architecture, we sample optimized architectures from the distribution in each iteration. Thus, the gradient of Eqn. (4) w.r.t. becomes
(5)  
The regularization term encourages the distribution to have high entropy, i.e., high diversity in the decisions on the edges. Thus, the decisions for some operations would be encouraged to choose the “skip” or “null” operations during training. In this sense, NAT is able to explore the whole search space to find the optimal architecture.
3.5 Inferring the Optimized Architectures
After the training process in Algorithm 1, we obtain the parameters of the architecture transformer model . Based on the NAT model, we take any given architecture as input and output the architecture optimization policy . Then, we conduct sampling according to the learned policy to obtain the optimized architecture, i.e., . Specifically, we predict the optimal transition among three candidate transitions (i.e., “remaining unchanged”, “replacing with null connection”, and “replacing with skip connection”) for each edge in the architecture graph. Note that the sampling method is not an iterative process and we perform sampling once for each operation/edge. We can also obtain the optimized architecture by selecting the operation with the maximum probability, which, however, tends to reach a local optimum and yields worse results than the sampling based method (See results in supplementary).
4 Neural Architecture Transformer++
As mentioned in Section 3, NAT replaces the redundant operations in with the null connections or the skip connections according to the transition scheme in Fig. 2. However, there are still several limitations of NAT. First, merely replacing an operation with the null or skip connection makes the search space very small and may hamper the performance of architecture optimization. Second, when we divide into more specific operations, the number of transitions between every two categories would significantly increase. As a result, it is nontrivial to manually design valid transitions for each operation using NAT. Third, since operations may have different valid transitions to reduce the computational cost, it is hard to build a general GCN based classifier to predict the optimal transitions for all the operations.
To address the above limitations, we further consider more possible operation transitions to enlarge the search space and develop more flexible operation transition rules. The proposed method is called Neural Architecture Transformer++ (NAT++), whose operation transition scheme is shown in Fig. 4. In NAT++, we propose a twolevel transition rule which encourages operations to have more efficient types or smaller kernel sizes to produce more compact architectures. Note that different operations may have different valid transitions. To predict the optimal transitions for the operations with different valid transitions, we propose a BinaryMasked Softmax (BMSoftmax) layer to build the NAT++ model. We will depict our NAT++ in the following.
4.1 Operation Transition Scheme for NAT++
Note that NAT [14] only considers three operation transitions, i.e., remaining unchanged, replacing with null connection, replacing with skip connection. As a result, the search space may be very limited and may hamper the performance of architecture optimization. To consider a larger search space, we propose a twolevel transition scheme which encourages operations to have more efficient types and/or smaller kernel sizes (See Fig. 4).
4.1.1 Twolevel Transition Scheme
In NAT++, we consider a larger search space to enable more possible transitions for architecture optimization. Specifically, we allow the transitions among six operation types, namely standard convolution, separable convolution, dilated separable convolution, max/average pooling, skip connection, and null connection. For each operation type, we consider three kernel sizes, i.e., , , and ^{1}^{1}1We put the details about all the considered operations in supplementary.. To optimize both the type and kernel size of operations, we design a type transition rule and a kernel transition rule, respectively.

Type Transition: We seek to reduce the computational cost by changing operation into a more computationally efficient one. According to Fig. 4, we use the following rule:
where denotes the transition direction. Since max pooling has a similar computational cost to average pooling, we enable the transition between max pooling and average pooling.

Kernel Transition: Given a specific operation type, one can also adjust the kernel size to change the operation. In general, a larger kernel would induce higher computational cost. Thus, to make sure that all the transitions can reduce the computational cost, we consider the following rule:
It is worth noting that only using any of the two rules cannot guarantee that we can reduce the computational cost. Specifically, according to Fig. 4, if we only focus on the rule on operation type, there may still exist some transitions that increase the computational cost by changing the operation type to a more efficient one but increasing the kernel size, e.g., conv_ sep_conv_. Similarly, if we only reduce the kernel size, there may also exist some transitions that introduce extra computational cost by changing the operation type to a more expensive one, e.g., sep_conv_ conv_. Thus, in practice, we make all the transitions meet the above two rules simultaneously to avoid increasing the computational cost. With the proposed twolevel transition rule, unlike NAT, our NAT++ is able to automatically obtain the valid transitions for all the operations.
4.1.2 Search Space of NAT++
NAT++ has more possible transitions than NAT and thus has a larger search space. Given a cell structure with nodes and edges, we consider 13 operations/states in total (See more details in Fig. 4 and supplementary). Based on a specific , the size of the largest search space of NAT++ is , which is larger than the largest search space of NAT with the size of . Therefore, NAT++ has the ability to find the architectures with better performance and lower computational cost than NAT (See results in Section 5). Note that NAT++ also allows the transitions , , and . Hence, the search space of NAT is a true subset of the search space of NAT++.
4.1.3 Complexity Analysis of Different Operations
Note that our NAT and NAT++ seek to replace operations with the more efficient ones to avoid introducing additional computation cost. To determine which operations are more efficient, we compare the computational cost of different operations in terms of the number of multiplyadds (MAdds) and the number of parameters.
In Fig. 4, we sort the operations according to the number of parameters and MAdds in descending order. From Fig. 4, we draw the following observations. First, given a fixed kernel size, different operation types have different computational cost. Specifically, separable and dilated separable convolution have lower computational cost than the standard convolution. The max/average pooling, skip connection, and null connection have less or even no computational cost. Second, when we fix the operation type, the kernel size is also an important factor that affects the computational cost of operations. In general, a smaller kernel tends to have a lower computational cost.
4.2 Policy Learning for NAT++
To learn the optimal policy for NAT++, we also use a GCN based classifier to predict the optimal transition for each operation/edge. However, it is hard to directly apply the GCN based classifier in NAT to predict the optimal transitions for the operations with different valid transitions. Note that, in NAT, all the operations share the same valid transitions, i.e., remaining unchanged, replacing with null connection, replacing with skip connection. However, in NAT++, each operation has its own valid transitions and these transitions directly determine the considered classes of the GCN based classifier. As a result, we may have to design a GCN classifier for each operation, which, however, is very expensive in practice.
To address this issue, we make the following changes to build the GCN model of NAT++. First, we increase the number of output channels of the final FC layer to match all the considered operations. In this way, NAT++ is able to consider more possible transitions than NAT. Second, according to the transition scheme in Fig. 4, we replace the standard softmax layer in Eqn. (3) with a BinaryMasked Softmax (BMSoftmax) layer to omit all the invalid transitions that violate the twolevel transition rule. Specifically, given different operations, we represent the transitions for each operation as a binary mask (1 for valid transitions and 0 for invalid transitions). To omit the invalid transitions, NAT++ only computes the probabilities of all the valid transitions and leaves the probabilities of the invalid ones to be zero. Let
be the predicted logits by NAT++ over
transitions. We compute the probability for the th transition by(6) 
Based on BMSoftmax, NAT++ is able to determine the optimal transition for the operations with different valid transitions.
4.3 Possible Bias Risk of NAT and NAT++
As shown in Figs. 2 and 4, both NAT and NAT++ seek to replace redundant operations with skip connections when optimizing architectures. However, the architectures with more skip connections tend to converge faster than other architectures [59, 7]. As a result, the competition between skip connections and other operations may easily become unfair [8] and mislead the search process. Consequently, the NAS methods may incur a bias towards those architectures which converge faster but may yield poor generalization performance [44, 59, 7]. More analysis on the bias issue can be found in supplementary.
To address the bias issue, Zhou et al. introduce a binary gate to each operation and propose a pathdepthwise regularization method to encourage the gates along the long paths in the supernet [64]. Such a regularization forces NAS methods to explore the architectures with slow convergence speed. It is worth mentioning that, based on NAT and NAT++, we can alleviate the bias issue without the need for complex regularization. As shown in Algorithm 1, unlike ENAS [34] and DARTS [28]
, we decouple the supernet training from architecture search by sampling architectures from a uniform distribution
rather than the learned policy . Since all the operations have the same probability to be sampled, we provide an equal opportunity to train the architectures with different operations. In this sense, we can alleviate the possible bias issue (See results in Section 6.5). More critically, our methods are able to find better architectures than the architecture searched by [64]on ImageNet (See Table
III).CIFAR  ImageNet  
Model  Method  #Params (M)  #MAdds (M)  Acc. (%)  Model  Method  #Params (M)  #MAdds (M)  Acc. (%)  
CIFAR10  CIFAR100  Top1  Top5  
VGG16  /  15.2  313  93.56  71.83  VGG16  /  138  15620  71.6  90.4 
NAO[30]  19.5  548  95.72  74.67  NAO [30]  148  18896  72.9  91.3  
ESNAC [4]  14.6  295  95.26  74.43  ESNAC [4]  133  14523  73.6  91.5  
APS [52]  15.0  305  95.53  74.79  APS [52]  137  15220  73.9  91.7  
NAT  15.2  315  96.04  75.02  NAT  138  15693  74.3  92.0  
NAT++  14.4  301  96.16  75.23  NAT++  131  14907  74.7  92.2  
ResNet20  /  0.3  41  91.37  68.88  ResNet18  /  11.7  1580  69.8  89.1 
NAO [30]  0.4  61  92.44  71.22  NAO [30]  17.9  2246  70.8  89.7  
ESNAC [4]  0.3  40  92.87  71.58  ESNAC [4]  11.2  1544  71.0  89.9  
APS [52]  0.3  42  93.14  71.84  APS [52]  11.2  1547  70.9  90.0  
NAT  0.3  42  93.05  71.67  NAT  11.7  1588  71.1  90.0  
NAT++  0.3  39  93.23  71.97  NAT++  11.0  1516  71.3  90.2  
ResNet56  /  0.9  127  93.21  71.54  ResNet50  /  25.6  3530  76.2  92.9 
NAO [30]  1.3  199  95.27  74.25  NAO [30]  34.8  4505  77.4  93.2  
ESNAC [4]  0.8  125  95.33  74.30  ESNAC [4]  25.0  3484  77.4  93.3  
APS [52]  0.8  123  94.54  73.58  APS [52]  24.9  3461  77.6  93.4  
NAT  0.9  129  95.40  74.33  NAT  25.6  3547  77.7  93.5  
NAT++  0.8  124  95.47  74.41  NAT++  24.8  3452  77.8  93.6  
ShuffleNet  /  0.9  161  92.29  71.14  ShuffleNet  /  2.4  138  68.0  86.4 
NAO [30]  1.4  251  93.16  72.04  NAO [30]  3.5  217  68.2  86.5  
ESNAC [4]  0.8  153  93.21  72.14  ESNAC [4]  2.2  131  68.4  86.6  
APS [52]  0.9  161  93.47  72.40  APS [52]  2.4  138  68.9  87.0  
NAT  0.8  158  93.37  72.34  NAT  2.3  136  68.7  86.8  
NAT++  0.7  147  93.53  72.61  NAT++  2.1  125  68.8  87.0  
MobileNetV2  /  2.3  91  94.47  73.66  MobileNetV2  /  3.4  300  72.0  90.3 
NAO [30]  2.9  131  94.75  73.79  NAO [30]  4.5  513  72.2  90.6  
ESNAC [4]  2.1  84  94.87  73.94  ESNAC [4]  3.1  277  72.4  90.8  
APS [52]  2.3  90  95.03  74.14  APS [52]  3.4  303  72.3  90.6  
NAT/NAT++  2.3  92  95.17  74.22  NAT/NAT++  3.4  302  72.5  91.0 
Model  Method  #Params (M)  #MAdds (M)  Acc. (%)  
LFW  CFPFP  AgeDB30  
LResNet34EIR [10]  /  31.8  7104  99.72  96.39  98.03 
NAO[30]  43.7  9874  99.73  96.41  98.07  
ESNAC [4]  31.7  7002  99.77  96.52  98.19  
APS [52]  31.6  6997  99.80  96.64  98.30  
NAT  31.8  7107  99.79  96.66  98.28  
NAT++  31.5  7023  99.83  96.72  98.35  
MobileFaceNet [5]  /  1.0  441  99.50  92.23  95.63 
NAO [30]  1.3  584  99.53  92.28  95.75  
ESNAC [4]  0.9  408  99.59  92.37  95.98  
APS [52]  1.0  437  99.63  92.41  96.13  
NAT/NAT++  1.0  443  99.76  92.50  96.36 
5 Experiments
We apply our method to optimize some welldesigned architectures, including handcrafted architectures and NAS based architectures. We have released the code for both NAT ^{2}^{2}2The code of NAT is available at https://github.com/guoyongcs/NAT. and NAT++ ^{3}^{3}3The code of NAT++ is available at https://github.com/guoyongcs/NATv2..
5.1 Implementation Details
CIFAR  ImageNet  
Model  Method  #Params (M)  #MAdds (M)  Acc. (%)  Model  Method  #Params (M)  #MAdds (M)  Acc. (%)  
CIFAR10  CIFAR100  Top1  Top5  
AmoebaNet [37]  /  3.2    96.73    AmoebaNet [37]  /  5.1  555  74.5  92.0 
PNAS [27]  3.2    96.67  81.13  PNAS [27]  5.1  588  74.2  91.9  
SNAS [56]  2.9    97.08  82.47  SNAS [56]  4.3  522  72.7  90.8  
GHN [60]  5.7    97.22    GHN [60]  6.1  569  73.0  91.3  
PRDARTS [64]  3.4    97.68  83.55  PRDARTS [64]  5.0  543  75.9  92.7  
ENAS [34]  /  4.6  804  97.11  82.87  ENAS [34]  /  5.6  607  73.8  91.7 
NAO [30]  4.5  763  97.05  82.57  NAO [30]  5.5  589  73.7  91.7  
ESNAC [4]  4.1  717  97.13  83.15  ESNAC [4]  5.0  542  73.5  91.4  
APS [52]  4.4  744  97.26  83.45  APS [52]  5.5  591  74.0  91.9  
NAT  4.6  804  97.24  83.43  NAT  5.6  607  73.9  91.8  
NAT++  3.7  580  97.31  83.51  NAT++  5.4  582  74.3  92.1  
DARTS [28]  /  3.3  533  97.06  83.03  DARTS [28]  /  4.7  574  73.1  91.0 
NAO [30]  3.5  577  97.09  83.12  NAO [30]  5.0  621  73.3  91.1  
ESNAC [4]  2.8  457  97.21  83.36  ESNAC [4]  4.0  494  73.5  91.2  
APS [52]  3.2  515  97.25  83.44  APS [52]  4.5  539  73.3  91.2  
NAT  2.7  424  97.28  83.49  NAT  4.0  441  73.7  91.4  
NAT++  2.5  395  97.30  83.56  NAT++  3.8  413  73.9  91.5  
NAONet [30]  /  128  66016  97.89  84.33  NAONet [30]  /  11.3  1360  74.3  91.8 
NAO [30]  143  73705  97.91  84.42  NAO [30]  11.8  1417  74.5  92.0  
ESNAC [4]  107  55187  97.98  84.49  ESNAC [4]  9.5  1139  74.6  92.1  
APS [52]  125  63468  97.96  84.47  APS [52]  11.0  1286  74.5  92.1  
NAT  113  58326  98.01  84.53  NAT  8.4  1025  74.8  92.3  
NAT++  101  51976  98.07  84.60  NAT++  8.1  992  75.0  92.5  
PCDARTS [57]  /  3.6  570  97.43  84.21  PCDARTS [57]  /  5.3  597  75.8  92.7 
NAO [30]  4.7  725  97.49  84.30  NAO [30]  6.7  706  76.0  92.8  
ESNAC [4]  3.3  503  97.44  84.20  ESNAC [4]  4.7  529  75.9  92.7  
APS [52]  3.4  529  97.47  84.28  APS [52]  5.0  557  76.0  92.7  
NAT  3.4  518  97.51  84.31  NAT  4.9  546  76.1  92.8  
NAT++  3.3  512  97.57  84.37  NAT++  4.8  540  76.3  93.0 
We build the supernet by stacking 8 cells with the initial channel number of 20. We train the transformer for 200 epochs. Following the setting of
[28], we set , , and in Eqn. (5). To cover all possible architectures, we set to be a uniform distribution. For the evaluation of networks, we replace the original cells with the optimized cells and train the models from scratch. For all the considered architectures, we follow the same settings of the original papers, i.e., we build the models with the same number of layers and channels as the original ones. We only apply cutout to the NAS based architectures on CIFAR.5.2 Results on Handcrafted Architectures
In this experiment, we apply both NAT and NAT++ to four popular handcrafted architectures, namely VGG [45], ResNet [15], ShuffleNet [61] and MobileNetV2 [40]. To make all architectures share the same graph representation method defined in Section 3.2, we add null connections into the handcrafted architectures to ensure that each node has two input nodes (See Fig. 5). Note that each handcrafted architecture may have multiple graph representations. However, our methods yield stable results on different graph representations (See results in supplementary).
5.2.1 Quantitative Results
From Table I, our NAT based models consistently outperform the original models by a large margin with approximately the same computational cost. Compared to NAT, NAT++ produces better optimized architectures with higher accuracy and lower computational cost. These results show that, by enlarging the search space, NAT++ is able to further improve the performance of architecture optimization. Moreover, compared to existing methods (i.e., NAO, ESNAC and ASP), NAT++ produces the architectures with higher accuracy and lower computational cost. Note that NAT and NAT++ yield the same results when optimizing MobileNetV2. The main reason is that the operations in MobileNetV2 are either conv_ or sep_conv_, which have already been very efficient operations. Thus, it is hard to benefit from the extended transition scheme of NAT++ when there are very few valid operation transitions.
We also evaluate our method on face recognition tasks. In this experiment, we consider three benchmark datasets (i.e., LFW [19], CFPFP [43] and AgeDB30 [32]) and two baselines (i.e., LResNet34EIR [10] and MobileFaceNet [5]). We adopt the same settings as that in [10]. More training details can be found in the supplementary. From Table II, the models optimized by NAT consistently outperform the original models without introducing extra computational cost. Moreover, NAT++ yields the best optimization results w.r.t. both architectures on all datasets.
5.2.2 Visualization of the Optimized Architectures
In this section, we visualize the original and optimized handcrafted architectures in Fig. 5. From Fig. 5, NAT is able to introduce additional skip connections to the architecture to improve the architecture design. Unlike NAT, NAT++ conducts architecture optimization in a finer manner. Specifically, NAT++ replaces some standard convolutions with separable convolutions for VGG and ResNet. In this way, NAT++ not only reduces the number of parameters and computational cost but also further improves the performance (See Table I).
5.3 Results on NAS Based Architectures
We also apply the proposed methods to the automatically searched architectures. In this experiment, we consider four stateoftheart NAS based architectures, namely DARTS [28], ENAS [34], NAONet [30] and PCDARTS [57]. Moreover, we compare our optimized architectures with other NAS based architectures, including AmoebaNet [37], PNAS [27], SNAS [56], GHN [60], and PRDARTS [64].
From Table III, given different input architectures, the architectures obtained by NAT consistently yield higher accuracy than their original counterparts and the architectures optimized by existing methods. For example, given DARTS as input, NAT not only reduces 15% parameters and 23% computational cost but also achieves 0.6% improvement in terms of Top1 accuracy on ImageNet. For NAONet, NAT reduces approximately 25% parameters and computational cost, and achieves 0.5% improvement in terms of Top1 accuracy. Moreover, we also evaluate the architectures optimized by NAT++. As shown in Table III, equipped with the extended transition scheme, NAT++ is able to find better architectures with higher accuracy and lower computational cost than the architectures found by NAT and existing methods. Due to the page limit, we show the visualization results of the optimized architectures in the supplementary. These results show the effectiveness of the proposed method.
6 Further Experiments
6.1 Results on Randomly Sampled Architectures
We apply our NAT and NAT++ to 20 randomly sampled architectures from the whole architecture space. We train all architectures using momentum SGD with a batch size of 128 for 600 epochs. From Table IV and Fig. 6, the architectures optimized by NAT surpass the original ones in terms of both accuracy and computational cost. Moreover, equipped with the twolevel transition scheme, NAT++ further improves the architecture optimization results. To better illustrate this, we exhibit the result of each architecture in Fig. 6, which shows that the models optimized by NAT++ achieve higher accuracy with fewer parameters than NAT. In this sense, our method has good generalizability on a wide range of architectures, making it possible to be applied in realworld applications.
Method  Original  NAT  NAT++ 
#Params (M)  6.402.04  4.671.36  3.661.23 
#MAdds (G)  1.070.32  0.790.21  0.520.20 
Test accuracy (%)  95.831.08  96.560.47  96.790.32 
6.2 Effect of the Number of Layers in GCN
We investigate the effect of the number of layers in GCN on the performance of our method. Specifically, we apply both NAT and NAT++ to optimize 20 randomly sampled architectures. We build 4 GCN models with layers, respectively. Note that a graph convolutional layer aims to extract features by aggregating the information from the neighbors of each node (i.e., onehop neighbors) [55]. The GCN with multiple layers is able to exploit the information from multihop neighbors in a graph [26, 11].
From Fig. 7(a), when we build a singlelayer GCN, the model yields very poor performance since a singlelayer model cannot handle the information from the nodes with more than 1 hop. However, if we build the GCN model with 5 or 10 layers, the larger models also hamper the performance since the models with too many graph convolutional layers (i.e., highorder model) may introduce redundant information [65]. To learn a good policy, we build a twolayer GCN in practice.
6.3 Effect of in Eqn. (4)
In this part, we investigate the effect of (which makes a tradeoff between the reward and the entropy term in Eqn. (4)) on the performance of architecture optimization. We train NAT and NAT++ with and report the average accuracy over the optimization results of 20 randomly sampled architectures. From Fig. 7(b), when we increase from to , the entropy term gradually becomes more important and encourages the model to explore the search space. In this way, it prevents the model from converging to a local optimum and helps find better optimized architectures. If we further increase to , the entropy term would overwhelm the objective function and hamper the performance. When we use a very large , the search process becomes approximately the same as random search and yields the architectures even worse than the original counterparts. In practice, we set .
m  1  5  10  30  
NAT  Accuracy (%)  96.560.47  96.580.41  96.610.33  96.590.37 
Search Cost  5.3  19.9  38.9  122.7  
NAT++  Accuracy (%)  96.790.32  96.800.37  96.830.29  96.810.34 
Search Cost  5.7  20.8  40.4  114.3  
n  1  5  10  30  
NAT  Accuracy (%)  96.560.47  96.590.41  96.570.43  96.580.39 
Search Cost  5.3  17.1  33.3  82.2  
NAT++  Accuracy (%)  96.790.32  96.800.35  96.820.35  96.840.37 
Search Cost  5.7  18.2  35.1  86.7 
6.4 Effect of and in Eqn. (5)
In this section, we investigate the effect of the hyperparameters and on the performance of our method. When we gradually increase during training, more architectures have to be evaluated via additional forward propagations through the supernet to compute the reward. The search cost would significantly increase with the increase of . From Table V, we do not observe obvious performance improvement when we consider a large . One possible reason is that, based on the uniform distribution , even sampling one architecture in each iteration has provided sufficient diversity of the input architectures to train our model. Thus, we set in practice.
We also investigate the effect of the hyperparameter which controls the number of sampled optimized architectures for each input architecture. When we consider a large , we have to evaluate more optimized architectures to compute the reward in each iteration, yielding significantly increased search cost. As shown in Table V, similar to , our model only achieves marginal performance improvement with the increase of . In practice, works well in NAT and NAT++. The main reason is that most of the sampled architectures can be very similar based on a fixed policy/distribution . As a result, increasing the number of sampled optimized architectures may provide limited benefits for the training process. Actually, a similar phenomenon is also observed in ENAS [34].
6.5 Discussions on the Possible Bias Risk
In this section, based on our methods, we investigate the possible bias issue towards the architectures that have fast convergence speed (in the early stage) but poor generalization performance. In this experiment, we randomly collect a set of architectures and use NAT and NAT++ to optimize them. Then, we compare the convergence curves of the original architectures and the optimized architectures on CIFAR10. From Fig. 8, some of the original architectures incur the issue of “fast convergence in the early stage but with poor generalization performance”, e.g., Arch2 and Arch4. In contrast, all of the architectures optimized by NAT and NAT++ have a relatively stable convergence speed and yield better generalization performance than their original counterparts. From these results, the bias problem is not obvious in our methods. The main reason is that, in NAT and NAT++, all the operations have the same probability to be sampled and we would offer an equal opportunity to train the architectures with different operations. In this sense, we are able to alleviate the too fast convergence issue incurred by skip connection. Due to the page limit, we put the convergence curves of more architectures in the supplementary.
7 Conclusion
In this paper, we have proposed a novel Neural Architecture Transformer (NAT) for the task of architecture optimization. To solve this problem, we seek to replace the existing operations with more computationally efficient operations. Specifically, we propose a NAT to replace the redundant or nonsignificant operations with the skip connection or null connection. Moreover, we design an advanced NAT++ to further enlarge the search space. To be specific, we present a twolevel transition rule which encourages operations to have a more efficient type or smaller kernel size to produce the more compact architectures. To verify the proposed method, we apply NAT and NAT++ to optimize both handcrafted architectures and Neural Architecture Search (NAS) based architectures. Extensive experiments on several benchmark datasets demonstrate the effectiveness of the proposed method in improving the accuracy and compactness of neural architectures.
References
 [1] (1987) Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple playspart i: iid rewards. IEEE Transactions on Automatic Control 32 (11), pp. 968–976. Cited by: §3.2.
 [2] (2017) Designing neural network architectures using reinforcement learning. In International Conference on Learning Representations, Cited by: §2.2.
 [3] (2019) ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, Cited by: §2.2.
 [4] (2019) Learnable embedding space for efficient neural architecture compression. In International Conference on Learning Representations, Cited by: §2.3, TABLE I, TABLE II, TABLE III.
 [5] (2018) Mobilefacenets: efficient cnns for accurate realtime face verification on mobile devices. In Chinese Conference on Biometric Recognition, pp. 428–438. Cited by: TABLE II, §5.2.1.
 [6] (2016) Net2net: accelerating learning via knowledge transfer. In International Conference on Learning Representations, Cited by: §2.3.

[7]
(2020)
Stabilizing differentiable architecture search via perturbationbased regularization.
In
International Conference on Machine Learning
, Cited by: §4.3.  [8] (2019) Fairnas: rethinking evaluation fairness of weight sharing neural architecture search. arXiv preprint arXiv:1907.01845. Cited by: §4.3.

[9]
(2019)
Chamnet: towards efficient network design through platformaware model adaptation.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pp. 11398–11407. Cited by: §2.3.  [10] (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: TABLE II, §5.2.1.
 [11] (2019) Cognitive graph for multihop reading comprehension at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §6.2.
 [12] (2020) Breaking the curse of space explosion: towards efficient nas with curriculum search. In International Conference on Machine Learning, pp. 3822–3831. Cited by: §1.

[13]
(2018)
Double forward propagation for memorized batch normalization
. InAAAI Conference on Artificial Intelligence
, pp. 3134–3141. Cited by: §1.  [14] (2019) NAT: neural architecture transformer for accurate and compact architectures. In Advances in Neural Information Processing Systems, pp. 735–747. Cited by: §1, §1, §4.1.
 [15] (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: Fig. 2, §1, §2.1, §3.1, §5.2.
 [16] (2016) Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Cited by: §3.1.
 [17] (2017) Channel pruning for accelerating very deep neural networks. In IEEE International Conference on Computer Vision, pp. 1398–1406. Cited by: §2.3.

[18]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §2.1.  [19] (2007) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 0749, University of Massachusetts, Amherst. Cited by: §5.2.1.
 [20] (2017) Variational deep embedding: An unsupervised and generative approach to clustering. In International Joint Conference on Artificial Intelligence, pp. 1965–1972. Cited by: §1.
 [21] (2016) Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations, Cited by: §3.3.
 [22] (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1, §2.1.
 [23] (1989) Backpropagation Applied to Handwritten zip Code Recognition. Neural Computation 1 (4), pp. 541–551. Cited by: §1.
 [24] (2019) Structured pruning of neural networks with budgetaware regularization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9108–9116. Cited by: §2.3.
 [25] (2017) Pruning filters for efficient convnets. In International Conference on Learning Representations, Cited by: §2.3.

[26]
(2018)
Multihop knowledge graph reasoning with reward shaping
. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, Brussels, Belgium, October 31November 4, 2018
, Cited by: §6.2.  [27] (2018) Progressive neural architecture search. In European Conference on Computer Vision, pp. 19–34. Cited by: §5.3, TABLE III.
 [28] (2019) Darts: differentiable architecture search. In International Conference on Learning Representations, Cited by: §1, §2.2, §3.1, §3.2, §4.3, §5.1, §5.3, TABLE III.
 [29] (2019) ThiNet: pruning CNN filters for a thinner net. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, pp. 2525–2538. Cited by: §2.3.
 [30] (2018) Neural architecture optimization. In Advances in Neural Information Processing Systems, pp. 7816–7827. Cited by: Fig. 1, §1, §2.2, TABLE I, TABLE II, §5.3, TABLE III.
 [31] (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In European Conference on Computer Vision, pp. 116–131. Cited by: §2.1.
 [32] (2017) Agedb: the first manually collected, inthewild age database. In cvprw, pp. 51–59. Cited by: §5.2.1.
 [33] (2010) Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, pp. 807–814. Cited by: §3.3.
 [34] (2018) Efficient neural architecture search via parameter sharing. In International Conference on Machine Learning, pp. 4095–4104. Cited by: §1, §2.2, §3.1, §3.2, §3.2, §3.4, §4.3, §5.3, TABLE III, §6.4.

[35]
(2017)
Hyperface: a deep multitask learning framework for face detection, landmark localization, pose estimation, and gender recognition
. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (1), pp. 121–135. Cited by: §1.  [36] (2018) Runtime network routing for efficient image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
 [37] (2019) Regularized evolution for image classifier architecture search. In AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §5.3, TABLE III.
 [38] (2016) Faster rcnn: towards realtime object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. Cited by: §1.
 [39] (2016) Object Detection Networks on Convolutional Feature Maps. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
 [40] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2.1, §5.2.
 [41] (2015) Facenet: A Unified Embedding for Face Recognition and Clustering. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §1.
 [42] (2015) Trust region policy optimization. In International Conference on Machine Learning, pp. 1889–1897. Cited by: §3.2.
 [43] (2016) Frontal to profile face verification in the wild. In wacv, pp. 1–9. Cited by: §5.2.1.
 [44] (2019) Understanding architectures learnt by cellbased neural architecture search. In International Conference on Learning Representations, Cited by: §4.3.
 [45] (2015) Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, Cited by: §1, §2.1, §5.2.
 [46] (2019) The evolved transformer. In International Conference on Machine Learning, Cited by: §2.2.
 [47] (2015) Training Very Deep Networks. In Advances in Neural Information Processing Systems, pp. 2377–2385. Cited by: §1.
 [48] (2015) Deeply Learned Face Representations are Sparse, Selective, and Robust. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2892–2900. Cited by: §1.
 [49] (2015) Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §2.1.
 [50] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.2.
 [51] (2005) Multiarmed bandit algorithms and empirical evaluation. In European conference on machine learning, pp. 437–448. Cited by: §3.2.
 [52] (2020) Revisiting parameter sharing for automatic neural channel number search. Advances in Neural Information Processing Systems 33. Cited by: §2.3, TABLE I, TABLE II, TABLE III.
 [53] (2015) HCP: a flexible cnn framework for multilabel image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (9), pp. 1901–1907. Cited by: §1.
 [54] (1992) Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8 (34), pp. 229–256. Cited by: §3.4.
 [55] (2020) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §3.3, §6.2.
 [56] (2019) SNAS: Stochastic neural architecture search. In International Conference on Learning Representations, Cited by: §5.3, TABLE III.
 [57] (2020) Pcdarts: partial channel connections for memoryefficient differentiable architecture search. In International Conference on Learning Representations, Cited by: §5.3, TABLE III.
 [58] (2018) Netadapt: platformaware neural network adaptation for mobile applications. In European Conference on Computer Vision, pp. 285–300. Cited by: §2.3.
 [59] (2019) Understanding and robustifying differentiable architecture search. In International Conference on Learning Representations, Cited by: §4.3.
 [60] (2019) Graph hypernetworks for neural architecture search. In International Conference on Learning Representations, Cited by: §5.3, TABLE III.
 [61] (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.1, §5.2.
 [62] (2016) Accelerating Very Deep Convolutional Networks for Classification and Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (10), pp. 1943–1955. Cited by: §1.
 [63] (2018) Practical blockwise neural network architecture generation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §2.2.
 [64] (2020) Theoryinspired pathregularized differential network architecture search. In Advances in Neural Information Processing Systems, Cited by: §4.3, §5.3, TABLE III.
 [65] (2019) Multihop convolutions on weighted graphs. arXiv preprint arXiv:1911.04978. Cited by: §3.3, §6.2.
 [66] (2018) Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §2.3.
 [67] (2017) Neural architecture search with reinforcement learning. In International Conference on Learning Representations, Cited by: §1, §2.2, §3.1, §3.2.
 [68] (2018) Learning transferable architectures for scalable image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710. Cited by: §3.4.
Comments
There are no comments yet.