ENTROPY LEARNING IN NEURAL NETWORK

In this paper, entropy term is used in the learning phase of a neural network. As learning progresses, more hidden nodes get into saturation. The early creation of such hidden nodes may impair generalisation. Hence entropy approach is proposed to dampen the early creation of such nodes. The entropy learning also helps to increase the importance of relevant nodes while dampening the less important nodes. At the end of learning, the less important nodes can then be eliminated to reduce the memory requirements of the neural network.


INTRODUCTION
The mapping capability of Back-propagation neural network depends on its structure, that is, the number of layers and hidden nodes. In most cases, given a certain application and training data, the network structure should be pre-determined. A network that has a structure simpler than necessary cannot give good approximations to patterns in the training set. A network with more layers and hidden nodes can perform more complicated mappings. However, better performance on unseen data, that is, generalisation ability, implies lower order mapping. A network that is more complicated than necessary with too many hidden nodes will over-fit the data and thus perform poorly on untrained data. Bigger networks also need larger data samples for training. It was pointed out, based on an information theoretic measure, that the required number of patterns in the training set grows almost linearly with the number of hidden nodes 1 . With a smaller network size, less memory is required to store the connection weights and the computational cost of each iteration decreases.
In this paper, a total cost function, which includes an additional entropy penalty term is proposed during the training process of Back-propagation neural networks. At the end of entropy learning, inactive nodes created can be eliminated without affecting the performance of the original network.

THE ENTROPY LEARNING METHOD
Mutual information has been applied to input pruning as well as hidden nodes pruning 2, 3 . Deco et al. 4 have used mutual information measures to eliminate over-training in supervised multilayer network. Their entropy cost function is: where N p is the number of patterns, N h is the number of hidden nodes N o is the number of output nodes, T k p is the target output of kth output node for pattern p, O k p is the activation value of kth output node for pattern p, P j p is the normalised activation value of jth hidden node for pattern p, P j is the average of the normalised activation value of jth hidden node and λ is the momentum.
In Equation 1, term B tries to move P j through time towards the two extreme ends of the Sigmoid function and term C tries to move the P j p towards the 0.5 region of the Sigmoid function.
However, term C acts temporally across the batch patterns N p and thus will have a smoothing effect across the activation encoding of sequential set of pattern inputs. In dynamic time-series forecasting problems, this term is added to average the time-series points 5 as most of the hidden node activations lie in the linear zones of the Sigmoid function. In static pattern classification problems such as those discussed in this paper, there is no reason to apply this constraint across the temporal domains as it will lead to poorer classification accuracies. Similarly, the entropy penalty term concept introduced in this paper helps the learning process to move away from the saturation zones. Furthermore, term A which is the sum-squared error in Equation 1 has been replaced by the cross-entropy error in our proposed approach.
Kamimura and Nakanishi 6 added a simple entropy measure to the cost term as shown in Equation 2 The entropy measure moves the normalised hidden node representation towards their extremes and speeds up training time. Kamimura and Nakanishi 6 argued that the entropy measure had the ability to concentrate the information on a few hidden nodes but no details were given on the network behaviour. Only terms A and B of Equation 1 are used in their work which is similar to the approach described in this paper. Our paper enhances the incremental weight update approach used by Kamimura and Nakanishi. Kamimura and Nakanishi 7 tried to solve the saturation problem by introducing another constraint y j p j=l N h ∑ = θ into their equation. However the setting of θ value is problematic. In this paper, the entropy is proposed to solve the early saturation problem. The effect of entropy learning is investigated using five relevance criteria, three of which are proposed in this paper. The proposed method is also different from Kamimura-Nakanishi in the aspect that no inhibitory connections are used.

Network entropy
A Back-propagation neural network structure is shown in Figure 1. Suppose that an entropy function can be defined with respect to the activity Y i of the hidden node i. If the entropy is minimised, only a certain subset of the hidden nodes will be turned on. On the other hand, if the entropy is maximised, all the hidden nodes are nearly equally activated in their linear region.
Y i , which is the activation value of a hidden node i, is obtained by where A i is the Sigmoid function and net i is defined by where X k is the activation value of input node k, V ik is the weighted connection from input node k to hidden node i, and N k is the number of input nodes. The activation value Y i is normalised as follows: By using this normalised activity, an entropy function can be formulated by In the learning phase, the entropy function is minimised or maximised with the cross-entropy cost function as shown in Figure 2. Through this cost function, the growth of the hidden node's representation can be controlled. By minimising the entropy term, the hidden nodes' activation can be forced to develop extreme representation near 0 and 1 that can be used for pruning the redundant nodes through saturation. By maximising the entropy term, the hidden activation value is moved towards 0.5 region. The values in this range require net i to be near zero and causes many of the weights V ik to move towards zero similar to the effect of weight decay.
A state with minimum entropy means that most nodes are operating in the non-linear region near the extreme values. As the training proceeds, the hidden nodes' activation is pushed towards their extreme value while some strong and pertinent nodes remain active in the linear region.
As training progresses, more nodes get into saturation around 0 or 1. It is important to note that creation of such nodes may impair generalisation performance. To prevent the network from being driven into saturation before it starts to learn, the training is conducted in a form analogous to annealing 8 as detailed in Section 2.3 by introducing a relaxing cycle called entropy cycle. These inactive nodes can then be eliminated sequentially according to five criteria mentioned in Section 3 without affecting the performance of the original network. If necessary, the entropy reduction process should be halted and a recovery phase of training cycles should be carried out to move the relevant hidden nodes back to the linear region.

The cost function
Our proposed total entropy cost function with the additional penalty term αH can be formulated as follows: where α is the entropy rate, H is the entropy cost function, β is the learning rate, and E is the cross-entropy cost function defined as: In fact, the cross-entropy error function has much less local minima compared to the mean squared error function 9 .
Differentiating Equation 7 with respect to hidden-output weight W ji , the general learning rule can be obtained from: where Differentiating Equation 7 with respect to input-hidden weight V ik , the general learning rule can be obtained from: So the weight update equations for the entropy minimisation are

Entropy cycle
The entropy penalty term is added to the total cost function for some number N of iterations and then removed by setting the entropy rate α to 0 for the next N number of iterations. N is termed as the entropy cycle. The reason for including such an entropy cycle is to allow learning to take place alongside with the entropy reduction process as demonstrated in the experiments of Section 4.2.

CRITERIA FOR HIDDEN NODE PRUNING
Many researchers 10,11,12,13,14 have worked on criterion that correctly reflects the relevance of hidden nodes in a neural network. Methods like sensitivity analysis 12 , optimal brain damage 15 and many others are expensive to compute. In this paper, two existing criteria proposed by Mozer and Smolensky 10 ; and Kamimura and Nakanishi 6 will be discussed and three new, relatively inexpensive criteria are proposed.

Mozer-smolensky relevance criterion 10
The relevance (ρ i ) for the ith hidden node is computed by the equation: where i O k p is the network output when the ith hidden node is eliminated. A high relevance value implies that the hidden node is important during the network training. Mozer-Smolensky relevance criterion requires two passes through a particular pattern and thus it is computationally intensive.

Kamimura-nakanishi variance criterion
Kamimura and Nakanishi 6 proposed a measure that is easier to compute and strongly resembles Mozer-Smolensky relevance criterion. Their measure is to calculate the variance of inputhidden connections. They based the measure on their observation that the hidden nodes playing important roles tend to have a large variance on the incoming weights, compared with other less important hidden nodes. The variance (R i ) for the ith hidden node is defined by (13) where V i is defined as: This measure is approximately equivalent to the Mozer-Smolensky relevance criterion but its magnitude is usually low.

Hidden-output weight variance criterion
A measure similar to R i is proposed in this paper to compute the variance of the hidden-output connections. This measure is usually higher in magnitude compared to R i measure. The variance (HOV i ) for hidden node i is defined by: where W i is defined as: This measure correlates well with the Mozer-Smolensky relevance ρ i criterion but requires at least two output nodes.

Hidden Node Activation Variance Criterion
A fourth measure is proposed to examine the variance of the ith hidden node (HV i ), which is defined as: where Y i is the mean activation value of ith hidden node across N p pattern presentation:

Hidden node activation differential criterion
Another criterion suitable for networks trained with the entropy learning cost function is also proposed. The temporal behaviour of pattern encoding in the hidden nodes varies across pattern presentation. This variation of the hidden nodes' activation from one pattern to the next could serve as a good measure of relevance if the encoding has extreme representation. Basically, the hidden node's activation differential (HA i ) for the ith hidden node is defined as: This measure is based on the assumption that if a hidden node varies very little in its encoding process, then it is minimal in its information content as it becomes a bias to the output node.

EXPERIMENTAL RESULTS
Three sets of experiments are conducted to highlight the use of entropy in pruning. These are the XOR, 3-bits parity, and Iris 16 classification problems. The XOR problem is formulated as a 2-bit classification task. Output 1 will be activated for (0, 1) or (1, 0) inputs and Output 2 will be activated for (1, 1) or (0, 0) inputs. Like XOR problem, 3-bits parity problem is a 2bit classification problem. There are 3 inputs (i.e. 3 bits) to the neural network. Output 1 will be activated when the number of bits of the input is odd number and the Output 2 will be activated when the number of bits of the input is even number. The Iris problem is to classify an input pattern into three Iris flower species. The input pattern has four features, which are the sepal length, sepal width, petal length and petal width.
In these three problems, first the behaviour of the entropy term is examined in the Backpropagation learning process using only the cross-entropy cost function, i.e. βE. Then, the proposed entropy reduction algorithm with the entropy penalty term αH is used to steer the activation levels of the hidden nodes in the network. Finally, a sequential pruning process is conducted to illustrate the number of relevant nodes remaining after entropy learning using the five criteria as detailed in Section 3.

Entropy of hidden nodes
The following experiments are conducted to show that learning is an entropy reduction process. In all the classification tasks, the entropy term H eventually decreases with training although the direction of entropy reduction may be jittering at times in the error minimisation phase. It can be seen from Figure 3 that the entropy measure decreases constantly in the Iris classification problems. This corresponds to the convergence of hidden node representations towards extreme representation direction near 0 or 1. Such a smooth decrease in entropy is due to the fact that there are not many local minima in the training process. In the other two non-linear problems, which are the XOR and 3-bits parity problems, the hidden nodes are seen to develop extreme behaviour at the beginning of learning. They are forced to change direction after encountering the saddle region in the error surface and converge towards the linear operating mode around 0.5. Upon getting out of the saddle region, the hidden nodes enter into the non-linear operating mode and converge towards an optimal solution. The entropy reflects the status of learning process. It can be seen through the entropy whether the networks are stuck in a basin of attraction. The entropy curve flattens or moves upwards when the network is searching a solution in the saddle region. The entropy decreases rapidly when the networks find the correct direction and converge. The hidden entropy thus indicates whether or not the majority of the nodes are operating in the linear region. If the majority of the nodes are operating in the linear region, then the entropy is high. If the nodes develop extreme behaviour near the flat non-linear end of the curve, then the entropy is low. The XOR problem is formulated as a 2-bit classification task. Output 1 will be activated for (0, 1) or (1, 0) inputs and Output 2 will be activated for (1, 1) or (0, 0) inputs. The entropy cycles dampen the extreme minimisation of the network entropy and improve generalisation. Figure 4 shows better accuracy (generalisation) with the use of entropy cycles. Figure 5 shows that without entropy cycles, the hidden nodes go into saturation very fast without capturing the essence of the training data. In

XOR problem
Fifty trials are conducted using a single layer Back-propagation neural network and starting with ten hidden nodes. The five criteria introduced in Section 3 are used to prune the ten hidden node network with and without entropy penalty term. The result of the experiment in Figure 7 shows the magnitude of the criteria and network classification accuracy with and without entropy penalty term. To interpret, Figure 7(a) shows that Node A is the least relevant node and Node J is the most relevant node. Hence the sequence of pruning the hidden nodes should be Node A, Node B,..., Node J. In Figure 7(b), it shows that if Node A is pruned, the network still   achieves 100% accuracy with and without entropy penalty term. The accuracy of the network starts to drop when Node I is pruned.
Mozer-Smolensky relevance is high for relevant nodes, i.e. Node I and J, when entropy penalty term is used (Figure 7 a, b). Without entropy penalty term, only one node, i.e. Node J, is high (Figure 7 a). Kamimura-Nakanishi variance is small and it fails to capture the relevance of the nodes (Figure 7 c, d). However, better accuracy is observed when entropy penalty term is used. Hidden-output weight variance is high for more important nodes (Figure 7 e, f). Hidden node activation variance is not reflective of the importance of the hidden nodes and thus it is not a good measure for pruning (Figure 7 g, h). Relevant nodes have higher activation differential with entropy penalty term (Figure 7 i, j).

Three-bits parity problem
Fifty trials are conducted using a single layer Back-propagation neural network and starting with ten hidden nodes. Figure 8 shows the experimental results of using five criteria on 3-bits parity problem. The results are comparable to that of XOR problem except that Kamimura-Nakanishi variance is higher (Figure 8 c) with entropy penalty term but fails to capture the relevance of the nodes (Figure 8 d). In most cases the network is effectively pruned down to an optimal size of three hidden nodes. In particular, hidden node activation differential (Figure 8 i, j) performs well with entropy learning. The network is pruned down to three nodes as compared to four nodes without entropy learning. The magnitude of each criterion is also higher after entropy learning.

Iris classification problem
One hundred and fifty samples of three Iris flower species are used. Twenty-five specimens from each of the three species are used to train the network and the remaining half for the testing.
Mozer-Smolensky relevance is higher for important nodes with entropy penalty term (Figure 9 a, b). In terms of accuracy, two nodes can be retained without entropy penalty term. For the Kamimura-Nakanishi variance (Figure 9 c, d), there is an accuracy drop from pruning node F to node G for network without entropy penalty term. When entropy penalty term is used, three nodes can be retained. The accuracy remains constant while pruning from node A to F and it increases from node F to G. Kamimura-Nakanishi variance is also seen to be higher for important nodes with entropy penalty term. The ordering of the important nodes seems to line well for sequential pruning. Hidden-output weight variance (Figure 9 e, f) is higher for important nodes with entropy penalty term. The network is pruned down to three nodes as compared to five nodes without entropy penalty term. Hidden node activation variance (Figure 9 g, h) fails to capture the importance of the hidden nodes. Hidden node activation differential (Figure 9 i, j) isolates clearly the important nodes with entropy penalty term. This criterion is more effective with entropy penalty term as more nodes are forced to encode at the non-linear end of the Sigmoid function while some still persist in the linear region. A network accuracy of 96% is achieved with 3 nodes as compared to 5 nodes without entropy penalty term.