Monday, March 6, 2023

This is part two of Chaos Networks Part 1.

In the previous post, Chaos Networks Part 1, we discussed the lack of chaos current feed-forward layer-based neural networks have, and proposed our hypothesis that this lack of chaos, or rather, the paradigm of using layers, is detrimental, and leads to extremely inefficient neural networks that have magnitudes more weights and connections than required (I don't think we used these exact words, but now we know the point of all this). Please note, this is similar to something already well established called NEAT (neural evolution of augmenting topologies), though I have not seen it implemented as we do.

It seems, although not thoroughly tested, our hypothesis holds some weight.

Quick Recap

====================================================================================================================================================================================================================================================================================================================================

Last post we created a graph based feed-forward neural network from scratch, including the writing of our own very simple automatic differentiation library that handles 1-dimensional `tensors`

. We trained this network on the MNIST dataset, and were able to achieve about 91% accuracy, so not great.

We hypothesized that the main contributor to our lack of validation accuracy was the lack of using batches. As we will see, it was not, or at least, not entirely that.

Finding the Hidden Error

====================================================================================================================================================================================================================================================================================================================================

Debugging our Chaos Network was painful. There was no undefined behaviour, there were no visible bugs, only a gut feeling that our network should be performing better.

To keep this section short I will give a bullet point list of the steps taken to rectify this bug.

- Compare each and every Tensor Operation with that of Pytorch's
- Get stuck checking the Negative Log Likelihood for three days. It was fine
- Go back through the other Tensor Operations
- Curse the fates for letting me be born in the era of computing
- Recreate the exact layout, weights, and order of operations of my Chaos Network in Pytorch to compare gradients when going backwards
- Add the
`split_on_add`

`tensor`

operation

I'm not going to do a deep dive into the code for our automatic differentiation library, as we did a small dive in Part 1 (we will cover some updates later), but the main issue was fixed when adding the following `tensor`

operation:

```
impl<const N: usize> Tensor1D<N> {
pub fn split_on_add(self, count: usize) -> Vec<Self> {
match &self.tape {
Some(tape) => {
let mut new_tensors: Vec<Tensor1D<N>> = (0..count)
.map(|_i| Self::new_with_tape(self.data, Some(tape.clone())))
for t in new_tensors.iter_mut() {
let self_id = t.grad_for;
let old_self_id = self.grad_for;
// For each split element grab their gradients, and the parent gradients, and
// add them together
t.tape.as_mut().unwrap().write().unwrap().add_operation((
self_id,
Box::new(move |g| {
let tg = g.remove(self_id);
let mut tg2 = g.remove_or_0(old_self_id);
tg2.data = element_wise_addition::<N>(&tg2.data, &tg.data);
g.insert(old_self_id, tg2);
}),
));
}
new_tensors
}
None => (0..count)
.map(|_i| Tensor1D::new_without_tape(self.data))
.collect(),
}
}
}
```

For brevity's sake, in version 0.1 of our Chaos Network, in our forward function, we would multiply the weight of the current `node`

once for every connection it had, but our automatic differentiation library was not designed to handle multiple operations by the same `tensor`

. By adding the above function, we first `split_on_add`

the `tensor`

we want to perform multiple operations with, and then can simply pop a version off of the returned Vec of our `tensors`

.

It doesn't matter if the above makes sense, the context required to understand it fully is missing, the main point is in Chaos Networks Part 1 we had an error accumulating gradients while going backwards, and that is now fixed!

Minor Upgrades and Improvements

====================================================================================================================================================================================================================================================================================================================================

One of the improvements we made is switching from a traditional graph tree-structure where each node holds its connections to the next, to a streamlined version where the network holds a `HashMap`

of edges. Because we are using Rust, this greatly reduced the use of `Rc`

and `RefCell`

although this was probably not a limiting factor for speed.

As it turns out, the two greatest restrictions to our speed were the `HashMap`

we used when accumulating gradients while going forward through the network, and each `tensor`

having to join its held tape after each operation.

Improving the speed of the `HashMap`

was relatively simple. We used rustc_hash which greatly improved our hashing speeds.

Removing the need for `tensors`

to handle their own `tape`

was a little more involved. In Part 1 we mentioned the inspiration for our automatic differentiation implementation came from dfdx, an absolutely awesome deep learning library for Rust. Each `tensors`

`tape`

is really just `Vec`

x<dyn FnOnce(&mut Gradients`vectors`

than I do, but the combining of these `tapes`

(hundreds of thousands every second) would greatly slow down the runtime performance. In the end, I switched to a system where there is one `tape`

of type `Arc`

Lock<Tape`tensor`

has a copy.

These two minor improvements greatly increased the runtime performance of our Chaos Network.

Let There Be Batches

There are two reasons why we went through the pain of creating our own automatic differentiation library when libraries like dfdx already exist: to learn how they work, and to specialize it for our use case. The automatic differentiation library we wrote is highly specialized for our Chaos Network. It only has the functions we need, does not work in many general purpose situations unless great care is taken, and in the case of batches, is specially built to manage the gradients for 1-dimensional `tensors`

differently.

Let's examine the new and updated code we created to support batches in the `mul`

implementation.

```
#[derive(Debug, Clone)]
pub struct Tensor1D<const N: usize> {
pub id: u64,
pub grad_for: u64,
pub data: [f64; N],
pub tape: Option<Arc<RwLock<Tape<N>>>>,
}
```

First we introduce the concept of a 1-dimensional `tensor`

(above).

```
impl<'a, 'b, const N: usize> Mul<&'b mut Tensor1D<N>> for &'a mut Tensor0D<N> {
type Output = Tensor1D<N>;
fn mul(self, other: &'b mut Tensor1D<N>) -> Self::Output {
let new_data: [f64; N] = other.data.map(|d| self.data * d);
let mut new = Tensor1D::new_without_tape(new_data);
match (&self.tape, &other.tape) {
(Some(self_tape), Some(_other_tape)) => {
let new_id = new.grad_for;
let self_id = self.grad_for;
let other_id = other.grad_for;
let self_data = self.data;
let other_data = other.data.clone();
self_tape.write().unwrap().add_operation((
new_id,
Box::new(move |g| {
let mut tg1 = g.remove(new_id);
let mut tg2 = tg1.clone();
tg1.data = element_wise_mul::<N>(&tg1.data, &other_data);
tg2.data = tg2.data.map(|x| x * self_data);
g.insert(self_id, tg1);
g.insert(other_id, tg2);
}),
));
new.set_tape(self.tape.clone());
}
(Some(self_tape), None) => {
let new_id = new.grad_for;
let self_id = self.grad_for;
let other_data = other.data.clone();
self_tape.write().unwrap().add_operation((
new_id,
Box::new(move |g| {
let mut tg = g.remove(new_id);
tg.data = element_wise_mul::<N>(&tg.data, &other_data);
g.insert(self_id, tg);
}),
));
new.set_tape(self.tape.clone());
}
(None, Some(_other_tape)) => {
panic!("Switch operator orientation");
}
(None, None) => (),
}
new
}
}
```

The above is the only implementation of `Mul`

for both 0-dimensional and 1-dimensional `tensors`

. That would not make sense in the context of a normal automatic differentiation library, but because we have highly specialized one built for Chaos Networks, not only does it make sense, but it saves us an immense amount of work and computation.

Let's examine one of the passing test cases for our implementation.

```
#[test]
fn test_mul_1d() {
let tape: Arc<RwLock<Tape<3>>> = Arc::new(RwLock::new(Tape::new()));
let mut a = Tensor0D::new_with_tape(2., Some(tape.clone()));
let mut b = Tensor1D::new_without_tape([1., 2., 3.]);
let mut c = &mut a * &mut b;
// Check value match
assert_eq!([2., 4., 6.], c.data);
// Check gradients
let mut grads = c.backward();
let a_grads = grads.remove(a.grad_for);
assert_eq!([1., 2., 3.], a_grads.data);
}
```

If you haven't done a lot of work with automatic differentiation everything may seem normal but the above code is noteworthy because it is different than what Pytorch would yield.

```
import torch as torch
import torch.nn as nn
w = torch.tensor(2., dtype=torch.float64, requires_grad=True)
x = torch.tensor([1, 2, 3], dtype=torch.float64)
y = w * x
print(y) # Prints tensor([2., 4., 6.], dtype=torch.float64, grad_fn=<MulBackward0>)
y.backward(torch.ones_like(x))
print(w.grad) # Prints tensor(6., dtype=torch.float64)
```

Notice how the gradients of `w--+`

he analogous Python variable for `mut`

-+ above) are the summation of the gradient vector produced by our implementation of `Mul`

. This is done with purpose. Notice the way the tapes are combined in `Mul`

function above. Most noticeably, they are combined through element wise multiplication.

All of the above to say, 1-dimensional `tensors`

are treated as a batch. Each element is a separate input corresponding with that batch. This is why gradients are not combined. If you are curious to see how this works, check out the repo!

Adding this form of batching improved the runtime performance of the network by a factor close to the batch size being used (MASSIVE IMPROVEMENT).

Training the Chaos Network

We have a neural network represented by an acyclic graph, now we must train it. Before showing any code, let's ask ourselves a few questions.

What should the structure of the network be? We have complete freedom in how we want to structure our Chaos Network, but complete freedom is just that, complete freedom. It is our job to establish laws regulating this structure.

How should the structure of the network change as we train it? One of the core ideas behind the Chaos Network is that it is chaotic, morphing as we train it. Once again, it is up to us to create the laws regulating this growth.

The first question has a few relatively simple answers, and I went with the easiest.

```
#[derive(Default)]
pub struct Network<const N: usize> {
pub inputs_count: usize,
pub leaves_count: usize,
pub nodes: Vec<Node<N>>,
pub connections_to: HashMap<i32, Vec<usize>>,
pub mode: NetworkMode,
pub tape: Arc<RwLock<Tape<N>>>,
}
impl<const N: usize> Network<N> {
pub fn new(inputs: usize, outputs: usize) -> Self {
let mut network: Network<N> = Network::default();
network.add_nodes(NodeKind::Leaf, outputs);
network.add_nodes(NodeKind::Input, inputs);
network
}
pub fn add_nodes(&mut self, kind: NodeKind, count: usize) {
match kind {
NodeKind::Normal => {
let node_index = self.batch_insert_normal_nodes(count);
for i in 0..(count) {
for _ii in 0..10 {
self.add_node_connection_to(node_index + i);
self.add_node_connection_from(node_index + i);
}
}
}
NodeKind::Input => {
self.inputs_count += count;
let mut nodes: Vec<Node<N>> = (0..count).map(|_i| Node::new(kind)).collect();
for _i in 0..count {
self.nodes.insert(0, nodes.remove(0));
}
self.shift_all_connections_after(0, count, ShiftDirection::Forward);
if self.leaves_count == 0 {
panic!("This should be called after leaves are added for now")
}
let mut rng = rand::thread_rng();
let odds_of_being_picked = 100. / (count as f64);
for i in 0..count {
if odds_of_being_picked > rng.gen::<f64>() {
let input_node_index = distribution_between.sample(&mut rng);
self.add_connection_between(i, input_node_index);
}
}
}
NodeKind::Leaf => {
self.leaves_count += count;
for _i in 0..count {
self.nodes.push(Node::new(kind));
}
}
}
}
}
```

The above is the definition of the `Network`

and the `new`

and `add_nodes`

implemented functions. `Nodes`

are one of three variations: `Input`

, `Normal`

and `Leaf`

. `Input`

des--+ receive input, `Normal`

des--+ would be like nodes in hidden layers and `Leaf`

des--+ are output.

The `new`

function is syntactic sugar around the `add_nodes`

function. The `add_nodes`

function takes the `NodeKind`

and the count, and adds the count of new `NodeKinds`

. Most notably, not all `Input`

des--+ start with connections as this is part of the over-paramaterization Chaos Networks are trying to fix.

The above answers our first posed question, but the second question about how the network should grow is a bit trickier. Outlining the problem further, at its core we have a graph that we want to change, and a heuristic (the validation accuracy) that we can use to evaluate that change. This sounds like a perfect use for evolutionary algorithms. For those not familiar, we essentially have some starting `population`

, make some changes to that `population`

, take the best subset N from the new `population`

evaluated using our heuristic function, and repeat.

In our case, we start with some N Chaos Networks, train those and some randomly modified versions of those, evaluate how well they perform, mix the current `population`

and the new randomly modified `population`

, and repeat.

The code doing this is below. I've had to remove some of the context to keep things smaller.

```
pub struct StandardClassificationNetworkHandler<const I: usize, const O: usize, const N: usize> {
max_training_steps: usize,
steps_per_training_step: usize,
train_data: Arc<RepeatingNetworkData<I, N>>,
test_data: Arc<RepeatingNetworkData<I, N>>,
validation_steps: usize,
}
impl<const I: usize, const O: usize, const N: usize> StandardClassificationNetworkHandler<I, O, N> {
pub fn train(&mut self) {
// Create the initial population
let mut population: Vec<Network<N>> =
(0..POPULATION_SIZE).map(|_i| Network::new(I, O)).collect();
// Prep some stuff for getting random data points
let train_data_distribution = Uniform::from(0..self.train_data.len());
let test_data_distribution = Uniform::from(0..self.test_data.len());
let mut rng = thread_rng();
// Do the actual training
for training_step in 0..self.max_training_steps {
// Prep the random indexes
let train_data_random_indexes = (0..self.steps_per_training_step)
.map(|_i| {
(0..N)
.map(|_ii| train_data_distribution.sample(&mut rng))
.collect::<Vec<usize>>()
.try_into()
.unwrap()
})
.collect::<Vec<[usize; N]>>();
let test_data_random_indexes = (0..self.validation_steps)
.map(|_i| {
(0..N)
.map(|_ii| test_data_distribution.sample(&mut rng))
.collect::<Vec<usize>>()
.try_into()
.unwrap()
})
.collect::<Vec<[usize; N]>>();
// Current population and maybe new networks
population = if training_step != 0 && training_step % 10 == 0 {
let new_networks = population.iter().map(|x| x.clone()).collect();
let new_morphed_networks = self.train_population(
new_networks,
&train_data_random_indexes,
&test_data_random_indexes,
true,
);
let new_population_networks = self.train_population(
population,
&train_data_random_indexes,
&test_data_random_indexes,
false,
);
// Merge them together
new_population_networks
.into_iter()
.zip(new_morphed_networks.into_iter())
.rev()
.skip(POPULATION_SIZE / 2)
.map(|(a, b)| vec![a.0, b.0])
.flatten()
.collect()
} else {
self.train_population(
population,
&train_data_random_indexes,
&test_data_random_indexes,
false,
)
.into_iter()
.map(|(n, _, _)| n)
.collect()
}
}
}
fn train_population(
&self,
population: Vec<Network<N>>,
train_data_random_indexes: &Vec<[usize; N]>,
test_data_random_indexes: &Vec<[usize; N]>,
do_morph: bool,
) -> Vec<(Network<N>, f64, f64)> {
// Do the training and validation
let new_networks: Vec<(Network<N>, f64)> = population
.into_par_iter()
.map(|mut network| {
if do_morph {
let mut rng = rand::thread_rng();
let percent_nodes_to_add = rng.gen::<f64>() / 10.;
let percent_connections_to_add = rng.gen::<f64>() / 10.;
let percent_connections_to_remove = (rng.gen::<f64>() / 10.) * 3.;
grow(
&mut network,
percent_nodes_to_add,
percent_connections_to_add,
);
prune(&mut network, percent_connections_to_remove);
}
let train_data = self.train_data.clone();
let test_data = self.test_data.clone();
for step in 0..self.steps_per_training_step {
train_next_batch(
&mut network,
train_data.next(&train_data_random_indexes[step]),
);
}
let average_validation_accuracy: f64 = (0..self.validation_steps)
.map(|step| {
validate_next_batch(
&mut network,
test_data.next(&test_data_random_indexes[step]),
)
})
.sum::<f64>()
/ (self.validation_steps as f64);
(network, average_validation_accuracy)
})
.collect();
let network_sizes: Vec<f64> = new_networks
.iter()
.map(|(n, _)| n.get_connection_count() as f64)
.collect();
let min_network_size = network_sizes
.iter()
.min_by(|a, b| a.partial_cmp(&b).unwrap())
.unwrap();
let mut new_networks: Vec<(Network<N>, f64, f64)> = new_networks
.into_iter()
.map(|(n, ava)| {
let connection_count = n.get_connection_count() as f64;
(n, ava, ava + ((min_network_size / connection_count) * 0.05))
})
.collect();
new_networks.sort_by(|a, b| b.2.partial_cmp(&a.2).unwrap());
new_networks
}
}
```

Essentially, this does exactly what is described above. Let's walk through it:

- We first create a
`population`

of Chaos Networks (for most cases I tried, the`population`

size was between 10 and 100) - Every
`training_step`

we train the`population`

for a selected number of batches - Every 10
`training_steps`

we take the`population`

, clone it, and perform some random adding and removing of nodes and connections - We remove the bottom 50% of the
`population`

and replace it with the top 50% of the new networks created from making random changes to the existing`population`

I made the following changes to the traditional evolutionary algorithm. One, our `population`

size is tiny. This is simply due to hardware restrictions. Two, we have no crossover between the `population`

. Most of the time, there is an idea of crossover when mutating the `population`

. We do not do this, instead each member of the `population`

has its own "child" that gets mutated randomly. It may be beneficial to come back and explore the idea of crossover and changing the mutation function. Three, we do not only chose Chaos Networks with the highest score. When evaluating which Chaos Networks should survive to the next generation, we take the best 50% of the current `population`

and the best 50% of the new morphed `population`

. This is not standard, but the motivation for this is the same as the motivation in the next point. Four, we do not mutate the `population`

every iteration. I found that mutating the `population`

every training step did not give the new additions enough time to increase their `average_validation_accuracy`

enough to stick around through another mutation such that our starting networks would become the only networks that ever last a round in the `population`

.

The heuristic function used is worth mentioning. We score not only on `average_validation_accuracy`

over some number of batches, but also on the comparison of the size of the networks. This encourages not only a high `average_validation_accuracy`

but a low network size.

Evaluating the Chaos Network

How does it perform? Not very well, but it is important to qualify that statement.

Through testing, I have found Chaos Networks with about 1,300 connections that have an `average_validation_accuracy`

of about 93.0% on the MNIST dataset (you could probably try for longer and get better). I have not tested it on other datasets yet.

Now 93.00% is not impressive, but the size of the network for that percent is, from my understanding, very good. To put this into perspective, a standard feed-forward neural network that will solve MNIST will have an input layer with 784 nodes, a hidden layer with 512 nodes, and an output layer with 10 nodes. This standard feed-forward network will have over 400,000 connections. I am confident with some tweaking to the heuristic and evolutionary algorithm approach we can improve these results even further.

Haters will say you could build a feed-forward neural network that is even smaller and performs better for MNIST which is (probably) correct, but the Chaos Network is not meant solely for MNIST, it is a framework for building feed-forward neural networks that are not over-paramaterized.

Thanks for reading!

----------------------------------------

Github | Twitter | LinkedIn | Newsletter

© 2024 Silas Marvin. No tracking, no cookies, just plain HTML and CSS.