This is part two of Chaos Networks Part 1.
In the previous post, Chaos Networks Part 1, we discussed the lack of chaos current feed-forward layer-based neural networks have, and proposed our hypothesis that this lack of chaos, or rather, the paradigm of using layers, is detrimental, and leads to extremely inefficient neural networks that have magnitudes more weights and connections than required (I don't think we used these exact words, but now we know the point of all this). Please note, this is similar to something already well established called NEAT (neural evolution of augmenting topologies), though I have not seen it implemented as we do.
It seems, although not thoroughly tested, our hypothesis holds some weight.
Last post we created a graph based feed-forward neural network from scratch, including the writing of our own very simple automatic differentiation library that handles 1-dimensional tensors
. We trained this network on the MNIST dataset, and were able to achieve about 91% accuracy, so not great.
We hypothesized that the main contributor to our lack of validation accuracy was the lack of using batches. As we will see, it was not, or at least, not entirely that.
Debugging our Chaos Network was painful. There was no undefined behaviour, there were no visible bugs, only a gut feeling that our network should be performing better.
To keep this section short I will give a bullet point list of the steps taken to rectify this bug.
split_on_add
tensor
operationI'm not going to do a deep dive into the code for our automatic differentiation library, as we did a small dive in Part 1 (we will cover some updates later), but the main issue was fixed when adding the following tensor
operation:
impl<const N: usize> Tensor1D<N> {
pub fn split_on_add(self, count: usize) -> Vec<Self> {
match &self.tape {
Some(tape) => {
let mut new_tensors: Vec<Tensor1D<N>> = (0..count)
.map(|_i| Self::new_with_tape(self.data, Some(tape.clone())))
for t in new_tensors.iter_mut() {
let self_id = t.grad_for;
let old_self_id = self.grad_for;
// For each split element grab their gradients, and the parent gradients, and
// add them together
t.tape.as_mut().unwrap().write().unwrap().add_operation((
self_id,
Box::new(move |g| {
let tg = g.remove(self_id);
let mut tg2 = g.remove_or_0(old_self_id);
tg2.data = element_wise_addition::<N>(&tg2.data, &tg.data);
g.insert(old_self_id, tg2);
}),
));
}
new_tensors
}
None => (0..count)
.map(|_i| Tensor1D::new_without_tape(self.data))
.collect(),
}
}
}
For brevity's sake, in version 0.1 of our Chaos Network, in our forward function, we would multiply the weight of the current node
once for every connection it had, but our automatic differentiation library was not designed to handle multiple operations by the same tensor
. By adding the above function, we first split_on_add
the tensor
we want to perform multiple operations with, and then can simply pop a version off of the returned Vec of our tensors
.
It doesn't matter if the above makes sense, the context required to understand it fully is missing, the main point is in Chaos Networks Part 1 we had an error accumulating gradients while going backwards, and that is now fixed!
One of the improvements we made is switching from a traditional graph tree-structure where each node holds its connections to the next, to a streamlined version where the network holds a HashMap
of edges. Because we are using Rust, this greatly reduced the use of Rc
and RefCell
although this was probably not a limiting factor for speed.
As it turns out, the two greatest restrictions to our speed were the HashMap
we used when accumulating gradients while going forward through the network, and each tensor
having to join its held tape after each operation.
Improving the speed of the HashMap
was relatively simple. We used rustc_hash which greatly improved our hashing speeds.
Removing the need for tensors
to handle their own tape
was a little more involved. In Part 1 we mentioned the inspiration for our automatic differentiation implementation came from dfdx, an absolutely awesome deep learning library for Rust. Each tensors
tape
is really just Vec
x<dyn FnOnce(&mut Gradientsvectors
than I do, but the combining of these tapes
(hundreds of thousands every second) would greatly slow down the runtime performance. In the end, I switched to a system where there is one tape
of type Arc
Lock<Tapetensor
has a copy.
These two minor improvements greatly increased the runtime performance of our Chaos Network.
There are two reasons why we went through the pain of creating our own automatic differentiation library when libraries like dfdx already exist: to learn how they work, and to specialize it for our use case. The automatic differentiation library we wrote is highly specialized for our Chaos Network. It only has the functions we need, does not work in many general purpose situations unless great care is taken, and in the case of batches, is specially built to manage the gradients for 1-dimensional tensors
differently.
Let's examine the new and updated code we created to support batches in the mul
implementation.
#[derive(Debug, Clone)]
pub struct Tensor1D<const N: usize> {
pub id: u64,
pub grad_for: u64,
pub data: [f64; N],
pub tape: Option<Arc<RwLock<Tape<N>>>>,
}
First we introduce the concept of a 1-dimensional tensor
(above).
impl<'a, 'b, const N: usize> Mul<&'b mut Tensor1D<N>> for &'a mut Tensor0D<N> {
type Output = Tensor1D<N>;
fn mul(self, other: &'b mut Tensor1D<N>) -> Self::Output {
let new_data: [f64; N] = other.data.map(|d| self.data * d);
let mut new = Tensor1D::new_without_tape(new_data);
match (&self.tape, &other.tape) {
(Some(self_tape), Some(_other_tape)) => {
let new_id = new.grad_for;
let self_id = self.grad_for;
let other_id = other.grad_for;
let self_data = self.data;
let other_data = other.data.clone();
self_tape.write().unwrap().add_operation((
new_id,
Box::new(move |g| {
let mut tg1 = g.remove(new_id);
let mut tg2 = tg1.clone();
tg1.data = element_wise_mul::<N>(&tg1.data, &other_data);
tg2.data = tg2.data.map(|x| x * self_data);
g.insert(self_id, tg1);
g.insert(other_id, tg2);
}),
));
new.set_tape(self.tape.clone());
}
(Some(self_tape), None) => {
let new_id = new.grad_for;
let self_id = self.grad_for;
let other_data = other.data.clone();
self_tape.write().unwrap().add_operation((
new_id,
Box::new(move |g| {
let mut tg = g.remove(new_id);
tg.data = element_wise_mul::<N>(&tg.data, &other_data);
g.insert(self_id, tg);
}),
));
new.set_tape(self.tape.clone());
}
(None, Some(_other_tape)) => {
panic!("Switch operator orientation");
}
(None, None) => (),
}
new
}
}
The above is the only implementation of Mul
for both 0-dimensional and 1-dimensional tensors
. That would not make sense in the context of a normal automatic differentiation library, but because we have highly specialized one built for Chaos Networks, not only does it make sense, but it saves us an immense amount of work and computation.
Let's examine one of the passing test cases for our implementation.
#[test]
fn test_mul_1d() {
let tape: Arc<RwLock<Tape<3>>> = Arc::new(RwLock::new(Tape::new()));
let mut a = Tensor0D::new_with_tape(2., Some(tape.clone()));
let mut b = Tensor1D::new_without_tape([1., 2., 3.]);
let mut c = &mut a * &mut b;
// Check value match
assert_eq!([2., 4., 6.], c.data);
// Check gradients
let mut grads = c.backward();
let a_grads = grads.remove(a.grad_for);
assert_eq!([1., 2., 3.], a_grads.data);
}
If you haven't done a lot of work with automatic differentiation everything may seem normal but the above code is noteworthy because it is different than what Pytorch would yield.
import torch as torch
import torch.nn as nn
w = torch.tensor(2., dtype=torch.float64, requires_grad=True)
x = torch.tensor([1, 2, 3], dtype=torch.float64)
y = w * x
print(y) # Prints tensor([2., 4., 6.], dtype=torch.float64, grad_fn=<MulBackward0>)
y.backward(torch.ones_like(x))
print(w.grad) # Prints tensor(6., dtype=torch.float64)
Notice how the gradients of w--+
he analogous Python variable for mut
-+ above) are the summation of the gradient vector produced by our implementation of Mul
. This is done with purpose. Notice the way the tapes are combined in Mul
function above. Most noticeably, they are combined through element wise multiplication.
All of the above to say, 1-dimensional tensors
are treated as a batch. Each element is a separate input corresponding with that batch. This is why gradients are not combined. If you are curious to see how this works, check out the repo!
Adding this form of batching improved the runtime performance of the network by a factor close to the batch size being used (MASSIVE IMPROVEMENT).
We have a neural network represented by an acyclic graph, now we must train it. Before showing any code, let's ask ourselves a few questions.
What should the structure of the network be? We have complete freedom in how we want to structure our Chaos Network, but complete freedom is just that, complete freedom. It is our job to establish laws regulating this structure.
How should the structure of the network change as we train it? One of the core ideas behind the Chaos Network is that it is chaotic, morphing as we train it. Once again, it is up to us to create the laws regulating this growth.
The first question has a few relatively simple answers, and I went with the easiest.
#[derive(Default)]
pub struct Network<const N: usize> {
pub inputs_count: usize,
pub leaves_count: usize,
pub nodes: Vec<Node<N>>,
pub connections_to: HashMap<i32, Vec<usize>>,
pub mode: NetworkMode,
pub tape: Arc<RwLock<Tape<N>>>,
}
impl<const N: usize> Network<N> {
pub fn new(inputs: usize, outputs: usize) -> Self {
let mut network: Network<N> = Network::default();
network.add_nodes(NodeKind::Leaf, outputs);
network.add_nodes(NodeKind::Input, inputs);
network
}
pub fn add_nodes(&mut self, kind: NodeKind, count: usize) {
match kind {
NodeKind::Normal => {
let node_index = self.batch_insert_normal_nodes(count);
for i in 0..(count) {
for _ii in 0..10 {
self.add_node_connection_to(node_index + i);
self.add_node_connection_from(node_index + i);
}
}
}
NodeKind::Input => {
self.inputs_count += count;
let mut nodes: Vec<Node<N>> = (0..count).map(|_i| Node::new(kind)).collect();
for _i in 0..count {
self.nodes.insert(0, nodes.remove(0));
}
self.shift_all_connections_after(0, count, ShiftDirection::Forward);
if self.leaves_count == 0 {
panic!("This should be called after leaves are added for now")
}
let mut rng = rand::thread_rng();
let odds_of_being_picked = 100. / (count as f64);
for i in 0..count {
if odds_of_being_picked > rng.gen::<f64>() {
let input_node_index = distribution_between.sample(&mut rng);
self.add_connection_between(i, input_node_index);
}
}
}
NodeKind::Leaf => {
self.leaves_count += count;
for _i in 0..count {
self.nodes.push(Node::new(kind));
}
}
}
}
}
The above is the definition of the Network
and the new
and add_nodes
implemented functions. Nodes
are one of three variations: Input
, Normal
and Leaf
. Input
des--+ receive input, Normal
des--+ would be like nodes in hidden layers and Leaf
des--+ are output.
The new
function is syntactic sugar around the add_nodes
function. The add_nodes
function takes the NodeKind
and the count, and adds the count of new NodeKinds
. Most notably, not all Input
des--+ start with connections as this is part of the over-paramaterization Chaos Networks are trying to fix.
The above answers our first posed question, but the second question about how the network should grow is a bit trickier. Outlining the problem further, at its core we have a graph that we want to change, and a heuristic (the validation accuracy) that we can use to evaluate that change. This sounds like a perfect use for evolutionary algorithms. For those not familiar, we essentially have some starting population
, make some changes to that population
, take the best subset N from the new population
evaluated using our heuristic function, and repeat.
In our case, we start with some N Chaos Networks, train those and some randomly modified versions of those, evaluate how well they perform, mix the current population
and the new randomly modified population
, and repeat.
The code doing this is below. I've had to remove some of the context to keep things smaller.
pub struct StandardClassificationNetworkHandler<const I: usize, const O: usize, const N: usize> {
max_training_steps: usize,
steps_per_training_step: usize,
train_data: Arc<RepeatingNetworkData<I, N>>,
test_data: Arc<RepeatingNetworkData<I, N>>,
validation_steps: usize,
}
impl<const I: usize, const O: usize, const N: usize> StandardClassificationNetworkHandler<I, O, N> {
pub fn train(&mut self) {
// Create the initial population
let mut population: Vec<Network<N>> =
(0..POPULATION_SIZE).map(|_i| Network::new(I, O)).collect();
// Prep some stuff for getting random data points
let train_data_distribution = Uniform::from(0..self.train_data.len());
let test_data_distribution = Uniform::from(0..self.test_data.len());
let mut rng = thread_rng();
// Do the actual training
for training_step in 0..self.max_training_steps {
// Prep the random indexes
let train_data_random_indexes = (0..self.steps_per_training_step)
.map(|_i| {
(0..N)
.map(|_ii| train_data_distribution.sample(&mut rng))
.collect::<Vec<usize>>()
.try_into()
.unwrap()
})
.collect::<Vec<[usize; N]>>();
let test_data_random_indexes = (0..self.validation_steps)
.map(|_i| {
(0..N)
.map(|_ii| test_data_distribution.sample(&mut rng))
.collect::<Vec<usize>>()
.try_into()
.unwrap()
})
.collect::<Vec<[usize; N]>>();
// Current population and maybe new networks
population = if training_step != 0 && training_step % 10 == 0 {
let new_networks = population.iter().map(|x| x.clone()).collect();
let new_morphed_networks = self.train_population(
new_networks,
&train_data_random_indexes,
&test_data_random_indexes,
true,
);
let new_population_networks = self.train_population(
population,
&train_data_random_indexes,
&test_data_random_indexes,
false,
);
// Merge them together
new_population_networks
.into_iter()
.zip(new_morphed_networks.into_iter())
.rev()
.skip(POPULATION_SIZE / 2)
.map(|(a, b)| vec![a.0, b.0])
.flatten()
.collect()
} else {
self.train_population(
population,
&train_data_random_indexes,
&test_data_random_indexes,
false,
)
.into_iter()
.map(|(n, _, _)| n)
.collect()
}
}
}
fn train_population(
&self,
population: Vec<Network<N>>,
train_data_random_indexes: &Vec<[usize; N]>,
test_data_random_indexes: &Vec<[usize; N]>,
do_morph: bool,
) -> Vec<(Network<N>, f64, f64)> {
// Do the training and validation
let new_networks: Vec<(Network<N>, f64)> = population
.into_par_iter()
.map(|mut network| {
if do_morph {
let mut rng = rand::thread_rng();
let percent_nodes_to_add = rng.gen::<f64>() / 10.;
let percent_connections_to_add = rng.gen::<f64>() / 10.;
let percent_connections_to_remove = (rng.gen::<f64>() / 10.) * 3.;
grow(
&mut network,
percent_nodes_to_add,
percent_connections_to_add,
);
prune(&mut network, percent_connections_to_remove);
}
let train_data = self.train_data.clone();
let test_data = self.test_data.clone();
for step in 0..self.steps_per_training_step {
train_next_batch(
&mut network,
train_data.next(&train_data_random_indexes[step]),
);
}
let average_validation_accuracy: f64 = (0..self.validation_steps)
.map(|step| {
validate_next_batch(
&mut network,
test_data.next(&test_data_random_indexes[step]),
)
})
.sum::<f64>()
/ (self.validation_steps as f64);
(network, average_validation_accuracy)
})
.collect();
let network_sizes: Vec<f64> = new_networks
.iter()
.map(|(n, _)| n.get_connection_count() as f64)
.collect();
let min_network_size = network_sizes
.iter()
.min_by(|a, b| a.partial_cmp(&b).unwrap())
.unwrap();
let mut new_networks: Vec<(Network<N>, f64, f64)> = new_networks
.into_iter()
.map(|(n, ava)| {
let connection_count = n.get_connection_count() as f64;
(n, ava, ava + ((min_network_size / connection_count) * 0.05))
})
.collect();
new_networks.sort_by(|a, b| b.2.partial_cmp(&a.2).unwrap());
new_networks
}
}
Essentially, this does exactly what is described above. Let's walk through it:
population
of Chaos Networks (for most cases I tried, the population
size was between 10 and 100)training_step
we train the population
for a selected number of batchestraining_steps
we take the population
, clone it, and perform some random adding and removing of nodes and connectionspopulation
and replace it with the top 50% of the new networks created from making random changes to the existing population
I made the following changes to the traditional evolutionary algorithm. One, our population
size is tiny. This is simply due to hardware restrictions. Two, we have no crossover between the population
. Most of the time, there is an idea of crossover when mutating the population
. We do not do this, instead each member of the population
has its own "child" that gets mutated randomly. It may be beneficial to come back and explore the idea of crossover and changing the mutation function. Three, we do not only chose Chaos Networks with the highest score. When evaluating which Chaos Networks should survive to the next generation, we take the best 50% of the current population
and the best 50% of the new morphed population
. This is not standard, but the motivation for this is the same as the motivation in the next point. Four, we do not mutate the population
every iteration. I found that mutating the population
every training step did not give the new additions enough time to increase their average_validation_accuracy
enough to stick around through another mutation such that our starting networks would become the only networks that ever last a round in the population
.
The heuristic function used is worth mentioning. We score not only on average_validation_accuracy
over some number of batches, but also on the comparison of the size of the networks. This encourages not only a high average_validation_accuracy
but a low network size.
How does it perform? Not very well, but it is important to qualify that statement.
Through testing, I have found Chaos Networks with about 1,300 connections that have an average_validation_accuracy
of about 93.0% on the MNIST dataset (you could probably try for longer and get better). I have not tested it on other datasets yet.
Now 93.00% is not impressive, but the size of the network for that percent is, from my understanding, very good. To put this into perspective, a standard feed-forward neural network that will solve MNIST will have an input layer with 784 nodes, a hidden layer with 512 nodes, and an output layer with 10 nodes. This standard feed-forward network will have over 400,000 connections. I am confident with some tweaking to the heuristic and evolutionary algorithm approach we can improve these results even further.
Haters will say you could build a feed-forward neural network that is even smaller and performs better for MNIST which is (probably) correct, but the Chaos Network is not meant solely for MNIST, it is a framework for building feed-forward neural networks that are not over-paramaterized.
Thanks for reading!
Github | Twitter | LinkedIn | Newsletter
© 2024 Silas Marvin. No tracking, no cookies, just plain HTML and CSS.