A redwood tree stump, Janith Wanniarachchi 2025

Winning the Neural Network Lottery by Chance

How I spent an year trying to fit neural networks with the worse luck imaginable
Janith Wanniarachchi
[email protected]
Supervised by Prof. Dianne Cook, Dr. Kate Saunders, Dr. Patricia Menendez, Dr. Thiyanga Talagala

Let’s be honest here

Building a good model is hard

Explaining how a good model works is even harder


Exhibit A: The good model

What if you could
poke around and find out
how this model works?

Introducing

Explainable AI (XAI) methods!

XAI has a lot of facets

XAI can help you look at

How the model reacts to different features overall using Global Interpretability Methods

How the model gives a prediction to one single instance using Local Interpretability Methods

Explaining one prediction

There are several key local interpretability methods that are related to each other in how they approach the problem

  1. LIME (Local Interpretable Model agnostic Explanations)
  2. SHAP (SHapley Additive exPlanations)
  3. Anchors
  4. Counterfactuals

But what if,


instead of looking at the numerical values of these XAI methods


we represented the numbers as a visual object within the data itself.

LIME

LIME works by trying to find the simplest model within the local neighbourhood that is as similar as possible to the original black box model. Therefore, for a given observation, the LIME explanations are the model coefficients of the interpretable model (e.g. a Generalized Linear Model)

Figure 1: Geometric representation of LIME in two dimensions

SHAP

Similar to LIME, SHAP builds a linear model around the given observation with the features being mapped to a binary vector indicating whether the feature is included in the model or not.

SHAP is based on Shapley values which distributes a reward among cooperative players in a game. In this context the players are the features of the model and the reward is the prediction.

The coefficients of the model are then given by Shapley values and can be considered as the contribution that the given feature has towards the prediction.

Figure 2: Geometric representation of SHAP in two dimensions.

Counterfactuals

A counterfactual explanation \(\boldsymbol{x}_i^{(c)}\) for a given \(\boldsymbol{x}_i\) and a desired outcome value \(y_{i}^{(\exp)}\), is defined as an observation satisfying the following conditions:

  1. \(y_{i}^{(\exp)} \approx f(\boldsymbol{x}_i^{(c)})\).
  2. \(\boldsymbol{x}_i\) and \(\boldsymbol{x}_i^{(c)}\) are close to each other in the data space.
  3. \(\boldsymbol{x}_i^{(c)}\) differs from \(\boldsymbol{x}_i\) only in a few components.
  4. \(\boldsymbol{x}_i^{(c)}\) is a plausible data point according to the distribution of each dimension.
Figure 3: Geometric representation of Counterfactuals in two dimensions. Hollow diamond shapes represent the counterfactual observations for the observations in solid circles connected through a line.

Anchors

Anchors are defined as a rule or a set of predicates that satisfy the given instance and is a sufficient condition for \(f(x_i)\) with high probability. A predicate is a logical condition that an observation may or may not satisfy.

Finding an anchor for a given instance can be defined as the solution to the following optimization problem,

\[ \max_{\mathcal{A} \text{ s.t. } \text{Pr}(\text{Prec}(\mathcal{A}) \ge \tau) \ge 1 - \delta} \text{Coverage}(\mathcal{A}) \]

The target would then be to maximize the coverage while ensuring that the precision is above a tolerance level.

Figure 4: Geometric representation of Anchors in two dimensions

Kultarr R package

The existing implementations of Anchors were quite hard to work with and were quite slow as it was using an existing Java package.

Kultarr is an R package that aims to provide an implementation of Anchors using a simpler algorithm and a complete set of orthogonal predicates.

Try the package out from https://github.com/janithwanni/kultarr

The proposed geometric represenations,


LIME can be seen as regression lines
SHAP can be seen as force vectors
Counterfactuals can be seen as connecting lines
Anchors can be seen as boxes

Further updates to detourr

  • Converted the structure of the widget to a single unified widget.
  • Official support for
    • Clicking on points in the detourr widget now returns the identifier.
    • A proxy was created to communicate with an existing rendered detourr widget.
    • Added support for adding points and edges connecting new points.
    • Options to customize aesthetics of points and edges.

Putting it all together

Rosella R package

The current status of the XAI explorer was bundled into an R package to ensure that anyone with compute can run the Shiny app on their own servers.

rosella is an R package that allows you to run the XAI explorer locally on your laptop. It’s still under development in terms of allowing users to generate XAI explanations for their own datasets.

Try the package out from https://github.com/janithwanni/rosella

Let’s join a cult

How long has it been since you last spoke to an LLM?

Being agnostic is fun and all,

Model Agnostic XAI methods can poke and prod at a model like neural networks but,

we want to crack open models to understand the specific internals and possibly simplify them.

So let’s fit a neural network

Simple stuff first,

Let’s take a look at the following training dataset

Quick refresher on how neural networks work

Playground

All right for our dataset,

Fixed hyperparameters (for the plot)

  • Number of epochs: 100
  • Batch size: 71
  • Loss function: Binary Cross Entropy loss
  • Optimizer: Adam
  • Training set size: 5,000
  • Testing set size: 5,000

What shall we pick as the number of neurons in a single layer?

Join the online poll at menti.com with code 8902 9625

So based on your results,

This is the result for a single layer neural network with,

Are we sure though?

Wait, Hol’ up

Any guesses on what exactly happened here?

For an entire year I used the wrong seed

The random seed has an effect on how neural networks are being built.

But why does the seed have an effect?

  • It can be due to primarily the initial weights used in the training, which has been discussed in Model Agnostic Meta Learning (MAML) and many more cases.
  • It can be due to shuffling in mini batches.

Cool story mate, so what?

Smaller models for small datasets

I could have chucked a 20 neuron two hidden layer neural network and trained it for 1000 epochs,

But for this dataset? Does this decision boundary demand a model with more parameters than there are bends in the decision boundary?

Fixing things before it starts


In a majority of deep learning education material very rarely is the effect of the random seed discussed.


That is why I built a tool for data science educators to educate students by trying out multiple model variants and seeing the difference

And today you get to try that out as well

As the second batch of students to experience the app

Introducing The Blind Box Simulator

Visit random-nn-playground.janithwanni.com

The architecture behind the app

graph TB
    
    %% Main Application Layer
    subgraph "Main Application (R/Shiny)"
        %% Frontend Layer
        subgraph "Frontend Layer"
            UI[Fomantic UI Components]
            Squiggler[Squiggler Tool<br/>Svelte → JS/CSS]
        end

        Shiny[Shiny Web Framework<br/>Rhino for<br/>Codebase Management]
        
        %% Visualization Components
        subgraph "Visualization Libraries"
            GGPlot[ggplot2<br/>Static Plots]
            Beeswarm[ggbeeswarm<br/>Plot Geometry]
            Ggiraph[ggiraph<br/>Interactivity]
            Gifski[gifski<br/>GIF Animations<br/>Rust-powered]
        end

        %% Backend Processing Layer
        subgraph "Backend Processing"
            Python[Python Background Tasks<br/>Model Fitting<br/>with PyTorch<br/>]
            Parquet[Parquet Files<br/>for Data Transfer]
        end

    end
    
    
    %% Data Layer
    %% subgraph "Data Storage"
    %% end
    
    %% Connections
    UI --> Shiny
    Squiggler --> Shiny
    
    %% Visualization connections
    Shiny --> GGPlot
    Shiny --> Beeswarm
    Shiny --> Ggiraph
    Shiny --> Gifski
    
    %% Interactive plot creation
    Beeswarm --> Ggiraph
    
    %% Backend connections
    Shiny -.->|Background Task| Python
    Python --> Parquet
    Shiny --> Parquet
    
    %% Styling
    classDef frontend fill:#e1f5fe
    classDef rshiny fill:#f3e5f5
    classDef backend fill:#fff3e0
    classDef data fill:#e8f5e8
    
    class UI,Squiggler frontend
    class Shiny,GGPlot,Beeswarm,Ggiraph,Gifski rshiny
    class Python,PyTorch backend
    class Parquet data

The backend

flowchart LR

    A["Decision Boundary Function f
    <br/>
    Neuron Sizes N"] --> C 

    C["Generate 10,000 Samples
    <br/>
    Uniformly from [-10, 10] × [-10, 10] 
    <br/>
    using same seed"] --> P2
   
    subgraph P2 ["Train"]
        direction TB
        G["For Each Neuron Size n ∈ N"]
        H["Select Random Seed s ∈ [1, 99999]"]
        I["Initialize Neural Network:
        <br/>Input: 2,
        <br/>Hidden: n (ReLU),
        <br/>Output: 1 (Sigmoid)"]
        J["Configure: 
        <br/>Batch=71,<br/> Adam lr=0.01,<br/> BCE Loss"]
        
        K["Fit model for 100 training epochs"]  
        O{More Seeds/Neurons?}
        
        G --> H --> I --> J --> K
        K --> O
        O -->|Yes| H
    end
    
    P2 --> P3
    
    subgraph P3 ["Evaluate"]
        direction TB
        P[For Each Trained Model M]
        Q[Calculate F1-Score & Accuracy on the testing dataset]
        R[Apply Model to Grid of 100x100 points]
        T[Return: Models, Metrics, Boundaries]
        
        P --> Q --> R --> T 
    end
    

Where to from here?

Attention is all you need, but did we need this much attention?

What are we planning on doing?

Using a simple toy transformer model trained to solve problems of the form

5 + 7 % 10 = ?.

Toy Transformer Models

Visualise how the attention heads learn by using Sparse Auto Encoders

Training Dynamics in Toy Models of Superimposition

Looking back and forward

Thesis Structure

mindmap
  root((The thesis))
    How can traditional machine learning models be explained visually?
      Kultarr: A simple, visual rewrite of Kultarr
        Geometric representations of XAI
          Rosella: An XAI explorer
    What are the hurdles to explain deep learning models?
      Complex models over parsimonious models
        The effect of random numbers used in the model
    How can we make a dent in understanding large language models?
      Visualizing trajectory of weights over time 
      Identifying circuits between attention heads

Timeline

Thank you!

Have any suggestions or ideas?

The colour palette for these slides are inspired by the photograph by Bill Henson as part of the art installation Oneiroi in the Hellenic Museum, Melbourne.

Janith Wanniarachchi

@janithwanni
janith-wanniarachchi
janithwanni.netlify.app