← Explorer

CodaLM SAE Guide

How we built this and how to read it.

What we did

Sperm whales communicate with codas: short bursts of clicks with specific timing patterns. Whales self-sort into vocal clans based on shared coda dialects. We wanted to understand what a language model learns about these dialects internally.

We trained a small transformer (CodaLM) on Pacific coda sequences to predict the next coda in a sequence. Then we applied a sparse autoencoder (SAE) to decompose the model's internal representations into individual, interpretable features.

The SAE discovered features that encode clan identity without the model ever seeing clan labels. Injecting a single feature's direction back into the model causally steers its predictions to the target clan.

Pipeline

Coda sequences (32 per window) ↓ CodaLM (4 layers, 4 heads, d=128, ~800K params) ↓ hidden states (128-dim) TopK SAE (k=12, 512 features) ↓ sparse activations 191 alive features → this explorer

The SAE encoder maps each 128-dimensional hidden state to 512 features, with only 12 active at a time. Of the 512, 191 fire on at least 1% of windows. The rest are dead. This explorer shows only the alive features.

Data

23,555 codas from the Pacific Ocean (Hersh et al., 2022), spanning 7 vocal clans recorded at 23 locations between 1978 and 2017. Each coda is represented by its inter-click interval (ICI) vector. We slide 32-coda windows across recording sessions with stride 16, yielding 1,131 windows.

How to read the explorer

Sidebar
All 191 alive features. Each row shows the feature index, dominant clan (colored dot), type badge, firing rate, and selectivity score. Click to view details. Search by index, clan, or coda type.
Selectivity
How concentrated a feature's activation is across clans. 1.0 = fires for only one clan. ~0.14 = fires equally across all seven. The big number on each sidebar row.
Clan activation bars
Mean activation for each clan. Lopsided bars = clan-selective. Flat bars = shared pattern.
Top activating windows
Coda sequences where this feature fires strongest. Each colored chip is a coda type. Brighter = higher relevance to this feature, faded = less relevant. Click any chip to see that coda type's full page.
Logit lens
What this feature does to next-coda prediction. "Promotes" = pushes toward predicting these types next. "Suppresses" = pushes away. Click any type to see its detail page.
Co-firing
Features that activate on the same windows. High percentage = they encode related patterns. Click to jump to that feature.
Histogram
Distribution of activation strengths when the feature fires. Tight peak = consistent. Long tail = graded response.

Feature types

Clan-specific features fire primarily for one or two clans. These encode cultural vocal patterns.

Universal features fire across all clans. These encode structural or sequential patterns common to all whales.

Intermediate features are in between: shared by some clans but not others.

Keyboard

Browse features j k Same (vim-style)