How we built this and how to read it.
Sperm whales communicate with codas: short bursts of clicks with specific timing patterns. Whales self-sort into vocal clans based on shared coda dialects. We wanted to understand what a language model learns about these dialects internally.
We trained a small transformer (CodaLM) on Pacific coda sequences to predict the next coda in a sequence. Then we applied a sparse autoencoder (SAE) to decompose the model's internal representations into individual, interpretable features.
The SAE discovered features that encode clan identity without the model ever seeing clan labels. Injecting a single feature's direction back into the model causally steers its predictions to the target clan.
The SAE encoder maps each 128-dimensional hidden state to 512 features, with only 12 active at a time. Of the 512, 191 fire on at least 1% of windows. The rest are dead. This explorer shows only the alive features.
23,555 codas from the Pacific Ocean (Hersh et al., 2022), spanning 7 vocal clans recorded at 23 locations between 1978 and 2017. Each coda is represented by its inter-click interval (ICI) vector. We slide 32-coda windows across recording sessions with stride 16, yielding 1,131 windows.
Clan-specific features fire primarily for one or two clans. These encode cultural vocal patterns.
Universal features fire across all clans. These encode structural or sequential patterns common to all whales.
Intermediate features are in between: shared by some clans but not others.