Explore how Delta Activations cluster finetuned LLMs by domain. Click and drag to navigate, select models to see their nearest neighbors.
In the Delta Activation Embedding Space, finetuned models cluster by domain, enabling efficient retrieval of finetuned models by task or domain.
The success of powerful pretrained Large Language Models (LLMs) has enabled the community to create a vast collection of post-trained models adapted to specific tasks and domains. However, navigating and understanding these models remains challenging due to inconsistent metadata and unstructured repositories.
We introduce Delta Activations, a method to represent finetuned models as vector embeddings by measuring shifts in their internal activations relative to a base model. This representation allows for effective clustering by domain and task, revealing structure in the model landscape.
Delta Activations also demonstrate desirable properties: it is robust across finetuning settings, exhibits an additive property when finetuning datasets are mixed, and can be used to embed tasks by finetuning on few-shot examples. We apply our approach to prototype model hubs and show its potential applications in model selection and model merging.
The difference between a finetuned model's hidden state and the base model's hidden state on a shared input quantifies the effect of finetuning.
Delta Activations are computed by passing a fixed set of five generic Alpaca instruction templates through both the base model and the post-trained model. We extract the last token embedding at the final layer and compute the difference between the two models' internal representations:
where hf(x) and hbase(x) represent the hidden states of the finetuned and base models respectively. The resulting vector vf โ โd serves as a standalone representation that captures how post-training has shifted the model's internal computations.
Delta Activations provide several technical advantages over existing approaches:
We evaluate Delta Activations by finetuning three base models (LLaMA-3.1-8B, Gemma-2-9B, Qwen-2.5-7B) on datasets from five domains (Legal/LegalBench, Mathematics/GSM-8K, Medical/PubMedQA, Commonsense/HellaSwag, Coding/OpenCoder) using LoRA (r=8, ฮฑ=16) and full finetuning. Each model pool contains 15 finetuned models (3 per domain) trained for 3 epochs with learning rate 1e-4.
t-SNE visualization showing Delta Activations form clean domain clusters while baseline methods fail to achieve clear separation.
Embedding Space | Dimension | LLaMA | Gemma | Qwen | Average |
---|---|---|---|---|---|
Flattened weights | ~2ยท10โท | โ.035 | โ.060 | โ.034 | โ.043 |
PCA on flattened weights | 14 | โ.004 | โ.007 | โ.004 | โ.005 |
Output sentence embeddings | 384 | .221 | โ.053 | .096 | .087 |
Delta Activations | 4096 | .645 | .545 | .653 | .614 |
When a model is finetuned on mixed datasets, its Delta Activation approximates the sum of individual domain activations:
We test this by comparing cosine similarities:
Domains Mixed | Mixed vs D1 | Mixed vs D2 | Mixed vs Sum | |
---|---|---|---|---|
Math | Common. | 0.58 | 0.48 | 0.65 |
Math | Code | 0.70 | 0.27 | 0.73 |
Medical | Legal | 0.41 | 0.68 | 0.70 |
The mixed model's embedding is consistently closer to the sum than to either individual domain
Delta Activations demonstrate stability across various training configurations. Models maintain domain-specific clustering even when trained with different hyperparameters:
Silhouette scores remain consistently above 0.58 across all variations, confirming the robustness of the representation.
"Some patients have had no ill effects from these medications..."
โ Medical model response to generic prompt
Using only 20 examples, Delta Activations embed tasks and locate relevant model clusters. Gemma achieves 100% retrieval accuracy:
Few-shot embeddings (circles) correctly locate full model clusters on Gemma
Delta Activations extend beyond supervised finetuning to preference alignment methods:
Delta Activations work beyond domain-specific finetuning. On Tulu v2 instruction splits with diverse output formats:
Method | LLaMA | Gemma | Qwen |
---|---|---|---|
Output Emb. | 0.02 | -0.03 | 0.10 |
Delta Act. | 0.49 | 0.32 | 0.48 |
Models finetuned on: CoT, GPT4-Alpaca, ShareGPT, CodeAlpaca, Science splits
Validated on LoraHub with ~200 FLAN-T5 models on Big-Bench Hard (26 tasks):
Method | Accuracy | Improvement |
---|---|---|
Random Selection | 34.3% | โ |
Delta Activations | 36.3% | +2.0% |
Strategy: Select 1 most similar model as anchor + 19 random models for merging. Interestingly, selecting all 20 similar models yields only 30.3% due to model interference.
We introduce Delta Activations, a method to represent post-trained large language models as vector embeddings by measuring shifts in their internal activations relative to a base model. Our experiments demonstrate that this representation effectively captures model specialization, forming coherent clusters by domain and task without requiring access to training data or external metadata.
The empirical analysis reveals several desirable properties of the Delta Activation embedding space. The additive property enables compositional understanding of models trained on mixed datasets, while robustness across training configurations ensures consistent representations. The ability to embed tasks through few-shot finetuning opens applications in model selection and retrieval from large repositories.
Our validation on real-world model hubs demonstrates practical applicability, with Delta Activations achieving superior clustering quality compared to weight-based and output-based alternatives. The method extends beyond supervised finetuning to preference optimization paradigms, suggesting broad applicability across post-training techniques.
While Delta Activations provide effective model representations, several limitations warrant investigation. The choice of probe prompts influences embedding quality, and optimal prompt design remains an open question. Additionally, the method assumes access to the base model, which may not always be available for proprietary systems. Future work could explore applications to model merging and investigate the theoretical basis for why generic prompts elicit domain-specific signals.