Delta Activations: A Representation for Finetuned Large Language Models

Abstract

The success of powerful open source Large Language Models (LLMs) has enabled the community to create a vast collection of post-trained models adapted to specific tasks and domains. However, navigating and understanding these models remains challenging due to inconsistent metadata and unstructured repositories. We introduce Delta Activations, a method to represent finetuned models as vector embeddings by measuring shifts in their internal activations relative to a base model. This representation allows for effective clustering by domain and task, revealing structure in the model landscape. Delta Activations also demonstrate desirable properties: it is robust across finetuning settings and exhibits an additive property when finetuning datasets are mixed. In addition, we show that Delta Activations can embed tasks via few-shot finetuning, and further explore its use for model selection and merging. We hope Delta Activations can facilitate the practice of reusing publicly available models.

Method Overview

The difference between a finetuned model's hidden state and the base model's hidden state on a shared input quantifies the effect of finetuning.

Approach

Delta Activations are computed by passing a fixed set of five generic Alpaca instruction templates through both the base model and the post-trained model. We extract the last token embedding at the final layer and compute the difference between the two models' internal representations:

where h_f(x) and h_base(x) represent the hidden states of the finetuned and base models respectively. The resulting vector v_f ∈ ℝ^d serves as a standalone representation that captures how post-training has shifted the model's internal computations.

Method Characteristics

Requires only a single forward pass with fixed probe prompts
No access to training datasets or evaluation metrics needed
Embeddings remain stable when new models are added
Works for both model characterization and task embedding
Extends to Delta-X family (logits, weighted activations, meaning representations)
Enables cross-architecture comparison with model-agnostic representations

Clustering Quality Evaluation

We evaluate Delta Activations by finetuning three base models on datasets from five domains: legal, mathematics, medical, commonsense reasoning, and coding. Each model pool contains 15 finetuned models with 3 models per domain.

t-SNE visualization comparison for Gemma models

t-SNE visualization showing Delta Activations form clean domain clusters while baseline methods fail to achieve clear separation.

Clustering Performance Across Methods

Embedding Space	Dimension	LLaMA	Gemma	Qwen	Average
Flattened weights	~2·10⁷	−.035	−.060	−.034	−.043
Salient Mask	~8·10⁹	.133	.208	.229	.190
Output sentence embeddings	384	.221	−.053	.096	.087
Delta Activations	4096	.645	.545	.653	.614

Properties and Applications

Additive Property

When a model is finetuned on mixed datasets, its Delta Activation approximates the sum of individual domain activations:

v(model trained on D₁ ∪ D₂) ≈ v(model on D₁) + v(model on D₂)

We test this by comparing cosine similarities:

Domains Mixed		Mixed vs D₁	Mixed vs D₂	Mixed vs Sum
Math	Common.	0.58	0.48	0.65
Math	Code	0.70	0.27	0.73
Medical	Legal	0.41	0.68	0.70

The mixed model's embedding is consistently closer to the sum than to either individual domain

Robustness

Delta Activations demonstrate stability across various training configurations. Models maintain domain-specific clustering even when trained with different hyperparameters:

Training Setting	Avg. Silh. Score
Different number of training examples	0.62
Different learning rates	0.38
Different training epochs	0.57
Identical training settings	0.61

Models trained in varying settings still form tight domain-specific clusters, comparable to those trained identically.

Task Embedding via Few-Shot

"Some patients have had no ill effects from these medications..."
— Medical model response to generic prompt

Using only 20 examples, Delta Activations embed tasks and locate relevant model clusters. Gemma achieves 100% retrieval accuracy:

Few-shot embeddings (circles) correctly locate full model clusters on Gemma

Preference Optimization

Delta Activations extend beyond supervised finetuning to preference alignment methods:

DPO clustering: 0.93 silhouette score
Clear separation by preference type
Works across different reward models

Cross-Checkpoint Clustering

Delta Activations successfully cluster models across different checkpoints (LLaMA-3-8B vs. LLaMA-3.1-8B), achieving a silhouette score of 0.39 and cleanly recovering five domain-specialization clusters.

Cross-Architecture Clustering

Using Delta Meaning (architecture-agnostic), models from different architectures (LLaMA-3.1-8B vs. LLaMA-3.2-1B) successfully form four out of five domain clusters with a silhouette score of 0.32.

Beyond Domains: Tulu v2

Delta Activations work beyond domain-specific finetuning. On Tulu v2 instruction splits with diverse output formats:

Method	LLaMA	Gemma	Qwen
Output Emb.	0.02	-0.03	0.10
Delta Act.	0.49	0.32	0.48

Models finetuned on: CoT, GPT4-Alpaca, ShareGPT, CodeAlpaca, Science splits

Model Selection: LoraHub

Validated on LoraHub with ~200 FLAN-T5 models on Big-Bench Hard (26 tasks):

Method	Accuracy	Improvement
Random Selection	34.3%	—
Delta Activations	36.3%	+2.0%

Strategy: Select 1 most similar model as anchor + 19 random models for merging. Interestingly, selecting all 20 similar models yields only 30.3% due to model interference.

Conclusion

Delta Activations provide a simple yet powerful way to represent finetuned LLMs by measuring shifts in their internal activations relative to a base LLM. Our experiments show that this representation consistently forms distinct clusters that reflect finetuning domains and offer the advantage of an additive property that mirrors multi-domain behavior. The stability of Delta Activations across varying finetuning settings shows its reliability for use-cases of model selection and merging in model hubs. We believe that Delta Activations can serve as a cornerstone for navigating the expanding landscape of finetuned models by enabling more efficient model discovery and reuse.

BibTeX

@article{xu2025delta, title={Delta Activations: A Representation for Finetuned Large Language Models}, author={Xu, Zhiqiu and Sethi, Amish and Naik, Mayur and Lim, Ser-Nam}, journal={arXiv preprint arXiv:2509.04442}, year={2025} }