Delta Activations: A Representation for Finetuned Large Language Models

Zhiqiu Xu1*    Amish Sethi1*    Mayur Naik1    Ser-Nam Lim2
1University of Pennsylvania      2University of Central Florida

๐Ÿš€ Interactive Model Embedding Explorer

Explore how Delta Activations cluster finetuned LLMs by domain. Click and drag to navigate, select models to see their nearest neighbors.

๐Ÿ”„ Loading adapter delta embeddings...
Processing 66 models and calculating distances
0.25

Selected Model

-
-

Neighbors Within Threshold:

Embedding Finetuned Models Concept

In the Delta Activation Embedding Space, finetuned models cluster by domain, enabling efficient retrieval of finetuned models by task or domain.

Abstract

The success of powerful pretrained Large Language Models (LLMs) has enabled the community to create a vast collection of post-trained models adapted to specific tasks and domains. However, navigating and understanding these models remains challenging due to inconsistent metadata and unstructured repositories.

We introduce Delta Activations, a method to represent finetuned models as vector embeddings by measuring shifts in their internal activations relative to a base model. This representation allows for effective clustering by domain and task, revealing structure in the model landscape.

Delta Activations also demonstrate desirable properties: it is robust across finetuning settings, exhibits an additive property when finetuning datasets are mixed, and can be used to embed tasks by finetuning on few-shot examples. We apply our approach to prototype model hubs and show its potential applications in model selection and model merging.

Method Overview

Computing Delta Activations

The difference between a finetuned model's hidden state and the base model's hidden state on a shared input quantifies the effect of finetuning.

Approach

Delta Activations are computed by passing a fixed set of five generic Alpaca instruction templates through both the base model and the post-trained model. We extract the last token embedding at the final layer and compute the difference between the two models' internal representations:

ฮ”f(x) = hf(x) โˆ’ hbase(x)
vf = (1/N) ฮฃi ฮ”f(xi)

where hf(x) and hbase(x) represent the hidden states of the finetuned and base models respectively. The resulting vector vf โˆˆ โ„d serves as a standalone representation that captures how post-training has shifted the model's internal computations.

Method Characteristics

Delta Activations provide several technical advantages over existing approaches:

  • Computational efficiency: Requires only a single forward pass through the model with fixed probe prompts
  • Independence from external data: Does not require access to training datasets or evaluation metrics
  • Stability: Embeddings remain fixed when new models are added to the repository
  • Generality: Applicable to both model characterization and task embedding through few-shot finetuning

Clustering Quality Evaluation

We evaluate Delta Activations by finetuning three base models (LLaMA-3.1-8B, Gemma-2-9B, Qwen-2.5-7B) on datasets from five domains (Legal/LegalBench, Mathematics/GSM-8K, Medical/PubMedQA, Commonsense/HellaSwag, Coding/OpenCoder) using LoRA (r=8, ฮฑ=16) and full finetuning. Each model pool contains 15 finetuned models (3 per domain) trained for 3 epochs with learning rate 1e-4.

t-SNE visualization of different embedding spaces

t-SNE visualization showing Delta Activations form clean domain clusters while baseline methods fail to achieve clear separation.

Clustering Performance Across Methods

Embedding Space Dimension LLaMA Gemma Qwen Average
Flattened weights ~2ยท10โท โˆ’.035 โˆ’.060 โˆ’.034 โˆ’.043
PCA on flattened weights 14 โˆ’.004 โˆ’.007 โˆ’.004 โˆ’.005
Output sentence embeddings 384 .221 โˆ’.053 .096 .087
Delta Activations 4096 .645 .545 .653 .614

Properties and Applications

Additive Property

When a model is finetuned on mixed datasets, its Delta Activation approximates the sum of individual domain activations:

v(model trained on D1 โˆช D2) โ‰ˆ v(model on D1) + v(model on D2)

We test this by comparing cosine similarities:

Domains Mixed Mixed vs D1 Mixed vs D2 Mixed vs Sum
Math Common. 0.58 0.48 0.65
Math Code 0.70 0.27 0.73
Medical Legal 0.41 0.68 0.70

The mixed model's embedding is consistently closer to the sum than to either individual domain

Robustness

Delta Activations demonstrate stability across various training configurations. Models maintain domain-specific clustering even when trained with different hyperparameters:

  • Learning rates: 5e-5 to 2e-4
  • Training epochs: 1 to 5
  • Batch sizes: 2 to 8

Silhouette scores remain consistently above 0.58 across all variations, confirming the robustness of the representation.

Task Embedding via Few-Shot

"Some patients have had no ill effects from these medications..."
โ€” Medical model response to generic prompt

Using only 20 examples, Delta Activations embed tasks and locate relevant model clusters. Gemma achieves 100% retrieval accuracy:

Task embedding visualization

Few-shot embeddings (circles) correctly locate full model clusters on Gemma

Preference Optimization

Delta Activations extend beyond supervised finetuning to preference alignment methods:

  • DPO clustering: 0.93 silhouette score
  • Clear separation by preference type
  • Works across different reward models
DPO clustering

Beyond Domains: Tulu v2

Delta Activations work beyond domain-specific finetuning. On Tulu v2 instruction splits with diverse output formats:

Method LLaMA Gemma Qwen
Output Emb. 0.02 -0.03 0.10
Delta Act. 0.49 0.32 0.48

Models finetuned on: CoT, GPT4-Alpaca, ShareGPT, CodeAlpaca, Science splits

Model Selection: LoraHub

Validated on LoraHub with ~200 FLAN-T5 models on Big-Bench Hard (26 tasks):

Method Accuracy Improvement
Random Selection 34.3% โ€”
Delta Activations 36.3% +2.0%

Strategy: Select 1 most similar model as anchor + 19 random models for merging. Interestingly, selecting all 20 similar models yields only 30.3% due to model interference.

Conclusion

We introduce Delta Activations, a method to represent post-trained large language models as vector embeddings by measuring shifts in their internal activations relative to a base model. Our experiments demonstrate that this representation effectively captures model specialization, forming coherent clusters by domain and task without requiring access to training data or external metadata.

The empirical analysis reveals several desirable properties of the Delta Activation embedding space. The additive property enables compositional understanding of models trained on mixed datasets, while robustness across training configurations ensures consistent representations. The ability to embed tasks through few-shot finetuning opens applications in model selection and retrieval from large repositories.

Our validation on real-world model hubs demonstrates practical applicability, with Delta Activations achieving superior clustering quality compared to weight-based and output-based alternatives. The method extends beyond supervised finetuning to preference optimization paradigms, suggesting broad applicability across post-training techniques.

Limitations and Future Work

While Delta Activations provide effective model representations, several limitations warrant investigation. The choice of probe prompts influences embedding quality, and optimal prompt design remains an open question. Additionally, the method assumes access to the base model, which may not always be available for proprietary systems. Future work could explore applications to model merging and investigate the theoretical basis for why generic prompts elicit domain-specific signals.

BibTeX

@article{xu2025delta, title={Delta Activations: A Representation for Finetuned Large Language Models}, author={Xu, Zhiqiu and Sethi, Amish and Naik, Mayur and Lim, Ser-Nam}, journal={arXiv preprint}, year={2025} }