Journal — Preprint —
CEKALA: CKA-Guided Layer Selection for Efficient Vision–Language Adaptation
Tasnimul Hossain Tomal, Md Fahim, Mir Sazzat Hossain, Md Farhad Alam Bhuiyan
Preprint
Abstract
Pre-trained vision-language models (VLMs) such as CLIP offer strong generalization but face challenges in few-shot adaptation, particularly in identifying which layers to adapt and how to align cross-modal representations effectively. Existing multimodal adaptation methods uniformly apply adapters across fixed layers, assuming homogeneous layer importance and implicit depth-wise alignment between vision and text encoders. This assumption neglects layer-wise heterogeneity and cross-modal semantic misalignment. To overcome these limitations, we propose Centered Kernel Alignment based Layer Adapter (CEKALA), a representation measurement framework that leverages CKA to guide selective layer adaptation and cross-modal alignment. CEKALA first computes layer-wise CKA scores to quantify each layer’s contribution to downstream performance, then identifies semantically aligned vision–text layer pairs based on CKA scores. Shared cross-modal adapters are injected only into aligned layer pairs, while unpaired layers receive modality-specific adapters, ensuring both semantic consistency and efficient parameter usage. CEKALA enables fine-grained, interpretable, and performance-aware layer selection for vision-language models. Empirical results demonstrate that CEKALA improves few-shot generalization and cross-modal alignment while maintaining strong parameter efficiency.
Cite
@article{pqnbT2bcN3wC,
title = {CEKALA: CKA-Guided Layer Selection for Efficient Vision–Language Adaptation},
author = {Tasnimul Hossain Tomal and Md Fahim and Mir Sazzat Hossain and Md Farhad Alam Bhuiyan},
journal = {},
year = {}
}