New Fine-Tuning Method Reduces LLM Forgetting by 65%

London, UK – A new research paper titled "Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance" introduces a novel approach to multi-task fine-tuning for large language models (LLMs), significantly reducing the common problem of "catastrophic forgetting" and the "seesaw effect." The method, dubbed Core Parameter Isolation Fine-Tuning (CPI-FT), claims a 65% reduction in forgetting compared to standard fine-tuning techniques.

The core issue in multi-task fine-tuning, often referred to as the "seesaw effect," arises when improving an LLM's performance on one task inadvertently degrades its ability on others. This occurs because traditional fine-tuning indiscriminately updates all model weights, leading to conflicts when different tasks require distinct parameter adjustments. As summarized by Rohan Paul on social media, "The problem is the seesaw effect, improving one task hurts another because all weights get pushed together."

The authors, including Yao Wang, Di Liang, and Minlong Peng, hypothesize that different skills reside in specific "pockets" of the network, meaning parameter updates should not be uniform. Their CPI-FT framework addresses this by first independently fine-tuning an LLM on each task to identify its "core parameter regions"—the top 1% of weights that change most significantly for that task. Tasks with overlapping core regions are then grouped, allowing related skills to be trained synergistically.

CPI-FT proceeds by constructing a single backbone model. It then directly transplants each task's core weights into this backbone, while non-core parameters from different tasks are blended smoothly using Spherical Linear Interpolation (SLERP). A final, lightweight staged fine-tuning pass on mixed data is conducted, crucially freezing all identified core parameters to prevent new learning from overwriting previously acquired skills. This strategic isolation and gentle blending are key to mitigating interference.

Extensive experiments across diverse tasks, including mathematical reasoning, code generation, logical reasoning, and conversational abilities, demonstrated that the fused CPI-FT model consistently outperforms both plain multi-task and staged fine-tuning baselines on various base models like LLaMA-2-7B, Mistral-8B, Qwen1.5-7B, and Gemma-9B. The researchers conclude that by protecting the "tiny set of task-critical weights" and blending the rest gently, CPI-FT offers a robust solution to a long-standing challenge in LLM development.