(Website Under Construction) Peculiarities of Mixture-of-Expert optimization

Massachusetts Institute of Technology

*Indicates Equal Contribution

Method Overview

Mixed Video-Image Finetuning

We present Router Policy Optimization (RPO), a novel training mechanism for MoE within transformer models, structured in two distinct phases: router policy optimization (blue) and expert optimization (red). In the router policy optimization phase, input tokens are directed through a router policy, which determines the subset of experts to utilize. The output from these selected experts is then weighted and aggregated. The accompanying pseudocode details our alternating training algorithm, where the expectation step updates the router policy using reinforcement learning, focusing solely on router parameters. The maximization step updates the rest of the model parameters, excluding the routers’. By applying this alternative EM approach, RPO adeptly addresses the intricacies of expert collapse and the balancing of expert contributions, leading to more precise and efficient routing that significantly improves model performance.

Abstract

Mixture-of-Experts (MoE) has proven to be a pivotal technique for reducing computational demands while maintaining scaling behaviors. Yet, their optimization intricacies and potential pitfalls remain under-explored. In this paper, we investigate challenges with router optimization in MoE, particularly within the context of transformers. We formalize the MoE training objective and use it to dissect the operational nuances of these models. We highlight key challenges including expert collapse, where experts converge to similar representations; monotonicity barriers, a requirement for a smooth linear combination of the experts; expert weighting dilemma, a dichotomy between the expert choice probability and expert weighting; and routerless-routing, where expert/attention mechanisms can learn to route even without explicit router training. To counteract these challenges, we present Router-Policy Optimization (RPO), a novel approach that combines expectation-maximization and policy optimization. We conduct a comprehensive analysis of MoE models under various scenarios to identify when these problems can occur and propose strategies to overcome them.

Router Optimization Visualization

Router Optimization Visualization

We demonstrate the challenges in router optimization. In a toy scenario, experts are represented by vectors in a regression problem aimed at minimizing the distance to a red star target. The routing trajectory is visualized, contrasting the different methods’ performance. RPO successfully identifies the most effective routing, while standard optimization and variants falls short, especially evident when more than one expert is involved.

Expert Collapse

Router Optimization Visualization

We capture the phenomenon of expert collapse in MoE models. When routers and experts are optimized together in a conventional manner, the experts tend to converge towards eachother in their outputs. This concept is illustrated by starting with experts at optimal positions in a two-dimensional space, marked by a dashed line and circle. Over the course of training, we observe that standard optimization tends to merge the distinct expert representations, as shown by the transition from lighter to darker shades. In contrast, Router Policy Optimization (RPO) maintains the diversity of the experts, ensuring they remain true to the initial optimal assignments. We observe the effect of weight decay (wd) which prevents the norm of the router weight matrix from exploding.

GPT2 Optimal Routing Recovery

Router Optimization Visualization Router Optimization Visualization

Here experts are fixed. We initialize a MoE model using GPT2, pre-trained on Openwebtext, comprising 4 experts. One expert is initialized with pre-trained weights, while the remaining three are set as sub-optimal experts. We only optimize for the router to highlight the failure to recover the optimal routing. On the left, the three suboptimal experts are started with random weights (Model DIFF) on the right, they are initialized with perturbed optimal weights (Model SIM). When the weights are randomly initialized, an unnormalized MoE arrives at the correct routing. However, when sub-optimal experts can partially solve the problem, standard training fails and only Router Policy Optimization (RPO) is capable of accurately of recovering the correct solution.

BibTeX

BibTex Code Here