Skip to content

MoE

🕒 Published at:

Mixture of experts

  • Replace big feedforward with (many) small feedforward networks and a selector layer.
  • Same FLOP,more param does better
  • Faster to train MoEs
  • parallelizable to many devices

What varies?

Routing function

  • How we are going to routhe or essentially match tokens to experts
  • Not all experts will process every token
Token Choice(almost all the MoEs)
  • each token is going to have a srt of preference for different experts,and will choose the top K for each token
Expert Choice
  • each expert is going to have a rank preference over tokens,and will choose the top K tokens for each expert
Global Assignment
  • make sure that the mapping between experts and tokens is somehow balanced

Coomon routing variants in detail

  • Top-k
  • Hashing

Expert sizes

Training objectives