Flash 2.0

Okay, let's break down how MMoE (Multi-gate Mixture-of-Experts) works in this paper and what the authors mean by adding experts on top of a shared hidden layer.

Understanding Multi-gate Mixture-of-Experts (MMoE)

At its core, MMoE is a neural network architecture designed for multi-task learning. This means it's built to handle scenarios where you have multiple related, but potentially conflicting, objectives to optimize for simultaneously. In the context of this paper, those objectives are things like predicting user clicks (engagement), and predicting user ratings (satisfaction).

Here's how MMoE works:

Shared Bottom Layer: The input features are first processed by a shared hidden layer. This layer learns a common representation of the input that's useful for all the different tasks. This is similar to how traditional multi-task learning often works, where parameters are shared between tasks.
Experts: Instead of directly passing the output of the shared layer to each task's prediction layer, MMoE introduces "experts." Each expert is a separate neural network (in this case, a multilayer perceptron). These experts are designed to learn different aspects of the data. Think of them as specialists that focus on different patterns or relationships in the input.
Gating Networks: For each task, there's a separate gating network. These gating networks take the output of the shared hidden layer as input and output a weight vector. The size of the vector is equal to the number of experts. This weight vector determines how much each expert contributes to the prediction for that specific task.
Weighted Combination of Experts: The output of the experts is combined using the weights provided by the gating network for each task. So, for a particular task, some experts might have a higher weight (contributing more to the final prediction), while others might have a lower weight (contributing less).
Task-Specific Prediction Layers: Finally, the weighted output of the experts is passed through a task-specific prediction layer (e.g., a softmax layer for classification or a linear layer for regression).

Key Idea: The crucial aspect of MMoE is that each task has its own gating network, allowing the model to dynamically choose which experts are most relevant for that specific task. This "soft parameter sharing" allows tasks to leverage shared knowledge while also specializing on their own unique needs. This is unlike the "hard parameter sharing" of a shared bottom, where all tasks must use the same representation.

The Sentence Explained: "For our ranking system, we propose to add experts on top of a shared hidden layer, as shown in Figure 2b..."

Let's break this down piece by piece:

"For our ranking system, we propose to add experts on top of a shared hidden layer..."
- This directly refers to the architecture described above. Instead of directly putting the experts right after the input layer (the raw features), they are placing them after a shared hidden layer.
- Think of it like this: The shared hidden layer first does a general processing of the input, creating a more abstract representation. Then, the experts, combined with their gating networks, focus on specializing for the different tasks based on the abstract representation.
"...as shown in Figure 2b."
- Figure 2b in the paper visually demonstrates this architecture.