TikTok’s Best Practice on Flink failovers

4 min readOct 6, 2020

Intro

More than 2k online tasks are running online for serving global users in real-time, which means that those tasks’ latency & stability would have a big influence on user experiences all over the world.

For those tasks, it’s critical to note that:

massive data with high concurrency
most of the topology are working on joining multiple streams, which is independent to checkpoint mostly
it’s okay to lose a small portion of the data, but the continuity of outputting data is heavily required.

We investigated lots of practices of Flink, almost all tasks will fail only if one of them failed in a topology with multi-stream joining. This is unacceptable to TikTok, and what’s worse, the restarting will consumes too much time(up to minutes), which is not tolerable.

For proving a better user experience to TikTok’s global users, we proposed a solution for single-point recovery, which works on the network layer to restart or recover the failed tasks quickly.

For this solution:

The whole topology won’t be restarted, only failover the failed tasks
Running tasks won’t be influenced

01. Proposed Ideas

The basic idea we proposed is:

only the tasks will failover if the host is off
the upstream/downstream tasks could be notified with the failed tasks
the upstream tasks would abandon the data which is ready for feeding the failed tasks.
the upstream tasks would start to send data to the tasks when they recovered.
the downstream tasks should abandon all incomplete data that was produced by failed tasks
the downstream tasks start to receive the data only when the failed upstream tasks got recovered.

To achieve this, the key points are:

how to communicate between upstream and downstream tasks
how to remove the incomplete data
how to build the connection when the failed tasks got recovered

Based on those key points, we decided to refactor the failure handling workflow of tasks based on the existing network framework, to satisfy our needs.

02. The Architecture

The following picture shows the basic communication workflow between Flink’s tasks:

Accordingly, the tasks’ communication workflow could be specified in the following cases:

Failure triggered by OOM/coding/etc: tasks would release all the network resources actively, and send #CHANNEL_CLOSE# & exceptions to upstream & downstream tasks respectively.
Failure triggered by Yarn Kill: the TaskManager would be killed, TCP connections would be closed too. Netty Server/Client would receive the message.
Failure triggered by physical reason(Machine goes off): TCP connections are aborted exceptionally, so Netty server/client might not be able to receive the expected message. We could do some special handling when deploying the new tasks for triggering the message.

Thus, as you found, the tasks could know its downstream/upstreams’ state in most cases, and we could refactor for achieving single point recovery based on those preconditions.

03. Solutions

Basic on the ideas, the following graph shows the communication workflow when a task failed:

Map(1) failed
Sending ERROR_MSG to Source(1), Sink(1), and JobManager.
JobManager starts to allocate new resources for preparing failover and cut off the channel between Source(1)/Sink(1) and Map(1) at the same time.
Map(1) is recovered, and build a connection with upstream tasks.
JobManager notifies Sink(1) to rebuild the connection with Map(1).

In it, communication with upstream tasks could be specified as the following graph shows: