Forced Alignment: How to match audio with a transcript via Machine Learning?

6 min readJan 29, 2020

1. Problems / Use Cases

It is a sunny Monday that Tom is enjoying his coffee, but his boss Jerry comes to him and says:

“Here’s the transcript of a movie, but it doesn’t have the timeline information. Could you take some time to watch the movie, and match the video timestamp with the transcript? This is easy and fun, right?”

The following graph explains the problem Tom need to handle:

“Ehh…, okay”, Tom just cannot refuse.

This task does look easy, but if Tom did exactly what Jerry said, to watch the movie and manually align the transcript with the movie timeline sentence by sentence, it won’t fun. That will only drive Tom crazy.

Tom feels so bad.

2. Forced Alignment Overview

2.1 What is Forced Alignment

“What are you frowning for?”, Aaron, the machine learning engineer who sits beside Tom, tries to help.

After Tom explains the problem he faced with, Arron says with a smile: “oh easy! this is just a forced alignment problem, not a big deal!”

“Not a big deal?”, Tom is about to complain but Arron says: “it might just take a few minutes to finish, don’t worry, Dude.”

“A few minutes?”, Tom is suspicious but excited at the same time, “so, what is Forced Alignment?”

“According to Wiki [1], forced alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.”, Aaron continues, “briefly, to align the audio with the text is the forced alignment”.

“That’s what I want to do!”, Tom’s so excited.

2.2 Forced Alignment Tools

“Okay, now I know the academic term ‘Forced Alignment’”, Tom says, “but how can I solve my issue by Forced Alignment?”

“Actually there are lots of tools that could help to fix this issue, check this [2]:”

After presenting the tool list to Tom, Aaron says: “just pick anyone you liked, and follow their document to install their client or use their website. Most of them only ask for the audio file and transcript text file, then it will output the aligned transcript.”

“Thanks so much! Aaron, it’s amazing!”, Tom starts to research on those tools.

Tom’s story ends there, but if you want to know more about the machine learning details, please read the following sections.

3. Advanced: Forced Alignment Algorithms

3.1 DTW(Dynamic Time Warping)

Typical Tool/Lib: Aeneas

In this section, we will introduce the basic DTW algorithm, to help you further better understanding other optimized DTW algorithms which are used in different Forced Alignment tools.

> Introduction

DTW is a time series alignment algorithm, which aims at aligning two sequences of feature vectors by warping the time axis iteratively until an optimal match between the two sequences is found.

> Input

2 sequences:

A = [a1, a2, …, an]
B = [b1, b2, …, bm]

> Output

Confusion Matrix: each cell is for the distance of corresponding points.

Optimal Matched Path (as the red dots shows):

W = [w1, w2, …, wx], in it w_k = (a_i, b_j)

> Time Complexity

O(m * n)

> Conditions

Boundaries

w1 = (a1, b1)
wk = (an, bm)

2. Monotonic

suppose w(k-1) = (a_i, b_j), thus for w_k = (a_ii, b_jj), there will be:

a_ii >= a_i
b_jj >= b_j

3. Continuity

suppose w(k-1) = (a_i, b_j), thus for w_k = (a_ii, b_jj), there will be:

ii — i <= 1
jj — j <= 1

> Optimization Target

there will be lots of paths W that satisfy the above conditions, but for DTW, the target is to find optimal W which has the shortest path.

Mathematically, the target is,

Based on the target, we could define a cumulative distance to solve this issue via dynamic programming. Reference [5] gives a very good example on how to solve DTW via dynamic programming, please check it for more details.

> How to Use DTW in Forced Alignment

So far you might be confused about how does DTW is used in Forced Alignment. Basically , DTW is not the only algorithms that are used. Let’s take Aeneas for example, the key workflow of how it uses DTW is:

use TTS(text to speech) to convert transcript text file to audio, and generate the word <-> time mapping.
convert audio to time series data
use DTW to match the 2 series
compute the real word <-> time mapping based on the DTW path

Above is the key workflow, you could read Reference [3] for more details if you are interested in the implementation details of Aeneas.

Moreover, because of the time complexity in DTW, most of the libs are using optimized DTW algorithms for finding out the path. In my article <Correlation Analysis in Time Series>, I mentioned some speed up approaches for DTW, please check it if you are interested.

3.2 ASR(Automatically Speech Recognition) via HMM(Hidden Markov Models)

> Introduction

HMM is a statistical Markov model in which the system being modeled is assumed to be a Markov process with hidden states.

With HMM, we could infer the hidden states or transition possibilities based on observed states.

> Definitions

N: The number of hidden states
Mat(N): The transition possibility matrix of N kinds of states
OS = [OS1, OS2, …]: The observed states transfer chain
HS = [HS1, HS2, …]: The hidden states transfer chain

Let’s take an example, suppose we have 3 dices, and their names are D6, D4, D8 respectively.

D6 have 6 surfaces, D4 have 4 and D8 have 8 surfaces. As the following picture shows:

Thus, now the definition of N = 3

Suppose their transition possibility are as the following picture shows, so the definition of Mat(N) = [ [0.2, 0.3, 0.5], [0.4, 0.5, 0.1], [0.1, 0.1, 0.8] ].

If we throw the dices and we will get 2 sequences, as the following picture shows.

The red numbers are observed every time we threw a dice, the blue boxes mean which dice we threw at that time. Thus,

the definition of observed states sequence OS = [1, 6, 3, 5, …]
the definition of hidden states sequence HS = [D6. D8, D6, …]

> What HMM can help to solve

There are some typical issues we can use HMM to fix:

Known N & Mat(N) & OS, try to figure out the HS
Known N OS, try to figure out Mat(N)

> How does HMM contribute to Forced Alignment

Previously when we talked about DTW, we are trying to convert text to audio (TTS) then align the 2 sequences.

However, a more intuitive method is to directly recognize every word in the audio file, then record the word’s start time and end time, to do the forced alignment.