Extracting subtitles from Matroska video files

Subtitles embedded in a video file are essentially a sequence of images overlaid over the video frames during the playback. To convert them to a text file one has to first extract the subtitle images themselves and then use some OCR mechanism for the image to text transformation.

On Fedora the procedure requires to install some packages. First, enable the rpmfusion repositories if you haven't done so yet. (Side note: livna repository is not dead yet and you may need it for some DVDs. Enable livna too.) Then install the following packages using yum or dnf:

mplayer
mkvtoolnix
subtitleripper
tesseract

The following should work on Matroska containers. I have a video in the MyFilm.mkv file: there are several tracks in it: video, several audio tracks and subtitles. I found MPlayer to be the simplest tool to identify which track to work with:

[tom@localhost ~]$ mplayer MyFilm.mkv
MPlayer SVN-r37150-4.8.3 (C) 2000-2014 MPlayer Team
Playing MyFilm.mkv.
Cache fill:  0.00% (0 bytes)

libavformat version 55.19.104 (external)
libavformat file format detected.
[lavf] stream 0: video (mpeg4), -vid 0
[lavf] stream 1: audio (vorbis), -aid 0, -alang jpn
[lavf] stream 2: audio (vorbis), -aid 1, -alang fre
[lavf] stream 3: audio (vorbis), -aid 2, -alang eng
[lavf] stream 4: audio (vorbis), -aid 3, -alang ger
[lavf] stream 5: audio (vorbis), -aid 4, -alang jpn, Audio-Kommentar
[lavf] stream 6: subtitle (dvdsub), -sid 0, -slang fre
[lavf] stream 7: subtitle (dvdsub), -sid 1, -slang eng
[lavf] stream 8: subtitle (dvdsub), -sid 2, -slang dut
[lavf] stream 9: subtitle (dvdsub), -sid 3, -slang ger
[lavf] stream 10: subtitle (dvdsub), -sid 4, -slang pol
[lavf] stream 11: subtitle (dvdsub), -sid 5, -slang fre, Audio-Kommentar
[lavf] stream 12: subtitle (dvdsub), -sid 6, -slang eng, Audio-Kommentar
[lavf] stream 13: subtitle (dvdsub), -sid 7, -slang dut, Audio-Kommentar
[lavf] stream 14: subtitle (dvdsub), -sid 8, -slang ger, Audio-Kommentar
[lavf] stream 15: subtitle (dvdsub), -sid 9, -slang pol, Audio-Kommentar
...

Alternatively you can use mkvmerge -i MyFilm.mkv to obtain the list of tracks. The advantage of MPlayer is that it also tells you what language is a particular subtitle track for. With mkvmerge you have to guess.

Assume I'm interested in the English subtitles from the example above. MPlayer tells me it's the track (stream) ID 7. Let's extract them using mkvextract:

[tom@localhost ~]$ mkvextract tracks MyFilm.mkv 7:subtitles.sub

This produces two new files: subtitles.sub (VOBSUB format) and subtitle.idx (separate text index file). Now I can feed the vobsub2pgm utility with the two files. The program takes only the base name without the .sub and .idx suffixes.

[tom@localhost ~]$ vobsub2pgm subtitles title

This produces a lot of PGM bitmaps named like title????.pgm and a title.srtx file that comes handy when putting the final result together.

The next step is to run the images through some OCR. There exists a pgm2txt utility that uses gocr for the task and I found the results very unsatisfactory [1]. I was much luckier with tesseract:

[tom@localhost ~]$ for i in *.pgm; do tesseract $i $i; done

Now we have a .pgm.txt file for each image and we're almost done. The last step is to replace the file references in the .srtx file with the referenced files content. Before I knew srttool I used awk for this, however it's much simpler like this:

[tom@localhost ~]$ srttool -s -i title.srtx -o subtitles.srt

That's it. The subtitles.srt is a text version of the embedded subtitles and you can do whatever you want with them.

[1]	And the sad thing is that I'm the gocr Fedora maintainer...