Monday, February 22, 2010

How to fix movie subtitle (and other text) encoding issues

Summary: Fixing subtitle encoding in DivX videos is easy... once you know how to do it.

I have been hunting for a copy of Moi Ivan, toi Abraham (AKA "Ivan and Abram", "Я - Иван, ты - Абрам") since I saw the movie on cable in mid-90s. The movie has not been released on a DVD, and I do not have a VHS player, but fortunately, I got a decent DivX version of the movie with Russian subtitles (the movie is mostly in Yiddish).


Unfortunately, instead of legitimate Cyrillic, the subtitle captions displayed garbage (accented characters). As I later found out, the subtitle file was encoded in ASCII for Windows-1251 (Cyrillic) code page instead of a Western code page (such as Windows-1252), so they appear fine only on a Russian version of Windows. So, what's a girl to do? I ran a few Google searches and found some posts from people running into a similar problem, but none of them contained any answers. I thought I would write a post explaining how I fixed the problem (really easy) hoping that it would help someone.

First, a quick intro to subtitles in DivX. Well, I do not really know much about this, but this is how much you -- a typical movie viewer -- need to know (if I misstate or omit something important, feel free to correct me). A typical DivX (AVI) file does not contain embedded subtitles. Subtitles normally come from a separate file, such as SRT, SUB, SSA/ASS. Normally, a subtitle file has the same name (and different extension) as the DivX file. For example, this would be a pair of a DivX (AVI) and a subtitle (SRT) files:
Moi Ivan, Moi Abraham.avi
Moi Ivan, Moi Abraham.srt
There is nothing magic about a subtitle file: it's just a text file, which confirms to a certain data format. Here is the format of the SubRip (SRT) subtitle file (directly from Wikipedia):
Subtitle number
Start time --> End time
Text of subtitle (one or more lines)
Blank line
Here is an example:
1
00:00:18,700 --> 00:00:21,889
<i>Говорят по-цыгански</i>

2
00:03:16,190 --> 00:03:21,760
Я - ИВАН, ТЫ - АБРАМ
Many popular video players (KMPlayer, VLC, etc), as well as DVD players, will automatically load and display the default subtitles from the file with the same name (as the DivX file) and the same folder, but you can also load additional subtitle files manually (e.g. you may have subtitles translated in several languages). In my favorite KMPlayer, you can load non-default subtitles via the Subtitles - Load Subtitle menu.

The original subtitle file I got looked like this:
1
00:00:18,700 --> 00:00:21,889
<i>Ãîâîðÿò ïî-öûãàíñêè</i>

2
00:03:16,190 --> 00:03:21,760
ß - ÈÂÀÍ, ÒÛ - ÀÁÐÀÌ
Although this text looks like garbage, it's not useless: it just needs to be re-encoded from one code page to another (and desirebly, to something non-code-page-specific, e.g. to Unicode). But how do you do it?

Help comes from Mozilla Firefox (and I suspect from any other web browser). If you need to fix the encoding of a subtitle file (or any other text file), here is what you need to do (you can use a similar approach to recover text in other types of documents, such as email, text files, and so on).
  1. Launch Firefox (or you favorite web browser).
  2. Open the subtitle file. To locate file in Firefox 3.5, use the File - Open File menu; in IE 8, use the File - Open menu, and click the Browse button; in Google Chrome 4.0 press the CTRL + O keys (when using Google Chrome, you need to change extension of the subtitle file to .TXT before opening the file; otherwise, it will launch the default program associated with the original file extension instead of displaying the file text in the browser).
  3. Once the browser opens the file, it may automatically adjust encoding. If you still see garbage, select a different encoding option until the text appears correctly. To change encoding in Firefox 3.5, select appropriate encoding from the View - Character Encoding menu (Auto-Detect menu for the appropriate language can be helpful); In IE 8, use the View - Encoding menu; In Google Chrome, click the Control the current page toolbar button and pick the appropriate option from the Encoding menu (again, the Auto detect option may help).
  4. Once you select the correct encoding option and verify that the text is displayed correctly highlight all text (you can use CTRL + A), and copy the selected text to the clipboard (press CTRL + C).
  5. Open Notepad (or your favorite plain text editor, such as Notepad++, PSPad, etc), create a new file (File - New menu option in Notepad) and paste the contents of the clipboard in the new file (press CTRL + V).
  6. Save the text file as the new subtitle file. If you decide to overwrite the original subtitle file, make sure that you first make a backup in case something goes wrong. When saving the file, you will most likely be prompted to change the default ANSI encoding, so pick the Unicode encoding.
  7. Close the newly created subtitle file in Notepad (or your text editor), and reopen it to verify that encoding is still intact and text appears correctly, and if so, use it as a new subtitle file.
Now, if you need the Unicode version of the Russian subtitle file for Moi Ivan, toi Abraham, you can download it from here:
Moi Ivan, toi Abraham.srt
UPDATE: As I recently found out, the process of correcting the code page related issues in subtitles can be even easier, assuming that you have a free text editor Notepad++ installed. What you need to do is:
  1. Back up the subtitles file (just in case something goes wrong).
  2. Open the subtitles file in Notepad++.
  3. From the Encoding menu, select the Characters Set option.
  4. Under the character set, select the appropriate language family and then the code page (you may need to try a few code pages if you don't know which one to use).
  5. When you see the characters appearing in the correct format, select the Convert to UTF-8 option under the Encoding menu.
  6. Save the file.
That should be it.
See also:
The 3 Best Subtitle Sites For Your Movies & TV Series
How To Add Subtitles To A Movie Or TV Series
SubDownloader: Fast and Easy Subtitle Downloader
DivX Subtitles
DivXLand Media Subtitler Embeds Subtitles into Movie Files
Sublight Labs: Searching subtitles has never been this easy
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

16 comments:

  1. Exactly what I needed! Thank you so much!

    ReplyDelete
  2. SO SIMPLE !!!!!!! You saved me a couple of time searching for a solution !!! Thank you

    ReplyDelete
  3. Thnx for that.. I was here looking for a way to change kmplayer encoding to be unicode, or a method to change encoding for multiple files at once

    Here is how I change 1 file encoding without firefox
    1- open .srt file by ms word
    2-it will notify you about the bad encoding
    3- choose the encoding language u need
    4-now you will see subtitles in word file as u need to see them
    5- open srt file via notepad
    6-ctrl+a all data in word file and ctrl+c
    7- ctrl+ v all subtitles inside the notepad file
    8- save as and change encoding to be unicode

    ReplyDelete
  4. That's an option, too. There are many ways to do this.

    ReplyDelete
  5. Many Thanks, that was really simple, and helpful

    ReplyDelete
  6. notpad can't save new file !!!!!!!

    ReplyDelete
  7. Thank you for great solution!!!

    ReplyDelete
  8. Hello,

    Now there is subtitle-index.org, it provides a lot of subtitles. They are ranked against multiple criterias, and the best one can be directly downloaded as UTF-8.

    ReplyDelete
  9. Really helpful!! Thx for sharing!!

    ReplyDelete
  10. That's excellent. I've been working in the area since '81 and I often rant about how most explanations seems to be wild guesses, calculated to waste your time, or just some groidy little groid's ego trip, and this answer proves it. It's very possible to be direct, clear and useful...all at the same time!

    ReplyDelete
  11. the best solution ever i seen in the whole sites .. Thanks dude and all the best -:)

    ReplyDelete
  12. Thank youuuuuuuuuu
    you are the best man!

    ReplyDelete
  13. Didn't work for me on Windows Media Player, Win 10, even though it looked fine in Notepad++. But when I tried it on VLC it worked fine!

    ReplyDelete