Pleco Dictionary for Chinese Topolects

Anyone who is learning Chinese has probably come across Pleco, a Chinese dictionary App for your phone. It offers a lot of different features, some of my favourites are offline access, clipboard reader, and integration with Anki.

However, while it is a great resource for learning Mandarin and Cantonese, it doesn't offer much when it comes to other Chinese topolects. You may ask, what are topolects? Well, basically dialects, but they are in most cases mutually unintelligible. It corresponds to the Chinese word 方言¹.

OK, so what can I do about it? My first thought would be to just fork the project and implement support for other topolects myself. But you see, Pleco is not Free Software, so I don't even have access to the source code… But after some web-searching, I found out that you can actually create your custom dictionary in Pleco, but you first need to purchase their 10-dollar flashcard add-on.

Before I actually sent them my money, I obviously had to figure out how custom dictionaries work, and, well it ain't pretty. You can either import .pqb files, or plain .txt files. The first option is only really useful for importing dictionaries that other people have already made, because .pqb seems to be Pleco's own database file format, which I don't know anything about, and is not human-readable at all. So my only real option was to create a plain .txt file. The format works like this:

SIMPLIFIED[TRADITIONAL]<tab>PINYIN<tab>DEFINITION<newline>

I found out about this on their forum ², but there doesn't seem to be any official documentation about this. But here comes the really ugly part, to insert newlines in the definition section, you need to use some weird private unicode symbol  (UEAB1), which my browser isn't even able to display. There are also four or five other weird unicode symbols for making things bold, coloured, etc. It would be much better if they used JSON or CSV, or something already well-known, instead of making up their own weird format.

OK, rant over. So what did I end up doing? First, I payed for the flashcard add-on. I came a cross a project called Wikihan, where they extracted pronounciation of characters in different topolects from Wiktionary, and stored the result in a TSV file. So kind of exactly what I was looking for, great! They have a GitHub repo with some scripts to generate everything, and I modified it to suit my needs³. Then I wrote a Haskell program to convert it to Pleco's weird plain-text format.

Then importing the file in Pleco, and now you know how to pronounce your characters in Hakka, Hokkien, Gan, etc. However, not all characters have data in all topolects, but better than nothing of course.

See https://en.wiktionary.org/wiki/topolect ↩︎
They used a Python library called Epitran which isn't packaged in Nixpkgs, and since it was only used to convert different Romanizations to IPA, which I didn't really care about, I removed the dependency on this it to be able to use Nix.↩︎
They used a Python library called Epitran which isn't packaged in Nixpkgs, and since it was only used to convert different Romanizations to IPA, which I didn't really care about, I removed the dependency on this it to be able to use Nix.↩︎