Rich cloudy content

In this post I outline the technologies I used to create a prototype for the system I describe here.

As I mention in the post outlining the idea, I strongly believe that a content consumption aid like this is only going to be useful if it is available on all content consumption devices, and if all the interactions with the system have a minimum of friction. Since the original post where I had developed a Google Chrome browser plugin and had the server-side stuff running on my laptop, I have moved the server applications to the (Alibaba) Cloud and ported the browser extension to Firefox. I also created a basic interface for interacting with the lexical database that is much easier to use for adding new words than using standard Anki. These are now live and I am using them to help my own learning.

Server-side

The system currently has 3 major components, all of which are currently installed on a single 4GB VM on Alibaba's cloud in their Hong Kong data centre. I am currently in Yunnan, China, and I need to have the application "freely available". There appear to be sporadic GFW issues (that disappear when going over a VPN) so I'll likely soon move the applications to Alibaba's Shenzhen DC. The content-enrichment clients do very large numbers of API calls so having the DC physically close is important. Ping times for the HK DC are acceptable, even with the GFW, but when ports get blocked/hyjacked the system becomes unusable, so moving to the mainland is pretty much obligatory. Ping times over a VPN (at least the one I'm using) slow the system down too much to be pleasant to use.

You need a licence to host a website inside China though, so I'll need to make sure that my API is not considered a website before spending much more time on it.

I am currently also using the Azure Text Translation API as one of my dictionary providers and for sentence translation and transliteration. I have been investigating using some open-source machine translation systems but that will require a lot of time and testing to do better than Azure, so hasn't been attempted yet. Azure offers 2M characters (dictionary lookups, translations or transliterations) for free each month, which is far more than I need for the moment, so internalising this can definitely wait. Azure doesn't (currently) offer word alignment for their Chinese -> English translations so I will definitely need to move off it at some stage, as that would probably quite dramatically improve the quality of the glosses.

Text parsing and analysis.

This is done using an off-the-shelf installation of Stanford's CoreNLP with the unmodified Chinese module. As the system doesn't currently to anything really sophisticated (like the stuff Azab et al. 2013 do with CoreNLP), I am able to get away with allocating only 1GB of RAM to the process, significantly reducing instance costs.

Lexical database/Spaced Repetition software

In order to be able to continue using Anki for spaced repetition (and not have to develop a client), I needed to have a net-accessible "Ankiweb" interface to continue synchronising between my various devices and to query for establishing what words need to be glossed and what words I already know. Because I need to interact with the API in non-standard ways and do some pretty intense querying, the standard Ankiweb wasn't going to cut it. It would also be a very serious breach of the terms of use and likely result in banning from the service! Luckily, there are a couple of "copycats" out there that implement at least part of the Ankiweb API. Unfortunately, the only one that is still being maintained (and has been migrated to Python 3 and Anki 2.1 compatibility) only supports synchronisation from official Anki clients, and doesn't support any other sort of interaction with the underlying DB. That meant I had some coding to do to add the necessary functionality. For various reasons adding the relatively simple functionality took me far too long but it is there now. I created endpoints for adding, updating and querying notes and checking card states.

Turns out this is all pretty hackish, and written in Python with underlying per-user sqlite3 databases. This will definitely not scale nicely so if the project goes further, the SRS part of the system will almost certainly need to be completely replaced. It would likely scale out to 50-100 users with reasonable hardware though, so this definitely doesn't need to be tackled until the project has proven it is really worth it.

Text Enricher

This component was developed from scratch. While there may be some open source software out there that could have been adapted, this is the core so I really needed to be comfortable with it. I am most comfortable with Python/Django+Postgresql, so that is what I went with.
This component receives API requests for enrichment (for a given user) and:
- gets the parsed version from CoreNLP
- goes through the words and sentences CoreNLP has identified and gets them translated using internal dictionaries and the Azure Text API
- goes through the words CoreNLP has identified and determines which are known by querying the anki-sync-server
- returns this back to the clients in JSON format

Dictionaries

As I mention above, the Azure API doesn't do Chinese -> English word alignment for sentence translations (it *does* do English -> Chinese :( ) so I have a simple algorithm that attempts to choose the "best" translation for the context from the dictionary lookups done. What I do is similar to what Azab et al. 2013 mentioned above report they do (basically looking for something with the right part of speech/POS). While having word alignment would be nice, I am pretty sure for this to be done properly it will require adapting an existing system or even building a new one. The issue is that what this system needs is the best translation for each given word. It doesn't want the best translation for phrases/sentences or entire texts and any optimising for that will likely make the system worse. As I can't think of any other use-case where you don't want to optimise for translating phrases or longer, no one will likely have done what this system needs. I have a strong suspicion that there may be some systems out there that either have this sort of thing as an intermediate result in the translation pipeline or could be tweaked (using weights or whatever) to do so pretty easily. That will need the advice of an expert though. In any case, the system I have is clunky and often wrong but is still good enough to make the system usable.

In addition to the above, Azure is used for dictionary lookups. Doing API calls to the Azure API and having to wait for the response can lead to some pretty nasty response times. *During testing* to keep latencies down I am caching responses so I don't need to make a call for lookups I have already done.

I also use 2 other "local" dictionaries - the free CC-Cedict and the ABC Chinese to English dictionary. The CC-Cedict dictionary lacks parts of speech information so is pretty limited in how it can be used in the system.

A digital version of the ABC Chinese dictionary is available via a developer licence (for exclusively personal use) that you can request when you purchase the Wenlin software suite. The Wenlin software was truly amazing back in the 90s but it hasn't had the resources to evolve much over the last couple of decades. As it is not open source, they have had to rely on internal resources and that has meant progress has been very slow indeed, and there isn't even really an Android version, let alone IOS or a web-extension version like what my system has. So I simply load a parsed version of the dictionary into memory from a text-format export of the ABC digital database/dictionary at this point.

CoreNLP to Azure/CC-Cedict/ABC POS mapping

One of the reasons I wanted to integrate the ABC dictionary was that it can be loaded into memory and has POS information, in addition to being far more comprehensive than the free CC-Cedict (and probably Azure/Bing too). Unfortunately, my initial assumption that determining what POS a given word falls into would be fairly standard and uncontroversial turned out to be very, very wrong. The POS categories returned by CoreNLP don't map very cleanly to the very basic Azure POS categories, and they don't map very well *at all* to the categories used by the ABC! How can there be disagreement on whether a language has adjectives or not?!?! So that makes finding the best translation for a word, given the POS, significantly more challenging than I thought it would be! After having used the system for a while, I have also noticed a number of cases where Azure will, say, have a noun and verb translation for a given word, and the ABC will have a verb and adjective translation, missing other forms.

Another situation that arises regularly is bad parsing from CoreNLP. When CoreNLP does a poor job at word identification and splits up words or combines several words into one, that makes dictionary lookup simply impossible. To at least have something, I also always perform an Azure translation (in addition to the Azure dictionary lookup) on all words that CoreNLP identifies. The translation API always gives a translation (only one, and with no POS) so that covers these cases pretty well. It is sometimes not very good but is usually good enough to not require further investigation from the user (me!).

These issues should be temporary though, until a system can be found or created that does best-translation-for-word-in-context translations.

Client-side

The initial version I developed had created the HTML on the server and returned it for the Google Chrome extension to replace the original HTML with. The amounts of text involved were pretty big though, and as the system will be used for non-HTML environments soon, that had to be converted to returning JSON (and maybe protobuf or thrift later on).

Porting to Firefox

As Google Chrome doesn't support web extensions on Chrome for Android and Firefox does support (unmodified) extensions on Firefox for mobile, I ported the extension to Firefox. The extension systems for both browsers are based on the same web extension architecture, so in theory it is just a matter of changing a couple of lines in the manifest.json file and you're away laughing. As anyone who has ever drunk the "write once, run everywhere" kool-aid at some point in their programming career can attest to, this always turns out to be a rather sick joke played on us by the platform developers. It is NEVER that simple. Ever!!! Debugging on Firefox is also much harder than Chrome as the internals give much better error messages back on Chrome. After much hair pulling I finally managed to get a version that works well on Firefox, and works well on all of my laptop, Android tablet and two Android phones (Nexus 6P and Chinese OnePlus 6 - H2OS, not OxygenOS).

I have had to use the Developer version of Firefox on the desktop because the normal version requires you to go through a certification and validation process to get your extensions signed. I initially spent a lot of time trying to get a nightly version of Firefox installed on my OnePlus 6 (which can't get access to the Play store, even with a VPN) before finding out that you can indeed install unsigned extensions on the normal Firefox for Android. I understand (and respect?) the reasoning behind demanding only signed extensions in the consumer version of Firefox, I just can't see their justification for doing it on the desktop and not mobile...

Data entry extension

As terms come up in conversation that I want/need to know that I haven't come across in text format, I needed a good way to get these words into Anki. I initially had a small web interface that allowed me to enter a word in Chinese (characters) that would then go and look up using exactly the same APIs as is done for the normal text enrichment. This means with a single extra click I get access to the same definitions and workflow for adding a new word that I get when I come across something in a webpage and want to add to my learning list. Other plugins/extensions exist for getting dictionary definitions and adding them to Anki (either on the desktop only or by breaking the TOS and pushing to Ankiweb) but nothing exactly like what I needed, and nothing that hooked into my system of dictionaries/lookups/translations.

I quickly found out that hitting a normal HTTPS website with HTML requests outside China that hasn't been whitelisted is highly unreliable - much less so than AJAX calls only, or at least that is how it seems... As I initially thought AJAX only would mean I would have an easier time with the firewall, I made another Firefox extension to do the job. This also has the distinct advantage *of not being a website*. I don't know whether there is a properly detailed definition of what constitutes a "website" for the purposes of China's website registration licences but an API that only serves JSON *might* not be covered...

Open Source?

I have been a fan and promoter of open source since I discovered the movement 17-odd years ago. I will likely open source all of the components I can (so not the ABC dictionary, for example...) at some point. That will require removing the hard-coded passwords and doing lots of cleanup, adding tests, etc. There is probably around 1-2 months of dev days for that to happen so it likely won't be in the near future unless a really good reason to do so appears. The other issue is that for it to scale pretty much everything will need to be reimplemented, likely not in Python. I love Python but it is incredibly slow. My javascript skills are also very, very poor and the web extensions are really horribly coded. Starting again from scratch (and writing tests beforehand...) is probably a much better way to go than trying to understand or extend what I have written.

Ideally I also want to get the different components set up in containers, probably using Gitlab's build and deploy pipeline (AutoDevops) to get them pushed directly from build to a Kubernetes cloud somewhere (Alibaba or Tencent probably). That is going to take quite a bit more than what I am doing now, which is just doing a push to a private Gitlab repo and then pulling from the VM to "deploy".

Anyway, I hope to make something available at some point in the not-too-distant future to get other peoples' ideas and evolve it forward!

Transcrobes

Rechercher dans ce blog