[DRAFT] Live image/video translation
This post is part of the series On-device translation for Android
In the previous post, we got some high quality image translation, running reasonably fast, on-device.
We can push it further though, and avoid the need to take a picture, select it, and translate in discrete steps.
This post contains a bunch of jittery video, you might get sea-sick :).
With the existing image pipeline, we can run the 'detect' phase of the model, at low resolution (640x480?) in about 80ms.
Then, we can run the recognizer, in another 50~100ms. This lets us build a very laggy live overlay
I spent a long while here, trying to match the contours back to their positions, doing heuristics to map 'previous frame detection' to 'current frame detection', smoothing the movement, etc. but it was never going to work for two reasons:
- detection is too slow to run live
- the heatmap doesn't have enough information to track it across runs
so, what to do?
well, the only answer is to not run detection often, instead, run it rarely, in the background, and project the detected labels onto the new frames.
this means, effectively, tracking the position of 'objects' and how they move through the scene over time.
for this, the standard is to track Features
then, use FAST BRIEF WHATEVER to estimate how they are moving on the screen.
this lets us calculate a Homography and we can use it to re-warp the original detection to the current frame.
Performance
Even if tesseract is not very good, it is 'quite fast' - on my initial tests, it was about 2-3x faster than the naive PP OCR pipeline.
The first, easy win was to quantize the model. Performing the math on FP32 has higher precision than necessary, and moving to FP16 reduced compute time by ~30%.
Then, the time to run inference on the detection model scales with pixel count. I was passing the original image to the model, but it is able to detect text down to a very small scale, so capping the image to 900px on its largest side worked pretty well, and moved detection ~300ms->~100ms.
The recognition model also scales on strip width (...obviously). Because this step is parallelized, having 1 long strip can hold up the entire pipeline. A good way to ensure no single strip is 'too wide' is to cap the max-width, by splitting on spaces.
TODO SPACE ALGO
on impostor syndrome
I have mixed feelings about this project. I am very happy that it exists math is hard, this claudio guy can do it mixed feelings, happy that it exists,.wouldnt have attempted it witjout, yet comparing to juniors i have given away implementstion before and it did not feel like this