Apple recently produced an interesting article on their latest implementation to scale up chinese handwriting recognition on mobile devices from 3,000 to 30,000 characters: while 3,755 is often quoted as enough for reading comprehension, it’s common to use a small number of less frequent characters, for example in names, poems and quotations.
What’s the best model or training image set available in the public domain? for example for Keras or Tensorflow? Particularly for the traditional and ancient scripts, for example to recognise inscriptions on artwork.
Apple collected 10s of millions of samples for training data to be able to cope with different writing styles, but that’s not publicly available and compiling a good data set is a research project in itself..
Our character recognition system thus focused on the Hànzì part of GB18030-2005, HKSCS-2008, Big5E, a core ASCII set, and a set of visual symbols and emojis, for a total of approximately 30,000 characters, which we felt represented the best compromise for most Chinese users.
As scripts get more cursive, they’re often pretty difficult even for humans to read…:
how many of these last do you recognise as being the same character as 花?