Datatang adds 5,000 traditional Chinese characters to the OCR system
Beijing-based artificial intelligence (AI) company Datatang has updated its optical character recognition (OCR) database to include 5,000 traditional Chinese handwritten characters.
In a webpage dedicated to the new set, Datatang said the characters were collected by various samples written on A4 paper, square paper, and lined paper, among others.
By adding the characters to its software suite, Datatang allows customers to OCR the corresponding traditional Chinese characters when they encounter them in the wild. In other words, by scanning text via a smartphone and the Datatang application, users will now be able to automate data entry and form filling.
OCR is sometimes implemented for scanning documents in digital identity verification and onboarding applications.
According to the company, the error limit of each vertex of the quadrilateral bounding box around each character is less than five pixels, for qualified annotation. The accuracy of the bounding boxes and the accuracy of text transcription would not be less than 97%.
The addition of the new dataset comes months after Datatang executives said their speech recognition datasets were created with native speakers and exceeded industry standards.
Most recently, the company showcased its synthetic data generation technology at the 2022 Conference on Computer Vision and Pattern Recognition (CVPR 2022).
AI | dataset | Datatang | optical character recognition | Research and development