Mixed Text and Formula Recognition Using ResNet and Transformer

Authors

  • Chenhe Yang

DOI:

https://doi.org/10.62051/evgcmr19

Keywords:

Deep Learning; LaTeX Conversion; Mathematical Formula Recognition; Transformer; ResNet.

Abstract

Recent advancements in deep learning have enabled the recognition of images, objects, observations, texts, and other complex structures with increasing accuracy. In recent years, scene text recognition has stimulated a lot of researchers in the computer vision community. However, it still needs improvement due to the poor performance of existing scene recognition algorithms. This paper aims to combine deep learning research to recognize texts and formulas in images and convert them into LaTeX format. More accurately and efficiently recognizing and converting images containing mixed text and mathematical formulas can improve the quality of the generated LaTeX code. In terms of the model, this paper utilizes ResNet as an Encoder for feature extraction and Transformer as a Decoder for text generation, improving the accuracy of the generated text. Two types of datasets were used in the process: pure formulas and a mixture of text and formulas. This paper concludes that the effects on both datasets are good, with CERs of less than 0.05 and a loss of only 0.10. Besides, the accuracy of pure formula recognition is higher than that of mixed types, possibly because there is more data on pure formulas and it is difficult to accurately extract text features in mixed types. This method can automatically parse and generate content containing mathematical formulas, promoting the development of the education industry and improving the efficiency and quality of digital content generation.

Downloads

Download data is not yet available.

References

Kantipudi, M., Kumar, S., & Jha, A. (2021). Scene Text Recognition Based on Bidirectional LSTM and Deep Neural Network. Computational Intelligence and Neuroscience, 2021, Article 2676780. https://doi.org/10.1155/2021/2676780

Long, S., He, X., & Yao, C. (2018). Scene Text Detection and Recognition: The Deep Learning Era. International Journal of Computer Vision, 129(1-2), 161-184. https://doi.org/10.1007/s11263-020-01369-0

Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. Neural Information Processing Systems.

García, S., Luengo, J., & Herrera, F. (2014). Data Preprocessing in Data Mining. Springer. https://doi.org/10.1007/978-3-319-10247-4

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770-778). Las Vegas, NV, USA. https://doi.org/10.1109/CVPR.2016.90

Chen, X., Kar, S., & Ralescu, D.A. (2012). Cross-entropy measure of uncertain variables. Information Sciences, 201, 53-60. https://doi.org/10.1016/j.ins.2012.02.049

Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

Lee, S., Lee, J., Moon, H., Park, C., Seo, J., Eo, S., Koo, S., & Lim, H. (2023). A Survey on Evaluation Metrics for Machine Translation. Mathematics, 11(4), Article 1006. https://doi.org/10.3390/math11041006

Reiter, E. (2018). A Structured Review of the Validity of BLEU. Computational Linguistics, 44(3), 393–401. https://doi.org/10.1162/coli_a_00322

Perez, L., & Wang, J. (2017). The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621.

Jiang, W., Zhang, K., Wang, N., & Yu, M. (2020). MeshCut data augmentation for deep learning in computer vision. PLOS ONE, 15(12), e0243613. https://doi.org/10.1371/journal.pone.0243613

Downloads

Published

12-08-2024

How to Cite

Yang, C. (2024) “Mixed Text and Formula Recognition Using ResNet and Transformer”, Transactions on Computer Science and Intelligent Systems Research, 5, pp. 739–746. doi:10.62051/evgcmr19.