[PDF] Characterization and recognition of handwritten digits using Julia

Abstract

Automatic image and digit recognition is a computationally challenging task for image processing and pattern recognition, requiring an adequate appreciation of the syntactic and semantic importance of the image for the identification ofthe handwritten digits. Image and Pattern Recognition has been identified as one of the driving forces in the research areas because of its shifting of different types of applications, such as safety frameworks, clinical frameworks, diversion, and so on.In this study, for recognition, we implemented a hybrid neural network model that is capable of recognizing the digit of MNISTdataset and achieved a remarkable result. The proposed neural model network can extract features from the image and recognize the features in the layer by layer. To expand, it is so important for the neural network to recognize how the proposed modelcan work in each layer, how it can generate output, and so on. Besides, it also can recognize the auto-encoding system and the variational auto-encoding system of the MNIST dataset. This study will explore those issues that are discussed above, and the explanation for them, and how this phenomenon can be overcome.

Full PDF

CCharacterization and recognition of handwrittendigits using Julia

M. A. Jishan , M. S. Alam , Afrida Islam , I. R. Mazumder , K. R. Mahmud and A. K. Al Azad Faculty of Statistics, Technische Universit¨at Dortmund, Germany , , , Department of Computer Science and Engineering, University of Liberal Arts Bangladesh , { md-asifuzzaman.jishan, md-shahabub.alam, afrida.islam, imran.mazumder } @tu-dortmund.de, { raqib.mahmud, abul.azad } @ulab.edu.bd Abstract —Automatic image and digit recognition is a compu-tationally challenging task for the image processing and patternrecognition, requiring an adequate appreciation of the syntacticand semantic importance of the image for the identiﬁcation ofthe handwritten digits. Image and Pattern Recognition has beenidentiﬁed as one of the driving forces in the research areasbecause of its shifting of different types of applications, suchas safety frameworks, clinical frameworks, diversion, and so on.In this study, for recognition, we implemented a hybrid neuralnetwork model that is capable of recognizing the digit of MNISTdataset and achieved a remarkable result. The proposed neuralmodel network can extract features from the image and recognizethe features in the layer by layer. To expand, it is so importantfor the neural network to recognize how the proposed modelcan work in each layer, how it can generate output, and so on.Besides, it also can recognize the auto-encoding system and thevariational auto-encoding system of the MNIST dataset. Thisstudy will explore those issues that are discussed above, and theexplanation for them, and how this phenomenon can be overcome.

Index Terms —Neural Network, Recognize the CNN fea-tures, Convolutional Neural Network, Classiﬁcation, Autoen-coder, MNIST handwritten digit dataset

I. I

NTRODUCTION

The importance of good metrics and simpliﬁed problemscannot be downplayed. In particular, in critical and fast-moving ﬁelds such as artiﬁcial intelligence, computer visionand pattern recognition. These tasks offer a simple, quan-titative, and rational way to break down and analyze theevolution of learning methods and strategies. In particular,where the errand is instinctive and addictively quick, observerswill quickly understand the layout and characteristics of themethods and calculations.As a single dataset may just cover a particular or speciﬁctask, the presence of a ﬂuctuated set-up of benchmark errandsis signiﬁcant in permitting a more comprehensive way to dealwith evaluating and portraying the exhibition of a calculationor framework. In the machine learning sector, there are a fewnormalized datasets that are broadly utilized and have gottenprofoundly serious. These incorporate the MNIST dataset [1],the CIFAR-10 and CIFAR-100 datasets, the STL-10 dataset,and Street View House Numbers (SVHN) dataset [2], [3], [4],[5], [6], [7].The MNIST dataset is the most widely recognized and useddataset for image recognition and computer vision. It includes a 10-class hand-written mission which was ﬁrst published in1998. In order both to provide assistance and to guarantee itslifetime, a decent dataset is required to address an adequatelymoving issue. This is perhaps where MNIST has endured theuse of deep learning and convolution neural systems despitetheir inﬁnitely high precision. Different exploration bunchesdistributed more than 99.7 percent accuracy. It is an orderaccuracy in which the name of the dataset may raise questions.There is preciseness in order that questions can be posed aboutthe name of the dataset [5], [6], [7], [8], [9].Image processing refers to the mastery that recognizeslocations, objects, artifacts, structures, faces, and some otherobject factors. We exchange huge measurements of knowl-edge with software, organizations, and websites. Besides, cellphones with cameras produce limitless computerized imagesand graphics [10], [11], [12]. The vast quantity of skills isused to communicate with the people that reach them betterand more intelligent management. The key steps in the imagerecognition technology are data compilation and sorting, pre-dictive model creation, and photo identiﬁcation. These modelsmatch the capacity of the human visual system (HVS). Thehuman eyes see an image as several colors produced by thevisual cortex in the cerebrum. These results, far away from thescene, relate to ideas and articles recorded in one’s memory.Image recognition tries to emulate this process. Computerconsiders the image to be a group of pixels with discreetnumbering colors (Red, Green, and Blue). Visual perception isa big part of the vision of the computer. The main aim of AI isto produce input information calculations that beneﬁt from thefactual investigation to predict appropriate output estimates.The main aim of AI innovation is the machine receives allthe information and uses it to justify when facing a differ-ent conﬁguration. In collaboration with the planet, machinesconstantly learn and develop their knowledge collection. Theconcept of machine learning is frequently found at the core ofthese systems for learning and transmission with the extensionof the wise systems.The Julia programming language is designed speciﬁcally forscientiﬁc computing. It is a ﬂexible dynamic language withperformance comparable to traditional static-typed languages.Julia aims to provide a single environment that is productiveenough for prototyping and efﬁcient for industrial applications.It has a very intuitive model that suits what is expected from a r X i v : . [ c s . C V ] F e b machine learning programming language. We implementedthe Julia programming language and their machine learninglibrary for the characterization and recognition of the MNISThandwritten dataset.The research investigates the ﬁlters which are model param-eters redundancy by visualization method. The key point of theresearch is using a target dataset e.g. MNIST implemented byJulia and our system follows the conventional procedure of ex-tracting features from an image using a simple ConvolutionalNeural Network. Moreover, we represented thresholds value,the ﬁlter number, ratio of similar pairs, visual of similarities, atesting input image, and output using activation function dur-ing the similar ﬁlters in the ﬁrst convolutional layer. We haveused the Cosine Similarity algorithm to work out the similaritybetween two ﬁlters which is responsible for detecting similarﬁlters. Furthermore, we showed that result for similar ﬁlters inthe second convolutional layer with ﬁlter size × with usingthe ﬁlter size from 32 to 256. Besides, we also representedsimilar ﬁlters in the second convolutional layer using ﬁltersize using × . Finally, we represented autoencoding andvariational auto-encoder results using MNIST. By visualizingthe activation maps of the ﬁlters, we have validated the systemfor determination. II. R ECENT W ORK

The paper “Characterization of Symbolic Rules Embeddedin Deep DIMLP Networks: A Challenge to Transparency ofDeep Learning complies rule extraction from ensembles ofDiscretized Interpretable Multi Layer Perceptrons (DIMLP)and deep DIMLPs for predictive accuracy on digit recogni-tion. Feature detectors created by neural networks over theMNIST as well as the complexity of the extracted rulesets ispossibly determined to keep well balanced between accuracyand interpretability [13], [14]. Whereas, a way of reducingthe number of parameters in fully connected layers of aneural network using pre-deﬁned sparsity can be derived. Thatindicated convolutional neural networks can operate withoutany loss of accuracy at less than 0.5-5 percent overall networkconnection density [15], [16], [17]. The results can be shownin MNIST.Restricted Boltzmann Machines (RBM) can generate gradedand distributed representations of data. The work- “Emergenceof Compositional Representations in Restricted BoltzmannMachines” has shown, how characterizing the structural con-ditions and allowing RBM to operate in such compositionalphase, RBM is trained on the handwritten digits of MNIST[18]. Moreover, as the Neural Networks structure is inher-ently computing and power intensive, so hardware accelera-tors emerge as a promising solution. Through a High LevelSynthesis (HLS) approach, characterizing the vulnerability ofseveral components of Register-Transfer Level (RTL) modelof Neural Network it is shown that severity of faults dependson application level speciﬁcation and Neural Network Data[19], [20].By using response characterization methods, a systematicpipeline for interpreting individual hidden state dynamics within Recurrent Neural Networks (RNNs), especially LongShort-term Memory Networks (LSTMs) can be deﬁned atthe cellular level. This method can uniquely identify neuronswith insightful dynamics and test accuracy through ablationanalysis [20]. Another method is proposed for discoveringfeatures required for separation of images using deep autoen-coder. In this certain methodology, it auto learns the imagerepresentation features for clustering and groups the similarimages in a cluster simultaneously separating the dissimilarimages into another cluster [18, [19], [20], [21].A comparison of four neural networks on MNIST Dataset:Convolutional Neural Networks (CNN), Deeps Residual Net-work (ResNet), Dense Convolutional Network (DenseNet)and improvement on CNN baseline thus Capsule Network(CapsNet) for image recognition is also done. The CapsNetis considered as giving excellent performance despite havingsmall amount of data [22]. However, “EMNIST: an extensionof MNIST to handwritten letters” make alterations to theMNIST dataset named as Extended MNIST constituting morechallenging dataset allowing for direct compatibility with allexisting classiﬁers and systems [22], [23].III. D

ATASET

The neural system works on the feeding dataset. TheMNIST (Modiﬁed National Institute of Standard and Tech-nology database) is an open source handwritten digits dataset,widely used for training and testing in the ﬁeld of machinelearning, computer vision, and image processing. This work isdone using the MNIST dataset, feeding into the neural networksystem, directly imported and downloaded from Keras. Thatdataset containing 60,000 training and 10,000 testing imagesare stored in a simple ﬁle format design for sorting vectors aswell as multidimensional matrices. The database is a subsetof the samples of black and white images of NIST’s originaldatasets, which data were collected by Lecun et al from UnitedStates Census Bureau employees and high school students. Inthe MNIST, the handwritten digits are size-normalized andcentered in a ﬁxed-size × images with correspondinglabels, split into a training, validation, and test set. The train-ing set shows the neural system gauging various highlights,whereas to predict the response validating set is used. Thetesting set helps in providing an impartial result of the ﬁnalmodel ﬁtting into the training dataset. To prepare the modelto arrange these pictures into the right class (0-9) is the targetof this work [3], [4], [5].IV. M ETHODOLOGY

In the area of Computer Vision, a neural system frameworkrequires complex computation that empowers the computa-tional framework to discover designs by coordinating complexinfo information connections like human minds.Convolutional Neural Network (CNN) is a technique ofdeep neural network model that chooses attributes in the takenpicture and separates from others. In earlier years, channelslike a blur, sharpening, and identifying edge was should havebeen hand-designed and included enough preparation beforeNN becomes an integral factor. The wide execution of thiscalculation, for example, has culminated in facial expres-sion recognition, picture allocation and object identiﬁcation,proposal framework, handwritten recognition, and image tonatural language processing system [24].

Fig. 1. Image characterization result of CNN part.

We utilized four primary layers of the design of CNN:Convolutional Layer, Pooling Layer, Rectiﬁed non-linear unit,and Fully-Connected Layer. Convolution layer is availableat the focal point of the system and performs convolutionsthat include straight activity using augmentation, a lot ofloads with a variety of info information called to channel orpiece. The fundamental reason for convolution is to bring top-notch highlights like edge location and some inferior qualityhighlights like shading, slope direction, and so on [24], [25],[26]. Utilizing a similar channel to identify a speciﬁc article inthe picture has been perceived amazing as it will sift throughmethodically everywhere throughout the picture where theitem is available [27], [28].The following comes the pooling layer. The primary goalis to constantly diminish the spatial size of the portrayal todiminish the number of boundaries and calculation in thesystem just as controlling overﬁtting [29], [30]. With theassistance of MAX activity, it works freely to each profunditycut of the info and resizes it spatially. Completely associatedlayer: Neurons in an FC layer have full associations withall initiations in the past layer, their enactments can thus beregistered with a network increase followed by an inclinationcounterbalance. It is conceivable to change over FC layersto CONV layers as there is extremely little contrast. Werepresented the proposed model in Figure 1.V. SIMULATION SETUP

A. Image Handling

We implemented the convolutional neural network for ourmodel. That CNN Network model comprises of two convo-lutional neural network layers and one Maxpooling2D layerthat down samples the output from the convolutional blocks.We utilized the ﬁlter size of × for the ﬁrst convolutionallayer and after that, we implemented × for the secondconvolutional layer. The input image of the model is of shape × with one color channel. We utilized crude picture documents of dataset close byCNN and VGG highlights. We set pixels measure × .The pictures of the MNIST dataset are without questionconcealing pictures with pixel regards running from 0 to 255with a segment of × , so before feed the data into themodel, it is indispensable to pre-process it. First devotee each × picture of the dataset into a matrix of size × ,which we would then be able to take care into the CNN im-plementation. We focused on the Julia programming languagefor implementing this MNIST dataset and use ﬂux, plots,statistics, mnist, onehotbatch, onecold, crossentropy, throttle,repeated packages for multilayer perception section. Duringauto encoding, we use ﬂux, mnist, epochs, onehotbatch, mse,plots, throttle, partition, and juno packages. For variationalautoencoding, we use ﬂux, mnist, throttle, params, juno, plots,distributions, and epochs for implementation.This requires the picture to be gone through a few prepro-cessing capacities that convert the image to a worthy shape.After the last convolutional layer, there is a ﬂattened layer,which has 1936 neurons. Our CNN is as yet handling matrixand we have to change over those units into the vector totake care of into the completely associated arrange for theultimate result. So, we apply a smooth layer here. It changesover the entire systems loads to vector from the matrix.In any case, toward the ﬁnish of the system, there are 2completely associated layers exist. The principal completelyassociated layer has 128 neurons, the subsequent one has 50neurons. Finally, we utilized the softmax classiﬁer techniquefor ordering the output, as the dataset has 10 distinct classesso characterizes the yield into 10 neurons.We implemented categorical _ crossentropy as loss functionto measure how good our model is, or how our model worksand gives output in perfectly. For classiﬁcation, we have useda softmax classiﬁer. It is placed at the output layer of aneural network. It is commonly used in multi-class learningproblems. Softmax function takes an N-dimensional vector ofreal numbers and transforms it into a vector of real number inrange (0, 1) which adds up to 1. B. Technique of the Optimization

For the improvement territory of CNN, we utilized a recti-ﬁed linear unit (ReLU) analyzer for the MNIST dataset usingJulia. We utilized learning rate 0.001, rot rate=1e-6, momen-tum=0.9, nesterov=True. We cross-endorse the learning rateand weight decay. We used dropout regularization techniquesin all layers. We utilized ReLU and softmax initiation workclose by set the dropout layer and utilized the distinctive edgeesteems for the diverse convolutional layers. We also maintainaccuracy and loss in the CNN part for measurement accuracyand loss vs. epoch.VI. R

ESULTS AND D ISCUSSION

This research focused on the handwritten recognition us-ing the MNIST and Julia programming language and alsofocused on the different type of ﬁlters in neural networksalong with ﬁne-tuning it. This research also concentrated onhich channels in neural systems might be excess, on the offchance that they have insigniﬁcantly little qualities if theirusefulness is imitated by another channel while expanding ordiminishing the number of channels and comparability rate invarious convolutional layers. During the training period, in thisresearch train up the full dataset using 50 epochs and achieved0.991926 for the training time accuracy. Additionally, withinFigure 2 and 3 training time accuracy and loss are representedgraphically. In this research, also implemented the Threshold,which will determine the range, for this, we have organizedthe system with different setups.

Fig. 2. Graphical representation of during training time loss.Fig. 3. Graphical representation of during training time accuracy.

In the wake of building up a framework, we took care of thedataset into the framework and prepared the up to 50 epochs.Accept that in the ﬁrst layer the ﬁlter size was × and thenumber of the ﬁlter is 256 and in the second convolutionallayer our channel size is × and the number of ﬁlters sizewas 256 along with implemented batch size 128. This researchachieved 0.981411 for validation period accuracy results andrepresented graphically in Figure 4 and 5 along with validationperiod loss.The model predicts the comparable pair of ﬁlters on givenﬁlter size and the number of ﬁlters is utilizing cosine closeness.First, we have prepared our framework with the MNISTdataset using Julia and anticipated that if the framework islearning the same highlights over and over. To extend, we have Fig. 4. Graphical representation of validation time accuracy.Fig. 5. Graphical representation of validation time loss. examined a few tests to assess our framework and legitimizeour exploration. We have taken our convolutional layer witha ﬁlter size of × along with × and the number ofchannels is 32 to 256 with two unique thresholds value whichworth are 0.5 and 0.6 respectively. We have removed justone set of comparable channels and envisioned it. Moreover,we represented thresholds value, the ﬁlter number, ratio ofsimilar pairs, visual of similarities, a testing input image, andoutput using activation function during the similar ﬁlters inthe ﬁrst convolutional layer in Figure 7. Furthermore, withinFigure 8, we showed that result for similar ﬁlters in the secondconvolutional layer with ﬁlter size × with using the ﬁltersize from 32 to 256. In addition, we also represented similarﬁlters in the second convolutional layer using ﬁlter size × within Figure 9. A. Auto-encoder

An auto-encoder is an unsupervised method for neuralnetwork learning which reduces data size by learning to ignoreata noise. Effective data representation (encoding) is oftentaught through network training to suppress ”dust” signals. Forthe show auto-encoding, the info measurement is (28) and theoutput of measurement of the encoder is 32. We demonstratedour auto-encoding output in Figure 6. Fig. 6. Output of auto encoding system for MNIST dataset.Fig. 7. Similar ﬁlters of the ﬁrst convolutional layer.Fig. 8. Similar ﬁlters of the second convolutional layer with ﬁlter size × .Fig. 9. Similar ﬁlters of the second convolutional layer with ﬁlter size × . B. Variational Autoencoder

In neural network framework language, a VAE comprisesan encoder, a decoder, and a deﬁcit work. In likelihood modelterms, the variational autoencoder alludes to inexact inferencein an idle Gaussian model where the rough back and modelprobability is parameterized by neural nets. In Figure 10,we demonstrated the variational autoencoder results for ourresearch which was implemented by MNIST and Julia.

Fig. 10. Output of Variational auto encoding system for MNIST dataset.

C. Discussion

We illustrated a CNN model that is capable of characteri-zation of a handwritten digit dataset using Julia. That modelcategorized the MNIST dataset and we represented graphicallythe training time accuracy which is 0.991926 and loss results inFigure 2 and 3. That model categorized the MNIST dataset andwe represented graphically the training time accuracy whichis 0.991926 and loss results in Figure 2 along with Figure3. After that, we also illustrated the validation time resultwhich is 0.981411 in Figure 4 and 5. Moreover, we showedthe thresholds value, the ﬁlter number, the ratio of similarpairs, visual of similarities, a testing input image, and outputusing activation function during the similar ﬁlters in the ﬁrstconvolutional layer in Figure 7.Furthermore, within Figure 8, we showed that result for sim-ilar ﬁlters in the second convolutional layer with ﬁlter size × with using the ﬁlter size from 32 to 256. To extend, we alsorepresented similar ﬁlters in the second convolutional layerusing ﬁlter size × within Figure 9. Finally, we demonstratedour auto-encoding and the variational autoencoder results inFigures 6 and 10. VII. C ONCLUSION

In this study, we propose a neural network model that is ca-pable of characterization and recognition of the MNIST datasetand identify the handwritten object using Julia. Our proposedmodel achieved the best accuracy during the training andtesting time. Moreover, this research has also been concernedabout implemented the Threshold, which will determine therange, for this, we have organized the system with differentsetups. After that, in the ﬁrst layer of model the ﬁlter sizewas × and the number of the ﬁlter is 256 and in thesecond convolutional layer our channel size is × and thenumber of ﬁlters size was 256 along with implemented batchsize 128. To extend, we have taken our convolutional layerwith a ﬁlter size of × along with × , and the number ofchannels is 32 to 256 with two unique thresholds value whichworth is 0.5 and 0.6, respectively, and our study showed thathow one can change the image size and visualization. Afterthat, we also represented the auto- encoding system result,and the variational autoencoder result using Julia. Moreover,e have an intension in the future to improve accuracy byimplementing using Julia an extraordinary characterization andrecognition of handwritten dataset.R EFERENCES[1] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemeiand Bengio, Y., ”Show, attend and tell: Neural image caption generationwith visual attention”, In International conference on machine learning,pp. 2048-2057, 2015.[2] Vinyals, O., Toshev, A., Bengio, S., and Erhan, D., ”Show and tell: Aneural image caption generator”, In Proceedings of the IEEE conferenceon computer vision and pattern recognition, pp. 3156-3164, 2015.[3] You, Q., Jin, H., Wang, Z., Fang, C., and Luo, J., ”Image captioning withsemantic attention”, In Proceedings of the IEEE conference on computervision and pattern recognition, pp. 4651-4659, 2016.[4] Vinyals, O., Toshev, A., Bengio, S., and Erhan, D., ”Show and tell:Lessons learned from the 2015 mscoco image captioning challenge”, InIEEE transactions on pattern analysis and machine intelligence, 39(4),652-663, 2017.[5] Johnson, J., Karpathy, A., and Fei-Fei, L., ”Densecap: Fully convolu-tional localization networks for dense captioning”, In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition, pp.4565-4574, 2016.[6] Kiros, R., Salakhutdinov, R., and Zemel, R., ”Multimodal neural lan-guage models”, In International Conference on Machine Learning, pp.595-603, 2014.[7] Lu, J., Xiong, C., Parikh, D., and Socher, R., ”Knowing when tolook: Adaptive attention via a visual sentinel for image captioning”, InProceedings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), Vol. 6, pp. 2, 2017.[8] M. A. Jishan, K. R. Mahmud, A. K. Al Azad, M. S. Alam, and A.M. Khan, “Hybrid deep neural network for Bangla automated imagedescriptor”, International Journal of Advances in Intelligent Informatics,vol. 6, no. 2, pp. 109-122, Jul. 2020.[9] Jia, X., Gavves, E., Fernando, B., and Tuytelaars, T., ”Guiding the long-short term memory model for image caption generation”, In Proceedingsof the IEEE International Conference on Computer Vision, pp. 2407-2415, 2015.[10] Feng, Y., and Lapata, M., ”Automatic caption generation for newsimages”, In IEEE Trans. Pattern Anal. Mach. Intell., 35(4), 797-812,2013.[11] Peng, H., and Li, N., ”Generating chinese captions for ﬂickr30k images”,2016.[12] Miyazaki, T., and Shimizu, N., ”Cross-lingual image caption genera-tion”, In Proceedings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp.1780-1790, 2016.[13] Najman, M., ”Image Captioning with Convolutional Neural Networks”,2017.[14] Kavitha, S., Keerthana, V., and Bharanidharan, A., ”Automatic ImageCaption Generation”, 2017.[15] D. D. Sapkal, Pratik Sethi, Rohan Ingle, Shantanu Kumar Vashishtha,Yash Bhan, ”A Survey on Auto Image Captioning?”, Vol. 5, Issue 2,2016.[16] Talwar, A., and Kumar, Y., ”Machine Learning: An artiﬁcial intelligencemethodology”, International Journal of Engineering and Computer Sci-ence, 2(12), 2013.[17] Rahman, M., Mohammed, N., Mansoor, N., and Momen, S. (2019). Chit-tron: An automatic bangla image captioning system. Procedia ComputerScience, 154, 636-642.[18] Gurney, K., ”An introduction to neural networks”, CRC press, 2014.[19] M. A. Jishan, K. R. Mahmud, and A. K. Al Azad, “Natural languagedescription of images using hybrid recurrent neural network”, Interna-tional Journal of Electrical and Computer Engineering (IJECE), vol. 9,no. 4, pp. 2932-2940, Aug. 2019.[20] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P., ”Gradient-basedlearning applied to document recognition”, Proceedings of the IEEE,86(11), 2278-2324, 1998.[21] Simonyan, K., and Zisserman, A., ”Very deep convolutional networks forlarge-scale image recognition”, arXiv preprint arXiv:1409.1556, 2014.[22] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun,Y., ”Overfeat: Integrated recognition, localization and detection usingconvolutional networks”, arXiv preprint arXiv:1312.6229, 2013. [23] Long, J., Shelhamer, E., and Darrell, T., ”Fully convolutional networksfor semantic segmentation”, In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pp. 3431-3440, 2015.[24] Lipton, Z. C., Berkowitz, J., and Elkan, C., ”A critical reviewof recurrent neural networks for sequence learning”, arXiv preprintarXiv:1506.00019, 2015.[25] Graves, A., Mohamed, A. R., and Hinton, G., ”Speech recognitionwith deep recurrent neural networks”, In Acoustics, speech and signalprocessing (icassp), 2013 ieee international conference on, pp. 6645-6649, IEEE, 2013.[26] LeCun, Y., Bengio, Y., and Hinton, G., ”Deep learning”, In Nature,521(7553), 436, 2015.[27] Hochreiter, S., and Schmidhuber, J., ”Long short-term memory”, InNeural computation, 9(8), 1735-1780, 1997.[28] Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis,N., and Plank, B., ”Automatic description generation from images:A survey of models, datasets, and evaluation measures”, Journal ofArtiﬁcial Intelligence Research, 55, 409-442, 2016.[29] M. A. Jishan, K. R. Mahmud, A. K. Al Azad, M. R. A. Rashid, B. Paul,and M. S. Alam, “Bangla language textual image description by hybridneural network model”, Indonesian Journal of Electrical Engineeringand Computer Science, vol. 21, no. 2, pp. 757-767, Feb. 2021.[30] J. Mao,W. Xu, Y. Yang, J.Wang, and A. L. Yuille., ”Explain images withmultimodal recurrent neural networks”, arXiv preprint arXiv:1410.1090arXiv preprint arXiv:1410.1090