Message Boards Message Boards

[WSS17] Image-to-Latex

Posted 7 years ago

The following is a short summary of a project I did at the Wolfram Summer School, 2017.

In this project we aim to convert any given mathematical expression (printed or or even handwritten) into LaTeX syntax. We implement the algorithm in the Wolfram Language using the built-in Neural Network functionality. We "loosely" follow the algorithm proposed in the paper https://arxiv.org/pdf/1609.04938v1.pdf .

The dataset is imported from https://zenodo.org/record/56198#.WVzy-caZORt. It includes a total of approximately 100k formulas and images split into train, validation and test sets. The file "im2latex_formulas.lst" contains 103558 latex formulae separated by "\n" which we import in Mathematica

formulae = StringSplit[Import["/Users/Himanshu/Desktop/Wolfram Assignments/Wolfram \Project/im2latex_formulas.lst", "String"], "\n"];

Using just Import["/Users/Himanshu/Desktop/Project/im2latex_formulas.lst", "String"] leads to incorrect number of lines 104563. We then process this data. First we strip off unnecessary elements in the LaTeX formulae ( like \label{eqn} ) and white-spaces as follows:

formulae = 
  StringSplit[
   Import["/Users/Himanshu/Desktop/Wolfram Assignments/Wolfram \
Project/im2latex_formulas.lst", "String"], "\n"];
labels = StringTrim[
     StringReplace[
      StringReplace[#, {"\\label{" ~~ ShortestMatch[___] ~~ "}" :> "",
         "\t" | "\\," | WhitespaceCharacter | "\\:" | "\\;" -> "~"}], 
      "~" .. -> "~"], "~"] & /@ formulae;

The formulae, stored in the folder "formula_images", are in png format and rendered as latex expressions on a transparent background. Although they occupy a small size on the disk, while importing in the Mathematica kernel the size of the cropped images increases.

fileDataset[s_String, import_?BooleanQ, folder_String: ""] := 
 Dataset[<|"Input" -> Last@#, 
     "Target" -> 
      labels[[First@# + 1]]|> & /@ ({ToExpression@#[[1]], 
       If[import, Import, Identity]@
        File[folder <> #[[2]] <> ".png"]} &~ParallelMap~
     StringSplit[ReadList[s, String]][[;; , {1, 2}]])]

The following code generates the training, test and validation dataset. WARNING: Time consuming and requires at least 20 GB of disk space.

trainDataset = 
   fileDataset["im2latex_train.lst", False, 
    "formula_images/"]; // AbsoluteTiming
testDataset = 
   fileDataset["im2latex_test.lst", False, 
    "formula_images/"]; // AbsoluteTiming
validateDataset = 
   fileDataset["im2latex_validate.lst", False, 
    "formula_images/"]; // AbsoluteTiming

We also pad the formula images in order to get images of the same size. This is to avoid distortion. The final training / validation / test data is Dataset enter image description here

which is nothing but a list of rules of the following form

enter image description here

The length of the training data is 83884, test data is 10355, validation data is 9320. After processing, they occupy a disk size of approximately 20 GB. Since the Dataset generated is huge in size it is not optimal for uploading on the GPU machine. Therefore instead of creating, saving and then uploading the dataset we generate the dataset from the raw data and save it in a temporary variable in memory which is fed into the network for training.

Next we turn to the neural network architecture. The neural network architecture is divided in to three stages.

  1. Convolutional Network: The visual features of an image are extracted with a multi-layer convolutional neural network (CNN) interleaved with max-pooling layers. The CNN takes the raw input and produces a feature grid $\tilde{V}$ of size $D \times H \times W $ where D denotes the number of channels and H and W are the resulted feature map height and width.

  2. Row Encoder: The feature grid $\tilde{V}$ produced by the CNN is fed into a Row Encoder that localizes its input by running a Recurrent Neural Network (RNN) over each of the rows of CNN feature grid $\tilde{V}$ and produces a new feature grid V. For OCR, it is important for the encoder to localize the relative positions within the source image.

  3. Decoder: The target markup tokens { $y_t$} are then generated by a decoder based on the row encoded feature grid V. The decoder (equipped with an attention mechanism) is trained to calculate the conditional probability of a token $y_{t+1}$ appearing at position $t+1$ given the sequence ${y_0,y_1,...y_t }$ and the feature grid V.

However in my implementation of the network I have not incorporated the attention mechanism. The final network that was used for training is schematically shown below

enter image description here

where the encoder is a NetChain that consists of CNN interleaved with Pooling and Batch Normalization layers and decoder is a NetGraph which is schematically shown below

enter image description here

In this work we have been able to implement the first two stages of the algorithm and a slightly modified version of the third stage that cuts away the attention mechanism. We have trained the network on a 12 GB NVIDIA GPU. Training batch size is set to 16 due to the size limit of GPU memory. After 7 rounds the loss drops to 1.48. It is expected that on adding attention layer in the network, the loss should improve.

For original and previous work in this direction please refer to the following links

Acknowledgements

I would like to thank Giulio Alessandrini, Matteo Salvarezza and Daniel George for useful discussions and helping me out in various stages of the project.

POSTED BY: Himanshu Raj
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract