Part 1 : Text Localization

Text information extraction is a growing area of research. Enormous work has been done to efficiently and robustly extract the text regions from scene text images. Various Text extractions models have been defined comprising of various stages. Amongst them text localization is an important stage and thus researchers mainly focus on this stage.

Text extraction is an arduous task in document image Analysis. Text in images contains information which is noteworthy and can be utilized for structuring of images or indexing databases. This also has its usefulness for many real world applications, such as navigation, license plate detection, conversion of paper based collection to e-books using OCR’s etc. text embedded in the mages contains a systematic information which can be used as a content based access to digital images,if it is extracted and harnessed efficiently.Text appearing in images is broadly classified in three categories: Document text, Scene text, Digital-born text. Amongst these extracting texts from the scene text images is most challenging as this type of text can have variable orientations, font size etc.In addition they may also be distorted due to the perceptive projections and may have low contrasts. A generalized Text Information Extraction (TIE) model has been shown below. Regardless of the profound research to develop an efficient model there are various problems with a scene text image that make automatic text extraction a challenging task.



When an image is fed to the Text Information Extraction, there is no prior information about the existence of the text in an image. Hence text detection techniques help identifying the presence of text in an image. The various features of a text such as the color, intensity, geometry etc, are utilized to classify the various localization techniques. Two broad classifications of localization Input image Text Detection and Localization Text Extraction Text Enhancement Text Recognition Output Image techniques are: Region based approaches and texture based approaches.

1.Region Based Methods:

Region-based method uses the properties of the color or gray scale in the text region or their differences to the corresponding properties of the background.They utilise the property that the variation of color within text is minimal moreover there is a sufficient distinction between the text and the background. Text can be obtained by thresholding the image at intensity level in between the text color and that of its immediate background. This approach is implemented in bottom-up fashion. Sub structures are identified and merged to form homogeneous region which are marked with bounding boxes.This method is further divided into two sub-approaches:

Connected component based(CC-based): Small similar components of the input images are grouped to form successively large components using bottom up approach in the case of CC-based methods. This process iteratively continues until all the regions of the image are not identified. Filtering of the non text regions of the image is performed by considering a geometrical analysis which merges the components using the spatial arrangement of the components and mark the boundaries of the text regions.

Edge Based Methods: Edges are those portions of the image that corresponds to the object boundaries. There is a major contrast in the text and the background and this drastic variation forms the edges.Text localization Region based methods CC-Based method Edge Based method Texture based methods Hybrid Methods Morphlogical operations

The non text region of the images is filtered out by firstly identifying the edges of the text boundaries and then applying various heuristics. Generally, an edge filter is incorporated for the identification of the edges in an image, and for the merging stage an averaging operation or a morphological operator is used.

2.Texture Based Methods

Texture-based methods exploit textural features of the text. These properties distinguished the text from its background. Parameters such as energy, contrast, correlation, and entropy define the texture of an image.Texture based methods implement techniques such as FFT, wavelet spatial variance etc.

3.Hybrid : Morphological Approach

Mathematical morphology is a topological and geometrical based approach for image analysis.Character recognition and document analysis uses the morphological operations because operations like translation, rotation and scaling do not have any effect on the geometrical shape of the image. This method works robustly under different image alterations.

Data Preparation:

Data preparation for text localization resembles some what similar to image segmentation, however special care is taken since most of the text images can be black and white images.

Image segmentation:

It is a computer vision task in which we label specific regions of an image according to what’s being shown.

“What’s in this image, and where in the image is it located?”

More specifically, the goal of semantic image segmentation is to label each pixel of an image with a corresponding class of what is being represented. Because we’re predicting for every pixel in the image, this task is commonly referred to as dense prediction.

An example of semantic segmentation, where the goal is to predict class labels for each pixel in the image. (Source)

One important thing to note is that we’re not separating instances of the same class; we only care about the category of each pixel. In other words, if you have two objects of the same category in your input image, the segmentation map does not inherently distinguish these as separate objects. There exists a different class of models, known as instance segmentation models, which do distinguish between separate objects of the same class.

Segmentation models are useful for a variety of tasks, including:

  • Autonomous vehicles
    We need to equip cars with the necessary perception to understand their environment so that self-driving cars can safely integrate into our existing roads.

A real-time segmented road scene for autonomous driving. (Source)

  • Medical image diagnostics
    Machines can augment analysis performed by radiologists, greatly reducing the time required to run diagnostic tests.

chest xray
A chest x-ray with the heart (red), lungs (green), and clavicles (blue) are segmented. (Source)

Representing the task

Simply, our goal is to take either a RGB color image (height×width×3) or a grayscale image (height×width×1) and output a segmentation map where each pixel contains a class label represented as an integer (height×width×1).

input to label

Note: For visual clarity, we have labeled a low-resolution prediction map. In reality, the segmentation label resolution should match the original input’s resolution.

Similar to how we treat standard categorical values, we’ll create our target by one-hot encoding the class labels – essentially creating an output channel for each of the possible classes.

one hot

A prediction can be collapsed into a segmentation map (as shown in the first image) by taking the

of each depth-wise pixel vector.

We can easily inspect a target by overlaying it onto the observation.


When we overlay a single channel of our target (or prediction), we refer to this as a mask which illuminates the regions of an image where a specific class is present.

For more in depth information on the topic please refer :

Text Image Segmentation

Unlike object detection, text detection/localization needs special care, since most of the data sources happened to be black and white images.

Natural scene text detection is different though — and much more challenging.

Due to the proliferation of cheap digital cameras, and not to mention the fact that nearly every smartphone now has a camera, we need to be highly concerned with the conditions the image was captured under — and furthermore, what assumptions we can and cannot make. I’ve included a summarized version of the natural scene text detection challenges described by Celine Mancas-Thillou and Bernard Gosselin in their excellent 2017 paper, Natural Scene Text Understanding below:

  • Image/sensor noise: Sensor noise from a handheld camera is typically higher than that of a traditional scanner. Additionally, low-priced cameras will typically interpolate the pixels of raw sensors to produce real colors.
  • Viewing angles: Natural scene text can naturally have viewing angles that are not parallel to the text, making the text harder to recognize.
  • Blurring: Uncontrolled environments tend to have blur, especially if the end user is utilizing a smartphone that does not have some form of stabilization.
  • Lighting conditions: We cannot make any assumptions regarding our lighting conditions in natural scene images. It may be near dark, the flash on the camera may be on, or the sun may be shining brightly, saturating the entire image.
  • Resolution: Not all cameras are created equal — we may be dealing with cameras with sub-par resolution.
  • Non-paper objects: Most, but not all, paper is not reflective (at least in context of paper you are trying to scan). Text in natural scenes may be reflective, including logos, signs, etc.
  • Non-planar objects: Consider what happens when you wrap text around a bottle — the text on the surface becomes distorted and deformed. While humans may still be able to easily “detect” and read the text, our algorithms will struggle. We need to be able to handle such use cases.
  • Unknown layout: We cannot use any a priori information to give our algorithms “clues” as to where the text resides.

For example in EAST: An Efficient and Accurate Scene Text Detector the data preparation involves creating 5-channel data labels capturing text orientation, size etc.

Note: More details to be followed in upcoming blogs.

Refernce Git :
Paper :
Dataset ICDAR : 2019

Deep Learning Models:

Of course Convolutional network will  be the default choice here, for those who needs to refresh their memories click here and head here to read on some common architectures.

Among several Convolutional network architectures available, variants of U-Net and ResNet are more notable for this task.

If you wanna refresh your brain on ResNet click here.