Home  >  Blog  >   Machine Learning  > 

TensorFlow Object Detection

Rating: 5

If you think about it, you must have spent a lot of valuable time looking for the room keys in your messy room. This is something that happens to everyone and is among the most frustrating experiences. But today, you could use computer algorithms to solve this kind of problems. This is the true power of the object detection algorithms. Though this is not what object detection algorithms are designed to do, they can be employed for round-the-clock surveillance and real-time vehicle detection in the smart cities. These are powerful deep learning algorithms.

With recent advancement in the computer vision models on deep learning, the object detection applications are much easier to develop than it had been ever before. TensorFlow’s Object Detection API is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. The techniques have also been leveraging massive image datasets to reduce the need for the large datasets besides the significant performance improvements. Moreover, the current approaches focus more on the end-to-end pipelines and this has led to significant improvements in performances and has enabled real-time use cases.

To gain in-depth knowledge and be on par with practical experience, then explore Our TensorFlow Training course.

let’s have a look at the following concepts of this Object Detection Tutorial using TensorFlow

What is Object Detection?

Object detection is a computer technology that is related to image processing and computer vision. The technology deals with detecting the instances of the semantic objects of different classes like building, human beings, cars, and others in videos and digital images. Some of the domains of object detection that have gone through proper research are pedestrian detection and face detection. There are numerous applications of object detection in areas like image retrieval, computer vision, and video surveillance.

Some of the major applications of object detection are related to computer vision and include face recognition, video object co-segmentation, etc. It is used in instances like tracking objects, tracking a person in a video, tracking the movement of a cricket bat, and many more.

People often confuse image classification with object detection. When the main aim is to classify the image into a certain category, image classification is used. On the other hand, to identify the location of the objects in an image or count the number of instances of an object, object detection is to be used. Labelled data is needed in order to train a custom model. The labelled data in the context of object detection are images that have corresponding labels and bounding box coordinates.

In a typical object detection algorithm, an image is sent to the network, which is then sent through lots of convolutions and pooling layers. The output would be an object of the class. For each input image, there is a corresponding class as output. After taking the image as an input, the image is divided into various regions.

MindMajix Logo

Subscribe to explore the latest tech updates, career transformation tips, and much more.

Each of these regions is considered a separate image. The regions are then passed to the Convolution Neural Networks (CNN) to classify them into various classes. Once each of the regions has been divided into corresponding classes, all the regions are combined to get the original image with the detected objects.

However, there are some problems with such trivial algorithms as the images might have different aspect ratios and spatial locations. These factors could lead to a large number of regions and the computational time would increase.

Frequently Asked TensorFlow Interview Question & Answers

Applications of Object Detection

Object Detection has a lot of real-life applications and can be used in different scenarios. New algorithms and models keep on outperforming the previous ones and object detection is one of the areas of computer vision which is maturing very rapidly. Below here are its applications.

Face Recognition

For instance, a group of researchers at Facebook had developed the DeepFace, which is a facial recognition system based on deep learning. Google also has its own facial recognition system which can automatically segregate the photos based on the person in the images.

Object Detection is one of the computer technologies that is connected to image processing and computer vision. It detects the instances of an object like building, human faces, cars, trees, and others. The primary job of face detection is to ensure whether there is any face in the image. face detection is the first and most essential step and it detects the faces in images. It is used in areas like security, law enforcement, biometrics, personal safety, and entertainment.

Faces can be detected in real-time and it helps to track persons or objects. The face detection methods can be appearance-based, feature-based, knowledge-based, or template matching.

People Counting

Another important use of object detection is people counting. It can be used for analyzing store performance or recording crowd statistics during festivals or other activities. However, it can be difficult at times as people move out of the frames very quickly.

Off-the-shelf people counters are not very expensive but the data generated by them is tied to proprietary systems that limit the options for data extraction and KPI optimization. An embedded DIP using your own camera and SBC would save time and money and offer the freedom to tailor the application to the KPIs you need. Insights can be extracted from the cloud that would not be possible in other cases.

The overall functionality for your DIP IoT application can be enhanced using the cloud. The visualization, alerting, reporting offer increased capabilities and so do the cross-referencing outside data sources.

Industrial Quality Check

Object Detection is often used in industrial processes to identify products. Using visual inspection to find a specific object is a basic task and it is involved in various industrial processes. This includes inventory management, sorting, quality management, machining, and packaging. Inventory management is sometimes quite tricky as it could be hard to track items in real-time. Localization and automatic object counting allow improving inventory accuracy.

Several challenges need to be taken into account while object detection is being performed. The objects come in different sizes, shapes, colors, and orientation. There is additional noise which occurs through variation in illumination, viewpoint, shadows, and occlusions. Ensuring the desired accuracy is important without arranging too many training examples.

Self-Driving Cars

Self-driving cars are something evident in the future. However, the working is very tricky as a lot of different techniques are required to perceive the surroundings like laser light, GPS, radar, computer vision, and odometry. Sensory information is interpreted to identify appropriate navigation paths and obstructions with the help of advanced control systems. When a sign of a living being is found in the path, the car automatically stops. The process is very fast and is a huge step towards Self-Driving cars.

Self-driving cars are being designed with the intention to save lives. A lot of people are involved in road accidents every year. Autonomous vehicles allow accurate and safer transportation and needless death tools are lowered. Object detection is performed in two steps - image classification and image localization. Image classification determines what the objects look like and image localization provides the specific location of the objects.


A very important role is played by Object Detection in terms of Security. It is used by police personnel to access security feed and match with the existing database. It helps to detect criminals or their vehicles. It can even be used to locate stolen products. There could be limitless applications. The abilities of a machine to look out for objects have surpassed the capabilities of human beings.

Using technology to perform surveillance is a lot more efficient. As surveillance is a repetitive and mundane task, performance dips can result in human beings. Letting technology do the task can help human beings to focus on the actions to be taken if something goes wrong. A lot of personnel might be needed to survey a large strip of land. Mobile surveillance bots, along with stationary cameras can mitigate the problems.

Object Detection Workflow with examples

The computer vision tasks are categorized into a few simple procedures.

  1. Image Classification: Image classification is among the most common computer vision problems. An algorithm looks at an image and classifies it as an object. Image classification performs a lot of operations, like face detection to detection of cancer in medicines.
  2. Object classification and localization: The object localization algorithms would not only help to know the presence of an object, but also the location of the object. A bounding box is drawn around the object in the image.
  3. Multiple object detection and localization: There could be multiple objects in the image and this is something that would be very common in self-driving cars. The algorithm would not only need to detect other cars but motorcycles, pedestrians, trees, and other objects. When it comes to the context of deep learning, the basic algorithmic difference would be choosing the relevant inputs and outputs.

Image Classification

  • An input image is convoluted by n-filters.
  • The output of the convolution is then treated with non-linear transformations, like RELU and MaxPool.
  • The above operations of Convolution, MaxPool and RELU are performed multiple times.
  • The output of the final layer is sent to the Softmax layer, where the numbers between 0 and 1 are converted and a probability is considered, declaring them a member of a particular class. The losses are minimized so that the predictions from the last layer can be as close as possible to the actual values.

Object Classification and Localization

The output labels are changed to make the bounding boxes around an object. This helps the programming model to learn the class of the object and the position of the object in the image. Four parameters are added in the output layer which includes the centroid, the proportion of height and width of the bounding box. A bunch of output units is added to get the cartesian coordinates of the different positions to be recognized. The different positions or landmark would be consistent for particular objects.

Multiple object detection and localization

If we are trying to detect multiple objects in the image, we can use the same technique that was being used in object localization. The difference is that we would want the algorithm to be able to classify and localize all the different objects in the image and not just one. The simple idea is to crop the image into multiple images and run the same algorithm for all these cropped images.

The following algorithm should be followed:

  • In the algorithm, a window of much smaller size than the actual image size is made. It is cropped and passed to the CNN for it to make the predictions.
  • The window is to be kept on sliding and these cropped images are to be passed into CNN.
  • After all the portions of the image with the window size have been cropped, the steps are to repeated all over again for bigger window sizes. These images are then to be passed to CNN for predictions.
  • There would be a set of cropped images at the end where there would be an object, along with a class and the bounding box of the object.

Expensive computation

Cropping multiple images and passing through CNN would be very expensive computationally. The computation power can be improved with the sliding window method. It would replace the fully connected layer and for a given window size, the input image would be passed only once. In actual implementation, the cropped images are not passed one at a time, but the entire image is passed at once.

Inaccurate bounding boxes

This section explains other drawbacks in the previously proposed model. Square windows are slid all over the image. The object may be rectangular or maybe none of the squares match perfectly with the actual object. The algorithm might be able to find and localize multiple objects in the images. However, the accuracy of the bounding box method is quite bad.


YOLO (You Only Look Once) is a solution which is much accurate and faster than the sliding window algorithm. There is a minor tweak in the algorithms. The image is divided into multiple grids. The label of the data is changed so that the classification and localization algorithm can be used for each grid cell. The algorithm proceeds as follows:

  1. The image is divided into multiple grids. 4x4 grids are drawn in the figure, but the actual implementation of YOLO has a different number of grids.
  2. The training data is labelled. If the number of unique objects in the data is C, the number of grids into which the image is split would be S*S. The output length of the vector would be S*S*(C+5).
  3. A deep CNN is made with loss function as the error between the label vector and output activations. The model predicts the output of all the grids in a forward pass of the input image through CNN.
  4. The label for the object present in a grid cell is determined by the presence of the centroid of the object in the grid. This helps to ensure that the object is not counted multiple times in different grids.

But still, there are some problems. Multiple objects in the same grid cannot be detected. The issue can be solved by choosing smaller grid sizes. But the algorithm can still fail in certain cases, for instance, a flock of birds. In addition to having C+5 labels for each grid cell, the idea used in anchor boxes is to have (C+5)*A labels for each grid cell and A is the required anchor boxes. If an object is assigned to one anchor box in a grid, the other object can be assigned to the other box of the same grid.

What is TensorFlow?

The most famous deep learning library today is TensorFlow. It is owned by Google. Machine learning is used in all of the Google products to improve translation, search engine, image captioning, and recommendations. Google users get to have a faster and refined search with Artificial Intelligence. Google uses machine learning to take advantage of the massive datasets to help users get the best experience. The researchers, programmers, and data scientists all use machine learning.

TensorFlow was built as a framework to help developers and researchers work together on an AI model. Lots of people can use it once it has been developed and scaled.

Creating an Object Detection Algorithm

Creating an object detection algorithm is the best way to understand how everything works. The necessary algorithms are provided with TensorFlow. You can create an entire object detection algorithm as follows. However, you need to take care of two things before you start:

  • Getting prerequisites
  • Setting up the environment

Getting prerequisites

A few prerequisites would be required to get the job done. A few things need to be installed on the system.

  • Python
  • Tensorflow
  • Tensorboard
  • Ptorobuf v3.4 and above

Setting up the environment

Tensorflow can be downloaded using the pip or conda commands:

# For CPU
pip install tensorflow
# For GPU
pip install tensorflow-gpu

The other libraries are also to be installed using the pip or conda commands. The following code would work.

pip install --user Cython
pip install --user contextlib2
pip install --user pillow
pip install --user lxml
pip install --user jupyter
pip install --user matplotlib

Protocol Buffers are the language-neutral, platform-neutral, extensible mechanism, which is like XML, but smaller and much simpler. Version 3.4 or above of the same needs to be downloaded. TensorFlow's model needs to be cloned or downloaded from GitHub. Both the models and protobuf should be placed in the same folder. After that, it is time to run protofbuf from the research folder.

 "path_of_protobuf's bin"./bin/protoc object_detection/protos/

Code for Object Detection

1. You need to start by importing all the libraries.

import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile
from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
from object_detection.utils import ops as utils_ops
from utils import label_map_util
from utils import visualization_utils as vis_util

2. The required model is to be provided and the frozen inference graph generated by TensorFlow.

MODEL_NAME = 'ssd_mobilenet_v1_coco_2017_11_17'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'

PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'

PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')
3. The model from the Internet can be downloaded through the code and extract the frozen inference graph.
opener = urllib.request.URLopener()
tar_file = tarfile.open(MODEL_FILE)
for file in tar_file.getmembers():
  file_name = os.path.basename(file.name)
  if 'frozen_inference_graph.pb' in file_name:
         tar_file.extract(file, os.getcwd())
detection_graph = tf.Graph()
with detection_graph.as_default():
  od_graph_def = tf.GraphDef()
  with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
         serialized_graph = fid.read()
         tf.import_graph_def(od_graph_def, name='')
4. The labels need to be loaded.
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)
5. The images need to be converted into a numPy to be processed.
def load_image_into_numpy_array(image):
  (im_width, im_height) = image.size
  return np.array(image.getdata()).reshape(
         (im_height, im_width, 3)).astype(np.uint8)
6. The path to the images is then defined.
PATH_TO_TEST_IMAGES_DIR = 'test_images'
TEST_IMAGE_PATHS = [ os.path.join(PATH_TO_TEST_IMAGES_DIR, 'image{}.jpg'.format(i)) for i in range(1, 8) ]
7. The inference for a single image is run, where the objects are detected through bounding boxes.
def run_inference_for_single_image(image, graph):
  with graph.as_default():
         with tf.Session() as sess:
         # Get handles to input and output tensors
         ops = tf.get_default_graph().get_operations()
         all_tensor_names = {output.name for op in ops for output in op.outputs}
         tensor_dict = {}
         for key in [
      'num_detections', 'detection_boxes', 'detection_scores',
      'detection_classes', 'detection_masks'
     tensor_name = key + ':0'
     if tensor_name in all_tensor_names:
      tensor_dict[key] = tf.get_default_graph().get_tensor_by_name(
         if 'detection_masks' in tensor_dict:
     # The following processing is only for single image
     detection_boxes = tf.squeeze(tensor_dict['detection_boxes'], [0])
     detection_masks = tf.squeeze(tensor_dict['detection_masks'], [0])
     # Reframe is required to translate mask from box coordinates to image coordinates and fit the image size.
     real_num_detection = tf.cast(tensor_dict['num_detections'][0], tf.int32)
     detection_boxes = tf.slice(detection_boxes, [0, 0], [real_num_detection, -1])
     detection_masks = tf.slice(detection_masks, [0, 0, 0], [real_num_detection, -1, -1])
     detection_masks_reframed = utils_ops.reframe_box_masks_to_image_masks(
     detection_masks, detection_boxes, image.shape[0], image.shape[1])
     detection_masks_reframed = tf.cast(
     tf.greater(detection_masks_reframed, 0.5), tf.uint8)
     # Follow the convention by adding back the batch dimension
     tensor_dict['detection_masks'] = tf.expand_dims(
     detection_masks_reframed, 0)
     image_tensor = tf.get_default_graph().get_tensor_by_name('image_tensor:0')
     # Run inference
     output_dict = sess.run(tensor_dict,
     feed_dict={image_tensor: np.expand_dims(image, 0)})
     # all outputs are float32 numpy arrays, so convert types as appropriate
     output_dict['num_detections'] = int(output_dict['num_detections'][0])
     output_dict['detection_classes'] = output_dict[
     output_dict['detection_boxes'] = output_dict['detection_boxes'][0]
     output_dict['detection_scores'] = output_dict['detection_scores'][0]
     if 'detection_masks' in output_dict:
      output_dict['detection_masks'] = output_dict['detection_masks'][0]
return output_dict

8. In the final part, all the functions would be called and the inference is run on all the input images.

for image_path in TEST_IMAGE_PATHS:
  image = Image.open(image_path)
  # the array based representation of the image will be used later in order to prepare the
  # result image with boxes and labels on it.
  image_np = load_image_into_numpy_array(image)
  # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
  image_np_expanded = np.expand_dims(image_np, axis=0)
  # Actual detection.
  output_dict = run_inference_for_single_image(image_np, detection_graph)
  # Visualization of the results of a detection.

Real-Time Object Detection Using Tensorflow

To perform real-time object detection through TensorFlow, the same code can be used but a few tweakings would be required. OpenCV would be used here and the camera module would use the live feed from the webcam. The code can be summarised as follows:

import numpy as np
import os
import six.moves.urllib as urllib
import sys
import tarfile
import tensorflow as tf
import zipfile

from collections import defaultdict
from io import StringIO
from matplotlib import pyplot as plt
from PIL import Image
import cv2
cap = cv2.VideoCapture(0)
from utils import label_map_util

from utils import visualization_utils as vis_util
MODEL_NAME = 'ssd_mobilenet_v1_coco_11_06_2017'
DOWNLOAD_BASE = 'http://download.tensorflow.org/models/object_detection/'
# Path to frozen detection graph. This is the actual model that is used for the object detection.
PATH_TO_CKPT = MODEL_NAME + '/frozen_inference_graph.pb'
# List of the strings that is used to add correct label for each box.
PATH_TO_LABELS = os.path.join('data', 'mscoco_label_map.pbtxt')
opener = urllib.request.URLopener()
tar_file = tarfile.open(MODEL_FILE)
for file in tar_file.getmembers():
  file_name = os.path.basename(file.name)
  if 'frozen_inference_graph.pb' in file_name:
         tar_file.extract(file, os.getcwd())
detection_graph = tf.Graph()
with detection_graph.as_default():
  od_graph_def = tf.GraphDef()
  with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
         serialized_graph = fid.read()
         tf.import_graph_def(od_graph_def, name='')
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)

categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)

category_index = label_map_util.create_category_index(categories)
with detection_graph.as_default():
  with tf.Session(graph=detection_graph) as sess:
         while True:
         ret, image_np = cap.read()
         # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
         image_np_expanded = np.expand_dims(image_np, axis=0)
         image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
         # Each box represents a part of the image where a particular object was detected.
         boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
         # Each score represent how level of confidence for each of the objects.
         # Score is shown on the result image, together with the class label.
         scores = detection_graph.get_tensor_by_name('detection_scores:0')
         classes = detection_graph.get_tensor_by_name('detection_classes:0')
         num_detections = detection_graph.get_tensor_by_name('num_detections:0')
         # Actual detection.
         (boxes, scores, classes, num_detections) = sess.run(
         [boxes, scores, classes, num_detections],
         feed_dict={image_tensor: image_np_expanded})
         # Visualization of the results of a detection.
         cv2.imshow('object detection', cv2.resize(image_np, (800,600)))
         if cv2.waitKey(25) 0xFF == ord('q'):

Object Detection is becoming common today. Its significance in face detection and face recognition is very well understood. It is also gaining wide acceptance in terms of surveillance and security measures. TensorFlow is one of the greatest libraries that is helping the users to easily achieve great results in Object Detection.

The algorithms are being constantly updated as that is what Machine Learning is all about. Old algorithms are being outperformed, and soon enough, Object Detection can be used in self-driving cars and other sophisticated areas.

Join our newsletter

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule
TensorFlow TrainingFeb 11 to Feb 26
TensorFlow TrainingFeb 14 to Mar 01
TensorFlow TrainingFeb 18 to Mar 05
TensorFlow TrainingFeb 21 to Mar 08
Last updated: 08 February 2023
About Author
Remy Sharp
Ravindra Savaram

Ravindra Savaram is a Content Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

Recommended Courses

1 /15