MongoDB Tutorial

MongoDB Tutorial

This MonggoDB Tutorial introduces to bigdata and walks you through MongoDB features and its advantages.

Big Data is used to describe data which has massive volume, varied structure and is generated at high velocity. This is posing challenges to the traditional RDBMS systems which is used for storing and processing data and paving way for newer approaches of processing and storing data”


Big Data along with Cloud, Social, Analytics and Mobility is the buzz word today in the Information Technology World. Data is getting generated from varied sources in varied formats such as Video, text, speech, log files, images etc. at a very high speed.

The advent of Social media sites, smartphones, and other data generating consumer devices including PCs, Laptops and Tablets is leading to an explosion of data.

The availability of Internet and the electronic devices with the masses is increasing every day and with this is increasing the number of connected devices and thus the data generated is increasing at an enormous speed.

The availability of photo and video technologies and their ease of usage are generating huge amounts of graphic data. Each second of high-definition video, for example, generates more than 2,000 times as many bytes as compared to store a single page of text. Enterprises are realizing the impact of Big Data in their business applications and environments. The potential of Big Data is enormous – from understanding the behavior of consumers to fraud detection to military applications big data is playing an ever increasing role.

The following figures depict the statistics of the social networking sites Facebook and twitter.


Source: Facebook

Few Stats for the year 2013:

1. Worldwide, there are over 1.19 billion active Facebook users. (Source: Facebook)

2. 4.5 billion likes generated daily as of May 2013 which is a 67 percent increase from August 2012 (Source: Facebook)

3. 728 million people log onto Facebook daily, which represents a 25% increase from 2012 (Source: Facebook as of 9/ 2013)


Fig: If you printed twitter

Just to explain with an example the amount of data which a single event of going and attending a Movie can generate.

You may search for a movie on movie review sites, read reviews about that movie and post queries, you may tweet about the movie, post the photographs of going to the movie on Facebook.Maybe you also take a video of walking around the theater and upload it. While travelling your GPS system tracks your mobile and generates data. You may also use Google Latitude to update the status so that your friends can know where you are.

Thus a simple activity like above can generate enormous amounts of data.

Hence if we have to summarize – the combination of smart phones, social networking sites, media and data internal to companies are creating flood of data for the companies to process and store.

“Big Data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” – The McKinsey Global Institute, 2011



“Big Data is defined as data which has High Volume, Velocity and has multiple Varieties.

“ Let’s next look at few facts and figures of Big Data.

Facts on Big Data


Fig: Big Data – A growing torrent

Various research teams in the world have done analysis on the amount of data being generated some of them are mentioned below for reference. One common theme which comes across is that the amount of data being generated and stored is increasing at an ever growing rate.

Report on enterprise server information, January 2011; and Martin Hilbert and Priscila López, “The world’s technological capacity to store, communicate, and compute information,” Science, February 10, 2011. IDC has published a series of white papers, sponsored by EMC, including “The expanding digital universe,” March 2007; “The diverse and exploding digital universe,” March 2008; “As the economy contracts, the digital universe expands,” May 2009; and “The digital universe decade— Are you ready?,” May 2010. All are available at www.emc.com/ leadership/ programs/ digital-universe.htm.

The IDC’s analysis revealed that in year 2007 , the amount of digital data generated in a year exceeded the world’s total storage capacity which means there were no means by which all the data which is being generated can be stored. This also revealed the fact that the rate at which the data is getting generated will soon outgrow the rate at which the data storage capacity is expanding.

The following are citing’s from the MGI (McKinsey Global Institute, established in 1990) report ( http:// www.mckinsey.com/ insights/ business_technology/ big_data_the_next_frontier_for_innovation ) published in May 2011.

The Big Data Size Varies across sectors

MGI estimates that enterprises around the world used more than 7 Exabyte’s of incremental disk drive data storage capacity in 2010; nearly 80 percent of that total appeared to be duplicate data that had been stored elsewhere.


Fig 1-6: Data Size variations Across Sectors

As is obvious from the figure above some sectors are particularly heavy on data storage.

Financial Services are highly transaction oriented and also regulated to store data and thus the analysis shows them to be big users of data. Some of the businesses are highly dependent on creation and transfer of data e.g. media, communications etc. and thus figure in the list as high users of data storage.

The Big Data Type also Varies across sectors

The MGI research shows that the type of data stored also varies by sector.


Fig 1-7: Variety of Data across sectors

In terms of geographic spread of big data, North America and Europe have 70% of the Global Total currently. With the advent of cloud computing data generated in one region can be stored in another country’s datacenter thus countries with significant cloud and hosting provider presence tend to have high storage of data.

Big Data Sources

In this section we will cover what are the major factors which are contributing to the ever increasing size of data. The below figure depicts few of the major contributing sources. As can be seen the internet usage via mobile devices , popularity of social media sites, proliferation of networked sensors applications, expansion of multimedia contents are few of the major sources which are contributing to the ever increasing data.


Fig: Sources of Data

As highlighted in the MGI report the major sources which are contributing to the ever increasing data set are

1: Enterprises which are collecting data with more granularities now, attaching more details with every transaction for understanding the consumer behavior.

2: Increased usage of multimedia across Sectors like health care, consumer facing industries etc.

3: Increased popularity of social media sites such as Facebook, twitter etc.

4: Rapid adoption of smart phones enabling the users to actively use these sites and other internet applications.

5: Increased usage of sensors and devices in day to day world which are connected by networks to computing resources.

The MGI report also projects that the number of connected nodes in the Internet of Things— sometimes also referred to as machine-to-machine (M2M) devices— deployed in the world will grow at a rate exceeding 30 percent annually over the next five years.


Fig 1-10: Data generation Report

Hence as seen above the rate of growth of data is increasing and so is the diversity.

Also the model of data generation has changed from few companies generating data and others consuming it to everyone generating data and everyone consuming it. This as explained earlier is due to the penetration of Consumer IT and Internet technologies along with trends like social media.

Three V’s of Big Data

We have defined the big data as data with 3 V’s – Volume, Velocity and Variety.


Fig: 3V’s of Big Data

The “Big” in Big Data isn’t just about volume.

In this section we will look at the three V’s. It is imperative that organizations and IT leaders focus on these 3 V’s.


Volume in Big Data means the size of the data. As discussed in the previous sections various factors contributes to the size of the Big Data – as businesses are becoming transaction oriented, we see ever increasing numbers of transactions, devices are getting connected to the internet which is adding to the amount of data being generated by these devices and tracked, increased usage of internet and digitization of content is resulting in increased volumes of data.


Fig: Digital Universe Size

In today’s scenario data is not just generated from within the enterprise, but is generated based on transactions with the extended enterprise and the customers with the extensive maintenance of customer’s data with the enterprises. In addition to this Machine data generated by smart phones, sensors etc. proliferation of social networking sites is also contributing to the growth. Petabyte scale data is becoming commonplace these days.

This huge volume of data is the biggest challenge for Big Data technologies. The storage and processing power needed to store, process and make accessible the data in a timely and cost effective manner is the challenge.


As we have seen in the previous sections with the proliferation of data sources from social sites, sensor devices, mobiles, smart phones different types of data are introduced. The data generated from various devices and sources follows no fixed format or structure. As compared to text, CSV or RDBMS data the data generated here varies from text files, log files, streaming videos, photos, meter readings, stock ticker data, PDFs, audio and various other unstructured formats.

There is no control over the structure of the data these days, new sources and structures of data are being created at a rapid pace. Thus the onus is on technology to find a solution to analyze and visualize the huge variety of data that is out there. As an example to provide alternate routes to commuters a traffic analysis application will need data feeds from millions of smartphones and sensors to provide accurate analytics on traffic conditions and alternate routes.


Fig: Data Variety

A variety of formats and unstructured data leads to a technology challenge to consume and analyze this disparate data.


Velocity in Big Data refers to the speed at which data is being generated and the speed at which it is required to be processe d. If data cannot be processed at the required speed it loses its significance. Due to data streaming happening from the social media sites, sensors, tickers , metering, monitoring it is becoming important for the organizations to speedily process the data both when it’s on move and when it’s static . Reacting and processing quickly enough to deal with the velocity of data is one more challenge for the big data technology.


Fig: Speed of Data

Real time insight is essential in many big data use cases as an example Algorithmic Trading system take real time feeds from market and social media sites like twitter to take real time decisions on stock trading . Any delay in processing of this data can mean millions of dollars in lost opportunity on a stock trade.

To summarize Big Data is defined by three V’s which are Volume, Velocity and Variety. The technology challenge in Big Data is the need for it to be processed within the available processing and storage resources in a cost effective manner.


Fig: The three aspects of Data

In addition to the 3 V’s there’s a fourth V which is talked about whenever Big Data is discussed. The fourth V is veracity which means not all the data out there is important hence it’s very important to identify form this humongous data source as in what is important what will provide us meaningful insight and what should be ignored.

Usage of Big Data

We have seen in the above sections what Big Data is, in this section we will focus on ways of using Big Data for creating value for the organizations.

Before we delve into how big data can be made usable to the organizations we will first look at why this big data is important.

Big data as we have seen above is a completely new source of data, data which is getting generated when you are posting on a blog, liking some product or moving. In older world such minutely available information was not captured, now this data is getting captured, hence organizations which embraces such data paves way for themselves for new innovations, in improving their agility and finally increasing their profitability.

Various ways are available in which the big data can create value for any organization.

As listed in the MGI report this can be broadly categorized into five ways of usage of big data:


Making big data accessible in a timely fashion to relevant stakeholders creates tremendous amount of value.

Let’s understand this with an example we have a manufacturing

company which has the R& D, engineering and manufacturing departments dispersed geographically if the data is accessible across all these departments and can be readily integrated then it can not only reduce the search and processing time but will also help in improving the product quality in time according to the present needs.

Discover and analyze information

Most of the value of big data comes from when the data being collected from outside sources can be merged with the organization’s internal data. Organizations are capturing detailed data be it of the inventories, employees or of the customer. Using all this data they can discover and analyze new information and patterns and hence this information and knowledge can be used to improve processes and performance.

Segmentation and Customization’s

The big data enables organizations in creating highly specific segmentations and tailor products and services to meet there needs. This can also be used in the social sector to accurately segment populations and target benefit schemes targeted to specific needs.

Segmentation of customers based on various parameters can aid in targeted marketing campaigns and tailoring of products to suit the needs of customers.

Aid Decision Making

This data can substantially improve decision making, minimize risks, and unearth valuable insights that would otherwise remain hidden. Automated Fraud Alert systems in credit card processing to automatic fine tuning of inventory are examples of systems which aid or automate decision making based on Big Data Analytics.


The big data enables innovation of new ideas in form of products and services. It enables innovation in the existing ones in order to reach out to large segments of people. Using data gathered for the actual products the manufacturers can not only innovate to generate the next generation product but they can also innovate there after sales offerings.

As an example real time data from machines and vehicles can be analyzed to provide insight into maintenance schedules, wear and tear of machines monitored to make resilient machines, fuel consumption monitoring can lead to higher efficiency engines. Real time traffic information is already making life easier for commuters by providing them an option to take alternate routes.

Thus big data is not only about size but it’s also about opportunities in finding meaningful insights from the ever increasing pool of data and helping the organizations in making more informed decisions making them more agile. Hence it not only provides the opportunity for organizations to strengthen their existing business by making informed decision it also helps them to expand their horizon.

BIG DATA Challenges

Big Data also poses some challenges, in this section we will highlight a few of those:

Policies and Procedures

As more and more data is gathered, digitized and moved around the globe the policy and compliance issues become increasingly important. Data Privacy, Security, Intellectual property and protection of big data is of immense importance to organizations.

Compliance to various statutory and legal requirements poses a challenge in data handling. Issues around ownership and liabilities around data are important legal aspects which need to be dealt with in case of Big Data.

Also a lot of big data projects leverage the scalability features of public cloud computing providers. These two aspects combined together pose a challenge for Compliance.

Policy questions on who owns the data, what is defined as fair use of data, who is responsible for accuracy and confidentiality of data need to be answered.

Access to data

Accessing data for consumption is going to be a challenge for Big Data projects. Some of the data may be available with third parties and gaining access can be a legal, contractual challenge.

Data about a product or service is available on facebook, twitter feeds, reviews and blogs, how does the product owner access this data from various sources owned by various providers?

The contractual clauses and the economic incentives for accessing big data need to be tied in to enable the availability and access of data to the consumer.

Technology and techniques

Newer tools and technologies which are built specifically keeping in mind the needs of big data will have to be leveraged rather than trying to address these through legacy systems. Thus the inadequacy of the legacy systems to deal with Big Data on one hand and the lack of experienced resources in newer technologies is a challenge that any big data project has to manage.

Legacy Systems and Big Data

In this section we will discuss about the challenges that the organizations are facing for managing the big data using the legacy systems.

Structure of Big Data

The legacy systems are designed to work with structured data where tables with columns are defined. The Format of the data held in the columns is also known.

As opposed to this Big Data is Data with varied structures it’s basically unstructured data such as images, videos, logs etc.

Since Big Data can be unstructured the legacy systems created to perform fast queries and analysis through techniques like indexing based on particular data types held in various columns cannot be used to hold or process Big Data.

Data Storage

The legacy system uses big servers and NAS and SAN system to store the data. As the data increases the server size and the backend storage size has to be increased. Traditional Legacy systems typically work in a scale up model where more and more compute, memory and storage needs to be added to a server to meet the increased data need.

Hence the time it takes to process the data increases exponentially hence defeating the other important requirement from Big Data i.e. Velocity.

Data Processing

The algorithms which are designed in the legacy system are designed to work with structured data such as strings, integers and are also limited by the size of data.

Thus the legacy systems are not capable of handling the processing of unstructured data, huge volume of data and the speed at which the processing needs to be performed.

Hence to capture value from big data, new technologies (e.g., storage, computing, and analytical software) and techniques (i.e., new types of analyses algorithms and techniques) need to be deployed.

Big Data Technologies

We have seen what big data is, in this section we will briefly look at what technologies can enable handling this humongous source of data. The technologies in discussion are not only needed to efficiently accept the data but they are also required to efficiently process different types of data in endurable time.

Hence a big data technology should not only be capable of collecting large amount of data but should also be able to process and analyze the data.

The recent technology advancements that enable organizations to make the most of its big data are

  1. New Storage and processing technologies designed specifically for large unstructured data.
  2. Parallel processing
  3. Clustering
  4. Large grid environments
  5. High connectivity and high throughput
  6. Cloud computing and scale out architectures.

There are a growing number of technologies which are making use of the technological advancements.

Here in this book we will be discussing about MongoDB one of the technologies which can be used to store and process big data.

What is NoSQL

“NoSQL is a new way of designing internet scale database solutions. It is not product or technology but a term that defines a set of database technologies which are not based on the traditional RDBMS principles.”

In this chapter we will cover the definition and basics of NOSQL. We will introduce the readers to the CAP theorem and will talk about the NRW notations. We will compare the ACID and BASE approaches and finally conclude the chapter by comparing NOSQL and SQL Database technologies.


The idea of RDBMS was borne from E.F.Codd’s whitepaper in 1970 titled “A relational model of data for large shared data banks”. The language used to query RDBMS systems is SQL (Sequel Query Language).

RDBMS systems are well suited for structured data held in columns and rows which can be queried using SQL. The RDBMS systems are based on the concept of “ACID” transactions.

ACID stands for Atomic, Consistent, Isolated and Durable where:

  • Atomic means that in a transaction either all the changes are applied completely or not applied at all.
  • Consistent means the data is in a consistent state after the transaction is applied which means after a transaction is committed the queries fetching a particular data will see the same result.
  • Isolated means the transactions that are applied to the same set of data are independent of each other. Thus one transaction will not interfere with the other transaction.
  • Finally durable means the changes are permanent in the system and will not be lost in case of any failures.


NoSQL is a term used to define non-relational databases. Thus it encompasses majority of the data stores which are not based on conventional RDBMS principles for handling large data sets on an internet scale.

Big data as discussed in the previous chapter is posing challenges to the traditional ways of storing and processing data i.e. the RDBMS systems. Thus we see the rise of NOSQL databases which are designed to process this huge amount and variety of data within the time and cost constraints.

Thus NoSQL databases evolved from the need of handling Big Data where the traditional RDBMS technologies did not provide adequate solutions.

Some examples of Big Data use cases which NOSQL Databases can meet are:

Social Network Graph: Who is connected to whom? Whose post should be visible on the user’s wall or homepage on a social-network site.

Search and Retrieve: Search all relevant pages with a particular keyword ranked by the number of times a keyword appears on a page.


Fig: Structured vs Un/Semi-structured Data


NoSQL doesn’t have a formal definition. It represents a form of persistence /data storage mechanism that is fundamentally different from RDBMS. Hence if we have to define NoSQL, it will be as follows NoSQL is an umbrella term for data stores that don’t follow the RDBMS principles.

The term was used initially to mean “do not use SQL if you want to scale” later this was redefined to “not only SQL” which want to scale” later this was redefined to “not only SQL” which means that in addition to SQL other complimentary database solutions also exist.

Brief History of NoSQL

In the year 1998 Carlo Strozzi coined the term NoSQL to refer to his open source, light weight database. He used this term to identify his database because the database didn’t have a SQL interface. The term resurfaced in early 2009 when Eric Evans (a Rackspace employee ) used this term in an event on open source distributed databases to refer to the distributed databases which were non-relational and does not follow the ACID feature of the relational databases.


In the introduction we referred that the traditional RDBMS applications have focused on ACID transactions.

Howsoever essential these qualities may seem, they are quite incompatible with availability and performance requirements for applications of web-scale.

Let’s say for example, we have a company like OLX which is used for selling products such as unused house hold goods for e.g. old furniture, vehicles etc. with RDBMS as its database.

Let’s consider two scenarios

First Scenario: Let’s look at an ecommerce shopping site where a user is buying a product. Now this may lock the database for updation when the user is buying a product so that the inventory is correctly reflected. In this case the user will end up locking a part of database which is the inventory and every other user will end up waiting until the user who has put the lock completes the transaction.

In another scenario the application might end up using cached data or even unlocked records resulting in inconsistency. In this case two users might end up buying the product when the inventory actually was zero.

The system may become slow impacting scalability and user experience.

Now let us look at how NOSQL tries to solve this problem using an approach popularly called as “BASE”. Before explaining BASE, let’s understand the concept of CAP theorem.

CAP Theorem (Brewer’s Theorem)

Eric Brewer outlined the CAP theorem in 2000. This is one of the important concepts that need to be well understood by developers and architects dealing with distributed databases. The theorem states when designing an application in a distributed environment there are three basic requirements which exist namely Consistency, Availability and Partition tolerance where

  • Consistency means the data remains consistent after any operation is performed that changes the data and all users or clients accessing the application sees the same updated data.
  • Availability means the system is always available with no downtime.
  • Partition Tolerance means the system will continue to function even if it’s partitioned into groups of servers which are not able to communicate with one another.

The CAP theorem states that at any point of time a distributed system can fulfill only two of the above three guarantees.


Fig: CAP Theorem


Eric Brewer coined the BASE acronym.

BASE can be explained as:

  • Basically Available means the system does guarantee availability, in terms of the CAP theroem.
  • Soft state indicates the system state may change over time even if no input is there. This is in compliance to the eventual consistency model.
  • Eventual consistency indicates that the system will become consistent over a period of time provided no input is sent to the system during that time.

Hence BASE is in contrast with the RDBMS ACID transactions.

We have seen that NoSQL databases are eventually consistent but the eventual consistency implementation may vary across different NoSQL databases.

NRW is the notation which is used to describe how the eventual consistency model is implemented across NOSQL databases where

  • N is the number of data copy which the database has maintained.
  • R is the number of copy which an application needs to refer to before returning a read request’s output.
  • W is the number of data copies that need to be written to before a write operation is marked as completed successfully.

Hence using these notation configurations the databases implement the model of eventual consistency.

Consistency can be implemented at both read and write operation level.

Write Operations

N = W implies that the write operation will update all data copies before returning the control to the client and marking the write operation as successful. This is similar to how the traditional RDBMS databases work when implementing synchronous replication. This setting will slow down the write performance.

If write performance is a concern which means you want the writes to be happening fast then in that case we can set W = 1, R = N which implies that the write will just update any one copy and mark the write as successful , but whenever the user issues a read request it will read all the copies to return the result. If either of the copy is not updated it will ensure the same is updated and then only the read will be successful but this implementation will slow down the read performance.

Hence most NoSQL implementations uses N > W > 1 which implies that more than one writes must complete, but not all nodes need to be updated immediately.

Read Operations

If R is set to 1 then the read operation will read any data copy which can be out dated. If R > 1 then more than one copy is read and hence will read most recent value but this can slow down the read operation.

Using W+R>N ensures a read always retrieves the latest value. Reason being the number of copies which is written and number of copies read is high enough to guarantee that always at least one copy of the latest version is read in the read set. This is referred as quorum assembly.

Table: Table depicts NRW configuration


Hence if we have to compare ACID vs. BASE it’ll appear as below

Table: ACID vs BASE



NoSQL Advantages and Disadvantages

In this section we will look at the advantages and disadvantages NoSQL databases.

Advantages of NoSQL

High scalability

This scaling up approach fails when the transaction rates and the fast response requirement increase. In contrast to this the new generation of NoSQL databases were designed to scale out i.e. expand horizontally using low end commodity servers.

Manageability and Administration

NoSQL databases are designed to mostly work with automated repairs, distributed data and simpler data models hence leading to low manageability and administration.

Low Cost

The NoSQL databases are typically designed to work with cluster of cheap commidity servers enabling the users to store and process more data at a low cost.

Flexible data models

NoSQL databases have a very flexible data model enabling them to work with any type of data; they don’t comply with the rigid RDBMS data models. Hence any application hanges which involve updating the database schema can be easily implemented.

Challenges of NoSQL

In addition to the above mentioned advantages there are many impediments that users need to be aware of before they start developing applications using these platforms.


Most of the NoSQL databases are still in pre-production versions with key features that are still to be implemented , hence while deciding on a NoSQL database organizations should analyze the product properly to ensure there requirement is fully implemented and is not in the To-do list.


Support is one limitation users ’ needs to be aware of. As most of the NoSQL ’ s are open source projects hence there are one or more firms offering support for the databases, generally these are small startups and as compared to the enterprise software companies and may not have global reach or support resources.

Limited Query Capabilities

As NoSQL databases were generally developed to meet the scaling requirement of the web-scale applications hence they provide limited querying capabilities. A simple querying requirement may involve significant programming expertise.


Though NoSQL is designed to provide no-admin solution however it still requires skill and effort for installing and maintaining the solution.


Since NoSQL is an evolving area hence expertise of the technology is limited in the developer and administrator community.

Though NoSQL is becoming an important part of the database landscape but the users need to be aware of the limitations and advantages of the products to make a correct choice of the NoSQL database platform.

SQL vs. NoSQL Databases

We have seen what NoSQL databases are. Though it’s increasingly getting adopted as a database solution but it’s not here to replace SQL or RDBMS databases. In this section we will look at the differences between SQL and NoSQL databases.


Fig: NoSQL World

Let’s do a quick recap of the RDBMS system. The RDBMS systems have prevailed for about 30 years and even now it’s a default choice for the solution architect for data storage for an application. If we will list down few of the good points of the system – the first and the foremost is the use of SQL which is a rich declarative query language used for data processing, its well understood by the users and in addition they have ACID support for transaction which is a must in many of the applications such as banking applications.

However the biggest drawback of the RDBMS system is difficulty in handling schema changes, scaling issues as the data increases. As the data increases the read read/ write performance degrades. We face scaling issues with RDBMS systems as they are mostly designed to Scale Up and not Scale Out.

In contrast to the SQL RDBMS databases NoSQL promotes the data stores which break away from the RDBMS paradigm.

Let’s talk about technical scenarios and how they compare in RDBMS Vs NoSQL:

Schema flexibility: This is a must for easy future enhancements and integration with external applications (outbound or inbound).

RDBMS are quite inflexible in their design. More often than not, adding a column is an absolute no-no, especially if the table has some data. The reasons range from default value, indexes, and performance implications. More often than not we end up creating new tables and increase complexity by introducing relationships across the tables.

Complex queries: The traditional designing of the tables leads to developers ending up writing complex JOIN queries which are not only difficult to implement and maintain but also take substantial database resources to execute

Data update: Updating data across tables is probably one of the more complex scenarios especially if they are a part of the transaction. Note that keeping the transaction open for a long duration hampers the performance. One also has to plan for propagating the updates to multiple nodes across the system. And if the system does not support multiple masters or writing to multiple nodes simultaneously , there is a risk of node-failure and the entire application moving to read-only mode.

Scalability: More often than not, the only scalability that may be required is for read operations. However, several factors impact this speed as operations grow. Some of the key questions to ask are:

  • What is the time taken to synchronize the data across physical database instances?
  • What is the time taken to synchronize the data across datacenters?
  • What is the bandwidth requirement to synchronize data? Is the data exchanged optimized?
  • What is the latency when any update is synchronized across servers? Typically, the records will be locked during an update.

NoSQL-based solutions provide answers to most of the challenges listed above. Let us now see what NoSQL has to offer against each technical question mentioned above:

Schema flexibility: Column-oriented databases store data as columns as opposed to rows in RDBMS. This allows flexibility of adding one or more columns as required, on the fly. Similarly, document stores that allow storing semi-structured data are also good options.

Complex queries: NoSQL databases do not have support for relationships or foreign keys. There are no complex queries. There are no JOIN statements.

Is that a drawback? How does one query across tables?

It is a functional drawback, definitely. To query across tables, multiple queries must be executed. Database is a shared resource, used across application servers and must not be released from use as quickly as possible. The options involve combination of simplifying queries to be executed, caching data, and performing complex operations in application tier. A lot of databases provide in-built entity-level caching. This means that as and when a record is accessed, it may be automatically cached transparently by the database. The cache may be in-memory distributed cache for performance and scale.

Data update: Data update and synchronization across physical instances are difficult engineering problems to solve. Synchronization across nodes within a datacenter has a different set of requirements as compared to synchronizing across multiple datacenters. One would want the latency within a couple of milliseconds or tens of milliseconds at the best. NoSQL solutions offer great synchronization options. MongoDB, for example, allows concurrent updates across nodes, synchronization with conflict resolution and eventually, consistency across the datacenters within an acceptable time that would run in few milliseconds. As such, MongoDB has no concept of isolation. Note that now because the complexity of managing the transaction may be moved out of the database, the application will have to do some hard work.

An example of this is a two-phase commit while implementing transactions

(http:// docs.mongodb.org/ manual/ tutorial/ perform-two-phase-commits/).

A plethora of databases offer Multiversion concurrency control (MCC) to achieve transactional consistency (http:// en.wikipedia.org/ wiki/ Multiversion_concurrency_control).

Surprisingly, eBay does not use transactions at all ( http:// www.infoq.com/ interviews/ dan-pritchett-ebay-architecture ).

Well, as Dan Pritchett (http:// www.addsimplicity.com/), Technical Fellow at eBay puts it, eBay.com does not use transactions. Note that PayPal does use transactions.

Scalability: NoSQL solutions provider greater scalability for obvious reasons. A lot of complexity that is required for transaction oriented RDBMS does not exist in ACID non-compliant NoSQL databases. Interestingly, since NoSQL do not provide cross-table references and there are no JOIN queries possible, and because one cannot write a single query to collate data across multiple tables , one simple and logical solution is to— at times— duplicate the data across tables. In some scenarios, embedding the information within the primary entity— especially in one-to-one mapping cases— may be a great idea.

Let’s compare SQL and NOSQL technologies:

Table: SQL vs. NoSQL





Categories of NoSQL database

In this section we will quickly analyze the NoSQL Landscape. We will look at the emerging categories of NoSQL databases. The following table is a listing of few of the projects in the NoSQL landscape with the types and the players in each category.

The NoSQL databases are categorized on the basis of how the data is stored. NoSQL mostly follows a horizontal structure because of the need to provide curated information from large volumes, generally in near real-time. They are optimized for insert and retrieve operations on a large scale with built-in capabilities for replication and clustering.

Table: NoSQL categories



The following table briefly provides a feature wise comparision between the various categories of NoSQL

Table: Feature wise comparision


The important thing while considering a NoSQL project is the feature set the users are interested in.

Hence when deciding on the NoSQL product first we need to understand the problem requirement very carefully, then we should look at other people who have already started using the NoSQL product in solving similar problems. This is done because the NoSQL is still maturing hence this will enable us to learn from the peers and previous deployments and make better choices.

In addition we also need to consider the following – How big is the data that need to be handled, what throughput is acceptable for read and write, how consistency is achieved in the system, whether the system needs to support high write performance, or high read performance, how easy is the maintainability and administration, what needs to be queried, what is the benefit of using NoSQL. And finally start small but significant and consider a hybrid approach wherever possible.

Introducing MongoDB in Big Data NoSQL

MongoDB is one of the leading NoSQL document store platform which enables organizations to handle Big Data. MongoDB is a platform of choice for some of the leading enterprises and consumer IT companies who have leveraged the scaling and Geospatial Indexing capabilities of MongoDB in their products and solutions.

MongoDB derives its name from the word “Humungous”. Like other NoSQL databases MongoDB also doesn’t comply with the RDBMS principles. It doesn’t have concepts of tables, rows, columns it doesn’t provide features of ACID compliance, JOINS, foreign keys etc.


“MongoDB is a Leading NoSQL open-source document based database system developed and supported by 10gen. It’s provides high performance, high availability and easy scalability.”

MongoDB stores data as Binary JSON Documents (also known as BSON). The documents can have different schemas hence enabling the schema to change as the application evolves. MongoDB is built for scalability, performance and high-availability.

MongoDB Design Philosophy

MongoDB wasn’t designed in a lab. We built MongoDB from our own experiences building large scale, high availability, and robust systems. We didn’t start from scratch; we really tried to figure out what was broken, and tackle that. So the way I think about MongoDB is that if you take MySql, and change the data model from relational to document based, you get a lot of great features: embedded docs for speed, manageability, agile development with schema-less databases, easier horizontal scalability because joins aren’t as important. There are lots of things that work great in relational databases: indexes, dynamic queries and updates to name a few, and we haven’t changed much there. For example, the way you design your indexes in MongoDB should be exactly the way you do it in MySql or Oracle , you just have the option of indexing an embedded field.

—Eliot Horowitz, 10gen CTO and Co-founder In this section we will briefly look at some of the design decisions that led to what MongoDB is today.

Speed, Scalability and Agility

The design team’s goal while designing MongoDB was to create a database which is fast, massively scalable and is easy to use. To achieve speed and horizontal scalability in a partitioned database as explained in the Brewer’s CAP theorem the Consistency and Transactional support have to be compromised. Thus as per the Brewer’s CAP theorem MongoDB provides High Availability, Scalability and Partitioning at the cost of Consistency and Transactional support.

Thus MongoDB is a platform of choice for applications needing a flexible schema, speed and partitioning capabilities while it may not be suited for applications which require consistency and atomicity. Instead of tables and rows MongoDB uses documents to make it flexible, scalable and fast.

Non-Relational Approach

Traditional RDBMS platforms provide scalability using scale up approach. One needs to provide a faster server to increase performance.

The following issues in RDBMS led to why MongoDB and other NoSQL databases were designed the way they are designed:

In order to scale out the RDBMS database needs to link the data available in two or more systems in order to report back the result. This is difficult to achieve in RDBMS’s since they are designed to work when all the data is available for computation together. Thus the data has to be available for processing at a single location.

In case of multiple Active-Active servers when both are getting updated from multiple sources there is a challenge in determining which update is correct.

In case an application tries to read data from the second server whereas the information has been updated on the first server but yet to be synchronized with the second server, the information returned may be stale.

Hence MongoDB team decided to take a non -relational approach to solving this problem and providing a scalable, high performance database.

As we have mentioned above MongoDB stores its data in BSON documents where all the related data is placed together which means everything which is needed is at one place. The queries in MongoDB are based on keys in the document; hence the documents can be spread across multiple servers. While querying each server will check its own set of documents and return the result. This enables liner scalability and improved performance.

MongoDB has a master-slave replication where only one master exists which accepts the write requests. If the write performance need to be improved then “Sharding” can be used which splits the data across multiple machines and hence enabling multiple machines updates different parts of the datasets. Sharding is automatic in MongoDB, as more machines are added data is distributed automatically.

We will be discussing the replication, sharding, data storage in detail in the subsequent sections.

Transactional Support

In order to make MongoDB scale horizontally the transactional support had to be compromised.

This is an important design consideration for many NoSQL databases. We had discussed this aspect in the CAP theorem that a distributed database system can only cater to two of the three parameters thus to provide Availability and Partitioning the Consistency is compromised. We will discuss this aspect in detail in subsequent chapters.

MongoDB doesn’t provide transactional support. Though in MongoDB the atomicity is guaranteed at the document level however if an update involves multiple documents there’s no guarantee of atomicity. Similarly “isolation” is also not supported which means the data which is read by one client might be modified by another concurrent client.

JSON based Document Store

MongoDB uses a JSON (JavaScript Object Notation) based document store to store the data. JSON /BSON (Binary JSON) provides for a schema less model which provides flexibility in terms of database design. Unlike RDBMSs’ the database design is schema less and flexible and changes can be done to the schema seamlessly.

This design also makes it high performance by providing for grouping of relevant data together internally and making it easily searchable.

A JSON Document contains the actual data and is comparable to a row in SQL. However in contrast to a RDBMS row the documents have dynamic schema. Which means documents in a same collection can have different fields or structure or may be common fields can have different type of data.

A document is a set of Key-Value pairs.

Let’s understand this with an example:

{ “Name”:“ABC”,

“Phone”: [“ 1111111”, “222222” ],



Keys and Values come in Pairs. The Value of a key in a document can be left blank. Thus in the above example the document has three keys namely “Name”,”Phone” and “Fax” and the “Fax” key has no value.

Performance vs. Features

In order to make MongoDB high performance and fast there were certain features which may be commonly available in RBMS systems which are not available in MongoDB.

Things like Unique Key Constraints, Multi-Document Updates are not available in MongoDB.

MongoDB is a document-oriented DBMS where data is stored as documents. It do not support joins or transactions, however it provide for secondary indexes, it enables users to query using query documents, provides support for automatic updates at per document level.

It provides replica set which are master-slave replication with automated failover and built-in horizontal scaling via automated range-based partitioning.

Running the Database Anywhere

One of the main design decisions was the ability to run the database from anywhere – which means it should be able to run on servers, VMs or even on the cloud using the pay -for-what-you-use service.

The language used for implementing MongoDB is C + + which enables MongoDB to achieve this goal. The 10gen site provides binaries for different OS platforms enabling MongoDB to practically run on almost any type of machines be it physical, virtual or even in cloud.

SQL Comparison

In this part of the chapter before concluding we will briefly summarize points of how MongoDB is different from SQL.

  1. MongoDB uses “document” for storing its data which can have flexible schema (documents in same collection can have different fields) enabling the users to store nested or multi-value fields such as arrays, hash etc. Whereas in RDBMS it’s a fixed schema where a column’s value should have similar data type also we cannot store arrays or nested values in the cell.
  2. MongoDB doesn’t provide support for “JOIN” operations like in SQL. However it enables the user to store all relevant data together in a single document avoiding at the periphery the usage of JOINS. It has a workaround to overcome this issue. We will be discussing the same in more details in the Data Modeling consideration section.
  3. MongoDB doesn’t provide support for “Transaction” in the ways of SQL. However it guarantees Atomicity at document level. Also it doesn’t guarantee “Isolation” which means a data being read by one client can have its values modified by another client concurrently. We will also look what solution is available to overcome this issue of MongoDB is the coming sections.

Data model in mongodb

“MongoDB is designed to work with documents without any need of predefined columns or data types the way relational databases are making the data model extremely flexible.”

In this post we will introduce you to MongoDB Data Model. We will also introduce the user to what flexible schema (Polymorphic Schema) means and why it’s a significant contemplation of MongoDB data model.

In this previous chapter we have seen that MongoDB is a document based database which is designed to work with documents where the documents can have flexible schema which means that documents within a collection can have different (or same) set of fields. This enables the users with more flexibility when dealing with data.

In this chapter we will understand about MongoDB flexible data model and wherever required we will demonstrate the difference in the appraoch as compared to RDBMS.

a MongoDB deployment can have many databases. Each database is a set of collections. Collections are similar to the concept of tables in SQL however it is schema less.


                                                       MongoDB Database Model

Each collection can have multiple documents. Think of a document as a row in SQL but schema less.

In an RDBMS system since the table structures and the data types for each column are fixed, one can only add data of a particular data type in a column. In case of MongoDB a collection is a collection of documents where data is stored as Key-Value pairs.

Let’s understand with an example how data is stored in a document. The following document holds the Name and Phone Numbers of the Users.

: {“ Name”: “ABC”,

“Phone”: [“ 1111111”, “222222” ] }

Dynamic schema implies that documents within the same collection can have same or different set of fields or structure, and even common fields can store different type of values across documents. This implies there’s no rigidness in the way data is stored in the documents of a collection.

Let’s take an example of a Region Collection:


“R_ID” : “REG001”,

“Name” : “United States”



“R_ID” :1234,

“Name” : “New York” ,

“Country” : “United States”


In the above example we have two documents in the Region collection.

If we will observe we will find that though both the documents are part of the single collection but they have different structure with the second collection having an additional field of information which is country. In fact if we look at the “R_ID” field it stores a STRING value in the first document whereas it’s a number in the second document.

Thus a collection’s documents can have entirely different schema . It depends on the application to store documents in a particular collection together or have multiple collections. As such there is no performance difference between having multiple collections or a single collection.


We have seen MongoDB is a document based database. It uses Binary JSON for storing its data.

In this section we will briefly understand what are JSON and Binary-JSON (BSON).

JSON stands for JavaScript Object Notation. It’s a standard used for data interchange in today’s modern web (along with XML). The format is human and machine readable. It is not only a great way for exchanging data but also a nice way for storing data.

All the basic data types such as Strings, Number, Boolean Values along with Arrays, Hashes are supported by JSON.

Let’s take an example to see how a JSON document looks like

“_id” : 1,

“name” : { “first” : “John”, “last” : “Doe” },

“publications” : [


“title” : “First Book”,

“year” : 1989,

“publisher” : “publisher1”


{ “title” : “Second Book”,

“year” : 1999,

“publisher” : “publisher2”




JSON enables the users to keep all the related peace of information together at one place hence providing excellent performance. It also enables the updation to a document to be independent of each other. It is schema-less.

Binary JSON (BSON)

MongoDB stores the JSON document in binary encoded format. This is termed as BSON. The BSON Data model is an extended form of JSON data model.

MongoDB’s this implementation of BSON document is fast, highly traversable and lightweight.

It supports embedding or arrays and objects within other arrays also enables MongoDB to reach inside the objects to build indexes and match objects against query expression both on top-level and nested BSON keys.

The Identifier (_id)

We have seen till now that MongoDB stores data in documents. Documents are comprised of key-value pairs. Though a document can be compared to a row in RDBMS however unlike a row documents have flexible schema. A key which is nothing but a label can be roughly compared to the column name in RDBMS. Key is used for querying data from the documents. Hence like RDBMS primary key (used to uniquely identify each row) we need to have a key which uniquely identify each document within a collection. This is referred as _id in MongoDB.

If user has not explicitly specified any value for this key a unique value will be automatically generated and assigned to this key by MongoDB.

This key value is immutable and can be of any data type except arrays.

Capped Collection

We are already well versed with Collections and Documents. Let’s talk about a special type of a collection called “Capped Collection”

MongoDB has a concept of capping the collection. This means it stores the documents in the collection in inserted order and as the collection reaches its limit the documents will be removed from the collection in FIFO (First in First Out) order. This implies that the least recently inserted documents will be removed first.

The above features are good for use cases where the order of insertion is required to be maintained automatically and deletion of records after a fix size is required. Some examples of such use cases can be log files which get automatically truncated after a certain size.

MongoDB itself uses capped collection for maintaining its replication logs.

Capped collection guarantees preservation of the data in insertion order, hence

  1. Queries retrieving data in insertion order return results quickly and don’t need an index.
  2. Updates that change the document size are not allowed.

Polymorphic Schemas

As the readers are already conversant with the schemaless nature of MongoDB data structure lets understand polymorphic schemas and the use cases.

A polymorphic schema is a schema where a collection has documents of different types or schemas.

A good example of this schema is a “Users” Collection , some user documents might have extra Fax number, email addresses while some might have only phone numbers but still all these documents coexist within the same “Users” collection.

This schema is generally referred as Polymorphic schema.

In this part of the chapter, we’ll explore the various reasons for using a polymorphic schema.

Object-Oriented Programming

The Object Oriented programming enables the programmers to have classes share data and behaviors using inheritance. It also enables them to define functions in the parent class which can be overridden in the child class and hence will function differently in different context i.e. we can use the same function name to manipulate the child as well as the parent class though under the hood implementations might be different. This feature is referred as polymorphism.

Hence the requirement in this case is the ability to have a schema wherein all the related set of objects or objects within a hierarchy can fit in together and can also be retrieved identically.

Let’s understand the above with an example, if we have an application that enables the users to upload and share different content types such as html pages, documents, images, videos etc.

Though many of the fields will be common across all the above mentioned content types such as Name, ID, Author, Upload Date and Time but not all fields will be identical such as in case of images we will have a Binary Field which will have the image content whereas in case of the html page we will have a large text field to hold the HTML content.

In this scenario the MongoDB polymorphic schema will be used wherein all the content node types will be stored in the same collection say for instance “LoadContent” wherein each document will have relevant fields only.

// “Document collections” – “HTMLPage” document


_id: 1,

title: “Hello”,

type: “HTMLpage”,

text: “< html > Hi.. Welcome to my world </ html >”


// Document collection also has a “Picture” document


_id: 3,

title: “Family Photo”,

type: “JPEG”,

sizeInMB: 10,……..


This schema not only enables the users to store related data with different structures together in a same collection, it also simplifies the querying. The same collection can be used to perform queries on common fields such as fetch all contents uploaded on a particular date and time as well as queries on specific fields such as to find out images having size greater than X MB.

Hence the object-oriented programming is one of the use cases where having polymorphic schema makes sense.

Schema Evolution

When we are working with databases one of the most important considerations that we need to account for is the schema evolution i.e. change in schema’s impact on the running application and the design should be done in a way as to have minimal or no impact on the application i.e. no or minimal downtime, no or very minimal code changes etc.

Typically Schema evolution happens by executing a migration script which upgrades the database schema from the old version to the new one. If the database is not in production than the script can be simple drop and recreation of the database however if the database is in production environment and contains the LIVE data then the migration script will be complex as the data needs to be preserved hence the script should take this into consideration. Though in MongoDB we have an Update option which can be used to update all the documents structure within a collection if there’s a new addition of field but imagine the impact of doing this if we have 1000 ’s of documents in the collection , it will be very slow and can have negative impact on the underlying applications performance, hence one of the ways of doing this is include the new structure to the new documents being added to the collection and then gradually migrate the collection in the background while the application is still running . This is one of the many use cases, where having a polymorphic schema will be advantageous.

For e.g. say we are working with Tickets Collection , where we have documents with ticket details as shown below

// “Ticket1” document (stored in “Tickets” collection”)


_id: 1,

Priority: “High”,

type: “Incident”,

text: “Printer not working”

} … … … ..

During some point in time the application team decides to introduce a “short description” field in the ticket document structure, so the best alternative is to introduce this new field in the new ticket documents. Within the application embed a piece of code which will handle retrieving both “old style documents (without a short description field)” as well as “new style documents (with short description field)” and then gradually the old style documents can be migrate to the new style documents. Once the migration is completed if required the code can be updated to remove the piece of code which was embedded to handle the missing field.

MongoDB Installation in Linux and Windows

“MongoDB is a cross platform database. Let’s get started with setting up our instance of MongoDb”

In this post we will go over the process for installing MongoDB on windows as well as LINUX.

Select Your Version

MongoDB runs on most platforms. List of all the available packages can be downloaded from the MongoDB Downloads page (http:// mongodb.org/ downloads).

The correct version for your environment will depend on your Server’s Operating System and the kind of processor. MongoDB supports both 32-bit and 64-bit architecture but it’s recommended to use 64-bit in the Production environment.

32-bit limitations: Since MongoDB uses memory mapped files thus limiting the 32-bit builds to around 2 GB of data. Hence it’s recommended to use 64-bit builds for production environment for performance reasons.

The Latest MongoDB production release is 2.4.9 at the time of writing this book.


Fig: MongoDB Downloads

Downloads for MongoDB are available for Windows, Solaris, MAC OS X and Linux. The MongoDB download website screen is divided in the following three sections

  1. Production Release (2.4.9) – 1/ 10/ 2014
  2. Previous Releases (stable)
  3. Development Releases (unstable)

As can be seen the Production Release is the most stable recent version available which at time of writing of the book is 2.4.9. As a new version is released the prior stable release is moved under Previous Releases section.

The development releases as the name suggest are the versions which are still under development and hence are tagged as unstable. Hence these versions can have additional features but this may not be stable as they are still in the development phase. You can use the development versions to try out new features and provide feedback to 10gen on the features and issues faced.

Installing MongoDB under Linux

In this section we will be installing MongoDB on a LINUX system. For the following demonstration we will be using an Ubuntu Linux distribution. We can install MongoDB either manually or can use repositories. We will walk the readers through both the options.

Installing using Repositories

Repositories are basically online directories filled with software. Aptitude is the software that can be used to install the software’s on Ubuntu. Though MongoDB might be present under the default repositories but the possibility of an out-of-date version is there hence the first step is to configure aptitude to look at a custom repository.

“apt” and “dbpkg” are Ubuntu package management tools which are used to ensure package consistency and authenticity by requiring that distributors sign packages with GPG keys.

Issue the following command to import MongoDB public.GPG key.

sudo apt-key adv –keyserver hkp:// keyserver.ubuntu.com: 80 –recv 7F0CEB10

Next create /etc/ apt/ sources.list.d/ mongodb.list file using the following command

echo ‘deb http:// downloads-distro.mongodb.org/ repo/ ubuntu-upstart dist 10gen’ | sudo tee /etc/ apt/ sources.list.d/ mongodb.list

Finally issue the following command to reload the repository

sudo apt-get update

By the end of this step aptitude is aware of the manually added repository.

Next we need to install the software. We will be issuing the following command in the shell to install the current stable version:

sudo apt-get install mongodb-10gen

If we wish to install an unstable version from the development releases then the following commands can be issues instead of the above command:

sudo apt-get install mongodb-10gen-unstable

The completion of the above steps confirms the successful installation of MongoDB and that’s all to it.

Installing Manually

In this section we will see how MongoDB can be installed manually. This knowledge is specifically important in the following cases:

  1. When the Linux distribution doesn’t use Aptitude.
  2. When the version of your interest is not available through repositories or is not part of the repository.
  3. When it’s required to run multiple MongoDB versions simultaneously.

The first step in manual installation is to decide on the version of MongoDB to use and then download from the site. Next the package needs to be extracted using the following command:

$ tar -xvf mongodb-linux-i686-2.4.9\ \( 1\). tgz mongodb-linux-i686-2.4.9/ README

mongodb-linux-i686-2.4.9/ THIRD-PARTY-NOTICES

mongodb-linux-i686-2.4.9/ GNU-AGPL-3.0

mongodb-linux-i686-2.4.9/ bin/ mongodump


mongodb-linux-i686-2.4.9/ bin/ mongosniff

mongodb-linux-i686-2.4.9/ bin/ mongod

mongodb-linux-i686-2.4.9/ bin/ mongos

mongodb-linux-i686-2.4.9/ bin/ mongo

As can be seen the step extracts the package content in a new directory called mongodb-linux-i686-2.4.9 (This is located under the user’s current directory).

The directory contains many subdirectories and files. The main executable files are under the sub directory bin.

This completes the MongoDB installation successfully and that’s all to it.

Installing MongoDB on Windows

Installing MongoDB on windows is a simple matter of just downloading the zip file, extract the content, create the directory folder and running the application itself.

The first step is deciding on the build that need to be downloaded.

Begin by downloading the zip file with binaries of the latest stable version. Next is extracting the archive to the root of C:\. The contents are extracted to a new directory called mongodb-win32-x86_64-xxxx-yy-zz; located under the C:\.

After extracting the content we will now run the Command Prompt by right clicking on the command prompt and selecting run as administrator.

Issue the following commands in the command prompt

C:\ Users\ Administrator > cd \

C:\ > move c:\ mongodb-win32-* c:\ mongodb

After successful completion of the above steps, we have a directory in C:\ with all the relevant applications in the bin folder which in this case is C:\ MongoDB\ Bin\. That’s all there is to it.

Start Running MongoDB

We have seen how to install MongoDB after choosing the appropriate version for our platform. It’s finally time to now look at how we can start running and using MongoDB.


MongoDB requires data folder to store its file which by default is C:\ data\ db in Windows and /data/ db directory in LINUX systems.

These data directories are not created by MongoDB hence before starting MongoDB the data directory needs to be manually created and we need to ensure that proper permissions are set i.e. MongoDB has read, write and directory creation permissions.

If the MongoDB is started before creating the folder it will throw an error message and will fail to run.

Start the Service

Once the directories are created and permissions are in place execute the mongod application (placed under the bin directory) to start MongoDB core database service.

In continuation to our above installation the same can be started by opening the command prompt in windows (the command prompt needs to be run as administrator) and executing the following

c:\ > c:\ mongodb\ bin\ mongod.exe

In case of Linux the mongod process is started in shell.

This will start the MongoDB Database on the localhost interface and it listens for connections from mongo shell on port 27017.

As we mentioned above the folder path need to be created before starting the database which by default is c:\ data\ db, an alternative path can also be provided while starting the database service using the –dbpath parameter.

C:\ > C:\ mongodb\ bin\ mongod.exe –dbpath

C:\ NewDBPath\ DBContents

Verifying the Installation

The relevant executable will be present under the sub directory bin. Hence the following can be checked under the bin directory in order to vet the success of the installation step.

mongod – the core database server

mongo – the database shell mongos – the auto-sharding process

mongoexport – export utility

mongoimport – import utility

Apart from the above listed we have several other applications available under the bin folder.

The mongo application launches the mongo shell which enables the users to access the database contents and fire selective queries or executes aggregation against the data in MongoDB.

The mongod application as we have seen above is used to start the database service or daemon as it’s called.

Multiple flags can be set while launching the applications. For e.g. as we mentioned above –dbpath can be used to specify an alternative path for the database files to be stored. In order to get the list of all available options we need to include the –help flag while launching the service.

Using the MongoDB Shell

mongo shell is a part of the standard MongoDB distribution, It provides a full JavaScript environment with a complete access to the JavaScript language and all standard functions as well as a full database interface for MongoDB.

Once the database services have been started, we can fire the mongo shell and start using MongoDB. This can be done using Shell in LINUX or command prompt in windows (run as administrator).

We need to ensure that we are referring to the exact location where the executable are say for e.g . in our case the executable are available under C:\ mongodb\ bin\ folder in Windows environment.

Open the command prompt (run as administrator), type mongo.exe at the command prompt and press the Return key. This will start the mongo shell.

C:\ > C:\ mongodb\ bin\ mongo.exe

MongoDB shell version: 2.4.9

connecting to: test


If no parameters are specified while starting the service by default it will connect to the test database on the local host instance.

The database will be created automatically when connected to it which is in par with the MongoDB’s feature where if an attempt is made to connect to a database which doesn’t exist it will sutomatically create one.

We will be covering more on working with the Mongo Shell in the next chapter.

Securing the Deployment

We have seen how we can install and start using MongoDB using default configurations. Next we need to ensure that the data which is stored within the database is secure in all aspects. Hence in this section we will look at how we can secure our data. We will change the configuration of the default installation to ensure that our database is more secure.

Using Authentication and Authorization

Authentication means the users will be able to access the database only if they login using the credentials which have access on the database . This disables anonymous access to the database.

After the user is authenticated authorization can be used to ensure that the user has only the required amount of access needed for accomplishing the tasks in hand. In MongoDB Authentication and Authorization is supported on per-database level.

Users exist in the context of a single logical database and are stored in the system.users collection within the database.

system.users – This collection stores information for authentication and authorization on that database. It stores the user’s credentials for authentication and users privileges information for authorization.

MongoDB uses a role-based approach for authorization e.g. of the roles are read, readWrite, readAnyDatabase etc.

A privilege document within the system.users collection is used for storing each user roles and credentials for users who have access to the database.

Hence a user can have multiple roles and may have different roles on different databases.

The available roles are

read – This provides a read only access of all the collections for the specified database.

readWrite – This provides a read and write access to any collection within the specified database.

dbAdmin – This enables the users to perform administrative actions within the specified database such as index management using ensureIndex, dropIndexes, reIndex, indexStats, renaming collections, create collections etc.

userAdmin – This enables the user to perform read write operation on the system.users collection of the specified database, also enables them to modify permissions for existing users and create new users. This is effectively the SuperUser role for the specified database.

clusterAdmin – This role enables the user to grant access to administration operations which affects or present information about the whole system. clusterAdmin is applicable only on the admin database, and does not confer any access to the local or config databases.

readAnyDatabase – This role enables user to read from any database in the MongoDB environment.

readWriteAnyDatabase – This role is similar to readWrite except it is for all databases.

userAdminAnyDatabase – This role is similar as userAdmin role except it applies to all databases.

dbAdminAnyDatabse – This role is same as dbAdmin, except it applies to all databases.

Enabling Authentication

Authentication is disabled by default – use auth for enabling authentication. While starting mongd use mongod –auth. Before enabling authentication we need to have at least one admin user.

We have seen above an a dmin user is a user who will be responsible for creating and managing other users, this user can create or modify any other users and can assign them any privileges.

It is recommended that in production deployments such users are created solely for managing users and should not be used for any other roles. This is the first user that needs to be created for a MongoDB deployment and then this user can create other users in the system.

The admin user can be created either ways that is before enabling the authentication or after enabling the authentication.

In our example we will first create the admin user and then enable the auth settings. The below steps have been executed on the windows platform

Start the mongod with default settings

C:\ > C:\ mongodb\ bin\ mongod.exe

C:\ mongodb\ bin\ mongod.exe –help for help and startup options Thu Feb 06 03: 23: 29.473 [initandlisten] MongoDB starting : pid = 2872 port = 27017 ……

Thu Feb 06 03: 23: 30.333 [initandlisten] waiting for connections on port 27017

Thu Feb 06 03: 23: 30.333 [websvr] admin web console waiting for connections on port 28017

Create the admin user

Run another instance of command prompt by running it as an administrator and execute the mongo application.

C:\ > C:\ mongodb\ bin\ mongo.exe

MongoDB shell version: 2.4.9

connecting to: test


Switch to the admin database

admin db is a privileged databases which the user need access on to execute certain administrative commands such as creating an admin user in this example.

> db = db.getSiblingDB(‘ admin’)


Add the user with either the userAdmin role or userAdminAnyDatabase role > db.addUser({ user: “AdminUser”, pwd: “password”, roles:[” userAdminAnyDatabase”]



“user” : “AdminUser”,

“pwd” : “2c14878340ab813b4e83c47b88918cdc”,

“roles” : [



“_id” : ObjectId(” 52f37218424122fb61569a89″)


To authenticate as this user, you must authenticate against the admin database. Restart the mongod with auth settings

C:\ > c:\ mongodb\ bin\ mongod.exe –auth

Thu Feb 06 18: 41: 36.289 [initandlisten] MongoDB starting : pid = 3384 port = 27017 dbpath =\ data\ db\ 64-bit host = ANOC9 Thu Feb 06 18: 41: 36.290 [initandlisten] db version v2.4.9


Thu Feb 06 18: 41: 36.314 [initandlisten] waiting for connections on port 27017

Thu Feb 06 18: 41: 36.314 [websvr] admin web console waiting for connections on port 28017

Start the mongo console and authenticate against the admin database using the AdminUser user created above.

C:\ > c:\ mongodb\ bin\ mongo.exe

MongoDB shell version: 2.4.9

connecting to: test

> use admin

switched to db admin

> db.auth(” AdminUser”, “password”)



Create a User and enable Authorization

In this section we will create a user and assign a role to the newly created user. We have already authenticated using the admin user as shown below

C:\ > c:\ mongodb\ bin\ mongo.exe

MongoDB shell version: 2.4.9

connecting to: test

> use admin

switched to db admin

> db.auth(” AdminUser”, “password”)



Switch to Products database and create user Alice and assign read access on the product database.

> use products

switched to db products

> db.addUser({ user: “Alice”

… , pwd:” Moon1234″

… , roles: [” read”]

… }

… )


“user” : “Alice”,

“pwd” : “c5f0f00ee1b145fbd9ccc3a6ad60f01a”,

“roles” : [ “read”


“_id” : ObjectId(” 52f38d7dbd14c630732401a0″)



We will next validate that the user has read only access on the database

> db


> show users


“_id” : ObjectId(” 52f38d7dbd14c630732401a0″),

“user” : “Alice”,

“pwd” : “c5f0f00ee1b145fbd9ccc3a6ad60f01a”,

“roles” : [ “read”




Next we can connect to a new mongo console and Login using Alice to the Products database to issue read only commands.

C:\ > c:\ mongodb\ bin\ mongo.exe -u Alice -p Moon1234 products

MongoDB shell version: 2.4.9

connecting to: products

Controlling access over network

In this section we will look at the configuration options that we have used for restricting the network exposures. The below is executed on the windows platform.

C:\ > c:\ mongodb\ bin\ mongod.exe –bind_ip –port 27017 –rest

Thu Feb 06 19: 06: 05.995 [initandlisten] MongoDB starting : pid = 3384 port = 27017 dbpath =\ data\ db\ 64-bit host = ANOC9 Thu Feb 06 19: 06: 05.996 [initandlisten] db version v2.4.9


Thu Feb 06 19: 06: 06.018 [initandlisten] waiting for connections on port 27017

Thu Feb 06 19: 06: 06.018 [websvr] admin web console waiting for connections on port 28017

We have started the server with bind_ip. bind_ip has one value set as which is the localhost interface. The bind_ip settings limits the network interface on which the program will listen for incoming connections, a comma separated list of IP addresses can be specified. For our case we have restricted the mongod to listen to only the localhost interface.

When the mongod instance is started by default it waits for any incoming connection on port 27017. We can change this using –port.

Changing the port does not meaningfully reduce risk or limit exposures. In order to completely secure the environment we need to allow only trusted clients to connect to the port using firewalls settings.

Changing this port also indirectly changes the port for the HTTP status interface which by default is 28017. This port is always available on the port numbered 1000 greater than the connection port. Hence for our environment it is available on X + 1000.

This web page exposes diagnostic and monitoring information that includes a variety of operational data, logs and status reports regarding the database instances i.e. it provides management level statistics which can be used for administration purpose. This page is by default read-only; to make it fully interactive we will be using the REST settings. This configuration makes the page fully interactive hence enabling the administrators in troubleshooting any performance issues when encountered. Only trusted client access should be allowed on this port using Firewalls.

It is recommended to disable the HTTP Status page as well as the REST configuration in the production environment.

Use Firewalls

Firewalls are used to control access within a network, it can be used to allow access from a specific IP address to specific IP ports or it can be used to stop any access from any untrusted hosts. This can basically be used to create a trusted environment for our mongod instance where we can specify what all IP addresses or hosts can connect to which all ports or interfaces of the mongod.

On the windows platform we have used netsh to allow all incoming traffic to port 27017, so that any application server can connect to our mongod instance.

C:\ > netsh advfirewall firewall add rule name =” Open mongod port 27017″ dir = in action = allow protocol = TCP localport = 27017 Ok.

C: \>

The rule allows all incoming traffic on port 27017, which allows the application server to connect to the mongod instance.

Encrypt Data

We have seen that MongoDB stores the data files in the data directory which defaults to C:\ data\ db in Windows and /data/ db in LINUX. The files which are stored in the directory are unencrypted as Mongo doesn’t provide a method to automatically encrypt these files. Any attacker who has access to the file system can view the data stored in the files. Hence it’s the application responsibility to ensure that the sensitive information’s are encrypted before it’s written to the database.

Additionally operating system level mechanisms such as file system level encryption and permissions should be implemented in order to prevent unauthorized access to the files.

In order to encrypt and secure sensitive data MongoDB has partnership with Gazzang. Gazzang provides solutions for encrypting mongoDB data and making it more secure. More information on Gazzang is available at http:// www.gazzang.com

Encrypt communication

Many a times it is required to ensure that the communication between the mongod and the client (mongo shell for instance) is encrypted.

Hence in this setup we will see how we can add one more level of security to our above installation by configuring SSL, so that the communication between the mongod and mongo shell (client) happens using SSL certificate and key.

It is recommended to use SSL for communication between the server and the client.

SSL is not supported in the default distribution of MongoDB. In order to use SSL you need to either build MongoDB locally passing the “–ssl” option or use MongoDB Enterprise.

The below commands are executed on an Ubuntu system and assumes that the MongoDB installed is the build that includes SSL support with SSL support at the client driver level too.

The first step is to generate the . pem file which will contain the public key certificate and the private key. The following command generates a self-signed certificate and private key.

cd /etc/ ssl/

sudo openssl req -new -x509 -days 365 -nodes -out mongodb-cert.crt -keyout mongodb-cert.key

Next the certificate and private key is concatenated to a .pem file

cat mongodb-cert.key mongodb-cert.crt > mongodb.pem

We will now include the following run-time options while starting up the mongod. mongod –sslOnNormalPorts –sslPEMKeyFile /etc/ ssl/ mongodb.pem

We will next see how we can connect to the mongod running with SSL using mongo shell.

Start the mongo shell using –ssl and –sslPemKeyFile (this specifies the signed certificate key file) options.

mongo –ssl –sslPEMKeyFile /etc/ ssl/ client.pem

If we need to connect to a mongod instance which requires only a SSL encryption mode then we need to start the mongo shell with –ssl, as shown in the following:

mongo –ssl

Basic Querying Using Mongo Shell

Basic Querying in MongoDB:

In this post we will briefly understand about the common database operations i.e. CRUD in MongoDB where CRUD stands for Create, Read, Update and Delete.

Instead of using a standardized query language such as SQL MongoDB uses its own JSON-like query language to retrieve information from the data stored.

The first step is always to start the database server. Open the command prompt (by running it as administrator) and issue the command CD \. Next run the command C:\ MongoDB\ bin\ mongod.exe (if the installation is in some other folder then accordingly the path will change, for the examples in this chapter the installation is in C:\ MongoDB folder). This will start the database server.

C:\ > c:\ mongodb\ bin\ mongod.exe

  1. Connect to the mongo shell
  2. Switch to our database which is “MyDB” in this case
  3. Checked for the collections that exist in the MyDB database using “show collections”
  4. Checked the count of the collection which we imported using the import tool.
  5. Finally executed find() command to check for the data in the new collection.

Now we will start with usage of mongo shell. To start the mongo shell, run command prompt as administrator and issue command C:\ MongoDB\ bin\ mongo.exe (the path will vary based on the installation folder, in this example the folder is C:\ MongoDB\) and press enter. This by defaults connects to the localhost database server which is listening on port 27017.

Use –port option and –host options to connect to a server on a different port or interface.

C:\ > c:\ mongodb\ bin\ mongo.exe

MongoDB shell version: 2.4.9

connecting to: test

As we can see in the above screen by default the database test is used for context.

At any point of time executing db command will show the current database the shell is connected to.

> Db


In order to display all the database names the user can run show dbs

Command: executing the command will list down all the databases for the connected server.

> show dbs

At any point help can be accesses using the help ( ) command

> help

Db.help( ) Help on db methods
Db. My coll.help( ) Help on collection methods
Sh. Help( ) Sharding helpers
Rs.help( ) Replica set helpers
Help admin Administrative help
Help connect Connecting to a db help
Help keys Key short cuts
Help misc Misc things to know
Help mr Map reduce
Show dbs Show data base names
Show collections database Show collections in current
Show users Show users in current database
… … … ….  
Exit Quit the mongo shell

Before we start our exploration lets first briefly look at the mongo DB terminology and concepts corresponding to the SQL terminology and concepts. This is summarized in the table below

SQL and mongo DB terminology Table

SQL Mongo DB
Database Collection
table Collection
Row Document
Column Field
Index Index
Table joins Embedded documents and linking
Primary key – a column or a group of column can be specified as primary key Primary key – automatically set to_id field

Let’s start our exploration of understanding the querying in mongo DB. Switch to MYDBPOC database.

>use my db poc

Switched to db my dbpoc

This switches the context from test to MYDBPOC. The same can be confirmed using the db command.

> db


Though the context is switched to MYDBPOC but the database name will not appear if the show dbs command is issued because MongoDB doesn’t create a database until data is inserted into the database. This is in par with the MongoDB’s dynamic approach to data facilitating dynamic namespace allocation and a simplified and accelerated development process.

  • If we issue show dbs command at this point it will not list MYDBPOC database in the list of databases, as the database is not created until data is inserted into the database.

The following example assumes a polymorphic collection named users which contains the documents of the following prototypes:


_id: ObjectID(),

FName: “First Name”,

LName: “Last Name”,

Age: 30,

Gender: “M”,

Country: “Country”




_id:object ID( ),

Name: “full name”,


Gender: “M”,

Country: “country”




_id: object ID( ), name: “full name”, age:30}

Create and Insert Statements in MongoDB

Create and Insert Statements in mongodb

Create and Insert

We will now look at how database and collections are created. The documents will  be specified in JSON.

First by issuing db command we will confirm that the context is mydbpoc database.




We will first see how we can create documents

The first document complies with the first prototype whereas the second document complies with the second prototype. We have created two variables user1 and user2.

>user1 = {F name : “test”, L name: “user”, age:30, gender: “M”, country: “US”}


“FName” : “Test”,

“LName” : “User”,

“Age” : 30,

“Gender” : “M”,

“Country” : “US”


> user2 = {Name: “Test User”, Age: 45, Gender: “F”, Country: “US”}

{ ” Name” : “Test User”, “Age” : 45, “Gender” : “F”, “Country” : “US” }


We will next add both these documents i.e. user1 and user2 to the users collection using the following sequence of operations.

> db.users.insert( user1)

> db.users.insert( user2)


The above operation will not only insert the two documents to the users collection but it will also create the collection as well as the database the same can be verified using the show collections and show dbs command. show collections will display the list of collection in the current database and show dbs as mentioned above will display the list of databases.

> show dbs

admin 0.203125GB

local 0.078125GB

mydb 0.203125GB

mydb1 (empty)

mydbnew 0.203125GB

mydbpoc 0.203125GB

products 0.203125GB

> show collections




As seen in the collections screenshot along with the collection users, system.indexes collection is also getting displayed. This system.indexes collection is created by default when the database is created. This manages the information of all the indexes of all collections within the database.

Executing the command db.users.find() will display the documents in the users collection.

> db.users.find ()

{ “_id” : ObjectId(” 52f48cf474f8fdcfcae84f79″), “FName” : “Test”, “LName” : “User “, “Age” : 30, “Gender” : “M”, “Country” : “US” }

{ ” _id” : ObjectId (” 52f48cfb74f8fdcfcae84f7a”), “Name” : “Test User”, “Age” : 45 ,

“Gender” : “F”, “Country” : “US” }


We can see the two documents we created being displayed. In addition to the fields we added to the document there’s an additional _id field which is being generated for all the documents.

  • All documents must have a unique _id field. If not explicitly specified by the user the same will be auto assigned as a unique Object ID by MongoDB as happened in our example above.
  • We didn’t explicitly insert an _id field but when we use find() command to display the documents we can see an _id field associated with each document.

The reason behind this is by default an index is created on the _id field which can be validated by issuing find command on the system.indexes collection.

> db.system.indexes.find ()

{ “v ” : 1, “key” : { “_id” : 1 }, “ns” : “mydbpoc.users”, “name” : “_id_” }


New indexes can be added or removed from the collection using ensureIndex() and dropIndex() command. This we will covering later in this chapter. By default an index is created on the _id field of all the collections. This default index cannot be dropped.

Explicit Create Collection:

In the above example the first insert operation implicitly created the collection. However the user can also explicitly create a collection before executing the insert statement.

db.createCollection(” users”)

Insert documents using Loop:

In the above example we created two document variables and inserted them to a collection one by one. Documents can also be added to the collection using for Loop. In the next example we will insert users using for loop.

> for( var i = 1; i < = 20; i + +) db.users.insert({” Name” : “Test User” + i, “Age”: 10 + i, “Gender” : “F”, “Country” :



In order to verify that the save is successful we will run the find command on the collection

> db.users.find ()

{ “_id” : ObjectId(” 52f48cf474f8fdcfcae84f79″), “FName” : “Test”, “LName” : “User “, “Age” : 30, “Gender” : “M”, “Country” : “US” }

{ ” _id” : ObjectId (” 52f48cfb74f8fdcfcae84f7a”), “Name” : “Test User”, “Age” : 45 ,

“Gender” : “F”, “Country” : “US” }


{ “_id ” : ObjectId(” 52f48eeb74f8fdcfcae84f8c “), “Name” : “Test User18”, “Age” :

28, “Gender” : “F”, “Country” : “India” }

Type “it” for more


Users appear in the collection. Before we move any further let’s understand what the below statement means “Type“it” for more”.

The find command returns a cursor to the result set. Instead of displaying all documents( which can be thousands or millions of results) in one go on the screen the cursor displays first 20 documents and wait for the request to iterate ( it) to display the next 20 and so on till all the resultset is displayed.

The resulting cursor can also be assigned to a variable and then programmatically it can be iterated over using a while loop . The cursor object can also be manipulated as an array.

In our case if we type “it” and press enter then the following screen will appear.

> It

{ ” _id” : ObjectId (” 52f48eeb74f8fdcfcae84f8d”), “Name” : “Test User19”, “Age” :

29, “Gender” : “F”, “Country” : “India” }

{ “_id ” : ObjectId(” 52f48eeb74f8fdcfcae84f8e “), “Name” : “Test User20”, “Age” :

30, “Gender” : “F”, “Country” : “India” }


Since only two documents were remaining it has displayed the two documents.

Insert with explicitly specifying _id:

In the previous examples of Insert the _id field was not specified hence was implicitly added. In the following example we will see how we can explicitly specify _id field while inserting the documents within a collection. While explicitly specifying the _id field we have to keep in mind the uniqueness of the field otherwise the insert will fail.

The following command explicitly specify the _id field

> db.users.insert({” _id”: 10, “Name”: “explicit id”})

The insert operation creates the following document in the users collection:

{ “_id” : 10, “Name” : “explicit id” }

This can be confirmed by issuing the following command:

> db.users.find ()

How to write Update Query in MongoDB

When working in a real world application we always come across schema evolution where we might end up adding or removing fields from the documents. We will next see how we can perform these alterations in the MongoDB database.

We have seen that there’s no structure enforcement at the collection level hence there’s no update of structure at the collection level. However update () operations can be used at the document level to update an existing document or set of documents within a collection.

The update() method by default updates a single document but by using multi option this can be used to update all documents that match the selection criteria.

Let’s begin by updating the values of existing columns. The $ set operator will be used for updating the records. The following command updates the country to UK for all Female users.

> db.users.update({” Gender”:” F”},      {$ set:{” Country”:” UK”}})

To check whether the update has happened we will issue a find command to check all the female users.

> db.users.find({” Gender”:” F”})

{ ” _id” : ObjectId (” 52f48cfb74f8fdcfcae84f7a”), “Name” : “Test User”, “Age” : 45

, “Gender” : “F”, “Country” : “UK” }

{ “_id” : ObjectId(“52f48eeb74f8fdcfcae84f7b”), “Name” :

“Test User1”, “Age” : 11, “Gender” : “F”, “Country” : “India”


{ ” _id” : ObjectId (” 52f48eeb74f8fdcfcae84f7c”), “Name” : “Test User2”, “Age” : 12, “Gender” : “F”, “Country” : “India” }


Type “it” for more


If we check the output we will see only the first document record is updated which is the default behavior of update as no multi option was specified.

Now let’s change the update command and include the multi option.

> db.users.update({” Gender”:” F”},{ $ set:

{” Country”:” UK”}},{ multi:true})


Now again we will issue the find command to check whether the Country is updated for all the female employees or not. Issuing the find command will return the following output.

> db.users.find({” Gender”:” F”})

{ ” _id” : ObjectId (” 52f48cfb74f8fdcfcae84f7a”), “Name” : “Test User”, “Age” : 45, “Gender” : “F”, “Country” : “UK” }


Type “it” for more


As seen the country is updated to UK. Hence if we need to update all documents matching the criteria than we need to set the option multi to true otherwise it will update only the first document matching the criteria.

We will next look at how we can add new fields to the documents. In order to add fields to the document we will be using update() command with $ set operator and multi option only. If we use a field name with $ set which is non-existent then the field will be added to the documents. The following command will add the field “company” to all the documents

> db.users.update({},{ $ set:{” Company”:” TestComp”}},{multi:true})


Issuing find command against the user’s collection, we will find the new field added to all documents.

> db.users.find ()

{ “Age” : 30, “Company” : “TestComp”, “Country” : “US”, “FName” : “Test”, “Gender ” : “M”, “LName” : “User”, “_id” : ObjectId(” 52f48cf474f8fdcfcae84f79″)

} { ” Age” : 45, “Company” : “TestComp”, “Country” : “UK”, “Gender” : “F”, “Name” : “Test User”, “_id” : ObjectId(” 52f48cfb74f8fdcfcae84f7a”) }

{ “Age” : 11, “Company” : “TestComp”, “Country” : “UK”, “Gender” : “F”, ………………..

Type “it” for more


Hence if we execute update () command with fields existing in the document, it will update the fields value, however if the field is not present in the document then the field will be added to the documents.

We will next see how we can use the same update() command with $ unset operator to remove fields from the documents.

The following command will remove the field company from all the documents.

>       db.users.update({},{ $ unset:{” Company”:””}},{multi:true})


The same can be checked by issuing find() command against the Users collection.

> db.users.find ()

{ “Age ” : 30, “Country” : “US”, “FName” : “Test “, “Gender” : “M”, “LName” : “User”, “_id” :

ObjectId(” 52f48cf474f8fdcfcae84f79″) }


Type “it” for more

This concludes how to write update query in MongoDB.

How to Write Delete and Read Query In MongoDB

Write Delete and Read Query In MongoDB:

Having covered how we can insert documents in a collection and how we can change document structure by adding/ removing fields of the document, we will see how we can delete data from the database. To delete documents in a collection we use the remove () method. If we specify a selection criteria only the documents meeting the criteria will be deleted, if no criteria is specified then all the documents will be deleted.

The following command will delete the documents where the Gender = ‘M’.

> db.users.remove({” Gender”:” M”})


The same can be verified by issuing the find() command on Users

> db.users.find({” Gender”:” M”})


No documents are returned.

Next command will delete all documents.

> db.users.remove ()

> db.users.find ()


As we can see no documents are returned.

Finally if we want to drop the collection the following command will drop the collection.

> db.users.drop ()



In order to validate whether the collection is dropped or not we will issue the show collections command.

> show collections



As we can see above the collection name is not displayed hence confirming that the collection is removed from the database.

Having covered the basic Create, Update and Delete operation, next in this section we will look at how we perform Read operation.


In this part of the chapter we will look at various examples illustrating the querying functionality available as part of MongoDB which enables the users to read the stored data from within the database.

In order to start with basic querying we will again first create the users collection and insert data following the insert commands.

> user1 = {FName: “Test”, LName: “User”, Age: 30, Gender: “M”, Country: “US”}


“FName” : “Test”,

“LName” : “User”,

“Age” : 30,

“Gender” : “M”,

“Country” : “US”


> user2 = {Name: “Test User”, Age: 45, Gender: “F”, Country: “US”}

{ ” Name” : “Test User”, “Age” : 45, “Gender” : “F”, “Country” : “US” }

> db.users.insert( user1)

> db.users.insert( user2`)

> for( var i = 1; i < = 20; i + +) db.users.insert({” Name” : “Test User” + i, “Age”: 10 + i, “Gender” : “F”, “Country” : “India”})


We will next start with basic querying.

The find() command is used for retrieving data from the database. The basic find() command we have been using till now in this chapter. Firing a find() command returns all the documents within the collection.

> db.users.find ()

{ “_id” : ObjectId(” 52f4a823958073ea07e15070″), “FName” : “Test”, “LName” : “User”, “Age” : 30, “Gender” : “M”, “Country” : “US” }

{ “_id” : ObjectId(” 52f4a826958073ea07e15071″), “Name” : “Test User”, “Age” : 45, “Gender” : “F”, “Country” : “US” }


{ “_id” : ObjectId(” 52f4a83f958073ea07e15083″), “Name” : “Test User18”, “Age” :28, “Gender” : “F”, “Country” : “India” }

Type “it” for more


What are the Query documents in MongoDB ?

Query Documents in MongoDB

MongoDB provides a rich query system to filter the documents in a collection.

In order to achieve the same “query documents” can be passed as parameter to the find method.

A Query Document is specified within open “{“ and closed “}” curly braces. A query document is matched against all the documents in the collection before returning the result set

Using find() command without any query document or an empty query document i.e. find({}) returns all the documents within the collection.

A Query document can contain selectors and projectors. A selector is like a where condition in SQL or a filter which is used to filter out the results. A projector is like the select condition or the selection list which is used to display the data fields.


We will now see how to use the selector.

The following command will return all the female users.

> db.users.find({” Gender”:” F”})

{ “_id” : ObjectId(” 52f4a826958073ea07e15071″), “Name” : “Test User”, “Age” : 45, “Gender” : “F”, “Country” : “US” }


{ “_id” : ObjectId(” 52f4a83f958073ea07e15084″), “Name” : “Test User19”, “Age” :29, “Gender” : “F”, “Country” : “India” }

Type “it” for more


Let’s step it up a notch. MongoDB also supports operators where we can merge different conditions together in order to refine our search on the basis of our requirements. Let’s refine the above query, let’s look for Female users from Country India. The below command will return the same.

> db.users.find({” Gender”:” F”, $ or: [{” Country”:” India”}]})

{ “_id” : ObjectId(” 52f4a83f958073ea07e15072″), “Name” : “Test User1”, “Age” : 11, “Gender” : “F”, “Country” : “India” }


{ “_id” : ObjectId(” 52f4a83f958073ea07e15085″), “Name” : “Test User20”, “Age” :30, “Gender” : “F”, “Country” : “India” }


Next if we want to find all Female users who belong to either India or US then we will execute the following command:

> db.users.find({” Gender”:” F”, $ or:[{” Country”:” India”},{” Country”:” US”}]})

{ “_id” : ObjectId(” 52f4a826958073ea07e15071″), “Name” : “Test User”, “Age” : 45, “Gender” : “F”, “Country” : “US” }


{ “_id” : ObjectId(” 52f4a83f958073ea07e15084″), “Name” : “Test User19”, “Age” :29, “Gender” : “F”, “Country” : “India” }

Type “it” for more


Hence executing a find() command returns the list of users as an output satisfying the conditions we have specified as selector in the query document. Say if the requirement is to just know the count of users, or perform some other aggregation on the result set, rather than simply printing the list, in such cases aggregate functions need to be used. We will be looking into the aggregate functions in more details in coming examples.

As of now we will just look at how to know the count of records. Say for e.g. in our above example instead of displaying the documents we want to find out count of female users who stay in either “India” or “US”, in that case we will execute the following command.

> db.users.find({” Gender”:” F”, $ or:[{” Country”:” India”}, {” Country”:” US”}]}). count()



If we want to find out count of users irrespective of any selectors then we need to execute the following command.

> db.users.find(). count()




We have seen how we can use selector to filter out the documents within the collection. In the above example the find() command returns all fields of the documents matching the selector. Let’s add projector to the Query document wherein in addition to the selector we will also mention specific details or fields that need to be displayed for e.g. let’s suppose we want to display the first name and age of all female employees. In this case along with the selector projector is also used, i.e. instead of displaying the complete document we will be displaying few fields from the document. We will execute the following command to return the desired result set.

> db.users.find({” Gender”:” F”}, {” Name”: 1,” Age”: 1})

{ “_id” : ObjectId(” 52f4a826958073ea07e15071″), “Name” : “Test User”, “Age” : 45 }


Type “it” for more


sort ()

In MongoDB the sort order is specified as follows: 1 for Ascending and -1 for Descending sort. If in the above example we want to sort the records by ascending order of age then we will execute the following command.

> db.users.find({” Gender”:” F”},

{” Name”: 1,” Age”: 1}). sort({” Age”: 1})

{ “_id” : ObjectId(” 52f4a83f958073ea07e15072″), “Name” : “Test User1”, “Age” : 11 }

{ “_id” : ObjectId(” 52f4a83f958073ea07e15073″), “Name” : “Test User2”, “Age” : 12 }

{ “_id” : ObjectId(” 52f4a83f958073ea07e15074″), “Name” : “Test User3”, “Age” : 13 }


{ “_id” : ObjectId(” 52f4a83f958073ea07e15085″), “Name” : “Test User20”, “Age” :30 }

Type “it” for more

If we want to display the records by Descending Order in Name and Ascending Order in Age then we will be executing the following command:

> db.users.find({” Gender”:” F”},{” Name”: 1,” Age”: 1}). sort({” Name”:-1,” Age”: 1})

{ “_id ” : ObjectId (” 52f4a83f958073ea07e1507a”), “Name” : “Test User9”, “Age” : 19 }

{ “_id” : ObjectId(” 52f4a83f958073ea07e15079″), “Name” : “Test User8”, “Age” : 18 }

{ “_id ” : ObjectId (” 52f4a83f958073ea07e15078″), “Name” : “Test User7”, “Age” : 17 }


{ “_id” : ObjectId(” 52f4a83f958073ea07e15072″), “Name” : “Test User1”, “Age” : 11 }

Type “it” for more



We will now look at how we can limit the records. For e.g. in case of huge collections with thousands of documents if we want to return only 5 matching documents then in that case limit command is used which enables us to do exactly the same.

Let’s say in our previous query of Female users who belongs to either country India or US, we want to limit the result set and return only two users. Then the following command needs to be executed

> db.users.find({” Gender”:” F”, $ or:[{” Country”:” India”},{” Country”:” US”}]}). limit( 2)

{ “_id” : ObjectId(” 52f4a826958073ea07e15071″), “Name” : “Test User”, “Age” : 45, “Gender” : “F”, “Country” : “US” } { “_id” : ObjectId(” 52f4a83f958073ea07e15072″), “Name” : “Test User1”, “Age” : 11, “Gender” : “F”, “Country” : “India” }

skip ()

If the requirement is to skip the first two records and return the 3rd and 4th user then skip command is used. The following command needs to be executed

> db.users.find({” Gender”:” F”, $ or:[{” Country”:” India”}, {” Country”:” US”}]}). limit( 2). skip( 2)

{ “_id” : ObjectId(” 52f4a83f958073ea07e15073″), “Name” : “Test User2”, “Age” : 12, “Gender” : “F”, “Country” : “India” }

{ “_id” : ObjectId(” 52f4a83f958073ea07e15074″), “Name” : “Test User3”, “Age” : 13, “Gender” : “F”, “Country” : “India”



Find one ( )

Similar to find( ) we have a findone ( ) command which returns a single document from the collection. The findone ( ) methods takes the same parameters as find( ), but rather then returning a cursor it returns a single document. Say for e.g. we want to return one female user who stays in either india or us. This can be achieved using the following command.

> db.users.findOne({” Gender”:” F”}, {” Name”: 1,” Age”: 1})


“_id” : ObjectId(” 52f4a826958073ea07e15071″),

“Name” : “Test User”,

“Age” : 45



Similarly if we want to return the first record irrespective of any selector in that case also we can use findOne() and that will return the first document in the collection.

> db.users.findOne ()


“_id” : ObjectId(” 52f4a823958073ea07e15070″),

“FName” : “Test”,

“LName” : “User”,

“Age” : 30,

“Gender” : “M”,

“Country” : “US”}

Using Cursor:

When the find() method is used, MongoDB returns the results of the query as a cursor object. The mongo shell then iterates over the cursor to display the results. The maximum number of documents that are iterated and displayed on the screen is 20. When a user executes the it command, the shell iterates over the next set of 20 records. Thus all records are not iterated for display in one go.

Mongo DB enables the users to work with the cursor object of the find method. In the next example we will see how the user can store the cursor object in a variable and manipulate it using a while loop.

Say for our example we want to return all the users in the country: US. We will declare a variable and assign the resulting cursor object to that variable we then print the full result set using the while loop to iterate over the variable.

The code snippet is as below:

> var c = db.users.find({” Country”:” US”})

> while( c.hasNext()) printjson( c.next())


“_id” : ObjectId(” 52f4a823958073ea07e15070″),

“FName” : “Test”,

“LName” : “User”,

“Age” : 30,

“Gender” : “M”,

“Country” : “US”



“_id” : ObjectId(” 52f4a826958073ea07e15071″),

“Name” : “Test User”,

“Age” : 45,

“Gender” : “F”,

“Country” : “US”



The hasNext() function returns true if the cursor has documents. The next() method returns the next document. The printjson() method renders the document in a JSON-like format.

The variable to which the cursor object is assigned can also be manipulated as an array. Say for example instead of looping through the variable we want to display the document at array index 1 then in that case we can run the following command:

> var c = db.users.find({” Country”:” US”})

> printjson( c[ 1])


“_id” : ObjectId(” 52f4a826958073ea07e15071″),

“Name” : “Test User”,


“Gender” : “F”,

“Country” : “US”}


We have seen how we can do basic querying, sorting, limiting etc. We also saw how we can manipulate the result set using a while loop or as an array. In the next section we will look at indexes.

  • Important Note: Notice that in the query, the value portion need to be determined before the query is made (in other words, it cannot be based on other attributes of the document). For example, let’s say if we have a collection of “Persons”, it is not possible to express a query that returns a person whose weight is larger than 10 times of their height.

explain ()

The explain() function can be used to see what steps the MongoDB database is running while executing a query. The following command will display the steps executed while filtering on the username field.

> db.users.find({” Name”:” Test User”}). explain()


“cursor” : “BasicCursor”,

“isMultiKey” : false,

“n” : 1,

“nscannedObjects” : 22,

“nscanned” : 22,

“nscannedobjectsallplans”: 22,


“scanandorder” : false,

“indexonly” : false,



“millis” : o,



“server”: “ANOC9:27017”}

Using Indexes in Mongo Shell

Using Indexes

Indexes provide high performance read operations for frequently used queries. By default whenever a collection is created and documents are added to it an index is created on the _id field of the document. In this section we will look at how different types of indexes can be created. Let’s begin by inserting 1 million documents using for loop in a new collection say testindx.

> for( i = 0; i < 1000000; i + +){db.testindx.insert({” Name”:” user” + i,” Age”: Math.floor(Math.random()* 120)})}

Next we will run find() command to fetch a Name with value as user101. We will run the explain() command to check what are the steps MongoDB is executing in order to return the result set.

> db.testindx.find({” Name”:” user101″}). explain()


“cursor” : “BasicCursor”,

“isMultiKey” : false,

“n” : 1,

“nscannedObjects” : 1000000,

“nscanned” : 1000000,

“nscannedObjectsAllPlans” : 1000000,

“nscannedAllPlans” : 1000000,

“scanAndOrder” : false,

“indexOnly” : false,

“nYields” : 0, “nChunkSkips” : 0,

“millis” : 464,

“indexBounds” : {


“server” : “ANOC9: 27017”



As we can see in the above example the database has scanned the entire table. This has significant performance impact and is happening because there are no indexes.

Single Key Index:

We will next create an index on Name field of the document. ensureIndex() is used to create index.

> db.testindx.ensureIndex({” Name”: 1})

The index creation will take few minutes depending on the server and the collection size. Let us run the same query that we ran earlier with explain() to check what are the steps the database is executing post index creation. Check the “n”,“nscanned” and “millis” fields in the output.

> db.testindx.find({” Name”:” user101″}). explain()


“cursor” : “BtreeCursor Name_1”,

“isMultiKey” : false,

“n” : 1,

“nscannedObjects” : 1,

“nscanned” : 1

, “nscannedObjectsAllPlans” : 1,

“nscannedAllPlans” : 1,

“scanAndOrder” : false,

“indexOnly” : false,

“nYields” : 0, “nChunkSkips” : 0, “millis” : 2,

“indexBounds” : { “Name” : [


“user101”, “user101”




“server” : “ANOC9: 27017”



As you can see in the results, there is no table scan. The index creation makes a significant difference in the query execution time.

The index we created above is a single-key index.

Compound Index:

While creating an index we should keep in mind that the index covers most of our queries. Say for e.g. if we query sometimes on only the Name field and at times we query on both the Name and the age field then in that scenario creating a compound index will be more efficient than a single-key index, as the compound index will be used for both queries.

The following command creates a compound index on Name and Age fields of the testindx Collection.

> db.testindx.ensureIndex({” Name”: 1, “Age”: 1})

Compound indexes help MongoDB execute queries with multiple clauses more efficiently.

While creating a compound index it is also very important to keep in mind that the fields which will be used for exact matches (for e.g. “Name”:“S1” ) comes first followed by fields which are used in ranges (e.g. “Age”: {“ $ gt”: 20}).

Hence the above index will be beneficial for the below query.

> db.testindx.find({” Name”: “user5”,” Age”:{” $ gt”: 25}}). explain()


“cursor” : “BasicCursor”,

“isMultiKey” : false,

“n” : 0,

“nscannedObjects” : 22,

“nscanned” : 22,

“nscannedObjectsAllPlans” : 22,

“nscannedAllPlans” : 22,

“scanAndOrder” : false,

“indexOnly” : false,

“nYields” : 0,

“nChunkSkips” : 0,

“millis” : 0,

“indexBounds” : {


“server” : “ANOC9: 27017”


Support for sort Operations:

In MongoDB sort operations that sort documents based on an indexed field provide the greatest performance. Indexes in MongoDB, as in other databases, have an order: as a result, using an index to access documents returns results in the same order as the index.

A compound index needs to be created when sorting on multiple fields. With compound indexes, the results can be in the sorted order of either the full index or an index prefix.

An index prefix is a subset of a compound index; the subset consists of one or more fields at the start of the index, in order.

  • For example, given an index { a: 1, b: 1, c: 1, d: 1 }, the following subsets are index prefixes:

{ a: 1 }, { a: 1, b: 1 }, { a: 1, b: 1, c: 1 }

A compound index can only help with sorting if it is a prefix of the sort. For example, a compound index on “Age”,“Name” and “Class”

> db.testindx.ensureIndex({” Age”: 1, “Name”: 1, “Class”: 1})

Will be useful for the following queries

> db.testindx.find(). sort({” Age”: 1})

> db.testindx.find(). sort({” Age”: 1,” Name”: 1})

> db.testindx.find(). sort({” Age”: 1,” Name”: 1, “Class”: 1})

The above index won’t be of much help in the following query

> db.testindx.find(). sort({“ Gender”: 1, “Age”: 1, “Name”: 1})

You can diagnose how mongo DB defaults to processing by using the explain command.

Unique index:

Creating index on a field doesn’t ensure uniqueness thus if index is created on the Name field then two or more documents can have same names, however if uniqueness is one of the constraints that needs to be enabled then the unique property need to be set to true while creating the index. The below command will create a unique index on the Name field of the testindx Collection.

> db.testindx.ensureIndex({” Name”: 1},{” unique”: true})

Now if try to insert duplicate names in the collection as shown below, MongoDB returns an error and does not allow insertion of duplicate records.

> db.testindx.insert({” Name”:” uniquename”})

> db.testindx.insert({” Name”:” uniquename”})

“E11000 duplicate key error index:

mydbpoc.testindx. $ Name_1 dup key: { : “uniquename” }”

If you check the collection , you’ll see that only the first ” uniquename” was stored.

> db.testindx.find({” Name”:” uniquename”})

{ “_id ” : ObjectId (” 52f4b3c3958073ea07f092ca”), “Name” : “uniquename” }


Uniqueness can be enabled for compound indexes also which means that though individual fields can have duplicate values but the combination will always be unique.

For example if we have a unique index on {“ name”: 1, “age”: 1},

> db.testindx.ensureIndex({” Name”: 1, “Age”: 1},{” unique”: true})


Then the following inserts will be permissible

> db.testindx.insert({” Name”:” usercit”})

> db.testindx.insert({” Name”:” usercit”, “Age”: 30})

However if we execute the below command

> db.testindx.insert({” Name”:” usercit”, “Age”: 30})

It’ll throw an error

E11000 duplicate key error index:

mydbpoc.testindx. $ Name_1_Age_1 dup key: { : “usercit”, :

30.0 }

It might be the case at times that we create the collection and insert the documents first and then create an index on the collection, if we create a unique index on the collection which might have duplicate values in the fields on which the index is being created then the index creation will fail.

In order to cater to this scenario MongoDB provides “dropDups” option.

The “dropDups” option will save the first document found and remove any subsequent documents with duplicate values:

The following command will create a unique index on the “ name” field and will delete any duplicate document if any.

> db.testindx.ensureIndex({” Name”: 1},{” unique”: true, “dropDups”: true})


 system. Indexes:

whenever we create a database, by default a system. Indexes collection is created. All of the information about a database’s indexes is stored in the system. Indexes collection. This is a reserved collection, so you cannot modify its documents or remove documents from it. You can manipulate it only through ensure index and the dropindexes database command.

Whenever an index is created its meta information can be seen in system. Indexes. The following comand can be used to fetch all the index information about the mentioned collection

Db. Collection name. Get indexes( )

For example the below command will return all indexes created on the restindx collection

>db.testindx.getindexes( )









“key” :{





“ns”: “mydbpoc.testindx”,

“name”: “name_1_age_1”




“key”: {



“unique” : true,

“ns” : “mydbpoc.testindx”,

“name” : “Name_1”,

“dropDups” : true




dropIndex dropIndex command is used for removing the index.

The following command will remove the Name field index from the testindx collection.

> db.testindx.dropIndex({” Name”: 1})

{ “nIndexesWas” : 3, “ok” : 1 }



When we have performed a number of insertion and deletion on the collection, at times it is required to rebuild the indexes so that the index can be used optimally. reIndex is used for rebuilding the indexes. The following command rebuilds all indexes on the collection in a single operation.


This operation drops all indexes including the _id index and then rebuilds all the indexes.

The following command rebuilds indexes of the testindx collection.

> db.testindx.reIndex ()



“msg” : “indexes dropped for collection”,


… … … ….

“ok” : 1



We will be discussing in details about the different types of indexes available in mongoDB in the next chapter.

How Indexing works:

MongoDB stores indexes as BTree structure hence range queries are automatically supported.

When there are multiple selection criteria in a query, MongoDB attempts to use one single best index to select a candidate set and then sequentially iterate through them to evaluate other criteria.

When handling a query the first time, MongoDB will create multiple execution plans (one for each available index) and let them take turns (within certain number of ticks) to execute until the fastest plan finishes. The result of the fastest executor will be returned and the system remembers the corresponding index used by the fastest executor.

Subsequent query will use the remembered index until certain number of updates has happened in the collection, then the system repeats the process to figure out what is the best index at that time.

As collections change over time, the query optimizer deletes a query plan and reevaluates the query plan after any of the following events:

  1. The collection receives 1,000 write operations.
  2. The reIndex rebuilds the index.
  3. You add or drop an index.
  4. The mongod process restarts.

Note: If you want to override MongoDB’s default index selection then the same can be done using hint() method.

Since only one index will be used, it is important to look at the search or sorting criteria of the query and build additional composite index to match the query better.

Maintaining an index is not without cost as index needs to be updated

When docs are created, deleted and updated, which incurs overhead for update operations.

To maintain an optimal balance, we need to periodically measure the effectiveness of having an index (e.g. the read/write ratio) and delete less efficient indexes.

How to use Conditional Operators in in Mongo Shell

Stepping Beyond the Basics

In the previous posts we looked at how to create a database and perform basic operations on it such as search, update and delete. We also used selectors. Selectors give the users the ability to have much more fine grained control to find out the data that we really want.

From the this posts we will be covering advanced querying using conditional operators and regular expressions in the selector part. Each of these successively provides us with more fine-grained control over the queries we can write and, consequently, the information that we can extract from our MongoDB databases.

Using Conditional Operators:

Conditional Operators as the name implies are operators that refine the conditions that the query must match when extracting data from the database. We will be focusing on the following conditional operators $lt, $ gt, $ lte, $ gte, $ in, $ nin and $ not

Let’s look and understand each one in turn.

  • The following example assumes a collection named Students that contains documents of the following prototype:


_id: ObjectID(),

Name: “Full Name”,

Age: 30,

Gender: “M”,

Class: “C1”,

Score: 95


We will first create the collection and insert few sample documents.

> db.students.insert({ Name:” S1″, Age: 25, Gender:” M”, Class:” C1″, Score: 95})

> db.students.insert({ Name:” S2″, Age: 18, Gender:” M”, Class:” C1″, Score: 85})

> db.students.insert({ Name:” S3″, Age: 18, Gender:” F”, Class:” C1″, Score: 85})

> db.students.insert({ Name:” S4″, Age: 18, Gender:” F”, Class:” C1″, Score: 75})

> db.students.insert({ Name:” S5″, Age: 18, Gender:” F”, Class:” C2″, Score: 75})

> db.students.insert({ Name:” S6″, Age: 21, Gender:” M”, Class:” C2″, Score: 100})

> db.students.insert({ Name:” S7″, Age: 21, Gender:” M”, Class:” C2″, Score: 100})

> db.students.insert({ Name:” S8″, Age: 25, Gender:” F”, Class:” C2″, Score: 100})

> db.students.insert({ Name:” S9″, Age: 25, Gender:” F”, Class:” C2″, Score: 90})

> db.students.insert({ Name:” S10″, Age: 28, Gender:” F”, Class:” C3″, Score: 90})

> db.students.find ()

{ “_id” : ObjectId(” 52f874faa13cd6a65998734d”), “Name” : “S1”, “Age” : 25, “Gender” : “M”, “Class” : “C1”, “Score” : 95 }


{ “_id” : ObjectId(” 52f8758da13cd6a659987356″), “Name” : “S10”, “Age” : 28, “Gender ” : “F”, “Class” : “C3”, “Score” :



$lt and $lte

Let’s start with $lt and $lte operators. They stand for “less than” and “less than or equal to” operators respectively.

Let’s say we want to find all students with age<25. Hence for this we will execute the following find with selector

> db.students.find({” Age”:{” $ lt”: 25}})

{ “_id ” : ObjectId (” 52f8750ca13cd6a65998734e”), “Name” : “S2”, “Age” : 18, “Gender” : “M”, “Class” : “C1”, “Score” : 85 }


{ “_id ” : ObjectId(” 52f87556a13cd6a659987353 “), “Name” : “S7”, “Age” : 21, “Gender” : “M”, “Class” : “C2”, “Score” : 100 }


Next if we want to find out all students with Age < = 25 then the following will be executed

> db.students.find({” Age”:{” $ lte”: 25}})

{ “_id” : ObjectId(” 52f874faa13cd6a65998734d”), “Name” : “S1”, “Age” : 25, “Gender” : “M”, “Class” : “C1”, “Score” : 95



{ “_id ” : ObjectId(” 52f87578a13cd6a659987355 “), “Name” : “S9”, “Age” : 25, “Gender ” : “F”, “Class” : “C2 “, “Score” : 90 }


 $ gt and $ gte

These operators stand for “Greater than” and “Greater than or equal to” respectively.

Let’s find out all the students with Age > 25. This can be achieved by executing the following command

> db.students.find({” Age”:{” $ gt”: 25}})

{ “_id” : ObjectId(” 52f8758da13cd6a659987356″), “Name” : “S10”, “Age” : 28, “Gender ” : “F”, “Class” : “C3”, “Score” : 90 }


If we change the above example to return students with Age > = 25 then the command is as below

> db.students.find({” Age”:{” $ gte”: 25}})

{ “_id” : ObjectId(” 52f874faa13cd6a65998734d”), “Name” : “S1”, “Age” : 25, “Gender” : “M”, “Class” : “C1”, “Score” : 95 }


{ “_id” : ObjectId(” 52f8758da13cd6a659987356″), “Name” : “S10”, “Age” : 28, “Gender ” : “F”, “Class” : “C3”, “Score” : 90 }


$ in and $ nin

Let’s find out all students who belong to either class C1 or C2. The command for the same is as below

> db.students.find({” Class”:{” $ in”:[” C1″,” C2″]}})

{ “_id” : ObjectId(” 52f874faa13cd6a65998734d”), “Name” : “S1”, “Age” : 25, “Gender” : “M”, “Class” : “C1”, “Score” : 95



{ “_id ” : ObjectId(” 52f87578a13cd6a659987355 “), “Name” : “S9”, “Age” : 25, “Gender ” : “F”, “Class” : “C2 “, “Score” : 90 }


The inverse of this can be returned by using $nin.

Let’s next find out students who don’t belong to class C1 or C2. The command is as below

> db.students.find({” Class”:{” $ nin”:[” C1″,” C2″]}})

{ “_id” : ObjectId(” 52f8758da13cd6a659987356″), “Name” : “S10”, “Age” : 28, “Gender ” : “F”, “Class” : “C3”, “Score” : 90 }


Let’s next see how we can combine all the above operators and write a query.

Say we want to find out all students whose gender is either “M” or they belong to class “C1” or ‘C2” and whose age is greater than or equal to 25. This can be achieved by executing the following command

> db.students.find({ $ or:[{” Gender”:” M”,” Class”:{” $ in”:[” C1″,” C2″]}}], “Age”:{” $ gte”: 25}})

{ “_id” : ObjectId(” 52f874faa13cd6a65998734d”), “Name” : “S1”, “Age” : 25, “Gender” : “M”, “Class” : “C1”, “Score” : 95



What are Regular Expressions in MongoDB

Regular Expressions

In this post we will look at how we can use Regular expressions. Regular expressions are useful in scenarios where we want to find say students with name starting with “A” etc.

In order to understand this lets add 3-4 students more with different names.

> db.students.insert({ Name:” Student1″, Age: 30, Gender:” M”, Class: “Biology”, Score: 90})

> db.students.insert({ Name:” Student2″, Age: 30, Gender:” M”, Class: “Chemistry”, Score: 90})

> db.students.insert({ Name:” Test1″, Age: 30, Gender:” M”, Class: “Chemistry”, Score: 90})

> db.students.insert({ Name:” Test2″, Age: 30, Gender:” M”, Class: “Chemistry”, Score: 90})

> db.students.insert({ Name:” Test3″, Age: 30, Gender:” M”, Class: “Chemistry”, Score: 90})


Say we want to find all students with names starting with “St” or “Te” and whose class begins with “Che”. The same can be filtered using regular expressions as shown below

> db.students.find({” Name”:/( St | Te)*/ i, “Class”:/( Che)/ i})

{ “_id” : ObjectId(” 52f89ecae451bb7a56e59086″), “Name” : “Student2”, “Age” : 30, “Gender” : “M”, “Class” : “Chemistry”, “Score” : 90 }


{ “_id” : ObjectId(” 52f89f06e451bb7a56e59089″), “Name” :

“Test3”, “Age” : 30, “Gender” : “M”, “Class ” : “Chemistry”, “Score” : 90 }


In order to understand how the regular expression works let’s take the query: We have mentioned “Name”:/( St | Te)*/ i.

// i indicates that whatever we mention between this is a case insensitive regex.

(St| Te)* indicates that the start of the Name string must be either “St” or “Te”.

The * at the end means it will match anything after that.

When we put everything together, we are doing a case insensitive match of names that have either “St” or “Te” at the beginning of them. In the regex for the Class also the same Regex is issued.

Next let’s complicate the query a bit by combining it with one of the conditional operators that we covered above.

Find out all Students with names as student1, student2 and who are male students with age > = 25.

The command for this is as shown below.

> db.students.find({” Name”:/( student*)/ i,” Age”:{” $ gte”: 25},” Gender”:” M”})

{ “_id” : ObjectId(” 52f89eb1e451bb7a56e59085″), “Name” : “Student1”, “Age” : 30,

“Gender” : “M”, “Class” : “Biology”, “Score” : 90 }

{ “_id” : ObjectId(” 52f89ecae451bb7a56e59086″), “Name” : “Student2”, “Age” : 30,

“Gender” : “M”, “Class” : “Chemistry”, “Score” : 90 }

What is MapReduce and Aggregation?


MapReduce is a process where the aggregation of data can be split up and executed across a cluster of computers to reduce the time that it takes to determine an aggregate result on a set of data. It’s made up of two parts: Map and Reduce.

A more specific description:

MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a file system (unstructured) or in a database (structured).

“Map” step: The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.

“Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.

In order to understand how it works let’s consider a small example, where we will find out the number of male and female students in our collection.

This involves the following steps: first we will create the map and reduce functions and then we will call the map Reduce function and pass the necessary arguments.

Let’s start with defining the map function

> var map = function(){ emit( this.Gender, 1);};


This step takes as input the document and based on the “Gender” field it emits documents of the type {“ F”, 1} or {“ M”, 1}.

Next we will create the reduce function

> var reduce = function( key, value){ return Array.sum( value);};


This will group the documents emitted by the map function on the key field which in our example is “Gender” and return the sum of values which in the above example is emitted as “1”. Hence the output of the reduce function defined above is gender wise count.

Finally we will put them together using the mapReduce function as follows

> db.students.mapReduce( map, reduce, {out: “mapreducecount1”})


“result” : “mapreducecount1”,

“timeMillis” : 280,

“counts” : {

“input” : 15,

“emit” : 15,

“reduce” : 2,

“output” : 2


“ok” : 1,



This actually is applying the map, reduce function which we defined on the students collection. The final result is stored in a new collection called mapreducecount1.

In order to vet we will run find() command on the mapreducecount1 collection as shown below.

> db.mapreducecount1. find()

{ “_id” : “F”, “value” : 6 }

{ “_id” : “M”, “value” : 9 }


We will use one more example to explain the working of MapReduce. We will next use MapReduce to find out Class wise Average Score. So as we saw in the above example, we need to create first the map function and then the reduce function and finally we will combine them to store the output in a collection in our database. The code snippet is as shown below

> var map_1 = function(){ emit( this.Class, this.Score);};

> var reduce_1 = function( key, value){ return Array.avg( value);};

> db.students.mapReduce( map_1, reduce_1, out:” MR_ClassAvg_1″})


“result” : “MR_ClassAvg_1”,

“timeMillis” : 57,

“counts” : {

“input” : 15, “emit” : 15,

“reduce” : 3 , “output” : 5


“ok” : 1,


> db.MR_ClassAvg_1. find()

{ “_id” : “Biology”, “value” : 90 }

{ “_id” : “C1”, “value” : 85 }

{ “_id” : “C2”, “value” : 93 }

{ “_id” : “C3”, “value” : 90 }

{ “_id” : “Chemistry”, “value” : 90 }


The first step is defining the map function which loops through the collection documents and returns output as {“ Class”: Score} for e.g. {“ C1”:95}.

The second step does grouping on the class and computes Average of the scores for that class.

The Third steps combine the results, it defines the collection to which the map, reduce function need to be applied and finally it is defining where to store the output which in this case is a new collection MR_ClassAvg_1.

In the last step we use find in order to check the resulting output.

aggregate ():

In the previous section we covered an introduction of MapReduce function. In this section we will take a glimpse of the Aggregation framework of MongoDB which provides a means to compute aggregated

Values without having to use the map reduce function.

We will depict the above two discussed outputs using the aggregate function. First output was to find count of male and female students. So the same can be achieved by executing the following command

>db. Students. Aggregate ({$group: {_id: “$gender”, totalstudent: {$sum: 1}}})


“result” :[


“_id” : “F”,

“totalstudent” :6



“_id” : “M”,

“totalstudent” :9



“ok” :1



Similarly in order to find out class wise average score the following command can be executed

>db. Students. Aggregate({$ group: {_id: “$class”,

Avgscore: {$avg: “$score” }}})


“result”: [ {

“_id” : “biology”,

“avgscore” :90




“_id” : “c1”,



“ok” : 1}



What is Relational Data Modeling and Normalization

Relational Data Modeling and Normalization

Before jumping into MongoDB’s approach we’ll take a little detour into how we will model this in relational databases such as SQL.

In relational databases the data modeling typically progresses by defining the tables and gradually removing data redundancy to achieve a Normal form.

Normal form:

In relational databases normal form typically begins by creating tables as per the application requirement and then gradually removing redundancy to achieve the highest normal form which is also termed as the third normal form or 3NF. In order to understand this better let’s try and put in the blogging application data in tabular form. The initial data might be of the following form


This data is actually in first normal form. Will have lots of redundancy as we can have multiple comments against the posts and multiple tags can be associated with the post. Hence data will be redundant. The problem with redundancy, of course, is that it introduces the possibility of inconsistency, where various copies of the same data may have different values. To remove this redundancy, we need to further normalize the data by splitting it into multiple tables. As part of this step, we must identify a key column which uniquely identifies each row in the table so that we can create links between the tables. Hence the above scenarios when modeled using the 3NF normal forms will look like as depicted below:

RDBMS Diagram


In this case, we have a data model that is free of redundancy, allowing us to update without having to worry about updating multiple rows. In particular, we no longer need to worry about inconsistency in the data model.

Problem with the normal forms:

As already mentioned, the nice thing about normalization is that it allows for easy updating without any redundancy i.e. it helps in keeping the data consistent. Updating a user name will just need to updater the name in the users table.

However the problem arises when we try to get the data back out. For instance if we have to find out all tags and comments associated with the posts by a specific user, the relational database programmer uses a JOIN. Though using JOINS the database returns all data as per the application screen design but the real problem is what operation the database performs to get that result set.

Generally and RDBMS reads from a disk and does seek which takes well over 99% of the time spent reading a row. When it comes to disk access, random seeks are the enemy. The reason why this is so important in this context is because JOINs typically require random seeks. JOIN operation is one of the most expensive operations within a relational database. Additionally, if you end up needing to scale your database to multiple servers, you introduce the problem of generating a distributed join, a complex and generally slow operation.

How to Approach MongoDB document data model

MongoDB document Data Model Approach

we have already seen that in MongoDB, data is stored in documents. Fortunately for us as application designers, that opens up some new possibilities in schema design. Unfortunately for us, it also complicates our schema design process. There is no longer a “garden path” of normalized database design to go down, and the go-to answer when faced with general schema design problems in MongoDB is “it depends.”

If we have to model the above using MongoDB document model then we might store the blog data in a document as follows


“_id” :

objectID (“508d27069cc1ae293b36928d”),

“title” : “this is the title”,

“body” : “this is the body text.”,

“tags” :[






“created_date : ISOdate ( “2012-10-28T12:41:39.11oZ”),

“author” : “author1”,


objectID (“508d29709cc1ae293b369295”),

“comments” : [


“subject” : “This is coment 1”,

“body” : “This is the body of comment 1.”,

“author ” : “author 2”,

“created_date”: ISODate(” 2012-10-28T13: 34: 23.929Z”)



As we can see we have embedded the comments and Tags within a single document only , alternatively we could “normalize” the model a bit by referencing the comments and tags by its _id field:

// Authors document:


“_id”: ObjectId(” 508d280e9cc1ae293b36928e “),

“name”: “Jenny”,}

// Tags document:


“_id”: ObjectId(” 508d35349cc1ae293b369299″),

“TagName”: “Tag1”,…..}

// Comments document:

{ “_id”: ObjectId(” 508d359a9cc1ae293b3692a0″),

“Author”: ObjectId(” 508d27069cc1ae293b36928d”),


“created_date” : ISODate(” 2012-10-28T13: 34: 59.336Z”)


// Category Document


“_id”: ObjectId(” 508d29709cc1ae293b369295″),



// Posts Document


“_id” :

ObjectId(” 508d27069cc1ae293b36928d”),

“title” : “This is the title”,” body” : “This is

the body text.”,

“tags” : [

ObjectId(” 508d35349cc1ae293b369299″),

ObjectId(” 508d35349cc1ae293b36929c”)


“created_date” : ISODate(” 2012-10-

28T12: 41: 39.110Z”),


: ObjectId(” 508d280e9cc1ae293b36928e”),

“category_id” :

ObjectId(” 508d29709cc1ae293b369295″),

“comments” : [

ObjectId(” 508d359a9cc1ae293b3692a0″),

The remainder of this chapter is devoted in identifying which solution will work in our context i.e. whether to use referencing or whether to embed.


In this post we will look at the possible causes when embedding will have positive impact on the performance.

Embedding can be useful when we want to fetch some set of data together and display on the screen together. For example in the above case say we have a page which is displaying comments associated with the blog together in that case the comments can be embedded in the Blogs document.

The benefit of this approach is that since MongoDB stores the documents contiguously on disk all the related data can be fetched in a single seek. Apart from this since JOINS are not supported and we had used referencing in this case then the application might do something like the following to fetch the comments data associated with the blog.

  1. Fetch the associated comments _id from the blogs document
  2. Fetch the comments document based on the comments_id found in the first step.

If we take this approach, not only does the database have to do multiple seeks to find our data, but also additional latency is introduced into the lookup since it now takes two round-trips to the database to retrieve our data.

Hence if the application frequently accesses the comments data along with the blogs then almost certainly embedding the comments within the blog documents will have a positive impact on the performance.

Another concern that weighs in favor of embedding is the desire for atomicity and isolation in writing data. MongoDB is designed without multi documents transaction i.e. MongoDB only provides atomic operations on the level of a single document hence data that need to be updated together atomically needs to be placed together in a single document.

When we update data in our database, we want to ensure that our update either succeeds or fails entirely, never having a “partial success,” and that any other database reader never sees an incomplete write operation.


We have seen that embedding is the approach that will provide the best performance in many cases and also provides data consistency guarantees. However , in some cases, a more normalized model works better in MongoDB.

One reason for having multiple collections and adding references is the increased flexibility it gives when querying the data. Let’s understand this with the blogging example we mentioned above.

We have already seen how the schema will be when we use embedded schema which will work very well when displaying all the data together on a single page i.e. the page which will display the Blog Post followed by all the associated comments.

Now suppose we have a requirement to search for the comments posted by a particular user, the query (using this embedded schema) would be as follows:

db.posts.find({‘comments.author’: ‘author2’},{‘ comments’: 1})

The result of this query, then, would be documents of the following form:


“_id” :

ObjectId(” 508d27069cc1ae293b36928d”),

“comments” : [


“subject” : “This is coment


“body” : “This is the body

of comment 1.”,

“author_id” : “author2”,

“created_date” :

ISODate(” 2012-10-28T13: 34: 23.929Z”)}…]


“_id” :

ObjectId(” 508d27069cc1ae293b36928d”),

“comments” : [


“subject” : “This is coment


“body” : “This is the body

of comment 1.”,

“author_id” : “author2”,

“created_date” :

ISODate(” 2012-10-28T13: 34: 23.929Z”)


The major drawback to this approach is that we get back much more data than we actually need . In particular, we can’t ask for just author2’ s comments; we have to ask for posts that author2 has commented on, which includes all the other comments on those posts as well. Hence this data will require further filtering within the application code.

On the other hand, suppose we decide to use a normalized schema. In this case we will have three documents i.e. “Authors”,“Posts” and “Comments”.

The “Authors” document will have Author specific contents such as Name, Age, Gender etc., “Posts” document will have Posts specific details such as Post creation time, Author of the post , actual content and subject of the post.

The “Comments” document will have the Post’s comments such as Commented On date time, created by author and the text of the comment. The same is depicted as below:

// Authors document:


“_id”: ObjectId(” 508d280e9cc1ae293b36928e “),

“name”: “Jenny”,



// Posts Document


“_id” :

ObjectId(” 508d27069cc1ae293b36928d”),………………..


// Comments document:


“_id”: ObjectId(” 508d359a9cc1ae293b3692a0″),

“Author”: ObjectId(” 508d27069cc1ae293b36928d”),

“created_date” : ISODate(” 2012-10-28T13: 34:59.336Z”),

“Post_id”: ObjectId(” 508d27069cc1ae293b36928d”),



In this scenario the query of finding the comments by say author “author2” can be satiated by a simple find() on the comments collection:

db.comments.find({“author”: “author2”})

In general , if your application’s query pattern is well-known and data tends to be accessed in only one way, an embedded approach works well. Alternatively, if your application may query data in many different ways, or you are not able to anticipate the patterns in which data may be queried, a more “normalized” approach may be better.

For instance, in the above schema, we will be able to sort the comments or return a more restricted set of comments using limit, skip operators. Whereas in the embedded case we’re stuck retrieving all the comments in the same order in which they are stored in the post.

Another factor that may weigh in favor of using document references is when you have one-to-many relationships.

For instance , a popular blog with a large amount of reader engagement may have hundreds or even thousands of comments for a given post. In this case, embedding carries significant penalties with it:

Effect on Read Performance – As the document size increase it will occupy more RAM. The problem with RAM is that a MongoDB database caches frequently accessed documents in RAM and larger the documents become lesser is the probability of it fitting in RAM and hence it will lead to more page faults while retrieving the documents and hence will lead to random disk i/ o which will further slowdown the performance.

Effect on Update performance – As the size increases and an update operation is performed on such documents to append data eventually MongoDB is going to need to move the document to an area with more space available. This movement, when it happens, significantly slows update performance.

Apart from this the MongoDB documents have a hard size limit of 16 MB. Although this is something to be aware of, you will usually run into problems due to memory pressure and document copying well before you reach the 16 MB size limit.

One final factor that weighs in favor of using document references is the case of many-to-many or M:N relationships.

For instance in our above example we have Tags, Each Blog can have multiple tags and each Tag can be associated to multiple Blog entries.

One approach to implement the Blogs-Tags M:N relationship is to have the following three collections

  1. Tags Collection which will store the Tags Details
  2. Blogs collection which will have Blogs Details
  3. Third Collection Tag-To-Blog Mapping will have mapping between the tags and the blogs.

This approach is similar to the one we have in the relational databases but this will negatively impact the application performance as the queries will end up doing a lot of application-level “joins”.

Alternatively, we can use embedding model where we will embed the Tags within the Blogs document, but this will lead to data duplication. Though this will simplify the read operation a bit but will increase the complexity of the update operation, as while updating a tag detail the user needs to ensure that the updated tag is updated at each and every place where it has been embedded in other blogs documents.

Hence for many-to-many joins, a compromise approach is often best, embedding a list of _id values rather than the full document:

// Tags document:


“_id”: ObjectId(” 508d35349cc1ae293b369299″),

“TagName”: “Tag1”,



// Posts document with Tag IDs added as References

// Posts Document


“_id” :

ObjectId(” 508d27069cc1ae293b36928d”),

“tags” : [

ObjectId(” 508d35349cc1ae293b369299″),

ObjectId(” 508d35349cc1ae293b36929a”),

ObjectId(” 508d35349cc1ae293b36929b”),

ObjectId(” 508d35349cc1ae293b36929c”)



Though querying will be a bit complicated but we no longer need to worry about updating a tag everywhere

In summary Schema design in MongoDB is one of the very early decisions that we need to take and is dependent on the application requirements and queries.

As we have seen above when we need to access the data together or we need to make atomic updates then using embedding will have a positive impact, however if we need more flexibility while querying or if we have a Many to Many relationships then using references will be a worthy decision.

Ultimately, the decision depends on the access patterns of your application, and there are no hard-and-fast rules in MongoDB, hence on basis of the access pattern the data model need to be thought of and decided. In the next section we will be covering various data modeling considerations.

Data modeling decisions:

This involves determining how to structure the documents to model the data effectively.

The most important decision involved in whether to embed the data or add reference to the data i.e. use references.

This point is best demonstrated with an example. Suppose we have a book review site which will have authors and books as well as reviews with threaded comments.

Now the question is how should be structure the collections.

The decisions lies in the use cases i.e. it depends on the number of comments expected on per book and how frequently the read vs. write operations will be performed.

Operational considerations:

In addition to the way the elements interact with each other i.e. whether to store the documents in an embedded manner or use references, a number of other operational factors are important while designing a data model for the application. These factors include the following:

Data Lifecycle management:

This feature needs to be used if our application has datasets which need to be persisted in the database only for a limited time period.

Say for e.g. in our above example if we need to retain the data related to the review and comments for say a month then this feature can be taken into consideration.

This is implemented by using the Time to live feature of the collection.

The Time to live or the TTL feature of the collection expires documents after a period of time.

Additionally if the application requirement is to work with only the recently inserted documents then using Capped collections will help optimize the performance.


Indexes can be created to support commonly used queries to increase the performance.

By default an index is created by MongoDB on the _id field.

Few points which we need to take into consideration while creating indexes are:

  1. Each index requires at least 8KB of data space.
  2. Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive as each insert must add keys to each index.
  3. Collections with high proportion of read operations to write operations often benefit from additional indexes. Indexes do not affect un-indexed read operations.


Among the various factors one of the important factors while designing the application model is whether to partition the data or not. This is implemented using Sharding in MongoDB.

Sharding is also referred as partitioning of data.

In MongoDB Sharding involves partitioning a collection within a database to distribute the collections documents across a cluster of machines which are also termed as shards. This can have significant impact on the performance. We will discuss more about Sharding in the MongoDB explained chapter.

Large number of collections:

The design considerations for having multiple collections versus storing data in a single collection are as below:

  • There is no performance penalty in choosing multiple collections for storing data.
  • Having distinct collections for different types of data can have performance improvements in high throughput batch processing applications.
  • When using models that have a large number of collections, we need to consider the following behaviors:
  • Each collection has a certain minimum overhead of a few kilobytes.
  • Each index , including the index on _id, requires at least 8KB of data space.

As mentioned above in the way data is stored in MongoDB we know that there’s a single < database >. ns file which stores all meta-data for each database .

Each index and collection has its own entry in the namespace file, so we need to consider the limits_on_the_size_of_namespace files in MongoDB while thinking of implementing large number of collections.

Document growth:

Certain updates to documents can increase the document size, such as pushing elements to an array and adding new fields. If the document size exceeds the allocated space for that document, MongoDB relocates the document on disk. This internal relocation can be both time and resource consuming. Although MongoDB automatically provides padding to minimize the occurrence of relocations, you may still need to manually handle document growth.

0 Responses on MongoDB Tutorial"

Leave a Message

Your email address will not be published. Required fields are marked *

Copy Rights Reserved © Mindmajix.com All rights reserved. Disclaimer.