Hadoop Uses

Prince Prashant Saini
5 min readMar 14, 2021

What is Big Data?

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Some Examples of Big Data

1. The New York Stock Exchange generates about one terabyte of new trade data per day.

2. Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.

3. Jet Engine

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

Characterstics of Big Data

  1. Volume: The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data.
  2. Variety: The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

3. Velocity: The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

4. Variablity: This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

What is Hadoop?

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model

How Facebook is Using Big Data?

Have you ever seen one of the videos on Facebook that shows a “flashback” of posts, likes, or images — like the ones you might see on your birthday or on the anniversary of becoming friends with someone? If so, you have seen examples of how Facebook uses Big Data.

According to a survey report “companies with more than 1,000 employees already had more than 200 terabytes of data of their customer’s lives stored. Consider adding that startling amount of stored data to the rapid growth of data provided to social media platforms since then. There are trillions of tweets, billions of Facebook likes, and other social media sites like Snapchat, Instagram, and Pinterest are only adding to this social media data deluge.”

As per my knowledge i can say “The convergence of social media and big data gives birth to a whole new level of technology.”

Distributed Storage

In Big Data, we deal with multiple clusters or we can say computers often. … The whole idea of Big Data is to distribute data across multiple clusters and to make use of computing power of each cluster (node) to process information. Distributed file system is a system that can handle accessing data across multiple clusters (nodes).
Distributed File System

A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information and files among users on a network in a controlled and authorized way. The server allows the client users to share files and store data just like they are storing the information locally. However, the servers have full control over the data and give access control to the clients.

Let me tell you some Facts

  1. Every 60 seconds, 136,000 photos are uploaded, 510,000 comments are posted, and 293,000 status updates are posted. That is a LOT of data.
  2. Every 2 days we create as much data as we did from the beginning of time until 2003.
  3. Google now processes over 40,000 search queries every second on average which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide

4. big data and analytics industry in India is expected to experience 8X growth hitting $16 billion by 2025 from the current $2 billion,

5. In the next seven years, the Indian analytics industry will expand its horizons further and demand more analytics professionals

--

--