teslaberry | Featured Section | Blog | Making a big data pipeline irrespective of the clouds!

What is big data?

Data that is too big that it canÃ¢â‚¬â„¢t be processed by our local machine which has limited RAM and CPU resources. To be more precise we categorize it into 4 dimensions: volume, variety, velocity, and veracity(quality and availability). Therefore, we need a dedicated setup to deal with this.

What is this setup?

Hadoop ecosystem is best for the storage and processing of Big data. It can be set up using Apache Hadoop (open source software) and a couple of commodity hardware (like 5 laptops) which are connected with each other or Hortonworks, MapR and Cloudera enterprise distributions, etc, or on any of the major clouds like AWS, Azure, and GCP. Most of the traditional setup has put the ambari server behind the paywall making it difficult to set up for people who have just started learning. I will recommend you to buy free credits on any of the major clouds and then go for available options like EMR clusters in AWS, HDInsight in Azure, Dataproc in GCP to start your learning.

What are these clusters?

Clusters are a collection of nodes with dedicated RAM and CPU. They are divided into master and worker nodes(Master-Slave architecture). Master nodes are responsible for resource management i.e. providing CPU and RAM as well as storing the metadata (information like name of files, permissions, etc.) and worker nodes are responsible for storing and processing the actual data. In Hadoop, we bring processing to the location where the data is stored i.e. processing happens on the data nodes. Those who are learning can create a cluster with a single node that acts as a master as well as a worker. It will perform the tasks of both management as well as storage and processing.

What services are installed on these clusters?

In a traditional setup, you have to configure all the services manually but in the cloud provisioned clusters, we have most of the services pre-installed. It comprises HDFS(Hadoop distributed file storage), MapReduce engine, spark, pig, and hive, etc. Mapreduce is the traditional processing engine. The coding is done in java and a jar(java archive) file is produced. This jar file is then submitted to the yarn after passing the input file as an argument and a job is executed. It consists of Map and Reduce tasks. I will recommend the use of apache spark for processing. It is faster than traditional processing engines as it uses in-memory operations. Also, you can do coding in either Scala, Java, or python(Polyglot framework). Python is the most popular and easiest of all. Thus, highly recommended.

How to proceed?

Take your source file and put it in HDFS storage using Hadoop fs -put command then ls into the HDFS directory to check if the file was uploaded successfully.

1. hadoop fs -put /nodedirectory/constitution.txt /hadoopdirectory/constitution.txt

2. hadoop fs -ls /hadoopdirectory/

On any of the node clients, start your processing engine.

For pyspark just enter pyspark and it will start the engine. We are trying to count the total number of words in a text file.
Write code to fetch the file and put it in a variable location. It will create a Spark RDD.

3. constitution = sc.textFile(Ã¢â‚¬Å“hdfs:///hadoopdirectory/constitution.txtÃ¢â‚¬Å“)

Create a new RDD and do all the transformations and aggregations.

4. sentence_words_length_new=constitution.filter(lambda sent:len(sent) > 0).map(lambda sent: len(sent.split(" ")))

Action- This will create a job and submit it to the cluster.

5. sentence_words_length_new.sum()

Notes:

The RDD concept was developed in Hadoop 1.0. Later in Hadoop 2.0, the Dataframe concept evolved which has a table-like structure and is much easier compared to RDD which can be used by both Data engineers and data scientists.

Behind the scene, Dataframe reduces to RDD. An abstraction layer has been added for people who are analytics-oriented and not programmers.