Elastic Map Reduce (Sahara): Introduction, Integration with swift and Installation

Table of Contents

Today’s blog will walk you through the introduction of Sahara, how it integrates with swift and the steps to install Sahara on your system. Elastic Map Reduce (Sahara) is one of the components of OpenStack and with this component; we are going to continue our discussion on components of OpenStack through our series!

Sahara:

An objective of Sahara is to offer simple methods to the users to preplanned Hadoop clusters via referring multiple options such as cluster topology, Hadoop version, hardware details of nodes and some more. Once the users fill out each and every parameter after that Sahara installs the clusters in a few periods of time.

Sahara also offers ways to escalate previously planned cluster by means of adding or removing worker nodes as per the demand. The solution obtained through this task will tackle with below mentioned used cases:

It helps in quick preplanning of Hadoop clusters on OpenStack for development and Quality Assurance.
It helps in consuming the multipurpose unutilized compute strength related to OpenStack laaS cloud.
Analyzes or analytics as a service, particular critical workloads like AWS EMR.

Main features of Sahara are: –

As mentioned earlier, Sahara is invented as a component of OpenStack.
Sahara was regulated by REST API along with User Interface which was feasible as a portion of OpenStack Dashboard.
Sahara provides support to the various types of Hadoop version:
Plugged – in a system of Hadoop installation engines.
Sahara integrates with particular tools like Apache Ambari or Cloudera Management Console, etc.
Predesigned prototypes of Hadoop forms along with the capability to change the parameters.

Sahara interacts with following OpenStack components:

Horizon: –

It offers Graphic User Interface due to the capacity to use all features of Sahara.

Keystone: –

It verifies the users and also provides safety token which can be used for working with the OpenStack, therefore, restricting the users’ capabilities inside Sahara to its privileges from OpenStack.

Nova: –

Nova supports in provisioning of virtual machines required for the Hadoop cluster.

Glance: –

The Virtual Machine images of Hadoop are stored in Glance, at the same time, every image contains a preinstalled Operating System and Hadoop. Already installed Hadoop helps us by giving a good advantage for startup nodes.

Swift: –

It may be used as a repository for data which can be dealt with Hadoop jobs.

Generic Workflow of Sahara: –

Sahara offers two levels of concept for API and User interface on the basis of used cases like analyzing service and preplanning the cluster.
For the faster preplanning of the cluster, the general workflow will be as follows:
First of all, choose the Hadoop version.
Then choose support image which is preinstalled with Hadoop or without Hadoop
The support images which are not preinstalled with Hadoop will provide support to the plugin installment engines interconnected with vendor casting.
After this, describe the configuration of the cluster, involving the size and layout of the cluster and then setting up various types of parameters related to Hadoop like heap size:
Customizable outlines or Templates will be provided for simplifying the process of configuration of these Hadoop parameters.
Preplanning the cluster: Virtual Machines, installation, and configuration of Hadoop will be preplanned by Sahara.
Processes performed on the cluster: It will help in adding or removing nodes.
Delete the cluster when it is not required at present
For analyzing service or analytic as a service general workflow will be as follows:
Choose anyone pre-described Hadoop version.
Configuration of the job:
Select the type of job from pig, hive, jar-file, etc
Supply the script source for the job or the location of the jar.
Choose the input as well as output location of data, at first only swift, will get the support.
Choose the log location.
Fix the limits for cluster size.
Completing the job:
Entire cluster preplanning and job accomplishment will take place clearly for the users.
Once the job is completed, the cluster will be terminated automatically.
Obtain the results of various estimations, for instance from Swift.

Integration of Sahara with Swift:

As discussed in the previous blog, Swift service is common object storage in the environment of OpenStack or equivalent of Amazon S3. According to the rule, it is installed on the bare metal machines or systems. For processing the data stored on OpenStack, Hadoop is already present on OpenStack. For helping the task of data processing, a few advancements are on their way.

First is, a file system execution for Swift: With the help of HADOOP -8545 in place, jobs related to Hadoop could work with Swift as normally as with HDFS.

There is a request for change: which is Change 16b1ba25b (merged). This executes the capability to register the objects’ endpoints, accounts for making it possible to merge the swift with the software which depends upon information about the location of data for avoiding the network overhead.

Pluggable installation and Controlling:

Along with the monitoring or controlling abilities offered specifically by the Hadoop management tooling which is vendor specific, Sahara will offer pluggable incorporation with outer scrutinizing systems like Nagios or Zabbix.

These both tools related to installation and monitoring will be deployed on a single virtual machine, thus offering only one event to handle multiple clusters at a time.

Architecture of Sahara:

Architecture of Sahara contains multiple components as follows:

Cluster Configuration Manager:

The entire logic of business is stored here.

Auth component:

It is accountable for the verification and approval of the client.

Data Access Layer or DAL:

DAL Stays internal models in database.

Provisioning of Virtual Machines:

This component is accountable for the interaction between the Nova and Glance.

Installation:

The plugin method is accountable for installing Hadoop on the pre-planned virtual machines; management solutions such as Apache Ambari and Cloudera Management Console can be used for installing Hadoop on provisioned virtual machines.

REST API:

Reveals the functionality of Sahara through REST API.

Python Sahara Client:

Same as other components of OpenStack, Sahara owns its own Python client.

Sahara Pages:

The Graphic user Interface for Sahara is situated on Horizon.

Installation of Sahara:

Before moving on with the installation, it is recommended that you should install Sahara in such a way that it will maintain the consistent state of your system. For this purpose, we recommend the following 3 alternatives:

Install Sahara through Fuel
Install Sahara through RDO Havana+
Install Sahara into a virtual environment.

Let’s discuss each of these installation methods:

To install Sahara with

Begin the installation and configuration of OpenStack by following the Quickstart.
Start the Sahara service during the course of installation.
To install Sahara with RDO:
Begin the installation and configuration of OpenStack by following the Quickstart.
Using Yum install Sahara-API service as follows:

$ yum install openstack-sahara

Then configure the Sahara-API service as per your wish. The configuration file is located at:

/etc/sahara/sahara.conf

Create database framework as follows:

$ sahara-db-manage –config-file /etc/sahara/sahara.conf upgrade head

Start the sahara-api service:

$ service openstack-sahara-api start

To install Sahara into a virtual environment:

First of all, using your Operating System manager, you will require installing several packages, which depends upon the operating system that you are using.

For Ubuntu run following command:

$ sudo apt-get install python-setuptools python-virtualenv python-dev

For Fedora run following command:

$ sudo yum install gcc python-setuptools python-virtualenv python-deve1

For CentOS run following command:

$ sudo yum install gcc python-setuptools python-deve1

$ sudo easy_install pip

$ sudo pip install virtualenv

Configure virtual environment for Sahara:

$ virtualenv sahara-venv

Above command will install python virtual environment into Sahara-venv directory present in your current working directory. There is no need of any superuser privileges for this command and this command can be executed in any directory to which, the current user has to write permission.

You can install the latest version of Sahara from pypi as follows :

$ sahara-venv/bin/pip install sahara

You could get a Sahara archive from http://tarballs.openstack.org/sahara/

And install it using pip:

$ sahara-venv/bin/pip install ’http://tarballs.openstack.org/sahara/sahara-master.tar.gz’

Important Note:

Remember that Sahara-master.tar.gz consists of the current changes and therefore it may not be stable at present. Therefore, we would recommend you to browse: http://tarballs.openstack.org/sahara/

And choose the current and stable version of Sahara.

Once you finish the installation, then you must create a configuration file using a sample config which is located in Sahara-venv/share/Sahara/sahara.conf.sample-basic:

$ mkdir sahara-venv/etc

$ cp sahara-venv/share/sahara/sahara.conf.sample-basic sahara-venv/etc/sahara.conf

Then it is important to make the required changes in:

sahara-venv/etc/sahara.conf.

If you are using Sahara with MySQL database, then for storing big Job Binaries in Internal Database of Sahara, you should set the size of maximum allowed packet. Then modify my.cnf and also change the parameter, as follows:

…

[mysqld]

… max_allowed_packet = 256M

And after that restart mysql server.

Finally create database schema as follows:

$ sahara-venv/bin/sahara-db-manage –config-file sahara-venv/etc/sahara.conf upgrade head

For enabling Sahara call as follows:

$ sahara-venv/bin/sahara-api –config-file sahara-venv/etc/sahara.conf

Here we have completed the installation of Sahara!

That’s all for today! Please do not forget to leave a comment in the comment section below. Thank you for reading the blog. See you soon with another interesting blog!

Vishwajit Kale

Vishwajit Kale blazed onto the digital marketing scene back in 2015 and is the digital marketing strategist of Hostripples, a company that aims to provide affordable web hosting solutions. Vishwajit is experienced in digital and content marketing along with SEO. He's fond of writing technology blogs, traveling and reading.