Pragmatic Hadoop Ecosystem: 2014

Monday, August 18, 2014

Overview of Apache Spark

Following is the link to the overview of Apache Spark, very helpful and effective:

http://stanford.edu/~rezab/sparkworkshop/slides/holden.pdf

Apache Storm WordCount Example by Hortonnworks

Please follow the simple and flawless guidelines for the setup and basic WordCount example in Apache Storm.

http://hortonworks.com/hadoop-tutorial/processing-streaming-data-near-real-time-apache-storm/

THE ONLY MISSING POINT: NOT STOPPING THE JOB i.e. TOPOLOGY
TURN OFF YOUR JOB AFTER A FEW MINUTES, OTHERWISE BEING A STREAM PROCESSING, YOUR WORKER LOGS /usr/lib/storm/logs/worker*.log WILL KEEP INCREASING.

To do so:
1) go to UI : http://localhost:8744/
2) Under Topology Summary, click on WordCount
3) On newly directed page, under Topology Actions, click on "Deactivate" or "Kill"

Saturday, August 2, 2014

YARN : Complete picture of Apache Hadoop Ecosystem

Above schematic explains the complete overview of Apache Hadoop Ecosystem using YARN for:

- Batch

- Interactive

- Realtime

- Search

- In Memory

operations ...

Following image shows the broad view for data ingestion, operations and management for whole process...

Source: http://hortonworks.com/blog/pivotal-hortonworks-shared-vision-operations-enterprise-hadoop/

Thursday, June 19, 2014

Apache Yarn - Hadoop 2.x Concept

Following is the excellent explanation about the idea of YARN by Arun Murthi and Rohit Bakshi from Hortonworks. It really makes clear the much needed new architecture which not only supports Map Reduce applications but also other applications like Tez, Storm on the same cluster and HDFS base.

A must watch!

Tuesday, February 11, 2014

Hue : Oozie workflow : Shell - To run shell script in HDFS using oozie workflow

click on Create button

Thursday, February 6, 2014

Hive: How to see Table Create Definition

show create table TABLE_NAME

Tuesday, February 4, 2014

Hive: Add A Partition Only If It Does Not Exist

Alter Partition

Add Partitions

ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1']

Hive: Run Hive Script File Having Batch of HQL Queries

Hive Command Line Options

To get help, run "hive -H" or "hive --help".
Usage (as it is in Hive 0.9.0):

usage: hive
 -d,--define <key=value>          Variable substitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
 -h <hostname>                    Connecting to Hive Server on remote host
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable substitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        Connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)


Example of running a script non-interactively

   $HIVE_HOME/bin/hive -f /home/my/hive-script.sql



Example of running a script non-interactively and in silent mode




   $HIVE_HOME/bin/hive -f -S /home/my/hive-script.sql

Example of running a query from the command line

   $HIVE_HOME/bin/hive -e 'select a.col from tab1 a'

Example of dumping data out from a query into a file using silent mode

   $HIVE_HOME/bin/hive -S -e 'select a.col from tab1 a' > a.txt

SOURCE: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli

Step 1: Open a file and write your HIVE queries and save it as filename.hql

e.g.

h1.hql file has only one following line

select * from ipdpt_tony limit 100

Step 2:

You can now run the h1.hql using command explained above i.e.

> hive -f h1.hql

If you want to redirect the output to a file, then

> hive -f h1.hql > result.dat

If you want to run the abive command from some script like Shell, Perl, or Python, then you can directly use the system call and use the line "hive -f h1.hql > result.dat"

e.g. in python:

import os
os.system("date")
os.system("hive -S -f h1.hql > d.txt")

There you go!!!

Now, you can automate the HQL stuff in any script language.

Pragmatic Hadoop Ecosystem