Sunday, August 3, 2014

Apache Storm WordCount Example by Hortonnworks

Please follow the simple and flawless guidelines for the setup and basic WordCount example in Apache Storm.

http://hortonworks.com/hadoop-tutorial/processing-streaming-data-near-real-time-apache-storm/

THE ONLY MISSING POINT: NOT STOPPING THE JOB i.e. TOPOLOGY
TURN OFF YOUR JOB AFTER A FEW MINUTES, OTHERWISE BEING A STREAM PROCESSING, YOUR WORKER LOGS /usr/lib/storm/logs/worker*.log WILL KEEP INCREASING.

To do so:
1) go to UI : http://localhost:8744/
2) Under Topology Summary, click on WordCount
3) On newly directed page, under Topology Actions, click on "Deactivate" or "Kill"






Saturday, August 2, 2014

YARN : Complete picture of Apache Hadoop Ecosystem



Above schematic explains the complete overview of Apache Hadoop Ecosystem using YARN for:
- Batch
- Interactive
- Realtime
- Search
- In Memory

operations ...


Following image shows the broad view for data ingestion, operations and management for whole process...



Source: http://hortonworks.com/blog/pivotal-hortonworks-shared-vision-operations-enterprise-hadoop/

Thursday, June 19, 2014

Apache Yarn - Hadoop 2.x Concept

Following is the excellent explanation about the idea of YARN by Arun Murthi and Rohit Bakshi from Hortonworks. It really makes clear the much needed new architecture which not only supports Map Reduce applications but also other applications like Tez, Storm on the same cluster and HDFS base.

A must watch!


Tuesday, February 4, 2014

Hive: Add A Partition Only If It Does Not Exist


Alter Partition

  Add Partitions

  ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1']



Hive: Run Hive Script File Having Batch of HQL Queries

Hive Command Line Options

To get help, run "hive -H" or "hive --help".
Usage (as it is in Hive 0.9.0):
usage: hive
 -d,--define <key=value>          Variable substitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
 -h <hostname>                    Connecting to Hive Server on remote host
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable substitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        Connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)


  • Example of running a script non-interactively
       $HIVE_HOME/bin/hive -f /home/my/hive-script.sql
  • Example of running a script non-interactively and in silent mode
  •    $HIVE_HOME/bin/hive -f -S /home/my/hive-script.sql
Example of running a query from the command line
   $HIVE_HOME/bin/hive -e 'select a.col from tab1 a'


Example of dumping data out from a query into a file using silent mode
   $HIVE_HOME/bin/hive -S -e 'select a.col from tab1 a' > a.txt

SOURCE: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli



Step 1: Open a file and write your HIVE queries and save it as filename.hql

e.g. 

h1.hql file has only one following line

select * from ipdpt_tony limit 100


Step 2:

You can now run the h1.hql using command explained above i.e.

> hive -f h1.hql 


If you want to redirect the output to a file, then

> hive -f h1.hql > result.dat


If you want to run the abive command from some script like Shell, Perl, or Python, then you can directly use the system call and use the line "hive -f h1.hql > result.dat"

e.g. in python:

import os
os.system("date")
os.system("hive -S -f h1.hql > d.txt")

There you go!!!

Now, you can automate the HQL stuff in any script language.