Big Data & Hadoop Training Material 0 1.pdf

Big Data & HadoopArchitecture and Development Raghavan Solium Big Data Consultant [email protected] Day - 1 • Understanding Big Data • • • • What is Big Data Challenges with Big Data Why not RDBMS / EDW? Distributed Computing & MapReduce Model • What is Apache Hadoop • • • • • Hadoop & its eco system Components of Hadoop (Architecture) Hadoop deployment modes Install & Configure Hadoop Hands on with Standalone mode Day - 1 • HDFS – The Hadoop DFS • • • • • Building Blocks Name Node & Data Node Starting HDFS Services HDFS Commands Hands on • Configure HDFS • Start & Examine the daemons • Export & Import files into HDFS • Map Reduce Anatomy • • • • MapReduce Workflow Job Tracker & Task Tracker Starting MapReduce Services Hands on • Configure MapReduce • Start & Examine the daemons Speculative Execution. Zero & One Reducer Distributed Cache Job Chaining HDFS Federation HDFS HA Hadoop Cluster Administration .2 • MapReduce Programming • • • • Java API Data Types Input & Output Formats Hands on • Advance Topics • • • • • • • • • Combiner Partitioner Counters Compression.Day . Day .3 • Pig • • • • • What is Pig Latin? Pig Architecture Install & Configure Pig Data Types & Common Query algorithms Hands On • Hive • • • • What is Hive? Hive Architecture Install & Configure Hive Hive Data Models • Hive Metastore • Partitioning and Bucketing • Hands On . 4 • Sqoop • • • • What is Sqoop Install & Configure Sqoop Import & Export Hands On • Introduction to Amazon Cloud • What is AWS • EC2. S3 • How to leverage AWS for Hadoop .Day . Day .4 • Hadoop Administration • • • • • • • • • • • HDFS Persistent Data Structure HDFS Safe Mode HDFS File system Check HDFS Block Scanner HDFS Balancer Logging Hadoop Routine Maintenance Commissioning & Decommissioning of nodes Cluster Machine considerations Network Topology Security . and machines leave so much data behind summing up to many Terabytes and many times. we want to analyze the data) where valuable business insight is mined out of historical data • But we also live in the age of crazy data where every individuals. it’s mostly machine produced. and it is only expected to grow • Good news. Petabytes. • We are in the age of advanced analytics (that’s where all the problem is. enterprises.What the “BIG” hype about Big Data? • May be it is in the hype.. More data means better precision • More data usually beats better algorithms • But How are we going to analyze? • Traditional database or warehouse systems crawl or crack at these volumes • Inflexible to handle most of these formats • This is the very characteristic of Big Data • Nature of Big Data • Huge volumes of data that can not be handled by traditional database or warehouse systems.. most of it is unstructured and grows at high velocity 7 . Blessing in disguise. real and big value. but the problems are big.. How?. • At the end of 2010 The Large Hadron Collider near Geneva. Switzerland has about 150 petabytes of data Velocity • The New York Stock Exchange generates about one terabyte of new trade data every day • The Large Hadron Collider produce s about 15 petabytes of data per year • Weather sensors collect data every hour at many locations across the globe and gather a large volume of log data 8 .Let’s Define Variety • Sensor Data • Machine logs • Social media data • Scientific data • RFID readers • sensor networks • vehicle GPS traces • Retail transactions Volume • The New York Stock Exchange has several petabytes of data for analysis • Facebook hosts approximately 10 billion photos. taking up one petabytes of storage. involves whole data scan. • At these volumes access speed of the data devices will dominate overall analysis time. combing etc • Traditional RDBMS/ EDW cannot handle these with their limited scalability options and architectural limitations • You can incorporate better servers. split your data file into small enough pieces across the drives and do parallel reads and processing • Hardware Reliability (Failure of any drive) is a challenge • Resolving Data interdependency between drives is a notorious challenge • Number of disk drives that can be added to a server is limited • Analysis • Much of Big Data is unstructured.Inflection Points • Data Storage • Big Data ranges from several Terabytes to Petabytes. processors and throw in more RAM but there is a limit to it 9 . joining. Traditional RDBMS/ EDW cannot handle them • Lot of Big Data analysis is adhoc in nature. • A Terabyte of data requires 2. referencing itself.5 hours to be read from a 100 MBPS drive • Writing will even be slower • Is divide the data and rule a solution here? • Have multiple disk drives. MapReduce are such models Let us see what MapReduce is 10 .Inflection Points • We need a Drastically different approach • A distributed file system with high capacity and high reliability • A process engine that can handle structure / Unstructured data • A computation model that can operate on distributed data and abstracts data dispersion • PRAM. V1) (K3. V3) Input file Split s Output files Computer 1 Split 1 Input File / Data Sort Map Reduce Computer 2 Split 2 Computer 1 Sort Map Computer 2 Reduce Computer 3 Split 3 Part 1 Part 2 Sort Map 11 . V2) Intermediate Key/Value pairs (K1.What is MapReduce Model (K2. • Map processes transforms input key/ value pairs to an intermediate key/value pairs. 12 . • There are two stages of processing in MapReduce model to achieve the final result. ‘Map’ and ‘Reduce’. • The MapReduce model expects the input data to be split and distributed to the machines on the cluster so the each split can be processed independently and in parallel. MapReduce framework passes this output to reduce processes which will transform this to get the final result which again will be in the form of key/ Value pairs. • Map processes the input splits. • The model treats data at every stage as Key and Values pairs.What is MapReduce Model • MapReduce is a computation model that supports parallel processing on distributed data using a cluster of computers. Every computer in the cluster can run independent map and reduce processes. transforming one set of Key/ Value pairs into different set of Key/ value pairs to arrive at the end result. The output of map is distributed again to the reduce processes to combine the map output to give final expected result. MapReduce Model • MapReduce should have • Ability to initiate and monitor parallel processes and coordinate between them • A mechanism to pass the same key outputs from map processes to a single reduce process • Recover from any failures transparently 13 . Big Data Universe Evolving and expanding………. 14 .. So what’s going to happen to our good friend RDBMS? • We don’t know! As of now it looks like they are going to coexists • Hadoop is a batch oriented analysis system. read many times Dynamic schema Low Linear (Some of these things are debatable as the Big Data and Hadoop eco systems are fast evolving and moving to higher degree of maturity and flexibility. For example Hbase brings in the ability to point queries ) 15 . It’s not suitable for low-latency data operations • MapReduce systems can output the analysis outcome to the RDBMS/EDWs for online access and point queries RDBMS / EDW compared to MapReduce Data size Access Updates Structure Integrity Scaling Traditional RDBMS Gigabytes Interactive and batch Read and write many times Static schema High Nonlinear MapReduce Petabytes Batch Write once. DNA Analysis • Machine Learning • Log Analytics 16 . Personalized promotions • Scientific simulation & analysis • Aeronautics. Particle physics.Some Use Cases • Web/Content Indexing • Finance & Insurance • Fraud detection • Sentiment Analysis • Retail • Trend analysis. What is Apache Hadoop and how it can help with Big Data? • It is an open source Apache project for handling Big Data • It addresses Data storage issue and Analysis (processing) issues through it’s HDFS file system and implementing MapReduce computation model • It is designed for massive scalability and reliability • The model enables leveraging cheap commodity servers keeping the cost in check Who Loves it? • • • • • • Yahoo! runs 20.000 servers running Hadoop Largest Hadoop cluster is 4000 servers. 16 PB raw storage (Is it Yahoo?) Facebook runs 2000 Hadoop servers 24 PB raw storage and 100 TB raw log / day eBay and LinkedIn has production use of Hadoop Sears retail uses Hadoop 17 . Hadoop & It’s ecosystem Apache Oozie (Workflow) HBase Zoo Keeper (Coordination Service) PIG Hive Mahout MapReduce Framework Structured Data Sqoop HDFS (Hadoop Distributed File System) Flume Log Files Unstructured Data 18 Hadoop & It’s ecosystem Avro: A serialization system for efficient, cross-language RPC and persistent data storage. MapReduce: A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS: A distributed file system that runs on large clusters of commodity machines. Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters. Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. Hbase: A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads). ZooKeeper: A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications. Sqoop: A tool for efficient bulk transfer of data between structured data stores (such as relational databases) and HDFS. Oozie: A service for running and scheduling workflows of Hadoop jobs (including Map-Reduce, Pig, Hive, and Sqoop jobs). 19 Hadoop Requirements • Supported Platforms • GNU/Linux is supported as a development and production • Win32 supported as development only • cygwin is required for running on Windows • Required Software • JavaTM 1.6.x • ssh to be installed, sshd must be running (for launching the daemons on the cluster with password less entry) • Development Environment • Eclipse 3.5 or above 20 Password: hadoop123 • Java 6 installed on Linux VM • Open SSH installed on Linux VM • Putty . Min 4 GB Ram • VMWare Player 5.0 • Linux VM .Lab Requirements • Windows 7 .5 21 .For opening Telnet sessions to the Linux VM • WinSCP .0.Ubuntu 12.64 bit OS.For transferring files between Windows / VM • Eclipse 3.04 LTS • User: hadoop. Hands On • Using the VM • Install & Configure hadoop • • • • • Install & Configure ssh Set up Putty & WinScp Set up lab directories Install open JDK Install & Verify hadoop 22 . Starting VM 23 . Starting VM Enter user ID/ Password : hadoop / hadoop123 24 . ssh/authorized_keys >>chmod 700 ~/.Install & Configure ssh • Install ssh >>sudo apt-get install ssh • Check ssh installation >>which ssh >>which sshd >>which ssh-keygen • Generate ssh Key >>ssh-keygen -t rsa -P ‘’ -f ~/.ssh/authorized_keys 25 .pub ~/.ssh/id_rsa • Copy public key as an authorized key (equivalent to slaves) >>cp ~/.ssh/id_rsa.ssh >>chmod 600 ~/. Verify ssh • Verify SSH by logging into target (localhost here) >>ssh localhost • This command should log you into the machine localhost 26 . Accessing VM Putty & WinSCP • Get IP address of the Linux VM >>ifconfig • Use Putty to telnet to VM • Use WinSCP to FTP to VM 27 . Lab – VM Directory Structure • User Home Directory for user “hadoop” (Created default by OS) /home/hadoop • Working directory for the lab session /home/hadoop/lab • Downloads directory (installables downloaded and stored under this) /home/hadoop/lab/downloads • Data directory (sample data is stored under this) /home/hadoop/lab/data • Create directory for installing the tools /home/hadoop/lab/install 28 . Install & Configure Java • Install Open JDK >>sudo apt-get install openjdk-6-jdk • Check Installation >>java -version • Configure Java Home in environment • Add a line to .bash_profile to set Java Home export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 • Hadoop will use this during runtime . 0.bash_profile • Add below two lines and execute bash profile >>export HADOOP_INSTALL=~/lab/install/hadoop-1.0.0.0.tar.0.3/hadoop-1.3 • Configure environment in .0.3” >>ls -l hadoop-1.techartifact.gz • FTP the file to Linux VM into ~/lab/downloads folder • Untar (execute the following commands) >>cd ~/lab/install >>tar xvf ~/lab/downloads/hadoop-1.3 >>export PATH=$PATH:$HADOOP_INSTALL/bin >>. .3.bash_profile 30 .com/mirror/hadoop/common/hado op-1.3.tar.gz • Check the extracted directory “hadoop-1.Install Hadoop • Download Hadoop Jar • http://apache. 3.jar • Will provide the list of classes in the above jar file >>hadoop jar hadoop-examples-1.jar wordcount <input directory> <output directory> 31 .3.0.Run an Example • Verify Hadoop installation >> hadoop version • Try the following >>hadoop • Will provide command usage >>cd $HADOOP_INSTALL >>hadoop jar hadoop-examples-1.0. Component of Core Hadoop Client Name Node Job Tracker Networked Secondary Name Node Data Node Data Node Data Node Data Node Task Tracker Task Tracker Task Tracker Task Tracker Map Map Map Map Red Red Red Red (Hadoop supports many other file systems other than HDFS itself . For one to leverage Hadoop’s abilities completely HDFS is one of the most reliable file systems) 32 . Components of Core Hadoop At a high-level Hadoop architectural components can be classified into two categories • Distributed File management system – HDFS This has central and distributed sub components • NameNode – Centrally Monitors and controls the whole file system • DataNode – Take care of the local file segments and constantly communicates with NameNode • Secondary NameNode – Do not confuse. This just backs up the file system status from the NameNode periodically • Distributed computing system – MapReduce Framework This again has central and distributed sub components • Job Tracker – Centrally Monitors the submitted Job and controls all processes running on the nodes (computers) of the cluster. This is not a NameNode Backup. This constantly communicates with Job Tracker daemon to report the task progress When the Hadoop system is running in a distributed mode all these daemons would be running in the respective computer 33 . This communicated with Name Node for file system access • Task Tracker – Take care of the local job execution on the local file segment. This talks to DataNode for file information. Job Tracker & Secondary Name Node on a separate machine. 34 Rest of the machines in the cluster run a Data Node and Task Tracker Daemons .Hadoop Operational Modes Hadoop can be run in one of the three modes • Standalone (Local) Mode • No daemons launched • Everything runs in single JVM • Suitable for development • Pseudo Distributed Mode • All daemons are launched on a single machine thus simulating a cluster environment • Suitable for testing & debugging • Fully Distributed Mode • The Hadoop daemons run in a cluster environment • Each daemons run on machines respectively assigned to them • Suitable for Integration Testing / Production A typical distributed mode runs Name Node on a separate machine. and the task log for the tasktracker child process 35 .properties Java Properties Properties for controlling how metrics are published in Hadoop log4j.xml Hadoop configuration XML Configuration settings for MapReduce daemons: the jobtracker and the tasktrackers masters Plain text List of machines (one per line) that run a secondary namenode slaves Plain text List of machines (one per line) that each run a datanode and a tasktracker hadoop-metrics .xml Hadoop configuration XML Configuration settings for HDFS daemons: the namenode.properties Java Properties Properties for system logfiles. and the datanodes mapred-site.Hadoop Configuration Files The configuration files can be found under “conf” Directory File Name Format Description hadoop-env. the namenode audit log.sh Bash script Environment variables that are used in the scripts to run Hadoop core-site. the secondary namenode.xml Hadoop configuration XML Configuration settings for Hadoop Core. such as I/O settings that are common to HDFS and MapReduce hdfs-site. tracker mapredsite.job.xml NA 1 3 (default) mapred.xml local (default) Localhost:8021 Jobtracket:8021 36 .name core-site.Key Configuration Properties Property Name Conf File Standalone Pseudo Distributed Fully Distributed fs.default.xml file:/// (default) hdfs://localhost/ hdfs://namenode/ dfs.replication hdfs-site. HDFS 37 . Design of HDFS • HDFS is hadoop’s Distributed File System • Designed for storing very large files (of sized petabytes) • Single file can be stored across several the disks • Designed for streaming data access patterns • Not suitable for low-latency data access • Designed to be highly fault tolerant hence can run on commodity hardware 38 . HDFS Concepts • Like in any file system HDFS stores files by breaking it into smallest units called Blocks • The default HDFS block size is 64 MB • The large block size helps in maintaining high throughput • Each Block is replicated across multiple machines in the cluster for redundancy 39 . Design of HDFS .Daemons Get block information for the file Secondary Name Node Name Node Networked Client Read Blocks Data Node Hadoop Cluster Data Node Data Node Data Node Data Blocks 40 . the HDFS is inaccessible • Secondary NameNode • Not a backup for the NameNode • Just helps in merging filesystem image with edit log to avoid edit log 41 becoming too large .Daemons The HDFS file system is managed by two daemons • NameNode & DataNode • NameNode & DataNode function in master/ slave fashion • NameNode Manages File system namespace • Maintains file system tree and the metadata of all the files and directories • Filesystem Image • Edit log • Datanodes store and retrieve the blocks for the files when they are told by NameNode • NameNode maintains the information on which DataNodes all the blocks for a given file are located • DataNodes report to NameNode periodically with the list of blocks they are storing • With NameNode off.Design of HDFS . Hands On • Configure HDFS file system for hadoop • Format HDFS • Start & Verify HDFS services • Verify HDFS • Stop HDFS services • Change replication 42 . xml <?xml version="1.default.sh export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 • The property is used on the remote machines • Set up core-site.default.core-site.xml --> <configuration> <property> <name>fs.0"?> <!-. Name node runs at port 8020 by default if no port is specified 43 .xml (Pseudo Distributed Mode) • Set JAVA_HOME in conf/hadoop-env.Configuring HDFS – core-site.name</name> <value>hdfs://localhost/</value> </property> </configuration> Add “fs.name” property under configuration tag to specify NameNode location. “localhost” for Pseudo distributed mode. Starting HDFS • Format NameNode >>hadoop namenode -format • Creates empty file system with storage directories and persistent data structures • Data nodes are not involved • Start dfs services & verify daemons >>start-dfs.sh >>jps • List / Check HDFS >>hadoop >>hadoop >>hadoop fs -ls fsck / -files -blocks fs -mkdir testdir 44 . Verify HDFS • List / Check HDFS again >>hadoop >>hadoop fs -ls fsck / -files -blocks • Stop dfs services >>stop-dfs.sh >>jps • No java processes should be running 45 . replication</name> <value>1</value> </property> </configuration> Add “dfs.0"?> <!-.hdfs-site.hdfs-site. value is set to 1 so that no replication is done 46 .xml (Pseudo Distributed Mode) <?xml version="1.xml --> <configuration> <property> <name>dfs.replication” property under configuration tag.Configuring HDFS . tmp.tmp.dir Directories where DataNode stores blocks.xml (Pseudo Distributed Mode) Property Name Description Default Value dfs.dir Directories for NameNode to store it’s persistent data (Comma separated directory names).name.dir}/dfs/ name dfs.hdfs-site.tmp.Configuring HDFS .dir}/dfs/ namesecondary 47 .data.checkpoint. Each block is stored in only one of these directories ${hadoop.dir}/dfs/ data fs. A copy of the checkpoint is stored in each of the listed directory ${hadoop.dir Directories where secondary NameNode stores checkpoints. A copy of metadata is stored in each of the listed directory ${hadoop. Basic HDFS Commands • Creating Directory • hadoop fs -mkdir <dirname> • Removing Directory • hadoop fs -rm <dirname> • Copying files to HDFS from local filesystem • hadoop fs -copyFromLocal <local dir>/<filename> <hdfs dir Name>/<hdfs file name> • Copying files from HDFS to local filesystem • hadoop fs -copyToLocal <local dir>/<filename> <hdfs dir Name>/<hdfs file name> • List files and directories • hadoop fs -ls <dir name> • List the blocks that make up each file in HDFS • hadoop fsck / -files -blocks 48 Hands On • Create data directories for • NameNode • Secondary NameNode • DataNode • Configure the nodes • Format HDFS • Start DFS service and verify daemons • Create directory “retail” in HDFS • Copy files from lab/data/retail directory to HDFS retail directory • Verify the blocks created • Do fsck on HDFS to check the health of HDFS file system 49 Create data directories for HDFS • Create directory for NameNode >>cd ~/lab >>mkdir hdfs >>cd hdfs >>mkdir namenode >>mkdir secondarynamenode >>mkdir datanode >>chmod 755 datanode 50 dir</name> <value>/home/hadoop/lab/hdfs/namenode</value> <final>true</final> </property> <property> <name>dfs.Configuring data directories for HDFS • Configure HDFS directories • Add the following properties in hdfs-site.dir</name> <value>/home/hadoop/lab/hdfs/datanode</value> <final>true</final> </property> <property> <name>fs.xml <property> <name>dfs.name.dir</name> <value>/home/hadoop/ lab/hdfs/secondarynamenode</value> <final>true</final> </property> 51 .data.checkpoint. HDFS Web UI • Hadoop provides a web UI for viewing HDFS • Available at http://<VM host IP>:50070/ • Browse file system • Log files 52 . MapReduce 53 . MapReduce • A distributed parallel processing engine of Hadoop • Processes the data in sequential parallel steps called • Map • Reduce • Best run with a DFS supported by hadoop to exploit it’s parallel processing abilities • Has the ability to run on a cluster of computers • Each computer called as a node • Input/output data at every stage is handled in terms of key/value pairs • Key/ Value can be chosen by programmer • Mapper output with the same key are sent to the same reducer • Input to Reducer is always sorted by key • Number of mappers and reducers per node can be configured 54 . 1 and. 1 down. 1 Computer 1 Reduce Computer 2 Reduce up. 1 up. 1 go. 1 the. 1 go. 1 down. 1 down. 1 the. 1 and Down go If 2 2 3 1 up you the Health Weight 2 1 2 1 1 weight. 1 up. 1 if. 1 and. 1 health. V3) and. 1 55 . 1 health. 1 go. 1 you. V1) Input file Split s on the DFS Computer 1 If you go up and down Input file If you go up and down The weight go down and the health go up Map Computer 2 The weight go down and Map Computer 3 the health go up Map Intermediate Key/Value pairs Output (K2. 1 down. 1 up. 1 go.MapReduce Workflow – Word count (K1. 1 go. 1 the. 1 you. 1 weight. 1 if. 1 go. 1 and. V2) (K3. 1 the. Daemons The MapReduce system is managed by two daemons • JobTracker & TaskTracker • JobTracker & TaskTracker function in master/ slave fashion • JobTracker coordinates the entire job execution • TaskTracker runs the individual tasks of map and reduce • JobTracker does the bookkeeping of all the tasks run on the cluster • One map task is created for each input split • Number of reduce tasks is configurable • mapred.Design of MapReduce .reduce.tasks 56 . Daemons Client Job Tracker Networked HDFS Task Tracker Task Tracker Task Tracker Task Tracker Map Red Map Red Map Red Map Red Map Red Map Red Map Red Map Red 57 .Design of MapReduce . Hands On • Configure MapReduce • Start MapReduce daemons • Verify the daemons • Stop the daemons 58 . xml .mapred-site.tracker</name> <value>localhost:8021</value> </property> </configuration> Add “mapred.tracker” property under configuration tag to specify JobTracker location.job.mapred-site. “localhost:8021” for Pseudo distributed mode.xml --> <configuration> <property> <name>mapred. 59 .job.0"?> <!-.Pseudo Distributed Mode <?xml version="1. Starting Hadoop – MapReduce Daemons • Start MapReduce Services >>start-mapred.sh >>jps • Stop MapReduce Services >>stop-mapred.sh >>jps 60 MapReduce Programming 61 MapReduce Programming • Having seen the functioning of MapReduce, to perform a job in hadoop a programmer needs to create • A MAP function • A REDUCE function • A Driver to communicate with the framework, configure and launch the job Execution Environment Map Reduce Framework Execution Parameters Map Map Reduce Framework Map Red uce Map Reduce Framework Output Red Driver 62 147.Corbett.Justin.4006641.4005497.Lacrosse.43.Rachel.4003754.csv Cust ID Fst Nam Lst Nam Age Profession 4009983.Water Sports.Water Sports.19.credit 00999991.06-09-2011.San Antonio.78.California.66.Windsurfing.097.66.Human resources assistant 63 .csv Txn ID TXN Date Cust ID Amt Category Sub-Cat Addr-1 Addr-2 Credit/ Cash 00999990.35.08-19-2011.Melvin.San Diego.Surfing.Coach 4009984.126.10-09-2011.credit • Customer details in custs.Loan officer 4009985.Tate.Retail Use case • Set of transactions in txn.Bellevue.credit 00999992.Jordan.Texas.Team Sports.Washington. V1 are the types of the input key / value pair • K2. K2.Map Function • The Map function is represented by Mapper class. to which the transformed key/ values can be written to 64 . V1. V2> • K1. which declares an abstract method map() • Mapper class is generic type with four type parameters for the input and output key/ value pairs • Mapper <K1. V2 are the types of the output key / value pair • Hadoop provides it’s own types that are optimized for network serialization • Text • LongWritable • IntWritable Corresponds to Java String Corresponds to Java Long Corresponds to Java Int • The map() method must be implemented to achieve the input key/ value transformation • map() is called by MapReduce framework passing the input key/ values from the input split • map() is provided with a context object in it’s call. hasMoreTokens()) { word.toString().toLowerCase()).write(word. one).. " \t\n\r\f. } } } 65 . InterruptedException { StringTokenizer itr = new StringTokenizer(value. while (itr. private Text word = new Text(). Text. Text. Text value.set(itr. @Override public void map(LongWritable key. IntWritable> { private final static IntWritable one = new IntWritable(1).:?![]'"). Context context) throws IOException.nextToken().Mapper – Word Count public static class TokenizerMapper extends Mapper<LongWritable. context.. which declares an abstract method reduce() • Reducer class is generic type with four type parameters for the input and output key/ value pairs • Reducer <K2.Reduce Function • The Reduce function is represented by Reducer class. K3. V2. V3 are the types of the output key/ value pair • The reduce() method must be implemented to achieve the desired transformation of input key/ value • reduce() method is called by MapReduce framework passing the input key/ values from out of map phase • MapReduce framework guarantees that the records with the same key from all the map tasks will reach a single reduce task • Similar to the map. this type of this pair must match the output types of Mapper • K3. V2 are the types of the input key/ value pair. reduce method is provided with a context object to which the transformed key/ values can be written to 66 . V3> • K2. IntWritable> { @Override public void reduce(Text key. } context. for (IntWritable value : values) { sum += value. IntWritable. } } 67 .get(). Context context) throws IOException.write(key. Iterable<IntWritable> values. Text. InterruptedException { int sum = 0. new IntWritable(sum) ).Reducer – Word Count public static class IntSumReducer extends Reducer< Text. • Job object gives you control over how the job is run • Set the jar file containing mapper and reducer for distribution around the cluster Job.class). job.class). • Set Mapper and Reducer output types • Set Input and Output formats • Input key/ value types are controlled by the Input Formats 68 . path).class). path).addInputPath(job.setOutputPath(job.setJarByClass(wordCount.Driver – MapReduce Job • The job object forms the specification of a job Configuration conf = new Configuration(). • Input/ output location is specified by calling static methods on FileInputFormat and FileOutputFormat classes by passing the job FileInputFormat. FileOutputFormat. • Set Mapper and Reducer classes job. Job job = new Job(conf. “Word Count”).setMapperClass(TokenizerMapper.setReducerClass(IntSumReducer. setMapOutputKeyClass(Text.addInputPath(job.class). new Path(args[0]) ). System. FileInputFormat.MapReduce Job – Word Count Public class WordCount { public static void main(String args[]) throws Exception { if (args.setOutputPath(job.setOutputValueClass(IntWritable.class). job. job.err.class). “Word Count”).class).class). job.println(“Usage: WordCount <input Path> <output Path>”). Job job = new Job(conf.setMapOutputValueClass(IntWritable. } Configuration conf = new Configuration().class). job. FileOutputFormat. } } 69 .length != 2) { System. job. job.setJarByClass(WordCount.exit(job.setMapperClass(TokenizerMapper.setOutputKeyClass(Text.waitForCompletion(true) ? 0 : 1).setReducerClass(IntSumReducer. new Path(args[1]) ).class). job. System.exit(-1). The MapReduce Web UI • Hadoop provides a web UI for viewing job information • • • • • Available at http://<VM host IP>:50030/ follow job’s progress while it is running find job statistics View job logs Task Details 70 . 71 .class).setCombinerClass(<combinerclassname>.Combiner • Combiner function helps to aggregate the map output before passing on to reduce function • Reduces intermediate data to be written to disk • Reduces data to be transferred over network • Combiner is represented by same interface as Reducer • Combiner for a job is specified as job. setOutputValueClass(IntWritable. “Word Count”). job. job.class). new Path(args[1]) ).class).waitForCompletion(true) ? 0 : 1).class). job. Job job = new Job(conf. job.class).class).class). FileOutputFormat. System.setCombinerClass(IntSumReducer. System.class). job. new Path(args[0]) ). job.length != 2) { System.setOutputPath(job.addInputPath(job.class). Otherwise a separate combiner needs to be created FileInputFormat.setMapperClass(TokenizerMapper. } } 72 .setMapOutputValueClass(IntWritable. } Configuration conf = new Configuration().Word Count – With Combiner Public class WordCount { public static void main(String args[]) throws Exception { if (args. job.exit(job.setOutputKeyClass(Text. In case of cumulative & associative functions the reducer can work as combiner.err.println(“Usage: WordCount <input Path> <output Path>”).setJarByClass(WordCount.setReducerClass(IntSumReducer.exit(-1). job.setMapOutputKeyClass(Text. setPartitionerClass(<customPartitionerClass>. VALUE value.Partitioning • Map tasks partition their output keys by the number of reducers • There can be many keys in a partition • All records for a given key will be in a single partition • A Partitioner class controls partitioning based on the Key • Hadoop uses hash partition by default (HashPartitioner) • The default behavior can be changed by implementing the getPartition() method in the Partitioner (abstract) class public abstract class Partitioner<KEY. } • A custom partitioner for a job can be set as job. VALUE> { public abstract int getPartition(KEY key. 73 . int numPartitions).class). */ //return (ch. IntWritable value. //round robin based on ASCI value return 0.Partitioner Example public class WordPartitioner extends Partitioner <Text. IntWritable>{ @Override public int getPartition(Text key.1). /*if (ch.charAt(0) % numPartitions). } else if (ch.matches("[abcdefghijklm]")) { return 0. // default behavior } } 74 .matches("[nopqrstuvwxyz]")) { return 1. int numPartitions) { String ch = key. } return 2.substring(0.toString(). setNumReduceTasks().One or Zero Reducers • Number of reducers is to be set by the developer job.reduce. OR mapred.tasks=10 • One Reducer • Maps output data is not partitioned. all key /values will reach the only reducer • Only one output file is created • Output file is sorted by Key • Good way of combining files or producing a sorted output for small amounts of data • Zero Reducers or Map-only • The job will have only map tasks • Each mapper output is written into a separate file (similar to multiple reducers case) into HDFS • Useful in cases where the input split can be processed independent of other parts 75 . Comparable<T> { } • Keys are compared with each other during the sorting phase • Respective registered RawComparator is used comparison public interface RawComparator<T> extends Comparator<T> { public int compare(byte[] b1. int l1. void readFields(DataInput in) throws IOException. int l2). int s1.Data Types • Hadoop provides it’s own data types • Data types implement Writable interface public interface Writable { void write(DataOutput out) throws IOException. byte[] b2. } 76 . int s2. } • Optimized for network serialization • Key data types implement WritableComparable interface which enables key comparison public interface WritableComparable<T> extends Writable. Data Types Writable wrapper classes for Java primitives Java primitive Writable Serialized size implementation (bytes) Boolean Byte Short Int BooleanWritable ByteWritable ShortWritable IntWritable VIntWritable FloatWritable LongWritable VLongWritable DoubleWritable Float Long Double 1 1 2 4 1–5 4 8 1–9 8 NullWritable • Special writable class with zero length serialization • Used as a place holder for a key/ value when you do not need to use that position 77 . define(<custDataType>. } • WritableComparator is a general purpose RawComparator • Custom comparators for a job can also be set as job. implementation of 78 .class). byte[] b2. int s2. new CustComparator()).class).setGroupingComparatorClass(GroupComparator. int l2) { } } static { WritableComparator.class.setSortComparatorClass(KeyComparator. } @Override public int compare(byte[] b1.class). int s1. int l1. job.Data Types (Custom) • Custom Data types (Custom Writables) • Custom and complex data types can be implemented per need to be used as key and values • key data types must implement WritableComparable • Values data types need to implement at least Writable • Custom types can implement raw comparators for speed public static class CustComparator extends WritableComparator { public CustComparator () { super(<custDataType>. class).setInputFormatClass(<Input Format Class Name>. the input data is divided into equal chunks called splits • Each split is processed by a separate map task • Each split in turn is divided into records based on Input Format and passed with each map call • The Key and the value from the input record is determined by the Input Format (including types) • All input Formats implement InputFormat interface • Input format for a job is set as follows • job.Input Formats • An Input Format determines how the input data to be interpreted and passed on to the mapper • Based on an Input Format. • Two categories of Input Formats • File Based • Non File Based 79 . Input Formats 80 . addInputPath(job. path) • Each Split corresponds to either all or part of a single file except for CombineFileInputFormat • File Input Formats • Text Based • • • • TextInputFormat KeyValueTextInputFormat NLineInputFormat CombineFileInputFormat (meant for lot of small files to avoid too many splits) • Binary • SequenceFileInputFormat 81 .File Input Formats • FileInputFormat is the base class for all file based data sources • Implements InputFormat interface • FileInputFormat offers static convenience methods for setting a Job’s input paths FileInputFormart. 320.450.Karev.Kumar.39.Kumar.1st Main.27.File Input Formats .lombard.2nd Main.2nd Block.39.Hackensack.Sunny Brook.TextInputFormat • Each line is treated as a record • Key is byte offset of the line from beginning of the file • Value is the entire line Input File 2001220.Lonely Beach.325.lombard.NY.Web Designer LongWritable Input to Mapper Text K1 = 0 V1 = “2001220.Vinay.Lonely Beach.Sunny Brook.320.manager 2001221.27.35.450.Sys Admin” K3 = 102 V3 = “2001223.NY.Hackensack.2nd Main.2nd Block.NJ.Yong.Web Designer” • TextInputFormat is the default input format if none specified 82 .peter.peter.1st Main.NJ.Yong.Vinay.325.NJ.John .NJ.Sys Admin 2001223.35.John .Karev.manager” K2 = 54 V2 = “2001221. linespermap • CobineFileInputFormat • A Splits can consist of multiple files (based on max split size) • Typically used for lot of small files • This is an abstracts class and one need to implement to use 83 .input.separator • NLineInputFormat • Each File Splits contains fixed number of lines • The default is one. which can be changed by setting the property mapreduce.lineinputformat.input.keyvaluelinerecordreader.Others • KeyValueTextInputFormat • Splits each line into key/ value based on specified delimiter • Key is the part of the record before the first appearance of the delimiter and rest is the value • Default delimiter is tab character • A different delimiter can be set through the property mapreduce.key.File Input Formats .value. values into Text Objects • SequnceFileAsBinaryInputFormat • Retrieves the keys and values as BytesWritable Objects 84 .File Input Formats . which makes a sequence file splittable • The key / values can be stored compressed or without • Two types of compressions • Record • Block • SequenceFileInputFormat • Enables reading data from a Sequence File • Can read MapFiles as well • Variants of SequnceFileInputFormat • SequnceFileAsTextInputFormat • Converts key.SequenceFileInputFormat • Sequence File • provides persistent data structure for binary key-value pairs • Provides sync points in the file at regular intervals. DBInputFormat • DBInputFormat is an input format to read data from RDBMS through JDBC 85 .Non File Input Formats . Output Formats OutputFormat class hierarchy 86 . nnnnn is an designating the part number. • One file per reducer is created (default file name : part-r-nnnnn). starting from zero • TextOutputFormat • SequenceFileOutputFormat • SequenceFileAsBinaryOutputFormat • MapFileOutputFormat • NullOutputFormat • DBOutputFormat • Output format to dump output data to RDBMS through JDBC 87 .setOutputFormatClass(TextOutputFormat.setOutputPath(job. • FileBased • FileOutputFormat is the Base class • FileOutputFormat offers static method for setting output path FileOutputFormat.Output Formats .class).Types • Output Format for a job is set as job. path). class) Instead of Job. TextOutputFormat. even if there is no record to write • LazyOutputFormat can be used to delay output file creation until there is a record to write LazyOutputFormat.setOutputFormatClass(job.setOutputFormatClass(TextOutputFormat.Lazy Output • FileOutputFormat subclasses will create output files.class) 88 . V2> has methods to run a Reducer by passing input key value and expected key values 89 . V1.MRUnit • MRUnit is a unit testing library for MapReduce program • Mapper and Reducer can be tested independently by passing inputs • MapDriver<K1. K2. V2> has methods to run a mapper by passing input key value and expected key values • ReduceDriver< MapDriver<K1.Unit Testing . V1. K2. Counters • Useful means of • Monitoring job progress • Gathering statistics • Problem diagnosis • Built-in-counters fall into below groups. MapReduce framework aggregates them across all maps and reduces to produce a grand total at the end of the job 90 . • • • • • MapReduce task counters Filesystem counters FileInput-Format counters FileOutput-Format counters Job counters • Each counter will either be task counter or job counter • Counters are global. getCounter(Temperature.Temperature.MAP_INPUT_RECORDS). • Dynamic counters • Counters can also be set without predefining as enums context.Counter. • Counters are retrieved as Counters cntrs = job.getCounter(“grounName”. “counterName”).MISSING).User Defined Counters • Counters are defined in a job by Java enum enum Temperature { MISSING.increment(1).getCounter(Task. MALFORMED } • Counters are set and incremented as context.increment(1). long missing = cntrs. 91 .MISSING). long total = cntrs.getCounters().getCounter(MaxTemperatureWithCounters. “Binoculars”). “54.get(“Product”). conf.Side Data Distribution • Side data: typically the read only data needed by the job for processing the main dataset • Two methods to make such data available to task trackers • Using Job Configuration • Using Distributed Cache • Using Job Configuration • Small amount of metadata can be set as key value pairs in the job configuration Configuration conf = new Configuration(). String product = conf. Else will put pressure on memory of daemons 92 .65”).trim(). • The same can be retrieved in the map or reduce tasks Configuration conf = context.getConfiguration().set(“Product”. • Effective only for small amounts of data (few KB). conf.set(“Conversion”. Side Data Distribution – Distributed Cache • A mechanism for copying read only data in files/ archives to the task nodes just in time • Can be used to provide 3rd party jar files • Hadoop copies these files to DFS then tasktracker copies them to the local disk relative to task’s working directory • Distributed cache for a job can be set up by calling methods on Job Job.getLocalCacheArchives(). Job.getLocalCacheFiles(). Path[] localArchives = context. Job. • The files can be retrieved from the distributed cache through the methods on JobContext Path[] localPaths = context. 93 .addFileToClasspath(new Path(<file path>/<file name>).addCacheFile(new URI(<file path>/<file name>).addCacheArchives(new URI(<file path>/<file name>). inputPath1.class).class). inputPath1. <inputformat>. • No need to set input path. inputPath2.Multiple Inputs • Often in real life you get the related data from different sources in different formats • Hadoop provide MultipleInputs class to handle the situation • MultipleInputs. InputFormat class separately • You can even have separate Mapper class for each input file • MultipleInputs.addInputPath(job.addInputPath(job. MapperClass2. <inputformat>. MapperClass1. inputPath2.class).addInputPath(job. • MultipleInputs. • MultipleInputs.class). <inputformat>. <inputformat>.class. • Both Mappers must emit same key/ value types 94 .class.addInputPath(job. Joins • More than one record sets to be joined based on a key • Two techniques for joining data in MapReduce • Map side join (Replicated Join) • Possible only when • one of the data sets is small enough to be distributed across the data nodes and fits into the memory for maps to independently join OR • Both the data sets are portioned in such a way that they have equal number of partitions. sorted by same key and all records for a given key must reside in the same partition • The smaller data set is used for the look up using the join key • Faster as the data is loaded into the memory 95 . Joins • Reduce side join • Mapper will tag the records from both the data sets distinctly • Join key is used as map’s output key • The records for the same key are brought together in the reducer and reducer will complete the joining process • Less efficient as both the data sets have to go through mapreduce shuffle 96 . JobControl jc = new JobControl(“Chained Job”).Job Chaining • Multiple jobs can be run in a linear or complex dependent fashion • Simple way is to call the job drivers one after the other with respective configurations JobClient. jc. jc.run(). ControlledJob cjob2 = new ControlledJob(conf2). jc. • Here the second job is not launched until first job is completed • For complex dependencies you can use JobControl. cjob2.addjob(cjob1). and ControlledJob classes ControlledJob cjob1 = new ControlledJob(conf1).addjob(cjob2).addDependingJob(cjob1). • JobControl can run jobs in parallel if there is no dependency or the dependencies are met 97 . JobClient.runJob(conf2).runJob(conf1). Speculative Execution • MapReduce job’s execution time is typically determined by the slowest running task • Job is not complete until all tasks are completed • One slow job could bring down overall performance of the job • Tasks could be slow due to various reasons • Hardware degradation • Software issues • Hadoop Strategy – Speculative Execution • • • • Determines when a task is running longer than expected Launches another equivalent task as backup Output is taken from the task whichever completes first Any duplicate tasks running are killed post that 98 . map.execution to true/ false • For Reduce Tasks • mapred.reduce.Settings • Is ON by default • The behavior can be controlled independently for map and reduce tasks • For Map Tasks • mapred.execution to true/ false 99 .Speculative Execution .speculative.speculative.tasks.tasks. skip.enabled = true 100 .Skipping Bad Records • While handling a large datasets you may not anticipate every possible error scenario • This will result in unhandled exception leading to task failure • Hadoop retries failed tasks(task can fail due to other reasons) up to four times before marking the whole job as failed • Hadoop provides skipping mode for automatically skipping bad records • The mode is OFF by default • Can be enabled by setting mapred.mode. max.Skipping Bad Records • When skipping mode is enabled.attempts mapred.reduce.max. the record is noted during the third time and skipped during the fourth attempt • The number of total attempts for map and reduce tasks can be increased by setting mapred.attempts • Bad records are stored under _logs/skip directory as sequence file 101 . if the task fails for two times.map. har extension • HAR files can be accessed by application using har URI 102 .har /my/files • hadoop fs -ls /my /my • HAR files always be with . increasing the disk seeks • Too many small files take lot of nameNode memory/ MB • Hadoop Archives (HAR) is hadoop’s file format that packs files into HDFS blocks efficiently • HAR files are not compressed • HAR files reduces namenode memory usage • HAR files can be used as mapreduce input directly • hadoop archive is the command to work on HAR files • hadoop archive -archiveName files.Hadoop Archive Files • HDFS stores small files (size << block size) inefficiently • Each file is stored in a block. Summing • Secondary Sort • Searching. Validation and transformation • Statistical Computations • Grouping • Unions and Intersections • Inverted Index 103 . Counting.Some Operations for Thought • Sorting. Disadvantages of MapReduce • MapReduce (Java API) is difficult to program. PIG and HIVE are in the leading front 104 . long development cycle • Need to rewrite trivial operations like Join. filter to achieve in map/reduce/Key/value concepts • Locked with Java which makes it impossible for data analysts to work with hadoop • There are several abstraction layers on top of MapReduce which make working with Hadoop simple. PIG 105 . • Designed to be extensible and reusable • Programmers can develop own functions and use (UDFs) • Programmer friendly • Allows to introspect data structures • Can do sample run on a representative subset of your input • PIG internally converts each transformation into a MapReduce job and submits to hadoop cluster • 40 percent of Yahoo’s Hadoop jobs are run with PIG 106 . which means the data is processed in a sequence of steps transforming the data • The transformations support relational-style operations such as filter.PIG • PIG is an abstraction layer on top of MapReduce that frees analysts from the complexity of MapReduce programming • Architected towards handling unstructured and semi structured data • It’s a dataflow language. group and join. union. PIG Architecture • Pig runs as a client side application. there is no need to install anything on the cluster Pig Script Grunt Shell Map Red Map Red Hadoop Cluster 107 . .gz • Configure Environment Variables .apache.y. This will be Pig’s home directory >>tar xvf pig-x.tar.y.add in .html • Untar into a designated folder.bash_profile • export PIG_INSTALL=/<parent directory path>/pig-x.Install & Configure PIG • Download a version of PIG compatible with your hadoop installation • http://pig.z • export PATH=$PATH:$PIG_INSTALL/bin >>.org/releases.z.bash_profile • Verify Installation >>pig -help • Displays command usage >>pig • Takes you into Grunt shell grunt> 108 . PIG Execution Modes • Local Mode • • • • Runs in a single JVM Operates on local file system Suitable for small datasets and for development To run PIG in local mode >>pig -x local • MapReduce Mode • In this mode the queries are translated into MapReduce jobs and run on hadoop cluster • PIG version must be compatible with hadoop version • Set HADOOP_HOME environment variable to indicate pig which hadoop client to use • export HADOOP_HOME=$HADOOP_INSTALL • If not set it will uses the bundled version of hadoop 109 . much like you can use JDBC • For programmatic access to Grunt. use PigRunner 110 .pig • It is also possible to run Pig scripts from Grunt shell using run and exec.Ways of Executing PIG programs • Grunt • An interactive shell for running Pig commands • Grunt is started when the pig command is run without any options • Script • Pig commands can be executed directly from a script file >>pig pigscript. • Embedded • You can run Pig programs from Java using the PigServer class. sub_cat. amt.csv' USING PigStorage('.') AS (txn_id. trans_type). adr2. FILTER grunt> txn_grpd = GROUP txn_100plus BY cat. grunt> txn_cnt_bycat = FOREACH txn_grpd GENERATE group. LOAD grunt> txn_100plus = FILTER transactions BY amt > 100. cust_id. adr1. A relation is created with every statement 111 .An Example A Sequence of transformation steps to get the end result grunt> transactions = LOAD 'retail/txn. txn_dt. COUNT(txn_100plus). GROUP AGGREGATE grunt> DUMP txn_cnt_bycat.00. cat. Data Types Simple Types Category Numeric Text Binary Type int long float double chararray bytearray Description 32-bit signed integer 64-bit signed integer 32-bit floating-point number 64-bit floating-point number Character array in UTF-16 format Byte array 112 . (2)} ['a'#'pomegranate'] 113 . keys must be character arrays.'pomegranate') {(1. possibly with duplicates A set of key-value pairs. but values may be any type Example (1.Data Types Complex Types Type Tuple Bag map Description Sequence of fields of any type An unordered collection of tuples.'pomegranate'). ………. f2:int. records=LOAD ‘sales. records=LOAD ‘sales. f3:float).<field name3>:dataType)] • Loads data from a file into a relation • Uses the PigStorage load function as default unless specified otherwise with the USING option • The data can be given a schema using the AS option.txt’ AS (f1:chararray.LOAD Operator <relation name> = LOAD ‘<input file with path>’ [USING UDF()] [AS (<field name1>:dataType. f2:int.txt’ USING PigStorage(‘\t’). <field name2>:dataType.txt’ USING PigStorage(‘\t’) AS (f1:chararray.txt’. records=LOAD ‘sales. 114 . f3:float). • The default data type is bytearray if not specified records=LOAD ‘sales. Diagnostic Operators • DESCRIBE • Describes the schema of a relation • EXPLAIN • Display the execution plan used to compute a relation • ILLUSTRATE • Illustrate step-by-step how data is transformed • Uses sample of the input data to simulate the execution. 115 . Data Write Operators • LIMIT • Limits the number of tuples from a relation • DUMP • Display the tuples from a relation • STORE • Store the data from a relation into a directory. • The directory must not exists 116 . • ORDER • Sort a relation based on one or more fields • Further processing (FILTER. DISTINCT.) may destroy the ordering ordered_list = ORDER cust BY name DESC.Relational Operators • FILTER • Selects tuples based on Boolean expression teenagers = FILTER cust BY age < 20. • DISTINCT • Removes duplicate tuples unique_custlist = DISTINCT cust. etc. 117 . Relational Operators • GROUP BY • Within a relation. SUM 118 . countByProfession=FOREACH groupByProfession GENERATE group. MAX. count(cust). group tuples with the same group key • GROUP ALL will group all tuples into one group groupByProfession=GROUP cust BY profession groupEverything=GROUP cust ALL • FOREACH • Loop through each tuple in nested_alias and generate new tuple(s). COUNT. • Built in aggregate functions AVG. MIN. and SAMPLE are allowed operations in nested_op to operate on the inner bag(s). group tuples with the same group key • GROUP ALL will group all tuples into one group groupByProfession=GROUP cust BY profession groupEverything=GROUP cust ALL • FOREACH • Loop through each tuple in nested_alias and generate new tuple(s). • At least one of the fields of nested_alias should be a bag • DISTINCT. LIMIT. COUNT. ORDER. SUM 119 . • countByProfession=FOREACH groupByProfession GENERATE group. • Built in aggregate functions AVG. MIN.Relational Operators • GROUP BY • Within a relation. FILTER. count(cust). MAX. Operating on Multiple datasets • JOIN • Compute inner join of two or more relations based on common field values. (2.1) (8. B BY b1.4.5) DUMP B.4) (4.2.4) (8.2.3) (7.9) (1.9) (1. X = JOIN A BY a1.2.7) (7.2.3) (2.2.5.7.8.3. DUMP X.1.3. DUMP A. (1.9) 120 .3.9) (7.3.3) (8.3) (4. 4). based on common group values.3. (1. (2.2. >>DUMP X.3) (4.7) (7.3)}.3.2.2.2.5)}.2. {} ) 121 . (8.4) (8.(2.5) >>DUMP B.3) (7. {(1.3.3)} ) {(8.9) >>X = COGROUP A BY a1.4)}.9)} ) {(7. {(8.2. {(2.3) (2. {(7. (1. B BY b1.1).4) (4.9) (1. >>DUMP A.3. (4.9)} ) {}. (2.1) (8.3)}.7)} ) {(4.Operating on Multiple datasets • COGROUP • Group tuples from two or more relations. (4. (7. {(1. 2.1) (8.4) >>DUMP B.4) (8.3) (4. (1.2.3) (4.9) • Splits a relation into two or more relations.1) (2. >>DUMP X.2.1) (8.3. based on a Boolean expressions. (1. D IF a1 > 5.Operating on Multiple datasets • UNION • Creates the union of two or more relations >>X = UNION A.4) (2. >>Y = SPLIT X INTO C IF a1 <5. >>DUMP C.3.2.2.3) (4. B. (2.3.9) 122 .4) (8.9) (1. • SPLIT >>DUMP A.4) (8. >>DUMP D.2.4) (8. Operating on Multiple datasets • SAMPLE • Randomly samples a relation as per given sampling factor. • There is no guarantee that the same number of tuples are returned every time.01. >>sample_data = SAMPLE large_data 0. • Above statement generates a 1% sample of data in relation large_data 123 . UDFs • PIG lets users define their own functions and lets them be used in the statements • The UDFs can be developed in Java.myfunc.isCustomerTeen() • filtered= FILTER cust BY isCustomerTeen(age) 124 . • DEFINE <funcName> com. } • Load UDF • To be subclassed of LoadFunc • Define and use an UDF • REGISTER pig-examples.jar. Python or Javascript • Filter UDF • To be subclassed of FilterFunc which is a subclass of EvalFunc • Eval UDF • To be subclassed of EvalFunc public abstract class EvalFunc<T> { public abstract T exec(Tuple input) throws IOException.training. }. $Y = FOREACH A GENERATE group. 125 . max_field) RETURNS Y { A = GROUP $X by $group_key. group_key. in which case they need to be imported IMPORT ‘<path>/<macrofile>'. MAX($X. year. temperature).Macros • Package reusable pieces of Pig Latin code • Define a Macro DEFINE max_by_group(X. max_temp = max_by_group(filtered_records.$max_field). • Macros can be defined in separate files to Pig scripts for reuse. HIVE 126 . HIVE • A datawarehousing framework built on top of hadoop • Abstracts MapReduce complexity behind • Target users are generally data analysts who are comfortable with SQL • SQL Like Language and called HiveQL • Hive meant only for structured data • You can interact with Hive using several methods • CLI (Command Line Interface) • A Web GUI • JDBC 127 . HIVE Architecture CLI Hive Metastore WEB JDBC Parser/ Planner/ Optimizer Map Red Map Red Hadoop Cluster 128 . y. This will be HIVE’s home directory >>tar xvf hive-x.z • export PATH=$PATH:$HIVE_INSTALL/bin • Verify Installation >>hive -help • Displays command usage >>hive • Takes you into hive shell hive> 129 .z.bash_profile • export HIVE_INSTALL=/<parent directory path>/hive-x.tar.html • Untar into a designated folder.Install & Configure HIVE • Download a version of HIVE compatible with your hadoop installation • http://hive.apache.y.org/releases.gz • Configure • Environment Variables – add in . Install & Configure HIVE • Hadoop needs to be running • Configure to hadoop • Create hive-site.job.name • mapred.tracker • If not set.default.just like they do in Hadoop • Create following directories under HDFS • /tmp • /user/hive/warehouse • chmod g+w for both above directories 130 . they default to the local file system and the local (in-process) job runner .xml under conf directory • specify the filesystem and jobtracker using the hadoop properties • fs. Install & Configure HIVE • Data store • Hive stores data under /user/hive/warehouse by default • Metastore • Out-of-the-box hive comes with light weight SQL database Derby to store and manage meta data • This can be configured to other databases like MySQL 131 . Hive Data Models • Databases • Tables • Partitions • Buckets 132 . b STRING} • MAPS • ARRAYS • *‘a’.Hive Data Types • TINYINT – 1 byte integer • SMALLINT – 2 byte integer • INT – 4 byte integer • BIGINT – 8 byte integer • BOOLEAN – true / false • FLOAT – single precision • DOUBLE – double precision • STRING – sequence of characters • STRUCT • A column can be of type STRUCT with data {a INT. ‘c’+ 133 . ‘b’. name String. • View Table Schema SHOW TABLES.Tables • A Hive table is logically made up of the data being stored and the associated metadata • Creating a Table CREATE TABLE emp_table (id INT. • An external table is a table which is outside the warehouse directory 134 .csv’ OVERWRITE INTO TABLE emp_table. DESCRIBE emp_table. address STRING) PARTITIONED BY (designation STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS SEQUENCEFILE. • Loading Data LOAD DATA INPATH ‘/home/hadoop/employee. Cust_id INT. FirstName STRING. Amount FLOAT.'. txn_date STRING. hive> CREATE TABLE customers (Cust_id INT. hive> CREATE TABLE retail_trans (txn_id INT. LastName STRING.'. 135 . hive> SHOW TABLES. hive> USE retail.Hands On • Create retail. Sub_Category STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '. Category STRING. Profession STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '. hive> DESCRIBE retail_trans. customers tables hive> CREATE DATABSE retail. count(*) FROM retail_trans GROUP BY Category. rt. hive> SELECT Category. count(*) FROM retail_trans WHERE Amount > 100 GROUP BY Category.cust_id GROUP BY cu.cust_id = cu. hive> LOAD DATA INPATH 'retail/custs.FirstName. cu. cu. hive> SELECT Category. hive> SELECT Concat (cu.LastName). 136 .Hands On • Load data and run queries hive> LOAD DATA INPATH 'retail/txn.Category.csv' INTO TABLE customers. ' '.FirstName.LastName. count(*) FROM retail_trans rt JOIN customers cu ON rt.csv' INTO TABLE retail_trans.Category. rt. • INSERT • INSERT OVERWRITE TABLE new_emp (SELECT * FROM emp_table WHERE id > 100). 137 .*. SELECT designation.Queries • SELECT SELECT id. SELECT count(*) FROM emp_table. detail.id). naem FROM emp_table WHERE designation = ‘manager’.age FROM emp_table JOIN detail ON (emp_table. count(*) FROM emp_table GROUP BY designation. • JOIN SELECT emp_table.id = detail. • Inserting local directory • INSERT OVERWRITE LOCAL DIRECTORY ‘tmp/results’ (SELECT * FROM emp_table WHERE id > 100). Partitioning & Bucketing • HIVE can organize tables into partitions based on columns • Partitioned are specified during the table creation time • When we load data into a partitioned table. country='GB'). • Bucketing • Bucketing imposes extra structure on the table • make sampling more efficient CREATE TABLE bucketed_users (id INT. name STRING) CLUSTERED BY (id) INTO 4 BUCKETS. the partition values are specified explicitly: LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01'. 138 . Strip'. public class Strip extends UDF { public Text evaluate(Text str) { ----------return str1.hive.exec. } } ADD JAR /path/to/hive-examples.apache. SELECT strip(' bee ') FROM dummy 139 .ql.hive. CREATE TEMPORARY FUNCTION strip AS 'com.UDF) • A UDF must implement at least one evaluate() method.hadoop.UDFs • UDFs have to be written in java • Have to be subclased UDF (org.hadoopbook.jar. SQOOP 140 . This will be SQOOP’s home directory >>tar xvf sqoop-x.z • export PATH=$PATH:$SQOOP_HOME /bin • Verify Installation >>sqoop >>sqoop help 141 .y.bash_profile • export SQOOP_HOME=/<parent directory path>/sqoop-x.gz • Configure • Environment Variables – add in .SQOOP • sqoop allows users to extract data from a structured data store into Hadoop for analysis • Sqoop can also export the data back the structured stores • Installing & Configuring SQOOP • Download a version of SQOOP compatible with your hadoop installation • Untar into a designated folder.z.y.tar. Use Generate Code Map Map Map Hadoop Cluster 142 . Examine the schema 2. Generate Code Sqoop Client MyClass. Launch Multiple maps on the cluster 4.java 3.Importing Data RDBMS 1. the import tool also generates a java class as per the table schema 143 . four map tasks are used • The output is written to a directory by the table name. under user’s HDFS home directory • Generates comma-delimited text files by default • In addition to downloading data.Importing Data • Copy mysql jdbc driver to sqoop’s lib directory • Sqoop does not come with the jdbc driver • Sample import >>sqoop import --connect jdbc:mysql://localhost/retail --table transactions -m 1 >>hadoop fs -ls transactions • The Import tool will run a MapReduce job that connects to the database and reads the table • By default. Codegen • The code can also be generated without import action >>sqoop codegen --connect jdbc:mysql://localhost/hadoopguide --table widgets --class-name Widget • The generated class can hold a single record retrieved from the table • The generated code can be used in MapReduce programs to manipulate the data 144 . ‘ • Generate table definition and import data into Hive >>sqoop import --connect jdbc:mysql://localhost/retail --table transactions -m 1 --hive-import • Exporting data from Hive • Create the table in MySQL database >>sqoop export --connect jdbc:mysql://localhost/retail -m 1 --table customers --export-dir /user/hive/warehouse/retail.' 145 .Working along with Hive • Importing data into Hive • Generate Hive table definition directly from the source >>sqoop create-hive-table --connect jdbc:mysql://localhost/retail --table transactions --fields-terminated-by '.db/customers --input-fields-terminated-by '. Administration 146 . moving a file is logged into edits • fsimage: Persistent checkpoint of file system metadata.dir}/ current/ VERSION edits fsimage fstime 147 .NameNode Persistent Data Structure • A newly formatted Namenode creates the shown directory structure • VERSION: Java properties file with HDFS version • edits: Any write operation such as creating. This is update whenever edit log rolls over • fstime: Records the time when fsimage was last updated ${dfs.name. checkpoint.dir}/ current/ blk_<id_1> blk_<id_1>.Persistent Data Structure – DataNode & SNN • Secondary Namenode directory structure • Datanode directory structure • Need not be formatted explicitly • They create their directories on startup ${dfs.dir}/ current/ VERSION edits fsimage fstime previous.data.meta ${fs.checkpoint/ VERSION edits fsimage fstime blk_<id_2> blk_<id_2>.meta Subdir0/ Subdir1/ 148 . HDFS Safe Mode • When a Namenode starts it will enter Safe Mode • Loads fsimage to memory and applies the edits from edit log • During this time it does not listen to any requests • Safe mode is exited when minimal replication condition is met. plus an extension time of 30 seconds • Check whether Namenode is in safe mode hadoop dfsadmin -safemode get • Wait until the safe mode is off hadoop dfsadmin -safemode wait • Enter or leave safe mode hadoop dfsadmin -safemode enter / leave 149 . HDFS Filesystem Check • Hadoop provides fsck utility to check the health of HDFS hadoop fsck / • Option to either move (to lost+found) or delete affected files hadoop fsck / -move hadoop fsck / -delete • Finding Blocks for a given file hadoop fsck /user/hadoop/weather/1901 -files -blocks -racks 150 . datanode.period.scan.hours) • Corrupt blocks are reported to Namenode for fixing • The Block scan report for a datanode can be accessed at • http://<datanode>:50 075/blockScannerRep ort • List of Blocks can be accessed by appending ?listblocks to the above URL 151 .HDFS Block Scanner • Datanodes run Block Scanner utility periodically to verify the blocks stored on it to guard against the disk errors • The default is 3 weeks (dfs. the block distribution across the cluster may become unbalanced • This will affect the data locality for MapReduce and puts strain on highly utilized datanodes • Hadoop’s Balancer daemon redistributes the blocks to achieve the balance • The balancing act can be initiated through start-balancer.balance.sh • It produces a log file in the standard log directory • The bandwidth for the balancer cab be changed by setting the dfs.bandwidthPerSec property in hdfs-site.xml • Default bandwidth 1 MB / Sec 152 .HDFS Balancer • Over a period of time. JobTracker=DEBUG • Stack Traces • The stack traces for all the hadoop daemons can be obtained at /stacks page under daemons expose a web UI • Job tracker stack trace can be found at • http://<jobtracker-host>:50030/stacks 153 .logger.org.apache.FSNamesystem.mapred.Logging • All Hadoop Daemons produce respective log files • Log files are stored under $HADOOP_INSTALL/logs • The location can be changed by setting the property HADOOP_LOG_DIR in hadoop-env.org.hdfs.hadoop.hadoop.audit=WARN • Job Tracker • log4j.namenode.logger.properties • Name Node • log4j.sh • The log levels can be set under log4j.apache.server. Hadoop Routine Maintenance • Metadata Backup • Good practice to keep copies of different ages(one hour. one week etc.) • One way is to periodically archive secondary namenode’s previous.. it is a good practice to prioritize the data to be backed up • Business critical data • Data that can not be regenerated • distcp is a good tool to backup from HDFS to other filesystems • Run filesystem check (fsck) and balancer tools regularly 154 . one day.checkpoint directory to an offsite location • Test the integrity of the copy regularly • Data Backup • HDFS replication is not a substitute for data back up • As the data volume is very high. Commissioning of New Nodes • The datanodes that are permitted to connect to Namenode are specified in a file pointed by the property dfs. It is used by control scripts for cluster-wide operations 155 .hosts • This restricts an arbitrary machine connecting into the cluster and compromising on data integrity and security • To add a new nodes • Add the network address of the new node in the above files • Run the commands to refresh Namenode and Jobtracker hadoop dfsadmin -refreshNodes hadoop mradmin -refreshNodes • Update the slaves file with the new nodes • Note that the slaves file is not used by hadoop daemons.hosts • The tasktrackers that are permitted to connect to Jobtracker are specified in a file pointed by the property mapred. the Namenode and the Jobtracker must be informed • The decommissioning process is controlled by an exclude file.hosts.exclude • For MapReduce mapred. The file location is set through a property • For HDFS it is dfs.exclude • To remove nodes from the cluster • add the network addresses to the respective exclude files • Run the commands to update Namenode and JobTracker • hadoop dfsadmin -refreshNodes • hadoop mradmin -refreshNodes • During decommission process Namenode will replicate the data to other datanodes • Remove the nodes from the include file as well as slaves file 156 .hosts.Decommissioning of Nodes • For removing nodes from the cluster. Cluster & Machines Considerations • Several options • Build your own cluster from scratch • Use offerings that provide hadoop as a service on cloud • While building you own.5 GHz CPUs 16-24 GB ECC RAM1 Four 1 TB SATA disks Gigabit Ethernet • Cluster size is typically estimated based on the storage capacity and it’s expected growth 157 . choose server grade commodity machines (Commodity does not mean low-end) • Unix / Linux platform • Hadoop is designed to use multiple cores and disks • A typical machine for running a datanode and tasktracker Processor Memory Storage Network Two quad-core 2-2. Master Node Scenarios • The machine running the master daemons should be resilient as failure of these would lead to data lose and unavailability of the cluster • On a small cluster (few 10s of nodes) you run all master daemons on a single machine • As the cluster grows their memory requirement grows and needs to be run on separate machines • The control scripts should be run as follows • Run HDFS control scripts from the namenode machine • masters file should contain the address of the secondary namenode • Run MapReduce scripts from the Job tracker machine • slaves file on both machines should be in sync so that each node will run one Datanode and a task tracker 158 . Network Topology • A common architecture consists of two level network topology 1GB + Switch 1GB Switch 30 to 40 servers per rack 159 . Network Topology • For multirack cluster, the admin needs to map nodes to racks so hadoop is network aware to place data as well as mapreduce tasks as close as possible to the data • Two ways to define the network map • Implement java interface DNSToSwitchMapping public interface DNSToSwitchMapping { public List<String> resolve(List<String> names); } • Have the property topology.node.switch.mapping.impl point to the implemented class. The namenode and jobtracker will make use of this • User based script pointed by the property topology.script.file.name • The default behavior is to map all nodes to the same rack 160 Cluster Setup and Installation • Use automated installation tools such as kickstart or Debian to install software on nodes • Create one master script and use the same to automate • Following steps to be carried to complete cluster setup • Install Java (6 or later) on all nodes • Create user account on all nodes for Hadoop activities • Have the same user name on all nodes • Having NFS drive as home directory makes SSH key distribution simple • Install Hadoop on all nodes and change the owner of files • Install SSH. Hadoop control scripts (not the daemons) rely on SSH to perform cluster-wide operations • Configure • Generate an RSA key pair, share public key on all nodes • Configure Hadoop. Better way of doing it is by using tools like Chef 161 or Puppet Memory Requirements – Worker Node • The memory allocated to each daemon is controlled by HADOOP_HEAPSIZE setting in hadoop-env.sh • The default value is 1 GB • The task tracker launches separate JVMs to run map and reduce tasks • The memory for the child JVM is set by mapred.child.java.opts. Default value is 200 MB • The number of map and reduce tasks that can be run at any time is set by the property • Map - mapred.tasktracker.map.tasks.maximum • Reduce - mapred.tasktracker.reduce.tasks.maximum • The default is two for both map and reduce tasks 162 Memory Requirements – Worker Node • The number of tasks that can be run simultaneously on a tasks tracker depends on the number of processors • a good rule of thumb is to have a factor of between one and two more tasks than processors • If you have a eight core processor • One core for Datanode and tasktracker • On remaining 7 cores we can have 7 maps and 7 reduce tasks • Increasing the memory for the JVM to 400 MB the total memory required is 7.6 MB 163 . Other Properties to consider • Cluster Membership • Buffer Size • HDFS Block size • Reserved storage space • Trash • Job Scheduler • Reduce slow start • Task Memory Limits 164 . security. configure Access Control Lists (ACLs) in the hadoop-policy.authorization to true in the same file • To control which users and groups can do what.security.Security • Hadoop uses Kerberos for authentication • Kerberos do not manage the permissions for hadoop • To enable Kerberos authentication set the property hadoop.xml to kerberos • Enable service-level authorization by setting hadoop.xml 165 .authentication in core-site. bob and users in the mapreduce group to submit the jobs <property> <name>security.submission.Security Policies • Allow only alice. bob mapreduce</value> </property> • Allow only users in the datanode group to communicate with Namenode <property> <name>security.client.acl</name> <value>alice.acl</name> <value>*</value> </property> 166 .protocol.protocol.acl</name> <value>datanode</value> </property> • Allow any user to talk to HDFS cluster as a DFSClient <property> <name>security.protocol.job.datanode. Recommended Readings 167 .

Big Data & Hadoop Training Material 0 1.pdf

Comments

Description