BigData Objective

This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “History ofHadoop”. 1. IBM and ________ have announced a major initiative to use Hadoop to support university courses in distributed computer programming. a) Google Latitude b) Android (operating system) c) Google Variations d) Google Answer:d Explanation:Google and IBM Announce University Initiative to Address Internet-Scale. 2. Point out the correct statement : a) Hadoop is an ideal environment for extracting and transforming small volumes of data b) Hadoop stores data in HDFS and supports data compression/decompression c) The Giraph framework is less useful than a MapReduce job to solve graph and machine learning d) None of the mentioned Answer:b Explanation:Data compression can be achieved using compression algorithms like bzip2, gzip, LZO, etc. Different algorithms can be used in different scenarios based on their capabilities. 3. What license is Hadoop distributed under ? a) Apache License 2.0 b) Mozilla Public License c) Shareware d) Commercial Answer:a Explanation:Hadoop is Open Source, released under Apache 2 license. 4. Sun also has the Hadoop Live CD ________ project, which allows running a fully functional Hadoop cluster using a live CD. a) OpenOffice.org b) OpenSolaris c) GNU d) Linux Answer:b Explanation: The OpenSolaris Hadoop LiveCD project built a bootable CD-ROM image. 5. Which of the following genres does Hadoop produce ? a) Distributed file system b) JAX-RS c) Java Message Service d) Relational Database Management System Answer:a Explanation: The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user. 6. What was Hadoop written in ? a) Java (software platform) b) Perl c) Java (programming language) d) Lua (programming language) Answer:c Explanation: The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts. 7. Which of the following platforms does Hadoop run on ? a) Bare metal b) Debian c) Cross-platform d) Unix-like Answer:c Explanation:Hadoop has support for cross platform operating system. 8. Hadoop achieves reliability by replicating the data across multiple hosts, and hence does not require ________ storage on hosts. a) RAID b) Standard RAID levels c) ZFS d) Operating system Answer:a Explanation:With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. 9. Above the file systems comes the ________ engine, which consists of one Job Tracker, to which client applications submit MapReduce jobs. a) MapReduce b) Google c) Functional programming d) Facebook Answer:a Explanation:MapReduce engine uses to distribute work around a cluster. 10. The Hadoop list includes the HBase database, the Apache Mahout ________ system, and matrix operations. a) Machine learning b) Pattern recognition c) Statistical classification d) Artificial intelligence Answer:a Explanation: The Apache Mahout project’s goal is to build a scalable machine learning tool. 1. As companies move past the experimental phase with Hadoop, many cite the need for additional capabilities, including: a) Improved data storage and information retrieval b) Improved extract, transform and load features for data integration c) Improved data warehousing functionality d) Improved security, workload management and SQL support Answer:d Explanation:Adding security to Hadoop is challenging because all the interactions do not follow the classic client- server pattern. 2. Point out the correct statement : a) Hadoop do need specialized hardware to process the data b) Hadoop 2.0 allows live stream processing of real time data c) In Hadoop programming framework output files are divided in to lines or records According to analysts. 3. 6. MapReduce. What was Hadoop named after? a) Creator Doug Cutting’s favorite circus act b) Cutting’s high school rock band . 4. 5. Hive and HBase b) MapReduce. Common cohorts include: a) MapReduce. Point out the wrong statement : a) Hardtop’s processing capabilities are huge and its real advantage lies in the ability to process terabytes & petabytes of data b) Hadoop uses a programming model called “MapReduce”.d) None of the mentioned Answer:b Explanation:Hadoop batch processes data distributed over a number of computers ranging in 100s and 1000s. MySQL and Google Apps c) MapReduce. used by Hadoop is simple to write and test. Heron and Trumpet Answer:a Explanation:To use Hive with HBase you’ll typically want to launch two clusters. Hummer and Iguana d) MapReduce. Hadoop is a framework that works with a variety of related tools. for what can traditional IT systems provide a foundation when they’re integrated with big data technologies like Hadoop ? a) Big data management and data mining b) Data warehousing and business intelligence c) Management of Hadoop clusters d) Collecting and storing unstructured data Answer:a Explanation:Data warehousing integrated with Hadoop would give better understanding of data. MapReduce. all the programs should confirms to this model in order to work on Hadoop platform c) The programming model. one to run HBase and the other to run Hive. used by Hadoop is difficult to write and test d) All of the mentioned Answer:c Explanation: The programming model. distributed algorithm. a) ‘Project Prism’ . the largest among them is the one that is used for Data warehousing. __________ can best be described as a programming model used to develop Hadoop-based applications that can process massive amounts of data.c) The toy elephant of Cutting’s son d) A sound Cutting’s laptop made during Hadoop’s development Answer:c Explanation:Doug Cutting. 7. 10. Hadoop’s creator. named the framework after his child’s stuffed toy elephant. All of the following accurately describe Hadoop. Facebook Tackles Big Data With _______ based on Hadoop. __________ has the world’s largest Hadoop cluster. a) Apple b) Datamatics c) Facebook d) None of the mentioned Answer:c Explanation:Facebook has many Hadoop clusters. EXCEPT: a) Open source b) Real-time c) Java-based d) Distributed computing approach Answer:b Explanation:Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. 9. 8. a) MapReduce b) Mahout c) Oozie d) All of the mentioned Answer:a Explanation:MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel. ad hoc queries. but a query engine that supports the parts of SQL specific to querying data b) Hive is a relational database with SQL support c) Pig is a relational database with SQL support d) All of the mentioned Answer:a Explanation:Hive is a SQL-based data warehouse system for Hadoop that facilitates data summarization. a) Scalding b) HCatalog c) Cascalog . and the analysis of large datasets stored in Hadoop-compatible file systems. 1. Point out the correct statement : a) Hive is not a relational database. and load (ETL) processing and analysis of large datasets.b) ‘Prism’ c) ‘Project Big’ d) ‘Project Data’ Answer:a Explanation:Prism automatically replicates and moves data wherever it’s needed across a vast network of computing facilities. Hadoop Questions and Answers – Hadoop Ecosystem This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Ecosystem”. a) Pig Latin b) Oozie c) Pig d) Hive Answer:c Explanation:Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. 2. _________ hides the limitations of Java behind a powerful and concise Clojure API for Cascading. ________ is a platform for constructing data flows for extract. transform. 3. 7. 5. 4. a) Mapreduce . either through the AWS Web Console or through command-line tools. Hive also support custom extensions written in : a) C# b) Java c) C d) C++ Answer:b Explanation:Hive also support custom extensions written in Java. including user-defined functions (UDFs) and serializer-deserializers for reading and optionally writing custom formats. ________ is the most popular high-level Java API in Hadoop Ecosystem a) Scalding b) HCatalog c) Cascalog d) Cascading Answer:d Explanation:Cascading hides many of the complexities of MapReduce programming behind more intuitive pipes and data flow abstractions.d) All of the mentioned Answer:c Explanation:Cascalog also adds Logic Programming concepts inspired by Datalog. ___________ is general-purpose computing model and runtime system for distributed data analytics. Point out the wrong statement : a) Elastic MapReduce (EMR) is Facebook’s packaged Hadoop offering b) Amazon Web Service Elastic MapReduce (EMR) is Amazon’s packaged Hadoop offering c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate d) All of the mentioned Answer:a Explanation:Rather than building Hadoop deployments manually on EC2 (Elastic Compute Cloud) clusters. 6. users can spin up fully configured Hadoop installations using simple invocation commands. Hence the name “Cascalog” is a contraction of Cascading and Datalog. b) Drill c) Oozie d) None of the mentioned Answer:a Explanation:Mapreduce provides a flexible and scalable foundation for analytics, from traditional reporting to leading-edge machine learning algorithms. 8. The Pig Latin scripting language is not only a higher-level data flow language but also has operators similar to : a) SQL b) JSON c) XML d) All of the mentioned Answer:a Explanation:Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL and the low-level procedural style of MapReduce. 9. _______ jobs are optimized for scalability but not latency. a) Mapreduce b) Drill c) Oozie d) Hive Answer:d Explanation:Hive Queries are translated to MapReduce jobs to exploit the scalability of MapReduce. 10. ______ is a framework for performing remote procedure calls and data serialization. a) Drill b) BigTop c) Avro d) Chukwa Answer:c Explanation:In the context of Hadoop, Avro can be used to pass data from one program or language to another. Hadoop Questions and Answers – Introduction to Mapreduce This set of Multiple Choice Questions & Answers (MCQs) focuses on “MapReduce”. 1. A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker. a) MapReduce b) Mapper c) TaskTracker d) JobTracker Answer:c Explanation:TaskTracker receives the information necessary for execution of a Task from JobTracker, Executes the Task, and Sends the Results back to JobTracker. 2. Point out the correct statement : a) MapReduce tries to place the data and the compute as close as possible b) Map Task in MapReduce is performed using the Mapper() function c) Reduce Task in MapReduce is performed using the Map() function d) All of the mentioned Answer:a Explanation:This feature of MapReduce is “Data Locality”. 3. ___________ part of the MapReduce is responsible for processing one or more chunks of data and producing the output results. a) Maptask b) Mapper c) Task execution d) All of the mentioned Answer:a Explanation:Map Task in MapReduce is performed using the Map() function. 4. _________ function is responsible for consolidating the results produced by each of the Map() functions/tasks. a) Reduce b) Map c) Reducer d) All of the mentioned Answer:a Explanation:Reduce function collates the work and resolves the results. 5. Point out the wrong statement : a) A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner b) The MapReduce framework operates exclusively on pairs c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods d) None of the mentioned Answer:d Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them and reexecutes the failed tasks. 6. Although the Hadoop framework is implemented in Java , MapReduce applications need not be written in : a) Java b) C c) C# d) None of the mentioned Answer:a Explanation:Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce applications (non JNITM based). 7. ________ is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. a) Hadoop Strdata b) Hadoop Streaming c) Hadoop Stream d) None of the mentioned Answer:b Explanation:Hadoop streaming is one of the most important utilities in the Apache Hadoop distribution. configureable d) None of the mentioned .configure b) JobConfigurable.8. 9. 1. __________ maps input key/value pairs to a set of intermediate key/value pairs. 10. a) HashPar b) Partitioner c) HashPartitioner d) None of the mentioned Answer:c Explanation: The default partitioner in Hadoop is the HashPartitioner which has a method called getPartition to partition. _________ is the default Partitioner for partitioning key space. Hadoop Questions and Answers – Analyzing Data with Hadoop This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Analyzing Data with Hadoop”. The number of maps is usually driven by the total size of : a) inputs b) outputs c) tasks d) None of the mentioned Answer:a Explanation:Total size of inputs means total number of blocks of the input files. Mapper implementations are passed the JobConf for the job via the ________ method a) JobConfigure.configure c) JobConfigurable. a) Mapper b) Reducer c) Both Mapper and Reducer d) None of the mentioned Answer:a Explanation:Maps are the individual tasks that transform input records into intermediate records. The right number of reduces seems to be : a) 0. but increases load balancing and lowers the cost of failures c) It is legal to set the number of reduce-tasks to zero if no reduction is desired d) The framework groups Reducer inputs by keys (since different mappers may have output the . sorted outputs are always stored in a simple (key-len.95 or 1. via HTTP.Answer:b Explanation:JobConfigurable. Point out the wrong statement : a) Reducer has 2 primary phases b) Increasing the number of reduces increases the framework overhead.75. 3.configure method is overridden to initialize themselves. Point out the correct statement : a) Applications can use the Reporter to report progress b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job c) The intermediate.36 d) 0. 4. value-len. 5. value) format d) All of the mentioned Answer: d Explanation:Reporters can be used to set application-level status messages and update Counters. key.80 c) 0.95 Answer:d Explanation: The right number of reduces seems to be 0.90 b) 0. 2. a) Reducer b) Mapper c) Shuffle d) All of the mentioned Answer:a Explanation:In Shuffle phase the framework fetches the relevant partition of the output of all the mappers. Input to the _______ is the sorted output of the mappers. Mapper and Reducer implementations can use the ________ to report progress or just indicate that they are alive.same key) in sort stage Answer:a Explanation:Reducer has 3 primary phases: shuffle. sort and reduce. __________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned . The output of the Reducer is not sorted. 8. a) Mapper b) Cascader c) Scalding d) None of the mentioned Answer:d Explanation: The output of the reduce task is typically written to the FileSystem. The output of the _______ is not sorted in the Mapreduce framework for Hadoop. a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned 9. Which of the following phases occur simultaneously ? a) Shuffle and Sort b) Reduce and Sort c) Shuffle and Map d) All of the mentioned Answer:a Explanation: The shuffle and sort phases occur simultaneously. 6. 7. while map-outputs are being fetched they are merged. data-warehouse-ish type of workload b) HDFS runs on a small cluster of commodity-class nodes c) NEWSQL is frequently the collection point for big data d) None of the mentioned Answer:a Explanation:Hadoop together with a relational data warehouse. 2. a) Map Parameters b) JobConf c) MemoryConf d) None of the mentioned Answer:b Explanation:JobConf represents a MapReduce job configuration. Point out the correct statement : a) Hadoop is ideal for the analytical. Hadoop Questions and Answers – Scaling out in Hadoop This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Scaling out in Hadoop”. . they can form very effective data warehouse infrastructure. 1. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes. a) NoSQL b) NewSQL c) SQL d) All of the mentioned Answer:a Explanation: NoSQL systems make the most sense whenever the application is based on data with varying data types and the data can be stored in key-value notation. post-operational. 10. and partitioners. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. reducers.Answer:b Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers. 6. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record values with schema applied on read based on: a) HCatalog b) Hive c) Hbase d) All of the mentioned Answer:a Explanation:Other means of tagging the values also can be used. HDFS and NoSQL file systems focus almost exclusively on adding nodes to : a) Scale out b) Scale up c) Both Scale out and up d) None of the mentioned . 5. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional Hadoop deployments a) EMR b) Isilon solutions c) AWS d) None of the mentioned Answer:b Explanation:enterprise data protection and security options including file system auditing and data-at-rest encryption to address compliance requirements is also provided by Isilon solution. Point out the wrong statement : a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and highly efficient storage platform b) Isilon’s native HDFS integration means you can avoid the need to invest in a separate Hadoop infrastructure c) NoSQL systems do provide high latency access and accommodate less concurrent users d) None of the mentioned Answer:c Explanation:NoSQL systems do provide low latency access and accommodate many concurrent users. 4.3. path and LD_LIBRARY_PATH. scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware. HBase provides ___________ like capabilities on top of Hadoop and HDFS. 7. a) DataCache b) DistributedData c) DistributedCache d) All of the mentioned Answer:c Explanation: The child-jvm always has its current working directory added to the java. _______ refers to incremental costs with no major impact on solution design.library. a) TopTable b) BigTop c) Bigtable d) None of the mentioned Answer:c Explanation: Google Bigtable leverages the distributed data storage provided by the Google File System. The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.Answer:a Explanation:HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up. Which is the most popular NoSQL database for scalable big data store with Hadoop ? a) Hbase b) MongoDB c) Cassandra d) None of the mentioned Answer:a Explanation:HBase is the Hadoop database: a distributed. 10. 9. performance and complexity. a) Scale-out . 8. Point out the correct statement : a) You can specify any executable as the mapper and/or the reducer b) You cannot supply a Java class as the mapper and/or the reducer c) The class you supply for the output format should return key/value pairs of Text class d) All of the mentioned Answer:a Explanation:If you do not specify an input format class. Hadoop Questions and Answers – Hadoop Streaming This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Streaming”. Streaming supports streaming command options as well as _________ command options. Which of the following Hadoop streaming command option parameter is required ? a) output directoryname b) mapper executable c) input directoryname d) All of the mentioned .b) Scale-down c) Scale-up d) None of the mentioned Answer:c Explanation:dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a cluster does not require additional network switches. otherwise the command will fail. a) generic b) tool c) library d) task Answer:a Explanation:Place the generic options before the streaming options. 2. the TextInputFormat is used as the default. 3. 1. Answer:d Explanation:Required parameters is used for Input and Output location for mapper. The ________ option allows you to copy jars locally to the current working directory of tasks and automatically unjar the files. ______________ class allows the Map/Reduce framework to partition the map outputs based on certain key fields. 4. a) KeyFieldPartitioner b) KeyFieldBasedPartitioner c) KeyFieldBased d) None of the mentioned . a) archives b) files c) task d) None of the mentioned Answer:a Explanation:Archives options is also a generic option. simply specify “-reducer aggregate”: 6. simply specify “-mapper aggregate” d) None of the mentioned Answer:c Explanation:To use Aggregate. not the whole keys. To set an environment variable in a streaming command use: a) -cmden EXAMPLE_DIR=/home/example/dictionaries/ b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/ c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/ Answer:c Explanation:Environment Variable is set using cmdenv command. 5. Point out the wrong statement : a) Hadoop has a library package called Aggregate b) Aggregate allows you to define a mapper plugin class that is expected to generate “aggregatable items” for each input key/value pair of the mappers c) To use Aggregate. 7. lib. 10.apache. . 9.mapred. and the combination of the primary and secondary keys is used for sorting. that is useful for many applications. Which of the following class provides a subset of features provided by the Unix/GNU Sort ? a) KeyFieldBased b) KeyFieldComparator c) KeyFieldBasedComparator d) All of the mentioned Answer:c Explanation:Hadoop has a library class. KeyFieldBasedComparator.FieldSelectionMapReduce. 8.hadoop.Answer:b Explanation: The primary key is used for partitioning. that effectively allows you to process text data like the unix ______ utility. org.Hadoop has a library class. and a list of simple aggregators that perform aggregations such as “sum”. Hadoop Questions and Answers – Introduction to HDFS This set of Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Filesystem – HDFS”. “min” and so on over a sequence of values. a) Copy b) Cut c) Paste d) Move Answer:b Explanation: The map function defined in the class treats each input key/value pair as a list of fields. “max”. Which of the following class is provided by Aggregate package ? a) Map b) Reducer c) Reduce d) None of the mentioned Answer:b Explanation:Aggregate provides a special reducer class and a special combiner class. 1. a) Data Node b) NameNode c) Data block d) Replication Answer:b Explanation:All the metadata related to HDFS including the information about data nodes. a) master-worker b) master-slave c) worker/slave. a) Rack b) Data c) Secondary d) None of the mentioned Answer:c Explanation:Secondary namenode is used for all time availability and reliability. A ________ serves as the master and there is only one NameNode per cluster. ________ NameNode is used when the Primary NameNode goes down. are stored and maintained on the NameNode. files stored on HDFS. d) All of the mentioned Answer:a Explanation:NameNode servers as the master and each DataNode servers as a worker/slave 4. HDFS works in a __________ fashion. Point out the wrong statement : a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file . etc. 2. Point out the correct statement : a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks b) Each incoming file is broken into 32 MB by default c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault tolerance d) None of the mentioned Answer:a Explanation:There can be any number of DataNodes in a Hadoop Cluster. 3. 5. and Replication. 8. 7. 6. with data replicated across them. The need for data replication can arise in various scenarios like : a) Replication Factor is changed b) DataNode goes down c) Data Blocks get corrupted d) All of the mentioned Answer:d Explanation:Data is replicated across different DataNodes to ensure a high degree of faulttolerance. .level b) Block Report from each DataNode contains a list of all the blocks that are stored on that DataNode c) User data is stored on the local file system of DataNodes d) DataNode is aware of the files to which the blocks stored on it belong to Answer:d Explanation: NameNode is aware of the files to which the blocks stored on it belong to. A functional filesystem has more than one DataNode. a) DataNode b) NameNode c) Data block d) Replication Answer:a Explanation: A DataNode stores data in the [HadoopFileSystem]. ________ is the slave/worker node and holds the user data in the form of Data Blocks. Which of the following scenario may not be a good fit for HDFS ? a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b) HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS is suitable for storing data related to applications requiring low latency data access d) None of the mentioned Answer:a Explanation:HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of faulttolerance. these can be used in conjunction to simulate secondary sort on values d) All of the mentioned .9. 10. Point out the correct statement : a) The framework groups Reducer inputs by keys b) The shuffle and sort phases occur simultaneously i. HDFS is implemented in _____________ programming language. 1. Hadoop Questions and Answers – Java Interface This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Java Interface”. instance of __________ is required. a) filesystem b) datastream c) outstream d) inputstream Answer:a Explanation:InputDataStream is used to read data from file. a) C++ b) Java c) Scala d) None of the mentioned Answer:b Explanation:HDFS is implemented in Java and any computer which can run Java can host a NameNode/DataNode on it.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped. 2. HDFS provides a command line interface called __________ used to interact with HDFS. while outputs are being fetched they are merged c) Since JobConf.e. a) “HDFS Shell” b) “FS Shell” c) “DFS Shell” d) None of the mentioned Answer:b Explanation: The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS). In order to read any file in HDFS. 5.Answer:d Explanation:If equivalence rules for keys while grouping the intermediates are different from those for grouping keys before reduction. a) IOUtils b) Utils c) IUtils d) All of the mentioned Answer:a Explanation:IOUtils class is static method in Java interface. Interface ____________ reduces a set of intermediate values which share a key to a smaller set of values. a) write() b) read() c) readwrite() d) All of the mentioned Answer:a Explanation:readfully method can also be used instead of read method. ______________ is method to copy byte from input stream to any other stream in Hadoop. then one may specify a Comparator. Point out the wrong statement : a) The framework calls reduce method for each pair in the grouped inputs b) The output of the Reducer is re-sorted c) reduce method reduces values for a given key d) None of the mentioned Answer:b Explanation: The output of the Reducer is not re-sorted. 4. . _____________ is used to read data from bytes buffers . 6. 3. a) Mapper b) Reducer c) Writable d) Readable Answer:b Explanation:Reducer implementations can access the JobConf for the job. fetches the relevant partition of the output of all the Mappers. Reporter) method is called for each pair in the grouped inputs. 10. Applications can use the _________ provided to report progress or just indicate that they are alive.7. 8. 9. OutputCollector. Which of the following parameter is to collect keys and combined values ? a) key b) values c) reporter d) output . a) Collector b) Reporter c) Dashboard d) None of the mentioned Answer:b Explanation:In scenarios where the application takes a significant amount of time to process individual key/value pairs. Iterator. The output of the reduce task is typically written to the FileSystem via : a) OutputCollector b) InputCollector c) OutputCollect d) All of the mentioned Answer:a Explanation:In reduce phase the reduce(Object. this is crucial since the framework might assume that the task has timed-out and kill that task. for each Reducer. via HTTP. Reducer is input the grouped output of a : a) Mapper b) Reducer c) Writable d) Readable Answer:a Explanation:In the phase the framework. 4. a) Hive b) MapReduce c) Pig d) Lucene Answer:b Explanation:MapReduce is the heart of hadoop. 1. striving to keep the work as close to the data as possible a) DataNodes b) TaskTracker c) ActionNodes . 3. Hadoop Questions and Answers – Data Flow This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Data Flow”. 2. ________ is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks.Answer:d Explanation:reporter parameter is for facility to report progress. a) job-tracker b) map-tracker c) reduce-tracker d) All of the mentioned Answer:a Explanation:Map-Reduce jobs are submitted on job-tracker. The JobTracker pushes work out to available _______ nodes in the cluster. The daemons associated with the MapReduce phase are ________ and task-trackers. Point out the correct statement : a) Data locality means movement of algorithm to the data instead of data to algorithm b) When the processing is done on the data algorithm is moved across the Action Nodes rather than data to the algorithm c) Moving Computation is expensive than Moving Data d) None of the mentioned Answer:a Explanation:Data flow framework possesses the feature of data locality. . list(V2)) → list(K3. 7. Point out the wrong statement : a) The map function in Hadoop MapReduce have the following general form:map:(K1. the map task passes the split to the createRecordReader() method on InputFormat to obtain a _________ for that split. 6. a) puts b) gets c) getSplits d) All of the mentioned Answer:c Explanation:InputFormat uses their storage locations to schedule map tasks to process them on the tasktrackers. On a tasktracker. V1) → list(K2.d) All of the mentioned Answer:b Explanation:A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status whether the node is dead or alive. V2) b) The reduce function in Hadoop MapReduce have the following general form: reduce: (K2. 5. InputFormat class calls the ________ function and computes splits for each file and then sends them to the jobtracker. V3) c) MapReduce has a complex model of data processing: inputs and outputs for the map and reduce functions are key-value pairs d) None of the mentioned Answer:c Explanation:MapReduce is relatively simple model to implement in Hadoop. a) InputReader b) RecordReader c) OutputReader d) None of the mentioned Answer:b Explanation: The RecordReader loads data from its source and converts into key-value pairs suitable for reading by mapper. 10. _________ is the name of the archive you would like to create. a) Collector b) Partitioner c) InputFormat d) None of the mentioned Answer:b Explanation: The output of the mapper is sent to the partitioner. The default InputFormat is __________ which treats each value of input a new value and the associated key is byte offset.8. 1. a) TextFormat b) TextInputFormat c) InputFormat d) All of the mentioned Answer:b Explanation:A RecordReader is little more than an iterator over records. a) archive b) archiveName c) Name d) None of the mentioned . and the map task uses one to generate record key-value pairs. Hadoop Questions and Answers – Hadoop Archives This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Archives”. Output of the mapper is first written on the local disk for sorting and _________ process. 9. a) shuffling b) secondary sorting c) forking d) reducing Answer:a Explanation:All values corresponding to the same key will go the same reducer. __________ controls the partitioning of the keys of the intermediate map-outputs. har extension.har extension d) All of the mentioned Answer:d Explanation:A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files. Point out the wrong statement : a) The Hadoop archive exposes itself as a file system layer b) Hadoop archives are immutable c) Archive rename’s.Answer:b Explanation: The name should have a *. The __________ guarantees that excess resources taken from a queue will be restored to it within N minutes of its need for them. 4. 3. a) capacitor b) scheduler c) datanode d) None of the mentioned Answer:b Explanation:Free resources can be allocated to any queue beyond its guaranteed capacity. a) Hive b) Pig c) MapReduce d) All of the mentioned Answer:c Explanation:Hadoop Archives is exposed as a file system MapReduce will be able to use all the logical input files in Hadoop Archives as input. 5. 2. Point out the correct statement : a) A Hadoop archive maps to a file system directory b) Hadoop archives are special format archives c) A Hadoop archive always has a *. Using Hadoop Archives in __________ is as easy as specifying a different input filesystem than the default file system. deletes and creates return an error d) None of the mentioned . 8.Answer:d Explanation:All the fs shell commands in the archives work but with a different URI. a) Flow Scheduler b) Data Scheduler c) Capacity Scheduler d) None of the mentioned Answer:c Explanation: The Capacity Scheduler supports for multiple queues. a) -archiveName <name> b) <source> c) <destination> d) None of the mentioned Answer:d Explanation: identifies destination directory which would contain the archive. where a job is submitted to a queue. 7. 9. _________ is a pluggable Map/Reduce scheduler for Hadoop which provides a way to share large clusters. _________ identifies filesystem pathnames which work as usual with regular expressions. Which of the following parameter describes destination directory which would contain the archive ? a) -archiveName b) <source> c) <destination> d) None of the mentioned Answer:c Explanation: -archiveName is the name of the archive to be created. __________ is the parent argument used to specify the relative path to which the files should be archived to a) -archiveName <name> b) -p <parent_path> c) <destination> d) <source> . 6. a file that contains other files. 10. Which of the following is a valid syntax for hadoop archive ? a) hadooparchive [ Generic Options ] archive -archiveName <name> [-p <parent>] <source> <destination> b) hadooparch [ Generic Options ] archive -archiveName <name> [-p <parent>] <source> <destination> c) hadoop [ Generic Options ] archive -archiveName <name> [-p <parent>] <source> <destination> d) None of the mentioned Answer:c Explanation: The Hadoop archiving tool can be invoked using the following command format: hadoop archive -archiveName name -p * .Answer:b Explanation: The hadoop archive command creates a Hadoop archive. a) methods b) commands c) classes d) None of the mentioned Answer:d Explanation:Hadoop I/O consist of primitives for serialization and deserialization. Apache Hadoop’s ___________ provides a persistent data structure for binary key-value pairs. a) GetFile b) SequenceFile c) Putfile d) All of the mentioned Answer:b Explanation:SequenceFile is append-only. 2.Hadoop Questions and Answers – Hadoop I/O This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop I/O”. 4. adding or removing it. How many formats of SequenceFile are present in Hadoop I/O ? a) 2 b) 3 c) 4 d) 5 . 1. via reflection. you can’t seek to a specified key editing. Hadoop I/O Hadoop comes with a set of ________ for data I/O. 3. for reading d) All of the mentioned Answer:d Explanation:In contrast with other persistent key-value data structures like B-Trees. Point out the correct statement : a) The sequence file also can contain a “secondary” key-value list that can be used as file Metadata b) SequenceFile formats share a header that contains some information which allows the reader to recognize is format c) There’re Key and Value Class Name’s that allow the reader to instantiate those classes. and is written to the file during the initialization that happens in the SequenceFile. 8. 7. a) Array b) Index c) Immutable . A “Record Compressed” format and a “Block-Compressed”. a) ReduceFile b) MapperFile c) MapFile d) None of the mentioned Answer:c Explanation:Sequence files are data file (“/data”) and the index file (“/index”).Answer:b Explanation:SequenceFile has 3 available formats: An “Uncompressed” format. Point out the wrong statement : a) The data file contains all the key. 5. The __________ is a directory that contains two SequenceFile. Which of the following format is more compression-aggressive ? a) Partition Compressed b) Record Compressed c) Block-Compressed d) Uncompressed Answer:c Explanation:SequenceFile key-value list can be just a Text/Text pair. value records but key N + 1 must be greater then or equal to the key N b) Sequence file is a kind of hadoop file based data structure c) Map file type is splittable as it contains a sync point after several records d) None of the mentioned Answer:c Explanation:Map file is again a kind of hadoop file based data structure and it differs from a sequence file in a matter of the order. 6. The ______ file is populated with the key and a LongWritable that contains the starting byte position of the record. d) All of the mentioned Answer:b Explanation:Index does’t contains all the keys but just a fraction of the keys. 10. Hadoop Questions and Answers – Compression This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Compression”. a) Snapcheck b) Snappy c) FileCompress d) None of the mentioned Answer:b Explanation:Snappy has fast compression and decompression speeds. a) Oozie b) Avro c) cTakes d) Lucene Answer:b Explanation:Avro is a splittable data format with a metadata section at the beginning and then a sequence of avro serialized objects. The _________ codec from Google provides modest compression ratios. count + 1. a) SetFile b) ArrayFile c) BloomMapFile d) None of the mentioned Answer:b Explanation: The SetFile instead of append(key. 1. The _________ as just the value field append(value) and the key is a LongWritable that contains the record number. value) as just the key field append(key) and the value is always the NullWritable instance. ____________ data file takes is based on avro serializaton framework which was primarily created for hadoop. . 9. Point out the correct statement : a) Snappy is licensed under the GNU Public License (GPL) b) BgCIK needs to create an index when it compresses a file c) The Snappy codec is integrated into Hadoop Common. 5. LZO and Gzip are similar. a set of common utilities that supports other Hadoop subprojects d) None of the mentioned Answer:c Explanation:You can use Snappy as an add-on for more recent versions of Hadoop that do not yet provide Snappy codec support. but it’s much slower c) Gzip is a compression utility that was adopted by the GNU project d) None of the mentioned Answer:a Explanation:From a usability standpoint. Bzip2 and Gzip are similar. b) Bzip2 generates a better compression ratio than does Gzip. Which of the following compression is similar to Snappy compression ? a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:a Explanation:LZO is only really desirable if you need to compress text files. Which of the following supports splittable compression ? a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:a Explanation:LZO enables the parallel processing of compressed text file splits by your MapReduce jobs. 4.2. . 3. Point out the wrong statement : a) From a usability standpoint. gzp d) . a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:b Explanation:bzip2 is a freely available. patent free (see below). Which of the following is the slowest compression technique ? a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:b Explanation:Of all the available compression codecs in Hadoop. a) .gzip b) . 8. 9.6. high-quality data compressor. 10. Bzip2 is by far the slowest. including Gzip. which is a combination of LZ77 and Huffman Coding. The LZO compression format is composed of approximately __________ blocks of compressed data. Which of the following is based on the DEFLATE algorithm ? a) LZO b) Bzip2 c) Gzip d) All of the mentioned Answer:c Explanation:gzip is based on the DEFLATE algorithm. __________ typically compresses files to within 10% to 15% of the best available techniques.gz c) . Gzip (short for GNU zip) generates compressed files that have a _________ extension. .g Answer:b Explanation:You can use the gunzip command to decompress files that were created by a number of compression utilities. 7. 2. a) DataNode b) NameNode c) ActionNode d) All of the mentioned . a) metastore b) parity c) checksum d) None of the mentioned Answer:c Explanation:When a client creates an HDFS file. The ___________ machine is a single point of failure for an HDFS cluster. The HDFS client software implements __________ checking on the contents of HDFS files. 3. d) None of the mentioned Answer:a Explanation:A scheme might automatically move data from one DataNode to another if the free space on a DataNode falls below a certain threshold. Hadoop Questions and Answers – Data Integrity This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Data Integrity”. Point out the correct statement : a) The HDFS architecture is compatible with data rebalancing schemes b) Datablocks support storing a copy of data at a particular instant of time. c) HDFS currently support snapshots. 1.a) 128k b) 256k c) 24k d) 36k Answer:b Explanation:LZO was designed with speed in mind: it decompresses about twice as fast as gzip. meaning it’s fast enough to keep up with hard drive read speeds. it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. Automatic restart and ____________ of the NameNode software to another machine is not supported. Currently. The ____________ and the EditLog are central data structures of HDFS. a) failover b) end c) scalability .Answer:b Explanation:If the NameNode machine fails. 4. a) Data Image b) Datanots c) Snapshots d) All of the mentioned Answer:c Explanation:One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. __________ support storing a copy of data at a particular instant of time. a) DsImage b) FsImage c) FsImages d) All of the mentioned Answer:b Explanation:A corruption of these files can cause the HDFS instance to be non-functional 5. automatic restart and failover of the NameNode software to another machine is not supported. 7. 6. Point out the wrong statement : a) HDFS is designed to support small files only b) Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously c) NameNode can be configured to support maintaining multiple copies of the FsImage and EditLog d) None of the mentioned Answer:a Explanation:HDFS is designed to support very large files. manual intervention is necessary. 2 b) 1. a) DataNode b) NameNode c) ActionNode d) None of the mentioned Answer:b Explanation:HDFS tolerates failures of storage servers (called DataNodes) and its disks 10. Apache _______ is a serialization framework that produces data in a compact binary format. by default. manual intervention is necessary. Hadoop Questions and Answers – Serialization This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Serialization”. 1.3 d) All of the mentioned Answer:a Explanation:HDFS has a simple yet robust architecture that was explicitly designed for data reliability in the face of faults and failures in disks. replicates each data block _____ times on different nodes and on at least ____ racks.2 c) 2. nodes and networks. a) Oozie b) Impala . _________ stores its metadata on multiple disks that typically include a non-local file server. 9. a) 3. a) DataNode b) NameNode c) ActionNode d) None of the mentioned Answer:b Explanation:When the HDFS NameNode is restarted it recovers its metadata. HDFS. 8. The HDFS file system is temporarily unavailable whenever the HDFS ________ is down.d) All of the mentioned Answer:a Explanation:If the NameNode machine fails. Point out the correct statement : a) Apache Avro is a framework that allows you to serialize data in a format that has a schema built in b) The serialized data is in a compact binary format that doesn’t require proxy objects or code generation c) Including schemas with the Avro messages allows any application to deserialize the data d) All of the mentioned Answer:d Explanation:Instead of using generated proxy libraries and strong typing. 3. Avro relies heavily on the schemas that are sent along with the serialized data.c) kafka d) Avro Answer:d Explanation:Apache Avro doesn’t require proxy objects or code generation. The ____________ is an iterator which reads through the file and returns objects using the next() method. 2. 4. Avro schemas describe the format of the message and are defined using : a) JSON b) XML c) JS d) All of the mentioned Answer:a Explanation: The JSON schema content is put into a file. a) DatReader b) DatumReader c) DatumRead d) None of the mentioned Answer:b Explanation:DatumReader reads the content through the DataFileReader implementation. 5. Point out the wrong statement : a) Java code is used to deserialize the contents of the file into objects b) Avro allows you to use complex data structures within Hadoop MapReduce jobs . 6. a) AvroReducer b) Mapper c) AvroMapper d) None of the mentioned Answer:c Explanation:AvroMapper is used to provide the ability to collect or map data.c) The m2e plug-in automatically downloads the newly added JAR files and their dependencies d) None of the mentioned Answer:d Explanation:A unit test is useful because you can make assertions to verify that the values of the deserialized object are the same as the original values. 9. The ________ method in the ModelCountReducer class “reduces” the values the mapper collects into a derived value a) count b) add c) reduce d) All of the mentioned Answer:c Explanation:In some case. Which of the following works well with Avro ? a) Lucene b) kafka c) MapReduce . it can be simple sum of the values. 7. a) AvroReducer b) Mapper c) AvroMapper d) None of the mentioned Answer:a Explanation:AvroReducer summarizes them by looping through the values. ____________ class accepts the values that the ModelCountMapper object has collected. 8. The ____________ class extends and implements several Hadoop-supplied interfaces. d) None of the mentioned Answer:c Explanation:You can use Avro and MapReduce together to process many items serialized with Avro’s small binary format. 10. __________ tools is used to generate proxy objects in Java to easily work with the objects. a) Lucene b) kafka c) MapReduce d) Avro Answer:d Explanation:Avro serialization includes the schema with it — in JSON format — which allows you to have different versions of objects. Hadoop Questions and Answers – Avro-1 This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Avro-1”. 1. Avro schemas are defined with _____ a) JSON b) XML c) JAVA d) All of the mentioned Answer:a Explanation:JSON implementation facilitates implementation in languages that already have JSON libraries. 2. Point out the correct statement : a) Avro provides functionality similar to systems such as Thrift b) When Avro is used in RPC, the client and server exchange data in the connection handshake c) Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Java Foundation d) None of the mentioned Answer:a Explanation:Avro differs from these systems in the fundamental aspects like untagged data. 3. __________ facilitates construction of generic data-processing systems and languages. a) Untagged data b) Dynamic typing c) No manually-assigned field IDs d) All of the mentioned Answer:b Explanation:Avro does not require that code be generated. 4. With ______ we can store data and read it easily with various programming languages a) Thrift b) Protocol Buffers c) Avro d) None of the mentioned Answer:c Explanation:Avro is optimized to minimize the disk space needed by our data and it is flexible. 5. Point out the wrong statement : a) Apache Avro™ is a data serialization system b) Avro provides simple integration with dynamic languages c) Avro provides rich data structures d) All of the mentioned Answer:d Explanation: Code generation is not required to read or write data files nor to use or implement RPC protocols in Avro. 6. ________ are a way of encoding structured data in an efficient yet extensible format. a) Thrift b) Protocol Buffers c) Avro d) None of the mentioned Answer:b Explanation:Google uses Protocol Buffers for almost all of its internal RPC protocols and file formats. 7. Thrift resolves possible conflicts through _________ of the field. a) Name b) Static number c) UID d) None of the mentioned Answer:b Explanation:Avro resolves possible conflicts through the name of the field. 8. Avro is said to be the future _______ layer of Hadoop. a) RMC b) RPC c) RDC d) All of the mentioned Answer:b Explanation:When Avro is used in RPC, the client and server exchange schemas in the connection handshake. 9. When using reflection to automatically build our schemas without code generation, we need to configure Avro using : a) AvroJob.Reflect(jConf); b) AvroJob.setReflect(jConf); c) Job.setReflect(jConf); d) None of the mentioned Answer:c Explanation:For strongly typed languages like Java, it also provides a generation code layer, including RPC services code generation. 10. We can declare the schema of our data either in a ______ file. a) JSON b) XML c) SQL d) R Answer:c Explanation:Schema can be declared using an IDL or simply through Java beans by using reflection-based schema building. Hadoop Questions and Answers – Avro-2 This set of Interview Questions and Answers focuses on “Avro”. unions and fixed. Which of the following is a primitive data type in Avro ? a) null b) boolean c) float d) All of the mentioned Answer:d Explanation:Primitive type names are also defined type names. 4. 2. 3. maps. enums. Point out the correct statement : a) Records use the type name “record” and support three attributes b) Enum are represented using JSON arrays c) Avro data is always serialized with its schema d) All of the mentioned Answer:a Explanation:A record is encoded by encoding the values of its fields in the order that they are declared.________ are encoded as a series of blocks. a) 3 b) 4 c) 6 d) 7 Answer:d Explanation:Avro supports six kinds of complex types: records. Avro supports ______ kinds of complex types. a) Arrays b) Enum c) Unions d) Maps . arrays. 1. Answer:a Explanation:Each block of array consists of a long count value. a) Codec b) Data Marker c) Syncronization markers . Point out the wrong statement : a) Record. A block with count zero indicates the end of the array. 5. Each item is encoded per the array’s item schema. a) Fixed b) Enum c) Unions d) Maps Answer:a Explanation:Except for unions. _____________ are used between blocks to permit efficient splitting of files for MapReduce processing. followed by that many array items. ________ instances are encoded using the number of bytes declared in the schema. 8. enums and fixed are named types b) Unions may immediately contain other unions c) A namespace is a dot-separated sequence of such names d) All of the mentioned Answer:b Explanation:Unions may not immediately contain other unions. a) Complex Data type b) Order c) Sort Order d) All of the mentioned Answer:c Explanation:Avro binary-encoded data can be efficiently ordered without deserializing it to objects. 7. ________ permits data written by one system to be efficiently sorted by another system. the JSON encoding is the same as is used to encode field default values. 6. Hadoop Questions and Answers – Metrics in Hbase This set of Interview Questions & Answers focuses on “Hbase”. Avro messages are framed as a list of _________ a) buffers b) frames c) rows d) None of the mentioned Answer:b Explanation:Framing is a layer between messages and the transport. _______ can change the maximum number of cells of a column family. like many technologies that come from Google. Point out the correct statement : a) You can add a column family to a table using the method addColumn() b) Using alter. 2. 10. a) null b) snappy c) deflate d) None of the mentioned Answer:b Explanation:Snappy is a compression library developed at Google. Snappy was designed to be fast. The __________ codec uses Google’s Snappy compression library. It exists to optimize certain operations. 9.d) All of the mentioned Answer:c Explanation:Avro includes a simple object container file format. and. 1. a) set b) reset c) alter d) select Answer:c Explanation:Alter is the command used to make changes to an existing table. you can also create a column family . READONLY. Point out the wrong statement : a) To read data from an HBase table. You can delete a column family from a table using the method _________ of HBAseAdmin class. 5. you can truncate a column family d) None of the mentioned Answer:a Explanation:Columns can also be added through HbaseAdmin. or get a set of rows by a set of row ids. use the get() method of the HTable class b) You can retrieve data from the HBase table using the get() method of the HTable class c) While retrieving data. you can set and remove table scope operators such as MAX_FILESIZE. etc. 3. a) delColumn() b) removeColumn() c) deleteColumn() d) All of the mentioned Answer:c Explanation:Alter command also can be used to delete a column family. or scan an entire table or a subset of rows d) None of the mentioned Answer:d Explanation:You can retrieve an HBase table data using the add method variants in Get class. __________ class adds HBase configuration files to its object. you can get a single row by id. DEFERRED_LOG_FLUSH. MEMSTORE_FLUSHSIZE. a) Configuration b) Collector c) Component . 6. 4. Which of the following is not a table scope operator ? a) MEMSTORE_FLUSH b) MEMSTORE_FLUSHSIZE c) MAX_FILESIZE d) All of the mentioned Answer:a Explanation:Using alter.c) Using disable-all. 10. _________ is the main configuration file of HBase.xml b) hbase-site.xml d) None of the mentioned Answer:b Explanation:Set the data directory to an appropriate location by opening the HBase home folder in /usr/local/HBase. a) Master Server b) Region Server c) Htable d) All of the mentioned Answer:b Explanation:Region Server handle read and write requests for all the regions under it. HBase uses the _______ File System to store its data. a) Get b) Result c) Put d) Value Answer:b Explanation:Get the result by passing your Get class instance to the get method of the HTable class. a) hbase. 7. 9. This method returns the Result class object. which holds the requested result.d) None of the mentioned Answer:a Explanation:You can create a configuration object using the create() method of the HbaseConfiguration class.xml c) hbase-site-conf. The ________ class provides the getValue() method to read the values from its instance. 8. a) Hive b) Imphala c) Hadoop . ________ communicate with the client and handle data-related operations. 2. The Hadoop MapReduce framework spawns one map task for each __________ generated by the InputFormat for the job.configure(JobConf) method and override it to initialize themselves.d) Scala Answer:c Explanation: The data storage will be in the form of regions (tables). The Mapper implementation processes one line at a time via _________ method. adoop Questions and Answers – Mapreduce Development-2 This set of Questions & Answers focuses on “Hadoop MapReduce”. a) map b) reduce c) mapper d) reducer Answer:a Explanation: The Mapper outputs are sorted and then partitioned per Reducer. . a) OutputSplit b) InputSplit c) InputSplitStream d) All of the mentioned Answer:b Explanation:Mapper implementations are passed the JobConf for the job via the JobConfigurable. 1. Point out the correct statement : a) Mapper maps input key/value pairs to a set of intermediate key/value pairs b) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods c) Mapper and Reducer interfaces form the core of the job d) None of the mentioned Answer:d Explanation: The transformed intermediate records do not need to be of the same type as the input records. 3. These regions will be split up and stored in region servers. key. Applications can use the ____________ to report progress and set application-level status messages a) Partitioner b) OutputSplit c) Reporter d) All of the mentioned Answer:c Explanation:Reporter is also used to update Counters. value-len. 5. Users can control which keys (and hence records) go to which Reducer by implementing a custom : a) Partitioner b) OutputSplit c) Reporter d) All of the mentioned Answer:a Explanation:Users can control the grouping by specifying a Comparator via JobConf. 6. value) format d) None of the mentioned Answer:d Explanation:All intermediate values associated with a given output key are subsequently grouped by the framework.setOutputKeyComparatorClass(Class). or just indicate that they are alive. sorted outputs are always stored in a simple (key-len. and passed to the Reducer(s) to determine the final output. Point out the wrong statement : a) The Mapper outputs are sorted and then partitioned per Reducer b) The total number of partitions is the same as the number of reduce tasks for the job c) The intermediate. The right level of parallelism for maps seems to be around _________ maps per-node a) 1-10 b) 10-100 c) 100-150 d) 150-200 .4. 7. get c) OutputCollector.receive d) OutputCollector. 9.setNumMapTasks(int) d) All of the mentioned Answer:b Explanation:Reducer has 3 primary phases: shuffle. 8. a) sort b) shuffle c) reduce d) None of the mentioned Answer:a Explanation: The shuffle and sort phases occur simultaneously.setNumReduceTasks(int) c) JobConf.setNumTasks(int) b) JobConf. while map-outputs are being fetched they are merged. Which of the following is the default Partitioner for Mapreduce ? a) MergePartitioner b) HashedPartitioner c) HashPartitioner d) None of the mentioned .put Answer:a Explanation: The output of the Reducer is not sorted. sort and reduce.Answer:b Explanation:Task setup takes a while.collect b) OutputCollector. The framework groups Reducer inputs by key in _________ stage. 10. Hadoop Questions and Answers – MapReduce Features-1 This set of Hadoop Questions & Answers for freshers focuses on “MapReduce Features”. The output of the reduce task is typically written to the FileSystem via _____________ . so it is best if the maps take at least a minute to execute. a) OutputCollector. 1. The number of reduces for the job is set by the user via : a) JobConf. Answer:c Explanation: The total number of partitions is the same as the number of reduce tasks for the job. 4. Point out the wrong statement : a) It is legal to set the number of reduce-tasks to zero if no reduction is desired b) The outputs of the map-tasks go directly to the FileSystem c) The Mapreduce framework does not sort the map-outputs before writing them out to the FileSystem . reducers. Point out the correct statement : a) The right number of reduces seems to be 0. 2.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Which of the following partitions the key space ? a) Partitioner b) Compactor c) Collector d) All of the mentioned Answer:a Explanation:Partitioner controls the partitioning of the keys of the intermediate map-outputs. ____________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer a) OutputCompactor b) OutputCollector c) InputCollector d) All of the mentioned Answer:b Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers.75 b) Increasing the number of reduces increases the framework overhead c) With 0. 3.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish d) All of the mentioned Answer:c Explanation:With 1.95 or 1. 5. and partitioners. Partitioner. a) JobTracker b) TaskTracker c) TaskScheduler d) None of the mentioned Answer:a Explanation: The child-task inherits the environment of the parent TaskTracker.spill. combiner (if any). OutputFormat and OutputCommitter implementations. and any sub-process it launches recursively. Reducer.percent b) io.percent . Maximum virtual memory of the launched child-task is specified using : a) mapv b) mapred c) mapvim d) All of the mentioned Answer:b Explanation:Admins can also specify the maximum virtual memory of the launched child-task.record. a) JobConfig b) JobConf c) JobConfiguration d) All of the mentioned Answer:b Explanation:JobConf is typically used to specify the Mapper. __________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. into the output path set by setOutputPath(Path). InputFormat. 7. The ___________ executes the Mapper/ Reducer task as a child process in a separate jvm. using mapred. 9.d) None of the mentioned Answer:d Explanation:Outputs of the map-tasks go directly to the FileSystem.sort. Which of the following parameter is the threshold for the accounting and serialization buffers ? a) io. 6.sort. 8. sort.sort.sort.reduce.merge.inmem.shuffle.reduce.job. ____________ specifies the number of segments on disk to be merged at the same time.threshold d) io.percent b) mapred.c) io.buffer. a) mapred.input. Hadoop Questions and Answers – MapReduce Features-2 This set of Interview Questions & Answers focuses on “MapReduce”.shuffle.inmem.job. Point out the correct statement : a) The number of sorted map outputs fetched into memory before being merged to disk b) The memory threshold for fetched map outputs before an in-memory merge is finished c) The percentage of memory relative to the maximum heapsize in which map outputs may not be retained during the reduce d) None of the mentioned .merge. 1. ______________ is percentage of memory relative to the maximum heapsize in which map outputs may be retained during the reduce.factor Answer:d Explanation:io.merge. 10.buffer.percent b) mapred. a) mapred.threshold d) io.mb d) None of the mentioned Answer:a Explanation:When percentage of either buffer has filled.factor Answer:b Explanation:When the reduce begins.job.percen c) mapred.factor limits the number of open files and compression codecs during the merge. their contents will be spilled to disk in the background.input.job. map outputs will be merged to disk until those that remain are under the resource limit this defines.merge. 2.sort.percen c) mapred. 3. During the execution of a streaming job.num.num. Map output larger than ___ percent of the memory allocated to copying map outputs.recycle. a) vmap b) mapvim c) mapreduce d) mapred .num.reuse. a) 10 b) 15 c) 25 d) 35 Answer:c Explanation:Map output will be written directly to disk without first staging through memory.job.num.reuse.job. 5.jvm. map outputs will be merged to disk until those that remain are under the resource limit this defines. the names of the _______ parameters are transformed.tasks.job.tasks b) mapissue. Jobs can enable task JVMs to be reused by specifying the job configuration : a) mapred.tasks d) All of the mentioned Answer:b Explanation:Many of my tasks had performance improved over 50% using mapissue. 4.job. 6.jvm.jvm. Point out the wrong statement : a) The task tracker has local directory to create localized cache and localized job b) The task tracker can define multiple local directories c) The Job tracker cannot define multiple local directories d) None of the mentioned Answer:d Explanation:When the job starts.jvm.tasks c) mapred.reuse.Answer:a Explanation:When the reduce begins. task tracker creates a localized job directory relative to the local directory specified in the configuration. track their progress. 9.path and LD_LIBRARY_PATH. access componenttasks’ reports and logs. __________ is used to filter log files from the output directory listing. get the MapReduce cluster’s status information and so on. The standard output (stdout) and error (stderr) streams of the task are read by the TaskTracker and logged to : a) ${HADOOP_LOG_DIR}/user b) ${HADOOP_LOG_DIR}/userlogs c) ${HADOOP_LOG_DIR}/logs d) None of the mentioned Answer:b Explanation: The child-jvm always has its current working directory added to the java. 10. The _____________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.library.loadLibrary or System. ____________ is the primary interface by which user-job interacts with the JobTracker. 8. a) OutputLog b) OutputLogFilter c) DistributedLog d) DistributedJars . a) JobConf b) JobClient c) JobServer d) All of the mentioned Answer:b Explanation:JobClient provides facilities to submit jobs. 7.Answer:d Explanation:To get the values in a streaming job’s mapper/reducer use the parameter names with the underscores.load. a) DistributedLog b) DistributedCache c) DistributedJars d) None of the mentioned Answer:b Explanation:Cached libraries can be loaded via System. 4.xml d) All of the mentioned Answer:b Explanation:core-default. 3. no subsequently-loaded resource can alter that value.xml is read-only defaults for hadoop. 2. ___________ gives site-specific configuration for a given hadoop installation.xml c) coredefault.d) None of the mentioned Answer:b Explanation:User can view the history logs summary in specified directory using the following command $ bin/hadoop job -history output-dir. Hadoop by default specifies two resources c) Configuration class provides access to configuration parameters d) None of the mentioned Answer:a Explanation:Once a resource declares a value final. . 1. Which of the following class provides access to configuration parameters ? a) Config b) Configuration c) OutputConfig d) None of the mentioned Answer:b Explanation:Configurations are specified by resources. Hadoop Questions and Answers – Hadoop Configuration This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Hadoop Configuration”. a) core-default. Administrators typically define parameters as final in __________ for values that user applications may not alter. Point out the correct statement : a) Configuration parameters may be declared static b) Unless explicitly turned off.xml b) core-site. 8.xml d) All of the mentioned Answer:b Explanation:Value strings are first processed for variable expansion. 7. 6. 5.xml c) coredefault. a) clear b) addResource c) getClass d) None of the mentioned Answer:a Explanation:getClass is used to get the value of the name property as a Class. Point out the wrong statement : a) addDeprecations adds a set of deprecated keys to the global deprecations b) Configuration parameters cannot be declared final c) addDeprecations method is lockless d) None of the mentioned Answer:b Explanation:Configuration parameters may be declared final. ________ checks whether the given key is deprecated. a) isDeprecated b) setDeprecated c) isDeprecatedif . ________ method adds the deprecated key to the global deprecation map. a) addDeprecits b) addDeprecation c) keyDeprecation d) None of the mentioned Answer:b Explanation:addDeprecation does not override any existing entries in the deprecation map. _________ method clears all keys from the configuration.xml b) core-site.a) core-default. a) SSL b) Kerberos c) SSH d) None of the mentioned Answer:b Explanation:Each service reads auhenticate information saved in keytab file with appropriate permission. a) addResource b) setDeprecatedProperties c) addDefaultResource d) None of the mentioned Answer:b Explanation:setDeprecatedProperties sets all deprecated properties that are not currently set but have a corresponding new property that is set. Hadoop Questions and Answers – Security This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Security”.d) All of the mentioned Answer:a Explanation:Method returns true if the key is deprecated and false otherwise. For running hadoop service daemons in Hadoop in secure mode. _________ is useful for iterating the properties when all deprecated properties for currently set properties need to be present. 9. 10. . ___________ principals are required. 1. unless they were marked final. Which of the following adds a configuration resource ? a) addResource b) setDeprecatedProperties c) addDefaultResource d) addResource Answer:d Explanation: The properties of this resource will override properties of previously added resources. address d) None of the mentioned Answer:d Explanation:Authentication is based on the assumption that the attacker won’t be able to get root privileges.group. Point out the wrong statement : a) Data transfer protocol of DataNode does not use the RPC framework of Hadoop b) Apache Oozie which access the services of Hadoop on behalf of end users need to be able to impersonate end users c) DataNode must authenticate itself by using privileged ports which are specified by dfs.datanode. 3.datanode.address and dfs. .security.http.2. a) auth b) kinit c) authorize d) All of the mentioned Answer:b Explanation:HTTP web-consoles should be served by principal different from RPC’s one. The simplest way to do authentication is using _________ command of Kerberos. Data transfer between Web-console and clients are protected by using : a) SSL b) Kerberos c) SSH d) None of the mentioned Answer:a Explanation:AES offers the greatest cryptographic strength and the best performance 5. 4. Point out the correct statement : a) Hadoop does have the definition of group by itself b) MapReduce JobHistory server run as same user such as mapred c) SSO environment is managed using Kerberos with LDAP for Hadoop in secure mode d) None of the mentioned Answer:c Explanation:You can change a way of mapping by specifying the name of mapping provider as a value of hadoop.mapping. security. The __________ provides a proxy between the web applications exported by an application and an end user. a) Container b) ContainerExecutor c) Executor d) All of the mentioned Answer:b Explanation: The container process has the same Unix user as the NodeManager. ___________ used by YARN framework which define how any container launched and controlled.local-dirs a) TaskController b) LinuxTaskController c) LinuxController d) None of the mentioned .authentication property to : a) zero b) kerberos c) false d) None of the mentioned Answer:b Explanation:Security settings need to be modified properly for robustness. 9. In order to turn on RPC authentication in hadoop.nodemanager.6. 8. a) ProxyServer b) WebAppProxy c) WebProxy d) None of the mentioned Answer:b Explanation:If security is enabled it will warn users before accessing a potentially unsafe web application. 7. The ____________ requires that paths including and leading up to the directories specified in yarn. Authentication and authorization using the proxy is handled just like any other privileged web application. set the value of hadoop. The configuration file must be owned by the user running : a) DataManager b) NodeManager c) ValidationManager d) None of the mentioned Answer:b Explanation:To re-cap. 1. 10. block replicas are stored according to the storage type list b) One_SSD is used for storing all replicas in SSD c) Hot policy is useful only for single replica blocks d) All of the mentioned Answer:a Explanation: The first phase of Heterogeneous Storage changed datanode storage model from a single storage. ___________ is added for supporting writing single replica files in memory.local file-sysytem permissions need to be modified Hadoop Questions and Answers – MapReduce Job-1 This set of Hadoop Interview Questions & Answers for freshers focuses on “MapReduce Job”. a) ROM_DISK b) ARCHIVE c) RAM_DISK . __________ storage is a solution to decouple growing storage capacity from compute capacity.Answer:b Explanation:LinuxTaskController keeps track of all paths and directories on datanode. a) DataNode b) Archival c) Policy d) None of the mentioned Answer:b Explanation:Nodes with higher density and less expensive storage with low compute power are becoming available. 2. 3. Point out the correct statement : a) When there is enough space. 5.enabled is used for enabling/disabling the storage policy feature d) None of the mentioned Answer:d Explanation: The effective storage policy can be retrieved by the “dfsadmin -getStoragePolicy” command. Point out the wrong statement : a) A Storage policy consists of the Policy ID b) The storage policy can be specified using the “dfsadmin -setStoragePolicy” command c) dfs. 7. 6. Which of the following storage policy is used for both storage and compute ? a) Hot b) Cold c) Warm d) All_SSD Answer:a Explanation:When a block is hot. Which of the following has high storage density ? a) ROM_DISK b) ARCHIVE c) RAM_DISK d) All of the mentioned Answer:b Explanation:Little compute power is added for supporting archival storage.storage. all replicas are stored in DISK. 4. Which of the following is only for storage with limited compute ? a) Hot b) Cold c) Warm d) All_SSD .d) All of the mentioned Answer:c Explanation:DISK is the default storage type.policy. ___________ is used for writing blocks with single replica in memory. 1. a) Mover b) Hiver c) Serde . ____________ is used for storing one of the replicas in SSD. _________ is a data migration tool added for archiving data. 8. a) Hot b) Lazy_Persist c) One_SSD d) All_SSD Answer:c Explanation: The remaining replicas are stored in DISK. 10. 9. a) Hot b) Lazy_Persist c) One_SSD d) All_SSD Answer:b Explanation: The replica is first written in RAM_DISK and then it is lazily persisted in DISK. When a block is warm. some of its replicas are stored in DISK and the remaining replicas are stored in : a) ROM_DISK b) ARCHIVE c) RAM_DISK d) All of the mentioned Answer:b Explanation:Warm storage policy is partially hot and partially cold. all replicas are stored in ARCHIVE. Hadoop Questions and Answers – MapReduce Job-2 This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “MapReduce Job2”.Answer:b Explanation:When a block is cold. 3. Point out the correct statement : a) Mover is not similar to Balancer b) hdfs dfsadmin -setStoragePolicy puts a storage policy to a file or a directory. 4.d) None of the mentioned Answer:a Explanation:Mover periodically scans the files in HDFS to check if the block placement satisfies the storage policy. Which of the following is used to list out the storage policies ? a) hdfs storagepolicies b) hdfs storage c) hd storagepolicies d) All of the mentioned Answer:a Explanation:Arguments are none for the hdfs storagepolicies command. c) addCacheArchive add archives to be localized d) None of the mentioned Answer:c Explanation:addArchiveToClassPath(Path archive) adds an archive path to the current set of classpath entries. 2. Which of the following node is responsible for executing a Task assigned to it by the JobTracker ? . 1. Which of the following statement can be used get the storage policy of a file or a directory ? a) hdfs dfsadmin -getStoragePolicy path b) hdfs dfsadmin -setStoragePolicy path policyName c) hdfs dfsadmin -listStoragePolicy path policyName d) All of the mentioned Answer:a Explanation: refers to the path referring to either a directory or a file. Hadoop Questions and Answers – Task Execution This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Task Execution”. Executes the Task. 3. 2. ___________ part of the MapReduce is responsible for processing one or more chunks of data and producing the output results.a) MapReduce b) Mapper c) TaskTracker d) JobTracker Answer:c Explanation:TaskTracker receives the information necessary for execution of a Task from JobTracker. d) All of the mentioned Answer:a Explanation:This feature of MapReduce is “Data Locality”. a) Reduce b) Map c) Reducer d) All of the mentioned Answer:a Explanation:Reduce function collates the work and resolves the results. 5. _________ function is responsible for consolidating the results produced by each of the Map() functions/tasks. Point out the correct statement : a) MapReduce tries to place the data and the compute as close as possible b) Map Task in MapReduce is performed using the Mapper() function. Point out the wrong statement : a) A MapReduce job usually splits the input data-set into independent chunks which are . c) Reduce Task in MapReduce is performed using the Map() function. and Sends the Results back to JobTracker. 4. a) Maptask b) Mapper c) Task execution d) All of the mentioned Answer:a Explanation:Map Task in MapReduce is performed using the Map() function. MapReduce applications need not be written in : a) Java b) C c) C# d) None of the mentioned Answer:a Explanation:Hadoop Pipes is a SWIG. a) Hadoop Strdata b) Hadoop Streaming c) Hadoop Stream d) None of the mentioned Answer:b Explanation:Hadoop streaming is one of the most important utilities in the Apache Hadoop distribution. __________ maps input key/value pairs to a set of intermediate key/value pairs.compatible C++ API to implement MapReduce applications (non JNITM based). 7. a) Mapper b) Reducer c) Both Mapper and Reducer d) None of the mentioned . Although the Hadoop framework is implemented in Java . monitoring them and reexecutes the failed tasks. 8. 6.processed by the map tasks in a completely parallel manner b) The MapReduce framework operates exclusively on pairs c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods d) None of the mentioned Answer:d Explanation: The MapReduce framework takes care of scheduling tasks. ________ is a utility which allows users to create and run jobs with any executable as the mapper and/or the reducer. component tasks need to create and/or write to side-files. a) MapReduce b) Map c) Reducer d) All of the mentioned Answer:a Explanation: In some applications. 9. Point out the correct statement : a) YARN also extends the power of Hadoop to incumbent and new technologies found within the data center . ________ is the architectural center of Hadoop that allows multiple data processing engines. Hadoop Questions and Answers – YARN-1 This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “YARN-1”. 1. a) YARN b) Hive c) Incubator d) Chuckwa Answer:a Explanation:YARN is the prerequisite for Enterprise Hadoop. Running a ___________ program involves running mapping tasks on many or all of the nodes in our cluster. 10. security. 2. providing resource management and a central platform to deliver consistent operations. and data governance tools across Hadoop clusters.Answer:a Explanation:Maps are the individual tasks that transform input records into intermediate records. The number of maps is usually driven by the total size of : a) inputs b) outputs c) tasks d) None of the mentioned Answer:a Explanation:Total size of inputs means total number of blocks of the input files. which differ from the actual job-output files. 3. a) Hive b) MapReduce c) Imphala d) All of the mentioned Answer:b Explanation:Multi-tenant data processing improves an enterprise’s return on its Hadoop investments. YARN’s dynamic allocation of cluster resources improves utilization over more static _______ rules used in early versions of Hadoop. Point out the wrong statement : a) From the system perspective. monitoring their resource usage d) None of the mentioned . the ApplicationMaster runs as a normal container. which is responsible for launching the applications’ containers. 5.b) YARN is the central point of investment for Hortonworks within the Apache community c) YARN enhances a Hadoop compute cluster in many ways d) All of the mentioned Answer:d Explanation:YARN provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop. which is responsible for launching the applications’ containers c) The NodeManager is the per-machine slave. The __________ is a framework-specific entity that negotiates resources from the ResourceManager a) NodeManager b) ResourceManager c) ApplicationMaster d) All of the mentioned Answer:c Explanation:Each ApplicationMaster has responsibility for negotiating appropriate resource containers from the schedule. 4. b) The ResourceManager is the per-machine slave. Answer:b Explanation:ResourceManager has a scheduler. 6. The __________ is responsible for allocating resources to the various running applications subject to familiar constraints of capacities. MapReduce has undergone a complete overhaul in hadoop : a) 0. 8. the NodeManager (NM). 7.23 c) 0.24 d) 0. according to constraints such as queue capacities and user limits. Apache Hadoop YARN stands for : a) Yet Another Reserve Negotiator b) Yet Another Resource Network c) Yet Another Resource Negotiator d) All of the mentioned Answer:c Explanation:YARN is a cluster management technology. form the datacomputation framework. a) NodeManager b) ResourceManager c) ApplicationMaster d) All of the mentioned Answer:b Explanation: The ResourceManager and per-node slave. 9.26 Answer:b Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker.21 b) 0. queues etc. a) Manager b) Master c) Scheduler . which is responsible for allocating resources to the various applications running in the cluster. The ____________ is the ultimate authority that arbitrates resources among all the applications in the system. a) hive b) bin c) hadoop d) home Answer:b Explanation:Running the yarn script without any arguments prints the description for all commands. which is responsible for partitioning the cluster resources among the various queues. Hadoop Questions and Answers – YARN-2 This set of Hadoop Question Bank focuses on “YARN”. Point out the correct statement : a) Each queue has strict ACLs which controls which users can submit applications to individual queues b) Hierarchy of queues is supported to ensure resources are shared among the sub-queues of an organization c) Queues are allocated a fraction of the capacity of the grid in the sense that a certain capacity of resources will be at their disposal d) All of the mentioned . applications etc. Yarn commands are invoked by the ________ script.d) None of the mentioned Answer:c Explanation: The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. 2. 1. a) Networked b) Hierarchial c) Partition d) None of the mentioned Answer:b Explanation: The Scheduler has a pluggable policy plug-in. The CapacityScheduler supports _____________ queues to allow for more predictable sharing of cluster resources. 10. ACLs can be changed. The CapacityScheduler has a pre-defined queue called : a) domain b) root c) rear d) All of the mentioned Answer:b Explanation:All queueus in the system are children of the root queue. Point out the wrong statement : a) The multiple of the queue capacity which can be configured to allow a single user to acquire more resources b) Changing queue properties and adding new queues is very simple c) Queues cannot be deleted.Answer:d Explanation:All applications submitted to a queue will have access to the capacity allocated to the queue. 5. 3. a) tolerant b) capacity c) speed d) All of the mentioned Answer:b Explanation:Administrators can add additional queues at runtime. 4. but queues cannot be deleted at runtime. The updated queue configuration should be a valid one i. 6. only addition of new queues is supported d) None of the mentioned Answer:d Explanation:You need to edit conf/capacity-scheduler.e.xml and run yarn rmadmin -refreshQueues for changing queue properties. queue-capacity at each level should be equal to : a) 50% b) 75% c) 100% . The queue definitions and properties such as ________. at runtime. only addition of new queues is supported. a) java b) jar c) C code d) xml Answer:b Explanation:Usage: yarn jar [mainClass] args… 8. 7. 10. Users can bundle their Yarn code in a _________ file and execute it using jar command. 9.d) 0% Answer:c Explanation:Queues cannot be deleted. a) -format-state b) -form-state-store c) -format-state-store d) None of the mentioned Answer:c Explanation:-format-state-store formats the RMStateStore. __________ will clear the RMStateStore and is useful if past applications are no longer needed. Which of the following command is used to dump the log container ? a) logs b) log c) dump d) All of the mentioned Answer:a Explanation:Usage: yarn logs -applicationId . Which of the following command runs ResourceManager admin client ? a) proxyserver b) run c) admin d) rmadmin . a) textformat b) split c) datanode d) All of the mentioned .Answer:d Explanation:proxyserver command starts the web proxy server. 1. In _____________. although the reduce output types may be different again b) The map input key and value types (K1 and V1) are different from the map output types c) The partition function operates on the intermediate key d) All of the mentioned Answer:d Explanation:In practice. 4. ___________ generates keys of type LongWritable and values of type Text. to the Java equivalent. you don’t need to call setMapOutputKeyClass(). 2. a) TextOutputFormat b) TextInputFormat c) OutputInputFormat d) None of the mentioned Answer:b Explanation:If K2 and K3 are the same. the default job is similar. 3. Hadoop Questions and Answers – Mapreduce Types This set of Hadoop Questions & Answers for experienced focuses on “MapReduce Types”. An input _________ is a chunk of the input that is processed by a single map. the partition is determined solely by the key (the value is ignored). but not identical. a) Mapreduce b) Streaming c) Orchestration d) All of the mentioned Answer:b Explanation:MapReduce Types and Formats MapReduce has a simple model of data processing. Point out the correct statement : a) The reduce input must have the same types as the map output. and the map processes each record—a key-value pair—in turn. An ___________ is responsible for creating the input splits. which it passes to the map function. and the map task uses one to generate record key-value pairs. you only need to use setOutputValueClass() b) The overall effect of Streaming job is to perform a sort of the input c) A Streaming application can control the separator that is used when a key-value pair is turned into a series of bytes and sent to the map or reduce process over standard input d) None of the mentioned Answer:d Explanation:If a combine function is used then it is the same form as the reduce function. Which of the following is the only way of running mappers ? a) MapReducer b) MapRunner .Answer:b Explanation:Each split is divided into records. 6. you don’t need to deal with InputSplits directly. ______________ is another implementation of the MapRunnable interface that runs mappers concurrently in a configurable number of threads. except its output types are the intermediate key and value types (K2 and V2). as they are created by an InputFormat. 5. so they can feed the reduce function. a) MultithreadedRunner b) MultithreadedMap c) MultithreadedMapRunner d) SinglethreadedMapRunner Answer:c Explanation:A RecordReader is little more than an iterator over records. and dividing them into records. 7. Point out the wrong statement : a) If V2 and V3 are the same. a) TextOutputFormat b) TextInputFormat c) OutputInputFormat d) InputFormat Answer:d Explanation:As a MapReduce application writer. 8. a) FileTextFormat b) FileInputFormat c) FileOutputFormat d) None of the mentioned Answer:b Explanation:FileInputFormat provides implementation for generating splits for the input files. 1. The split size is normally the size of an ________ block. 10. a) generic b) task c) library d) HDFS . which is appropriate for most applications. the client sends them to the jobtracker.c) MapRed d) All of the mentioned Answer:b Explanation:Having calculated the splits. 9. Which of the following method add a path or paths to the list of inputs ? a) setInputPaths() b) addInputPath() c) setInput() d) None of the mentioned Answer:b Explanation:FileInputFormat offers four static convenience methods for setting a JobConf’s input paths. Hadoop Questions and Answers – Mapreduce Formats-1 This set of Hadoop Interview Questions & Answers for experienced focuses on “MapReduce Formats”. _________ is the base class for all implementations of InputFormat that use files as their data source . 3. 4. 5. forcing splits to be smaller than a block.Answer:d Explanation:FileInputFormat splits only large files(Here “large” means larger than an HDFS block). Point out the correct statement : a) The minimum split size is usually 1 byte. c) The maximum split size defaults to the maximum value that can be represented by a Java long type d) All of the mentioned Answer:a Explanation: The maximum split size has an effect only when it is less than the block size. although some formats have a lower bound on the split size b) Applications may impose a minimum split size. Point out the wrong statement : a) Hadoop works better with a small number of large files than a large number of small files b) CombineFileInputFormat is designed to work well with small files c) CombineFileInputFormat does not compromise the speed at which it can process the input in a typical MapReduce job . To set an environment variable in a streaming command use: a) -cmden EXAMPLE_DIR=/home/example/dictionaries/ b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/ c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/ Answer:c Explanation:Environment Variable is set using cmdenv command. Which of the following Hadoop streaming command option parameter is required ? a) output directoryname b) mapper executable c) input directoryname d) All of the mentioned Answer:d Explanation:Required parameters is used for Input and Output location for mapper. 2. d) None of the mentioned Answer:c Explanation:If the file is very small (“small” means significantly smaller than an HDFS block) and there are a lot of them. KeyFieldBasedComparator. 7. 6. ______________ class allows the Map/Reduce framework to partition the map outputs based on certain key fields. Which of the following class provides a subset of features provided by the Unix/GNU Sort ? a) KeyFieldBased b) KeyFieldComparator c) KeyFieldBasedComparator d) All of the mentioned Answer:c Explanation:Hadoop has a library class. a) archives b) files c) task d) None of the mentioned Answer:a Explanation:Archives options is also a generic option. a) KeyFieldPartitioner b) KeyFieldBasedPartitioner c) KeyFieldBased d) None of the mentioned Answer:b Explanation: The primary key is used for partitioning. 9. that is useful for many applications. Which of the following class is provided by Aggregate package ? a) Map . then each map task will process very little input. and the combination of the primary and secondary keys is used for sorting. The ________ option allows you to copy jars locally to the current working directory of tasks and automatically unjar the files. each of which imposes extra bookkeeping overhead. and there will be a lot of them (one per file). not the whole keys. 8. each mapper receives a variable number of lines of input b) StreamXmlRecordReader. 10. 2.b) Reducer c) Reduce d) None of the mentioned Answer:b Explanation:Aggregate provides a special reducer class and a special combiner class.hadoop. that effectively allows you to process text data like the unix ______ utility. Hadoop Questions and Answers – Mapreduce Formats-2 This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Mapreduce Formats-2”. 1. and a list of simple aggregators that perform aggregations such as “sum”. a) Copy b) Cut c) Paste d) Move Answer:b Explanation: The map function defined in the class treats each input key/value pair as a list of fields. org.mapred. the page elements can be interpreted as records for processing by a mapper .apache. “min” and so on over a sequence of values.Hadoop has a library class.lib. Point out the correct statement : a) With TextInputFormat and KeyValueTextInputFormat. “max”. ___________ takes node and rack locality into account when deciding which blocks to place in the same split a) CombineFileOutputFormat b) CombineFileInputFormat c) TextFileInputFormat d) None of the mentioned Answer:b Explanation:CombineFileInputFormat does not compromise the speed at which it can process the input in a typical MapReduce job.FieldSelectionMapReduce. Hadoop’s default OutputFormat. d) None of the mentioned Answer:c Explanation:SequenceFileAsBinaryInputFormat is used for reading keys. a) LongReadable b) LongWritable c) LongWritable d) All of the mentioned Answer:b Explanation: The value is the contents of the line. The key. Point out the wrong statement : a) Hadoop’s sequence file format stores sequences of binary key-value pairs b) SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary objects c) SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that retrieves the sequence file’s keys and values as opaque binary objects. is the byte offset within the file of the beginning of the line. _________ is the output produced by TextOutputFor mat. excluding any line terminators (newline. and is packaged as a Text object. 4. d) All of the mentioned Answer:d Explanation:Large XML documents that are composed of a series of “records” can be broken into these records using simple string or regular-expression matching to find start and end tags of records. carriage return). a ____________. values from SequenceFiles in binary (raw) format.c) The number depends on the size of the split and the length of the lines. KeyValueTextInputFormat is appropriate. 3. a) KeyValueTextInputFormat b) KeyValueTextOutputFormat c) FileValueTextInputFormat d) All of the mentioned Answer:b Explanation:To interpret such files correctly. 5. . with all records that share the same key being processed by the same reduce task. ___________ is an input format for reading data from a relational database. the other a binary sequence file. they may have different representations.6. 7. a) MultipleOutputs b) MultipleInputs c) SingleInputs d) None of the mentioned Answer:b Explanation:One might be tab-separated plain text. 9. Even if they are in the same format. records will be allocated evenly across reduce tasks. __________ class allows you to specify the InputFormat and Mapper to use on a per-path basis. 8. __________ is a variant of SequenceFileInputFormat that converts the sequence file’s keys and values to Text objects a) SequenceFile b) SequenceFileAsTextInputFormat c) SequenceAsTextInputFormat d) All of the mentioned Answer:b Explanation:With multiple reducers. a) DBInput b) DBInputFormat c) DBInpFormat d) All of the mentioned Answer:b Explanation:DBInputFormat is the most frequently used format for reading data. Which of the following is the default output format ? a) TextFormat b) TextOutput c) TextOutputFormat d) None of the mentioned . and therefore need to be parsed differently. using JDBC. 3. value-len. Which of the following writes MapFiles as output ? a) DBInpFormat b) MapFileOutputFormat c) SequenceFileAsBinaryOutputFormat d) None of the mentioned Answer:c Explanation:SequenceFileAsBinaryOutputFormat writes keys and values in raw binary format into a SequenceFile container. value) format d) All of the mentioned Answer: d Explanation:Reporters can be used to set application-level status messages and update Counters. Point out the correct statement : a) Applications can use the Reporter to report progress b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job c) The intermediate. Hadoop Questions and Answers – Hadoop Cluster-1 This set of Questions and Answers focuses on “Hadoop Cluster” 1. key. Mapper implementations are passed the JobConf for the job via the ________ method a) JobConfigure. 10. 2. sorted outputs are always stored in a simple (key-len.Answer:c Explanation:TextOutputFormat keys and values may be of any type.configure b) JobConfigurable.configure c) JobConfigurable.configure method is overrided to initialize themselves. a) Reducer b) Mapper c) Shuffle .configureable d) None of the mentioned Answer:b Explanation:JobConfigurable. Input to the _______ is the sorted output of the mappers. sort and reduce. 5. The right number of reduces seems to be : a) 0. 7.95 or 1. 4. The output of the _______ is not sorted in the Mapreduce framework for Hadoop.36 d) 0. but increases load balancing and lowers the cost of failures c) It is legal to set the number of reduce-tasks to zero if no reduction is desired d) The framework groups Reducer inputs by keys (since different mappers may have output the same key) in sort stage Answer:a Explanation:Reducer has 3 primary phases: shuffle. Point out the wrong statement : a) Reducer has 2 primary phases b) Increasing the number of reduces increases the framework overhead. The output of the Reducer is not sorted. Which of the following phases occur simultaneously ? a) Shuffle and Sort b) Reduce and Sort c) Shuffle and Map .d) All of the mentioned Answer:a Explanation:In Shuffle phase the framework fetches the relevant partition of the output of all the mappers.80 c) 0. a) Mapper b) Cascader c) Scalding d) None of the mentioned Answer:d Explanation: The output of the reduce task is typically written to the FileSystem. via HTTP.90 b) 0.95 Answer:d Explanation: The right number of reduces seems to be 0.75. 6. 8. a) Map Parameters b) JobConf c) MemoryConf d) None of the mentioned Answer:b Explanation:JobConf represents a MapReduce job configuration. Mapper and Reducer implementations can use the ________ to report progress or just indicate that they are alive. 10. and partitioners. while map-outputs are being fetched they are merged. a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned Answer:c Explanation:Reporter is a facility for MapReduce applications to report progress. __________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer a) Partitioner b) OutputCollector c) Reporter d) All of the mentioned Answer:b Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers. 9. set applicationlevel status messages and update Counters. reducers. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. Hadoop Questions and Answers – Hadoop Cluster-2 .d) All of the mentioned Answer:a Explanation: The shuffle and sort phases occur simultaneously. 1. 2. Point out the correct statement : a) Hadoop is ideal for the analytical.This set of Hadoop assessment questions focuses on “Hadoop Cluster”. 4. data-warehouse-ish type of workload b) HDFS runs on a small cluster of commodity-class nodes c) NEWSQL is frequently the collection point for big data d) None of the mentioned Answer:a Explanation:Hadoop together with a relational data warehouse. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record values with schema applied on read based on: a) HCatalog b) Hive c) Hbase d) All of the mentioned Answer:a Explanation:Other means of tagging the values also can be used. a) NoSQL b) NewSQL c) SQL d) All of the mentioned Answer:a Explanation: NoSQL systems make the most sense whenever the application is based on data with varying data types and the data can be stored in key-value notation. 3. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes. post-operational. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional Hadoop deployments a) EMR b) Isilon solutions c) AWS d) None of the mentioned . they can form very effective data warehouse infrastructure. Answer:b Explanation:enterprise data protection and security options including file system auditing and data-at-rest encryption to address compliance requirements is also provided by Isilon solution. 5. 8. HDFS and NoSQL file systems focus almost exclusively on adding nodes to : a) Scale out b) Scale up c) Both Scale out and up d) None of the mentioned Answer:a Explanation:HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up. scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware. The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks. 7. 6. Point out the wrong statement : a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and highly efficient storage platform b) Isilon’s native HDFS integration means you can avoid the need to invest in a separate Hadoop infrastructure c) NoSQL systems do provide high latency access and accommodate less concurrent users d) None of the mentioned Answer:c Explanation:NoSQL systems do provide low latency access and accommodate many concurrent users. Which is the most popular NoSQL database for scalable big data store with Hadoop ? a) Hbase b) MongoDB c) Cassandra d) None of the mentioned Answer:a Explanation:HBase is the Hadoop database: a distributed. . 10. 1. Hadoop Questions and Answers – HDFS Maintenance This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “HDFS Maintenance”. a) Scale-out b) Scale-down c) Scale-up d) None of the mentioned Answer:c Explanation:dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a cluster does not require additional network switches. a) TopTable b) BigTop c) Bigtable d) None of the mentioned Answer:c Explanation: Google Bigtable leverages the distributed data storage provided by the Google File System.a) DataCache b) DistributedData c) DistributedCache d) All of the mentioned Answer:c Explanation: The child-jvm always has its current working directory added to the java.library. Which of the following is a common hadoop maintenance issue ? a) Lack of tools b) Lack of configuration management c) Lack of web interface d) None of the mentioned . HBase provides ___________ like capabilities on top of Hadoop and HDFS. performance and complexity. _______ refers to incremental costs with no major impact on solution design.path and LD_LIBRARY_PATH. 9. service. Which of the following is a configuration management system ? a) Alex b) Puppet c) Acem d) None of the mentioned Answer:b Explanation:Administrators may use configuration management systems such as Puppet and Chef to manage processes. 5. 3. then any roles running on that host are put into effective maintenance mode c) Putting a component into maintenance mode prevent events from being logged . role. you end up with a number of issues that can cascade just as your usage picks up. a) Safe b) Maintenance c) Secure d) All of the mentioned Answer:b Explanation:Maintenance mode can be useful when you need to take actions in your cluster and do not want to see the alerts that will be generated due to those actions. 2. 4. or even the entire cluster.Answer:b Explanation:Without a centralized configuration management framework. ___________ mode allows you to suppress alerts for a host. then its roles (HBase Master and all Region Servers) are put into effective maintenance mode b) If you set a host into maintenance mode. Point out the wrong statement : a) If you set the HBase service into maintenance mode. Point out the correct statement : a) RAID is turned off by default b) Hadoop is designed to be a highly redundant distributed system c) Hadoop has a networked configuration system d) None of the mentioned Answer:b Explanation:Hadoop deployment is sometimes difficult to implement. a) Microsoft b) Cloudera c) Amazon d) None of the mentioned Answer:b Explanation:Manager’s Service feature presents health and performance data in a variety of formats. Which of the following is a common reason to restart hadoop process ? a) Upgrade Hadoop b) React to incidents c) Remove worker nodes d) All of the mentioned Answer:d Explanation: The most common reason administrators restart Hadoop processes is to enact configuration changes.d) None of the mentioned Answer:c Explanation:Maintenance mode only suppresses the alerts that those events would otherwise generate. 9. 6. 7. Which of the tab shows all the role instances that have been instantiated for this service ? a) Service b) Status c) Instance d) All of the mentioned Answer:c Explanation: The Instances page displays the results of the configuration validation checks it performs for all the role instances for this service. 8. __________ Manager’s Service feature monitors dozens of service health and performance metrics about the services and role instances running on your cluster. a) JVX b) JVM . __________ is a standard Java API for monitoring and managing applications. 1. 2.c) JMX d) None of the mentioned Answer:c Explanation:Hadoop includes several managed beans (MBeans). files stored on HDFS. a) Data Node b) NameNode c) Resource d) Replication Answer:c Explanation:All the metadata related to HDFS including the information about data nodes. Hadoop Questions and Answers – Monitoring HDFS This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on “Monitoring HDFS”. NameNode is monitored and upgraded in a __________ transition. the ___________ Manager UI provides host and port information. etc. For YARN. a) safemode b) securemode c) servicemode d) None of the mentioned Answer:b Explanation: The HDFS service has some unique functions that may result in additional information on its Status and Instances pages. Point out the correct statement : a) The Hadoop framework publishes the job flow status to an internally running web server on the master nodes of the Hadoop cluster b) Each incoming file is broken into 32 MB by default c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault tolerance d) None of the mentioned . and Replication. 10. which expose Hadoop metrics to JMX-aware applications. are stored and maintained on the NameNode. 4. ZooKeeper information. Which of the following scenario may not be a good fit for HDFS ? a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b) HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS is suitable for storing data related to applications requiring low latency data access d) None of the mentioned . debug dumps. a) HBase b) Oozie c) Kafka d) All of the mentioned Answer:a Explanation:HBase Master UI provides information about the number of live.Answer:a Explanation: The web interface for the Hadoop Distributed File System (HDFS) shows information about the NameNode itself. 6. a) Rack b) Data c) Secondary d) None of the mentioned Answer:c Explanation:Secondary namenode is used for all time availability and reliability. ________ NameNode is used when the Primary NameNode goes down. For ________. and thread stacks. the HBase Master UI provides information about the HBase Master uptime. 3. logs. Point out the wrong statement : a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file level b) Block Report from each DataNode contains a list of all the blocks that are stored on that DataNode c) User data is stored on the local file system of DataNodes d) DataNode is aware of the files to which the blocks stored on it belong to Answer:d Explanation:NameNode is aware of the files to which the blocks stored on it belong to. 5. dead and transitional servers. A functional filesystem has more than one DataNode. During start up. a) “HDFS Shell” b) “FS Shell” c) “DFS Shell” d) None of the mentioned Answer:b Explanation: The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System. ________ is the slave/worker node and holds the user data in the form of Data Blocks. with data replicated across them. a) DataNode b) NameNode c) Data block d) Replication Answer:a Explanation: A DataNode stores data in the [HadoopFileSystem]. 10.Answer:a Explanation:HDFS can be used for storing archive data since it is cheaper as HDFS allows storing the data on low cost commodity hardware while ensuring a high degree of faulttolerance. 9. HDFS provides a command line interface called __________ used to interact with HDFS. The need for data replication can arise in various scenarios like : a) Replication Factor is changed b) DataNode goes down c) Data Blocks get corrupted d) All of the mentioned Answer:d Explanation:Data is replicated across different DataNodes to ensure a high degree of faulttolerance. 8. the ___________ loads the file system state from the fsimage and the edits log file. a) DataNode b) NameNode c) ActionNode . 7. d) None of the mentioned Answer:b Explanation:HDFS is implemented on any computer which can run Java can host a NameNode/DataNode on it. .

Comments

Description