MapR GA 3.0.1 Docs Final - Free Download PDF Ebook

MapR Administrator TrainingApril 2012 Version 3.0.2 October 28, 2013 Quick Start Installation Administration Development Reference 1. Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1 Start Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Quick Start - Test Drive MapR on a Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Installing the MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2 A Tour of the MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.3 Getting Started with Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.2.4 Getting Started with Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2.5 Getting Started with HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2.6 Getting Started with MapR Native Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2.7 Working with Snapshots, Mirrors, and Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.3 Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.3.1 Planning the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 1.3.2 Preparing Each Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 1.3.3 Installing MapR Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 1.3.3.1 MapR Repositories and Package Archives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 1.3.4 Bringing Up the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 1.3.5 Installing Hadoop Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 1.3.5.1 Cascading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 1.3.5.2 Flume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 1.3.5.3 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 1.3.5.4 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 1.3.5.5 Impala . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 1.3.5.6 Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 1.3.5.7 MultiTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 1.3.5.8 Oozie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 1.3.5.9 Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 1.3.5.10 Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 1.3.5.11 Whirr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 1.3.6 Next Steps After Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 1.3.7 Setting Up the Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 1.4 Upgrade Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 1.4.1 Planning the Upgrade Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 1.4.2 Preparing to Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 1.4.3 Upgrading MapR Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 1.4.3.1 Offline Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 1.4.3.2 Rolling Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 1.4.3.3 Scripted Rolling Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 1.4.4 Configuring the New Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 1.4.5 Troubleshooting Upgrade Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 1.4.5.1 NFS incompatible when upgrading to MapR v1.2.8 or later . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 1.5 M7 - Native Storage for MapR Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 1.5.1 Setting Up MapR-FS to Use Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 1.5.1.1 Mapping Table Namespace Between Apache HBase Tables and MapR Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 1.5.1.2 Working With MapR Tables and Column Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 1.5.1.2.1 Schema Design for MapR Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 1.5.1.2.2 Supported Regular Expressions in MapR Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 1.5.2 MapR Table Support for Apache HBase Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 1.5.3 Using AsyncHBase with MapR Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 1.5.3.1 Using OpenTSDB with AsyncHBase and MapR Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 1.5.4 Protecting Table Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 1.5.5 Displaying Table Region Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 1.5.6 Integrating Hive and MapR Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 1.5.7 Migrating Between Apache HBase Tables and MapR Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 1.6 Administration Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 1.6.1 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 1.6.1.1 Alarms and Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 1.6.1.2 Centralized Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 1.6.1.3 Monitoring Node Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 1.6.1.4 Service Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 1.6.1.5 Job Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 1.6.1.6 Setting up the MapR Metrics Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 1.6.1.7 Third-Party Monitoring Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 1.6.1.7.1 Ganglia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 1.6.1.7.2 Nagios Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 1.6.1.8 Configuring Email for Alarm Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 1.6.2 Managing Data with Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 1.6.2.1 Mirror Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 1.6.2.2 Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 1.6.2.3 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 1.6.3 Data Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 1.6.4 Managing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 1.6.4.1 Balancers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 1.6.4.2 Central Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 1.6.4.3 Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 1.6.4.3.1 Setting Up Disks for MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 1.6.4.3.2 Specifying Disks or Partitions for Use by MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 1.6.4.3.3 Working with a Logical Volume Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 1.6.4.4 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 1.6.4.4.1 Adding Nodes to a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 1.6.4.4.2 Managing Services on a Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 1.6.4.4.3 Node Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 1.6.4.4.4 Isolating CLDB Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 1.6.4.4.5 Isolating ZooKeeper Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 1.6.4.4.6 Removing Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 1.6.4.4.7 Task Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 1.6.4.5 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 1.6.4.5.1 Assigning Services to Nodes for Best Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 1.6.4.5.2 Changing the User for MapR Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 1.6.4.5.3 CLDB Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 1.6.4.5.4 Dial Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 1.6.4.6 Startup and Shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 1.6.4.7 TaskTracker Blacklisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 1.6.4.8 Uninstalling MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 1.6.4.9 Designating NICs for MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 1.6.5 Placing Jobs on Specified Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 1.6.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 1.6.6.1 PAM Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 1.6.6.2 Secured TaskTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 1.6.6.3 Subnet Whitelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 1.6.7 Users and Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 1.6.7.1 Managing Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 1.6.7.2 Managing Quotas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 1.6.7.3 Setting the Administrative User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 1.6.7.4 Converting a Cluster from Root to Non-root User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 1.6.8 Working with Multiple Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 1.6.9 Setting Up MapR NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 1.6.9.1 High Availability NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 1.6.9.2 Setting Up VIPs for NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 1.6.10 Setting up a MapR Cluster on Amazon Elastic MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 1.6.11 Troubleshooting Cluster Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 1.6.11.1 MapR Control System doesn't display on Internet Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 1.6.11.2 'ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils' in cldb.log, caused by mixed-case cluster name in mapr-clusters.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 1.6.11.3 Error 'mv Failed to rename maprfs...' when moving files across volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 1.6.11.4 How to find a node's serverid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 1.6.11.5 Out of Memory Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 1.6.12 Client Compatibility Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 1.6.13 Mirroring with Multiple Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 1.7 Development Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 1.7.1 Accessing MapR-FS in C Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 1.7.2 Accessing MapR-FS in Java Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 1.7.2.1 My application that includes maprfs-0.1.jar is now missing dependencies and fails to link . . . . . . . . . . . . . . . . . . . . . . 273 1.7.2.2 Garbage Collection in MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 1.7.3 Working with MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 1.7.3.1 Configuring MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 1.7.3.1.1 Job Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 1.7.3.1.2 Standalone Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 1.7.3.1.3 Tuning Your MapR Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 1.7.3.2 Compiling Pipes Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 1.7.4 Working with MapR-FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 1.7.4.1 Chunk Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 1.7.4.2 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 1.7.5 Working with Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 1.7.5.1 Accessing Data with NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 1.7.5.2 Copying Data from Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 1.7.5.3 Provisioning Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 1.7.5.3.1 Provisioning for Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 1.7.5.3.2 Provisioning for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 1.7.6 MapR Metrics and Job Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 1.7.7 Maven Repository and Artifacts for MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 1.7.8 Working with Cascading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 1.7.8.1 Upgrading Cascading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 1.7.9 Working with Flume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 1.7.9.1 Upgrading Flume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 1.7.10 Working with HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 1.7.10.1 HBase Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 1.7.10.2 Upgrading HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 1.7.10.3 Enabling HBase Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 1.7.11 Working with HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 1.7.11.1 Upgrading HCatalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 1.7.12 Working with Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 1.7.12.1 Hive ODBC Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 1.7.12.1.1 Hive ODBC Connector License and Copyright Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 1.7.12.2 Using HiveServer2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 1.7.12.3 Upgrading Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 1.7.12.4 Troubleshooting Hive Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 1.7.12.4.1 Error 'Hive requires Hadoop 0.20.x' after upgrading to MapR v2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 1.7.12.5 Using HCatalog and WebHCat with Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 1.7.13 Working with Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 1.7.13.1 Upgrading Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 1.7.14 Working with Oozie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 1.7.14.1 Upgrading Oozie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 1.7.15 Working with Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 1.7.15.1 Upgrading Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 1.7.16 Working with Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 1.7.16.1 Upgrading Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 1.7.17 Working with Whirr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 1.7.17.1 Upgrading Whirr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 1.7.18 Integrating MapR's GitHub Repositories With Your IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356 1.7.19 Troubleshooting Development Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 1.7.20 Integrating MapR's GitHub and Maven Repositories With Your IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 1.8 Migration Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 1.8.1 Planning the Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 1.8.2 Initial MapR Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 1.8.3 Component Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 1.8.4 Application Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 1.8.5 Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 1.8.6 Node Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 1.9 Third Party Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 1.9.1 Datameer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 1.9.2 Karmasphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 1.9.3 HParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 1.9.4 Pentaho . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 1.10 Documentation for Previous Releases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 1.11 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 1.12 Reference Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 1.12.1 MapR Control System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 1.12.1.1 Cluster Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 1.12.1.2 MapR-FS Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 1.12.1.3 NFS HA Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 1.12.1.4 Alarms Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 1.12.1.4.1 Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 1.12.1.5 System Settings Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 1.12.1.6 Other Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 1.12.1.6.1 CLDB View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 1.12.1.6.2 HBase View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 1.12.1.6.3 JobTracker View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 1.12.1.6.4 Nagios View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 1.12.1.6.5 Terminal View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 1.12.1.7 Node-Related Dialog Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 1.12.2 Hadoop Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 1.12.2.1 hadoop archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431 1.12.2.2 hadoop classpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 1.12.2.3 hadoop daemonlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 1.12.2.4 hadoop distcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 1.12.2.5 hadoop fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 1.12.2.6 hadoop jar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 1.12.2.7 hadoop job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 1.12.2.8 hadoop jobtracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 1.12.2.9 hadoop mfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 1.12.2.10 hadoop mradmin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 1.12.2.11 hadoop pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 1.12.2.12 hadoop queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 1.12.2.13 hadoop tasktracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 1.12.2.14 hadoop version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 1.12.2.15 hadoop conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 1.12.3 API Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 1.12.3.1 acl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458 1.12.3.1.1 acl edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 1.12.3.1.2 acl set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 1.12.3.1.3 acl show . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 1.12.3.2 alarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 1.12.3.2.1 alarm clear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 1.12.3.2.2 alarm clearall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 1.12.3.2.3 alarm config load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 1.12.3.2.4 alarm config save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 1.12.3.2.5 alarm list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 1.12.3.2.6 alarm names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 1.12.3.2.7 alarm raise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 1.12.3.3 config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 1.12.3.3.1 config load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 1.12.3.3.2 config save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 1.12.3.4 dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 1.12.3.4.1 dashboard info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 1.12.3.5 dialhome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480 1.12.3.5.1 dialhome ackdial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480 1.12.3.5.2 dialhome enable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 1.12.3.5.3 dialhome lastdialed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 1.12.3.5.4 dialhome metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 1.12.3.5.5 dialhome status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 1.12.3.6 disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 1.12.3.6.1 disk add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 1.12.3.6.2 disk list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 1.12.3.6.3 disk listall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 1.12.3.6.4 disk remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 1.12.3.7 dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 1.12.3.7.1 dump balancerinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 1.12.3.7.2 dump balancermetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 1.12.3.7.3 dump changeloglevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494 1.12.3.7.4 dump cldbnodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 1.12.3.7.5 dump containerinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 1.12.3.7.6 dump replicationmanagerinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 1.12.3.7.7 dump replicationmanagerqueueinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 1.12.3.7.8 dump rereplicationinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 1.12.3.7.9 dump rolebalancerinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 1.12.3.7.10 dump rolebalancermetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 1.12.3.7.11 dump volumeinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 1.12.3.7.12 dump volumenodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 1.12.3.7.13 dump zkinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 1.12.3.8 entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 1.12.3.8.1 entity info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 1.12.3.8.2 entity list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516 1.12.3.8.3 entity modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 1.12.3.9 job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518 1.12.3.9.1 job changepriority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518 1.12.3.9.2 job kill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 1.12.3.9.3 job linklogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 1.12.3.9.4 job table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 1.12.3.10 license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 1.12.3.10.1 license add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 1.12.3.10.2 license addcrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 1.12.3.10.3 license apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 1.12.3.10.4 license list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 1.12.3.10.5 license listcrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 1.12.3.10.6 license remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 1.12.3.10.7 license showid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 1.12.3.11 Metrics API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 1.12.3.12 nagios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 1.12.3.12.1 nagios generate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 1.12.3.13 nfsmgmt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 1.12.3.13.1 nfsmgmt refreshexports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 1.12.3.14 node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 1.12.3.14.1 add-to-cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 1.12.3.14.2 node allow-into-cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 1.12.3.14.3 node cldbmaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 1.12.3.14.4 node heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 1.12.3.14.5 node list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 1.12.3.14.6 node listcldbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 1.12.3.14.7 node listcldbzks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 1.12.3.14.8 node listzookeepers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 1.12.3.14.9 node maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 1.12.3.14.10 node metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546 1.12.3.14.11 node move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 1.12.3.14.12 node remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551 1.12.3.14.13 node services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 1.12.3.14.14 node topo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 1.12.3.15 rlimit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 1.12.3.15.1 rlimit get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 1.12.3.15.2 rlimit set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 1.12.3.16 schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 1.12.3.16.1 schedule create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 1.12.3.16.2 schedule list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 1.12.3.16.3 schedule modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560 1.12.3.16.4 schedule remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561 1.12.3.17 service list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561 1.12.3.18 setloglevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 1.12.3.18.1 setloglevel cldb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 1.12.3.18.2 setloglevel fileserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 1.12.3.18.3 setloglevel hbmaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 1.12.3.18.4 setloglevel hbregionserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 1.12.3.18.5 setloglevel jobtracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 1.12.3.18.6 setloglevel nfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 1.12.3.18.7 setloglevel tasktracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 1.12.3.19 table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 1.12.3.19.1 table attr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 1.12.3.19.2 table cf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 1.12.3.19.3 table create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 1.12.3.19.4 table delete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 1.12.3.19.5 table listrecent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 1.12.3.19.6 table region . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576 1.12.3.20 task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 1.12.3.20.1 task failattempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 1.12.3.20.2 task killattempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 1.12.3.20.3 task table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 1.12.3.21 trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 1.12.3.21.1 trace dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 1.12.3.21.2 trace info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 1.12.3.21.3 trace print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 1.12.3.21.4 trace reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 1.12.3.21.5 trace resize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 1.12.3.21.6 trace setlevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 1.12.3.21.7 trace setmode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 1.12.3.22 urls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 1.12.3.23 userconfig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 1.12.3.23.1 userconfig load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 1.12.3.24 virtualip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 1.12.3.24.1 virtualip add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 1.12.3.24.2 virtualip edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 1.12.3.24.3 virtualip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 1.12.3.24.4 virtualip move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 1.12.3.24.5 virtualip remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 1.12.3.25 volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 1.12.3.25.1 volume create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598 1.12.3.25.2 volume dump create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 1.12.3.25.3 volume dump restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 1.12.3.25.4 volume fixmountpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604 1.12.3.25.5 volume info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 1.12.3.25.6 volume link create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607 1.12.3.25.7 volume link remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 1.12.3.25.8 volume list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 1.12.3.25.9 volume mirror push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 1.12.3.25.10 volume mirror start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614 1.12.3.25.11 volume mirror stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 1.12.3.25.12 volume modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616 1.12.3.25.13 volume mount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618 1.12.3.25.14 volume move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 1.12.3.25.15 volume remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 1.12.3.25.16 volume rename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620 1.12.3.25.17 volume showmounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620 1.12.3.25.18 volume snapshot create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 1.12.3.25.19 volume snapshot list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 1.12.3.25.20 volume snapshot preserve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624 1.12.3.25.21 volume snapshot remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 1.12.3.25.22 volume unmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 1.12.4 Alarms Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 1.12.5 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 1.12.5.1 configure.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 1.12.5.2 disksetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643 1.12.5.3 fsck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644 1.12.5.4 gfsck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 1.12.5.5 mapr-support-collect.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649 1.12.5.6 mapr-support-dump.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651 1.12.5.7 mrconfig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 1.12.5.7.1 mrconfig dg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 1.12.5.7.2 mrconfig info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 1.12.5.7.3 mrconfig sp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 1.12.5.7.4 mrconfig disk help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671 1.12.5.7.5 mrconfig disk init . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 671 1.12.5.7.6 mrconfig disk list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673 1.12.5.7.7 mrconfig disk load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673 1.12.5.7.8 mrconfig disk remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 1.12.5.8 pullcentralconfig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 1.12.5.9 rollingupgrade.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 1.12.6 Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 1.12.7 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 1.12.7.1 .dfs_attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 1.12.7.2 cldb.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678 1.12.7.3 core-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681 1.12.7.4 daemon.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 1.12.7.5 disktab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 1.12.7.6 hadoop-metrics.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 1.12.7.7 mapr-clusters.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688 1.12.7.8 mapred-default.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 1.12.7.9 mapred-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706 1.12.7.10 mfs.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727 1.12.7.11 taskcontroller.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728 1.12.7.12 warden.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728 1.12.7.13 exports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734 1.12.7.14 zoo.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 1.12.7.15 db.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 1.12.8 MapR Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736 1.12.9 MapR Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737 1.12.10 Ports Used by MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742 1.12.11 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745 1.12.12 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748 1.12.13 Source Code for MapR Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751 Home Welcome to MapR! If you are not sure how to get started, here are a few places to find the information you are looking for: Quick Start - Test Drive MapR on a Virtual Machine - Try out a single-node cluster that is ready to roll, right out of the box! MapR Release Notes - Read more about what's new with the current release Installation Guide - Learn how to set up a production cluster, large or small. Administration Guide - Learn how to configure and tune a MapR cluster for performance Development Guide - Read more about what you can do with a MapR cluster Start Here The MapR Distribution for Apache Hadoop is the easiest, most dependable, and fastest Hadoop distribution on the planet. It is the only Hadoop distribution that allows direct data input and output via MapR Direct Access NFS™ with realtime analytics, and the first to provide true High Availability (HA) at all levels. MapR introduces logical volumes to Hadoop. A volume is a way to group data and apply policy across an entire data set. MapR provides hardware status and control with the , a comprehensive UI including a Heatmap™ that displays the MapR Control System health of the entire cluster at a glance. In this section, you can learn about MapR's unique features and how they provide the highest performing, lowest cost Hadoop available. To learn more about MapR, including information about MapR , see the following sections: partners MapR Provides Complete Hadoop Compatibility Intuitive, Powerful Cluster Management with the MapR Control System Reliability, Fault-Tolerance, and Data Recovery with MapR High-Performance Hadoop Clusters with MapR Direct Shuffle Get Started MapR Provides Complete Hadoop Compatibility MapR is a complete Hadoop distribution. For more information, see the . Version 2.0 Release Notes Intuitive, Powerful Cluster Management with the MapR Control System The MapR Control System webapp provides powerful hardware insight down to the node level, as well as complete control of users, volumes, quotas, mirroring, and snapshots. Filterable alarms and notifications provide immediate warnings about hardware failures or other conditions that require attention, allowing a cluster administrator to detect and resolve problems quickly. MapR lets you control data access and placement, so that multiple concurrent Hadoop jobs can safely share the cluster. Provisioning resources is simple. You can easily create a volume for a project or department in a few clicks. MapR integrates with NIS and LDAP, making it easy to manage users and groups. The provides a flexible web-based user interface to cluster administration. MapR Control System From the MapR Control System, you can assign user or group , limit the amount of data a user or group can write, or limit a size. quotas volume's Setting recovery time objective (RTO) and recovery point objective (RPO) points for a data set are a simple matter of and scheduling snapshots m on a volume through the MapR Control System. You can set read and write permissions on volumes directly via or using c irrors NFS hadoop fs ommands, and volumes provide administrative delegation through Access Control Lists (ACLs). Through the MapR Control System you can control who can mount, unmount, snapshot, or mirror a volume. Because MapR is a complete Hadoop distribution, you can run your Hadoop jobs the way you always have. Unrestricted Writes to the Cluster with MapR Direct Access NFS The MapR NFS service lets you access data on a licensed MapR cluster via the protocol. You mount a cluster through NFS on a of NFS variety clients. Clusters with the M3 license can run MapR NFS on one node, enabling you to mount your cluster as a standard POSIX-compliant filesystem. Once your cluster is mounted on NFS, you can use standard shell scripting to read and write live data in the cluster. You can run multiple NFS server nodes by upgrading to the M5 license level. You can use virtual IP addresses (VIPs) to provide transparent NFS failover with multiple NFS servers. You can also have each node in your cluster self-mount to NFS to make all of your cluster's data available from every node. These NFS self-mounts enable you to run standard shell scripts to work with the cluster's Hadoop data directly. Data Protection, Availability, and Performance with Volume Management With volumes, you can control access to data, set replication factor, and place specific data sets on specific racks or nodes for performance or data protection. Volumes control data access to specific users or groups with Linux-style permissions that integrate with existing LDAP and NIS directories. Use volume quotas to prevent data overruns from consuming excessive storage capacity. One of the most powerful aspects of the volume concept is the ways in which a volume provides data protection: To enable point-in-time recovery and easy backups, volumes have manual and policy-based snapshot capability. For true business continuity, you can manually or automatically mirror volumes and synchronize them between clusters or datacenters to enable easy disaster recovery. You can set volume read/write permission and delegate administrative functions to control data access. You can export volumes with MapR Direct Access NFS with HA, allowing data read and write operations directly to Hadoop without the need for temporary storage or log collection. Multiple NFS nodes provide the same view of the cluster regardless of where the client connects. Realtime Hadoop Analytics: Intuitive and Powerful Performance Metrics New in the 2.0 release, the service provides in-depth access to the performance statistics of your cluster and the jobs that run MapR Job Metrics on it. With MapR Job Metrics, you can examine trends in resource use, diagnose unusual node behavior, or examine how changes in your job configuration affects the job's execution. The service, also new in the 2.0 release, provides detailed information on the activity and resource usage of specific nodes MapR Node Metrics within your cluster. Critical MapR services collect on cluster resource utilization and activity that you can write directly to a file or integrate into the information Ganglia third-party tool. Expand Your Capabilities with Third-Party Solutions MapR has with Datameer, which provides a self-service Business Intelligence platform that runs best on the MapR Distribution for partnered Apache Hadoop. Your download of MapR includes a 30-day trial version of Datameer Analytics Solution (DAS), which provides spreadsheet-style analytics, ETL and data visualization capabilities. For More Information Read about Provisioning Applications Learn about Direct Access NFS Check out Datameer Reliability, Fault-Tolerance, and Data Recovery with MapR With clusters growing to thousands of nodes, hardware failures are inevitable even with the most reliable machines in place. The MapR Distribution for Hadoop has been designed from the ground up to seamlessly tolerate hardware failure. MapR is the first Hadoop distibution to provide true high availability (HA) and failover at all levels, including a MapR Distributed HA NameNode™. If a disk or node in the cluster fails, MapR automatically restarts any affected processes on another node without requiring administrative intervention. The HA JobTracker ensures that any tasks interrupted by a node or disk failure are re-started on another TaskTracker node. In the event of any failure, the job's completed task state is preserved and no tasks are lost. For additional data reliability, every bit of data on the wire is compressed and CRC-checked. For more information: Take a look at the Heatmap Learn about Volumes, Snapshots, and Mirroring Explore scenarios Data Protection Read about and Job Metrics Node Metrics High-Performance Hadoop Clusters with MapR Direct Shuffle The MapR distribution for Hadoop achieves up to three times the performance of any other Hadoop distribution, and can reduce your equipment costs by half. MapR Direct Shuffle uses the Distributed NameNode to drastically improve Reduce-phase performance. Unlike Hadoop distributions that use the local filesystem for shuffle and HTTP to transport shuffle data, MapR shuffle data is readable directly from anywhere on the network. MapR stores data with Lockless Storage Services™, a sharded system that eliminates contention and overhead from data transport and retrieval. Automatic, transparent client-side compression reduces network overhead and reduces footprint on disk, while direct block device I/O provides throughput at hardware speed with no additional overhead. As an additional performance boost, with MapR Realtime Hadoop, you can read files while they are still being written. MapR gives you ways to tune the performance of your cluster. Using mirrors, you can load-balance reads on highly-accessed data to alleviate bottlenecks and improve read bandwidth to multiple users. You can run MapR Direct Access NFS on many nodes – all nodes in the cluster, if desired – and load-balance reads and writes across the entire cluster. Volume topology helps you further tune performance by allowing you to place resource-intensive Hadoop jobs and high-activity data on the fastest machines in the cluster. For more information: Read about Tuning Your MapR Install Read about Provisioning for Performance Get Started Now that you know a bit about how the features of MapR Distribution for Apache Hadoop work, take a quick tour to see for yourself how they can work for you: Quick Start - Test Drive MapR on a Virtual Machine - Try out a single-node cluster that's ready to roll, right out of the box! Installation Guide - Learn how to set up a production cluster, large or small Development Guide - Read more about what you can do with a MapR cluster Administration Guide - Learn how to configure and tune a MapR cluster for performance Quick Start - Test Drive MapR on a Virtual Machine The MapR Virtual Machine is a fully-functional single-node Hadoop cluster capable of running MapReduce programs and working with applications like Hive, Pig, and HBase. You can try the MapR Virtual Machine on nearly any 64-bit computer by downloading the free VMware . Player The MapR Virtual Machine comes with popular open source components Hive 11, Pig 11, and HBase 94.5 already installed. The MapR Virtual Machine desktop contains the following icons: MapR Control System - navigates to the graphical control system for managing the cluster MapR User Guide - navigates to the MapR online documentation MapR NFS - navigates to the NFS-mounted cluster storage layer Tour of the MapR VM - A link to . A Tour of the MapR Virtual Machine Ready for a tour? The following documents will help you get started: Installing the MapR Virtual Machine A Tour of the MapR Virtual Machine Getting Started with Hive Getting Started with Pig Getting Started with HBase Getting Started with MapR Native Tables Installing the MapR Virtual Machine The MapR Virtual Machine runs on VMware Player, a free desktop application that lets you run a virtual machine on a Windows or Linux PC. You can download VMware Player from the . To install the VMware Player, see the . VMware web site VMware documentation For Linux and Windows, download the free VMware Player For Mac, purchase VMware Fusion Use of VMware Player is subject to the VMware Player end user license terms, and VMware provides no support for VMware Player. For self-help resources, see the . VMware Player FAQ Requirements The MapR Virtual Machine requires at least 20 GB free hard disk space and 2 GB of RAM on the host system. You will see higher performance with more RAM and more free hard disk space. To run the MapR Virtual Machine, the host system must have one of the following 64-bit x86 architectures: 1. 2. 3. 4. 5. 6. A 1.3 GHz or faster AMD CPU with segment-limit support in long mode A 1.3 GHz or faster Intel CPU with VT-x support If you have an Intel CPU with VT-x support, you must verify that VT-x support is enabled in the host system BIOS. The BIOS settings that must be enabled for VT-x support vary depending on the system vendor. See the VMware knowledge base article at for information about how to determine if VT-x support is enabled. http://kb.vmware.com/kb/1003944 Installing and Running the MapR Virtual Machine Choose whether to install the M3 Edition or the M5 Edition, and download the corresponding archive file: M3 Edition - http://package.mapr.com/releases/v2.1.3.2/vmdemo/MapR-VM-2.1.3.20987.GA-3-m3.tbz2 M5 Edition - http://package.mapr.com/releases/v2.1.3.2/vmdemo/MapR-VM-2.1.3.20987.GA-3-m5.tbz2 On a UNIX system, use the command to extract the archive to your home directory or another directory of your choosing: tar tar -xvf MapR-VM-<version>tar.bzip2 On a Windows system, use a decompression utility such as to extract the archive. 7-zip Run the VMware player. Click , navigate to the directory into which you extracted the archive, then open the MapR-VM.vmx virtual Open a Virtual Machine machine. Tip for VMWare Fusion If you are running VMware Fusion, make sure to select or instead of creating a new virtual machine. Open Open and Run To log on to the MapR Control System, use the username and the password (all lowercase). mapr mapr Once the virtual machine is fully started, you can proceed with the . tour A Tour of the MapR Virtual Machine In this tutorial, you'll get familiar with the MapR Control System dashboard, learn how to get data into the cluster (and organized), and run some MapReduce jobs on Hadoop. You can read the following sections in order or browse them as you explore on your own: The Dashboard Working with Volumes Exploring NFS Running a MapReduce Job Once you feel comfortable working with the MapR Virtual Machine, you can move on to more advanced topics: Working with Snapshots, Mirrors, and Schedules Getting Started with Hive Getting Started with Pig Getting Started with HBase The Dashboard The dashboard, the main screen in the , shows the health of the cluster at a glance. To get to the dashboard, click the MapR Control System Map link on the desktop of the MapR Virtual Machine and log on with the username and the password . If it is your first R Control System mapr mapr time using the MapR Control System, you will need to accept the terms of the license agreement to proceed. Before You Start Make sure your VMWare Player's Networking settings are set to . You can access these settings under Bridge Virtual Machine > . Select in the settings dialog, then at the Network Connection pane. Virtual Machine Settings Network Adapter Bridged 1. 2. 3. 4. Parts of the dashboard: To the left, the lets you navigate to other views that display more detailed information about , navigation pane nodes in the cluster volume , , , and . s in the MapR Storage Services layer NFS settings Alarms Views System Settings Views In the center, the main dashboard view displays the nodes in a " " that uses color to indicate node health--since there is only one heat map node in the MapR Virtual Machine cluster, there is a single green square. To the right, information about cluster usage is displayed. Try clicking the button at the top of the heat map. You will see different kinds of information that can be displayed in the heat map. Health Try clicking the green square representing the node. You will see more detailed information about the status of the node. The browser is pre-configured with the following bookmarks, which you will find useful as you gain experience with Hadoop, MapReduce, and the MapR Control System: MapR Control System JobTracker Status TaskTracker Status HBase Master CLDB Status Don't worry if you aren't sure what those are yet. Exploring NFS With MapR, you can mount the cluster via NFS, and browse it as if it were a filesystem. First, make sure the cluster is mounted via NFS: Click the terminal icon at the top of the screen to open the termina. Type in the terminal to see what hosts are mounted on mapr-desktop. Example: showmount mapr@mapr-desktop:~$ showmount Hosts on mapr-desktop: 127.0.1.1 If no hosts are listed, use the command to mount the cluster. Example: mount mapr@mapr-desktop:~$ sudo mount -o nolock mapr-desktop:/mapr /mapr 4. Use again to verify that the cluster is successfully mounted. showmount With the cluster mounted via NFS, try double-clicking the icon on the MapR Virtual Machine desktop. MapR NFS When you navigate to you can see the volume that is preconfigured in the VM. mapr > my.cluster.com user Try copying some files to the volume; a good place to start is the files and which are attached to this constitution.txt sample-table.txt page. Both are text files, which will be useful when running the Word Count example later. To download them, select from the menu to the top right of this document (the one you are reading now) and then Attachments Tools click the links for those two files. Once they are downloaded, you can add them to the cluster. Since you'll be using them as input to MapReduce jobs in a few minutes, create a directory with the same name as the user, which is ma , and another directory called under the directory in the volume and drag the files there. pr in mapr user By the way, if you want to verify that you are really copying the files into the Hadoop cluster, you can open a terminal on the MapR Virtual Machine (select ) and type to see that the files are there. Applications > Accessories > Terminal hadoop fs -ls /user/mapr/in The Terminal When you run MapReduce jobs, and when you use Hive, Pig, or HBase, you'll be working with the Linux terminal. Open a terminal window by selecting . Applications > Accessories > Terminal 1. 2. 3. 4. Running a MapReduce Job In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files (like the ones you copied to the cluster in the previous section). The Word Count program reads files from an input directory, counts the words, and writes the results of the job to files in an output directory. For this exercise you will use for the input, and for the output. The input directory /user/mapr/in /user/mapr/out must exist and must contain the input files before running the job; the output directory must not exist, as the Word Count example creates it. Try MapReduce On the MapR Virtual Machine, open a terminal (select ) Applications > Accessories > Terminal Copy a couple of text files into the cluster. If you are not sure how, see the previous section. Create the directory and /user/mapr/in put the files there. Type the following line to run the Word Count job: hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /user/mapr/in /user/mapr/out Look in the newly-created for a file called containing the results. /user/mapr/out part-r-00000 That's it! If you're ready, you can try working with . MapR tables Getting Started with Pig Apache Pig is a platform for parallelized analysis of large data sets via a language called Pig Latin. (For more information about Pig, see the Pig .) project page You'll be working with Pig from the Linux shell. Open a terminal by selecting Applications > Accessories > Terminal (see A Tour of the MapR ). Virtual Machine Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in a cluster, for example by using a different directory structure. In this tutorial, we'll use of Pig to run a MapReduce job that counts the words in the file in the user' version 0.11 /in/constitution.txt mapr s directory on the cluster, and store the results in the file . wordcount.txt 1. 2. 3. 1. First, make sure you have downloaded the file: On the page , select Tools > Attachments and A Tour of the MapR Virtual Machine right-click to save it. constitution.txt Make sure the file is loaded onto the cluster, in the directory . If you are not sure how, look at the tutorial on /user/mapr/in NFS A Tour . of the MapR Virtual Machine Open a Pig shell and get started: In the terminal, type the command to start the Pig shell. pig At the prompt, type the following lines (press ENTER after each): grunt> A = LOAD '/user/mapr/in' USING TextLoader() AS (words:chararray); B = FOREACH A GENERATE FLATTEN(TOKENIZE(*)); C = GROUP B BY $0; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO '/user/mapr/wordcount'; After you type the last line, Pig starts a MapReduce job to count the words in the file . constitution.txt When the MapReduce job is complete, type to exit the Pig shell and take a look at the contents of the directory quit /myvolume/wordc to see the results. ount Getting Started with Hive Getting Started with Hive Hive is a data warehouse system for Hadoop that uses a SQL-like language called HiveQL to query structured data stored in a distributed filesystem. (For more information about Hive, see the .) Apache Hive project page You'll be working with of Hive from the Linux shell. To use Hive, open a terminal by selecting version 0.11.0 Applications > Accessories > (see ). Terminal A Tour of the MapR Virtual Machine Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in a cluster, for example by using a different directory structure. In this tutorial, you'll create a Hive table, load data from a tab-delimited text file, and run a couple of basic queries against the table. First, make sure you have downloaded the sample table: On the page , select and A Tour of the MapR Virtual Machine Tools > Attachments right-click on , select from the pop-up menu, select a directory to save to, then click OK. If you're sample-table.txt Save Link As... working on the MapR Virtual Machine, we'll be loading the file from the MapR Virtual Machine's local file system (not the cluster storage layer), so save the file in the MapR Home directory (for example, ). /home/mapr Take a look at the source data First, take a look at the contents of the file using the terminal: If you are using HiveServer2, you will use the BeeLine CLI instead of the Hive shell, as shown below. For details on setting up HiveServer2 and starting BeeLine, see . Using HiveServer2 1. 2. 1. 2. 3. 4. Make sure you are in the Home directory where you saved (type if you are not sure). sample-table.txt cd ~ Type to display the following output. cat sample-table.txt mapr@mapr-desktop:~$ cat sample-table.txt 1320352532 1001 http://www.mapr.com/doc http://www.mapr.com 192.168.10.1 1320352533 1002 http://www.mapr.com http://www.example.com 192.168.10.10 1320352546 1001 http://www.mapr.com http://www.mapr.com/doc 192.168.10.1 Notice that the file consists of only three lines, each of which contains a row of data fields separated by the TAB character. The data in the file represents a web log. Create a table in Hive and load the source data: Set the location of the Hive scratch directory by editing the file to add /opt/mapr/hive/hive-<version>/conf/hive-site.xml the following block, replacing with the path to a directory in the user volume: /tmp/mydir <property> <name>hive.exec.scratchdir</name> <value>/tmp/mydir</value> <description>Scratch space for Hive jobs</description> </property> Alternately, use the option in the following step to specify the scratch -hiveconf hive.exec.scratchdir=scratch directory directory's location or use the at the command line. set hive exec.scratchdir=scratch directory Type the following command to start the Hive shell, using tab-completion to expand the : <version> /opt/mapr/hive/hive-0.9.0/bin/hive At the prompt, type the following command to create the table: hive> CREATE TABLE web_log(viewTime INT, userid BIGINT, url STRING, referrer STRING, ip STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; Type the following command to load the data from into the table: sample-table.txt LOAD DATA LOCAL INPATH '/home/mapr/sample-table.txt' INTO TABLE web_log; Run basic queries against the table: Try the simplest query, one that displays all the data in the table: SELECT web_log.* FROM web_log; This query would be inadvisable with a large table, but with the small sample table it returns very quickly. Try a simple SELECT to extract only data that matches a desired string: 1. 2. 3. 4. 5. 6. SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc'; This query launches a MapReduce job to filter the data. Getting Started with HBase HBase is the Hadoop database, designed to provide random, realtime read/write access to very large tables — billions of rows and millions of columns — on clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable. (For more information about HBase, see the .) HBase project page We'll be working with of HBase from the Linux shell. Open a terminal by selecting (see version 0.94.9 Applications > Accessories > Terminal ). A Tour of the MapR Virtual Machine Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in a cluster, for example by using a different directory structure. In this tutorial, we'll create an HBase table on the cluster, enter some data, query the table, then clean up the data and exit. HBase tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called . When column families creating an HBase table, you must define the column families before inserting any data. Column families should not be changed often, nor should there be too many of them, so it is important to think carefully about what column families will be useful for your particular data. Each column family, however, can contain a very large number of columns. Columns are named using the format . family:qualifier Unlike columns in a relational database, which reserve empty space for columns with no values, HBase columns simply don't exist for rows where they have no values. This not only saves space, but means that different rows need not have the same columns; you can use whatever columns you need for your data on a per-row basis. Create a table in HBase: Start the HBase shell by typing the following command: /opt/mapr/hbase/hbase-0.94.5/bin/hbase shell Create a table called with one column family named : weblog stats create 'weblog', 'stats' Verify the table creation by listing everything: list Add a test value to the column in the column family for row 1: daily stats put 'weblog', 'row1', 'stats:daily', 'test-daily-value' Add a test value to the column in the column family for row 1: weekly stats put 'weblog', 'row1', 'stats:weekly', 'test-weekly-value' 6. 7. 8. 9. 10. 11. Add a test value to the column in the column family for row 2: weekly stats put 'weblog', 'row2', 'stats:weekly', 'test-weekly-value' Type to display the contents of the table. Sample output: scan 'weblog' ROW COLUMN+CELL row1 column=stats:daily, timestamp=1321296699190, value=test-daily-value row1 column=stats:weekly, timestamp=1321296715892, value=test-weekly-value row2 column=stats:weekly, timestamp=1321296787444, value=test-weekly-value 2 row(s) in 0.0440 seconds Type to display the contents of row 1. Sample output: get 'weblog', 'row1' COLUMN CELL stats:daily timestamp=1321296699190, value=test-daily-value stats:weekly timestamp=1321296715892, value=test-weekly-value 2 row(s) in 0.0330 seconds Type to disable the table. disable 'weblog' Type to drop the table and delete all data. drop 'weblog' Type to exit the HBase shell. exit Getting Started with MapR Native Tables We'll be working with MapR tables from the Linux shell. Open a terminal by selecting (see Applications > Accessories > Terminal A Tour of the ). MapR Virtual Machine Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in a cluster, for example by using a different directory structure. In this tutorial, we'll create a MapR table on the cluster, enter some data, query the table, then clean up the data and exit. MapR tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called . A column column families family's name is known as the for that column family. When creating a MapR table, define the column families before inserting any data. qualifier Changing your column families can be difficult after creating the table, so it is important to think carefully about what column families will be useful for your particular data. Each column family can contain a very large number of columns. Columns are named using the format family:qualif . ier In a MapR table, columns don't exist for rows where they have no values, a quality called . Sparse tables save space, and different sparseness rows can have different columns. Use whatever columns you need for your data on a per-row basis. Before you start: The user directory MapR tables are stored natively in your cluster's filesystem, just as files are. The virtual machine's cluster is mounted over NFS in the /mapr/my. directory. The cluster already has a directory in it. Make a directory for the MapR user under the directory cluster.com /user /mapr /user with this command: 1. 2. 3. 1. 2. 3. 4. 5. $ mkdir /mapr/my.cluster.com/user/mapr Now MapR can track the activity on the tables you create. Example: Creating a MapR Table You can create a MapR table from the HBase shell, from the MapR Control System, or with the MapR CLI interface. Expand any of the following sections for detailed instructions. With the HBase shell This example creates a table called in directory with a column family called , using system defaults. In development /user/mapr stage this example, we first start the HBase shell from the command line with , and then use the command to create the hbase shell create table. After creating the table, we use the command to add a column family. alter $ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.92.2, rUnknown, Mon Dec 17 09:23:31 PST 2012 hbase(main):001:0> hbase(main):001:0> create '/user/mapr/development', 'stage' hbase(main):002:0> alter '/user/mapr/development', {NAME => 'status'} Type to display the contents of row 1. Sample output: get '/user/mapr/development', 'row1' COLUMN CELL stats:daily timestamp=1321296699190, value=test-daily-value stats:weekly timestamp=1321296715892, value=test-weekly-value 2 row(s) in 0.0330 seconds Type to drop the table and delete all data. drop '/user/mapr/development' Type to exit the HBase shell. exit With the MapR Control System From the terminal, create the directory under the cluster's directory with the following command: /analysis/tables /user $ mkdir /mapr/my.cluster.com/user/analysis/tables In the MCS pane under the group, click . The tab appears in the main window. Navigation MapR Data Platform Tables Tables Click the button. New Table Type a complete path for the new table: /user/analysis/tables/table01 Click . The MCS displays a tab for the new table. OK The screen-capture below demonstrates the creation of a table in location . table01 /user/analysis/tables/ 1. 2. 3. 4. 5. 6. To add a column family with the MapR Control System: In the MCS pane under the group, click . The tab appears in the main window. Navigation MapR Data Platform Tables Tables Find the table you want to work with, using one of the following methods. Scan for the table under on the tab. Recently Opened Tables Tables Enter a regular expression for part of the table pathname in the field and click . Go to table Go Click the desired table name. A tab appears in the main MCS pane, displaying information for the specific table. Table Click the tab. Column Families Click . The dialog appears. New Column Family Create Column Family Enter values for the following fields: Column Family Name - Required. Max Versions - The maximum number of versions of a cell to keep in the table. Min Versions - The minimum number of versions of a cell to keep in the table. Compression - The compression algorithm used on the column family's data. Select a value from the drop-down. The default value is , which uses the same compression type as the table. Available compression methods are LZF, Inherited LZ4, and ZLib. Select to disable compression. OFF Time-To-Live - The minimum time-to-live for cells in this column family. Cells older than their time-to-live stamp are purged periodically. In memory - Preference for a column family to reside in memory for fast lookup. You can change any column family properties at a later time using the MCS or from the command line. maprcli table cf edit The screen-capture below demonstrates the creation of a column family to table at location userinfo /user/analysis/tables/table0 . 1 1. 2. 3. 4. With the MapR CLI Use the command at a command line. For details, type at a command maprcli table create maprcli table create -help line. The following example demonstrates creation of a table in cluster location . The cluster table02 /user/analysis/tables/ is mounted at . my.cluster.com /mapr/ $ maprcli table create -path /user/analysis/tables/table02 List the tables in the directory to verify that was successfully created: table02 $ ls -l /mapr/my.cluster.com/user/analysis/tables lrwxr-xr-x 1 mapr mapr 2 Oct 24 16:14 table01 -> mapr::table::2056.62.17034 lrwxr-xr-x 1 mapr mapr 2 Oct 24 16:13 table02 -> mapr::table::2056.56.17022 Use the command to show recent table activity. maprcli table listrecent $ maprcli table listrecent path /user/analytics/tables/table01 /user/analytics/tables/table02 Add a column family with the command. For details see or type maprcli table cf create table cf create maprcli at a command line. The following example demonstrates addition of a column family named i table cf create -help casedata n table , using lz4 compression, and keeping a maximum of 5 versions of cells in the column /user/analysis/tables/table01 family. 4. 1. 2. 3. 4. 5. 6. 1. 2. 3. 4. $ maprcli table cf create -path /user/analysis/tables/table01 \ -cfname casedata -compression lzf -maxversions 5 $ maprcli table cf list -path /user/analysis/tables/table01 inmemory cfname compression ttl maxversions minversions true userinfo lz4 0 3 0 false casedata lzf 0 5 0 $ You can change the properties of a column family with the command. maprcli table cf edit Working with Snapshots, Mirrors, and Schedules Snapshots, mirrors, and schedules help you protect your data from user error, make backup copies, and in larger clusters provide load balancing for highly-accessed data. These features are available under the M5 license. If you are working with an M5 virtual machine, you can use this section to get acquainted with snapshots, mirrors, and schedules. If you are working with the M3 virtual machine, you should proceed to the sections about Getting Started with , , and Hive Pig Getting . Started with HBase Taking Snapshots A is a point-in-time image of a volume that protects data against user error. Although other strategies such as replication and mirroring snapshot provide good protection, they cannot protect against accidental file deletion or corruption. You can create a snapshot of a volume manually before embarking on risky jobs or operations, or set a snapshot schedule on the volume to ensure that you can always roll back to specific points in time. Try creating a snapshot manually: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the volume (which you created during the ). MyVolume previous tutorial Expand the MapR Virtual Machine window or scroll the browser to the right until the New Snapshot button is visible. Click to display the Snapshot Name dialog. New Snapshot Type a name for the new snapshot in the field. Name Click to create the snapshot. OK Try scheduling snapshots: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Display the Volume Properties dialog by clicking the volume name (which you created during the ), or by MyVolume previous tutorial selecting the checkbox beside and clicking the button. MyVolume Properties In the section, choose a schedule from the dropdown menu. Replication and Snapshot Scheduling Snapshot Schedule Click to save changes to the volume. Modify Volume Viewing Snapshot Contents All the snapshots of a volume are available in a directory called at the volume's top level. For example, the snapshots of the volume .snapshot MyVolume, which is mounted at , are available in the directory. You can view the snapshots using the /myvolume /myvolume/.snapshot had command or via NFS. If you list the contents of the top-level directory in the volume, you will not see — but it's there. oop fs -ls .snapshot 1. 2. 3. 4. 5. 6. 7. 8. To view the snapshots for on the command line, type /myvolume hadoop fs -ls /myvolume/.snapshot To view the snapshots for in the file browser via NFS, navigate to and use to specify an explicit path, /myvolume /myvolume CTRL-L then add to the end. .snapshot Creating Mirrors A is a full read-only copy of a volume, which you can use for backups, data transfer to another cluster, or load balancing. A mirror is itself a mirror type of volume; after you create a mirror volume, you can sync it with its source volume manually or set a schedule for automatic sync. Try creating a mirror volume: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Click the button to display the dialog. New Volume New Volume Select the radio button at the top of the dialog. Local Mirror Volume Type in the field. my-mirror Mirror Name Type the in the field. MyVolume Source Volume Name Type in the field. /my-mirror Mount Path To schedule mirror sync, select a from the dropdown menu respectively. schedule Mirror Update Schedule Click to create the volume. OK You can also sync a mirror manually; it works just like taking a manual snapshot. View the list of volumes, select the checkbox next to a mirror 1. 2. 3. 4. a. b. 5. volume, and click . Start Mirroring Working with Schedules The MapR Virtual machine comes pre-loaded with a few , but you can create your own as well. Once you have created a schedule, you schedules can use it for snapshots and mirrors on any volume. Each schedule contains one or more rules that determine when to trigger a snapshot or a mirror sync, and how long to keep snapshot data resulting from the rule. Try creating a schedule: In the Navigation pane, expand the group and click the view. MapR-FS Schedules Click . New Schedule Type in the field. My Schedule Schedule Name Define a schedule rule in the section: Schedule Rules From the first dropdown menu, select Every 5 min Use the field to specify how long the data is to be preserved. Type in the box, and select from the Retain For 1 hour(s) dropdown menu. Click to create the schedule. Save Schedule You can use the schedule "My Schedule" to perform a snapshot or mirror operation automatically every 5 minutes. If you use "My Schedule" to automate snapshots, they will be preserved for one hour (you will have 12 snapshots of the volume, on average). Next Steps If you haven't already, try the following tutorials: Getting Started with Hive Getting Started with Pig Getting Started with HBase Getting Started with MapR Native Tables Installation Guide MapR is a complete, industry-standard, distribution with key improvements. MapR Hadoop is API-compatible and includes or works with Hadoop the family of Hadoop ecosystem components such as HBase, Hive, Pig, Flume, and others. MapR provides a version of Hadoop and key ecosystem components that have been tested together on specific platforms. For example, while MapR supports the Hadoop FS abstraction interface, MapR specifically improves the performance and robustness of the distributed file system, eliminating the Namenode. The MapR distribution for Hadoop supports continuous read/write access, improving data load and unload processes. To reiterate, . MapR Hadoop does not use Namenodes The diagram above illustrates the services surrounding the basic Hadoop idea of Map and Reduce operations performed across a distributed storage system. Some services provide management and others run at the application level. The (MCS) is a browser-based management console that provides a way to view and control the entire cluster. MapR Control System Editions MapR offers multiple editions of the MapR distribution for Apache Hadoop. Edition Description M3 Free community edition M5 Adds high availability and data protection, including multi-node NFS 1. 2. 3. 4. 5. M7 Supports structured table data natively in the storage layer, providing a flexible, NoSQL database compatible with Apache HBase. Available with MapR version 3.0 and later. The type of license you apply determines which features will be available on the cluster. The installation steps are similar for all editions, but you will plan the cluster differently depending on the license you apply. Installation Process This has been designed as a set of sequential steps. Complete each step before proceeding to the next. Installation Guide Installing MapR Hadoop involves these steps: Planning the Cluster Determine which services will be run on each node. It is important to see the big picture before installing and configuring the individual management and compute nodes. Preparing Each Node Check that each node is a suitable platform for its intended use. Nodes must meet minimum requirements for operating system, memory and disk resources and installed software, such as Java. Including unsuitable nodes in a cluster is a major source of installation difficulty. Installing MapR Each node in the cluster, even purely data/compute nodes, runs several services. Obtain and install MapR packages, using either a package manager, a local repository, or a downloaded tarball. After installing services on a node, configure it to participate in the cluster, then initialize the raw disk resources. Bringing Up the Cluster Start the nodes and check the cluster. Verify node communication and that services are up and running. Create one or more to organize data. volumes Installing Hadoop Ecosystem Components Install additional Hadoop components alongside MapR services. To begin, start by . Planning the Cluster Planning the Cluster A MapR Hadoop installation is usually a large-scale set of individual hosts, called , collectively called a . In a typical cluster, most (or nodes cluster all) nodes are dedicated to data processing and storage, and a smaller number of nodes run other services that provide cluster coordination and management. The first step in deploying MapR is planning which nodes will contribute to the cluster, and selecting the services that will run on each node. First, plan what computers will serve as nodes in the MapR Hadoop cluster and what specific services (daemons) will run on each node. To determine whether a computer is capable of contributing to the cluster, it may be necessary to check the requirements found in Step 2, Preparing . Each node in the cluster must be carefully checked against these requirements; unsuitability of a node is one of the most common Each Node reasons for installation failure. The objective of Step 1 is a Cluster Plan that details each node's set of services. The following sections help you create this plan: Unique Features of the MapR Distribution Select Services Cluster Design Objectives Licensing Choices Data Workload High Availability Cluster Hardware Service Layout in a Cluster Node Types Example Cluster Designs Plan Initial Volumes User Accounts Next Step Unique Features of the MapR Distribution Administrators who are familiar with ordinary Apache Hadoop will appreciate the MapR distribution's real-time read/write storage layer. While API-compatible with HDFS, MapR Hadoop does not require Namenodes. Furthermore, MapR utilizes raw disks and partitions without RAID or Logical Volume Manager. Many Hadoop installation documents spend pages discussing HDFS and Namenodes, and MapR Hadoop's solution is simpler to install and offers higher performance. The MapR Filesystem (MapR-FS) stores data in , conceptually in a set of containers distributed across a cluster. Each container includes volumes its own metadata, eliminating the central "Namenode" single point of failure. A required directory of container locations, the Container Location Database (CLDB), can improve network performance and provide high availability. Data stored by MapR-FS can be files or tables. A process called the runs on all nodes to manage, monitor, and report on the other services on each node. The MapR cluster uses warden Apache ZooKeeper to coordinate services. ZooKeeper prevents service conflicts by enforcing a set of rules and conditions that determine which instance of each service is the master. The warden will not start any services unless ZooKeeper is reachable and more than half of the configured ZooKeeper nodes (a ) are live. quorum The MapR M7 Edition provides native table storage in MapR-FS. The is used to access table data via the open-standard MapR HBase Client Apache HBase API. M7 Edition simplifies and unifies administration for both structured table data and unstructured file data on a single cluster. If you plan to use MapR tables exclusively for structured data, then you do not need to install the Apache HBase Master or RegionServer. However, Master and RegionServer services be deployed on an M7 cluster if your applications require them, for example, during the migration period can from Apache HBase to MapR tables. The MapR HBase Client provides access to both Apache HBase tables and MapR tables. As of MapR version 3.0, table features are included in all MapR-FS fileservers. Table features are enabled by applying an appropriate M7 license. Select Services In a typical cluster, most nodes are dedicated to data processing and storage and a smaller number of nodes run services that provide cluster coordination and management. Some applications run on cluster nodes and others run on clients that can reach the cluster, but which are not part of it. The services that you choose to run on each node will likely evolve over the life of the cluster. Services can be added and removed over time. We will plan for the cluster you're going to start with, but it's useful to think a few steps down the road: Where will services migrate to when you grow the cluster by 10x? 100x? The following table shows some of the services that can be run on a node. MapReduce Storage Management Application Service Description Warden The Warden service runs on every node, coordinating the node's contribution to the cluster. TaskTracker The TaskTracker service starts and tracks MapReduce tasks on a node. The TaskTracker service receives task assignments from the JobTracker service and manages task execution. FileServer FileServer is the MapR service that manages disk storage for MapR-FS on each node. CLDB Maintains the (CLDB) service. The CLDB container location database service coordinates data storage services among MapR-FS FileServer nodes, MapR NFS gateways, and MapR clients. NFS Provides read-write MapR Direct Access NFS™ access to the cluster, with full support for concurrent read and write access. MapR HBase Client Provides access to tables in MapR-FS on an M7 Edition cluster via HBase APIs. Required on all nodes that will access table data in MapR-FS, typically all TaskTracker nodes and edge nodes for accessing table data. JobTracker Hadoop JobTracker service. The JobTracker service coordinates the execution of MapReduce jobs by assigning tasks to TaskTracker nodes and monitoring task execution. ZooKeeper Enables high availability (HA) and fault tolerance for MapR clusters by providing coordination. HBase Master The master service manages the region servers that make up HBase HBase table storage. Web Server Runs the MapR Control System and provides the MapR Heatmap™. Metrics Provides optional real-time analytics data on cluster and job performance through the interface. If used, the Metrics Job Metrics service is required on all JobTracker and Web Server nodes. HBase Region Server region server is used with the HBase Master service and HBase provides storage for an individual HBase region. Pig is a high-level data-flow language and execution framework. Pig Hive is a data warehouse that supports SQL-like ad hoc querying and Hive data summarization. Flume is a service for aggregating large amounts of log data Flume Oozie is a workflow scheduler system for managing Hadoop jobs. Oozie HCatalog HCatalog aggregates HBase data. Cascading is an application framework for analyzing and managing Cascading big data. Mahout is a set of scalable machine-learning libraries that analyze Mahout user behavior. Sqoop is a tool for transferring bulk data between Hadoop and Sqoop relational databases. MapR is a complete Hadoop distribution, but not all services are required. Every Hadoop installation requires and servi JobTracker TaskTracker ces to manage Map/Reduce tasks. In addition, MapR requires the service to coordinate the cluster, and at least one node must run ZooKeeper the service. The service is required if the browser-based MapR Control System will be used. CLDB WebServer MapR Hadoop includes tested versions of the services listed here. MapR provides a more robust, read-write storage system based on volumes and containers. MapR data nodes typically run and . Do not plan to use packages from other sources in place of the TaskTracker FileServer MapR distribution. Cluster Design Objectives Begin by understanding the work that the cluster will perform. Establish metrics for data storage capacity, throughput, and characterize the data processing that will typically be performed. Licensing Choices The MapR Hadoop distribution is licensed in tiers. If you need to store table data, choose the M7 license. M7 includes all features of the M5 license, and adds support for structured table data natively in the storage layer. M7 Edition provides a flexible, NoSQL database compatible with Apache HBase. The M5 license enables enterprise-class storage features, such as and of individual volumes, and high-availability features, snapshots mirrors such as the ability to run NFS servers on multiple nodes which also improves bandwidth and performance. The free M3 community edition includes MapR improvements, such as the read/write MapR-FS and NFS access to the filesystem, but does not include the level of technical support offered with the M5 or M7 editions. You can obtain an M3 license or an M5 trial license online by registering. To obtain an M7 license, you will need to contact a MapR representative. Data Workload While MapR is relatively easy to install and administer, designing and tuning a large production MapReduce cluster is a complex task that begins with understanding your data needs. Consider the kind of data processing that will occur and estimate the storage capacity and throughput speed required. Data movement, independent of MapReduce operations, is also a consideration. Plan for how data will arrive at the cluster, and how it will be made useful elsewhere. Network bandwidth and disk I/O speeds are related; either can become a bottleneck. CPU-intensive workloads reduce the relative importance of disk or network speed. If the cluster will be performing a large number of big reduces, network bandwidth is important, suggesting that the hardware plan include multiple NICs per node. In general, the more network bandwidth, the faster things will run. Running NFS on multiple data nodes can improve data transfer performance and make direct loading and unloading of data possible, but multiple NFS instances requires an M5 license. For more information about NFS, see . Setting Up MapR NFS Plan which nodes will provide NFS access according to your anticipated traffic. For instance, if you need 5Gb/s of write throughput and 5Gb/s of read throughput, the following node configurations would be suitable: 12 NFS nodes with a single 1GbE connection each 6 NFS nodes with dual 1GbE connections each 4 NFS nodes with quadruple 1GbE connections each When you set up NFS on all of the file server nodes, you enable a self-mounted NFS point for each node. A cluster made up of nodes with self-mounted NFS points enable you to run native applications as tasks. You can use round-robin DNS or a hardware load balancer to mount NFS on one or more dedicated gateways outside the cluster to allow controlled access. High Availability A properly licensed and configured MapR cluster provides automatic failover for continuity throughout the stack. Configuring a cluster for HA involves redundant instances of specific services, as well as a correct configuration of the MapR NFS service. HA features are not available with the M3 Edition license. The following describes redundant services used for HA: Service Strategy Min. instances CLDB Master/slave--two instances in case one fails 2 It is not necessary to bond or trunk the NICs together. MapR is able to take advantage of multiple NICs transparently. ZooKeeper A majority of ZK nodes (a ) must be quorum up 3 JobTracker Active/standby--if the first JT fails, the backup is started 2 HBase Master Active/standby--if the first HBase Master fails, the backup is started. This is only a consideration when deploying Apache HBase on the cluster. 2 NFS The more redundant NFS services, the better 2 On a large cluster, you may choose to have extra nodes available in preparation for failover events. In this case, you keep spare, unused nodes ready to replace nodes running control services--such as CLDB, JobTracker, ZooKeeper, or HBase Master--in case of a hardware failure. Virtual IP Addresses You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multiple addresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also enable high availability (HA) NFS. In a HA NFS system, when an NFS node fails, data requests are satisfied by other NFS nodes in the pool. Use a minimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, with each NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs should be in the same IP subnet as the interfaces to which they will be assigned. See Setting Up VIPs for NFS for details on enabling VIPs for your cluster. If you plan to use VIPs on your M5 cluster's NFS nodes, consider the following tips: Set up NFS on at least three nodes if possible. All NFS nodes must be accessible over the network from the machines where you want to mount them. To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, you can provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes in the cluster, if needed. To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the client manages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the cluster nodes) to transfer data via MapR APIs, balancing operations among nodes as needed. Use VIPs to provide High Availability (HA) and failover. Cluster Hardware When planning the hardware architecture for the cluster, make sure all hardware meets the node requirements listed in . Preparing Each Node The architecture of the cluster hardware is an important consideration when planning a deployment. Among the considerations are anticipated data storage and network bandwidth needs, including intermediate data generated during MapReduce job execution. The type of workload is important: consider whether the planned cluster usage will be CPU-intensive, I/O-intensive, or memory-intensive. Think about how data will be loaded into and out of the cluster, and how much data is likely to be transmitted over the network. Planning a cluster often involves tuning key ratios, such as: disk I/O speed to CPU processing power; storage capacity to network speed; or number of nodes to network speed. Typically, the CPU is less of a bottleneck than network bandwidth and disk I/O. To the extent possible, network and disk transfer rates should be balanced to meet the anticipated data rates using multiple NICs per node. It is not necessary to bond or trunk the NICs together; MapR is able to take advantage of multiple NICs transparently. Each node should provide raw disks and partitions to MapR, with no RAID or logical volume manager, as MapR takes care of formatting and data protection. The following example architecture provides specifications for a standard compute/storage node for general purposes, and two sample rack configurations made up of the standard nodes. MapR is able to make effective use of more drives per node than standard Hadoop, so each node should present enough face plate area to allow a large number of drives. The standard node specification allows for either 2 or 4 1Gb/s ethernet network interfaces. MapR recommends 10Gb/s network interfaces for high-performance clusters. You should use an odd number of ZooKeeper instances. For a high availability cluster, use 5 ZooKeepers, so that the cluster can tolerate 2 ZooKeeper nodes failing and still maintain a quorum. Setting up more than 5 ZooKeeper instances is not recommended. Standard 50TB Rack Configuration 10 standard compute/storage nodes (10 x 12 x 2 TB storage; 3x replication, 25% margin) 24-port 1 Gb/s rack-top switch with 2 x 10Gb/s uplink Add second switch if each node uses 4 network interfaces Standard 100TB Rack Configuration 20 standard nodes (20 x 12 x 2 TB storage; 3x replication, 25% margin) 48-port 1 Gb/s rack-top switch with 4 x 10Gb/s uplink Add second switch if each node uses 4 network interfaces To grow the cluster, just add more nodes and racks, adding additional service instances as needed. MapR rebalances the cluster automatically. Service Layout in a Cluster How you assign services to nodes depends on the scale of your cluster and the MapR license level. For a single-node cluster, no decisions are involved. All of the services you are using run on the single node. On medium clusters, the performance demands of the CLDB and ZooKeeper services requires them to be assigned to separate nodes to optimize performance. On large clusters, good cluster performance requires that these services run on separate nodes. The cluster is flexible and elastic---nodes play different roles over the lifecycle of a cluster. The basic requirements of a node are not different for management or for data nodes. As the cluster size grows, it becomes advantageous to locate control services (such as ZooKeeper and CLDB) on nodes that do not run compute services (such as TaskTracker). The MapR M3 Edition license does not include HA capabilities, which restricts how many instances of certain services can run. The number of nodes and the services they run will evolve over the life cycle of the cluster. When setting up a cluster initially, take into consideration the following points from the page . Assigning Services to Nodes for Best Performance The architecture of MapR software allows virtually any service to run on any node, or nodes, to provide a high-availability, high-performance cluster. Below are some guidelines to help plan your cluster's service layout. Node Types In a production MapR cluster, some nodes are typically dedicated to cluster coordination and management, and other nodes are tasked with data storage and processing duties. An edge node provides user access to the cluster, concentrating open user privileges on a single host. In smaller clusters, the work is not so specialized and a single node may perform data processing as well as management. Nodes Running ZooKeeper and CLDB It is possible to install MapR Hadoop on a one- or two-node demo cluster. Production clusters may harness hundreds of nodes, but five- or ten-node production clusters are appropriate for some applications. High latency on a ZooKeeper node can lead to an increased incidence of ZooKeeper quorum failures. A ZooKeeper quorum failure occurs when the cluster finds too few copies of the ZooKeeper service running. If the ZooKeeper node is also running other services, competition for computing resources can lead to increased latency for that node. If your cluster experiences issues relating to ZooKeeper quorum failures, consider reducing or eliminating the number of other services running on the ZooKeeper node. The following are guidelines about which services to separate on large clusters: JobTracker on ZooKeeper nodes: Avoid running the JobTracker service on nodes that are running the ZooKeeper service. On large clusters, the JobTracker service can consume significant resources. MySQL on CLDB nodes: Avoid running the MySQL server that supports the MapR Metrics service on a CLDB node. Consider running the MySQL server on a machine external to the cluster to prevent the MySQL server’s resource needs from affecting services on the cluster. TaskTracker on CLDB or ZooKeeper nodes: When the TaskTracker service is running on a node that is also running the CLDB or ZooKeeper services, consider reducing the number of task slots that this node's instance of the TaskTracker service provides. See Tuning Your MapR Install. Webserver on CLDB nodes: Avoid running the webserver on CLDB nodes. Queries to the MapR Metrics service can impose a bandwidth load that reduces CLDB performance. JobTracker on large clusters: Run the JobTracker service on a dedicated node for clusters with over 250 nodes. Nodes for Data Storage and Processing Most nodes in a production cluster are data nodes. Data nodes can be added or removed from the cluster as requirements change over time. Tune TaskTracker for fewer slots on nodes that include both management and data services. See . Tuning Your MapR Install Edge Nodes So-called Edge nodes provide a common user access point for the MapR webserver and other client tools. Edge nodes may or may not be part of the cluster, as long as the edge node can reach cluster nodes. Nodes on the same network can run client services, MySQL for Metrics, and so on. Example Cluster Designs Small M3 Cluster For a small cluster using the free M3 Edition license, assign the CLDB, JobTracker, NFS, and WebServer services to one node each. A hardware failure on any of these nodes would result in a service interruption, but the cluster can be recovered. Assign the ZooKeeper service to the CLDB node and two other nodes. Assign the FileServer and TaskTracker services to every node in the cluster. Example Service Configuration for a 5-Node M3 Cluster This cluster has several single points of failure, at the nodes with CLDB, JobTracker and NFS. Small High-Availability M5 Cluster A small M5 cluster can ensure high availability (HA) for all services by providing at least two instances of each service, eliminating single points of failure. The example below depicts a 5-node, high-availabilty M5 cluster with HBase installed. ZooKeeper is installed on three nodes. CLDB, JobTracker, and HBase Master services are installed on two nodes each, spread out as much as possible across the nodes: Example Service Configuration for a 5-Node M5 Cluster These examples put CLDB and ZooKeeper services on the same nodes and generally place JobTracker services on other nodes, but this is somewhat arbitrary. The JobTracker service can coexist on the same node as ZooKeeper or CLDB services. Large High-Availability M5 Cluster On a large cluster designed for high availability (HA), assign services according to the example below, which depicts a 150-node HA M5 cluster. The majority of nodes are dedicated to the TaskTracker service. ZooKeeper, CLDB, and JobTracker are installed on three nodes each, and are isolated from other services. The NFS server is installed on most machines, providing high network bandwidth to the cluster. Example Service Configuration for a 100+ Node M5 Cluster Plan Initial Volumes MapR manages the data in a cluster in a set of . Volumes can be mounted in the Linux filesystem in a hierarchical directory structure, but volumes volumes do not contain other volumes. Each volume has its own policies and other settings, so it is important to define a number of volumes in order to segregate and classify your data. Plan to define volumes for each user, for each project, and so on. For streaming data, you might plan to create a new volume to store new data every day or week or month. The more volume granularity, the easier it is to specify backup or other policies for subsets of the data. For more information on volumes, see . Managing Data with Volumes User Accounts Part of the cluster plan is a list of authorized users of the cluster. It is preferable to give each user an account, because account-sharing makes administration more difficult. Any user of the cluster must be established with the same Linux UID and GID on every node in the cluster. Central directory services, such as LDAP, are often used to simplify user maintenance. Next Step It is important to begin installation with a complete Cluster Plan, but plans should not be immutable. Cluster services often change over time, particularly as clusters scale up by adding nodes. Balancing resources to maximize utilization is the goal, and it will require flexibilty. The next step is to prepare each node. Most installation difficulties are traced back to nodes that are not qualified to contribute to the cluster, or which have not been properly prepared. For large clusters, it can save time and trouble to use a configuration management tool such as Puppet or Chef. Proceed to and assess each node. Preparing Each Node Preparing Each Node Each node contributes to the cluster designed in the , so each must be able to run MapR and Hadoop software. previous step Requirements CPU 64-bit OS Red Hat, CentOS, SUSE, or Ubuntu Memory 4 GB minimum, more in production Disk Raw, unformatted drives and partitions DNS Hostname, reaches all other nodes Users Common users across all nodes; Keyless ssh Java Must run Java Other NTP, Syslog, PAM Use the following sections as a checklist to make each candidate node suitable for its assigned roles. Once each node has been prepared or disqualified, proceed to Step 3, . Installing MapR Software 2.1 CPU and Operating System a. Processor is 64-bit To determine the processor type, run $ uname -m x86_64 If the output includes "x86_64," the processor is 64-bit. If it includes "i386," "i486," "i586," or "i686," it is a 32-bit processor, which is not supported by MapR software. If the results are "unknown," or none of the above, try one of these alternative commands. $ uname -a Linux mach-name 2.6.35-22-server #33-Ubuntu SMP Sun Sep 19 20:48:58 UTC 2012 x86_64 GNU/Linux In the file, the flag 'lm' (for "long-mode") indicates a 64-bit processor. cpuinfo $ grep flags /proc/cpuinfo flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc up arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes hypervisor lahf_lm ida arat b. Operating System is supported Run the following command to determine the name and version of the installed operating system. (If the lsb_release command reports "No LSB modules are available," this is not a problem.) $ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 10.10 Release: 10.10 Codename: maverick The operating system must be one of the following: Operating System Minimum version RedHat Enterprise Linux (RHEL) or Community Enterprise Linux (CentOS) 5.4 or later SUSE Enterprise Linux Server 11 or later Ubuntu Linux 9.04 or later If the command is not found, try one of the following alternatives. lsb_release $ cat /proc/version Linux version 2.6.35-22-server (build@allspice) (gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 20:48:58 UTC 2012 1. 2. $ cat /etc/*-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=10.10 DISTRIB_CODENAME=maverick DISTRIB_DESCRIPTION="Ubuntu 10.10" If you determine that the node is running an older version of a supported OS, upgrade to at least a supported version and test the upgrade before proceeding. If you find a different Linux distribution, such as Fedora or Gentoo, the node must be reformatted and a supported distro installed. 2.2 Memory and Disk Space a. Minimum Memory Run to display total and available memory in gigabytes. The software will run with as little as 4 GB total memory on a node, but free -g performance will suffer with less than 8 GB. MapR recommends at least 16 GB for a production environment, and typical MapR production nodes have 32 GB or more. $ free -g total used free shared buffers cached Mem: 3 2 1 0 0 1 -/+ buffers/cache: 0 2 Swap: 2 0 2 If the command is not found, there are many alternatives: , , , or various GUI free grep MemTotal: /proc/meminfo vmstat -s -SM top system information tools. MapR does not recommend using because it may lead to the kernel memory manager killing processes to free memory, resulting in overcommit killed MapR processes and system instability. Set to 0: vm.overcommit_memory Edit the file and add the following line: /etc/sysctl.conf vm.overcommit_memory=0 Save the file and run: sysctl -p b. Storage Unlike ordinary Hadoop, MapR manages raw (unformatted) devices directly to optimize performance and offer high availability. If this will be a datanode, MapR recommends at least 3 unmounted physical drives or partitions available for use by MapR storage. MapR uses disk spindles in parallel for faster read/write bandwidth and therefore groups disks into sets of three. These raw drives should not use RAID or Logical Volume You can try MapR out on non-production equipment, but under the demands of a production environment, memory needs to be balanced against disks, network and CPU. Management. (MapR can work with these technologies, but they require advanced setup and actually degrade cluster performance.) Minimum Disk Space OS Partition. Provide at least 10 GB of free disk space on the operating system partition. Disk. Provide 10 GB free disk space in the directory (for JobTracker and TaskTracker temporary files) and 128 GB free disk space in the /tmp /o directory (for logs, cores, and support images). pt Swap space. Provide sufficient swap space for stability, 10% more than the node's physical memory, but not less than 24 GB and not more than 128 GB. ZooKeeper. On ZooKeeper nodes, dedicate a partition, if practicable, for the directory to avoid other processes filling that /opt/mapr/zkdata partition with writes and to reduce the possibility of errors due to a full directory. This directory is used to store snapshots /opt/mapr/zkdata that are up to 64 MB. Since the four most recent snapshots are retained, reserve at least 500 MB for this partition. Do not share the physical disk where resides with any MapR File System data partitions to avoid I/O conflicts that might lead to ZooKeeper service /opt/mapr/zkdata failures. 2.3 Connectivity a. Hostname Each node in the cluster must have a unique hostname, resolvable forward and backward with every other node with both normal and reverse DNS name lookup. Run to check the node's hostname. For example: hostname -f $ hostname -f node125 If returns a name, run to return the node's IP address and fully-qualified domain name (FQDN). hostname -f getent hosts `hostname` $ getent hosts `hostname` 10.250.1.53 node125.corp.example.com To troubleshoot hostname problems, edit the file as root. A simple might contain: /etc/hosts /etc/hosts 127.0.0.1 localhost 10.10.5.10 mapr-hadoopn.maprtech.prv mapr-hadoopn A common problem is an incorrect loopback entry (127.0.x.x) that prevents the IP address from being assigned to the hostname. For example, on Ubuntu, the default file might contain: /etc/hosts MapR requires a minimum of one disk or partition for MapR data. However, file contention for a shared disk will decrease performance. In a typical production environment, multiple physical disks on each node are dedicated to the distributed file system, which results in much better performance. 127.0.0.1 localhost 127.0.1.1 node125.corp.example.com A loopback ( ) entry with the node's hostname will confuse the installer and other programs. Edit the file and delete any 127.0.x.x /etc/hosts entries that associate the hostname with a loopback IP. Only associate the hostname with the actual IP address. Use the command to verify that each node can reach the others using each node's hostname. For more information, see the ping hosts(5) man . page b. Common Users Users of the cluster must have the same credentials and uid on every node in the cluster. Each real person (or department) that will run MapR jobs needs an account and must belong to a common group (gid). In addition, a MapR user with full privileges to administer the cluster will be created. If a directory service, such as LDAP, is not used, this user will be created on each node. If a user named 'mapr' does not exist, installing MapR will create it. It is recommended that you create the user named 'mapr' in advance in order to test for connectivity issues before the installation step. Every user (including the 'mapr' user) must have the same uid and primary gid on every node. To create a user, run the following command as root, substituting a uid for and a gid for . (The error "cannot lock /etc/passwd" suggests that m n the command was not run as root.) $ useradd mapr --gid n --uid m To test that the mapr user has been created, . Verify that a home directory has been created (usually ) and that the mapr su mapr /home/mapr user has read-write access to it. The mapr user must have write access to the directory, or the warden will fail to start services. /tmp c. Optional: Keyless ssh It is very helpful for the common user to be able to ssh from each webserver node to any other node without providing a password. If so-called "keyless ssh" is not provided, centralized disk management will not be available. Remote, centralized cluster management is convenient, but keyless ssh between nodes is optional because MapR will run without it. Setting up keyless ssh is straightforward. On each webserver node, generate a key pair and append the key to an authorization file. Then copy this authorization file to each node, so that every node is available from the webserver node. su mapr (if you are not already logged in as mapr) ssh-keygen -t rsa -P '' -f ~/filename The command creates , containing the private key, and , containing the public key. For convenience, you ssh-keygen filename filename.pub may want to name the file for the hostname of the node. For example, on the node with hostname "node10.10.1.1," ssh-keygen -t rsa -P '' -f ~/node10.10.1.1 In this example, append the file to the file. /home/mapr/node10.10.1.1.pub authorized_keys Append each webserver node's public key to a single file, using a command like . (The key file is cat filename.pub >> authorized_keys simple text, so you can append the file in several ways, including a text editor.) When every webserver node's empty passphrase public key has been generated, and the public key file has been appended to the master "authorized_keys" file, copy this master keys file to each node as ~/.s , where ~ refers to the mapr user's home directory (typically ). sh/authorized_keys /home/mapr For more information about Ubuntu's default file, see . /etc/hosts https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/871966 2.4 Software a. Java MapR services require the Java runtime environment. Run . Verify that one of these versions is installed on the node: java -version Sun Java JDK 1.6 or 1.7 OpenJDK 1.6 (On Ubuntu Linux only, Open JDK 1.7 is also acceptable) If the command is not found, download and install Oracle/Sun Java or use a package manager to install OpenJDK. Obtain the Oracle/Sun java Java Runtime Environment (JRE), Standard Edition (Java SE), available at . Find Java SE 6 in the archive of previous Oracle's Java SE website versions. Use a package manager, such as (RedHat or CentOS), (Ubuntu) or to install or update OpenJDK on the node. The command yum apt-get rpm will be something like one of these: Red Hat or CentOS yum install java-1.6.0-openjdk.x86_64 Ubuntu apt-get install openjdk-6-jdk SUSE rpm -I openjdk-1.6.0-21.x86_64.rpm b. MySQL The MapR Metrics service requires access to a MySQL server running version 5.1 or later. MySQL does not have to be installed on a node in the cluster, but it must be on the same network as the cluster. If you do not plan to use MapR Metrics, MySQL is not required. 2.5 Infrastructure a. Network Time To keep all cluster nodes time-synchronized, MapR requires software such as a Network Time Protocol (NTP) server to be configured and running on every node. If server clocks in the cluster drift out of sync, serious problems will occur with HBase and other MapR services. MapR raises a Time Skew alarm on any out-of-sync nodes. See for more information about obtaining and installing NTP. http://www.ntp.org/ Advanced: Installing an internal NTP server keeps your cluster synchronized even when an outside NTP server is inaccessible. b. Syslog Syslog must be enabled on each node to preserve logs regarding killed processes or failed jobs. Modern versions such as syslog-ng and rsyslog Sun Java includes the command that lists running Java processes and can show whether the CLDB has started. There are ways to jps determine this with OpenJDK, but they are more complicated. 1. 2. 3. are possible, making it more difficult to be sure that a syslog daemon is present. One of the following commands should suffice: syslogd -v service syslog status rsyslogd -v service rsyslog status c. ulimit is a command that sets limits on the user's access to system-wide resources. Specifically, it provides control over the resources available ulimit to the shell and to processes started by it. The mapr-warden script uses the command to set the maximum number of file descriptors ( ) and processes ( ) to 64000. ulimit nofile nproc Higher values are unlikely to result in an appreciable performance gain. Lower values, such as the default value of 1024, are likely to result in task failures. Depending on your environment, you might want to set limits manually rather than relying on Warden to set them automatically using . ulimit The following examples show how to do this, using the recommended value of 64000. Setting resource limits on Centos/Redhat Edit and add the following line: /etc/security/limits.conf <MAPR_USER> - nofile 64000 Edit and add the following line: /etc/security/limits.d/90-nproc.conf <MAPR_USER> - nproc 64000 Check that the /etc/pam.d/system-auth file contains the following settings: MapR's recommended value is set automatically every time warden is started. 3. 1. 2. 1. 2. #%PAM-1.0 auth sufficient pam_rootok.so # Uncomment the following line to implicitly trust users in the "wheel" group. #auth sufficient pam_wheel.so trust use_uid # Uncomment the following line to require a user to be in the "wheel" group. #auth required pam_wheel.so use_uid auth include system-auth account sufficient pam_succeed_if.so uid = 0 use_uid quiet account include system-auth password include system-auth session include system-auth session required pam_limits.so session optional pam_xauth.so Setting resource limits on Ubuntu Edit and add the following lines: /etc/security/limits.conf <MAPR_USER> - nofile 64000 <MAPR_USER> - nproc 64000 Edit and uncomment the following line: /etc/pam.d/su session required pam_limits.so Use to verify settings: ulimit Reboot the system. Run the following command as the MapR user (not root) at a command line: ulimit -n The command should report . 64000 d. PAM Nodes that will run the (the service) can take advantage of Pluggable Authentication Modules (PAM) if MapR Control System mapr-webserver found. Configuration files in directory are typically provided for each standard Linux command. MapR can use, but does not /etc/pam.d/ 1. 2. require, its own profile. For more detail about configuring PAM, see . PAM Configuration e. Security - SELinux, AppArmor SELinux (or the equivalent on other operating systems) must be disabled during the install procedure. If the MapR services run as a non-root user, SELinux can be enabled after installation and while the cluster is running. f. TCP Retries On each node, set the number of TCP retries to 5 so that MapR can detect unreachable nodes with less latency. Edit the file and add the following line: /etc/sysctl.conf net.ipv4.tcp_retries2=5 Save the file and run: sysctl -p g. NFS Disable the stock Linux NFS server on nodes that will run the MapR NFS server. h. iptables Enabling iptables on a node may close ports that are used by MapR. If you enable iptables, make sure that remain open. Check required ports your current IP table rules with the following command: $ service iptables status Automated Configuration Some users find tools like Puppet or Chef useful to configure each node in a cluster. Make sure, however, that any configuration tool does not reset changes made when MapR packages are later installed. Specifically, do not let automated configuration tools overwrite changes to the following files: /etc/sudoers /etc/security/limits.conf /etc/udev/rules.d/99-mapr-disk.rules Next Step Each prospective node in the cluster must be checked against the requirements presented here. Failure to ensure that each node is suitable for use generally leads to hard-to-resolve problems with installing Hadoop. After each node has been shown to meet the requirements and has been prepared, you are ready to . Install MapR components Installing MapR Software After you have and , you are ready to install the MapR distribution on each node according to your Cluster planned the cluster prepared each node Plan. Installing MapR software across the cluster involves performing several steps on each node. To make the installation process simpler, we will postpone the installation of Apache Hadoop components, such as HBase or Hive, until Step 5, . However, Installing Hadoop Components experienced administrators can install these components at the same time as MapR software if desired. It is usually easier to bring up the MapR Hadoop cluster successfully before installing Hadoop ecosystem components. The following sections describe the steps and options for installing MapR software: Preparing Packages and Repositories Using MapR's Internet repository Using a local repository Using a local path containing or package files rpm deb Installation Installing MapR packages Verify successful installation Setting Environment Variables Configure the Node with the Formatting Disks with the Next Step Preparing Packages and Repositories When installing MapR software, each node must have access to the package files. There are several ways to specify where the packages will be. This section describes the ways to make packages available to each node. The options are: Using MapR's Internet repository Using a local repository Using a local path containing or package files rpm deb You also must consider all packages that the MapR software depends on. You can install dependencies on each node before beginning the MapR installation process, or you can specify repositories and allow the package manager on each node to resolve dependencies. See Packages for details. and Dependencies for MapR Software Starting in the 2.0 release, MapR separates the distribution into two repositories: MapR packages which provide core functionality for MapR clusters, such as the MapR filesystem 1. 2. 3. 1. 2. Hadoop ecosystem packages which are not specific to MapR, such as HBase, Hive and Pig Using MapR's Internet repository The MapR repository on the Internet provides all the packages you need in order to install a MapR cluster using native tools such as on Red yum Hat or CentOS, or on Ubuntu. Installing from MapR's repository is generally the easiest method for installation, but requires the greatest apt-get amount of bandwidth. With this method, each node must be connected to the Internet and will individually download the necessary packages. Below are instructions on setting up repositories for each supported Linux distribution. Adding the MapR repository on Red Hat or CentOS Change to the user (or use for the following commands). root sudo Create a text file called in the directory with the following contents: maprtech.repo /etc/yum.repos.d/ [maprtech] name=MapR Technologies baseurl=http://package.mapr.com/releases/v3.0.1/redhat/ enabled=1 gpgcheck=0 protect=1 [maprecosystem] name=MapR Technologies baseurl=http://package.mapr.com/releases/ecosystem/redhat enabled=1 gpgcheck=0 protect=1 (See the for the correct paths for all past releases.) Release Notes If your connection to the Internet is through a proxy server, you must set the environment variable before installation: http_proxy http_proxy=http://<host>:<port> export http_proxy You can also set the value for the environment variable by adding the following section to the file: http_proxy /etc/yum.conf proxy=http://<host>:<port> proxy_username=<username> proxy_password=<password> To enable the EPEL repository on CentOS or Red Hat 5.x: Download the EPEL repository: wget http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm The EPEL (Extra Packages for Enterprise Linux) repository contains dependencies for the package on Red mapr-metrics Hat/CentOS. If your Red Hat/CentOS cluster does not use the service, you can skip EPEL configuration. mapr-metrics 2. 1. 2. 1. 2. 3. 4. 5. 6. 1. 2. Install the EPEL repository: rpm -Uvh epel-release-5*.rpm To enable the EPEL repository on CentOS or Red Hat 6.x: Download the EPEL repository: wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rp m Install the EPEL repository: rpm -Uvh epel-release-6*.rpm Adding the MapR repository on SUSE Change to the user (or use for the following commands). root sudo Use the following command to add the repository for MapR packages: zypper ar http://package.mapr.com/releases/v3.0.1/suse/ maprtech Use the following command to add the repository for MapR ecosystem packages: zypper ar http://package.mapr.com/releases/ecosystem/suse/ maprecosystem (See the for the correct paths for all past releases.) MapR Release Notes If your connection to the Internet is through a proxy server, you must set the environment variable before installation: http_proxy http_proxy=http://<host>:<port> export http_proxy Update the system package index by running the following command: zypper refresh MapR packages require a compatibility package in order to install and run on SUSE. Execute the following command to install the SUSE compatibility package: zypper install mapr-compat-suse Adding the MapR repository on Ubuntu Change to the user (or use for the following commands). root sudo Add the following lines to : /etc/apt/sources.list 2. 3. 4. 1. 2. 3. 4. 5. deb http://package.mapr.com/releases/v3.0.1/ubuntu/ mapr optional deb http://package.mapr.com/releases/ecosystem/ubuntu binary/ (See the for the correct paths for all past releases.) MapR Release Notes Update the package indexes. apt-get update If your connection to the Internet is through a proxy server, add the following lines to : /etc/apt/apt.conf Acquire { Retries "0"; HTTP { Proxy "http://<user>:<password>@<host>:<port>"; }; }; Using a local repository You can set up a local repository on each node to provide access to installation packages. With this method, the package manager on each node installs from packages in the local repository. Nodes do not need to be connected to the Internet. Below are instructions on setting up a local repository for each supported Linux distribution. These instructions create a single repository that includes both MapR components and the Hadoop ecosystem components. Setting up a local repository requires running a web server that nodes access to download the packages. Setting up a web server is not documented here. Creating a local repository on Red Hat or CentOS Login as on the node. root Create the following directory if it does not exist: /var/www/html/yum/base On a computer that is connected to the Internet, download the following files, substituting the appropriate and <version> <datest : amp> http://package.mapr.com/releases/v<version>/redhat/mapr-v<version>GA.rpm.tgz http://package.mapr.com/releases/ecosystem/redhat/mapr-ecosystem-<datestamp>.r pm.tgz (See for the correct paths for all past releases.) MapR Repositories and Package Archives Copy the files to on the node, and extract them there. /var/www/html/yum/base tar -xvzf mapr-v<version>GA.rpm.tgz tar -xvzf mapr-ecosystem-<datestamp>.rpm.tgz Create the base repository headers: 5. 1. 1. 2. 1. 2. 1. 2. 3. createrepo /var/www/html/yum/base When finished, verify the contents of the new directory: /var/www/html/yum/base/repodata filelists.xml.gz, other.xml.gz, primary.xml.gz, repomd.xml To add the repository on each node Add the following lines to the file: /etc/yum.conf [maprtech] name=MapR Technologies, Inc. baseurl=http://<host>/yum/base enabled=1 gpgcheck=0 To enable the EPEL repository on CentOS or Red Hat 5.x: On a computer that is connected to the Internet, download the EPEL repository: wget http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm Install the EPEL repository: rpm -Uvh epel-release-5*.rpm To enable the EPEL repository on CentOS or Red Hat 6.x: On a computer that is connected to the Internet, download the EPEL repository: wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rp m Install the EPEL repository: rpm -Uvh epel-release-6*.rpm Creating a local repository on SUSE Login as on the node. root Create the following directory if it does not exist: /var/www/html/zypper/base On a computer that is connected to the Internet, download the following files, substituting the appropriate and <version> <datest : amp> The EPEL (Extra Packages for Enterprise Linux) repository contains dependencies for the package on Red mapr-metrics Hat/CentOS. If your Red Hat/CentOS cluster does not use the service, you can skip EPEL configuration. mapr-metrics 3. 4. 5. 1. 1. 2. 3. http://package.mapr.com/releases/v<version>/suse/mapr-v<version>GA.rpm.tgz http://package.mapr.com/releases/ecosystem/suse/mapr-ecosystem-<datestamp>.rpm .tgz (See for the correct paths for all past releases.) MapR Repositories and Package Archives Copy the files to on the node, and extract them there. /var/www/html/zypper/base tar -xvzf mapr-v<version>GA.rpm.tgz tar -xvzf mapr-ecosystem-<datestamp>.rpm.tgz Create the base repository headers: createrepo /var/www/html/zypper/base When finished, verify the contents of the new directory: /var/www/html/zypper/base/repodata filelists.xml.gz, other.xml.gz, primary.xml.gz, repomd.xml To add the repository on each node Use the following commands to add the repository for MapR packages and the MapR ecosystem packages, substituting the appropriate <version>: zypper ar http://<host>/zypper/base/ maprtech Creating a local repository on Ubuntu To create a local repository Login as on the machine where you will set up the repository. root Change to the directory and create the following directories within it: /root ~/mapr . dists binary optional binary-amd64 mapr On a computer that is connected to the Internet, download the following files, substituting the appropriate and <version> <datest . amp> 3. 4. 5. 6. 7. 1. 2. 1. 2. http://package.mapr.com/releases/v<version>/ubuntu/mapr-v<version>GA.deb.tgz http://package.mapr.com/releases/ecosystem/ubuntu/mapr-ecosystem-<datestamp>.d eb.tgz (See for the correct paths for all past releases.) MapR Repositories and Package Archives Copy the files to on the node, and extract them there. /root/mapr/mapr tar -xvzf mapr-v<version>GA.rpm.tgz tar -xvzf mapr-ecosystem-<datestamp>.rpm.tgz Navigate to the directory. /root/mapr/ Use to create in the directory: dpkg-scanpackages Packages.gz binary-amd64 dpkg-scanpackages . /dev/null | gzip -9c > ./dists/binary/optional/binary-amd64/Packages.gz Move the entire directory to the default directory served by the HTTP server (e. g. ) and make sure the /root/mapr /var/www HTTP server is running. To add the repository on each node Add the following line to on each node, replacing with the IP address or hostname of the node /etc/apt/sources.list <host> where you created the repository: deb http://<host>/mapr binary optional On each node update the package indexes (as or with ). root sudo apt-get update After performing the above steps, you can use as normal to install MapR software and Hadoop ecosystem components on each apt-get node from the local repository. Using a local path containing or package files rpm deb You can download package files and store them locally, and install from there. This option is useful for clusters that are not connected to the Internet. Using a machine connected to the Internet, download the tarball for the MapR components and the Hadoop ecosystem components, substituting appropriate , and : <platform> <version> <datestamp> <version>/<platform>/mapr-v<version>GA.rpm.tgz http://package.mapr.com/releases/v (or ) .deb.tgz <platform>/mapr-ecosystem-<datestamp>.rpm.tgz http://package.mapr.com/releases/ecosystem/ (or .deb ) .tgz For example, . http://package.mapr.com/releases/v3.0.1/ubuntu/mapr-v3.0.1GA.deb.tgz (See for the correct paths for all past releases.) MapR Repositories and Package Archives 2. 1. 2. 3. Extract the tarball to a local directory, either on each node or on a local network accessible by all nodes. tar -xvzf mapr-v<version>GA.rpm.tgz tar -xvzf mapr-ecosystem-<datestamp>.rpm.tgz MapR package dependencies need to be pre-installed on each node in order for MapR installation to succeed. If you are not using a package manager to install dependencies from Internet repositories, you need to manually download and install other dependency packages as well. Installation After and preparing packages and repositories, you are ready to install the MapR software. making your Cluster Plan To proceed you will need the following from your Cluster Plan: A list of the hostnames (or IP addresses) for all CLDB nodes A list of the hostnames (or IP addresses) for all ZooKeeper nodes A list of all disks and/or partitions to be used for the MapR cluster on all nodes Perform the following steps on each node: Install the planned MapR services Run the script to the node configure.sh configure Format raw drives and partitions allocated to MapR using the script disksetup The following table shows some of the services that can be run on a node, and the name of the package used to install the service. Service Package CLDB mapr-cldb JobTracker mapr-jobtracker MapR Control System mapr-webserver MapR-FS File Server mapr-fileserver Metrics mapr-metrics NFS mapr-nfs TaskTracker mapr-tasktracker ZooKeeper mapr-zookeeper MapR HBase Client mapr-hbase-<version> (See .) #MapR HBase Client Installation on M7 Edition Hadoop Ecosystem Components Use MapR-tested versions, compatible and in some cases improved components Cascading mapr-cascading Flume mapr-flume HBase mapr-hbase-master mapr-hbase-regionserver Before you proceed, make sure that all nodes meet the . Failure to meet node requirements is the primary Requirements for Installation cause of installation problems. 1. 2. 1. 2. HCatalog mapr-hcatalog mapr-hcatalog-server Hive mapr-hive Mahout mapr-mahout Oozie mapr-oozie Pig mapr-pig Sqoop mapr-sqoop Whirr mapr-whirr MapR HBase Client Installation on M7 Edition MapR M7 Edition, which introduces table storage in MapR-FS, is available in MapR version 3.0 and later. Nodes that will access table data in MapR-FS must have the MapR HBase Client installed. The package name is , where matches the version mapr-hbase-<version> <version> of HBase API to support, such as 0.92.2 or 0.94.5. This version has no impact on the underlying storage format used by the MapR-FS file server. If you have existing applications written for a specific version of the HBase API, install the MapR HBase Client package with the same version. If you are developing new applications to use MapR tables exclusively, use the highest available version of the MapR HBase Client. Installing MapR packages Based on your Cluster Plan for which services to run on which nodes, use the commands in this section to install the appropriate packages for each node. You can use a package manager such as or , which will automatically resolve and install dependency packages, provided that yum apt-get necessary repositories have been set up correctly. Alternatively, you can use or commands to manually install package files that you rpm dpkg have downloaded and extracted to a local directory. Installing from a repository Installing from a repository on Red Hat or CentOS Change to the user (or use for the following command). root sudo Use the command to install the services planned for the node. For example: yum Use the following command to install TaskTracker and MapR-FS yum install mapr-tasktracker mapr-fileserver Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, Mahout, and the MapR HBase client 0.92.2: yum install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hive mapr-pig mapr-mahout mapr-hbase-0.92.2.19720.GA Installing from a repository on SUSE Change to the user (or use for the following command). root sudo Use the command to install the services planned for the node. For example: zypper Use the following command to install TaskTracker and MapR-FS zypper install mapr-tasktracker mapr-fileserver Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, Mahout, and the MapR HBase client 0.92.2: 2. 1. 2. 3. 1. 2. 3. 1. 2. zypper install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hive mapr-pig mapr-mahout mapr-hbase-0.92.2.19720.GA Installing from a repository on Ubuntu Change to the user (or use for the following commands). root sudo On all nodes, issue the following command to update the Ubuntu package cache: apt-get update Use the command to install the services planned for the node. For example: apt-get install Use the following command to install TaskTracker and MapR-FS apt-get install mapr-tasktracker mapr-fileserver Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, Mahout, and the MapR HBase client 0.92.2: apt-get install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hive mapr-pig mapr-mahout mapr-hbase=0.92.2.19720.GA Installing from package files When installing from package files, you must manually pre-install any dependency packages in order for the installation to succeed. Note that most MapR packages depend on the package . Similarly, many Hadoop ecosystem components have internal dependencies, such as mapr-core the package for . See for details. hbase-internal mapr-hbase-regionserver Packages and Dependencies for MapR Software In the commands that follow, replace with the exact version string found in the package filename. For example, for version 3.0.1, <version> substitute with . mapr-core-<version>.x86_64.rpm mapr-core-3.0.1.GA-1.x86_64.rpm Installing from local files on Red Hat, CentOS, or SUSE Change to the user (or use for the following command). root sudo Change the working directory to the location where the package files are located. rpm Use the command to install the appropriate packages for the node. For example: rpm Use the following command to install TaskTracker and MapR-FS rpm -ivh mapr-core-<version>.x86_64.rpm mapr-fileserver-<version>.x86_64.rpm mapr-tasktracker-<version>.x86_64.rpm Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, and the MapR HBase client: rpm -ivh mapr-core-<version>.x86_64.rpm mapr-cldb-<version>.x86_64.rpm \ mapr-jobtracker-<version>.x86_64.rpm mapr-webserver-<version>.x86_64.rpm \ mapr-zk-internal-<version>.x86_64.rpm mapr-zookeeper-<version>.x86_64.rpm \ mapr-hive-<version>.noarch.rpm mapr-pig-<version>.noarch.rpm \ mapr-hbase-<version>.noarch.rpm Installing from local files on Ubuntu Change to the user (or use for the following command). root sudo 2. 3. Change the working directory to the location where the package files are located. deb Use the command to install the appropriate packages for the node. For example: dpkg Use the following command to install TaskTracker and MapR-FS dpkg -i mapr-core_<version>.x86_64.deb mapr-fileserver_<version>.x86_64.deb mapr-tasktracker_<version>.x86_64.deb Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, and the MapR HBase client: dpkg -i mapr-core_<version>_amd64.deb mapr-cldb_<version>_amd64.deb \ mapr-jobtracker_<version>_amd64.deb mapr-webserver_<version>_amd64.deb \ mapr-zk-internal_<version>_amd64.deb mapr-zookeeper_<version>_amd64.deb \ mapr-pig-<version>_all.deb mapr-hive-<version>_all.deb \ mapr-hbase-<version>_all.deb Verify successful installation To verify that the software has been installed successfully, check the directory on each node. The software is installed in /opt/mapr/roles directory and a file is created in for every service that installs successfully. Examine this directory to verify /opt/mapr /opt/mapr/roles installation for the node. For example: # ls -l /opt/mapr/roles total 0 -rwxr-xr-x 1 root root 0 Jan 29 17:59 fileserver -rwxr-xr-x 1 root root 0 Jan 29 17:58 tasktracker -rwxr-xr-x 1 root root 0 Jan 29 17:58 webserver -rwxr-xr-x 1 root root 0 Jan 29 17:58 zookeeper Setting Environment Variables Set in . This variable be set before you start ZooKeeper or Warden. JAVA_HOME /opt/mapr/conf/env.sh must Set other environment variables for MapR as described in the section. Environment Variables Configure the Node with the Script configure.sh The script configures a node to be part of a MapR cluster, or modifies services running on an existing node in the cluster. The configure.sh script creates (or updates) configuration files related to the cluster and the services running on the node. Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. You can optionally specify the ports for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are: CLDB – 7222 ZooKeeper – 5181 Configure the node first, then prepare raw disks and partitions with the command. disksetup If you plan to license your cluster for M7, run the script with the option to apply M7 settings to the node. If the M7 configure.sh -M7 license is applied to the cluster before the nodes are configured with the M7 settings, the system raises the NODE_ALARM_M7_CONFIG alarm. To clear the alarm, restart the FileServer service on all of the nodes using the instructions on the page. _MISMATCH Services The script takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP configure.sh addresses (and optionally ports), using the following syntax: /opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z <host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>] Example: /opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r5n1.sj.us:5181 -N MyCluster Formatting Disks with the Script disksetup If is installed on this node, use the following procedure to format disks and partitions for use by MapR. mapr-fileserver The script is used to format disks for use by the MapR cluster. Create a text file listing the disks and partitions for disksetup /tmp/disks.txt use by MapR on the node. Each line lists either a single disk or all applicable partitions on a single disk. When listing multiple partitions on a line, separate by spaces. For example: /dev/sdb /dev/sdc1 /dev/sdc2 /dev/sdc4 /dev/sdd Later, when you run to format the disks, specify the file. For example: disksetup disks.txt /opt/mapr/server/disksetup -F /tmp/disks.txt If you are re-using a node that was used previously in another cluster, it is important to format the disks to remove any traces of data from the old cluster. This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure, please read Setting Up . Disks for MapR Next Step After you have successfully installed MapR software on each node according to your cluster plan, you are ready to . bring up the cluster Each time you specify the option, you must use the for the ZooKeeper node list. If you change the -Z <host>[:<port>] same order order for any node, the ZooKeeper leader election process will fail. Run the script (described above) running . configure.sh before disksetup The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any data you wish disksetup to keep has been backed up elsewhere. MapR Repositories and Package Archives This page describes the online repositories and archives for MapR software. and Repositories for MapR Core Software rpm deb and Repositories for Hadoop Ecosystem Tools rpm deb Package Archive for All Releases of Hadoop Ecosystem Tools GitHub Repositories for Source Code Maven Repositories for Application Developers Other Scripts and Tools History of and Repository URLs rpm deb rpm and Repositories for MapR Core Software deb MapR hosts and repositories for installing the MapR core software using Linux package management tools. For every release of the rpm deb core MapR software, a repository is created for each supported platform. These platform-specific repositories are hosted at: <version>/<platform> http://package.mapr.com/releases/ For a list of the repositories for all MapR releases see section below. History of rpm and deb Repository URLs rpm and Repositories for Hadoop Ecosystem Tools deb MapR hosts and repositories for installing Hadoop ecosystem tools, such as Cascading, Flume, HBase, HCatalog, Hive, Mahout, Oozie, rpm deb Pig, Sqoop, and Whirr. At any given time, MapR's recommended versions of ecosystem tools that work with the latest version of MapR core software are available here. These platform-specific repositories are hosted at: <platform> http://package.mapr.com/releases/ecosystem/ Package Archive for All Releases of Hadoop Ecosystem Tools All of MapR's past and present releases of Hadoop ecosystem tools, such as HBase, Hive, and Oozie, are available at: http://package.mapr.com/r <platform>. eleases/ecosystem-all/ While this is not a repository, and files are archived here, and you can download and install them manually. rpm deb GitHub Repositories for Source Code MapR releases the source code for Hadoop ecosystem components to GitHub, including all patches MapR has applied to the components. MapR's repos under GitHub include Cascading, Flume, HBase, HCatalog, Hive, Mahout, Oozie, Pig, Sqoop, and Whirr. Source code for all releases since March 2013 are available here. For details see or browse to . Source Code for MapR Software http://github.com/mapr Maven Repositories for Application Developers MapR hosts a Maven repository where application developers can download dependencies on MapR software or Hadoop ecosystem components. Maven artifacts for all releases since March 2013 are available here. For details see . Maven Repository and Artifacts for MapR Other Scripts and Tools Other MapR scripts and tools can be found in the following locations: http://package.mapr.com/scripts/ http://package.mapr.com/tools/ History of and Repository URLs rpm deb Here is a list of the paths to the repositories for current and past releases of the MapR distribution for Apache Hadoop. Version 3.0.1 http://package.mapr.com/releases/v3.0.1/mac/ (Mac) http://package.mapr.com/releases/v3.0.1/redhat/ (CentOS or Red Hat) http://package.mapr.com/releases/v3.0.1/suse/ (SUSE) http://package.mapr.com/releases/v3.0.1/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v3.0.1/windows/ (Windows) Version 2.1.3 http://package.mapr.com/releases/v2.1.3/mac/ (Mac) http://package.mapr.com/releases/v2.1.3/redhat/ (CentOS or Red Hat) http://package.mapr.com/releases/v2.1.3/suse/ (SUSE) http://package.mapr.com/releases/v2.1.3/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v2.1.3/windows/ (Windows) Version 2.1.2 http://package.mapr.com/releases/v2.1.2/mac/ (Mac) http://package.mapr.com/releases/v2.1.2/redhat/ (CentOS or Red Hat) http://package.mapr.com/releases/v2.1.2/suse/ (SUSE) http://package.mapr.com/releases/v2.1.2/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v2.1.2/windows/ (Windows) Version 2.1.1 http://package.mapr.com/releases/v2.1.1/mac/ (Mac) http://package.mapr.com/releases/v2.1.1/redhat/ (CentOS or Red Hat) http://package.mapr.com/releases/v2.1.1/suse/ (SUSE) http://package.mapr.com/releases/v2.1.1/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v2.1.1/windows/ (Windows) Version 2.1 http://package.mapr.com/releases/v2.1.0/mac/ (Mac) http://package.mapr.com/releases/v2.1.0/redhat/ (CentOS or Red Hat) http://package.mapr.com/releases/v2.1.0/suse/ (SUSE) http://package.mapr.com/releases/v2.1.0/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v2.1.0/windows/ (Windows) Version 2.0.1 http://package.mapr.com/releases/v2.0.1/redhat/ (CentOS or Red Hat) http://package.mapr.com/releases/v2.0.1/suse/ (SUSE) http://package.mapr.com/releases/v2.0.1/ubuntu/ (Ubuntu) Version 2.0.0 http://package.mapr.com/releases/v2.0.0/mac/ (Mac) http://package.mapr.com/releases/v2.0.0/redhat/ (CentOS or Red Hat) http://package.mapr.com/releases/v2.0.0/suse/ (SUSE) http://package.mapr.com/releases/v2.0.0/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v2.0.0/windows/ (Windows) Version 1.2.10 http://package.mapr.com/releases/v1.2.10/redhat/ (CentOS or Red Hat) http://package.mapr.com/releases/v1.2.10/suse/ (SUSE) http://package.mapr.com/releases/v1.2.10/ubuntu/ (Ubuntu) Version 1.2.9 http://package.mapr.com/releases/v1.2.9/mac/ (Mac) http://package.mapr.com/releases/v1.2.9/redhat/ (CentOS, Red Hat, or SUSE) http://package.mapr.com/releases/v1.2.9/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v1.2.9/windows/ (Windows) Version 1.2.7 http://package.mapr.com/releases/v1.2.7/mac/ (Mac) http://package.mapr.com/releases/v1.2.7/redhat/ (CentOS, Red Hat, or SUSE) http://package.mapr.com/releases/v1.2.7/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v1.2.7/windows/ (Windows) Version 1.2.3 http://package.mapr.com/releases/v1.2.3/mac/ (Mac) http://package.mapr.com/releases/v1.2.3/redhat/ (Red Hat or CentOS) http://package.mapr.com/releases/v1.2.3/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v1.2.3/windows/ (Windows) Version 1.2.2 http://package.mapr.com/releases/v1.2.2/mac/ (Mac) 1. http://package.mapr.com/releases/v1.2.2/redhat/ (Red Hat or CentOS) http://package.mapr.com/releases/v1.2.2/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v1.2.2/windows/ (Windows) Version 1.2.0 http://package.mapr.com/releases/v1.2.0/mac/ (Mac) http://package.mapr.com/releases/v1.2.0/redhat/ (Red Hat or CentOS) http://package.mapr.com/releases/v1.2.0/ubuntu/ (Ubuntu) http://package.mapr.com/releases/v1.2.0/windows/ (Windows) Version 1.1.3 http://package.mapr.com/releases/v1.1.3/redhat/ (Red Hat or CentOS) http://package.mapr.com/releases/v1.1.3/ubuntu/ (Ubuntu) Version 1.1.2 - Internal maintenance release Version 1.1.1 http://package.mapr.com/releases/v1.1.1/mac/ (Mac client) http://package.mapr.com/releases/v1.1.1/redhat/ (Red Hat or CentOS) http://package.mapr.com/releases/v1.1.1/ubuntu/ (Ubuntu) Version 1.1.0 http://package.mapr.com/releases/v1.1.0-sp0/mac/ (Mac client) http://package.mapr.com/releases/v1.1.0-sp0/redhat/ (Red Hat or CentOS) http://package.mapr.com/releases/v1.1.0-sp0/ubuntu/ (Ubuntu) Version 1.0.0 http://package.mapr.com/releases/v1.0.0-sp0/redhat/ (Red Hat or CentOS) http://package.mapr.com/releases/v1.0.0-sp0/ubuntu/ (Ubuntu) Bringing Up the Cluster The installation of software across a cluster of nodes will go more smoothly if the services have be pre-planned and each node has been validated. Referring to the cluster design developed in , ensure that each node has been prepared and meets the minimum Planning the Cluster requirements described in , and that the MapR packages have been in accordance with the plan. Preparing Each Node installed on each node Initialization Sequence Troubleshooting Installing the Cluster License Verifying Cluster Status Adding Volumes Next Step Bringing up the cluster involves starting the ZooKeeper service, starting the CLDB service, setting up the administrative user, and installing a MapR license. Once these initial steps are done, the cluster is functional on a limited set of nodes. Not all services are started yet, but you can use the MapR Control System Dashboard, or the MapR Command Line Interface, to examine nodes and activity on the cluster. You can then proceed to start services on all remaining nodes. Initialization Sequence First, start the ZooKeeper service. It is important that all ZooKeeper instances start up, because the rest of the system cannot start unless a majority (or ) of ZooKeeper instances are up and running. Next, start the service on each node, or at least on the nodes that host quorum warden the CLDB and webserver services. The warden service manages all MapR services on the node (except ZooKeeper) and helps coordinate communications. Starting the warden automatically starts the CLDB. To bring up the cluster 1. 2. 3. 4. 5. 6. Start on all nodes where it is installed, by issuing the following command: ZooKeeper service mapr-zookeeper start Verify that the quorum has been successfully established. Issue the following command and make sure that one Zookeeper is the Leader and the rest are Followers before starting the : warden service mapr-zookeeper qstatus Start the on all nodes where CLDB is installed by issuing the following command. warden service mapr-warden start Verify that a CLDB master is running by issuing the command. For example: maprcli node cldbmaster # maprcli node cldbmaster cldbmaster ServerID: 4553404820491236337 HostName: node-36.boston Do not proceed until a CLDB master is active. Start the on all remaining nodes using the following command. warden service mapr-warden start Issue the following command to give full permission to the chosen administrative user: /opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc Troubleshooting Difficulty bringing up the cluster seems daunting, but most cluster problems are easily resolved. For the latest support tips, visit http://answers.ma . pr.com Can each node connect with the others? For a list of ports that must be open, see . Ports Used by MapR Is the warden running on each node? On the node, run the following command as root: $ service mapr-warden status WARDEN running as process 18732 If the warden service is not running, check the warden log file, , for clues /opt/mapr/logs/warden.log To restart the warden service: $ service mapr-warden start Before continuing, wait 30 to 60 seconds for the warden to start the CLDB service. Calls to commands may fail if maprcli executed before the CLDB has started successfully. 1. a. b. c. 2. a. b. 1. 2. 3. a. The ZooKeeper service is not running on one or more nodes Check the warden log file for errors related to resources, such as low memory Check the warden log file for errors related to user permissions Check for DNS and other connectivity issues between ZooKeeper nodes The MapR CLI program won't run /opt/mapr/bin/maprcli Did you configure this node? See . Installing MapR Software Permission errors appear in the log Check that MapR changes to the following files have not been overwritten by automated configuration management tools: /etc/sudoers Allows the user to invoke commands as root mapr /etc/security/limits.conf Allows MapR services to increase limits on resources such as memory, file handles, threads and processes, and maximum priority level /etc/udev/rules.d/99-mapr-disk.rules Covers permissions and ownership of raw disk devices Before contacting MapR Support, you can collect your cluster's logs using the script. mapr-support-collect Installing the Cluster License MapR Hadoop requires a valid license file for even the free M3 Community Edition. Using the web-based MCS to install the license On a machine that is connected to the cluster and to the Internet, perform the following steps to open the MapR Control System and install the license: In a browser, view the MapR Control System by navigating to the node that is running the MapR Control System: https://:8443 Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can ignore the warning this time. The first time MapR starts, you must accept the Terms of Use and choose whether to enable the MapR service. Dial Home Log in to the MapR Control System as the administrative user you designated earlier. Until a license is applied, the MapR Control System dashboard might show some nodes in the amber "degraded" state. Don't worry if not all nodes are green and "healthy" at this stage. In the navigation pane of the MapR Control System, expand the gro System Settings Views up and click to display the MapR License Management dialog. Manage Licenses Click . Add Licenses via Web If the cluster is already registered, the license is applied automatically. Otherwise, click to register the cluster on MapR.com and follow the instructions there. OK Installing a license from the command line Use the following steps if it is not possible to connect to the cluster and the Internet at the same time. Obtain a valid license file from MapR Copy the license file to a cluster node Run the following command to add the license: maprcli license add [ -cluster <name> ] -license <filename> -is_file true Verifying Cluster Status Click on the thumbnail images to view them full-size. 1. 2. 3. 1. 2. a. To view cluster status using the web interface Log in to the MapR Control System. Under the group in the left pane, click . Cluster Dashboard Check the pane and make sure each service is running the correct number of instances, according to your cluster plan. Services To view cluster status using the command line interface Log in to a cluster node Use the following command to list MapR services: $ maprcli service list name state logpath displayname fileserver 0 /opt/mapr/logs/mfs.log FileServer webserver 0 /opt/mapr/logs/adminuiapp.log WebServer cldb 0 /opt/mapr/logs/cldb.log CLDB hoststats 0 /opt/mapr/logs/hoststats.log HostStats $ maprcli license list $ maprcli disk list -host <name or IP address> Next, start the warden on all remaining nodes using one of the following commands: service mapr-warden start /etc/init.d/mapr-warden start Adding Volumes Referring to the volume plan created in , use the MapR Control System or the command to create and mount Planning the Cluster maprcli distinct volumes to allow more granularity in specifying policy for subsets of data. If you do not set up volumes, and instead store all data in the single volume mounted at , it creates problems in administering data / policy later as data size grows. Next Step Now that the MapR Hadoop cluster is up and running, the final installation step is to . If you will not install install Hadoop Ecosystem Components any Hadoop components, see for a list of post-install considerations. Next Steps After Installation Installing Hadoop Components The final step in installing a MapR cluster is to install and bring up Hadoop ecosystem components that are not included in the MapR distribution because not every installation requires them. This section provides information about integrating the following tools with a MapR cluster: Cascading - Installing and using Cascading on a MapR cluster Flume - Installing and using Flume on a MapR cluster HBase - Installing and using HBase on MapR Hive - Installing and using Hive on a MapR cluster, and setting up a MySQL metastore Impala - Installing and using Impala on a MapR cluster Mahout - Environment variable settings needed to run Mahout on MapR MultiTool - A wrapper for Cascading MultiTool Pig - Installing and using Pig on a MapR cluster Oozie - Installing and using Oozie on a MapR cluster Sqoop - Installing and using Sqoop on a MapR cluster Whirr - Using Whirr to manage services on a MapR cluster After installing all the needed components, see for a list of post-install considerations to configure your cluster. Next Steps After Installation MapR works well with Hadoop monitoring tools, such as: Ganglia - Setting up Ganglia monitoring on a MapR cluster Nagios Integration - Generating a Nagios Object Definition file for use with a MapR cluster MapR works with the leaders in the Hadoop ecosystem to provide the most powerful data analysis solutions. For more information about our partners, take a look at the following pages: Datameer HParser Karmasphere Pentaho Cascading Cascading™ is a Java application framework produced by that enables developers Concurrent, Inc. to quickly and easily build rich enterprise-grade Data Processing and Machine Learning applications that can be deployed and managed across 1. 2. 3. 4. 1. 2. 3. 1. 2. 3. 4. 1. 2. private or cloud-based Hadoop clusters. Installing Cascading The following procedures use the operating system package managers to download and install from the MapR Repository. To install the packages manually, refer to Preparing Packages and . Repositories To install Cascading on an Ubuntu cluster: Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide Update the list of available packages: apt-get update On each planned Cascading node, install : mapr-cascading apt-get install mapr-cascading To install Cascading on a Red Hat or CentOS cluster: Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide On each planned Cascading node, install : mapr-cascading yum install mapr-cascading Flume Flume is a reliable, distributed service for collecting, aggregating, and moving large amounts of log data, generally delivering the data to a distributed file system such as MapR-FS. For more information about Flume, see the . Apache Flume Incubation Wiki Installing Flume The following procedures use the operating system package managers to download and install from the MapR Repository. If you want to install this component manually from packages files, see . Packages and Dependencies for MapR Software To install Flume on an Ubuntu cluster: Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide Update the list of available packages: apt-get update On each planned Flume node, install : mapr-flume apt-get install mapr-flume To install Flume on a Red Hat or CentOS cluster: Execute the following commands as or using . root sudo 2. 3. 1. 2. 3. 4. 5. 6. 7. This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide On each planned Flume node, install : mapr-flume yum install mapr-flume Using Flume For information about configuring and using Flume, see the following documents: Flume User Guide Flume Developer Guide HBase HBase is the Hadoop database, which provides random, realtime read/write access to very large data. See for information about using HBase with MapR Installing HBase See for information about compressing HFile storage Setting Up Compression with HBase See for information about using MapReduce with HBase Running MapReduce Jobs with HBase See for HBase tips and tricks HBase Best Practices Installing HBase Plan which nodes should run the HBase Master service, and which nodes should run the HBase RegionServer. At least one node (generally three nodes) should run the HBase Master; for example, install HBase Master on the ZooKeeper nodes. Only a few of the remaining nodes or all of the remaining nodes can run the HBase RegionServer. When you install HBase RegionServer on nodes that also run TaskTracker, reduce the number of map and reduce slots to avoid oversubscribing the machine. The following procedures use the operating system package managers to download and install from the MapR Repository. To install the packages manually, refer to . Preparing Packages and Repositories To install HBase on an Ubuntu cluster: Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide Update the list of available packages: apt-get update On each planned HBase Master node, install : mapr-hbase-master apt-get install mapr-hbase-master On each planned HBase RegionServer node, install : mapr-hbase-regionserver apt-get install mapr-hbase-regionserver On all HBase nodes, run with a list of the CLDB nodes and ZooKeeper nodes in the cluster. configure.sh The warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden: To use Java 7 with HBase, set the value of the attribute in to the location of your Java 7 JVM. JAVA_HOME /opt/mapr/conf/env.sh Note that this change results in all other Hadoop and MapR Java daemons and code using the specified JVM. 7. 1. 2. 3. 4. 5. 1. 2. 3. 4. # service mapr-warden stop # service mapr-warden start To install HBase on a Red Hat or CentOS cluster: Execute the following commands as or using . root sudo On each planned HBase Master node, install : mapr-hbase-master yum install mapr-hbase-master On each planned HBase RegionServer node, install : mapr-hbase-regionserver yum install mapr-hbase-regionserver On all HBase nodes, run the script with a list of the CLDB nodes and ZooKeeper nodes in the cluster. configure.sh The warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden: # service mapr-warden stop # service mapr-warden start Installing HBase on a Client To use the HBase shell from a machine outside the cluster, you can install HBase on a computer running the MapR client. For MapR client setup instructions, see . Setting Up the Client Prerequisites: The MapR client must be installed You must know the IP addresses or hostnames of the ZooKeeper nodes on the cluster To install HBase on a client computer: Execute the following commands as or using . root sudo On the client computer, install : mapr-hbase-internal CentOS or Red Hat: yum install mapr-hbase-internal Ubuntu: apt-get install mapr-hbase-internal On all HBase nodes, run with a list of the CLDB nodes and ZooKeeper nodes in the cluster. configure.sh Edit , setting the property to include a comma-separated list of the IP addresses or hbase-site.xml hbase.zookeeper.quorum hostnames of the ZooKeeper nodes on the cluster you will be working with. Example: <property> <name>hbase.zookeeper.quorum</name> <value>10.10.25.10,10.10.25.11,10.10.25.13</value> </property> Getting Started with HBase In this tutorial, we'll create an HBase table on the cluster, enter some data, query the table, then clean up the data and exit. HBase tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called . When column families 1. 2. 3. 4. 5. 6. 7. creating an HBase table, you must define the column families before inserting any data. Column families should not be changed often, nor should there be too many of them, so it is important to think carefully about what column families will be useful for your particular data. Each column family, however, can contain a very large number of columns. Columns are named using the format . family:qualifier Unlike columns in a relational database, which reserve empty space for columns with no values, HBase columns simply don't exist for rows where they have no values. This not only saves space, but means that different rows need not have the same columns; you can use whatever columns you need for your data on a per-row basis. Create a table in HBase: Start the HBase shell by typing the following command: /opt/mapr/hbase/hbase-0.94.5/bin/hbase shell Create a table called with one column family named : weblog stats create 'weblog', 'stats' Verify the table creation by listing everything: list Add a test value to the column in the column family for row 1: daily stats put 'weblog', 'row1', 'stats:daily', 'test-daily-value' Add a test value to the column in the column family for row 1: weekly stats put 'weblog', 'row1', 'stats:weekly', 'test-weekly-value' Add a test value to the column in the column family for row 2: weekly stats put 'weblog', 'row2', 'stats:weekly', 'test-weekly-value' Type to display the contents of the table. Sample output: scan 'weblog' ROW COLUMN+CELL row1 column=stats:daily, timestamp=1321296699190, value=test-daily-value row1 column=stats:weekly, timestamp=1321296715892, value=test-weekly-value row2 column=stats:weekly, timestamp=1321296787444, value=test-weekly-value 2 row(s) in 0.0440 seconds 8. 9. 10. 11. 1. 2. 3. 4. Type to display the contents of row 1. Sample output: get 'weblog', 'row1' COLUMN CELL stats:daily timestamp=1321296699190, value=test-daily-value stats:weekly timestamp=1321296715892, value=test-weekly-value 2 row(s) in 0.0330 seconds Type to disable the table. disable 'weblog' Type to drop the table and delete all data. drop 'weblog' Type to exit the HBase shell. exit Setting Up Compression with HBase Using compression with HBase reduces the number of bytes transmitted over the network and stored on disk. These benefits often outweigh the performance cost of compressing the data on every write and uncompressing it on every read. GZip Compression GZip compression is included with most Linux distributions, and works natively with HBase. To use GZip compression, specify it in the per-column family compression flag while creating tables in HBase shell. Example: create 'mytable', {NAME=>'colfam:', COMPRESSION=>'gz'} LZO Compression Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm, included in most Linux distributions, that is designed for decompression speed. Setting up LZO compression for use with HBase: Make sure HBase is installed on the nodes where you plan to run it. See and for more Planning the Deployment Installing MapR Software information. On each HBase node, ensure the native LZO base library is installed: On Ubuntu: apt-get install liblzo2-dev liblzo2 On Red Hat or CentOS: yum install lzo-devel lzo Check out the native connector library from http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/ For 0.20.2 check out branches/branch-0.1 svn checkout http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/branches /branch-0.1/ For 0.21 or 0.22 check out trunk svn checkout http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/branches /trunk/ Set the compiler flags and build the native connector library:4. 5. 6. 7. 8. 9. 1. $ export CFLAGS="-m64" $ ant compile-native $ ant jar Create a directory for the native libraries (use TAB completion to fill in the <version> placeholder): mkdir -p /opt/mapr/hbase/hbase-<version>/lib/native/Linux-amd64-64/ Copy the build results into the appropriate HBase directories on every HBase node. Example: $ cp build/native/Linux-amd64-64/lib/libgplcompression.* /opt/mapr/hbase/hbase-<version>/lib/native/Linux-amd64-64/ Download the hadoop-lzo compression library from . https://github.com/twitter/hadoop-lzo Create a symbolic link under to point to /opt/mapr/hbase/hbase-<version>/lib/native/Linux-amd64-64/ On Ubuntu: ln -s /usr/lib/x86_64-linux-gnu/liblzo2.so.2 /opt/mapr/hbase/hbase-<version>/lib/native/Linux-amd64-64/ On Red Hat or CentOS: ln -s /usr/lib64/liblzo2.so.2 /opt/mapr/hbase/hbase-<version>/lib/native/Linux-amd64-64/liblzo2.so.2 Restart the RegionServer: maprcli node services -hbregionserver restart -nodes <hostname> Once LZO is set up, you can specify it in the per-column family compression flag while creating tables in HBase shell. Example: create 'mytable', {NAME=>'colfam:', COMPRESSION=>'lzo'} Snappy Compression The Snappy compression algorithm is optimized for speed over compression. Snappy is not included in the core MapR distribution, and you will have to build the Snappy libraries to use this compression algorithm. Setting up Snappy compression for use with HBase On a node in the cluster, download, build, and install Snappy from : the project page 1. 2. 3. # wget http://snappy.googlecode.com/files/snappy-1.0.5.tar.gz # tar xvf snappy-1.0.5.tar.gz # cd snappy-1.0.5 # ./configure # make # sudo make install Copy the files from to the directory on all nodes /usr/local/lib/libsnappy* <HADOOP_HOME>/lib/native/Linux-amd64-64/ in the cluster. On all nodes that have the TaskTracker service installed, restart the TaskTracker with the maprcli -tasktracker node services command. restart -nodes <list of nodes> Running MapReduce Jobs with HBase To run MapReduce jobs with data stored in HBase, set the environment variable to the output of the c HADOOP_CLASSPATH hbase classpath ommand (use TAB completion to fill in the placeholder): <version> $ export HADOOP_CLASSPATH=`/opt/mapr/hbase/hbase-<version>/bin/hbase classpath` Note the backticks ( ). ` Example: Exporting a table named t1 with MapReduce Notes: On a node in a MapR cluster, the output directory /hbase/export_t1 will be located in the mapr hadoop filesystem, so to list the output files in the example below use the following hadoop fs command from the node's command line: # hadoop fs -ls /hbase/export_t1 To view the output: # hadoop fs -cat /hbase/export_t1/part-m-00000 # cd /opt/mapr/hadoop/hadoop-0.20.2 # export HADOOP_CLASSPATH=`/opt/mapr/hbase/hbase-0.90.4/bin/hbase classpath` # ./bin/hadoop jar /opt/mapr/hbase/hbase-0.90.4/hbase-0.90.4.jar export t1 /hbase/export_t1 11/09/28 09:35:11 INFO mapreduce.Export: verisons=1, starttime=0, endtime=9223372036854775807 11/09/28 09:35:11 INFO fs.JobTrackerWatcher: Current running JobTracker is: lohit-ubuntu/10.250.1.91:9001 11/09/28 09:35:12 INFO mapred.JobClient: Running job: job_201109280920_0003 11/09/28 09:35:13 INFO mapred.JobClient: map 0% reduce 0% 11/09/28 09:35:19 INFO mapred.JobClient: Job complete: job_201109280920_0003 11/09/28 09:35:19 INFO mapred.JobClient: Counters: 15 11/09/28 09:35:19 INFO mapred.JobClient: Job Counters 11/09/28 09:35:19 INFO mapred.JobClient: Aggregate execution time of mappers(ms)=3259 11/09/28 09:35:19 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/09/28 09:35:19 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/09/28 09:35:19 INFO mapred.JobClient: Launched map tasks=1 11/09/28 09:35:19 INFO mapred.JobClient: Data-local map tasks=1 11/09/28 09:35:19 INFO mapred.JobClient: Aggregate execution time of reducers(ms)=0 11/09/28 09:35:19 INFO mapred.JobClient: FileSystemCounters 11/09/28 09:35:19 INFO mapred.JobClient: FILE_BYTES_WRITTEN=61319 11/09/28 09:35:19 INFO mapred.JobClient: Map-Reduce Framework 11/09/28 09:35:19 INFO mapred.JobClient: Map input records=5 11/09/28 09:35:19 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=107991040 11/09/28 09:35:19 INFO mapred.JobClient: Spilled Records=0 11/09/28 09:35:19 INFO mapred.JobClient: CPU_MILLISECONDS=780 11/09/28 09:35:19 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=759836672 11/09/28 09:35:19 INFO mapred.JobClient: Map output records=5 11/09/28 09:35:19 INFO mapred.JobClient: SPLIT_RAW_BYTES=63 11/09/28 09:35:19 INFO mapred.JobClient: GC time elapsed (ms)=35 HBase Best Practices The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase, turn off MapR compression for directories in the HBase volume (normally mounted at . Example: /hbase hadoop mfs -setcompression off /hbase You can check whether compression is turned off in a directory or mounted volume by using to list the file contents. hadoop mfs Example: hadoop mfs -ls /hbase The letter in the output indicates compression is turned on; the letter indicates compression is turned off. See for more Z U hadoop mfs information. On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so that the node can handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit the 1. 2. 3. RegionServer to 4 GB, give 10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to each service, edit the file. See for more information. /opt/mapr/conf/warden.conf Tuning Your MapR Install You can start and stop HBase the same as other services on MapR. For example, use the following commands to shut down HBase across the cluster: maprcli node services -hbregionserver stop -nodes <list of RegionServer nodes> maprcli node services -hbmaster stop -nodes <list of HBase Master nodes> Hive Apache Hive is a data warehouse system for Hadoop that uses a SQL-like language called Hive Query Language (HQL) to query structured data stored in a distributed filesystem. For more information about Hive, see the . Apache Hive project page On this page: Installing Hive Installing and Configuring HiveServer2 Getting Started with Hive Using Hive with MapR Volumes Default Hive Directories Hive Scratch Directory Hive Warehouse Directory Setting Up Hive with a MySQL Metastore Prerequisites Configuring Hive for MySQL Hive-HBase Integration Install and Configure Hive and HBase Getting Started with Hive-HBase Integration Getting Started with Hive-MapR Tables Integration Zookeeper Connections Installing Hive The following procedures use the operating system package managers to download and install Hive from the MapR Repository. If you want to install this component manually from packages files, see . This procedure is to be performed on Packages and Dependencies for MapR Software a MapR cluster (see the ) or client (see ). Installation Guide Setting Up the Client Make sure the environment variable is set correctly. Example: JAVA_HOME # export JAVA_HOME=/usr/lib/jvm/java-6-sun Make sure the environment variable is set correctly. Example: HIVE_HOME # export HIVE_HOME=/opt/mapr/hive/hive-<version> After Hive is installed, the executable is located at: /opt/mapr/hive/hive-<version>/bin/hive To install Hive on an Ubuntu cluster: Execute the following commands as or using . root sudo Update the list of available packages: apt-get update On each planned Hive node, install : mapr-hive 3. 1. 2. 1. 2. apt-get install mapr-hive To install Hive on a Red Hat or CentOS cluster: Execute the following commands as or using . root sudo On each planned Hive node, install : mapr-hive yum install mapr-hive Installing and Configuring HiveServer2 Getting Started with Hive In this tutorial, you'll create a Hive table, load data from a tab-delimited text file, and run a couple of basic queries against the table. First, make sure you have downloaded the sample table: On the page , select and A Tour of the MapR Virtual Machine Tools > Attachments right-click on , select from the pop-up menu, select a directory to save to, then click OK. If you're sample-table.txt Save Link As... working on the MapR Virtual Machine, we'll be loading the file from the MapR Virtual Machine's local file system (not the cluster storage layer), so save the file in the MapR Home directory (for example, ). /home/mapr Take a look at the source data First, take a look at the contents of the file using the terminal: Make sure you are in the Home directory where you saved (type if you are not sure). sample-table.txt cd ~ Type to display the following output. cat sample-table.txt This procedure installs Hive 0.10.0. To install Hive 0.9.0, use the string apt-get install mapr-hive=0.9.0-<version> . You can determine the available versions with the command. See the Hive 0.10.0 f apt-cache madison mapr-hive release notes or a list of fixes and new features added since the release of Hive 0.9.0. This procedure installs Hive 0.10.0. To install Hive 0.9.0, use the string yum install 'mapr-hive-0.9.0-*' . See the Hive 0.10.0 for a list of fixes and new features added since the release of Hive 0.9.0. release notes MapR's release of Hive 0.9.0 includes HiveServer2 as of March 7, 2013, which allows multiple concurrent Hive connections to the Hive server over a network. In addition to the documentation on this page, refer to for details. Using HiveServer2 If you are using HiveServer2, you will use the BeeLine CLI instead of the Hive shell, as shown below. For details on setting up HiveServer2 and starting BeeLine, see . Using HiveServer2 1. 2. 3. 4. mapr@mapr-desktop:~$ cat sample-table.txt 1320352532 1001 http://www.mapr.com/doc http://www.mapr.com 192.168.10.1 1320352533 1002 http://www.mapr.com http://www.example.com 192.168.10.10 1320352546 1001 http://www.mapr.com http://www.mapr.com/doc 192.168.10.1 Notice that the file consists of only three lines, each of which contains a row of data fields separated by the TAB character. The data in the file represents a web log. Create a table in Hive and load the source data: Set the location of the Hive scratch directory by editing the file to add /opt/mapr/hive/hive-<version>/conf/hive-site.xml the following block, replacing with the path to a directory in the user volume: /tmp/mydir <property> <name>hive.exec.scratchdir</name> <value>/tmp/mydir</value> <description>Scratch space for Hive jobs</description> </property> Alternately, use the option in the following step to specify the scratch -hiveconf hive.exec.scratchdir=scratch directory directory's location or use the at the command line. set hive exec.scratchdir=scratch directory Type the following command to start the Hive shell, using tab-completion to expand the : <version> /opt/mapr/hive/hive-0.9.0/bin/hive At the prompt, type the following command to create the table: hive> CREATE TABLE web_log(viewTime INT, userid BIGINT, url STRING, referrer STRING, ip STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; Type the following command to load the data from into the table: sample-table.txt LOAD DATA LOCAL INPATH '/home/mapr/sample-table.txt' INTO TABLE web_log; Run basic queries against the table: Try the simplest query, one that displays all the data in the table: SELECT web_log.* FROM web_log; This query would be inadvisable with a large table, but with the small sample table it returns very quickly. Try a simple SELECT to extract only data that matches a desired string: SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc'; This query launches a MapReduce job to filter the data. When the Hive shell starts, it reads an initialization file called which is located in the or directories. You can .hiverc HIVE_HOME/bin/ $HOME/ edit this file to set custom parameters or commands that initialize the Hive command-line environment, one command per line. When you run the Hive shell, you can specify a MySQL initialization script file using the option. Example: -i hive -i <filename> Using Hive with MapR Volumes Before you run a job, set the Hive scratch directory and Hive warehouse directory in the volume where the data for the Hive job resides. same This is the most efficient way to set up the directory structure. If the Hive scratch directory and the Hive warehouse directory are in volum different es, Hive needs to move data across volumes, which is slower than a move within the same volume. In earlier MapR releases (before version 2.1), setting the scratch and warehouse directories in different MapR volumes can cause errors. The following sections provide additional detail on preparing volumes and directories for use with Hive. Default Hive Directories It is not necessary to create and the Hive and directories in the MapR cluster. By default, MapR creates chmod /tmp /user/hive/warehouse and configures these directories for you when you create your first Hive table. These default directories are defined in the file: $HIVE_HOME/conf/hive-default.xml <configuration> ... <property> <name>hive.exec.scratchdir</name> <value>/tmp/hive-$\{user.name}</value> <description>Scratch space for Hive jobs</description> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> <description>location of default database for the warehouse</description> </property> ... </configuration> If you need to modify the default names for one or both of these directories, create a file for this purpose if $HIVE_HOME/conf/hive-site.xml it doesn't already exist. Copy the and/or the property elements from the file and hive.exec.scratchdir hive.metastore.warehouse.dir hive-default.xml paste them into an XML configuration element in the file. Modify the value elements for these directories in the hive-site.xml hive-site.xm file as desired, and then save and close the file and close the file. l hive-site.xml hive-default.xml Hive Scratch Directory When running an import job on data from a MapR volume, set to a directory in the volume where the data for hive.exec.scratchdir same the job resides. The directory should be under the volume's mount point (as viewed in ) – for example, . Volume Properties /tmp You can set this parameter from the Hive shell. Example: hive> set hive.exec.scratchdir=/myvolume/tmp Hive Warehouse Directory When writing queries that move data between tables, make sure the tables are in the volume. By default, all volumes are created under the same path "/user/hive/warehouse" under the root volume. This value is specified by the property , which you can hive.metastore.warehouse.dir set from the Hive shell. Example: hive> set hive.metastore.warehouse.dir=/myvolume/mydirectory Setting Up Hive with a MySQL Metastore The metadata for Hive tables and partitions are stored in the Hive Metastore (for more information, see the ). By Hive project documentation default, the Hive Metastore stores all Hive metadata in an embedded Apache Derby database in MapR-FS. Derby only allows one connection at a time; if you want multiple concurrent Hive sessions, you can use MySQL for the Hive Metastore. You can run the Hive Metastore on any machine that is accessible from Hive. Prerequisites Make sure MySQL is installed on the machine on which you want to run the Metastore, and make sure you are able to connect to the MySQL Server from the Hive machine. You can test this with the following command: mysql -h <hostname> -u <user> The database administrator must create a database for the Hive metastore data, and the username specified in javax.jdo.Connecti must have permissions to access it. The database can be specified using the parameter. The tables and onUser ConnectionURL schemas are created automatically when the metastore is first started. Download and install the driver for the MySQL JDBC connector. Example: $ curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.18.tar.g z/from/http://mysql.he.net/|http://mysql.he.net/' | tar xz $ sudo cp mysql-connector-java-5.1.18/mysql-connector-java-5.1.18-bin.jar /opt/mapr/hive/hive-<version>/lib/ Configuring Hive for MySQL Create the file in the Hive configuration directory ( ) with the contents from the hive-site.xml /opt/mapr/hive/hive-<version>/conf example below. Then set the parameters as follows: You can set a specific port for Thrift URIs by adding the command into the file (if export METASTORE_PORT=<port> hive-env.sh h does not exist, create it in the Hive configuration directory). Example: ive-env.sh export METASTORE_PORT=9083 To connect to an existing MySQL metastore, make sure the parameter and the parameters in ConnectionURL Thrift URIs hive-si point to the metastore's host and port. te.xml Once you have the configuration set up, start the Hive Metastore service using the following command (use tab auto-complete to fill in the ): <version> /opt/mapr/hive/hive-<version>/bin/hive --service metastore You can use to run metastore in the background. nohup hive --service metastore Example hive-site.xml <configuration> <property> <name>hive.metastore.local</name> <value>true</value> <description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> <description>JDBC connect string for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name for a JDBC metastore</description> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>username to use against metastore database</description> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value><fill in with password></value> <description>password to use against metastore database</description> </property> <property> <name>hive.metastore.uris</name> <value>thrift://localhost:9083</value> </property> </configuration> Hive-HBase Integration You can create HBase tables from Hive that can be accessed by both Hive and HBase. This allows you to run Hive queries on HBase tables. You can also convert existing HBase tables into Hive-HBase tables and run Hive queries on those tables as well. In this section: Install and Configure Hive and HBase Getting Started with Hive-HBase Integration [Getting Started with Hive-MapR tables Integration|Hive#IntegrateMapR Install and Configure Hive and HBase 1. if it is not already installed. Install and configure Hive 2. if it is not already installed. Install and configure HBase 3. Execute the command and ensure that all relevant Hadoop, HBase and Zookeeper processes are running. jps Example: $ jps 21985 HRegionServer 1549 jenkins.war 15051 QuorumPeerMain 30935 Jps 15551 CommandServer 15698 HMaster 15293 JobTracker 15328 TaskTracker 15131 WardenMain Configure the File hive-site.xml 1. Open the file with your favorite editor, or create a file if it doesn't already exist: hive-site.xml hive-site.xml $ cd $HIVE_HOME $ vi conf/hive-site.xml 2. Copy the following XML code and paste it into the file. hive-site.xml Note: If you already have an existing file with a element block, just copy the element block code hive-site.xml configuration property below and paste it inside the element block in the file. configuration hive-site.xml Example configuration: <configuration> <property> <name>hive.aux.jars.path</name> <value>file:///opt/mapr/hive/hive-0.10.0/lib/hive-hbase-handler-0.10.0-mapr.jar,file:/ //opt/mapr/hbase/hbase-0.94.5/hbase-0.94.5-mapr.jar,file:///opt/mapr/zookeeper/zookeep er-3.3.6/zookeeper-3.3.6.jar</value> <description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description> </property> <property> <name>hbase.zookeeper.quorum</name> <value>xx.xx.x.xxx,xx.xx.x.xxx,xx.xx.x.xxx</value> <description>A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the cluster.</description> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>5181</value> <description>The Zookeeper client port. The MapR default clientPort is 5181.</description> </property> </configuration> 3. Save and close the file. hive-site.xml If you have successfully completed all the steps in this Install and Configure Hive and HBase section, you're ready to begin the Getting Started with Hive-HBase Integration tutorial in the next section. Getting Started with Hive-HBase Integration In this tutorial you will: Create a Hive table Populate the Hive table with data from a text file Query the Hive table Create a Hive-HBase table Introspect the Hive-HBase table from HBase Populate the Hive-Hbase table with data from the Hive table Query the Hive-HBase table from Hive Convert an existing HBase table into a Hive-HBase table Be sure that you have successfully completed all the steps in the Install and Configure Hive and HBase section before beginning this Getting Started tutorial. This Getting Started tutorial closely parallels the section of the Apache Hive Wiki, and thanks to Samuel Guo and other Hive-HBase Integration contributors to that effort. If you are familiar with their approach to Hive-HBase integration, you should be immediately comfortable with this material. However, please note that there are some significant differences in this Getting Started section, especially in regards to configuration and command parameters or the lack thereof. Follow the instructions in this Getting Started tutorial to the letter so you can have an enjoyable and successful experience. Create a Hive table with two columns: Change to your Hive installation directory if you're not already there and start Hive: $ cd $HIVE_HOME $ bin/hive Execute the CREATE TABLE command to create the Hive pokes table: hive> CREATE TABLE pokes (foo INT, bar STRING); To see if the pokes table has been created successfully, execute the SHOW TABLES command: hive> SHOW TABLES; OK pokes Time taken: 0.74 seconds The table appears in the list of tables. pokes Populate the Hive pokes table with data Execute the LOAD DATA LOCAL INPATH command to populate the Hive table with data from the file. pokes kv1.txt The file is provided in the directory. kv1.txt $HIVE_HOME/examples hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; A message appears confirming that the table was created successfully, and the Hive prompt reappears: Copying data from file: ... OK Time taken: 0.278 seconds hive> Execute a SELECT query on the Hive pokes table: hive> SELECT * FROM pokes WHERE foo = 98; The SELECT statement executes, runs a MapReduce job, and prints the job output: OK 98 val_98 98 val_98 Time taken: 18.059 seconds The output of the SELECT command displays two identical rows because there are two identical rows in the Hive table with a key of 98. pokes Note: This is a good illustration of the concept that Hive tables can have multiple identical keys. As we will see shortly, HBase tables cannot have multiple identical keys, only unique keys. To create a Hive-HBase table, enter these four lines of code at the Hive prompt: hive> CREATE TABLE hbase_table_1(key int, value string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") > TBLPROPERTIES ("hbase.table.name" = "xyz"); After a brief delay, a message appears confirming that the table was created successfully: OK Time taken: 5.195 seconds Note: The TBLPROPERTIES command is not required, but those new to Hive-HBase integration may find it easier to understand what's going on if Hive and HBase use different names for the same table. In this example, Hive will recognize this table as "hbase_table_1" and HBase will recognize this table as "xyz". Start the HBase shell: Keeping the Hive terminal session open, start a new terminal session for HBase, then start the HBase shell: $ cd $HBASE_HOME $ bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.90.4, rUnknown, Wed Nov 9 17:35:00 PST 2011 hbase(main):001:0> Execute the list command to see a list of HBase tables: hbase(main):001:0> list TABLE xyz 1 row(s) in 0.8260 seconds HBase recognizes the Hive-HBase table named . This is the same table known to Hive as . xyz hbase_table_1 Display the description of the xyz table in the HBase shell: hbase(main):004:0> describe "xyz" DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE', REPLICATI true ON_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BL OCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.0190 seconds From the Hive prompt, insert data from the Hive table pokes into the Hive-HBase table hbase_table_1: hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98; ... 2 Rows loaded to hbase_table_1 OK Time taken: 13.384 seconds Query hbase_table_1 to see the data we have inserted into the Hive-HBase table: hive> SELECT * FROM hbase_table_1; OK 98 val_98 Time taken: 0.56 seconds Even though we loaded two rows from the Hive table that had the same key of 98, only one row was actually inserted into pokes hbase_table_ . This is because is an HBASE table, and although Hive tables support duplicate keys, HBase tables only support unique 1 hbase_table_1 keys. HBase tables arbitrarily retain only one key, and will silently discard all the data associated with duplicate keys. Convert a pre-existing HBase table to a Hive-HBase table To convert a pre-existing HBase table to a Hive-HBase table, enter the following four commands at the Hive prompt. Note that in this example the existing HBase table is . my_hbase_table hive> CREATE EXTERNAL TABLE hbase_table_2(key int, value string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val") > TBLPROPERTIES("hbase.table.name" = "my_hbase_table"); Now we can run a Hive query against the pre-existing HBase table that Hive sees as : my_hbase_table hbase_table_2 hive> SELECT * FROM hbase_table_2 WHERE key > 400 AND key < 410; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator ... OK 401 val_401 402 val_402 403 val_403 404 val_404 406 val_406 407 val_407 409 val_409 Time taken: 9.452 seconds Getting Started with Hive-MapR Tables Integration MapR tables, introduced in version 3.0 of the MapR distribution for Hadoop, use the native MapR-FS storage layer. A full tutorial on integrating Hive with MapR tables is available at . Integrating Hive and MapR Tables Zookeeper Connections If you see the following error message, ensure that and are hbase.zookeeper.quorum hbase.zookeeper.property.clientPort properly defined in the file. $HIVE_HOME/conf/hive-site.xml Failed with exception java.io.IOException:org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information. Impala Impala on MapR Impala is a distributed query execution engine that runs against data stored natively in MapR-FS and HBase. Building Impala 1. 2. 3. 4. 5. Prerequisites Installing prerequisite packages Run the following command to install the prerequisite packages for Impala: sudo yum install libevent-devel automake libtool flex bison gcc-c++ openssl-devel make cmake doxygen.x86_64 \ python-devel bzip2-devel svn libevent-devel cyrus-sasl-devel wget git unzip rpm-build Install MapR packages Impala on MapR requires the and packages. Run the following command to install those packages: mapr-core mapr-hive sudo yum install mapr-core mapr-hive Install Boost Add the following entry to to add the repository: /etc/yum.repos.d [jur-linux] name=Jur Linux baseurl=http://jur-linux.org/download/el-updates/6.5/x86_64/ gpgcheck=0 enabled=1 Install the packages: sudo yum install libicu-devel chrpath openmpi-devel mpich2-devel Instally Python: sudo yum install python3-devel Download the Boost RPM. wget ftp://ftp.icm.edu.pl/vol/rzm2/linux-fedora-secondary/development/rawhide/source/S RPMS/b/boost-1.53.0-6.fc19.src.rpm Use to prepare the Boost RPM. rpmbuild Impala requires Boost 1.42 or later.5. 6. 7. 1. 2. 3. 4. 5. sudo rpmbuild --rebuild boost-1.53.0-6.fc19.src.rpm Unpack the Boost code. sudo rpm -Uvh /root/rpmbuild/RPMS/x86_64/* Make the following change to /usr/include/boost/move/core.hpp: class rv : public ::boost::move_detail::if_c < ::boost::move_detail::is_class_or_union<T>::value , T , ::boost::move_detail::empty >::type { rv(); --- ~rv(); +++ ~rv() throw(); rv(rv const&); void operator=(rv const&); } BOOST_MOVE_ATTRIBUTE_MAY_ALIAS; Install LLVM The low-level virtual machine (LLVM) is a requirement for Impala. Follow these steps to install LLVM. Download the LLVM source code. wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz Extract the downloaded code. tar xvzf llvm-3.2.src.tar.gz Change to the directory. /tools cd llvm-3.2.src/tools Check out the project. clang svn co http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_32/final/ clang Change to the directory. /projects Ubuntu 12.04 (and later) requires the package to work with Thrift v0.9 libevent1-dev 5. 6. 7. 8. 9. 1. 2. 3. cd ../projects Check out the project. compiler-rt svn co http://llvm.org/svn/llvm-project/compiler-rt/tags/RELEASE_32/final/ compiler-rt Configure the build. cd .. ./configure --with-pic Build LLVM. make -j4 REQUIRES_RTTI=1 Install LLVM. sudo make install Install the JDK Impala requires version 6 of the Oracle Java Development Kit (JDK). OpenJDK is not compatible with Impala. Verify that is set in JAVA_HOME your environment by issuing the following command: echo $JAVA_HOME Install Maven The Impala installation process uses Maven to manage code dependencies. Use the following steps to install Maven: Download Maven with the following command: wget http://www.fightrice.com/mirrors/apache/maven/maven-3/3.0.5/binaries/apache-maven -3.0.5-bin.tar.gz Unpack the download with the following command: tar xvf apache-maven-3.0.5.tar.gz && sudo mv apache-maven-3.0.5 /usr/local Add the following three lines to your file: .bashrc 3. 4. 5. export M2_HOME=/usr/local/apache-maven-3.0.5 export M2=$M2_HOME/bin export PATH=$M2:$PATH Apply the changes by logging in to a fresh shell or by running the following command: source ~/.bashrc Confirm the installation by running the following command: mvn -version A successful install will return output similar to: Apache Maven 3.0.5... Building Impala Clone the Impala Repository Download the Impala source code using : git git clone https://github.com/mapr/impala Set the Impala Environment Run the script to set up your environment: impala-config.sh cd impala . bin/impala-config.sh Confirm your environment looks correct: # env | grep "IMPALA.*VERSION" IMPALA_AVRO_VERSION=1.7.1-cdh4.2.0 IMPALA_CYRUS_SASL_VERSION=2.1.23 IMPALA_HBASE_VERSION=0.94.9-mapr IMPALA_SNAPPY_VERSION=1.0.5 IMPALA_GTEST_VERSION=1.6.0 IMPALA_GPERFTOOLS_VERSION=2.0 IMPALA_GFLAGS_VERSION=2.0 IMPALA_GLOG_VERSION=0.3.2 IMPALA_HADOOP_VERSION=1.0.3-mapr-3.0.0 IMPALA_HIVE_VERSION=0.11-mapr IMPALA_MONGOOSE_VERSION=3.3 IMPALA_THRIFT_VERSION=0.9.0 Download Required Third-party Packages Run the script to download the third-party packages that Impala uses: download_thirdparty.sh cd thirdparty ./download_thirdparty.sh Build Impala Build the Impala binary with the following command: cd ${IMPALA_HOME} ./build_public.sh -build_thirdparty After Building The binary is in the directory after a successful build. impalad ${IMPALA_HOME}/be/build/release/service You can start the Impala backend by running the following command: ${IMPALA_HOME}/bin/start-impalad.sh -use_statestore=false To configure Impala's use of MapR-FS, HBase, or the Hive metastore, place the path to the relevant configuration files in the variable CLASSPATH that the script establishes. bin/set-classpath.sh The Impala Shell The Impala shell is a convenient command-line interface to Impala. The following command starts the Impala shell: The script sets environment variables that are necessary for Impala to run successfully. start-impalad.sh ${IMPALA_HOME}/bin/impala-shell.sh Mahout Apache Mahout™ is a scalable machine learning library. For more information about Mahout, see the project. Apache Mahout On this page: Installing Mahout Configuring the Mahout Environment Getting Started with Mahout Installing Mahout Mahout can be installed when MapR services are initially installed as discussed in . If Mahout wasn't installed during the Installing MapR Services initial MapR services installation, Mahout can be installed at a later date by executing the instructions in this section. These procedures may be performed on a node in a MapR cluster (see the ) or on a client (see ). Installation Guide Setting Up the Client The Mahout installation procedures below use the operating system's package manager to download and install Mahout from the MapR Repository. If you want to install this component manually from packages files, see . Packages and Dependencies for MapR Software Installing Mahout on a MapR Node Mahout only needs to be installed on the nodes in the cluster from which Mahout applications will be executed. So you may only need to install Mahout on one node. However, depending on the number of Mahout users and the number of scheduled Mahout jobs, you may need to install Mahout on more than one node. Mahout applications may run MapReduce programs, and by default Mahout will use the cluster's default JobTracker to execute MapReduce jobs. Install Mahout on a MapR node running Ubuntu Install Mahout on a MapR node running Ubuntu as or using by executing the following command: root sudo apt-get install # apt-get install mapr-mahout Install Mahout on a MapR node running Red Hat or CentOS Install Mahout on a MapR node running Red Hat or CentOS as or using by executing the following command: root sudo yum install # yum install mapr-mahout Installing Mahout on a Client If you install Mahout on a Linux client, you can run Mahout applications from the client that execute MapReduce jobs on the cluster that your client is configured to use. Tip: You don't have to install Mahout on the cluster in order to run Mahout applications from your client. Install Mahout on a client running Ubuntu Install Mahout on a client running Ubuntu as or using by executing the following command: root sudo apt-get install # apt-get install mapr-mahout Install Mahout on a client running Red Hat or CentOS Install Mahout on a client running Red Hat or CentOS as or using by executing the following command: root sudo yum install # yum install mapr-mahout Configuring the Mahout Environment After installation the Mahout executable is located in the following directory: /opt/mapr/mahout/mahout-<version>/bin/mahout Example: /opt/mapr/mahout/mahout-0.7/bin/mahout To use Mahout with MapR, set the following environment variables: MAHOUT_HOME - the path to the Mahout directory. Example: $ export MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7 JAVA_HOME - the path to the Java directory. Example for Ubuntu: $ export JAVA_HOME=/usr/lib/jvm/java-6-sun JAVA_HOME - the path to the Java directory. Example for Red Hat and CentOS: $ export JAVA_HOME=/usr/java/jdk1.6.0_24 HADOOP_HOME - the path to the Hadoop directory. Example: $ export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2 HADOOP_CONF_DIR - the path to the directory containing Hadoop configuration parameters. Example: $ export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf You can set these environment variables persistently for all users by adding them to the file as or using . The /etc/environment root sudo order of the environment variables in the file doesn't matter. Example entries for setting environment variables in the /etc/environment file for Ubuntu: JAVA_HOME=/usr/lib/jvm/java-6-sun MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7 HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2 HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf Example entries for setting environment variables in the /etc/environment file for Red Hat and CentOS: JAVA_HOME=/usr/java/jdk1.6.0_24 MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7 HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2 HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf After adding or editing environment variables to the file, you can activate them without rebooting by executing the c /etc/environment source ommand: $ source /etc/environment Note: A user who doesn't have or permissions can add these environment variable entries to his or her file. The root sudo ~/.bashrc environment variables will be set each time the user logs in. Getting Started with Mahout To see the sample applications bundled with Mahout, execute the following command: 1. 2. $ ls $MAHOUT_HOME/examples/bin To run the Twenty Newsgroups Classification Example, execute the following commands: $ cd $MAHOUT_HOME $ ./examples/bin/classify-20newsgroups.sh The output from this example will look similar to the following: MultiTool The command is the wrapper around Cascading.Multitool, a command line tool for processing large text files and datasets (like sed and grep mt on unix). The command is located in the directory. To use , change to the directory. mt /opt/mapr/contrib/multitool/bin mt multitool Example: cd /opt/mapr/contrib/multitool ./bin/mt Oozie Oozie is a workflow system for Hadoop. Using Oozie, you can set up that execute MapReduce jobs and that manage workflows coordinators workflows. Installing Oozie The following procedures use the operating system package managers to download and install from the MapR Repository. To install the packages manually, refer to . Preparing Packages and Repositories To install Oozie on a MapR cluster: Oozie's client/server architecture requires you to install two packages, and , on the server node. Client mapr-oozie mapr-oozie-internal Oozie nodes require only the role package . mapr-oozie Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster with the MapR repository properly set. If you have not installed MapR, see the Instal 2. 3. 4. 5. 6. 7. 1. . lation Guide If you are installing on Ubuntu, update the list of available packages: apt-get update Install and on the Oozie server node: mapr-oozie mapr-oozie-internal RHEL/CentOS: yum install mapr-oozie mapr-oozie-internal SUSE: zypper install mapr-oozie mapr-oozie-internal Ubuntu: apt-get install mapr-oozie mapr-oozie-internal Install on the Oozie client nodes: mapr-oozie RHEL/CentOS: yum install mapr-oozie SUSE: zypper install mapr-oozie Ubuntu: apt-get install mapr-oozie Start the Oozie daemon: service mapr-oozie start The command returns immediately, but it might take a few minutes for Oozie to start. Use the following command to see if Oozie has started: service mapr-oozie status Enabling the Oozie web UI The Oozie web UI can display your job status, logs, and other related information. The file must include the library to enable oozie.war extjs the web UI. After installing Oozie, perform the following steps to add the ExtJS library to your file: oozie.war Download the library. extjs wget http://extjs.com/deploy/ext-2.2.zip 2. 3. 4. 1. 2. 3. If Oozie is running, shut it down: service mapr-oozie stop Run the script and specify the path to the file. oozie-setup.sh extjs cd /opt/mapr/oozie/oozie-<version> bin/oozie-setup.sh prepare-war -extjs ~/ext-2.2.zip Start Oozie. Checking the Status of Oozie Once Oozie is installed, you can check the status using the command line or the Oozie web console. To check the status of Oozie using the command line: Use the command: oozie admin /opt/mapr/oozie/oozie-<version>/bin/oozie admin -oozie http://localhost:11000/oozie -status The following output indicates normal operation: System mode: NORMAL To check the status of Oozie using the web console: Point your browser to http://localhost:11000/oozie Examples After verifying the status of Oozie, set up and try the examples, to get familiar with Oozie. To set up the examples and copy them to the cluster: Extract the oozie examples archive : oozie-examples.tar.gz cd /opt/mapr/oozie/oozie-<version> tar xvfz ./oozie-examples.tar.gz Mount the cluster via NFS (See .). Example: Accessing Data with NFS mkdir /mnt/mapr mount localhost:/mapr /mnt/mapr Create a directory for the examples. Example: 3. 4. 5. 1. 2. 3. 1. 2. 3. mkdir /mnt/mapr/my.cluster.com/user/root/examples Copy the Oozie examples from the local directory to the cluster directory. Example: cp -r /opt/mapr/oozie/oozie-3.0.0/examples/* /mnt/mapr/my.cluster.com/user/root/examples/ Set the environment variable so that you don't have to provide the option when you run each job: OOZIE_URL -oozie export OOZIE_URL="http://localhost:11000/oozie" To run the examples: Choose an example and run it with the command. Example: oozie job /opt/mapr/oozie/oozie-<version>/bin/oozie job -config /opt/mapr/oozie/oozie-<version>/examples/apps/map-reduce/job.properties -run Make a note of the returned job ID. Using the job ID, check the status of the job using the command line or the Oozie web console, as shown below. Using the command line, type the following (substituting the job ID for the placeholder): <job id> /opt/mapr/oozie/oozie-<version>/bin/oozie job -info <job id> Using the Oozie web console, point your browser to and click . http://localhost:11000/oozie All Jobs Pig Apache Pig is a platform for parallelized analysis of large data sets via a language called PigLatin. For more information about Pig, see the Pig . project page Once Pig is installed, the executable is located at: /opt/mapr/pig/pig-<version>/bin/pig Make sure the environment variable is set correctly. Example: JAVA_HOME # export JAVA_HOME=/usr/lib/jvm/java-6-sun Installing Pig The following procedures use the operating system package managers to download and install Pig from the MapR Repository. For instructions on setting up the ecosystem repository (which includes Pig), see . Preparing Packages and Repositories If you want to install this component manually from packages files, see . Packages and Dependencies for MapR Software To install Pig on an Ubuntu cluster: Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide Update the list of available packages: 3. 4. 1. 2. 3. 1. 2. 3. apt-get update On each planned Pig node, install : mapr-pig apt-get install mapr-pig To install Pig on a Red Hat or CentOS cluster: Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide On each planned Pig node, install : mapr-pig yum install mapr-pig Getting Started with Pig In this tutorial, we'll use of Pig to run a MapReduce job that counts the words in the file in the user' version 0.11 /in/constitution.txt mapr s directory on the cluster, and store the results in the file . wordcount.txt First, make sure you have downloaded the file: On the page , select Tools > Attachments and A Tour of the MapR Virtual Machine right-click to save it. constitution.txt Make sure the file is loaded onto the cluster, in the directory . If you are not sure how, look at the tutorial on /user/mapr/in NFS A Tour . of the MapR Virtual Machine Open a Pig shell and get started: In the terminal, type the command to start the Pig shell. pig At the prompt, type the following lines (press ENTER after each): grunt> A = LOAD '/user/mapr/in' USING TextLoader() AS (words:chararray); B = FOREACH A GENERATE FLATTEN(TOKENIZE(*)); C = GROUP B BY $0; D = FOREACH C GENERATE group, COUNT(B); STORE D INTO '/user/mapr/wordcount'; After you type the last line, Pig starts a MapReduce job to count the words in the file . constitution.txt When the MapReduce job is complete, type to exit the Pig shell and take a look at the contents of the directory quit /myvolume/wordc to see the results. ount Sqoop 1. 2. 3. 4. 1. 2. 3. 1. 2. 3. Sqoop transfers data between MapR-FS and relational databases. You can use Sqoop to transfer data from a relational database management system (RDBMS) such as MySQL or Oracle into MapR-FS and use MapReduce on the transferred data. Sqoop can export this transformed data back into an RDBMS. For more information about Sqoop, see the . Apache Sqoop Documentation Installing Sqoop The following procedures use the operating system package managers to download and install from the MapR Repository. If you want to install this component manually from packages files, see . Packages and Dependencies for MapR Software To install Sqoop on an Ubuntu cluster: Execute the following commands as or using . root sudo Perform this procedure on a MapR cluster. If you have not installed MapR, see the . Installation Guide Update the list of available packages: apt-get update On each planned Sqoop node, install : mapr-sqoop apt-get install mapr-sqoop To install Sqoop on a Red Hat or CentOS cluster: Execute the following commands as or using . root sudo Perform this procedure on a MapR cluster. If you have not installed MapR, see the . Installation Guide On each planned Sqoop node, install : mapr-sqoop yum install mapr-sqoop Using Sqoop For information about configuring and using Sqoop, see the following documents: Sqoop User Guide Sqoop Developer's Guide Whirr Apache Whirr™ is a set of libraries for running cloud services. Whirr provides: A cloud-neutral way to run services. You don't have to worry about the idiosyncrasies of each provider. A common service API. The details of provisioning are particular to the service. Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed. You can also use Whirr as a command line tool for deploying clusters. Installing Whirr The following procedures use the operating system package managers to download and install from the MapR Repository. To install the packages manually, refer to . Preparing Packages and Repositories To install Whirr on an Ubuntu cluster: Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide Update the list of available packages: 3. 4. 1. 2. 3. apt-get update On each planned Whirr node, install : mapr-whirr apt-get install mapr-whirr To install Whirr on a Red Hat or CentOS cluster: Execute the following commands as or using . root sudo This procedure is to be performed on a MapR cluster. If you have not installed MapR, see the . Installation Guide On each planned Whirr node, install : mapr-cascading yum install mapr-whirr Next Steps After Installation After installing the MapR core and any desired Hadoop components, you might need to perform additional steps to ready the cluster for production. Review the topics below for next steps that might apply to your cluster. Setting up the MapR Metrics Database Setting up Topology Setting Up Volumes Setting Up Central Configuration Designating NICs for MapR Setting up MapR NFS Configuring Authentication Configuring Permissions Setting Usage Quotas Configuring alarm notifications Setting up a Client to Access the Cluster Working with Multiple Clusters Setting up the MapR Metrics Database In order to use MapR Metrics you have to set up a MySQL database where metrics data will be logged. For details see Setting up the MapR . Metrics Database Setting up Topology Your node topology describes the locations of nodes and racks in a cluster. The MapR software uses node topology to determine the location of replicated copies of data. Optimally defined cluster topology results in data being replicated to separate racks, providing continued data availability in the event of rack or node failure. For details see . Node Topology Setting Up Volumes A well-structured volume hierarchy is an essential aspect of your cluster's performance. As your cluster grows, keeping your volume hierarchy efficient maximizes your data's availability. Without a volume structure in place, your cluster's performance will be negatively affected. For details see . Managing Data with Volumes Setting Up Central Configuration MapR services can be configured globally across the cluster, from master configuration files stored in a MapR-FS, eliminating the need to edit configuration files on all nodes individually. For details see . Central Configuration Designating NICs for MapR If multiple NICs are present on nodes, you can configure MapR to use one or more of them, depending on the cluster's need for bandwidth. For details on configuring NICs, see . Review for details on provisioning NICs according to data Designating NICs for MapR Planning the Cluster workload. Setting up MapR NFS The MapR NFS service lets you access data on a licensed MapR cluster via the NFS protocol. You can mount the MapR cluster via NFS and use standard shell scripting to read and write live data in the cluster. NFS access to cluster data can be faster than accessing the same data with the commands. For details, see . You also might also be interested in and hadoop fs Setting Up MapR NFS High Availability NFS Setting Up VIPs . for NFS Configuring Authentication If you use Kerberos, LDAP, or another authentication scheme, make sure PAM is configured correctly to give MapR access. See PAM . Configuration Configuring Permissions By default, users are able to log on to the MapR Control System, but do not have permission to perform any actions. You can grant specific permissions to individual users and groups. See . Managing Permissions Setting Usage Quotas You can set specific quotas for individual users and groups. See . Managing Quotas Configuring alarm notifications If an alarm is raised on the cluster, MapR sends an email notification. For example, if a volume goes over its allotted quota, MapR raises an alarm and sends email to the volume creator. To configure notification settings, see . Alarms and Notifications To configure email settings see . Configuring Email for Alarm Notifications Setting up a Client to Access the Cluster You can access the cluster either by logging into a node on the cluster, or by installing MapR client software on a machine with access to the cluster's network. For details see . Setting Up the Client Working with Multiple Clusters If you need to access multiple clusters or mirror data between clusters, see . Working with Multiple Clusters Setting Up the Client MapR provides several interfaces for working with a cluster from a client computer: MapR Control System - manage the cluster, including nodes, volumes, users, and alarms Direct Access NFS™ - mount the cluster in a local directory MapR client - work with MapR Hadoop directly Mac OS X Red Hat/CentOS SUSE Ubuntu Windows MapR Control System The MapR Control System allows you control the cluster through a comprehensive graphical user interface. Browser Compatibility The MapR Control System is web-based, and works with the following browsers: Chrome 1. 2. 1. Safari Firefox 3.0 and above Internet Explorer 10 and above Launching MapR Control System To use the MapR Control System (MCS), navigate to the host that is running the WebServer in the cluster. MapR Control System access to the cluster is typically via HTTP on port 8080 or via HTTPS on port 8443; you can specify the protocol and port in the dialog. You Configure HTTP should disable pop-up blockers in your browser to allow MapR to open help links in new browser tabs. The first time you open the MCS via HTTPS from a new browser, the browser alerts you that the security certificate is unrecognized. This is normal behavior for a new connection. Add an exception in your browser to allow the connection to continue. Direct Access NFS™ You can mount a MapR cluster locally as a directory on a Mac, Linux, or Windows computer. Before you begin, make sure you know the hostname and directory of the NFS share you plan to mount. Example: usa-node01:/mapr - for mounting from the command line nfs://usa-node01/mapr - for mounting from the Mac Finder Mounting NFS to MapR-FS on a Cluster Node To mount NFS to MapR-FS on the cluster at the mount point, add the following line to automatically my.cluster.com /mapr /opt/mapr/conf : /mapr_fstab <hostname>:/mapr/my.cluster.com /mapr hard,nolock Every time your system is rebooted, the mount point is automatically reestablished according to the configuration file. mapr_fstab To mount NFS to MapR-FS at the mount point: manually /mapr Set up a mount point for an NFS share. Example: sudo mkdir /mapr Mount the cluster via NFS. Example: sudo mount -o nolock usa-node01:/mapr/my.cluster.com /mapr Mounting NFS on a Linux Client To mount when your system starts up, add an NFS mount to . Example: automatically /etc/fstab # device mountpoint fs-type options dump fsckorder ... usa-node01:/mapr /mapr nfs rw 0 0 ... To mount NFS on a Linux client : manually Make sure the NFS client is installed. Examples: sudo yum install nfs-utils (Red Hat or CentOS) sudo apt-get install nfs-common (Ubuntu) The change to will not take effect until warden is restarted. /opt/mapr/conf/mapr_fstab When you mount manually from the command line, the mount point does persist after a reboot. not 1. 2. 3. 4. 1. 2. 3. 4. 5. 6. sudo zypper install nfs-client (SUSE) List the NFS shares exported on the server. Example: showmount -e usa-node01 Set up a mount point for an NFS share. Example: sudo mkdir /mapr Mount the cluster via NFS. Example: sudo mount -o nolock usa-node01:/mapr /mapr Mounting NFS on a Mac Client To mount the cluster manually from the command line: Open a terminal (one way is to click on Launchpad > Open terminal). At the command line, enter the following command to become the root user: sudo bash List the NFS shares exported on the server. Example: showmount -e usa-node01 Set up a mount point for an NFS share. Example: sudo mkdir /mapr Mount the cluster via NFS. Example: sudo mount -o nolock usa-node01:/mapr /mapr List all mounted filesystems to verify that the cluster is mounted. mount Mounting NFS on a Windows Client Setting up the Windows NFS client requires you to mount the cluster and configure the user ID (UID) and group ID (GID) correctly, as described in the sections below. In all cases, the Windows client must access NFS using a valid UID and GID from the Linux domain. Mismatched UID or GID will result in permissions problems when MapReduce jobs try to access files that were copied from Windows over an NFS share. Mounting the cluster To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise The mount point does not persist after reboot when you mount manually from the command line. Because of Windows directory caching, there may appear to be no directory in each volume's root directory. To work around .snapshot the problem, force Windows to re-load the volume's root directory by updating its modification time (for example, by creating an empty file or directory in the volume's root directory). With Windows NFS clients, use the option on the NFS server to prevent the Linux NLM from registering with the -o nolock portmapper. The native Linux NLM conflicts with the MapR NFS server. 1. 2. 3. 4. 5. 1. 2. 3. Open . Start > Control Panel > Programs Select . Turn Windows features on or off Select . Services for NFS Click . OK Mount the cluster and map it to a drive using the tool or from the command line. Example: Map Network Drive mount -o nolock usa-node01:/mapr z: To mount the cluster on other Windows versions Download and install (SFU). You only need to install the NFS Client and the User Name Mapping. Microsoft Windows Services for Unix Configure the user authentication in SFU to match the authentication used by the cluster (LDAP or operating system users). You can map local Windows users to cluster Linux users, if desired. Once SFU is installed and configured, mount the cluster and map it to a drive using the tool or from the command Map Network Drive line. Example: mount -o nolock usa-node01:/mapr z: Mapping a network drive To map a network drive with the Map Network Drive tool 1. 2. 3. 4. 5. 6. 7. Open . Start > My Computer Select . Tools > Map Network Drive In the Map Network Drive window, choose an unused drive letter from the drop-down list. Drive Specify the by browsing for the MapR cluster, or by typing the hostname and directory into the text field. Folder Browse for the MapR cluster or type the name of the folder to map. This name must follow UNC. Alternatively, click the Browse… button to find the correct folder by browsing available network shares. Select to reconnect automatically to the MapR cluster whenever you log into the computer. Reconnect at login Click Finish. See for more information. Accessing Data with NFS MapR Client The MapR client lets you interact with MapR Hadoop directly. With the MapR client, you can submit Map/Reduce jobs and run and hadoop fs h commands. The MapR client is compatible with the following operating systems: adoop mfs CentOS 5.5 or above Mac OS X (Intel) Red Hat Enterprise Linux 5.5 or above Ubuntu 9.04 or above SUSE Enterprise 11.1 or above Windows 7 and Windows Server 2008 To configure the client, you will need the cluster name and the IP addresses and ports of the CLDB nodes on the cluster. The configuration script has the following syntax: configure.sh Linux — configure.sh [-N <cluster name>] -c -C <CLDB node>[:<port>][,<CLDB node>[:<port>]...] Do not install the client on a cluster node. It is intended for use on a computer that has no other MapR server software installed. Do not install other MapR server software on a MapR client computer. MapR server software consists of the following packages: mapr-core mapr-tasktracker mapr-fileserver mapr-nfs mapr-jobtracker mapr-webserver To run commands, establish an session to a node in the cluster. MapR CLI ssh 1. 2. 3. 1. 2. 3. Windows — server\configure.bat -c -C <CLDB node>[:<port>][,<CLDB node>[:<port>]...] Linux or Mac Example: /opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222 Windows Example: server\configure.bat -c -C 10.10.100.1:7222 Installing the MapR Client on CentOS or Red Hat The MapR Client supports Red Hat Enterprise Linux 5.5 or above. Remove any previous MapR software. You can use to get a list of installed MapR packages, then type the rpm -qa | grep mapr packages separated by spaces after the command. Example: rpm -e rpm -qa | grep mapr rpm -e mapr-fileserver mapr-core Install the MapR client for your target architecture: yum install mapr-client.i386 yum install mapr-client.x86_64 Run to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) option configure.sh -C -c to specify a client configuration. Example: /opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222 Installing the MapR Client on SUSE The MapR Client supports SUSE Enterprise 11.1 or above. Remove any previous MapR software. You can use to get a list of installed MapR packages, then type the rpm -qa | grep mapr packages separated by spaces after the command. Example: zypper rm rpm -qa | grep mapr zypper rm mapr-fileserver mapr-core Install the MapR client: zypper install mapr-client Run to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) option configure.sh -C -c to specify a client configuration. Example: 3. 1. 2. 3. 4. 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 6. /opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222 Installing the MapR Client on Ubuntu The MapR Client supports Ubuntu 9.04 or above. Remove any previous MapR software. You can use to get a list of installed MapR packages, then type the dpkg -list | grep mapr packages separated by spaces after the command. Example: dpkg -r dpkg -l | grep mapr dpkg -r mapr-core mapr-fileserver Update your Ubuntu repositories. Example: apt-get update Install the MapR client: apt-get install mapr-client Run to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) option configure.sh -C -c to specify a client configuration. Example: /opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222 Installing the MapR Client on Mac OS X The MapR Client supports Mac OS X (Intel). Download the archive http://package.mapr.com/releases/v3.0.1/mac/mapr-client-3.0.1.21771.GA-1.x86_64.tar.gz Open the application. Terminal Create the directory : /opt sudo mkdir -p /opt Extract mapr-client-2.1.2.18401.GA-1.x86_64.tar.gz into the directory. Example: /opt *sudo tar -C /opt -xvf mapr-client-2.1.2.18401.GA-1.x86_64.tar.gz * Run to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) option configure.sh -C -c to specify a client configuration. Example: sudo /opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222 Installing the MapR Client on Windows The MapR Client supports Windows 7 and Windows Server 2008. Make sure Java is installed on the computer, and set correctly. JAVA_HOME Open the command line. Create the directory on your drive (or another hard drive of your choosing)--- either use Windows Explorer, or type the \opt\mapr c: following at the command prompt: mkdir c:\opt\mapr Set to the directory you created in the previous step. Example: MAPR_HOME SET MAPR_HOME=c:\opt\mapr Navigate to : MAPR_HOME cd %MAPR_HOME% Download the correct archive into : MAPR_HOME On a 64-bit Windows machine, download http://package.mapr.com/releases/v3.0.1/windows/mapr-client-3.0.1.21771GA-1.amd6 4.zip On a 32-bit Windows machine, download http://package.mapr.com/releases/v3.0.1/windows/mapr-client-3.0.1.21771GA-1.x86.zi 6. 7. 8. p Extract the archive by right-clicking on the file and selecting Extract All... From the command line, run to configure the client, using the (uppercase) option to specify the CLDB nodes, and configure.bat -C the (lowercase) option to specify a client configuration. Example: -c server\configure.bat -c -C 10.10.100.1:7222 On the Windows client, you can run MapReduce jobs using the command the way you would normally use the command. hadoop.bat hadoop For example, to list the contents of a directory, instead of you would type the following: hadoop fs -ls hadoop.bat fs -ls Before running jobs on the Windows client, set the following properties in %MAPR_HOME%\hadoop\hadoop-<version>\conf\core-site.xm on the Windows machine to match the username, user ID, and group ID that have been set up for you on the cluster: l <property> <name>hadoop.spoofed.user.uid</name> <value>{UID}</value> </property> <property> <name>hadoop.spoofed.user.gid</name> <value>{GID}</value> </property> <property> <name>hadoop.spoofed.user.username</name> <value>{id of user who has UID}</value> </property> To determine the correct UID and GID values for your username, log into a cluster node and type the command. In the following example, the id UID is 1000 and the GID is 2000: $ id uid=1000(pconrad) gid=2000(pconrad) groups=4(adm),20(dialout),24(cdrom),46(plugdev),105(lpadmin),119(admin),122(sambashare ),2000(pconrad) Upgrade Guide This guide describes the process of upgrading the software version on a MapR cluster. This page contains: Upgrade Process Overview Upgrade Methods: Offline Upgrade vs. Rolling Upgrade What Gets Upgraded Goals for Upgrade Process Version-Specific Considerations When upgrading from MapR v1.x When upgrading from MapR v2.x Related Topics Throughout this guide we use the terms version to mean the MapR version you are upgrading , and version to mean a later existing from new version you are upgrading . to You must use the values for and , not the text names. numeric UID GID On the Windows client, because the native Hadoop library is not present, the command is not available. hadoop fs -getmerge 1. 2. 3. 4. Upgrade Process Overview The upgrade process proceeds in the following order. Planning the upgrade process – Determine how and when to perform the upgrade. Preparing to upgrade – Prepare the cluster for upgrade while it is still operational. Upgrading MapR packages – Perform steps that upgrade MapR software in a maintenance window. Configuring the new version – Do any final steps to transition the cluster to the new version. You will spend the bulk of time for the upgrade process in planning an appropriate upgrade path and then preparing the cluster for upgrade. Once you have established the right path for your needs, the steps to prepare the cluster are straight-forward, and the steps to upgrade the software move rapidly and smoothly. Read through all steps in this guide so that you understand the whole process before you begin to upgrade software packages. This Upgrade Guide does not address the following “upgrade” operations, which are part of day-to-day cluster administration: Upgrading license. Paid features can be enabled by simply applying a new license. If you are upgrading from M3, revisit the cluster’s serv to enable High Availability features. ice layout Adding nodes to the cluster. See . Adding Nodes to a Cluster Adding disk, memory or network capacity to cluster hardware. See and . Adding Disks Preparing Each Node in the Installation Guide Adding Hadoop ecosystem components, such as HBase and Hive. See for links to appropriate component guides. #Related Topics Upgrading local OS on a node. This is not recommended while a node is in service. Upgrade Methods: Offline Upgrade vs. Rolling Upgrade You can perform either or , and either method has trade-offs. Offline upgrade is the most popular option, taking the rolling upgrade offline upgrade least amount of time, but requiring the cluster to go completely offline for maintenance. Rolling upgrade keeps the filesystem online throughout the upgrade process, accepting reads and writes, but extends the duration of the upgrade process. Rolling upgrade cannot be used for clusters running Hadoop ecosystem components such as HBase and Hive. The figures below show the high-level sequence of events for a offline upgrade and a rolling upgrade. (The arrow lengths do not accurately depict the relative time spent in each stage.) Figure 1. Offline Upgrade Figure 2. Rolling Upgrade All methods described in this guide are for , which means the cluster runs on the same nodes after upgrade as before upgrade. in-place upgrade Adding nodes and disks to the cluster is part of the typical life of a production cluster, but does not involve upgrading software. If you plan to add disk, CPU, or network capacity, use standard administration procedures. See or for details. Adding Nodes to a Cluster Adding Disks You must upgrade all nodes on the cluster at once. The MapReduce layer requires JobTracker and TaskTracker build IDs to match, and therefore software versions must match across all nodes. What Gets Upgraded Upgrading the MapR core upgrades the following aspects of the cluster: Hadoop MapReduce Layer: JobTracker and TaskTracker services Storage Layer: MapR-FS fileserver and Container Location Database (CLDB) services Cluster Management Services: ZooKeeper and Warden NFS server Web server, including the MapR Control System user interface and REST API to cluster services The commands for managing cluster services from a client maprcli Any new features and performance enhancements introduced with the new version. You typically have to enable new features manually after upgrade, which minimizes uncontrolled changes in cluster behavior during upgrade. This guide focuses on upgrading MapR core software packages, not the Hadoop ecosystem components such as HBase, Hive, Pig, etc. Considerations for ecosystem components are raised where appropriate in this guide, because changes to the MapR core can impact other components in the Hadoop ecosystem. For instructions on upgrading ecosystem components, see the documentation for each specific component. See . If you plan to upgrade both the MapR core and Hadoop ecosystem components, MapR recommends #Related Topics upgrading the core first, and ecosystem second. Upgrading the MapR core does not impact data format of other Hadoop components storing data on the cluster. For example, HBase 0.92.2 data and metadata stored on a MapR 2.1 cluster will work as-is after upgrade to MapR 3.0. Components such as HBase and Hive have their own data migration processes when upgrading the component version, but this is independent of the MapR core version. Once cluster services are started with a new major version, the cluster cannot be rolled back to a previous major version, because the new version writes updated data formats to disk which cannot be reverted. For most minor releases and service updates it is possible to downgrade versions (for example, x.2 to x.1). Goals for Upgrade Process Your MapR deployment is unique to your data workload and the needs of your users. Therefore, your upgrade plan will also be unique. By following this guide, you will make an upgrade plan that fits your needs. This guide bases recommendations on the following principles, regardless of your specific upgrade path. Reduce risk Incremental change Frequent verification of success Minimize down time Plan, prepare and practice first. Then execute. You might also aspire to touch each node the fewest possible times, which can be counteractive to the goal of minimizing down-time. Some steps from can be moved into the flow, reducing the number of times you have to access each node, Preparing to Upgrade Upgrading MapR Packages but increasing the node’s down-time during upgrade. Version-Specific Considerations This section lists upgrade considerations that apply to specific versions of MapR software. When upgrading from MapR v1.x Starting with v1.2.8, a change in NFS file format necessitates remounting NFS mounts after upgrade. See NFS incompatible when . upgrading to MapR v1.2.8 or later Hive release 0.7.x, which is included in the MapR v1.x distribution, does not work with MapR core v2.1 and later. If you plan to upgrade to MapR v2.1 or later, you must also upgrade Hive to 0.9.0 or higher, available in MapR's . repository New features are not enabled automatically. You must enable them as described in . Configuring the New Version To enable the cluster to run as a non-root user, you must explicitly switch to non-root usage as described in . Configuring the New Version When you are upgrading from MapR v1.x to MapR v2.1.3 or later, run the script after installing the upgrade upgrade2maprexecute packages but before starting the Warden in order to incorporate changes in how MapR interacts with . sudo When upgrading from MapR v2.x If the existing cluster is running as root and you want to transition to a non-root user as part of the upgrade process, perform the steps described in before proceeding with the upgrade. Converting a Cluster from Root to Non-root User For performance reasons, version 2.1.1 of the MapR core made significant changes to the default MapReduce propeties stored in the files and in the directory . core-site.xml mapred-site.xml /opt/mapr/hadoop/hadoop-<version>/conf/ New filesystem features are not enabled automatically. You must enable them as described in . Configuring the New Version If you are using the table features added to MapR-FS in version 3.0, note the following considerations: You need to apply an M7 Edition license. M3 and M5 licenses do not include MapR table features. A MapR HBase client package must be installed in order to access table data in MapR-FS. If the existing cluster is already running Apache HBase, you must upgrade the MapR HBase client to a version that can access tables in MapR-FS. The HBase package named changes to as of the 3.0 release mapr-hbase-internal-<version> mapr-hbase-<version> (May 1, 2013). When you upgrade to MapR v2.1.3 or later from an earlier version of MapR v2, run the /opt/mapr/server/upgrade2maprexecute script after installing the upgrade packages but before starting the Warden in order to incorporate changes in how MapR interacts with su . do Related Topics Relevant topics from the MapR Installation Guide Planning the Cluster Preparing Each Node Upgrade topics for Hadoop Ecosystem Components Working with Cascading Working with Flume Working with HBase Working with HCatalog Working with Hive Working with Mahout When you upgrade from MapR v2.1.3 to v2.1.3.1 or later, run the script on /opt/mapr/server/upgrade2maprexecute each node in the cluster after upgrading the package to set the correct permissions for the binary. mapr-core maprexecute Working with Oozie Working with Pig Working with Sqoop Working with Whirr Planning the Upgrade Process The first stage to a successful upgrade process is to plan it ahead of time. This page helps you map out an upgrade process that fits the needs of your cluster and users. This page contains the following topics: Choosing Upgrade Method Offline Upgrade Rolling Upgrade Scheduling the Upgrade Considering Ecosystem Components Reviewing Service Layout Choosing Upgrade Method Choose the upgrade method and form your upgrade plans based on this choice. MapR provides a method, as well as a Offline Upgrade Rolling method for clusters that meet certain criteria. The method you choose impacts the flow of events while upgrading packages on nodes, Upgrade and also impacts the duration of the maintenance window. See below for more details. Offline Upgrade In general, MapR recommends offline upgrade because the process is simpler than rolling upgrade, and usually completes faster. Offline upgrade is the default upgrade method when other methods cannot be used. During the maintenance window the administrator stops all jobs on the cluster, stops all cluster services, upgrades packages on all nodes (which can be done in parallel), and then brings the cluster back online at once. Figure 1. Offline Upgrade Rolling Upgrade Rolling upgrade keeps the filesystem online throughout the upgrade process, which allows for reads and writes for critical data streams. With this method, the administrator runs the script to upgrade software node by node (or, with the utility, in batches of up to 4 rollingupgrade.sh pssh nodes at a time), while the other nodes stay online with active fileservers and TaskTrackers. After all the other nodes have been upgraded, the ro script stages a graceful failover of the cluster's JobTracker to activate it on the upgraded nodes on the cluster. llingupgrade.sh The following restrictions apply to rolling upgrade: Rolling upgrades only upgrade MapR packages, not open source components. The administrator should block off a maintenance window, during which only critical jobs are allowed to run and users expect longer-than-average run times. The cluster’s compute capacity diminishes by 1 to 4 nodes at a time the upgrade, and then recovers to 100% capacity by the end of the maintenance window. Scheduling the Upgrade Plan the optimal time window for the upgrade. Below are factors to consider when scheduling the upgrade: When will preparation steps be performed? How much of the process can be performed before the maintenance window? What calendar time would minimize disruption in terms of workload, access to data, and other stakeholder needs? How many nodes need to be upgraded? How long will the upgrade process take for each node, and for the cluster as a whole? When should the cluster stop accepting new non-critical jobs? When (or will) existing jobs be terminated? How long will it take to clear the pipeline of current workload? Will other Hadoop ecosystem components (such as HBase or Hive) get upgraded during the same maintenance window? When and how will stakeholders be notified? Considering Ecosystem Components If your cluster runs other Hadoop ecosystem components such as HBase or Hive, consider them in your upgrade plan. In most cases upgrading the MapR core does not necessitate upgrading the ecosystem components. For example, the Hive 0.10.0 package which runs on MapR 2.1 can continue running on MapR 3.0. However, there are some specific cases when upgrading the MapR core requires you to also upgrade one or more Hadoop ecosystem components. Below are related considerations: Will you upgrade ecosystem component(s) too? Upgrading ecosystem components is considered a separate process from upgrading the MapR core. If you choose to also upgrade an ecosystem component, you will first upgrade the MapR core, and then proceed to upgrade the ecosystem component. Do you need to upgrade MapR core services? If your goal is to upgrade an ecosystem component, in most cases you do need to not upgrade the MapR core packages. Simply upgrade the component which needs to be upgraded. See . Related Topics Does the new MapR version necessitate a component upgrade? Verify that all installed ecosystem components support the new version of MapR core. See . Related Topics Which ecosystem components need upgrading? Each component constitutes a separate upgrade process. You can upgrade components independently of each other, but you must verify that the resulting version combinations are supported. Can the component upgrade occur without service disruption? In most cases, upgrading an ecosystem component (except for HBase) does not necessitate a maintenance window for the whole cluster. Reviewing Service Layout While planning for upgrade, it is a good time to review the layout of services on nodes. Confirm that the service layout still meets the needs of the cluster. For example, as you grow the cluster over time, you typically move toward isolating cluster management services, such as ZooKeeper and CLDB, onto their own nodes. See in the for a review of MapR’s recommendations. For guidance on moving services, see the Service Layout in a Cluster Installation Guide following topics: Managing Services on a Node Isolating ZooKeeper Nodes Isolating CLDB Nodes Preparing to Upgrade After you have , you are ready to prepare the cluster for upgrade. This page contains action steps you can perform planned your upgrade process now, while your existing cluster is fully operational. This page contains the following topics: 1. Verify System Requirements for All Nodes 2. Prepare Packages and Repositories for Upgrade 3. Stage Configuration Files 4. Perform Version-Specific Steps 5. Design Health Checks 6. Verify Cluster Health 7. Backup Critical Data 8. Move JobTrackers off of CLDB nodes (Rolling Upgrade Only) 9. Run Your Upgrade Plan on a Test Cluster The goal of performing these steps early is to minimize the number of operations within the maintenance window, which reduces downtime and eliminates unnecessary risk. It is possible to move some of these steps into the flow, which will reduce the number of Upgrading MapR Packages times you have to touch each node, but increase down-time during upgrade. Design your upgrade flow according to your needs. 1. Verify System Requirements for All Nodes Verify that all nodes meet the minimum requirements for the new version of MapR software. Check: Software dependencies. Packages dependencies in the MapR distribution can change from version to version. If the new version of MapR has dependencies that were not present in the older version, you must address them on all nodes before upgrading MapR software. Installing dependency packages can be done while the cluster is operational. See Packages and Dependencies for MapR . If you are using a package manager, you can specify a repository that contains the dependency package(s), and allow the Software package manager to automatically install them when you upgrade the MapR packages. If you are installing from package files, you must pre-install dependencies on all nodes manually. Hardware requirements. The newer version of packages might have greater hardware requirements. Hardware requirements must be met before upgrading. See in the . Preparing Each Node Installation Guide OS requirements. MapR’s OS requirements do not change frequently. If the OS on a node doesn’t meet the requirements for the newer version of MapR, plan to decommission the node and re-deploy it with updated OS after the upgrade. For , make sure the node from which you start the upgrade process has keyless ssh access as the root user scripted rolling upgrades to all other nodes in the cluster. To upgrade nodes in parallel, to a maximum of 4, the utility must be present or available in a pssh repository accessible to the node running the upgrade script. 2. Prepare Packages and Repositories for Upgrade When upgrading you can install packages from: MapR’s Internet repository A local repository Individual package files. Prepare the repositories or package files on every node, according to your chosen installation method. See Preparing Packages and Repositories 1. 2. 3. 4. 5. in the . If keyless SSH is set up for the root user, you can prepare the repositories or package files on a single node instead. Installation Guide When setting up a repository for the new version, leave in place the repository for the existing version because you might still need it as you prepare to upgrade. 2a. Update Repository Cache If you plan to install from a repository, update the repository cache on all nodes. On RedHat and CentOS # yum clean all On Ubuntu # apt-get update On SUSE # zypper refresh 3. Stage Configuration Files You probably want to re-apply existing configuration customizations after upgrading to the new version of MapR software. New versions commonly introduce changes to configuration properties. It is common for new properties to be introduced and for the default values of existing properties to change. This is true for the MapReduce layer, the storage layer, and all other aspects of cluster behavior. This section guides you through the steps to stage configuration files for the new version, so they are ready to be applied as soon as you perform the upgrade. Active configuration files for the current version of the MapR core are in the following locations: /opt/mapr/conf/ /opt/mapr/hadoop/hadoop-<version>/conf/ When you install or upgrade MapR software, fresh configuration files containing default values are installed to parallel directories /opt/mapr/co and . Configuration files in these directories are not active unless you nf.new /opt/mapr/hadoop/hadoop-<version>/conf.new .new copy them to the active directory. conf If your existing cluster uses default configuration properties only, then you might choose to use the defaults for the new version as well. In this case, you do not need to prepare configuration files, because you can simply copy to after upgrading a node to use the new conf.new conf version's defaults. If you want to propagate customizations in your existing cluster to the new version, you will need to find your configuration changes and apply them to the new version. Below are guidelines to stage configuration files for the new version. Install the existing version of MapR on a test node to get the default configurations files. You will find the files in the /opt/mapr/conf.n and directories. ew /opt/mapr/hadoop/hadoop-<version>/conf.new For each node, diff your existing configuration files with the defaults to produce a list of changes and customizations. Install the new version of MapR on a test node to get the default configuration files. For each node, merge changes in the existing version into the new version’s configuration files. Copy the merged configuration files to a staging directory, such as . You will use these files when /opt/mapr/conf.staging/ upgrading packages on each node in the cluster. Figure 1. Staging Configuration Files for the New Version The procedure does not work on clusters running SUSE. Scripted Rolling Upgrade 1. 2. 3. Note that the Central Configuration feature, which is enabled by default in MapR version 2.1 and later, automatically updates configuration files. If you choose to enable Centralized Configuration as part of your upgrade process, it could overwrite manual changes you've made to configuration files. See and for more details. Central Configuration Configuring the New Version 4. Perform Version-Specific Steps This section contains version-specific preparation steps. If you are skipping over a major version (for example, upgrading from 1.2.9 to 3.0), perform the preparation steps for the skipped version(s) as well (in this case, 2.x). Upgrading from Version 1.x 4a. Set TCP Retries On each node, set the number of TCP retries to 5 so that the cluster detects unreachable nodes earlier. This also benefits the rolling upgrade process, by reducing the graceful failover time for TaskTrackers and JobTrackers. Edit the file and add the following line: /etc/sysctl.conf net.ipv4.tcp_retries2=5 Save the file and run to refresh system settings. For example: sysctl -p # sysctl -p ...lines removed... net.ipv4.ip_forward = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 net.ipv4.tcp_retries2 = 5 Ensure that the setting has taken effect. Issue the following command, and verify that the output is 5: # cat /proc/sys/net/ipv4/tcp_retries2 5 4b. Create non-root user and group for MapR services If you plan for MapR services to run as non-root after upgrading, create a new “mapr user” and group on every node. The mapr user is the user that runs MapR services, instead of root. For example, the following commands create a new group and new user, both called , and then sets a password. You do not have to use mapr 1001 for and , but the values must be consistent across all nodes. The username is typically or , but can be any valid login. uid gid mapr hadoop # groupadd --gid 1001 mapr # useradd --uid 1001 --gid mapr --create-home mapr # passwd mapr To test that the mapr user has been created, switch to the new user with . Verify that a home directory has been created (usually su mapr /home ) and that the mapr user has read-write access to it. The mapr user must have write access to the directory, or the warden will fail to /mapr /tmp start services. Later, after MapR software has been upgraded on all nodes, you must perform additional steps to enable cluster services to run as the user. mapr Upgrading from Version 2.x 4c. Obtain license for new v3.x features If you are upgrading to gain access to the native table features available in v3.x, you must obtain an M7 license which enables table storage. Log in at and go to the area to manage your license. mapr.com My Clusters 5. Design Health Checks Plan what kind of test jobs and scripts you will use to verify cluster health as part of the upgrade process. You will verify cluster health several times before, during, and after upgrade to ensure success at every step, and to isolate issues whenever they occur. Create both simple tests to verify that cluster services start and respond, as well as non-trivial tests that verify workload-specific aspects of your cluster. 5a. Design Simple Tests Examples of simple tests: Check node health using commands to verify if any alerts exist and that services are running where they are expected to be. maprcli For example: # maprcli node list -columns svc service hostname ip tasktracker,cldb,fileserver,hoststats centos55 10.10.82.55 tasktracker,hbregionserver,fileserver,hoststats centos56 10.10.82.56 fileserver,tasktracker,hbregionserver,hoststats centos57 10.10.82.57 fileserver,tasktracker,hbregionserver,webserver,hoststats centos58 10.10.82.58 ...lines deleted... # maprcli alarm list alarm state description entity alarm name alarm statechange time 1 One or more licenses is about to expire within 25 days CLUSTER CLUSTER_ALARM_LICENSE_NEAR_EXPIRATION 1366142919009 1 Can not determine if service: nfs is running. Check logs at: /opt/mapr/logs/nfsserver.log centos58 NODE_ALARM_SERVICE_NFS_DOWN 1366194786905 In this example you can see that an alarm is raised indicating that MapR is expecting an NFS server to be running on node , centos58 and the of running services confirms that the service is not running on this node. node list nfs Batch create a set of test files. Submit a MapReduce job. Run simple checks on installed Hadoop ecosystem components. For example: Make a Hive query. Do a put and get from Hbase. Run to verify consistency of the HBase datastore. Address any issues that are found. hbase hbck 5b. Design Non-trivial Tests Appropriate non-trivial tests will be specific to your particular cluster’s workload. You may have to work with users to define an appropriate set of tests. Run tests on the existing cluster to calibrate expectations for “healthy” task and job durations. On future iterations of the tests, inspect results for deviations. Some examples: Run performance benchmarks relevant the cluster’s typical workload. Run a suite of common jobs. Inspect for correct results and deviation from expected completion times. Test correct inter-operation of all components in the Hadoop stack and third-party tools. Confirm integrity of critical data stored on cluster. 6. Verify Cluster Health Verify cluster health before beginning the upgrade process. Proceed with the upgrade only if the cluster is in an expected, healthy state. Otherwise, if cluster health does not check out after upgrade, you can’t isolate the cause to be related to the upgrade. 6a. Run Simple Health Checks Run the suite of simple tests to verify that basic features of the MapR core are functioning correctly, and that any alarms are known and accounted for. 6b. Run Non-trivial Health Checks Run your suite of non-trivial tests to verify that the cluster is running as expected for typical workload, including integration with Hadoop ecosystem components and third-party tools. 1. 2. a. 7. Backup Critical Data Data in the MapR cluster persists across upgrades from version to version. However, as a precaution you might want to backup critical data before upgrading. If you deem it practical and necessary, you can do any of the following: Copy data out of the cluster using to a separate, non-Hadoop datastore. distcp Mirror critical volume(s) into a separate MapR cluster, creating a read-only copy of the data which can be accessed via the other cluster. When services for the new version are activated, MapR-FS will update data on disk automatically. The migration is transparent to users and administrators. Once the cluster is active with the new version, you typically cannot roll back. The data format for the MapR filesystem changes between major releases (for example, 2.x to 3.x). For some (but not all) minor releases and service updates (for example, x.1 to x.2, or y.z.1 to y.z.2), it is possible to revert versions. 8. Move JobTrackers off of CLDB nodes (Rolling Upgrade Only) For the manual rolling upgrade process, JobTracker and CLDB services cannot co-exist on the same node. This restriction does not apply to the offline upgrade process. If necessary, move JobTracker services to non-CLDB nodes. You may need to to record this revisit your service layout change in design. Below are steps to remove the JobTracker role from CLDB nodes, and add it to other nodes. If the active JobTracker is among the JobTrackers that need to move, move it . In this case, removing the active JobTracker will cause the last cluster to failover and activate a standby JobTracker. Partially-completed MapReduce jobs in progress will resume when the new JobTracker comes online, typically within seconds. If this is an unacceptable disruption of service on your active cluster, you can perform these steps during the upgrade maintenance window. Determine where JobTracker, CDLB and ZooKeeper are installed, and where JobTracker is running, by executing the maprcli node and commands. For the , the option lists where a service is installed to run (but list maprcli node listcldbzks node list csvc might not currently be running), and the option lists where a service is running. Note in the list which node is running the svc actively svc active JobTracker. # maprcli node list -columns svc,csvc # maprcli node listcldbzks The command is not available prior to MapR version 2.0. node listcldbzks For each JobTracker on a CLDB node, use the commands below to add one replacement JobTracker on a non-CLDB node. Install the JobTracker package. Substitute with the specific version. <version> On RedHat and CentOS # yum install mapr-jobtracker-<version> On Ubuntu # apt-get install mapr-jobtracker=<version> On SUSE # zypper install mapr-jobtracker-<version> Install the version of MapR when moving the JobTrackers, because that is the active version on the cluster at existing this stage. Explicitly specify a version number when installing to make sure you don't accidentally install the newer version. Alternatively, you can temporarily disable the repository for the new version. 2. b. 3. a. b. c. 1. Run the script to remove the node role. configure.sh # /opt/mapr/server/configure.sh -R A successful result will produce output like the following, showing that is configured on this node: jobtracker # /opt/mapr/server/configure.sh -R Node setup configuration: fileserver jobtracker webserver Log can be found at: /opt/mapr/logs/configure.log Remove the JobTracker from any CLDB node(s) where it is installed. If you have to remove the active JobTracker, remove it . last If the node is running the active JobTracker, stop the service. # maprcli node services -nodes <JobTracker node> -jobtracker stop Remove the package. mapr-jobtracker On RedHat and CentOS # yum remove mapr-jobtracker On Ubuntu # apt-get purge mapr-jobtracker On SUSE # zypper remove mapr-jobtracker Run the script so the cluster recognizes the changed roles on the node. Confirm that is no longer configure.sh jobtracker configured on the node. # /opt/mapr/server/configure.sh -R Node setup configuration: cldb fileserver tasktracker zookeeper Log can be found at: /opt/mapr/logs/configure.log 9. Run Your Upgrade Plan on a Test Cluster Before executing your upgrade plan on the production cluster, perform a complete "dry run" on a test cluster. You can perform the dry run on a smaller cluster than the production cluster, but make the dry run as similar to the real-world circumstances as possible. For example, install all Hadoop ecosystem components that are in use in production, and replicate data and jobs from the production cluster on the test cluster. The goals for the dry run are: 1. 2. Eliminate surprises. Get familiar with all upgrade operations you will perform as you upgrade the production cluster. Uncover any upgrade-related issues as early as possible so you can accommodate them in your upgrade plan. Look for issues in the upgrade process itself, as well as operational and integration issues that could arise after the upgrade. When you have successfully run your upgrade plan on a test cluster, you are ready for . Upgrading MapR Packages Upgrading MapR Packages After you have and performed all , you are ready to upgrade the MapR packages on all nodes in planned your upgrade process preparation steps the cluster. The upgrade process differs depending on whether you are performing offline upgrade or rolling upgrade. Choose your planned installation flow: Offline Upgrade Rolling Upgrade Scripted Rolling Upgrade To complete the upgrade process and end the maintenance window, you need to perform additional cluster configuration steps described in Confi . guring the New Version Offline Upgrade The package upgrade process for the offline upgrade follows the sequence below. 1. Halt Jobs 2. Stop Cluster Services 2a. Disconnect NFS Mounts and Stop NFS Server 2b. Stop Hive and Apache HBase Services 2c. Stop MapR Core Services 3. Upgrade Packages and Configuration Files 3a. Upgrade or Install HBase Client for MapR Tables 3b. Run upgrade2maprexecute 4. Restart Cluster Services 4a. Restart MapR Core Services 4b. Run Simple Health Check 4c. Set the New Cluster Version 4d. Restart Hive and Apache HBase Services 1. 2. 3. 1. 2. 3. 1. 5. Verify Success on Each Node Perform these steps on all nodes in the cluster. For larger clusters, these steps are commonly performed on all nodes in parallel using scripts and/or remote management tools. 1. Halt Jobs As defined by your upgrade plan, halt activity on the cluster in the following sequence before you begin upgrading packages: Notify stakeholders. Stop accepting new jobs. At some later point, terminate any running jobs. The following commands can be used to terminate MapReduce jobs, and you might also need specific commands to terminate custom applications. # hadoop job -list # hadoop job -kill <job-id> # hadoop job -kill-task <task-id> At this point the cluster is ready for maintenance but still operational. The goal is to perform the upgrade and get back to normal operation as safely and quickly as possible. 2. Stop Cluster Services The following sequence will stop cluster services gracefully. When you are done, the cluster will be offline. The commands used in this maprcli section can be executed on any node in the cluster. 2a. Disconnect NFS Mounts and Stop NFS Server Use the steps below to stop the NFS service. Unmount the MapR NFS share from all clients connected to it, including other nodes in the cluster. This allows all processes accessing the cluster via NFS to disconnect gracefully. Assuming the cluster is mounted at : /mapr # umount /mapr Stop the NFS service on all nodes where it is running: # maprcli node services -nodes <list of nodes> -nfs stop Verify that the MapR NFS server is not running on any node. Run the following command and confirm that is not included on any nfs node. # maprcli node list -columns svc | grep nfs 2b. Stop Hive and Apache HBase Services For nodes running Hive or Apache HBase, stop these services so they don’t hit an exception when the filesystem goes offline. Stop the services in this order: HiveServer - The HiveServer runs as a Java process on a node. You can use to find if HiveServer is running on a node, and jps -m 1. 2. a. b. 3. a. b. use to stop it. For example: kill -9 # jps -m 16704 RunJar /opt/mapr/hive/hive-0.10.0/lib/hive-service-0.10.0.jar org.apache.hadoop.hive.service.HiveServer 32727 WardenMain /opt/mapr/conf/warden.conf 2508 TaskTracker 17993 Jps -m # kill -9 16704 HBase Master - For all nodes running the HBase Master service, stop HBase services. By stopping the HBase Master first, it won’t detect individual regionservers stopping later, and therefore won’t trigger any fail-over responses. Use the following commands to find nodes running the HBase Master service and to stop it. # maprcli node list -columns svc # maprcli node services -nodes <list of nodes> -hbmaster stop You can the HBase master log file on nodes running the HBase master to track shutdown progress, as shown in the tail example below. The in the log filename will match the cluster's MapR user which runs services. mapr # tail /opt/mapr/hbase/hbase-0.92.2/logs/hbase-mapr-master-centos55.log ...lines removed... 2013-04-15 08:10:53,277 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing because balanced cluster; servers=3 regions=3 average=1.0 mostloaded=2 leastloaded=0 Mon Apr 15 08:14:14 PDT 2013 Killing master HBase regionservers - Soon after stopping the HBase Master, stop the HBase regionservers on all nodes. Use the following commands to find nodes running the HBase Regionserver service and to stop it. It can take a regionserver several minutes to shut down, depending on the cleanup tasks it has to do. # maprcli node list -columns svc # maprcli node services -nodes <list of nodes> -hbregionserver stop You can the regionserver log file on nodes running the HBase regionserver to track shutdown progress, as shown in the tail example below. 3. b. 1. 2. 3. 4. # tail /opt/mapr/hbase/hbase-0.92.2/logs/hbase-mapr-regionserver-centos58.log ...lines removed... 2013-04-15 08:15:16,583 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server centos58,60020,1366023348995; zookeeper connection closed. 2013-04-15 08:15:16,584 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020 exiting 2013-04-15 08:15:16,584 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook thread. 2013-04-15 08:15:16,585 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished. If a regionserver's log show no progress and the process does not terminate, you might have to kill it manually. For example: # kill -9 `cat /opt/mapr/logs/hbase-mapr-regionserver.pid` 2c. Stop MapR Core Services Stop MapR core services in the following sequence. Note where CLDB and ZooKeeper services are installed, if you do not already know. # maprcli node list -columns hostname,csvc centos55 tasktracker,hbmaster,cldb,fileserver,hoststats 10.10.82.55 centos56 tasktracker,hbregionserver,cldb,fileserver,hoststats 10.10.82.56 ...more nodes... centos98 fileserver,zookeeper 10.10.82.98 centos99 fileserver,webserver,zookeeper 10.10.82.99 Stop the warden on all nodes with CLDB installed: # service mapr-warden stop stopping WARDEN looking to stop mapr-core processes not started by warden Stop the warden on all remaining nodes: # service mapr-warden stop stopping WARDEN looking to stop mapr-core processes not started by warden Stop the ZooKeeper on all nodes where it is installed: 4. 1. 2. # service mapr-zookeeper stop JMX enabled by default Using config: /opt/mapr/zookeeper/zookeeper-3.3.6/conf/zoo.cfg Stopping zookeeper ... STOPPED At this point the cluster is completely offline. commands will not work, and the browser-based MapR Control System will be unavailable. maprcli 3. Upgrade Packages and Configuration Files Perform the following steps to upgrade the MapR core packages on every node. Use the following command to determine which packages are installed on the node: On RedHat and CentOS # yum list installed 'mapr-*' On Ubuntu dpkg --list 'mapr-*' On SUSE zypper se -i mapr Upgrade the following packages on all nodes where they exist: mapr-cldb mapr-core mapr-fileserver mapr-hbase-<version> - You must specify a version that matches the version of HBase API used by your applications. See #3a. for details. Upgrade or Install HBase Client for MapR Tables mapr-jobtracker mapr-metrics mapr-nfs mapr-tasktracker mapr-webserver mapr-zookeeper mapr-zk-internal On RedHat and CentOS # yum update mapr-cldb mapr-core mapr-fileserver mapr-hbase-<version> mapr-jobtracker mapr-metrics mapr-nfs mapr-tasktracker mapr-webserver mapr-zookeeper mapr-zk-internal Do not use a wildcard such as " " to upgrade all MapR packages, which could erroneously include Hadoop ecosystem mapr-* components such as and . mapr-hive mapr-pig 1. 2. On Ubuntu # apt-get install mapr-cldb mapr-core mapr-fileserver mapr-hbase=<version> mapr-jobtracker mapr-metrics mapr-nfs mapr-tasktracker mapr-webserver mapr-zookeeper mapr-zk-internal On SUSE # zypper update mapr-cldb mapr-core mapr-fileserver mapr-jobtracker mapr-metrics mapr-nfs mapr-tasktracker mapr-webserver mapr-zookeeper mapr-zk-internal Verify that packages installed successfully on all nodes. Confirm that there were no errors during installation, and check that /opt/mapr contains the expected value. For example: /MapRBuildVersion # cat /opt/mapr/MapRBuildVersion 2.1.2.18401.GA Copy the staged configuration files for the new version to , if you created them as part of . /opt/mapr/conf Preparing to Upgrade 3a. Upgrade or Install HBase Client for MapR Tables If you are upgrading from a pre-3.0 version of MapR and you will use MapR tables, you have to install (or upgrade) the MapR HBase client. If you are upgrading the Apache HBase component as part of your overall upgrade plan, then the MapR HBase client will get upgraded as part of that process. See . Upgrading HBase All nodes that access table data in Map-FS must have the MapR HBase Client installed. This typically includes all TaskTracker nodes and any other node that will access data in MapR tables. The package name is , where matches the version of mapr-hbase-<version> <version> HBase API to support, such as 0.92.2 or 0.94.5. This version has no impact on the underlying storage format used by the MapR-FS file server. If you have existing applications written for a specific version of the HBase API, install the MapR HBase client package with the same version. If you are developing new applications to use MapR tables exclusively, use the highest available version of the MapR HBase Client. On RedHat and CentOS # yum install mapr-hbase-<version> On Ubuntu # apt-get install mapr-hbase=<version> On SUSE # zypper install mapr-hbase-<version> 3b. Run upgrade2maprexecute If you are upgrading from a previous version of MapR to version 2.1.3 or later, run the script on /opt/mapr/server/upgrade2maprexecute 1. 2. 3. every node, after installing packages but before bringing up the cluster, in order to apply changes in MapR's interaction with . sudo 4. Restart Cluster Services After you have upgraded packages on all nodes, perform the following sequence on all nodes to restart the cluster. 4a. Restart MapR Core Services Run the script using one of the following sets of options: configure.sh If services on nodes remain constant during the upgrade use the option as shown in the example below. -R # /opt/mapr/server/configure.sh -R Node setup configuration: fileserver nfs tasktracker Log can be found at: /opt/mapr/logs/configure.log If you have added or removed packages on a node, use the and options to reconfigure the expected services on the node, -C -Z as shown in the example below. # /opt/mapr/server/configure.sh -C <CLDB nodes> -Z <Zookeeper nodes> [-N <cluster name>] Node setup configuration: fileserver nfs tasktracker Log can be found at: /opt/mapr/logs/configure.log If ZooKeeper is installed on the node, start it: # service mapr-zookeeper start JMX enabled by default Using config: /opt/mapr/zookeeper/zookeeper-3.3.6/conf/zoo.cfg Starting zookeeper ... STARTED Start the warden: # service mapr-warden start Starting WARDEN, logging to /opt/mapr/logs/warden.log. . For diagnostics look at /opt/mapr//logs/ for createsystemvolumes.log, warden.log and configured services log files At this point, MapR core services are running on all nodes. 4b. Run Simple Health Check Run simple health-checks targeting the filesystem and MapReduce services only. Address any issues or alerts that might have come up at this point. 4c. Set the New Cluster Version After restarting MapR services on all nodes, issue the following command on any node in the cluster to update and verify the configured version. The version of the installed MapR software is stored in the file . /opt/mapr/MapRBuildVersion 1. 2. # maprcli config save -values {mapr.targetversion:"`cat /opt/mapr/MapRBuildVersion`"} You can verify that the command worked, as shown in the example below. # maprcli config load -keys mapr.targetversion mapr.targetversion 2.1.2.18401.GA 4d. Restart Hive and Apache HBase Services For all nodes with Hive and/or Apache HBase installed, restart the the services. HBase Master and - Start the HBase Master service first, followed immediately by regionservers. On any node in HBase Regionservers the cluster, use these commands to start HBase services. # maprcli node services -nodes <list of nodes> -hbmaster start # maprcli node services -nodes <list of nodes> -hbregionserver start You can the log files on specific nodes to track status. For example: tail # tail /opt/mapr/hbase/hbase-<version>/logs/hbase-<mapr user>-master-<hostid>.log # tail /opt/mapr/hbase/hbase-<version>/logs/hbase-<mapr user>-regionserver-<hostid>.log HiveServer - The HiveServer or (HiveServer2) process must be started on the node where Hive is installed. The method to start-up is dependent on whether you are using HiveServer or HiveServer2. See for more information. Working with Hive 5. Verify Success on Each Node Below are some simple checks to confirm that the packages have upgraded successfully: All expected nodes show up in a cluster node listing, and the expected services are configured on each node. For example: # maprcli node list -columns hostname,csvc hostname configuredservice ip centos55 tasktracker,hbmaster,cldb,fileserver,hoststats 10.10.82.55 centos56 tasktracker,hbregionserver,cldb,fileserver,hoststats 10.10.82.56 centos57 fileserver,tasktracker,hbregionserver,hoststats,jobtracker 10.10.82.57 centos58 fileserver,tasktracker,hbregionserver,webserver,nfs,hoststats,jobtracker 10.10.82.58 ...more nodes... If a node is not connected to the cluster, commands will not work at all. maprcli A master CLDB is active, and all nodes return the same result. For example: # maprcli node cldbmaster cldbmaster ServerID: 8851109109619685455 HostName: centos56 Only one ZooKeeper service claims to be the ZooKeeper leader, and all other ZooKeepers are followers. For example: # service mapr-zookeeper qstatus JMX enabled by default Using config: /opt/mapr/zookeeper/zookeeper-3.3.6/conf/zoo.cfg Mode: follower At this point, MapR packages have been upgraded on all nodes. You are ready to . configure the cluster for the new version Rolling Upgrade This page contains the following topics: Overview Planning the Order of Nodes Why Node Order Matters Move JobTracker Service Off of CLDB Nodes Upgrade ZooKeeper packages on All ZooKeeper Nodes Upgrade Half the Nodes, One-by-One, up to the Active JobTracker Upgrade All Remaining Nodes, Starting with the Active JobTracker Overview In the rolling upgrade process, you upgrade the MapR software one node at a time so that the cluster as a whole remains operational throughout the process. The fileserver service on each node goes offline while packages are upgraded, but its absence is short enough that the cluster does not raise the data under-replication alarm. The rolling upgrade process follows the steps shown in the figure below. In the figure, each table cell represents a service running on a node. For example, stands for a TaskTracker service running the existing version of MapR, and stands for the TaskTracker service upgraded to the T T’ new version. The MapR fileserver service is assumed to run on every node, and it gets upgraded at the same time as TaskTracker. Before you begin, make sure you understand the restrictions for rolling upgrade described in . Planning the Upgrade Process 1. 2. 3. 4. 5. Planning the Order of Nodes Plan the order of nodes before you begin upgrading. The particular services running on each node determines the order to upgrade. The node running the JobTracker is of particular interest, because it can change over time. active You will upgrade nodes in the following order: Upgrade ZooKeeper on all ZooKeeper nodes. This establishes a stable ZooKeeper quorum on the new version, which will remain active through the rest of the upgrade process. Upgrade MapR packages on all CLDB nodes. The upgraded CLDB nodes can support both the existing and the new versions of fileservers, which enables all fileservers to remain in service throughout the upgrade. Upgrade MapR packages on half the nodes, including all JobTracker nodes the active JobTracker node. This step upgrades except for the fileserver, TaskTracker and (where present) JobTracker to the new version. Upgrade the active JobTracker node. This node marks the half-way point in the upgrade. Stopping the active JobTracker (running the existing version) causes the cluster to fail-over to a standby JobTracker (running the new version). At this cross-over point, all the new TaskTrackers become active. All existing-version TaskTrackers become inactive, because they cannot accept tasks from the new JobTracker. Upgrade MapR packages on all remaining nodes in the cluster. The cluster’s MapReduce capacity increases with every TaskTracker node that gets upgraded. Going node by node has the following effects: You avoid compromising high-availability (HA) services, such as CLDB and JobTracker, by leaving as many redundant nodes online as possible throughout the upgrade process. You avoid triggering aggressive data replication (or making certain data unavailable altogether), which could result if too many fileservers go offline at once. The cluster alarm VOLUME_ALARM_DATA_UNDER_REPLICATED might trigger when a node’s fileserver goes offline. By default, the cluster will not begin replicating data for several minutes, which allows each node’s upgrade process to complete 1. without incurring any replication burden. Downtime per node will be on the order of 1 minute. To find the node currently running the active JobTracker Shortly before beginning to upgrade nodes, determine where the active JobTracker is running. The following command lists the active services running on each node. The service will appear on exactly one node. jobtracker # maprcli node list -columns hostname,svc hostname service ip centos55 tasktracker,cldb,fileserver,hoststats 10.10.82.55 centos56 tasktracker,cldb,fileserver,hoststats 10.10.82.56 centos57 fileserver,tasktracker,hbregionserver,hoststats 10.10.82.57 centos58 fileserver,tasktracker, webserver,nfs,hoststats,jobtracker 10.10.82.58 ...more nodes... To find where ZooKeeper and CLDB are running Use either of the following command to list which nodes have the ZooKeeper and CLDB service configured. # maprcli node listcldbzks CLDBs: centos55,centos56 Zookeepers: centos10:5181,centos11:5181,centos12:5181 # maprcli node list -columns hostname,csvc hostname configuredservice ip centos55 tasktracker,cldb,fileserver,hoststats 10.10.82.55 centos56 tasktracker,cldb,fileserver,hoststats 10.10.82.56 centos57 fileserver,tasktracker,hoststats,jobtracker 10.10.82.57 centos58 fileserver,tasktracker,webserver,nfs,hoststats,jobtracker 10.10.82.58 ...more nodes... The command is not available prior to MapR version 2.0. node listcldbzks Why Node Order Matters The following aspects of Hadoop and the MapR software are at the root of why node order matters when upgrading. Maintaining a ZooKeeper quorum throughout the upgrade process is critical. Newer versions of ZooKeeper are backward compatible. Therefore, we upgrade ZooKeeper packages first to get this step out of the way while ensuring a stable quorum throughout the rest of the upgrade. Newer versions of the CLDB service can recognize older versions of the fileserver service. The reverse is not true, however. Therefore, after you upgrade the CLDB service on a node (which also updates the fileserver on the node), both the upgraded fileservers and existing fileservers can still access the CLDB. MapReduce binaries and filesystem binaries are installed at the same time, and cannot be separated. When you upgrade the mapr-fil package, the binaries for and also get installed, and vice-versa. eserver mapr-tasktracker mapr-jobtracker Move JobTracker Service Off of CLDB Nodes If you have not already done so as part of preparing to upgrade, move JobTrackers to non-CLDB nodes. This is a preparation step to accommodate the fact that the MapR installer cannot upgrade the CLDB binaries independently of JobTracker. See Move JobTrackers off of in for details. CLDB nodes Preparing to Upgrade Upgrade ZooKeeper packages on All ZooKeeper Nodes Upgrade to the new version on all nodes configured to run the ZooKeeper service. Upgrade one node at a time so that a mapr-zookeeper ZooKeeper quorum is maintained at all times through the process. 1. 2. 3. 4. 1. Stop ZooKeeper. # service mapr-zookeeper stop JMX enabled by default Using config: /opt/mapr/zookeeper/zookeeper-3.3.6/conf/zoo.cfg Stopping zookeeper ... STOPPED Upgrade the package. mapr-zookeeper On RedHat and CentOS... # yum upgrade 'mapr-zookeeper*' On Ubuntu... # apt-get install 'mapr-zookeeper*' On SUSE... # zypper upgrade 'mapr-zookeeper*' Restart ZooKeeper. # service mapr-zookeeper start JMX enabled by default Using config: /opt/mapr/zookeeper/zookeeper-3.3.6/conf/zoo.cfg Starting zookeeper ... STARTED Verify quorum status to make sure the service is started. # service mapr-zookeeper qstatus JMX enabled by default Using config: /opt/mapr/zookeeper/zookeeper-3.3.6/conf/zoo.cfg Mode: follower Upgrade Half the Nodes, One-by-One, up to the Active JobTracker You will now begin upgrading packages on nodes, proceeding one node at a time. Perform the following steps, one node at a time, following your planned order of upgrade until you have upgraded half the nodes in the cluster. Do not upgrade the JobTracker node. active Stop the warden: Before you begin to upgrade MapR packages in your planned order, verify that the active JobTracker is still running on the node you expect. 1. 2. 1. 2. # service mapr-warden stop stopping WARDEN looking to stop mapr-core processes not started by warden Upgrade the following packages where they exist: mapr-cldb mapr-core mapr-fileserver mapr-hbase-<version> mapr-jobtracker mapr-metrics mapr-nfs mapr-tasktracker mapr-webserver On RedHat and CentOS # yum upgrade mapr-cldb mapr-core mapr-fileserver mapr-hbase-<version> mapr-jobtracker mapr-metrics mapr-nfs mapr-tasktracker mapr-webserver On Ubuntu # apt-get install mapr-cldb mapr-core mapr-fileserver mapr-hbase=<version> mapr-jobtracker mapr-metrics mapr-nfs mapr-tasktracker mapr-webserver On SUSE # zypper update mapr-cldb mapr-core mapr-fileserver mapr-jobtracker mapr-metrics mapr-nfs mapr-tasktracker mapr-webserver Verify that packages installed successfully. Confirm that there were no errors during installation, and check that /opt/mapr/MapRBuild contains the expected value. For example: Version # cat /opt/mapr/MapRBuildVersion 2.1.2.18401.GA If you are upgrading to MapR version 2.1.3 or later, run the script before bringing up the cluster in order to upgrade2maprexecute apply changes in MapR's interaction with . sudo # /opt/mapr/server/upgrade2maprexecute Do not use a wildcard such as " " to upgrade all MapR packages, which could erroneously include Hadoop ecosystem mapr-* components such as and . mapr-hive mapr-pig 3. 4. 5. Start the warden: # service mapr-warden start Starting WARDEN, logging to /opt/mapr/logs/warden.log. . For diagnostics look at /opt/mapr//logs/ for createsystemvolumes.log, warden.log and configured services log files Verify that the node recognizes the CLDB master and that the command returns expected results. For example: maprcli node list # maprcli node cldbmaster cldbmaster ServerID: 8191791652701999448 HostName: centos55 # maprcli node list -columns hostname,csvc,health,disks hostname configuredservice health disks ip centos55 tasktracker,cldb,fileserver,hoststats 0 6 10.10.82.55 centos56 tasktracker,cldb,fileserver,hoststats 0 6 10.10.82.56 centos57 fileserver,tasktracker,hoststats,jobtracker 0 6 10.10.82.57 centos58 fileserver,tasktracker,webserver,nfs,hoststats,jobtracker 0 6 10.10.82.58 ...more nodes... Copy the staged configuration files for the new version to , if you created them as part of . /opt/mapr/conf Preparing to Upgrade Upgrade All Remaining Nodes, Starting with the Active JobTracker At this point, half the nodes in the cluster are upgraded. The (old) JobTracker is still running, and only the TaskTrackers are existing existing active. When you upgrade the active JobTracker node, you will stop the active JobTracker, and a graceful failover event will activate a stand-by JobTracker. The JobTracker runs the new version, and therefore issues tasks only to TaskTrackers. The existing TaskTrackers will new new become inactive until you upgrade them. Starting from the active JobTracker node, follow your planned order of upgrade and continue upgrading the remaining nodes in the cluster. Use the same instructions outlined in section . #Upgrade Half the Nodes, One-by-One, up to the Active JobTracker After upgrading the final, active JobTracker, verify that a new JobTracker is active. # maprcli node list -columns hostname,svc hostname service ip centos55 tasktracker,cldb,fileserver,hoststats 10.10.82.55 centos56 tasktracker,cldb,fileserver,hoststats 10.10.82.56 centos57 fileserver,tasktracker,hbregionserver,hoststats,jobtracker 10.10.82.57 centos58 fileserver,tasktracker, webserver,nfs,hoststats 10.10.82.58 ...more nodes... At this point, MapR packages have been upgraded on all nodes. You are ready to . configure the cluster for the new version Scripted Rolling Upgrade The script upgrades the core packages on each node, logging output to the rolling upgrade log ( rollingupgrade.sh /opt/mapr/logs/roll ). The core design goal for the scripted rolling upgrade process is to keep the cluster running at the highest capacity possible ingupgrade.log during the upgrade process. As of the 3.0.1 release of the MapR distribution for Hadoop, the JobTracker can continue working with a TaskTracker of an earlier version, which allows job execution to continue during the upgrade. Individual node progress, status, and command output is logged to the file on each node. You can use the option to specify a directory that contains the /opt/mapr/logs/singlenodeupgrade.log -p upgrade packages. You can use the option to fetch packages from the or a local repository. -v MapR repository Usage Tips If you specify a local directory with the option, you must either ensure that the directory that contains the packages has the same -p name and is on the same path on on all nodes in the cluster or use the option to automatically copy packages out to each node with -x SCP. If you use the option, the upgrade process copies the packages from the directory specified with the option into the same -x -p directory path on each node. See the page for the path where you can download MapR software. Release Notes In a multi-cluster setting, use to specify which cluster to upgrade. If is not specified, the default cluster is upgraded. -c -c When specifying the version with the parameter, use the format to specify the major, minor, and revision numbers of the target -v x.y.z version. Example: 3.0.1 The package (Red Hat) or (Ubuntu) enables automatic rollback if the upgrade fails. The script attempts to rpmrebuild dpkg-repack install these packages if they are not already present. To determine whether or not the appropriate package is installed on each node, run the following command to see a list of all installed versions of the package: On Red Hat and Centos nodes: rpm -qa | grep rpmrebuild On Ubuntu nodes: dpkg -l | grep dpkg-repack Specify the option to the script to disable rollback on a failed upgrade. -n rollingupgrade.sh Installing a newer version of MapR software might introduce new package dependencies. Dependency packages must be installed on all nodes in the cluster in addition to the updated MapR packages. If you are upgrading using a package manager such as or , yum apt-get then the package manager on each node must have access to repositories for dependency packages. If installing from package files, you must pre-install dependencies on all nodes in the cluster prior to upgrading the MapR software. See Packages and Dependencies for . MapR Software Jobs in progress on the cluster will continue to run throughout the upgrade process unless they were submitted from a node in the cluster instead of from a client. The script does not support SUSE. Clusters on SUSE must be upgraded with a manual rolling upgrade or an rollingupgrade.sh offline upgrade. The rolling upgrade script only upgrades MapR core packages, not any of the Hadoop ecosystem components. (See Packages and for a list of the MapR packages and Hadoop ecosystem packages.) Follow the procedures in Dependencies for MapR Software Manual to upgrade your cluster's Hadoop ecosystem components. Upgrade for Hadoop Ecosystem Components 1. 2. 3. 4. 5. 6. 7. 8. 9. 1. 2. 3. 1. 2. 3. 4. There are two ways to perform a rolling upgrade: Via SSH - If keyless SSH for the root user is set up to all nodes from the node where you run the script, use the rollingupgrade.sh - option to automatically upgrade all nodes without user intervention. s Node by node - If SSH is not available, the script prepares the cluster for upgrade and guides the user through upgrading each node. In a node-by-node installation, you must individually run the commands to upgrade each node when instructed by the rollingupgrade.sh script. After upgrading your cluster to MapR 2.x, you can run MapR as a . non-root user Upgrade Process Overview The scripted rolling upgrade goes through the following steps: Checks the old and new version numbers. Identifies critical service nodes: CLDB nodes, ZooKeeper nodes, and JobTracker nodes. Builds a list of all other nodes in the cluster. Verifies the hostnames and IP addresses for the nodes in the cluster. If the options are specified, copies packages to the other nodes in the cluster using SCP. -p -x Pretests functionality by building a dummy volume. If the utility is not already installed and the repository is available, installs . pssh pssh Upgrades nodes in batches of 2 to 4 nodes, in an order determined by the presence of critical services. Post-upgrade check and removal of dummy volume. Requirements On the computer from which you will be starting the upgrade, perform the following steps: Change to the user (or use for the following commands). root sudo If you are starting the upgrade from a computer that is not a MapR client or a MapR cluster node, you must add the MapR repository (see ) and install : Preparing Packages and Repositories mapr-core CentOS or Red Hat: yum install mapr-core Ubuntu: apt-get install mapr-core Run , using to specify the cluster CLDB nodes and to specify the cluster ZooKeeper nodes. Example: configure.sh -C -Z /opt/mapr/server/configure.sh -C 10.10.100.1,10.10.100.2,10.10.100.3 -Z 10.10.100.1,10.10.100.2,10.10.100.3 To enable a fully automatic rolling upgrade, ensure that keyless SSH is enabled to all nodes for the user , from the computer on root which the upgrade will be started. IF you are using the option, perform the following steps on the computer from which you will be starting the upgrade. If you are not using the -s - option, perform the following steps on all nodes: s Change to the user (or use for the following commands). root sudo If you are using the option, add the MapR software repository (see ). -v Preparing Packages and Repositories Install rolling upgrade scripts: CentOS or Red Hat: yum install mapr-upgrade Ubuntu: apt-get install mapr-upgrade If you are planning to upgrade from downloaded packages instead of the repository, prepare a directory containing the package files. This directory should reside at the same absolute path on each node unless you are using the options to automatically copy the -s -x packages from the upgrade node. Each NFS node in your cluster must have the utility installed. Type the following command on each NFS node in your cluster to verify showmount the presence of the utilty: Your MapR installation must be version 1.2 or newer to use the scripted rolling upgrade. 1. 2. 3. which showmount Upgrading the Cluster via SSH On the node from which you will be starting the upgrade, issue the command as (or with ) to upgrade the rollingupgrade.sh root sudo cluster: If you have prepared a directory of packages to upgrade, issue the following command, substituting the path to the directory for the <dir placeholder: ectory> /opt/upgrade-mapr/rollingupgrade.sh -s -p -x <directory> If you are upgrading from the MapR software repository, issue the following command, substituting the version (x.y.z) for the <version> placeholder: /opt/upgrade-mapr/rollingupgrade.sh -s -v <version> Upgrading the Cluster Node by Node On the node from which you will be starting the upgrade, use the command as (or with ) to upgrade the cluster: rollingupgrade.sh root sudo Start the upgrade: If you have prepared a directory of packages to upgrade, issue the following command, substituting the path to the directory for the placeholder: <directory> /opt/upgrade-mapr/rollingupgrade.sh -p <directory> If you are upgrading from the MapR software repository, issue the following command, substituting the version (x.y.z) for the <ve placeholder: rsion> /opt/upgrade-mapr/rollingupgrade.sh -v <version> When prompted, run on all nodes other than the active JobTracker and master CLDB node, following the singlenodeupgrade.sh on-screen instructions. When prompted, run on the active JobTracker node, then the master CLDB node, following the on-screen singlenodeupgrade.sh instructions. After upgrading, as usual. configure the new version Configuring the New Version After you have successfully upgraded MapR packages to the new version, you are ready to configure the cluster to enable new features. Not all new features are enabled by default, so that administrators have the option to make the change-over at a specific time. Follow the steps in this section to enable new features. Note that you do not have to enable all new features. This page contains the following topics: Enabling v3.0 Features Enable New Filesystem Features Configure CLDB for the New Version Apply a License to Use Tables Enabling v2.0 Features Enable new filesystem features Enable Centralized Configuration Enable/Disable Centralized Logging Enable Non-Root User Install MapR Metrics Verify Cluster Health Success! If your upgrade process skips a major release boundary (for example, MapR version 1.2.9 to version 3.0), perform the steps for the skipped version too (in this example, 2.0). Enabling v3.0 Features The following are operations to enable features available as of MapR version 3.0. Enable New Filesystem Features To enable v3.0 features related to the filesystem, issue the following command on any node in the cluster. The cluster will raise the alarm CLUSTE until you perform this command. R_ALARM_NEW_FEATURES_DISABLED # maprcli config save -values {"cldb.v3.features.enabled":"1"} You can verify that the command worked, as shown in the example below. # maprcli config load -keys cldb.v3.features.enabled cldb.v3.features.enabled 1 Configure CLDB for the New Version Because some CLDB nodes are shut down during the upgrade, those nodes aren't notified of the change in version number, resulting in the NOD alarm raising once the nodes are back up. To set the version number manually, use the following command to E_ALARM_VERSION_MISMATCH make the CLDB aware of the new version: Note: This command is mandatory when upgrading to version 3.x. Once enabled, it cannot be disabled. After enabling v3.0 features, nodes running a pre-3.0 version of the service will fail to register with the cluster. mapr-mfs This command will also enable v2.0 filesystem features. 1. 2. maprcli config save -values {"mapr.targetversion":"'cat /opt/mapr/MapRBuildVersion'"} Apply a License to Use Tables MapR version 3.0 introduced native table storage in the cluster filesystem. To use MapR tables you must purchase and apply an M7 Edition license. Log into the MapR Control System and click to apply an M7 license file. Manage Licenses Enabling v2.0 Features The following are operations to enable features available as of MapR version 2.0. Enable new filesystem features To enable v2.0 features related to the filesystem, issue the following command on any node in the cluster. The cluster will raise the alarm CLUSTE until you perform this command. R_ALARM_NEW_FEATURES_DISABLED # maprcli config save -values {"cldb.v2.features.enabled":"1"} You can verify that the command worked, as shown in the example below. # maprcli config load -keys cldb.v2.features.enabled cldb.v2.features.enabled 1 Enable Centralized Configuration To enable centralized configuration: On each node in the cluster, add the following lines to the file . /opt/mapr/conf/warden.conf centralconfig.enabled=true pollcentralconfig.interval.seconds=300 Restart the warden to pick up the new configuration. # service mapr-warden restart Note that the Central Configuration feature, which is enabled by default in MapR version 2.1 and later, automatically updates configuration files. If you choose to enable Centralized Configuration as part of your upgrade process, it could overwrite manual changes you've made to configuration The system raises the alarm if you upgrade your cluster to an M7 license without having NODE_ALARM_M7_CONFIG_MISMATCH configured the FileServer nodes for M7. To clear the alarm, restart the FileServer service on all of the nodes using the instructions on the page. Services Note: This command is mandatory when upgrading to version 2.x. Once enabled, it cannot be disabled. After enabling, nodes running a pre-2.0 version of the service will fail to register with the cluster. mapr-mfs 1. 2. a. b. c. 3. files. See for more details. Central Configuration Enable/Disable Centralized Logging Depending on the MapR version, the Centralized Logging feature may be on or off in the default configuration files. MapR recommends disabling this feature unless you plan to you use it. Centralized logging is enabled by the parameter in the file HADOOP_TASKTRACKER_ROOT_LOGGER /op . Setting this parameter to disables centralized logging, and setting t/mapr/hadoop/hadoop-<version>/conf/hadoop-env.sh INFO,DRFA to enables it. INFO,maprfsDRFA If you make changes to , restart TaskTracker on all touched nodes to make the changes take effect: hadoop-env.sh # maprcli node services -nodes <nodes> -tasktracker restart Enable Non-Root User If you want to run MapR services as a non-root user, follow the steps in this section. Note that you do not have to switch the cluster to a non-root user if you do not need this additional level of security. This procedure converts a MapR cluster running as to run as a non-root user. Non-root operation is available from MapR version 2.0 and root later. In addition to converting the MapR user to a non-root user, you can also disable superuser privileges to the cluster for the root user for additional security. To convert a MapR cluster from running as root to running as a non-root user: Create a user with the same UID/GID across the cluster. Assign that user to the environment variable. MAPR_USER On each node: Stop the warden and the ZooKeeper (if present). # service mapr-warden stop # service mapr-zookeeper stop Run the config-mapr-user.sh script to configure the cluster to start as the non-root user. # /opt/mapr/server/config-mapr-user.sh -u <MapR user> [-g <MapR group>] Start the ZooKeeper (if present) and the warden. # service mapr-zookeeper start # service mapr-warden start After the previous step is complete on all nodes in the cluster, run the script on all nodes. upgrade2mapruser.sh # /opt/mapr/server/upgrade2mapruser.sh This command may take several minutes to return. The script waits ten minutes for the process to complete across the entire cluster. If the cluster-wide operation takes longer than ten minutes, the script fails. Re-run the script on all nodes where the script failed. You must perform these steps on all nodes on a stable cluster. Do not perform this procedure concurrently while upgrading packages. The alarm may raise during this process. The alarm will clear when this process is complete on MAPR_UID_MISMATCH 3. 1. 2. To disable superuser access for the root user To disable root user (UID 0) access to the MapR filesystem on a cluster that is running as a non-root user, use either of the following commands: The configuration value treats all requests from UID 0 as coming from UID -2 (nobody): squash root # maprcli config save -values {"cldb.squash.root":"1"} The configuration value automatically fails all filesystem requests from UID 0: reject root # maprcli config save -values {"cldb.reject.root":"1"} You can verify that these commands worked, as shown in the example below. # maprcli config load -keys cldb.squash.root,cldb.reject.root cldb.reject.root cldb.squash.root 1 1 Install MapR Metrics MapR Metrics is a separately-installable package. For details on adding and activating the mapr-metrics service, see Managing Services on a to add the service and to configure it. Node Setting up the MapR Metrics Database Verify Cluster Health At this point, the cluster should be fully operational again with new features enabled. Run your simple and non-trivial health checks to verify cluster health. If you experience problems, see . Troubleshooting Upgrade Issues Success! Congratulations! At this point, your cluster is fully upgraded. Troubleshooting Upgrade Issues This section provides information about troubleshooting upgrade problems. Click a subtopic below for more detail. NFS incompatible when upgrading to MapR v1.2.8 or later NFS incompatible when upgrading to MapR v1.2.8 or later Starting in MapR release 1.2.8, a change in the NFS file handle format makes NFS file handles incompatible between NFS servers running MapR version 1.2.7 or earlier and servers running MapR 1.2.8 and following. NFS clients that were originally mounted to NFS servers on nodes running MapR version 1.2.7 or earlier must remount the file system when the node is upgraded to MapR version 1.2.8 or following. If you are performing a rolling upgrade and need to maintain NFS service throughout the upgrade process, you can use the guidelines below. Upgrade a subset of the existing NFS server nodes, or install the newer version of MapR on a set of new nodes. If the selected NFS server nodes are using virtual IP numbers (VIPs), reassign those VIPs to other NFS server nodes that are still all nodes. 2. 3. 4. 5. 6. 7. running the previous version of MapR. Apply the upgrade to the selected set of NFS server nodes. Start the NFS servers on nodes upgraded to the newer version. Unmount the NFS clients from the NFS servers of the older version. Remount the NFS clients on the upgraded NFS server nodes. Stage these remounts in groups of 100 or fewer clients to prevent performance disruptions. After remounting all NFS clients, stop the NFS servers on nodes running the older version, then continue the upgrade process. Due to changes in file handles between versions, cached file IDs cannot persist across this upgrade. M7 - Native Storage for MapR Tables Starting in version 3.0, the MapR distribution for Hadoop integrates native tables stored directly in MapR-FS. This page contains the following topics: About MapR Tables MapR-FS Handles Structured and Unstructured Data Benefits of Integrated Tables in MapR-FS The MapR Implementation of HBase Effects of Decoupling API and Architecture The HBase Data Model Using MapR and Apache HBase Tables Together Current Limitations in Version 3.0 Administering MapR Tables Related Topics About MapR Tables In the 3.0 release of the MapR distribution for Hadoop, MapR-FS enables you to create and manipulate tables in many of the same ways that you create and manipulate files in a standard UNIX file system. This document discusses how to set up your MapR installation to use MapR tables. For users experienced with standard Apache HBase, this document describes the differences in capabilities and behavior between MapR tables and Apache HBase tables. MapR-FS Handles Structured and Unstructured Data The 3.0 release of the MapR distribution for Hadoop features a unified architecture for files and tables, providing distributed data replication for structured and unstructured data. Tables enable you to manage data, as opposed to the unstructured data management provided by structured files. The structure for structured data management is defined by a , a set of rules that defines the relationships in the structure. data model By design, the data model for tables in MapR focuses on columns, similar to the open-source standard Apache HBase system. Like Apache HBase, MapR tables store data structured as a nested sequence of key/value pairs, where the value in one pair serves as the key for another pair. Apache HBase is compatible with MapR tables. With a properly licensed MapR installation, you can use MapR tables exclusively or work in a mixed environment with Apache HBase tables. MapR tables are implemented directly within MapR-FS, yielding a familiar, open-standards API that provides a high-performance datastore for tables. MapR-FS is written in C and optimized for performance. As a result, MapR-FS runs significantly faster than JVM-based Apache HBase. The diagram below compares the application stacks for different HBase implementations. Benefits of Integrated Tables in MapR-FS The MapR cluster architecture provides the following benefits for table storage, providing an enterprise-grade HBase environment. MapR clusters with HA features recover instantly from node failures. MapR provides a unified namespace for tables and files, allowing users to group tables in directories by user, project, or any other useful grouping. Tables are stored in volumes on the cluster alongside unstructured files. Storage policy settings for apply to tables as well as volumes files. Volume mirrors and snapshots provide flexible, reliable read-only access. Table storage and MapReduce jobs can co-exist on the same nodes without degrading cluster performance. The use of MapR tables imposes no administrative overhead beyond administration of the MapR cluster. Node upgrades and other administrative tasks do not cause downtime for table storage. The MapR Implementation of HBase MapR's implementation is with the core HBase API. Programmers who are used to writing code for the HBase API will have API compatible immediate, intuitive access to MapR tables. MapR delivers faithfully on the original vision for Google's BigTable paper, using the open-standard HBase API. MapR's implementation of the HBase API provides enterprise-grade high availability (HA), data protection, and disaster recovery features for tables on a distributed Hadoop cluster. MapR tables can be used as the underlying key-value store for Hive, or any other application requiring a high-performance, high-availability key-value datastore. Because MapR tables are with HBase, many legacy HBase applications API-compatible can continue to run without modification. MapR has extended to work with MapR tables in addition to Apache HBase tables. Similar to development for Apache HBase, the hbase shell simplest way to create tables and column families in MapR-FS, and put and get data from them, is to use . MapR tables can be hbase shell created from the (MCS) user interface or from the Linux , without the need to coordinate with a database MapR Control System command line administrator. You can treat a MapR table just as you would a file, specifying a path to a location in a directory, and the table appears in the same namespace as your regular files. You can also create and manage for your table from the MCS or directly from the command line. column families During or other specific scenarios where you need to refer to a MapR table of the same name as an Apache HBase table in the data migration same cluster, you can to enable that operation. map the table namespace The Apache HBase API exposes many low-level administrative functions that can be tuned for performance or reliability. The reliability and functionality of MapR tables renders these low level functions moot, and these low level calls are not supported for MapR tables. Please see the for detailed information. API compatibility tables MapR does not support hooks to manipulate the internal behavior of the datastore, which are common in Apache HBase applications. The Apache HBase codebase and community have internalized numerous hacks and workarounds to circumvent the intrinsic limitations of a datastore implemented on a Java Virtual Machine. Some HBase workflows are designed specifically to accommodate limitations in the Apache HBase implementation. HBase code written around those limitations will generally need to be modified in order to work with MapR tables. To summarize: The MapR table API is compatible with the core HBase API. MapR tables implement the HBase feature set. MapR tables can be used as the datastore for Hive applications. Unlike Apache HBase tables, MapR tables support manipulation of internal storage operations. do not Apache HBase applications crafted specifically to accommodate architectural limitations in HBase require modification in order to run will on MapR tables. Effects of Decoupling API and Architecture The following features of MapR tables result from decoupling the HBase API from the Apache HBase architecture: MapR's High Availability (HA) cluster architecture eliminates the RegionServer component of traditional Apache HBase architecture, which was a single point of failure and bottleneck for scalability. In MapR-FS, MapR tables are HA at all levels, similar to other services on a MapR cluster. MapR-FS allows an unlimited number of tables, with cells up to 1GB. MapR tables can have up to 64 column families, with no limit on number of columns. MapR-FS automates compaction operations and splitting for MapR tables. Crash recovery is significantly faster than Apache HBase. The HBase Data Model Apache HBase stores structured data as a nested series of maps. Each map consists of a set of key-value pairs, where the value can be the key in another map. Keys are kept in strict lexicographical order: 1, 10, and 113 come before 2, 20, and 213. In descending order of granularity, the elements of an HBase entry are: Key: Keys define the rows in an HBase table. Column family: A column family is a key associated with a set of columns. Specify this association according to your individual use case, creating sets of columns. A column family can contain an arbitrary number of columns. MapR tables support up to 64 column families. Column: Columns are keys that are associated with a series of timestamps that define when the value in that column was updated. Timestamp: The timestamp in a column specifies a particular data write to that column. Value: The data written to that column at the specific timestamp. This structure results in versioned values that you can access flexibly and quickly. Because Apache HBase and MapR tables are , any of sparse the column values for a given key can be null. Example HBase Table This example uses JSON notation for representational clarity. In this example, timestamps are arbitrarily assigned. Expand this section to see the JSON code sample. { "arbitraryFirstKey" : { "firstColumnFamily" : { "firstColumn" : { 10 : "valueFive", 7 : "valueThree", 4 : "valueOne", } "secondColumn" : { 16 : "valueEight", 1 : "valueSeven", } } "secondColumnFamily" : { "firstColumn" : { 37 : "valueFive", 23 : "valueThree", 11 : "valueSeven", 4 : "valueOne", } "secondColumn" : { 15 : "valueEight", } } } "arbitrarySecondKey" : { "firstColumnFamily" : { "firstColumn" : { 10 : "valueFive", 4 : "valueOne", } "secondColumn" : { 16 : "valueEight", 7 : "valueThree", 1 : "valueSeven", } } "secondColumnFamily" : { "firstColumn" : { 23 : "valueThree", 11 : "valueSeven", } } } } HBase queries return the most recent timestamp by default. A query for the value in "arbitrarySecondKey"/"secondColumnFamily:firstColumn" returns . Specifying a timestamp with a query for "arbitrarySecondKey"/"secondColumnFamily:firstColumn"/11 returns . valueThree valueSeven Using MapR and Apache HBase Tables Together MapR table storage is independent from Apache HBase table storage, enabling a single MapR cluster to run both systems. Users typically run both systems concurrently, particularly during the migration phase. Alternately, you can leave Apache HBase running for existing applications, and use MapR tables for new applications. You can set up for your cluster to run both MapR tables and Apache HBase namespace mappings tables concurrently, during or on an ongoing basis. migration Current Limitations in Version 3.0 Custom HBase filters are not supported. User permissions for column families are not supported. User permissions for tables and columns are supported. HBase authentication is not supported. HBase replication is handled with . Mirror Volumes Bulk loads using the HFiles workaround are not supported and not necessary. HBase coprocessors are not supported. Filters use a different regular expression library from . See java.util.regex.Pattern Supported Regular Expressions in MapR for a complete list of supported regular expressions. Tables Administering MapR Tables The and the provide a compact set of features for . In a traditional MapR Control System command-line interface adding and managing tables HBase environment, cluster administrators are typically involved in provisioning tables and column families, because of limitations on the number of tables and column families that Apache HBase can support. MapR supports a virtually unlimited number of tables with up to 64 column families, reducing administrative overhead. HBase programmers can use the API function calls to create as many tables and column families as needed for the particular application. Programmers can also use tables to store intermediate data in a multi-stage MapReduce program, then delete the tables without assistance from an administrator. See for more information. Working With MapR Tables and Column Families Related Topics Setting Up MapR-FS to Use Tables Working With MapR Tables and Column Families Mapping Table Namespace Between Apache HBase Tables and MapR Tables Protecting Table Data Migrating Between Apache HBase Tables and MapR Tables Setting Up MapR-FS to Use Tables This page describes how to begin using tables natively with MapR-FS. This page contains the following topics: Installation Enabling Access to MapR Tables via HBase APIs, , and MapReduce Jobs hbase shell MapR Tables and Apache HBase Tables on the Same Cluster Set Up User Directories for MapR Tables Related Topics Installation As of version 3.0 of the MapR distribution, MapR-FS provides storage for structured table data. No additional installation steps are required to install table capabilities. However, you must after you've completed the to enable table features. apply an appropriate license installation process You can also set up a to connect to your MapR cluster and access table. client-only node Enabling Access to MapR Tables via HBase APIs, , and MapReduce Jobs hbase shell You can use the HBase API and the command to access your MapR tables. MapR has extended the HBase component to handle hbase shell access to both MapR tables and Apache HBase tables. MapR tables do not support low-level HBase API calls that are used to manipulate the state of an Apache HBase cluster. See the page for a full list of supported HBase API and shell MapR Table Support for Apache HBase Interfaces commands. To enable the HBase API and access, install the package on every node in the cluster. The HBase component of hbase shell mapr-hbase the MapR distribution for Hadoop is typically installed under . To maintain compatibility with your existing HBase applications /opt/mapr/hbase and workflow, be sure to install the package that provides the same version number of HBase as your existing Apache HBase. mapr-hbase See for information about MapR installation procedures, including the proper repositories. Installing MapR Software setting up MapR Tables and Apache HBase Tables on the Same Cluster Apache HBase can run on MapR's distribution of Hadoop, and users can store table data in both Apache HBase tables as well as MapR tables concurrently. Apache HBase and MapR store table data separately. However, the same mechanisms (HBase APIs and ) are used hbase shell to access data in both systems. On clusters that run Apache HBase on top of MapR, you can set up a to specify whether a namespace mapping given table identifier maps to a MapR table or an Apache HBase table. Set Up User Directories for MapR Tables Because MapR tables, like files, are created by users, MapR tracks table activity in a user's home directory on the cluster. Create a home directory at on your cluster for each user that will access MapR tables. After the cluster on NFS, create these /user/<username> mounting directories with the standard Linux command in the cluster's directory structure. mkdir When a user does not have a corresponding directory on the cluster, querying MapR for a list of tables that belong to that user foo /user/foo generates an error reporting the missing directory. Related Topics Mapping Table Namespace Between Apache HBase Tables and MapR Tables Protecting Table Data Mapping Table Namespace Between Apache HBase Tables and MapR Tables MapR's implementation of the HBase API differentiates between Apache HBase tables and MapR tables, based on the table name. In certain cases, such as migrating code from Apache HBase tables to MapR tables, users need to force the API to access a MapR table, even though the table name could map to an Apache HBase table. The property allows you to map Apache HBase table hbase.table.namespace.mappings names to MapR tables. This property is typically set in the configuration file . /opt/mapr/hadoop/hadoop-<version>/conf/core-site.xml In general, if a table name includes a slash ( ), the name is assumed to be a path to a MapR table, because slash is not a valid character for / Apache HBase table names. In the case of "flat" table names without a slash, namespace conflict is possible, and you might need to use table mappings. Table Mapping Naming Conventions A table mapping takes the form , where is the table name to redirect and is the modification made to the name. The value in : name map name map can be a literal string or contain the wildcard. When mapping a name with a wild card, the mapping is treated as a directory. Requests to name * tables with names that match the wild card are sent to the directory in the mapping. When mapping a name that is a literal string, you can choose from two different behaviors: End the mapping with a slash to indicate that this mapping is to a directory. For example, the mapping sends mytable1:/user/aaa/ requests for table to the full path . mytable1 /user/aaa/mytable1 End the mapping without a slash, which creates an alias and treats the mapping as a full path. For example, the mapping mytable1:/u sends requests for table to the full path . ser/aaa mytable1 /user/aaa Mappings and Table Listing Behaviors When you use the command without specifying a directory, the command's behavior depends on two factors: list Whether a table mapping exists Whether Apache HBase is installed and running The version of HBase provided by MapR has been modified to work with MapR tables in addition to Apache HBase. Do not download and install stock Apache HBase on a MapR cluster that uses MapR tables. If you use fat JARs to deploy your application as a single JAR including all dependencies, be aware that the fat JAR may contain versions of HBase that override the installed MapR versions, leading to problems. Check your fat JARs for the presence of stock HBase to prevent this problem. Here are three different scenarios and the resulting command behavior for each. list There is a table mapping for *, as in . *:/tables In this case, the command lists the tables in the mapped directory. list There is no mapping for *, and Apache HBase is installed and running. In this case, the command lists the HBase tables. list There is no mapping for *, and Apache HBase is not installed or is not running. In this case, the shell will try to connect to an HBase cluster, but will not be able to. After a few seconds, it will give up and fall back to listing the M7 tables in the user's home directory. Table Mapping Examples Example 1: Map all HBase tables to MapR tables in a directory In this example, any flat table name is treated as a MapR table in the directory . foo /tables_dir/foo <property> <name>hbase.table.namespace.mappings</name> <value>*:/tables_dir</value> </property> Example 2: Map specific Apache HBase tables to specific MapR tables In this example, the Apache HBase table name is treated as a MapR table at . The Apache Hbase table name mytable1 /user/aaa/mytable1 is treated as a MapR table at . All other Apache HBase table names are treated as stock Apache HBase mytable2 /user/bbb/mytable2 tables. <property> <name>hbase.table.namespace.mappings</name> <value>mytable1:/user/aaa/,mytable2:/user/bbb/</value> </property> Example 3: Combination of specific table names and wildcards Mappings are evaluated in order. In this example, the flat table name is treated as an MapR table at . The flat mytable1 /user/aaa/mytable1 table name is treated as a MapR table at . Any other flat table name is treated as a MapR table at mytable2 /user/bbb/mytable2 foo /tabl . es_dir/foo <property> <name>hbase.table.namespace.mappings</name> <value>mytable1:/user/aaa/,mytable2:/user/bbb/,*:/tables_dir</value> </property> Working With MapR Tables and Column Families About MapR Tables MapR Tables and Filters Filesystem Operations Read and Write Move Remove Copy and Recursive/Directory Copy About MapR Tables The MapR Data Platform stores tables in the same namespace as files. You can move, delete, and set attributes for a table similarly to a file. All filesystem operations remain accessible with the command. hadoop fs You can create MapR tables using the (MCS) and the interface in addition to the normal HBase shell or HBase MapR Control System maprcli API methods. When creating a MapR table, specify a location in the MDP directory structure in addition to the name of the table. A user can create a MapR table anywhere on the cluster that the user has write access. Volume properties, such as replication factor or rack topology, that apply to the specified location also apply to tables stored at that location. You can move a table with the Linux command or the command. mv hadoop fs -mv Administrators may choose to pre-create tables for a project in order to enforce a designated naming convention, or to store tables in a desired location in the cluster. The number of tables that can be stored on a MapR cluster is constrained only by the number of open file handles and storage space availability. Each table can have up to 64 column families. You can add, edit and delete column families in a MapR table with the (MCS) and the interface. You can also MapR Control System maprcli add column families to MapR tables with the HBase shell or API. When you use Direct Access NFS or the command to access a MapR cluster, tables and files are listed together. Because the hadoop fs -ls client's Linux commands are not table-aware, other Linux file manipulation commands, notably file read and write commands, are not available for MapR tables. Some Apache HBase table operations are not applicable or required for MapR tables, notably manual compactions, table enables, and table disables. HBase API calls that perform such operations on a MapR table result in the modification being silently ignored. When appropriate, the modification request is cached in the client and returned by API calls to enable legacy HBase applications to run successfully. In addition, the command displays recently-accessed MapR tables, rather than listing tables across the entire file maprcli table listrecent system. See for a complete list of supported operations. MapR Table Support for Apache HBase Interfaces MapR Tables and Filters MapR tables support the following built-in filters. These filters work identically to their Apache HBase versions. Filter Description ColumnCountGetFilter Returns the first N columns of a row. ColumnPaginationFilter ColumnPrefixFilter ColumnRangeFilter CompareFilter FirstKeyOnlyFilter FuzzyRowFilter InclusiveStopFilter KeyOnlyFilter MultipleColumnPrefixFilter Because all data stored in a column family is compressed together, encapsulating similar kinds of data within a column family can improve compression. PageFilter PrefixFilter RandomRowFilter SingleColumnValueFilter SkipFilter TimestampsFilter WhileMatchFilter FilterList RegexStringComparator ColumnRangeFilter MultipleColumnPrefixFilter Filesystem Operations This section describes the operations that you can perform on MapR tables through a Linux command line when you access the cluster through NFS or with the commands. hadoop fs Read and Write You cannot perform read or write operations to a MapR table from a Linux filesystem context. Among other things, you cannot use the comm cat and to insert text to a table or search through a table with the command. The MapR software returns an error when an application attempts grep to read or write to a MapR table. Move You can move a MapR table within a volume with the command over NFS or with the command. These moves are subject mv hadoop fs -mv to the standard . Moves across volumes are not currently supported. permissions restrictions Remove You can remove a table with the command over NFS or with the command. These commands remove the table from the rm hadoop fs -rm namespace and asynchronously reclaims the disk space. You can remove a directory that includes both files and tables with the or rm -r hadoo commands. p fs -rmr Copy and Recursive/Directory Copy Table copying at the filesystem level is not supported in this release. See for Migrating Between Apache HBase Tables and MapR Tables information on copying tables using the HBase shell. Example: Creating a MapR Table With the HBase shell This example creates a table called in directory with a column family called , using system defaults. In this development /user/foo stage example, we first start the HBase shell from the command line with , and then use the command to create the table. hbase shell create 1. 2. 3. 4. $ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.1-SNAPSHOT, rUnknown, Mon Dec 17 09:23:31 PST 2012 hbase(main):001:0> hbase(main):001:0> create '/user/foo/development', 'stage' With the MapR Control System In the MCS pane under the group, click . The tab appears in the main window. Navigation MapR Data Platform Tables Tables Click the button. New Table Type a complete path for the new table. Click . The MCS displays a tab for the new table. OK The screen-capture below demonstrates the creation of a table in location . table01 /user/analysis/tables/ With the MapR CLI Use the command at a command line. For details, type at a command line. maprcli table create maprcli table create -help The following example demonstrates creation of a table in cluster location . The cluster table02 /user/analysis/tables/ my.cluster is mounted at . .com /mnt/mapr/ $ maprcli table create -path /user/analysis/tables/table02 $ ls -l /mnt/mapr/my.cluster.com/user/analysis/tables lrwxr-xr-x 1 mapr mapr 2 Oct 24 16:14 table01 -> mapr::table::2056.62.17034 lrwxr-xr-x 1 mapr mapr 2 Oct 24 16:13 table02 -> mapr::table::2056.56.17022 $ maprcli table listrecent path /user/analytics/tables/table01 /user/analytics/tables/table02 Example: Adding a column family With the HBase shell This example adds a column family called to the table , using system defaults. In this example, we first start the status development HBase shell from the command line with , and then use the command to add the column family. hbase shell alter 1. 2. 3. 4. 5. 6. $ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.1-SNAPSHOT, rUnknown, Mon Dec 17 09:23:31 PST 2012 hbase(main):001:0> hbase(main):001:0> alter '/user/foo/development', {NAME => 'status'} With the MapR Control System In the MCS pane under the group, click . The tab appears in the main window. Navigation MapR Data Platform Tables Tables Find the table you want to work with, using one of the following methods. Scan for the table under on the tab. Recently Opened Tables Tables Enter a regular expression for part of the table pathname in the field and click . Go to table Go Click the desired table name. A tab appears in the main MCS pane, displaying information for the specific table. Table Click the tab. Column Families Click . The dialog appears. New Column Family Create Column Family Enter values for the following fields: Column Family Name - Required. Max Versions - The maximum number of versions of a cell to keep in the table. Min Versions - The minimum number of versions of a cell to keep in the table. Compression - The compression algorithm used on the column family's data. Select a value from the drop-down. The default value is , which uses the same compression type as the table. Available compression methods are LZF, Inherited LZ4, and ZLib. Select to disable compression. OFF Time-To-Live - The minimum time-to-live for cells in this column family. Cells older than their time-to-live stamp are purged periodically. In memory - Preference for a column family to reside in memory for fast lookup. You can change any column family properties at a later time using the MCS or from the command line. maprcli table cf edit The screen-capture below demonstrates the creation of a column family to table at location userinfo /user/analysis/tables/table0 . 1 With the MapR CLI Use the command at a command line. For details see or type maprcli table cf create table cf create maprcli table cf at a command line. The following example demonstrates addition of a column family named in table create -help casedata /user/ana , using lz4 compression, and keeping a maximum of 5 versions of cells in the column family. lysis/tables/table01 $ maprcli table cf create -path /user/analysis/tables/table01 \ -cfname casedata -compression lzf -maxversions 5 $ maprcli table cf list -path /user/analysis/tables/table01 inmemory cfname compression ttl maxversions minversions true userinfo lz4 0 3 0 false casedata lzf 0 5 0 $ You can change any column family properties at a later time using the command. maprcli table cf edit Schema Design for MapR Tables Your database schema defines the data in your tables and how they are related. The choices you make when specifying how to arrange your data in keys and columns, and how those columns are grouped in families, can have a significant effect on query performance. Row key design Composite keys Column family design Column design Schemas for MapR tables follow the same general principles as schemas for standard Apache HBase tables, with one important difference. Because MapR tables can use up to 64 column families, you can make more extensive use of the advantages of column families: Segregate related data into column families for more efficient queries Optimize column-family specific parameters to tune for performance Group related data for more efficient compression Naming your identifiers: Because the names for the row key, column family, and column identifiers are associated with every value in a table, these identifiers are replicated potentially billions of times across your tables. Keeping these names short can have a significant effect on your tables' storage demands. Access times for MapR tables are fastest when a single record is looked up based on the full row key. Partial scans within a column family are more demanding on the cluster's resources. A full-table scan is the least efficient way to retrieve data from your table. Row key design Because records in Apache HBase tables are stored in lexicographical order, using a sequential generation method for row keys can lead to a hot problem. As new rows are created, the table splits. Since the new records are still being created sequentially, all the new entries are still spot directed to a single node until the next split, and so on. In addition to concentrating activity on a single region, all the other splits remain at half their maximum size. With MapR tables, the cluster handles sequential keys and table splits to keep potential hotspots moving across nodes, decreasing the intensity and performance impact of the hotspot. To spread write and insert activity across the cluster, you can randomize sequentially generated keys by hashing the keys, inverting the byte order. Note that these strategies come with trade-offs. Hashing keys, for example, makes table scans for key subranges inefficient, since the subrange is spread across the cluster. Instead of hashing the entire key, you can the key by prepending a few bytes of the hash to the actual key. For a key based on a timestamp, salt for instance, a timestamp value of has an MD5 hash that ends with . By making the key for that row , you 1364248490 ffe5 ffe51364248490 avoid hotspotting. Since the first four digits are known to be the hash salt, you can derive the original timestamp by dropping those digits. Be aware that a row key is immutable once created, and cannot be renamed. To change a row key's name, the original row must be deleted and then re-created with the new name. Composite keys Rows in a MapR table can only have a single row key. You can create to approximate multiple keys in a table. A composite key composite keys contains several individual fields joined together, for example and . You can then scan for the specific segments of the userID applicationID composite row key that represent the original, individual field. Because rows are stored in sorted order, you can affect the results of the sort by changing the ordering of the fields that make up the composite row key. For example, if your application IDs are generated sequentially but your user IDs are not, using a composite key of userID+applicationID will store all rows with the same user ID closely together. If you know the userID for which you want to retrieve rows, you can specify the first userI row and the first row as the start and stop rows for your scan, then retrieve the rows you're interested in without scanning the entire D userID+1 table. When designing a composite key, consider how the data will be queried during production use. Place the fields that will be queried the most often towards the front of the composite key, bearing in mind that sequential keys will generate hot spotting. Column family design Scanning an entire table for matches can be very performance-intensive. enable you to group related sets of data and restrict Column families queries to a defined subset, leading to better performance. You can also make a specified column family remain in memory to further increase the speed at which the system accesses that data. When you design a column family, think about what kinds of queries are going to be used the most often, and group your columns accordingly. You can specify compression settings for individual column families, which lets you choose the settings that prioritize speed of access or efficient use of disk space, according to your needs. Be aware of the approximate number of rows in your column families. This property is called the column family's . When column families cardinality in the same table have very disparate cardinalities, the sparser table's data can be spread out across multiple nodes, due to the denser table requiring more splits. Scans on the sparser column family can take longer due to this effect. For example, consider a table that lists products across a small range of numbers, but with a row for the unique serial numbers for each individual product manufactured within a given model model. Such a table will have a very large difference in cardinality between a column family that relates to the model number compared to a column family that relates to the serial number. Scans on the model-number column family will have to range across the cluster, since the frequent splits required by the comparatively large numbers of serial-number rows will spread the model-number rows out across many regions on many nodes. Column design MapR tables split at the row level, not the column level. For this reason, extremely wide tables with very large numbers of columns can sometimes reach the recommended size for a table split at a comparatively small number of rows. Because MapR tables are , you can add columns to a table at any time. Null columns for a given row don't take up any storage space. sparse Supported Regular Expressions in MapR Tables MapR tables support the regular expressions provided by the , as well as a subset of the complete Perl- Compatible Regular Expressions library set of regular expressions supported in . For more information on Perl compatible regular expressions, issue the java.util.regex.Pattern m command from a terminal prompt. an pcrepattern Applications for Apache HBase that use regular expressions not supported in MapR tables will need to be rewritten to use supported regular expressions. The tables in the following sections define the subset of Java regular expressions supported in MapR tables. Characters Pattern Description x The character x \\ The backslash character \0n The character with octal value 0n (0 <= n <= 7) \0nn The character with octal value 0nn (0 <= n <= 7) \xhh The character with hexadecimal value 0xhh \t The tab character ('\u0009') In general, design your schema to prioritize more rows and fewer columns. \n The newline (line feed) character ('\u000A') \r The carriage-return character ('\u000D') \f The form-feed character ('\u000C') \a The alert (bell) character ('\u0007') \e The escape character ('\u001B') \cx The control character corresponding to x Character Classes Pattern Description [abc] a, b, or c (simple class) [Supported Regular Expressions in MapR Tables^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z or A through Z, inclusive (range) Predefined Character Classes Pattern Description . Any character (may or may not match line terminators) \d A digit: [0-9] \D A non-digit: [Supported Regular Expressions in MapR Tables^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [Supported Regular Expressions in MapR Tables^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [Supported Regular Expressions in MapR Tables^\w] Classes for Unicode Blocks and Categories Pattern Description \p{Lu} An uppercase letter (simple category) \p{Sc} A currency symbol Boundaries Pattern Description ^ The beginning of a line $ The end of a line \b A word boundary \B A non-word boundary \A The beginning of the input \G The end of the previous match \Z The end of the input but for the final terminator, if any \z The end of the input Greedy Quantifiers Pattern Description X? X, once or not at all X* X, zero or more times X+ X, one or more times X{n} X, exactly n times X{n,} X, at least n times X{n,m} X, at least n but not more than m times Reluctant Quantifiers Pattern Description X?? X, once or not at all X*? X, zero or more times X+? X, one or more times X{n}? X, exactly n times X{n,}? X, at least n times X{n,m}? X, at least n but not more than m times Possessive Quantifiers Pattern Description X?+ X, once or not at all X*+ X, zero or more times X++ X, one or more times X{n}+ X, exactly n times X{n,}+ X, at least n times X{n,m}+ X, at least n but not more than m times Logical Operators Pattern Description XY X followed by Y X|Y Either X or Y (X) X, as a capturing group Back References Pattern Description \n Whatever the nth capturing group matches Quotation Pattern Description \ Nothing, but quotes the following character \Q Nothing, but quotes all characters until \E \E Nothing, but ends quoting started by \Q Special Constructs Pattern Description (?:X) X, as a non-capturing group (?=X) X, via zero-width positive lookahead (?!X) X, via zero-width negative lookahead (?<=X) X, via zero-width positive lookbehind (?<!X) X, via zero-width negative lookbehind (?>X) X, as an independent, non-capturing group MapR Table Support for Apache HBase Interfaces This page lists the supported interfaces for accessing MapR tables. This page contains the following topics: Compatibility with the Apache HBase API HBase Shell Commands Compatibility with the Apache HBase API The API for accessing MapR tables is compatible with the Apache HBase API. Code written for Apache HBase can be easily ported to use MapR tables. MapR tables do not support low-level HBase API calls that are used to manipulate the state of an Apache HBase cluster. HBase API calls that are not supported by MapR tables report successful completion to allow legacy code written for Apache HBase to continue executing, but do not perform any actual operations. For details on the behavior of each function, refer to the . Apache HBase API documentation HBaseAdmin API Available for MapR Tables? Comments void addColumn(String tableName, HColumnDescriptor column) Yes void close() Yes void createTable (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HB aseAdmin.html#createTable(org.apa che.hadoop.hbase.HTableDescriptor , byte\[\]\[\]))(HTableDescriptor desc, byte[][] splitKeys) Yes This call is synchronous. void createTableAsync (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HB aseAdmin.html#createTableAsync(or g.apache.hadoop.hbase.HTableDescr iptor, byte\[\]\[\]))(HTableDescr iptor desc, byte[][] splitKeys) Yes For MapR tables, this call is identical to createTable. void deleteColumn (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/De lete.html#deleteColumn(byte\[\],% 20byte\[\],%20long))(byte[] family, byte[] qualifier, long timestamp) Yes void deleteTable(String tableName) Yes HTableDescriptor[] deleteTables(Pa ttern pattern) Yes Configuration getConfiguration() Yes HTableDescriptor getTableDescripto r (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HB aseAdmin.html#getTableDescriptor( byte\[\]))(byte[] tableName) Yes HTableDescriptor[] getTableDescrip tors(List<String> tableNames) Yes boolean isTableAvailable(String tableName) Yes boolean isTableDisabled(String tableName) Yes boolean isTableEnabled(String tableName) Yes HTableDescriptor[] listTables() Yes void modifyColumn(String tableName, HColumnDescriptor descriptor) Yes void modifyTable (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HB aseAdmin.html#modifyTable(byte\[\ ], org.apache.hadoop.hbase.HTableDes criptor))(byte[] tableName, HTableDescriptor htd) No boolean tableExists(String tableName) Yes Pair<Integer, Integer> getAlterSta tus (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HB aseAdmin.html#getAlterStatus(byte \[\]))(byte[] tableName) Yes CompactionState getCompactionState (String tableNameOrRegionName) Yes Returns . CompactionState.NONE void abort(String why, Throwable e) No void assign (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HB aseAdmin.html#assign(byte\[\]))(b yte[] regionName) No boolean balancer() No boolean balanceSwitch(boolean b) No void closeRegion(ServerName sn, HRegionInfo hri) No void closeRegion(String regionname, String serverName) No boolean closeRegionWithEncodedRegi onName(String encodedRegionName, String serverName) No void flush(String tableNameOrRegionName) No ClusterStatus getClusterStatus() No HConnection getConnection() No HMasterInterface getMaster() No String[] getMasterCoprocessors() No boolean isAborted() No boolean isMasterRunning() No void majorCompact(String tableNameOrRegionName) No void move (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HB aseAdmin.html#move(byte\[\], byte\[\]))(byte[] encodedRegionName, byte[] destServerName) No byte[][] rollHLogWriter(String serverName) No boolean setBalancerRunning(boolean on, boolean synchronous) No void shutdown() No void stopMaster() No void stopRegionServer(String hostnamePort) No void unassign (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HB aseAdmin.html#unassign(byte\[\], boolean))(byte[] regionName, boolean force) No HTable API Available for MapR Tables? Comments Configuration and State Management void clearRegionCache() No Operation is silently ignored. void close() Yes <T extends CoprocessorProtocol, R> Map<byte[], R> coprocessorExec(Class<T> protocol, byte[] startKey, byte[] endKey, Call<T, R> callable) No Returns . null <T extends CoprocessorProtocol> T coprocessorProxy(Class<T> protocol, byte[] row) No Returns . null Map<HRegionInfo, HServerAddress> deserializeRegionInfo(DataInput in) Yes void flushCommits() Yes Configuration getConfiguration() Yes HConnection getConnection() No Returns null int getOperationTimeout() No Returns null ExecutorService [getPool() No Returns null int getScannerCaching() No Returns 0 ArrayList<Put> getWriteBuffer() No Returns null long getWriteBufferSize() No Returns 0 boolean isAutoFlush() Yes void prewarmRegionCache(Map<HRegionInf o, HServerAddress> regionMap) No Operation is silently ignored. void serializeRegionInfo(DataOutput out) Yes void setAutoFlush(boolean autoFlush, boolean clearBufferOnFail) Same as setAutoFlush(boolean autoFlush) void setAutoFlush(boolean autoFlush) Yes void setFlushOnRead(boolean val) Yes boolean shouldFlushOnRead() Yes void setOperationTimeout(int operationTimeout) No Operation is silently ignored. void setScannerCaching(int scannerCaching) No Operation is silently ignored. void setWriteBufferSize(long writeBufferSize) No Operation is silently ignored. Atomic operations Result append(Append append) Yes boolean checkAndDelete(byte\[\] row, byte\[\] family, byte\[\] qualifier, byte\[\] value, Delete delete) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#checkAndDelete(byte\[\] , byte\[\], byte\[\], byte\[\], org.apache.hadoop.hbase.client.De lete)) Yes boolean checkAndPut(byte\[\] row, byte\[\] family, byte\[\] qualifier, byte\[\] value, Put put) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#checkAndPut(byte\[\], byte\[\], byte\[\], byte\[\], org.apache.hadoop.hbase.client.Pu t)) Yes Result increment(Increment increment) Yes long incrementColumnValue(byte\[\] row, byte\[\] family, byte\[\] qualifier, long amount, boolean writeToWAL) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#incrementColumnValue(by te\[\], byte\[\], byte\[\], long, boolean)) Yes long incrementColumnValue(byte\[\] row, byte\[\] family, byte\[\] qualifier, long amount) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#incrementColumnValue(by te\[\], byte\[\], byte\[\], long)) Yes void mutateRow(RowMutations rm) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#incrementColumnValue(by te\[\], byte\[\], byte\[\], long)) Yes DML operations void batch(List actions, Object\[\] results) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#batch(java.util.List, java.lang.Object\[\])) Yes Object[] batch(List<? extends Row> actions) Yes void delete(Delete delete) Yes void delete(List<Delete> deletes) Yes boolean exists(Get get) Yes Result get(Get get) Yes Result[] get(List<Get> gets) Yes Result getRowOrBefore(byte\[\] row, byte\[\] family) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#getRowOrBefore(byte\[\] , byte\[\])) No ResultScanner getScanner(...) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#getScanner(byte\[\])) Yes void put(Put put) Yes void put(List<Put> puts) Yes Table Schema Information HRegionLocation getRegionLocation(byte\[\] row, boolean reload) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#getRegionLocation(byte\ [\], boolean)) Yes Map<HRegionInfo, HServerAddress> getRegionsInfo() Yes List<HRegionLocation> getRegionsInRange(byte\[\] startKey, byte\[\] endKey) (http://hbase.apache.org/apidocs/ org/apache/hadoop/hbase/client/HT able.html#getRegionsInRange(byte\ [\], byte\[\])) Yes byte[][] getEndKeys() Yes byte[][] getStartKeys() Yes Pair<byte[][], byte[][]> getStartEndKeys() Yes HTableDescriptor getTableDescript or() Yes byte[] getTableName() Yes Returns table path Row Locks RowLock lockRow(byte[] row) No void unlockRow(RowLock rl) No HBase Shell Commands The following table lists support information for HBase shell commands for managing MapR tables. Command Available for MapR Tables? Comments alter Yes alter_async Yes create Yes describe Yes disable Yes drop Yes enable Yes exists Yes is_disabled Yes is_enabled Yes list Yes disable_all Yes drop_all Yes enable_all Yes show_filters Yes count Yes get Yes put Yes scan Yes delete Yes deleteall Yes incr Yes truncate Yes get_counter Yes assign No balance_switch No balancer No close_region No major_compact No move No unassign No zk_dump No status No version Yes whoami Yes Using AsyncHBase with MapR Tables You can use the to provide asynchronous access to MapR tables. MapR provides a of AsyncHBase modified to AsyncHBase libraries version work with MapR tables. Once your cluster is ready to use MapR tables, it is also ready to use AsyncHBase with MapR tables. After installation, the AsyncHBase JAR file is in the directory . Add that directory to your Java /opt/mapr/hadoop/hadoop-0.20.2/lib CLAS . SPATH See also Documentation for AsyncHBase client Using OpenTSDB with AsyncHBase and MapR Tables The software package provides a time series database that collects user-specified data. Because OpenTSDB depends on OpenTSDB AsyncHBase, MapR provides a customized version of OpenTSDB that works with MapR's version of AsyncHBase in order to provide compatibility with MapR tables. Download the OpenTSDB source from the repository instead of the standard [email protected]:mapr/opentsdb.git gith location. ub.com/OpenTSDB/opentsdb.git Nodes using OpenTSDB must have the following packages installed: One of or mapr-core mapr-client mapr-hbase-<version). To maintain compatibility with your existing HBase applications and workflow, be sure to install the mapr-hb package that provides the same version number of HBase as your existing Apache HBase. ase You can follow the directions at OpenTSDB's page, changing to Getting Started git clone git://github.com/OpenTSDB/opentsdb.git . git clone git://github.com/mapr/opentsdb.git After running the script, replace the contents of the file with the build.sh /opt/mapr/hadoop/hadoop-0.20.2/conf/core-site.xml contents of the on all nodes. Set up a table to specify the full path to the and /opentsdb/core-site.xml.template mapping tsdb tsdb-ui tables. Create the directories in that path before running the script. d create_table.sh See also Documentation for OpenTSDB Protecting Table Data This page discusses how to organize tables and files on a MapR cluster by making effective use of directories and volumes. This page contains the following topics: Organizing Tables and Files in Directories Controlling Table Storage Policy with Volumes Mirrors and Snapshots for MapR Tables Comparison to Apache HBase Running on a MapR Cluster Related Topics Organizing Tables and Files in Directories Because the 3.0 release of the MapR distribution for Hadoop mingles unstructured files with structured tables in a directory structure, you can group logically-related files and tables together. For example, tables related to a project housed in directory can be saved in a /user/foo subdirectory, such as . /user/foo/tables Listing the contents of a directory with lists both tables and files stored at that path. Because table data is not structured as a simple character ls stream, you cannot operate on table data with common Linux commands such as , , and . See for more cat more > Filesystem Operations information on Linux file system operations with MapR tables. Example: Creating a MapR table in a directory using the HBase shell In this example, we create a new table in directory on a MapR cluster that already contains a mix of files and tables. In this table3 /user/dave example, the MapR cluster is mounted at . /maprcluster/ $ pwd /maprcluster/user/dave $ ls file1 file2 table1 table2 $ hbase shell hbase(main):003:0> create '/user/dave/table3', 'cf1', 'cf2', 'cf3' 0 row(s) in 0.1570 seconds $ ls file1 file2 table1 table2 table3 $ hadoop fs -ls /user/dave Found 5 items -rw-r--r-- 3 mapr mapr 16 2012-09-28 08:34 /user/dave/file1 -rw-r--r-- 3 mapr mapr 22 2012-09-28 08:34 /user/dave/file2 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:32 /user/dave/table1 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:33 /user/dave/table2 trwxr-xr-x 3 mapr mapr 2 2012-09-28 08:38 /user/dave/table3 Note that in the listing, table items are denoted by a bit. hadoop fs -ls t Controlling Table Storage Policy with Volumes MapR provides volumes as a way to organize data and manage cluster performance. A is a logical unit that allows you to apply policies to volume a set of files, directories, and tables. Volumes are used to enforce disk usage limits, set replication levels, define snapshots and mirrors, and establish ownership and accountability. Because MapR tables are stored in volumes, these same storage policy controls apply to table data. As an example, the diagram below depicts a MapR cluster storing table and file data. The cluster has three separate volumes mounted at directories , , and /user/john /user/dave /proj . As shown, each directory contains both file data and table data, grouped together logically. Because each of these directories maps to ect/ads a different volume, data in each directory can have different policy. For example, has a disk-usage quota, while is on /user/john /user/dave a snapshot schedule. Furthermore, two directories, and are mirrored to locations outside the cluster, providing /user/john /project/ads read-only access to high-traffic data, including the tables in those volumes. Example: Restricting table storage with quotas and physical topology This example creates a table with disk usage quota of 100GB restricted to certain data nodes in the cluster. First we create a volume named pro , specifying the quota and restricting storage to nodes in the topology, and mounting it in the local ject-tables-vol /data/rack1 namespace. Next we use the HBase shell to create a new table named , specifying a path inside the volume datastore project-tables-vol . $ pwd /mapr/cluster1/user/project $ls bin src $ maprcli volume create -name project-tables-vol -path /user/project/tables \ -quota 100G -topology /data/rack1 $ ls bin src tables $ hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.1-SNAPSHOT, rUnknown, Thu Oct 25 09:28:51 PDT 2012 hbase(main):001:0> create '/user/project/tables/datastore', 'colfamily1' 0 row(s) in 0.5180 seconds hbase(main):002:0> exit $ ls -l tables total 1 lrwxr-xr-x 1 mapr mapr 2 Oct 25 15:20 datastore -> mapr::table::2252.32.16498 1. 2. Mirrors and Snapshots for MapR Tables Because MapR tables are stored in volumes, you can take advantage of MapR , , and , which operate at the Schedules Mirror Volumes Snapshots volume level. Mirrors and snapshots are read-only copies of specific volumes on the cluster, which can be used to provision for disaster recovery and improved access time for high-bandwidth data. To access tables in snapshots or mirrors, HBase programs access a table path in a mirror or snapshot volume. You can set policy for volumes using the or the commands. For details, see . MapR Control System maprcli Managing Data with Volumes Comparison to Apache HBase Running on a MapR Cluster Prior to MapR version 3.0, the only option for HBase users was to run Apache HBase on top of the MapR cluster. For the purposes of illustration, this section contrasts how running Apache HBase on a MapR cluster differs from the integrated tables in MapR. As shown in the diagram below, installing Apache HBase on a MapR cluster involves storing all HBase components in a single volume mapped to directory in the cluster. Compared to the MapR implementation shown above, this method has the following differences: /hbase Tables are stored in a flat namespace, not grouped logically with related files. Because all Apache HBase data resides in one volume, only one set of storage policies can be applied to the entire Apache HBase datastore. Mirrors and snapshots of the HBase volume do not provide functional replication of the datastore. Despite this limitation, mirrors can be used to backup HLogs and HFiles in order to provide a recovery point for Apache HBase data. Related Topics Managing Data with Volumes Working With MapR Tables and Column Families Displaying Table Region Information MapR tables are split into regions on an ongoing basis. Administrators and developers do not need to manage region splits or data compaction. There are no settings or operations to control region splits or data compaction for MapR tables. You can list region information for tables to get a sense of the size and location of table data on the MapR cluster. Examining Table Region Information in the MapR Control System In the MCS pane under the group, click . The tab appears in the main window. Navigation MapR Data Platform Tables Tables 2. 3. 4. 1. 2. Find the table you want to work with, using one of the following methods. Scan for the table under on the tab. Recently Opened Tables Tables Enter the table pathname in the field and click . Go to table Go Click the desired table name. A tab appears in the main MCS pane, displaying information for the specific table. Table Click the tab. The tab displays region information for the table. Regions Regions Listing Table Region Information at the Command Line Use the command: maprcli table region $ maprcli table region list -path <path to table> sk sn ek pn lhb -INFINITY hostname1, hostname2 INFINITY hostname3 0 Integrating Hive and MapR Tables You can create MapR tables from Hive that can be accessed by both Hive and MapR. With this functionality, you can run Hive queries on MapR tables. You can also convert existing MapR tables into Hive-MapR tables, running Hive queries on those tables as well. Install and Configure Hive Configure the the File hive-site.xml Getting Started with Hive-MapR Integration Create a Hive table with two columns: Start the HBase shell: Zookeeper Connections Install and Configure Hive Install and configure Hive if it is not already installed. Execute the command and ensure that all relevant Hadoop, MapR, and Zookeeper processes are running. jps Example: $ jps 21985 HRegionServer 1549 jenkins.war 15051 QuorumPeerMain 30935 Jps 15551 CommandServer 15698 HMaster 15293 JobTracker 15328 TaskTracker 15131 WardenMain Configure the the File hive-site.xml 1. Open the file with your favorite editor, or create a file if it doesn't already exist: hive-site.xml hive-site.xml $ cd $HIVE_HOME $ vi conf/hive-site.xml 2. Copy the following XML code and paste it into the file. hive-site.xml Note: If you already have an existing file with a element block, just copy the element block code hive-site.xml configuration property below and paste it inside the element block in the file. Be sure to use the correct values for the paths to your configuration hive-site.xml auxiliary JARs and ZooKeeper IP numbers. Example configuration: <configuration> <property> <name>hive.aux.jars.path</name> <value>file:///opt/mapr/hive/hive-0.10.0/lib/hive-hbase-handler-0.10.0-mapr.jar,file:/ //opt/mapr/hbase/hbase-0.94.5/hbase-0.94.5-mapr.jar,file:///opt/mapr/zookeeper/zookeep er-3.3.6/zookeeper-3.3.6.jar</value> <description>A comma separated list (with no spaces) of the jar files required for Hive-HBase integration</description> </property> <property> <name>hbase.zookeeper.quorum</name> <value>xx.xx.x.xxx,xx.xx.x.xxx,xx.xx.x.xxx</value> <description>A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers in the cluster.</description> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>5181</value> <description>The Zookeeper client port. The MapR default clientPort is 5181.</description> </property> </configuration> 3. Save and close the file. hive-site.xml If you have successfully completed all of the steps in this section, you're ready to begin the tutorial in the next section. Getting Started with Hive-MapR Integration In this tutorial we will: Create a Hive table Populate the Hive table with data from a text file Query the Hive table Create a Hive-MapR table Introspect the Hive-MapR table from the HBase shell Populate the Hive-MapR table with data from the Hive table Query the Hive-MapR table from Hive Convert an existing MapR table into a Hive-MapR table Be sure that you have successfully completed all of the steps in the and sections Install and Configure Hive Setting Up MapR-FS to Use Tables before beginning this Getting Started tutorial. This Getting Started tutorial is based on the section of the Apache Hive Wiki, and thanks to Samuel Guo and other Hive-HBase Integration contributors to that effort. If you are familiar with their approach to Hive-HBase integration, you should be immediately comfortable with this material. However, please note that there are some significant differences in this Getting Started section, especially in regards to configuration and command parameters or the lack thereof. Follow the instructions in this Getting Started tutorial to the letter so you can have an enjoyable and successful experience. Create a Hive table with two columns: Change to your Hive installation directory if you're not already there and start Hive: $ cd $HIVE_HOME $ bin/hive Execute the CREATE TABLE command to create the Hive pokes table: hive> CREATE TABLE pokes (foo INT, bar STRING); To see if the pokes table has been created successfully, execute the SHOW TABLES command: hive> SHOW TABLES; OK pokes Time taken: 0.74 seconds The table appears in the list of tables. pokes Populate the Hive pokes table with data The file is provided in the directory. Execute the LOAD DATA LOCAL INPATH command to populate kv1.txt $HIVE_HOME/examples/files the Hive table with data from the file. pokes kv1.txt hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; A message appears confirming that the table was created successfully, and the Hive prompt reappears: Copying data from file: ... OK Time taken: 0.278 seconds hive> Execute a SELECT query on the Hive pokes table: hive> SELECT * FROM pokes WHERE foo = 98; The SELECT statement executes, runs a MapReduce job, and prints the job output: OK 98 val_98 98 val_98 Time taken: 18.059 seconds The output of the SELECT command displays two identical rows because there are two identical rows in the Hive table with a key of 98. pokes To create a Hive-MapR table, enter these four lines of code at the Hive prompt: hive> CREATE TABLE mapr_table_1(key int, value string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") > TBLPROPERTIES ("hbase.table.name" = "/user/mapr/xyz"); After a brief delay, a message appears confirming that the table was created successfully: OK Time taken: 5.195 seconds Note: The TBLPROPERTIES command is not required, but those new to Hive-MapR integration may find it easier to understand what's going on if Hive and MapR use different names for the same table. In this example, Hive will recognize this table as "mapr_table_1" and MapR will recognize this table as "xyz". Start the HBase shell: Keeping the Hive terminal session open, start a new terminal session for HBase, then start the HBase shell: Hive tables can have multiple identical keys. As we will see shortly, MapR tables cannot have multiple identical keys, only unique keys. $ cd $HBASE_HOME $ bin/hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.90.4, rUnknown, Wed Nov 9 17:35:00 PST 2011 hbase(main):001:0> Execute the list command to see a list of HBase tables: hbase(main):001:0> list TABLE /user/mapr/xyz 1 row(s) in 0.8260 seconds HBase recognizes the Hive-MapR table named in directory . This is the same table known to Hive as . xyz /user/mapr mapr_table_1 Display the description of the /user/mapr/xyz table in the HBase shell: hbase(main):004:0> describe "/user/mapr/xyz" DESCRIPTION ENABLED {NAME => '/user/mapr/xyz', FAMILIES => [{NAME => 'cf1', DATA_B true LOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REP LICATION_SCOPE => '0', VERSIONS => '3', MIN_VERSION S => '0', TTL => '2147483647', KEEP_DELETED_CELLS = > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals e', ENCODE_ON_DISK => 'true', BLOCKCACHE => 'true'} ]} 1 row(s) in 0.0240 seconds From the Hive prompt, insert data from the Hive table pokes into the Hive-MapR table mapr_table_1: hive> INSERT OVERWRITE TABLE mapr_table_1 SELECT * FROM pokes WHERE foo=98; ... 2 Rows loaded to mapr_table_1 OK Time taken: 13.384 seconds Query mapr_table_1 to see the data we have inserted into the Hive-MapR table: hive> SELECT * FROM mapr_table_1; OK 98 val_98 Time taken: 0.56 seconds Even though we loaded two rows from the Hive table that had the same key of 98, only one row was actually inserted into pokes mapr_table_1 . This is because is a MapR table, and although Hive tables support duplicate keys, MapR tables only support unique keys. mapr_table_1 MapR tables arbitrarily retain only one key, and silently discard all of the data associated with duplicate keys. Convert a pre-existing MapR table to a Hive-MapR table To convert a pre-existing MapR table to a Hive-MapR table, enter the following four commands at the Hive prompt. Note that in this example the existing MapR table is in directory . mapr_table_2 /user/mapr hive> CREATE EXTERNAL TABLE mapr_table_2(key int, value string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = "cf1:val") > TBLPROPERTIES("hbase.table.name" = "/user/mapr/my_mapr_table"); Now we can run a Hive query against the pre-existing MapR table that Hive sees as : /user/mapr/my_mapr_table mapr_table_2 hive> SELECT * FROM mapr_table_2 WHERE key > 400 AND key < 410; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator ... OK 401 val_401 402 val_402 403 val_403 404 val_404 406 val_406 407 val_407 409 val_409 Time taken: 9.452 seconds Zookeeper Connections If you see a similar error message to the following, ensure that and hbase.zookeeper.quorum hbase.zookeeper.property.clientPort are properly defined in the file. $HIVE_HOME/conf/hive-site.xml 1. Failed with exception java.io.IOException:org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information. Migrating Between Apache HBase Tables and MapR Tables MapR tables can be parsed by the ( ). You can use the Apache CopyTable tool org.apache.hadoop.hbase.mapreduce.CopyTable CopyTable tool to migrate data from an Apache HBase table to a MapR table or from a MapR table to an Apache HBase table. Before You Start Before migrating your tables to another platform, consider the following points: Schema Changes. Apache HBase and MapR tables have different limits on the number of column families. If you are migrating to MapR, you may be interested in changing your table's to take advantage of the increased availability of column families. Conversely, if schema you're migrating from MapR tables to Apache, you may need to adjust your schema to reflect the reduced availability of column families. API Mappings: If you are migrating from Apache HBase to MapR tables, examine your current HBase applications to verify the APIs and HBase shell commands used are fully . supported Namespace Mapping: If the migration will take place over a period of time, be sure to plan your in advance table namespace mappings to ease the transition. Implementation Limitations: MapR tables do not support HBase coprocessors. If your existing Apache HBase installation uses coprocessors, plan any necessary modifications in advance. MapR tables support a of the regular expressions supported in subset Apache HBase. Check your existing workflow and HBase applications to verify you are not using unsupported regular expressions. If you are migrating MapR tables, be sure to change your Apache HBase client to the MapR client by installing the version of the to mapr-hbase package that matches the version of Apache HBase on your source cluster. See for information about MapR installation procedures, including the proper repositories. Installing MapR Software setting up Compression Mappings MapR tables support the LZ4, LZF, and ZLIB compression algorithms. When you create a MapR table with the Apache HBase API or the HBase shell and specify the LZ4, LZO, or SNAPPY compression algorithms, the resulting MapR table uses the LZ4 compression algorithm. When you describe an MapR table schema through the HBase API, the LZ4 and OLDLZF compression algorithms map to the LZ4 compression algorithm. Copying Data Launch the CopyTable tool with the following command, specifying the full destination path of the table with the parameter: --new.name The CopyTable tool launches a MapReduce job. The nodes on your cluster must have the correct version of the package mapr-hbase installed. To maintain compatibility with your existing HBase applications and workflow, be sure to install the package that mapr-hbase provides the same version number of HBase as your existing Apache HBase. 1. 1. 2. 3. 1. hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=/user/john/foo/mytable01 Example: Migrating an Apache HBase table to a MapR table This example migrates the existing Apache HBase table to the MapR table . mytable01 /user/john/foo/mytable01 On the node in the MapR cluster where you will launch the CopyTable tool, modify the value of the property in the hbase.zookeeper.quorum h file to point at a ZooKeeper node in the source cluster. Alternately, you can specify the value for the base-site.xml hbase.zookeeper.quor property from the command line. This example specifies the value in the command line. um Create the destination table. This example uses the HBase shell. The and (MCS) are also viable methods. CLI MapR Control System [user@host] hbase shell HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 0.94.3-SNAPSHOT, rUnknown, Thu Mar 7 10:15:47 PST 2013 hbase(main):001:0> create '/user/john/foo/mytable01', 'usernames', 'userpath' 0 row(s) in 0.2040 seconds Exit the HBase shell. hbase(main):002:0> exit [user@host] From the HBase command line, use the CopyTable tool to migrate data. [user@host] hbase org.apache.hadoop.hbase.mapreduce.CopyTable --Dhbase.zookeeper.quorum=zknode1,zknode2,zknode3 --new.name=/user/john/foo/mytable01 mytable01 Verifying Migration After copying data to the new tables, verify that the migration is complete and successful. In increasing order of complexity: Verify that the destination table exists. From the HBase shell, use the command, or use the command from list ls /user/john/foo a Linux prompt: hbase(main):006:0> list '/user/john/foo' TABLE /user/john/foo/mytable01 1 row(s) in 0.0770 seconds On a MapR cluster, only lists the most recently used MapR tables. When you specify a path to the command, the command list list verifies that a table exists at that path. 1. 2. 1. 2. 3. 4. 5. 6. 7. 8. 1. 2. 3. Check the number of rows in the source table against the destination table with the command: count hbase(main):005:0> count '/user/john/foo/mytable01' 30 row(s) in 0.1240 seconds Hash each table, then compare the hashes. Decommissioning the Source After verifying a successful migration, you can decommission the source nodes where the tables were originally stored. Decommissioning a MapR Node Before you start, drain the node of data by moving the node to the physical . All the data on a node in the /decommissioned topology /decommi topology is migrated to volumes and nodes in the topology. ssioned /data Run the following command to check if a given volume is present on the node: maprcli dump volumenodes -volumename <volume> -json | grep <ip:port> Run this command for each non-local volume in your cluster to verify that the node being decommissioned is not storing any volume data. Change to the root user (or use sudo for the following commands). Stop the Warden: service mapr-warden stop If ZooKeeper is installed on the node, stop it: service mapr-zookeeper stop Determine which MapR packages are installed on the node: dpkg --list | grep mapr (Ubuntu) rpm -qa | grep mapr (Red Hat or CentOS) Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples: apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu) yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS) Remove the directory to remove any instances of , , , and left behind by the package /opt/mapr hostid hostname zkdata zookeeper manager. Remove any MapR cores in the directory. /opt/cores If the node you have decommissioned is a CLDB node or a ZooKeeper node, then run on all other nodes in the cluster configure.sh (see ). Configuring a Node Decommissioning Apache HBase Nodes To decommission nodes running Apache HBase, follow these steps for each node: From the HBase shell, disable the Region Load Balancer by setting the value of to : balance_switch false hbase(main):001:0> balance_switch false Leave the HBase shell by typing . exit Run the script to stop the HBase RegionServer: graceful stop 3. [user@host] ./bin/graceful_stop.sh <hostname> Administration Guide Welcome to the MapR Administration Guide! This guide is for system administrators tasked with managing MapR clusters. Topics include how to manage data by using volumes; how to monitor the cluster for performance; how to manage users and groups; how to add and remove nodes from the cluster; and more. The focus of the Administration Guide is managing the nodes and services that make up a cluster. For details of fine-tuning MapR for specific jobs, see the . The Administration Guide does not cover the details of installing MapR software on a cluster. See Development Guide Installation for details on planning and installing a MapR cluster. Guide Click on one of the sub-sections below to get started. Monitoring Alarms and Notifications Centralized Logging Monitoring Node Metrics Service Metrics Job Metrics Setting up the MapR Metrics Database Third-Party Monitoring Tools Configuring Email for Alarm Notifications Managing Data with Volumes Mirror Volumes Schedules Snapshots Data Protection Managing the Cluster Balancers Central Configuration Disks Nodes Services Startup and Shutdown TaskTracker Blacklisting Uninstalling MapR Designating NICs for MapR Placing Jobs on Specified Nodes Security PAM Configuration Secured TaskTracker Subnet Whitelist Users and Groups Managing Permissions Managing Quotas Setting the Administrative User Converting a Cluster from Root to Non-root User Working with Multiple Clusters Setting Up MapR NFS High Availability NFS Setting Up VIPs for NFS Setting up a MapR Cluster on Amazon Elastic MapReduce Troubleshooting Cluster Administration The script does not look up the hostname for an IP number. Do not pass an IP number to the script. graceful_stop.sh Check the list of RegionServers in the Apache HBase Master UI to determine the hostname for the node being decommissioned. 1. 2. 1. 2. 3. MapR Control System doesn't display on Internet Explorer 'ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils' in cldb.log, caused by mixed-case cluster name in mapr-clusters.conf Error 'mv Failed to rename maprfs...' when moving files across volumes How to find a node's serverid Out of Memory Troubleshooting Client Compatibility Matrix Mirroring with Multiple Clusters Monitoring This section provides information about monitoring the cluster. Click a subtopic below for more details. Alarms and Notifications Centralized Logging Monitoring Node Metrics Service Metrics Job Metrics Setting up the MapR Metrics Database Third-Party Monitoring Tools Ganglia Nagios Integration Configuring Email for Alarm Notifications Alarms and Notifications On a cluster with an M5 license, MapR raises alarms and sends notifications to alert you to information about the cluster: Cluster health, including disk failures Volumes that are under-replicated or over quota Services not running You can see any currently raised alarms in the view of the MapR Control System, or using the command. For a list of all Alarms Views alarm list alarms, see . Alarms Reference To view cluster alarms using the MapR Control System: In the Navigation pane, expand the group and click the view. Cluster Dashboard All alarms for the cluster and its nodes and volumes are displayed in the pane. Alarms To view node alarms using the MapR Control System: In the Navigation pane, expand the group and click the view. Alarms Node Alarms You can also view node alarms in the view, the view, and the pane of the view. Node Properties NFS Alarm Status Alarms Dashboard To view volume alarms using the MapR Control System: In the Navigation pane, expand the group and click the view. Alarms Volume Alarms You can also view volume alarms in the pane of the view. Alarms Dashboard Notifications When an alarm is raised, MapR can send an email notification to either or both of the following addresses: The owner of the cluster, node, volume, or entity for which the alarm was raised (standard notification) A custom email address for the named alarm. You can set up alarm notifications using the command or from the view in the MapR Control System. alarm config save Alarms Views To set up alarm notifications using the MapR Control System: In the Navigation pane, expand the group and click the view. Alarms Alarm Notifications Display the dialog by clicking . Configure Alarm Subscriptions Alarm Notifications For each : Alarm To send notifications to the owner of the cluster, node, volume, or entity: select the checkbox. Standard Notification 3. 4. To send notifications to an additional email address, type an email address in the field. Additional Email Address Click to save the configuration changes. Save Centralized Logging Analyzing log files is an essential part of tracking and tuning Hadoop jobs and tasks. MapR's Centralized Logging feature, new with the v2.0 release, makes job analysis easier than it has ever been before. MapR's Centralized Logging feature provides a job-centric view of all log files generated by tracker nodes throughout the cluster. During or after execution of a job, use the command to create a centralized log directory populated with symbolic links to all log files maprcli job linklogs related to tasks, map attempts, and reduce attempts pertaining to the specified job(s). If MapR-FS is mounted using NFS, you can use standard tools like and to investigate issues which may be distributed across multiple nodes in the cluster. grep find Log files contain details such as which Mapper and Reducer tasks ran on which nodes; how many attempts were tried; and how long each attempt lasted. The distributed nature of MapReduce processing has historically created challenges for analyzing the execution of jobs, because Mapper and Reducer tasks are scattered throughout the cluster. Task-related logs are written out by task trackers running on distributed nodes, and each node might be processing tasks for multiple jobs simultaneously. Without Centralized Logging, a user with access to all nodes would need to access all log files created by task trackers, filter out information from unrelated jobs, and then merge together log details in order to get a complete picture of job execution. Centralized Logging automates all of these steps. Usage Use the to initiate Centralized Logging: maprcli maprcli job linklogs -jobid <jobPattern> -todir <maprfsDir> [ -jobconf <pathToJobXml> ] The following directory structure will be created under specified directory for all matching . <maprfsDir> jobids <jobPattern> <jobid>/hosts/<host>/ contains symbolic links to log directories of tasks executed for on <jobid> <host> <jobid>/mappers/ contains symbolic links to log directories of all map task attempts for across the whole cluster <jobid> <jobid>/reducers/ contains symbolic links to log directories of all reduce task attempts for across the whole cluster <jobid> You can use any glob prefixed , otherwise, is automatically prepended. There is just one match if the full job id is used. job job This command uses the centralized job history location as specified in your current configuration by mapred.job.tracker.history.comple , by default . If location has changed since the job(s) of interest ted.location /var/mapr/cluster/mapred/jobTracker/history/done was run, you can supply the optional parameter. jobconf Examples maprcli job linklogs -jobid job_201204041514_0001 -todir /myvolume/joblogviewdir link logs of a single job job_201204041514_0001. maprcli job linklogs -jobid job_${USER} -todir /myvolume/joblogviewdir link logs of all jobs by the current shell user. maprcli job linklogs -jobid job_*_wordcount1 -todir /myvolume/joblogviewdir link logs all jobs named wordcount1. Managing Centralized Logging The centralized logging feature is enabled by default and is managed by the value of the parameter in HADOOP_TASKTRACKER_ROOT_LOGGER the file. The default value of this parameter is . /opt/mapr/hadoop/hadoop-<version>/conf/hadoop-env.sh INFO,maprfsDRFA Disabling centralized logging To disable centralized logging, set the value of the parameter in the HADOOP_TASKTRACKER_ROOT_LOGGER /opt/mapr/hadoop/hadoop-<v file to . ersion>/conf/hadoop-env.sh INFO,DRFA Enabling Centralized Logging can cause corruption in task attempt log files in some cases. If your cluster is experiencing task attempt log file corruption, central logging. disable 1. 2. 3. Enabling Centralized Logging After Upgrading from Version 1.x If you upgraded your cluster from version 1.x to 2.0 or later, you must explicitly enable the v2.0 features (which include Centralized Logging). New features are not enabled by default after upgrading. Perform the following steps to enable Centralized Logging. Execute the following command from a command line to enable v2.0 features. maprcli config save -values \{cldb.v2.features.enabled:1\} On all nodes running TaskTracker or JobTracker, backup the existing log properties file, and replace it with a new version that enables Centralized Logging. cd /opt/mapr/hadoop/hadoop-<version>/conf mv log4j.properties log4j.properties.old cp ../conf.new/log4j.properties . Restart TaskTracker maprcli node services -tasktracker restart -nodes <list of nodes> For further details, see . Upgrade Guide Monitoring Node Metrics You can examine fine-grained analytics information about the nodes in your cluster by using the API with the command-li Node Metrics maprcli ne tool. You can use this information to examine specific aspects of your node's performance at a very granular level. The API node metrics returns data as a table sent to your terminal's standard output or as a JSON file. The JSON file includes in-depth reports on the activity on each CPU, disk partition, and network interface in your node. The service writes raw metrics data to the volumes as hourly files. The files rotate as hoststats /var/mapr/local/<hostname>/metrics specified by the parameter in the file. metric.file.rotate db.conf The Metrics database holds data in the following tables: METRIC_TRANSACTION: Data is written every 10 seconds and partitioned by day. This table retains up to three days of data. METRIC_TRANSACTION_EVENT: Holds data regarding system events such as service starts/stops and disk additions. METRIC_TRANSACTION_SUMMARY_DAILY: Holds the aggregate, maximum, and minimum values for five-minute intervals of the data in the METRIC_TRANSACTION table. This table is partitioned by day and holds up to 15 days' worth of data. METRIC_TRANSACTION_SUMMARY_YEARLY: Holds the aggregate, maximum, and minimum values for daily intervals of the data in the METRIC_TRANSACTION_SUMMARY_DAILY table. This table is partitioned by year and holds up to 100 years' worth of data. Retention of Node Metrics Data You can purge node metrics data from the database without affecting the cluster's stability. This data will not be restored after purging. Node metrics cover the following general categories: CPU time used Memory used RPC activity Process activity Storage used TaskTracker resources used Service Metrics MapR services produce metrics that can be written to an output file or consumed by . The file metrics output is directed by the Ganglia hadoop-m 1. 2. 3. 1. 1. files. etrics.properties By default, the CLDB and FileServer metrics are sent via unicast to the Ganglia gmon server running on localhost. To send the metrics directly to a Gmeta server, change the property to the hostname of the Gmeta server. To send the metrics to a multicast channel, change cldb.servers the property to the IP address of the multicast channel. cldb.servers Metrics Collected Below are the kinds of metrics that can be collected. CLDB FileServers Number of FileServers Number of Volumes Number of Containers Cluster Disk Space Used GB Cluster Disk Space Available GB Cluster Disk Capacity GB Cluster Memory Capacity MB Cluster Memory Used MB Cluster Cpu Busy % Cluster Cpu Total Number of FS Container Failure Reports Number of Client Container Failure Reports Number of FS RW Container Reports Number of Active Container Reports Number of FS Volume Reports Number of FS Register Number of container lookups Number of container assign Number of container corrupt reports Number of rpc failed Number of rpc received FS Disk Used GB FS Disk Available GB Cpu Busy % Memory Total MB Memory Used MB Memory Free MB Network Bytes Received Network Bytes Sent Setting Up Service Metrics To configure metrics for a service: Edit the appropriate file on all CLDB nodes, depending on the service: hadoop-metrics.properties For MapR-specific services, edit /opt/mapr/conf/hadoop-metrics.properties For standard Hadoop services, edit /opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties In the sections specific to the service: Un-comment the lines pertaining to the context to which you wish the service to send metrics. Comment out the lines pertaining to other contexts. Restart the service. To enable service metrics: As root (or using sudo), run the following commands: maprcli config save -values '{"cldb.ganglia.cldb.metrics":"1"}' maprcli config save -values '{"cldb.ganglia.fileserver.metrics":"1"}' To disable service metrics: As root (or using sudo), run the following commands: maprcli config save -values '{"cldb.ganglia.cldb.metrics":"0"}' maprcli config save -values '{"cldb.ganglia.fileserver.metrics":"0"}' Example In the following example, CLDB service metrics will be sent to the Ganglia context: #CLDB metrics config - Pick one out of null,file or ganglia. #Uncomment all properties in null, file or ganglia context, to send cldb metrics to that context # Configuration of the "cldb" context for null #cldb.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread #cldb.period=10 # Configuration of the "cldb" context for file #cldb.class=org.apache.hadoop.metrics.file.FileContext #cldb.period=60 #cldb.fileName=/tmp/cldbmetrics.log # Configuration of the "cldb" context for ganglia cldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31 cldb.period=10 cldb.servers=localhost:8649 cldb.spoof=1 Job Metrics The MapR Metrics service collects and displays analytics information about the Hadoop jobs, tasks, and task attempts that run on the nodes in your cluster. You can use this information to examine specific aspects of your cluster's performance at a very granular level, enabling you to monitor how your cluster responds to changing workloads and optimize your Hadoop jobs or cluster configuration. The analytics information collected by the MapR Metrics service is stored in a MySQL database. The server running MySQL does not have to be a node in the cluster, but the nodes in your cluster must have access to the server. View this video for an introduction to Job Metrics... The MapR Control System presents the jobs running on your cluster and the tasks that make up a specific job as a sortable list, along with histograms and line charts that represent the distribution of a particular metric. You can sort the list by the metric you're interested in to quickly find any outliers, then display specific detailed information about a job or task attempt that you want to learn more about. The filtering capabilities of the MapR Control System enable you to narrow down the display of data to the ranges you're interested in. The MapR Control System displays data using histograms (for jobs) and line charts (for jobs and task attempts). All histograms and charts are implemented in HTML5, CSS and JavaScript to enable display on your browser or mobile device without requiring plug-ins. The histograms presented by MapR Metrics divide continuous data, such as a range of job durations, into a sequence of discrete bins. For example, a range of durations from 0 to 10000 seconds could be presented as 20 individual bins that cover a 500-second band each. The height of the histogram's bar for each bin represents the number of jobs with a duration in the bin's range. The line charts in MapR Metrics display the trend over time for the value of a specific metric. An M3 license for MapR displays basic information. The M5 license provides sophisticated graphs, and histograms, providing access to trends and detailed statistics. Either license provides access to MapR Metrics from the MapR Control System and . command line interface The job metrics cover the following categories: Cluster resource use (CPU and memory) Duration Task count (map, reduce, failed map, failed reduce) Map rates (record input and output, byte input and output) Reduce rates (record input and output, shuffle bytes) Task attempt counts (map, reduce, failed map, failed attempt Task attempt durations (average map, average reduce, maximum map, maximum reduce) The task attempt metrics cover the following categories: Times (task attempt duration, garbage collection time, CPU time) Local byte rate (read and written) Mapr-FS byte rate (read and written) Memory usage (bytes of physical and virtual memory) Records rates (map input, map output, reduce input, reduce output, skipped, spilled, combined input, combined output) Reduce task attempt input groups Reduce task attempt shuffle bytes The Job Metrics Database Metrics information is kept in a MySQL database that you when you install MapR. configure The JOB and JOB_ATTRIBUTES tables hold job metadata while a job is running. Transactional data, such as counters, is written to the /var/ma directory on each host. This directory depends on the base path of pr/<cluster name>/mapred/jobTracker/jobs/history/metrics/ the JobTracker directory. These transactional data files are named . <hostname> {jobID>_<fileID>_metrics job Task Metrics Data The TASK, TASK_ATTEMPT, and TASK_ATTEMPT_ATTRIBUTES tables hold information related to a job's tasks and task attempts. These tables update while the job is running. If a job's task data has not been accessed within a configurable time limit, the data from the TASK, TASK_ATTEMPT, and TASK_ATTEMPT_ATTRIBUTES tables is purged. The parameter in the file sets the number of db.joblastaccessed.limit.hours db.conf hours that define this time limit. The default value for this parameter is 48. Retention of Job Metrics Data Information from the JOB and JOB_ATTRIBUTES tables is written to the /var/mapr/<cluster directory. If a request is made at the MCS for a job that has already been purged from the name>/mapred/jobTracker/jobs/history/ Metrics database, that data is reloaded from the relevant directory. Example: Using MapR Metrics To Diagnose a Faulty Network Interface Card (NIC) In this example, a node in your cluster has a NIC that is intermittently failing. This condition is leading to abnormally long task completion times due to that node being occasionally unreachable. In the Metrics interface, you can display a job's average and maximum task attempt durations for both map and reduce attempts. A high variance between the average and maximum attempt durations suggests that some task attempts are taking an unusually long time. You can sort the list of jobs by maximum map task attempt duration to find jobs with such an unusually high variance. Click the name of a job name to display information about the job's tasks, then sort the task attempt list by duration to find the outliers. Because the list of tasks includes information about the node the task is running on, you can see that several of these unusually long-running task attempts are assigned to the same node. This information suggests that there may be an issue with that specific node that is causing task attempts to take longer than usual. When you display summary information for that node, you can see that the Network I/O speeds are lower than the speeds for other similarly configured nodes in the cluster. You can use that information to examine the node's network I/O configuration and hardware and diagnose the specific cause. Setting up the MapR Metrics Database In order to use MapR Metrics, you have to set up a MySQL database where metrics data will be logged. MySQL is not included in the MapR distribution for Apache Hadoop, and you need to download and install it separately. Perform the following configuration steps to enable the MapR Metrics database. Requirements The MySQL server must have remote access enabled. The MySQL server must be accessible from all JobTracker and webserver nodes in the cluster. MySQL version must be 5.1 or greater. Installation 1. 2. 3. 4. 5. 6. 1. Install mysql server from the . EPEL repository # yum install mysql-server Initial Configuration Start the MySQL server: # /etc/init.d/mysqld start Set a password for the MySQL root user. # mysqladmin -u root password <new password> Log into MySQL and create a new database and schema for use with MapR Metrics. # mysql -u root -p Enter password: mysql> CREATE DATABASE metrics; mysql> SHOW DATABASES; Create a user and grant it the necessary privileges, then verify. mapr mysql> CREATE user 'maprmetrics'@'%' IDENTIFIED BY 'mapr'; mysql> CREATE user 'maprmetrics'@'localhost' IDENTIFIED BY 'mapr'; mysql> grant all privileges on metrics.* to 'maprmetrics'@'%' with grant option; mysql> grant all privileges on metrics.* to 'maprmetrics'@'localhost' with grant option; mysql> show grants for 'maprmetrics'@'%'; mysql> show grants for 'maprmetrics'@'localhost'; Create the schema. mapr-metrics # mysql -u maprmetrics -h <hostname or IP where the SQL server is running> -p -vvv < /opt/mapr/bin/setup.sql > /opt/mapr/logs/setup_sql_results.txt Restart the service. hoststats # maprcli node services -name hoststats -action restart -filter '[hn==mfs*]' Client Installation Install the MySQL client on JobTracker and webserver nodes in your cluster. # yum install mysql 1. 1. Test the MySQL connection from the JobTracker and webserver nodes to the MySQL server: # mysql -u maprmetrics -h <hostname or IP where sql server is running> -p Enter password: Troubleshooting the Client Check that the port used by the MySQL server is set to the default of 3306. To use a non-default port, specify the port number with the - option for the command. P mysql Check the file to verify that the following lines are present: /etc/my.cnf not # skip-networking tells mysql server to not listen for tcp/ip connections at all skip-networking # bind-address should be 0.0.0.0 which tells mysql server to listen on all interfaces, not only localhost (which is what "bind-address = 127.0.0.1" does) bind-address = 127.0.0.1 Restart the MySQL server after any changes to the : /etc/my.cnf # /etc/init.d/mysqld restart Specify MySQL parameters to MapR On each node in the cluster that has the package installed, specify your MySQL database parameters in one of the mapr-metrics following ways: To specify the MySQL database parameters from the command line, run the script: configure.sh # configure.sh -R -d <host>:<port> -du <database username> -dp <database password> -ds metrics To specify the MySQL database parameters from the (MCS), click to MapR Control System Navigation > System Settings > Metrics display the Configure Metrics Database dialog. In the field, enter the hostname and port of the machine running the MySQL server. URL In the and fields, enter the username and password of the MySQL user. is set to by default. Username Password Schema metrics Customizing the Database Name You can change the name of the database from the default name of by editing the script before sourcing the script. metrics setup.sql Troubleshooting Metrics Verify that the package is installed on all JobTracker and webserver nodes. mapr-metrics Verify that all nodes have the correct config info in the and files /opt/mapr/conf/db.conf /opt/mapr/conf/hibernate.cfg.xml . Check the file for database-related messages or errors similar to this example: /opt/mapr/logs/adminuiapp.log 2012-09-26 12:41:38,740 WARN com.mchange.v2.resourcepool.BasicResourcePool [com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread-#2]: com.mchange.v2.resourcepool.BasicResourcePool$AcquireTask@663f3fbd -- Acquisition Attempt Failed!!! Clearing pending acquires. While trying to acquire a needed new resource, we failed to succeed more than the maximum number of allowed acquisition attempts (30). Last acquisition attempt exception: java.sql.SQLException: Access denied for user 'maprmetrics'@'10.10.80.116' (using password: YES) Verify that the service is running: hoststats # ps -ef | grep hoststats root 8411 8049 0 11:01 pts/0 00:00:00 grep hoststats root 26607 1 0 Sep25 ? 00:03:04 /opt/mapr/server/hoststats 5660 /opt/mapr/logs/TaskTracker.stats The username you provide must have full permissions when logged in from any node in the cluster. When you change the Metrics configuration information from the initial settings, you must restart the service on each node hoststats that reports Metrics data. You can restart the service with the command hoststats maprcli -name hoststats node services . -action restart 1. 2. 3. Check the file for databse-related errors. /opt/mapr/logs/hoststats.log Verify that the file exists and contains the following /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-metrics.properties entries: maprmepredvariant.class=com.mapr.job.mngmnt.hadoop.metrics.MaprRPCContext maprmepredvariant.period=10 maprmapred.class=com.mapr.job.mngmnt.hadoop.metrics.MaprRPCContextFinal maprmapred.period=10 Add these entries if they are absent, then restart the JobTracker. Verify that the file has the following entries: /opt/mapr/conf/warden.conf rpc.drop=false hs.rpcon=true hs.port=1111 hs.host=localhost Check the JobTracker logs for errors or entries related to the string . com.mapr.job.mngmnt.hadoop.metrics Third-Party Monitoring Tools MapR works with the following third-party monitoring tools: Ganglia Nagios Ganglia Ganglia is a scalable distributed system monitoring tool that allows remote viewing live or historical statistics for a cluster. The Ganglia system consists of the following components: A PHP-based web front end Ganglia monitoring daemon ( ): a multi-threaded monitoring daemon gmond Ganglia meta daemon ( ): a multi-threaded aggregation daemon gmetad A few small utility programs The daemon aggregates metrics from the instances, storing them in a database. The front end pulls metrics from the database gmetad gmond and graphs them. You can aggregate data from multiple clusters by setting up a separate for each, and then a master to gmetad gmetad aggregate data from the others. If you configure Ganglia to monitor multiple clusters, remember to use a separate port for each cluster. MapR with Ganglia The CLDB reports metrics about its own load, as well as cluster-wide metrics such as CPU and memory utilization, the number of active FileServer nodes, the number of volumes created, etc. For a complete list of metrics, see . Service Metrics MapRGangliaContext collects and sends CLDB metrics, FileServer metrics, and cluster-wide metrics to Gmon or Gmeta, depending on the configuration. On the Ganglia front end, these metrics are displayed separately for each FileServer by hostname. The ganglia monitor only needs to be installed on CLDB nodes to collect all the metrics required for monitoring a MapR cluster. To monitor other services such as HBase and MapReduce, install Gmon on nodes running the services and configure them as you normally would. The Ganglia properties for the and contexts are configured in the file cldb fileserver $INSTALL_DIR/conf/hadoop-metrics.properti . Any changes to this file require a CLDB restart. es Installing Ganglia To install Ganglia on Ubuntu: On each CLDB node, install : ganglia-monitor sudo apt-get install ganglia-monitor On the machine where you plan to run the Gmeta daemon, install : gmetad sudo apt-get install gmetad On the machine where you plan to run the Ganglia front end, install : ganglia-webfrontend sudo apt-get install 3. 1. 2. 3. 4. 1. 2. 1. 2. 1. 2. 3. ganglia-webfrontend To install Ganglia on Red Hat: Download the following RPM packages for Ganglia version 3.1 or later: ganglia-gmond ganglia-gmetad ganglia-web On each CLDB node, install : ganglia-monitor rpm -ivh <ganglia-gmond> On the machine where you plan to run the Ganglia meta daemon, install : gmetad rpm -ivh <gmetad> On the machine where you plan to run the Ganglia front end, install : ganglia-webfrontend rpm -ivh <ganglia-web> For more details about Ganglia configuration and installation, see the . Ganglia documentation To start sending CLDB metrics to Ganglia: Make sure the CLDB is configured to send metrics to Ganglia (see ). Service Metrics As (or using ), run the following commands: root sudo maprcli config save -values '{"cldb.ganglia.cldb.metrics":"1"}' maprcli config save -values '{"cldb.ganglia.fileserver.metrics":"1"}' To stop sending CLDB metrics to Ganglia: As (or using ), run the following commands: root sudo maprcli config save -values '{"cldb.ganglia.cldb.metrics":"0"}' maprcli config save -values '{"cldb.ganglia.fileserver.metrics":"0"}' Nagios Integration Nagios is an open-source cluster monitoring tool. MapR can generate a Nagios Object Definition File that describes the nodes in the cluster and the services running on each. You can generate the file using the MapR Control System or the command, then save the file in nagios generate the proper location in your Nagios environment. MapR recommends Nagios version 3.3.1 and version 1.4.15 of the plugins. To generate a Nagios file using the MapR Control System: In the Navigation pane, click . Nagios Copy and paste the output, and save as the appropriate Object Definition File in your Nagios environment. For more information, see the . Nagios documentation Configuring Email for Alarm Notifications MapR can notify users by email when certain conditions occur. There are three ways to specify the email addresses of MapR users: From an LDAP directory By domain Manually, for each user To configure email from an LDAP directory: In the MapR Control System, expand the group and click to display the System Settings Views Email Addresses Configure Email dialog. Addresses Select and enter the information about the LDAP directory into the appropriate fields. Use LDAP Click to save the settings. Save 1. 2. 3. 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. To configure email by domain: In the MapR Control System, expand the group and click to display the System Settings Views Email Addresses Configure Email dialog. Addresses Select and enter the domain name in the text field. Use Company Domain Click to save the settings. Save To configure email manually for each user: Create a volume for the user. In the MapR Control System, expand the group and click . MapR-FS User Disk Usage Click the to display the User Properties dialog. username Enter the user's email address in the field. Email Click to save the settings. Save Configuring SMTP Use the following procedure to configure the cluster to use your SMTP server to send mail: In the MapR Control System, expand the group and click to display the dialog. System Settings Views SMTP Configure Sending Email Enter the information about how MapR will send mail: Provider: assists in filling out the fields if you use Gmail. SMTP Server: the SMTP server to use for sending mail. This server requires an encrypted connection (SSL): specifies an SSL connection to SMTP. SMTP Port: the SMTP port to use for sending mail. Full Name: the name MapR should use when sending email. Example: MapR Cluster Email Address: the email address MapR should use when sending email. Username: the username MapR should use when logging on to the SMTP server. SMTP Password: the password MapR should use when logging on to the SMTP server. Click . Test SMTP Connection If there is a problem, check the fields to make sure the SMTP information is correct. Once the SMTP connection is successful, click to save the settings. Save Managing Data with Volumes This page describes how to organize and manage data using volumes, a unique feature of MapR clusters. This page contains the following topics: Introduction to Volumes Creating a Volume Viewing a List of Volumes Viewing Volume Properties Modifying a Volume Mounting a Volume Unmounting a Volume Removing a Volume or Mirror Setting Volume Topology Setting Default Volume Topology Example: Setting Up CLDB-Only Nodes Related Topics Introduction to Volumes MapR provides as a way to organize data and manage cluster performance. A volume is a logical unit that allows you to apply policies volumes to a set of files, directories, and sub-volumes. A well-structured volume hierarchy is an essential aspect of your cluster's performance. As your cluster grows, keeping your volume hierarchy efficient maximizes your data's availability. Without a volume structure in place, your cluster's performance will be negatively affected. This section discusses fundamental volume concepts. You can use volumes to enforce disk usage limits, set replication levels, establish ownership and accountability, and measure the cost generated by different projects or departments. Create a volume for each user, department, or project. You can mount volumes under other volumes to build a structure that reflects the needs of your organization. The volume structure defines how data is distributed across the nodes in your cluster. Create multiple small volumes with shallow paths at the top of your cluster's volume hierarchy to spread the load of access requests across the nodes. 1. 2. 3. 4. 5. On a cluster with an M5 license, you can create a special type of volume called a , a local or remote read-only copy of an entire volume. mirror Mirrors are useful for load balancing or disaster recovery. With an M5 license, you can also create a , an image of a volume at a specific point in time. Snapshots are useful for rollback to a known snapshot data set. You can create snapshots and synchronize mirrors manually or using a . schedule MapR lets you control and configure volumes in a number of ways: Replication - set the number of physical copies of the data, for robustness and performance Topology - restrict a volume to certain physical racks or nodes (requires M5 license and permission on the volume) m Quota - set a hard disk usage limit for a volume (requires M5 license) Advisory Quota - receive a notification when a volume exceeds a soft disk usage quota (requires M5 license) Ownership - set a user or group as the accounting entity for the volume Permissions - give users or groups permission to perform specified volume operations File Permissions - Unix-style read/write permissions on volumes Volumes are stored as pieces called that contain files, directories, and other data. Containers are to protect data. There are containers replicated normally three copies of each container stored on separate nodes to provide uninterrupted access to all data even if a node fails. For each volume, you can specify a replication factor and a replication factor: desired minimum The desired replication factor is the number of replicated copies you want to keep for normal operation and data protection. When the number of copies falls below the desired replication factor, but remains equal to or above the minimum replication factor, re-replication occurs after the timeout specified in the parameter (configurable using the API). The cldb.fs.mark.rereplicate.sec config minimum replication factor you can set is 1; the maximum is 6. The minimum replication factor is the smallest number of copies you want in order to adequately protect against data loss. When the replication factor falls below this minimum, re-replication occurs as aggressively as possible to restore the replication level. The minimum number allowed for the minimum replication factor is 1; the default is 2; the maximum number you can set for the minimum replication factor is 6. In all cases, the minimum replication factor cannot be greater than the replication factor. If any containers in the CLDB volume fall below the minimum replication factor, writes are disabled until aggressive re-replication restores the minimum level of replication. If a disk failure is detected, any data stored on the failed disk is re-replicated without regard to the timeout specified in the parameter. cldb.fs.mark.rereplicate.sec Creating a Volume When creating a volume, the only required parameters are the volume type (normal or mirror) and the volume name. You can set the ownership, permissions, quotas, and other parameters at the time of volume creation, or use the dialog to set them later. If you plan to Volume Properties schedule snapshots or mirrors, it is useful to create a ahead of time; the schedule will appear in a drop-down menu in the Volume schedule Properties dialog. By default, the root user and the volume creator have full control permissions on the volume. You can grant specific permissions to other users and groups: Code Allowed Action dump Dump the volume restore Mirror or restore the volume m Modify volume properties, create and delete snapshots d Delete a volume fc Full control (admin access and permission to change volume ACL) You can create a volume using the command, or use the following procedure to create a volume using the MapR Control System. volume create To create a volume using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Click the button to display the dialog. New Volume New Volume Use the radio button at the top of the dialog to choose whether to create a standard volume, a local mirror, or a remote Volume Type mirror. Type a name for the volume or source volume in the or field. Volume Name Mirror Name 5. 6. 7. 8. a. b. c. 9. a. b. c. 10. a. b. c. 11. 1. 1. 2. 3. 1. 2. 3. 4. If you are creating a mirror volume: Type the name of the source volume in the field. Source Volume Name If you are creating a remote mirror volume, type the name of the cluster where the source volume resides, in the Source Cluster field. Name You can set a mount path for the volume by typing a path in the field. Mount Path You can specify which rack or nodes the volume will occupy by selecting a toplogy from the drop-down selector. Topology You can set permissions using the fields in the section: Ownership & Permissions Click to display fields for a new permission. [ + Add Permission ] In the left field, type either u: and a user name, or g: and a group name. In the right field, select permissions to grant to the user or group. You can associate a standard volume with an accountable entity and set quotas in the section: Usage Tracking In the field, select or from the dropdown menu and type the user or group name in the text field. Group/User User Group To set an advisory quota, select the checkbox beside and type a quota (in megabytes) in the text field. Volume Advisory Quota To set a quota, select the checkbox beside and type a quota (in megabytes) in the text field. Volume Quota You can set the replication factor and choose a snapshot or mirror in the Replication and Snapshot section: schedule Type the desired replication factor in the field. When the number of replicas drops below this threshold, the Replication Factor volume is re-replicated after a timeout period (configurable with the parameter using the cldb.fs.mark.rereplicate.sec c API). onfig Type the minimum replication factor in the field. When the number of replicas drops below this threshold, Minimum Replication the volume is aggressively re-replicated to bring it above the minimum replication factor. To schedule snapshots or mirrors, select a from the dropdown menu or the schedule Snapshot Schedule Mirror Update dropdown menu respectively. Schedule Click to create the volume. OK Viewing a List of Volumes You can view all volumes using the command, or view them in the MapR Control System using the following procedure. volume list To view all volumes using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Viewing Volume Properties You can view volume properties using the command, or use the following procedure to view them using the MapR Control System. volume info To view the properties of a volume using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Display the dialog by clicking the volume name, or by selecting the checkbox beside the volume name, then clicking Volume Properties the button. Properties After examining the volume properties, click to exit without saving changes to the volume. Close Modifying a Volume You can modify any attributes of an existing volume, except for the following restriction: Mirror and normal volumes cannot be converted to the other type. You can modify a volume using the command, or use the following procedure to modify a volume using the MapR Control System. volume modify To modify a volume using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Display the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clicking Volume Properties the button. Properties Make changes to the fields. See for more information about the fields. Creating a Volume After examining the volume properties, click to save changes to the volume. Modify Volume Mounting a Volume 1. 2. 3. 1. 2. 3. 1. 2. 3. 4. 1. 2. 3. 4. 5. 6. You can mount a volume using the command, or use the following procedure to mount a volume using the MapR Control System. volume mount To mount a volume using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the name of each volume you wish to mount. Click the button. Mount You can also mount or unmount a volume using the checkbox in the dialog. See for more Mounted Volume Properties Modifying a Volume information. Unmounting a Volume You can unmount a volume using the command, or use the following procedure to unmount a volume using the MapR Control volume unmount System. To unmount a volume using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the name of each volume you wish to unmount. Click the button. Unmount You can also mount or unmount a volume using the Mounted checkbox in the dialog. See for more Volume Properties Modifying a Volume information. Removing a Volume or Mirror You can remove a volume using the command, or use the following procedure to remove a volume using the MapR Control volume remove System. To remove a volume or mirror using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Click the checkbox next to the volume you wish to remove. Click the button to display the Remove Volume dialog. Remove In the Remove Volume dialog, click the button. Remove Volume Setting Volume Topology You can place a volume on specific racks, nodes, or groups of nodes by setting its topology to an existing node topology. Your node topology describes the locations of nodes and racks in a cluster. The MapR software uses node topology to determine the location of replicated copies of data. Optimally defined cluster topology results in data being replicated to separate racks, providing continued data availability in the event of rack or node failure. Define your cluster's topology by specifying a topology for each node in the cluster. You can use topology to group nodes by rack or switch, depending on how the physical cluster is arranged and how you want MapR to place replicated data. For more information about node topology, see . Node Topology You can set volume topology using the MapR Control System or with the command. volume move To set volume topology using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR Data Platform Volumes Display the dialog by clicking the volume name or by selecting the checkbox beside the volume name, then clicking Volume Properties the button. Properties Click to display the Move Volume dialog. Move Volume Select a topology path that corresponds to the rack or nodes where you would like the volume to reside. Click to return to the Volume Properties dialog. Move Volume Click to save changes to the volume. Modify Volume 1. 2. 1. 2. 3. 4. 1. 2. Setting Default Volume Topology By default, new volumes are created with a topology of . To change the default topology, use the command to change the /data config save c configuration parameter. Example: ldb.default.volume.topology maprcli config save -values "{\"cldb.default.volume.topology\":\"/data/rack02\"}" After running the above command, new volumes have the volume topology by default, which could be useful to restrict new /data/rack02 volume data to subset of the cluster. Example: Setting Up CLDB-Only Nodes In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional control over the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restricting the CLDB volume to its own topology and making sure all other volumes are on a separate topology. Because both the CLDB-only path and the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoid this problem, set a default volume topology. See . Setting Default Volume Topology To set up a CLDB-only node: SET UP the node as usual: PREPARE the node, making sure it meets the requirements. ADD the MapR Repository. INSTALL the following packages to the node. mapr-cldb mapr-webserver mapr-core mapr-fileserver To set up a volume topology that restricts the CLDB volume to specific nodes: Move all CLDB nodes to a CLDB-only topology (e. g. ) using the MapR Control System or the following command: /cldbonly maprcli node move -serverids <CLDB nodes> -topology /cldbonly Restrict the CLDB volume to the CLDB-only topology. Use the MapR Control System or the following command: maprcli volume move -name mapr.cldb.internal -topology /cldbonly If the CLDB volume is present on nodes not in /cldbonly, increase the replication factor of mapr.cldb.internal to create enough copies in / using the MapR Control System or the following command: cldbonly maprcli volume modify -name mapr.cldb.internal -replication <replication factor> Once the volume has sufficient copies, remove the extra replicas by reducing the replication factor to the desired value using the MapR Control System or the command used in the previous step. To move all other volumes to a topology separate from the CLDB-only nodes: Move all non-CLDB nodes to a non-CLDB topology (e. g. ) using the MapR Control System or the following command: /defaultRack maprcli node move -serverids <all non-CLDB nodes> -topology /defaultRack Restrict all existing volumes to the topology using the MapR Control System or the following command: /defaultRack maprcli volume move -name <volume> -topology /defaultRack All volumes except are re-replicated to the changed topology automatically. mapr.cluster.root Related Topics For further information on volume-related operations, see the following topics. Mirror Volumes Snapshots To prevent subsequently created volumes from encroaching on the CLDB-only nodes, set a default topology that excludes the CLDB-only topology. Schedules Mirror Volumes A is a read-only physical copy of another volume, the . You can use mirror volumes in the same cluster (local mirror volume source volume mirroring) to provide local load balancing by using mirror volumes to serve read requests for the most frequently accessed data in the cluster. You can also mirror volumes on a separate cluster (remote mirroring) for backup and disaster readiness purposes. Once you've created a mirror volume, keeping your mirror synchronized with its source volume is fast. Because mirror operations are based on a of the source volume, your source volume remains available for read and write operations for the entire duration of the process. snapshot Mirroring Overview Creating a mirror volume is similar to creating a normal read/write volume. However, when you create a mirror volume, you must specify a source volume that the mirror retrieves content from. This retrieval is called the . Like a normal volume, a mirror volume has a mirroring operation configurable replication factor. Only one copy of the data is transmitted from the source volume to the mirror volume; the source and mirror volumes handle their own replication independently. The MapR system creates a temporary of the source volume at the start of a mirroring operation. The mirroring process reads content snapshot from the snapshot into the mirror volume. The source volume remains available for read and write operations during the mirroring process. If the mirroring operation is , the snapshot expires according to the value of the schedule's parameter. Snapshots created schedule-based Retain For during manual mirroring persist until they are deleted manually. The mirroring process transmits only the differences between the source volume and the mirror. The initial mirroring operation copies the entire source volume, but subsequent mirroring operations can be extremely fast. The mirroring operation never consumes all available network bandwidth, and throttles back when other processes need more network bandwidth. The server sending mirror data continuously monitors the total round-trip time between the data transmission and arrival, and uses this information to restrict itself to 70% of the available bandwidth (continuously calculated). If the network or servers anywhere along the entire path need more bandwidth, the sending server throttles back automatically. If more bandwidth opens up, the sender automatically increases how fast it sends data. During the copy process, the mirror is a fully-consistent image of the source volume. Mirrors are atomically updated at the mirror destination. The mirror does not change until all bits are transferred, at which point all the new files, directories, blocks, etc., are atomically moved into their new positions in the mirror-volume. The previous mirror is left behind as a snapshot, which can be accessed from the directory. These old .snapshot snapshots can be deleted on a . schedule Mirroring is extremely resilient. In the case of a , where some or all of the machines that host the source volume cannot network partition communicate with the machines that host the mirror volume, the mirroring operation periodically retries the connection. Once the network is restored, the mirroring operation resumes. When the root volume on a cluster is mirrored, the source root volume contains a writable volume link, that points to the read/write copies of .rw all local volumes. In that case, the mount path refers to one of the root volume's mirrors, and is read-only. The mount path refers to the / /.rw source volume, and is read/write. A mount path that consists entirely of mirrored volumes refers to a mirrored copy of the specified volume. When a mount path contains volumes that are not mirrored, the path refers to the target volume directly. In cases where a path refers to a mirrored copy, the link is useful for .rw navigating to the read/write source volume. The table below provides examples. Example Volume Topology with Mirrors For the four volumes , , , and , the following table indicates the volumes referred to by example mount paths for particular combinations of / a b c mirrored and not mirrored volumes in the path: / a b c This Path Refers To This Volume... Which is... Mirrored Mirrored Mirrored Mirrored /a/b/c Mirror of c Read-only Mirrored Mirrored Mirrored Mirrored /.rw/a/b/c c directly Read/Write Mirrored Mirrored Not Mirrored Mirrored /a/b/c c directly Read/Write Mirrored Mirrored Not Mirrored Mirrored /a Mirror of a Read-only Not Mirrored Mirrored Mirrored Mirrored /a/b/c c directly Read/Write 1. 2. 3. 4. Setting a Mirror Schedule You can automate mirror synchronization by setting a . You can also use the command to synchronize data schedule volume mirror start manually. Completion time for a mirroring operation is affected by available network bandwidth and the amount of data to transmit. For best performance, set the mirroring schedule according to the anticipated rate of data changes and the available bandwidth for mirroring. Mirror Cascades In a cascade, one mirror synchronizes to the source volume, and each successive mirror uses a previous mirror as its source. Mirror cascades are useful for propagating data over a distance, then re-propagating the data locally instead of transferring the same data remotely again for each copy of the mirror. In the example below, the character indicates a mirror's source: < / < mirror1 < mirror2 < mirror3 A mirror cascade makes more efficient use of your cluster's network bandwidth, but synchronization can be slower to propagate through the chain. For cases where synchronization of mirrors is a higher priority than network bandwidth optimization, make each mirror read directly from the source volume: mirror1 > < mirror2 / mirror3 > < mirror4 You can create or break a mirror cascade made from existing mirror volumes by changing the source volume of each mirror in the Volume dialog. Properties Other Mirror Operations For more information on mirror volume operations, see the following sections: You can set the of a mirror volume to determine the placement of the data. topology You can change a mirror's source volume by changing the source volume in the dialog. Volume Properties To create a new mirror volume refer to (requires license and permission) Creating a Volume M5 cv To modify a mirror (including changing its source), see Modifying a Volume To remove a mirror, see Removing a Volume or Mirror Local Mirroring A is a mirror volume whose source is on the same cluster. Local mirror volumes are useful for load balancing or for providing a local mirror volume read-only copy of a data set. You can locate your local mirror volumes in specific servers or on racks with particularly high bandwidth, mounted in a public directory separate from the source volume. The most frequently accessed volumes in a cluster are likely to be the root volume and its immediate children. In order to load-balance read operations on these volumes, mirror the root volume (typically , which is mounted at ). By mirroring these volumes, read mapr.cluster.root / requests can be served from the mirrors, distributing load across the nodes. Less-frequently accessed volumes that are lower in the hierarchy do not need mirror volumes. Since the mount paths for those volumes are not mirrored throughout, those volumes are writable. To create a local mirror using the MapR Control System: Log on to the . MapR Control System In the navigation pane, select . MapR-FS > Volumes Click the button. New Volume In the dialog, specify the following values: New Volume Select . Local Mirror Volume 4. 1. 2. 1. 2. 3. 4. 5. 1. 2. Enter a name for the mirror volume in the field. If the mirror is on the same cluster as the source volume, the Mirror Name source and mirror volumes must have different names. Enter the source volume name (not mount point) in the field. Source Volume Name To automate mirroring, select a corresponding to critical data, important data, normal data, or a user-defined schedule schedule from the dropdown menu. Mirror Schedule To create a local mirror using the command: volume create Connect via ssh to a node on the cluster where you want to create the mirror. Use the command to create the mirror volume. Specify the volume name, provide a for the mirror volume create source name volume, and specify a of . Example: type 1 maprcli volume create -name volume-a -source volume-a -type 1 -schedule 2 Remote Mirroring A is a mirror volume with a source in another cluster. You can use remote mirrors for offsite backup, for data transfer to remote mirror volume remote facilities, and for load and latency balancing for large websites. By mirroring the cluster's root volume and all other volumes in the cluster, you can create an entire mirrored cluster that keeps in sync with the source cluster. Backup mirrors for disaster recovery can be located on physical media outside the cluster or in a remote cluster. In the event of a disaster affecting the source cluster, you can check the time of last successful synchronization to determine how current the backup is (see b Mirror Status elow). Creating Remote Mirrors Creating remote mirrors is similar to creating local mirrors, except that the source cluster name must also be specified. To create a remote mirror using the MapR Control System: Log on to the . MapR Control System Check the (near the MapR logo). If you are not connected to the cluster on which you want to create a mirror: Cluster Name Click the next to the Cluster Name. [+] In the Available Clusters dialog, click the name of the cluster where you want to create a mirror. In the Launching Web Interface dialog, click that cluster again to connect. In the navigation pane, select . MapR-FS > Volumes Click the button. New Volume In the dialog, specify the following values: New Volume Select or . Local Mirror Volume Remote Mirror Volume Enter a name for the mirror volume in the field. If the mirror is on the same cluster as the source volume, the Volume Name source and mirror volumes must have different names. Enter the source volume name (not mount point) in the field. Source Volume Enter the source cluster name in the field. Source Cluster To automate mirroring, select a from the dropdown menu. schedule Mirror Update Schedule To create a remote mirror using the command: volume create Connect to a node on the cluster where you wish to create the mirror. Use the command to create the mirror volume. Specify the source volume and cluster in the format volume create <volume>@<clus , provide a for the mirror volume, and specify a of . Example: ter> name type 1 maprcli volume create -name volume-a -source volume-a@cluster-1 -type 1 -schedule 2 Moving Large Amounts of Data to a Remote Cluster You can use the command to create volume copies for transport on physical media. The comm volume dump create volume dump create and creates backup files containing the volumes, which can be reconstituted into mirrors at the remote cluster with the c volume dump restore 1. 2. 3. ommand. Associate these mirrors with their source volumes with the command to re-establish synchronization. volume modify Another way to transfer large amounts of data to a remote cluster is to create a small cluster locally and mirror to that local cluster. Then move that cluster to a remote location and enlarge it by adding more nodes. Working with Multiple Clusters To mirror volumes between clusters, create an additional entry in on the source volume's cluster for each additional mapr-clusters.conf cluster that hosts a mirror of the volume. The entry must list the cluster's name, followed by a comma-separated list of hostnames and ports for the cluster's CLDB nodes. To set up multiple clusters On each cluster, make a note of the cluster name and CLDB nodes (the first line in ) mapr-clusters.conf On each webserver and CLDB node, add the remote cluster's CLDB nodes to , using the /opt/mapr/conf/mapr-clusters.conf following format: clustername1 <CLDB> <CLDB> <CLDB> [ clustername2 <CLDB> <CLDB> <CLDB> ] [ ... ] On each cluster, restart the service on all nodes where it is running. mapr-webserver To set up cross-mirroring between clusters You can between clusters, mirroring some volumes from cluster A to cluster B and other volumes from cluster B to cluster A. To set cross-mirror up cross-mirroring, create entries in as follows: mapr-clusters.conf Entries in on cluster A nodes: mapr-clusters.conf First line contains cluster name and CLDB servers of cluster A (the local cluster) Second line contains cluster name and CLDB servers of cluster B (the remote cluster) Entries in on cluster B nodes: mapr-clusters.conf First line contains cluster name and CLDB servers of cluster B (the local cluster) Second line contains cluster name and CLDB servers of cluster A (the remote cluster) For example, the file for cluster A with three CLDB nodes (nodeA, nodeB, and nodeC) would look like this: mapr-clusters.conf clusterA <nodeA> <nodeB> <nodeC> clusterB <nodeD> The file for cluster B with one CLDB node (nodeD) would look like this: mapr-clusters.conf clusterB <nodeD> clusterA <nodeA> <nodeB> <nodeC> By creating additional entries in the file, you can mirror from one cluster to several others. mapr-clusters.conf When a mirror volume is created on a remote cluster (according to the entries in the file), the CLDB checks that the local mapr-clusters.conf volume exists in the local cluster. If both clusters are not set up and running, the remote mirror volume cannot be created. To set up a mirror volume, make sure: Each cluster is already set up and running Each cluster has a unique name 1. 2. 3. 1. 2. 3. Every node in each cluster can resolve all nodes in remote clusters, either through DNS or entries in /etc/hosts Mirror Status You can see a list of all mirror volumes and their current status on the view (in the MapR Control System, select then Mirror Volumes MapR-FS M ) or using the command. You can see additional information about mirror volumes on the CLDB status page (in the irror Volumes volume list MapR Control System, select ), which shows the status and last successful synchronization of all mirrors, as well as the container locations CLDB for all volumes. You can also find container locations using the commands. hadoop mfs Starting a Mirror When a mirror , all the data in the source volume is copied into the mirror volume. Starting a mirror volume requires that the mirror volume starts exist and be associated with a source. After you start a mirror, synchronize it with the source volume regularly to keep the mirror current. You can start a mirror using the command, or use the following procedure to start mirroring using the MapR Control System. volume mirror start To start mirroring using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the name of each volume you wish to mirror. Click the button. Start Mirroring Stopping a Mirror Stopping a mirror halts any replication or synchronization process currently in progress. Stopping a mirror does not delete or remove the mirror volume. Stop a mirror with the command, or use the following procedure to stop mirroring using the MapR Control System. volume mirror stop To stop mirroring using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the name of each volume you wish to stop mirroring. Click the button. Stop Mirroring Pushing Changes to Mirrors To a mirror means to start pushing data from the source volume to all its local mirrors. You can push source volume changes out to all push mirrors using the command, which returns after the data has been pushed. volume mirror push Using Volume Links with Mirrors When you mirror a volume, read requests to the source volume can be served by any of its mirrors on the same cluster via a of type volume link m . A volume link is similar to a normal volume mount point, except that you can specify whether it points to the source volume or its mirrors. irror To write to (and read from) the source volume, mount the source volume normally. As long as the source volume is mounted below a non-mirrored volume, you can read and write to the volume normally via its direct mount path. You can also use a volume link of type wri to write directly to the source volume regardless of its mount point. teable To read from the mirrors, use the command to make a volume link (of type ) to the source volume. Any read volume link create mirror requests from the volume link are distributed among the volume's mirrors. Since the volume link provides access to the mirror volumes, you do not need to mount the mirror volumes. Schedules A schedule is a group of rules that specify recurring points in time at which certain actions are determined to occur. You can use schedules to automate the creation of snapshots and mirrors; after you create a schedule, it appears as a choice in the scheduling menu when you are editing the properties of a task that can be scheduled: To apply a schedule to snapshots, see . Scheduling a Snapshot To apply a schedule to volume mirroring, see . Creating a Volume Schedules require the M5 license. The following sections provide information about the actions you can perform on schedules: To create a schedule, see Creating a Schedule To view a list of schedules, see Viewing a List of Schedules To modify a schedule, see Modifying a Schedule 1. 2. 3. 4. a. b. c. d. 5. 6. 1. 2. 3. a. b. 4. 1. 2. 3. 4. To remove a schedule, see Removing a Schedule Creating a Schedule You can create a schedule using the command, or use the following procedure to create a schedule using the MapR Control schedule create System. To create a schedule using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Schedules Click . New Schedule Type a name for the new schedule in the field. Schedule Name Define one or more schedule rules in the section: Schedule Rules From the first dropdown menu, select a frequency (Once, Yearly, Monthly, etc.)) From the next dropdown menu, select a time point within the specified frequency. For example: if you selected Monthly in the first dropdown menu, select the day of the month in the second dropdown menu. Continue with each dropdown menu, proceeding to the right, to specify the time at which the scheduled action is to occur. Use the field to specify how long the data is to be preserved. For example: if the schedule is attached to a volume for Retain For creating snapshots, the Retain For field specifies how far after creation the snapshot expiration date is set. Click to specify additional schedule rules, as desired. [ + Add Rule ] Click to create the schedule. Save Schedule Viewing a List of Schedules You can view a list of schedules using the command, or use the following procedure to view a list of schedules using the MapR schedule list Control System. To view a list of schedules using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Schedules Modifying a Schedule When you modify a schedule, the new set of rules replaces any existing rules for the schedule. You can modify a schedule using the command, or use the following procedure to modify a schedule using the MapR Control schedule modify System. To modify a schedule using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Schedules Click the name of the schedule to modify. Modify the schedule as desired: Change the schedule name in the field. Schedule Name Add, remove, or modify rules in the section. Schedule Rules Click to save changes to the schedule. Save Schedule For more information, see . Creating a Schedule Removing a Schedule You can remove a schedule using the command, or use the following procedure to remove a schedule using the MapR Control schedule remove System. To remove a schedule using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Schedules Click the name of the schedule to remove. Click to display the dialog. Remove Schedule Remove Schedule Click to remove the schedule. Yes Snapshots A snapshot is a read-only image of a at a specific point in time. On clusters with an M5 or higher license, you can create a snapshot volume manually or automate the process with a . Snapshots are useful any time you need to be able to roll back to a known good data set at a schedule specific point in time. For example, before performing a risky operation on a volume, you can create a snapshot to enable rollback capability for the entire volume. A snapshot takes no time to create, and initially uses no disk space, because it stores only the incremental changes needed to roll the volume back to the state at the time the snapshot was created. The storage used by a volume's snapshots does not count against the volume's quota. When you view the list of volumes on your cluster in the , the value of the column is the disk MapR Control System Snap Size space used by all of the snapshots for that volume. The following sections describe procedures associated with snapshots: To view the contents of a snapshot, see Viewing the Contents of a Snapshot To create a snapshot, see (requires M5 or higher license) Creating a Volume Snapshot To view a list of snapshots, see Viewing a List of Snapshots To remove a snapshot, see Removing a Volume Snapshot See a video explanation of snapshots Viewing the Contents of a Snapshot At the top level of each volume is a directory called containing all the snapshots for the volume. You can view the directory with .snapshot hado commands or by mounting the cluster with NFS. To prevent recursion problems, and do not show the di op fs ls hadoop fs -ls .snapshot rectory when the top-level volume directory contents are listed. You must navigate explicitly to the directory to view and list the .snapshot snapshots for the volume. Example: root@node41:/opt/mapr/bin# hadoop fs -ls /myvol/.snapshot Found 1 items drwxrwxrwx - root root 1 2011-06-01 09:57 /myvol/.snapshot/2011-06-01.09-57-49 Creating a Volume Snapshot You can create a snapshot manually or use a to automate snapshot creation. Each snapshot has an expiration date that determines schedule how long the snapshot will be retained: 1. 2. 3. 4. 1. 2. 3. 4. 1. 2. When you create the snapshot manually, specify an expiration date. When you schedule snapshots, the expiration date is determined by the Retain parameter of the . schedule For more information about scheduling snapshots, see . Scheduling a Snapshot Creating a Snapshot Manually You can create a snapshot using the command, or use the following procedure to create a snapshot using the MapR volume snapshot create Control System. To create a snapshot using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the name of each volume for which you want a snapshot, then click the button to display the New Snapshot dialog. Snapshot Name Type a name for the new snapshot in the field. Name... Click to create the snapshot. OK Scheduling a Snapshot You schedule a snapshot by associating an existing schedule with a normal (non-mirror) volume. You cannot schedule snapshots on mirror volumes; in fact, since mirrors are read-only, creating a snapshot of a mirror would provide no benefit. You can schedule a snapshot by passing the ID of a to the command, or you can use the following procedure to choose a schedule for a volume using the MapR schedule volume modify Control System. To schedule a snapshot using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Display the dialog by clicking the volume name, or by selecting the checkbox beside the name of the volume then Volume Properties clicking the button. Properties In the Replication and Snapshot Scheduling section, choose a from the dropdown menu. schedule Snapshot Schedule Click to save changes to the volume. Modify Volume For information about creating a schedule, see . Schedules Viewing a List of Snapshots Viewing all Snapshots You can view snapshots for a volume with the command or using the MapR Control System. volume snapshot list To view snapshots using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Snapshots Viewing Snapshots for a Volume You can view snapshots for a volume by passing the volume to the command or using the MapR Control System. volume snapshot list To view snapshots using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Click the button to display the dialog. Snapshots Snapshots for Volume Removing a Volume Snapshot Each snapshot has an expiration date and time, when it is deleted automatically. You can remove a snapshot manually before its expiration, or you can preserve a snapshot to prevent it from expiring. Removing a Volume Snapshot Manually You can remove a snapshot using the command, or use the following procedure to remove a snapshot using the MapR volume snapshot remove Control System. 1. 2. 3. 4. 1. 2. 3. 4. 5. 6. 1. 2. 3. 1. 2. 3. 4. 5. To remove a snapshot using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Snapshots Select the checkbox beside each snapshot you wish to remove. Click to display the dialog. Remove Snapshot Remove Snapshots Click to remove the snapshot or snapshots. Yes To remove a snapshot from a specific volume using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the volume name. Click Snapshots to display the dialog. Snapshots for Volume Select the checkbox beside each snapshot you wish to remove. Click to display the dialog. Remove Remove Snapshots Click to remove the snapshot or snapshots. Yes Preserving a Volume Snapshot You can preserve a snapshot using the command, or use the following procedure to create a volume using the MapR volume snapshot preserve Control System. To remove a snapshot using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Snapshots Select the checkbox beside each snapshot you wish to preserve. Click to preserve the snapshot or snapshots. Preserve Snapshot To remove a snapshot from a specific volume using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the volume name. Click Snapshots to display the dialog. Snapshots for Volume Select the checkbox beside each snapshot you wish to preserve. Click to preserve the snapshot or snapshots. Preserve Data Protection You can use MapR to protect your data from hardware failures, accidental overwrites, and natural disasters. MapR organizes data into volumes so that you can apply different data protection strategies to different types of data. The following scenarios describe a few common problems and how easily and effectively MapR protects your data from loss. This page contains the following topics: Scenario: Hardware Failure Solution: Topology and Replication Factor Scenario: Accidental Overwrite Solution: Snapshots Scenario: Disaster Recovery Solution: Mirroring to Another Cluster Related Topics Scenario: Hardware Failure Even with the most reliable hardware, growing cluster and datacenter sizes will make frequent hardware failures a real threat to business continuity. In a cluster with 10,000 disks on 1,000 nodes, it is reasonable to expect a disk failure more than once a day and a node failure every few days. Solution: Topology and Replication Factor MapR automatically replicates data and places the copies on different nodes to safeguard against data loss in the event of hardware failure. By default, MapR assumes that all nodes are in a single rack. You can provide MapR with information about the rack locations of all nodes by setting topology paths. MapR interprets each topology path as a separate rack, and attempts to replicate data onto different racks to provide continuity in case of a power failure affecting an entire rack. These replicas are maintained, copied, and made available seamlessly without user intervention. 1. 2. a. b. c. 3. 1. 2. 3. 4. 1. 2. 3. 4. 5. 6. 1. 2. 3. 4. To set up topology and replication: In the MapR Control System, open the MapR-FS group and click to display the view. Nodes Nodes Set up each rack with its own path. For each rack, perform the following steps: Click the checkboxes next to the nodes in the rack. Click the button to display the dialog. Change Topology Change Node Topology In the Change Node Topology dialog, type a path to represent the rack. For example, if the cluster name is and the cluster1 nodes are in rack 14, type . /cluster1/rack14 When creating volumes, choose a of 3 or more to provide sufficient data redundancy. Replication Factor Scenario: Accidental Overwrite Even in a cluster with data replication, important data can be overwritten or deleted accidentally. If a data set is accidentally removed, the removal itself propagates across the replicas and the data is lost. Users or applications can corrupt data, and once the corruption spreads to the replicas the damage is permanent. Solution: Snapshots With MapR, you can create a point-in-time snapshot of a volume, allowing recovery from a known good data set. You can create a manual snapshot to enable recovery to a specific point in time, or schedule snapshots to occur regularly to maintain a recent recovery point. If data is lost, you can restore the data using the most recent snapshot (or any snapshot you choose). Snapshots do not add a performance penalty, because they do not involve additional data copying operations; a snapshot can be created almost instantly regardless of data size. Example: Creating a Snapshot Manually In the Navigation pane, expand the group and click the view. MapR-FS Volumes Select the checkbox beside the name the volume, then click the button to display the dialog. New Snapshot Snapshot Name Type a name for the new snapshot in the field. Name... Click to create the snapshot. OK Example: Scheduling Snapshots This example schedules snapshots for a volume hourly and retains them for 24 hours. To create a schedule: In the Navigation pane, expand the group and click the view. MapR-FS Schedules Click . New Schedule In the field, type "Every Hour". Schedule Name From the first dropdown menu in the Schedule Rules section, select . Hourly In the field, specify 24 Hours. Retain For Click to create the schedule. Save Schedule To apply the schedule to the volume: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Display the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clicking Volume Properties the button. Properties In the section, choose "Every Hour." Replication and Snapshot Scheduling Click to apply the changes and close the dialog. Modify Volume Scenario: Disaster Recovery A severe natural disaster can cripple an entire datacenter, leading to permanent data loss unless a disaster plan is in place. Solution: Mirroring to Another Cluster MapR makes it easy to protect against loss of an entire datacenter by mirroring entire volumes to a different datacenter. A mirror is a full read-only copy of a volume that can be synced on a schedule to provide point-in-time recovery for critical data. If the volumes on the original cluster contain a large amount of data, you can store them on physical media using the command and transport them to the mirror cluster. volume dump create Otherwise, you can simply create mirror volumes that point to the volumes on the original cluster and copy the data over the network. The 1. 2. 3. a. b. c. d. e. f. g. 1. 2. 3. a. b. mirroring operation conserves bandwidth by transmitting only the deltas between the source and the mirror, and by compressing the data over the wire. In addition, MapR uses checksums and a latency-tolerant protocol to ensure success even on high-latency WANs. You can set up a cascade of mirrors to replicate data over a distance. For instance, you can mirror data from New York to London, then use lower-cost links to replicate the data from London to Paris and Rome. To set up mirroring to another cluster: Use the command to create a full volume dump for each volume you want to mirror. volume dump create Transport the volume dump to the mirror cluster. For each volume on the original cluster, set up a corresponding volume on the mirror cluster. Restore the volume using the command. volume dump restore In the MapR Control System, click under the MapR-FS group to display the Volumes view. Volumes Click the name of the volume to display the dialog. Volume Properties Set the to Remote Mirror Volume. Volume Type Set the to the source volume name. Source Volume Name Set the to the cluster where the source volume resides. Source Cluster Name In the section, choose a schedule to determine how often the mirror will sync. Replication and Mirror Scheduling To recover volumes from mirrors: Use the command to create a full volume dump for each mirror volume you want to restore. Example: volume dump create maprcli volume create -e statefile1 -dumpfile fulldump1 -name volume@cluster Transport the volume dump to the rebuilt cluster. For each volume on the mirror cluster, set up a corresponding volume on the rebuilt cluster. Restore the volume using the command. Example: volume dump restore maprcli volume dump restore -name volume@cluster -dumpfile fulldump1 Copy the files to a standard (non-mirror) volume. Related Topics Mirror Volumes Managing the Cluster This section describes the tools and processes involved in managing a MapR cluster. Topics include upgrading the MapR software version; adding and removing disks and nodes; managing data replication and disk space with balancers; managing the services on a node; managing the topology of a cluster; and more. Choose a subtopic below for more detail. Balancers Central Configuration Disks Setting Up Disks for MapR Specifying Disks or Partitions for Use by MapR Working with a Logical Volume Manager Nodes Adding Nodes to a Cluster Managing Services on a Node Node Topology Isolating CLDB Nodes Isolating ZooKeeper Nodes Removing Roles Task Nodes Services Assigning Services to Nodes for Best Performance Changing the User for MapR Services CLDB Failover Dial Home Startup and Shutdown TaskTracker Blacklisting Uninstalling MapR Designating NICs for MapR Balancers The disk space balancer and the replication role balancer redistribute data in the MapR storage layer to ensure maximum performance and efficient use of space: The works to ensure that the percentage of space used on all disks in the node is similar, so that no nodes are disk space balancer overloaded. The changes the replication roles of cluster containers so that the replication process uses network bandwidth replication role balancer evenly. To view balancer configuration values: Pipe the command through . Example: maprcli config load grep # maprcli config load -json | grep balancer "cldb.balancer.disk.max.switches.in.nodes.percentage":"10", "cldb.balancer.disk.paused":"1", "cldb.balancer.disk.sleep.interval.sec":"120", "cldb.balancer.disk.threshold.percentage":"70", "cldb.balancer.logging":"0", "cldb.balancer.role.max.switches.in.nodes.percentage":"10", "cldb.balancer.role.paused":"1", "cldb.balancer.role.sleep.interval.sec":"900", "cldb.balancer.startup.interval.sec":"1800", To set balancer configuration values: Use the command to set the appropriate values. Example: config save # maprcli config save -values {"cldb.balancer.disk.max.switches.in.nodes.percentage":"20"} Disk Space Balancer The is a tool that balances disk space usage on a cluster by moving containers between storage pools. disk space balancer The disk space balancer distributes containers to storage pools in other nodes that have lower utilization than the average for that cluster. The disk space balancer checks every storage pool on a regular basis and moves containers from a storage pool when that pool's utilization meets the following conditions: The storage pool is over 70% full. The storage pool's utilization exceeds the average utilization on the cluster by a specified threshold: When the average cluster storage utilization is below 80%, the threshold is 10%. When the average cluster storage utilization is below 90% but over 80%, the threshold is 3%. When the average cluster storage utilization is below 94% but over 90%, the threshold is 2%. The disk space balancer aims to ensure that the percentage of space used on all of the disks in the cluster is similar. You can view disk usage on all nodes in the view, by clicking in the Navigation pane and the choosing from the Disks Cluster > Nodes Disks dropdown. By default, the balancers are turned off. To turn on the disk space balancer, use to set to config save cldb.balancer.disk.paused 0 To turn on the replication role balancer, use to set to config save cldb.balancer.role.paused 0 Disk Space Balancer Configuration Parameters Parameter Value Description cldb.balancer.disk.threshold.percentage 70 Threshold for moving containers out of a given storage pool, expressed as utilization percentage. cldb.balancer.disk.paused 1 Specifies whether the disk space balancer runs: 0 - Not paused (normal operation) 1 - Paused (does not perform any container moves) cldb.balancer.disk.max.switches.in.nodes.pe rcentage 10 This can be used to throttle the disk balancer. If it is set to 10, the balancer will throttle the number of concurrent container moves to 10% of the total nodes in the cluster (minimum 2). Disk Space Balancer Status Use the command to view detailed information about the storage pools on a cluster. maprcli dump balancerinfo # maprcli dump balancerinfo usedMB fsid spid percentage outTransitMB inTransitMB capacityMB 209 5567847133641152120 01f8625ba1d15db7004e52b9570a8ff3 1 0 0 15200 209 1009596296559861611 816709672a690c96004e52b95f09b58d 1 0 0 15200 If there are any active container moves at the time the command is run, returns information about the source maprcli dump balancerinfo and destination storage pools. # maprcli dump balancerinfo -json .... { "containerid":7840, "sizeMB":15634, "From fsid":8081858704500413174, "From IP:Port":"10.50.60.64:5660-", "From SP":"9e649bf0ac6fb9f7004fa19d200abcde", "To fsid":3770844641152008527, "To IP:Port":"10.50.60.73:5660-", "To SP":"fefcc342475f0286004fad963f0fghij" } For more information about this command, see . maprcli dump balancerinfo Disk Space Balancer Metrics The command returns a cumulative count of container moves and MB of data moved between storage maprcli dump balancermetrics pools since the current CLDB became the the master CLDB. # maprcli dump balancermetrics -json { "timestamp":1337770325979, "status":"OK", "total":1, "data":[ { "numContainersMoved":10090, "numMBMoved":3147147, "timeOfLastMove": "Wed May 23 03:51:44 PDT 2012" } ] } For more information about this command, see . maprcli dump balancermetrics Replication Role Balancer The is a tool that switches the replication roles of containers to ensure that every node has an equal share of master and replication role balancer replica containers (for name containers) and an equal share of master, intermediate, and tail containers (for data containers). The replication role balancer changes the replication role of the containers in a cluster so that network bandwidth is spread evenly across all nodes during the replication process. A container's replication role determines how it is replicated to the other nodes in the cluster. For name (the volume's first container), replication occurs simultaneously from the master to all replica containers. For , containers data containers replication proceeds from the master to the intermediate container(s) until it reaches the tail containers. Replication occurs over the network between nodes, often in separate racks. Replication Role Balancer Configuration Parameters Parameter Value Description cldb.balancer.role.paused 1 Specifies whether the role balancer runs: 0 - Not paused (normal operation) 1 - Paused (does not perform any container replication role switches) cldb.balancer.role.max.switches.in.nodes.per centage 10 This can be used to throttle the role balancer. If it is set to 10, the balancer will throttle the number of concurrent role switches to 10% of the total nodes in the cluster (minimum 2). Replication Role Balancer Status The command returns information the number of active replication role switches. maprcli dump rolebalancerinfo During a replication role switch, the replication role balancer selects a master or intermediate data container and switches its replication role to that of a tail data container. # maprcli dump rolebalancerinfo -json { "timestamp":1335835436698, "status":"OK", "total":1, "data":[ { "containerid": 36659, "Tail IP:Port":"10.50.60.123:5660-", "Updates blocked Since":"Wed May 23 05:48:15 PDT 2012" } ] } For more information about this command, see . maprcli dump rolebalancerinfo Central Configuration MapR provides a central location where you can place customized configuration files for all the services running on the MapR cluster. As a result, you do not have to edit the configuration files on each node individually. Default configuration files for each service are stored locally under . You can edit these files to create customized versions of the /opt/mapr/ configuration files, store them in a central location, and MapR will overwrite the local files in . /opt/mapr Central configuration files are stored in a volume, (mounted at ), that is created just for mapr.configuration /var/mapr/configuration central configuration. This page contains the following topics: Configuration Files for Each Service Using Customized Configuration Files How the Script Works pullcentralconfig Preserving Multiple Versions of Configuration Files Configuration Files for Each Service Each service on a node has one or more configuration files associated with it. The default version of each configuration file can be used as a template that you can modify as needed. The file locations are listed in . You can run /opt/mapr/servicesconf/<service> $cat /opt/mapr/servicesconf/<service> to display the contents. Note that all pathnames are relative to . Sample contents are shown here, and reflect the currently installed $MAPR_HOME versions of hbase and hadoop. Service Pathnames of Configuration Files cldb conf/BaseLicense.txt conf/cldb.conf conf/hadoop-metrics.properties conf/log4j.cldb.properties conf/log4j.properties conf/MapRLicenseIssuerCert.der fileserver conf/mfs.conf hbmaster hbase/hbase-0.92.1/conf/ Click here to expand... hadoop-metrics.properties hbase-env.sh hbase-policy.xml hbase-site.xml log4j.properties regionservers hbregionserver hbase/hbase-0.92.1/conf/ Click here to expand... hadoop-metrics.properties hbase-env.sh hbase-policy.xml hbase-site.xml log4j.properties regionservers jobtracker hadoop/hadoop-0.20.2/conf/ Click here to expand... capacity-scheduler.xml configuration.xsl core-site.xml fair-scheduler.xml hadoopDefaultMetricsList hadoop-env.sh hadoop-metrics2.properties.example hadoop-metrics.properties hadoop-policy.xml log4j.properties mapred-queue-acls.xml mapred-site.xml masters pools.xml slaves ssl-client.xml.example ssl-server.xml.example taskcontroller.cfg metrics conf/db.conf bin/setup.sql conf/hibernate.cfg.xml nfs conf/nfsserver.conf conf/exports tasktracker hadoop/hadoop-0.20.2/conf/ Click here to expand... capacity-scheduler.xml configuration.xsl core-site.xml fair-scheduler.xml hadoopDefaultMetricsList hadoop-env.sh hadoop-metrics2.properties.example hadoop-metrics.properties hadoop-policy.xml log4j.properties mapred-queue-acls.xml mapred-site.xml masters pools.xml slaves ssl-client.xml.example ssl-server.xml.example taskcontroller.cfg webserver conf/web.conf Using Customized Configuration Files Scenario Suppose you have a cluster with eight nodes, and five of them (host1, host2, host3, host4, and host5) are running the TaskTracker service. Now suppose you want to create one customized configuration file ( ) that applies to host2 through host5 and assign a different mapred-site.xml customized configuration file to host1. Hostname Customized Configuration Files host1 /var/mapr/configuration/ /hadoop/hadoop-0 nodes/host1 .20.2/conf/mapred-site.xml host2 /var/mapr/configuration/ /hadoop/hadoop-0.20. default 2/conf/mapred-site.xml host3 /var/mapr/configuration/ /hadoop/hadoop-0.20. default 2/conf/mapred-site.xml host4 /var/mapr/configuration/ /hadoop/hadoop-0.20. default 2/conf/mapred-site.xml host5 /var/mapr/configuration/ /hadoop/hadoop-0.20. default 2/conf/mapred-site.xml host6 none host7 none host8 none Do change the name of the new configuration file - it must match the name of the original version. not 1. 2. 3. 1. 2. 3. 1. 2. To create a customized configuration file for host2 through host5: Make a copy of the existing default version of the file (so you can use it as a template), and store it in . mapred-site.xml /tmp cp /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml /tmp/mapred-site.xml Edit the copy and put in the changes you want for host2 through host5. Store the new configuration file in the directory. /var/mapr/configuration/default $ hadoop fs -put /tmp/mapred-site.xml /var/mapr/configuration/default/hadoop/hadoop-0.20.2/conf/mapred-site.xml To create a node-specific configuration file for host1: Edit the configuration file in (or you could copy the default version into again and edit that) and create mapred-site.xml /tmp /tmp the node-specific configuration file for host1. Create a sub-directory under : /host1 /var/mapr/configuration/nodes hadoop fs -mkdir /var/mapr/configuration/nodes/host1 Store the new configuration file for host1 in the node-specific directory you just created. $ hadoop fs -put /tmp/mapred-site.xml /var/mapr/configuration/nodes/host1/hadoop/hadoop-0.20.2/conf/mapred-site.xml Verifying the changes Now that you have two separate customized configuration files for your TaskTracker nodes, the script will detect the new pullcentralconfig files in the next time it searches. It overwrites the local version in with the appropriate customized /var/mapr/configuration /opt/mapr version for each of the five TaskTracker nodes. For the changes to take effect: Run the script to put the modified configuration files under the local directory. pullcentralconfig /opt/mapr Run from the command line to overwrite the old files immediately. pullcentralconfig /opt/mapr/server/pullcentralconfig or Wait five minutes (the interval between successive checks for updated configuration files) for the script to run automatically. Look at the information messages in . Whenever the timestamp comparison is , an INFO message pullcentralconfig.log false indicates that the older file under is copied into a backup file ( ) and the newer version is copied to replace it. Sample /opt/mapr .bkp log output is shown here: 2. 3. # cat /opt/mapr/logs/pullcentralconfig.log Thu Jun 20 11:24:07 PDT 2013 INFO Check mtimes: false Thu Jun 20 11:24:08 PDT 2013 INFO Copying /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml to /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml.bkp ... Thu Jun 20 11:24:08 PDT 2013 INFO Copied /var/mapr/configuration/default/hadoop/hadoop-0.20.2/conf/mapred-site.xml to /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml Restart the service. From the command line: maprcli node services -nodes <hostname> -<service> restart For example, to restart the TaskTracker on nodes host1 through host5, enter: maprcli nodes services -nodes host1,host2,host3,host4,host5 -tasktracker restart From the MCS: On the tab, use a filter to display the nodes that are running the service you want to restart (TaskTracker in this case). Nodes Mark the check boxes next to the nodes where you want to restart the service. For this example, two nodes are selected. Click on the button at the top to display the dialog, then click the dropdown menu next Manage services for 2 nodes Manage Services to TaskTracker and select , then click . Restart OK 1. 2. How the Script Works pullcentralconfig The script is launched automatically at specified intervals (the default interval is 300 seconds, which is five minutes). For pullcentralconfig each service listed in , the script searches for corresponding configuration files in the central /opt/mapr/roles pullcentralconfig configuration location, . If any configuration files are found, it compares the timestamp ( ) of the files in the /var/mapr/configuration mtimes central configuration location to the timestamp of the local version in . If the central configuration version is newer, /opt/mapr pullcentralcon overwrites the local version with the central configuration file. To ensure that the newer version gets used, you need to fig restart the associated . service Checking for Node-specific vs. Cluster-wide Configuration Files The script checks for central configuration files in this order: pullcentralconfig Node-specific configuration files under . /var/mapr/configuration/nodes/<hostname> If configuration files are found here, does not check . pullcentralconfig /var/mapr/configuration/default Cluster-wide configuration files under . /var/mapr/configuration/default The script only searches here if no node-specific configuration files are found. pullcentralconfig If no configuration files are found in either location, the script finishes and no changes are made to the files in . /opt/mapr Changing the Polling Frequency By default, the script polls the central configuration location every five minutes (300 seconds) to check for configuration pullcentralconfig files. You can change the polling frequency by editing this variable in : warden.conf pollcentralconfig.interval.seconds To make the change take effect, restart warden: root$> service mapr-warden restart Disabling Central Configuration Central configuration is enabled by default. You can disable the central configuration feature by editing and setting the following warden.conf variable to : false centralconfig.enabled=false By disabling the feature, the script stops checking for more recent versions of each configuration file. To make the change pullcentralconfig take effect, restart warden: root$> service mapr-warden restart Preserving Multiple Versions of Configuration Files If you want to save multiple versions of customized configuration files, you can take snapshots of to preserve each mapr.configuration version you create. Disks MapR-FS groups disks into usually made up of two or three disks. storage pools, When adding disks to MapR-FS, it is a good idea to add at least two or three at a time so that MapR can create properly-sized storage pools. Each node in a MapR cluster can support up to 36 storage pools. When you remove a disk from MapR-FS, any other disks in the storage pool are also removed automatically from MapR-FS and are no longer in use. Their disk storage goes to 0%, and they are eligible to be re-added to MapR-FS to build a new storage pool. You can either replace the disk and re-add it along with the other disks that were in the storage pool, or just re-add the other disks if you do not plan to replace the disk you removed. MapR maintains a list of disks used by MapR-FS in a file called on each node. disktab The following sections provide procedures for working with disks: Adding Disks - adding disks for use by MapR-FS Removing Disks - removing disks from use by MapR-FS Handling Disk Failure - replacing a disk in case of failure Tolerating Slow Disks - increasing the disk timeout to handle slow disks Adding Disks You can add one or more available disks to MapR-FS using the command or the MapR Control System. In both cases, MapR disk add automatically takes care of formatting the disks and creating storage pools. Only the two most recent versions of a configuration file are preserved in . The file extension indicates the back-up /opt/mapr .bkp configuration file. Before removing or replacing disks, make sure the Replication Alarm ( ) and Data Alarm ( VOLUME_ALARM_DATA_UNDER_REPLICATED ) are not raised. These alarms can indicate potential or actual data loss! If either alarm is raised, VOLUME_ALARM_DATA_UNAVAILABLE it may be necessary to attempt repair using the utility before removing or replacing disks. /opt/mapr/server/fsck Using the utility with the flag to repair a filesystem risks data loss. Call MapR support before /opt/mapr/server/fsck -r using . /opt/mapr/server/fsck -r 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 6. 1. 2. To add disks using the MapR Control System: Add physical disks to the node or nodes according to the correct hardware procedure. In the Navigation pane, expand the group and click the view. Cluster Nodes Click the name of the node on which you wish to add disks. In the pane, select the checkboxes beside the disks you wish to add. MapR-FS and Available Disks Click to add the disks. Properly-sized storage pools are allocated automatically. Add Disks to MapR-FS Removing Disks You can remove one or more disks from MapR-FS using the command or the MapR Control System. When you remove disks from disk remove MapR-FS, any other disks in the same storage pool are also removed from MapR-FS and become (not in use, and eligible to be available re-added to MapR-FS). If you are removing and replacing failed disks, you can install the replacements, then re-add the replacement disks and the other disks that were in the same storage pool(s) as the failed disks. If you are removing disks but not replacing them, you can just re-add the other disks that were in the same storage pool(s) as the failed disks. To remove disks using the MapR Control System: In the Navigation pane, expand the group and click the view. Cluster Nodes Click the name of the node from which you wish to remove disks. In the pane, select the checkboxes beside the disks you wish to remove. MapR-FS and Available Disks Click to remove the disks from MapR-FS. Remove Disks from MapR-FS Wait several minutes while the removal process completes. After you remove the disks, any other disks in the same storage pools are taken offline and marked as (not in use by MapR). available Remove the physical disks from the node or nodes according to the correct hardware procedure. Handling Disk Failure When a disk fails, MapR raises the node-level alarm and identifies the failed disks as well as the nodes those NODE_ALARM_DISK_FAILURE disks are on. When you see a disk failure alarm, check the field in the file on that node. Failure Reason /opt/mapr/logs/faileddisk.log The following failure cases may not require disk replacement: Failure Reason: Timeout - Increase the value of the parameter in the file. mfs.io.disk.timeout /opt/mapr/conf/mfs.conf Failure Reason: Disk GUID mismatch - After a node restart, the operating system can reassign the drive labels (for example, ), /sda resulting in drive labels no longer matching the entries in the file. Edit the file according to the instructions in the log disktab disktab to repair the problem. If there are any volume alarms ( ) in the cluster, follow these steps to run the Data Unavailable VOLUME_ALARM_DATA_UNAVAILABLE /o utility on all of the offline storage pools. On each node in the cluster that has raised a disk failure alarm: pt/mapr/server/fsck Run the following command: [user@host] /opt/mapr/server/mrconfig sp list | grep Offline For each storage pool reported by the previous command, run the following command, where specifies the name of an <sp> offline storage pool: 1. 2. 3. If you are running MapR 1.2.2 or earlier, do not use the command or the MapR Control System to add disks to MapR-FS. You disk add must either upgrade to MapR 1.2.3 before adding or replacing a disk, or use the following procedure (which avoids the comma disk add nd): Use the to the failed disk. All other disks in the same storage pool are removed at the same MapR Control System remove time. Make a note of which disks have been removed. Create a text file containing a list of the disks you just removed. See . /tmp/disks.txt Setting Up Disks for MapR Add the disks to MapR-FS by typing the following command (as or with ): root sudo /opt/mapr/server/disksetup -F /tmp/disks.txt 2. 3. 1. 2. 3. 4. 5. 6. 1. 2. 3. 4. [user@host] /opt/mapr/server/fsck -n <sp> -r Verify that all volume alarms are cleared. If volume alarms persist, contact MapR support or Data Unavailable Data Unavailable post on . answers.mapr.com If there any volume alarms ( ) in the cluster, allow reasonable time Data Under Replicated VOLUME_ALARM_DATA_UNDER_REPLICATED for re-replication, then verify that the under-replication alarms are cleared. If volume alarms persist, contact Data Under Replicated MapR support or post on . answers.mapr.com Disk Failure node alarms that persist require disk replacement. To replace disks using the MapR command-line interface: On the node with failed disks, determine which disk to replace by examining entries in the fil Disk /opt/mapr/logs/faileddisk.log e. Use the command to remove the disk. Run the following command, substituting the hostname or IP address for a disk remove <host> nd a list of disks for : <disks> [user@host] maprcli disk remove -host <host> -disks <disks> The disk removal process can take several minutes to complete. Note any disks that appear in the output from the command fdisk -l that are not listed in the file. These disks are failed disks that have been successfully removed from MapR-FS in the previous disktab step. Replace the failed disks on the node or nodes, following correct procedures for your hardware. Remove the failed disk log file from the directory. These log files are typically named in the pattern /opt/mapr/logs .faile diskname . d.info Use the command to add the replacement disk or disks along with other disks from the same storage pool or pools. Run the disk add following command, substituting the hostname or IP address for and a list of disks for : <host> <disks> [user@host] maprcli disk add -host <host> -disks <disks> Once the disks are added to MapR-FS, the cluster allocates properly sized storage pools automatically. To replace disks using the MapR Control System: Identify the failed disk or disks: In the Navigation pane, expand the group and click the view. Cluster Nodes Click the name of the node on which you wish to replace disks, and look in the pane. MapR-FS and Available Disks Remove the failed disk or disks from MapR-FS: In the pane, select the checkboxes beside the failed disks. MapR-FS and Available Disks Click to remove the disks from MapR-FS. Remove Disks from MapR-FS Wait several minutes while the removal process completes. After you remove the disks, any other disks in the same storage pools are taken offline and marked as (not in use by MapR). available From a command line terminal, remove the failed disk log file from the directory. These log files are typically /opt/mapr/logs named in the pattern . .failed.info diskname Replace the failed disks on the node or nodes according to the correct hardware procedure. Add the replacement and available disks to MapR-FS: In the Navigation pane, expand the group and click the view. Cluster Nodes Click the name of the node on which you replaced the disks. In the pane, select the checkboxes beside the disks you wish to add. MapR-FS and Available Disks Using the utility with the flag to repair a filesystem risks data loss. Call MapR support /opt/mapr/server/fsck -r before using . /opt/mapr/server/fsck -r This step force-formats the disks. Any data on these disks will be lost. 4. 1. 2. 3. 4. 5. Click to add the disks. Properly-sized storage pools are allocated automatically. Add Disks to MapR-FS Tolerating Slow Disks The parameter in determines how long MapR waits for a disk to respond before assuming it has failed. If mfs.io.disk.timeout mfs.conf healthy disks are too slow, and are erroneously marked as failed, you can increase the value of this parameter. Setting Up Disks for MapR MapR formats and uses disks for the Lockless Storage Services layer (MapR-FS), recording these disks in the file . In a production disktab environment, or when testing performance, MapR should be configured to use physical hard drives and partitions. In some cases, it is necessary to reinstall the operating system on a node so that the physical hard drives are available for direct use by MapR. Reinstalling the operating system provides an unrestricted opportunity to configure the hard drives. If the installation procedure assigns hard drives to be managed by the Linux Logical Volume Manager (LVM) by default, you should explicitly remove from LVM configuration the drives you plan to use with MapR. It is common to let LVM manage one physical drive containing the operating system partition(s) and to leave the rest unmanaged by LVM for use with MapR. The following procedures are intended for use on physical clusters or Amazon EC2 instances. On EC2 instances, EBS volumes can be used as MapR storage, although performance will be slow. To determine if a disk or partition is ready for use by MapR: Run the command to determine whether any processes are already using the disk or partition. sudo lsof <partition> There should be no output when running , indicating there is no process accessing the specific disk or sudo fuser <partition> partition. The disk or partition should not be mounted, as checked via the output of the command. If the disk or partition is mounted, mount unmount it using the command. umount The disk or partition should not have an entry in the file; comment out or delete any such entries. /etc/fstab The disk or partition should be accessible to standard Linux tools such as . You should be able to successfully format the partition mkfs using a command like as this is similar to the operations MapR performs during installation. If fa sudo mkfs.ext3 <partition> mkfs ils to access and format the partition, then it is highly likely MapR will encounter the same problem. Any disk or partition that passes the above testing procedure can be added to the list of disks and partitions passed to the command. disksetup To specify disks or partitions for use by MapR: The script is used to format disks for use by the MapR cluster. Create a text file listing the disks and partitions for disksetup /tmp/disks.txt use by MapR on the node. Each line lists either a single disk or all applicable partitions on a single disk. When listing multiple partitions on a line, separate by spaces. For example: /dev/sdb /dev/sdc1 /dev/sdc2 /dev/sdc4 /dev/sdd Later, when you run to format the disks, specify the file. For example: disksetup disks.txt /opt/mapr/server/disksetup -F /tmp/disks.txt If you are re-using a node that was used previously in another cluster, it is important to format the disks to remove any traces of data from the old cluster. If you are using , you do not have to use this procedure; the disks are set up for you automatically. MapR on Amazon EMR The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any data you wish disksetup to keep has been backed up elsewhere. Run only after running . disksetup configure.sh 1. 2. 3. 4. 5. To evaluate MapR using a flat storage file instead of formatting disks: When setting up a small cluster for evaluation purposes, if a particular node does not have physical disks or partitions available to dedicate to the cluster, you can use a flat file on an existing disk partition as the node's storage. Create at least a 16GB file, and include a path to the file in the disk list file for the script. disksetup The following example creates a 20 GB flat file ( specifies 1 gigabyte blocks, multiplied by ) at : bs=1G count=20 /root/storagefile $ dd if=/dev/zero of=/root/storagefile bs=1G count=20 Then, you would add the following to the disk list file to be used by : /tmp/disks.txt disksetup /root/storagefile Working with a Logical Volume Manager The Logical Volume Manager creates symbolic links to each logical volume's block device, from a directory path in the form: /dev/<volume . MapR needs the actual block location, which you can find by using the command to list the symbolic links. group>/<volume name> ls -l Make sure you have free, unmounted logical volumes for use by MapR: Unmount any mounted logical volumes that can be erased and used for MapR. Allocate any free space in an existing logical volume group to new logical volumes. Make a note of the volume group and volume name of each logical volume. Use with the volume group and volume name to determine the path of each logical volume's block device. Each logical volume is ls -l a symbolic link to a logical block device from a directory path that uses the volume group and volume name: /dev/<volume group>/<volume name> The following example shows output that represents a volume group named containing logical volumes named , , mapr mapr1 mapr2 map , and : r3 mapr4 # ls -l /dev/mapr/mapr* lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr1 -> /dev/mapper/mapr-mapr1 lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr2 -> /dev/mapper/mapr-mapr2 lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr3 -> /dev/mapper/mapr-mapr3 lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr4 -> /dev/mapper/mapr-mapr4 Create a text file containing the paths to the block devices for the logical volumes (one path on each line). Example: /tmp/disks.txt $ cat /tmp/disks.txt /dev/mapper/mapr-mapr1 /dev/mapper/mapr-mapr2 /dev/mapper/mapr-mapr3 /dev/mapper/mapr-mapr4 Pass to disks.txt disksetup # sudo /opt/mapr/server/disksetup -F /tmp/disks.txt Specifying Disks or Partitions for Use by MapR The script is used to format disks for use by the MapR cluster. Create a text file listing the disks and partitions for disksetup /tmp/disks.txt use by MapR on the node. Each line lists either a single disk or all applicable partitions on a single disk. When listing multiple partitions on a line, 1. 2. 3. 4. separate by spaces. For example: /dev/sdb /dev/sdc1 /dev/sdc2 /dev/sdc4 /dev/sdd Later, when you run to format the disks, specify the file. For example: disksetup disks.txt /opt/mapr/server/disksetup -F /tmp/disks.txt If you are re-using a node that was used previously in another cluster, it is important to format the disks to remove any traces of data from the old cluster. Testing MapR Without Formatting Physical Disks When setting up a small cluster for evaluation purposes, if a particular node does not have physical disks or partitions available to dedicate to the cluster, you can use a flat file on an existing disk partition as the node's storage. Create at least a 16GB file, and include a path to the file in the disk list file for the script. disksetup The following example creates a 20 GB flat file ( specifies 1 gigabyte blocks, multiplied by ) at : bs=1G count=20 /root/storagefile $ dd if=/dev/zero of=/root/storagefile bs=1G count=20 Then, you would add the following to the disk list file to be used by : /tmp/disks.txt disksetup /root/storagefile Working with a Logical Volume Manager The Logical Volume Manager creates symbolic links to each logical volume's block device, from a directory path in the form: /dev/<volume . MapR needs the actual block location, which you can find by using the command to list the symbolic links. group>/<volume name> ls -l Make sure you have free, unmounted logical volumes for use by MapR: Unmount any mounted logical volumes that can be erased and used for MapR. Allocate any free space in an existing logical volume group to new logical volumes. Make a note of the volume group and volume name of each logical volume. Use with the volume group and volume name to determine the path of each logical volume's block device. Each logical volume is ls -l a symbolic link to a logical block device from a directory path that uses the volume group and volume name: /dev/<volume group>/<volume name> The following example shows output that represents a volume group named containing logical volumes named , , mapr mapr1 mapr2 map , and : r3 mapr4 # ls -l /dev/mapr/mapr* lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr1 -> /dev/mapper/mapr-mapr1 lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr2 -> /dev/mapper/mapr-mapr2 lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr3 -> /dev/mapper/mapr-mapr3 lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr4 -> /dev/mapper/mapr-mapr4 The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any data you wish disksetup to keep has been backed up elsewhere. 4. 5. 1. 2. 3. 4. 5. Create a text file containing the paths to the block devices for the logical volumes (one path on each line). Example: /tmp/disks.txt $ cat /tmp/disks.txt /dev/mapper/mapr-mapr1 /dev/mapper/mapr-mapr2 /dev/mapper/mapr-mapr3 /dev/mapper/mapr-mapr4 Pass to disks.txt disksetup # sudo /opt/mapr/server/disksetup -F /tmp/disks.txt Nodes This page provides information about managing nodes in the cluster, including the following topics: Viewing a List of Nodes Adding a Node Managing Services Formatting Disks on a Node Removing a Node Decommissioning a Node Reconfiguring a Node Stopping a Node Installing or Removing Software or Hardware Setting Up a Node Starting the Node Renaming a Node Maintenance Mode for Nodes Viewing a List of Nodes You can view all nodes using the command, or view them in the MapR Control System using the following procedure. node list To view all nodes using the MapR Control System: In the Navigation pane, expand the group and click the view. Cluster Nodes Adding a Node To Add Nodes to a Cluster PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements. PLAN which services to run on the new nodes. INSTALL MapR Software: On all new nodes, the MapR Repository. ADD On each new node, the planned MapR services. INSTALL On all new nodes, configure.sh. RUN On all new nodes, disks for use by MapR. FORMAT If any configuration files on your existing cluster's nodes have been modified (for example, or warden.conf mapred-site.xm ), replace the default configuration files on all new nodes with the appropriate modified files. l Start ZooKeeper on all new nodes that have ZooKeeper installed: service mapr-zookeeper start Start the warden on all new nodes: 5. 6. 7. 8. 1. 2. 3. 4. 5. service mapr-warden start If any of the new nodes are CLDB and/or ZooKeeper nodes, on all new and existing nodes in the cluster, RUN configure.sh specifying all CLDB and ZooKeeper nodes. SET UP node topology for the new nodes. On any new nodes running NFS, NFS for HA. SET UP Managing Services You can manage node services using the command, or in the MapR Control System using the following procedure. node services To manage node services using the MapR Control System: In the Navigation pane, expand the group and click the view. Cluster Nodes Select the checkbox beside the node or nodes you wish to remove. Click the button to display the dialog. Manage Services Manage Node Services For each service you wish to start or stop, select the appropriate option from the corresponding drop-down menu. Click to start and stop the services according to your selections. Change Node You can also display the Manage Node Services dialog by clicking in the view. Manage Services Node Properties Formatting Disks on a Node The script is used to format disks for use by the MapR cluster. Create a text file listing the disks and partitions for disksetup /tmp/disks.txt use by MapR on the node. Each line lists either a single disk or all applicable partitions on a single disk. When listing multiple partitions on a line, separate by spaces. For example: /dev/sdb /dev/sdc1 /dev/sdc2 /dev/sdc4 /dev/sdd Later, when you run to format the disks, specify the file. For example: disksetup disks.txt /opt/mapr/server/disksetup -F /tmp/disks.txt If you are re-using a node that was used previously in another cluster, it is important to format the disks to remove any traces of data from the old cluster. Removing a Node You can remove a node using the command, or in the MapR Control System using the following procedure. Removing a node node remove detaches the node from the cluster, but does not remove the MapR software from the cluster. To remove a node using the MapR Control System: Before you start, drain the node of data by moving the node to the physical . All the data on a node in the /decommissioned topology /decommi topology is migrated to volumes and nodes in the topology. ssioned /data Run the following command to check if a given volume is present on the node: The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any data you wish disksetup to keep has been backed up elsewhere. 1. 2. 3. 4. 5. 6. 1. 2. 3. 4. 5. 6. 7. 8. maprcli dump volumenodes -volumename <volume> -json | grep <ip:port> Run this command for each non-local volume in your cluster to verify that the node being removed is not storing any volume data. In the Navigation pane, expand the group and click the view. Cluster Nodes Select the checkbox beside the node or nodes you wish to remove. Click and stop all services on the node. Manage Services Wait 5 minutes. The Remove button becomes active. Click the button to display the dialog. Remove Remove Node Click to remove the node. Remove Node You can also remove a node by clicking in the view. Remove Node Node Properties Decommissioning a Node Use the following procedures to remove a node and uninstall the MapR software. This procedure detaches the node from the cluster and removes the MapR packages, log files, and configuration files, but does not format the disks. To decommission a node permanently: Before you start, drain the node of data by moving the node to the physical . All the data on a node in the /decommissioned topology /decommi topology is migrated to volumes and nodes in the topology. ssioned /data Run the following command to check if a given volume is present on the node: maprcli dump volumenodes -volumename <volume> -json | grep <ip:port> Run this command for each non-local volume in your cluster to verify that the node being decommissioned is not storing any volume data. Change to the root user (or use sudo for the following commands). Stop the Warden: service mapr-warden stop If ZooKeeper is installed on the node, stop it: service mapr-zookeeper stop Determine which MapR packages are installed on the node: dpkg --list | grep mapr (Ubuntu) rpm -qa | grep mapr (Red Hat or CentOS) Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples: apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu) yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS) Remove the directory to remove any instances of , , , and left behind by the package /opt/mapr hostid hostname zkdata zookeeper manager. Remove any MapR cores in the directory. /opt/cores If the node you have decommissioned is a CLDB node or a ZooKeeper node, then run on all other nodes in the cluster configure.sh (see ). Configuring a Node If you are using Ganglia, restart all gmeta and gmon daemons in the cluster. See . Ganglia Before Decommissioning a Node Make sure any data on the node is replicated and any needed services are running elsewhere. If the node you are decommissioning runs a critical service such as CLDB or ZooKeeper, verify that enough instances of that service are running on other nodes in the cluster. See for recommendations on service assignment to nodes. Planning the Deployment 1. 2. 3. 1. 2. 3. 4. 5. 1. Reconfiguring a Node You can add, upgrade, or remove services on a node to perform a manual software upgrade or to change the roles a node serves. There are four steps to this procedure: Stopping the Node Formatting the Disks (optional) Installing or Removing Software or Hardware Configuring the Node Starting the Node This procedure is designed to make changes to existing MapR software on a machine that has already been set up as a MapR cluster node. If you need to install software for the first time on a machine to create a new node, please see instead. Adding a Node Stopping a Node Change to the root user (or use sudo for the following commands). Stop the Warden: service mapr-warden stop If ZooKeeper is installed on the node, stop it: service mapr-zookeeper stop Installing or Removing Software or Hardware Before installing or removing software or hardware, stop the node using the procedure described in . Stopping the Node Once the node is stopped, you can add, upgrade or remove software or hardware. At some point in time after adding or removing services, it is recommended to restart the warden, to re-optimize memory allocation among all the services on the node. It is not crucial to perform this step immediately; you can restart the warden at a time when the cluster is not busy. To add or remove individual MapR packages, use the standard package management commands for your Linux distribution: apt-get (Ubuntu) yum (Red Hat or CentOS) For information about the packages to install, see . Planning the Deployment You can add or remove services from a node after it has been deployed in a cluster. This process involves installing or uninstalling packages on the node, and then updating the cluster to recognize the new role for this node. Adding a service to an existing node: The process of adding a service to a node is similar to the initial installation process for nodes. For further detail see . Installing MapR Software Install the package(s) corresponding to the new role(s) using or . apt-get yum Run with a list of the CLDB nodes and ZooKeeper nodes in the cluster. configure.sh If you added the CLDB or ZooKeeper role, you must run on all other nodes in the cluster. configure.sh -R If you added the fileserver role, run to format and prepare disks for use as storage. disksetup Restart the warden % service mapr-warden restart When the warden restarts, it picks up the new configuration and starts the new services, making them active in the cluster. Removing a service from an existing node: Stop the service you want to remove from the MapR Control System (MCS) or with the command-line tool. The following maprcli example stops the HBase master service: If you are using Ganglia, restart all gmeta and gmon daemons in the cluster. See . Ganglia 1. 2. 3. 4. % maprcli node services -hbmaster stop -nodes mapr-node1 Purge the service packages with the , , or commands, as suitable for your operating system. apt-get yum zypper Run the script with the option. configure.sh -R When you remove the CLDB or ZooKeeper role from a node, run on all nodes in the cluster. configure.sh -R Setting Up a Node Formatting the Disks The script is used to format disks for use by the MapR cluster. Create a text file listing the disks and partitions for disksetup /tmp/disks.txt use by MapR on the node. Each line lists either a single disk or all applicable partitions on a single disk. When listing multiple partitions on a line, separate by spaces. For example: /dev/sdb /dev/sdc1 /dev/sdc2 /dev/sdc4 /dev/sdd Later, when you run to format the disks, specify the file. For example: disksetup disks.txt /opt/mapr/server/disksetup -F /tmp/disks.txt If you are re-using a node that was used previously in another cluster, it is important to format the disks to remove any traces of data from the old cluster. Configuring the Node The script configures a node to be part of a MapR cluster, or modifies services running on an existing node in the cluster. The configure.sh script creates (or updates) configuration files related to the cluster and the services running on the node. Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. You can optionally specify the ports for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are: CLDB – 7222 ZooKeeper – 5181 The script takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP configure.sh addresses (and optionally ports), using the following syntax: /opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z <host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>] Example: The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any data you wish disksetup to keep has been backed up elsewhere. Each time you specify the option, you must use the for the ZooKeeper node list. If you change the -Z <host>[:<port>] same order order for any node, the ZooKeeper leader election process will fail. 1. 2. 1. 2. 3. 4. 5. 6. 1. /opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r5n1.sj.us:5181 -N MyCluster Starting the Node If ZooKeeper is installed on the node, start it: service mapr-zookeeper start Start the Warden: service mapr-warden start Renaming a Node To rename a node: Stop the warden on the node. Example: service mapr-warden stop If the node is a ZooKeeper node, stop ZooKeeper on the node. Example: service mapr-zookeeper stop Rename the host: On Red Hat or CentOS, edit the parameter in the file and restart the service or HOSTNAME /etc/sysconfig/network xinetd reboot the node. On Ubuntu, change the old hostname to the new hostname in the and files. /etc/hostname /etc/hosts If the node is a ZooKeeper node or a CLDB node, run with a list of CLDB and ZooKeeper nodes. See . configure.sh configure.sh If the node is a ZooKeeper node, start ZooKeeper on the node. Example: service mapr-zookeeper start Start the warden on the node. Example: service mapr-warden start Maintenance Mode for Nodes You can place a node into a maintenance mode for a specified timeout duration. For the duration of the timeout, the cluster's CLDB does not consider this node's data as lost and does not trigger a resync of the data on this node. To put a node into maintenance mode, use the following : command maprcli node maintenance -timeoutminutes <minutes> Specify a timeout in minutes with the option. -timeoutminutes To take a node out of maintenance mode before the timeout expires, follow this process: From a terminal, use the following command: 1. 2. 1. 2. 3. 4. 5. 6. 7. 8. 1. 2. 3. 4. 5. maprcli node maintenance -timeoutminutes 0 Restart the Warden: service mapr-warden restart Adding Nodes to a Cluster To Add Nodes to a Cluster PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements. PLAN which services to run on the new nodes. INSTALL MapR Software: On all new nodes, the MapR Repository. ADD On each new node, the planned MapR services. INSTALL On all new nodes, configure.sh. RUN On all new nodes, disks for use by MapR. FORMAT If any configuration files on your existing cluster's nodes have been modified (for example, or warden.conf mapred-site.xm ), replace the default configuration files on all new nodes with the appropriate modified files. l Start ZooKeeper on all new nodes that have ZooKeeper installed: service mapr-zookeeper start Start the warden on all new nodes: service mapr-warden start If any of the new nodes are CLDB and/or ZooKeeper nodes, on all new and existing nodes in the cluster, RUN configure.sh specifying all CLDB and ZooKeeper nodes. SET UP node topology for the new nodes. On any new nodes running NFS, NFS for HA. SET UP Managing Services on a Node You can add or remove services from a node after it has been deployed in a cluster. This process involves installing or uninstalling packages on the node, and then updating the cluster to recognize the new role for this node. Adding a service to an existing node: The process of adding a service to a node is similar to the initial installation process for nodes. For further detail see . Installing MapR Software Install the package(s) corresponding to the new role(s) using or . apt-get yum Run with a list of the CLDB nodes and ZooKeeper nodes in the cluster. configure.sh If you added the CLDB or ZooKeeper role, you must run on all other nodes in the cluster. configure.sh -R If you added the fileserver role, run to format and prepare disks for use as storage. disksetup Restart the warden Limitations A node that is running both the CLDB and MFS services cannot be put into maintenance mode. You can shut down the CLDB service on the node provide it is a secondary CLDB node or High Availability for CLDB is enabled.5. 1. 2. 3. 4. % service mapr-warden restart When the warden restarts, it picks up the new configuration and starts the new services, making them active in the cluster. Removing a service from an existing node: Stop the service you want to remove from the MapR Control System (MCS) or with the command-line tool. The following maprcli example stops the HBase master service: % maprcli node services -hbmaster stop -nodes mapr-node1 Purge the service packages with the , , or commands, as suitable for your operating system. apt-get yum zypper Run the script with the option. configure.sh -R When you remove the CLDB or ZooKeeper role from a node, run on all nodes in the cluster. configure.sh -R Node Topology Your node topology describes the locations of nodes and racks in a cluster. The MapR software uses node topology to determine the location of replicated copies of data. Optimally defined cluster topology results in data being replicated to separate racks, providing continued data availability in the event of rack or node failure. Define your cluster's topology by specifying a topology for each node in the cluster. You can use topology to group nodes by rack or switch, depending on how the physical cluster is arranged and how you want MapR to place replicated data. Topology paths can be as simple or complex as needed to correspond to your cluster layout. In a simple cluster, each topology path might consist of the rack only (for example, ). In a deployment consisting of multiple large datacenters, each topology path can be much longer (for /rack-1 example, ). MapR uses topology paths to spread out replicated copies of data, /europe/uk/london/datacenter2/room4/row22/rack5/ placing each copy on a separate path. By setting each path to correspond to a physical rack, you can ensure that replicated data is distributed across racks to improve fault tolerance. After you have defined node topology for the nodes in your cluster, you can use volume topology to place volumes on specific racks, nodes, or groups of nodes. See for more information. Setting Volume Topology Recommended Node Topology The node topology described in this section enables you to gracefully migrate data off a node in order to decommission the node for replacement or maintenance while avoiding data under-replication. Establish a topology path to serve as the default topology path for the volumes in that cluster. Establish a topology /data /decommissioned path that is not assigned to any volumes. When you need to migrate a data volume off a particular node, move that node from the path to the path. Since no /data /decommissioned data volumes are assigned to that topology path, standard data replication will migrate the data off that node to other nodes that are still in the /d topology path. ata You can run the following command to check if a given volume is present on a specified node: maprcli dump volumenodes -volumename <volume> -json | grep <ip:port> Run this command for each non-local volume in your cluster. Once all the data has migrated off the node, you can decommission the node or place it in maintenance mode. If you need to segregate CLDB data, create a topology node and move the CLDB nodes under . Point the topology for the CLDB /cldb /cldb volume ( ) to . See for details. mapr.cldb.internal /cldb Isolating CLDB Nodes 1. 2. 3. 4. 5. 1. 2. 1. 2. 3. 4. 1. 2. Setting Node Topology Manually You can specify a topology path for one or more nodes using the command, or in the MapR Control System using the following node move procedure. To set node topology using the MapR Control System: In the Navigation pane, expand the group and click the view. Cluster Nodes Select the checkbox beside each node whose topology you wish to set. Click the button to display the dialog. Change Topology Change Node Topology Set the path in the field: New Path To define a new path, type a topology path. Topology paths must begin with a forward slash ('/'). To use a path you have already defined, select it from the dropdown. Click to set the new topology. Move Node Setting Node Topology with a Script For large clusters, you can specify complex topologies in a text file or by using a script. Each line in the text file or script output specifies a single node and the full topology path for that node in the following format: <ip or hostname> <topology> The text file or script must be specified and available on the local filesystem on all CLDB nodes: To set topology with a text file, set in to the text file name net.topology.table.file.name /opt/mapr/conf/cldb.conf To set topology with a script, set in to the script file name net.topology.script.file.name /opt/mapr/conf/cldb.conf If you specify a script and a text file, the MapR system uses the topology specified by the script. Isolating CLDB Nodes In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional control over the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restricting the CLDB volume to its own topology and making sure all other volumes are on a separate topology. Because both the CLDB-only path and the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoid this problem, set a default volume topology. See . Setting Default Volume Topology To set up a CLDB-only node: SET UP the node as usual: PREPARE the node, making sure it meets the requirements. ADD the MapR Repository. INSTALL the following packages to the node. mapr-cldb mapr-webserver mapr-core mapr-fileserver To set up a volume topology that restricts the CLDB volume to specific nodes: Move all CLDB nodes to a CLDB-only topology (e. g. ) using the MapR Control System or the following command: /cldbonly maprcli node move -serverids <CLDB nodes> -topology /cldbonly Restrict the CLDB volume to the CLDB-only topology. Use the MapR Control System or the following command: maprcli volume move -name mapr.cldb.internal -topology /cldbonly If the CLDB volume is present on nodes not in /cldbonly, increase the replication factor of mapr.cldb.internal to create enough copies in / using the MapR Control System or the following command: cldbonly maprcli volume modify -name mapr.cldb.internal -replication <replication factor> Once the volume has sufficient copies, remove the extra replicas by reducing the replication factor to the desired value using the MapR Control System or the command used in the previous step. To move all other volumes to a topology separate from the CLDB-only nodes: Move all non-CLDB nodes to a non-CLDB topology (e. g. ) using the MapR Control System or the following command: /defaultRack maprcli node move -serverids <all non-CLDB nodes> -topology /defaultRack 2. 1. 2. 1. 2. 3. 1. 2. 3. Restrict all existing volumes to the topology using the MapR Control System or the following command: /defaultRack maprcli volume move -name <volume> -topology /defaultRack All volumes except are re-replicated to the changed topology automatically. mapr.cluster.root Isolating ZooKeeper Nodes For large clusters (100 nodes or more), isolate the ZooKeeper on nodes that do not perform any other function. Isolating the ZooKeeper node enables the node to perform its functions without competing for resources with other processes. Installing a ZooKeeper-only node is similar to any typical node installation, but with a specific subset of packages. To set up a ZooKeeper-only node: SET UP the node as usual: PREPARE the node, making sure it meets the requirements. ADD the MapR Repository. INSTALL the following packages to the node. mapr-zookeeper mapr-zk-internal mapr-core Removing Roles To remove roles from an existing node: Purge the packages corresponding to the roles using or . apt-get yum Run with a list of the CLDB nodes and ZooKeeper nodes in the cluster. configure.sh If you have removed the CLDB or ZooKeeper role, run on all nodes in the cluster to read new configuration configure.sh -R information from the and files. cldb.conf warden.conf The warden picks up the new configuration automatically. When it is convenient, restart the warden: # service mapr-warden restart Removing the role requires additional steps. Refer to . mapr-filesystem Removing the Fileserver Role Removing the Fileserver Role Removing the role from a node is more complex than removing other roles. The CLDB tracks data precisely on all fileserver nodes, fileserver and therefore you should direct the cluster CLDB to stop tracking the node before removing the role. fileserver For a planned decommissioning of a node, use node topologies to migrate data off the node before removing the role. For example, fileserver you could move the node out of a live topology into a topology that has no volumes assigned to it, in order to force /data /decommissioned data off the node. Otherwise, some data will be under-replicated as soon as the node is removed. Refer to . Node Topology To Remove the role from a node fileserver Stop the warden, which will halt all MapR services on the node. Wait 5 minutes, after which the CLDB will mark the node as critical. Remove the node from the cluster, to direct the CLDB to stop tracking this node. You can do this in the MapR Control System GUI or with To prevent subsequently created volumes from encroaching on the CLDB-only nodes, set a default topology that excludes the CLDB-only topology. Do not install the FileServer package on an isolated ZooKeeper node in order to prevent MapR from using this node for data storage. The following procedure involves halting all MapR services on the node temporarily. If this will disrupt critical services on your cluster, such as CLDB or JobTracker, migrate those services to a different node first, and then proceed. 3. 4. 5. 6. 7. 1. 2. the command. maprcli node remove Remove the role by deleting the file on the node. fileserver /opt/mapr/roles/fileserver Run on the node to reconfigure the node without the role. configure.sh fileserver Start the warden. Remove any volumes that were stored locally on the node. You can do this in the MapR Control System GUI or with the maprcli command. volume remove For example: /opt/mapr # service mapr-warden stop ...wait 5 minutes for CLDB to recognize the node is down... /opt/mapr # maprcli node remove 10.10.80.61 /opt/mapr # rm /opt/mapr/roles/fileserver /opt/mapr # /opt/mapr/server/configure.sh -R /opt/mapr # service mapr-warden start /opt/mapr # maprcli volume remove -name mapr.mapr-desktop.local.logs /opt/mapr # maprcli volume remove -name mapr.mapr-desktop.local.mapred /opt/mapr # maprcli volume remove -name mapr.mapr-desktop.local.metrics Task Nodes "Task Node" refers to a node that contributes only compute resources for TaskTrackers, and does not contribute any disk space to the cluster's storage pools. Generally, when permanently adding a node to a cluster, you want the node to contribute both compute and storage resources. However, there are cases for which it is preferable to prevent the cluster from storing data on a particular node. For example, Task Nodes are useful if you need the ability to add compute resources to the cluster at will, and later remove them spontaneously without provisioning for data on the nodes to safely replicate elsewhere. Task Node Services and Topology Task Nodes run the following services: TaskTracker and Fileserver. The Fileserver service is required, because TaskTrackers require local storage for intermediate data. Node Topology settings prevent the cluster from storing unrelated data on the Task Node. You must assign a Task Node to the /compute-only topology, which has no storage volumes assigned to it. (The topology name is unimportant, so long as it has no storage assigned to it.) By contrast, nodes for data storage are generally assigned to the topology (or a sub-topology of ). /data /data Adding a Task Node to a cluster To add a Task Node to a cluster, follow the steps outlined in , with the following modifications: Adding Nodes to a Cluster Packages to install: , mapr-tasktracker mapr-fileserver Before you start the Warden service (which starts the Fileserver service), add the following line to : /opt/mapr/conf/mfs.conf mfs.network.location=/compute-only Converting an Existing Node to a Task Node If the Fileserver is running on a node assigned to a topology with volume data assigned to it, you will need to use the com maprcli node move mand to move the node to the topology (or some other topology with no volumes assigned to it). For example: /compute-only Find the for the node, which you will use in the next step. (See .) serverid How to find a node's serverid Issue the following command to re-assign the node's topology: After the Fileserver service is running, changing has no effect on the node's topology. mfs.conf 2. 1. 2. 1. 2. 3. 1. 2. 3. 4. 1. 2. 3. 4. maprcli node move -serverids <serverid> -topology /compute-only It will then take time for any data stored on the node to transition elsewhere in the cluster. Services Viewing Services on the Cluster You can view services on the cluster using the command, or using the MapR Control System. In the MapR Control System, the dashboard info running services on the cluster are displayed in the pane of the . Services Dashboard To view the running services on the cluster using the MapR Control System: Log on to the MapR Control System. In the Navigation pane, expand the pane and click . Cluster Views Dashboard Viewing Services on a Node You can view services on a single node using the command, or using the MapR Control System. In the MapR Control System, the service list running services on a node are displayed in the . Node Properties View To view the running services on a node using the MapR Control System: Log on to the MapR Control System. In the Navigation pane, expand the pane and click . Cluster Views Nodes Click the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane. Starting Services You can start services using the command, or using the MapR Control System. node services To start specific services on a node using the MapR Control System: Log on to the MapR Control System. In the Navigation pane, expand the pane and click . Cluster Views Nodes Click the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane. Click the checkbox next to each service you would like to start, and click . Start Service Stopping Services You can stop services using the command, or using the MapR Control System. node services To stop specific services on a node using the MapR Control System: Log on to the MapR Control System. In the Navigation pane, expand the pane and click . Cluster Views Nodes Click the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane. Click the checkbox next to each service you would like to stop, and click . Stop Service Adding Services Starting services with the command can lead to cluster issues that are difficult to diagnose. /etc/init.d/<service name> start Interact with MapR services through the provided API. Stopping services with the command can lead to cluster issues that are difficult to diagnose. /etc/init.d/<service name> stop Interact with MapR services through the provided API. 1. 2. 3. 4. 5. 1. 2. 3. 4. Services determine which roles a node fulfills. You can view a list of the roles configured for a given node by listing the direct /opt/mapr/roles ory on the node. To add roles to a node, you must install the corresponding services. You can add or remove services from a node after it has been deployed in a cluster. This process involves installing or uninstalling packages on the node, and then updating the cluster to recognize the new role for this node. Adding a service to an existing node: The process of adding a service to a node is similar to the initial installation process for nodes. For further detail see . Installing MapR Software Install the package(s) corresponding to the new role(s) using or . apt-get yum Run with a list of the CLDB nodes and ZooKeeper nodes in the cluster. configure.sh If you added the CLDB or ZooKeeper role, you must run on all other nodes in the cluster. configure.sh -R If you added the fileserver role, run to format and prepare disks for use as storage. disksetup Restart the warden % service mapr-warden restart When the warden restarts, it picks up the new configuration and starts the new services, making them active in the cluster. Removing a service from an existing node: Stop the service you want to remove from the MapR Control System (MCS) or with the command-line tool. The following maprcli example stops the HBase master service: % maprcli node services -hbmaster stop -nodes mapr-node1 Purge the service packages with the , , or commands, as suitable for your operating system. apt-get yum zypper Run the script with the option. configure.sh -R When you remove the CLDB or ZooKeeper role from a node, run on all nodes in the cluster. configure.sh -R Assigning Services to Nodes for Best Performance The architecture of MapR software allows virtually any service to run on any node, or nodes, to provide a high-availability, high-performance cluster. Below are some guidelines to help plan your cluster's service layout. Don't Overload the ZooKeeper High latency on a ZooKeeper node can lead to an increased incidence of ZooKeeper quorum failures. A ZooKeeper quorum failure occurs when the cluster finds too few copies of the ZooKeeper service running. If the ZooKeeper node is also running other services, competition for computing resources can lead to increased latency for that node. If your cluster experiences issues relating to ZooKeeper quorum failures, consider reducing or eliminating the number of other services running on the ZooKeeper node. Reduce TaskTracker Slots Where Necessary Monitor the server load on the nodes in your cluster that are running high-demand services such as ZooKeeper or CLDB. If the TaskTracker service is running on nodes that also run a high-demand service, you can reduce the number of task slots provided by the TaskTracker service. Tune the number of task slots according to the acceptable load levels for nodes in your cluster. Separate High-Demand Services The following are guidelines about which services to separate on large clusters: JobTracker on ZooKeeper nodes: Avoid running the JobTracker service on nodes that are running the ZooKeeper service. On large clusters, the JobTracker service can consume significant resources. MySQL on CLDB nodes: Avoid running the MySQL server that supports the MapR Metrics service on a CLDB node. Consider running the MySQL server on a machine external to the cluster to prevent the MySQL server’s resource needs from affecting services on the cluster. TaskTracker on CLDB or ZooKeeper nodes: When the TaskTracker service is running on a node that is also running the CLDB or 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 6. 7. 8. ZooKeeper services, consider reducing the number of task slots that this node's instance of the TaskTracker service provides. See Tunin . g Your MapR Install Webserver on CLDB nodes: Avoid running the webserver on CLDB nodes. Queries to the MapR Metrics service can impose a bandwidth load that reduces CLDB performance. JobTracker on large clusters: Run the JobTracker service on a dedicated node for clusters with over 250 nodes. Changing the User for MapR Services All services should be run with same uid/gid on all nodes in the cluster. To fix the alarm, the following steps should be done on the node for which the alarm is raised. To run MapR services as the root user: Stop the warden: service mapr-warden stop If ZooKeeper is installed on the node, stop it: service mapr-zookeeper stop Run the script $INSTALL_DIR/server/config-mapr-user.sh -u root If Zookeeper is installed, start it: service mapr-zookeeper start Start the warden: service mapr-warden start To run MapR services as a non-root user: Stop the warden: service mapr-warden stop If ZooKeeper is installed on the node, stop it: service mapr-zookeeper stop If the MAPR_USER does not exist, create the user/group with the same UID and GID. If the MAPR_USER exists, verify that the uid of MAPR_USER is the same same as the value on the CLDB node. Run $INSTALL_DIR/server/config-mapr-user.sh -u MAPR_USER If Zookeeper is installed, start it: service mapr-zookeeper start Start the warden: service mapr-warden start After clearing NODE_ALARM_MAPRUSER_MISMATCH alarms on all nodes, run o $INSTALL_DIR/server/upgrade2mapruser.sh n all nodes wherever the alarm is raised. 1. 2. 3. 4. 5. 6. 7. 1. 2. 3. 4. CLDB Failover The CLDB service automatically replicates its data to other nodes in the cluster, preserving at least two (and generally three) copies of the CLDB data. If the CLDB process dies, it is automatically restarted on the node. All jobs and processes wait for the CLDB to return, and resume from where they left off, with no data or job loss. If the node itself fails, the CLDB data is still safe, and the cluster can continue normally as soon as the CLDB is started on another node. In an M5-licensed cluster, a failed CLDB node automatically fails over to another CLDB node without user intervention and without data loss. It is possible to recover from a failed CLDB node on an M3 cluster, but the procedure is somewhat different. Recovering from a Failed CLDB Node on an M3 Cluster To recover from a failed CLDB node, perform the steps listed below: Restore ZooKeeper - if necessary, install ZooKeeper on an additional node. Locate the CLDB data - locate the nodes where replicates of CLDB data are stored, and choose one to serve as the new CLDB node. Stop the selected node - stop the node you have chosen, to prepare for installing the CLDB service. Install the CLDB on the selected node - install the CLDB service on the new CLDB node. Configure the selected node - run to inform the CLDB node where the CLDB and ZooKeeper services are running. configure.sh Start the selected node - start the new CLDB node. Restart all nodes - stop each node in the cluster, run on it, and start it. configure.sh After the CLDB restarts, there is a 15-minute delay before replication resumes, in order to allow all nodes to register and heartbeat. This delay can be configured using the command to set the parameter. config save cldb.replication.manager.start.mins Restore ZooKeeper If the CLDB node that failed was also running ZooKeeper, install ZooKeeper on another node to maintain the minimum required number of ZooKeeper nodes. Locate the CLDB Data After restoring the ZooKeeper service on the M3 cluster, use the command to identify the latest epoch of the CLDB, maprcli dump zkinfo identify the nodes where replicates of the CLDB are stored, and select one of those nodes to serve the new CLDB node. Perform the following steps on any cluster node: Log in as or use for the following commands. root sudo Issue the command using the flag. maprcli dump zkinfo -json # maprcli dump zkinfo -json The output displays the ZooKeeper znodes. In the directory, locate the CLDB with the latest epoch. /datacenter/controlnodes/cldb/epoch/1 { "/datacenter/controlnodes/cldb/epoch/1/KvStoreContainerInfo":" Container ID:1 VolumeId:1 Master:10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Servers: 10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Inactive Servers: Unused Servers: Latest epoch:13" } The Latest Epoch field identifies the current epoch of the CLDB data. In this example, the latest epoch is . 13 Select a CLDB from among the copies at the latest epoch. For example, indicates that the node has a 10.250.2.41:5660--13-VALID copy at epoch 13 (the latest epoch). You can now install a new CLDB on the selected node. 1. 2. 3. 1. 2. 3. 1. 2. Stop the Selected Node Perform the following steps on the node you have selected for installation of the CLDB: Change to the root user (or use sudo for the following commands). Stop the Warden: service mapr-warden stop If ZooKeeper is installed on the node, stop it: service mapr-zookeeper stop Install the CLDB on the Selected Node Perform the following steps on the node you have selected for installation of the CLDB: Login as or use for the following commands. root sudo Install the CLDB service on the node: RHEL/CentOS: yum install mapr-cldb Ubuntu: apt-get install mapr-cldb Wait until the failover delay expires. If you try to start the CLDB before the failover delay expires, the following message appears: CLDB HA check failed: not licensed, failover denied: elapsed time since last failure=<time in minutes> minutes Configure the Selected Node Perform the following steps on the node you have selected for installation of the CLDB: The script configures a node to be part of a MapR cluster, or modifies services running on an existing node in the cluster. The configure.sh script creates (or updates) configuration files related to the cluster and the services running on the node. Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. You can optionally specify the ports for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are: CLDB – 7222 ZooKeeper – 5181 The script takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP configure.sh addresses (and optionally ports), using the following syntax: /opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z <host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>] Example: /opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r5n1.sj.us:5181 -N MyCluster Start the Node Perform the following steps on the node you have selected for installation of the CLDB: If ZooKeeper is installed on the node, start it: service mapr-zookeeper start Start the Warden: service mapr-warden start Each time you specify the option, you must use the for the ZooKeeper node list. If you change the -Z <host>[:<port>] same order order for any node, the ZooKeeper leader election process will fail. 1. 2. 3. 1. 2. 1. 2. 3. 4. 5. 6. Restart All Nodes On all nodes in the cluster, perform the following procedures: Stop the node: Change to the root user (or use sudo for the following commands). Stop the Warden: service mapr-warden stop If ZooKeeper is installed on the node, stop it: service mapr-zookeeper stop Configure the node with the new CLDB and ZooKeeper addresses: The script configures a node to be part of a MapR cluster, or modifies services running on an existing node in the cluster. The configure.sh script creates (or updates) configuration files related to the cluster and the services running on the node. Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. You can optionally specify the ports for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are: CLDB – 7222 ZooKeeper – 5181 The script takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP configure.sh addresses (and optionally ports), using the following syntax: /opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z <host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>] Example: /opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r5n1.sj.us:5181 -N MyCluster Start the node: If ZooKeeper is installed on the node, start it: service mapr-zookeeper start Start the Warden: service mapr-warden start Dial Home MapR provides a service called , which automatically collects certain metrics about the cluster for use by support engineers and to help Dial Home us improve and evolve our product. When you first install MapR, you are presented with the option to enable or disable Dial Home. We recommend enabling it. You can enable or disable Dial Home later, using the command. dialhome enable Startup and Shutdown When you shut down a cluster, follow this sequence to preserve your data and replication: Verify that recently data has finished processing Shut down any NFS servers Shut down any ecosystem components that are running Shut down the job and task trackers Shut down the Warden on all nodes that are not running CLDB or ZooKeeper Shut down the Warden on the CLDB and ZooKeeper nodes To shut down the cluster: Each time you specify the option, you must use the for the ZooKeeper node list. If you change the -Z <host>[:<port>] same order order for any node, the ZooKeeper leader election process will fail. 1. 2. 3. 4. 5. 6. 7. Before you start, make sure there are no active MapReduce or HBase processes, and that no data is being loaded to the cluster or being persisted within the cluster. Change to the user (or use for the following commands). root sudo Before shutting down the cluster, you will need a list of NFS nodes, CLDB nodes, and all remaining nodes. Once the CLDB is shut down, you cannot retrieve a list of nodes; it is important to obtain this information at the beginning of the process. Use the comman node list d as follows: Determine which nodes are running the NFS gateway. Example: /opt/mapr/bin/maprcli node list -filter "[rp==/*]and[svc==nfs]" -columns id,h,hn,svc, rp id service hostname health ip 6475182753920016590 fileserver,tasktracker,nfs,hoststats node-252.cluster.us 0 10.10.50.252 8077173244974255917 tasktracker,cldb,fileserver,nfs,hoststats node-253.cluster.us 0 10.10.50.253 5323478955232132984 webserver,cldb,fileserver,nfs,hoststats,jobtracker node-254.cluster.us 0 10.10.50.254 Determine which nodes are running the CLDB. Example: /opt/mapr/bin/maprcli node list -filter "[rp==/*]and[svc==cldb]" -columns id,h,hn,svc, rp List all non-CLDB nodes. Example: /opt/mapr/bin/maprcli node list -filter "[rp==/*]and[svc!=cldb]" -columns id,h,hn,svc, rp Shut down all NFS instances. Example: /opt/mapr/bin/maprcli node services -nfs stop -nodes node-252.cluster.us node-253.cluster.us node-254.cluster.us If your cluster is running any ecosystem components, shut down those components on all nodes. Shut down all job and task tracker services on all nodes. Example: # maprcli node services -jobtracker stop # maprcli node services -tasktracker stop SSH into each node that is not running CLDB or the ZooKeeper and stop the warden. Example: # service mapr-warden stop SSH into each CLDB or ZooKeeper node and stop the warden. Example: 7. 8. 1. 2. 3. 4. 5. # service mapr-warden stop If desired, you can shut down the nodes using the Linux command. halt To start up the cluster: If the cluster nodes are not running, start them. Change to the user (or use for the following commands). root sudo Start the ZooKeeper on nodes where it is installed. Example: # service mapr-zookeeper start On all nodes, start the warden. Example: # service mapr-warden start Over a period of time (depending on the cluster size and other factors) the cluster comes up automatically. After the CLDB restarts, there is a 15-minute delay before replication resumes, in order to allow all nodes to register and heartbeat. This delay can be configured using the command to set the parameter. config save cldb.replication.manager.start.mins TaskTracker Blacklisting In the event that a TaskTracker is not performing properly, it can be so that no jobs will be scheduled to run on it. There are two types blacklisted of TaskTracker blacklisting: Per-job blacklisting, which prevents scheduling new tasks from a particular job Cluster-wide blacklisting, which prevents scheduling new tasks from all jobs Per-Job Blacklisting The configuration value in specifies a number of task failures in a specific job after mapred.max.tracker.failures mapred-site.xml which the TaskTracker is blacklisted for that job. The TaskTracker can still accept tasks from other jobs, as long as it is not blacklisted cluster-wide (see below). A job can only blacklist up to 25% of TaskTrackers in the cluster. Cluster-Wide Blacklisting A TaskTracker can be blacklisted cluster-wide for any of the following reasons: The number of blacklists from successful jobs (the ) exceeds fault count mapred.max.tracker.blacklists The TaskTracker has been manually blacklisted using hadoop job -blacklist-tracker <host> The status of the TaskTracker (as reported by a user-provided health-check script) is not healthy If a TaskTracker is blacklisted, any currently running tasks are allowed to finish, but no further tasks are scheduled. If a TaskTracker has been blacklisted due to or using the command, un-blacklisting mapred.max.tracker.blacklists hadoop job -blacklist-tracker <host> requires a TaskTracker restart. Only 50% of the TaskTrackers in a cluster can be blacklisted at any one time. After 24 hours, the TaskTracker is automatically removed from the blacklist and can accept jobs again. Blacklisting a TaskTracker Manually To blacklist a TaskTracker manually, run the following command as the administrative user: 1. 2. 3. 4. 5. 6. 7. hadoop job -blacklist-tracker <hostname> Manually blacklisting a TaskTracker prevents additional tasks from being scheduled on the TaskTracker. Any currently running tasks are allowed to fihish. Un-blacklisting a TaskTracker Manually If a TaskTracker is blacklisted per job, you can un-blacklist it by running the following command as the administrative user: hadoop job -unblacklist <jobid> <hostname> If a TaskTracker has been blacklisted cluster-wide due to or using the mapred.max.tracker.blacklists hadoop job command, use the following command as the administrative user to remove that node from the blacklist: -blacklist-tracker <host> hadoop job -unblacklist-tracker <hostname> If a TaskTracker has been blacklisted cluster-wide due to a non-healthy status, correct the indicated problem and run the health check script again. When the script picks up the healthy status, the TaskTracker is un-blacklisted. Uninstalling MapR To re-purpose machines, you may wish to remove nodes and uninstall MapR software. Removing Nodes from a Cluster To remove nodes from a cluster: first uninstall the desired nodes, then run on the remaining nodes. Finally, if you are using configure.sh Ganglia, restart all and daemons in the cluster. gmeta gmon To uninstall a node: On each node you want to uninstall, perform the following steps: Before you start, drain the node of data by moving the node to the physical . All the data on a node in the /decommissioned topology /decommi topology is migrated to volumes and nodes in the topology. ssioned /data Run the following command to check if a given volume is present on the node: maprcli dump volumenodes -volumename <volume> -json | grep <ip:port> Run this command for each non-local volume in your cluster to verify that the node being decommissioned is not storing any volume data. Change to the root user (or use sudo for the following commands). Stop the Warden: service mapr-warden stop If ZooKeeper is installed on the node, stop it: service mapr-zookeeper stop Determine which MapR packages are installed on the node: dpkg --list | grep mapr (Ubuntu) rpm -qa | grep mapr (Red Hat or CentOS) Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples: apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu) yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS) Remove the directory to remove any instances of , , , and left behind by the package /opt/mapr hostid hostname zkdata zookeeper manager. 7. 8. Remove any MapR cores in the directory. /opt/cores If the node you have decommissioned is a CLDB node or a ZooKeeper node, then run on all other nodes in the cluster configure.sh (see ). Configuring a Node To reconfigure the cluster: The script configures a node to be part of a MapR cluster, or modifies services running on an existing node in the cluster. The configure.sh script creates (or updates) configuration files related to the cluster and the services running on the node. Before performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. You can optionally specify the ports for the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are: CLDB – 7222 ZooKeeper – 5181 The script takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IP configure.sh addresses (and optionally ports), using the following syntax: /opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z <host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>] Example: /opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Z r1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r5n1.sj.us:5181 -N MyCluster Designating NICs for MapR If you do not want MapR to use all NICs on each node, use the environment variable to restrict MapR traffic to specific NICs. Set MAPR_SUBNETS to a comma-separated list of up to four subnets in with no spaces. Example: MAPR_SUBNETS CIDR notation export MAPR_SUBNETS=10.10.0.0/16,192.168.34.0/24 If is not set, MapR uses all NICs present on the node. MAPR_SUBNETS Related Topics Environment Variables Placing Jobs on Specified Nodes You can run jobs on specified nodes or groups of nodes using – assigning labels to various groups of nodes and then label-based scheduling using the labels to specify where jobs run. This feature is used in conjunction with . The labels are mapped to nodes using the The Fair Scheduler , a file stored in MapR-FS. When you run jobs, you can place them on specified nodes individually or at the queue level. node labels file The Node Labels File Each time you specify the option, you must use the for the ZooKeeper node list. If you change the -Z <host>[:<port>] same order order for any node, the ZooKeeper leader election process will fail. If you are using Ganglia, restart all gmeta and gmon daemons in the cluster. See . Ganglia When using label-based job placement, you cannot use the or task prefetch. For details on prefetch, Fair Scheduler with preemption see parameter on page . mapreduce.tasktracker.prefetch.maptasks mapred-site.xml The node labels file defines labels for cluster nodes, to identify them for the purpose of specifying where to run jobs. Each line in the node label file consists of an that specifies one or more nodes, along with one or more to apply to the specified nodes, separated by identifier labels whitespace. <identifier> <labels> The specifies nodes by matching the node names or IP addresses in one of two ways: identifier Unix-style which supports the and wildcards glob ? * Java regular expressions (refer to for more information) https://java.net/projects/eval/pages/Home The are a comma-delimited list of labels to apply to the nodes matched by the identifier. Labels that contain whitespace should be labels enclosed in single or double quotation marks. Sample node label file The following example shows both glob identifiers and regular expression identifiers. /perfnode200.*/ big, "Production Machines" /perfnode203.*/ big, 'Development Machines' perfnode15* good perfnode201* slow perfnode204* good, big If glob identifiers overlap, so that patterns from multiple lines match a particular node, only the last line in the file that matches is applied. The file path and name are specified in the parameter in . If no file is mapreduce.jobtracker.node.labels.file mapred-site.xml specified, jobs can run on any nodes in the cluster. The parameter in determines how often the JobTracker mapreduce.jobtracker.node.labels.monitor.interval mapred-site.xml should poll the node label file for changes (the default is two minutes). You can also use to manually tell the -refreshlabels hadoop job JobTracker to re-load the node label file. Placing Jobs The placement of jobs on nodes or groups of nodes is controlled by applied to the queues to which jobs are submitted, or to the jobs labels themselves. A queue or job label is an expression that uses logical operators OR, AND, NOT to specify the nodes described in the node labels file: A specifies the node or nodes that will run all jobs submitted to a queue. queue label A specifies the node or nodes that will run a particular job job label Job and queue labels specify nodes using labels corresponding to the node labels file, with the operators (OR) and (AND). || && Examples: "Production Machines" || good — selects nodes that are in either the group or the group. "Production Machines" good 'Development Machines' && good — selects nodes only if they are in both the group and the 'Development Machines' good group. If a job is submitted with a label that does not include any nodes, the job will remain in the state until nodes exist that meet the criteria (or PREP until the job is killed). For example, in the node labels file above, no nodes have been assigned to the group both 'Development Machines' and the group. If a job is submitted with the label , it cannot execute unless there are nodes that good 'Development Machines' && good exist in both groups. If the node labels file is edited so that the group and the group have nodes in common, 'Development Machines' good the job will execute as soon as the JobTracker becomes aware — either after the mapreduce.jobtracker.node.labels.monitor.interv Tabs cannot be used for whitespace. If tabs are used, the node label list will be empty. You can use to view the labels of all active nodes. -showlabels hadoop job or when you execute the al command. -refreshlabels hadoop job Queue Labels Queue Labels are defined using the parameter in . The corresponding parameter mapred.queue.<queue-name>.label mapred-site.xml m specifies one of the following policies that determine the precedence of queue labels and job apred.queue.<queue-name>.label.policy labels: PREFER_QUEUE — always use label set on queue PREFER_JOB — always use label set on job AND (default) — job label AND node label OR — job label OR node label You can set a default queue policy using . mapred.queue.default.label Example: Setting a policy on the default queue The following excerpt from shows the policy set on the default queue. mapred-site.xml PREFER_QUEUE <property> <name>mapred.queue.default.label</name> <value>big || good</value> </property> <property> <name>mapred.queue.default.label.policy</name> <value>PREFER_QUEUE</value> </property> Job Labels There are three ways to set job labels: Use from the Hadoop configuration API in your Java application (Example: set() conf.set("mapred.job.label","Production ) Machines"); Pass the label in when running a job with -Dmapred.job.label hadoop jar Set in mapred.job.label mapred-site.xml Examples The following examples show the job placement policy behavior in certain scenarios, using the sample node labels file above. Job Label Queue Label Queue Policy Outcome good big PREFER_JOB The job runs on nodes labeled "good" (hostnames match perfnode15* or perfnode204*) good big PREFER_QUEUE The job runs on nodes labeled "big" (hostnames match /perfnode200./ or /perfnode203. /) good big AND The job runs on nodes only if they are labeled both "good" and "big" (hostnames match perfnode204*) 1. 2. 3. good big OR The job runs on nodes if they are labeled either "good" or "big" (hostnames match /perfnode200. /, perfnode15*, /, /perfnode203. or perfnode204*) Security This section provides information about managing security on a MapR cluster. Click a subtopic below for more detail. PAM Configuration Secured TaskTracker Subnet Whitelist PAM Configuration MapR uses for user authentication in the MapR Control System. Make sure PAM is installed and Pluggable Authentication Modules (PAM) configured on the node running the . mapr-webserver There are typically several PAM modules (profiles), configurable via configuration files in the directory. Each standard UNIX /etc/pam.d/ program normally installs its own profile. MapR can use (but does not require) its own PAM profile. The MapR Control System mapr-admin webserver tries the following three profiles in order: mapr-admin (Expects that user has created the profile) /etc/pam.d/mapr-admin sudo ( ) /etc/pam.d/sudo sshd ( ) /etc/pam.d/sshd The profile configuration file (for example, ) should contain an entry corresponding to the authentication scheme used by your /etc/pam.d/sudo system. For example, if you are using local OS authentication, check for the following entry: auth sufficient pam_unix.so # For local OS Auth Example: Configuring PAM with mapr-admin Although there are several viable ways to configure PAM to work with the MapR UI, we recommend using the profile. The following mapr-admin example shows how to configure the file. If LDAP is not configured, comment out the LDAP lines. Example /etc/pam.d/mapr-admin file 1. 2. 3. 4. 1. 2. account required pam_unix.so account sufficient pam_succeed_if.so uid < 1000 quiet account [default=bad success=ok user_unknown=ignore] pam_ldap.so account required pam_permit.so auth sufficient pam_unix.so nullok_secure auth requisite pam_succeed_if.so uid >= 1000 quiet auth sufficient pam_ldap.so use_first_pass auth required pam_deny.so password sufficient pam_unix.so md5 obscure min=4 max=8 nullok try_first_pass password sufficient pam_ldap.so password required pam_deny.so session required pam_limits.so session required pam_unix.so session optional pam_ldap.so The following sections provide information about configuring PAM to work with LDAP or Kerberos. LDAP To configure PAM with LDAP: Verify that each MapR user ID has the auxiliary schema . posixAccount Verify that each group ID has the auxiliary schema . posixGroup Install the appropriate PAM packages: On Ubuntu, sudo apt-get install libpam-ldapd On Redhat/Centos, sudo yum install pam_ldap Open and check for the following line: /etc/pam.d/sudo auth sufficient pam_ldap.so # For LDAP Auth Kerberos To configure PAM with Kerberos: Install the appropriate PAM packages: On Redhat/Centos, sudo yum install pam_krb5 On Ubuntu, sudo apt-get install -krb5 Open and check for the following line: /etc/pam.d/sudo auth sufficient pam_krb5.so # For kerberos Auth Secured TaskTracker You can control which users are able to submit jobs to the TaskTracker. By default, the TaskTracker is secured; all TaskTracker nodes should The file should be modified only with care and only when absolutely necessary. /etc/pam.d/sudo 1. 2. 3. 1. 2. 3. 1. 2. 3. 1. 2. have the same user and group databases, and only users who are present on all TaskTracker nodes (same user ID on all nodes) can submit jobs. You can disallow certain users (including or other superusers) from submitting jobs, or remove user restrictions from the TaskTracker root completely. To disallow : root Edit and set on all TaskTracker mapred-site.xml mapred.tasktracker.task-controller.config.overwrite = false nodes. Edit and set on all TaskTracker nodes. taskcontroller.cfg min.user.id=0 Restart all TaskTrackers. To disallow all superusers: Edit and set on all TaskTracker mapred-site.xml mapred.tasktracker.task-controller.config.overwrite = false nodes. Edit and set on all TaskTracker nodes. taskcontroller.cfg min.user.id=1000 Restart all TaskTrackers. To disallow specific users: Edit and set on all TaskTracker mapred-site.xml mapred.tasktracker.task-controller.config.overwrite = false nodes. Edit and add the parameter on all TaskTracker nodes, setting it to a comma-separated list of taskcontroller.cfg banned.users usernames. Example: banned.users=foo,bar Restart all TaskTrackers. To remove all user restrictions, and run all jobs as : root Edit and set mapred-site.xml mapred.task.tracker.task-controller = on all TaskTracker nodes. org.apache.hadoop.mapred.DefaultTaskController Restart all TaskTrackers. Subnet Whitelist To provide additional cluster security, you can limit cluster data access to a whitelist of trusted subnets. The paramet mfs.subnets.whitelist er in accepts a comma-separated list of subnets in CIDR notation. If this parameter is set, the FileServer service only accepts requests mfs.conf from the specified subnets. Users and Groups Two users are important when installing and setting up the MapR cluster: is used to install MapR software on each node root The “MapR user” is the user that MapR services run as (typically named or ) on each node. The MapR user has full mapr hadoop privileges to administer the cluster. Administrative privilege with varying levels of control can be assigned to other users as well. Before installing MapR, decide on the name, user id (UID) and group id (GID) for the MapR user. The MapR user must exist on each node, and the user name, UID and primary GID must match on all nodes. MapR uses each node's native operating system configuration to authenticate users and groups for access to the cluster. If you are deploying a large cluster, you should consider configuring all nodes to use LDAP or another user management system. You can use the MapR Control You must restart the TaskTracker after changing a node's hostname. When you make the above setting, all jobs submitted by any user will run as , and will have the ability to overwrite, delete, or root damage data regardless of ownership or permissions. 1. 2. 3. 1. 2. 3. System to give specific permissions to particular users and groups. For more information, see . Each user can be restricted Managing Permissions to a specific amount of disk usage. For more information, see . Managing Quotas By default, MapR gives the user full administrative permissions. If the nodes do not have an explicit login (as is sometimes the case root root with Ubuntu, for example), you can give full permissions to another user after deployment. See . Setting the Administrative User On the node where you plan to run the (the MapR Control System), install Pluggable Authentication Modules (PAM). See mapr-webserver PAM . Configuration To create a volume for a user or group: In the view, click . Volumes New Volume In the dialog, set the volume attributes: New Volume In , type a volume name. Make sure the Volume Type is set to Normal Volume. Volume Setup In , set the volume owner and specify the users and groups who can perform actions on the volume. Ownership & Permissions In , set the accountable group or user, and set a quota or advisory quota if needed. Usage Tracking In , set the replication factor and choose a snapshot schedule. Replication & Snapshot Scheduling Click to save the settings. OK See for more information. You can also create a volume using the command. Managing Data with Volumes volume create You can see users and groups that own volumes in the view or using the command. User Disk Usage entity list Managing Permissions MapR manages permissions using two mechanisms: Cluster and volume permissions use , which specify actions particular users are allowed to perform on a access control lists (ACLs) certain cluster or volume MapR-FS permissions control access to directories and files in a manner similar to Linux file permissions. To manage permissions, you must have permissions. fc Cluster and Volume Permissions Cluster and volume permissions use ACLs, which you can edit using the MapR Control System or the commands. acl Cluster Permissions The following table lists the actions a user can perform on a cluster, and the corresponding codes used in the cluster ACL. Code Allowed Action Includes login Log in to the MapR Control System, use the API and command-line interface, read access on cluster and volumes ss Start/stop services cv Create volumes a Admin access All permissions except fc fc Full control (administrative access and permission to change the cluster ACL) a Setting Cluster Permissions You can modify cluster permissions using the and commands, or using the MapR Control System. acl edit acl set To add cluster permissions using the MapR Control System: Expand the group and click to display the dialog. System Settings Views Permissions Edit Permissions Click to add a new row. Each row lets you assign permissions to a single user or group. [ + Add Permission ] Type the name of the user or group in the empty text field: If you are adding permissions for a user, type , replacing with the username. u:<user> <user> 3. 4. 5. 6. 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 6. 1. 2. 3. 4. 5. 6. If you are adding permissions for a group, type , replacing with the group name. g:<group> <group> Click the ( ) to expand the Permissions dropdown. Open Arrow Select the permissions you wish to grant to the user or group. Click to save the changes. OK To remove cluster permissions using the MapR Control System: Expand the group and click to display the dialog. System Settings Views Permissions Edit Permissions Remove the desired permissions: To remove all permissions for a user or group: Click the delete button ( ) next to the corresponding row. To change the permissions for a user or group: Click the ( ) to expand the Permissions dropdown. Open Arrow Unselect the permissions you wish to revoke from the user or group. Click to save the changes. OK Volume Permissions The following table lists the actions a user can perform on a volume, and the corresponding codes used in the volume ACL. Code Allowed Action dump Dump the volume restore Mirror or restore the volume m Modify volume properties, create and delete snapshots d Delete a volume fc Full control (admin access and permission to change volume ACL) To mount or unmount volumes under a directory, the user must have read/write permissions on the directory (see ). MapR-FS Permissions You can set volume permissions using the and commands, or using the MapR Control System. acl edit acl set To add volume permissions using the MapR Control System: Expand the group and click . MapR-FS Volumes To create a new volume and set permissions, click to display the dialog. New Volume New Volume To edit permissions on a existing volume, click the volume name to display the dialog. Volume Properties In the section, click to add a new row. Each row lets you assign permissions to a single user or Permissions [ + Add Permission ] group. Type the name of the user or group in the empty text field: If you are adding permissions for a user, type , replacing with the username. u:<user> <user> If you are adding permissions for a group, type , replacing with the group name. g:<group> <group> Click the ( ) to expand the Permissions dropdown. Open Arrow Select the permissions you wish to grant to the user or group. Click to save the changes. OK To remove volume permissions using the MapR Control System: Expand the group and click . MapR-FS Volumes Click the volume name to display the dialog. Volume Properties Remove the desired permissions: To remove all permissions for a user or group: Click the delete button ( ) next to the corresponding row. To change the permissions for a user or group: Click the ( ) to expand the Permissions dropdown. Open Arrow Unselect the permissions you wish to revoke from the user or group. Click to save the changes. OK MapR-FS Permissions MapR-FS permissions are similar to the POSIX permissions model. Each file and directory is associated with a user (the ) and a group. You owner can set read, write, and execute permissions separately for: The owner of the file or directory Members of the group associated with the file or directory All other users. The permissions for a file or directory are called its . The mode of a file or directory can be expressed in two ways: mode Text - a string that indicates the presence of the read ( ), write ( ), and execute ( ) permission or their absence ( ) for the owner, group, r w x - and other users respectively. Example: rwxr-xr-x Octal - three octal digits (for the owner, group, and other users), that use individual bits to represent the three permissions. Example: 755 Both and represent the same mode: the owner has all permissions, and the group and other users have read and execute rwxr-xr-x 755 permissions only. Text Modes String modes are constructed from the characters in the following table. Text Description u The file's owner. g The group associated with the file or directory. o Other users (users that are not the owner, and not in the group). a All (owner, group and others). = Assigns the permissions Example: "a=rw" sets read and write permissions and disables execution for all. - Removes a specific permission. Example: "a-x" revokes execution permission from all users without changing read and write permissions. + Adds a specific permission. Example: "a+x" grants execution permission to all users without changing read and write permissions. r Read permission w Write permission x Execute permission Octal Modes To construct each octal digit, add together the values for the permissions you wish to grant: Read: 4 Write: 2 Execute: 1 Syntax You can change the modes of directories and files in the MapR storage using either the command with the option, or using hadoop fs -chmod the command via NFS. The syntax for both commands is similar: chmod hadoop fs -chmod [-R] <MODE>[,<MODE>]... | <OCTALMODE> <URI> [<URI> ...] chmod [-R] <MODE>[,<MODE>]... | <OCTALMODE> <URI> [<URI> ...] Parameters and Options Parameter/Option Description -R If specified, this option applies the new mode recursively throughout the directory structure. MODE A string that specifies a mode. OCTALMODE A three-digit octal number that specifies the new mode for the file or directory. URI A relative or absolute path to the file or directory for which to change the mode. Examples The following examples are all equivalent: chmod 755 script.sh chmod u=rwx,g=rx,o=rx script.sh chmod u=rwx,go=rx script.sh Managing Quotas Quotas limit the disk space used by a volume or an (user or group) on an M5-licensed cluster, by specifying the amount of disk space the entity volume or entity is allowed to use: A volume quota limits the space used by a volume. A user/group quota limits the space used by all volumes owned by a user or group. Quotas are expressed as an integer value plus a single letter to represent the unit: B - bytes K - kilobytes M - megabytes G - gigabytes T - terabytes P - petabytes Example: 500G specifies a 500 gigabyte quota. The size of a disk space quota is expressed in terms of the actual data stored. System compression and replication affect total disk consumption. Disk consumption is not charged to a user or volume's quota. For example, a 10G file that is compressed to 8G and has a replication factor of 3 consumes 24G (3*8G) of disk space, but charges only 10G to the user or volume's quota. If a volume or entity exceeds its quota, further disk writes are prevented and a corresponding alarm is raised: AE_ALARM_AEQUOTA_EXCEEDED - an entity exceeded its quota VOLUME_ALARM_QUOTA_EXCEEDED - a volume exceeded its quota A quota that prevents writes above a certain threshold is also called a . In addition to the hard quota, you can also set an quot hard quota advisory a for a user, group, or volume. An advisory quota does not enforce disk usage limits, but raises an alarm when it is exceeded: AE_ALARM_AEADVISORY_QUOTA_EXCEEDED - an entity exceeded its advisory quota VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED - a volume exceeded its advisory quota In most cases, it is useful to set the advisory quota somewhat lower than the hard quota, to give advance warning that disk usage is approaching the allowed limit. To manage quotas, you must have or permissions. a fc 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 1. 2. 3. 4. 5. 6. 7. 1. 2. Quota Defaults You can set hard quota and advisory quota defaults for users and groups. When a user or group is created, the default quota and advisory quota apply unless overridden by specific quotas. Setting Volume Quotas and Advisory Quotas You can set a volume quota using the command, or use the following procedure to set a volume quota using the MapR Control volume modify System. To set a volume quota using the MapR Control System: In the Navigation pane, expand the group and click the view. MapR-FS Volumes Display the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clicking Volume Properties the button. Properties In the Usage Tracking section, select the checkbox and type a quota (value and unit) in the field. Example: Volume Quota 500G To set the advisory quota, select the checkbox and type a quota (value and unit) in the field. Example: Volume Advisory Quota 250G After setting the quota, click to exit save changes to the volume. Modify Volume Setting User/Group Quotas and Advisory Quotas You can set a user/group quota using the command, or use the following procedure to set a user/group quota using the MapR entity modify Control System. To set a user or group quota using the MapR Control System: In the Navigation pane, expand the MapR-FS group and click the view. User Disk Usage Select the checkbox beside the user or group name for which you wish to set a quota, then click the button to display the Edit Properties dialog. User Properties In the Usage Tracking section, select the checkbox and type a quota (value and unit) in the field. Example: User/Group Quota 500G To set the advisory quota, select the checkbox and type a quota (value and unit) in the field. Example: User/Group Advisory Quota 250 G After setting the quota, click to exit save changes to the entity. OK Setting Quota Defaults You can set an entity quota using the command, or use the following procedure to set an entity quota using the MapR Control entity modify System. To set quota defaults using the MapR Control System: In the Navigation pane, expand the group. System Settings Click the view to display the dialog. Quota Defaults Configure Quota Defaults To set the user quota default, select the checkbox in the User Quota Defaults section, then type a quota Default User Total Quota (value and unit) in the field. To set the user advisory quota default, select the checkbox in the User Quota Defaults section, then type Default User Advisory Quota a quota (value and unit) in the field. To set the group quota default, select the checkbox in the Group Quota Defaults section, then type a quota Default Group Total Quota (value and unit) in the field. To set the group advisory quota default, select the checkbox in the Group Quota Defaults section, then Default Group Advisory Quota type a quota (value and unit) in the field. After setting the quota, click to exit save changes to the entity. Save Setting the Administrative User The administrative user has full control over the cluster, and can assign permissions for other users to manage aspects of the cluster. To give full administrative control to a user Log on to any cluster node as (or use for the following command). root sudo Execute the following command, replacing with the username of the account that will get administrative control: <user> sudo /opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc 1. 2. a. b. c. 3. For general information about users and groups in the cluster, see . Users and Groups Converting a Cluster from Root to Non-root User This procedure converts a MapR cluster running as to run as a non-root user. Non-root operation is available from MapR version 2.0 and root later. In addition to converting the MapR user to a non-root user, you can also disable superuser privileges to the cluster for the root user for additional security. To convert a MapR cluster from running as root to running as a non-root user: Create a user with the same UID/GID across the cluster. Assign that user to the environment variable. MAPR_USER On each node: Stop the warden and the ZooKeeper (if present). # service mapr-warden stop # service mapr-zookeeper stop Run the config-mapr-user.sh script to configure the cluster to start as the non-root user. # /opt/mapr/server/config-mapr-user.sh -u <MapR user> [-g <MapR group>] Start the ZooKeeper (if present) and the warden. # service mapr-zookeeper start # service mapr-warden start After the previous step is complete on all nodes in the cluster, run the script on all nodes. upgrade2mapruser.sh # /opt/mapr/server/upgrade2mapruser.sh This command may take several minutes to return. The script waits ten minutes for the process to complete across the entire cluster. If the cluster-wide operation takes longer than ten minutes, the script fails. Re-run the script on all nodes where the script failed. To disable superuser access for the root user To disable root user (UID 0) access to the MapR filesystem on a cluster that is running as a non-root user, use either of the following commands: The configuration value treats all requests from UID 0 as coming from UID -2 (nobody): squash root # maprcli config save -values {"cldb.squash.root":"1"} The configuration value automatically fails all filesystem requests from UID 0: reject root You must perform these steps on all nodes on a stable cluster. Do not perform this procedure concurrently while upgrading packages. The alarm may raise during this process. The alarm will clear when this process is complete on MAPR_UID_MISMATCH all nodes. 1. 2. 3. # maprcli config save -values {"cldb.reject.root":"1"} You can verify that these commands worked, as shown in the example below. # maprcli config load -keys cldb.squash.root,cldb.reject.root cldb.reject.root cldb.squash.root 1 1 Working with Multiple Clusters To mirror volumes between clusters, create an additional entry in on the source volume's cluster for each additional mapr-clusters.conf cluster that hosts a mirror of the volume. The entry must list the cluster's name, followed by a comma-separated list of hostnames and ports for the cluster's CLDB nodes. To set up multiple clusters On each cluster, make a note of the cluster name and CLDB nodes (the first line in ) mapr-clusters.conf On each webserver and CLDB node, add the remote cluster's CLDB nodes to , using the /opt/mapr/conf/mapr-clusters.conf following format: clustername1 <CLDB> <CLDB> <CLDB> [ clustername2 <CLDB> <CLDB> <CLDB> ] [ ... ] On each cluster, restart the service on all nodes where it is running. mapr-webserver To set up cross-mirroring between clusters You can between clusters, mirroring some volumes from cluster A to cluster B and other volumes from cluster B to cluster A. To set cross-mirror up cross-mirroring, create entries in as follows: mapr-clusters.conf Entries in on cluster A nodes: mapr-clusters.conf First line contains cluster name and CLDB servers of cluster A (the local cluster) Second line contains cluster name and CLDB servers of cluster B (the remote cluster) Entries in on cluster B nodes: mapr-clusters.conf First line contains cluster name and CLDB servers of cluster B (the local cluster) Second line contains cluster name and CLDB servers of cluster A (the remote cluster) For example, the file for cluster A with three CLDB nodes (nodeA, nodeB, and nodeC) would look like this: mapr-clusters.conf clusterA <nodeA> <nodeB> <nodeC> clusterB <nodeD> The file for cluster B with one CLDB node (nodeD) would look like this: mapr-clusters.conf clusterB <nodeD> clusterA <nodeA> <nodeB> <nodeC> By creating additional entries in the file, you can mirror from one cluster to several others. mapr-clusters.conf When a mirror volume is created on a remote cluster (according to the entries in the file), the CLDB checks that the local mapr-clusters.conf volume exists in the local cluster. If both clusters are not set up and running, the remote mirror volume cannot be created. To set up a mirror volume, make sure: Each cluster is already set up and running Each cluster has a unique name Every node in each cluster can resolve all nodes in remote clusters, either through DNS or entries in /etc/hosts Setting Up MapR NFS The MapR NFS service lets you access data on a licensed MapR cluster via the protocol: NFS M3 license: one NFS node allows you to access your cluster as a standard POSIX-compliant filesystem. M5 license: multiple NFS servers allow each node to mount its own MapR-FS on NFS enable with VIPs for high availability (HA) and load balancing You can mount the MapR cluster via NFS and use standard shell scripting to read and write live data in the cluster. NFS access to cluster data can be faster than accessing the same data with the commands. To mount the cluster via NFS from a client machine, see hadoop Setting Up the . Client Before You Start: NFS Setup Requirements Make sure the following conditions are met before using the MapR NFS gateway: The stock Linux NFS service must not be running. Linux NFS and MapR NFS cannot run concurrently. On RedHat and CentOS v6.0 and higher, the service must be running. You can use the command t rpcbind ps ax | grep rpcbind o check. On RedHat and CentOS v5.x and lower, and on Ubuntu and SUSE, the service must be running. You can use the portmapper command to check. ps ax | grep portmap The package must be present and installed. You can list the contents in the directory to check for in mapr-nfs /opt/mapr/roles nfs the list. Make sure you have applied an M3 license or an M5 (paid or trial) license to the cluster. See . Adding a License Make sure the MapR NFS service is started (see ). Services For information about mounting the cluster via NFS, see . Setting Up the Client Verify that the primary group of the user listed for in the file is mapr.daemon.user /opt/mapr/conf/daemon.conf mapr.daemon.g . Restart the Warden after any changes to . roup daemon.conf NFS on an M3 Cluster At installation time, choose one node on which to run the NFS gateway. NFS is lightweight and can be run on a node running services such as CLDB or ZooKeeper. To add the NFS service to a running cluster, use the instructions in to install the p Managing Services on a Node mapr-nfs ackage on the node where you would like to run NFS. NFS on an M5 Cluster At cluster installation time, plan which nodes should provide NFS access according to your anticipated traffic. For instance, if you need 5Gbps of To preserve compatibility with 32-bit applications and system calls, MapR-NFS uses 32-bit inode numbers by default. On 64-bit clients, this default forces the client's 64-bit inode numbers to be hashed down to 32 bits. Hashing 64-bit inodes down to 32 bits can potentially cause inum conflicts. To change the default behavior to 64-bit inode numbers, set the value of the property to 0 in Use32BitFileId the file, then restart the NFS server. nfsserver.conf write throughput and 5Gbps of read throughput, here are a few ways to set up NFS: 12 NFS nodes, each of which has a single 1Gbe connection 6 NFS nodes, each of which has a dual 1Gbe connection 4 NFS nodes, each of which has a quad 1Gbe connection You can also set up NFS on all file server nodes to enable a self-mounted NFS point for each node. Self-mounted NFS for each node in a cluster enables you to run native applications as tasks. You can mount NFS on one or more dedicated gateways outside the cluster (using round-robin DNS or behind a hardware load balancer) to allow controlled access. NFS and Virtual IP addresses You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multiple addresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also enable high availability (HA) NFS. In a HA NFS system, when an NFS node fails, data requests are satisfied by other NFS nodes in the pool. Use a minimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, with each NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs should be in the same IP subnet as the interfaces to which they will be assigned. See for details on enabling VIPs for your Setting Up VIPs for NFS cluster. Here are a few tips: Set up NFS on at least three nodes if possible. All NFS nodes must be accessible over the network from the machines where you want to mount them. To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, you can provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes in the cluster, if needed. To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the client manages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the cluster nodes) to transfer data via MapR APIs, balancing operations among nodes as needed. Use VIPs to provide High Availability (HA) and failover. To add the NFS service to a running cluster, use the instructions in to install the package on the nodes Managing Services on a Node mapr-nfs where you would like to run NFS. NFS Memory Settings The memory allocated to each MapR service is specified in the file, which MapR automatically configures /opt/mapr/conf/warden.conf based on the physical memory available on the node. You can adjust the minimum and maximum memory used for NFS, as well as the percentage of the heap that it tries to use, by setting the , , and parameters in the file on each NFS node. percent max min warden.conf Example: ... service.command.nfs.heapsize.percent=3 service.command.nfs.heapsize.max=1000 service.command.nfs.heapsize.min=64 ... The percentages need not add up to 100; in fact, you can use less than the full heap by setting the parameters for all heapsize.percent services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings for individual services, unless you see specific memory-related problems occurring. Running NFS on a Non-standard Port To run NFS on an arbitrary port, modify the following line in : warden.conf service.command.nfs.start=/etc/init.d/mapr-nfsserver start Add to the end of the line, as in the following example: -p <portnumber> service.command.nfs.start=/etc/init.d/mapr-nfsserver start -p 12345 After modifying , restart the MapR NFS server by issuing the following command: warden.conf maprcli node services -nodes <nodename> -nfs restart You can verify the port change with the command. rpcinfo -p localhost Enabling Debug Logging for NFS Debug-level logging is available to help you isolate and identify NFS-related issues. To enable logging at the debug level, enter this command at the command line: maprcli trace setlevel -port 9998 -level debug where indicates NFS. -port 9998 In default mode, information is logged to a buffer and dumped periodically. To display information immediately instead, enable mod continuous e by entering: maprcli trace setmode -port 9998 -mode continuous Sample log output from an command is shown here: ls Click here to expand... From /opt/mapr/logs/nfsserver.log: 2013-06-10 16:13:27,2278 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x5d349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:27,2278 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x5d349889] NFS FileHandle: 2.1012313856.2.2.2 2013-06-10 16:13:28,3774 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 MapR uses version 3 of the NFS protocol. NFS version 4 bypasses the port mapper and attempts to connect to the default port only. If you are running NFS on a non-standard port, mounts from NFS version 4 clients time out. Use the option to specify -o nfsvers=3 NFS version 3. The log level provides much more information than the default log level of . debug info 127.0.0.1[0x5e349889] NFS Proc=NFSPROC3_ACCESS 2013-06-10 16:13:28,3774 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x5e349889] NFS FileHandle: 2.1012313856.2.2.2 2013-06-10 16:13:28,3775 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x5f349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,3775 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x5f349889] NFS FileHandle: 2.1012313856.2.2.2 2013-06-10 16:13:28,3776 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x60349889] NFS Proc=NFSPROC3_READDIRPLUS 2013-06-10 16:13:28,3783 INFO nfsserver[30283] fs/nfsd/mount.cc:822 Cluster my.cluster.com, Setting myTopology to /default-rack/ubuntu-n3.jon.prv 2013-06-10 16:13:28,3784 DEBUG nfsserver[30283] fs/nfsd/cache.cc:659 127.0.0.1[0x60349889] Sending CLDB Lookup for cid=3410106368.2049 (sleep=0) ip= cldb=10.10.80.41:7222 2013-06-10 16:13:28,3906 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x61349889] NFS Proc=NFSPROC3_LOOKUP 2013-06-10 16:13:28,3906 DEBUG nfsserver[30283] fs/nfsd/attrs.cc:1032 127.0.0.1[0x61349889] Lookup: my.cluster.com 2013-06-10 16:13:28,3906 DEBUG nfsserver[30283] fs/nfsd/cache.cc:449 127.0.0.1[0x61349889] using existing RpcBinding 2013-06-10 16:13:28,3927 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x62349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,3927 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x62349889] NFS FileHandle: 2.1012313856.2.2.2 2013-06-10 16:13:28,8755 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x63349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,8755 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x63349889] NFS FileHandle: 0.3410106368.2049.16.2 2013-06-10 16:13:28,8755 DEBUG nfsserver[30283] fs/nfsd/cache.cc:449 127.0.0.1[0x63349889] using existing RpcBinding 2013-06-10 16:13:28,8759 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x64349889] NFS Proc=NFSPROC3_ACCESS 2013-06-10 16:13:28,8759 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x64349889] NFS FileHandle: 0.3410106368.2049.16.2 2013-06-10 16:13:28,8759 DEBUG nfsserver[30283] fs/nfsd/cache.cc:449 127.0.0.1[0x64349889] using existing RpcBinding 2013-06-10 16:13:28,8763 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x65349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,8763 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x65349889] NFS FileHandle: 0.3410106368.2064.16.2 2013-06-10 16:13:28,8763 DEBUG nfsserver[30283] fs/nfsd/cache.cc:659 127.0.0.1[0x65349889] Sending CLDB Lookup for cid=3410106368.2064 (sleep=0) ip= cldb=10.10.80.41:7222 2013-06-10 16:13:28,8886 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x66349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,8886 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x66349889] NFS FileHandle: 0.3410106368.2049.44.66108 2013-06-10 16:13:28,8886 DEBUG nfsserver[30283] fs/nfsd/cache.cc:449 127.0.0.1[0x66349889] using existing RpcBinding 2013-06-10 16:13:28,8889 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x67349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,8890 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x67349889] NFS FileHandle: 0.3410106368.2537.16.2 2013-06-10 16:13:28,8890 DEBUG nfsserver[30283] fs/nfsd/cache.cc:659 127.0.0.1[0x67349889] Sending CLDB Lookup for cid=3410106368.2537 (sleep=0) ip= cldb=10.10.80.41:7222 2013-06-10 16:13:28,9185 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x68349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,9186 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x68349889] NFS FileHandle: 0.3410106368.2050.16.2 2013-06-10 16:13:28,9186 DEBUG nfsserver[30283] fs/nfsd/cache.cc:659 127.0.0.1[0x68349889] Sending CLDB Lookup for cid=3410106368.2050 (sleep=0) ip= cldb=10.10.80.41:7222 2013-06-10 16:13:28,9312 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x69349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,9312 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x69349889] NFS FileHandle: 0.3410106368.2536.16.2 2013-06-10 16:13:28,9312 DEBUG nfsserver[30283] fs/nfsd/cache.cc:659 127.0.0.1[0x69349889] Sending CLDB Lookup for cid=3410106368.2536 (sleep=0) ip= cldb=10.10.80.41:7222 2013-06-10 16:13:28,9432 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:555 127.0.0.1[0x6a349889] NFS Proc=NFSPROC3_GETATTR 2013-06-10 16:13:28,9432 DEBUG nfsserver[30283] fs/nfsd/nfsserver.cc:1022 127.0.0.1[0x6a349889] NFS FileHandle: 0.3410106368.2535.16.2 2013-06-10 16:13:28,9432 DEBUG nfsserver[30283] fs/nfsd/cache.cc:659 1. 2. 3. 4. 5. 6. 7. a. b. 8. 9. 127.0.0.1[0x6a349889] Sending CLDB Lookup for cid=3410106368.2535 (sleep=0) ip= cldb=10.10.80.41:7222 The log shows every operation sent to and received from an NFS client. To return to the default log level and log mode, enter: maprcli trace setlevel -port 9998 -level info maprcli trace setmode -mode default High Availability NFS You can easily set up a pool of NFS nodes with HA and failover using virtual IP addresses (VIPs); if one node fails the VIP will be automatically reassigned to the next NFS node in the pool. If you do not specify a list of NFS nodes, then MapR uses any available node running the MapR NFS service. You can add a server to the pool simply by starting the MapR NFS service on it. Before following this procedure, make sure you are running NFS on the servers to which you plan to assign VIPs. You should install NFS on at least three nodes. If all NFS nodes are connected to only one subnet, then adding another NFS server to the pool is as simple as starting NFS on that server; the MapR cluster automatically detects it and adds it to the pool. You can restrict VIP assignment to specific NFS nodes or MAC addresses by adding them to the NFS pool list manually. VIPs are not assigned to any nodes that are not on the list, regardless of whether they are running NFS. If the cluster's NFS nodes have multiple network interface cards (NICs) connected to different subnets, you should restrict VIP assignment to the NICs that are on the correct subnet: for each NFS server, choose whichever MAC address is on the subnet from which the cluster will be NFS-mounted, then add it to the list. If you add a VIP that is not accessible on the subnet, then failover will not work. You can only set up VIPs for failover between network interfaces that are in the same subnet. In large clusters with multiple subnets, you can set up multiple groups of VIPs to provide NFS failover for the different subnets. You can set up VIPs with the command, or using the Add Virtual IPs dialog in the MapR Control System. The Add Virtual IPs dialog virtualip add lets you specify a range of virtual IP addresses and assign them to the pool of servers that are running the NFS service. The available servers are displayed in the left pane in the lower half of the dialog. Servers that have been added to the NFS VIP pool are displayed in the right pane in the lower half of the dialog. Setting Up VIPs for NFS To set up VIPs for NFS using the MapR Control System: In the Navigation pane, expand the group and click the view. NFS HA NFS Setup Click to start the NFS Gateway service on nodes where it is installed. Start NFS Click to display the Add Virtual IPs dialog. Add VIP Enter the start of the VIP range in the field. Starting IP Enter the end of the VIP range in the field. If you are assigning one one VIP, you can leave the field blank. Ending IP Enter the Netmask for the VIP range in the field. Example: Netmask 255.255.255.0 If you wish to restrict VIP assignment to specific servers or MAC addresses: If each NFS node has one NIC, or if all NICs are on the same subnet, select NFS servers in the left pane. If each NFS node has multiple NICs connected to different subnets, select the server rows with the correct MAC addresses in the left pane. Click to add the selected servers or MAC addresses to the list of servers to which the VIPs will be assigned. The servers appear in Add the right pane. Click to assign the VIPs and exit. OK Setting up a MapR Cluster on Amazon Elastic MapReduce The MapR distribution for Hadoop adds enterprise-grade features to the Hadoop platform that make Hadoop easier to use and more dependable. The MapR distribution for Hadoop is fully integrated with Amazon's (EMR) framework, allowing customers to deploy a MapR Elastic MapReduce cluster with ready access to Amazon's cloud infrastructure. MapR provides network file system (NFS) and open database connectivity (ODBC) interfaces, a comprehensive management suite, and automatic compression. MapR provides high availability with a no-NameNode architecture and data protection with snapshots, disaster recovery, and cross-cluster mirroring. For more details on EMR with MapR, visit the Amazon EMR detail page. with the MapR Distribution for Hadoop 1. 2. 3. 4. 5. 6. Starting an EMR Job Flow with the MapR Distribution for Hadoop from the AWS Management Console Log in to your Amazon Web Services Account: Use your normal Amazon Web Services (AWS) credentials to log in to your AWS account. From the AWS Management Console, select . Elastic MapReduce From the drop-down selector at the upper right, select a region where your job flow will run. Click the button in the center of the page. Create New Job Flow Select a MapR Edition and version from the drop-down selector: , , or Hadoop Version MapR M3 Edition MapR M5 Edition MapR M7 . Edition MapR M3 Edition is a complete Hadoop distribution that provides many unique capabilities such as industry-standard NFS and ODBC interfaces, end-to-end management, high reliability and automatic compression. You can manage a MapR cluster via the AWS Management Console, the command line, or a REST API. Amazon EMR's standard rates include the full functionality of MapR M3 at no additional cost. MapR M5 Edition expands the capabilities of M3 with enterprise-grade capabilities such as , and high availability snapshots mirror . ing MapR M7 Edition provides native MapR table functionality on MapR-FS, enabling responsive HBase-style flat table databases compatible with snapshots and mirroring. Continue to specify your job flow as described in . Creating a Job Flow Amazon EMR with MapR provides a Debian environment with MapR software running on each node. MapR's NFS interface mounts the cluster is mounted on localhost at the directory. Packages for Hadoop ecosystem components are in the directory. /mapr /home/hadoop/mapr-pkgs The MapR distribution for Hadoop does not support Apache HBase on Amazon EMR. Starting Pig and Hive Sessions as Individual Job Flows To start an interactive Pig session directly, select when you create the job flow, then select Pig program Start an Interactive Pig . Session To start an interactive Hive session directly, select when you create the job flow, then select Hive program Start an Interactive Hive . Session For general information on EMR Job Flows, see Amazon's . documentation Starting an EMR Job Flow with the MapR Distribution for Hadoop from the Command Line Interface Use the parameter with the command to specify a MapR distribution. Specify the MapR --supported-product mapr elastic-mapreduce edition and version by passing arguments with the parameter in the following format: --args --args "--edition,<edition label>,--version,<version number>" You can use to specify the following editions: --edition m3 m5 m7 You can use to specify the following versions: --version 1.2.8 2.1.2 3.0 Use the parameter to specify how much of the instance's storage space to reserve for the MapR file system. This parameter mfs-percentage has a ceiling of 100 and a floor of 50. Specifying percentages outside this range will result in the floor or ceiling being applied instead, and a message written to the log. Storage space not reserved for MapR is available for native Linux file storage. The following table lists parameters that you can specify at the command line and the results as interpreted by MapR: EMR Command Line Parameter Command Processed by MapR --supported-product mapr --edition m3 --supported-product mapr-m5 --edition m5 --supported-product mapr-m3 --edition m3 --with-supported-products mapr-m3 --edition m3 --with-supported-products mapr-m5 --edition m5 --supported-product mapr-m5 --args "--version,1.1" --edition m5 --version 1.1 --supported-product mapr-m5 --args "--edition,m3" Returns an error --supported-product mapr --args "--edition,m5" --edition m5 --supported-product mapr --args "--version,1.1" --edition m3 --version1.1 --supported-product mapr --args "--edition,m5,--key1 value1" --edition m5 --key1 value1 Launching a job flow with MapR M3 The following command launches a job flow with one EC2 Large instance as a master that uses the MapR M3 Edition distribution, version 2.1.2. This instance reserves 75 percent of the storage space for the MapR file system and keeps 25 percent of the storage space available for native Linux file storage. To use the command line interface commands, download and install the . Amazon Elastic MapReduce Ruby Client The parameter is deprecated and does not support arguments such as , , or --with-supported-products --edition --version . --keyN valueN On a Windows system, use the command instead of . ruby elastic-mapreduce elastic-mapreduce ./elastic-mapreduce --create --alive \ --instance-type m1.xlarge\ --num-instances 5 \ --supported-product mapr \ --args "--edition,m3,--version,2.1.2,--mfs-percentage,75" To pass bootstrap parameters, add the and parameters before the parameter. The --bootstrap-action --args --instance-type following command launches a job flow and passes a value of 4 to the parameter as a bootstrap mapred.tasktracker.map.tasks.maximum action: ./elastic-mapreduce --create --alive \ --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ --args -m,mapred.tasktracker.map.tasks.maximum=4 --instance-type m1.xlarge\ --num-instances 5 \ --supported-product mapr \ --args "--edition,m3,--version,2.1.2" \ See for more information about the command's options. the linked article elastic-mapreduce To use MapR commands with a REST API, include the following mandatory parameters: SupportedProducts.member.1=mapr-m3 bootstrap-action=s3://elasticmapreduce/thirdparty/mapr/scripts/mapr_emr_install.sh args="--base-path,s3://elasticmapreduce/thirdparty/mapr/" In the request to , set a member of the list to a value that corresponds to the MapR edition you'd like to run RunJobFlow SupportedProducts the job flow on. See for more information on how to interact with your EMR cluster using a REST API. the documentation Launching an M3 edition MapR cluster with the REST API 1. 2. 3. 4. 5. 6. 1. 2. https://elasticmapreduce.amazonaws.com?Action=RunJobFlow &Name=MyJobFlowName &LogUri=s3n%3A%2F%2Fmybucket%2Fsubdir &Instances.MasterInstanceType=m1.xlarge &Instances.SlaveInstanceType=m1.xlarge &Instances.InstanceCount=4 &Instances.Ec2KeyName=myec2keyname &Instances.Placement.AvailabilityZone=us-east-1a &Instances.KeepJobFlowAliveWhenNoSteps=true &Instances.TerminationProtected=true &Steps.member.1.Name=MyStepName &Steps.member.1.ActionOnFailure=CONTINUE &Steps.member.1.HadoopJarStep.Jar=MyJarFile &Steps.member.1.HadoopJarStep.MainClass=MyMainClass &Steps.member.1.HadoopJarStep.Args.member.1=arg1 &Steps.member.1.HadoopJarStep.Args.member.2=arg2 &SupportedProducts.member.1=mapr-m3 &AuthParams Enabling MCS access for your EMR Cluster After your MapR job flow is running, you need to open port 8453 to enable access to the (MCS) from hosts other than the MapR Control System host that launched the cluster. Follow these steps to open the port. Select your job from the list of jobs displayed in in the tab of the AWS Your Elastic MapReduce Job Flows Elastic MapReduce Management Console, then select the tab in the lower pane. Make a note of the Master Public DNS Name value. Click the Description A tab in the AWS Management Console to open the Amazon EC2 Console Dashboard. mazon EC2 Select from the group in the pane at the left of the EC2 Console Dashboard. Security Groups Network & Security Navigation Select from the list displayed in . Elastic MapReduce-master Security Groups In the lower pane, click the tab. Inbound In , type . Leave the default value in the : field. Port Range: 8453 Source Click , then click . Add Rule Apply Rule Changes You can now navigate to the master node's DNS address. Connect to port 8453 to log in to the MapR Control System. Use the string for hadoop both login and password at the MCS login screen. Testing Your Cluster Follow these steps to create a file and run your first MapReduce job: Connect to the master node with SSH as user hadoop. Pass your .pem credentials file to ssh with the -i flag, as in this example: ssh -i /path_to_pemfile/credentials.pem [email protected] Create a simple text file: The standard MapR port is 8443. Use port number 8453 instead of 8443 when you use the MapR REST API calls to a MapR on Amazon EMR cluster. For M5 and M7 Edition MapR clusters on EMR, the MCS web server runs on the primary and secondary CLDB nodes, giving you another entry point to the MCS if the primary fails. 2. 3. 4. cd /mapr/MapR_EMR.amazonaws.com mkdir in echo "the quick brown fox jumps over the lazy dog" > in/data.txt Run the following command to perform a word count on the text file: hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /mapr/MapR_EMR.amazonaws.com/in /mapr/MapR_EMR.amazonaws.com/out As the job runs, you should see terminal output similar to the following: 12/06/09 00:00:37 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-118-194-139.ec2.internal/10.118.194.139:9001 12/06/09 00:00:37 INFO input.FileInputFormat: Total input paths to process : 1 12/06/09 00:00:37 INFO mapred.JobClient: Running job: job_201206082332_0004 12/06/09 00:00:38 INFO mapred.JobClient: map 0% reduce 0% 12/06/09 00:00:50 INFO mapred.JobClient: map 100% reduce 0% 12/06/09 00:00:57 INFO mapred.JobClient: map 100% reduce 100% 12/06/09 00:00:58 INFO mapred.JobClient: Job complete: job_201206082332_0004 12/06/09 00:00:58 INFO mapred.JobClient: Counters: 25 12/06/09 00:00:58 INFO mapred.JobClient: Job Counters 12/06/09 00:00:58 INFO mapred.JobClient: Launched reduce tasks=1 12/06/09 00:00:58 INFO mapred.JobClient: Aggregate execution time of mappers(ms)=6193 12/06/09 00:00:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/06/09 00:00:58 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/06/09 00:00:58 INFO mapred.JobClient: Launched map tasks=1 12/06/09 00:00:58 INFO mapred.JobClient: Data-local map tasks=1 12/06/09 00:00:58 INFO mapred.JobClient: Aggregate execution time of reducers(ms)=4875 12/06/09 00:00:58 INFO mapred.JobClient: FileSystemCounters 12/06/09 00:00:58 INFO mapred.JobClient: MAPRFS_BYTES_READ=385 12/06/09 00:00:58 INFO mapred.JobClient: MAPRFS_BYTES_WRITTEN=276 12/06/09 00:00:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=94449 12/06/09 00:00:58 INFO mapred.JobClient: Map-Reduce Framework 12/06/09 00:00:58 INFO mapred.JobClient: Map input records=1 12/06/09 00:00:58 INFO mapred.JobClient: Reduce shuffle bytes=94 12/06/09 00:00:58 INFO mapred.JobClient: Spilled Records=16 12/06/09 00:00:58 INFO mapred.JobClient: Map output bytes=80 12/06/09 00:00:58 INFO mapred.JobClient: CPU_MILLISECONDS=1530 12/06/09 00:00:58 INFO mapred.JobClient: Combine input records=9 12/06/09 00:00:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=125 12/06/09 00:00:58 INFO mapred.JobClient: Reduce input records=8 12/06/09 00:00:58 INFO mapred.JobClient: Reduce input groups=8 12/06/09 00:00:58 INFO mapred.JobClient: Combine output records=8 12/06/09 00:00:58 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=329244672 12/06/09 00:00:58 INFO mapred.JobClient: Reduce output records=8 12/06/09 00:00:58 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=3252969472 12/06/09 00:00:58 INFO mapred.JobClient: Map output records=9 12/06/09 00:00:58 INFO mapred.JobClient: GC time elapsed (ms)=18 Check the /mapr/MapR_EMR.amazonaws.com/out directory for a file named part-r-00000 with the results of the job. 4. cat out/part-r00000 brown 1 dog 1 fox 1 jumps 1 lazy 1 over 1 quick 1 the 2 Note that the ability to use standard Linux tools such as and in this example are made possible by MapR's ability to mount the cluster echo cat on NFS at . /mapr/MapR_EMR.amazonaws.com Troubleshooting Cluster Administration This section provides information about troubleshooting cluster administration problems. Click a subtopic below for more detail. MapR Control System doesn't display on Internet Explorer 'ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils' in cldb.log, caused by mixed-case cluster name in mapr-clusters.conf Error 'mv Failed to rename maprfs...' when moving files across volumes How to find a node's serverid Out of Memory Troubleshooting MapR Control System doesn't display on Internet Explorer The MapR Control System supports Internet Explorer version 9 and above. In IE9, under the menu must be turned off, Compatibility View Tools or else the user interface will not display correctly. 'ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils' in cldb.log, caused by mixed-case cluster name in mapr-clusters.conf MapR cluster names are case sensitive. However, some versions of MapR v1.2.x have a bug in which the cluster names specified in /opt/mapr are not treated as case sensitive. If you have a cluster with a mixed-case name, after upgrading from v1.2 to /conf/mapr-clusters.conf v2.0+, you may experience CLDB errors (in particular for mirror volumes) which generate messages like the following in : cldb.log 2012-07-31 04:43:50,716 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils [VolumeMirrorThread]: Unable to reach cluster with name: qacluster1.2.9. No entry found in file /conf/mapr-clusters.conf for cluster qacluster1.2.9. Failing the CLDB RPC with status 133 (The path given in this message is relative to , which might be misleading.) /opt/mapr/ As a work-around after upgrading, to continue working with mirror volumes created in v1.2, duplicate any lines with upper-case letters in mapr-cl , converting all letters to lower case. usters.conf Mirror volumes created in v2.0+ do not exhibit this behavior. Error 'mv Failed to rename maprfs...' when moving files across volumes Prior to version 2.1, you cannot move files across volume boundaries in the MapR Data Platform. You can move files within a volume using the h command, but attempting to move files to a different volume results in an error of the form " adoop fs -mv mv: Failed to rename ". maprfs://<source path> to <destination path> As a workaround, you can copy the file(s) from source volume to destination volume, and then remove the source files. The example below shows the failure occurring. In this example directories and are mount-points for two distinct volumes. /a /b root@node1:~# hadoop fs -ls / Found 2 items drwxrwxrwx - root root 0 2011-12-02 15:14 /a drwxrwxrwx - root root 0 2011-12-02 15:09 /b root@node1:~# hadoop fs -put testfile /a root@node1:~# hadoop fs -ls /a Found 1 items -rwxrwxrwx 3 root root 2048000 2011-12-02 15:18 /a/testfile root@node1:~# hadoop fs -mv /a/testfile /b mv: Failed to rename maprfs://10.10.80.71:7222/a/testfile to /b root@node1:~# The example below shows the work-around, moving a file to directory , and then removing the source file. /a/testfile /b root@node1:~# hadoop fs -cp /a/testfile /b/testfile root@node1:~# hadoop fs -ls /b Found 1 items -rwxrwxrwx 3 root root 2048000 2011-12-02 15:19 /b/testfile root@node1:~# hadoop fs -rmr /a/testfile Deleted maprfs://10.10.80.71:7222/a/testfile root@node1:~# hadoop fs -ls /a root@node1:~# This workaround is only necessary if and correspond to different volumes. /a /b How to find a node's serverid Some commands take an argument , which is an unique identifier for each node in a cluster. This id is also sometimes maprcli serverid referred to as the "node id". To find the , use the command, which lists information about all nodes in a cluster. The field is the value to serverid maprcli node list id use for . serverid For example: $ maprcli node list -columns hostname,id id hostname ip 4800813424089433352 node-28.lab 10.10.20.28 6881304915421260685 node-29.lab 10.10.20.29 4760082258256890484 node-31.lab 10.10.20.31 8350853798092330580 node-32.lab 10.10.20.32 2618757635770228881 node-33.lab 10.10.20.33 You can also get this listing as a JSON object by using the option. For example: -json 1. 2. $ maprcli node list -columns id,hostname -json { "timestamp":1358537735777, "status":"OK", "total":5, "data":[ { "id":"4800813424089433352", "ip":"10.10.20.28", "hostname":"node-28.lab" }, { "id":"6881304915421260685", "ip":"10.10.20.29", "hostname":"node-29.lab" }, { "id":"4760082258256890484", "ip":"10.10.20.31", "hostname":"node-31.lab" }, { "id":"8350853798092330580", "ip":"10.10.20.32", "hostname":"node-32.lab" }, { "id":"2618757635770228881", "ip":"10.10.20.33", "hostname":"node-33.lab" } ] } Out of Memory Troubleshooting When the aggregated memory used by MapReduce tasks exceeds the memory reserve on a TaskTracker node, tasks can fail or be killed. MapR attempts to prevent out-of-memory exceptions by killing MapReduce tasks when memory becomes scarce. If you allocate too little Java heap for the expected memory requirements of your tasks, an exception can occur. The following steps can help configure MapR to avoid these problems: If a particular job encounters out-of-memory conditions, the simplest way to solve the problem might be to reduce the memory footprint of the map and reduce functions, and to ensure that the partitioner distributes map output to reducers evenly. If it is not possible to reduce the memory footprint of the application, try increasing the Java heap size (-Xmx) in the client-side MapReduce configuration. If many jobs encounter out-of-memory conditions, or if jobs tend to fail on specific nodes, it may be that those nodes are advertising too many TaskTracker slots. In this case, the cluster administrator should reduce the number of slots on the affected nodes. To reduce the number of slots on a node: Stop the TaskTracker service on the node: $ sudo maprcli node services -nodes <node name> -tasktracker stop 2. 3. 1. 2. 3. Edit the file : /opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml Reduce the number of map slots by lowering mapred.tasktracker.map.tasks.maximum Reduce the number of reduce slots by lowering mapred.tasktracker.reduce.tasks.maximum Start the TaskTracker on the node: $ sudo maprcli node services -nodes <node name> -tasktracker start Client Compatibility Matrix Feature Client: 2.x Cluster: 3.x Client: 3.x Cluster: 2.x Comments Mirroring Yes Yes Snapshots Yes Yes MapReduce Yes Yes Hadoop client Yes Yes MapR Tables No No MapR Tables require both cluster and client to be version 3.0.0 or later. For versions 3.0.1 and later, HBase version 94.9 must be present on the cluster. Mirroring with Multiple Clusters To mirror volumes between clusters, create an additional entry in on the source volume's cluster for each additional mapr-clusters.conf cluster that hosts a mirror of the volume. The entry must list the cluster's name, followed by a comma-separated list of hostnames and ports for the cluster's CLDB nodes. To set up multiple clusters On each cluster make a note of the cluster name and CLDB nodes (the first line in ) mapr-clusters.conf On each webserver and CLDB node, add the remote cluster's CLDB nodes to , using the /opt/mapr/conf/mapr-clusters.conf following format: clustername1 <CLDB> <CLDB> <CLDB> [ clustername2 <CLDB> <CLDB> <CLDB> ] [ ... ] On each cluster, restart the service on all nodes where it is running. mapr-webserver To set up cross-mirroring between clusters You can between clusters, mirroring some volumes from cluster A to cluster B and other volumes from cluster B to cluster A. To set cross-mirror up cross-mirroring, create entries in as follows: mapr-clusters.conf Entries in on cluster A nodes: mapr-clusters.conf First line contains name and CLDB servers of cluster A Second line contains name and CLDB servers of cluster B Entries in on cluster B nodes: mapr-clusters.conf First line contains name and CLDB servers of cluster B Second line contains name and CLDB servers of cluster A For example, the file for cluster A with three CLDB nodes (nodeA, nodeB, and nodeC) would look like this: mapr-clusters.conf clusterA <nodeA> <nodeB> <nodeC> clusterB <nodeD> The file for cluster B with one CLDB node (nodeD) would look like this: mapr-clusters.conf clusterB <nodeD> clusterA <nodeA> <nodeB> <nodeC> By creating additional entries, you can mirror from one cluster to several others. Each cluster must already be set up and running. Each cluster must have a unique name. Every node in every cluster must be able to resolve all nodes in other clusters, either through DNS or entries in . /etc/hosts Development Guide Welcome to the MapR Development Guide! This guide is for application developers who write programs using MapReduce and other tools in the Hadoop ecosystem. Click on one of the sub-sections below to get started. Accessing MapR-FS in C Applications Accessing MapR-FS in Java Applications My application that includes maprfs-0.1.jar is now missing dependencies and fails to link Garbage Collection in MapR Working with MapReduce Configuring MapReduce Compiling Pipes Programs Working with MapR-FS Chunk Size Compression Working with Data Accessing Data with NFS Copying Data from Apache Hadoop Provisioning Applications MapR Metrics and Job Performance Maven Repository and Artifacts for MapR Working with Cascading Upgrading Cascading Working with Flume Upgrading Flume Working with HBase HBase Best Practices Upgrading HBase Enabling HBase Access Control Working with HCatalog Upgrading HCatalog Working with Hive Hive ODBC Connector Using HiveServer2 Upgrading Hive Troubleshooting Hive Issues Using HCatalog and WebHCat with Hive Working with Mahout Upgrading Mahout Working with Oozie Upgrading Oozie Working with Pig Upgrading Pig Working with Sqoop Upgrading Sqoop Working with Whirr Upgrading Whirr Integrating MapR's GitHub Repositories With Your IDE Troubleshooting Development Issues Integrating MapR's GitHub and Maven Repositories With Your IDE Related Topics See the for details on managing a MapR cluster. Administration Guide See the for details on planning and installing a MapR cluster. Installation Guide See the for details on upgrading the core software on a MapR cluster. Upgrade Guide Accessing MapR-FS in C Applications MapR provides a modified version of that supports access to both MapR-FS and HDFS. MapR-FS is API-compatible with HDFS; if libhdfs.so you already have a client program built to use , you do not have to relink your program just to access the MapR filesystem. libhdfs.so However, re-linking to the MapR-specific shared library will give you better performance, because it does not make any libMapRClient.so Java calls to access the filesystem (unlike ): libhdfs.so If you will be using with a Java MapReduce application, then you must link your program to (see libMapRClient.so libjvm run1.sh , below). If you will be using with a C/C++ client program (no java involved), then you do not need to link to . In this libMapRClient.so libjvm case, use the following options: gcc -Wl -allow-shlib-undefined (see run2.sh) The library provides backward compatibility; if you need to access a distributed filesystem other than MapR-FS, you must link to libhdfs.so li . bhdfs.so The APIs are defined in the header file, which includes documentation for /opt/mapr/hadoop/hadoop-0.20.2/src/c++/libhdfs/hdfs.h each API. Three sample programs are included in the same directory: , , and . hdfs_test.c hdfs_write.c hdfs_read.c Finally, before running your program, some environment variables need to be set depending on what option is chosen. For examples, look at run and . 1.sh run2.sh run1.sh The examples below work with gcc v4.4, and are known to fail to compile with later versions. #!/bin/bash #Ensure JAVA_HOME is defined if [ ${JAVA_HOME} = "" ] ; then echo "JAVA_HOME not defined" exit 1 fi #Setup environment export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mapr/lib:${JAVA_HOME}/jre/lib/amd64/server/ GCC_OPTS="-I. -I${HADOOP_HOME}/src/c++/libhdfs -I${JAVA_HOME}/include -I${JAVA_HOME}/include/linux -L${HADOOP_HOME}/c++/lib -L${JAVA_HOME}/jre/lib/amd64/server/ -L/opt/mapr/lib -lMapRClient -ljvm" #Compile gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_test.c -o hdfs_test gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_read.c -o hdfs_read gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_write.c -o hdfs_write #Run tests ./hdfs_test -m run2.sh #!/bin/bash #Setup environment export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/ GCC_OPTS="-Wl,--allow-shlib-undefined -I. -I${HADOOP_HOME}/src/c++/libhdfs -L/opt/mapr/lib -lMapRClient" export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mapr/lib #Compile and Link gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_test.c -o hdfs_test gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_read.c -o hdfs_read gcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_write.c -o hdfs_write #Run tests ./hdfs_test -m Accessing MapR-FS in Java Applications This page describes how to access MapR-FS in a Java program, and includes sample code. This page contains the following topics: Using JARs from Maven Using JARs from MapR Installation Writing a Java Application As a high-performance filesystem, portions of the MapR-FS file client are based on a native library. All dependencies required to access files on MapR-FS are included in a one JAR, for 32- and 64-bit Linux, 64-bit Mac OSx, and 32- and 64-bit Windows clients. When developing an application, specifying dependence on this JAR allows you to build applications without having to manage platform-specific dependencies. When your application loads the library, if the library for the target OS is not available on the Java , the loader will maprfs maprclient CLASSPATH search the contents of the fat JAR and find there. maprclient Using JARs from Maven MapR publishes Maven artifacts from version 2.1.2 onward at . http://repository.mapr.com/maven/ For example, when compiling for MapR version 2.1.3, add the following dependency to the project's file. This dependency will pull the pom.xml rest of the dependencies from MapR's Maven repository the next time you do a . mvn clean install <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.0.3-mapr-2.1.3.1</version> </dependency> For a complete list of MapR-provided artifacts and further details, see . Maven Repository and Artifacts for MapR Using JARs from MapR Installation You can also find the library JAR in the directory. maprfs /opt/mapr/lib Compiling the below requires only the Hadoop core JAR: #sample code javac -cp /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar MapRTest.java For example, to run the sample code on MapR version 3.0.0, using the following library path: java -Djava.library.path=/opt/mapr/lib -cp .:$(hadoop classpath)\ /opt/mapr/hadoop/hadoop-0.20.2/conf:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-1.0.3-mapr-3.0.0.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-jni-1.0.3-mapr-3.0.0.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.6.jar \ MapRTest /test Writing a Java Application In your Java application, you will use a object to interface with MapR-FS. When you run your Java application, add the Hadoop Configuration configuration directory to the Java classpath. When you instantiate a object /opt/mapr/hadoop/hadoop-<version>/conf Configuration , it is created with default values drawn from configuration files in that directory. Sample Code The following sample code shows how to interface with MapR-FS using Java. The example creates a directory, writes a file, then reads the contents of the file. /* Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved */ //package com.mapr.fs; import java.net.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; /** * Assumes mapr installed in /opt/mapr * * compilation needs only hadoop jars: * javac -cp /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar MapRTest.java * * Run: * java -Djava.library.path=/opt/mapr/lib -cp /opt/mapr/hadoop/hadoop-0.20.2/conf:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-d ev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar:.:/opt/mapr/hadoop/hadoo p-0.20.2/lib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3. 3.2.jar MapRTest /test */ public class MapRTest { public static void main(String args[]) throws Exception { byte buf[] = new byte[ 65*1024]; int ac = 0; if (args.length != 1) { System.out.println("usage: MapRTest pathname"); return; } // maprfs:/// -> uses the first entry in /opt/mapr/conf/mapr-clusters.conf // maprfs:///mapr/my.cluster.com/ // /mapr/my.cluster.com/ // String uri = "maprfs:///"; String dirname = args[ac++]; Configuration conf = new Configuration(); //FileSystem fs = FileSystem.get(URI.create(uri), conf); // if wanting to use a different cluster FileSystem fs = FileSystem.get(conf); Path dirpath = new Path( dirname + "/dir"); Path wfilepath = new Path( dirname + "/file.w"); //Path rfilepath = new Path( dirname + "/file.r"); Path rfilepath = wfilepath; // try mkdir boolean res = fs.mkdirs( dirpath); if (!res) { System.out.println("mkdir failed, path: " + dirpath); return; } System.out.println( "mkdir( " + dirpath + ") went ok, now writing file"); // create wfile FSDataOutputStream ostr = fs.create( wfilepath, true, // overwrite 512, // buffersize (short) 1, // replication (long)(64*1024*1024) // chunksize ); ostr.write(buf); ostr.close(); System.out.println( "write( " + wfilepath + ") went ok"); // read rfile System.out.println( "reading file: " + rfilepath); FSDataInputStream istr = fs.open( rfilepath); int bb = istr.readInt(); istr.close(); System.out.println( "Read ok"); } } My application that includes maprfs-0.1.jar is now missing dependencies and fails to link As of version 2.1.2 of the MapR distribution, the contents of are separated into two parts: and maprfs-0.1.jar maprfs-<version>.jar map . The refers to the version of the MapR distribution. For example, if you have an existing application rfs-jni-<version>.jar <version> written for and you update it to load , you must also include . This change maprfs-0.1.jar maprfs-2.1.2.jar maprfs-jni-2.1.2.jar was made to enable loading on distributed class-loader environments that use the libraries to access MapR-FS from multiple contexts. maprfs These files are installed in the directory, or can be accessed via the Maven Central JAR /opt/mapr/hadoop/hadoop<version>/lib/ Repository. Garbage Collection in MapR The garbage collection (GC) algorithms in Java provide opportunities for performance optimizations for your application. Java provides the following GC algorithms: Serial GC. This algorithm is typically used in client-style applications that don't require low pause times. Specify to -XX:+UseSerialGC use this algorithm. Parallel GC, which is optimized to maximize throughput. Specify to use this algorithm. -XX:+UseParNewGC Mostly-Concurrent or GC, which is optimized to minimize latency. Specify to use Concurrent Mark-Sweep -XX:+UseConcMarkSweepGC this algorithm. Garbage First GC, a new GC algorithm intended to replace Concurrent Mark-Sweep GC. Specify to use this algorithm. -XX:+UseG1GC Consider testing your application with different GC algorithms to determine their effects on performance. Flags for GC Debugging Set the following flags in Java to log the GC algorithm's behavior for later analysis: -verbose:gc -Xloggc:<filename> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime For more information, see the Java document or the links. Garbage Collection Tuning Java Garbage Collection Working with MapReduce If you have used Hadoop in the past to run MapReduce jobs, then running jobs on MapR Distribution for Apache Hadoop will be very familiar to you. MapR is a full Hadoop distribution, API-compatible with all versions of Hadoop. MapR provides additional capabilities not present in any other Hadoop distribution. Click on one of the sub-sections below to get started. Configuring MapReduce Job Scheduling Standalone Operation Tuning Your MapR Install Compiling Pipes Programs Configuring MapReduce You can configure your MapR installation in a number of ways to address your specific cluster's needs. This section contains information about the following topics: Job Scheduling - Prioritize the MapReduce jobs that run on your MapR cluster Standalone Operation - Running MapReduce jobs locally, using the local filesystem Tuning Your MapR Install - Strategies for optimizing resources to meet the goals of your application Job Scheduling You can use job scheduling to prioritize the MapReduce jobs that run on your MapR cluster. The MapReduce system supports a minimum of one queue, named . Hence, this parameter's value should always contain the string default de . Some job schedulers, like the Capacity Scheduler, support multiple queues. fault The default job schedule is queue-based and uses FIFO (First In First Out) ordering. In a production environment with multiple users or groups that compete for cluster resources, consider using one of the multiuser schedulers available in MapR: the Fair Scheduler or the Capacity Scheduler. MapR Hadoop supports the following job schedulers: FIFO queue-based scheduler: This is the default scheduler. The FIFO queue scheduler runs jobs based on the order in which the jobs were submitted. You can prioritize a job by changing the value of the property or by calling the mapred.job.priority setJobPriori method. ty() Fair Scheduler: The Fair Scheduler allocates a share of cluster capacity to each user over time. The design goal of the Fair Scheduling is to assign resources in to jobs so that each job receives an equal share of resources over time. The Fair Scheduler enforces fair sharing within each pool. Running jobs share the pool’s resources. Capacity Scheduler: The Capacity Scheduler enables users or organizations to simulate an individual MapReduce cluster with FIFO scheduling for each user or organization. You can define organizations using . queues The Capacity Scheduler The Capacity Scheduler is a multi-user MapReduce job scheduler that enables organizations to simulate a dedicated MapReduce cluster with FIFO scheduling for users or organizations. The Capacity Scheduler divides the cluster into multiple that may identify distinct groups or organizations. Each queue is allocated a queues capacity (a fraction of the total capacity of the grid) and jobs are submitted to queues and scheduled within that queue using FIFO scheduling. Enabling the Capacity Scheduler To enable the Capacity Scheduler on MapR, define the property in the file. mapred.jobtracker.taskScheduler mapred-default.xml Property Value mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.CapacityTaskScheduler Configuring the Capacity Scheduler Setting Up Queues The Capacity Scheduler enables you to define multiple queues to which users and groups can submit jobs. Once queues are defined, users can submit jobs to a queue using the property name in the job configuration. mapred.job.queue.name To define multiple queues, modify the property in the file. mapred.queue.names mapred-site.xml Property Description mapred.queue.names Comma separated list of queues to which jobs can be submitted. A separate configuration file may be used to configure properties for each of the queues managed by the scheduler. For more information see Co nfiguring Properties for Queues Setting Up Job Queue ACLs The Capacity Scheduler enables you to define access control lists (ACLs) to control which users and groups can submit jobs to each queue using the configuration parameters of the form . mapred.queue.queue-name.acl-name To enable and configure ACLs for the queue, define the following properties in the file. mapred-default.xml Property Description mapred.acls.enabled If , access control lists are supported and are checked whenever true a job is submitted or administered. mapred.queue.<queue-name>.acl-submit-job Specifies a list of users and groups that can submit jobs to the specified . The comma-separated lists of users and queue-name groups are separated by a blank space. For example, user1,user2 . To define a list of groups only, enter a blank space at group1,group2 the beginning of the group list. mapred.queue.<queue-name>.acl-administer-jobs Specifies a list of users and groups that can change the priority or kill jobs submitted to the specified . The comma-separated queue-name lists of users and groups are separated by a blank space. Example: u . To define a list of groups only, enter a ser1,user2 group1,group2 blank space at the beginning of the group list. No matter the ACL, the job owner can always change the priority of or kill a job. Configuring Properties for Queues The Capacity Scheduler enables you to configure queue-specific properties that determine how each queue is managed by the scheduler. All queue-specific properties are defined in the file. By default, a single queue named is configured. conf/capacity-scheduler.xml default To specify a property for a queue that is defined in the site configuration, you should use the property name as mapred.capacity-scheduler . For example, to define the property for queue named , you should .queue.<queue-name>.<property-name> guaranteed-capacity research specify the property name as . mapred.capacity-scheduler.queue.research.guaranteed-capacity The properties defined for queues and their descriptions are listed in the table below: Property Description mapred.capacity-scheduler.queue.<queue-name>.guaran teed-capacity Specifies the percentage of the slots in the cluster that are guaranteed to be available for jobs in this queue. The sum of the guaranteed capacities configured for all queues must be less than or equal to . 100 mapred.capacity-scheduler.queue.<queue-name>.reclai m-time-limit Specifies the amount of time (in seconds) before which resources distributed to other queues will be reclaimed. mapred.capacity-scheduler.queue.<queue-name>.suppor ts-priority If , the priorities of jobs are taken into account in scheduling true decisions. Jobs with higher priority value are given access to the queue's resources before jobs with the lower priority value. mapred.capacity-scheduler.queue.<queue-name>.minimu m-user-limit-percent Specifies a value that defines the maximum percentage of resources that can be allocated to a user at any given time. The minimum percentage of resources allocated depends on the number of users who have submitted jobs. For example, suppose a value of is set 25 for this property. If two users have submitted jobs to a queue, no single user can use more than 50% of the queue resources. If a third user submits a job, no single user can use more than 33% of the queue resources. With four or more users, no user can use more than 25% of the queue's resources. If a value of is set, no user 100 limits are imposed. Memory Management Job Initialization Parameters The Capacity Scheduler initializes jobs before they are scheduled and thereby reduces the memory footprint of the JobTracker. You can control the "laziness" of the job initialization by defining the following properties in the file. capacity-scheduler.xml 1. 2. 3. 4. 1. 2. Property Description mapred.capacity-scheduler.queue.<queue-name>.maximu m-initialized-jobs-per-user Specifies the maximum number of jobs that can be pre-initialized for a user in the queue. Once a job starts running, the scheduler no longer takes that job into consideration when it computes the maximum number of jobs each user is allowed to initialize. mapred.capacity-scheduler.init-poll-interval Specifies the time (in milliseconds) used to poll the scheduler job queue for jobs to be initialized. mapred.capacity-scheduler.init-worker-threads Specifies the number of worker threads used to initialize jobs in a set of queues. If the configured value is equal to the number of job queues, each thread is assigned jobs from a single queue. If the configured value is less than number of queues, a single thread can receive jobs from more than one queue; the thread initializes the queues in a round-robin fashion. If the configured value is greater than number of queues, the number of threads spawned is equal to number of job queues. Administering the Capacity Scheduler Once the installation and configuration is completed, you can review it after starting the cluster from the admin UI. Start the Map/Reduce cluster as usual. Open the JobTracker web UI. The queues you have configured should be listed under the Scheduling Information section of the page. The properties for the queues should be visible in the Scheduling Information column against each queue. The Fair Scheduler The Fair Scheduler is a multi-user MapReduce job scheduler that enables organizations to share a large cluster among multiple users and ensure that all jobs get roughly an equal share of CPU time. The Fair Scheduler organizes jobs into and shares resources fairly across all pools. By default, each user is allocated a separate pool and, pools therefore, gets an equal share of the cluster no matter how many jobs they submit. Within each pool, fair sharing is used to share capacity between the running jobs. Pools can also be given weights to share the cluster non-proportionally in the config file. Using the Fair Scheduler, you can define custom pools that are guaranteed minimum capacities. Enabling the Fair Scheduler To enable the Fair Scheduler in your MapR cluster, define the property in the file mapred.jobtracker.taskScheduler mapred-site.xml and set several Fair Scheduler properties in the file. mapred-site.xml Define the property to in the file. mapred.fairscheduler.allocation.file conf/pools.xml mapred-site.xml <property> <name>mapred.fairscheduler.allocation.file</name> <value>conf/pools.xml</value> </property> Define the property in the file. mapred.jobtracker.taskScheduler mapred-site.xml When using the Fair Scheduler with preemption, you must disable and task prefetch. For details on prefetch, label-based job placement see parameter on page . For details on preemption, see the mapreduce.tasktracker.prefetch.maptasks mapred-site.xml Apac . he Hadoop documentation on the Fair Scheduler 2. 3. 4. 5. <property> <name>mapred.jobtracker.taskScheduler</name> <value>org.apache.hadoop.mapred.FairScheduler</value> </property> Set the property to in the file. mapred.fairscheduler.assignmultiple true mapred-site.xml <property> <name>mapred.fairscheduler.assignmultiple</name> <value>true</value> </property> Set the property to in the file. mapred.fairscheduler.eventlog.enabled false mapred-site.xml <property> <name>mapred.fairscheduler.eventlog.enabled</name> <value>false</value> </property> Restart the JobTracker, then check that the Fair Scheduler is running by going to on the http://<jobtracker URL>/scheduler JobTracker's web UI. For example, browse to on a node running the job tracker. For more http://localhost:50030/scheduler information about the job scheduler administration page, see . Administering the Fair Scheduler Configuring the Fair Scheduler The following properties can be set in to configure the Fair Scheduler. Whenever you change Fair Scheduler properties, you mapred-site.xml must restart the JobTracker. Property Description mapred.fairscheduler.allocation.file Specifies the path to the XML file ( ) that contains the conf/pools.xml allocations for each pool, as well as the per-pool and per-user limits on number of running jobs. If this property is not provided, allocations are not used. mapred.fairscheduler.assignmultiple A Boolean property that allows the scheduler to assign both a map task and a reduce task on each heartbeat. This improves cluster throughput when there are many small tasks to run. Default: . false mapred.fairscheduler.sizebasedweight If , the size of a job is taken into account in calculating its weight true for fair sharing. The weight given to the job is to the log proportional of the number of tasks required. If , the weight of a job is based false entirely on its priority. mapred.fairscheduler.poolnameproperty Specifies which jobconf property is used to determine the pool that a job belongs in. String, default: user.name (that is, one pool for each user). Some other useful values to set this to are: group.name: to create a pool per Unix group. mapred.job.queue.name: the same property as the queue name in the Capacity Scheduler. mapred.fairscheduler.preemption A Boolean property for enabling preemption. Default: . false The Fair Scheduler ExpressLane MapR provides an express path for small MapReduce jobs to run when all slots are occupied by long tasks. Small jobs are only given this special treatment when the cluster is busy, and only if they meet the criteria specified by the following parameters in : mapred-site.xml Property Value Description mapred.fairscheduler.smalljob.sch edule.enable true Enable small job fast scheduling inside fair scheduler. TaskTrackers should reserve a slot called ephemeral slot which is used for smalljob if cluster is busy. mapred.fairscheduler.smalljob.max .maps 10 Small job definition. Max number of maps allowed in small job. mapred.fairscheduler.smalljob.max .reducers 10 Small job definition. Max number of reducers allowed in small job. mapred.fairscheduler.smalljob.max .inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job. Default is 10GB. mapred.fairscheduler.smalljob.max .reducer.inputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed in small job. Default is 1GB per reducer. mapred.cluster.ephemeral.tasks.me mory.limit.mb 200 Small job definition. Max memory in mbytes reserved for an ephemeral slot. Default is 200mb. This value must be same on JobTracker and TaskTracker nodes. MapReduce jobs that appear to fit the small job definition but are in fact larger than anticipated are killed and re-queued for normal execution. Fair Scheduler Extension Points The Fair Scheduler offers several extension points through which the basic functionality can be extended. For example, the weight calculation can be modified to give a priority boost to new jobs, implementing a "shortest job first" policy which reduces response times for interactive jobs even further. mapred.fairscheduler.weightadjuster Specifies a class that adjusts the weights of running jobs. This class should implement the WeightAdjuster interface. There is currently one example implementation: the NewJobWeightB , which increases the weight of jobs for the first five minutes ooster of their lifetime to let short jobs finish faster. To use it, set the weight property to the full class name, adjuster org.apache.hadoop.m . Then set the duration and boost apred.NewJobWeightBooster factor parameters: mapred.newjobweightbooster.factor: Factor by which new jobs weight should be boosted. Default is 3 mapred.newjobweightbooster.duration: Duration in milliseconds. Default is 300000 (five minutes) mapred.fairscheduler.loadmanager Specifies a class that determines how many maps and reduces can run on a given TaskTracker. This class should implement the LoadManager interface. By default, the task caps in the Hadoop config file are used, but this option could be used to make the load based on available memory and CPU utilization for example. mapred.fairscheduler.taskselector Specifies a class that determines which task from within a job to launch on a given tracker. This can be used to change either the local (for example, keep some jobs within a particular rack) or the ity policy (select when to launch speculative speculative execution algorithm tasks). By default, it uses Hadoop's default algorithms from JobInProgress. Administering the Fair Scheduler You can administer the Fair Scheduler at runtime using two mechanisms: Allocation config file: It is possible to modify pools' allocations and user and pool running job limits at runtime by editing the allocation config file. The scheduler will reload this file 10-15 seconds after it sees that it was modified. Jobtracker web interface: Current jobs, pools, and fair shares can be examined through the JobTracker's web interface, at http://<j . For example, browse to on a node running the job tracker. obtracker URL>/scheduler http://localhost:50030/scheduler In the web interface, you can modify job priorities, move jobs between pools, and see the effects on the fair shares. For each job, the web interface displays the following fields: Field Description Submitted Shows the date and time job was submitted. JobID, User, Name Displays job identifiers as on the standard web UI. Pool Shows the current pool of the job. Select another value to move job to another pool. Priority Shows the current priority of the job. Select another value to change the job's priority. Maps/Reduces Finished Shows the number of tasks finished / total tasks. Maps/Reduces Running Shows the tasks currently running. Map/Reduce Fair Share Shows the average number of task slots that this job should have at any given time according to fair sharing. The actual number of tasks will go up and down depending on how much compute time the job has had, but on average it will get its fair share amount. In the advanced web UI (navigate to ), you can view these additional columns that display internal http://<jobtracker URL>/scheduler?advanced calculations: Field Description Maps/Reduce Weight Shows the weight of the job in the fair sharing calculations. The weight of the job depends on its priority and optionally, if the sizeba and properties are enabled, sedweight newjobweightbooster then its size and age. Map/Reduce Deficit Shows the job's scheduling deficit in machine-seconds; that is, the amount of resources the job should have received according to its fair share, minus the amount it actually received. A positive value indicates the job will be scheduled again in the near future because it needs to catch up to its fair share. The Fair Scheduler schedules jobs with higher deficit ahead of others. See #Fair Scheduler for details. Implementation Details Fair Scheduler Implementation Details There are two aspects to implementing fair scheduling: Calculating each job's fair share Choosing which job to run when a task slot becomes available To select jobs to run, the scheduler keeps track of a for each job, which is the difference between the amount of compute time the job deficit should have gotten on an ideal scheduler, and the amount of compute time it actually got. This is a measure of how "unfair" the job's situation is. Every few hundred milliseconds, the scheduler updates the deficit of each job by looking at how many tasks each job had running during this interval vs. its fair share. Whenever a task slot becomes available, it is assigned to the job with the highest deficit. There is one exception: If one or more jobs are not meeting their pool capacity guarantees, the scheduler chooses among only these "needy" jobs, based on their deficit, to ensure that the scheduler meets pool guarantees as soon as possible. The fair shares are calculated by dividing the capacity of the cluster among runnable jobs according to a "weight" for each job. By default the weight is based on priority, with each level of priority having 2x higher weight than the next. (For example, VERY_HIGH has 4x the weight of NORMAL.) However, weights can also be based on job sizes and ages, as described in section . For jobs that are #Configuring the Fair Scheduler in a pool, fair shares also take into account the minimum guarantee for that pool. This capacity is divided among the jobs in that pool according to their weights. When limits on a user's running jobs or a pool's running jobs are in place, the scheduler chooses which jobs get to run by sorting all jobs, first in order of priority, and second in order of submit time, as in the standard Hadoop scheduler. Any jobs that fall after the user/pool's limit in this ordering are queued up and wait idle until they can be run. During this time, they are ignored from the fair sharing calculations and do not gain or lose deficit (that is, their fair share is set to zero). Standalone Operation MapR supports standalone mode for executing jobs. To use standalone mode you must install the MapR core packages - specifically, the mapr-c package. You do not need to install or run any services, including the warden. In standalone mode, a single Java process executes a ore complete Hadoop job within itself using the local file system. This is useful for code development and debugging before deploying to a cluster with a larger number of nodes. Note that the MapR client does not support standalone mode; you must install . mapr-core To run MapR in standalone mode, including the use of a local file system, set to and to fs.default.name file:/// mapred.job.tracker . local It is also possible when using standalone mode to access data in MapR-FS. In this case there is still only one process running the job, but either the input or output data is in MapR-FS. In that case, you must of course have a fully configured cluster and you would not set the fs.default.n as shown above but would instead leave the default which uses MapR-FS. ame To set the parameters only for the current job, specify them on the command line. Example: hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount -Dmapred.job.tracker=local -Dfs.default.name=file:/// /root/wcin /root/wcout To make the parameter changes permanent, edit the configuration files: Add the parameter in and set the value to to override the default. fs.default.name core-site.xml file:/// Set to in . mapred.job.tracker local mapred-site.xml Examples Input and output on local filesystem ./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local -Dfs.default.name=file:/// file:///opt/mapr/hadoop/hadoop-0.20.2/input file:///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+' Input from MapR-FS ./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local input file:///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+' Output to MapR-FS ./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:///opt/mapr/hadoop/hadoop-0.20.2/input output 'dfs[a-z.]+' Tuning Your MapR Install MapR automatically tunes the cluster for most purposes. A service called the determines machine resources on nodes configured to run warden the TaskTracker service, and sets MapReduce parameters accordingly. On nodes with multiple CPUs, MapR uses to reserve CPUs for MapR services: taskset On nodes with five to eight CPUs, CPU 0 is reserved for MapR services On nodes with nine or more CPUs, CPU 0 and CPU 1 are reserved for MapR services In certain circumstances, you might want to manually tune MapR to provide higher performance. For example, when running a job consisting of unusually large tasks, it is helpful to reduce the number of slots on each TaskTracker and adjust the Java heap size. The following sections provide MapReduce tuning tips. If you change any settings in , restart the TaskTracker. mapred-site.xml NFS Write Performance The kernel tunable value represents the number of simultaneous Remote Procedure Call (RPC) requests. sunrpc.tcp_slot_table_entries This tunable's default value is 16. Increasing this value to 128 may improve write speeds. Use the command sysctl -w to set the value. Add an entry to your file to make the setting persist across reboots. sunrpc.tcp_slot_table_entries=128 sysctl.conf NFS write performance varies between different Linux distributions. This suggested change may have no or negative effect on your particular cluster. Inline Setup When inline setup is enabled, each job's setup task runs as a thread directly inside the JobTracker instead of being forked out as a separate task by a TaskTracker. When inline setup is enabled, jobs that require a setup task can show increased performance because those jobs aren't waiting for TaskTrackers to get scheduled and then run the setup task. Enabling the JobTracker to execute user-defined code as the privileged JT user is risky. If your cluster's original installation of MapR was version 1.2.7 or earlier, inline setup is enabled by default. Disable inline setup on production clusters by setting the value of the mapreduce.jobtracke to false in . Add the following section to the file: r.inline.setup.cleanup mapred-site.xml mapred-site.xml <property> <name>mapreduce.jobtracker.inline.setup.cleanup</name> <value>false</value> <description> </description> </property> Memory Settings Memory for MapR Services The memory allocated to each MapR service is specified in the file, which MapR automatically configures /opt/mapr/conf/warden.conf based on the physical memory available on the node. For example, you can adjust the minimum and maximum memory used for the TaskTracker, as well as the percentage of the heap that the TaskTracker tries to use, by setting the appropriate , , and paramet percent max min ers in the file: warden.conf 1. 2. 3. ... service.command.tt.heapsize.percent=2 service.command.tt.heapsize.max=325 service.command.tt.heapsize.min=64 ... The percentages of memory used by the services need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.p parameters for all services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings ercent for individual services, unless you see specific memory-related problems occurring. MapReduce Memory The memory allocated for MapReduce tasks normally equals the total system memory minus the total memory allocated for MapR services. If necessary, you can use the parameter to set the maximum physical memory reserved by mapreduce.tasktracker.reserved.physicalmemory.mb MapReduce tasks, or you can set it to to disable physical memory accounting and task management. -1 If the node runs out of memory, MapReduce tasks are killed by the to free memory. You can use (copy OOM-killer mapred.child.oom_adj from to adjust the parameter for MapReduce tasks. The possible values of range from -17 to +15. mapred-default.xml oom_adj oom_adj The higher the score, more likely the associated process is to be killed by the OOM-killer. Job Configuration Map Tasks Map tasks use memory mainly in two ways: The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs. The application consumes memory to run the map function. MapReduce framework memory is controlled by . If is less than the data emitted from the mapper, the task ends up io.sort.mb io.sort.mb spilling data to disk. If is too large, the task can run out of memory or waste allocated memory. By default, is set to io.sort.mb io.sort.mb 380MB. Set the value of to approximately 1.5 times the number of data bytes emitted from the mapper. If you cannot resolve io.sort.mb memory problems by adjusting the value of , then try to re-write the application to use less memory in its map function. io.sort.mb Compression To turn off MapR compression for map outputs, set mapreduce.maprfs.use.compression=false To turn on LZO or any other compression, set and mapreduce.maprfs.use.compression=false mapred.compress.map.outpu t=true For more details on selecting a compression algorithm, see . Compression Reduce Tasks If tasks fail because of an Out of Heap Space error, increase the heap space (the option in ) to give -Xmx mapred.reduce.child.java.opts more memory to the tasks. If map tasks are failing, you can also try reducing the value of . io.sort.mb (see mapred.map.child.java.opts in mapred-site.xml) TaskTracker Configuration Ideally, the number of map and reduce slots should be decided based on the needs of the application. Map slots should be based on how many map tasks can fit in memory, and reduce slots should be based on the number of CPUs. If each task in a MapReduce job takes 3 GB, and each node has 9GB reserved for MapReduce tasks, then the total number of map slots should be 3. The amount of data each map task must process also affects how many map slots should be configured. If each map task processes 256 MB (the default chunksize in MapR), then each map task should have 800 MB of memory. If there are 4 GB reserved for map tasks, then the number of map slots should be 4000MB/800MB, or 5 slots. There are three ways to tune the calculation for the number of map and reduce slots on each TaskTracker node: Specify the maximum number of map and reduce slots on each TaskTracker node. Define a formula to calculate the maximum number of map and reduce slots. 3. Use default values. Specifying a maximum number of slots You can directly set these parameters to an integer in the file: mapred-site.xml mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum. Defining a formula You can define a formula for the or param mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum eter. This formula uses the syntax in and takes the following variables: Eval CPUS - number of CPUs present on the node DISKS - number of disks present on the node MEM - memory reserved for MapReduce tasks The general syntax for these is . For a hypothetical 4-core, 12-disk node, a conditional of the form CONDITIONAL ? TRUE : FALSE (2*CPUS < evaluates to 8, generating 8 map or reduce slots. On a 6-core, 12-disk node, this conditional evaluates to 12, DISKS) ? 2*CPUS : DISKS generating 12 map or reduce slots. Using default values Leave these variables ( and ) set to the mapred.tasktracker.map.tasks.maximum mapred.tasktracker.reduce.tasks.maximum default value of -1. The default indicates that the maximum number of map and reduce slots is calculated by these formulas: Reduce slot calculation The number of reduce slots is calculated first, since leftover memory might be enough for another map slot. 60% of the total RAM available for MapReduce is allocated to reduce tasks. Example: 4994MB of RAM is available for MapReduce, of which 200MB is allocated for small jobs. This leaves 4794MB for map and reduce slots. 60% (4794MB) = 2876MB. Divide the total amount of memory for reduce tasks by 1500MB (default memory size per reduce task). Example: 2876MB/1500MB = 1 slot, with 1376MB left over. Map slot calculation Once the number of reduce slots has been calculated, map slots can be allocated from the remaining memory. Subtract the amount of memory used for reduce slots from the total available for map and reduce slots. Then divide the result by 800MB (default memory size per map task). Example: (4794MB-1500MB)/800MB = 3294MB/800MB = 4 slots with 94MB left over. MapR allows the JobTracker to over-schedule tasks on TaskTracker nodes in advance of the availability of slots, creating a pipeline. This optimization allows TaskTracker to launch each map task as soon as the previous running map task finishes. The number of tasks to over-schedule should be about 25-50% of total number of map slots. You can adjust this number with the parameter mapreduce.tasktracker . .prefetch.maptasks ExpressLane MapR provides an express path (called ExpressLane) that works in conjunction with . ExpressLane is for small MapReduce The Fair Scheduler You can assign a different memory size for the reduce task default by changing the value of the mapred.reducetask.memory.defa parameter. ult You can assign a different memory size for the map task default by changing the value of the p mapred.maptask.memory.default arameter. 1. 2. 3. 4. 5. 6. jobs to run when all slots are occupied by long tasks. Small jobs are only given this special treatment when the cluster is busy, and only if they meet the criteria specified by the following parameters in : mapred-site.xml Parameter Value Description mapred.fairscheduler.smalljob.schedule.ena ble true Enable small job fast scheduling inside fair scheduler. TaskTrackers should reserve a slot called ephemeral slot which is used for smalljob if cluster is busy. mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job. mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job. mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job. Default is 10GB. mapred.fairscheduler.smalljob.max.reducer.i nputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed in small job. Default is 1GB per reducer. mapred.cluster.ephemeral.tasks.memory.limi t.mb 200 Small job definition. Max memory in mbytes reserved for an ephermal slot. Default is 200mb. This value must be same on JobTracker and TaskTracker nodes. MapReduce jobs that appear to fit the small job definition but are in fact larger than anticipated are killed and re-queued for normal execution. Compiling Pipes Programs To facilitate running jobs on various platforms, MapR provides , , and sources. hadoop pipes hadoop pipes util pipes-example To compile the pipes example: Install on all nodes. libssl Set the environment variable as follows: LIBS export LIBS=-lcrypto This is needed for the time being, to fix errors in the configuration script. Change to the directory, and execute the following commands: /opt/mapr/hadoop/hadoop-0.20.2/src/c++/utils chmod +x configure ./configure # resolve any errors make install Change to the directory, and execute the following commands: /opt/mapr/hadoop/hadoop-0.20.2/src/c++/pipes chmod +x configure ./configure # resolve any errors make install The APIs and libraries will be in the directory. /opt/mapr/hadoop/hadoop-0.20.2/src/c++/install When using , all nodes must run the same distribution of the operating system. If you run different distributions (Red Hat and pipes CentOS, for example) on nodes in the same cluster, the compiled application might run on some nodes but not others. 6. 1. 2. Compile : pipes-example cd /opt/mapr/hadoop/hadoop-0.20.2/src/c++ g++ pipes-example/impl/wordcount-simple.cc -Iinstall/include/ -Linstall/lib/ -lhadooputils -lhadooppipes -lss -lcrypto -lpthread -o wc-simple To run the pipes example: Copy the pipes program into MapR-FS. Run the command: hadoop pipes hadoop pipes -Dhadoop.pipes.java.recordreader=true -Dhadoop.pipes.java.recordwriter=true -input <input-dir> -output <output-dir> -program <MapR-FS path to program> Working with MapR-FS This section contains the following subtopics: Chunk Size Compression This page used to contain content that now resides on other pages. The following pages might contain information you are looking for. Accessing MapR-FS in C Applications Accessing MapR-FS in Java Applications Chunk Size Files in MapR-FS are split into (similar to Hadoop ) that are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunks blocks chunk size, but tuning the size correctly is important: Smaller chunk sizes result in larger numbers of map tasks, which can result in lower performance due to task scheduling overhead Larger chunk sizes require more memory to sort the map task output, which can crash the JVM or add significant garbage collection overhead MapR can deliver a single stream at upwards of 300 MB per second, making it possible to use larger chunks than in stock Hadoop. Generally, it is wise to set the chunk size between 64 MB and 256 MB. Chunk size is set at the directory level. Files inherit the chunk size settings of the directory that contains them, as do subdirectories on which chunk size has not been explicitly set. Any files written by a Hadoop application, whether via the file APIs or over NFS, use chunk size specified by the settings for the directory where the file is written. If you change a directory's chunk size settings after writing a file, the file will keep the old chunk size settings. Further writes to the file will use the file's existing chunk size. Setting Chunk Size You can set the chunk size for a given directory in two ways: Change the attribute in the file at the top level of the directory ChunkSize .dfs_attributes Use the command -setchunksize <size> <directory> hadoop mfs For example, if the volume is NFS-mounted at you can set the chunk size to 268,435,456 test /mapr/my.cluster.com/projects/test bytes by editing the file and setting . To accomplish /mapr/my.cluster.com/projects/test/.dfs_attributes ChunkSize=268435456 the same thing from the shell, use the following command: hadoop hadoop mfs -setchunksize 268435456 /mapr/my.cluster.com/projects/test Compression MapR provides compression for files stored in the cluster. Compression is applied automatically to uncompressed files unless you turn compressi . The advantages of compression are: on off Compressed data uses less bandwidth on the network than uncompressed data. Compressed data uses less disk space. This page contains the following topics: Choosing a Compression Setting Setting Compression on Files File Extensions of Compressed Files Turning Compression On or Off on Directories Setting Compression During Shuffle Choosing a Compression Setting MapR supports three different compression algorithms: lz4 (default) lzf zlib Compression algorithms can be evaluated for compression ratio (higher compression means less disk space used), compression speed and decompression speed. The following table gives a comparison for the three supported algorithms. The data is based on a single-thread, Core 2 Duo at 3 GHz. Compression Type Compression Ratio Compression Speed Decompression Speed lz4 2.084 330 MB/s 915 MB/s lzf 2.076 197 MB/s 465 MB/s zlib 3.095 14 MB/s 210 MB/s Note that compression speed depends on various factors including: block size (the smaller the block size, the faster the compression speed) single-thread vs. multi-thread system single-core vs. multi-core system the type of codec used Setting Compression on Files Compression is set at the directory level. Any files written by a Hadoop application, whether via the file APIs or over NFS, are compressed according to the settings for the directory where the file is written. Sub-directories on which compression has not been explicitly set inherit the compression settings of the directory that contains them. If you change a directory's compression settings after writing a file, the file will keep the old compression settings---that is, if you write a file in an uncompressed directory and then turn compression on, the file does not automatically end up compressed, and vice versa. Further writes to the file will use the file's existing compression setting. File Extensions of Compressed Files By default, MapR does not compress files whose filename extension indicate they are already compressed. The default list of filename extensions is as follows: bz2 gz lzo snappy Only the owner of a directory can change its compression settings or other attributes. Write permission is not sufficient. tgz tbz2 zip z Z mp3 jpg jpeg mpg mpeg avi gif png The list of filename extensions not to compress is stored as comma-separated values in the configuration parameter, mapr.fs.nocompression and can be modified with the command. Example: config save maprcli config save -values '{"mapr.fs.nocompression":"bz2,gz,lzo,snappy,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mpeg,av i,gif,png"}' The list can be viewed with the command. Example: config load maprcli config load -keys mapr.fs.nocompression Turning Compression On or Off on Directories You can turn compression on or off for a given directory in two ways: Set the value of the attribute in the file at the top level of the directory. Compression .dfs_attributes Set to turn compression for a directory. Compression=lzf|lz4|zlib on Set to turn compression for a directory. Compression=false off Use the command . -setcompression on|off/lzf/lz4/zlib <dir> hadoop mfs If you choose without specifying an algorithm, lz4 is used by default. This algorithm has improved compression speeds -setcompression on for MapR's block size of 64 KB. Example Suppose the volume is NFS-mounted at . You can turn off compression by editing the file test /mapr/my.cluster.com/projects/test /ma and setting . To accomplish the same thing from the pr/my.cluster.com/projects/test/.dfs_attributes Compression=false hado shell, use the following command: op hadoop mfs -setcompression off /projects/test You can view the compression settings for directories using the command. For example, -ls hadoop mfs # hadoop mfs -ls / Found 23 items vrwxr-xr-x Z - root root 13 2012-04-29 10:24 268435456 /.rw p mapr.cluster.root writeable 2049.35.16584 -> 2049.16.2 scale-50.scale.lab:5660 scale-51.scale.lab:5660 scale-52.scale.lab:5660 vrwxr-xr-x U - root root 7 2012-04-28 22:16 67108864 /hbase p mapr.hbase default 2049.32.16578 -> 2050.16.2 scale-50.scale.lab:5660 scale-51.scale.lab:5660 scale-52.scale.lab:5660 drwxr-xr-x Z - root root 0 2012-04-29 09:14 268435456 /tmp p 2049.41.16596 scale-50.scale.lab:5660 scale-51.scale.lab:5660 scale-52.scale.lab:5660 vrwxr-xr-x Z - root root 1 2012-04-27 22:59 268435456 /user p users default 2049.36.16586 -> 2055.16.2 scale-50.scale.lab:5660 scale-52.scale.lab:5660 scale-51.scale.lab:5660 drwxr-xr-x Z - root root 1 2012-04-27 22:37 268435456 /var p 2049.33.16580 scale-50.scale.lab:5660 scale-51.scale.lab:5660 scale-52.scale.lab:5660 The symbols for the various compression settings are explained here: Symbol Compression Setting Z lz4 z zlib L lzf U Uncompressed, or previously compressed by another algorithm Setting Compression During Shuffle By default, MapReduce uses compression during the Shuffle phase. You can use the switch to turn compression during the Shuffle phase of a MapReduce job. For example: -Dmapreduce.maprfs.use.compression off hadoop jar xxx.jar -Dmapreduce.maprfs.use.compression=false Working with Data This section contains information about working with data: Copying Data from Apache Hadoop - using to copy data to MapR from an Apache cluster distcp Data Protection - how to protect data from corruption or deletion Accessing Data with NFS - how to mount the cluster via NFS Managing Data with Volumes - using volumes to manage data Mirror Volumes - local or remote copies of volumes Schedules - scheduling for snapshots and mirrors Snapshots - point-in-time images of volumes Accessing Data with NFS Unlike other Hadoop distributions that only allow cluster data import or import as a batch operation, MapR lets you mount the cluster itself via NFS so that your applications can read and write data directly. MapR allows direct file modification and multiple concurrent reads and writes via POSIX semantics. With an NFS-mounted cluster, you can read and write data directly with standard tools, applications, and scripts. For example, you could run a MapReduce job that outputs to a CSV file, then import the CSV file directly into SQL via NFS. View this video for an explanation of NFS mounting models and data flows... 1. 2. MapR exports each cluster as the directory (for example, ). If you create a mount point with /mapr/<cluster name> /mapr/my.cluster.com the local path , then Hadoop FS paths and NFS paths to the cluster will be the same. This makes it easy to work on the same files via NFS /mapr and Hadoop. In a multi-cluster setting, the clusters share a single namespace, and you can see them all by mounting the top-level director /mapr y. This page contains the following sections: Mounting the Cluster Mounting NFS to MapR-FS on a Cluster Node Mounting NFS on a Linux Client Mounting NFS on a Mac Client Mounting NFS on a Windows Client Mounting the cluster To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise To mount the cluster on other Windows versions Mapping a network drive To map a network drive with the Map Network Drive tool Configuring UID and GID for NFS access To access NFS share when system is part of Active Directory Domain To access NFS share from a standalone system Setting Compression and Chunk Size See to set up NFS on a non-standard port. Setting up MapR NFS Mounting the Cluster Before you begin, make sure you know the hostname and directory of the NFS share you plan to mount. Example: usa-node01:/mapr - for mounting from the command line nfs://usa-node01/mapr - for mounting from the Mac Finder Mounting NFS to MapR-FS on a Cluster Node To mount NFS to MapR-FS on the cluster at the mount point, add the following line to automatically my.cluster.com /mapr /opt/mapr/conf : /mapr_fstab <hostname>:/mapr/my.cluster.com /mapr hard,nolock Every time your system is rebooted, the mount point is automatically reestablished according to the configuration file. mapr_fstab To mount NFS to MapR-FS at the mount point: manually /mapr Set up a mount point for an NFS share. Example: sudo mkdir /mapr Mount the cluster via NFS. Example: sudo mount -o nolock usa-node01:/mapr/my.cluster.com /mapr Mounting NFS on a Linux Client MapR uses version 3 of the NFS protocol. NFS version 4 bypasses the port mapper and attempts to connect to the default port only. If you are running NFS on a non-standard port, mounts from NFS version 4 clients time out. Use the option to specify -o nfsvers=3 NFS version 3. The change to will not take effect until warden is restarted. /opt/mapr/conf/mapr_fstab When you mount manually from the command line, the mount point does persist after a reboot. not 1. 2. 3. 4. 1. 2. 3. 4. 5. 6. To mount when your system starts up, add an NFS mount to . Example: automatically /etc/fstab # device mountpoint fs-type options dump fsckorder ... usa-node01:/mapr /mapr nfs rw 0 0 ... To mount NFS on a Linux client : manually Make sure the NFS client is installed. Examples: sudo yum install nfs-utils (Red Hat or CentOS) sudo apt-get install nfs-common (Ubuntu) sudo zypper install nfs-client (SUSE) List the NFS shares exported on the server. Example: showmount -e usa-node01 Set up a mount point for an NFS share. Example: sudo mkdir /mapr Mount the cluster via NFS. Example: sudo mount -o nolock usa-node01:/mapr /mapr Mounting NFS on a Mac Client To mount the cluster manually from the command line: Open a terminal (one way is to click on Launchpad > Open terminal). At the command line, enter the following command to become the root user: sudo bash List the NFS shares exported on the server. Example: showmount -e usa-node01 Set up a mount point for an NFS share. Example: sudo mkdir /mapr Mount the cluster via NFS. Example: sudo mount -o nolock usa-node01:/mapr /mapr List all mounted filesystems to verify that the cluster is mounted. mount Mounting NFS on a Windows Client Setting up the Windows NFS client requires you to mount the cluster and configure the user ID (UID) and group ID (GID) correctly, as described in the sections below. In all cases, the Windows client must access NFS using a valid UID and GID from the Linux domain. Mismatched UID or GID will result in permissions problems when MapReduce jobs try to access files that were copied from Windows over an NFS share. Mounting the cluster To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise The mount point does not persist after reboot when you mount manually from the command line. Because of Windows directory caching, there may appear to be no directory in each volume's root directory. To work around .snapshot the problem, force Windows to re-load the volume's root directory by updating its modification time (for example, by creating an empty file or directory in the volume's root directory). With Windows NFS clients, use the option on the NFS server to prevent the Linux NLM from registering with the -o nolock portmapper. The native Linux NLM conflicts with the MapR NFS server. 1. 2. 3. 4. 5. 1. 2. 3. Open . Start > Control Panel > Programs Select . Turn Windows features on or off Select . Services for NFS Click . OK Mount the cluster and map it to a drive using the tool or from the command line. Example: Map Network Drive mount -o nolock usa-node01:/mapr z: To mount the cluster on other Windows versions Download and install (SFU). You only need to install the NFS Client and the User Name Mapping. Microsoft Windows Services for Unix Configure the user authentication in SFU to match the authentication used by the cluster (LDAP or operating system users). You can map local Windows users to cluster Linux users, if desired. Once SFU is installed and configured, mount the cluster and map it to a drive using the tool or from the command Map Network Drive line. Example: mount -o nolock usa-node01:/mapr z: Mapping a network drive To map a network drive with the Map Network Drive tool 1. 2. 3. 4. 5. 6. 7. 1. 2. 3. Open . Start > My Computer Select . Tools > Map Network Drive In the Map Network Drive window, choose an unused drive letter from the drop-down list. Drive Specify the by browsing for the MapR cluster, or by typing the hostname and directory into the text field. Folder Browse for the MapR cluster or type the name of the folder to map. This name must follow UNC. Alternatively, click the Browse… button to find the correct folder by browsing available network shares. Select to reconnect automatically to the MapR cluster whenever you log into the computer. Reconnect at login Click Finish. Configuring UID and GID for NFS access To access NFS share when system is part of Active Directory Domain You must instruct the NFS client to access an AD server to get and . At a high level, the process is as follows: uidNumber gidNumber Ensure the AD Users schema has auxiliary class . posixAccount Populate AD and fields with matching and from Linux. uidNumber gidNumber uid gid Configure the NFS client to look up and in the AD DS store. uid gid Refer to details here: . http://technet.microsoft.com/en-us/library/hh509016(v=ws.10).aspx To access NFS share from a standalone system For a standalone Windows 7 or Vista machine (not using Active Directory), Windows always uses its configured Anonymous UID and GID for NFS access, which by default are -2. However, you can configure Windows to use specific values, which results in being able to access NFS using those values. The UID and GID values are set in the Windows Registry and are global on the Windows NFS client box. This solution might not work well if your Windows box has multiple users who each need access to NFS with their own permissions, but there is no obvious way to avoid this limitation. The values are stored in the registry path . The HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ClientForNFS\CurrentVersion\Default two DWORD values are and . If they do not exist, you must create them. AnonymousUid AnonymousGid Refer to details here: http://blogs.msdn.com/b/sfu/archive/2009/03/27/can-i-set-up-user-name-mapping-in-windows-vista.aspx Setting Compression and Chunk Size Each directory in MapR storage contains a hidden file called that controls compression and chunk size. To change these .dfs_attributes attributes, change the corresponding values in the file. Example: # lines beginning with # are treated as comments Compression=lz4 ChunkSize=268435456 Valid values: Compression: , , , or lz4 lzf zlib false Chunk size (in bytes): a multiple of 65535 (64 K) or zero (no chunks). Example: 131072 You can also set compression and chunksize using the command. hadoop mfs By default, MapR does not compress files whose filename extension indicate they are already compressed. The default list of filename extensions is as follows: bz2 gz lzo snappy tgz tbz2 zip z Z mp3 jpg jpeg mpg mpeg avi gif png The list of filename extensions not to compress is stored as comma-separated values in the configuration parameter, mapr.fs.nocompression and can be modified with the command. Example: config save maprcli config save -values '{"mapr.fs.nocompression":"bz2,gz,lzo,snappy,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mpeg,av i,gif,png"}' The list can be viewed with the command. Example: config load maprcli config load -keys mapr.fs.nocompression Copying Data from Apache Hadoop There are three ways to copy data from an Apache Hadoop cluster based on the Hadoop Distributed Filesystem (HDFS) to a MapR cluster: If the HDFS cluster uses the same version of the RPC protocol that MapR uses (currently version 4), use normally, as described distcp below. If you are copying very small amounts of data, use . hftp If the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason the above steps do not work, you can data from the HDFS cluster. push The sections below describe each method. 1. 2. 3. To copy data from HDFS to MapR using distcp To perform this operation, you need the following information: <NameNode> - the IP address or hostname of the NameNode in the HDFS cluster <NameNode Port> - the port for connecting to the NameNode in the HDFS cluster <HDFS path> - the path to the HDFS directory from which you plan to copy data <MapR-FS path> - the path in the MapR cluster to which you plan to copy HDFS data <file> - a file in the HDFS path Perform the following steps: From a node in the MapR cluster, try to determine whether the MapR cluster can successfully communicate with the hadoop fs -ls HDFS cluster: hadoop fs -ls <NameNode IP>:<NameNode port>/<path> For example, using the default NameNode port for HDFS access: hadoop fs -ls hdfs://nn1:8020/user/sara If the command is successful, try to determine whether the MapR cluster can read file contents hadoop fs -ls hadoop fs -cat from the specified path on the HDFS cluster: hadoop fs -cat <NameNode IP>:<NameNode port>/<HDFS path>/<file> If you are able to communicate with the HDFS cluster and read file contents, use to copy data from the HDFS cluster to the distcp MapR cluster: hadoop distcp hdfs://<NameNode>:<NameNode Port>/<HDFS path> maprfs://<MapR-FS path> For example, using the default NameNode port for HDFS access: hadoop distcp hdfs://nn1:8020/user/sara maprfs:///user/sara Note that the triple slashes in ' ' are not a misprint. maprfs:///... To copy data from HDFS to MapR using HFTP To perform this operation, you need the following information: <NameNode> - the IP address or hostname of the NameNode in the HDFS cluster <NameNode HTTP Port> - the HTTP port on the NameNode in the HDFS cluster <HDFS path> - the path to the HDFS directory from which you plan to copy data <MapR-FS path> - the path in the MapR cluster to which you plan to copy HDFS data Execute the following command on the destination cluster, using over HFTP to copy files: distcp hadoop distcp hftp://<NameNode IP>:<NameNode HTTP Port>/<HDFS path> maprfs://<MapR-FS path> 1. 2. 3. 4. 5. 6. 7. 8. For example, using the default HTTP port on the NameNode: hadoop distcp hftp://nn2:50070/user/lohit maprfs:///user/lohit Note that the triple slashes in ' ' are not a misprint. maprfs:///... To push data from an HDFS cluster Perform the following steps from a MapR client or node (any computer that has either or installed). For more mapr-core mapr-client information about setting up a MapR client, see . Setting Up the Client To perform this operation, you need the following information: <input path> - the HDFS path to the source data <output path> - the MapR-FS path to the target directory <MapR CLDB IP> - the IP address of the master CLDB node on the MapR cluster Log in as the user (or use for the following commands). root sudo Create the directory on the Apache Hadoop JobClient node. /tmp/maprfs-client/ Copy the following files from a MapR client or any MapR node to the directory: /tmp/maprfs-client/ /opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar, /opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar /opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64/libMapRClient.so Install the files in the correct places on the Apache Hadoop JobClient node: cp /tmp/maprfs-client/maprfs-0.1.jar $HADOOP_HOME/lib/. cp /tmp/maprfs-client/zookeeper-3.3.2.jar $HADOOP_HOME/lib/. cp /tmp/maprfs-client/libMapRClient.so $HADOOP_HOME/lib/native/Linux-amd64-64/libMapRClient.so If you are on a 32-bit client, use in place of above. Linux-i386-32 Linux-amd64-64 If the JobTracker is a different node from the JobClient node, copy and install the files to the JobTracker node as well using the above steps. On the JobTracker node, set in . fs.maprfs.impl=com.mapr.fs.MapRFileSystem $HADOOP_HOME/conf/core-site.xml Restart the JobTracker. You can now copy data to the MapR cluster by running on the JobClient node of the Apache Hadoop cluster. Example: distcp ./bin/hadoop distcp -Dfs.maprfs.impl=com.mapr.fs.MapRFileSystem -libjars /tmp/maprfs-client/maprfs-0.1.jar,/tmp/maprfs-client/zookeeper-3.3.2.jar -files /tmp/maprfs-client/libMapRClient.so <input path> maprfs://<MapR CLDB IP>:7222/<output path> Provisioning Applications Provisioning a new application involves meeting the business goals of performance, continuity, and security while providing necessary resources to a client, department, or project. You'll want to know how much disk space is needed, and what the priorities are in terms of performance and reliability. Once you have gathered all the requirements, you will create a volume to manage the application data. A volume provides convenient control over data placement, performance, protection, and policy for an entire data set. Make sure the cluster has the storage and processing capacity for the application. You'll need to take into account the starting and predicted size of the data, the performance and protection requirements, and the memory required to run all the processes required on each node. Here is the information to gather before beginning: Access How often will the data be read and written? What is the ratio of reads to writes? Continuity What is the desired (RPO)? recovery point objective What is the desired (RTO)? recovery time objective Performance Is the data static, or will it change frequently? Is the goal data storage or data processing? Size How much data capacity is required to start? What is the predicted growth of the data? The considerations in the above table will determine the best way to set up a volume for the application. About Volumes Volumes provide a number of ways to help you meet the performance, access, and continuity goals of an application, while managing application data size: Mirroring - create read-only copies of the data for highly accessed data or multi-datacenter access Permissions - allow users and groups to perform specific actions on a volume Quotas - monitor and manage the data size by project, department, or user Replication - maintain multiple synchronized copies of data for high availability and failure protection Snapshots - create a real-time point-in-time data image to enable rollback Topology - place data on a high-performance rack or limit data to a particular set of machines See . Managing Data with Volumes Mirroring Mirroring means creating , full physical read-only copies of normal volumes for fault tolerance and high performance. When you mirror volumes create a mirror volume, you specify a source volume from which to copy data, and you can also specify a schedule to automate re-synchronization of the data to keep the mirror up-to-date. After a mirror is initially copied, the synchronization process saves bandwidth and reads on the source volume by transferring only the deltas needed to bring the mirror volume to the same state as its source volume. A mirror volume need not be on the same cluster as its source volume; MapR can sync data on another cluster (as long as it is reachable over the network). When creating multiple mirrors, you can further reduce the mirroring bandwidth overhead by daisy-chaining the mirrors. That is, set the source volume of the first mirror to the original volume, the source volume of the second mirror to the first mirror, and so on. Each mirror is a full copy of the volume, so remember to take the number of mirrors into account when planning application data size. See . Mirrors Permissions MapR provides fine-grained control over which users and groups can perform specific tasks on volumes and clusters. When you create a volume, keep in mind which users or groups should have these types of access to the volume. You may want to create a specific group to associate with a project or department, then add users to the group so that you can apply permissions to them all at the same time. See . Managing Permissions Quotas You can use quotas to limit the amount of disk space an application can use. There are two types of quotas: User/Group quotas limit the amount of disk space available to a user or group Volume quotas limit the amount of disk space available to a volume When the data owned by a user, group, or volume exceeds the quota, MapR prevents further writes until either the data size falls below the quota again, or the quota is raised to accommodate the data. Volumes, users, and groups can also be assigned . An advisory quota does not limit the disk space available, but raises an alarm advisory quotas and sends a notification when the space used exceeds a certain point. When you set a quota, you can use a slightly lower advisory quota as a warning that the data is about to exceed the quota, preventing further writes. Remember that volume quotas do not take into account disk space used by sub-volumes (because volume paths are logical, not physical). You can set a User/Group quota to manage and track the disk space used by an (a department, project, or application): accounting entity Create a group to represent the accounting entity. Create one or more volumes and use the group as the Accounting Entity for each. Set a User/Group quota for the group. Add the appropriate users to the group. When a user writes to one of the volumes associated with the group, any data written counts against the group's quota. Any writes to volumes not associated with the group are not counted toward the group's quota. See . Managing Quotas Replication When you create a volume, you can choose a to safeguard important data. The factor defines the number of replication factor desired replication replicas that is the standard for your cluster. The factor is a threshold below which your cluster aggressively replicates the minimum replication volume until enough replicas are created. MapR manages the replication automatically, raising an alarm and notification if replication falls below the desired level you have set. A volume's replica is a full copy of the volume. Consider space requirements for replicas when planning application data size. Snapshots A snapshot is an instant image of a volume at a particular point in time. Snapshots take no time to create, because they only record changes to data over time rather than the data itself. You can manually create a snapshot to enable rollback to a particular known data state, or schedule periodic automatic snapshots to ensure a specific (RPO). You can use snapshots and mirrors to achieve a near-zero recovery point objective reco (RTO). Snapshots store only the deltas between a volume's current state and its state when the snapshot is taken. Initially, very time objective snapshots take no space on disk, but they can grow arbitrarily as a volume's data changes. When planning application data size, take into account how much the data is likely to change, and how often snapshots will be taken. See . Snapshots Topology You can restrict a volume to a particular rack by setting its physical topology attribute. This is useful for placing an application's data on a high-performance rack (for critical applications) or a low-performance rack (to keep it out of the way of critical applications). See Setting Volume . Topology Scenarios Here are a few ways to configure the application volume based on different types of data. If the application requires more than one type of data, you can set up multiple volumes. Data Type Strategy Important Data High replication factor Frequent snapshots to minimize RPO and RTO Mirroring in a remote cluster Highly Acccessed Data High replication factor Mirroring for high-performance reads Topology: data placement on high-performance machines Scratch data No snapshots, mirrors, or replication Topology: data placement on low-performance machines Static data Mirroring and replication set by performance and availability requirements One snapshot (to protect against accidental changes) Volume set to read-only The following documents provide examples of different ways to provision an application to meet business goals: Provisioning for Capacity Provisioning for Performance Setting Up the Application Once you know the course of action to take based on the application's data and performance needs, you can use the MapR Control System to set up the application. Creating a Group and a Volume 1. 2. 3. 4. 5. a. b. c. 6. a. b. 7. 8. Setting Up Mirroring Setting Up Snapshots Setting Up User or Group Quotas Creating a Group and a Volume Create a group and a volume for the application. If you already have a snapshot schedule prepared, you can apply it to the volume at creation time. Otherwise, use the procedure in below, after you have created the volume. Setting Up Snapshots Setting Up Mirroring If you want the mirror to sync automatically, use the procedure in to create a schedule. Creating a Schedule Use the procedure in to create a mirror volume. Make sure to set the following fields: Creating a Volume Volume Type - Mirror Volume Source Volume - the volume you created for the application Responsible Group/User - in most cases, the same as for the source volume Setting Up Snapshots To set up automatic snapshots for the volume, use the procedure in . Scheduling a Snapshot Provisioning for Capacity You can easily provision a volume for maximum data storage capacity by setting a low replication factor, setting hard and advisory quotas, and tracking storage use by users, groups, and volumes. You can also set permissions to limit who can write data to the volume. The replication factor determines how many complete copies of a volume are stored in the cluster. The actual storage requirement for a volume is the volume size multiplied by its replication factor. To maximize storage capacity, set the replication factor on the volume to 1 at the time you create the volume. Volume quotas and user or group quotas limit the amount of data that can be written by a user or group, or the maximum size of a specific volume. When the data size exceeds the advisory quota, MapR raises an alarm and notification but does not prevent additional data writes. Once the data exceeds the hard quota, no further writes are allowed for the volume, user, or group. The advisory quota is generally somewhat lower than the hard quota, to provide advance warning that the data is in danger of exceeding the hard quota. For a high-capacity volume, the volume quotas should be as large as possible. You can use the advisory quota to warn you when the volume is approaching its maximum size. To use the volume capacity wisely, you can limit write access to a particular user or group. Create a new user or group on all nodes in the cluster. In this scenario, storage capacity takes precedence over high performance and data recovery; to maximize data storage, there will be no snapshots or mirrors set up in the cluster. A low replication factor means that the data is less effectively protected against loss in the event that disks or nodes fail. Because of these tradeoffs, this strategy is most suitable for risk-tolerant large data sets, and should not be used for data with stringent protection, recovery, or performance requirements. To create a high-capacity volume: Set up a user or group that will be responsible for the volume. For more information, see . Users & Groups In the MapR Control System, open the MapR-FS group and click to display the view. Volumes Volumes Click the button to display the dialog. New Volume New Volume In the pane, set the volume name and mount path. Volume Setup In the pane: Usage Tracking In the section, select or and enter the user or group responsible for the volume. Group/User User Group In the section, check and enter the maximum capacity of the volume, based on the storage capacity of Quotas Volume Quota your cluster. Example: 1 TB Check and enter a lower number than the volume quota, to serve as advance warning when the data Volume Advisory Quota approaches the hard quota. Example: 900 GB In the pane: Replication & Snapshot Scheduling Set to . Replication 1 Do not select a snapshot schedule. Click OK to create the volume. Set the volume permissions on the volume via NFS or using . You can limit writes to root and the responsible user or group. hadoop fs See for more information. Managing Data with Volumes Provisioning for Performance 1. 2. 3. 4. 1. 2. 3. 4. 1. 2. 3. 4. 5. 6. 1. 2. 3. 4. 5. 6. You can provision a high-performance volume by creating multiple mirrors of the data and defining volume topology to control data placement: store the data on your fastest servers (for example, servers that use SSDs instead of hard disks). When you create mirrors of a volume, make sure your application load-balances reads across the mirrors to increase performance. Each mirror is an actual volume, so you can control data placement and replication on each mirror independently. The most efficient way to create multiple mirrors is to cascade them rather than creating all the mirrors from the same source volume. Create the first mirror from the original volume, then create the second mirror using the first mirror as the source volume, and so on. You can mirror the volume within the same cluster or to another cluster, possibly in a different datacenter. You can set node topology paths to specify the physical locations of nodes in the cluster, and volume topology paths to limit volumes to specific nodes or racks. To set node topology: Use the following steps to create a rack path representing the high-performance nodes in your cluster. In the MapR Control System, open the MapR-FS group and click to display the view. Nodes Nodes Click the checkboxes next to the high-performance nodes. Click the button to display the dialog. Change Topology Change Node Topology In the Change Node Topology dialog, type a path to represent the high-performance rack. For example, if the cluster name is cluster1 and the high-performance nodes make up rack 14, type . /cluster1/rack14 To set up the source volume: In the MapR Control System, open the MapR-FS group and click to display the view. Volumes Volumes Click the button to display the dialog. New Volume New Volume In the pane, set the volume name and mount path normally. Volume Setup Set the to limit the volume to the high-performance rack. Example: Topology /default/rack14 To Set Up the First Mirror In the MapR Control System, open the MapR-FS group and click to display the view. Volumes Volumes Click the button to display the dialog. New Volume New Volume In the pane, set the volume name and mount path normally. Volume Setup Choose . Local Mirror Volume Set the to the original volume name. Example: Source Volume Name original-volume Set the to a different rack from the source volume. Topology To Set Up Subsequent Mirrors In the MapR Control System, open the MapR-FS group and click to display the view. Volumes Volumes Click the button to display the dialog. New Volume New Volume In the pane, set the volume name and mount path normally. Volume Setup Choose . Local Mirror Volume Set the to the previous mirror volume name. Example: Source Volume Name mirror1 Set the to a different rack from the source volume and the other mirror. Topology See for more information. Managing Data with Volumes MapR Metrics and Job Performance The MapR Metrics service collects and displays detailed about the tasks and task attempts that comprise your Hadoop job. You can use analytics the to display charts based on those analytics and diagnose performance issues with a particular job. MapR Control System View this video for an introduction to Job Metrics... The MapR Control System presents the jobs running on your cluster and the tasks that make up a specific job as a sortable list, along with histograms and line charts that represent the distribution of a particular metric. You can sort the list by the metric you're interested in to quickly find any outliers, then display specific detailed information about a job or task attempt that you want to learn more about. The filtering capabilities of the MapR Control System enable you to narrow down the display of data to the ranges you're interested in. For example, if a job lists 100% map task completion and 99% reduce task completion, you can filter the views in the MapR Control System to list only reduce tasks. Once you have a list of your job's reduce tasks, you can sort the list by duration to see if any reduce task attempts are taking an abnormally long time to execute, then display detailed information about those task attempts, including log files for those task attempts. You can also use the Metrics displays to gauge performance. Consider two different jobs that perform the same function. One job is written in Python using , and the other job is written in C++ using . To evaluate how these jobs perform on the cluster, you can open two pydoop Pipes browser windows logged into the MapR Control System and filter the display down to the metrics you're most interested in while the jobs are running. Maven Repository and Artifacts for MapR You can use Maven for dependency management when developing applications based on the MapR distribution for Apache Hadoop. MapR's Maven repository is located at . You can also the repository through Nexus. http://repository.mapr.com/maven/ browse The following POM file enables access to MapR's Maven repository: <repositories> <repository> <id>mapr-releases</id> <url>http://repository.mapr.com/maven/</url> <snapshots><enabled>false</enabled></snapshots> <releases><enabled>true</enabled></releases> </repository> </repositories> The table below lists the group ID, artifact ID, version, and name for all artifacts published by MapR. groupId artifactId version artifact org.apache.hadoop hadoop-core 1.0.3-mapr-3.0.1 hadoop-core-1.0.3-mapr-3.0.1.jar com.mapr mapr-root 1.0 mapr-root-1.0.pom com.mapr.fs mapr-hbase 1.0.3-mapr-3.0.1 mapr-hbase-1.0.3-mapr-3.0.1.jar com.mapr.util central-logging 1.0.3-mapr-3.0.1 central-logging-1.0.3-mapr-3.0.1. jar com.mapr.hadoop maprfs-parent 1.0.3-mapr-3.0.1 maprfs-parent-1.0.3-mapr-3.0.1. pom com.mapr.hadoop maprfs 1.0.3-mapr-3.0.1 maprfs-1.0.3-mapr-3.0.1.pom org.apache.mapreduce fair-scheduler 1.0.3-mapr-3.0.1 fair-scheduler-1.0.3-mapr-3.0.1.j ar org.apache.mapreduce capacity-scheduler 1.0.3-mapr-3.0.1 capacity-scheduler-1.0.3-mapr-3 .0.1.jar org.apache.hive hive-shims 0.10.0-mapr hive-shims-0.10.0-mapr.jar org.apache.hive hive-service 0.10.0-mapr hive-service-0.10.0-mapr.jar org.apache.hive hive-serde 0.10.0-mapr hive-serde-0.10.0-mapr.jar org.apache.hive hive-pdk 0.10.0-mapr hive-pdk-0.10.0-mapr.jar org.apache.hive hive-metastore 0.10.0-mapr hive-metastore-0.10.0-mapr.jar org.apache.hive hive-jdbc 0.10.0-mapr hive-jdbc-0.10.0-mapr.jar org.apache.hive hive-hwi 0.10.0-mapr hive-hwi-0.10.0-mapr.jar org.apache.hive hive-hbase-handler 0.10.0-mapr hive-hbase-handler-0.10.0-mapr. jar org.apache.hive hive-exec 0.10.0-mapr hive-exec-0.10.0-mapr.jar org.apache.hive hive-contrib 0.10.0-mapr hive-contrib-0.10.0-mapr.jar org.apache.hive hive-common 0.10.0-mapr hive-common-0.10.0-mapr.jar org.apache.hive hive-cli 0.10.0-mapr hive-cli-0.10.0-mapr.jar org.apache.hive hive-builtins 0.10.0-mapr hive-builtins-0.10.0-mapr.jar org.apache.hive hive-anttasks 0.10.0-mapr hive-anttasks-0.10.0-mapr.jar org.apache.hbase hbase 0.94.9-mapr-1308 hbase-0.94.9-mapr-1308.jar org.apache.hbase hbase 0.94.9-mapr-1308 hbase-0.94.9-mapr-1308-tests.ja r org.apache.hbase hbase 0.94.9-mapr-1308 hbase-0.94.9-mapr-1308.jar org.apache.hbase hbase 0.92.2-mapr-1308 hbase-0.92.2-mapr-1308.jar org.apache.hbase hbase 0.92.2-mapr-1308 hbase-0.92.2-mapr-1308-tests.ja r org.apache.hbase hbase 0.92.2-mapr-1308 hbase-0.92.2-mapr-1308-source s.jar org.apache.hadoop s3filesystem 1.0.3-mapr-3.0.1 s3filesystem-1.0.3-mapr-3.0.1.jar org.apache.mahout mahout 0.7-mapr mahout-core-0.7-mapr.jar org.apache.oozie oozie 3.3.0-mapr-1308 oozie-3.3.0-mapr.jar org.apache.oozie oozie 3.3.2-mapr-1309 oozie-3.3.2-mapr.jar org.hbase asynchbase 1.4.1-mapr asynchbase-1.4.1-mapr.jar Working with Cascading Cascading™ is a Java application framework produced by that enables developers Concurrent, Inc. to quickly and easily build rich enterprise-grade Data Processing and Machine Learning applications that can be deployed and managed across private or cloud-based Hadoop clusters. This section contains documentation on working with Cascading on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the Cascading project on the . This section provides all relevant details about using Cascading with MapR, but Concurrent website does not duplicate documentation available from Concurrent, Inc. To install Cascading, see the section of the Administration Guide. Cascading Topics in This Section Upgrading Cascading Related Links Cascading project at Concurrent, Inc. MapR Forum posts related to Cascading Search the MapR Blog for Cascading topics Upgrading Cascading This page contains the following topics describing how to upgrade Cascading in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations Upgrading the Software Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of Cascading you want to 1. 2. 3. upgrade to. See the . Cascading Release Notes Update Repositories or Download Packages MapR's and repositories always contain the Cascading version recommended for the latest release of the MapR core. The repositories rpm deb are located at . You can also prepare a local repository with any version of Cascading http://package.mapr.com/releases/ecosystem/ you need. For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where Cascading is installed. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of Cascading, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/cascading/cascading-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where Cascading is installed. Upgrade Cascading software. Migrate custom configuration settings into the new default files in the directory. conf Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. There are no known version-specific considerations at this time. Upgrading the Software Use one of the following methods to upgrade the Cascading component: To upgrade with a package manager To keep a prior version and install a newer version To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-cascading On Ubuntu apt-get install mapr-cascading 1. 2. To keep a prior version and install a newer version Cascading installs into separate directories named after the version, such as , so the files /opt/mapr/cascading/cascading-<version>/ for multiple versions can co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 2.1 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package file for version 2.1 from mapr-cascading http://package.mapr.com/releases/ecosystem-all/ . Install the package with . rpm rpm -i --force mapr-cascading-2.1.20130226.18380-1.noarch.rpm On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Working with Flume Apache Flume™ is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application. This section contains documentation on working with Flume on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section provides all relevant details about using Flume with MapR, but does not duplicate Apache Apache Flume project documentation. To install Flume, see the section of the Administration Guide. Flume Topics in This Section Upgrading Flume Related Links Apache Flume project MapR Forum posts related to Flume Search the MapR Blog for Flume topics Upgrading Flume This page contains the following topics describing how to upgrade Flume in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations 1. 2. 3. Upgrading the Software Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of Flume you want to upgrade to. See the . Flume Release Notes Update Repositories or Download Packages MapR's and repositories always contain the Flume version recommended for the latest release of the MapR core. The repositories are rpm deb located at . You can also prepare a local repository with any version of Flume you http://package.mapr.com/releases/ecosystem/ need. For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where Flume is installed. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of Flume, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/flume/flume-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where Flume is installed. Upgrade Flume software. Migrate custom configuration settings into the new default files in the directory. conf Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. Packaging changes between Flume releases 0.9.4 and 1.2.0 Packaging changes between Flume releases 0.9.4 and 1.2.0 The following points apply when upgrading Flume from 0.9.4 (or earlier) to 1.2.0 (or later). MapR did not distribute any releases of Flume between these two versions. MapR's file naming convention changed between these releases, which caused a special case when upgrading. Because of a reversal in the alphanumeric order of the filenames, package managers incorrectly perceive the newer version to be a downgrade. When upgrading the software, package management tools might require you to specify a particular version, rather than automatically upgrading to the latest version. Flume release 1.2.0 (and onward) is stored in a separate repository than the MapR core software. This release corresponded to the release of MapR core v2.0. Prior to v2.0, Flume packages were located in the same repository with the MapR core. From MapR v2.0 onward, Flume packages are located in a separate repository, which requires some consideration in setting up repositories. See Installing for details. MapR Software MapR packaged Flume 0.9.4 (and earlier) as two separate packages, and . Starting with Flume mapr-flume mapr-flume-internal release 1.2.0, MapR packages Flume as one package . When upgrading from 0.9.4 (or earlier), if you upgrade only the mapr-flume map package, the package manager will leave files in place. You have to explicitly uninstall r-flume mapr-flume-internal mapr-flume to clean the older version from the node. -internal 1. Upgrading the Software Use one of the following methods to upgrade the Flume component: To upgrade with a package manager To manually remove a prior version and install the latest version in the repository To keep a prior version and install a newer version To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-flume On Ubuntu apt-get install mapr-flume If you are upgrading from Flume 0.9.4, you might have to specify the particular version you want to upgrade to, because of #Version-Specific . Considerations To manually remove a prior version and install the latest version in the repository If you are upgrading from Flume 0.9.4, this process might be necessary to remove the package which is no longer part mapr-flume-internal of MapR's Flume release 1.2.0 and onward. Run the package manager twice, first to remove the old version, and again to install the new version. For example, to upgrade from version 0.7.1 to version 0.10.0, perform the steps below. In this case, we assume the repository is set up on the node and 0.10.0 is the latest version in the repository. On RedHat and CentOS yum remove mapr-flume mapr-flume-internal yum install mapr-flume On Ubuntu apt-get remove mapr-flume mapr-flume-internal apt-get install mapr-flume To keep a prior version and install a newer version Flume installs into separate directories named after the version, such as , so the files for multiple /opt/mapr/flume/flume-<version>/ versions can co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 1.3.1 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package file for version 1.3.1 from mapr-flume http://package.mapr.com/releases/ecosystem-all/ Copy custom configuration files in to a safe location before proceeding. /opt/mapr/flume/flume-<version>/conf 1. 2. . Install the package with . rpm rpm -i --force mapr-flume-1.3.1.18380-GA.noarch.rpm On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Working with HBase Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. You can use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables – billions of rows X millions of columns – atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and Hadoop-compatible filesystems, such as the MapR-FS. This section contains documentation on working with HBase on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section Apache HBase project provides all relevant details about using HBase with MapR, but does not duplicate Apache documentation. To install HBase, see the section of the Administration Guide. HBase Topics in This Section HBase Best Practices Upgrading HBase Enabling HBase Access Control Related Links Apache HBase Reference Guide Apache HBase project MapR Forum posts related to HBase Search the MapR Blog for HBase topics HBase Best Practices The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase, turn off MapR compression for directories in the HBase volume (normally mounted at . Example: /hbase hadoop mfs -setcompression off /hbase You can check whether compression is turned off in a directory or mounted volume by using to list the file contents. hadoop mfs Example: hadoop mfs -ls /hbase The letter in the output indicates compression is turned on; the letter indicates compression is turned off. See for more Z U hadoop mfs information. The MapR filesystem provides native storage for table data, compatible with the HBase API. For new applications, consider using MapR tables for increased performance, more versatile table operations, and easier cluster administration. for more Click here information on table storage available in MapR M7 Edition. 1. 2. 3. On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so that the node can handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit the RegionServer to 4 GB, give 10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to each service, edit the file. See for more information. /opt/mapr/conf/warden.conf Tuning Your MapR Install Upgrading HBase This page contains the following topics describing how to upgrade HBase in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Planning for Upgrade Version-Specific Considerations Upgrading the Software Configure the Cluster for the New Version Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of HBase you want to upgrade to. See the . HBase Release Notes Update Repositories or Download Packages MapR's and repositories always contain the HBase version recommended for the latest release of the MapR core. The repositories are rpm deb located at . You can also prepare a local repository with any version of HBase you http://package.mapr.com/releases/ecosystem/ need. For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where HBase is installed. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of HBase, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/hbase/hbase-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where HBase is installed. Upgrade HBase software. Migrate custom configuration settings into the new default files in the directory. conf Planning for Upgrade Upgrading an established deployment of HBase requires planning and consideration before beginning the upgrade process. Below are items to consider as you plan to upgrade: The MapR-FS filesystem provides native storage for table data as of MapR version 3.0. MapR tables are API-compatible with the Apache HBase, and have higher performance, are more versatile for developers, and reduce administrative burden, compared to Apache HBase. Before upgrading HBase, consider whether migrating to MapR tables is appropriate for your needs. The topic of upgrading HBase is discussed in depth in Apache literature. This page covers details of upgrading the HBase packages included in the MapR distribution for Apache Hadoop. However, administrators need to consider migration of data and maintenance of service for HBase clusters. For details, refer to the . Apache HBase Reference Guide The data formats for and tables change between minor release boundaries of HBase (such as 0.92.x to 0.94.x). HBase ROOT META handles the data migration process so it is transparent to the administrator. However, after upgrading you cannot downgrade to a previous version without also restoring the pre-upgrade data. Perform health checks and address any concerns before upgrading HBase. As a start, run to check for any inconsistencies in hbck HBase data. Refer to in the Apache HBase Reference Guide for usage details. hbck in Depth /opt/mapr/hbase/hbase-<version>/bin/hbase hbck If you also plan to upgrade the MapR core as part of upgrading your HBase cluster, upgrade the MapR core first. After successfully upgrading the MapR core and verifying cluster health, upgrade the HBase component. While planning to upgrade, it is a good time to review your cluster service layout and determine if the right services are running on the right set of nodes. For example, as your cluster grows, you will tend to isolate cluster-management services from compute services on separate nodes. Review and for details on planning the service layout. Planning the Cluster Installing HBase Because the upgrade process takes HBase services offline and requires careful planning, perform a test upgrade on a development cluster to make sure you understand the process. After you have experienced success on a dev cluster, proceed with your production cluster. Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. The has Apache HBase Reference Guide a section dedicated to version-specific upgrade considerations. Upgrading to MapR v3.x Upgrading from 0.92.x to 0.94.x Packaging changes between HBase releases 0.90.6 and 0.92.1 Upgrading from 0.90.x to 0.92.x Upgrading to HBase 0.90.x from 0.20.x or 0.89.x Upgrading to MapR v3.x Due to changes in the MapR HBase client for MapR tables which are available starting with MapR version 3.0, if you are upgrading your cluster to MapR v3.0 or later, you need to upgrade the HBase client packaged with the new version of the MapR distribution. Upgrading from 0.92.x to 0.94.x Refer to the for details. Apache HBase Reference Guide Packaging changes between HBase releases 0.90.6 and 0.92.1 The following points apply when upgrading HBase from 0.90.6 (or earlier) to 0.92.1 (or later). MapR did not distribute any releases of HBase between these two versions. MapR's file naming convention changed between these releases, which caused a special case when upgrading. Because of a reversal in the alphanumeric order of the filenames, package managers incorrectly perceive the newer version to be a downgrade. When upgrading the software, package management tools might require you to specify a particular version, rather than automatically upgrading to the latest version. HBase release 0.92.1 (and onward) is stored in a separate repository than the MapR core software. This release corresponded to the release of MapR core v2.0. Prior to v2.0, HBase packages were located in the same repository with the MapR core. From MapR v2.0 onward, HBase packages are located in a separate repository, which requires some consideration in setting up repositories. See Installin for details. g MapR Software Upgrading from 0.90.x to 0.92.x Refer to the for details. Apache HBase Reference Guide Upgrading to HBase 0.90.x from 0.20.x or 0.89.x Refer to the for details. Apache HBase Reference Guide Upgrading the Software 1. 2. Use one of the following methods to upgrade the HBase component: To upgrade with a package manager To upgrade by manually installing packages To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. The upgrade process will remove all but the following directories in the current HBase directory: and . conf logs On RedHat and CentOS To upgrade an HBase region server node: yum upgrade mapr-hbase-internal mapr-hbase-regionserver To upgrade an HBase master node: yum upgrade mapr-hbase-internal mapr-hbase-master On Ubuntu To upgrade an HBase region server node: apt-get install mapr-hbase-internal mapr-hbase-regionserver To upgrade an HBase master node: apt-get install mapr-hbase-internal mapr-hbase-master If you are upgrading from HBase 0.90.6 or earlier, you might have to specify the particular version you want to upgrade to, because of #Version-S . pecific Considerations When the upgrade finishes, the package manager updates the file to contain the correct version, such as /opt/mapr/hbase/hbaseversion 0 . .94.1 $ cat hbase/hbaseversion 0.94.1 After the upgrade, verify that the file exists. If it does not, run the comand /opt/mapr/hbase/hbaseversion echo "<version>" > to re-create the file, substituting the new version. Example: hbaseversion echo "0.94.1" > hbaseversion To upgrade by manually installing packages For example, to install version 0.94.5 build 18380, perform the steps below. On RedHat and CentOS Download the RPM package files , , and for version mapr-hbase-internal mapr-hbase-master mapr-hbase-regionserver 0.94.5 from http://package.mapr.com/releases/ecosystem-all/ . 2. 1. 2. 1. Install the package with . rpm To upgrade an HBase node: region server rpm -i --force mapr-hbase-internal-0.94.5.18380-GA.noarch.rpm mapr-hbase-regionserver-0.94.5.18380-GA.noarch.rpm To upgrade an HBase node: master rpm -i --force mapr-hbase-internal-0.94.5.18380-GA.noarch.rpm mapr-hbase-master-0.94.5.18380-GA.noarch.rpm On Ubuntu Download the RPM package files , , and for version mapr-hbase-internal mapr-hbase-master mapr-hbase-regionserver 0.94.5 from http://package.mapr.com/releases/ecosystem-all/ . Install the package with . dpkg To upgrade an HBase node: region server dpkg -i mapr-hbase-internal-0.94.5.18380_all.deb mapr-hbase-regionserver-0.94.5.18380_all.deb To upgrade an HBase node: master dpkg -i mapr-hbase-internal-0.94.5.18380_all.deb mapr-hbase-master-0.94.5.18380_all.deb Configure the Cluster for the New Version After upgrading the HBase packages, run the script to populate the new properties file with correct configure.sh hbase-site.xml ZooKeeper information. Substitute and with a comma-separated list of the CLDB and ZooKeeper nodes. <CLDBs> <ZooKeepers> /opt/mapr/server/configure.sh -C <CLDBs> -Z <ZooKeepers> Enabling HBase Access Control Starting in the 3.0 release of the MapR distribution for Hadoop, HBase supports Access Control Lists (ACLs) to limit the privileges of users on the system. To enable HBase ACLs on your cluster, perform the following steps: On the HBase Region Server, edit the file and add the following /opt/mapr/hbase/hbase-<version>/conf/hbase-site.xml section: Do not keep a prior version and install a newer version HBase installs into separate directories named after the version, such as , so the files for /opt/mapr/hbase/hbase-<version>/ multiple versions can co-exist. However, HBase data cannot be shared between separate versions of the software, and the data format is not backward compatible. Furthermore, HBase master and region-server services are resource intensive. MapR does not recommend keeping multiple versions of HBase on a node. 1. 2. 3. 4. <property> <name>hbase.coprocessor.region.classes</name> <value>org.apache.hadoop.hbase.security.token.TokenProvider,org.apache.hadoop.hba se.security.access.AccessController</value> </property> <property> <name>hbase.superuser</name> <value><admin1>,<admin2>,@<group1>,...</value>  </property> <property> <name>hbase.rpc.engine</name> <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value> </property> On the HBase Master, edit the file and add the following section: /opt/mapr/hbase/hbase-<version>/conf/hbase-site.xml <property> <name>hbase.rpc.engine</name> <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value> </property> <property> <name>hbase.coprocessor.master.classes</name> <value>org.apache.hadoop.hbase.security.access.AccessController</value> </property> <property> <name>hbase.superuser</name> <value><admin1>,<admin2>,@<group1>,...</value>  </property> On every HBase client node, edit the file and add the following /opt/mapr/hbase/hbase-<version>/conf/hbase-site.xml section: <property> <name>hbase.rpc.engine</name> <value>org.apache.hadoop.hbase.ipc.SecureRpcEngine</value> </property> Restart HBase on every node. Using HBase ACLs HBase ACLs support the following privileges: Read Write Create tables Administrator You can grant and remove privileges from users by using the and commands from the HBase shell. The following example grants grant revoke user read privileges from column family of table : jfoo cf1 mytable hbase(main):001:0> grant 'jfoo' 'R' 'mytable','cf1' This example removes user 's administrative privileges on the cluster: kbar hbase(main):001:0> revoke 'kbar' 'A' Working with HCatalog Apache HCatalog™ is a table and storage management service for data created using Apache Hadoop. This includes: Providing a shared schema and data type mechanism. Providing a table abstraction so that users need not be concerned with where or how their data is stored. Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive. Apache HCatalog is in incubation at the Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF. This section contains documentation on working with HCatalog on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section provides all relevant details about using HCatalog with MapR, but does not duplicate Apache HCatalog project Apache documentation. Version 11 of Hive includes HCatalog and WebHCat. To install Hive, see the section of the Administration Guide. Hive Topics in This Section Upgrading HCatalog Related Links Apache HCatalog project MapR Forum posts related to HCatalog Search the MapR Blog for HCatalog topics Upgrading HCatalog This page contains the following topics describing how to upgrade HCatalog in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations Upgrading the Software Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of HCatalog you want to upgrade to. See the . HCatalog Release Notes Update Repositories or Download Packages MapR's and repositories always contain the HCatalog version recommended for the latest release of the MapR core. The repositories rpm deb are located at . You can also prepare a local repository with any version of HCatalog http://package.mapr.com/releases/ecosystem/ you need. For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where HCatalog is installed. 1. 2. 3. 1. 2. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of HCatalog, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/hcatalog/hcatalog-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where HCatalog is installed. Upgrade HCatalog software. Migrate custom configuration settings into the new default files in the directory. conf Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. There are no known version-specific considerations at this time. Upgrading the Software Use one of the following methods to upgrade the HCatalog component: To upgrade with a package manager To keep a prior version and install a newer version To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-hcatalog mapr-hcatalog-server On Ubuntu apt-get install mapr-hcatalog mapr-hcatalog-server To keep a prior version and install a newer version HCatalog installs into separate directories named after the version, such as , so the files for /opt/mapr/hcatalog/hcatalog-<version>/ multiple versions can co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 0.4.0 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package files for and version 0.4.0 from mapr-hcatalog mapr-hcatalog-server http://package.mapr.com/releases/ecosystem-all/ . Install the package with . rpm 2. rpm -i --force mapr-hcatalog-0.4.18380-GA.noarch.rpm mapr-hcatalog-server-0.4.0.16780-1.noarch.rpm On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Working with Hive Apache Hive™ is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems, such as the MapR Data Platform (MDP). Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. This section contains documentation on working with Hive on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section provides all relevant Apache Hive project details about using Hive with MapR, but does not duplicate Apache documentation. To install Hive, see the section of the Administration Guide. Hive Topics in This Section Hive ODBC Connector Using HiveServer2 Upgrading Hive Troubleshooting Hive Issues Using HCatalog and WebHCat with Hive Related Links Apache Hive project MapR Forum posts related to Hive Search the MapR Blog for Hive topics Hive ODBC Connector This page contains details about setting up and using the ODBC Connector for Hive. This page contains the following topics: Before You Begin The SQL Connector Software and Hardware Requirements Installation and Configuration Configuring SSL on a DSN Configuring DSN-less Authentication SQLPrepare Optimization Notes Data Types HiveQL Notes Notes on Applications Microsoft Access Microsoft Excel/Query Tableau Desktop Before You Begin The MapR Hive ODBC Connector is an ODBC driver for Apache Hive 0.7.0 and later that complies with the ODBC 3.52 specification. To use the ODBC driver, configure a (DSN), a definition that specifies how to connect to Hive. DSNs are typically managed by the Data Source Name operating system and may be used by multiple applications. Some applications do not use DSNs. You will need to refer to your particular application’s documentation to understand how it connects using ODBC. 1. 2. a. b. 3. 4. 5. 1. 2. The standard query language for ODBC is SQL. HiveQL, the standard query language for Hive, includes a subset of ANSI SQL-92. Applications that connect to Hive using ODBC may need queries altered if the queries use SQL features that are not present in Hive. Applications that use SQL will recognize HiveQL, but might not provide access to HiveQL-specific features such as multi-table insert. Please refer to the for up-to-date information on HiveQL. HiveQL wiki The SQL Connector The SQL Connector feature translates standard SQL-92 queries into equivalent HiveQL queries. The SQL Connector performs syntactical translations and structural transformations. For example: Quoted Identifiers: When quoting identifiers, HiveQL uses back quotes ( ), while SQL uses double quotes ( ). Even when a driver reports ` " the back quote as the quote character, some applications still generate double-quoted identifiers. Table Aliases: HiveQL does not support the AS keyword between a table reference and its alias. The , , and SQL syntaxes are translated to the HiveQL syntax. JOIN INNER JOIN CROSS JOIN JOIN SQL queries are transformed to HiveQL queries. TOP N LIMIT Software and Hardware Requirements To use MapR Hive ODBC Connector on Windows requires: Windows® 7 Professional or Windows® 2008 R2. Both 32 and 64-bit editions are supported. The Microsoft Visual C++ 2010 Redistributable Package (runtimes required to run applications developed with Visual C++ on a computer that does not have Visual C++ 2010 installed.) A Hadoop cluster with the Hive service installed and running. You should find out from the cluster administrator the hostname or IP address for the Hive service and the port that the service is running on. (The default port for Hive is 10000.) Installation and Configuration There are versions of the connector for 32-bit and 64-bit applications. The 64-bit version of the connector works only with 64-bit DSNs; the 32-bit connector works only with 32-bit DSNs. Because 64-bit Windows machines can run both 64-bit and 32-bit applications, install both versions of the connector in order to set up DSNs to work with both types of applications. If both the 32-bit connector and the 64-bit connector are installed, you must configure DSNs for each independently, in their separate Data Source Administrators. To install the Hive ODBC Connector: Run the installer to get started: To install the 64-bit connector, download and run http://package.mapr.com/tools/MapR-ODBC/MapR_odbc_2.1.0_x6 . 4.exe To install the 32-bit connector, download and run http://package.mapr.com/tools/MapR-ODBC/MapR_odbc_2.1.0_x8 . 6.exe Perform the following steps, clicking after each: Next Accept the license agreement. Select an installation folder. On the Information window, click . Next On the Completing... window, click Finish. Install a DSN corresponding to your Hive server. To create a Data Source Name (DSN) Open the Data Source Administrator from the Start menu. Example: Start > MapR Hive ODBC Driver 2.0 > 64-Bit ODBC Driver Manager On the tab click to open the Create New Data Source dialog. User DSN Add 2. 3. 4. Select and click to open the Hive ODBC Driver DSN Setup window. MapR Hive ODBC Connector Finish Enter the connection information for the Hive instance: Data Source Name — Specify a name for the DSN. Description — Enter an optional description for the DSN. Host — Enter the hostname or IP of the server running HiveServer1 or HiveServer2. Port — Enter the listening port for the Hive service. Database — Leave as to connect to the default Hive database, or enter a specific database name. default Hive Server Type: — Set to HiveServer1 or HiveServer2. Authentication — If you are using HiveServer2, set the following. Mechanism: — Set to the authentication mechanism you're using. The MapR ODBC driver supports user name, user 4. 5. 6. name and password, and username and password over SSL authentication. User Name: — Set the user to run queries as. Password: — The user's password, if your selected authentication mechanism requires one. Click to test the connection. Test When you're sure the connection works, click . Your new connector will appear in the User Data Sources list. Finish Configuring SSL on a DSN Select the DSN from the , then click to display the Setup dialog. From the Setup dialog, ODBC Data Source Administrator Window Configure click to display the Advanced Options dialog. Advanced Options... In the pane, click the box next to to control whether the driver allows the common SSL Allow Common Name Host Name Mismatch name of a CA issued certificate to not match the host name of the Hive server. For self-signed certificates, the driver always allow common name of the certificate to not match the host name. If you wish to specify a local trusted certificates file, click next to the field and navigate to the location of Browse Trusted Certificates your file. The default setting uses the trusted CA certificates PEM file that is installed with the driver. cacerts.pem Advanced Options Select the checkbox to disable the SQL Connector feature. The SQL Connector feature has been added to the driver Use Native Query to apply transformations to the queries emitted by an application to convert them into an equivalent form in HiveQL. If the application is Hive aware and already emits HiveQL then turning off the SQL Connector feature avoids the extra overhead of query transformation. Select the checkbox to defer query execution to . When using Native Query mode, the driver will execute Fast SQLPrepare SQLExecute the HiveQL query to retrieve the result set metadata for . As a result, might be slow. Enable this option if the SQLPrepare SQLPrepare result set metadata is not required after calling . SQLPrepare In the field, type the number of rows to be fetched per block. Any positive 32-bit integer is valid. Performance Rows Fetched Per Block The driver always accepts a self-signed SSL certificate. 1. 2. 3. 4. gains are marginal beyond the default value of 10000 rows. In the Length field, type the default column length to use. Hive does not provide the length for c Default String Column String String olumns in its column metadata. This option allows you to tune the length of columns. String In the field, type the maximum number of digits to the right of the decimal point for numeric data types. Decimal Column Scale To allow the common name of a CA issued SSL certificate to not matchthe hostname of the Hive server, select the Allow Common checkbox. This setting is only applicable to User Name and Password (SSL) authentication mechanism and Name Hostname Mismatch will ignored by other authentication mechanisms. Enter the path of the file containing the trusted certificates in the edit box to configure the driver to load the Trusted Certificates certificates from the specified file to authenticate the Hive server when using SSL. This is only applicable to User Name and Password (SSL) authentication mechanisms and will be ignored by other authentication mechanisms. If this setting is not set the driver will default to using the trusted CA certificates PEM file installed by the driver. To create a server-side property, click the button, then type appropriate values in the Key and Value fields, and then click . Click Add OK the button to alter an existing property or to delete a property. Edit Remove If you selected Hive Server 2 as the Hive server type, then select or clear the check box as Apply Server Side Properties with Queries needed. If you selected Hive Server 2, then the check box is selected by default. Selecting Apply Server Side Properties with Queries the check box configures the driver to apply each server-side property you set by executing a query when opening a session to the Hive server. Clearing the check box configures the driver to use a more efficient method to apply server-side properties that does not involve additional network round tripping. Some Hive Server 2 builds are not compatible with the more efficient method. If the server-side properties you set do not take effect when the check box is clear, then select the check box. If you selected Hive Server 1 as the Hive server type, then the check box is selected and unavailable. Apply Server Side Properties with Queries Configuring DSN-less Authentication Some client applications, such as Tableau, provide some support for connecting to a data source using a driver without a DSN. Applications that connect using ODBC data sources work with Hive Server 2 by sending the appropriate authentication credentials defined in the data source. Applications that are Hive Server 1 aware but not Hive Server 2 aware and that connect using a DSN-less connection will not have a facility for sending authentication credentials to Hive Server 2. You can configure the ODBC driver with authentication credentials using the Driver Configuration tool. To configure driver authentication for a DSN-less connection: Launch the program from the menu. Driver Configuration Start Select a Hive Server Type from the drop-down. Select an authentication mechanism from the drop-down, then configure any required fields as suited to that mechanism. (optional) Click and configure any desired advanced options. Advanced SQLPrepare Optimization The connector currently uses query execution to determine the result-set’s metadata for SQLPrepare. The down side of this is that SQLPrepare is slow because query execution tends to be slow. You can configure the connector to speed up SQLPrepare if you do not need the result-set’s metadata. To change the behavior for SQLPrepare, create a String value under your DSN. If the value is set to a non-zero NOPSQLPrepare value, SQLPrepare will not use query execution to derive the result-set’s metadata. If this registry entry is not defined, the default value is 0. Notes Data Types Type at the Hive CLI command line or in Beeline to display a list of the Hadoop and Hive server-side properties that set -v your implementation supports. Credentials defined in a data source take precedence over credentials configured using the Driver Configuration tool. Credentials configured using the Driver Configuration tool apply for all connections made using a DSN-less connection unless the client application is Hive Server 2 aware and requests credentials from the user. The MapR ODBC driver only supports the , , and aut User Name User Name and Password User Name and Password (SSL) hentication mechanisms. The following data types are supported: Type Description TINYINT 1-byte integer SMALLINT 2- byte integer INT 4-byte integer BIGINT 8-byte integer FLOAT Single-precision floating-point number DOUBLE Double-precision floating-point number DECIMAL Decimal numbers BOOLEAN True/false value STRING Sequence of characters TIMESTAMP Date and time value Not yet supported: The aggregate types (ARRAY, MAP, and STRUCT) HiveQL Notes CAST Function HiveQL doesn’t support the CONVERT function; it uses the CAST function to perform type conversion. Example: CAST (<expression> AS <type>) Using in HiveQL: CAST Use the HiveQL names for the eight data types supported by Hive in the expression. For example, to convert 1.0 to an integer, use CAST rather than . CAST (1.0 AS INT) CAST (1.0 AS SQL_INTEGER) Hive does not do a range check during operations. For example, returns a of CAST CAST (1000000 AS SQL_TINYINT) TINYINT value 64, rather than the expected error. Unlike SQL, Hive returns instead of an error if it fails to convert the data. For example, returns null. null CAST (“STRING” AS INT) Using with values: CAST BOOLEAN The boolean value converts to the numeric value TRUE 1 The boolean value converts to the numeric value FALSE 0 The numeric value converts to the boolean value ; any other number converts to 0 FALSE TRUE The empty string converts to the boolean value ; any other string converts to FALSE TRUE The HiveQL type stores text strings, and corresponds to the data type. The operation successfully converts STRING SQL_LONGVARCHAR CAST strings to numbers if the strings contain only numeric characters; otherwise the conversion fails. You can tune the column length used for columns. To change the default length reported for columns, add the registry entry STRING STRING Def under your DSN and specify a value. If this registry entry is not defined, the default length of 1024 characters is aultStringColumnLength used. Delimiters The connector uses Thrift to connect to the Hive server. Hive returns the result set of a HiveQL query as newline-delimited rows whose fields are tab-delimited. Hive currently does not escape any tab character in the field. Make sure to escape any tab or newline characters in the Hive data, indlucing platform-specific newline character sequences such as line-feed (LF) for UNIX/Linux/Mac OS X/etc, carriage return/line-feed (CR/LF) for Windows, and carriage return (CR) for older Macintosh platforms. Notes on Applications Microsoft Access Version tested "2010" (=14.0), 32 and 64-bit. Notes Linked table is not available currently. Microsoft Excel/Query Version tested "2010" (=14.0), 32 and 64-bit. Notes From the ribbon, use and select either Data From Other Sources Fr or . The former om Data Connection Wizard From Microsoft Query requires a pre-defined DSN while the latter supports creating a DSN on the fly. You can use the ODBC driver via the OLE DB for ODBC Driver bridge. Tableau Desktop Version tested 7.0, 32-bit only. Works with v1 of the ODBC driver only. Notes Prior to version 7.0.n, you will need to install a TDC to maximize the capability of the driver. From version 7.0.n onward, you can specify the driver via the MapR option from the tab. Hadoop Hive Connect to Data Hive ODBC Connector License and Copyright Information Third Party Trademarks ICU License - ICU 1.8.1 and later COPYRIGHT AND PERMISSION NOTICE Copyright (c) 1995-2010 International Business Machines Corporation and others All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permission notice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supporting documentation. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder. All trademarks and registered trademarks mentioned herein are the property of their respective owners. OpenSSL Copyright (c) 1998-2008 The OpenSSL Project. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. All advertising materials mentioning features or use of this software must display the following acknowledgment: "This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. ( )" http://www.openssl.org/ 4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to endorse or promote products derived from this software without prior written permission. For written permission, please contact [email protected]. 5. Products derived from this software may not be called "OpenSSL" nor may "OpenSSL" appear in their names without prior written permission of the OpenSSL Project. 6. Redistributions of any form whatsoever must retain the following acknowledgment: "This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit ( )" http://www.openssl.org/ THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE OpenSSL PROJECT OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Expat Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ""Software""), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED ""AS IS"", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NOINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE." Apache Hive Copyright 2008-2011 The Apache Software Foundation. Apache Thrift Copyright 2006-2010 The Apache Software Foundation. Using HiveServer2 HiveServer2 allows multiple concurrent connections to the Hive server over a network. HiveServer2 is included as a patch on the Hive 0.9.0 base release in the MapR distribution for Apache Hadoop. See for installation Installing Hive details. The package includes both HiveServer1 and HiveServer2, and you can choose which one to run. mapr-hive This page contains the following topics: Configuring Hive for HiveServer2 Enabling SSL for HiveServer2 Configuring Security Authentication for HiveServer2 LDAP Authentication using OpenLDAP Setting up Authentication with Pluggable Access Modules Configuring Custom Authentication Starting HiveServer2 Accessing Hive with the BeeLine client Connecting to HiveServer2 JDBC ODBC Configuring JDBC Clients for LDAP Authentication with HiveServer2 Enabling SSL in Clients Enabling SSL for JDBC Passing a truststore and truststore password in the URI string. Passing the truststore parameters as JVM arguments. Using a CA signed certificate in JRE library Using a self-signed SSL certificate Related Topics Configuring Hive for HiveServer2 HiveServer2 accesses Hive data without alteration if you are not changing releases of Hive. You do not need to update or otherwise transform data in order to begin using HiveServer2. Simply enable support, as described below, and run instead of the previous HiveServer. hiveserver2 To configure Hive for use with HiveServer2, include the following configuration properties in the /opt/mapr/hive/hive<version>/conf/hiv configuration file. e-site.xml <property> <name>hive.support.concurrency</name> <description>Enable Hive's Table Lock Manager Service</description> <value>true</value> </property> <property> <name>hive.zookeeper.quorum</name> <description>Zookeeper quorum used by Hive's Table Lock Manager</description> <value><zk node1>,<zk node2>,...,<zk nodeN></value> </property> <property> <name>hive.zookeeper.client.port</name> <value>5181</value> <description>The Zookeeper client port. The MapR default clientPort is 5181.</description> </property> For the property above, substitute the values with a comma-separated list of the node hostnames or IP hive.zookeeper.quorum <zk nodeX> addresses running the ZooKeeper service. For users migrating from HiveServer1, you might need to modify applications and scripts to interact with HiveServer2. If you run a dedicated instance of HiveServer1 on each client because HiveServer1 does not support concurrent connections, you can replace those instances with a single instance of HiveServer2. HiveServer2 uses a different connection URL for the JDBC driver. Existing scripts that use JDBC to communicate with HiveServer1 will need to be modified by changing the JDBC driver URL from to . jdbc:hive://<hostname>:<port> jdbc:hive2://<hostname>:<port> Enabling SSL for HiveServer2 To enable SSL for HiveServer2, set the following parameters in the file: hive-site.xml <property> <name>hive.server2.enable.ssl</name> <value>true</value> <description>enable/disable SSL communication</description> </property> <property> <name>hive.server2.ssl.keystore</name> <value><path-to-keystore-file></value> <description>path to keystore file</description> </property> You can specify the keystore password in the file by adding the following parameter: hive-site.xml <property> <name>hive.server2.ssl.keystore.password</name> <value><password></value> <description>keystore password</description> </property> HiveServer2 automatically prompts for the keystore password during startup when no password is stored in the file. hive-site.xml Configuring Security Authentication for HiveServer2 LDAP Authentication using OpenLDAP Include the following properties in the file to enable LDAP authentication using OpenLDAP. hive-site.xml <property> <name>hive.server2.authentication</name> <value>LDAP</value> </property> <property> <name>hive.server2.authentication.ldap.url</name> <value><LDAP URL></value> </property> <property> <name>hive.server2.authentication.ldap.baseDN</name> <value><LDAP Base DN></value> </property> Substitute with the access URL for your LDAP server. Substitute with the base LDAP DN for your LDAP server, <LDAP URL> <LDAP BaseDN> for example, . ou=People,dc=mycompany,dc=com Setting up Authentication with Pluggable Access Modules The following steps enable Pluggable Access Module (PAM) authentication for HiveServer2. If you specify the password in the file, protect the file with the appropriate file permissions. hive-site.xml 1. 2. 3. 4. 1. Prerequisite: PAM authentication with Hive requires the . When the JPam library is not present in the environment JPam native library LD_LIBRARY_PATH variable, Hive throws an error similar to this example: Exception in thread "pool-1-thread-1" java.lang.UnsatisfiedLinkError: no jpam in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1738) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1028) at net.sf.jpam.Pam.<clinit>(Pam.java:51) at org.apache.hive.service.auth.PamAuthenticationProvider.Authenticate(PamAuthenticationP rovider.java:39) at org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSa slHelper.java:63) at org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:127 ) at org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse (TSaslTransport.java:509) at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:264) at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTrans port.java:216) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:18 9) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) To resolve this error, follow these steps: Download the package version for your architecture the JPam . downloads page Unzip the package . Copy the file to the directory and add the directory to the environment variable: libjpam.so libpjam-dir-path LD_LIBRARY_PATH $ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:libjpam-dir-path Alternately, copy the file to the folder in the JRE installation directory. libjpam.so lib/<amd64/i386> Restart HiveServer2. Once JPam is installed, follow these steps to configure HiveServer2 for PAM authentication: Set the following config properties in the file. The values for the property hive-site.xml login,sudo hive.server2.authentica are examples. You can configure your own list of PAM modules. tion.pam.profiles 1. 2. 3. <property> <name>hive.server2.authentication</name> <value>CUSTOM</value> </property> <property> <name>hive.server2.custom.authentication.class</name> <value>org.apache.hive.service.auth.PamAuthenticationProvider</value> </property> <property> <name>hive.server2.authentication.pam.profiles</name> <value>login,sudo</value> <description>Comma-separated list of PAM modules to verify. Do not use spaces after the comma.</description> </property> Restart HiveServer2 to apply these changes. Enter the username and password in your Hive client. In the following example, Beeline is the client. ~$ beeline Beeline version 0.11-mapr by Apache Hive beeline> !connect jdbc:hive2://<HiveServer2Host>:<port>/default scan complete in 4ms Connecting to jdbc:hive2://<HiveServer2Host>:<port>/default Enter username for jdbc:hive2://<HiveServer2Host>:<port>/default: mapr Enter password for jdbc:hive2://<HiveServer2Host>:<port>/default: ******* Hive history file=/tmp/mapr/hive_job_log_97d1cf06-bbf5-4abf-9bbb-d9ce56667fdf_941674138.txt Connected to: Hive (version 0.11-mapr) Driver: Hive (version 0.11-mapr) Transaction isolation: TRANSACTION_REPEATABLE_READ Configuring Custom Authentication To implement custom authentication for HiveServer2, create a custom Authenticator class derived from the following interface: public interface PasswdAuthenticationProvider { /** * The Authenticate method is called by the HiveServer2 authentication layer * to authenticate users for their requests. * If a user is to be granted, return nothing/throw nothing. * When a user is to be disallowed, throw an appropriate {@link AuthenticationException}. * * For an example implementation, see {@link LdapAuthenticationProviderImpl}. * * @param user - The username received over the connection request * @param password - The password received over the connection request * @throws AuthenticationException - When a user is found to be * invalid by the implementation */ void Authenticate(String user, String password) throws AuthenticationException; } The attached code has an example implementation that has stored usernames and passwords. SampleAuthenticator.java Add the following properties to the file, then restart Hiveserver2: hive-site.xml <property> <name>hive.server2.authentication</name> <value>CUSTOM</value> </property> <property> <name>hive.server2.custom.authentication.class</name> <value>hive.test.SampleAuthenticator</value> </property> Starting HiveServer2 If you are running the metastore in Remote mode, you need to start the metastore before HiveServer2. hive --service metastore To start the service, execute the following command. hiveserver2 hive --service hiveserver2 Accessing Hive with the BeeLine client HiveServer2 uses the BeeLine command line interface, and does not work with the Hive Shell used for HiveServer1. BeeLine is based on the SQ , which is currently the best source of documentation. Refer to the . LLine project SQLLine documentation page The following is an example of running basic SQL commands in BeeLine, using the same server that is running HiveServer2. hive --service beeline !connect jdbc:hive2://<hiveserver2 node>:10000 <hive username> <hive user password> org.apache.hive.jdbc.HiveDriver show tables select * from <table name> Substitute , , , and with valid values. <hiveserver2 node> <hive username> <hive user password> <table name> The example session below demonstrates a sample BeeLine session that lists tables and then lists all values in a table. The user is logged in as root, but this is not a requirement. $ hive --service beeline Beeline version 0.11-mapr by Apache Hive beeline> !connect jdbc:hive2://10.10.100.56:10000 root mypasswd \ org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://10.10.100.56:10000 Connected to: Hive (version 0.11-mapr) Driver: Hive (version null) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://10.10.100.56:10000> load data local inpath '/user/data/hive/pokes' overwrite into table pokes; 0: jdbc:hive2://10.10.100.56:10000> show tables; +-----------------+ | tab_name | +-----------------+ | pokes | +-----------------+ 1 rows selected (0.152 seconds) 0: jdbc:hive2://10.10.100.56:10000> select * from pokes; +------+----------+ | foo | bar | +------+----------+ | 238 | val_238 | | 86 | val_86 | | 311 | val_311 | | 27 | val_27 | | 165 | val_165 | | 409 | val_409 | +------+----------+ 6 rows selected (0.201 seconds) 0: jdbc:hive2://10.10.100.56:10000> exit $ Connecting to HiveServer2 JDBC The JDBC connection URI format and driver class for HiveServer2 are different from HiveServer1. HiveServer2 uses URI format and class . jdbc:hive2://<host>:<port> org.apache.hive.jdbc.HiveDriver HiveServer1 uses URI format and class jdbc:hive://<host>:<port> org.apache.hadoop.hive.jdbc.HiveDriver ODBC See for information on connecting to HiveServer2 via ODBC. Hive ODBC Connector Configuring JDBC Clients for LDAP Authentication with HiveServer2 JDBC clients connect using a connection URL as shown below. String url = "jdbc:hive2://hs2node:10000/default;user=<LDAP userid>;password=<password>" Connection connection = DriverManager.getConnection(url); Substitute with the user id and with the password for the client user. <ldap userid> <password> Enabling SSL in Clients Enabling SSL for JDBC To enable SSL for JDBC, add the string to the URI string. JDBC requires a truststore and an truststore password. ssl=true; optional Applications that use JDBC, such as Beeline, can pass the truststore and the optional truststore password in the following ways: Passing a truststore and truststore password in the URI string. The following example uses this JDBC command to pass the truststore parameters in the URI string: jdbc:hive2://<host>:<port>/<database>;ssl=true;sslTrustStore=<path-to-truststore>;sslT rustStorePassword=<password> $ beeline Beeline version 0.11-mapr by Apache Hive beeline> !connect jdbc:hive2://127.0.0.1:10000/default;ssl=true;sslTrustStore=truststore.jks;sslTrustSto rePassword=tsp scan complete in 4ms Connecting to jdbc:hive2://127.0.0.1:10000/default;ssl=true;sslTrustStore=truststore.jks;sslTrustSto rePassword=tsp Enter username for jdbc:hive2://127.0.0.1:10000/default;ssl=true;sslTrustStore=truststore.jks;sslTrustSto rePassword=tsp: qa-user1 Enter password for jdbc:hive2://127.0.0.1:10000/default;ssl=true;sslTrustStore=truststore.jks;sslTrustSto rePassword=tsp: **** Connected to: Hive (version 0.11-mapr) Driver: Hive (version 0.11-mapr) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://127.0.0.1:10000/default> show tables; +-------------------+ | tab_name | +-------------------+ | table1 | | table2 | +-------------------+ Passing the truststore parameters as JVM arguments. You can use the environment variable to pass JVM arguments to the Beeline client: HADOOP_OPTS export HADOOP_OPTS="-Djavax.net.ssl.trustStore=<path-to-trust-store-file> -Djavax.net.ssl.trustStorePassword=<password>" $ beeline Beeline version 0.11-mapr by Apache Hive beeline> !connect jdbc:hive2://127.0.0.1:1000/default;ssl=true scan complete in 4ms Connecting to jdbc:hive2://127.0.0.1:10000/default;ssl=true Enter username for jdbc:hive2://127.0.0.1:10000/default;ssl=true: qa-user1 Enter password for jdbc:hive2://127.0.0.1:10000/default;ssl=true: **** Connected to: Hive (version 0.11-mapr) Driver: Hive (version 0.11-mapr) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://127.0.0.1:10000/default> show tables; +-------------------+ | tab_name | +-------------------+ | table1 | | table2 | +-------------------+ Using a CA signed certificate in JRE library In this example, the SSL certificate is signed by a Certified Authority. $ beeline Beeline version 0.11-mapr by Apache Hive beeline> !connect jdbc:hive2://127.0.0.1:1000/default;ssl=true scan complete in 4ms Connecting to jdbc:hive2://127.0.0.1:10000/default;ssl=true Enter username for jdbc:hive2://127.0.0.1:10000/default;ssl=true: qa-user1 Enter password for jdbc:hive2://127.0.0.1:10000/default;ssl=true: **** Connected to: Hive (version 0.11-mapr) Driver: Hive (version 0.11-mapr) Transaction isolation: TRANSACTION_REPEATABLE_READ Using a self-signed SSL certificate When you are using a self-signed certificate, import the certificate into an existing or new truststore file with the following command: keytool -import -alias <alias> -file <path-to-cerficate-file> -keystore <truststorefile> To install the self-signed certificate in the JRE, give the option in the above argument the value -keystore <JRE-HOME/lib/security/cace . rts Related Topics Installing Hive Hive ODBC Connector SQLLine Project Documentation HiveServer2 Thrift API Upgrading Hive This page contains the following topics describing how to upgrade Hive in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations Upgrading the Software Updating the Hive Metastore Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of Hive you want to upgrade to. See the . Hive Release Notes Update Repositories or Download Packages MapR's and repositories always contain the Hive version recommended for the latest release of the MapR core. The repositories are rpm deb located at . You can also prepare a local repository with any version of Hive you need. http://package.mapr.com/releases/ecosystem/ For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where Hive is installed. 1. 2. 3. 1. 2. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of Hive, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/hive/hive-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where Hive is installed. Upgrade Hive software. Migrate custom configuration settings into the new default files in the directory. conf Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. Packaging changes between Hive releases 0.7.1 and 0.9.0 Removing Hard References to the CLDB in MapR-FS URLs Packaging changes between Hive releases 0.7.1 and 0.9.0 The following points apply when upgrading Hive from 0.7.1 (or earlier) to 0.9.0 (or later). MapR did not distribute any releases of Hive between these two versions. MapR's file naming convention changed between these releases, which caused a special case when upgrading. Because of a reversal in the alphanumeric order of the filenames, package managers incorrectly perceive the newer version to be a downgrade. When upgrading the software, package management tools might require you to specify a particular version, rather than automatically upgrading to the latest version. Hive release 0.9.0 (and onward) is stored in a separate repository than the MapR core software. This release corresponded to the release of MapR core v2.0. Prior to v2.0, Hive packages were located in the same repository with the MapR core. From MapR v2.0 onward, Hive packages are located in a separate repository, which requires some consideration in setting up repositories. See Installing for details. MapR Software MapR packaged Hive 0.7.1 (and earlier) as two separate packages, and . Starting with Hive release mapr-hive mapr-hive-internal 0.9.0, MapR packages Hive as one package . When upgrading from 0.7.1 (or earlier), if you upgrade only the pa mapr-hive mapr-hive ckage, the package manager will leave files in place. You have to explicitly uninstall to mapr-hive-internal mapr-hive-internal clean the older version from the node. Removing Hard References to the CLDB in MapR-FS URLs When you upgrade your cluster to version 2.0 of the MapR distribution for Hadoop or later, perform the following steps to update Hive's behavior with respect to the CLDB nodes. Verify that the property in the Hive table properties shows an IP and port for a CLDB node. location hive> describe extended <table_name>; Update the table to remove the IP and port from the location value. 2. 3. 4. 5. 6. hive> alter table <table_name> set location 'maprfs:///user/hive/warehouse/<table_name>'; Verify that the DB location URI in the mysql metastore has an IP address and port number. mysql> select * from DBS; Update the DB location URI in the DBS table to remove the IP address and port number. mysql> update DBS set DBS.DB_LOCATION_URI = REPLACE (DBS.DB_LOCATION_URI , 'maprfs://<ip>:PORT' , 'maprfs://' ) WHERE DBS.DB_LOCATION_URI like 'maprfs://<ip>%' ; Verify the property for each entry in the SDS table. location mysql> select * from SDS; Update the property for all entries in the SDS table to remove the IP address and port number. location mysql> update SDS set SDS.LOCATION = REPLACE (SDS.LOCATION , 'maprfs://<ip>:PORT' , 'maprfs://' ) WHERE SDS.LOCATION like 'maprfs://<ip>%' ; Upgrading the Software Use one of the following methods to upgrade the Hive component: To upgrade with a package manager To manually remove a prior version and install the latest version in the repository To keep a prior version and install a newer version To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-hive On Ubuntu apt-get install mapr-hive If you are upgrading from Hive 0.7.1, you might have to specify the particular version you want to upgrade to, because of #Version-Specific . Considerations Back up your metastore database before upgrading Hive. 1. 2. To manually remove a prior version and install the latest version in the repository If you are upgrading from Hive 0.7.1, this process might be necessary to remove the package which is no longer part of mapr-hive-internal MapR's Hive release 0.9.0 and onward. Run the package manager twice, first to remove the old version, and again to install the new version. For example, to upgrade from version 0.7.1 to version 0.10.0, perform the steps below. In this case, we assume the repository is set up on the node and 0.10.0 is the latest version in the repository. On RedHat and CentOS yum remove mapr-hive mapr-hive-internal yum install mapr-hive On Ubuntu apt-get remove mapr-hive mapr-hive-internal apt-get install mapr-hive To keep a prior version and install a newer version Hive installs into separate directories named after the version, such as , so the files for multiple versions /opt/mapr/hive/hive-<version>/ can co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 0.10.0 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package file for version 0.10.0 from mapr-hive http://package.mapr.com/releases/ecosystem-all/ . Install the package with . rpm rpm -i --force mapr-hive-0.10.18380-GA.noarch.rpm On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Updating the Hive Metastore Refer to the README file in the directory /opt/mapr/hive/hive-<version>/scripts/metastore/upgrade/<metastore_database> after upgrading Hive for directions on updating your existing schema to work with the new Hive version. Scripts are provided for metastore_db MySQL and Derby. After the upgrade, verify that that metastore database update completed successfully. A few sample diagnostics: The command in Hive should provide a complete list of all your Hive tables. show tables Perform simple operations on Hive tables that existed before the upgrade. SELECT Copy custom configuration files in to a safe location before proceeding. /opt/mapr/hive/hive-<version>/conf Before running the new version of Hive, you need to update the Metastore to work with the new version. Perform filtered operations on Hive tables that existed before the upgrade. SELECT Troubleshooting Hive Issues This section provides information about troubleshooting development problems. Click a subtopic below for more detail. Error 'Hive requires Hadoop 0.20.x' after upgrading to MapR v2.1 Related Links Apache Hive project MapR Forum posts related to Hive Search the MapR Blog for Hive topics Error 'Hive requires Hadoop 0.20.x' after upgrading to MapR v2.1 Hive version 0.7.x will not work with MapR core version 2.1 and beyond. From MapR version v2.1 the command returns the hadoop version new Hadoop versioning scheme (1.0.x instead of 0.20.x). Hive 0.7.0 does not recognize the new Hadoop version numbers and fails to start. The failure case produces the following error: root@node07:/# /opt/mapr/hive/hive-0.7.1/bin/hive Hive requires Hadoop 0.20.x (x >= 1). 'hadoop version' returned: Hadoop 1.0.3 Source http://mapr.com -r a14862b4053cd61ed69b5035e10d35dfda4615b7 Compiled by root on Sat Nov 24 12:13:28 PST 2012 From source with checksum cd3907397879a0522405ea12c707cd09 Using HCatalog and WebHCat with Hive The service provides applications with a table view of the MapR-FS layer in your cluster, expanding your application's options from HCatalog read/write data streams to add table operations such as gets and stores on table rows. The HCatalog service stores the metadata required for its operations in the Hive Metastore. You can create tables with the utility, in addition to the Hive shell. hcat The utility can execute any of the data definition language (DDL) commands available in Hive that do not involve launching a MapReduce hcat job. The following Hive DDL commands are not supported in HCatalog: IMPORT FROM EXPORT TABLE CREATE TABLE ... AS SELECT ALTER TABLE ... REBUILD ALTER TABLE ... CONCATENATE ANALYZE TABLE ... COMPUTE STATISTICS ALTER TABLE ARCHIVE/UNARCHIVE PARTITION Internally, the utility passes DDL commands to the program. Data stored in the MapR filesystem is serialized and deserialized through hcat hive and statements for records. Fields within a record are parsed with . InputStorageFormats OutputStorageFormats SerDes Accessing HCatalog Tables from Hive To access tables created in HCatalog in Hive, use the following command to append paths to your environment variable: HADOOP_CLASSPATH The JSON serializer/deserializer has not implemented a method and as a result does hive-json-serde-0.2.jar serialize() not function. 1. 2. 3. 4. 5. export HADOOP_CLASSPATH=$HCAT_HOME/share/hcatalog/storage-handlers/hbase/lib/hbase-storage-ha ndler-0.1.0.jar:$HCAT_HOME/share/hcatalog/hcatalog-core-0.11-mapr.jar:$HCAT_HOME/share /hcatalog/hcatalog-pig-adapter-0.11-mapr.jar:$HCAT_HOME/share/hcatalog/hcatalog-server -extensions-0.11-mapr.jar Using HCatalog Interfaces with Pig This section provides an example of using the HCatalog interfaces and to load and retrieve data from Pig. HCatLoader HCatStorer Setup: From a command line terminal, issue the following commands. export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.0/ export HIVE_HOME=/opt/mapr/hive/hive-0.11-mapr/ export HCAT_HOME=${HIVE_HOME}/hcatalog/ export PATH=$PATH:${HIVE_HOME}/bin:${HCAT_HOME}/bin Create a table with the utility. hcat hcat -e "create table hcatpig(key int, value string)" Verify that the table and table definition both exist. hcat -e "describe formatted hcatpig" Load data into the table from Pig: Copy the file into the MapRFS file system, then start Pig $HIVE_HOME/examples/files/kv1.txt and load the file with the following commands: pig -useHCatalog -Dmapred.job.tracker=maprfs:/// -Dfs.default.name=maprfs://CLDB_Host:7222/ grunt> A = LOAD 'kv1.txt' using PigStorage('\u0001') AS(key:INT, value:chararray); grunt> STORE A INTO 'hcatpig' USING org.apache.hcatalog.pig.HCatStorer(); Retrieve data from the table with the following Pig commands: hcatpig B = LOAD 'default.hcatpig' USING org.apache.hcatalog.pig.HCatLoader(); dump B; // this should display the records in kv1.txt Another way to verify that the data is loaded into the table is by looking at the contents of hcatpig maprfs://user/hive/warehouse . HCatalog tables are also accessible from the Hive CLI. All Hive queries work on HCatalog tables. /hcatpig/ Example HCatInputFormat and HCatOutputFormat code to read/write data from MapReduce applications. 1. 2. 3. 4. 5. 1. This example uses an example MapReduce program named . The program is attached to this page and can be HCatalogMRTest.java downloaded by clicking > . Tools Attachments From the command line, issue the following commands to define the environment: export LIB_JARS= $HCAT_HOME/share/hcatalog/hcatalog-core-0.11-mapr.jar, $HIVE_HOME/lib/hive-metastore-0.11-mapr.jar, $HIVE_HOME/lib/libthrift-0.9.0.jar, $HIVE_HOME/lib/hive-exec-0.11-mapr.jar, $HIVE_HOME/lib/libfb303-0.9.0.jar, $HIVE_HOME/lib/jdo2-api-2.3-ec.jar, $HIVE_HOME/lib/slf4j-api-1.6.1.jar export HADOOP_CLASSPATH= $HCAT_HOME/share/hcatalog/hcatalog-core-0.11-mapr.jar: $HIVE_HOME/lib/hive-metastore-0.11-mapr.jar: $HIVE_HOME/lib/libthrift-0.9.0.jar: $HIVE_HOME/lib/hive-exec-0.11-mapr.jar: $HIVE_HOME/lib/libfb303-0.9.0.jar: $HIVE_HOME/lib/jdo2-api-2.3-ec.jar: $HIVE_HOME/conf: $HADOOP_HOME/conf: $HIVE_HOME/lib/slf4j-api-1.6.1.jar Compile : HCatalogMRTest.java javac -cp `hadoop classpath`:${HCAT_HOME}/share/hcatalog/hcatalog-core-0.11-mapr.jar HCatalogMRTest.java -d . Create a JAR file: jar -cf hcatmrtest.jar org Create an output table: hcat -e "create table hcatpigoutput(key int, value int)" Run the job: hadoop --config $HADOOP_HOME/conf jar ./hcatmrtest.jar org.myorg.HCatalogMRTest -libjars $LIB_JARS hcatpig hcatpigoutput At the end of the job, the file should have entries in the form . hcatpigoutput key, count Using the HCatReader and HCatWriter interfaces to read and write data from non-MapReduce applications. This example uses an example MapReduce program named . The program is attached to this page and can be TestReaderWriter.java downloaded by clicking > . Tools Attachments 1. 2. 3. 4. 1. 1. 2. Add the following JAR files to your $HADOOP_CLASSPATH environment variable with the following command: export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/mapr/hive/hive-0.11/lib/antlr-runtime-3.4 .jar:/opt/mapr/hive/hive-0.11/lib/hive-cli-0.11-mapr.jar Compile the test program with the following command: javac -cp `hadoop classpath`:${HCAT_HOME}/share/hcatalog/hcatalog-core-0.11-mapr.jar TestReaderWriter.java -d . Create a JAR file with the following command: jar -cf hcatrwtest.jar org Run the job with the following command: hadoop jar /root/<username>/hcatalog/hcatrwtest.jar org.apache.hcatalog.data.TestReaderWriter -libjars $LIB_JARS The last command should result in a table named "mytbl" that is populated with data. Using the WebHCat Interfaces The WebHCat service provides a REST-like web API for HCatalog. Applications make HTTP requests to Pig, Hive, and HCatalog DDL from within applications. Setting up WebHCat Create working and log directories for the WebHCat server. Add the paths for those directories to your environment definition file, such as , by adding the following lines to the file: .bashrc export WEBHCAT_PID_DIR=~/webhcat export WEBHCAT_LOG_DIR=~/webhcat/logs Create a file named in the directory . Place the following text in the file: wehcat-env.sh $HCAT_HOME/conf export HADOOP_CONFIG_DIR=${HADOOP_HOME}/conf export HADOOP_PREFIX=${HADOOP_HOME} export TEMPLETON_HOME=${HCAT_HOME} export HCAT_PREFIX=${HCAT_HOME} Create compressed copies of the directories for Pig and Hive from with the following commands: /opt/mapr 2. 3. 4. 5. 6. 7. 8. tar -cvzf hive.tar.gz $HIVE_HOME/* tar -cvzf pig.tar.gz $PIG_HOME/* Place these compressed files in any directory in MapR-FS on your cluster. Copy the file from hadoop-0.20.2-dev-streaming.jar / to the same directory. opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming Create a file named in the directory. Specify the following properties in the webhcat-site.xml $HCAT_HOME/conf webhcat-site.x file: ml Set the value of the property to the MapR-FS URI where the compressed Pig tar.gz file is located, templeton.pig.archive such as . maprfs:///user/sampleuser1/templeton/pig.tar.gz Set the value of the property to the path inside the compressed Pig tar.gz file where the Pig binary is templeton.pig.path located, such as . pig.tar.gz/bin/pig Set the value of the property to the MapR-FS URI where the compressed Hive tar.gz file is templeton.hive.archive located, such as . maprfs:///user/sampleuser1/templeton/hive.tar.gz Set the value of the property to the path inside the compressed Hive tar.gz file where the Hive binary templeton.hive.path is located, such as . hive.tar.gz/bin/pig Set the value of the property to . templeton.storage.class org.apache.hcatalog.templeton.tool.HDFSStorage From the $HCAT_HOME directory, run the command to configure the environment for the WebHCat ./sbin/webhcat_config.sh server. Start from the $HCAT_HOME directory: webhcat ./sbin/webhcat_server.sh start Check the directory for error logs. $WEBHCAT_LOG_DIR Verify the server's status by navigating to . A running server returns the following string: http://hostname:50111/templeton/v1/status Error rendering macro 'code' : Invalid value specified for parameter lang You can specify the value of the port in the property of the file. templeton.port webhcat-site.xml Example webhcat-site.xml file Click here to expand... <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> <property> <name>templeton.port</name> <value>50111</value> </property> <property> <name>templeton.hadoop.config.dir</name> <value>$(env.HADOOP_CONFIG_DIR)</value> <description>The path to the Hadoop configuration.</description> </property> <property> <name>templeton.jar</name> <value>${env.TEMPLETON_HOME}/share/webhcat/svr/webhcat-0.5.0-SNAPSHOT.jar</value> <description>The path to the Templeton jar file.</description> </property> <property> <name>templeton.libjars</name> <value>${env.TEMPLETON_HOME}/share/webhcat/svr/lib/zookeeper-3.4.3.jar</value> <description>Jars to add to the classpath.</description> </property>  <property> <name>templeton.override.enabled</name> <value>false</value> <description>Enable the override path in templeton.override.jars</description> </property> <property> <name>templeton.streaming.jar</name> <value>maprfs:///user/user1/templeton/hadoop-streaming-0.20.2+737.jar</value> <description>The hdfs path to the Hadoop streaming jar file.</description> </property> <property> <name>templeton.hadoop</name> <value>${env.HADOOP_PREFIX}/bin/hadoop</value> <description>The path to the Hadoop executable.</description> </property> <property> <name>templeton.pig.archive</name> <value>maprfs:///user/user1/templeton/pig-p.10.tar.gz</value> <description>The path to the Pig archive.</description> </property> <property> <name>templeton.pig.path</name> <value>pig-p.10.tar.gz/bin/pig</value> <description>The path to the Pig executable.</description> </property> <property> <name>templeton.hcat</name> <value>${env.HCAT_PREFIX}/bin/hcat</value> <description>The path to the Hcatalog executable.</description> </property> <property> <name>templeton.hive.archive</name> <value>maprfs:///user/user1/templeton/hive-0.11-mapr.tar.gz</value> <description>The path to the Hive archive.</description> </property> <property> <name>templeton.hive.path</name> <value>hive-0.11-mapr.tar.gz/bin/hive</value> <description>The path to the Hive executable.</description> </property> <property> <name>templeton.hive.properties</name> <value>hive.metastore.local=false, hive.metastore.uris=thrift://10.250.0.90:9083, hive.metastore.sasl.enabled=false</value> <description>Properties to set when running hive.</description> </property> <property> <name>templeton.exec.encoding</name> <value>UTF-8</value> <description>The encoding of the stdout and stderr data.</description> </property> <property> <name>templeton.exec.timeout</name> <value>10000</value> <description>How long in milliseconds a program is allowed to run on the Templeton box.</description> </property> <property> <name>templeton.exec.max-procs</name> <value>16</value> <description>The maximum number of processes allowed to run at once.</description> </property> <property> <name>templeton.exec.max-output-bytes</name> <value>1048576</value> <description>The maximum number of bytes from stdout or stderr stored in ram.</description> </property> <property> <name>templeton.controller.mr.child.opts</name> <value>-server -Xmx256m -Djava.net.preferIPv4Stack=true</value> <description>Java options to be passed to templeton controller map task.</description> </property> <property> <name>templeton.exec.envs</name> <value>HADOOP_PREFIX,HADOOP_HOME,JAVA_HOME,HIVE_HOME</value> <description>The environment variables passed through to exec.</description> </property> <property> <name>templeton.zookeeper.hosts</name> <value>10.250.0.90:2181</value> <description>ZooKeeper servers, as comma separated host:port pairs</description> </property> <property> <name>templeton.zookeeper.session-timeout</name> <value>30000</value> <description>ZooKeeper session timeout in milliseconds</description> </property> <property> <name>templeton.callback.retry.interval</name> <value>10000</value> <description>How long to wait between callback retry attempts in milliseconds</description> </property> <property> <name>templeton.callback.retry.attempts</name> <value>5</value> <description>How many times to retry the callback</description> </property> <property> <name>templeton.storage.class</name> <value>org.apache.hcatalog.templeton.tool.HDFSStorage</value> <description>The class to use as storage</description> </property> <property> <name>templeton.storage.root</name> <value>maprfs:///user/user1/templeton</value> <description>The path to the directory to use for storage</description> </property> <property> <name>templeton.hdfs.cleanup.interval</name> <value>43200000</value> <description>The maximum delay between a thread's cleanup checks</description> </property> <property> <name>templeton.hdfs.cleanup.maxage</name> <value>604800000</value> <description>The maximum age of a templeton job</description> </property> <property> <name>templeton.zookeeper.cleanup.interval</name> <value>43200000</value> <description>The maximum delay between a thread's cleanup checks</description> </property> <property> <name>templeton.zookeeper.cleanup.maxage</name> <value>604800000</value> <description>The maximum age of a templeton job</description> </property>  </configuration> REST calls in WebHCat The base URI for REST calls in WebHCat is . The following tables elements appended to the base http://<host>:<port>/templeton/v1/ URI. URI Description Server Information /status Shows WebHCat server status. /version Shows WebHCat server version. DDL Commands /ddl/database List existing databases. /ddl/database/<mydatabase> Shows properties for the database named . mydatabase /ddl/database/<mydatabase>/table Shows tables in the database named . mydatabase /ddl/database/<mydatabase>/table/<mytable> Shows the table definition for the table named in the mytable database named . mydatabase /ddl/database/<mydatabase>/table/<mytable>/property Shows the table properties for the table named in the mytable database named . mydatabase The Job Queue To shot HCatalog jobs for a particular user, navigate to the following address: http://<hostname>:<port>/templeton/v1/queue/?user.name=<username> The default port for HCatalog is 50111. Working with Mahout The Apache Mahout™ machine learning library's goal is to build scalable machine learning libraries. Mahout currently offers: Collaborative Filtering User and Item based recommenders K-Means, Fuzzy K-Means clustering Mean Shift clustering Dirichlet process clustering Latent Dirichlet Allocation Singular value decomposition Parallel Frequent Pattern mining Complementary Naive Bayes classifier Random forest decision tree based classifier High performance java collections (previously colt collections) A known HCatalog bug exists that fetches information for any valid job instead of checking that the job is an HCatalog job or was started by the specified user. 1. 2. 3. A vibrant community This section contains documentation on working with Mahout on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section provides all relevant details about using Mahout with MapR, but does not duplicate Apache Mahout project Apache documentation. To install Mahout, see the section of the Administration Guide. Mahout Topics in This Section Upgrading Mahout Related Links Apache Mahout project MapR Forum posts related to Mahout Search the MapR Blog for Mahout topics Upgrading Mahout This page contains the following topics describing how to upgrade Mahout in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations Upgrading the Software Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of Mahout you want to upgrade to. See the . Mahout Release Notes Update Repositories or Download Packages MapR's and repositories always contain the Mahout version recommended for the latest release of the MapR core. The repositories are rpm deb located at . You can also prepare a local repository with any version of Mahout you http://package.mapr.com/releases/ecosystem/ need. For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where Mahout is installed. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of Mahout, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/mahout/mahout-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where Mahout is installed. Upgrade Mahout software. Migrate custom configuration settings into the new default files in the directory. conf 1. 2. Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. Packaging changes between Mahout releases 0.5 to 0.6 Packaging changes between Mahout releases 0.5 to 0.6 The following points apply when upgrading Mahout from 0.5 (or earlier) to 0.6 (or later). MapR did not distribute any releases of Mahout between these two versions. MapR's file naming convention changed between these releases, which caused a special case when upgrading. Because of a reversal in the alphanumeric order of the filenames, package managers incorrectly perceive the newer version to be a downgrade. When upgrading the software, package management tools might require you to specify a particular version, rather than automatically upgrading to the latest version. Mahout release 0.6 (and onward) is stored in a separate repository than the MapR core software. This release corresponded to the release of MapR core v2.0. Prior to v2.0, Mahout packages were located in the same repository with the MapR core. From MapR v2.0 onward, Mahout packages are located in a separate repository, which requires some consideration in setting up repositories. See Installin for details. g MapR Software Upgrading the Software Use one of the following methods to upgrade the Mahout component: To upgrade with a package manager To keep a prior version and install a newer version To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-mahout On Ubuntu apt-get install mapr-mahout If you are upgrading from Mahout 0.5 or earlier, you might have to specify the particular version you want to upgrade to, because of #Version-Spe . cific Considerations To keep a prior version and install a newer version Mahout installs into separate directories named after the version, such as , so the files for multiple /opt/mapr/mahout/mahout-<version>/ versions can co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 0.7 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package file for version 0.7 from mapr-mahout http://package.mapr.com/releases/ecosystem-all/ . Install the package with . rpm rpm -i --force mapr-mahout-0.7.18380-GA.noarch.rpm On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Working with Oozie Apache Oozie™ is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). This section contains documentation on working with Oozie on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section provides all relevant details about using Oozie with MapR, but does not duplicate Apache Apache Oozie project documentation. To install Oozie, see the section of the Administration Guide. Oozie Topics in This Section Upgrading Oozie Related Links Apache Oozie project MapR Forum posts related to Oozie Search the MapR Blog for Oozie topics Upgrading Oozie This page contains the following topics describing how to upgrade Oozie in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations Upgrading the Software Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of Oozie you want to upgrade to. See the . Oozie Release Notes Update Repositories or Download Packages MapR's and repositories always contain the Oozie version recommended for the latest release of the MapR core. The repositories are rpm deb located at . You can also prepare a local repository with any version of Oozie you need. http://package.mapr.com/releases/ecosystem/ For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where Oozie is installed. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of Oozie, you probably want to apply those changes to the updated 1. 2. 3. version. Configuration properties are located in . /opt/mapr/oozie/oozie-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where Oozie is installed. Upgrade Oozie software. Migrate custom configuration settings into the new default files in the directory. conf Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. Packaging changes between Oozie releases 3.0.0 and 3.1.0 Packaging changes between Oozie releases 3.0.0 and 3.1.0 The following points apply when upgrading Oozie from 3.0.0 (or earlier) to 3.1.0 (or later). MapR did not distribute any releases of Oozie between these two versions. MapR's file naming convention changed between these releases, which caused a special case when upgrading. Because of a reversal in the alphanumeric order of the filenames, package managers incorrectly perceive the newer version to be a downgrade. When upgrading the software, package management tools might require you to specify a particular version, rather than automatically upgrading to the latest version. Oozie release 3.1.0 (and onward) is stored in a separate repository than the MapR core software. This release corresponded to the release of MapR core v2.0. Prior to v2.0, Oozie packages were located in the same repository with the MapR core. From MapR v2.0 onward, Oozie packages are located in a separate repository, which requires some consideration in setting up repositories. See Installing for details. MapR Software Oozie and Upgrading Your Core MapR Version When you upgrade the core MapR version on a cluster that already has Oozie installed, a packaging error results in Oozie building its WAR file with an incorrect version of . To work around this issue, delete the file after the upgrade is complete and re-run the maprfs.jar oozie.war ooz script to rebuild the WAR file with the correct JARs: ie-setup.sh $ cd /opt/mapr/oozie/oozie-3.3.2 $ mv oozie.war.ORIG oozie.war $ bin/oozie-setup.sh -jars /opt/mapr/lib/zookeeper-3.3.6.jar -hadoop 0.20.2 /opt/mapr/hadoop/hadoop-0.20.2 Upgrading the Software Use one of the following methods to upgrade the Oozie component: To upgrade with a package manager To keep a prior version and install a newer version To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-oozie-internal mapr-oozie On Ubuntu apt-get install mapr-oozie-internal mapr-oozie If you are upgrading from Oozie 3.0.0 or earlier, you might have to specify the particular version you want to upgrade to, because of #Version-Spe . cific Considerations 1. 2. 1. 2. 3. To keep a prior version and install a newer version Oozie installs into separate directories named after the version, such as , so the files for multiple /opt/mapr/oozie/oozie-<version>/ versions can co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 3.1.0 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package files for and version 3.1.0 from mapr-oozie mapr-oozie-internal http://package.mapr.com/releases/ecosystem-all/ . Install the package with . rpm rpm -i --force mapr-oozie-internal-3.3.0.18380-GA.noarch.rpm mapr-oozie-3.3.0.18380-GA.noarch.rpm On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Working with Pig Apache Pig™ is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called . Pig Latin This section contains documentation on working with Pig on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section provides all relevant details about using Pig with MapR, but does Apache Pig project not duplicate Apache documentation. To install Pig, see the section of the Administration Guide. Pig Topics in This Section Upgrading Pig Integrating Pig and MapR Tables To configure Pig to work with MapR tables, perform the following steps: On the client node where Pig is installed, add the following string to : /opt/mapr/conf/env.sh export PIG_CLASSPATH=$PIG_CLASSPATH:/location-to-hbase-jar If the client node where Pig is installed also has either the or packages installed, mapr-hbase-regionserver mapr-hbase-master add the location of the file to the variable from the previous step: hbase-0.92-1.jar PIG_CLASSPATH export PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-0.92.1/hbase-0.92.1.jar" 3. 4. 5. 1. 2. 3. 4. 5. 6. If the client node where Pig is installed does not have any HBase packages installed, copy the HBase JAR from a node that does have HBase installed to a location on the Pig client node. Add the HBase JAR's location to the definition from previous steps: export PIG_CLASSPATH=$PIG_CLASSPATH:/opt/mapr/lib/hbase-0.92.1.jar Add the HBase JAR to the Hadoop classpath: export HADOOP_CLASSPATH="/opt/mapr/hbase/hbase-0.94.5/hbase-0.94.5-mapr.jar:$HADOOP_CLAS SPATH" Launch a Pig job and verify that Pig can access HBase tables by using the HBase table name directly. Do not use the prefix. hbase:// Integrating Pig and Apache HBase To configure Pig to work with Apache HBase tables, perform the following steps: On the client node where Pig is installed, add the following string to : /opt/mapr/conf/env.sh export PIG_CLASSPATH=$PIG_CLASSPATH:/location-to-hbase-jar If the client node where Pig is installed also has either the or packages installed, mapr-hbase-regionserver mapr-hbase-master add the location of the file to the variable from the previous step: hbase-0.92-1.jar PIG_CLASSPATH export PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-0.92.1/hbase-0.92.1.jar" If the client node where Pig is installed does not have any HBase packages installed, copy the HBase JAR from a node that does have HBase installed to a location on the Pig client node. Add the HBase JAR's location to the definition from previous steps: export PIG_CLASSPATH=$PIG_CLASSPATH:/opt/mapr/lib/hbase-0.92.1.jar List the cluster's zookeeper nodes: maprcli node listzookeepers Add the following variable to the file; /opt/mapr/conf/env.sh export PIG_OPTS="-Dhbase.zookeeper.property.clientPort=5181 -Dhbase.zookeeper.quorum=<comma-separated list of ZooKeeper IP addresses>" Launch a Pig job and verify that Pig can access HBase tables by using the HBase table name directly. Do not use the prefix. hbase:// Sample file for HBase and Pig integration env.sh [root@nmk-centos-60-3 ~]# cat /opt/mapr/conf/env.sh #!/bin/bash # Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved # Please set all environment variable you want to be used during MapR cluster # runtime here. # namely MAPR_HOME, JAVA_HOME, MAPR_SUBNETS export PIG_OPTS="-Dhbase.zookeeper.property.clientPort=5181 -Dhbase.zookeeper.quorum=10.10.80.61,10.10.80.62,10.10.80.63" export PIG_CLASSPATH="$PIG_CLASSPATH:/opt/mapr/hbase/hbase-0.92.1/conf:/usr/java/default/lib/ tools.jar:/opt/mapr/hbase/hbase-0.92.1:/opt/mapr/hbase/hbase-0.92.1/hbase-0.92.1.jar" export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$PIG_CLASSPATH" export CLASSPATH="$CLASSPATH:$HADOOP_CLASSPATH" #export JAVA_HOME= #export MAPR_SUBNETS= #export MAPR_HOME= #export MAPR_ULIMIT_U= #export MAPR_ULIMIT_N= #export MAPR_SYSCTL_SOMAXCONN= #export PIG_CLASSPATH=:$PIG_CLASSPATH [root@nmk-centos-60-3 ~]# Sample HBase insertion script [root@nmk-centos-60-3 nabeel]# cat hbase_pig.pig raw_data = LOAD '/user/mapr/input2.csv' USING PigStorage(',') AS ( listing_id: chararray, fname: chararray, lname: chararray ); STORE raw_data INTO 'sample_names' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ( 'info:fname info:lname'); Related Links Apache Pig project MapR Forum posts related to Pig Search the MapR Blog for Pig topics Upgrading Pig This page contains the following topics describing how to upgrade Pig in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations Upgrading the Software Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of Pig you want to upgrade to. See the . Pig Release Notes Update Repositories or Download Packages 1. 2. 3. MapR's and repositories always contain the Pig version recommended for the latest release of the MapR core. The repositories are rpm deb located at . You can also prepare a local repository with any version of Pig you need. http://package.mapr.com/releases/ecosystem/ For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where Pig is installed. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of Pig, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/pig/pig-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where Pig is installed. Upgrade Pig software. Migrate custom configuration settings into the new default files in the directory. conf Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. Packaging changes between Pig releases 0.9.0 and 0.10.0 Packaging changes between Pig releases 0.9.0 and 0.10.0 The following points apply when upgrading Pig from 0.9.0 (or earlier) to 0.10.0 (or later). MapR did not distribute any releases of Pig between these two versions. MapR's file naming convention changed between these releases, which caused a special case when upgrading. Because of a reversal in the alphanumeric order of the filenames, package managers incorrectly perceive the newer version to be a downgrade. When upgrading the software, package management tools might require you to specify a particular version, rather than automatically upgrading to the latest version. Pig release 0.10.0 (and onward) is stored in a separate repository than the MapR core software. This release corresponded to the release of MapR core v2.0. Prior to v2.0, Pig packages were located in the same repository with the MapR core. From MapR v2.0 onward, Pig packages are located in a separate repository, which requires some consideration in setting up repositories. See Installing for details. MapR Software MapR packaged Pig 0.9.0 (and earlier) as two separate packages, and . Starting with Pig release mapr-pig mapr-pig-internal 0.10.0, MapR packages Pig as one package . When upgrading from 0.9.0 (or earlier), if you upgrade only the pack mapr-pig mapr-pig age, the package manager will leave files in place. You have to explicitly uninstall to clean mapr-pig-internal mapr-pig-internal the older version from the node. Upgrading the Software Use one of the following methods to upgrade the Pig component: To upgrade with a package manager To manually remove a prior version and install the latest version in the repository To keep a prior version and install a newer version 1. 2. To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-pig On Ubuntu apt-get install mapr-pig If you are upgrading from Pig 0.9.0, you might have to specify the particular version you want to upgrade to, because of #Version-Specific . Considerations To manually remove a prior version and install the latest version in the repository If you are upgrading from Pig 0.9.0, this process might be necessary to remove the package which is no longer part of mapr-pig-internal MapR's Pig release 0.10.0 and onward. Run the package manager twice, first to remove the old version, and again to install the new version. For example, to upgrade from version 0.9.0 to version 0.10.0, perform the steps below. In this case, we assume the repository is set up on the node and 0.10.0 is the latest version in the repository. On RedHat and CentOS yum remove mapr-pig mapr-pig-internal yum install mapr-pig On Ubuntu apt-get remove mapr-pig mapr-pig-internal apt-get install mapr-pig To keep a prior version and install a newer version Pig installs into separate directories named after the version, such as , so the files for multiple versions can /opt/mapr/pig/pig-<version>/ co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 0.10.0 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package file for version 0.10.0 from mapr-pig http://package.mapr.com/releases/ecosystem-all/ . Install the package with . rpm rpm -i --force mapr-pig-0.10.18380-GA.noarch.rpm Copy custom configuration files in to a safe location before proceeding. /opt/mapr/pig/pig-<version>/conf On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Working with Sqoop Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. This section contains documentation on working with Sqoop on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section provides all relevant Apache Sqoop project details about using Sqoop with MapR, but does not duplicate Apache documentation. To install Sqoop, see the section of the Administration Guide. Sqoop Topics in This Section Upgrading Sqoop Related Links Apache Sqoop project MapR Forum posts related to Sqoop Search the MapR Blog for Sqoop topics Upgrading Sqoop This page contains the following topics describing how to upgrade Sqoop in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations Upgrading the Software Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of Sqoop you want to upgrade to. See the . Sqoop Release Notes Update Repositories or Download Packages MapR's and repositories always contain the Sqoop version recommended for the latest release of the MapR core. The repositories are rpm deb located at . You can also prepare a local repository with any version of Sqoop you http://package.mapr.com/releases/ecosystem/ need. For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where Sqoop is installed. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of Sqoop, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/sqoop/sqoop-<version>/conf/ 1. 2. 3. In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where Sqoop is installed. Upgrade Sqoop software. Migrate custom configuration settings into the new default files in the directory. conf Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. Packaging changes between Sqoop releases 1.3.0 and 1.4.1 Packaging changes between Sqoop releases 1.3.0 and 1.4.1 The following points apply when upgrading Sqoop from 1.3.0 (or earlier) to 1.4.1 (or later). MapR did not distribute any releases of Sqoop between these two versions. MapR's file naming convention changed between these releases, which caused a special case when upgrading. Because of a reversal in the alphanumeric order of the filenames, package managers incorrectly perceive the newer version to be a downgrade. When upgrading the software, package management tools might require you to specify a particular version, rather than automatically upgrading to the latest version. Sqoop release 1.4.1 (and onward) is stored in a separate repository than the MapR core software. This release corresponded to the release of MapR core v2.0. Prior to v2.0, Sqoop packages were located in the same repository with the MapR core. From MapR v2.0 onward, Sqoop packages are located in a separate repository, which requires some consideration in setting up repositories. See Installin for details. g MapR Software MapR packaged Sqoop 1.3.0 (and earlier) as two separate packages, and . Starting with Sqoop mapr-sqoop mapr-sqoop-internal release 1.4.1, MapR packages Sqoop as one package . When upgrading from 1.3.0 (or earlier), if you upgrade only the mapr-sqoop map package, the package manager will leave files in place. You have to explicitly uninstall r-sqoop mapr-sqoop-internal mapr-sqoop to clean the older version from the node. -internal Upgrading the Software Use one of the following methods to upgrade the Sqoop component: To upgrade with a package manager To manually remove a prior version and install the latest version in the repository To keep a prior version and install a newer version To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-sqoop On Ubuntu apt-get install mapr-sqoop If you are upgrading from Sqoop 1.3.0, you might have to specify the particular version you want to upgrade to, because of #Version-Specific . Considerations To manually remove a prior version and install the latest version in the repository If you are upgrading from Sqoop 1.3.0 (or earlier), this process might be necessary to remove the package which is no mapr-sqoop-internal longer part of MapR's Sqoop release 1.4.1 and onward. Run the package manager twice, first to remove the old version, and again to install the new version. Copy custom configuration files in to a safe location before proceeding. /opt/mapr/sqoop/sqoop-<version>/conf 1. 2. For example, to upgrade from version 1.3.0 to version 1.4.1, perform the steps below. In this case, we assume the repository is set up on the node and 1.4.1 is the latest version in the repository. On RedHat and CentOS yum remove mapr-sqoop mapr-sqoop-internal yum install mapr-sqoop On Ubuntu apt-get remove mapr-sqoop mapr-sqoop-internal apt-get install mapr-sqoop To keep a prior version and install a newer version Sqoop installs into separate directories named after the version, such as , so the files for multiple /opt/mapr/sqoop/sqoop-<version>/ versions can co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 1.4.1 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package file for version 1.4.1 from mapr-sqoop http://package.mapr.com/releases/ecosystem-all/ . Install the package with . rpm rpm -i --force mapr-sqoop-1.4.2.18380-1.noarch.rpm On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Working with Whirr Apache Whirr™ is a set of libraries for running cloud services. Whirr provides: A cloud-neutral way to run services. You don't have to worry about the idiosyncrasies of each provider. A common service API. The details of provisioning are particular to the service. Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed. You can also use Whirr as a command line tool for deploying clusters. This section contains documentation on working with Whirr on the MapR distribution for Apache Hadoop. You can refer also to documentation available from the . This section provides all relevant Apache Whirr project details about using Whirr with MapR, but does not duplicate Apache documentation. To install Whirr, see the section of the Administration Guide. Whirr Topics in This Section Upgrading Whirr Related Links Apache Whirr project MapR Forum posts related to Whirr 1. 2. 3. Search the MapR Blog for Flume topics Upgrading Whirr This page contains the following topics describing how to upgrade Whirr in the MapR distribution for Apache Hadoop: Update Repositories or Download Packages Migrating Configuration Files Version-Specific Considerations Upgrading the Software Before you upgrade, make sure that the version of the MapR core software on your cluster supports the version of Whirr you want to upgrade to. See the . Whirr Release Notes Update Repositories or Download Packages MapR's and repositories always contain the Whirr version recommended for the latest release of the MapR core. The repositories are rpm deb located at . You can also prepare a local repository with any version of Whirr you need. http://package.mapr.com/releases/ecosystem/ For more details on setting up repositories, see . Preparing Packages and Repositories If you don't want to install from a repository, you can download the package file for the specific release you want and install it manually. Individual package files are located at . http://package.mapr.com/releases/ecosystem-all/ To update the repository cache If you plan to install from a repository, update the repository cache on each node where Whirr is installed. On RedHat and CentOS yum clean all On Ubuntu apt-get update Migrating Configuration Files If you have changed configuration properties on your current installation of Whirr, you probably want to apply those changes to the updated version. Configuration properties are located in . /opt/mapr/whirr/whirr-<version>/conf/ In general, you can migrate your configuration changes with the following procedure: Before upgrade, save configuration files on all nodes where Whirr is installed. Upgrade Whirr software. Migrate custom configuration settings into the new default files in the directory. conf Version-Specific Considerations Before you upgrade the software, note if there are any version-specific considerations that apply to you. Packaging changes between Whirr releases 0.3.0 and 0.7.0 Packaging changes between Whirr releases 0.3.0 and 0.7.0 The following points apply when upgrading Whirr from 0.3.0 (or earlier) to 0.7.0 (or later). MapR did not distribute any releases of Whirr between these two versions. MapR's file naming convention changed between these releases, which caused a special case when upgrading. Because of a reversal in the alphanumeric order of the filenames, package managers incorrectly perceive the newer version to be a downgrade. When upgrading the software, package management tools might require you to specify a particular version, rather than automatically upgrading to the latest version. Whirr release 0.7.0 (and onward) is stored in a separate repository than the MapR core software. This release corresponded to the 1. 2. 1. 2. release of MapR core v2.0. Prior to v2.0, Whirr packages were located in the same repository with the MapR core. From MapR v2.0 onward, Whirr packages are located in a separate repository, which requires some consideration in setting up repositories. See Installing for details. MapR Software Upgrading the Software Use one of the following methods to upgrade the Whirr component: To upgrade with a package manager To keep a prior version and install a newer version To upgrade with a package manager After configuring repositories so that the version you want to install is available, you can use a package manager to install from the repository. On RedHat and CentOS yum upgrade mapr-whirr On Ubuntu apt-get install mapr-whirr If you are upgrading from Whirr 0.3.0 or earlier, you might have to specify the particular version you want to upgrade to, because of #Version-Spe . cific Considerations To keep a prior version and install a newer version Whirr installs into separate directories named after the version, such as , so the files for multiple /opt/mapr/whirr/whirr-<version>/ versions can co-exist. To keep the prior version when installing a new version, you must manually install the package file for the new version. For example, to install version 0.8.1 build 18380 while keeping any previously installed version, perform the steps below. On RedHat and CentOS Download the RPM package file for version 0.8.1 from mapr-whirr http://package.mapr.com/releases/ecosystem-all/ . Install the package with . rpm rpm -i --force mapr-whirr-0.8.1.18380-GA.noarch.rpm On Ubuntu This process is not supported on Ubuntu, because and cannot manage multiple versions of a package with the same name. apt-get dpkg Integrating MapR's GitHub Repositories With Your IDE The steps in this procedure walk you through cloning the repository for a MapR open source project into your Eclipse IDE. Open the Git Repository perspective by selecting , then choosing . Window > Open Perspective > Other... Git Repository Exploring From the Git Repository perspective, click the button to display the dialog. Clone Git Repository 2. 3. 4. 5. 6. 7. 1. 2. From a web browser, navigate to the MapR , then select the project you want to clone. repository Copy the git URI from the project page to your clipboard by clicking the button. In the dialog, paste the git URI into the field, then click . Eclipse will connect to github and download the Clone Git Repository URI: Next repository metadata, then display a list of branches. Select the branches you wish to clone, then click . Next Configure the destination directory, then click . Eclipse downloads the project from github and adds it to your view. Finish Troubleshooting Development Issues This section provides information about troubleshooting development problems. Click a subtopic below for more detail. Integrating MapR's GitHub and Maven Repositories With Your IDE The steps in this procedure walk you through cloning the GitHub and Maven repositories for a MapR open source project into your Eclipse IDE. Integrating Git Open the Git Repository perspective by selecting > > , then choosing . Window Open Perspective Other... Git Repository Exploring From the Git Repository perspective, click the button to display the dialog. Clone Git Repository 2. 3. 4. 5. 6. 7. 1. 2. 3. 4. From a web browser, navigate to the MapR , then select the project you want to clone. repository Copy the git URI from the project page to your clipboard by clicking the button. In the dialog, paste the git URI into the field, then click . Eclipse will connect to github and download the Clone Git Repository URI: Next repository metadata, then display a list of branches. Select the branches you wish to clone, then click . Next Configure the destination directory, then click . Eclipse downloads the project from github and adds it to your view. Finish Integrating Maven Start a new Maven project, or convert your current project into a Maven project if necessary. Select > > to show your current Maven project. Window Show View Package Explorer Add the following lines to your project's file: pom.xml <repositories> <repository> <id>mapr-releases</id> <url>http://repository.mapr.com/maven/</url> <snapshots><enabled>false</enabled></snapshots> <releases><enabled>true</enabled></releases> </repository> </repositories> In a browser, navigate to the MapR and search for the Maven artifact your project depends on. You can also th Maven Repository browse 4. 5. 6. 7. e repository. In the Package Explorer, right-click your project and select > . Maven Add Dependency Enter the , , and values for the dependency, then click . groupId artifactId version OK Refresh the workspace by pressing F5. Your Maven dependencies download automatically. Migration Guide This guide provides instructions for migrating business-critical data and applications from an Apache Hadoop cluster to a MapR cluster. The MapR distribution is 100% API-compatible with Apache Hadoop, and migration is a relatively straight-forward process. The additional features available in MapR provide new ways to interact with your data. In particular, MapR provides a fully read/write storage layer that can be mounted as a filesystem via NFS, allowing existing processes, legacy workflows, and desktop applications full access to the entire cluster. Migration consists of the following steps: Planning the Migration — Identify the goals of the migration, understand the differences between your current cluster and the MapR cluster, and identify potential gotchas. Initial MapR Deployment — Install, configure, and test the MapR cluster. Component Migration — Migrate your customized components to the MapR cluster. Application Migration — Migrate your applications to the MapR cluster and test using a small set of data. Data Migration — Migrate your data to the MapR cluster and test the cluster against performance benchmarks. Node Migration — Take down old nodes from the previous cluster and install them as MapR nodes. Planning the Migration The first phase of migration is planning. In this phase you will identify the requirements and goals of the migration, identify potential issues in the migration, and define a strategy. The requirements and goals of the migration depend on a number of factors: Data migration — can you move your datasets individually, or must the data be moved all at once? Downtime — can you tolerate downtime, or is it important to complete the migration with no interruption in service? Customization — what custom patches or applications are running on the cluster? Storage — is there enough space to store the data during the migration? The MapR Hadoop distribution is 100% plug-and-play compatible with Apache Hadoop, so you do not need to make changes to your applications to run them on a MapR cluster. MapR Hadoop automatically configures compression and memory settings, task heap sizes, and local volumes for shuffle data. Initial MapR Deployment The initial MapR deployment phase consists of installing, configuring, and testing the MapR cluster and any ecosystem components (such as Hive, HBase, or Pig) on an initial set of nodes. Once you have the MapR cluster deployed, you will be able to begin migrating data and applications. To deploy the MapR cluster on the selected nodes, follow the steps in the . Installation Guide 1. 2. 3. 4. Component Migration MapR Hadoop features the complete Hadoop distribution including components such as Hive and HBase. There are a few things to know about migrating Hive and HBase, or about migrating custom components you have patched yourself. Hive Migration Hive facilitates the analysis of large datasets stored in the Hadoop filesystem by organizing that data into tables that can be queried and analyzed using a dialect of SQL called HiveQL. The schemas that define these tables and all other Hive metadata are stored in a centralized repository called the . metastore If you would like to continue using Hive tables developed on an HDFS cluster in a MapR cluster, you can import Hive metadata from the metastore to recreate those tables in MapR. Depending on your needs, you can choose to import a subset of table schemas or the entire metastore in a single go. Importing table schemas into a MapR cluster Use this procedure to import a subset of Hive metastore from an HDFS cluster to a MapR cluster. This method is preferred when you want to test a subset of applications using a smaller subset of data. Use the following procedure to import Hive metastore data into a new metastore running on a node in the MapR cluster. You will need to redirect all of links that formerly pointed to the HDFS ( ) to point to MapR-FS ( ). hdfs://<namenode>:<port number>/<path> maprfs:///<path> Importing an entire Hive metastore into a MapR cluster Use this procedure to import an entire Hive metastore from an HDFS cluster to a MapR cluster. This method is preferred when you want to test all applications using a complete dataset. MySQL is a very popular choice for the Hive metastore and so we’ll use it as an example. If you are using another RDBMS, consult the relevant documentation. Ensure that both Hive and your database are installed on one of the nodes in the MapR cluster. For step-by-step instructions on setting up a standalone MySQL metastore, see . Setting Up Hive with a MySQL Metastore On the HDFS cluster, back up the metastore to a file. mysqldump [options] \--databases db_name... > filename Ensure that queries in the dumpfile point to the MapR-FS rather than HDFS. Search the dumpfile and edit all of the URIs that point to hdf so that they point to instead. s:// maprfs:/// Import the data from the dumpfile into the metastore running on the node in the MapR cluster: mysql [options] db_name < filename Using Hive with MapR volumes MapR-FS does not allow moving or renaming across volume boundaries. Be sure to set the Hive Scratch Directory and Hive Warehouse Directory in the same volume where the data for the Hive job resides before running the job. For more information see . Using Hive with MapR Volumes HBase Migration HBase is the Hadoop database, which provides random, real-time read/write access to very large datasets. The MapR Hadoop distribution includes HBase and is fully integrated with MapR enhancements for speed, usability, and dependability. MapR provides a (normally volume mounted at ) to store HBase data. /hbase HBase bulk load jobs: If you are currently using HBase bulk load jobs to import data into the HDFS, make sure to load your data into a path under the volume. /hbase Compression: The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase, turn off MapR compression for directories in the HBase volume. For more information, see . HBase Best Practices Custom Components If you have applied your own patches to a component and wish to continue to use that customized component with the MapR distribution, you 1. 2. 3. 4. should keep the following considerations in mind: MapR libraries: All Hadoop components must point to MapR for the Hadoop libraries. Change any absolute paths. Do not hardcode hdf or into your applications. This is also true of Hadoop ecosystem components that are not included in the MapR s:// maprfs:// Hadoop distribution (such as Cascading). For more information see . Working with MapR-FS Component compatibility: Before you commit to the migration of a customized component (for example, customized HBase), check the MapR release notes to see if MapR Technologies has issued a patch that satisfies your business requirements. MapR Technologies publishes a list of Hadoop common patches and MapR patches with each release and makes those patches available for our customers to take, build, and deploy. For more information, see the . MapR Release Notes ZooKeeper coordination service: Certain components, such as HBase, depend on ZooKeeper. When you migrate your customized component from the HDFS cluster to the MapR cluster, make sure it points correctly to the MapR ZooKeeper service. Application Migration In this phase you will migrate your applications to the MapR cluster test environment. The goal of this phase is to get your applications running smoothly on the MapR cluster using a subset of data. Once you have confirmed that all applications and components are running as expected you can begin migrating your data. Migrating your applications from HDFS to MapR is relatively easy. MapR Hadoop is 100% plug-and-play compatible with Apache Hadoop, so you do not need to make changes to your applications to run them on a MapR cluster. Application Migration Guidelines Keep the following considerations in mind when you migrate your applications: MapR Libraries — Ensure that your applications can find the libraries/configs it is expecting. Make sure the java classpath includes the path to and the includes maprfs.jar java.library.path libMapRClient.so MapR Storage — Every application must point to MapR-FS ( ) rather than the HDFS ( ). If your application uses maprfs:/// hdfs:// fs then it will work automatically. If you have hardcoded HDFS links into your applications, you must redirect those links .default.name so that they point to MapR-FS. Setting a default path of tells your applications to use the cluster specified in the first line of maprfs:/// . You can also specify a specific cluster with . mapr-clusters.conf maprfs://<cluster name>/ Permissions — The command does not copy permissions; permissions defined in HDFS do not transfer automatically to distcp MapR-FS. MapR uses a combination of access control lists (ACLs) to specify cluster or volume-level permissions and file permissions to manage directory and file access. You must define these permissions in MapR when you migrate your customized components, applications, and data. For more information, see . Managing Permissions Memory — Remove explicit memory settings defined in your applications. If memory is set explicitly in the application, the jobs may fail after migration to MapR. Application Migration Roadmap Generally, the best approach to migrating your applications to MapR is to import a small subset of data and test and tune your application using that data in a test environment before you import your production data. The following procedure offers a simple roadmap for migrating and running your applications in a MapR cluster test environment. Copy over a small amount of data to the MapR cluster. Use the command to copy over a small number of files: hadoop distcp hftp $ hadoop distcp hftp://namenode1:50070/foo maprfs:///bar You must specify the namenode IP address, port number, and source directory on the HDFS cluster. For more information, see Copying Data from Apache Hadoop Run the application. Add more data and test again. When the application is running to your satisfaction, use the same process to test and tune another application. Data Migration Once you have installed and configured your MapR cluster in a test environment and migrated your applications to the MapR cluster you can begin to copy over your data from the Apache Hadoop HDFS to the MapR cluster. In the application migration phase, you should have already moved over small amounts of data using the command. See hadoop distcp hftp . While this method is ideal for copying over the very small amounts of data required for an initial test, you must Application Migration Roadmap 1. 2. 3. 1. 2. 3. 4. use different methods to migrate your data. There two ways to migrate large datasets from an HDFS cluster to MapR: Distributed Copy — Use the command to copy data from the HDFS to the MapR-FS. This is the preferred method for hadoop distcp moving large amounts of data. Push Data — If the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason you cannot use the hadoop distcp command, you can push data from HDFS to MapR-FS. Important: Ensure that you have laid out your volumes and defined policies before you migrate your data from the HDFS cluster to the to the MapR cluster. Note that you cannot copy over permissions defined in HDFS. Distributed Copy The command (distributed copy) enables you to use a MapReduce job to copy large amounts of data between clusters. “The hadoop distcp ha command expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in doop distcp the source list.” To copy data from HDFS to MapR using hadoop distcp: From a node in the MapR cluster, try hadoop fs -ls to determine whether the MapR cluster can successfully communicate with the HDFS cluster: hadoop fs -ls <NameNode IP>:<NameNode port>/<path> If the hadoop fs -ls command is successful, try hadoop fs -cat to determine whether the MapR cluster can read file contents from the specified path on the HDFS cluster: hadoop fs -cat <NameNode IP>:<NameNode port>/<HDFS path>/<file> If you are able to communicate with the HDFS cluster and read file contents, use distcp to copy data from the HDFS cluster to the MapR cluster: hadoop distcp <NameNode IP>:<NameNode port>/<HDFS path> <MapR-FS path> Pushing Data from HDFS to MapR-FS If the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason you cannot use the hadoo command to copy files from HDFS to MapR-FS, you can push data from the HDFS cluster to the MapR cluster. p distcp Perform the following steps from a MapR client or node (any computer that has either mapr-core or mapr-client installed). For more information about setting up a MapR client, see . Setting Up the Client <input path>: The HDFS path to the source data. <output path>: The MapR-FS path to the target directory. <MapR CLDB IP>: The IP address of the master CLDB node on the MapR cluster. Log into a MapR client or node as the root user (or use for the following commands). sudo Create the directory on the Apache Hadoop JobClient node. /tmp/maprfs-client/ Copy the following files from a MapR client or any MapR node to the directory: /tmp/maprfs-client/ /opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar /opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar /opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64/libMapRClient.so Install the files in the correct places on the Apache Hadoop JobClient node: You can use the command to migrate data from a Hadoop HDFS cluster to the MapR-FS only if the HDFS cluster hadoop distcp uses the same version of the RPC protocol as that used by the MapR cluster. Currently, MapR uses version 4. (If the clusters do not share the same version of the RPC protocol, you must use the push data method described below.) 4. 5. 6. 7. 8. cp /tmp/maprfs-client/maprfs-0.1.jar $HADOOP_HOME/lib/ cp /tmp/maprfs-client/zookeeper-3.3.2.jar $HADOOP_HOME/lib/ cp /tmp/maprfs-client/libMapRClient.so $HADOOP_HOME/lib/native/Linux-amd64-64/libMapRClient.so If the JobTracker is a different node from the JobClient node, copy and install the files to the JobTracker node as well using the above steps. On the JobTracker node, add the following section to the file: $HADOOP_HOME/conf/core-site.xml <property> <name>fs.maprfs.impl</name> <value>com.mapr.fs.MapRFileSystem</value> </property> Restart the JobTracker. Copy data to the MapR cluster by running the command on the JobClient node of the Apache Hadoop cluster. hadoop distcp Node Migration Once you have loaded your data and tested and tuned your applications, you can add decommission HDFS data-nodes and add them to the MapR cluster. This is a three-step process: Decommissioning nodes on an Apache Hadoop cluster: The Hadoop decommission feature enables you to gracefully remove a set of existing data-nodes from a cluster while it is running, without data loss. For more information, see . Hadoop Wiki FAQ Meeting minimum hardware and software requirements: Ensure that every data-node you want to add to the MapR cluster meets the hardware, software, and configuration . requirements Adding Nodes to a MapR cluster: You can add those data-nodes to the MapR cluster. For more information, see Adding Nodes to a . Cluster Third Party Solutions MapR works with the leaders in the Hadoop ecosystem to provide the most powerful data analysis solutions. For more information about our partners, take a look at the following pages: Datameer HParser Karmasphere Pentaho Datameer Datameer provides the world's first business intelligence platform built natively for Hadoop. Datameer delivers powerful, self-service analytics for the BI user through a simple spreadsheet UI, along with point-and-click data integration (ETL) and data visualization capabilities. MapR provides a pre-packaged version of ("DAS"). DAS is delivered as an RPM or Debian package. Datameer Analytics Solution See to add the DAS package to your MapR environment. How to setup DAS on MapR Visit to explore several demos included in the package to illustrate the usage of DAS in behavioral analytics and IT Demos for MapR systems management use cases. Check out the library of with step-by-step walk-throughs on how to use DAS, and demo videos showing various video tutorials applications. If you have questions about using DAS, please visit the . For information about Datameer, please visit . DAS documentation www.datameer.com Karmasphere Karmasphere provides software products for data analysts and data professionals so they can unlock the power of Big Data in Hadoop, opening a whole new world of possibilities to add value to the business. Karmasphere equips analysts with the ability to discover new patterns, relationships, If you are on a 32-bit client, use Linux-i386-32 in place of Linux-amd64-64 above. 1. 2. 3. 4. 5. 6. and drivers in any kind of data – unstructured, semi-structured or structured - that were not possible to find before. The Karmasphere Big Data Analytics product line supports the Map R distributions, M3 and M5 Editions and includes: Karmasphere Analyst, which provides data analysts immediate entry to structured and unstructured data on Hadoop, through SQL and other familiar languages, so that they can make ad-hoc queries, interact with the results, and iterate – without the aid of IT. Karmasphere Studio, which provides developers that support analytic teams a graphical environment to analyze their MapReduce code as they develop custom analytic algorithms and systematize the creation of meaningful datasets for analysts. To get started with Karmasphere Analyst or Studio: Request a 30-day trial of Karmasphere Analyst or Studio for MapR Learn more about Karmasphere Big Data Analytics products View videos about Karmasphere products Big Data Analytics Access technical resources Read documentation for Karmasphere products If you have questions about Karmasphere please email [email protected] or visit www.karmasphere.com. HParser HParser is a data transformation (data handler) environment optimized for Hadoop. This easy-to-use, codeless parsing software enables processing of any file format inside Hadoop with scale and efficiency. It provides Hadoop developers with out-of-the-box Hadoop parsing capabilities to address the variety and complexity of data sources, including logs, industry standards, documents, and binary or hierarchical data. MapR has partnered with Informatica to provide the of : Community Edition HParser The HParser package can be downloaded from Informatica as a Zip archive that includes the HParser engine, the Data Transformation HParser Jar file, HParser Studio, and the . HParser Operator Guide The HParser engine is also available as an RPM via the MapR repository, making it easier to install the HParser Engine on all nodes in the cluster. HParser can be installed on a MapR cluster running CentOS or Red Hat Enterprise Linux. To install HParser on a MapR cluster: Register on the site. Informatica Download the Zip file containing the of HParser, and extract it. Community Edition Familiarize yourself with the installation procedure in the HParser Operator Guide. On each node, install HParser Engine from the MapR repository by typing the following command as or with : root sudo yum install hparser-engine Choose a a node in the cluster from which you will issue HParser commands. Command Node, Following the instructions in the copy the HParser Jar file to the Command Node and create the HParser HParser Operator Guide, configuration file. Pentaho Pentaho Business Analytics is a tightly coupled data integration and business analytics platform that brings together IT and business users for easy access, integration, visualization and exploration of any data. Pentaho includes data discovery, data integration and predictive analytics. With Pentaho, business users are empowered to make information-driven decisions that positively impact their organization’s performance. IT can rapidly deliver a secure, scalable, flexible, and easy to manage business analytics platform for the broadest set of users. To troubleshoot Pentaho on MapR, refer to troubleshooting guide. If you have questions about using Pentaho Business Analytics, Pentaho's please visit the . For information about Pentaho, please visit . Pentaho Infocenter www.pentaho.com Documentation for Previous Releases Here are links to documentation for all major releases of MapR software: Version 3.x - Latest Release 1. 2. Version 2.x Version 1.x (PDF here) Release Notes Release notes for open source Hadoop components in the MapR distribution for Hadoop are available . here Release notes for versions of the MapR distribution for Hadoop are available . here Reference Guide The MapR Reference Guide contains in-depth reference information for MapR Software. Choose a subtopic below for more detail. MapR Release Notes - Known issues and new features, by release MapR Control System - User interface reference API Reference - Information about the command-line interface and the REST API Utilities - MapR tool and utility reference Environment Variables - Environment variables specific to MapR Configuration Files - Information about MapR settings Ports Used by MapR - List of network ports used by MapR services Glossary - Essential MapR terms and definitions Hadoop Commands - Listing of Hadoop commands and options MapR Control System The MapR Control System main screen consists of a navigation pane to the left and a view to the right. Dialogs appear over the main screen to perform certain actions. View this video for an introduction to the MapR Control System dashboard... Logging on to the MapR Control System In a browser, navigate to the node that is running the service: mapr-webserver https://<hostname>:8443 When prompted, enter the username and password of the administrative user. The Dashboard The Navigation pane to the left lets you choose which to display on the right. view The main view groups are: Cluster Views - information about the nodes in the cluster MapR-FS - information about volumes, snapshots and schedules NFS HA Views - NFS nodes and virtual IP addresses Alarms Views - node and volume alarms System Settings Views - configuration of alarm notifications, quotas, users, groups, SMTP, and HTTP Some other views are separate from the main navigation tree: CLDB View - information about the container location database HBase View - information about HBase on the cluster JobTracker View - information about the JobTracker Nagios View - information about the Nagios configuration script Terminal View - an ssh terminal for logging in to the cluster Views Views display information about the system. As you open views, tabs along the top let you switch between them quickly. Clicking any column name in a view sorts the data in ascending or descending order by that column. Most views contain a that lets you sort data in the view, so you can quickly find the information you want. Filter toolbar Some views contain collapsible panes that provide different types of detailed information. Each collapsible pane has a control at the top left that expands and collapses the pane. The control changes to show the state of the pane: - pane is collapsed; click to expand - pane is expanded; click to collapse The Filter Toolbar The Filter toolbar lets you build search expressions to provide sophisticated filtering capabilities for locating specific data on views that display a large number of nodes. Expressions are implicitly connected by the AND operator; any search results satisfy the criteria specified in all expressions. The Filter toolbar has two controls: The button ( ) removes the expression. Minus The button ( ) adds a new expression. Plus Expressions Each expression specifies a semantic statement that consists of a field, an operator, and a value. The first dropdown menu specifies the field to match. The second dropdown menu specifies the type of match to perform. The text field specifies a value to match or exclude in the field. You can use a wildcard to substitute for any part of the string. Cluster Views This section provides reference for the following views in the MapR Control System: Dashboard - Summary of cluster health, activity, and usage Cluster Heatmap Alarms Cluster Utilization MapReduce Services Volumes Nodes - Summary of node information Overview Services Performance Disks MapReduce NFS Nodes Alarm Status Node Properties View - Details about a node Alarms Machine Performance MapR-FS and Available Disks System Disks Manage Node Services MapReduce DB Gets, Puts, Scans Node Heatmap Jobs The Job Pane The Task Table The Task Attempt Pane Dashboard - Summary of cluster health, activity, and usage The Dashboard displays a summary of information about the cluster in six panes. Panes include: Cluster Heatmap - the alarms and health for each node, by rack Alarms - a summary of alarms for the cluster Cluster Utilization - CPU, Memory, and Disk Space usage MapReduce - the number of running and queued jobs, running tasks, running map tasks, running reduce tasks, map task capacity, reduce task capacity, map task prefetch capacity, and blacklisted nodes Services - the number of instances of each service Volumes - the number of available, under-replicated, and unavailable volumes Links in each pane provide shortcuts to more detailed information. The following sections provide information about each pane. Cluster Heatmap The Cluster Heatmap pane displays the health of the nodes in the cluster, by rack. Each node appears as a colored square to show its health at a glance. If you click on the small wrench icon at the upper right of the Cluster Heatmap pane, a key to the color-coded heatmap display slides into view. At the top of the display, you can set the refresh rate for the display (measured in seconds), as well as the number of columns to display (for example, 20 nodes are displayed across two rows for a 10-column display). Click the wrench icon again to slide the display back out of view. The left drop-down menu at the top of the pane lets you choose which data is displayed. Some of the choices are shown below. Heatmap legend by category The heatmap legend changes depending on the criteria you select from the drop-down menu. All the criteria and their corresponding legends are shown here. Health Healthy - all services up, MapR-FS and all disks OK, and normal heartbeat Upgrading - upgrade in process Degraded - one or more services down, or no heartbeat for over 1 minute Maintenance - routine maintenance in process Critical - Mapr-FS Inactive/Dead/Replicate, or no heartbeat for over 5 minutes Click to see the legend for all Heatmap displays, such as CPU, memory and disk space... CPU Utilization CPU < 50% CPU < 80% CPU >= 80% Unknown Memory Utilization Memory < 50% Memory < 80% Memory >= 80% Unknown Disk Space Utilization Used < 50% Used < 80% Used >= 80% Unknown Too Many Containers Alarm Containers within limit Containers exceeded limit Duplicate HostId Alarm No duplicate host id detected Duplicate host id detected UID Mismatch Alarm No UID mismatch detected UID mismatch detected No Heartbeat Detected Alarm Node heartbeat detected Node heartbeat not detected TaskTracker Local Dir Full Alarm TaskTracker local directory is not full TaskTracker local directory full PAM Misconfigured Alarm PAM configured PAM misconfigured High FileServer Memory Alarm Fileserver memory OK Fileserver memory high Cores Present Alarm No core files Core files present Installation Directory Full Alarm Installation Directory free Installation Directory full Metrics Write Problem Alarm Metrics writing to Database Metrics unable to write to Database Root Partition Full Alarm Root partition free Root partition full HostStats Down Alarm HostStats running HostStats down Webserver Down Alarm Webserver running Webserver down NFS Gateway Down Alarm NFS Gateway running NFS Gateway down HBase RegionServer Down Alarm HBase RegionServer running HBase RegionServer down HBase Master Down Alarm HBase Master running HBase Master down TaskTracker Down Alarm TaskTracker running TaskTracker down JobTracker Down Alarm JobTracker running JobTracker down FileServer Down Alarm FileServer running FileServer down CLDB Down Alarm CLDB running CLDB down Time Skew Alarm Time OK Time skew alarm(s) Software Installation & Upgrades Alarm Version OK Version alarm(s) Disk Failure(s) Alarm Disks OK Disk alarm(s) Excessive Logging Alarm No debug Debugging Zoomed view You can see a zoomed view of all the nodes in the cluster by moving the zoom slide bar. The zoomed display reveals more details about each node, based on the criteria you chose from the drop-down menu. In this example, CPU Utilization is displayed for each node. Clicking a rack name navigates to the view, which provides more detailed information about the nodes in the rack. Nodes Clicking a colored square navigates to the , which provides detailed information about the node. Node Properties View Alarms The Alarms pane includes these four columns: Alarm - a list of alarms raised on the cluster Last Raised - the most recent time each alarm state changed Summary - how many nodes or volumes have raised each alarm Clear Alarm - clicking on the X clears the corresponding alarm Clicking , , or sorts data in ascending or descending order by that column. Alarm Last Raised Summary Cluster Utilization The Cluster Utilization pane displays a summary of the total usage of the following resources: CPU Memory Disk Space For each resource type, the pane displays the percentage of cluster resources used, the amount used, and the total amount present in the system. A colored dot after the pane's title summarizes the status of the disk and role : Balancers Green: Both balancers are running. Orange: The replication role balancer is running. Yellow: The disk space balancer is running. Purple: Neither balancer is running. Click the colored dot to bring up the dialog. Balancer Configuration MapReduce The MapReduce pane shows information about MapReduce jobs: Running Jobs - the number of MapReduce jobs currently running Queued Jobs - the number of MapReduce jobs queued to run Running Tasks - the number of Map and Reduce tasks currently running Running Map Tasks - the number of Map tasks currently running Running Reduce Tasks - the number of Reduce tasks currently running Map Task Capacity - the number of map slots available across all nodes in the cluster Reduce Task Capacity - the number of reduce slots available across all nodes in the cluster Map Task Prefetch Capacity - the number of map tasks that can be queued to fill map slots once they become available Blacklisted Nodes - the number of nodes that have been eliminated from the MapReduce pool Services The Services pane shows information about the services running on the cluster. For each service, the pane displays the following information: Actv - the number of running instances of the service Stby - the number of instances of the service that are configured and standing by to provide failover Stop - the number of instances of the service that have been intentionally stopped Fail - the number of instances of the service that have failed, indicated by a corresponsing Service Down alarm Total - the total number of instances of the service configured on the cluster Clicking a service navigates to the view. Services Volumes The Volumes pane displays the total number of volumes, and the number of volumes that are mounted and unmounted. For each category, the Volumes pane displays the number, percent of the total, and total size. Clicking or navigates to the view. Mounted Unmounted Volumes Nodes - Summary of node information The Nodes view displays the nodes in the cluster, by rack. The Nodes view contains two panes: the Topology pane and the Nodes pane. The Topology pane shows the racks in the cluster. Selecting a rack displays that rack's nodes in the Nodes pane to the right. Selecting displa Cluster ys all the nodes in the cluster. Clicking any column name sorts data in ascending or descending order by that column. Selecting the checkbox beside one node makes the following buttons available: Properties - navigates to the , which displays detailed information about a single node. Node Properties View Manage Services - displays the dialog, which lets you start and stop services on the node. Manage Node Services Change Topology - displays the dialog, which lets you change the topology path for a node. Change Node Topology Note: If a node has a No Heartbeat alarm raised, the button is also displayed. Forget Node When you click on , the following Message appears: Forget Node When you click on , a dialog is displayed where you can stop, start, or restart the services on the node. Manage Services When you click on , a dialog is displayed where you can choose a different location for the selected node. Change Topology Selecting the checkboxes beside multiple nodes changes the text on the buttons to reflect the number of nodes affected: The dropdown menu at the top left specifies the type of information to display: Overview - general information about each node Services - services running on each node Performance - information about memory, CPU, I/O and RPC performance on each node Disks - information about disk usage, failed disks, and the MapR-FS heartbeat from each node MapReduce - information about the JobTracker heartbeat and TaskTracker slots on each node NFS Nodes - the IP addresses and Virtual IPs assigned to each NFS node Alarm Status - the status of alarms on each node Clicking a node's Hostname navigates to the , which provides detailed information about the node. Node Properties View Selecting the checkbox displays the , which provides additional data filtering options. Filter Filter toolbar Each time you select a filtering option, the option is displayed in the window below the filter checkbox. You can add more options by clicking on the . Overview The Overview displays the following general information about nodes in the cluster: Hlth - each node's health: healthy, degraded, critical, or maintenance Hostname - the hostname of each node Physical IP(s) - the IP address or addresses associated with each node FS HB - time since each node's last heartbeat to the CLDB Physical Topology - the rack path to each node Services The Services view displays the following information about nodes in the cluster: Hlth - eact node's health: healthy, degraded, critical, or maintenance Hostname - the hostname of each node Configured Services - a list of the services specified in the config file Running Services - a list of the services running on each node Physical Topology - each node's physical topology Performance The Performance view displays the following information about nodes in the cluster, including: Hlth - each node's health: healthy, degraded, critical, or maintenance Hostname - DNS hostname for the nodes in this cluster Memory - percentage of memory used and the total memory % CPU - percentage of CPU usage on the node # CPUs - number of CPUs present on the node Bytes Received - number of bytes received in 1 second, through all network interfaces on the node Click to see all Performance metrics... Bytes Sent - number of bytes sent in 1 second, through all network interfaces on the node # RPCs - number of RPC calls RPC In Bytes - number of RPC bytes received by this node every second RPC Out Bytes - number of RPC bytes sent by this node every second # Disk Reads - number of disk read operations on this node every second # Disk Writes - number of disk write operations on this node every second Disk Read Bytes - number of bytes read from all the disks on this node every second Disk Write Bytes - number of bytes written to all the disks on this node every second # Disks - number of disks on this node Gets - 1m - number of data retrievals (gets) executed on this region's primary node in a 1-minute interval Puts - 1m - number of data writes (puts) executed on this region's primary node in a 1-minute interval Scans - 1m - number of data seeks (scans) executed on this region's primary node in a 1-minute interval Disks The Disks view displays the following information about nodes in the cluster: Hlth - each node's health: healthy, degraded, or critical Hostname - the hostname of each node # Bad Disks - the number of failed disks on each node Disk Space - the amount of disk used and total disk capacity, in gigabytes MapReduce The MapReduce view displays the following information about nodes in the cluster: Hlth - each node's health: healthy, degraded, or critical Hostname - the hostname of each node TT Map Slots - the number of map slots on each node TT Map Slots Used - the number of map slots in use on each node TT Reduce Slots - the number of reduce slots on each node TT Reduce Slots Used - the number of reduce slots in use on each node NFS Nodes The NFS Nodes view displays the following information about nodes in the cluster: Hlth - each node's health: healthy, degraded, or critical Hostname - the hostname of each node Physical IP(s) - the IP address or addresses associated with each node Virtual IP(s) - the virtual IP address or addresses assigned to each node Alarm Status The Alarm Status view displays the following information about nodes in the cluster: Hlth - each node's health: healthy, degraded, critical, or maintenance Hostname - DNS hostname for nodes in this cluster Version Alarm - one or more services on the node are running an unexpected version No Heartbeat Alarm - node is not undergoing maintenance, and no heartbeat is detected for over 5 minutes UID Mismatch Alarm - services in the cluster are being run with different user names (UIDs) Duplicate HostId Alarm - two or more nodes in the cluster have the same host id Click to see all Alarm Status alerts... Too Many Containers Alarm - number of containers on this node reached the maximum limit Excess Logs Alarm - debug logging is enabled on the node (debug logging generates enormous amounts of data and can fill up disk space) Disk Failure Alarm - a disk has failed on the node Time Skew Alarm - the clock on the node is out of sync with the master CLDB by more than 20 seconds Root Partition Full Alarm - the root partition ("/") on the node is running out of space (99% full) Installation Directory Full Alarm - the partition /opt/mapr on the node is running out of space (95% full) Core Present Alarm - a service on the node has crashed and created a core dump file High FileServer Memory Alarm - memory consumed by service on the node is high fileserver Pam Misconfigured Alarm - the PAM authentication on the node is configured incorrectly TaskTracker Local Directory Full Alarm - the local directory used by the TaskTracker on the specified node(s) is full, and the TaskTracker cannot operate as a result CLDB Alarm - the CLDB service on the node has stopped running FileServer Alarm - the FileServer service on the node has stopped running JobTracker Alarm - the JobTracker service on the node has stopped running TaskTracker Alarm - the TaskTracker service on the node has stopped running HBase Master Alarm - the HBase Master service on the node has stopped running HBase RegionServer Alarm - the HBase RegionServer service on the node has stopped running NFS Gateway Alarm - the NFS service on the node has stopped running WebServer Alarm - the WebServer service on the node has stopped running HostStats Alarm - the HostStats service has stopped running Metrics write problem Alarm - metric data was not written to the database Node Properties View - Details about a node The Node Properties view displays detailed information about a single node in seven collapsible panes: Alarms Machine Performance MapR-FS and Available Disks System Disks Manage Node Services MapReduce DB Gets, Puts, Scans Buttons: Forget Node - displays the dialog box Forget Node Alarms The Alarms pane displays a list of alarms that have been raised on the system, and the following information about each alarm: Alarm - the alarm name Last Raised - the most recent time when the alarm was raised Summary - a description of the alarm Clear Alarm - clicking on the X clears the corresponding alarm Machine Performance The Machine Performance pane displays the following information about the node's performance and resource usage since it last reported to the CLDB: Memory Used - the amount of memory in use on the node Disk Used - the amount of disk space used on the node CPU - The number of CPUs and the percentage of CPU used on the node Network I/O - the input and output to the node per second RPC I/O - the number of RPC calls on the node and the amount of RPC input and output Disk I/O - the amount of data read to and written from the disk # Operations - the number of disk reads and writes MapR-FS and Available Disks The MapR-FS and Available Disks pane displays the disks on the node and information about each disk. Information headings include: Status - the status of the disk (healthy, failed, or offline) Mount - whether the disk is mounted (indicated by ) or unmounted Device - the device name File System - the file system on the disk Used - the percentage of memory used out of total memory available on the disk Model # - the model number of the disk Serial # - the serial number of the disk Firmware Version - the version of the firmware being used Add to MAPR-FS - clicking the adds the disk to MAPR-FS storage Remove from MAPR-FS - clicking the displays a dialog that asks you to verify that you want to remove the disk If you confirm by clicking , and data on that disk has not been replicated, a warning dialog appears: OK For more information on disk status, and the proper procedure for adding, removing, and replacing disks, see the page. Disks System Disks The System Disks pane displays information about disks present and mounted on the node: Status - the status of the disk (healthy, failed, or offline) Mount - whether the disk is mounted (indicated by ) or unmounted Device - the device name File System - the file system on the disk Used - the percentage of memory used out of total memory available on the disk Model # - the model number of the disk Serial # - the serial number of the disk Firmware Version - the version of the firmware being used Manage Node Services The Manage Node Services pane displays the status of each service on the node. 1. 2. 3. If you are running MapR 1.2.2 or earlier, do not use the command or the MapR Control System to add disks to MapR-FS. You disk add must either upgrade to MapR 1.2.3 before adding or replacing a disk, or use the following procedure (which avoids the comma disk add nd): Use the to the failed disk. All other disks in the same storage pool are removed at the same MapR Control System remove time. Make a note of which disks have been removed. Create a text file containing a list of the disks you just removed. See . /tmp/disks.txt Setting Up Disks for MapR Add the disks to MapR-FS by typing the following command (as or with ): root sudo /opt/mapr/server/disksetup -F /tmp/disks.txt Service - the name of each service State: Configured: the package for the service is installed and the service is configured for all nodes, but it is not enabled for the particular node Not Configured: the package for the service is not installed and/or the service is not configured ( has not run) configure.sh Running: the service is installed, has been started by the warden, and is currently executing Stopped: the service is installed and has run, but the service is currently not executing configure.sh StandBy: the service is installed Failed: the service was running, but terminated unexpectedly Log Path - the path to where each service stores its logs Stop/Start: click on to stop the service click on to start the service Restart - click on to restart the service Log Settings - displays the Trace Activity dialog where you can set the level of logging for a service on a particular node. When you select a log level, all the levels listed above it are included in the log. Levels include: ERROR WARN INFO DEBUG TRACE You can also start and stop services in the the dialog, by clicking in the view. Manage Node Services Manage Services Nodes MapReduce The MapReduce pane displays the number of map and reduce slots used, and the total number of map and reduce slots on the node. DB Gets, Puts, Scans The DB Gets, Puts, Scans pane displays the number of gets, puts, and scan operations performed during various time intervals. Node Heatmap The Node Heatmap view provides a graphical summary of node status across the cluster. This view displays the same information as the Node on the Dashboard, without the other panes that appear on the dashboard. Heatmap pane Jobs The Jobs view displays the data collected by the MapR Metrics service. The Jobs view contains two panes: the chart pane and the data grid. The chart pane displays the data corresponding to the selected metric in histogram form. The data grid lists the jobs running on the cluster. Click on the wrench icon to slide out a menu of information to display. Choices include: Cumulative Job Combine Input Records Cumulative Job Map Input Bytes Cumulative Job Map Input Records Cumulative Job Map Output Bytes Cumulative Job Map Output Records Cumulative Job Reduce Input Records Click to see all Job metrics... Cumulative Job Reduce Output Bytes Cumulative Job Reduce Shuffle Bytes Cumulative Physical Memory Current CPU Current Memory Job Average Map Attempt Duration Job Average Reduce Attempt Duration Job Average Task Duration Job Combine Output Records Job Complete Map Task Count Job Complete Reduce Task Count Job Complete Task Count Job Cumulative CPU Job Data-local Map Tasks Job Duration Job End Time Job Error Count Job Failed Map Task Attempt Count Job Failed Map Task Count Job Failed Reduce Task Attempt Count Job Failed Reduce Task Count Job Failed Task Attempt Count Job Failed Task Count Job Id Job Map CPU Job Map Cumulative Memory Bytes Job Map File Bytes Written Job Map GC Time Job Map Input Bytes/Sec Job Map Input Records/Sec Job Map Output Bytes/Sec Job Map Output Records/Sec Job Map Progress Job Map Reserve Slot Wait Job Map Spilled Records Job Map Split Raw Bytes Job Map Task Attempt Count Job Map Task Count Job Map Tasks Duration Job Map Virtual Memory Bytes Job MapR-FS Map Bytes Read Job MapR-FS Map Bytes Written Job MapR-FS Reduce Bytes Read Job MapR-FS Reduce Bytes Written Job MapR-FS Total Bytes Read Job MapR-FS Total Bytes Written Job Maximum Map Attempt Duration Job Maximum Reduce Attempt Duration Job Maximum Task Duration Job Name Job Non-local Map Tasks Job Rack-local Map Tasks Job Reduce CPU Job Reduce Cumulative Memory Bytes Job Reduce File Bytes Written Job Reduce GC Time Job Reduce Input Groups Job Reduce Input Records/Sec Job Reduce Output Records/Sec Job Reduce Progress Job Reduce Reserve Slot Wait Job Reduce Shuffle Bytes/Sec Job Reduce Spilled Records Job Reduce Split Raw Bytes Job Reduce Task Attempt Count Job Reduce Task Count Job Reduce Tasks Duration Job Reduce Virtual Memory Bytes Job Running Map Task Count Job Running Reduce Task Count Job Running Task Count Job Split Raw Bytes Job Start Time Job Submit Time Job Task Attempt Count Job Total File Bytes Written Job Total GC Time Job Total Spilled Records Job Total Task Count Job User Logs Map Tasks Finish Time Map Tasks Start Time Priority Reduce Tasks Finish Time Reduce Tasks Start Time Status Virtual Memory Bytes Select the checkbox to display the , which provides additional data filtering options. Filter Filter toolbar The drop-down selector lets you change the display scale of the histogram's X axis between a uniform or logarithmic scale. Hover the x-axis: cursor over a bar in the histogram to display the and buttons. Filter Zoom Click the button or click the bar to filter the table below the histogram by the data range corresponding to that bar. The selected bar turns Filter yellow. Hover the cursor over the selected bar to display the and buttons. Click the button to remove the filter from Clear Filter Zoom Clear Filter the data range in the table below the histogram. Double-click a bar or click the button to zoom in and display a new histogram that displays metrics constrained to the data range Zoom represented by the bar. The data range applied to the metrics data set displays above the histogram. Click the plus or minus buttons in the filter conditions panel to add or remove filter conditions. Uncheck the checkbox above the histogram Filter to clear the entire filter. Check the box next to a job in the table below the histogram to enable the button. If the job is still running, checking this box also View Job enables the button. Clicking will display a confirmation dialog to choose whether or not to terminate the job. Kill Job Kill Job Click the button or click the job name in the table below the histogram to open the Job tab for that job. View Job The Job Pane From the main page, select a job from the list below the histogram and click . You can also click directly on the name of the job in Jobs View Job the list. The pane displays with the tab selected by default. This pane has three tabs, , , and . If the job is Job Properties Tasks Tasks Charts Info running, the button is enabled. Kill Job The Tasks Tab The tab has two panes. The upper pane displays histograms of metrics for the tasks and task attempts in the selected job. The lower pane Tasks displays a table that lists the tasks and primary task attempts in the selected job. Tasks can be in any of the following states: COMPLETE FAILED KILLED PENDING RUNNING The table of tasks also lists the following information for each task: Task ID. Click the link to display a with information about the task attempts for this task. table Task type: M: Map R: Reduce TC: Task Cleanup JS: Job Setup JC: Job Cleanup Primary task attempt ID. Click the link to display the pane for this task attempt. task attempt Task starting timestamp Task ending timestamp Task duration Host locality Node running the task. Click the link to display the pane for this node. Node Properties You can select the following task histogram metrics for this job from the drop-down selector: Task Duration Task Attempt Duration Task Attempt Local Bytes Read Task Attempt Local Bytes Written Task Attempt MapR-FS Bytes Read Click to see all Task metrics... Task Attempt MapR-FS Bytes Written Task Attempt Garbage Collection Time Task Attempt CPU Time Task Attempt Physical Memory Bytes Task Attempt Virtual Memory Bytes Map Task Attempt Input Records Map Task Attempt Output Records Map Task Attempt Skipped Records Map Task Attempt Input Bytes Map Task Attempt Output Bytes Reduce Task Attempt Input Groups Reduce Task Attempt Shuffle Bytes Reduce Task Attempt Input Records Reduce Task Attempt Output Records Reduce Task Attempt Skipped Records Task Attempt Spilled Records Combined Task Attempt Input Records Combined Task Attempt Output Records Uncheck the box to hide map tasks. Uncheck the box to hide reduce tasks. Check the Show Map Tasks Show Reduce Tasks Show box to display job and task setup and cleanup tasks. Histogram filtering and zoom work in the same way as the pane Setup/Cleanup Tasks Jobs . The Charts Tab Click the tab to display your job's line chart metrics. Charts Click the button to add a new line chart. You can use the X and minus buttons at the top-left of each chart to dismiss or hide the chart. Add chart Line charts can display the following metrics for your job: Cumulative CPU used Cumulative physical memory used Number of failed map tasks Number of failed reduce tasks Number of running map tasks Click to see all available Chart metrics... Number of running reduce tasks Number of map task attempts Number of failed map task attempts Number of failed reduce task attempts Rate of map record input Rate of map record output Rate of map input bytes Rate of map output bytes Rate of reduce record output Rate of reduce shuffle bytes Average duration of map attempts Average duration of reduce attempts Maximum duration of map attempts Maximum duration of reduce attempts The Information Tab The tab of the pane displays summary information about the job in three collapsible panes: Information Job Properties The pane displays information about this job's MapReduce activity. MapReduce Framework Counters The pane displays information about the number of this job's map tasks. Job Counters The pane displays information about this job's interactions with the cluster's file system. File System Counters The Task Table The Task table displays a list of the task attempts for the selected task, along with the following information for each task attempt: Status: RUNNING SUCCEEDED FAILED UNASSIGNED KILLED COMMIT PENDING FAILED UNCLEAN KILLED UNCLEAN Task attempt ID. Click the link to display the pane for this task attempt. task attempt Task attempt type: M: Map R: Reduce TC: Task Cleanup JS: Job Setup JC: Job Cleanup Task attempt starting timestamp Task attempt ending timestamp Task attempt shuffle ending timestamp Task attempt sort ending timestamp Task attempt duration Node running the task attempt. Click the link to display the pane for this node. Node Properties A link to the log file for this task attempt Diagnostic information about this task attempt The Task Attempt Pane The pane has two tabs, and . Task Attempt Info Charts The Task Attempt Info Tab The tab displays summary information about this task attempt in three panes: Info The pane displays information about this task attempt's MapReduce activity. MapReduce Framework Counters The pane displays information about the I/O performance in Bytes/sec and Records/sec. MapReduce Throughput Counters The pane displays information about this task attempt's interactions with the cluster's file system. File System Counters The Task Attempt Charts Tab The tab displays line charts for metrics specific to this task attempt. By default, this tab displays charts for these metrics: Task Attempt Charts Cumulative CPU by Time Physical Memory by Time Virtual Memory by Time Click the button to add a new line chart. You can use the X and minus buttons at the top-left of each chart to dismiss or hide the chart. Add chart Line charts can display the following metrics for your task: Combine Task Attempt Input Records Combine Task Attempt Output Records Map Task Attempt Input Bytes Map Task Attempt Input Records Map Task Attempt Output Bytes Click to see all available Task Attempt metrics... Map Task Attempt Output Records Map Task Attempt Skipped Records Reduce Task Attempt Input Groups Reduce Task Attempt Input Records Reduce Task Attempt Output Records Reduce Task Attempt Shuffle Bytes Reduce Task Attempt Skipped Records Task Attempt CPU Time Task Attempt Local Bytes Read Task Attempt Local Bytes Written Task Attempt MapR-FS Bytes Read Task Attempt MapR-FS Bytes Written Task Attempt Physical Memory Bytes Task Attempt Spilled Records Task Attempt Virtual Memory Bytes MapR-FS Views The MapR-FS group provides the following views: Tables - information about M7 tables in the cluster Volumes - information about volumes in the cluster Mirror Volumes - information about mirrors User Disk Usage - cluster disk usage Snapshots - information about volume snapshots Schedules - information about schedules Tables The view displays a list of tables in the cluster. Tables The button displays a field where you can enter the path to a new table to create from the MCS. New Table Click the name of a table from the view to display the table detail view. Tables From the table detail view, click to delete this table. Delete Table The table detail view has the following tabs: Column Families Regions The tab displays the following information: Column Families Column Family Name Max Versions Min Versions Compression Time-to-Live In Memory Click the button to change these values. Click the button to delete the selected column families. Edit Column Family Delete Column Family The tab displays the following information: Regions Start Key - The first key in the region range. End Key - The last key in the region range. Physical Size - The physical size of the region with compression. Logical Size - The logical size of the region without compression. # Rows - The number of rows stored in the region. Primary Node - The region's original source for storage and computation. Secondary Nodes - The region's replicated sources for storage and computation. Last HB - The time interval since the last data communication with the region's primary node. Region Identifier - The tablet region identifier. Volumes The view displays the following information about volumes in the cluster: Volumes Mnt - Whether the volume is mounted. ( ) Vol Name - The name of the volume. Mount Path - The path where the volume is mounted. Creator - The user or group that owns the volume. Quota - The volume quota. Vol Size - The size of the volume. Data Size - The size of the volume on the disk before compression. Snap Size - The size of the all snapshots for the volume. As the differences between the snapshot and the current state of the volume grow, the amount of data storage taken up by the snapshots increases. Total Size - The size of the volume and all its snapshots. Replication Factor - The number of copies of the volume. Physical Topology - The rack path to the volume. Clicking any column name sorts data in ascending or descending order by that column. The checkbox specifies whether to show unmounted volumes: Unmounted selected - show both mounted and unmounted volumes unselected - show mounted volumes only The checkbox specifies whether to show system volumes: System selected - show both system and user volumes unselected - show user volumes only Selecting the checkbox displays the , which provides additional data filtering options. Filter Filter toolbar Clicking displays the dialog. New Volume New Volume New Volume The New Volume dialog lets you create a new volume. For mirror volumes, the Snapshot Scheduling section is replaced with a section called Mirror Scheduling: The Volume Setup section specifies basic information about the volume using the following fields: Volume Type - a standard volume, or a local or remote mirror volume Volume Name (required) - a name for the new volume Mount Path - a path on which to mount the volume (check the small box at the right to indicate the mount path for the new volume; if the box is not checked, an unmounted volume is created) Topology - the new volume's rack topology Read-only - if checked, prevents writes to the volume The Permissions section lets you grant specific permissions on the volume to certain users or groups: User/Group field - the user or group to which permissions are to be granted (one user or group per row) Permissions field - the permissions to grant to the user or group (see the Permissions table below) Delete button ( ) - deletes the current row [+ Add Permission ] - adds a new row Volume Permissions Code Allowed Action dump Dump/Back up the volume restore Restore/Mirror the volume m Edit volume properties d Delete the volume fc Full control (admin access and permission to change volume ACL) The Usage Tracking section displays cluster usage and sets quotas for the volume using the following fields: Quotas - the volume quotas: Volume Advisory Quota - if selected, enter the advisory quota for the volume expressed as an integer plus the single letter abbreviation for the unit (such as 100G for 100GB). When this quota is reached, an advisory email is sent to the user or group. Volume Hard Quota - if selected, enter the maximum limit for the volume expressed as an integer plus the single letter abbreviation for the unit (such as 128G for 128GB). When this hard limit is reached, no more data is written to the volume. The Replication section contains the following fields: Replication - the requested replication factor for the volume Min Replication - the minimum replication factor for the volume. When the number of replicas drops down to or below this number, the volume is aggressively re-replicated to bring it above the minimum replication factor. Optimize Replication For - the basis for choosing the optimum replication factor (high throughput or low latency) The Snapshot Scheduling section (normal volumes) contains the snapshot schedule, which determines when snapshots will be automatically created. Select an existing schedule from the pop-up menu. The Mirror Scheduling section (local and remote mirror volumes) contains the mirror schedule, which determines when mirror volumes will be automatically created. Select an existing schedule from the pop-up menu. Buttons: OK - creates the new volume Cancel - exits without creating the volume Modify Volume You can modify a volume's properties by selecting the checkbox next to that volume and clicking the button. A dropdown menu of Modify Volume properties you can modify is displayed. To apply one set of changes to multiple volumes, mark the checkboxes next to each volume. Volume Properties Clicking on a volume name displays the Volume Properties dialog where you can view information about the volume, and check or change various settings. You can also remove the volume. If you click on , the following dialog appears: Remove Volume Buttons: OK - removes the volume or volumes Cancel - exits without removing the volume or volumes For information about the fields in the Volume Properties dialog, see . New Volume The checkbox fills when containers in this volume are outside the volume's main topology. Partly Out of Topology Snapshots The Snapshots dialog displays the following information about snapshots for the specified volume: Snapshot Name - The name of the snapshot. Disk Used - The total amount of logical storage held by the snapshot. Since the current volume and all of its snapshots will often have storage held in common, the total disk usage reported will often exceed the total storage used by the volume. The value reported in this field is the size the snapshot would have if the difference between the snapshot and the volume's current state is 100%. Created - The date and time the snapshot was created. Expires - The snapshot expiration date and time. Buttons: New Snapshot - Displays the dialog. Snapshot Name Remove - When the checkboxes beside one or more snapshots are selected, displays the dialog. Remove Snapshots Preserve - When the checkboxes beside one or more snapshots are selected, prevents the snapshots from expiring. Close - Closes the dialog. Snapshot Name The Snapshot Name dialog lets you specify the name for a new snapshot you are creating. The Snapshot Name dialog creates a new snapshot with the name specified in the following field: Name For New Snapshot(s) - the new snapshot name Buttons: OK - creates a snapshot with the specified name Cancel - exits without creating a snapshot Remove Snapshots The Remove Snapshots dialog prompts you for confirmation before removing the specified snapshot or snapshots. Buttons Yes - removes the snapshot or snapshots No - exits without removing the snapshot or snapshots Mirror Volumes The Mirror Volumes pane displays information about mirror volumes in the cluster: Mnt - whether the volume is mounted Vol Name - the name of the volume Src Vol - the source volume Src Clu - the source cluster Orig Vol -the originating volume for the data being mirrored Orig Clu - the originating cluster for the data being mirrored Last Mirrored - the time at which mirroring was most recently completed - status of the last mirroring operation % Done - progress of the mirroring operation Error(s) - any errors that occurred during the last mirroring operation User Disk Usage The User Disk Usage view displays information about disk usage by cluster users: Name - the username Disk Usage - the total disk space used by the user # Vols - the number of volumes Hard Quota - the user's quota Advisory Quota - the user's advisory quota Email - the user's email address Snapshots The Snapshots view displays the following information about volume snapshots in the cluster: Snapshot Name - the name of the snapshot Volume Name - the name of the source volume volume for the snapshot Disk Space used - the disk space occupied by the snapshot Created - the creation date and time of the snapshot Expires - the expiration date and time of the snapshot Clicking any column name sorts data in ascending or descending order by that column. Selecting the checkbox displays the , which provides additional data filtering options. Filter Filter toolbar Buttons: Remove Snapshot - when the checkboxes beside one or more snapshots are selected, displays the dialog Remove Snapshots Preserve Snapshot - when the checkboxes beside one or more snapshots are selected, prevents the snapshots from expiring Schedules The Schedules view lets you view and edit schedules, which can then can be attached to events to create occurrences. A schedule is a named group of rules that describe one or more points of time in the future at which an action can be specified to take place. The left pane of the Schedules view lists the following information about the existing schedules: Schedule Name - the name of the schedule; clicking a name displays the schedule details in the right pane for editing In Use - indicates whether the schedule is ( ), or attached to an action in use The right pane provides the following tools for creating or editing schedules: Schedule Name - the name of the schedule Schedule Rules - specifies schedule rules with the following components: A dropdown that specifies frequency (Once, Yearly, Monthly, Weekly, Daily, Hourly, Every X minutes) Dropdowns that specify the time within the selected frequency Retain For - the time for which the scheduled snapshot or mirror data is to be retained after creation [ +Add Rule ] - adds another rule to the schedule Navigating away from a schedule with unsaved changes displays the dialog. Save Schedule Buttons: New Schedule - starts editing a new schedule Remove Schedule - displays the dialog Remove Schedule Save Schedule - saves changes to the current schedule Cancel - cancels changes to the current schedule Remove Schedule The Remove Schedule dialog prompts you for confirmation before removing the specified schedule. Buttons Yes - removes the schedule No - exits without removing the schedule NFS HA Views The NFS view group provides the following views: NFS Setup - information about NFS nodes in the cluster VIP Assignments - information about virtual IP addresses (VIPs) in the cluster NFS Nodes - information about NFS nodes in the cluster NFS Setup The NFS Setup view displays information about NFS nodes in the cluster and any VIPs assigned to them: Starting VIP - the starting IP of the VIP range Ending VIP - the ending IP of the VIP range Node Name(s) - the names of the NFS nodes IP Address(es) - the IP addresses of the NFS nodes MAC Address(es) - the MAC addresses associated with the IP addresses Buttons: Start NFS - displays the Manage Node Services dialog Add VIP - displays the Add Virtual IPs dialog Edit - when one or more checkboxes are selected, edits the specified VIP ranges Remove- when one or more checkboxes are selected, removes the specified VIP ranges Unconfigured Nodes - displays nodes not running the NFS service (in the Nodes view) VIP Assignments - displays the VIP Assignments view VIP Assignments The VIP Assignments view displays VIP assignments beside the nodes to which they are assigned: Virtual IP Address - each VIP in the range Node Name - the node to which the VIP is assigned IP Address - the IP address of the node MAC Address - the MAC address associated with the IP address Buttons: Start NFS - displays the Manage Node Services dialog Add VIP - displays the Add Virtual IPs dialog Unconfigured Nodes - displays nodes not running the NFS service (in the Nodes view) NFS Nodes The NFS Nodes view displays information about nodes running the NFS service: Hlth - the health of the node Hostname - the hostname of the node Physical IP(s) - physical IP addresses associated with the node Virtual IP(s) - virtual IP addresses associated with the node Buttons: Properties - when one or more nodes are selected, navigates to the Node Properties View Forget Node - navigates to the dialog, which lets you remove the node Remove Node Manage Services - navigates to the dialog, which lets you start and stop services on the node Manage Node Services Change Topology - navigates to the dialog, which lets you change the rack or switch path for a node Change Node Topology Alarms Views The Alarms view group provides the following views: Node Alarms - information about node alarms in the cluster Volume Alarms - information about volume alarms in the cluster User/Group Alarms - information about users or groups that have exceeded quotas Alarm Notifications - configure where notifications are sent when alarms are raised The following controls are available on views: Clicking any column name sorts data in ascending or descending order by that column. Selecting the checkbox displays the , which provides additional data filtering options. Filter Filter toolbar Clicking the Column Controls icon ( ) opens a dialog that lets you select which columns to view. Click any item to toggle its column on or off. You can also specify the refresh rate for updating data on the page. For example: Node Alarms The Node Alarms view displays information about alarms on any node in the cluster that has raised an alarm. The first two columns display Hlth - a color indicating the status of each node (see ) Cluster Heat Map Hostname - the hostname of the node The remaining columns are based on alarm type, such as: Version Alarm - one or more services on the node are running an unexpected version No Heartbeat Alarm - no heartbeat has been detected for over 5 minutes, and the node is not undergoing maintenance UID Mismatch Alarm - services in the cluster are being run with different usernames (UIDs) Duplicate HostId Alarm - two or more nodes in the cluster have the same Host ID Too Many Containers Alarm - the number of containers on this node reached the maximum limit Excess Logs Alarm - debug logging is enabled on this node, which can fill up disk space Disk Failure Alarm - a disk has failed on the node (the disk health log indicates which one failed) Time Skew Alarm - the clock on the node is out of sync with the master CLDB by more than 20 seconds Root Partition Full Alarm - the root partition ("/") on the node is 99% full and running out of space Installation Directory Full Alarm - the partition /opt/mapr on the node is running out of space (95% full) Core Present Alarm - a service on the node has crashed and created a core dump file High FileServer Memory Alarm - the FileServer service on the node has high memory consumption Pam Misconfigured Alarm - the PAM authentication on the node is configured incorrectly TaskTracker Local Directory Full Alarm - the local directory used by the TaskTracker is full, and the TaskTracker cannot operate as a result CLDB Alarm - the CLDB service on the node has stopped running FileServer Alarm - the FileServer service on the node has stopped running JobTracker Alarm - the JobTracker service on the node has stopped running TaskTracker Alarm - the TaskTracker service on the node has stopped running HBase Master Alarm - the HBase Master service on the node has stopped running HBase RegionServer Alarm - the HBase RegionServer service on the node has stopped running NFS Gateway Alarm - the NFS Gateway service on the node has stopped running Webserver Alarm - the WebServer service on the node has stopped running HostStats Alarm - the HostStats service on the node has stopped running Metrics write problem Alarm - metric data was not written to the database, or there were issues writing to a logical volume See . Alarms Reference Note the following behavior on the view: Node Alarms Clicking a node's Hostname navigates to the , which provides detailed information about the node. Node Properties View The left pane of the Node Alarms view displays the available topologies. Click a topology name to view only the nodes in that topology. Buttons: Properties - navigates to the Node Properties View Forget Node - opens the dialog to remove the node(s) from active management in this cluster. Services on the node must Forget Node be stopped before the node can be forgotten. Manage Services - opens the dialog, which lets you start and stop services on the node Manage Node Services Change Topology - opens the dialog, which lets you change the rack or switch path for a node Change Node Topology Volume Alarms The Volume Alarms view displays information about volume alarms in the cluster: Mnt - whether the volume is mounted Vol Name - the name of the volume Snapshot Alarm - last Snapshot Failed alarm Mirror Alarm - last Mirror Failed alarm Replication Alarm - last Data Under-Replicated alarm Data Alarm - last Data Unavailable alarm Vol Advisory Quota Alarm - last Volume Advisory Quota Exceeded alarm Vol Quota Alarm- last Volume Quota Exceeded alarm Clicking any column name sorts data in ascending or descending order by that column. Clicking a volume name displays the Volume Properties dialog Selecting the checkbox shows unmounted volumes as well as mounted volumes. Show Unmounted Selecting the checkbox displays the , which provides additional data filtering options. Filter Filter toolbar Buttons: New Volume displays the New Volume Dialog. Properties - if the checkboxes beside one or more volumes is selected,displays the dialog Volume Properties Mount - if an unmounted volume is selected, mounts it; if a mounted volume is selected, unmounts it (Unmount) Remove - if the checkboxes beside one or more volumes is selected, displays the dialog Remove Volume Start Mirroring - if a mirror volume is selected, starts the mirror sync process Snapshots - if the checkboxes beside one or more volumes is selected,displays the dialog Snapshots for Volume New Snapshot - if the checkboxes beside one or more volumes is selected,displays the dialog Snapshot Name User/Group Alarms The User/Group Alarms view displays information about user and group quota alarms in the cluster: Name - the name of the user or group User Advisory Quota Alarm - the last Advisory Quota Exceeded alarm User Quota Alarm - the last Quota Exceeded alarm Buttons: Edit Properties - opens a dialog box that lets you change user properties and clear alarms. User Properties Alerts The Alerts dialog lets you specify which alarms cause a notification event and where email notifications are sent. Fields: Alarm Name - select the alarm to configure Standard Notification - send notification to the default for the alarm type (the cluster administrator or volume creator, for example) Additional Email Address - specify an additional custom email address to receive notifications for the alarm type Buttons: OK - save changes and exit Cancel - exit without saving changes Alarms The content on this page has moved to . Your browser will redirect you momentarily. Alarms Views Redirecting to http://mapr.com/doc/display/MapR/Alarms+Views System Settings Views The System Settings view group provides the following views: Email Addresses - specify MapR user email addresses Permissions - give permissions to users Quota Defaults - settings for default quotas in the cluster Balancer Settings - settings to configure the disk space and role replication on the cluster. Balancers SMTP - settings for sending email from MapR HTTP - settings for accessing the MapR Control System via a browser Manage Licenses - MapR license settings Metrics Database - Settings for the MapR Metrics MySQL database Email Addresses The Configure Email Addresses dialog lets you specify whether MapR gets user email addresses from an LDAP directory, or uses a company domain: Use Company Domain - specify a domain to append after each username to determine each user's email address Use LDAP - obtain each user's email address from an LDAP server Buttons: OK - save changes and exit Cancel - exit without saving changes Permissions The User Permissions dialog lets you grant specific cluster permissions to particular users and groups. User/Group field - the user or group to which permissions are to be granted (one user or group per row) Permissions field - the permissions to grant to the user or group (see the Permissions table below) Delete button ( ) - deletes the current row [ + Add Permission ] - adds a new row Cluster Permissions Code Allowed Action Includes login Log in to the MapR Control System, use the API and command-line interface, read access on cluster and volumes ss Start/stop services cv Create volumes a Admin access All permissions except fc fc Full control (administrative access and permission to change the cluster ACL) a Buttons: OK - save changes and exit Cancel - exit without saving changes Quota Defaults The Configure Quota Defaults dialog lets you set the default quotas that apply to users and groups. The User Quota Defaults section contains the following fields: Default User Advisory Quota - if selected, sets the advisory quota that applies to all users without an explicit advisory quota. Default User Total Quota - if selected, sets the advisory quota that applies to all users without an explicit total quota. The Group Quota Defaults section contains the following fields: Default Group Advisory Quota - if selected, sets the advisory quota that applies to all groups without an explicit advisory quota. Default Group Total Quota - if selected, sets the advisory quota that applies to all groups without an explicit total quota. Buttons: OK - saves the settings Cancel - exits without saving the settings SMTP The Configure SMTP dialog lets you configure the email account from which the MapR cluster sends alerts and other notifications. The Configure Sending Email (SMTP) dialog contains the following fields: Provider - selects Gmail or another email provider; if you select Gmail, the other fields are partially populated to help you with the configuration SMTP Server specifies the SMTP server to use when sending email. The server requires an encrypted connection (SSL) - use SSL when connecting to the SMTP server SMTP Port - the port to use on the SMTP server Full Name - the name used in the From field when the cluster sends an alert email Email Address - the email address used in the From field when the cluster sends an alert email. Username - the username used to log onto the email account the cluster will use to send email. SMTP Password - the password to use when sending email. Buttons: OK - saves the settings Cancel - exits without saving the settings Balancer Settings The Balancer Configuration dialog enables you to configure the behaviors of the disk space and role replication . Balancers The Balancer Configuration dialog has the following elements: Balancer Controls: Contains toggle settings for the Disk Balancer and the Role Balancer. Set a balancer's toggle to to enable that ON balancer. Disk Balancer Settings: Configures the behavior of the disk balancer. Disk Balancer Presets: These preconfigured settings enable quick setting of policies for , and disk Rapid Moderate, Relaxed balancing. The default setting is . Moderate Threshold: Move this slider to set a percentage usage of a storage pool that makes the storage pool eligible for rebalancing operations. The default value for this setting is 70%. % Concurrent Disk Rebalancers: Move this slider to set the maximum percentage of data that is actively being rebalanced at a given time. Rebalancing operations will not affect more data than the value of this slider. The default value for this setting is 10%. Role Balancer Settings: Configures the behavior of the role balancer. Role Balancer Presets: These preconfigured settings enable quick setting of policies for , and role Rapid Moderate, Relaxed balancing. The default setting is . Moderate % Concurrent Role Rebalancers: Move this slider to set the maximum percentage of data that is actively being rebalanced at a given time. Role rebalancing operations will not affect more data than the value of this slider. The default value for this setting is 10%. Delay For Active Data: Move this slider to set a time frame in seconds. Role rebalancing operations skip any data that was active within the specified time frame. The default value for this setting is 600 seconds. Buttons: OK - saves the settings Cancel - exits without saving the settings HTTP The Configure HTTP dialog lets you configure access to the MapR Control System via HTTP and HTTPS. The sections in the Configure HTTP dialog let you enable HTTP and HTTPS access, and set the session timeout, respectively: Enable HTTP Access - if selected, configure HTTP access with the following field: HTTP Port - the port on which to connect to the MapR Control System via HTTP Enable HTTPS Access - if selected, configure HTTPS access with the following fields: HTTPS Port - the port on which to connect to the MapR Control System via HTTPS HTTPS Keystore Path - a path to the HTTPS keystore HTTPS Keystore Password - a password to access the HTTPS keystore HTTPS Key Password - a password to access the HTTPS key Session Timeout - the number of seconds before an idle session times out. Buttons: OK - saves the settings Cancel - exits without saving the settings Manage Licenses The License Management dialog lets you add and activate licenses for the cluster, and displays the Cluster ID and the following information about existing licenses: Name - the name of each license Issued - the date each license was issued Expires - the expiration date of each license Nodes - the nodes to which each license applies Fields: Cluster ID - the unique identifier needed for licensing the cluster Buttons: Add Licenses via Web - navigates to the MapR licensing form online Add License via Upload - alternate licensing mechanism: upload via browser Add License via Copy/Paste - alternate licensing mechanism: paste license key Apply Licenses - validates the licenses and applies them to the cluster Cancel - closes the dialog. Metrics The Configure Metrics Database dialog enables you to specify the location and login credentials of the MySQL server that stores information for J . ob Metrics Fields: URL - the hostname and port of the machine running the MySQL server Username - the username for the MySQL database metrics Password - the password for the MySQL database metrics Buttons: OK - saves the MySQL information in the fields Cancel - closes the dialog Other Views In addition to the MapR Control System views, there are views that display detailed information about the system: CLDB View - information about the container location database HBase View - information about HBase on the cluster JobTracker View - information about the JobTracker Nagios View - information about the Nagios configuration script Terminal View - an ssh terminal for logging in to the cluster With the exception of the MapR Launchpad, the above views include the following buttons: - Refresh Button (refreshes the view) - Popout Button (opens the view in a new browser window) CLDB View The CLDB view provides information about the Container Location Database (CLDB). The CLDB is a management service that keeps track of container locations and the root of volumes. To display the CLDB view, open the MapR Control System and click in the navigation pane. CLDB The following table describes the fields on the CLDB view: Field Description CLDB Mode The CLDB node can be in the following modes: MASTER_READ_WRITE, SLAVE_READ_ONLY, or / CLDB BuildVersion Lists the build version. CLDB Status Can be RUNNING, or Cluster Capacity Lists the storage capacity for the cluster. Cluster Used Lists the amount of storage in use. Cluster Available Lists the amount of available storage. Active FileServers A list of FileServers, and the following information about each: ServerID (Hex) - The server's ID in hexadecimal notation. ServerID - The server's ID in decimal notation. HostPort - The IP address of the host HostName - The hostname assigned to that file server. Network Location - The network topology for that file server. Last Heartbeat (s) - The timestamp for the last received heartbeat. State - Can be ACTIVE or Capacity (MB) - Total storage capacity on this server. Used (MB) - Storage used on this server. Available (MB) - Storage available on this server. In Transit (MB) - Active NFS Servers A list of NFS servers, and the following information about each: ServerID (Hex) - The server's ID in hexadecimal notation. ServerID - The server's ID in decimal notation. HostPort - The IP address of the host HostName - The hostname assigned to that file server. Last Heartbeat (s) - The timestamp for the last received heartbeat. State - Can be Active or Volumes A list of volumes, and the following information about each: Volume Name Mount Point - The path of where the volume is mounted over NFS. Mounted - Can be Y or N. ReadOnly - Can be Y or N. Volume ID - The Volume ID Volume Topology - The path describing the topology to which the volume is assigned. Quota - The total size of the volume's quota. A quota of 0 means no quota is assigned. Advisory Quota - The usage level that triggers a disk usage warning. Used - Total size of data written to the volume LogicalUsed - Actual size of data written to the volume Root Container ID - The ID of the root container. Replication - Guaranteed Replication - Accounting Entities A list of users and groups, and the following information about each: AE Name - AE Type - AE Quota - AE Advisory Quota - AE Used - Mirrors A list of mirrors, and the following information about each: Mirror Volume Name - Mirror ID - Mirror NextID - Mirror Status - Last Successful Mirror Time - Mirror SrcVolume - Mirror SrcRootContainerID - Mirror SrcClusterName - Mirror SrcSnapshot - Mirror DataGenerator Volume - Snapshots A list of snapshots, and the following information about each: Snapshot ID - RW Volume ID - Snapshot Name - Root Container ID - Snapshot Size - Snapshot InProgress - Containers A list of containers, and the following information about each: Container ID - Volume ID - Latest Epoch - SizeMB - Container Master Location - Container Locations - Inactive Locations - Unused Locations - Replication Type - Snapshot Containers A list of snapshot containers, and the following information about each: Snapshot Container ID - unique ID of the container Snapshot ID - ID of the snapshot corresponding to the container RW Container ID - corresponding source container ID Latest Epoch - SizeMB - container size, in MB Container Master Location - location of the container's master replica Container Locations - Inactive Locations - HBase View The HBase View provides information about HBase on the cluster. Field Description Local Logs A link to the HBase Local Logs View Thread Dump A link to the HBase Thread Dump View Log Level A link to the , a form for getting/setting the log HBase Log Level View level Master Attributes A list of attributes, and the following information about each: Attribute Name - Value - Description - Catalog Tables A list of tables, and the following information about each: Table - Description - User Tables Region Servers A list of region servers in the cluster, and the following information about each: Address - Start Code - Load - Total - HBase Local Logs View The HBase Local Logs view displays a list of the local HBase logs. Clicking a log name displays the contents of the log. Each log name can be copied and pasted into the to get or set the current log level. HBase Log Level View HBase Log Level View The HBase Log Level View is a form for getting and setting log levels that determine which information gets logged. The field accepts a log Log name (which can be copied from the and pasted). The Level field takes any of the following valid log levels: HBase Local Logs View ALL TRACE DEBUG INFO WARN ERROR OFF HBase Thread Dump View The HBase Thread Dump View displays a dump of the HBase thread. Example: Process Thread Dump: 40 active threads Thread 318 (1962516546@qtp-879081272-3): State: RUNNABLE Blocked count: 8 Waited count: 32 Stack: sun.management.ThreadImpl.getThreadInfo0(Native Method) sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147) sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123) org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149) org.apache.hadoop.http.HttpServer$StackServlet.doGet(HttpServer.java:695) javax.servlet.http.HttpServlet.service(HttpServlet.java:707) javax.servlet.http.HttpServlet.service(HttpServlet.java:820) org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221 ) org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:826) org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212 ) org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.jav a:230) org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) org.mortbay.jetty.Server.handle(Server.java:326) org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) Thread 50 (perfnode51.perf.lab:60000-CatalogJanitor): State: TIMED_WAITING Blocked count: 1081 Waited count: 1350 Stack: java.lang.Object.wait(Native Method) org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:91) org.apache.hadoop.hbase.Chore.run(Chore.java:74) Thread 49 (perfnode51.perf.lab:60000-BalancerChore): State: TIMED_WAITING Blocked count: 0 Waited count: 270 Stack: java.lang.Object.wait(Native Method) org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:91) org.apache.hadoop.hbase.Chore.run(Chore.java:74) Thread 48 (MASTER_OPEN_REGION-perfnode51.perf.lab:60000-1): State: WAITING Blocked count: 2 Waited count: 3 Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6d1cf4e5 Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQu euedSynchronizer.java:1925) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) JobTracker View Field Description State Started Version Compiled Identifier Cluster Summary The heapsize, and the following information about the cluster: Running Map Tasks - Running Reduce Tasks - Total Submissions - Nodes - Occupied Map Slots - Occupied Reduce Slots - Reserved Map Slots - Reserved Reduce Slots - Map Task Capacity - Reduce Task Capacity - Avg. Tasks/Node - Blacklisted Nodes - Excluded Nodes - MapTask Prefetch Capacity - Scheduling Information A list of queues, and the following information about each: Queue name - State - Scheduling Information - Filter A field for filtering results by Job ID, Priority, User, or Name Running Jobs A list of running MapReduce jobs, and the following information about each: JobId - Priority - User - Name - Start Time - Map % Complete - Current Map Slots - Failed MapAttempts - MapAttempt Time Avg/Max - Cumulative Map CPU - Current Map PMem - Reduce % Complete - Current Reduce Slots - Failed ReduceAttempts - ReduceAttempt Time Avg/Max Cumulative Reduce CPU - Current Reduce PMem - Completed Jobs A list of current MapReduce jobs, and the following information about each: JobId - Priority - User - Name - Start Time - Total Time - Maps Launched - Map Total - Failed MapAttempts - MapAttempt Time Avg/Max - Cumulative Map CPU - Reducers Launched - Reduce Total - Failed ReduceAttempts - ReduceAttempt Time Avg/Max - Cumulative Reduce CPU - Cumulative Reduce PMem - Vaidya Reports - Retired Jobs A list of retired MapReduce job, and the following information about each: JobId - Priority - User - Name - State - Start Time - Finish Time - Map % Complete - Reduce % Complete - Job Scheduling Information - Diagnostic Info - Local Logs A link to the local logs JobTracker Configuration A link to a page containing Hadoop JobTracker configuration values JobTracker Configuration View Field Default fs.automatic.close TRUE fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary fs.checkpoint.edits.dir ${fs.checkpoint.dir} fs.checkpoint.period 3600 fs.checkpoint.size 67108864 fs.default.name maprfs:/// fs.file.impl org.apache.hadoop.fs.LocalFileSystem fs.ftp.impl org.apache.hadoop.fs.ftp.FTPFileSystem fs.har.impl org.apache.hadoop.fs.HarFileSystem fs.har.impl.disable.cache TRUE fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem fs.hftp.impl org.apache.hadoop.hdfs.HftpFileSystem fs.hsftp.impl org.apache.hadoop.hdfs.HsftpFileSystem fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem fs.maprfs.impl com.mapr.fs.MapRFileSystem fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem fs.s3.block.size 67108864 fs.s3.buffer.dir ${hadoop.tmp.dir}/s3 fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem fs.s3.maxRetries 4 fs.s3.sleepTimeSeconds 10 fs.s3n.block.size 67108864 fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem fs.trash.interval 0 hadoop.job.history.location file:////opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/history hadoop.logfile.count 10 hadoop.logfile.size 10000000 hadoop.native.lib TRUE hadoop.proxyuser.root.groups root hadoop.proxyuser.root.hosts (none) hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.StandardSocketFactory hadoop.security.authentication simple hadoop.security.authorization FALSE hadoop.security.group.mapping org.apache.hadoop.security.ShellBasedUnixGroupsMapping hadoop.tmp.dir /tmp/hadoop-${user.name} hadoop.util.hash.type murmur io.bytes.per.checksum 512 io.compression.codecs org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io .compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec io.file.buffer.size 8192 io.map.index.skip 0 io.mapfile.bloom.error.rate 0.005 io.mapfile.bloom.size 1048576 io.seqfile.compress.blocksize 1000000 io.seqfile.lazydecompress TRUE io.seqfile.sorter.recordlimit 1000000 io.serializations org.apache.hadoop.io.serializer.WritableSerialization io.skip.checksum.errors FALSE io.sort.factor 256 io.sort.record.percent 0.17 io.sort.spill.percent 0.99 ipc.client.connect.max.retries 10 ipc.client.connection.maxidletime 10000 ipc.client.idlethreshold 4000 ipc.client.kill.max 10 ipc.client.tcpnodelay FALSE ipc.server.listen.queue.size 128 ipc.server.tcpnodelay FALSE job.end.retry.attempts 0 job.end.retry.interval 30000 jobclient.completion.poll.interval 5000 jobclient.output.filter FAILED jobclient.progress.monitor.poll.interval 1000 keep.failed.task.files FALSE local.cache.size 10737418240 map.sort.class org.apache.hadoop.util.QuickSort mapr.localoutput.dir output mapr.localspill.dir spill mapr.localvolumes.path /var/mapr/local mapred.acls.enabled FALSE mapred.child.oom_adj 10 mapred.child.renice 10 mapred.child.taskset TRUE mapred.child.tmp ./tmp mapred.cluster.ephemeral.tasks.memory.limit.mb 200 mapred.compress.map.output FALSE mapred.fairscheduler.allocation.file conf/pools.xml mapred.fairscheduler.assignmultiple TRUE mapred.fairscheduler.eventlog.enabled FALSE mapred.fairscheduler.smalljob.max.inputsize 10737418240 mapred.fairscheduler.smalljob.max.maps 10 mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 mapred.fairscheduler.smalljob.max.reducers 10 mapred.fairscheduler.smalljob.schedule.enable TRUE mapred.healthChecker.interval 60000 mapred.healthChecker.script.timeout 600000 mapred.inmem.merge.threshold 1000 mapred.job.queue.name default mapred.job.reduce.input.buffer.percent 0 mapred.job.reuse.jvm.num.tasks -1 mapred.job.shuffle.input.buffer.percent 0.7 mapred.job.shuffle.merge.percent 0.66 mapred.job.tracker <JobTracker_hostname>:9001 mapred.job.tracker.handler.count 10 mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done mapred.job.tracker.http.address 0.0.0.0:50030 mapred.job.tracker.persist.jobstatus.active FALSE mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo mapred.job.tracker.persist.jobstatus.hours 0 mapred.jobtracker.completeuserjobs.maximum 100 mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetricsInst mapred.jobtracker.job.history.block.size 3145728 mapred.jobtracker.jobhistory.lru.cache.size 5 mapred.jobtracker.maxtasks.per.job -1 mapred.jobtracker.port 9001 mapred.jobtracker.restart.recover TRUE mapred.jobtracker.retiredjobs.cache.size 1000 mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.FairScheduler mapred.line.input.format.linespermap 1 mapred.local.dir ${hadoop.tmp.dir}/mapred/local mapred.local.dir.minspacekill 0 mapred.local.dir.minspacestart 0 mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log mapred.map.max.attempts 4 mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec mapred.map.tasks 2 mapred.map.tasks.speculative.execution TRUE mapred.max.maps.per.node -1 mapred.max.reduces.per.node -1 mapred.max.tracker.blacklists 4 mapred.max.tracker.failures 4 mapred.merge.recordsBeforeProgress 10000 mapred.min.split.size 0 mapred.output.compress FALSE mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec mapred.output.compression.type RECORD mapred.queue.names default mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log mapred.reduce.copy.backoff 300 mapred.reduce.max.attempts 4 mapred.reduce.parallel.copies 12 mapred.reduce.slowstart.completed.maps 0.95 mapred.reduce.tasks 1 mapred.reduce.tasks.speculative.execution FALSE mapred.running.map.limit -1 mapred.running.reduce.limit -1 mapred.skip.attempts.to.start.skipping 2 mapred.skip.map.auto.incr.proc.count TRUE mapred.skip.map.max.skip.records 0 mapred.skip.reduce.auto.incr.proc.count TRUE mapred.skip.reduce.max.skip.groups 0 mapred.submit.replication 10 mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system mapred.task.cache.levels 2 mapred.task.profile FALSE mapred.task.profile.maps 0-2 mapred.task.profile.reduces 0-2 mapred.task.timeout 600000 mapred.task.tracker.http.address 0.0.0.0:50060 mapred.task.tracker.report.address 127.0.0.1:0 mapred.task.tracker.task-controller org.apache.hadoop.mapred.DefaultTaskController mapred.tasktracker.dns.interface default mapred.tasktracker.dns.nameserver default mapred.tasktracker.ephemeral.tasks.maximum 1 mapred.tasktracker.ephemeral.tasks.timeout 10000 mapred.tasktracker.ephemeral.tasks.ulimit 4294967296> mapred.tasktracker.expiry.interval 600000 mapred.tasktracker.indexcache.mb 10 mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMetricsInst mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? (CPUS * 0.75) : 1 mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? (CPUS * 0.50): 1 mapred.tasktracker.taskmemorymanager.monitoring-interval 5000 mapred.tasktracker.tasks.sleeptime-before-sigkill 5000 mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp mapred.userlog.limit.kb 0 mapred.userlog.retain.hours 24 mapreduce.heartbeat.10 300 mapreduce.heartbeat.100 1000 mapreduce.heartbeat.1000 10000 mapreduce.heartbeat.10000 100000 mapreduce.job.acl-view-job mapreduce.job.complete.cancel.delegation.tokens TRUE mapreduce.job.split.metainfo.maxsize 10000000 mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery mapreduce.jobtracker.recovery.maxtime 120 mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging mapreduce.maprfs.use.compression TRUE mapreduce.reduce.input.limit -1 mapreduce.tasktracker.outofband.heartbeat FALSE mapreduce.tasktracker.prefetch.maptasks 1 mapreduce.use.fastreduce FALSE mapreduce.use.maprfs TRUE tasktracker.http.threads 2 topology.node.switch.mapping.impl org.apache.hadoop.net.ScriptBasedMapping topology.script.number.args 100 webinterface.private.actions FALSE Nagios View The Nagios view displays a dialog containing a Nagios configuration script. Example: ############# Commands ############# define command { command_name check_fileserver_proc command_line $USER1$/check_tcp -p 5660 } define command { command_name check_cldb_proc command_line $USER1$/check_tcp -p 7222 } define command { command_name check_jobtracker_proc command_line $USER1$/check_tcp -p 50030 } define command { command_name check_tasktracker_proc command_line $USER1$/check_tcp -p 50060 } define command { command_name check_nfs_proc command_line $USER1$/check_tcp -p 2049 } define command { command_name check_hbmaster_proc command_line $USER1$/check_tcp -p 60000 } define command { command_name check_hbregionserver_proc command_line $USER1$/check_tcp -p 60020 } define command { command_name check_webserver_proc command_line $USER1$/check_tcp -p 8443 } ################# HOST: perfnode51.perf.lab ############### define host { use linux-server host_name perfnode51.perf.lab address 10.10.30.51 check_command check-host-alive } ################# HOST: perfnode52.perf.lab ############### define host { use linux-server host_name perfnode52.perf.lab address 10.10.30.52 check_command check-host-alive } ################# HOST: perfnode53.perf.lab ############### define host { use linux-server host_name perfnode53.perf.lab address 10.10.30.53 check_command check-host-alive } ################# HOST: perfnode54.perf.lab ############### define host { use linux-server host_name perfnode54.perf.lab address 10.10.30.54 check_command check-host-alive } ################# HOST: perfnode55.perf.lab ############### define host { use linux-server host_name perfnode55.perf.lab address 10.10.30.55 check_command check-host-alive } ################# HOST: perfnode56.perf.lab ############### define host { use linux-server host_name perfnode56.perf.lab address 10.10.30.56 check_command check-host-alive } Terminal View The Terminal View feature does not exist in MapR version 3.x and later. Your browser should redirect to the in a moment. Terminal View page for MapR v2.x Redirecting to http://www.mapr.com/doc/display/MapR2/Terminal+View Node-Related Dialog Boxes This page describes the node-related dialog boxes, which are accessible in most views that list node details. This includes the following dialog boxes: Forget Node Manage Node Services Change Node Topology Forget Node The Forget Node dialog confirms that you wish to remove a node from active management in this cluster. Services on the node must be stopped before the node can be forgotten. Manage Node Services The Manage Node Services dialog lets you start and stop services on a node, or multiple nodes. The Service Changes section contains a dropdown menu for each service: No change - leave the service running if it is running, or stopped if it is stopped Start - start the service Stop - stop the service Restart - restart the service Buttons: OK - start and stop the selected services as specified by the dropdown menus Cancel - returns to the Node Properties View without starting or stopping any services You can also start and stop services in the pane of the view. Manage Node Services Node Properties Change Node Topology The Change Node Topology dialog lets you change the rack or switch path for one or more nodes. The Change Node Topology dialog consists of: Nodes affected (the node or nodes to be moved, as specified in the Nodes view) A field with a dropdown menu for the new node topology path The Change Node Topology dialog contains the following buttons: OK - changes the node topology Cancel - returns to the Nodes view without changing the node topology Hadoop Commands All Hadoop commands are invoked by the script. bin/hadoop Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS] Hadoop has an option parsing framework that employs parsing generic options as well as running classes. COMMAND_OPTION Description --config confdir Overwrites the default Configuration directory. Default is ${HADOOP_ . HOME}/conf COMMAND Various commands with their options are described in the following sections. GENERIC_OPTIONS The common set of options supported by multiple commands. COMMAND_OPTIONS Various command options are described in the following sections. Commands The following commands may be run on MapR: hadoop Command Description archive -archiveName NAME <src>* <dest> The command creates a Hadoop archive, a file hadoop archive that contains other files. A Hadoop archive always has a exte *.har nsion. classpath The command prints the class path needed to hadoop classpath access the Hadoop JAR and the required libraries. conf The command prints the configuration information for hadoop conf the current node. daemonlog The command may be used to get or set the hadoop daemonlog log level of Hadoop daemons. distcp <source> <destination> The command is a tool for large inter- and hadoop distcp intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. fs The command runs a generic filesystem user client that hadoop fs interacts with the MapR filesystem (MapR-FS). jar <jar> The command runs a JAR file. Users can bundle their hadoop jar MapReduce code in a JAR file and execute it using this command. job Manipulates MapReduce jobs. jobtracker Runs the MapReduce Jobtracker node. mfs The command performs operations on directories in the hadoop mfs cluster. The main purposes of are to display directory hadoop mfs information and contents, to create symbolic links, and to set compression and chunk size on a directory. mradmin Runs a MapReduce admin client. pipes Runs a pipes job. queue Gets information about job queues. tasktracker The command runs a MapReduce hadoop tasktracker tasktracker node. version The command prints the Hadoop software hadoop version version. Generic Options Useful Information Running the script without any arguments prints the description for all commands. hadoop Useful Information Most Hadoop commands print help when invoked without parameters. Implement the interface and the following generic Hadoop command-line options are available for many of the Hadoop commands. Tool Generic options are supported by the , , , , , and Hadoop commands. distcp fs job mradmin pipes queue Generic Option Description -conf <filename1 filename2 ...> Add the specified configuration files to the list of resources available in the configuration. -D <property=value> Set a value for the specified Hadoop configuration property. -fs <local|filesystem URI> Set the URI of the default filesystem. -jt <local|jobtracker:port> Specify a jobtracker for a given host and port. This command option is a shortcut for -D mapred.job.tracker=host:port -files <file1,file2,...> Specify files to be copied to the map reduce cluster. -libjars <jar1,jar2,...> Specify JAR files to be included in the classpath of the mapper and reducer tasks. -archives <archive1,archive2,...> Specify archive files (JAR, tar, tar.gz, ZIP) to be copied and unarchived on the task node. CLASSNAME hadoop script can be used to invoke any class. Usage: hadoop CLASSNAME Runs the class named CLASSNAME. hadoop archive The command creates a Hadoop archive, a file that contains other files. A Hadoop archive always has a extension. hadoop archive *.har Syntax hadoop [ Generic Options ] archive -archiveName <name> [-p <parent>] <source> <destination> Parameters Parameter Description -archiveName <name> Name of the archive to be created. -p <parent_path> The parent argument is to specify the relative path to which the files should be archived to. <source> Filesystem pathnames which work as usual with regular expressions. <destination> Destination directory which would contain the archive. Examples Archive within a single directory hadoop archive -archiveName myArchive.har -p /foo/bar /outputdir The above command creates an archive of the directory in the directory . /foo/bar /outputdir Archive to another directory hadoop archive -archiveName myArchive.har -p /foo/bar a/b/c e/f/g The above command creates an archive of the directory in the directory . /foo/bar/a/b/c /foo/bar/e/f/g hadoop classpath The command prints the class path needed to access the Hadoop jar and the required libraries. hadoop classpath Syntax hadoop classpath Output $ hadoop classpath /opt/mapr/hadoop/hadoop-0.20.2/bin/../conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/opt/ mapr/hado op/hadoop-0.20.2/bin/..:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar:/opt/ma pr/hadoop /hadoop-0.20.2/bin/../lib/aspectjrt-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../li b/aspectj tools-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-cli-1.2.jar:/opt/map r/hadoop/ hadoop-0.20.2/bin/../lib/commons-codec-1.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../l ib/common s-daemon-1.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-el-1.0.jar:/opt/m apr/hadoo p/hadoop-0.20.2/bin/../lib/commons-httpclient-3.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2 /bin/../l ib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging -api-1.0. 4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-1.4.1.jar:/opt/mapr/hadoop /hadoop-0 .20.2/bin/../lib/core-3.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/eval-0.5.jar :/opt/map r/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-capacity-scheduler.jar:/opt/mapr/h adoop/had oop-0.20.2/bin/../lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/.. /lib/hado op-0.20.2-dev-fairscheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hsqldb-1.8.0 .10.jar:/ opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-core-asl-1.5.2.jar:/opt/mapr/hadoop/h adoop-0.2 0.2/bin/../lib/jackson-mapper-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/ jasper-co mpiler-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-runtime-5.5.12.jar: /opt/mapr /hadoop/hadoop-0.20.2/bin/../lib/jets3t-0.6.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/. ./lib/jet ty-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-servlet-tester-6.1.14.ja r:/opt/ma pr/hadoop/hadoop-0.20.2/bin/../lib/jetty-util-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20. 2/bin/../ lib/junit-4.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/kfs-0.2.2.jar:/opt/mapr/ha doop/hado op-0.20.2/bin/../lib/log4j-1.2.15.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/loggin g-0.1.jar :/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-0.1.jar:/opt/mapr/hadoop/hadoop-0.20 .2/bin/.. /lib/maprfs-test-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.2.j ar:/opt/m apr/hadoop/hadoop-0.20.2/bin/../lib/mysql-connector-java-5.0.8-bin.jar:/opt/mapr/hadoo p/hadoop- 0.20.2/bin/../lib/oro-2.0.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/servlet-api- 2.5-6.1.1 4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-api-1.4.3.jar:/opt/mapr/hadoop/h adoop-0.2 0.2/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/xmlen c-0.52.ja r:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/zookeeper-3.3.2.jar:/opt/mapr/hadoop/hadoo p-0.20.2/ bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-a pi-2.1.ja r hadoop daemonlog The command gets and sets the log level for each daemon. hadoop daemonlog Hadoop daemons all produce logfiles that you can use to learn about what is happening on the system. You can use the co hadoop daemonlog mmand to temporarily change the log level of a component when debugging the system. Syntax hadoop daemonlog -getlevel | -setlevel <host>:<port> <name> [ <level> ] Parameters The following command options are supported for command: hadoop daemonlog Parameter Description -getlevel <host:port><name> Prints the log level of the daemon running at the specified host and port, by querying http://<host>:<port>/logLevel?log=<name> <host>: The host on which to get the log level. <port>: The port by which to get the log level. <name>: The daemon on which to get the log level. Usually the fully qualified classname of the daemon doing the logging. For example, for org.apache.hadoop.mapred.JobTracker the JobTracker daemon. -setlevel <host:port> <name> <level> Sets the log level of the daemon running at the specified host and port, by querying http://<host>:<port>/logLevel?log=<name> * : The host on which to set the log level. <host> <port>: The port by which to set the log level. <name>: The daemon on which to set the log level. <level: The log level to set the daemon. Examples Getting the log levels of a daemon To get the log level for each daemon enter a command such as the following: hadoop daemonlog -getlevel 10.250.1.15:50030 org.apache.hadoop.mapred.JobTracker Connecting to http://10.250.1.15:50030/logLevel?log=org.apache.hadoop.mapred.JobTracker Submitted Log Name: org.apache.hadoop.mapred.JobTracker Log Class: org.apache.commons.logging.impl.Log4JLogger Effective level: ALL Setting the log level of a daemon To temporarily set the log level for a daemon enter a command such as the following: hadoop daemonlog -setlevel 10.250.1.15:50030 org.apache.hadoop.mapred.JobTracker DEBUG Connecting to http://10.250.1.15:50030/logLevel?log=org.apache.hadoop.mapred.JobTracker&level=DEBUG Submitted Log Name: org.apache.hadoop.mapred.JobTracker Log Class: org.apache.commons.logging.impl.Log4JLogger Submitted Level: DEBUG Setting Level to DEBUG ... Effective level: DEBUG Using this method, the log level is automatically reset when the daemon is restarted. To make the change to log level of a daemon persistent, enter a command such as the following: hadoop daemonlog -setlevel 10.250.1.15:50030 log4j.logger.org.apache.hadoop.mapred.JobTracker DEBUG hadoop distcp The command is a tool used for large inter- and intra-cluster copying. It uses MapReduce to effect its distribution, error handling hadoop distcp and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Syntax hadoop [ Generic Options ] distcp [-p [rbugp] ] [-i ] [-log ] [-m ] [-overwrite ] [-update ] [-f <URI list> ] [-filelimit <n> ] [-sizelimit <n> ] [-delete ] <source> <destination> Parameters Command Options The following command options are supported for the command: hadoop distcp Parameter Description <source> Specify the source URL. <destination> Specify the destination URL. -p [rbugp] Preserve : replication number r : block size b : user u : group g : permission p alone is equivalent to . -p -prbugp Modification times are not preserved. When you specify , -update status updates are not synchronized unless the file sizes also differ. -i Ignore failures. As explained in the below, this option will keep more accurate statistics about the copy than the default case. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause the job to fail before all splits are attempted. -log <logdir> Write logs to . The command keeps logs <logdir> hadoop distcp of each file it attempts to copy as map output. If a map fails, the log output will not be retained if it is re-executed. -m <num_maps> Maximum number of simultaneous copies. Specify the number of maps to copy data. Note that more maps may not necessarily improve throughput. See . Map Sizing -overwrite Overwrite destination. If a map fails and is not specified, all the -i files in the split, not only those that failed, will be recopied. As discussed in the , it also changes Overwriting Files Between Clusters the semantics for generating destination paths, so users should use this carefully. -update Overwrite if size is different from size. <source> <destination> As noted in the preceding, this is not a "sync" operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file. See Updating Files Between Clusters -f <URI list> Use list at <URI list> as source list. This is equivalent to listing each source on the command line. The value of <URI list> must be a fully qualified URI. -filelimit <n> Limit the total number of files to be <= n. See Symbolic . Representations -sizelimit <n> Limit the total size to be <= n bytes. See . Symbolic Representations -delete Delete the files existing in the but not in . <destination> <source> The deletion is done by FS Shell. Generic Options The command supports the following generic options: , , hadoop distcp -conf <configuration file> -D <property=value> -fs , , , <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars , and . <libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...> For more information on generic options, see . Generic Options Symbolic Representations The parameter in and can be specified with symbolic representation. For example, <n> -filelimit -sizelimit 1230k = 1230 * 1024 = 1259520 891g = 891 * 1024^3 = 956703965184 Map Sizing The command attempts to size each map comparably so that each copies roughly the same number of bytes. Note that files are hadoop distcp the finest level of granularity, so increasing the number of simultaneous copiers (i.e. maps) may not always increase the number of simultaneous copies nor the overall throughput. If is not specified, will attempt to schedule work for wher -m distcp min (total_bytes / bytes.per.map, 20 * num_task_trackers) e defaults to 256MB. bytes.per.map Tuning the number of maps to the size of the source and destination clusters, the size of the copy, and the available bandwidth is recommended for long-running and regularly run jobs. Examples Basic inter-cluster copying The commmand is most often used to copy files between clusters: hadoop distcp hadoop distcp maprfs:///mapr/cluster1/foo \ maprfs:///mapr/cluster2/bar The command in the example expands the namespace under on cluster1 into a temporary file, partitions its contents among a set of /foo/bar map tasks, and starts a copy on each TaskTracker from cluster1 to cluster2. Note that the command expects absolute paths. hadoop distcp Only those files that do not already exist in the destination are copied over from the source directory. Updating files between clusters Use the command to synchronize changes between clusters. hadoop distcp -update $ hadoop distcp -update maprfs:///mapr/cluster1/foo maprfs:///mapr/cluster2/bar/foo Files in the subtree are copied from cluster1 to cluster2 only if the size of the source file is different from that of the size of the destination /foo file. Otherwise, the files are skipped over. Note that using the option changes distributed copy interprets the source and destination paths making it necessary to add the trailing -update / subdirectory in the second cluster. foo Overwriting files between clusters By default, distributed copy skips files that already exist in the destination directory, but you can overwrite those files using the optio -overwrite n. In this example, multiple source directories are specified: $ hadoop distcp -overwrite maprfs:///mapr/cluster1/foo/a \ maprfs:///mapr/cluster1/foo/b \ maprfs:///mapr/cluster2/bar As with using the option, using the changes the way that the source and destination paths are interpreted by distributed -update -overwrite copy: the contents of the source directories are compared to the contents of the destination directory. The distributed copy aborts in case of a conflict. Migrating Data from HDFS to MapR-FS The command can be used to migrate data from an HDFS cluster to a MapR-FS where the HDFS cluster uses the same hadoop distcp version of the RPC protocol as that used by MapR. For a discussion, see . Copying Data from Apache Hadoop $ hadoop distcp namenode1:50070/foo maprfs:///bar You must specify the IP address and HTTP port (usually 50070) for the namenode on the HDFS cluster. hadoop fs The command runs a generic filesystem user client that interacts with the MapR filesystem (MapR-FS). hadoop fs Syntax hadoop [ Generic Options ] fs [-cat <src>] [-chgrp [-R] GROUP PATH...] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-copyFromLocal <localsrc> ... <dst>] [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-count[-q] <path>] [-cp <src> <dst>] [-df <path>] [-du <path>] [-dus <path>] [-expunge] [-get [-ignoreCrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-help [cmd]] [-ls <path>] [-lsr <path>] [-mkdir <path>] [-moveFromLocal <localsrc> ... <dst>] [-moveToLocal <src> <localdst>] [-mv <src> <dst>] [-put <localsrc> ... <dst>] [-rm [-skipTrash] <src>] [-rmr [-skipTrash] <src>] [-stat [format] <path>] [-tail [-f] <path>] [-test -[ezd] <path>] [-text <path>] [-touchz <path>] Parameters Command Options The following command parameters are supported for : hadoop fs Parameter Description -cat <src> Fetch all files that match the file pattern defined by the <src> parameter and display their contents on . stdout -fs [local | <file system URI>] Specify the file system to use. If not specified, the current configuration is used, taken from the following, in increasing precedence: inside the hadoop jar file core-default.xml in core-site.xml $HADOOP_CONF_DIR The option means use the local file system as your DFS. local specifies a particular file system to <file system URI> contact. This argument is optional but if used must appear appear first on the command line. Exactly one additional argument must be specified. -ls <path> List the contents that match the specified file pattern. If path is not specified, the contents of /user/<currentUser> will be listed. Directory entries are of the form dirName (full path) <dir> and file entries are of the form . fileName(full path) <r n> size where n is the number of replicas specified for the file and size is the size of the file, in bytes. -lsr <path> Recursively list the contents that match the specified file pattern. Behaves very similarly to , hadoop fs -ls except that the data is shown for all the entries in the subtree. -df [<path>] Shows the capacity, free and used space of the filesystem. If the filesystem has multiple partitions, and no path to a particular partition is specified, then the status of the root partitions will be shown. -du <path> Show the amount of space, in bytes, used by the files that match the specified file pattern. Equivalent to the Unix command in case of a directory, du -sb <path>/* and to in case of a file. du -b <path> The output is in the form name(full path) size (in bytes). -dus <path> Show the amount of space, in bytes, used by the files that match the specified file pattern. Equivalent to the Unix command . The output is in the form du -sb size (in bytes). name(full path) -mv <src> <dst> Move files that match the specified file pattern <src> to a destination . When moving multiple files, the <dst> destination must be a directory. -cp <src> <dst> Copy files that match the file pattern to a <src> destination. When copying multiple files, the destination must be a directory. -rm [-skipTrash] <src> Delete all files that match the specified file pattern. Equivalent to the Unix command . rm <src> The option bypasses trash, if enabled, -skipTrash and immediately deletes <src> -rmr [-skipTrash] <src> Remove all directories which match the specified file pattern. Equivalent to the Unix command rm -rf <src> The option bypasses trash, if enabled, -skipTrash and immediately deletes <src> -put <localsrc> ... <dst> Copy files from the local file system into fs. -copyFromLocal <localsrc> ... <dst> Identical to the command. -put -moveFromLocal <localsrc> ... <dst> Same as , except that the source is -put deleted after it's copied. -get [-ignoreCrc] [-crc] <src> <localdst> Copy files that match the file pattern <src> to the local name. <src> is kept. When copying multiple files, the destination must be a directory. -getmerge <src> <localdst> Get all the files in the directories that match the source file pattern and merge and sort them to only one file on local fs. is kept. <src> -copyToLocal [-ignoreCrc] [-crc] <src> <localdst> Identical to the command. -get -moveToLocal <src> <localdst> Not implemented yet -mkdir <path> Create a directory in specified location. -tail [-f] <file> Show the last 1KB of the file. The option shows appended data as the file grows. -f -touchz <path> Write a timestamp in format yyyy-MM-dd HH:mm:ss in a file at . An error is returned if the file exists with non-zero <path> length. -test -[ezd] <path> If file { exists, has zero length, is a directory then return 0, else return 1. -text <src> Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream. -stat [format] <path> Print statistics about the file/directory at <path> in the specified format. Format accepts filesize in blocks (%b), filename (%n), block size (%o), replication (%r), modification date (%y, %Y) -chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH... Changes permissions of a file. This works similar to shell's with a few exceptions. chmod modifies the files recursively. This is the only option currently -R supported. Mode is same as mode used for shell command. MODE chmod Only letters recognized are . That is, rwxXt +t,a+r,g-w,+rwx,o=r Mode specifed in 3 or 4 digits. If 4 digits, the first may OCTALMODE be 1 or 0 to turn the sticky bit on or off, respectively. Unlike shell command, it is not possible to specify only part of the mode E.g. 754 is same as u=rwx,g=rx,o=r If none of 'augo' is specified, 'a' is assumed and unlike shell command, no umask is applied. -chown [-R] [OWNER][:[GROUP]] PATH... Changes owner and group of a file. This is similar to shell's with a few exceptions. chown modifies the files recursively. This is the only option -R currently supported. If only owner or group is specified then only owner or group is modified.The owner and group names may only consists of digits, alphabet, and any of . The names are -.@/' i.e. [-.@/a-zA-Z0-9] case sensitive. -chgrp [-R] GROUP PATH... This is equivalent to -chown ... :GROUP ... -count[-q] <path> Count the number of directories, files and bytes under the paths that match the specified file pattern. The output columns are: or DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME QUOTA REMAINING_QUATA SPACE_QUOTA REMAINING_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME -help [cmd] Displays help for given command or all commands if none is specified. Generic Options The following generic options are supported for the command: , , hadoop fs -conf <configuration file> -D <property=value> -fs , , , <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars , and . For more information on generic options, <libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...> Warning WARNING: Avoid using '.' to separate user name and group though Linux allows it. If user names have dots in them and you are using local file system, you might see surprising results since shell command is used for local files. chown see . Generic Options hadoop jar The command runs a program contained in a JAR file. Users can bundle their MapReduce code in a JAR file and execute it using hadoop jar this command. Syntax hadoop jar <jar> [<arguments>] Parameters The following commands parameters are supported for : hadoop jar Parameter Description <jar> The JAR file. <arguments> Arguments to the program specified in the JAR file. Examples Streaming Jobs Hadoop streaming jobs are run using the command. The Hadoop streaming utility enables you to create and run MapReduce jobs hadoop jar with any executable or script as the mapper and/or the reducer. $ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc The , , , and streaming command options are all required for streaming jobs. Either an executable or a Java -input -output -mapper -reducer class may be used for the mapper and the reducer. For more information about and examples of streaming jobs, see Streaming Options and at the Apache project's page. Usage Running from a JAR file The simple Word Count program is another example of a program that is run using the command. The functionality is hadoop jar wordcount built into the program. You pass the file, along with the location, to Hadoop with the comm hadoop-0.20.2-dev-examples.jar hadoop jar and and Hadoop reads the JAR file and executes the relevant instructions. The Word Count program reads files from an input directory, counts the words, and writes the results of the job to files in an output directory. $ hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in /myvolume/out hadoop job The command enables you to manage MapReduce jobs. hadoop job Syntax hadoop job [Generic Options] [-submit <job-file>] [-status <job-id>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-unblacklist <job-id> <hostname>] [-unblacklist-tracker <hostname>] [-set-priority <job-id> <priority>] [-events <job-id> <from-event-#> <#-of-events>] [-history <jobOutputDir>] [-list [all]] [-list-active-trackers] [-list-blacklisted-trackers] [-list-attempt-ids <job-id> <task-type> <task-state>] [-kill-task <task-id>] [-fail-task <task-id>] [-blacklist-tasktracker <hostname>] [-showlabels] Parameters Command Options The following command options are supported for : hadoop job Parameter Description -submit <job-file> Submits the job. -status <job-id> Prints the map and reduce completion percentage and all job counters. -counter <job-id> <group-name> <counter-name> Prints the counter value. -kill <job-id> Kills the job. -unblacklist <job-id> <hostname> Removes a tasktracker job from the jobtracker's blacklist. -unblacklist-tracker <hostname> Admin only. Removes the TaskTracker at from the <hostname JobTracker's global blacklist. -set-priority <job-id> <priority> Changes the priority of the job. Valid priority values are , VERY_HIGH , , and . HIGH, NORMAL LOW VERY_LOW The job scheduler uses this property to determine the order in which jobs are run. -events <job-id> <from-event-#> <#-of-events> Prints the events' details received by jobtracker for the given range. -history <jobOutputDir> Prints job details, failed and killed tip details. -list [all] The option displays all jobs. The command -list all -list without the option displays only jobs which are yet to complete. all -list-active-trackers Prints all active tasktrackers. -list-blackisted-trackers Prints blacklisted tasktrackers. -list-attempt-ids <job-id><task-type> Lists the IDs of task attempts. -kill-task <task-id> Kills the task. Killed tasks are counted against failed attempts. not -fail-task <task-id> Fails the task. Failed tasks are counted against failed attempts. -blacklist-tasktracker <hostname> Pauses all current tasktracker jobs and prevent additional jobs from being scheduled on the tasktracker. -showlabels Dumps label information of all active nodes. Generic Options The following generic options are supported for the command: , , hadoop job -conf <configuration file> -D <property=value> -fs , , , <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars , and . For more information on generic options, <libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...> see . Generic Options Examples Submitting Jobs The command enables you to submit a job to the specified jobtracker. hadoop job -submit $ hadoop job -jt darwin:50020 -submit job.xml Stopping Jobs Gracefully Use the command to stop a running or queued job. hadoop kill $ hadoop job -kill <job-id> Viewing Job History Logs Run the command to view the history logs summary in specified directory. hadoop job -history $ hadoop job -history output-dir This command will print job details, failed and killed tip details. Additional details about the job such as successful tasks and task attempts made for each task can be viewed by adding the option: -all $ hadoop job -history all output-dir Blacklisting Tasktrackers The command when run as root or using can be used to manually blacklist tasktrackers: hadoop job sudo hadoop job -blacklist-tasktracker <hostname> Manually blacklisting a tasktracker pauses any running jobs and prevents additional jobs from being scheduled. For a detailed discussion see . TaskTracker Blacklisting hadoop jobtracker The command runs the MapReduce jobtracker node. hadoop jobtracker Syntax hadoop jobtracker [-dumpConfiguration] Parameters The command supports the following command options: hadoop jobtracker Parameter Description -dumpConfiguration Dumps the configuration used by the jobtracker along with queue configuration in JSON format into standard output used by the jobtracker and exits. hadoop mfs The command performs operations on directories in the cluster. The main purposes of are to display directory hadoop mfs hadoop mfs information and contents, to create symbolic links, and to set compression and chunk size on a directory. Syntax hadoop mfs [ -ln <target> <symlink> ] [ -ls <path> ] [ -lsd <path> ] [ -lsr <path> ] [ -Lsr <path> ] [ -lsrv <path> ] [ -lss <path> ] [ -setcompression on|off|lzf|lz4|zlib <dir> ] [ -setchunksize <size> <dir> ] [ -help <command> ] Parameters The normal command syntax is to specify a single option from the following table, along with its corresponding arguments. If compression and chunk size are not set explicitly for a given directory, the values are inherited from the parent directory. Parameter Description -ln <target> <symlink> Creates a symbolic link that points to the target path <symlink> <ta , similar to the standard Linux command. rget> ln -s -ls <path> Lists files in the directory specified by . The <path> hadoop mfs command corresponds to the standard comma -ls hadoop fs -ls nd, but provides the following additional information: Chunks used for each file Server where each chunk resides -lsd <path> Lists files in the directory specified by , and also provides <path> information about the specified directory itself: Whether compression is enabled for the directory (indicated by z ) The configured chunk size (in bytes) for the directory. -lsr <path> Lists files in the directory and subdirectories specified by , <path> recursively, including dereferencing symbolic links. The hadoop mfs command corresponds to the standard com -lsr hadoop fs -lsr mand, but provides the following additional information: Chunks used for each file Server where each chunk resides -Lsr <path> Equivalent to lsr, but additionally dereferences symbolic links -lsrv <path> Lists all paths recursively without crossing volume links. -lss <path> Lists files in the directory specified by , with an additional <path> column that displays the number of disk blocks per file. Disk blocks are 8192 bytes. -setcompression on|off|lzf|lz4|zlib <dir> Turns compression on or off on the directory specified in , and <dir> sets the compression type: on — turns on compression using the default algorithm (LZ4) off — turns off compression lzf — turns on compression and sets the algorithm to LZF lz4 — turns on compression and sets the algorithm to LZ4 zlib — turns on compression and sets the algorithm to ZLIB -setchunksize <size> <dir> Sets the chunk size in bytes for the directory specified in . The <dir> parameter must be a multiple of 65536. <size> -help <command> Displays help for the command. hadoop mfs Examples The command is used to view file contents. You can use this command to check if compression is turned off in a directory or hadoop mfs mounted volume. For example, # hadoop mfs -ls / Found 23 items vrwxr-xr-x Z - root root 13 2012-04-29 10:24 268435456 /.rw p mapr.cluster.root writeable 2049.35.16584 -> 2049.16.2 scale-50.scale.lab:5660 scale-51.scale.lab:5660 scale-52.scale.lab:5660 vrwxr-xr-x U - root root 7 2012-04-28 22:16 67108864 /hbase p mapr.hbase default 2049.32.16578 -> 2050.16.2 scale-50.scale.lab:5660 scale-51.scale.lab:5660 scale-52.scale.lab:5660 drwxr-xr-x Z - root root 0 2012-04-29 09:14 268435456 /tmp p 2049.41.16596 scale-50.scale.lab:5660 scale-51.scale.lab:5660 scale-52.scale.lab:5660 vrwxr-xr-x Z - root root 1 2012-04-27 22:59 268435456 /user p users default 2049.36.16586 -> 2055.16.2 scale-50.scale.lab:5660 scale-52.scale.lab:5660 scale-51.scale.lab:5660 drwxr-xr-x Z - root root 1 2012-04-27 22:37 268435456 /var p 2049.33.16580 scale-50.scale.lab:5660 scale-51.scale.lab:5660 scale-52.scale.lab:5660 In the above example, the letter indicates LZ4 compression on the directory; the letter indicates that the directory is uncompressed. Z U Output When used with , , , or , displays information about files and directories. For each file or directory -ls -lsd -lsr -lss hadoop mfs hadoop mfs displays a line of basic information followed by lines listing the chunks that make up the file, in the following format: {mode} {compression} {replication} {owner} {group} {size} {date} {chunk size} {name} {chunk} {fid} {host} [{host}...] {chunk} {fid} {host} [{host}...] ... Volume links are displayed as follows: {mode} {compression} {replication} {owner} {group} {size} {date} {chunk size} {name} {chunk} {target volume name} {writability} {fid} -> {fid} [{host}...] For volume links, the first is the chunk that stores the volume link itself; the after the arrow ( ) is the first chunk in the target volume. fid fid -> The following table describes the values: mode A text string indicating the read, write, and execute permissions for the owner, group, and other permissions. See also Managing . Permissions compression U — uncompressed L — LZf Z (Uppercase) — LZ4 z (Lowercase) — ZLIB replication The replication factor of the file (directories display a dash instead) owner The owner of the file or directory group The group of the file of directory size The size of the file or directory date The date the file or directory was last modified chunk size The chunk size of the file or directory name The name of the file or directory chunk The chunk number. The first chunk is a primary chunk labeled " ", a p 64K chunk containing the root of the file. Subsequent chunks are numbered in order. fid The chunk's file ID, which consists of three parts: The ID of the container where the file is stored The inode of the file within the container An internal version number host The host on which the chunk resides. When several hosts are listed, the first host is the first copy of the chunk and subsequent hosts are replicas. target volume name The name of the volume pointed to by a volume link. writability Displays whether the volume is writable. hadoop mradmin The command runs Map-Reduce administrative commands. hadoop mradmin Syntax hadoop [ Generic Options ] mradmin [-refreshServiceAcl] [-refreshQueues] [-refreshNodes] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-help [cmd]] Parameters The following command parameters are supported for : hadoop mradmin Parameter Description -refreshServiceAcl Reload the service-level authorization policy file Job tracker will reload the authorization policy file. -refreshQueues Reload the queue acls and state JobTracker will reload the mapred-queues.xml file. -refreshUserToGroupsMappings Refresh user-to-groups mappings. -refreshSuperUserGroupsConfiguration Refresh superuser proxy groups mappings. -refreshNodes Refresh the hosts information at the job tracker. -help [cmd] Displays help for the given command or all commands if none is specified. The following generic options are supported for : hadoop mradmin Generic Option Description -conf <configuration file> Specify an application configuration file. -D <property=value> Use value for given property. -fs <local|file system URI> Specify a file system. -jt <local|jobtracker:port> Specify a job tracker. -files <comma separated list of files> Specify comma separated files to be copied to the map reduce cluster. -libjars <comma seperated list of jars> Specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> Specify comma separated archives to be unarchived on the computer machines. hadoop pipes The command runs a pipes job. hadoop pipes Hadoop Pipes is the C++ interface to Hadoop Reduce. Hadoop Pipes uses sockets to enable tasktrackers to communicate processes running the C++ map or reduce functions. See also . Compiling Pipes Programs Syntax hadoop [GENERIC OPTIONS ] pipes [-output <path>] [-jar <jar file>] [-inputformat <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>] [-program <executable>] [-reduces <num>] Parameters Command Options The following command parameters are supported for : hadoop pipes Parameter Description -output <path> Specify the output directory. -jar <jar file> Specify the jar filename. -inputformat <class> InputFormat class. -map <class> Specify the Java Map class. -partitioner <class> Specify the Java Partitioner. -reduce <class> Specify the Java Reduce class. -writer <class> Specify the Java RecordWriter. -program <executable> Specify the URI of the executable. -reduces <num> Specify the number of reduces. Generic Options The following generic options are supported for the command: , , hadoop pipes -conf <configuration file> -D <property=value> -f , , , s <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars , and . For more information on generic options, <libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...> see . Generic Options hadoop queue The command displays job queue information. hadoop queue Syntax hadoop [ Generic Options ] queue [-list] | [-info <job-queue-name> [-showJobs]] | [-showacls] Parameters Command Options The command supports the following command options: hadoop queue Parameter Description -list Gets list of job queues configured in the system. Along with scheduling information associated with the job queues. -info <job-queue-name> [-showJobs] Displays the job queue information and associated scheduling information of particular job queue. If option is present, a -showJobs list of jobs submitted to the particular job queue is displayed. -showacls Displays the queue name and associated queue operations allowed for the current user. The list consists of only those queues to which the user has access. Generic Options The following generic options are supported for the command: , , hadoop queue -conf <configuration file> -D <property=value> -f , , , s <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars , and . For more information on generic options, <libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...> see . Generic Options hadoop tasktracker The command runs a MapReduce tasktracker node. hadoop tasktracker Syntax hadoop tasktracker Output mapr@mapr-desktop:~$ hadoop tasktracker 12/03/21 21:19:56 INFO mapred.TaskTracker: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting TaskTracker STARTUP_MSG: host = mapr-desktop/127.0.1.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.2-dev STARTUP_MSG: build = -r ; compiled by 'root' on Thu Dec 8 22:43:13 PST 2011 ************************************************************/ 12/03/21 21:19:56 INFO mapred.TaskTracker: /*-------------- TaskTracker System Properties ---------------- java.runtime.name: Java(TM) SE Runtime Environment sun.boot.library.path: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64 java.vm.version: 20.1-b02 hadoop.root.logger: INFO,console java.vm.vendor: Sun Microsystems Inc. java.vendor.url: http://java.sun.com/ path.separator: : java.vm.name: Java HotSpot(TM) 64-Bit Server VM file.encoding.pkg: sun.io sun.java.launcher: SUN_STANDARD user.country: US sun.os.patch.level: unknown java.vm.specification.name: Java Virtual Machine Specification user.dir: /home/mapr java.runtime.version: 1.6.0_26-b03 java.awt.graphicsenv: sun.awt.X11GraphicsEnvironment java.endorsed.dirs: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/endorsed os.arch: amd64 java.io.tmpdir: /tmp When you change the name of a node, you must restart the tasktracker. line.separator: hadoop.log.file: hadoop.log java.vm.specification.vendor: Sun Microsystems Inc. os.name: Linux hadoop.id.str: sun.jnu.encoding: UTF-8 java.library.path: /opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/native/Linux-amd64-64: hadoop.home.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/.. java.specification.name: Java Platform API Specification java.class.version: 50.0 sun.management.compiler: HotSpot 64-Bit Tiered Compilers hadoop.pid.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/../pids os.version: 2.6.32-33-generic user.home: /home/mapr user.timezone: America/Los_Angeles java.awt.printerjob: sun.print.PSPrinterJob file.encoding: UTF-8 java.specification.version: 1.6 java.class.path: /opt/mapr/hadoop/hadoop-0.20.2/bin/../conf:/usr/lib/jvm/java-6-sun-1.6.0.26/lib/tools. jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/..:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop *core*.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjrt-1.6.5.jar:/opt/mapr/had oop/hadoop-0.20.2/bin/../lib/aspectjtools-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin /../lib/commons-cli-1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-codec-1. 4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-daemon-1.0.1.jar:/opt/mapr/had oop/hadoop-0.20.2/bin/../lib/commons-el-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../ lib/commons-httpclient-3.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-log ging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar :/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-1.4.1.jar:/opt/mapr/hadoop/hado op-0.20.2/bin/../lib/core-3.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/eval-0.5 .jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-capacity-scheduler.ja r:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoo p/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-fairscheduler.jar:/opt/mapr/hadoop/hadoop -0.20.2/bin/../lib/hsqldb-1.8.0.10.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jacks on-core-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-mapper-asl-1.5 .2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-compiler-5.5.12.jar:/opt/mapr/ hadoop/hadoop-0.20.2/bin/../lib/jasper-runtime-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20 .2/bin/../lib/jets3t-0.6.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-6.1.14. jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-servlet-tester-6.1.14.jar:/opt/map r/hadoop/hadoop-0.20.2/bin/../lib/jetty-util-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2 /bin/../lib/junit-4.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/kfs-0.2.2.jar:/opt /mapr/hadoop/hadoop-0.20.2/bin/../lib/log4j-1.2.15.jar:/opt/mapr/hadoop/hadoop-0.20.2/ bin/../lib/logging-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-0.1.jar:/o pt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-test-0.1.jar:/opt/mapr/hadoop/hadoop-0. 20.2/bin/../lib/mockito-all-1.8.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mysql- connector-java-5.0.8-bin.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/oro-2.0.8.jar:/ opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/mapr/hadoop/h adoop-0.20.2/bin/../lib/slf4j-api-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/ slf4j-log4j12-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/xmlenc-0.52.jar:/opt /mapr/hadoop/hadoop-0.20.2/bin/../lib/zookeeper-3.3.2.jar:/opt/mapr/hadoop/hadoop-0.20 .2/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/js p-api-2.1.jar user.name: mapr java.vm.specification.version: 1.0 sun.java.command: org.apache.hadoop.mapred.TaskTracker java.home: /usr/lib/jvm/java-6-sun-1.6.0.26/jre sun.arch.data.model: 64 user.language: en java.specification.vendor: Sun Microsystems Inc. hadoop.log.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/../logs java.vm.info: mixed mode java.version: 1.6.0_26 java.ext.dirs: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/ext:/usr/java/packages/lib/ext sun.boot.class.path: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.2 6/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/sunrsasign.jar:/usr/lib/jvm/ java-6-sun-1.6.0.26/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jce.jar: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.26 /jre/lib/modules/jdk.boot.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/classes java.vendor: Sun Microsystems Inc. file.separator: / java.vendor.url.bug: http://java.sun.com/cgi-bin/bugreport.cgi sun.io.unicode.encoding: UnicodeLittle sun.cpu.endian: little hadoop.policy.file: hadoop-policy.xml sun.desktop: gnome sun.cpu.isalist: ------------------------------------------------------------*/ 12/03/21 21:19:57 INFO mapred.TaskTracker: /tmp is not tmpfs or ramfs. Java Hotspot Instrumentation will be disabled by default 12/03/21 21:19:57 INFO mapred.TaskTracker: Cleaning up config files from the job history folder 12/03/21 21:19:57 INFO mapred.TaskTracker: TT local config is /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml 12/03/21 21:19:57 INFO mapred.TaskTracker: Loading resource properties file : /opt/mapr//logs/cpu_mem_disk 12/03/21 21:19:57 INFO mapred.TaskTracker: Physical memory reserved for mapreduce tasks = 2105540608 bytes 12/03/21 21:19:57 INFO mapred.TaskTracker: CPUS: 1 12/03/21 21:19:57 INFO mapred.TaskTracker: Total MEM: 1.9610939GB 12/03/21 21:19:57 INFO mapred.TaskTracker: Reserved MEM: 2008MB 12/03/21 21:19:57 INFO mapred.TaskTracker: Reserved MEM for Ephemeral slots 0 12/03/21 21:19:57 INFO mapred.TaskTracker: DISKS: 2 12/03/21 21:19:57 INFO mapred.TaskTracker: Map slots 1, Default heapsize for map task 873 mb 12/03/21 21:19:57 INFO mapred.TaskTracker: Reduce slots 1, Default heapsize for reduce task 1135 mb 12/03/21 21:19:57 INFO mapred.TaskTracker: Ephemeral slots 0, memory given for each ephemeral slot 200 mb 12/03/21 21:19:57 INFO mapred.TaskTracker: Prefetch map slots 1 12/03/21 21:20:07 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 12/03/21 21:20:08 INFO http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter) 12/03/21 21:20:08 WARN mapred.TaskTracker: Error while writing to TaskController config filejava.io.FileNotFoundException: /opt/mapr/hadoop/hadoop-0.20.2/bin/../conf/taskcontroller.cfg (Permission denied) 12/03/21 21:20:08 ERROR mapred.TaskTracker: Can not start TaskTracker because java.io.IOException: Cannot run program "/opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/Linux-amd64-64/bin/task-controller": java.io.IOException: error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at org.apache.hadoop.util.Shell.runCommand(Shell.java:267) at org.apache.hadoop.util.Shell.run(Shell.java:249) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:442) at org.apache.hadoop.mapred.LinuxTaskController.setup(LinuxTaskController.java:142) at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:2149) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:5216) Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.<init>(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) ... 6 more 12/03/21 21:20:08 INFO mapred.TaskTracker: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down TaskTracker at mapr-desktop/127.0.1.1 ************************************************************/ hadoop version The command prints the hadoop software version. hadoop version Syntax hadoop version Output mapr@mapr-desktop:~$ hadoop version Hadoop 0.20.2-dev Subversion -r Compiled by root on Thu Dec 8 22:43:13 PST 2011 From source with checksum 19fa44df0cb831c45ef984f21feb7110 hadoop conf The command outputs the configuration information for this node to standard output. hadoop conf Syntax hadoop [ generic options ] conf [ -dump ] [ -key <parameter name>] Parameters Parameter Description -dump Dumps the entire configuration set to standard output. -key <parameter name> Displays the configured value for the specified parameter. Examples Dumping a node's entire configuration to a text file hadoop conf -dump > nodeconfiguration.txt The above command creates a text file named that contains the node's configuration information. Using the util nodeconfiguration.txt tail ity to examine the last few lines of the file displays the following information: [user@hostame:01] tail nodeconfiguration.txt mapred.merge.recordsBeforeProgress=10000 io.mapfile.bloom.error.rate=0.005 io.bytes.per.checksum=512 mapred.cluster.ephemeral.tasks.memory.limit.mb=200 mapred.fairscheduler.smalljob.max.inputsize=10737418240 ipc.client.tcpnodelay=true mapreduce.tasktracker.reserved.physicalmemory.mb.low=0.80 fs.s3.sleepTimeSeconds=10 mapred.task.tracker.report.address=127.0.0.1:0 *** MapR Configuration Dump: END *** [user@hostname:02] Displaying the configured value of a specific parameter [user@hostame:01] hadoop conf -key io.bytes.per.checksum 512 [user@hostname:02] The above command returns 512 as the configured value of the parameter. io.bytes.per.checksum API Reference Overview This guide provides information about the MapR command API. Most commands can be run on the command-line interface (CLI), or by making REST requests programmatically or in a browser. To run CLI commands, use a machine or an ssh connection to any node in the cluster. Client To use the REST interface, make HTTP requests to a node that is running the WebServer service. Each command reference page includes the command syntax, a table that describes the parameters, and examples of command usage. In each parameter table, required parameters are in text. For output commands, the reference pages include tables that describe the output fields. bold Values that do not apply to particular combinations are marked . NA REST API Syntax MapR REST calls use the following format: https://<host>:<port>/rest/<command>[/<subcommand>...]?<parameters> Construct the list from the required and optional parameters, in the format separated by the <parameters> <parameter>=<value> ampersand ( ) character. Example: & https://r1n1.qa.sj.ca.us:8443/rest/volume/mount?name=test-volume&path=/test Values in REST API calls must be URL-encoded. For readability, the values in this document use the actual characters, rather than the URL-encoded versions. Authentication To make REST calls using or , provide the username and password. curl wget Curl Syntax curl -k -u <username> https://<host>:<port>/rest/<command>... Wget Syntax wget --no-check-certificate --user <username> --ask-password https://<host>:<port>/rest/<command>... Command-Line Interface (CLI) Syntax The MapR CLI commands are documented using the following conventions: [Square brackets] indicate an optional parameter <Angle brackets> indicate a value to enter The following syntax example shows that the command requires the parameter, for which you must enter a list of volume mount -name volumes, and all other parameters are optional: maprcli volume mount [ -cluster <cluster> ] -name <volume list> [ -path <path list> ] For clarity, the syntax examples show each parameter on a separate line; in practical usage, the command and all parameters and options are typed on a single line. Example: maprcli volume mount -name test-volume -path /test Common Parameters The following parameters are available for many commands in both the REST and command-line contexts. Parameter Description cluster The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. zkconnect A ZooKeeper connect string, which specifies a list of the hosts running ZooKeeper, and the port to use on each, in the format: '<ho Default: st>[:<port>][,<host>[:<port>]...]' 'localhost In most cases the ZooKeeper connect string can be omitted, :5181' but it is useful in certain cases when the CLDB is not running. Common Options The following options are available for most commands in the command-line context. Option Description -noheader When displaying tabular output from a command, omits the header row. To keep your password secure, do not provide it on the command line. Curl will prompt you for your password, and you can enter it securely. To keep your password secure, do not provide it on the command line. Use the option instead; then wget will prompt --ask-password you for your password and you can enter it securely. -long Shows the entire value. This is useful when the command response contains complex information. When -long is omitted, complex information is displayed as an ellipsis (...). -json Displays command output in JSON format. When -json is omitted, the command output is displayed in tabular format. -cli.loglevel Specifies a log level for API output. Legal values for this option are: DEBUG INFO ERROR WARN TRACE FATAL Filters Some MapR CLI commands use , which let you specify large numbers of nodes or volumes by matching specified values in specified fields filters rather than by typing each name explicitly. Filters use the following format: [<field><operator>"<value>"]<and|or>[<field><operator>"<value>"] ... field Field on which to filter. The field depends on the command with which the filter is used. operator An operator for that field: == - Exact match != - Does not match > - Greater than < - Less than >= - Greater than or equal to <= - Less than or equal to value Value on which to filter. Wildcards (using ) are allowed for operators * and . There is a special value that matches all values. == != all You can use the wildcard ( ) for partial matches. For example, you can display all volumes whose owner is and whose name begins with * root t as follows: est maprcli volume list -filter [n=="test*"]and[on=="root"] Response The commands return responses in JSON or in a tabular format. When you run commands from the command line, the response is returned in tabular format unless you specify JSON using the -json option; when you run commands through the REST interface, the response is returned in JSON. Success On a successful call, each command returns the error code zero (OK) and any data requested. When JSON output is specified, the data is returned as an array of records along with the status code and the total number of records. In the tabular format, the data is returned as a sequence of rows, each of which contains the fields in the record separated by tabs. JSON { "status":"OK", "total":<number of records>, "data":[ { <record> } ... ] } Tabular status 0 Or <heading> <heading> <heading> ... <field> <field> <field> ... ... Error When an error occurs, the command returns the error code and descriptive message. JSON { "status":"ERROR", "errors":[ { "id":<error code>, "desc":"<command>: <error message>" } ] } Tabular ERROR (<error code>) - <command>: <error message> acl The acl commands let you work with (ACLs): access control lists acl edit - modifies a specific user's access to a cluster or volume acl set - modifies the ACL for a cluster or volume acl show - displays the ACL associated with a cluster or volume In order to use the command, you must have full control ( ) permission on the cluster or volume for which you are running the acl edit fc command. Specifying Permissions Specify permissions for a user or group with a string that lists the permissions for that user or group. To specify permissions for multiple users or groups, use a string for each, separated by spaces. The format is as follows: Users - <user>:<action>[,<action>...][ <user>:<action>[,<action...]] Groups - <group>:<action>[,<action>...][ <group>:<action>[,<action...]] The following tables list the permission codes used by the commands. acl Cluster Permission Codes Code Allowed Action Includes login Log in to the MapR Control System, use the API and command-line interface, read access on cluster and volumes ss Start/stop services cv Create volumes a Admin access All permissions except fc fc Full control (administrative access and permission to change the cluster ACL) a Volume Permission Codes Code Allowed Action dump Dump the volume restore Mirror or restore the volume m Modify volume properties, create and delete snapshots d Delete a volume fc Full control (admin access and permission to change volume ACL) acl edit The command grants one or more specific volume or cluster permissions to a user. To use the command, you must have acl edit acl edit full control ( ) permissions on the volume or cluster for which you are running the command. fc The permissions are specified as a comma-separated list of permission codes. See . You must specify either a or a . When the acl user group ty is , a volume name must be specified using the parameter. pe volume name Syntax CLI maprcli acl edit [ -cluster <cluster name> ] [ -group <group> ] [ -name <name> ] -type cluster|volume [ -user <user> ] REST http[s]://<host:port>/rest/acl/edit?<parameters> Parameters Parameter Description cluster The cluster on which to run the command. group Groups and allowed actions for each group. See . Format: acl <group >:<action>[,<action>...][ <group>:<action>[,<action...]] name The object name. type The object type ( or ). cluster volume user Users and allowed actions for each user. See . Format: acl <user>:< action>[,<action>...][ <user>:<action>[,<action...]] Examples Give the user jsmith dump, restore, and delete permissions for "test-volume": CLI maprcli acl edit -type volume -name test-volume -user jsmith:dump,restore,d acl set The command specifies the entire ACL for a cluster or volume. Any previous permissions are overwritten by the new values, and any acl set permissions omitted are removed. To use the command, you must have full control ( ) permissions on the volume or cluster for which acl set fc you are running the command. The permissions are specified as a comma-separated list of permission codes. See . You must specify either a or a . When the acl user group ty is , a volume name must be specified using the parameter. pe volume name Syntax CLI maprcli acl set [ -cluster <cluster name> ] [ -group <group> ] [ -name <name> ] -type cluster|volume [ -user <user> ] REST http[s]://<host:port>/rest/acl/edit?<parameters> Parameters Parameter Description cluster The cluster on which to run the command. The command removes any previous ACL values. If you wish to preserve some of the permissions, you should either use the acl set command instead of , or use to list the values before overwriting them. acl edit acl set acl show group Groups and allowed actions for each group. See . Format: acl <group >:<action>[,<action>...][ <group>:<action>[,<action...]] name The object name. type The object type ( or ). cluster volume user Users and allowed actions for each user. See . Format: acl <user>:< action>[,<action>...][ <user>:<action>[,<action...]] Examples Give the user full control of the cluster and remove all permissions for all other users: root my.cluster.com CLI maprcli acl set -type cluster -cluster my.cluster.com -user root:fc Usage Example # maprcli acl show -type cluster Principal Allowed actions User root [login, ss, cv, a, fc] User lfedotov [login, ss, cv, a, fc] User mapr [login, ss, cv, a, fc] # maprcli acl set -type cluster -cluster my.cluster.com -user root:fc # maprcli acl show -type cluster Principal Allowed actions User root [login, ss, cv, a, fc] Give multiple users specific permissions for the volume and remove all permissions for all other users: test-volume CLI maprcli acl set -type volume -name test-volume -user jsmith:dump,restore,m rjones:fc acl show Displays the ACL associated with an object (cluster or a volume). An ACL contains the list of users who can perform specific actions. Syntax Notice that the specified permissions have overwritten the existing ACL. CLI maprcli acl show [ -cluster <cluster> ] [ -group <group> ] [ -name <name> ] [ -output long|short|terse ] [ -perm ] -type cluster|volume [ -user <user> ] REST http[s]://<host:port>/rest/acl/show?<parameters> Parameters Parameter Description cluster The name of the cluster on which to run the command group The group for which to display permissions name The cluster or volume name output The output format: long short terse perm When this option is specified, displays the permissions acl show available for the object type specified in the parameter. type type Cluster or volume. user The user for which to display permissions Output The actions that each user or group is allowed to perform on the cluster or the specified volume. For information about each allowed action, see a . cl Principal Allowed actions User root [r, ss, cv, a, fc] Group root [r, ss, cv, a, fc] All users [r] Examples Show the ACL for "test-volume": CLI maprcli acl show -type volume -name test-volume Show the permissions that can be set on a cluster: CLI maprcli acl show -type cluster -perm alarm The alarm commands perform functions related to system alarms: alarm clear - clears one or more alarms alarm clearall - clears all alarms alarm config load - displays the email addresses to which alarm notifications are to be sent alarm config save - saves changes to the email addresses to which alarm notifications are to be sent alarm list - displays alarms on the cluster alarm names - displays all alarm names alarm raise - raises a specified alarm Alarm Notification Fields The following fields specify the configuration of alarm notifications. Field Description alarm The named alarm. individual Specifies whether individual alarm notifications are sent to the default email address for the alarm type. 0 - do not send notifications to the default email address for the alarm type 1 - send notifications to the default email address for the alarm type email A custom email address for notifications about this alarm type. If specified, alarm notifications are sent to this email address, regardless of whether they are sent to the default email address Alarm Types See . Alarms Reference Alarm History To see a history of alarms that have been raised, look at the file on the master CLDB node. Example: /opt/mapr/logs/cldb.log grep ALARM /opt/mapr/logs/cldb.log alarm clear Clears one or more alarms. Permissions required: or fc a Syntax CLI maprcli alarm clear -alarm <alarm> [ -cluster <cluster> ] [ -entity <host, volume, user, or group name> ] REST http[s]://<host>:<port>/rest/alarm/clear?<parameter s> Parameters Parameter Description alarm The named alarm to clear. See . Alarm Types cluster The cluster on which to run the command. entity The entity on which to clear the alarm. Examples Clear a specific alarm: CLI maprcli alarm clear -alarm NODE_ALARM_DEBUG_LOGGING REST https://r1n1.sj.us:8443/rest/alarm/clear?alarm=NODE _ALARM_DEBUG_LOGGING alarm clearall Clears all alarms. Permissions required: or fc a Syntax CLI maprcli alarm clearall [ -cluster <cluster> ] REST http[s]://<host>:<port>/rest/alarm/clearall?<parame ters> Parameters Parameter Description cluster The cluster on which to run the command. Examples Clear all alarms: CLI maprcli alarm clearall REST https://r1n1.sj.us:8443/rest/alarm/clearall alarm config load Displays the configuration of alarm notifications. Permissions required: or fc a Syntax CLI maprcli alarm config load [ -cluster <cluster> ] [ -output terse|verbose ] REST http[s]://<host>:<port>/rest/alarm/config/load Parameters Parameter Description cluster The cluster on which to run the command. output Whether the output should be terse or verbose. Output A list of configuration values for alarm notifications. Output Fields See . Alarm Notification Fields Sample output alarm individual email CLUSTER_ALARM_BLACKLIST_TTS 1 CLUSTER_ALARM_UPGRADE_IN_PROGRESS 1 CLUSTER_ALARM_UNASSIGNED_VIRTUAL_IPS 1 VOLUME_ALARM_SNAPSHOT_FAILURE 1 VOLUME_ALARM_MIRROR_FAILURE 1 VOLUME_ALARM_DATA_UNDER_REPLICATED 1 VOLUME_ALARM_DATA_UNAVAILABLE 1 VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED 1 VOLUME_ALARM_QUOTA_EXCEEDED 1 NODE_ALARM_CORE_PRESENT 1 NODE_ALARM_DEBUG_LOGGING 1 NODE_ALARM_DISK_FAILURE 1 NODE_ALARM_OPT_MAPR_FULL 1 NODE_ALARM_VERSION_MISMATCH 1 NODE_ALARM_TIME_SKEW 1 NODE_ALARM_SERVICE_CLDB_DOWN 1 NODE_ALARM_SERVICE_FILESERVER_DOWN 1 NODE_ALARM_SERVICE_JT_DOWN 1 NODE_ALARM_SERVICE_TT_DOWN 1 NODE_ALARM_SERVICE_HBMASTER_DOWN 1 NODE_ALARM_SERVICE_HBREGION_DOWN 1 NODE_ALARM_SERVICE_NFS_DOWN 1 NODE_ALARM_SERVICE_WEBSERVER_DOWN 1 NODE_ALARM_SERVICE_HOSTSTATS_DOWN 1 NODE_ALARM_ROOT_PARTITION_FULL 1 AE_ALARM_AEADVISORY_QUOTA_EXCEEDED 1 AE_ALARM_AEQUOTA_EXCEEDED 1 Examples Display the alarm notification configuration: CLI maprcli alarm config load REST https://r1n1.sj.us:8443/rest/alarm/config/load alarm config save Sets notification preferences for alarms. Permissions required: or fc a Alarm notifications can be sent to the default email address and a specific email address for each named alarm. If is set to for a individual 1 specific alarm, then notifications for that alarm are sent to the default email address for the alarm type. If a custom email address is provided, notifications are sent there regardless of whether they are also sent to the default email address. Syntax CLI maprcli alarm config save [ -cluster <cluster> ] -values <values> REST http[s]://<host>:<port>/rest/alarm/config/save?<par ameters> Parameters Parameter Description cluster The cluster on which to run the command. values A comma-separated list of configuration values for one or more alarms, in the following format: <alarm>,<individual>,<email> See . Alarm Notification Fields Examples Send alert emails for the AE_ALARM_AEQUOTA_EXCEEDED alarm to the default email address and a custom email address: CLI maprcli alarm config save -values "AE_ALARM_AEQUOTA_EXCEEDED,1,[email protected]" REST https://r1n1.sj.us:8443/rest/alarm/config/save?valu es=AE_ALARM_AEQUOTA_EXCEEDED,1,[email protected] alarm list Lists alarms in the system. Permissions required: or fc a You can list all alarms, alarms by type (Cluster, Node or Volume), or alarms on a particular node or volume. To retrieve a count of all alarm types, pass in the parameter. You can specify the alarms to return by filtering on type and entity. Use and to retrieve only a 1 summary start limit specified window of data. Syntax CLI maprcli alarm list [ -alarm <alarm ID> ] [ -cluster <cluster> ] [ -entity <host or volume> ] [ -limit <limit> ] [ -output (terse|verbose) ] [ -start <offset> ] [ -summary (0|1) ] [ -type <alarm type> ] REST http[s]://<host>:<port>/rest/alarm/list?<parameters > Parameters Parameter Description alarm The alarm type to return. See . Alarm Types cluster The cluster on which to list alarms. entity The name of the cluster, node, volume, user, or group to check for alarms. limit The number of records to retrieve. Default: 2147483647 output Whether the output should be terse or verbose. start The list offset at which to start. summary Specifies the type of data to return: 1 = count by alarm type 0 = List of alarms type The entity type: cluster node volume ae Output Information about one or more named alarms on the cluster, or for a specified node, volume, user, or group. Output Fields Field Description alarm state State of the alarm: 0 = Clear 1 = Raised description A description of the condition that raised the alarm entity The name of the volume, node, user, or group. alarm name The name of the alarm. alarm statechange time The date and time the alarm was most recently raised. Sample Output alarm state description entity alarm name alarm statechange time 1 Volume desired replication is 1, current replication is 0 mapr.qa-node173.qa.prv.local.logs VOLUME_ALARM_DATA_UNDER_REPLICATED 1296707707872 1 Volume data unavailable mapr.qa-node173.qa.prv.local.logs VOLUME_ALARM_DATA_UNAVAILABLE 1296707707871 1 Volume desired replication is 1, current replication is 0 mapr.qa-node235.qa.prv.local.mapred VOLUME_ALARM_DATA_UNDER_REPLICATED 1296708283355 1 Volume data unavailable mapr.qa-node235.qa.prv.local.mapred VOLUME_ALARM_DATA_UNAVAILABLE 1296708283099 1 Volume desired replication is 1, current replication is 0 mapr.qa-node175.qa.prv.local.logs VOLUME_ALARM_DATA_UNDER_REPLICATED 1296706343256 Examples List a summary of all alarms CLI maprcli alarm list -summary 1 REST https://r1n1.sj.us:8443/rest/alarm/list?summary=1 List cluster alarms CLI maprcli alarm list -type 0 REST https://r1n1.sj.us:8443/rest/alarm/list?type=0 alarm names Displays a list of alarm names. Permissions required or . fc a Syntax CLI maprcli alarm names REST http[s]://<host>:<port>/rest/alarm/names Examples Display all alarm names: CLI maprcli alarm names REST https://r1n1.sj.us:8443/rest/alarm/names alarm raise Raises a specified alarm or alarms. Permissions required or . fc a Syntax CLI maprcli alarm raise -alarm <alarm> [ -cluster <cluster> ] [ -description <description> ] [ -entity <cluster, entity, host, node, or volume> ] REST http[s]://<host>:<port>/rest/alarm/raise?<parameter s> Parameters Parameter Description alarm The alarm type to raise. See . Alarm Types cluster The cluster on which to run the command. description A brief description. entity The entity on which to raise alarms. Examples Raise a specific alarm: CLI maprcli alarm raise -alarm NODE_ALARM_DEBUG_LOGGING REST https://r1n1.sj.us:8443/rest/alarm/raise?alarm=NODE _ALARM_DEBUG_LOGGING config The config commands let you work with configuration values for the MapR cluster: config load displays the values config save makes changes to the stored values Configuration Fields Field Default Value Description cldb.balancer.disk.max.switches.in.nodes.pe rcentage 10 cldb.balancer.disk.paused 1 cldb.balancer.disk.sleep.interval.sec 2 * 60 cldb.balancer.disk.threshold.percentage 70 cldb.balancer.logging 0 cldb.balancer.role.max.switches.in.nodes.per centage 10 cldb.balancer.role.paused 1 cldb.balancer.role.sleep.interval.sec 15 * 60 cldb.balancer.startup.interval.sec 30 * 60 cldb.cluster.almost.full.percentage 90 The percentage at which the CLUSTER_ALARM_CLUSTER_ALMOST_F ULL alarm is triggered. cldb.container.alloc.selector.algo 0 cldb.container.assign.buffer.sizemb 1 * 1024 cldb.container.create.diskfull.threshold 80 cldb.container.sizemb 16 * 1024 cldb.default.chunk.sizemb 256 cldb.default.volume.topology The default topology for new volumes. cldb.dialhome.metrics.rotation.period 365 cldb.fileserver.activityreport.interval.hb.multip lier 3 cldb.fileserver.containerreport.interval.hb.mul tiplier 1800 cldb.fileserver.heartbeat.interval.sec 1 cldb.force.master.for.container.minutes 1 cldb.fs.mark.inactive.sec 5 * 60 cldb.fs.mark.rereplicate.sec 60 * 60 The number of seconds a node can fail to heartbeat before it is considered dead. Once a node is considered dead, the CLDB re-replicates any data contained on the node. cldb.fs.workallocator.num.volume.workunits 20 cldb.fs.workallocator.num.workunits 80 cldb.ganglia.cldb.metrics 0 cldb.ganglia.fileserver.metrics 0 cldb.heartbeat.monitor.sleep.interval.sec 60 cldb.log.fileserver.timeskew.interval.mins 60 cldb.max.parallel.resyncs.star 2 cldb.min.containerid 1 cldb.min.fileservers 1 The minimum CLDB fileservers. cldb.min.snap.containerid 1 cldb.min.snapid 1 cldb.replication.manager.start.mins 15 The delay between CLDB startup and replication manager startup, to allow all nodes to register and heartbeat cldb.replication.process.num.containers 60 cldb.replication.sleep.interval.sec 15 cldb.replication.tablescan.interval.sec 2 * 60 cldb.restart.wait.time.sec 180 cldb.snapshots.inprogress.cleanup.minutes 30 cldb.topology.almost.full.percentage 90 cldb.volume.default.replication The default replication for the CLDB volumes. cldb.volume.epoch cldb.volumes.default.min.replication 2 cldb.volumes.default.replication 3 mapr.domainname The domain name MapR uses to get operating system users and groups (in domain mode). mapr.entityquerysource Sets MapR to get user information from LDAP (LDAP mode) or from the operating system of a domain (domain mode): ldap domain mapr.eula.user mapr.eula.time mapr.fs.nocompression "bz2,gz,tgz,tbz2, zip,z,Z,mp3,jpg, jpeg,mpg,mpeg,avi, gif,png" The file types that should not be compressed. See Extensions Not . Compressed mapr.fs.permissions.supergroup The of the MapR-FS layer. super group mapr.fs.permissions.superuser The of the MapR-FS layer. super user mapr.ldap.attribute.group The LDAP server group attribute. mapr.ldap.attribute.groupmembers The LDAP server groupmembers attribute. mapr.ldap.attribute.mail The LDAP server mail attribute. mapr.ldap.attribute.uid The LDAP server uid attribute. mapr.ldap.basedn The LDAP server Base DN. mapr.ldap.binddn The LDAP server Bind DN. mapr.ldap.port The port MapR is to use on the LDAP server. mapr.ldap.server The LDAP server MapR uses to get users and groups (in LDAP mode). mapr.ldap.sslrequired Specifies whether the LDAP server requires SSL: 0 == no 1 == yes mapr.license.exipry.notificationdays 30 mapr.quota.group.advisorydefault The default group advisory quota; see Mana . ging Quotas mapr.quota.group.default The default group quota; see Managing . Quotas mapr.quota.user.advisorydefault The default user advisory quota; see Managi ng Quotas. mapr.quota.user.default The default user quota; see Managing Quotas. mapr.smtp.port The port MapR uses on the SMTP server ( - 1 65535). mapr.smtp.sender.email The reply-to email address MapR uses when sending notifications. mapr.smtp.sender.fullname The full name MapR uses in the Sender field when sending notifications. mapr.smtp.sender.password The password MapR uses to log in to the SMTP server when sending notifications. mapr.smtp.sender.username The username MapR uses to log in to the SMTP server when sending notifications. mapr.smtp.server The SMTP server that MapR uses to send notifications. mapr.smtp.sslrequired Specifies whether SSL is required when sending email: 0 == no 1 == yes mapr.targetversion mapr.webui.http.port The port MapR uses for the MapR Control System over HTTP (0-65535); if 0 is specified, disables HTTP access. mapr.webui.https.certpath The HTTPS certificate path. mapr.webui.https.keypath The HTTPS key path. mapr.webui.https.port The port MapR uses for the MapR Control System over HTTPS (0-65535); if 0 is specified, disables HTTPS access. mapr.webui.timeout The number of seconds the MapR Control System allows to elapse before timing out. mapreduce.cluster.permissions.supergroup The of the MapReduce layer. super group mapreduce.cluster.permissions.superuser The of the MapReduce layer. super user config load Displays information about the cluster configuration. You can use the parameter to specify which information to display. keys Syntax CLI maprcli config load [ -cluster <cluster> ] -keys <keys> REST http[s]://<host>:<port>/rest/config/load?<parameter s> Parameters Parameter Description cluster The cluster for which to display values. keys The fields for which to display values; see the ta Configuration Fields ble Output Information about the cluster configuration. See the table. Configuration Fields Sample Output { "status":"OK", "total":1, "data":[ { "mapr.webui.http.port":"8080", "mapr.fs.permissions.superuser":"root", "mapr.smtp.port":"25", "mapr.fs.permissions.supergroup":"supergroup" } ] } Examples Display several keys: CLI maprcli config load -keys mapr.webui.http.port,mapr.webui.https.port,mapr.web ui.https.keystorepath,mapr.webui.https.keystorepass word,mapr.webui.https.keypassword,mapr.webui.timeou t REST https://r1n1.sj.us:8443/rest/config/load?keys=mapr. webui.http.port,mapr.webui.https.port,mapr.webui.ht tps.keystorepath,mapr.webui.https.keystorepassword, mapr.webui.https.keypassword,mapr.webui.timeout config save Saves configuration information, specified as key/value pairs. Permissions required: or . fc a See the table. Configuration Fields Syntax CLI maprcli config save [ -cluster <cluster> ] -values <values> REST http[s]://<host>:<port>/rest/config/save?<parameter s> Parameters Parameter Description cluster The cluster on which to run the command. values A JSON object containing configuration fields; see the Configuration table. Fields Examples Configure MapR SMTP settings: CLI maprcli config save -values '{"mapr.smtp.provider":"gmail","mapr.smtp.server":" smtp.gmail.com","mapr.smtp.sslrequired":"true","map r.smtp.port":"465","mapr.smtp.sender.fullname":"Ab Cd","mapr.smtp.sender.email":"[email protected]","mapr. smtp.sender.username":"[email protected]","mapr.smtp.se nder.password":"abc"}' REST https://r1n1.sj.us:8443/rest/config/save?values={"m apr.smtp.provider":"gmail","mapr.smtp.server":"smtp .gmail.com","mapr.smtp.sslrequired":"true","mapr.sm tp.port":"465","mapr.smtp.sender.fullname":"Ab Cd","mapr.smtp.sender.email":"[email protected]","mapr. smtp.sender.username":"[email protected]","mapr.smtp.se nder.password":"abc"} dashboard The command displays a summary of information about the cluster. dashboard info dashboard info Displays a summary of information about the cluster. For best results, use the option when running from the command -json dashboard info line. Syntax CLI maprcli dashboard info [ -cluster <cluster> ] [ -multi_cluster_info true|false. default: false ] [ -version true|false. default: false ] [ -zkconnect <ZooKeeper connect string> ] REST http[s]://<host>:<port>/rest/dashboard/info?<parame ters> Parameters Parameter Description cluster The cluster on which to run the command. multi_cluster_info Specifies whether to display cluster information from multiple clusters. version Specifies whether to display the version. zkconnect ZooKeeper Connect String Output A summary of information about the services, volumes, mapreduce jobs, health, and utilization of the cluster. Output Fields Field Description Timestamp The time at which the data was retrieved, dashboard info expressed as a Unix epoch time. Status The success status of the command. dashboard info Total The number of clusters for which data was queried in the dashboard command. info Version The MapR software version running on the cluster. Cluster The following information about the cluster: name — the cluster name ip — the IP address of the active CLDB id — the cluster ID services The number of active, stopped, failed, and total installed services on the cluster: CLDB File server Job tracker Task tracker HB master HB region server volumes The number and size (in GB) of volumes that are: Mounted Unmounted mapreduce The following mapreduce information: Queue time Running jobs Queued jobs Running tasks Blacklisted jobs maintenance The following information about system health: Failed disk nodes Cluster alarms Node alarms Versions utilization The following utilization information: CPU: Memory Disk space compression Sample Output # maprcli dashboard info -json { "timestamp":1336760972531, "status":"OK", "total":1, "data":[ { "version":"2.0.0", "cluster":{ "name":"mega-cluster", "ip":"192.168.50.50", "id":"7140172612740778586" }, "volumes":{ "mounted":{ "total":76, "size":88885376 }, "unmounted":{ "total":1, "size":6 } }, "utilization":{ "cpu":{ "util":14, "total":528, "active":75 }, "memory":{ "total":2128177, "active":896194 }, "disk_space":{ "total":707537, "active":226848 }, "compression":{ "compressed":86802, "uncompressed":116655 } }, "services":{ "fileserver":{ "active":22, "stopped":0, "failed":0, "total":22 }, "nfs":{ "active":1, "stopped":0, "failed":0, "total":1 }, "webserver":{ "active":1, "stopped":0, "failed":0, "total":1 }, "cldb":{ "active":1, "stopped":0, "failed":0, "total":1 }, "tasktracker":{ "active":21, "stopped":0, "failed":0, "total":21 }, "jobtracker":{ "active":1, "standby":0, "stopped":0, "failed":0, "total":1 }, "hoststats":{ "active":22, "stopped":0, "failed":0, "total":22 } }, "mapreduce":{ "running_jobs":1, "queued_jobs":0, "running_tasks":537, "blacklisted":0 } } ] } Examples Display dashboard information: CLI maprcli dashboard info -json REST https://r1n1.sj.us:8443/rest/dashboard/info dialhome The commands let you change the Dial Home status of your cluster: dialhome dialhome ackdial - acknowledges a successful Dial Home transmission. dialhome enable - enables or disables Dial Home. dialhome lastdialed - displays the last Dial Home transmission. dialhome metrics - displays the metrics collected by Dial Home. dialhome status - displays the current Dial Home status. dialhome ackdial Acknowledges the most recent Dial Home on the cluster. Permissions required: or fc a Syntax CLI maprcli dialhome ackdial [ -forDay <date> ] REST http[s]://<host>:<port>/rest/dialhome/ackdial[?para meters] Parameters Parameter Description forDay Date for which the recorded metrics were successfully dialed home. Accepted values: UTC timestamp or a UTC date in MM/DD/YY format. Default: yesterday Examples Acknowledge Dial Home: CLI maprcli dialhome ackdial REST https://r1n1.sj.us:8443/rest/dialhome/ackdial dialhome enable Enables Dial Home on the cluster. Permissions required: or fc a Syntax CLI maprcli dialhome enable -enable 0|1 REST http[s]://<host>:<port>/rest/dialhome/enable Parameters Parameter Description enable Specifies whether to enable or disable Dial Home: 0 - Disable 1 - Enable Output A success or failure message. Sample output pconrad@s1-r1-sanjose-ca-us:~$ maprcli dialhome enable -enable 1 Successfully enabled dialhome pconrad@s1-r1-sanjose-ca-us:~$ maprcli dialhome status Dial home status is: enabled Examples Enable Dial Home: CLI maprcli dialhome enable -enable 1 REST https://r1n1.sj.us:8443/rest/dialhome/enable?enable =1 dialhome lastdialed Displays the date of the last successful Dial Home call. Permissions required: or fc a Syntax CLI maprcli dialhome lastdialed REST http[s]://<host>:<port>/rest/dialhome/lastdialed Output The date of the last successful Dial Home call. Sample output $ maprcli dialhome lastdialed date 1322438400000 Examples Show the date of the most recent Dial Home: CLI maprcli dialhome lastdialed REST https://r1n1.sj.us:8443/rest/dialhome/lastdialed dialhome metrics Returns a compressed metrics object. Permissions required: or fc a Syntax CLI maprcli dialhome metrics [ -forDay <date> ] REST http[s]://<host>:<port>/rest/dialhome/metrics Parameters Parameter Description forDay Date for which the recorded metrics were successfully dialed home. Accepted values: UTC timestamp or a UTC date in MM/DD/YY format. Default: yesterday Output Sample output $ maprcli dialhome metrics metrics [B@48067064 Examples Show the Dial Home metrics: CLI maprcli dialhome metrics REST https://r1n1.sj.us:8443/rest/dialhome/metrics dialhome status Displays the Dial Home status. Permissions required: or fc a Syntax CLI maprcli dialhome status REST http[s]://<host>:<port>/rest/dialhome/status Output The current Dial Home status. Sample output $ maprcli dialhome status enabled 1 Examples Display the Dial Home status: CLI maprcli dialhome status REST https://r1n1.sj.us:8443/rest/dialhome/status disk The disk commands lets you work with disks: disk add adds a disk to a node disk list lists disks disk listall lists all disks disk remove removes a disk from a node Disk Fields The following table shows the fields displayed in the output of the disk list and disk listall commands. You can choose which fields (columns) to display and sort in ascending or descending order by any single field. Verbose Name Terse Name Description hostname hn Hostname of node which owns this disk/partition. diskname n Name of the disk or partition. status st Disk status: 0 = Good 1 = Bad disk powerstatus pst Disk power status: 0 = Active/idle (normal operation) 1 = Standby (low power mode) 2 = Sleeping (lowest power mode, drive is completely shut down) mount mt Disk mount status 0 = unmounted 1 = mounted fstype fs File system type modelnum mn Model number serialnum sn Serial number firmwareversion fw Firmware version vendor ven Vendor name totalspace dst Total disk space, in MB usedspace dsu Disk space used, in MB availablespace dsa Disk space available, in MB err Disk error message, in English. Note that this will be translated. Only sent if st == 1. not ft Disk failure time, MapR disks only. Only sent if st == 1. disk add Adds one or more disks to the specified node. Permissions required: or fc a Syntax CLI maprcli disk add [ -cluster ] -disks <disk names> -host <host> REST http[s]://<host>:<port>/rest/disk/add?<parameters> Parameters Parameter Description cluster The cluster on which to add disks. If not specified, the default is the current cluster. disks A comma-separated list of disk names. Examples: /dev/sdc /dev/sdd,/dev/sde,/dev/sdf host The hostname or IP address of the machine on which to add the disk. 1. 2. 3. If you are running MapR 1.2.2 or earlier, do not use the command or the MapR Control System to add disks to MapR-FS. You disk add must either upgrade to MapR 1.2.3 before adding or replacing a disk, or use the following procedure (which avoids the comma disk add nd): Use the to the failed disk. All other disks in the same storage pool are removed at the same MapR Control System remove time. Make a note of which disks have been removed. Create a text file containing a list of the disks you just removed. See . /tmp/disks.txt Setting Up Disks for MapR Add the disks to MapR-FS by typing the following command (as or with ): root sudo /opt/mapr/server/disksetup -F /tmp/disks.txt Output Output Fields Field Description ip The IP address of the machine that owns the disk(s). disk The name of a disk or partition. Example or sca sca/sca1 all The string , meaning all unmounted disks for this node. all Examples Add a disk: CLI maprcli disk add -disks /dev/sda1 -host 10.250.1.79 REST https://r1n1.sj.us:8443/rest/disk/add?disks=["/dev/ sda1"] disk list The command lists the disks on a node. maprcli disk list Syntax CLI maprcli disk list -host <host> [ -output terse|verbose ] [ -system 1|0 ] REST http[s]://<host>:<port>/rest/disk/list?<parameters> Parameters Parameter Description host The node on which to list the disks. output Whether the output should be or . Default is terse verbose verbos . e system Show only operating system disks: 0 - shows only MapR-FS disks 1 - shows only operating system disks Not specified - shows both MapR-FS and operating system disks Output Information about the specified disks. See the table. Disk Fields Examples List disks on a host: CLI maprcli disk list -host 10.10.100.22 REST https://r1n1.sj.us:8443/rest/disk/list?host=10.10.1 00.22 disk listall Lists all disks Syntax CLI maprcli disk listall [ -cluster <cluster> ] [ -limit <limit>] [ -output terse|verbose ] [ -start <offset>] REST http[s]://<host>:<port>/rest/disk/listall?<paramete rs> Parameters Parameter Description cluster The cluster on which to run the command. limit The number of rows to return, beginning at start. Default: 0 output Always the string . terse start The offset from the starting row according to sort. Default: 0 Output Information about all disks. See the table. Disk Fields Examples List all disks: CLI maprcli disk listall REST https://r1n1.sj.us:8443/rest/disk/listall disk remove Removes a disk from MapR-FS. Permissions required: or fc a The command does not remove a disk containing unreplicated data unless forced. To force disk removal, specify with the disk remove -force value . 1 Syntax CLI maprcli disk remove [ -cluster <cluster> ] -disks <disk names> [ -force 0|1 ] -host <host> REST http[s]://<host>:<port>/rest/disk/remove?<parameter s> Parameters Parameter Description cluster The cluster on which to run the command. Only use the option if you are sure that you do not need the data on the disk. This option removes the disk without -force 1 regard to replication factor or other data protection mechanisms, and may result in permanent data loss. Removing a disk in the storage pool that contains Container ID 1 will stop your cluster. Container ID 1 contains CLDB data for the master CLDB. Run without the option first and examine the warning messages to make sure you disk remove -force 1 aren't removing the disk with Container ID 1. To safely remove such a disk, perform a to make one of the other CLDB Failover CLDB nodes the primary CLDB, then remove the disk as normal. disks A list of disks in the form: ["disk"]or["disk","disk","disk"...]or[] force Whether to force 0 (default) - do not remove the disk or disks if there is unreplicated data on the disk 1 - remove the disk or disks regardless of data loss or other consequences host The hostname or ip address of the node from which to remove the disk. Output Output Fields Field Description disk The name of a disk or partition. Example: or sca sca/sca1 all The string , meaning all unmounted disks attached to the node. all disks A comma-separated list of disks which have non-replicated volumes.<eg> "sca" or "sca/sca1,scb"</eg> Examples Remove a disk: CLI maprcli disk remove -disks ["sda1"] REST https://r1n1.sj.us:8443/rest/disk/remove?disks=["sd a1"] dump The commands can be used to view key information about volumes, containers, storage pools, and MapR cluster services for maprcli dump debugging and troubleshooting. dump balancerinfo returns detailed information about the storage pools on a cluster. If there are any active container moves, the command returns information about the source and destination storage pools. dump balancermetrics returns a cumulative count of container moves and MB of data moved between storage pools. dump cldbnodes returns the IP address and port number of the CLDB nodes on the cluster. dump containerinfo returns detailed information about one or more specified containers. dump replicationmanagerinfo returns information about volumes and the containers on those volumes including the nodes on which the containers have been replicated and the space allocated to each container. dump replicationmanagerqueueinfo returns information that enables you to identify containers that are under-replicated or over-replicated. dump rereplicationinfo returns information about the ongoing re-replication of replica containers including the destination IP address and port number, the ID number of the destination file server, and the ID number of the destination storage pool. dump rolebalancerinfo returns information about active replication role switches. dump rolebalancermetrics returns the cumulative number of times that the replication role balancer has switched the replication role of name containers and data containers on the cluster. dump volumeinfo returns information about volumes and the associated containers. dump volumenodes returns the IP address and port number of volume nodes. dump zkinfo returns the ZooKeeper znodes. This command is used by the mapr-support-collect.sh script to gather cluster diagnostics for troubleshooting. dump balancerinfo The command enables you to see how much space is used in storage pools and to track active container maprcli dump balancerinfo moves. The is a tool that balances disk space usage on a cluster by moving containers between storage pools. Whenever a storage disk space balancer pool is over 70% full (or a threshold defined by the parameter), the disk space balancer cldb.balancer.disk.threshold.percentage distributes containers to other storage pools that have lower utilization than the average for that cluster. The disk space balancer aims to ensure that the percentage of space used on all the disks in the node is similar. For more information, see . Disk Space Balancer Syntax maprcli dump balancerinfo [-cluster <cluster name>] Parameters Parameter Description -cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. Output The command returns detailed information about the storage pools on a cluster. If there are any active maprcli dump balancerinfo container moves, the command returns information about the source and destination storage pools. # maprcli dump balancerinfo -cluster my.cluster.com -json { "timestamp":1337036566035, "status":"OK", "total":187, "data":[ { "spid":"4bc329ce06752062004fa1a537abcdef", "fsid":5410063549464613987, "ip:port":"10.50.60.72:5660-", "capacityMB":1585096, "usedMB":1118099, "percentage":70, "fullnessLevel":"AboveAverage", "inTransitMB":0, "outTransitMB":31874 }, { "spid":"761fec1fabf32104004fad9630ghijkl", "fsid":3770844641152008527, "ip:port":"10.50.60.73:5660-", "capacityMB":1830364, "usedMB":793679, "percentage":47, "fullnessLevel":"BelowAverage", "inTransitMB":79096, "outTransitMB":0 }, .... { "containerid":4034, "sizeMB":16046, "From fsid":5410063549464613987, "From IP:Port":"10.50.60.72:5660-", "From SP":"4bc329ce06752062004fa1a537abcefg", "To fsid":3770844641152008527, "To IP:Port":"10.50.60.73:5660-", "To SP":"761fec1fabf32104004fad9630ghijkl" }, Output fields Field Description spid The unique ID number of the storage pool. fsid The unique ID number of the file server. The FSID identifies an MapR-FS instance or a node that has MapR-FS running in the cluster. Typically, each node has a group of storage pools, so the same FSID will correspond to multiple SPIDs. ip:port The host IP address and MapR-FS port. capacityMB The total capacity of the storage pool (in MB). usedMB The amount of space used on the storage pool (in MB). percentage The percentage of the storage pool currently utilized. A ratio of the space used ( ) to the total capacity ( ) of the usedMB capacityMB storage pool. fullnessLevel The fullness of the storage pool relative to the fullness of the rest of the cluster. Possible values are , , , OverUsed AboveAverage Average , and . For more information, see BelowAverage UnderUsed Monitorin below. g storage pool space usage inTransitMB The amount of data (in MB) that the disk space balancer is currently moving into a storage pool. outTransitMB The amount of data (in MB) that the disk space balancer is currently moving out of a storage pool. The following fields are returned only if the disk space balancer is actively moving one or more containers at the time the command is run. Field Description containerid The unique ID number of the container. sizeMB The amount of data (in MB) being moved. From fsid The FSID (file server ID number) of the source file server. From IP:Port The IP address and port number of the source node. From SP The SPID (storage pool ID) of the source storage pool. To fsid The FSID (file server ID number) of the destination file server. To IP:Port The IP address and port number of the destination node. To SP The SPID (storage pool ID number) of the destination storage pool. Examples Monitoring storage pool space usage You can use the command to monitor space usage on storage pools. maprcli dump balancerinfo # maprcli dump balancerinfo -json .... { "spid":"4bc329ce06752062004fa1a537abcefg", "fsid":5410063549464613987, "ip:port":"10.50.60.72:5660-", "capacityMB":1585096, "usedMB":1118099, "percentage":70, "fullnessLevel":"AboveAverage", "inTransitMB":0, "outTransitMB":31874 }, Tracking active container moves Using the command you can monitor the activity of the disk space balancer. Whenever there are active maprcli dump balancerinfo container moves, the command returns information about the source and destination storage pools. # maprcli dump balancerinfo -json .... { "containerid":7840, "sizeMB":15634, "From fsid":8081858704500413174, "From IP:Port":"10.50.60.64:5660-", "From SP":"9e649bf0ac6fb9f7004fa19d20rstuvw", "To fsid":3770844641152008527, "To IP:Port":"10.50.60.73:5660-", "To SP":"fefcc342475f0286004fad963flmnopq" } The example shows that a container (7840) is being moved from a storage pool on node 10.50.60.64 to a storage pool on node 10.50.60.73. dump balancermetrics The command returns a cumulative count of container moves and MB of data moved between storage maprcli dump balancermetrics pools. You can run this command periodically to determine how much data has been moved by the disk space balancer between two intervals. The is a tool that balances disk space usage on a cluster by moving containers between storage pools. Whenever a storage disk space balancer pool is over 70% full (or it reaches a threshold defined by the parameter), the disk space cldb.balancer.disk.threshold.percentage balancer distributes containers to other storage pools that have lower utilization than the average for that cluster. The disk space balancer aims to ensure that the percentage of space used on all the disks in the node is similar. For more information, see . Disk Space Balancer Syntax maprcli dump balancermetrics [-cluster <cluster name>] Tip You can use the storage pool IDs (SPIDs) to search the CLDB and MFS logs for activity (balancer moves, container moves, creates, deletes, etc.) related to specific storage pools. Parameters Parameter Description -cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. Output The command returns a cumulative count of container moves and MB of data moved between storage maprcli dump balancermetrics pools since the current CLDB became the master CLDB. # maprcli dump balancermetrics -json { "timestamp":1337770325979, "status":"OK", "total":1, "data":[ { "numContainersMoved":10090, "numMBMoved":3147147, "timeOfLastMove": "Wed May 23 03:51:44 PDT 2012" } ] } Output fields Field Description numContainersMoved The number of containers moved between storage pools by the disk space balancer. numMBMoved The total MB of data moved between storage pools on the cluster. timeOfLastMove The date and time of most recent container move. dump changeloglevel Dumps the change log level. Syntax CLI maprcli dump changeloglevel [ -classname <class name> ] [ -loglevel <log level> ] [ -cldbip <host> ] [ -cldbiprt <port> ] REST None Parameters Parameter Description classname The class name. loglevel The log level to dump. cldbip The IP address of the CLDB to use. Default: 127.0.0.1 cldbiprt The port to use on the CLDB. Default: 7222 Examples CLI maprcli dump changeloglevel REST None dump cldbnodes The command lists the nodes that contain (CLDB) data. maprcli dump cldbnodes container location database The CLDB is a service running on one or more MapR nodes that maintains the location of cluster containers, services, and other information. The CLDB automatically replicates its data to other nodes in the cluster, preserving at least two (and generally three) copies of the CLDB data. If the CLDB process dies, it is automatically restarted on the node. Syntax maprcli dump cldbnodes [-cluster <cluster name>] -zkconnect <ZooKeeper Connect String> Parameters Parameter Description -cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. -zkconnect <ZooKeeper connection string A ZooKeeper connect string, which specifies a list of the hosts running ZooKeeper, and the port to use on each, in the format: '<ho st>[:<port>][,<host>[:<port>]...]' 1. 2. 3. 1. 2. Output The command returns the IP address and port number of the CLDB nodes on the cluster. maprcli dump cldbnodes $ maprcli dump cldbnodes -zkconnect localhost:5181 -json { { "timestamp":1309882069107, "status":"OK", "total":1, "data":[ { "valid":[ "10.10.30.39:5660-10.50.60.39:5660-", "10.10.30.38:5660-10.50.60.38:5660-", "10.10.30.35:5660-10.50.60.35:5660-" ] } ] } Examples Disaster Recovery In the event that all CLDB nodes fail, you can restore the CLDB from a backup. It is a good idea to set up an automatic backup of the CLDB volume at regular intervals. You can use the command to set up cron jobs to back up CLDB volumes locally or to maprcli dump cldbnodes external media such as a USB drive. For more information, see . Disaster Recovery To back up a CLDB volume from a remote cluster: Set up a cron job to save the container information on the remote cluster using the following command: # maprcli dump cldbnodes -zkconnect <ZooKeeper connect string> > <path to file> Set up a cron job to copy the container information file to a volume on the local cluster. Create a mirror volume on the local cluster, choosing the volume from the remote cluster as the source volume. mapr.cldb.internal Set the mirror sync schedule so that it will run at the same time as the cron job. To back up a CLDB volume locally: Set up a cron job to save the container information to a file on external media by running the following command: # maprcli dump cldbnodes -zkconnect <ZooKeeper connect string> > <path to file> Set up a cron job to create a dump file of the local volume on external media. Example: mapr.cldb.internal # maprcli volume dump create -name mapr.cldb.internal -dumpfile <path to file> For information about restoring from a backup of the CLDB, contact MapR Support. dump containerinfo The command enables you to view detailed information about one or more specified containers. maprcli dump containerinfo A is a unit of sharded storage in a MapR cluster. Every container in a MapR volume is either a or a . container name container data container The name container is the first container in a volume and holds that volume's namespace and file chunk locations. Depending on its replication role, a name container may be either a (part of the original copy of the volume) or a (one of the replicas in the master container replica container replication chain). Every data container is either a , an , or a . master container intermediate container tail container Syntax maprcli dump containerinfo [-clustername <cluster name>] -ids <id1,id2,id3 ...> Parameters Parameter Description [-clustername <cluster name>] The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. -ids <id1,id2,id3 ...> Specifies one or more container IDs. Container IDs are comma separated. Output The command returns information about one or more containers. maprcli dump containerinfo # maprcli dump containerinfo -ids 2049 -json { "timestamp":1335831624586, "status":"OK", "total":1, "data":[ { "ContainerId":2049, "Epoch":11, "Master":"10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--11-VALID", "ActiveServers":{ "IP:Port":"10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--11-VALID" }, "InactiveServers":{ }, "UnusedServers":{ }, "OwnedSizeMB":"0 MB", "SharedSizeMB":"0 MB", "LogicalSizeMB":"0 MB", "Mtime":"Thu Mar 22 15:44:22 PDT 2012", "NameContainer":"true", "VolumeName":"mapr.cluster.root", "VolumeId":93501816, "VolumeReplication":3, "VolumeMounted":true } ] } Output fields Field Description ContainerID The unique ID number for the container. Epoch A sequence number that indicates the most recent copy of the container. The CLDB uses the epoch to ensure that an out-of-date copy cannot become the master for the container. Master The physical IP address and port number of the . The master copy master copy is part of the original copy of the volume. ActiveServers The physical IP address and port number of each active node on which the container resides. InactiveServers The physical IP address and port number of each inactive node on which the container resides. UnusedServers The physical IP address and port number of servers from which no "heartbeat" has been received for quite some time. OwnedSizeMB The size on disk (in MB) dedicated to the container. SharedSizeMB The size on disk (in MB) shared by the container. LogicalSizeMB The logical size on disk (in MB) of the container. TotalSizeMB The total size on disk (in MB) allocated to the container. Combines the Owned Size and Shared Size. Mtime The time of the last modification to the contents of the container. NameContainer Indicates if the container is the for the volume. If name container tru , the container holds the volume's namespace information and file e chunk locations. VolumeName The name of the volume. VolumeId The unique ID number of the volume. VolumeReplication The , the number of copies of a volume excluding the replication factor original. VolumeMounted Indicates whether the volume is mounted. If , the volume is true currently mounted. If , the volume is not mounted. false dump replicationmanagerinfo The enables you to see which containers are under or over replicated in a specified volume. For maprcli dump replicationmanagerinfo each container, the command displays the current state of that container. Syntax maprcli dump replicationmanagerinfo [-cluster <cluster name>] -volumename <volume name> Parameters Parameter Description -cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. -volumename <volume name> Specifies the name of the volume. Output The returns information about volumes and the containers on those volumes including the nodes maprcli dump replicationmanagerinfo on which the containers have been replicated and the space allocated to each container. # maprcli dump replicationmanagerinfo -cluster my.cluster.com -volumename mapr.metrics -json { "timestamp":1335830006872, "status":"OK", "total":2, "data":[ { "VolumeName":"mapr.metrics", "VolumeId":182964332, "VolumeTopology":"/", "VolumeUsedSizeMB":1, "VolumeReplication":3, "VolumeMinReplication":2 }, { "ContainerId":2053, "Epoch":9, "Master":"10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--9-VALID", "ActiveServers":{ "IP:Port":"10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--9-VALID" }, "InactiveServers":{ }, "UnusedServers":{ }, "OwnedSizeMB":"1 MB", "SharedSizeMB":"0 MB", "LogicalSizeMB":"1 MB", "Mtime":"Mon Apr 30 16:40:41 PDT 2012", "NameContainer":"true" } ] } Output fields Field Description VolumeName Indicates the name of the volume. VolumeId Indicates the ID number of the volume. VolumeTopology The volume topology corresponds to the node topology of the rack or nodes where the volume resides. By default, new volumes are created with a topology of / (root directory). For more information, see Volume Topology VolumeUsedSizeMB The size on disk (in MB) of the volume. VolumeReplication The desired replication factor, the number of copies of a volume excluding the original. The default value is . 3 VolumeMinReplication The minimum replication factor, the number of copies of a volume (excluding the original) that should be maintained by the MapR cluster for normal operation. When the replication factor falls below this minimum, writes to the volume are disabled. The default value is . 2 ContainerId The unique ID number for the container. Epoch A sequence number that indicates the most recent copy of the container. The CLDB uses the epoch to ensure that an out-of-date copy cannot become the master for the container. Master The physical IP address and port number of the . The master copy master copy is part of the original copy of the volume. ActiveServers The physical IP address and port number of each active node on which the container resides. InactiveServers The physical IP address and port number of each inactive node on which the container resides. UnusedServers The physical IP address and port number of each on which the container does not reside. OwnedSizeMB The size on disk (in MB) dedicated to the container. SharedSizeMB The size on disk (in MB) shared by the container. LogicalSizeMB The logical size on disk (in MB) of the container. Mtime Indicates the time of the last modification to the container's contents. NameContainer Indicates if the container is the for the volume. If name container tru , the container is the volume's first container and replication occurs e simultaneously from the master to the intermediate and tail containers. dump replicationmanagerqueueinfo The command enables you to determine the status of under-replicated containers and maprcli dump replicationmanagerqueueinfo over-replicated containers. Syntax maprcli dump replicationmanagerqueueinfo [-cluster <cluster name>] -queue <queue> Parameters Parameter Description cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. queue <queue> The name of the queue. Valid values are , , or . Queue 0 includes 0 1 2 containers that have copies below the minimum replication factor for the volume. Queue 1 includes containers that have copies below the replication for the volume, but above the minimum replication factor. Queue 2 includes containers that are over-replicated. Output The command returns information about one of three queues: 0, 1, or 2. Depending on the maprcli dump replicationmanagerqueueinfo queue value entered, the command displays information about containers that are under-replicated or over-replicated. You can use this information to decide if you need to change the replication factor for that volume. # maprcli dump replicationmanagerqueueinfo -queue 0 Mtime LogicalSizeMB UnusedServers ActiveServers TotalSizeMB NameContainer InactiveServers ContainerId Master Epoch SharedSizeMB OwnedSizeMB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2065 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2064 10.250.1.103:5660--3-VALID 3 0 MB 0 MB 0 MB ... 0 MB true 1 10.250.1.103:5660--8-VALID 8 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2066 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 1 MB ... 0 MB false 2069 10.250.1.103:5660--5-VALID 5 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 1 MB ... 0 MB false 2068 10.250.1.103:5660--5-VALID 5 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2071 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2070 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2073 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2072 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2075 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2074 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2077 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2076 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:36:30 PDT 2012 0 MB ... 0 MB true 2049 10.250.1.103:5660--7-VALID 7 0 MB 0 MB Thu May 17 10:36:36 PDT 2012 0 MB ... 0 MB true 2050 10.250.1.103:5660--7-VALID 7 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB true 2051 10.250.1.103:5660--6-VALID 6 0 MB 0 MB Thu May 17 10:37:06 PDT 2012 0 MB ... 0 MB true 2053 10.250.1.103:5660--6-VALID 6 0 MB 0 MB Fri May 18 14:33:44 PDT 2012 0 MB ... 0 MB true 2054 10.250.1.103:5660--5-VALID 5 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB true 2055 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB true 2056 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2057 10.250.1.103:5660--5-VALID 5 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2058 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2059 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2060 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2061 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2062 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Thu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false 2063 10.250.1.103:5660--3-VALID 3 0 MB 0 MB Output fields Field Description ContainerID The unique ID number of the container. Epoch A sequence number that indicates the most recent copy of the container. The CLDB uses the epoch to ensure that an out-of-date copy cannot become the master for the container. Master The physical IP address and port number of the . The master copy master copy is part of the original copy of the volume. ActiveServers The physical IP address and port number of each active node on which the container resides. InactiveServers The physical IP address and port number of each inactive node on which the container resides. UnusedServers The physical IP address and port number of servers from which no "heartbeat" has been received for quite some time. OwnedSizeMB The size on disk (in MB) dedicated to the container. SharedSizeMB The size on disk (in MB) shared by the container. LogicalSizeMB The logical size on disk (in MB) of the container. TotalSizeMB The total size on disk (in MB) allocated to the container. Combines the Owned Size and Shared Size. Mtime The time of the last modification to the contents of the container. NameContainer Indicates if the container is the for the volume. If name container tru , the container holds the volume's namespace information and file e chunk locations. dump rereplicationinfo The command enables you to view information about the re-replication of containers. maprcli dump rereplicationinfo Re-replication occurs whenever the number of available replica containers drops below the number prescribed by that volume's replication factor. Re-replication may occur for a variety of reasons including replica container corruption, node unavailability, hard disk failure, or an increase in replication factor. Syntax maprcli dump rereplicationinfo [-cluster <cluster name>] Parameters Parameter Description -cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. Output The command returns information about the ongoing re-replication of replica containers including the maprcli dump rereplicationinfo destination IP address and port number, the ID number of the destination file server, and the ID number of the destination storage pool. # maprcli dump rereplicationinfo -json { "timestamp":1338222709331, "status":"OK", "total":7, "data":[ { "containerid":2158, "replica":{ "sizeMB":15467, "To fsid":9057314602141502940, "To IP:Port":"192.0.2.28:5660-", "To SP":"03b5970f41abbe48004f828abaabcdef" } }, { "containerid":3367, "replica":{ "sizeMB":658, "To fsid":3684488804112157043, "To IP:Port":"192.0.2.33:5660-", "To SP":"3b86b4ce5bfd6bbf004f87e9b6ghijkl" } }, { "containerid":3376, "replica":{ "sizeMB":630, "To fsid":3684488804112157043, "To IP:Port":"192.0.2.33:5660-", "To SP":"3b86b4ce5bfd6bbf004f87e9b6ghijkl" } }, { "containerid":3437, "replica":{ "sizeMB":239, "To fsid":6776586767180745590, "To IP:Port":"192.0.2.32:5660-", "To SP":"6cd440fad0426db7004f828b2amnopqr" } }, { "containerid":8833, "replica":{ "sizeMB":7327, "To fsid":9057314602141502940, "To IP:Port":"192.0.2.28:5660-", "To SP":"33885e3c5be9a04d004f828abcstuvwx" } } ] } Output fields Field Description sizeMB The amount of data (in MB) being moved. To fsid The ID number (FSID) of the destination file server. To IP:Port The IP address and port number of the destination node. To SP The ID number (SPID) of the destination storage pool. dump rolebalancerinfo The command enables you to monitor the replication role balancer and view information about active maprcli dump rolebalancerinfo replication role switches. The is a tool that switches the replication roles of containers to ensure that every node has an equal share of master and replication role balancer replica containers (for name containers) and an equal share of master, intermediate, and tail containers (for data containers). The replication role balancer changes the replication role of the containers in a cluster so that network bandwidth is spread evenly across all nodes during the replication process. A container's replication role determines how it is replicated to the other nodes in the cluster. For name (the volume's first container), replication occurs simultaneously from the master to all replica containers. For , containers data containers replication proceeds from the master to the intermediate container(s) until it reaches the tail containers. For more information, see Replication . Role Balancer Syntax maprcli dump rolebalancerinfo [-cluster <cluster name>] Parameters Parameter Description -cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. Output The command returns information about active replication role switches. maprcli dump rolebalancerinfo # maprcli dump rolebalancerinfo -json { "timestamp":1335835436698, "status":"OK", "total":1, "data":[ { "containerid": 36659, "Tail IP:Port":"10.50.60.123:5660-", "Updates blocked Since":"Wed May 23 05:48:15 PDT 2012" } ] } Output fields Field Description containerid The unique ID number of the container. Tail IP:Port The IP address and port number of the tail container node. Updates blocked Since During a replication role switch, updates to that container are blocked. dump rolebalancermetrics The command enables you to view the number of times that the replication role balancer has maprcli dump rolebalancermetrics switched the replication role of the name containers and data containers to ensure that containers are balanced across the nodes in the cluster. The is a tool that switches the replication roles of containers to ensure that every node has an equal share of master and replication role balancer replica containers (for name containers) and an equal share of master, intermediate, and tail containers (for data containers). The replication role balancer changes the replication role of the containers in a cluster so that network bandwidth is spread evenly across all nodes during the replication process. A container's replication role determines how it is replicated to the other nodes in the cluster. For name (the volume's first container), replication occurs simultaneously from the master to all replica containers. For , containers data containers replication proceeds from the master to the intermediate container(s) until it reaches the tail containers. For more information, see Replication . Role Balancer Syntax maprcli dump rolebalancermetrics [-cluster <cluster name>] Parameters Parameter Description -cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. Output The command returns the cumulative number of times that the replication role balancer has switched the maprcli dump rolebalancerinfo replication role of name containers and data containers on the cluster. # maprcli dump rolebalancermetrics -json { "timestamp":1337777286527, "status":"OK", "total":1, "data":[ { "numNameContainerSwitches":60, "numDataContainerSwitches":28, "timeOfLastMove":"Wed May 23 05:48:00 PDT 2012" } ] } Output fields Field Description numNameContainerSwitches The number of times that the replication role balancer has switched the replication role of name containers. numDataContainerSwitches The number of times that the replication role balancer has switched the replication role of data containers. timeOfLastMove The date and time of the last replication role change. dump volumeinfo The command enables you to view information about a volume and the containers within that volume. maprcli dump volumeinfo A is a logical unit that allows you to apply policies to a set of files, directories, and sub-volumes. Using volumes, you can enforce disk volume usage limits, set replication levels, establish ownership and accountability, and measure the cost generated by different projects or departments. For more information, see . Managing Data with Volumes Syntax maprcli dump volumeinfo [-cluster <cluster name>] -volumename <volume name> Parameters Parameter Description cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. volumename <volume name> The name of the volume. Output The returns information about the volume and the containers associated with that volume. Volume information includes maprcli volume info the ID, volume name, and replication factor. For each container on the specified volume, the command returns information about nodes and storage. # maprcli dump volumeinfo -volumename mapr.cluster.root -json { "timestamp":1335830155441, "status":"OK", "total":2, "data":[ { "VolumeName":"mapr.cluster.root", "VolumeId":93501816, "VolumeTopology":"/", "VolumeUsedSizeMB":0, "VolumeReplication":3, "VolumeMinReplication":2 }, { "ContainerId":2049, "Epoch":11, "Master":"10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--11-VALID", "ActiveServers":{ "IP:Port":"10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--11-VALID" }, "InactiveServers":{ }, "UnusedServers":{ }, "OwnedSizeMB":"0 MB", "SharedSizeMB":"0 MB", "LogicalSizeMB":"0 MB", "Mtime":"Thu Mar 22 15:44:22 PDT 2012", "NameContainer":"true" } ] } Output fields Field Description VolumeName The name of the volume. VolumeId The unique ID number of the volume. VolumeTopology The volume topology corresponds to the node topology of the rack or nodes where the volume resides. By default, new volumes are created with a topology of / (root directory). For more information, see . Volume Topology VolumeUsedSizeMB The size on disk (in MB) of the volume. VolumeReplication The desired replication factor, the number of copies of a volume. The default value is . The maximum value is . 3 6 VolumeMinReplication The minimum replication factor, the number of copies of a volume (excluding the original) that should be maintained by the MapR cluster for normal operation. When the replication factor falls below this minimum, writes to the volume are disabled. The default value is . 2 ContainerId The unique ID number of the container. Epoch A sequence number that indicates the most recent copy of the container. The CLDB uses the epoch to ensure that an out-of-date copy cannot become the master for the container. Master The physical IP address and port number of the . The master copy master copy is part of the original copy of the volume. ActiveServers The physical IP address and port number of each active node on which the container resides. InactiveServers The physical IP address and port number of each inactive node on which the container resides. UnusedServers The physical IP address and port number of servers from which no "heartbeat" has been received for quite some time. OwnedSizeMB The size on disk (in MB) dedicated to the container. SharedSizeMB The size on disk (in MB) shared by the container. LogicalSizeMB The logical size on disk (in MB) of the container. TotalSizeMB The total size on disk (in MB) allocated to the container. Combines the Owned Size and Shared Size. Mtime Indicates the time of the last modification to the contents of the container. NameContainer Indicates if the container is the for the volume. If name container tru , the container is the volume's first container and replication occurs e simultaneously from the master to the intermediate and tail containers. dump volumenodes The command enables you to view information about the nodes on a volume. maprcli dump volumenodes Syntax maprcli dump volumenodes [-cluster <cluster name>] -volumename <volume name> Parameters Parameter Description cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. volumename <volume name> The name of the volume. Output The command returns the IP address and port number of volume nodes. maprcli dump volumenodes # maprcli dump volumenodes -volumename mapr.hbase -json { "timestamp":1337280188850, "status":"OK", "total":1, "data":[ { "Servers":{ "IP:Port":"10.250.1.103:5660--7-VALID" } } ] } Output fields Field Description IP:Port The IP address and MapR-FS port. dump zkinfo The command enables you to view a snapshot of the data stored in Zookeeper as a result of cluster operations. maprcli dump zkinfo ZooKeeper prevents service coordination conflicts by enforcing a rigid set of rules and conditions, provides cluster-wide information about running services and their configuration, and provides a mechanism for almost instantaneous service failover. The will not start any services warden unless ZooKeeper is reachable and more than half of the configured ZooKeeper nodes are live. The mapr-support-collect.sh script calls the command to gather cluster diagnostics for troubleshooting. For more maprcli dump supportdump information, see . mapr-support-collect.sh Syntax maprcli dump zkinfo [-cluster <cluster name>] [-zkconnect <connect string>] Parameters Parameter Description -cluster <cluster name> The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command. -zkconnect <connection string> A ZooKeeper connect string, which specifies a list of the hosts running ZooKeeper, and the port to use on each, in the format: '<ho st>[:<port>][,<host>[:<port>]...]' Output The command is run as part of support dump tools to view the current state of the Zookeeper service. The command maprcli dump zkinfo should always be run using the flag. Output in the tabular format is not useful. Command output displays the data stored in the ZooKeepr -json hierarchical tree of znodes. # maprcli dump zkinfo -json { "timestamp":1335825202157, "status":"OK", "total":1, "data":[ { "/_Stats":"\ncZxid = 0,ctime = Wed Dec 31 16:00:00 PST 1969,mZxid = 0,mtime = Wed Dec 31 16:00:00 PST 1969,pZxid = 516,cversion = 12,dataVersion = 0,aclVersion = 0,ephemeralOwner = 0,dataLength = 0,numChildren = 13", "/":[ { .... } ] } Output fields You can use the command as you would use a database snapshot. The , , maprcli dump zkinfo /services /services_config /server , and znodes are used by the wardens to store and exchange information. s /*_locks Field Description services The directory is used by the wardens to store and /services exchange information about services. datacenter The directory contains CLDB "vital signs" that you can /datacenter to identify the CLDB master, the most recent epoch, and other key data. For more information, see below. Moving CLDB Data 1. 2. 3. 4. services_config The directory is used by the wardens to store /services_config and exchange information. zookeeper The directory stores information about the ZooKeeper /zookeeper service. servers The directory is used by the wardens to store and /servers exchange information. nodes The directory (znode) stores key information about the /nodes nodes. Examples Moving CLDB Data In an M3-licensed cluster, CLDB data must be recovered from a failed CLDB node and installed on another node. The cluster can continue normally as soon as the CLDB is started on another node. For more information, see . Recovering from a Failed CLDB Node on an M3 Cluster Use the command to identify the latest epoch of the CLDB, identify the nodes where replicates of the CLDB are stored, maprcli dump zkinfo and select one of those nodes to serve the new CLDB node. Perform the following steps on any cluster node: Log in as or use for the following commands. root sudo Issue the command using the flag. maprcli dump zkinfo -json # maprcli dump zkinfo -json The output displays the ZooKeeper znodes. In the directory, locate the CLDB with the latest epoch. /datacenter/controlnodes/cldb/epoch/1 { "/datacenter/controlnodes/cldb/epoch/1/KvStoreContainerInfo":" Container ID:1 VolumeId:1 Master:10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Servers: 10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Inactive Servers: Unused Servers: Latest epoch:13" } The Latest Epoch field identifies the current epoch of the CLDB data. In this example, the latest epoch is . 13 Select a CLDB from among the copies at the latest epoch. For example, indicates that the node has a 10.250.2.41:5660--13-VALID copy at epoch 13 (the latest epoch). entity The entity commands let you work with (users and groups): entities entity info shows information about a specified user or group entity list lists users and groups in the cluster entity modify edits information about a specified user or group entity info Displays information about an entity. Syntax CLI maprcli entity info [ -cluster <cluster> ] -name <entity name> [ -output terse|verbose ] -type <type> REST http[s]://<host>:<port>/rest/entity/info?<parameter s> Parameters Parameter Description cluster The cluster on which to run the command. name The entity name. output Whether to display terse or verbose output. type The entity type Output DiskUsage EntityQuota EntityType EntityName VolumeCount EntityAdvisoryquota EntityId 864415 0 0 root 208 0 0 Output Fields Field Description DiskUsage Disk space used by the user or group EntityQuota The user or group quota EntityType The entity type EntityName The entity name VolumeCount The number of volumes associated with the user or group EntityAdvisoryquota The user or group advisory quota EntityId The ID of the user or group Examples Display information for the user 'root': CLI maprcli entity info -type 0 -name root REST https://r1n1.sj.us:8443/rest/entity/info?type=0&nam e=root entity list Syntax CLI maprcli entity list [ -alarmedentities true|false ] [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <rows> ] [ -output terse|verbose ] [ -start <start> ] REST http[s]://<host>:<port>/rest/entity/list?<parameter s> Parameters Parameter Description alarmedentities Specifies whether to list only entities that have exceeded a quota or advisory quota. cluster The cluster on which to run the command. columns A comma-separated list of fields to return in the query. See the Fields table below. filter A filter specifying entities to display. See for more information. Filters limit The number of rows to return, beginning at start. Default: 0 output Specifies whether output should be or . terse verbose start The offset from the starting row according to sort. Default: 0 Output Information about the users and groups. Fields Field Description EntityType Entity type 0 = User 1 = Group EntityName User or Group name EntityId User or Group id EntityQuota Quota, in MB. = no quota. 0 EntityAdvisoryquota Advisory quota, in MB. = no advisory quota. 0 VolumeCount The number of volumes this entity owns. DiskUsage Disk space used for all entity's volumes, in MB. Sample Output DiskUsage EntityQuota EntityType EntityName VolumeCount EntityAdvisoryquota EntityId 5859220 0 0 root 209 0 0 Examples List all entities: CLI maprcli entity list REST https://r1n1.sj.us:8443/rest/entity/list entity modify Modifies a user or group quota or email address. Permissions required: or fc a Syntax CLI maprcli entity modify [ -advisoryquota <advisory quota> [ -cluster <cluster> ] [ -email <email>] [ -entities <entities> ] -name <entityname> [ -quota <quota> ] -type <type> REST http[s]://<host>:<port>/rest/entity/modify?<paramet ers> Parameters Parameter Description advisoryquota The advisory quota. cluster The cluster on which to run the command. email Email address. entities A comma-separated list of entities, in the format . <type>:<name> Example: 0:<user1>,0:<user2>,1:<group1>,1:<group2>... name The entity name. quota The quota for the entity. type The entity type: 0=user 1-group Examples Modify the email address for the user 'root': CLI maprcli entity modify -name root -type 0 -email [email protected] REST https://r1n1.sj.us:8443/rest/entity/modify?name=roo t&type=0&[email protected] job The commands enable you to manipulate information about the Hadoop jobs that are running on your cluster: job job changepriority - Changes the priority of a specific job. job kill - Kills a specific job. job linklogs - Uses to create symbolic links to all the logs relating to the activity of a specific job. Centralized Logging job table - Retrieves detailed information about the jobs running on the cluster. job changepriority Changes the priority of the specified job. Syntax CLI maprcli job changepriority [ -cluster cluster name ] -jobid job ID -priority NORMAL|LOW|VERY_LOW|HIGH|VERY_HIGH REST http[s]://<host>:<port>/rest/job/changepriority?<pa rameters> Parameters Parameter Description cluster Cluster name jobid Job ID priority New job priority Examples Changing a Job's Priority: CLI maprcli job changepriority -jobid job_201120603544_8282 -priority LOW REST https://r1n1.sj.us:8443/rest/job/changepriority?job id=job_201120603544_8282&priority=LOW job kill The command kills the specified job. job kill Syntax CLI maprcli job kill [ -cluster cluster name ] -jobid job ID REST http[s]://<host>:<port>/rest/job/kill?[cluster=clus ter_name&]jobid=job_ID Parameters Parameter Description cluster Cluster name jobid Job ID Examples Killing a Job CLI maprcli job kill -jobid job_201120603544_8282 REST https://r1n1.sj.us:8443/rest/job/kill?jobid=job_201 120603544_8282 job linklogs The command performs , which provides a job-centric view of all log files generated by tracker maprcli job linklogs Centralized Logging nodes during job execution. The output of is a directory populated with symbolic links to all log files related to tasks, map attempts, and reduce attempts job linklogs pertaining to the specified job(s). The command can be performed during or after a job. Syntax CLI maprcli job linklogs -jobid <jobPattern> -todir <desinationDirectory> REST http[s]://<host>:<port>/rest/job/linklogs?jobid=<jo bPattern>&todir=<destinationDirectory> Parameters Parameter Description jobid A regular expression specifying the target jobs. todir The target location to dump the Centralized Logging output directories. Output The following directory structure will be created in the location specified by for all jobids matching the parameter. todir jobid <jobid>/hosts/<host>/ contains symbolic links to log directories of tasks executed for <jobid> on <host> <jobid>/mappers/ contains symbolic links to log directories of all map task attempts for <jobid> across the whole cluster <jobid>/reducers/ contains symbolic links to log directories of all reduce task attempts for <jobid> across the whole cluster Examples Link logs for all jobs named "wordcount1" and dump output to /myvolume/joblogviewdir: CLI maprcli job linklogs -jobid job_*_wordcount1 -todir /myvolume/joblogviewdir REST https://r1n1.sj.us:8443/rest/job/linklogs?jobid=job _*_wordcount1&todir=/myvolume/joblogviewdir job table Retrieves histograms and line charts for job metrics. Use the API to retrieve for your job. The metrics data can be formatted for histogram display or line chart display. job table job metrics Syntax REST http[s]://<host>:<port>/api/job/table?output=terse& filter=string&chart=chart_type&columns=list_of_colu mns&scale=scale_type<parameters> Parameters Parameter Description filter Filters results to match the value of a specified string. chart Chart type to use: for a line chart, for a histogram. line bar columns Comma-separated list of column to return. names bincount Number of histogram bins. scale Scale to use for the histogram. Specify for a linear scale and linear for a logarithmic scale. log Column Names Parameter Description Notes jmadavg Job Average Map Attempt Duration jradavg Job Average Reduce Attempt Duration jtadavg Job Average Task Duration Filter Only jcmtct Job Complete Map Task Count Filter Only jcrtct Job Complete Reduce Task Count Filter Only jctct Job Complete Task Count Filter Only jccpu Job Cumulative CPU jcmem Job Cumulative Physical Memory jcpu Job Current CPU Filter Only jmem Job Current Memory Filter Only jfmtact Job Failed Map Task Attempt Count jfmtct Job Failed Map Task Count jfrtact Job Failed Reduce Task Attempt Count jfrtct Job Failed Reduce Task Count jftact Job Failed Task Attempt Count Filter Only jftct Job Failed Task Count Filter Only jmibps Job Map Input Bytes Rate Per-second throughput rate jmirps Job Map Input Records Rate Per-second throughput rate jmobps Job Map Output Bytes Rate Per-second throughput rate jmorps Job Map Output Records Rate Per-second throughput rate jmtact Job Mask Task Attempt Count Per-second throughput rate jmtct Job Map Task Count jmadmax Job Maximum Map Attempt Duration jradmax Job Maximum Reduce Attempt Duration jtadmax Job Maximum Task Duration Filter Only jribps Job Reduce Input Bytes Rate jrirps Job Reduce Input Records Rate jrobps Job Reduce Output Bytes Rate jrorps Job Reduce Output Records Rate jrsbps Job Reduce Shuffle Bytes Rate jrtact Job Reduce Task Attempt Count jrtct Job Reduce Task Count jrumtct Job Running Map Task Count Filter Only jrurtct Job Running Reduce Task Count Filter Only jrutct Job Running Task Count Filter Only jtact Job Task Attempt Count Filter Only jtct Job Total Task Count Filter Only jd Job Duration Histogram Only jid Job ID Filter Only jn Job Name Filter Only ju Job User Filter Only js Job Status Filter Only jmcmem Job Map Cumulative Memory Bytes Histogram Only jrcmem Job Reduce Cumulative Memory Bytes Histogram Only jpri Job Priority Filter Only jmpro Job Map Progress Filter Only jrpro Job Reduce Progress Filter Only jmtst Map Tasks Start Time jmtft Map Tasks Finish Time jrtst Reduce Tasks Start Time jrtft Reduce Tasks Finish Time jsbt Job Submit Time Filter Only jst Job Start Time Filter Only jft Job Finish Time Filter Only jrrsw Job Reduce Reserve Slot Wait jmrsw Job Map Reserve Slot Wait jdlmt Job Data-local Map Tasks jolmt Job Non-local Map Tasks jrlmt Job Rack-local Map Tasks jrtd Job Reduce Tasks Duration jmtd Job Map Tasks Duration jmmbr Job MapR-FS Map Bytes Read jrmbr Job MapR-FS Reduce Bytes Read jtmbr Job MapR-FS Total Bytes Read jmmbw Job MapR-FS Map Bytes Written jrmbw Job MapR-FS Reduce Bytes Written jtmbw Job MapR-FS Total Bytes Written jmfbw Job Map File Bytes Written jrfbw Job Reduce File Bytes Written jtfbw Job Total File Bytes Written jmir Cumulative Job Map Input Records jrir Cumulative Job Reduce Input Records jcir Cumulative Job Combine Input Records jmor Cumulative Job Map Output Records jror Cumulative Job Reduce Output Records jcor Job Combine Output Records jrsb Cumulative Job Reduce Shuffle Bytes jmsr Job Map Spilled Records jrsr Job Reduce Spilled Records jtsr Job Total Spilled Records jmob Cumulative Job Map Output Bytes jmib Cumulative Job Map Input Bytes jmcpu Job Map CPU jrcpu Job Reduce CPU jmsrb Job Map Split Raw Bytes jrsrb Job Reduce Split Raw Bytes jsrb Job Split Raw Bytes jrig Job Reduce Input Groups jvmb Virtual Memory Bytes jmvmb Job Map Virtual Memory Bytes jrvmb Job Reduce Virtual Memory Bytes jgct Job Total GC Time jmgct Job Map GC Time jrgct Job Reduce GC Time Examples Retrieve a Histogram: REST https://r1n1.sj.us:8443/api/job/table?chart=bar&fil ter=[tt!=JOB_SETUP]and[tt!=JOB_CLEANUP]and[jid==job _201129649560_3390]&columns=td&bincount=28&scale=lo g CURL curl -d @json https://r1n1.sj.us:8443/api/job/table In the example above, the file contains a URL-encoded version of the information in the section below. curl json Request Request GENERAL_PARAMS: { [chart: "bar"|"line"], columns: <comma-sep list of column terse names>, [filter: "[<terse_field>{operator}<value>]and[...]",] [output: terse,] [start: int,] [limit: int] } REQUEST_PARAMS_HISTOGRAM: { chart:bar columns:jd filter: <anything> } REQUEST_PARAMS_LINE: { chart:line, columns:jmem, filter: NOT PARSED, UNUSED IN BACKEND } REQUEST_PARAMS_GRID: { columns:jid,jn,js,jd filter:<any real filter expression> output:terse, start:0, limit:50 } Response RESPONSE_SUCCESS_HISTOGRAM: { "status" : "OK", "total" : 15, "columns" : ["jd"], "binlabels" : ["0-5s","5-10s","10-30s","30-60s","60-90s","90s-2m","2m-5m","5m-10m","10m-30m","30m-1h ","1h-2h","2h-6h","6h-12h","12h-24h",">24h"], "binranges" : [ [0,5000], [5000,10000], [10000,30000], [30000,60000], [60000,90000], [90000,120000], [120000,300000], [300000,600000], [600000,1800000], [1800000,3600000], [3600000,7200000], [7200000,21600000], [21600000,43200000], [43200000,86400000], [86400000] ], "data" : [33,919,1,133,9820,972,39,2,44,80,11,93,31,0,0] } RESPONSE_SUCCESS_GRID: { "status": "OK", "total" : 50, "columns" : ["jid","jn","ju","jg","js","jcpu","jccpu","jmem","jcmem","jpri","jmpro","jrpro","jsbt" ,"jst","jft","jd","jmtct","jfmtact","jrtct","jmtact","jrtact","jtact","jfrtact","jftac t","jfmtct","jfrtct","jftct","jtct","jrumtct","jrurtct","jrutct","jctct","jcmtct","jcr tct","jmirps","jmorps","jmibps","jmobps","jrirps","jrorps","jribps","jrobps","jtadavg" ,"jmadavg","jmadmax","jtadmax","jradavg","jradmax"], "data" : [ ["job_201210216041_7311","Billboard Top 10","heman","jobberwockies","PREP",69,9106628,857124,181087410,"LOW",30,48,13099922755 80,1316685654403,1324183149687,7497495284,72489,25227,6171223,95464,6171184,6266648,-3 8,25189,13115,-4,13111,6243712,5329,4,6243712,6225268,54045,6171223,403,128,570,137,17 2,957,490,179,246335,367645,1758151,1758151,125024,514028], ["job_201129309372_8897","Super Big","srivas","jobberwockies","KILLED",59,3125830,2895159,230270693,"LOW",91,1,1313111 819653,1323504739893,1326859602015,3354862122,8705980,3774739,7691269,12515000,1627363 1,28788631,8196156,11970895,2706470,4365698,7072168,16397249,215570,35,16397249,910947 6,5783940,3325536,707,509,345,463,429,93,88,752,406336,553455,3392429,3392429,259216,5 11285], ["job_201165490737_7144","Trending Human Interaction","mickey","jobberwockies","PREP",100,1304791,504092,728635524,"VERY_LOW",5 7,90,1301684627596,1331548331890,1331592957521,44625631,7389503,3770494,5433308,110114 95,15048822,26060317,9769362,13539856,2544349,4315172,6859521,12822811,21932,327,12822 811,5941031,4823222,1117809,739,654,561,426,925,23,420,597,292024,470314,1854688,18546 88,113733,566672], ["job_201152533959_6159","Star Search","darth","fbi","FAILED",82,7151113,2839682,490527441,"NORMAL",51,61,13053670422 24,1325920952001,1327496965896,1576013895,8939964,2041524,4965024,10795895,8786681,195 82576,3924842,5966366,1130482,3422544,4553026,13904988,833761,2,13904988,8518199,69757 21,1542478,665,916,10,34,393,901,608,916,186814,331708,2500504,2500504,41920,251453] ] } RESPONSE_SUCCESS_LINE: { "status" : "OK", "total" : 22, "columns" : ["jcmem"], "data" : [ [1329891055016,0], [1329891060016,8], [1329891065016,16], [1329891070016,1024], [1329891075016,2310], [1329891080016,3243], [1329891085016,4345], [1329891090016,7345], [1329891095016,7657], [1329891100016,8758], [1329891105016,9466], [1329891110016,10345], [1329891115016,235030], [1329891120016,235897], [1329891125016,287290], [1329891130016,298390], [1329891135016,301355], [1329891140016,302984], [1329891145016,303985], [1329891150016,304403], [1329891155016,503030], [1329891160016,983038] ] } license The license commands let you work with MapR licenses: license add - adds a license license addcrl - adds a certificate revocation list (CRL) license apps - displays the features included in the current license license list - lists licenses on the cluster license listcrl - lists CRLs license remove - removes a license license showid - displays the cluster ID license add Adds a license. Permissions required: or fc a The license can be specified either by passing the license string itself to , or by specifying a file containing the license string. license add Syntax CLI maprcli license add [ -cluster <cluster> ] [ -is_file true|false ] -license <license> REST http[s]://<host>:<port>/rest/license/add?<parameter s> Parameters Parameter Description cluster The cluster on which to run the command. is_file Specifies whether the specifies a file. If , the license false licens parameter contains a long license string. e license The license to add to the cluster. If is true, speci -is_file license fies the filename of a license file. Otherwise, contains the license license string itself. Examples Adding a License from a File Assuming a file containing a license string, the following command adds the license to the cluster. /tmp/license.txt CLI maprcli license add -is_file true -license /tmp/license.txt license addcrl Adds a certificate revocation list (CRL). Permissions required: or fc a Syntax CLI maprcli license addcrl [ -cluster <cluster> ] -crl <crl> [ -is_file true|false ] REST http[s]://<host>:<port>/rest/license/addcrl?<parame ters> Parameters Parameter Description cluster The cluster on which to run the command. crl The CRL to add to the cluster. If file is set, crl specifies the filename of a CRL file. Otherwise, crl contains the CRL string itself. is_file Specifies whether the license is contained in a file. license apps Displays the features authorized for the current license. Permissions required: login Syntax CLI maprcli license apps [ -cluster <cluster> ] REST http[s]://<host>:<port>/rest/license/apps?<paramete rs> Parameters Parameter Description cluster The cluster on which to run the command. license list Lists licenses on the cluster. Permissions required: login Syntax CLI maprcli license list [ -cluster <cluster> ] REST http[s]://<host>:<port>/rest/license/list?<paramete rs> Parameters Parameter Description cluster The cluster on which to run the command. license listcrl Lists certificate revocation lists (CRLs) on the cluster. Permissions required: login Syntax CLI maprcli license listcrl [ -cluster <cluster> ] REST http[s]://<host>:<port>/rest/license/listcrl?<param eters> Parameters Parameter Description cluster The cluster on which to run the command. license remove Removes a license. Permissions required: or fc a Syntax CLI maprcli license remove [ -cluster <cluster> ] -license_id <license> REST http[s]://<host>:<port>/rest/license/remove?<parame ters> Parameters Parameter Description cluster The cluster on which to run the command. license_id The license to remove. license showid Displays the cluster ID for use when creating a new license. Permissions required: login Syntax CLI maprcli license showid [ -cluster <cluster> ] REST http[s]://<host>:<port>/rest/license/showid?<parame ters> Parameters Parameter Description cluster The cluster on which to run the command. Metrics API A Hadoop job sets the rules that the JobTracker service uses to break an input data set into discrete tasks and assign those tasks to individual nodes. The MapR Metrics service provides two API calls that enable you to retrieve grids of job data or task attempt data depending on the parameters you send: /api/job/table retrieves information about the jobs running on your cluster. You can use this API to retrieve information about the number of task attempts for jobs on the cluster, job duration, job computing resource use (CPU and memory), and job data throughput (both records and bytes per second). /api/task/table retrieves information about the tasks that make up a specific job, as well as the specific task attempts. You can use this API to retrieve information about a task attempt's data throughput, measured in number of records per second as well as in bytes per second. Both of these APIs provide robust filtering capabilities to display data with a high degree of specificity. nagios The command generates a topology script for Nagios nagios generate nagios generate Generates a Nagios Object Definition file that describes the cluster nodes and the services running on each. Syntax CLI maprcli nagios generate [ -cluster <cluster> ] REST http[s]://<host>:<port>/rest/nagios/generate?<param eters> Parameters Parameter Description cluster The cluster on which to run the command. Output Sample Output ############# Commands ############# define command { command_name check_fileserver_proc command_line $USER1$/check_tcp -p 5660 } define command { command_name check_cldb_proc command_line $USER1$/check_tcp -p 7222 } define command { command_name check_jobtracker_proc command_line $USER1$/check_tcp -p 50030 } define command { command_name check_tasktracker_proc command_line $USER1$/check_tcp -p 50060 } define command { command_name check_nfs_proc command_line $USER1$/check_tcp -p 2049 } define command { command_name check_hbmaster_proc command_line $USER1$/check_tcp -p 60000 } define command { command_name check_hbregionserver_proc command_line $USER1$/check_tcp -p 60020 } define command { command_name check_webserver_proc command_line $USER1$/check_tcp -p 8443 } ################# HOST: host1 ############### define host { use linux-server host_name host1 address 192.168.1.1 check_command check-host-alive } ################# HOST: host2 ############### define host { use linux-server host_name host2 address 192.168.1.2 check_command check-host-alive } Examples Generate a nagios configuration, specifying cluster name and ZooKeeper nodes: CLI maprcli nagios generate -cluster cluster-1 REST https://host1:8443/rest/nagios/generate?cluster=clu ster-1 Generate a nagios configuration and save to the file : nagios.conf CLI maprcli nagios generate > nagios.conf nfsmgmt The command refreshes the NFS exports on the specified host and/or port. nfsmgmt refreshexports nfsmgmt refreshexports Refreshes the list of clusters and mount points available to mount with NFS. Permissions required: or . fc a Syntax CLI maprcli nfsmgmt refreshexports [ -nfshost <host> ] [ -nfsport <port> ] REST http[s]://<host><:port>/rest/nfsmgmt/refreshexports ?<parameters> Parameters Parameter Description nfshost The hostname of the node that is running the MapR NFS server. nfsport The port to use. node The node commands let you work with nodes in the cluster: node allow-into-cluster node cldbmaster node heatmap node list node listcldbs node listcldbzks node listzookeepers node maintenance node metrics node move node remove node services node topo add-to-cluster Allows host IDs to join the cluster after duplicates have been resolved. When the CLDB detects duplicate nodes with the same host ID, all nodes with that host ID are removed from the cluster and prevented from joining it again. After making sure that all nodes have unique host IDs, you can use the command to un-ban the node allow-into-cluster host ID that was previously duplicated among several nodes. Syntax CLI maprcli node allow-into-cluster [ -hostids <host IDs> ] REST http[s]://<host>:<port>/rest/node/allow-into-cluste r?<parameters> Parameters Parameter Description hostids A comma-separated list of host IDs. Examples Allow former duplicate host IDs node1 and node2 to join the cluster: CLI maprcli node allow-into-cluster -hostids node1,node2 REST https://r1n1.sj.us:8443/rest/node/allow-into-cluste r?hostids=node1,node2 node allow-into-cluster Allows host IDs to join the cluster after duplicates have been resolved. When the CLDB detects duplicate nodes with the same host ID, all nodes with that host ID are removed from the cluster and prevented from joining it again. After making sure that all nodes have unique host IDs, you can use the command to un-ban the node allow-into-cluster host ID that was previously duplicated among several nodes. Syntax CLI maprcli node allow-into-cluster [ -hostids <host IDs> ] REST http[s]://<host>:<port>/rest/node/allow-into-cluste r?<parameters> Parameters Parameter Description hostids A comma-separated list of host IDs. Examples Allow former duplicate host IDs node1 and node2 to join the cluster: CLI maprcli node allow-into-cluster -hostids node1,node2 REST https://r1n1.sj.us:8443/rest/node/allow-into-cluste r?hostids=node1,node2 node cldbmaster Returns the address of the master CLDB node. The API returns the server ID and hostname of the node serving as the CLDB master node. node cldbmaster Syntax CLI maprcli node cldbmaster [ -cluster <cluster name> ] REST http[s]://<host>:<port>/rest/node/cldbmaster?<param eters> Parameters Parameter Description cluster name The name of the cluster for which to return the CLDB master node information. Examples Return the CLDB master node information for the cluster my.cluster.com: CLI maprcli node cldbmaster -cluster my.cluster.com REST https://r1n1.sj.us:8443/rest/node/cldbmaster?cluste r=my.cluster.com node heatmap Displays a heatmap for the specified nodes. Syntax CLI maprcli node heatmap [ -cluster <cluster> ] [ -filter <filter> ] [ -view <view> ] REST http[s]://<host>:<port>/rest/node/heatmap?<paramete rs> Parameters Parameter Description cluster The cluster on which to run the command. filter A filter specifying snapshots to preserve. See for more Filters information. view Name of the heatmap view to show: status = Node status: 0 = Healthy 1 = Needs attention 2 = Degraded 3 = Maintenance 4 = Critical cpu = CPU utilization, as a percent from 0-100. memory = Memory utilization, as a percent from 0-100. diskspace = MapR-FS disk space utilization, as a percent from 0-100. DISK_FAILURE = Status of the DISK_FAILURE alarm. if clear, 0 if raised. 1 SERVICE_NOT_RUNNING = Status of the SERVICE_NOT_RUNNING alarm. if clear, if raised. 0 1 CONFIG_NOT_SYNCED = Status of the CONFIG_NOT_SYNCED alarm. if clear, if raised. 0 1 Output Description of the output. { status:"OK", data: [{ "{{rackTopology}}" : { "{{nodeName}}" : {{heatmapValue}}, "{{nodeName}}" : {{heatmapValue}}, "{{nodeName}}" : {{heatmapValue}}, ... }, "{{rackTopology}}" : { "{{nodeName}}" : {{heatmapValue}}, "{{nodeName}}" : {{heatmapValue}}, "{{nodeName}}" : {{heatmapValue}}, ... }, ... }] } Output Fields Field Description rackTopology The topology for a particular rack. nodeName The name of the node in question. heatmapValue The value of the metric specified in the view parameter for this node, as an integer. Examples Display a heatmap for the default rack: CLI maprcli node heatmap REST https://r1n1.sj.us:8443/rest/node/heatmap Display memory usage for the default rack: CLI maprcli node heatmap -view memory REST https://r1n1.sj.us:8443/rest/node/heatmap?view=memo ry node list Lists nodes in the cluster. Syntax CLI maprcli node list [ -alarmednodes 1 ] [ -cluster <cluster ] [ -columns <columns>] [ -filter <filter> ] [ -limit <limit> ] [ -nfsnodes 1 ] [ -output terse|verbose ] [ -start <offset> ] [ -zkconnect <ZooKeeper Connect String> ] REST http[s]://<host>:<port>/rest/node/list?<parameters> Parameters Parameter Description alarmednodes If set to 1, displays only nodes with raised alarms. Cannot be used if nfsnodes is set. cluster The cluster on which to run the command. columns A comma-separated list of fields to return in the query. See the Fields table below. filter A filter specifying nodes on which to start or stop services. See Filters for more information. limit The number of rows to return, beginning at start. Default: 0 nfsnodes If set to 1, displays only nodes running NFS. Cannot be used if alarmednodes is set. output Specifies whether the output should be terse or verbose. start The offset from the starting row according to sort. Default: 0 zkconnect ZooKeeper Connect String Output Information about the nodes.See the table below. Fields Sample Output bytesSent dreads davail TimeSkewAlarm servicesHoststatsDownAlarm ServiceHBMasterDownNotRunningAlarm ServiceNFSDownNotRunningAlarm ttmapUsed DiskFailureAlarm mused id mtotal cpus utilization rpcout ttReduceSlots ServiceFileserverDownNotRunningAlarm ServiceCLDBDownNotRunningAlarm dtotal jt-heartbeat ttReduceUsed dwriteK ServiceTTDownNotRunningAlarm ServiceJTDownNotRunningAlarm ttmapSlots dused uptime hostname health disks faileddisks fs-heartbeat rpcin ip dreadK dwrites ServiceWebserverDownNotRunningAlarm rpcs LogLevelAlarm ServiceHBRegionDownNotRunningAlarm bytesReceived service topo(rack) MapRfs disks ServiceMiscDownNotRunningAlarm VersionMismatchAlarm 8300 0 269 0 0 0 0 75 0 4058 6394230189818826805 7749 4 3 141 50 0 0 286 2 10 32 0 0 100 16 Thu Jan 15 16:58:57 PST 1970 whatsup 0 1 0 0 51 10.250.1.48 0 2 0 0 0 0 8236 /third/rack/whatsup 1 0 0 Fields Field Description bytesReceived Bytes received by the node since the last CLDB heartbeat. bytesSent Bytes sent by the node since the last CLDB heartbeat. CorePresentAlarm Cores Present Alarm (NODE_ALARM_CORE_PRESENT): 0 = Clear 1 = Raised cpus The total number of CPUs on the node. davail Disk space available on the node. DiskFailureAlarm Failed Disks alarm (DISK_FAILURE): 0 = Clear 1 = Raised disks Total number of disks on the node. dreadK Disk Kbytes read since the last heartbeat. dreads Disk read operations since the last heartbeat. dtotal Total disk space on the node. dused Disk space used on the node. dwriteK Disk Kbytes written since the last heartbeat. dwrites Disk write ops since the last heartbeat. faileddisks Number of failed MapR-FS disks on the node. 0 = Clear 1 = Raised fs-heartbeat Time since the last heartbeat to the CLDB, in seconds. health Overall node health, calculated from various alarm states: 0 = Healthy 1 = Needs attention 2 = Degraded 3 = Maintenance 4 = Critical hostname The host name. id The node ID. ip A list of IP addresses associated with the node. jt-heartbeat Time since the last heartbeat to the JobTracker, in seconds. LogLevelAlarm Excessive Logging Alarm (NODE_ALARM_DEBUG_LOGGING): 0 = Clear 1 = Raised MapRfs disks mtotal Total memory, in GB. mused Memory used, in GB. HomeMapRFullAlarm Installation Directory Full Alarm (ODE_ALARM_OPT_MAPR_FULL): 0 = Clear 1 = Raised RootPartitionFullAlarm Root Partition Full Alarm (NODE_ALARM_ROOT_PARTITION_FULL): 0 = Clear 1 = Raised rpcin RPC bytes received since the last heartbeat. rpcout RPC bytes sent since the last heartbeat. rpcs Number of RPCs since the last heartbeat. service A comma-separated list of services running on the node: cldb - CLDB fileserver - MapR-FS jobtracker - JobTracker tasktracker - TaskTracker hbmaster - HBase Master hbregionserver - HBase RegionServer nfs - NFS Gateway Example: "cldb,fileserver,nfs" ServiceCLDBDownNotRunningAlarm CLDB Service Down Alarm (NODE_ALARM_SERVICE_CLDB_DOWN) 0 = Clear 1 = Raised ServiceFileserverDownNotRunningAlarm Fileserver Service Down Alarm (NODE_ALARM_SERVICE_FILESERVER_DOWN) 0 = Clear 1 = Raised ServiceHBMasterDownNotRunningAlarm HBase Master Service Down Alarm (NODE_ALARM_SERVICE_HBMASTER_DOWN) 0 = Clear 1 = Raised ServiceHBRegionDownNotRunningAlarm HBase Regionserver Service Down Alarm" (NODE_ALARM_SERVICE_HBREGION_DOWN) 0 = Clear 1 = Raised ServicesHoststatsDownNotRunningAlarm Hoststats Service Down Alarm (NODE_ALARM_SERVICE_HOSTSTATS_DOWN) 0 = Clear 1 = Raised ServiceJTDownNotRunningAlarm Jobtracker Service Down Alarm (NODE_ALARM_SERVICE_JT_DOWN) 0 = Clear 1 = Raised ServiceMiscDownNotRunningAlarm 0 = Clear 1 = Raised ServiceNFSDownNotRunningAlarm NFS Service Down Alarm (NODE_ALARM_SERVICE_NFS_DOWN): 0 = Clear 1 = Raised ServiceTTDownNotRunningAlarm Tasktracker Service Down Alarm (NODE_ALARM_SERVICE_TT_DOWN): 0 = Clear 1 = Raised ServicesWebserverDownNotRunningAlarm Webserver Service Down Alarm (NODE_ALARM_SERVICE_WEBSERVER_DOWN) 0 = Clear 1 = Raised TimeSkewAlarm Time Skew alarm (NODE_ALARM_TIME_SKEW): 0 = Clear 1 = Raised racktopo The rack path. ttmapSlots TaskTracker map slots. ttmapUsed TaskTracker map slots used. ttReduceSlots TaskTracker reduce slots. ttReduceUsed TaskTracker reduce slots used. uptime Date when the node came up. utilization CPU use percentage since the last heartbeat. VersionMismatchAlarm Software Version Mismatch Alarm (NODE_ALARM_VERSION_MISMATCH): 0 = Clear 1 = Raised Examples List all nodes: CLI maprcli node list REST https://r1n1.sj.us:8443/rest/node/list List the health of all nodes: CLI maprcli node list -columns service,health REST https://r1n1.sj.us:8443/rest/node/list?columns=serv ice,health List the number of slots on all nodes: CLI maprcli node list -columns ip,ttmapSlots,ttmapUsed,ttReduceSlots,ttReduceUsed REST https://r1n1.sj.us:8443/rest/node/list?columns=ip,t tmapSlots,ttmapUsed,ttReduceSlots,ttReduceUsed node listcldbs The API returns the hostnames of the nodes in the cluster that are running the CLDB service. node listcldbs Syntax CLI maprcli node listcldbs [ -cluster <cluster name> ] [ -cldb <cldb hostname|ip:port> ] REST http[s]://<host>:<port>/rest/node/listcldbs?<parame ters> Parameters Parameter Description cluster name The name of the cluster for which to return the list of CLDB node hostnames. cldb hostname|ip:port The hostname or IP address and port number of a CLDB node. Examples Return the list of CLDB nodes for the cluster my.cluster.com: CLI maprcli node listcldbs -cluster my.cluster.com REST https://r1n1.sj.us:8443/rest/node/listcldbs?cluster =my.cluster.com node listcldbzks The API returns the hostnames of the nodes in the cluster that are running the CLDB service and the IP addresses and port node listcldbzks numbers for the nodes in the cluster that are running the ZooKeeper service. Syntax CLI maprcli node listcldbzks [ -cluster <cluster name> ] [ -cldb <cldb hostname|ip:port> ] REST http[s]://<host>:<port>/rest/node/listcldbzks?<para meters> Parameters Parameter Description cluster name The name of the cluster for which to return the CLDB and ZooKeeper information. cldb hostname|ip:port The hostname or IP address and port number of a CLDB node. Examples Return CLDB and ZooKeeper node information for the cluster my.cluster.com: CLI maprcli node listcldbzks -cluster my.cluster.com REST https://r1n1.sj.us:8443/rest/node/listcldbzks?clust er=my.cluster.com node listzookeepers The API returns the hostnames of the nodes in the cluster that are running the zookeeper service. node listzookeepers Syntax CLI maprcli node listzookeepers [ -cluster <cluster name> ] [ -cldb <cldb hostname|ip:port> ] REST http[s]://<host>:<port>/rest/node/listzookeepers?<p arameters> Parameters Parameter Description cluster name The name of the cluster for which to return the list of zookeeper node hostnames. cldb hostname|ip:port The hostname or IP address and port number of a valid CLDB node. The other CLDB nodes and zookeeper nodes can be discovered from this node. Examples Return the list of zookeeper nodes for the cluster my.cluster.com If you know that the CLDB service is running on a node with hostname , you can enter: host1 CLI maprcli node listzookeepers -cluster my.cluster.com -cldb host1 REST https://r1n1.sj.us:8443/rest/node/listzookeepers?cl uster=my.cluster.com&cldb=host1 node maintenance Places a node into a maintenance mode for a specified timeout duration. For the duration of the timeout, the cluster's CLDB does not consider this node's data as lost and does not trigger a resync of the data on this node. See for more information. Nodes Syntax CLI maprcli node maintenance [ -cluster <cluster> ] [ -serverids <serverids> ] [ -nodes <nodes> ] -timeoutminutes minutes REST http[s]://<host>:<port>/rest/node/maintenance?<para meters> Parameters Parameter Description cluster The cluster on which to run the command. serverids List of server IDs nodes List of nodes timeoutminutes Duration of timeout in minutes node metrics Retrieves metrics information for nodes in a cluster. Use the API to retrieve node data for your job. node metrics Syntax CLI maprcli node metrics -nodes -start start_time -end end_time [ -json ] [ -interval interval ] [ -events ] [ -columns columns ] [ -cluster cluster name ] Parameters Parameter Description nodes A space-separated list of node names. start The start of the time range. Can be a UTC timestamp or a UTC date in MM/DD/YY format. end The end of the time range. Can be a UTC timestamp or a UTC date in MM/DD/YY format. json Specify this flag to return data as a JSON object. interval Data measurement interval in seconds. The minimum value is 10 seconds. events Specify to return node events only. The default value of this TRUE parameter is . FALSE columns Comma-separated list of column to return. names cluster Cluster name. Column Name Parameters The API always returns the (node name), (timestamp string), and (integer timestamp) node metrics NODE TIMESTAMPSTR TIMESTAMP columns. Use the flag to specify a comma-separated list of column names to return. -columns Parameter Description Notes CPUNICE Amount of CPU time used by processes with a positive nice value. CPUUSER Amount of CPU time used by user processes. CPUSYSTEM Amount of CPU time used by system processes. LOAD5PERCENT Percentage of time this node spent at load 5 or below The , , and parameters return information in . This unit measures one tick of the system timer CPUNICE CPUUSER CPUSYSTEM jiffies interrupt and is usually equivalent to 10 milliseconds, but may vary depending on your particular node configuration. Call sysconf(_SC to determine the exact value for your node. _CLK_TCK) LOAD1PERCENT Percentage of time this node spent at load 1 or below MEMORYCACHED Memory cache size in bytes MEMORYSHARED Shared memory size in bytes MEMORYBUFFERS Memory buffer size in bytes MEMORYUSED Memory used in megabytes PROCRUN Number of processes running RPCCOUNT Number of RPC calls RPCINBYTES Number of bytes passed in by RPC calls RPCOUTBYTES Number of bytes passed out by RPC calls SERVAVAILSIZEMB Server storage available in megabytes SERVUSEDSIZEMB Server storage used in megabytes SWAPFREE Free swap space in bytes TTMAPUSED Number of TaskTracker slots used for map tasks TTREDUCEUSED Number of TaskTracker slots used for reduce tasks Three column name parameters return data that is too granular to display in a standard table. Use the option to return this information as a -json JSON object. Parameter Description Metrics Returned CPUS Activity on this node's CPUs. Each CPU on the node is numbered from zero, to cpu0 cp . Metrics returned are for each CPU. uN CPUIDLE: Amount of CPU time spent idle. Reported as . jiffies : Amount of CPU time spent CPUIOWAIT waiting for I/O operations. Reported as . jiffies : Total amount of CPU time. CPUTOTAL Reported as . jiffies DISKS Activity on this node's disks. Metrics returned are for each partition. READOPS: Number of read operations. : Number of kilobytes read. READKB : Number of write operations. WRITEOPS : Number of kilobytes written. WRITEKB NETWORK Activity on this node's network interfaces. Metrics returned are for each interface. BYTESIN: Number of bytes received. : Number of bytes sent. BYTESOUT : Number of packets received. PKTSIN : Number of packets sent. PKTSOUT Examples To retrieve the percentage of time that a node spent at the 1 and 5 load levels: [user@host ~]# maprcli node metrics -nodes my.node.lab -start 07/25/12 -end 07/26/12 -interval 7200 -columns LOAD1PERCENT,LOAD5PERCENT NODE LOAD5PERCENT LOAD1PERCENT TIMESTAMPSTR TIMESTAMP my.node.lab Wed Jul 25 12:52:40 PDT 2012 1343245960047 my.node.lab 11 18 Wed Jul 25 14:52:50 PDT 2012 1343253170000 my.node.lab 10 23 Wed Jul 25 16:52:50 PDT 2012 1343260370000 my.node.lab 15 46 Wed Jul 25 18:52:57 PDT 2012 1343267577000 my.node.lab 18 34 Wed Jul 25 20:52:58 PDT 2012 1343274778000 my.node.lab 28 70 Wed Jul 25 22:53:01 PDT 2012 1343281981000 my.node.lab 35 84 Thu Jul 26 00:53:01 PDT 2012 1343289181000 my.node.lab 30 35 Thu Jul 26 02:53:03 PDT 2012 1343296383000 my.node.lab 36 62 Thu Jul 26 04:53:10 PDT 2012 1343303590000 my.node.lab 37 44 Thu Jul 26 06:53:14 PDT 2012 1343310794000 my.node.lab 12 28 Thu Jul 26 08:53:21 PDT 2012 1343318001000 my.node.lab 22 38 Thu Jul 26 10:53:30 PDT 2012 1343325210000 Sample JSON object This JSON object is returned by the following command: [user@host ~]# maprcli node metrics -nodes my.node.lab -json -start 1343290000000 -end 1343300000000 -interval 28800 -columns LOAD1PERCENT,LOAD5PERCENT,CPUS { "timestamp":1343333063869, "status":"OK", "total":3, "data":[ { "NODE":"my.node.lab", "TIMESTAMPSTR":"Wed Jul 25 18:00:05 PDT 2012", "TIMESTAMP":1343264405000, "LOAD1PERCENT":13, "LOAD5PERCENT":12 "CPUS":{ "cpu0":{ "CPUIDLE":169173957, "CPUIOWAIT":2982912, "CPUTOTAL":173897423 }, "cpu1":{ "CPUIDLE":172217855, "CPUIOWAIT":26760, "CPUTOTAL":174016589 }, "cpu2":{ "CPUIDLE":171071574, "CPUIOWAIT":4051, "CPUTOTAL":173957716 }, }, }, { "NODE":"my.node.lab", "TIMESTAMPSTR":"Thu Jul 26 02:00:08 PDT 2012", "TIMESTAMP":1343293208000, "LOAD1PERCENT":17, "LOAD5PERCENT":13 "CPUS":{ "cpu0":{ "CPUIDLE":169173957, "CPUIOWAIT":2982912, "CPUTOTAL":173897423 }, "cpu1":{ "CPUIDLE":172217855, "CPUIOWAIT":26760, "CPUTOTAL":174016589 }, "cpu2":{ "CPUIDLE":171071574, "CPUIOWAIT":4051, "CPUTOTAL":173957716 }, }, }, { "NODE":"my.node.lab", "TIMESTAMPSTR":"Thu Jul 26 10:00:08 PDT 2012", "TIMESTAMP":1343322008000, "LOAD1PERCENT":18, "LOAD5PERCENT":13 "CPUS":{ "cpu0":{ "CPUIDLE":169173957, "CPUIOWAIT":2982912, "CPUTOTAL":173897423 }, "cpu1":{ "CPUIDLE":172217855, "CPUIOWAIT":26760, "CPUTOTAL":174016589 }, "cpu2":{ "CPUIDLE":171071574, "CPUIOWAIT":4051, "CPUTOTAL":173957716 }, }, } ] } node move Moves one or more nodes to a different topology. Permissions required: or . fc a Syntax CLI maprcli node move [ -cluster <cluster> ] -serverids <server IDs> -topology <topology> REST http[s]://<host>:<port>/rest/node/move?<parameters> Parameters Parameter Description cluster The cluster on which to run the command. serverids The server IDs of the nodes to move. topology The new topology. To obtain the server ID, run maprcli node list -columns id . Sample output from is shown below. The resulting server ID(s) can be copied and pasted into the maprcli node list -columns id maprcl command. i node move id hostname ip 547819249997313015 node-34.lab 10.10.40.34,10.10.88.34 2130988050310536949 node-36.lab 10.10.40.36,10.10.88.36 8683110801227243688 node-37.lab 10.10.40.37,10.10.88.37 5056865595028557458 node-38.lab 10.10.40.38,10.10.88.38 3111141192171195352 node-39.lab 10.10.40.39,10.10.88.39 node remove The command removes one or more server nodes from the system. Permissions required: or node remove fc a After issuing the command, wait several minutes to ensure that the node has been properly and completely removed. node remove Syntax CLI maprcli node remove [ -filter "<filter>" ] [ -force true|false ] [ -hostids <host IDs. ] [ -nodes <node names> ] [ -zkconnect <ZooKeeper Connect String> ] REST http[s]://<host>:<port>/rest/node/remove?<parameter s> Parameters Parameter Description filter A filter specifying nodes on which to start or stop services. See Filters for more information. force Forces the service stop operations. Default: false hostids A list of host IDs, separated by spaces. nodes A list of node names, separated by spaces. zkconnect . Example: 'host:port,host:port,host:port,...'. ZooKeeper Connect String default: localhost:5181 node services Starts, stops, restarts, suspends, or resumes services on one or more server nodes. Permissions required: , or ss fc a The same set of services applies to all specified nodes; to manipulate different groups of services differently, send multiple requests. Syntax Note: the suspend and resume actions have not yet been implemented. CLI maprcli node services [ -action restart|resume|start|stop|suspend ] [ -cldb restart|resume|start|stop|suspend ] [ -cluster <cluster> ] [ -fileserver restart|resume|start|stop|suspend ] [ -filter <filter> ] [ -hbmaster restart|resume|start|stop|suspend ] [ -hbregionserver restart|resume|start|stop|suspend ] [ -jobtracker restart|resume|start|stop|suspend ] [ -name <service> ] [ -nfs restart|resume|start|stop|suspend ] [ -nodes <node names> ] [ -tasktracker restart|resume|start|stop|suspend ] [ -zkconnect <ZooKeeper Connect String> ] REST http[s]://<host>:<port>/rest/node/services?<paramet ers> Parameters When used together, the and parameters specify an action to perform on a service. To start the JobTracker, for example, you can action name either specify for the and for the , or simply specify on the . start action jobtracker name start jobtracker Parameter Description action An action to perform on a service specified in the parameter: name restart, resume, start, stop, or suspend cldb Starts or stops the cldb service. Values: restart, resume, start, stop, or suspend cluster The cluster on which to run the command. fileserver Starts or stops the fileserver service. Values: restart, resume, start, stop, or suspend filter A filter specifying nodes on which to start or stop services. See Filters for more information. hbmaster Starts or stops the hbmaster service. Values: restart, resume, start, stop, or suspend hbregionserver Starts or stops the hbregionserver service. Values: restart, resume, start, stop, or suspend jobtracker Starts or stops the jobtracker service. Values: restart, resume, start, stop, or suspend name A service on which to perform an action specified by the par action ameter. nfs Starts or stops the nfs service. Values: restart, resume, start, stop, or suspend nodes A list of node names, separated by spaces. tasktracker Starts or stops the tasktracker service. Values: restart, resume, start, stop, or suspend zkconnect ZooKeeper Connect String node topo Lists cluster topology information. Lists internal nodes only (switches/racks/etc) and not leaf nodes (server nodes). Syntax CLI maprcli node topo [ -cluster <cluster> ] [ -path <path> ] REST http[s]://<host>:<port>/rest/node/topo?<parameters> Parameters Parameter Description cluster The cluster on which to run the command. path The path on which to list node topology. Output Node topology information. Sample output { "timestamp":1360704473225, "status":"OK", "total":recordCount, "data": [ { "path":path, }, ...additional structures above for each topology node... ] } Output Fields Field Description path The physical topology path to the node. rlimit The rlimit commands enable you to get and set resource usage limits for your cluster. rlimit get rlimit set rlimit get The API returns the resource usage limit for the cluster's disk resource rlimit get Syntax CLI maprcli rlimit get -resource disk [ -cluster <cluster name> ] REST http[s]://<host>:<port>/rest/rlimit/get?<parameters > Parameters Parameter Description resource The type of resource to get the usage limit for. Currently only the value is supported. disk cluster name The name of the cluster whose usage limit is being queried. Examples Return the disk usage limit for the cluster my.cluster.com: CLI maprcli rlimit get -resource disk -cluster my.cluster.com REST https://r1n1.sj.us:8443/rest/rlimit/get?cluster=my. cluster.com rlimit set The API sets the resource usage limit for the cluster's disk resource rlimit set Syntax CLI maprcli rlimit set -resource disk [ -cluster <cluster name> ] -value <limit> REST http[s]://<host>:<port>/rest/rlimit/set?<parameters > Parameters Parameter Description resource The type of resource to set the usage limit for. Currently only the value is supported. disk cluster name The name of the cluster whose usage limit is being set. limit The value of the limit being set. You can express the value as KB, MB, GB, or TB. Examples Set the disk usage limit for the cluster my.cluster.com to 80TB: CLI maprcli rlimit set -resource disk -cluster my.cluster.com -value 80TB REST https://r1n1.sj.us:8443/rest/rlimit/get?resource=di sk&cluster=my.cluster.com&value=80TB schedule The schedule commands let you work with schedules: schedule create creates a schedule schedule list lists schedules schedule modify modifies the name or rules of a schedule by ID schedule remove removes a schedule by ID A schedule is a JSON object that specifies a single or recurring time for volume snapshot creation or mirror syncing. For a schedule to be useful, it must be associated with at least one volume. See and . volume create volume modify Schedule Fields The schedule object contains the following fields: Field Value id The ID of the schedule. name The name of the schedule. inuse Indicates whether the schedule is associated with an action. rules An array of JSON objects specifying how often the scheduled action occurs. See below. Rule Fields Rule Fields The following table shows the fields to use when creating a rules object. Field Values frequency How often to perform the action: once - Once yearly - Yearly monthly - Monthly weekly - Weekly daily - Daily hourly - Hourly semihourly - Every 30 minutes quarterhourly - Every 15 minutes fiveminutes - Every 5 minutes minute - Every minute retain How long to retain the data resulting from the action. For example, if the schedule creates a snapshot, the retain field sets the snapshot's expiration. The retain field consists of an integer and one of the following units of time: mi - minutes h - hours d - days w - weeks m - months y - years time The time of day to perform the action, in 24-hour format: HH date The date on which to perform the action: For single occurrences, specify month, day and year: MM/DD/YYYY For yearly occurrences, specify the month and day: MM/DD For monthly occurrences occurrences, specify the day: DD Daily and hourly occurrences do not require the date field. Example The following example JSON shows a schedule called "snapshot," with three rules. { "id":8, "name":"snapshot", "inuse":0, "rules":[ { "frequency":"monthly", "date":"8", "time":14, "retain":"1m" }, { "frequency":"weekly", "date":"sat", "time":14, "retain":"2w" }, { "frequency":"hourly", "retain":"1d" } ] } schedule create Creates a schedule. Permissions required: or fc a A schedule can be associated with a volume to automate mirror syncing and snapshot creation. See and . volume create volume modify Syntax CLI maprcli schedule create [ -cluster <cluster> ] -schedule <JSON> REST http[s]://<host>:<port>/rest/schedule/create?<param eters> Parameters Parameter Description cluster The cluster on which to run the command. schedule A JSON object describing the schedule. See for Schedule Objects more information. Examples Scheduling a Single Occurrence CLI maprcli schedule create -schedule '{"name":"Schedule-1","rules":[{"frequency":"once", "retain":"1w","time":13,"date":"12/5/2010"}]}' REST https://r1n1.sj.us:8443/rest/schedule/create?schedu le={"name":"Schedule-1","rules":[{"frequency":"once ","retain":"1w","time":13,"date":"12/5/2010"}]} A Schedule with Several Rules CLI maprcli schedule create -schedule '{"name":"Schedule-2","rules":[{"frequency":"weekly ","date":"sun","time":7,"retain":"2w"},{"frequency" :"daily","time":14,"retain":"1w"},{"frequency":"hou rly","retain":"1w"},{"frequency":"yearly","date":"1 1/5","time":14,"retain":"1w"}]}' REST https://r1n1.sj.us:8443/rest/schedule/create?schedu le={"name":"Schedule-1","rules":[{"frequency":"week ly","date":"sun","time":7,"retain":"2w"},{"frequenc y":"daily","time":14,"retain":"1w"},{"frequency":"h ourly","retain":"1w"},{"frequency":"yearly","date": "11/5","time":14,"retain":"1w"}]} schedule list Lists the schedules on the cluster. Syntax CLI maprcli schedule list [ -cluster <cluster> ] [ -output terse|verbose ] REST http[s]://<host>:<port>/rest/schedule/list?<paramet ers> Parameters Parameter Description cluster The cluster on which to run the command. 1. 2. 3. output Specifies whether the output should be terse or verbose. Output A list of all schedules on the cluster. See for more information. Schedule Objects Examples List schedules: CLI maprcli schedule list REST https://r1n1.sj.us:8443/rest/schedule/list schedule modify Modifies an existing schedule, specified by ID. Permissions required: or fc a To find a schedule's ID: Use the command to list the schedules. schedule list Select the schedule to modify Pass the selected schedule's ID in the -id parameter to the command. schedule modify Syntax CLI maprcli schedule modify [ -cluster <cluster> ] -id <schedule ID> [ -name <schedule name ] [ -rules <JSON>] REST http[s]://<host>:<port>/rest/schedule/modify?<param eters> Parameters Parameter Description cluster The cluster on which to run the command. id The ID of the schedule to modify. name The new name of the schedule. rules A JSON object describing the rules for the schedule. If specified, replaces the entire existing rules object in the schedule. For information about the fields to use in the JSON object, see Rule . Fields Examples Modify a schedule CLI maprcli schedule modify -id 0 -name Newname -rules '[{"frequency":"weekly","date":"sun","time":7,"reta in":"2w"},{"frequency":"daily","time":14,"retain":" 1w"}]' REST https://r1n1.sj.us:8443/rest/schedule/modify?id=0&n ame=Newname&rules=[{"frequency":"weekly","date":"su n","time":7,"retain":"2w"},{"frequency":"daily","ti me":14,"retain":"1w"}] schedule remove Removes a schedule. A schedule can only be removed if it is not associated with any volumes. See . volume modify Syntax CLI maprcli schedule remove [ -cluster <cluster> ] -id <schedule ID> REST http[s]://<host>:<port>/rest/schedule/remove?<param eters> Parameters Parameter Description cluster The cluster on which to run the command. id The ID of the schedule to remove. Examples Remove schedule with ID 0: CLI maprcli schedule remove -id 0 REST https://r1n1.sj.us:8443/rest/schedule/remove?id=0 service list Lists all services on the specified node, along with the state and log path for each service. Syntax CLI maprcli service list -node <node name> REST http[s]://<host>:<port>/rest/service/list?<paramete rs> Parameters Parameter Description node The node on which to list the services Output Information about services on the specified node. For each service, the status is reported numerically: 0 - NOT_CONFIGURED: the package for the service is not installed and/or the service is not configured ( has not run) configure.sh 2 - RUNNING: the service is installed, has been started by the warden, and is currently executing 3 - STOPPED: the service is installed and has run, but the service is currently not executing configure.sh 5 - STAND_BY: the service is installed and is in standby mode, waiting to take over in case of failure of another instance (mainly used for JobTracker warm standby) setloglevel The setloglevel commands set the log level on individual services: setloglevel cldb - Sets the log level for the CLDB. setloglevel hbmaster - Sets the log level for the HB Master. setloglevel hbregionserver - Sets the log level for the HBase RegionServer. setloglevel jobtracker - Sets the log level for the JobTracker. setloglevel fileserver - Sets the log level for the FileServer. setloglevel nfs - Sets the log level for the NFS. setloglevel tasktracker - Sets the log level for the TaskTracker. setloglevel cldb Sets the log level on the CLDB service. Permissions required: or fc a Syntax CLI maprcli setloglevel cldb -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port> REST http[s]://<host>:<port>/rest/setloglevel/cldb?<para meters> Parameters Parameter Description classname The name of the class for which to set the log level. loglevel The log level to set: DEBUG ERROR FATAL INFO TRACE WARN node The node on which to set the log level. port The CLDB port setloglevel fileserver Sets the log level on the FileServer service. Permissions required: or fc a Syntax CLI maprcli setloglevel fileserver -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port> REST http[s]://<host>:<port>/rest/setloglevel/fileserver ?<parameters> Parameters Parameter Description classname The name of the class for which to set the log level. loglevel The log level to set: DEBUG ERROR FATAL INFO TRACE WARN node The node on which to set the log level. port The MapR-FS port setloglevel hbmaster Sets the log level on the HBase Master service. Permissions required: or fc a Syntax CLI maprcli setloglevel hbmaster -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port> REST http[s]://<host>:<port>/rest/setloglevel/hbmaster?< parameters> Parameters Parameter Description classname The name of the class for which to set the log level. loglevel The log level to set: DEBUG ERROR FATAL INFO TRACE WARN node The node on which to set the log level. port The HBase Master webserver port setloglevel hbregionserver Sets the log level on the HBase RegionServer service. Permissions required: or fc a Syntax CLI maprcli setloglevel hbregionserver -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port> REST http[s]://<host>:<port>/rest/setloglevel/hbregionse rver?<parameters> Parameters Parameter Description classname The name of the class for which to set the log level. loglevel The log level to set: DEBUG ERROR FATAL INFO TRACE WARN node The node on which to set the log level. port The Hbase Region Server webserver port setloglevel jobtracker Sets the log level on the JobTracker service. Permissions required: or fc a Syntax CLI maprcli setloglevel jobtracker -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port> REST http[s]://<host>:<port>/rest/setloglevel/jobtracker ?<parameters> Parameters Parameter Description classname The name of the class for which to set the log level. loglevel The log level to set: DEBUG ERROR FATAL INFO TRACE WARN node The node on which to set the log level. port The JobTracker webserver port setloglevel nfs Sets the log level on the NFS service. Permissions required: or fc a Syntax CLI maprcli setloglevel nfs -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port> REST http[s]://<host>:<port>/rest/setloglevel/nfs?<param eters> Parameters Parameter Description classname The name of the class for which to set the log level. loglevel The log level to set: DEBUG ERROR FATAL INFO TRACE WARN node The node on which to set the log level. port The NFS port setloglevel tasktracker Sets the log level on the TaskTracker service. Permissions required: or fc a Syntax CLI maprcli setloglevel tasktracker -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port> REST http[s]://<host>:<port>/rest/setloglevel/tasktracke r?<parameters> Parameters Parameter Description classname The name of the class for which to set the log level. loglevel The log level to set: DEBUG ERROR FATAL INFO TRACE WARN node The node on which to set the log level. port The TaskTracker port table The commands perform functions related to MapR tables: table table attr table cf table create table delete table listrecent table region table attr The commands enable you to list and edit attributes for MapR tables. table attr table attr list - lists the attributes of an existing MapR table table attr edit - edits the attributes of an existing MapR table Table Attributes Name Field Value autoSplit Boolean. Defines whether or not this table is automatically split. The default value is . A value of indicates this table will not true false be automatically split. table attr edit Edits the attributes of a specified MapR table. Syntax CLI maprcli table attr edit -path <path> -attrname <name> -attrvalue <value> REST http[s]://<host>:<port>/rest/table/attr/edit?path=< path>&attrname=<name>&attrvalue=<value> Parameters Parameter Description path Path to the table. attrname The attribute to edit. attrvalue The new value for the attribute. Table Attributes Attribute Name Attribute Value autoSplit Boolean. Defines whether or not this table is automatically split. The default value is . A value of indicates this table will not true false be automatically split. Examples Editing a table's attributes This example changes the value of the attribute for the table from to . autoSplit mytable01 true false CLI maprcli table attr edit -path /my.cluster.com/user/user01/mytable01 -attrname autoSplit -attrvalue false REST https://r1n1.sj.us:8443/rest/table/attr/list?path=% 2Fmy.cluster.com%2Fuser%2Fuser01%2Fmytable01&attrna me=autoSplit&attrvalue=false table attr list Lists the attributes of a specified MapR table. Syntax CLI maprcli table attr list -path <path> REST http[s]://<host>:<port>/rest/table/attr/list?path=< path>&<parameters> Parameters Parameter Description path Path to the table. Output fields Name Field Value autoSplit Boolean. Defines whether or not this table is automatically split. The default value is . A value of indicates this table will not true false be automatically split. Examples Listing a table's attributes This example lists the attribute information for the table . mytable01 CLI maprcli table attr list -path /my.cluster.com/user/user01/mytable01 REST https://r1n1.sj.us:8443/rest/table/attr/list?path=% 2Fmy.cluster.com%2Fuser%2Fuser01%2Fmytable01 Example Output [user@node]# maprcli table attr list -path /mapr/my.cluster.com/user/user01/mytable01 name value autoSplit true table cf The commands deal with creating and managing column families for MapR tables. table cf table cf create - creates a new column family table cf edit - edits the properties of an existing column family table cf delete - deletes a column family table cf list - lists information about column families table cf create Creates a column family for a MapR table. Syntax CLI maprcli table cf create -path <path> -cfname <name> [ -compression off|lzf|lz4|zlib ] [ -minversions <integer> ] [ -maxversions <integer> ] [ -ttl <value> ] [ -inmemory true|false ] REST http[s]://<host>:<port>/rest/table/cf/create?path=< path>&cfname=<name>&<parameters> Parameters Parameter Description path Path to the MapR table. cfname The name of the new column family. compression The compression setting to use for the column family. Valid options are , , , and . The default setting is equal to the off lzf lz4 zlib table's compression setting. minversions Minimum number of versions to keep. The default is zero. maxversions Maximum number of versions to keep. The default is three. ttl Time to live. The default value is . When the age of the data forever in this column family exceeds the value of the parameter, the ttl data is purged from the column family. inmemory Boolean. Whether or not to keep this column family in memory. The default value is . false Examples Creating a new column family for a table, keeping four versions in memory CLI maprcli table cf create -path /volume1/mytable -cfname mynewcf -maxversions 4 -inmemory true REST https://r1n1.sj.us:8443/rest/table/cf/create?path=% 2Fvolume1%2Fmytable&cfname=mynewcf&maxversions=4&in memory=true table cf delete Delete a column family from a MapR table, removing all records in the column family. Deletion cannot be undone. Syntax CLI maprcli table cf delete -path <path> -cfname <name> REST http[s]://<host>:<port>/rest/table/cf/delete?path=< path>&cfname=<name> Parameters Parameter Description path Path to the MapR table. cfname The name of the column family to delete. Examples Deleting a column family CLI maprcli table cf delete -path /volume1/thetable -cfname mycf REST https://r1n1.sj.us:8443/rest/table/cf/edit?path=%2F volume1%2Fthetable&cfname=mycf table cf edit Edits a column family definition. You can alter a column family's name, minimum and maximum versions, time to live, compression, and memory residence status. Syntax CLI maprcli table cf edit -path <path> -cfname <name> [ -newcfname <name> ] [ -minversions <integer> ] [ -maxversions <integer> ] [ -ttl <value> ] [ -inmemory true|false ] [ -compression off|lzf|lz4|zlib ] REST http[s]://<host>:<port>/rest/table/cf/create?path=< path>&cfname=<name>&<parameters> Parameters Parameter Description path Path to the table. cfname The name of the column family to edit. newcfname The new name of the column family. minversions Minimum number of versions to keep. The default is zero. maxversions Maximum number of versions to keep. The default is three. ttl Time to live. The default value is . forever inmemory Boolean. Whether or not to keep this column family in memory. The default value is . false compression The compression setting to use for the column family. Valid options are , , , and . The default setting is equal to the off lzf lz4 zlib table's compression setting. Examples Changing a column family's name and time to live CLI maprcli table cf edit -path /my.cluster.com/volume1/newtable -cfname mynewcf -newcfname mynewcfname -ttl 3 REST https://r1n1.sj.us:8443/rest/table/cf/edit?path=%2F my.cluster.com%2Fvolume1%2Fnewtable&cfname=mynewcf& newcfname=mynewcfname&ttl=3 table cf list Lists a MapR table's column families. Syntax CLI maprcli table cf list -path <path> [ -cfname <name> ] [ -output verbose|terse ] REST http[s]://<host>:<port>/rest/table/cf/list?path=<pa th>&<parameters> Parameters Parameter Description path Path to the table. cfname The name of the column family to edit. output Valid options are or . Verbose output lists full names verbose terse for column headers. Terse output lists abbreviated column header names. The default value is . verbose Output fields Verbose Field Name Terse Field Name Field Value inmemory inmem Whether or not this column value resides in memory cfname n The column family name maxversions vmax Maximum number of versions for this column family minversions vmin Minimum number of versions for this column family compression comp Compression scheme used for this column family ttl ttl Time to live for this column family Examples Tersely listing the column families for a table This example lists all column families for the table . newtable CLI maprcli table cf list -path /my.cluster.com/volume1/newtable -output terse REST https://r1n1.sj.us:8443/rest/table/cf/list?path=%2F my.cluster.com%2Fvolume1%2Fnewtable&output=terse Example Output [user@node]# maprcli table cf list -path /mapr/default/user/user/newtable -output terse comp inmem vmax n ttl vmin lz4 false 3 dine 2147483647 0 lz4 false 3 nahashchid 2147483647 0 lz4 false 3 wollachee 2147483647 0 table create Creates a new MapR table. Syntax CLI maprcli table create -path <path> REST http[s]://<host>:<port>/rest/table/create?path=<pat h> Parameters Parameter Description path Path to the new MapR table. Examples Creating a new MapR table CLI maprcli table create -path /my.cluster.com/volume1/newtable REST https://r1n1.sj.us:8443/rest/table/create?path=%2Fm y.cluster.com%2Fvolume1%2Fnewtable table delete Deletes a MapR table. Syntax CLI maprcli table delete -path <path> REST http[s]://<host>:<port>/rest/table/create?path=<pat h> Parameters Parameter Description path Path to the MapR table to delete. Examples Deleting a table CLI maprcli table delete -path /my.cluster.com/volume1/table REST https://r1n1.sj.us:8443/rest/table/delete?path=%2Fm y.cluster.com%2Fvolume1%2Ftable table listrecent MapR keeps track of the 50 most recently-accessed tables by each user. When the argument is specified, path maprcli table listrecent verifies if a table exists at that path. When used without , lists the user's recently-accessed tables. path maprcli table listrecent The paths to recently-accessed tables are written to a file, . Referencing a table with any maprfs:///user/<user_name>/.recent_tables of the commands or the MapR Control System will log the table path in . If a user doesn't have a home maprcli table .recent_tables directory on the cluster, throws an error. maprcli table listrecent Syntax CLI maprcli table listrecent [ -path <path> ] [ -output verbose|terse ] REST http[s]://<host>:<port>/rest/table/listrecent?<para meters> Parameters Parameter Description path Specifies a path to verify if a table exists at that path output Valid options are or . Verbose output lists full names verbose terse for column headers. Terse output lists abbreviated column header names. The default value is . verbose Output fields Verbose Field Name Terse Field Name Field Value path p Path to the table Examples Listing recently-accessed tables CLI maprcli table listrecent REST https://r1n1.sj.us:8443/rest/table/listrecent Listing tables verbosely CLI maprcli table listrecent -path /my.cluster.com/volume1/ -output verbose REST https://r1n1.sj.us:8443/rest/table/listrecent?path= %2Fmy.cluster.com%2Fvolume1&output=verbose table region The command lists the regions associated with a specified MapR table. table region list table region list Lists the regions that make up a specified table. Syntax CLI maprcli table region list -path <path> [ -output verbose|terse ] [ -start offset ] [ -limit number ] REST http[s]://<host>:<port>/rest/table/region/list?path =<path>&<parameters> Parameters Parameter Description path Path to the table. output Valid options are or . Verbose output lists full names verbose terse for column headers. Terse output lists abbreviated column header names. The default value is . verbose start The offset from the starting region. The default value is 0. limit The number of regions to return, counting from the starting region. The default value is 2147483647. Output fields Verbose Field Name Terse Field Name Field Value primarynode pn Host name of the primary node for this region secondarynodes sn Host names of the secondary nodes where this region is replicated startkey sk Value of the start key for this region endkey ek Value of the end key for this region lastheartbeat lhb Time since last heartbeat from the region's primary node puts puts Number of puts on the region's primary node (which may include puts due to other regions residing on the same node) over the last 10 seconds, 1 minute, 5 minutes and 15 minutes. gets gets Number of gets on the region's primary node (which may include gets due to other regions residing on the same node) over the last 10 seconds, 1 minute, 5 minutes and 15 minutes. scans scans Number of scans on the region's primary node (which may include scans due to other regions residing on the same node) over the last 10 seconds, 1 minute, 5 minutes and 15 minutes. Examples Tersely listing the region information for a table This example lists the region information for the table . newtable CLI maprcli table region list -path /my.cluster.com/volume1/newtable -output terse REST https://r1n1.sj.us:8443/rest/table/region/list?path =%2Fmy.cluster.com%2Fvolume1%2Fnewtable&output=ters e Example Output [user@node]# maprcli table region list -path /mapr/default/user/user/newtable -output terse sk sn ek pn lhb -INFINITY perfnode54.perf.lab, perfnode52.perf.lab INFINITY perfnode51.perf.lab 0 task The commands enable you to manipulate information about the Hadoop jobs that are running on your cluster: task task killattempt - Kills a specific task attempt. task failattempt - Ends a specific task attempt as failed. task table - Retrieves detailed information about the task attempts associated with a job running on the cluster. task failattempt The API ends the specified task attempt as failed. task failattempt Syntax CLI maprcli task failattempt [ -cluster cluster name ] -taskattemptid task attempt ID REST http[s]://<host>:<port>/rest/task/failattempt?[clus ter=cluster_name&]taskattemptid=task_attempt_ID Parameters Parameter Description cluster Cluster name taskattemptid Task attempt ID Examples Ending a Task Attempt as Failed CLI maprcli task failattempt -taskattemptid attempt_201187941846_1077_300_7707 REST https://r1n1.sj.us:8443/rest/task/failattempt?taska ttemptid=attempt_201187941846_1077_300_7707 task killattempt The API kills the specified task attempt. task killattempt Syntax CLI maprcli task killattempt [ -cluster cluster name ] -taskattemptid task attempt ID REST http[s]://<host>:<port>/rest/task/killattempt?[clus ter=cluster_name&]taskattemptid=task_attempt_ID Parameters Parameter Description cluster Cluster name taskattemptid Task attempt ID Examples Killing a Task Attempt CLI maprcli task killattempt -taskattemptid attempt_201187941846_1077_300_7707 REST https://r1n1.sj.us:8443/rest/task/killattempt?taska ttemptid=attempt_201187941846_1077_300_7707 task table Retrieves histograms and line charts for task metrics. Use the API to retrieve data for your job. The metrics data can be formatted for histogram display or line chart task table task analytics display. Syntax REST http[s]://<host>:<port>/rest/task/table?output=ters e&filter=string&chart=chart_type&columns=list_of_co lumns&scale=scale_type<parameters> Parameters Parameter Description filter Filters results to match the value of a specified string. chart Chart type to use: for a line chart, for a histogram. line bar columns Comma-separated list of column to return. names bincount Number of histogram bins. scale Scale to use for the histogram. Specify for a linear scale and linear for a logarithmic scale. log Column Names The following table lists the terse short names for particular metrics regarding task attempts. Parameter Description tacir Combine Task Attempt Input Records tacor Combine Task Attempt Output Records tamib Map Task Attempt Input Bytes tamir Map Task Attempt Input Records tamob Map Task Attempt Output Bytes tamor Map Task Attempt Output Records tamsr Map Task Attempt Skipped Records tarig Reduce Task Attempt Input Groups tarir Reduce Task Attempt Input Records taror Reduce Task Attempt Output Records tarsb Reduce Task Attempt Shuffle Bytes tarsr Reduce Task Attempt Skipped Records tacput Task Attempt CPU Time talbr Task Attempt Local Bytes Read talbw Task Attempt Local Bytes Written tambr Task Attempt MapR-FS Bytes Read tambw Task Attempt MapR-FS Bytes Written tapmem Task Attempt Physical Memory Bytes taspr Task Attempt Spilled Records tavmem Task Attempt Virtual Memory Bytes tad Task Attempt Duration (histogram only) tagct Task Attempt Garbage Collection Time (histogram only) td Task Duration (histogram only) taid Task Attempt ID (filter only) tat Task Attempt Type (filter only) tas Task Attempt Status (filter only) tapro Task Attempt Progress (filter only) tast Task Attempt Start Time (filter only) taft Task Attempt Finish Time (filter only) tashe Task Attempt Shuffle End tase Task Attempt Sort End tah Task Attempt Host Location talog Location of logs, , and for this task attempt. stderr stdout tadi Freeform information about this task attempt used for diagnosing behaviors. tamor Map Task Attempt Output Records tarsg Reduce Task Attempt Skipped Groups (filter only) tasrb Reduce Task Attempt Shuffle Bytes tamirps Map Task Attempt Input Records per Second tarirps Reduce Task Attempt Input Records per Second tamorps Map Task Attempt Output Records per Second tarorps Reduce Task Attempt Output Records per Second tamibps Map Task Attempt Input Bytes per Second tamobps Map Output Bytes per Second tarsbps Reduce Task Attempt Shuffle Bytes per Second ts Task Status (filter only) tid Task Duration tt Task Type (filter only) tsta Primary Task Attempt ID (filter only) tst Task Start Time (filter only) tft Task End Time (filter only) th Task Host Location (filter only) thl Task Host Locality (filter only) Example Retrieve a Task Histogram: REST https://r1n1.sj.us:8443/rest/task/table?chart=bar&f ilter=%5Btt!=JOB_SETUP%5Dand%5Btt!=JOB_CLEANUP%5Dan d%5Bjid==job_201129649560_3390%5D&columns=td&bincou nt=28&scale=log CURL curl -d @json https://r1n1.sj.us:8443/api/task/table In the example above, the file contains a URL-encoded version of the information in the section below. curl json Request Request GENERAL_PARAMS: { [chart: "bar"|"line"], columns: <comma-separated list of column terse names>, [filter: "[<terse_field>{operator}<value>]and[...]",] [output: terse,] [start: int,] [limit: int] } REQUEST_PARAMS_HISTOGRAM: { chart:bar columns:td filter: <anything> } REQUEST_PARAMS_LINE: { chart:line, columns:tapmem, filter: NOT PARSED, UNUSED IN BACKEND } REQUEST_PARAMS_GRID: { columns:tid,tt,tsta,tst,tft filter:<any real filter expression> output:terse, start:0, limit:50 } Response RESPONSE_SUCCESS_HISTOGRAM: { "status" : "OK", "total" : 15, "columns" : ["td"], "binlabels" : ["0-5s","5-10s","10-30s","30-60s","60-90s","90s-2m","2m-5m","5m-10m","10m-30m","30m-1h ","1h-2h","2h-6h","6h-12h","12h-24h",">24h"], "binranges" : [ [0,5000], [5000,10000], [10000,30000], [30000,60000], [60000,90000], [90000,120000], [120000,300000], [300000,600000], [600000,1800000], [1800000,3600000], [3600000,7200000], [7200000,21600000], [21600000,43200000], [43200000,86400000], [86400000] ], "data" : [33,919,1,133,9820,972,39,2,44,80,11,93,31,0,0] } RESPONSE_SUCCESS_GRID: { "status": "OK", "total" : 67, "columns" : ["ts","tid","tt","tsta","tst","tft","td","th","thl"], "data" : [ ["FAILED","task_201204837529_1284_9497_4858","REDUCE","attempt_201204837529_1284_9497_ 4858_3680", 1301066803229,1322663797292,21596994063,"newyork-rack00-8","remote"], ["PENDING","task_201204837529_1284_9497_4858","MAP","attempt_201204837529_1284_9497_48 58_8178", 1334918721349,1341383566992,6464845643,"newyork-rack00-7","unknown"], ["RUNNING","task_201204837529_1284_9497_4858","JOB_CLEANUP","attempt_201204837529_1284 _9497_4858_1954", 1335088225728,1335489232319,401006591,"newyork-rack00-8","local"], ]} RESPONSE_SUCCESS_LINE: { "status" : "OK", "total" : 22, "columns" : ["tapmem"], "data" : [ [1329891055016,0], [1329891060016,8], [1329891065016,16], [1329891070016,1024], [1329891075016,2310], [1329891080016,3243], [1329891085016,4345], [1329891090016,7345], [1329891095016,7657], [1329891100016,8758], [1329891105016,9466], [1329891110016,10345], [1329891115016,235030], [1329891120016,235897], [1329891125016,287290], [1329891130016,298390], [1329891135016,301355], [1329891140016,302984], [1329891145016,303985], [1329891150016,304403], [1329891155016,503030], [1329891160016,983038] ] } trace The trace commands let you view and modify the trace buffer, and the trace levels for the system modules. The valid trace levels are: DEBUG INFO ERROR WARN FATAL The following pages provide information about the trace commands: trace dump trace info trace print trace reset trace resize trace setlevel trace setmode trace dump Dumps the contents of the trace buffer into the MapR-FS log. Syntax CLI maprcli trace dump [ -host <host> ] [ -port <port> ] REST None. Parameters Parameter Description host The IP address of the node from which to dump the trace buffer. Default: localhost port The port to use when dumping the trace buffer. Default: 5660 Examples Dump the trace buffer to the MapR-FS log: CLI maprcli trace dump trace info Displays the trace level of each module. Syntax CLI maprcli trace info [ -host <host> ] [ -port <port> ] REST None. Parameters Parameter Description host The IP address of the node on which to display the trace level of each module. Default: localhost port The port to use. Default: 5660 Output A list of all modules and their trace levels. Sample Output RPC Client Initialize **Trace is in DEFAULT mode. **Allowed Trace Levels are: FATAL ERROR WARN INFO DEBUG **Trace buffer size: 2097152 **Modules and levels: Global : INFO RPC : ERROR MessageQueue : ERROR CacheMgr : INFO IOMgr : INFO Transaction : ERROR Log : INFO Cleaner : ERROR Allocator : ERROR BTreeMgr : ERROR BTree : ERROR BTreeDelete : ERROR BTreeOwnership : INFO MapServerFile : ERROR MapServerDir : INFO Container : INFO Snapshot : INFO Util : ERROR Replication : INFO PunchHole : ERROR KvStore : ERROR Truncate : ERROR Orphanage : INFO FileServer : INFO Defer : ERROR ServerCommand : INFO NFSD : INFO Cidcache : ERROR Client : ERROR Fidcache : ERROR Fidmap : ERROR Inode : ERROR JniCommon : ERROR Shmem : ERROR Table : ERROR Fctest : ERROR DONE Examples Display trace info: CLI maprcli trace info trace print Manually dumps the trace buffer to stdout. Syntax CLI maprcli trace print [ -host <host> ] [ -port <port> ] -size <size> REST None. Parameters Parameter Description host The IP address of the node from which to dump the trace buffer to stdout. Default: localhost port The port to use. Default: 5660 size The number of kilobytes of the trace buffer to print. Maximum: 64 Output The most recent bytes of the trace buffer. <size> ----------------------------------------------------- 2010-10-04 13:59:31,0000 Program: mfs on Host: fakehost IP: 0.0.0.0, Port: 0, PID: 0 ----------------------------------------------------- DONE Examples Display the trace buffer: CLI maprcli trace print trace reset Resets the in-memory trace buffer. Syntax CLI maprcli trace reset [ -host <host> ] [ -port <port> ] REST None. Parameters Parameter Description host The IP address of the node on which to reset the trace parameters. Default: localhost port The port to use. Default: 5660 Examples Reset trace parameters: CLI maprcli trace reset trace resize Resizes the trace buffer. Syntax CLI maprcli trace resize [ -host <host> ] [ -port <port> ] -size <size> REST None. Parameters Parameter Description host The IP address of the node on which to resize the trace buffer. Default: localhost port The port to use. Default: 5660 size The size of the trace buffer, in kilobytes. Default: Minimum: 2097152 1 Examples Resize the trace buffer to 1000 CLI maprcli trace resize -size 1000 trace setlevel Sets the trace level on one or more modules. Syntax CLI maprcli trace setlevel [ -host <host> ] -level <trace level> -module <module name> [ -port <port> ] REST None. Parameters Parameter Description host The node on which to set the trace level. Default: localhost module The module on which to set the trace level. If set to , sets the all trace level on all modules. level The new trace level. If set to , sets the trace level to its default default. port The port to use. Default: 5660 Examples Set the trace level of the log module to INFO: CLI maprcli trace setlevel -module log -level info Set the trace levels of all modules to their defaults: CLI maprcli trace setlevel -module all -level default trace setmode Sets the trace mode. There are two modes: Default Continuous In default mode, all trace messages are saved in a memory buffer. If there is an error, the buffer is dumped to stdout. In continuous mode, every allowed trace message is dumped to stdout in real time. Syntax CLI maprcli trace setmode [ -host <host> ] -mode default|continuous [ -port <port> ] REST None. Parameters Parameter Description host The IP address of the host on which to set the trace mode mode The trace mode. port The port to use. Examples Set the trace mode to continuous: CLI maprcli trace setmode -mode continuous urls The urls command displays the status page URL for the specified service. Syntax CLI maprcli urls [ -cluster <cluster> ] -name <service name> [ -zkconnect <zookeeper connect string> ] REST http[s]://<host>:<port>/rest/urls/<name> Parameters Parameter Description cluster The name of the cluster on which to save the configuration. name The name of the service for which to get the status page: cldb jobtracker tasktracker zkconnect ZooKeeper Connect String Examples Display the URL of the status page for the CLDB service: CLI maprcli urls -name cldb REST https://r1n1.sj.us:8443/rest/maprcli/urls/cldb userconfig The command displays information about the current user. userconfig load userconfig load Loads the configuration for the specified user. Syntax CLI maprcli userconfig load -username <username> REST http[s]://<host>:<port>/rest/userconfig/load?<param eters> Parameters Parameter Description username The username for which to load the configuration. Output The configuration for the specified user. Sample Output username fsadmin mradmin root 1 1 Output Fields Field Description username The username for the specified user. email The email address for the user. fsadmin Indicates whether the user is a MapR-FS Administrator: 0 = no 1 = yes mradmin Indicates whether the user is a MapReduce Administrator: 0 = no 1 = yes helpUrl URL pattern for locating help files on the server. Example: http://www.mapr.com/doc/display/MapR-<version>/<pag e>#<topic> helpVersion Version of the help content corresponding to this build of MapR. Note that this is different from the build version. Examples View the root user's configuration: CLI maprcli userconfig load -username root REST https://r1n1.sj.us:8443/rest/userconfig/load?userna me=root virtualip The virtualip commands let you work with virtual IP addresses for NFS nodes: virtualip add adds a range of virtual IP addresses virtualip edit edits a range of virtual IP addresses virtualip list lists virtual IP addresses virtualip move reassigns a range of virtual IP addresses to a MAC virtualip remove removes a range of virtual IP addresses Virtual IP Fields Field Description macaddress The MAC address of the virtual IP. netmask The netmask of the virtual IP. virtualipend The virtual IP range end. virtualip add Adds a virtual IP address. Permissions required: or fc a Syntax CLI maprcli virtualip add [ -cluster <cluster> ] [ -gateway <gateway> ] [ -macs <MAC address> ] -netmask <netmask> -virtualip <virtualip> [ -virtualipend <virtual IP range end> ] [ -preferredmac <MAC address> ] REST http[s]://<host>:<port>/rest/virtualip/add?<paramet ers> Parameters Parameter Description cluster The cluster on which to run the command. gateway The NFS gateway IP or address macs A list of the MAC addresses that represent the NICs on the nodes that the VIPs in the VIP range can be associated with. Use this list to limit VIP assignment to NICs on a particular subnet when your NFS server is part of multiple subnets. netmask The netmask of the virtual IP. virtualip The virtual IP, or the start of the virtual IP range. virtualipend The end of the virtual IP range. preferredmac The preferred MAC for this virtual IP. When an NFS server restarts, the MapR system attempts to move all of the virtual IP addresses that list a MAC address on this node as a preferred MAC to this node. If the new value is null, this parameter resets the preferred MAC value. virtualip edit Edits a virtual IP (VIP) range. Permissions required: or fc a Syntax CLI maprcli virtualip edit [ -cluster <cluster> ] [ -macs <MAC addresses> ] -netmask <netmask> -virtualip <virtualip> [ -virtualipend <virtual IP range end> ] [ -preferredmac <MAC address> ] REST http[s]://<host>:<port>/rest/virtualip/edit?<parame ters> Parameters Parameter Description cluster The cluster on which to run the command. macs A list of the MAC addresses that represent the NICs on the nodes that the VIPs in the VIP range can be associated with. Use this list to limit VIP assignment to NICs on a particular subnet when your NFS server is part of multiple subnets. netmask The netmask of the virtual IP. virtualip The virtual IP, or the start of the virtual IP range. virtualipend The end of the virtual IP range. preferredmac The preferred MAC for this virtual IP. When an NFS server restarts, the MapR system attempts to move all of the virtual IP addresses that list a MAC address on this node as a preferred MAC to this node. If the new value is null, this parameter resets the preferred MAC value. virtualip list Lists the virtual IP addresses in the cluster. Syntax CLI maprcli virtualip list [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <limit> ] [ -nfsmacs <NFS macs> ] [ -output <output> ] [ -range <range> ] [ -start <start> ] REST http[s]://<host>:<port>/rest/virtualip/list?<parame ters> Parameters Parameter Description cluster The cluster on which to run the command. columns The columns to display. filter A filter specifying VIPs to list. See for more information. Filters limit The number of records to return. nfsmacs The MAC addresses of servers running NFS. output Whether the output should be or . terse verbose range The VIP range. start The index of the first record to return. virtualip move The API reassigns a virtual IP or a range of virtual IP addresses to a specified Media Access Control (MAC) address. virtualip move Syntax CLI maprcli virtualip move [ -cluster <cluster name> ] -virtualip <virtualip> [ -virtualipend <virtualip end range> -tomac <mac> REST http[s]://<host>:<port>/rest/virtualip/move?<parame ters> Parameters Parameter Description cluster name The name of the cluster where the virtual IP addresses are being moved. virtualip A virtual IP address. If you provide a value for , this -virtualipend virtual IP address defines the beginning of the range. virtualip end range A virtual IP address that defines the end of a virtual IP address range. mac The MAC address that the virtual IP addresses are being assigned. Examples Move a range of three virtual IP addresses to a MAC address for the cluster my.cluster.com: CLI maprcli virtualip move -cluster my.cluster.com -virtualip 192.168.0.8 -virtualipend 192.168.0.10 -tomac 00:FE:ED:CA:FE:99 REST https://r1n1.sj.us:8443/rest/virtualip/move?cluster =my.cluster.com&virtualip=192.168.0.8&virtualipend= 192.168.0.10&tomac=00%3AFE%3AED%3ACA%3AFE%3A99 virtualip remove Removes a virtual IP (VIP) or a VIP range. Permissions required: or fc a Syntax CLI maprcli virtualip remove [ -cluster <cluster> ] -virtualip <virtual IP> [ -virtualipend <Virtual IP Range End> ] REST http[s]://<host>:<port>/rest/virtualip/remove?<para meters> Parameters Parameter Description cluster The cluster on which to run the command. virtualip The virtual IP or the start of the VIP range to remove. virtualipend The end of the VIP range to remove. volume The volume commands let you work with volumes, snapshots and mirrors: volume create creates a volume volume dump create creates a volume dump volume dump restore restores a volume from a volume dump volume info displays information about a volume volume link create creates a symbolic link volume link remove removes a symbolic link volume list lists volumes in the cluster volume mirror push pushes a volume's changes to its local mirrors volume mirror start starts mirroring a volume volume mirror stop stops mirroring a volume volume modify modifies a volume volume mount mounts a volume volume move moves a volume volume remove removes a volume volume rename renames a volume volume showmounts shows the mount points for a volume volume snapshot create creates a volume snapshot volume snapshot list lists volume snapshots volume snapshot preserve prevents a volume snapshot from expiring volume snapshot remove removes a volume snapshot volume unmount unmounts a volume volume create Creates a volume. Permissions required: , , or cv fc a Syntax CLI maprcli volume create -name <volume name> -type 0|1 [ -advisoryquota <advisory quota> ] [ -ae <accounting entity> ] [ -aetype <accounting entity type> ] [ -cluster <cluster> ] [ -createparent 0|1 ] [ -group <list of group:allowMask> ] [ -localvolumehost <localvolumehost> ] [ -localvolumeport <localvolumeport> ] [ -maxinodesalarmthreshold <maxinodesalarmthreshold> ] [ -minreplication <minimum replication factor> ] [ -mount 0|1 ] [ -path <mount path> ] [ -quota <quota> ] [ -readonly <read-only status> ] [ -replication <replication factor> ] [ -replicationtype <type> ] [ -rereplicationtimeoutsec <seconds> ] [ -rootdirperms <root directory permissions> ] [ -schedule <ID> ] [ -source <source> ] [ -topology <topology> ] [ -user <list of user:allowMask> ] REST http[s]://<host>:<port>/rest/volume/create?<paramet ers> Parameters Parameter Description advisoryquota The advisory quota for the volume as plus integer unit. , , , , , Example: quota=500G; Units: B K M G T P ae The accounting entity that owns the volume. aetype The type of accounting entity: 0=user 1=group cluster The cluster on which to create the volume. createparent Specifies whether or not to create a parent volume: 0 = Do not create a parent volume. 1 = Create a parent volume. group Space-separated list of pairs. group:permission localvolumehost The local volume host. localvolumeport The local volume port. Default: 5660 maxinodesalarmthreshold Threshold for the alarm. INODES_EXCEEDED minreplication The minimum replication level. Default: 2 When the replication factor falls below this minimum, re-replication occurs as aggressively as possible to restore the replication level. If any containers in the CLDB volume fall below the minimum replication factor, writes are disabled until aggressive re-replication restores the minimum level of replication. mount Specifies whether the volume is mounted at creation time. name The name of the volume to create. path The path at which to mount the volume. quota The quota for the volume as plus integer unit. Example: , , , , , quota=500G; Units: B K M G T P readonly Specifies whether or not the volume is read-only: 0 = Volume is read/write. 1 = Volume is read-only. replication The desired replication level. Default: 3 When the number of copies falls below the desired replication factor, but remains equal to or above the minimum replication factor, re-replication occurs after the timeout specified in the cldb.fs.mark.rereplicate.sec parameter. replicationtype The desired replication type. You can specify (star low_latency replication) or (chain replication). The default high_throughput setting is . high_throughput rereplicationtimeoutsec The re-replication timeout, in seconds. rootdirperms Permissions on the volume root directory. schedule The ID of a schedule. If a schedule ID is provided, then the volume will automatically create snapshots (normal volume) or sync with its source volume (mirror volume) on the specified schedule. Use the sc command to find the ID of the named schedule you wish to hedule list apply to the volume. source For mirror volumes, the source volume to mirror, in the format <sour (Required when creating a mirror ce volume>@<cluster> volume). topology The rack path to the volume. user Space-separated list of pairs. user:permission type The type of volume to create: 0 - standard volume 1 - mirror volume Examples Create the volume "test-volume" mounted at "/test/test-volume": CLI maprcli volume create -name test-volume -path /test/test-volume REST https://r1n1.sj.us:8443/rest/volume/create?name=tes t-volume&path=/test/test-volume Create Volume with a Quota and an Advisory Quota This example creates a volume with the following parameters: advisoryquota: 100M name: volumename path: /volumepath quota: 500M replication: 3 schedule: 2 topology: /East Coast type: 0 CLI maprcli volume create -name volumename -path /volumepath -advisoryquota 100M -quota 500M -replication 3 -schedule 2 -topology "/East Coast" -type 0 1. 2. REST https://r1n1.sj.us:8443/rest/volume/create?advisory quota=100M&name=volumename&path=/volumepath&quota=5 00M&replication=3&schedule=2&topology=/East%20Coast &type=0 Create the mirror volume "test-volume.mirror" from source volume "test-volume" and mount at "/test/test-volume-mirror": CLI maprcli volume create -name test-volume.mirror -source test-volume@src-cluster-name -path /test/test-volume-mirror REST https://r1n1.sj.us:8443/rest/volume/create?name=tes t-volume.mirror&sourcetest-volume@src-cluster-name& path=/test/test-volume-mirror volume dump create The volume dump create command creates a volume containing data from a volume for distribution or restoration. Permissions dump file required: , , or dump fc a You can use volume dump create to create two types of files: full dump files containing all data in a volume incremental dump files that contain changes to a volume between two points in time A full dump file is useful for restoring a volume from scratch. An incremental dump file contains the changes necessary to take an existing (or restored) volume from one point in time to another. Along with the dump file, a full or incremental dump operation can produce a file state (specified by the ?-e parameter) that contains a table of the version number of every container in the volume at the time the dump file was created. This represents the of the dump file, which is used as the of the next incremental dump. The main difference end point start point between creating a full dump and creating an incremental dump is whether the -s parameter is specified; if -s is not specified, the volume create command includes all volume data and creates a full dump file. If you create a full dump followed by a series of incremental dumps, the result is a sequence of dump files and their accompanying state files: dumpfile1 statefile1 dumpfile2 statefile2 dumpfile3 statefile3 ... To maintain an up-to-date dump of a volume: Create a full dump file. Example: maprcli volume dump create -name cli-created -dumpfile fulldump1 -e statefile1 Periodically, add an incremental dump file. Examples: 2. maprcli volume dump create -s statefile1 -e statefile2 -name cli-created -dumpfile incrdump1 maprcli volume dump create -s statefile2 -e statefile3 -name cli-created -dumpfile incrdump2 maprcli volume dump create -s statefile3 -e statefile4 -name cli-created -dumpfile incrdump3 ...and so on. You can restore the volume from scratch, using the command with the full dump file, followed by each dump file in volume dump restore sequence. Syntax CLI maprcli volume dump create [ -cluster <cluster> ] [ -s <start state file> ] [ -e <end state file> ] [ -o ] [ -dumpfile <dump file> ] -name volumename {anchor:cli-syntax-end} REST None. Parameters Parameter Description cluster The cluster on which to run the command. dumpfile The name of the dump file (ignored if -o is used). e The name of the state file to create for the end point of the dump. name A volume name. o This option dumps the volume to stdout instead of to a file. s The start point for an incremental dump. Examples Create a full dump: CLI maprcli volume create -e statefile1 -dumpfile fulldump1 -name volume -n Create an incremental dump: 1. 2. CLI maprcli volume dump -s statefile1 -e statefile2 -name volume -dumpfile incrdump1 volume dump restore The command restores or updates a volume from a dump file. Permissions required: , , or volume dump restore dump fc a There are two ways to use : volume dump restore With a full dump file, recreates a volume from scratch from volume data stored in the dump file. volume dump restore With an incremental dump file, updates a volume using incremental changes stored in the dump file. volume dump restore The volume that results from a operation is a mirror volume whose source is the volume from which the dump was volume dump restore created. After the operation, this volume can perform mirroring from the source volume. When you are updating a volume from an incremental dump file, you must specify an existing volume and an incremental dump file. To restore from a sequence of previous dump files would involve first restoring from the volume's full dump file, then applying each subsequent incremental dump file. A restored volume may contain mount points that represent volumes that were mounted under the original source volume from which the dump was created. In the restored volume, these mount points have no meaning and do not provide access to any volumes that were mounted under the source volume. If the source volume still exists, then the mount points in the restored volume will work if the restored volume is associated with the source volume as a mirror. To restore from a full dump plus a sequence of incremental dumps: Restore from the full dump file, using the option to create a new mirror volume and the option to specify the name. Example: -n -name maprcli volume dump restore -dumpfile fulldump1 -name restore1 -n Restore from each incremental dump file in order, specifying the same volume name. Examples: maprcli volume dump restore -dumpfile incrdump1 -name restore1 maprcli volume dump restore -dumpfile incrdump2 -name restore1 maprcli volume dump restore -dumpfile incrdump3 -name restore1 ...and so on. Syntax CLI maprcli volume dump restore [ -cluster <cluster> ] [ -dumpfile dumpfilename ] [ -i ] [ -n ] -name <volume name> REST None. Parameters Parameter Description cluster The cluster on which to run the command. dumpfile The name of the dumpfile (ignored if is used). -i i This option reads the dump file from . stdin n This option creates a new volume if it doesn't exist. name A volume name, in the form volumename Examples Restore a volume from a full dump file: CLI maprcli volume dump restore -name volume -dumpfile fulldump1 Apply an incremental dump file to a volume: CLI maprcli volume dump restore -name volume -dumpfile incrdump1 volume fixmountpath Corrects the mount path of a volume. Permissions required: or fc a The CLDB maintains information about the mount path of every volume. If a directory in a volume's path is renamed (by a command, hadoop fs for example) the information in the CLDB will be out of date. The command does a reverse path walk from the volume volume fixmountpath and corrects the mount path information in the CLDB. Syntax CLI maprcli volume fixmountpath -name <name> [ -cluster <clustername> ] REST http[s]://<host>:<port>/rest/volume/fixmountpath?<p arameters> Parameters Parameter Description name The volume name. clustername The cluster name Examples Fix the mount path of volume v1: CLI maprcli volume fixmountpath -name v1 REST https://r1n1.sj.us:8443/rest/volume/fixmountpath?na me=v1 volume info Displays information about the specified volume. Syntax CLI maprcli volume info [ -cluster <cluster> ] [ -name <volume name> ] [ -output terse|verbose ] [ -path <path> ] REST http[s]://<host>:<port>/rest/volume/info?<parameter s> Parameters You must specify either name or path. Parameter Description cluster The cluster on which to run the command. name The volume for which to retrieve information. output Whether the output should be terse or verbose. path The mount path of the volume for which to retrieve information. Verbose Terse Description acl acl A JSON object that contains the Access Control List for the volume. creator on Name of the user that created the volume aename aen Accountable entity name aetype aet Accountable entity type: 0=user 1=group numreplicas drf Desired number of replicas. Containers with this amount of replicas are not re-replicated. minreplicas mrf Minimum number of replicas before re-replication starts. replicationtype dcr Replication type rackpath rp The rack path for this volume readonly ro A value of 1 indicates the volume is read-only mountdir p The path the volume is mounted on volumename n The name of the volume mounted mt A value of 1 indicates the volume is mounted quota qta A value of 0 indicates there are no hard quotas for this volume advisoryquota aqt A value of 0 indicates there are no soft or advisory quotas for this volume snapshotcount sc The number of snapshots for this volume logicalUsed dlu Logical size of disk used by this volume used Disk space used, in MB, not including snapshots snapshotused ssu Disk space used for all snapshots, in MB totalused Total space used for volume and snapshots, in MB scheduleid sid The ID of the schedule, if any, used by this volume schedulename sn The name of the schedule, if any, used by this volume volumetype The volume type volumeid id The volume ID actualreplication arf The actual current replication factor by percentage of the volume, as a zero-based array of integers from 0 to 100. For each position in the array, the value is the percentage of the volume that is replicated index number of times. Example: arf=[5,1 means that 5% is not replicated, 10% 0,85] is replicated once, 85% is replicated twice. nameContainerSizeMB ncsmb needsGfsck nfsck A value of TRUE indicates this volume requires a filesystem check maxinodesalarmthreshold miath The threshold of inodes in use that will set off the VOLUME_ALARM_INODES_EXCEEDE alarm D partlyOutOfTopology poot A value of 1 indicates this volume is partly out of its topology volume link create Creates a link to a volume. Permissions required: or fc a Syntax CLI maprcli volume link create [ -cluster <clustername> ] -path <path> -type <type> -volume <volume> REST http[s]://<host>:<port>/rest/volume/link/remove?<pa rameters> Parameters Parameter Description path The path parameter specifies the link path and other information, using the following syntax: /link/[maprfs::][volume::]<volume type>::<volume name> link - the link path maprfs - a keyword to indicate a special MapR-FS link volume - a keyword to indicate a link to a volume volume type - writeable or mirror volume name - the name of the volume Example: /abc/maprfs::mirror::abc type The volume type: or . writeable mirror volume The volume name. clustername The cluster name. Examples Create a link to v1 at the path v1. mirror: CLI maprcli volume link create -volume v1 -type mirror -path /v1.mirror REST https://r1n1.sj.us:8443/rest/volume/link/create?pat h=/v1.mirror&type=mirror&volume=v1 volume link remove Removes the specified symbolic link. Permissions required: or fc a Syntax CLI maprcli volume link remove -path <path> [ -cluster <clustername> ] REST http[s]://<host>:<port>/rest/volume/link/remove?<pa rameters> Parameters Parameter Description path The symbolic link to remove. The path parameter specifies the link path and other information about the symbolic link, using the following syntax: /link/[maprfs::][volume::]<volume type>::<volume name> link - the symbolic link path * - a keyword to indicate a special MapR-FS link maprfs volume - a keyword to indicate a link to a volume volume type - or writeable mirror volume name - the name of the volume Example: /abc/maprfs::mirror::abc clustername The cluster name. Examples Remove the link /abc: CLI maprcli volume link remove -path /abc/maprfs::mirror::abc REST https://r1n1.sj.us:8443/rest/volume/link/remove?pat h=/abc/maprfs::mirror::abc volume list Lists information about volumes specified by name, path, or filter. Syntax CLI maprcli volume list [ -alarmedvolumes 1 ] [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <limit> ] [ -nodes <nodes> ] [ -output terse | verbose ] [ -start <offset> ] REST http[s]://<host>:<port>/rest/volume/list?<parameter s> Parameters Parameter Description alarmedvolumes Specifies whether to list alarmed volumes only. cluster The cluster on which to run the command. columns A comma-separated list of fields to return in the query. See the Fields table below. filter A filter specifying volumes to list. See for more information. Filters limit The number of rows to return, beginning at start. Default: 0 nodes A list of nodes. If specified, only lists volumes on the volume list specified nodes. output Specifies whether the output should be or . terse verbose start The offset from the starting row according to sort. Default: 0 Output Column Headings Verbose Terse Description volumeid id Unique volume ID. volumetype t Volume type: 0 = normal volume 1 = mirror volume volumename n The name of the volume mounted mt Volume mount status: 0 = unmounted 1 = mounted rackpath rp Rack path. mountdir p The path the volume is mounted on creator on Username of the volume creator. aename aen Accountable entity name. aetype aet Accountable entity type: 0=user 1=group uacl Users ACL (comma-separated list of user names. gacl Group ACL (comma-separated list of group names). quota qta Quota, in MB; = no quota. 0 advisoryquota aqt Advisory quota, in MB; = no advisory 0 quota. used dsu Disk space used, in MB, not including snapshots. snapshotused ssu Disk space used for all snapshots, in MB. totalused tsu Total space used for volume and snapshots, in MB. readonly ro Read-only status: 0 = read/write 1 = read-only numreplicas drf Desired replication factor (number of replications). minreplicas mrf Minimum replication factor (number of replications) replicationtype dcr Replication type actualreplication arf The actual current replication factor by percentage of the volume, as a zero-based array of integers from 0 to 100. For each position in the array, the value is the percentage of the volume that is replicated index number of times. Example: arf=[5,1 means that 5% is not replicated, 10% 0,85] is replicated once, 85% is replicated twice. nameContainerSizeMB ncsmb needsGfsck nfsck A value of TRUE indicates this volume requires a filesystem check maxinodesalarmthreshold miath The threshold of inodes in use that will set off the VOLUME_ALARM_INODES_EXCEEDE alarm D partlyOutOfTopology poot A value of 1 indicates this volume is partly out of its topology schedulename sn The name of the schedule associated with the volume. scheduleid sid The ID of the schedule associated with the volume. mirrorSrcVolumeId Source volume ID (mirror volumes only). mirrorSrcVolume Source volume name (mirror volumes only). mirrorSrcCluster The cluster where the source volume resides (mirror volumes only). lastSuccessfulMirrorTime Last successful Mirror Time, milliseconds since 1970 (mirror volumes only). mirrorstatus Mirror Status (mirror volumes only: 0 = success 1 = running 2 = error mirror-percent-complete Percent completion of last/current mirror (mirror volumes only). snapshotcount sc Snapshot count. logicalUsed dlu Logical size of disk used by this volume SnapshotFailureAlarm sfa Status of SNAPSHOT_FAILURE alarm: 0 = Clear 1 = Raised AdvisoryQuotaExceededAlarm aqa Status of VOLUME_ALARM_ADVISORY_QUOTA_EX CEEDED alarm: 0 = Clear 1 = Raised QuotaExceededAlarm qa Status of VOLUME_QUOTA_EXCEEDED alarm: 0 = Clear 1 = Raised MirrorFailureAlarm mfa Status of MIRROR_FAILURE alarm: 0 = Clear 1 = Raised DataUnderReplicatedAlarm Status of DATA_UNDER_REPLICATED alarm: 0 = Clear 1 = Raised DataUnavailableAlarm dua Status of DATA_UNAVAILABLE alarm: 0 = Clear 1 = Raised Output Information about the specified volumes. mirrorstatus QuotaExceededAlarm numreplicas schedulename DataUnavailableAlarm volumeid rackpath volumename used volumetype SnapshotFailureAlarm mirrorDataSrcVolumeId advisoryquota aetype creator snapshotcount quota mountdir scheduleid snapshotused MirrorFailureAlarm AdvisoryQuotaExceededAlarm minreplicas mirrorDataSrcCluster actualreplication aename mirrorSrcVolumeId mirrorId mirrorSrcCluster lastSuccessfulMirrorTime nextMirrorId mirrorDataSrcVolume mirrorSrcVolume mounted logicalUsed readonly totalused DataUnderReplicatedAlarm mirror-percent-complete 0 0 3 every15min 0 362 / ATS-Run-2011-01-31-160018 864299 0 0 0 0 0 root 3 0 /ATS-Run-2011-01-31-160018 4 1816201 0 0 1 ... root 0 0 0 0 1 2110620 0 2680500 0 0 0 0 3 0 12 / mapr.cluster.internal 0 0 0 0 0 0 root 0 0 /var/mapr/cluster 0 0 0 0 1 ... root 0 0 0 0 1 0 0 0 0 0 0 0 3 0 11 / mapr.cluster.root 1 0 0 0 0 0 root 0 0 / 0 0 0 0 1 ... root 0 0 0 0 1 1 0 1 0 0 0 0 10 0 21 / mapr.jobtracker.volume 1 0 0 0 0 0 root 0 0 /var/mapr/cluster/mapred/jobTracker 0 0 0 0 1 ... root 0 0 0 0 1 1 0 1 0 0 0 0 3 0 1 / mapr.kvstore.table 0 0 0 0 0 0 root 0 0 0 0 0 0 1 ... root 0 0 0 0 0 0 0 0 0 0 Output Fields See the table above. Fields volume mirror push Pushes the changes in a volume to all of its mirror volumes in the same cluster, and waits for each mirroring operation to complete. Use this command when you need to push recent changes. Syntax CLI maprcli volume mirror push [ -cluster <cluster> ] -name <volume name> [ -verbose true|false ] REST None. Parameters Parameter Description cluster The cluster on which to run the command. name The volume to push. verbose Specifies whether the command output should be verbose. Default: true Output Sample Output Starting mirroring of volume mirror1 Mirroring complete for volume mirror1 Successfully completed mirror push to all local mirrors of volume volume1 Examples Push changes from the volume "volume1" to its local mirror volumes: CLI maprcli volume mirror push -name volume1 -cluster mycluster volume mirror start Starts mirroring on the specified volume from its source volume. License required: M5 Permissions required: or fc a When a mirror is started, the mirror volume is synchronized from a hidden internal snapshot so that the mirroring process is not affected by any concurrent changes to the source volume. The command does not wait for mirror completion, but returns immediately. volume mirror start The changes to the mirror volume occur atomically at the end of the mirroring process; deltas transmitted from the source volume do not appear until mirroring is complete. To provide rollback capability for the mirror volume, the mirroring process creates a snapshot of the mirror volume before starting the mirror, with the following naming format: . <volume>.mirrorsnap.<date>.<time> Normally, the mirroring operation transfers only deltas from the last successful mirror. Under certain conditions (mirroring a volume repaired by fs , for example), the source and mirror volumes can become out of sync. In such cases, it is impossible to transfer deltas, because the state is ck not the same for both volumes. Use the option to force the mirroring operation to transfer all data to bring the volumes back in sync. -full Syntax CLI maprcli volume mirror start [ -cluster <cluster> ] [ -full true|false ] -name <volume name> REST http[s]://<host>:<port>/rest/volume/mirror/start?<p arameters> Parameters Parameter Description cluster The cluster on which to run the command. full Specifies whether to perform a full copy of all data. If false, only the deltas are copied. name The volume for which to start the mirror. Output Sample Output messages Started mirror operation for volumes 'test-mirror' Examples Start mirroring the mirror volume "test-mirror": CLI maprcli volume mirror start -name test-mirror volume mirror stop Stops mirroring on the specified volume. License required: M5 Permissions required: or fc a The command lets you stop mirroring (for example, during a network outage). You can use the volume mirror stop volume mirror start command to resume mirroring. Syntax CLI maprcli volume mirror stop [ -cluster <cluster> ] -name <volume name> REST http[s]://<host>:<port>/rest/volume/mirror/stop?<pa rameters> Parameters Parameter Description cluster The cluster on which to run the command. name The volume for which to stop the mirror. Output Sample Output messages Stopped mirror operation for volumes 'test-mirror' Examples Stop mirroring the mirror volume "test-mirror": CLI maprcli volume mirror stop -name test-mirror volume modify Modifies an existing volume. Permissions required: , , or m fc a An error occurs if the name or path refers to a non-existent volume, or cannot be resolved. Syntax CLI maprcli volume modify [ -cluster <cluster> ] -name <volume name> [ -source <source> ] [ -replication <replication> ] [ -minreplication <minimum replication> ] [ -user <list of user:allowMask> ] [ -group <list of group:allowMask> ] [ -aetype <aetype> ] [ -ae <accounting entity> ] [ -quota <quota> ] [ -advisoryquota <advisory quota> ] [ -readonly <readonly> ] [ -schedule <schedule ID> ] [ -maxinodesalarmthreshold <threshold> ] REST http[s]://<host>:<port>/rest/volume/modify?<parameters> Parameters Parameter Description advisoryquota The advisory quota for the volume as plus integer unit. , , , , , Example: quota=500G; Units: B K M G T P ae The accounting entity that owns the volume. aetype The type of accounting entity: 0=user 1=group cluster The cluster on which to run the command. group Space-separated list of pairs. group:permission minreplication The minimum replication level. Default: 0 name The name of the volume to modify. quota The quota for the volume as plus integer unit. Example: , , , , , quota=500G; Units: B K M G T P readonly Specifies whether the volume is read-only. 0 = read/write 1 = read-only replication The desired replication level. Default: 0 schedule A schedule ID. If a schedule ID is provided, then the volume will automatically create snapshots (normal volume) or sync with its source volume (mirror volume) on the specified schedule. source (Mirror volumes only) The source volume from which a mirror volume receives updates, specified in the format . <volume>@<cluster> user Space-separated list of pairs. user:permission threshold Threshold for the alarm. INODES_EXCEEDED Examples Change the source volume of the mirror "test-mirror": CLI maprcli volume modify -name test-mirror -source volume-2@my-cluster REST https://r1n1.sj.us:8443/rest/volume/modify?name=tes t-mirror&source=volume-2@my-cluster volume mount Mounts one or more specified volumes. Permissions required: , , or mnt fc a Syntax CLI maprcli volume mount [ -cluster <cluster> ] -name <volume list> [ -path <path list> ] [ -createparent 0|1 ] REST http[s]://<host>:<port>/rest/volume/mount?<paramete rs> Parameters Parameter Description cluster The cluster on which to run the command. name The name of the volume to mount. path The path at which to mount the volume. createparent Specifies whether or not to create a parent volume: 0 = Do not create a parent volume. 1 = Create a parent volume. Examples Mount the volume "test-volume" at the path "/test": CLI maprcli volume mount -name test-volume -path /test REST https://r1n1.sj.us:8443/rest/volume/mount?name=test -volume&path=/test volume move Moves the specified volume or mirror to a different topology. Permissions required: , , or m fc a Syntax CLI maprcli volume move [ -cluster <cluster> ] -name <volume name> -topology <path> REST http[s]://<host>:<port>/rest/volume/move?<parameter s> Parameters Parameter Description cluster The cluster on which to run the command. name The volume name. topology The new rack path to the volume. volume remove Removes the specified volume or mirror. Permissions required: , , or d fc a Syntax CLI maprcli volume remove [ -cluster <cluster> ] [ -force ] -name <volume name> [ -filter <filter> ] REST http[s]://<host>:<port>/rest/volume/remove?<paramet ers> Parameters Parameter Description cluster The cluster on which to run the command. force Forces the removal of the volume, even if it would otherwise be prevented. name The volume name. filter All volumes with names that match the filter are removed. volume rename Renames the specified volume or mirror. Permissions required: , , or m fc a Syntax CLI maprcli volume rename [ -cluster <cluster> ] -name <volume name> -newname <new volume name> REST http[s]://<host>:<port>/rest/volume/rename?<paramet ers> Parameters Parameter Description cluster The cluster on which to run the command. name The volume name. newname The new volume name. volume showmounts The API returns a list of mount points for the specified volume. volume showmounts Syntax CLI maprcli volume showmounts [ -cluster <cluster name> ] -name <volume name> REST http[s]://<host>:<port>/rest/volume/showmounts?<par ameters> Parameters Parameter Description cluster name The name of the cluster hosting the volume. volume name The name of the volume to return a list of mount points for. Examples Return the mount points for volume mapr.user.volume for the cluster my.cluster.com: CLI maprcli volume showmounts -cluster my.cluster.com -name mapr.user.volume REST https://r1n1.sj.us:8443/rest/volume/showmounts?clus ter=my.cluster.com&name=mapr.user.volume volume snapshot create Creates a snapshot of the specified volume, using the specified snapshot name. License required: M5 Permissions required: , , or snap fc a Syntax CLI maprcli volume snapshot create [ -cluster <cluster> ] -snapshotname <snapshot> -volume <volume> REST http[s]://<host>:<port>/rest/volume/snapshot/create ?<parameters> Parameters Parameter Description cluster The cluster on which to run the command. snapshotname The name of the snapshot to create. volume The volume for which to create a snapshot. Examples Create a snapshot called "test-snapshot" for volume "test-volume": CLI maprcli volume snapshot create -snapshotname test-snapshot -volume test-volume REST https://r1n1.sj.us:8443/rest/volume/snapshot/create ?volume=test-volume&snapshotname=test-snapshot volume snapshot list Displays info about a set of snapshots. You can specify the snapshots by volumes or paths, or by specifying a filter to select volumes with certain characteristics. Syntax CLI maprcli volume snapshot list [ -cluster <cluster> ] [ -columns <fields> ] ( -filter <filter> | -path <volume path list> | -volume <volume list> ) [ -limit <rows> ] [ -output (terse\|verbose) ] [ -start <offset> ] REST http[s]://<host>:<port>/rest/volume/snapshot/list?< parameters> Parameters Specify exactly one of the following parameters: , , or . volume path filter Parameter Description cluster The cluster on which to run the command. columns A comma-separated list of fields to return in the query. See the Field table below. Default: none s filter A filter specifying snapshots to preserve. See for more Filters information. limit The number of rows to return, beginning at start. Default: 0 output Specifies whether the output should be or . Default: terse verbose verbose path A comma-separated list of paths for which to preserve snapshots. start The offset from the starting row according to sort. Default: 0 volume A comma-separated list of volumes for which to preserve snapshots. Fields The following table lists the fields used in the sort and columns parameters, and returned as output. Field Description snapshotid Unique snapshot ID. snapshotname Snapshot name. volumeid ID of the volume associated with the snapshot. volumename Name of the volume associated with the snapshot. volumepath Path to the volume associated with the snapshot. ownername Owner (user or group) associated with the volume. ownertype Owner type for the owner of the volume: 0=user 1=group dsu Disk space used for the snapshot, in MB. creationtime Snapshot creation time, milliseconds since 1970 expirytime Snapshot expiration time, milliseconds since 1970; = never expires. 0 Output The specified columns about the specified snapshots. Sample Output creationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath 1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy 1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy 1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummy dummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001 Output Fields See the table above. Fields Examples List all snapshots: CLI maprcli volume snapshot list REST https://r1n1.sj.us:8443/rest/volume/snapshot/list volume snapshot preserve Preserves one or more snapshots from expiration. Specify the snapshots by volumes, paths, filter, or IDs. License required: M5 Permissions required: , , or snap fc a Syntax CLI maprcli volume snapshot preserve [ -cluster <cluster> ] ( -filter <filter> | -path <volume path list> | -snapshots <snapshot list> | -volume <volume list> ) REST http[s]://<host>:<port>/rest/volume/snapshot/preser ve?<parameters> Parameters Specify exactly one of the following parameters: volume, path, filter, or snapshots. Parameter Description cluster The cluster on which to run the command. filter A filter specifying snapshots to preserve. See for more Filters information. path A comma-separated list of paths for which to preserve snapshots. snapshots A comma-separated list of snapshot IDs to preserve. volume A comma-separated list of volumes for which to preserve snapshots. Examples Preserve two snapshots by ID: First, use to get the IDs of the snapshots you wish to preserve. Example: volume snapshot list # maprcli volume snapshot list creationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath 1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy 1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy 1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummy dummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001 Use the IDs in the command. For example, to preserve the first two snapshots in the above list, run the volume snapshot preserve commands as follows: CLI maprcli volume snapshot preserve -snapshots 363,364 REST https://r1n1.sj.us:8443/rest/volume/snapshot/preser ve?snapshots=363,364 volume snapshot remove Removes one or more snapshots. License required: M5 Permissions required: , , or snap fc a Syntax CLI maprcli volume snapshot remove [ -cluster <cluster> ] ( -snapshotname <snapshot name> | -snapshots <snapshots> | -volume <volume name> ) REST http[s]://<host>:<port>/rest/volume/snapshot/remove ?<parameters> Parameters Specify exactly one of the following parameters: snapshotname, snapshots, or volume. Parameter Description cluster The cluster on which to run the command. snapshotname The name of the snapshot to remove. snapshots A comma-separated list of IDs of snapshots to remove. volume The name of the volume from which to remove the snapshot. Examples Remove the snapshot "test-snapshot": CLI maprcli volume snapshot remove -snapshotname test-snapshot REST https://10.250.1.79:8443/api/volume/snapshot/remove ?snapshotname=test-snapshot Remove two snapshots by ID: First, use to get the IDs of the snapshots you wish to remove. Example: volume snapshot list # maprcli volume snapshot list creationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath 1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy 1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy 1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummy dummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001 Use the IDs in the command. For example, to remove the first two snapshots in the above list, run the commands volume snapshot remove as follows: CLI maprcli volume snapshot remove -snapshots 363,364 REST https://r1n1.sj.us:8443/rest/volume/snapshot/remove ?snapshots=363,364 volume unmount Unmounts one or more mounted volumes. Permissions required: , , or mnt fc a Syntax CLI maprcli volume unmount [ -cluster <cluster> ] [ -force 1 ] -name <volume name> REST http[s]://<host>:<port>/rest/volume/unmount?<parame ters> Parameters Parameter Description cluster The cluster on which to run the command. force Specifies whether to force the volume to unmount. name The name of the volume to unmount. Examples Unmount the volume "test-volume": CLI maprcli volume unmount -name test-volume REST https://r1n1.sj.us:8443/rest/volume/unmount?name=te st-volume Alarms Reference This page provides details for all alarm types. User/Group Alarms Entity Advisory Quota Alarm Entity Quota Alarm Cluster Alarms Blacklist Alarm CLDB Low Memory Alarm License Near Expiration License Expired Cluster Almost Full Cluster Full Maximum Licensed Nodes Exceeded alarm New Cluster Features Disabled Upgrade in Progress VIP Assignment Failure Node Alarms CLDB Service Alarm Core Present Alarm Debug Logging Active Disk Failure Duplicate Host ID FileServer Service Alarm HBMaster Service Alarm HBRegion Service Alarm Hoststats Alarm Installation Directory Full Alarm JobTracker Service Alarm MapR-FS High Memory Alarm M7 Configuration Mismatch MapR User Mismatch Metrics Write Problem Alarm NFS Gateway Alarm PAM Misconfigured Alarm Root Partition Full Alarm TaskTracker Service Alarm TaskTracker Local Directory Full Alarm Time Skew Alarm Version Alarm WebServer Service Alarm Volume Alarms Data Unavailable Data Under-Replicated Inodes Limit Exceeded Mirror Failure No Nodes in Topology Snapshot Failure Topology Almost Full Topology Full Alarm Volume Advisory Quota Alarm Volume with Non-Local Containers Volume Quota Alarm User/Group Alarms User/group alarms indicate problems with user or group quotas. The following tables describe the MapR user/group alarms. Entity Advisory Quota Alarm UI Column User Advisory Quota Alarm Logged As AE_ALARM_AEADVISORY_QUOTA_EXCEEDED Meaning A user or group has exceeded its advisory quota. See Managing for more information about user/group quotas. Quotas Resolution No immediate action is required. To avoid exceeding the hard quota, clear space on volumes created by the user or group, or stop further data writes to those volumes. Entity Quota Alarm UI Column User Quota Alarm Logged As AE_ALARM_AEQUOTA_EXCEEDED Meaning A user or group has exceeded its quota. Further writes by the user or group will fail. See for more information about Managing Quotas user/group quotas. Resolution Free some space on the volumes created by the user or group, or increase the user or group quota. Cluster Alarms Cluster alarms indicate problems that affect the cluster as a whole. The following tables describe the MapR cluster alarms. Blacklist Alarm UI Column Blacklist Alarm Logged As CLUSTER_ALARM_BLACKLIST_TTS Meaning The JobTracker has blacklisted a TaskTracker node because tasks on the node have failed too many times. Resolution To determine which node or nodes have been blacklisted, see the JobTracker status page (click in the ). JobTracker Navigation Pane The JobTracker status page provides links to the TaskTracker log for each node; look at the log for the blacklisted node or nodes to determine why tasks are failing on the node. CLDB Low Memory Alarm UI Column CLDB Low Memory Alarm Logged As CLUSTER_ALARM_CLDB_HEAPSIZE Meaning The CLDB process needs more memory to cache containers. Resolution Edit the following settings in the file located in warden.conf $MAPR : _HOME/conf/ service.command.cldb.heapsize.max=<max heap size> service.command.cldb.heapsize.min=<min heap size> Example: If the current settings start the CLDB warden.config service with , you can change the max and min heap size to Xmx=1G 1500 to start the CLDB with . Xmx=1.5G You can also set in t enable.overcommit=true warden.config o over commit memory without error. Restart the Warden service after you edit the warden.config . file License Near Expiration UI Column License Near Expiration Alarm Logged As CLUSTER_ALARM_LICENSE_NEAR_EXPIRATION Meaning The M5 license associated with the cluster is within 30 days of expiration. Resolution Renew the M5 license. License Expired UI Column License Expiration Alarm Logged As CLUSTER_ALARM_LICENSE_EXPIRED Meaning The M5 license associated with the cluster has expired. M5 features have been disabled. Resolution Renew the M5 license. Cluster Almost Full UI Column Cluster Almost Full Logged As CLUSTER_ALARM_CLUSTER_ALMOST_FULL Meaning The cluster storage is almost full. The percentage of storage used before this alarm is triggered is 90% by default, and is controlled by the configuration parameter cldb.cluster.almost.full.perce . ntage Resolution Reduce the amount of data stored in the cluster. If the cluster storage is less than 90% full, check the cldb.cluster.almost.full.per parameter via the command, and adjust it if centage config load necessary via the command. config save Cluster Full UI Column Cluster Full Logged As CLUSTER_ALARM_CLUSTER_FULL Meaning The cluster storage is full. MapReduce operations have been halted. Resolution Free up some space on the cluster. Maximum Licensed Nodes Exceeded alarm UI Column Licensed Nodes Exceeded Alarm Logged As CLUSTER_ALARM_LICENSE_MAXNODES_EXCEEDED Meaning The cluster has exceeded the number of nodes specified in the license. Resolution Remove some nodes, or upgrade the license to accommodate the added nodes. New Cluster Features Disabled UI Column New Cluster Features Disabled Logged As CLUSTER_ALARM_NEW_FEATURES_DISABLED Meaning Features added in version 2.0 are disabled on this cluster. Resolution Edit the file to add the line cldb.conf cldb.v2.features.enabled . Restart the Warden on all nodes. =1 Upgrade in Progress UI Column Software Installation & Upgrades Logged As CLUSTER_ALARM_UPGRADE_IN_PROGRESS Meaning A rolling upgrade of the cluster is in progress. Resolution No action is required. Performance may be affected during the upgrade, but the cluster should still function normally. After the upgrade is complete, the alarm is cleared. VIP Assignment Failure UI Column VIP Assignment Alarm Logged As CLUSTER_ALARM_UNASSIGNED_VIRTUAL_IPS Meaning MapR was unable to assign a VIP to any NFS servers. Resolution Check the VIP configuration, and make sure at least one of the NFS servers in the VIP pool are up and running. See Configuring NFS for . This alarm can also indicate HA that a VIP's hostname exceeds the maximum allowed length of 16. Check the log file for additional /opt/mapr/logs/nfsmon.log information. Node Alarms Node alarms indicate problems in individual nodes. The following tables describe the MapR node alarms. CLDB Service Alarm UI Column CLDB Alarm Logged As NODE_ALARM_SERVICE_CLDB_DOWN Meaning The CLDB service on the node has stopped running. Resolution Go to the pane of the to Manage Services Node Properties View check whether the CLDB service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter service in . If the warden s.retryinterval.time.sec warden.conf successfully restarts the CLDB service, the alarm is cleared. If the warden is unable to restart the CLDB service, it may be necessary to contact technical support. Core Present Alarm UI Column Core Present Logged As NODE_ALARM_CORE_PRESENT Meaning A service on the node has crashed and created a core dump file. When all core files are removed, the alarm is cleared. Resolution Contact technical support. Debug Logging Active UI Column Excess Logs Alarm Logged As NODE_ALARM_DEBUG_LOGGING Meaning Debug logging is enabled on the node. Resolution Debug logging generates enormous amounts of data, and can fill up disk space. If debug logging is not absolutely necessary, turn it off: either use the pane in the Node Properties view or Manage Services the command. If it is absolutely necessary, make sure that setloglevel the logs in /opt/mapr/logs are not in danger of filling the entire disk. Disk Failure UI Column Disk Failure Alarm Logged As NODE_ALARM_DISK_FAILURE Meaning A disk has failed on the node. Resolution Check the disk health log (/opt/mapr/logs/faileddisk.log) to determine which disk failed and view any SMART data provided by the disk. See Handling Disk Failure Duplicate Host ID UI Column Duplicate Host Id Logged As NODE_ALARM_DUPLICATE_HOSTID Meaning Two or more nodes in the cluster have the same host ID. Resolution Multiple nodes with the same host ID are prevented from joining the cluster, in order to prevent addressing problems that can lead to data loss. To correct the problem and clear the alarm, make sure all host IDs are unique and use the maprcli node allow-into-cluster command to un-ban the affected host IDs. FileServer Service Alarm UI Column FileServer Alarm Logged As NODE_ALARM_SERVICE_FILESERVER_DOWN Meaning The FileServer service on the node has stopped running. Resolution Go to the pane of the Node Properties View to Manage Services check whether the FileServer service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter service in . If the warden s.retryinterval.time.sec warden.conf successfully restarts the FileServer service, the alarm is cleared. If the warden is unable to restart the FileServer service, it may be necessary to contact technical support. HBMaster Service Alarm UI Column HBase Master Alarm Logged As NODE_ALARM_SERVICE_HBMASTER_DOWN Meaning The HBMaster service on the node has stopped running. Resolution Go to the pane of the Node Properties View to Manage Services check whether the HBMaster service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter service in . If the warden s.retryinterval.time.sec warden.conf successfully restarts the HBMaster service, the alarm is cleared. If the warden is unable to restart the HBMaster service, it may be necessary to contact technical support. HBRegion Service Alarm UI Column HBase RegionServer Alarm Logged As NODE_ALARM_SERVICE_HBREGION_DOWN Meaning The HBRegion service on the node has stopped running. Resolution Go to the pane of the Node Properties View to Manage Services check whether the HBRegion service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter service in . If the warden s.retryinterval.time.sec warden.conf successfully restarts the HBRegion service, the alarm is cleared. If the warden is unable to restart the HBRegion service, it may be necessary to contact technical support. Hoststats Alarm UI Column HostStats Logged As NODE_ALARM_SERVICE_HOSTSTATS_DOWN Meaning The Hoststats service on the node has stopped running. Resolution Go to the pane of the Node Properties View to Manage Services check whether the Hoststats service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter service in . If the warden s.retryinterval.time.sec warden.conf successfully restarts the service, the alarm is cleared. If the warden is unable to restart the service, it may be necessary to contact technical support. Installation Directory Full Alarm UI Column Installation Directory Full Logged As NODE_ALARM_OPT_MAPR_FULL Meaning The partition on the node is running out of space (95% /opt/mapr full). Resolution Free up some space in on the node. /opt/mapr JobTracker Service Alarm UI Column JobTracker Alarm Logged As NODE_ALARM_SERVICE_JT_DOWN Meaning The JobTracker service on the node has stopped running. Resolution Go to the pane of the Node Properties View to Manage Services check whether the JobTracker service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter service in . If the warden s.retryinterval.time.sec warden.conf successfully restarts the JobTracker service, the alarm is cleared. If the warden is unable to restart the JobTracker service, it may be necessary to contact technical support. MapR-FS High Memory Alarm UI Column High FileServer Memory Alarm Logged As NODE_ALARM_HIGH_MFS_MEMORY Meaning Memory consumed by service on the node is in excess of fileserver the allotted amount. Resolution Log on as root to the node for which the alarm is raised, and restart the Warden: service mapr-warden restart M7 Configuration Mismatch UI Column Node M7 Config Mismatch Logged As NODE_ALARM_M7_CONFIG_MISMATCH Meaning This node's memory is not optimized for tables. The M7 license may have been installed on the cluster after the FileServer service was started on the node. Resolution Restart the FileServer service on the node. MapR User Mismatch UI Column MapR User Mismatch Alarm Logged As NODE_ALARM_MAPRUSER_MISMATCH Meaning The cluster nodes are not all set up to run MapR services as the same user (for example, some nodes are running MapR as whil root e others are running as . mapr_user Resolution For the nodes on which the User Mismatch alarm is raised, follow the steps in . Changing the User for MapR Services Metrics Write Problem Alarm UI Column Metrics write problem Alarm Logged As NODE_ALARM_METRICS_WRITE_PROBLEM Meaning Unable to write Metrics data to the database or the MapR-FS local Metrics volume. Resolution This issue can have multiple causes. To clear the alarm, check the log file at for the cause of the /opt/mapr/logs/hoststats.log write failure. In the case of database access failure, restore write access to the MySQL database. For more information, consult the process outlined in . Setting up the MapR Metrics Database NFS Gateway Alarm UI Column NFS Alarm Logged As NODE_ALARM_SERVICE_NFS_DOWN Meaning The NFS service on the node has stopped running. Resolution Go to the pane of the Node Properties View to Manage Services check whether the NFS service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter service in . If the warden s.retryinterval.time.sec warden.conf successfully restarts the NFS service, the alarm is cleared. If the warden is unable to restart the NFS service, it may be necessary to contact technical support. PAM Misconfigured Alarm UI Column Pam Misconfigured Alarm Logged As NODE_ALARM_PAM_MISCONFIGURED Meaning The PAM authentication on the node is configured incorrectly. Resolution See . PAM Configuration Root Partition Full Alarm UI Column Root Partition Full Logged As NODE_ALARM_ROOT_PARTITION_FULL Meaning The root partition ('/') on the node is running out of space (99% full). Resolution Free up some space in the root partition of the node. TaskTracker Service Alarm UI Column TaskTracker Alarm Logged As NODE_ALARM_SERVICE_TT_DOWN Meaning The TaskTracker service on the node has stopped running. Resolution Go to the pane of the Node Properties View to Manage Services check whether the TaskTracker service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter serv in . If the warden ices.retryinterval.time.sec warden.conf successfully restarts the TaskTracker service, the alarm is cleared. If the warden is unable to restart the TaskTracker service, it may be necessary to contact technical support. TaskTracker Local Directory Full Alarm UI Column TaskTracker Local Directory Full Alarm Logged As NODE_ALARM_TT_LOCALDIR_FULL Meaning The local directory used by the TaskTracker on the specified node(s) is full, and the TaskTracker cannot operate as a result. Resolution Delete or move data from the local disks, or add storage to the specified node(s), and try the jobs again. Time Skew Alarm UI Column Time Skew Alarm Logged As NODE_ALARM_TIME_SKEW Meaning The clock on the node is out of sync with the master CLDB by more than 20 seconds. Resolution Use NTP to synchronize the time on all the nodes in the cluster. Version Alarm UI Column Version Alarm Logged As NODE_ALARM_VERSION_MISMATCH Meaning One or more services on the node are running an unexpected version. Resolution Stop the node, Restore the correct version of any services you have modified, and re-start the node. See . Managing Nodes WebServer Service Alarm UI Column Webserver Alarm Logged As NODE_ALARM_SERVICE_WEBSERVER_DOWN Meaning The WebServer service on the node has stopped running. Resolution Go to the pane of the Node Properties View to Manage Services check whether the WebServer service is running. The warden will try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times to restart the service. The interval can be configured using the parameter service in . If the warden s.retryinterval.time.sec warden.conf successfully restarts the WebServer service, the alarm is cleared. If the warden is unable to restart the WebServer service, it may be necessary to contact technical support. Volume Alarms Volume alarms indicate problems in individual volumes. The following tables describe the MapR volume alarms. Data Unavailable UI Column Data Alarm Logged As VOLUME_ALARM_DATA_UNAVAILABLE Meaning This is a potentially very serious alarm that may indicate data loss. Some of the data on the volume cannot be located. This alarm indicates that enough nodes have failed to bring the replication factor of part or all of the volume to zero. For example, if the volume is stored on a single node and has a replication factor of one, the Data Unavailable alarm will be raised if that volume fails or is taken out of service unexpectedly. If a volume is replicated properly (and therefore is stored on multiple nodes) then the Data Unavailable alarm can indicate that a significant number of nodes is down. Resolution Investigate any nodes that have failed or are out of service. You can see which nodes have failed by looking at the Cluster Node Heatmap pane of the . Dashboard Check the cluster(s) for any snapshots or mirrors that can be used to re-create the volume. You can see snapshots and mirrors in the view. MapR-FS Data Under-Replicated UI Column Replication Alarm Logged As VOLUME_ALARM_DATA_UNDER_REPLICATED Meaning The volume replication factor is lower than the desired replication set in . This can be caused by failing disks or factor Volume Properties nodes, or the cluster may be running out of storage space. Resolution Investigate any nodes that are failing. You can see which nodes have failed by looking at the Cluster Node Heatmap pane of the Dashboard . Determine whether it is necessary to add disks or nodes to the cluster. This alarm is generally raised when the nodes that store the volumes or replicas have not sent a for five minutes. To heartbeat prevent re-replication during normal maintenance procedures, MapR waits a specified interval (by default, one hour) before considering the node dead and re-replicating its data. You can control this interval by setting the parameter using the cldb.fs.mark.rereplicate.sec command. config save Inodes Limit Exceeded UI Column Inodes Exceeded Alarm Logged As VOLUME_ALARM_INODES_EXCEEDED Meaning The volume contains too many files. Resolution This alarm indicates that not enough volumes are set up to handle the number of files stored in the cluster. Typically, each user or project should have a separate volume. Mirror Failure UI Column Mirror Alarm Logged As VOLUME_ALARM_MIRROR_FAILURE Meaning A mirror operation failed. Resolution Make sure the CLDB is running on both the source cluster and the destination cluster. Look at the CLDB log (/opt/mapr/logs/cldb.log) and the MapR-FS log (/opt/mapr/logs/mfs.log) on both clusters for more information. If the attempted mirror operation was between two clusters, make sure that both clusters are reachable over the network. Make sure the source volume is available and reachable from the cluster that is performing the mirror operation. No Nodes in Topology UI Column No Nodes in Vol Topo Logged As VOLUME_ALARM_NO_NODES_IN_TOPOLOGY Meaning The path specified in the volume's topology no longer corresponds to a physical topology that contains any nodes, either due to node failures or changes to node topology settings. While this alarm is raised, MapR places data for the volume on nodes outside the volume's topology to prevent write failures. Resolution Add nodes to the specified volume topology, either by moving existing nodes or adding nodes to the cluster. See . Node Topology Snapshot Failure UI Column Snapshot Alarm Logged As VOLUME_ALARM_SNAPSHOT_FAILURE Meaning A snapshot operation failed. Resolution Make sure the CLDB is running. Look at the CLDB log (/opt/mapr/logs/cldb.log) and the MapR-FS log (/opt/mapr/logs/mfs.log) on both clusters for more information. If the attempted snapshot was a scheduled snapshot that was running in the background, try a manual snapshot. Topology Almost Full UI Column Vol Topo Almost Full Logged As VOLUME_ALARM_TOPOLOGY_ALMOST_FULL Meaning The nodes in the specified topology are running out of storage space. Resolution Move volumes to another topology, enlarge the specified topology by adding more nodes, or add disks to the nodes in the specified topology. Topology Full Alarm UI Column Vol Topo Full Logged As VOLUME_ALARM_TOPOLOGY_FULL Meaning The nodes in the specified topology have out of storage space. Resolution Move volumes to another topology, enlarge the specified topology by adding more nodes, or add disks to the nodes in the specified topology. Volume Advisory Quota Alarm UI Column Vol Advisory Quota Alarm Logged As VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED Meaning A volume has exceeded its advisory quota. Resolution No immediate action is required. To avoid exceeding the hard quota, clear space on the volume or stop further data writes. Volume with Non-Local Containers UI Column Local Volume containers non-local Logged As VOLUME_ALARM_DATA_CONTAINERS_NONLOCAL Meaning This is a local volume and its containers should all reside on the same node. Some containers were created on another node, which may cause performance issues in MapReduce jobs. Resolution Recreate the local volume or contact support. Volume Quota Alarm UI Column Vol Quota Alarm Logged As VOLUME_ALARM_QUOTA_EXCEEDED Meaning A volume has exceeded its quota. Further writes to the volume will fail. Resolution Free some space on the volume or increase the volume hard quota. Utilities This section contains information about the following scripts and commands: configure.sh - configures a node or client to work with the cluster disksetup - sets up disks for use by MapR storage mapr-support-collect.sh - collects cluster information for use by MapR Support pullcentralconfig - pulls master configuration files from the cluster to the local disk rollingupgrade.sh - upgrades software on a MapR cluster configure.sh Sets up a MapR cluster or client, creates or modifies , and updates the corresponding and /opt/mapr/conf/mapr-clusters.conf *.conf * files. .xml Each time is run, it creates or modifies a line in containing a cluster name followed configure.sh /opt/mapr/conf/mapr-clusters.conf by a list of CLDB nodes. If you do not specify a name (using the parameter), applies a default name (my.cluster.com) to the -N configure.sh cluster. Subsequent runs of without the parameter will operate on this default cluster. If you specify a name when you first run configure.sh -N , you can modify the CLDB and ZooKeeper settings corresponding to the named cluster by specifying the same name and configure.sh running again. Whenever you run you must be aware of the existing cluster name or names in configure.sh mapr-clusters.conf mapr-cl and specify the parameter accordingly. If you specify a name that does not exist, a new line is created in usters.conf -N mapr-clusters.co and treated as a configuration for a separate cluster. nf The normal use of is to set up a MapR cluster, or to set up a MapR client for communication with one or more clusters. configure.sh To set up a cluster, run on all nodes specifying the cluster's CLDB and ZooKeeper nodes, and a cluster name if desired. configure.sh If setting up a cluster on virtual machines, use the parameter. -isvm To set up a client, run on the client machine, specifying the CLDB and ZooKeeper nodes of the cluster or clusters. When configure.sh setting up a client to work with multiple clusters, run for each cluster, specifying the CLDB and ZooKeeper nodes configure.sh normally and specifying the name with the parameter. On a client, use both the and parameters. -N -c -C To change services (other than the CLDB and ZooKeeper) running on a node, run with the option. If you change the configure.sh -R location or number of CLDB or ZooKeeper services in a cluster, run and specify the new lineup of CLDB and ZooKeeper configure.sh nodes. To specify a MySQL database to use for storing MapR Metrics data, use the parameters , , and . If these are not specified, you -d -du -dp can configure the database later using the MapR Control System or the MapR API. To specify a user for running MapR services, either set to the username before running , or specify the $MAPR_USER configure.sh username in the parameter when running . -u configure.sh On a Windows client, the script is named but otherwise works in a similar way. configure.sh configure.bat Syntax /opt/mapr/server/configure.sh -C cldb_list (hostname[:port_no] [,hostname[:port_no]...]) -M cldb_mh_list (hostname[:port_no][,[hostname[:port_no]...]) -Z zookeeper_list (hostname[:port_no][,hostname[:port_no]...]) [ -c ] [ --isvm ] [ -J <CLDB JMX port> ] [ -L <log file> ] [ -M7 ] [ -N <cluster name> ] [ -R ] [ -d <host>:<port> ] [ -du <database username> ] [ -dp <database password> ] [ --create-user|-a ] [ -U <user ID> ] [ -u <username> ] [ -G <group ID> ] [ -g <group name> ] [ -f ] Parameters Parameter Description -C Use the option only for CLDB servers that have a single IP -C address each. This option takes a list of the CLDB nodes that this machine uses to connect to the MapR cluster. The list is in the following format: hostname[:port_no] [,hostname[:port_no]...] -M Use the option only for multihomed CLDB servers that have more -M than one IP address. This option takes a list of the multihomed CLDB nodes that this machine uses to connect to the MapR cluster. The list is in the follwing format: hostname[:port_no][,[hostname[:port_no]...]] -Z The option is required unless (lowercase) is specified. This -Z -c option takes a list of the ZooKeeper nodes in the cluster. The list is in the following format: hostname[:port_no][,hostname[:port_no]...] --isvm Specifies virtual machine setup. Required when is configure.sh run on a virtual machine. -c Specifies client setup. See . Setting Up the Client -J Specifies the port for the CLDB. Default: JMX 7220 -L Specifies a log file. If not specified, logs errors to configure.sh /o . pt/mapr/logs/configure.log -M7 Use the option to apply M7 settings to FileServer nodes. When -M7 Warden starts the FileServer service on the nodes, it assigns extra memory to them. -N Specifies the cluster name. -R After initial node configuration, specifies that should configure.sh use the previously configured ZooKeeper and CLDB nodes. When -R is specified, the CLDB credentials are read from and the ZooKeeper credentials are read mapr-clusters.conf from . The option is useful for making changes to warden.conf -R the services configured on a node without changing the CLDB and ZooKeeper nodes. The and parameters are not required when -C -Z is specified. -R -d The host and port of the MySQL database to use for storing MapR Metrics data. -du The username for logging into the MySQL database used for storing MapR Metrics data. -dp The password for logging into the MySQL database used for storing MapR Metrics data. --create-user or -a Create a local user to run MapR services, using the specified user from or the environment variable . -u $MAPR_USER -U The user ID to use when creating with the $MAPR_USER --create- or option; corresponds to the or option of the user -a -u --uid use command in Linux. radd -u The user name under which MapR services will run. -G The group ID to use when creating with the $MAPR_USER -create- or option; corresponds to the or option of the user -a -g -gid user command in Linux. add -g The group name under which MapR services will run. -f Specifies that the node should be configured without the system prerequisite check. Examples Add a node (not CLDB or ZooKeeper) to a cluster that is running the CLDB and ZooKeeper on three nodes: On the new node, run the following command: /opt/mapr/server/configure.sh -C nodeA,nodeB,nodeC -Z nodeA,nodeB,nodeC Configure a client to work with cluster my.cluster.com, which has one CLDB at nodeA: On a Linux client, run the following command: /opt/mapr/server/configure.sh -N my.cluster.com -c -C nodeA On a Windows 7 client, run the following command: /opt/mapr/server/configure.bat -N my.cluster.com -c -C nodeA Add a second cluster to the configuration: On a node in the second cluster , run the following command: your.cluster.com configure.sh -C nodeZ -N your.cluster.com -Z <zkNodeA,zkNodeB,zkNodeC> Adding CLDB servers with multiple IP addresses to a cluster: In this example, the cluster has CLDB servers at , , , and . The CLDB servers and h my.cluster.com nodeA nodeB nodeC nodeD nodeB nodeD ave two NICs each at and . eth0 eth1 On a node in the cluster , run the following command: my.cluster.com configure.sh -N my.cluster.com -C nodeAeth0,nodeCeth0 -M nodeBeth0,nodeBeth1 -M nodeDeth0,nodeDeth1 -Z zknodeA disksetup The command formats specified disks for use by MapR storage and adds those disks to the file. See disksetup disktab Setting Up Disks for for more information about when and how to use . MapR disksetup Syntax /opt/mapr/server/disksetup <disk list file> [-F] [-G] [-M] [-W <stripe_width>] Options Option Description -F Forces formatting of all specified disks. Disks that are already formatted for MapR are not reformatted by unless you disksetup specify this option. The option fails when a filesystem has an entry -F in the file, is mounted, or is in use. Call disktab maprcli disk to remove a disk entry from the file. remove disktab -G Generates contents from input disk list, but does not format disktab disks. This option is useful if disk names change after a reboot, or if the file is damaged. disktab -M Uses the maximum available number of disks per storage pool. -W Specifies the number of disks per storage pool. Examples Setting up disks specified in the file /tmp/disks.txt: /opt/mapr/server/disksetup -F /tmp/disks.txt Reformatting all disks To reformat all disks, remove the file and issue the command to format the disk: disktab disksetup -F /opt/mapr/server/disksetup -F To reformat a particular disk from the use the and commands. For more information, disktab maprcli disk remove maprcli disk add see . Setting Up Disks for MapR Specifying disks The script is used to format disks for use by the MapR cluster. Create a text file listing the disks and partitions for disksetup /tmp/disks.txt use by MapR on the node. Each line lists either a single disk or all applicable partitions on a single disk. When listing multiple partitions on a line, separate by spaces. For example: /dev/sdb /dev/sdc1 /dev/sdc2 /dev/sdc4 /dev/sdd Later, when you run to format the disks, specify the file. For example: disksetup disks.txt /opt/mapr/server/disksetup -F /tmp/disks.txt If you are re-using a node that was used previously in another cluster, it is important to format the disks to remove any traces of data from the old cluster. Test Purposes Only: Using a Flat File for Storage When setting up a small cluster for evaluation purposes, if a particular node does not have physical disks or partitions available to dedicate to the cluster, you can use a flat file on an existing disk partition as the node's storage. Create at least a 16GB file, and include a path to the file in the disk list file for the script. disksetup The following example creates a 20 GB flat file ( specifies 1 gigabyte blocks, multiplied by ) at : bs=1G count=20 /root/storagefile $ dd if=/dev/zero of=/root/storagefile bs=1G count=20 Then, you would add the following to the disk list file to be used by : /tmp/disks.txt disksetup /root/storagefile fsck Filesystem check (fsck) is used to find and fix inconsistencies in the filesystem. Every storage pool has its own log to journal updates to the storage pool. All operations to a storage pool are done transactionally by journaling all operations to the log before they are applied to storage pool metadata. If a storage pool is not shut down cleanly, metadata can become inconsistent. To make the metadata consistent on the next load of the storage pool, replays the log to recover any data before it does any fsck check or repair. walks the storage pool in question to verify all MapR-FS metadata (and data correctness if specified on the command line), fsck and reports all potentially lost or corrupt containers, directories, tables, files, filelets, and blocks in the storage pool. The local visits every allocated block in the storage pool and recovers any blocks that are part of a corrupted or unconnected metadata fsck chain. fsck can be used on an offline storage pool after a node failure, after a disk failure, or after a MapR-FS process crash, or simply to verify the consistency of data for suspected software bugs. The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any data you wish disksetup to keep has been backed up elsewhere. Typical process flow: Take the affected storage pools offline with the command. mrconfig sp offline Execute the command on the storage pools (or disks) as discussed below. fsck Bring the storage pools back online with the command. mrconfig sp online Execute the command on the cluster, volumes, or snapshots that were affected. gfsck The local command can be run in two modes: fsck Verification mode - will only report but not actually attempt to fix or modify any data on disk. can be run in verification mode fsck fsck on an offline storage pool at any time, and if it does not report any errors the storage pool can be brought online without any risk of data loss. Repair mode - will attempt to restore a bad storage pool. If the local is run in repair mode on a storage pool, some volumes fsck fsck might need a global fsck ( ) after bringing the storage pool online. There is potential for loss of data in this case. gfsck Syntax /opt/mapr/server/fsck [ -d ] [ <device-paths> | -n <sp name> ] [ -l <log filename> ] [ -p <MapR-FS port> ] [ -h ] [ -j ] [ -m <memory in MB> ] [ -d ] [ -r ] Parameters Parameter Description -d Perform a CRC on data blocks. By default, will not validate the fsck CRC of user data pages. Enabling this check can take quite a long time to finish. <device-paths> Paths to the disks that make up the storage pool. -n Storage pool name. This option works only if all the disks are in disk . Otherwise the user will have to individually specify all the disks tab that make up the storage pool, using the paramet <device-paths> er. -l The log filename; default /tmp/fsck.<pid>.out -p The MapR-FS port; default 5660 -h Help -j Skip log replay. Should be set only when log recovery fails. Log recovery can fail if the damaged blocks of a disk belong to the log, or if log recovery finds some CRC errors in the metadata blocks. *Using this parameter will typically lead to larger data loss. * -m Cache size for blocks (MB) -d Check data blocks CRC -r Run in repair mode. USE WITH CAUTION AS THIS CAN LEAD TO LOSS OF DATA. gfsck The (global filesystem check) command performs a scan and repair operation on a cluster, volume, or snapshot. gfsck Typical process flow Take the affected storage pools offline with the command. mrconfig sp offline Execute the command on the storage pools (or disks). fsck Bring the storage pools back online with the command. mrconfig sp online Execute the command on the cluster, volumes, or snapshots that were affected. gfsck Syntax /opt/mapr/bin/gfsck [ -h|--help ] [ -c|--clear ] [ -d|--debug ] [ -b|--dbcheck ] [ -r|--repair ] [ -y|--assume-yes ] [ cluster=<cluster name> ] [ rwvolume=<volume name> ] [ snapshot=<snapshot name> ] [ snapshotid=<snapshot-id> ] Parameters Parameter Description -h --help Prints usage text. -c --clear Clears previous warnings before performing the global filesystem check. -d --debug Provides additional information in the output for debug purposes. -b --dbcheck Checks that every key in a tablet is within that tablet's startKey and endKey range. This option is very IO intensive, and should only be used if database inconsistency is suspected. -r --repair Indicates that repairs should be performed if needed. -y --assume-yes Assumes that containers without valid copies (as reported by CLDB) can be deleted automatically. If this option is not specified, will gfsck pause for user input to verify that containers can be deleted - enter ye to delete, to exit , or to quit. s no gfsck ctrl-C cluster Name of the cluster (default: default cluster) rwvolume Name of the volume (default: null) snapshot Name of the snapshot (default: null) snapshotid The snapshot id (default: 0) Example (Debug mode) Execute the command on the read/write volume named with mode turned on. gfsck mapr.cluster.root debug /opt/mapr/bin/gfsck rwvolume=mapr.cluster.root -d Sample output is shown below. Starting GlobalFsck: clear-mode = false debug-mode = true dbcheck-mode = false repair-mode = false assume-yes-mode = false cluster = my.cluster.com rw-volume-name = mapr.cluster.root snapshot-name = null snapshot-id = 0 user-id = 0 group-id = 0 get volume properties ... rwVolumeName = mapr.cluster.root (volumeId = 205374230, rootContainerId = 2049, isMirror = false) put volume mapr.cluster.root in global-fsck mode ... get snapshot list for volume mapr.cluster.root ... starting phase one (get containers) for volume mapr.cluster.root(205374230) ... container 2049 (latestEpoch=3, fixedByFsck=false) got volume containers map done phase one starting phase two (get inodes) for volume mapr.cluster.root(205374230) ... get container inode list for cid 2049 +inodelist: fid=2049.32.131224 pfid=-1.16.2 typ=4 styp=0 nch=0 dMe:false dRec: false +inodelist: fid=2049.33.131226 pfid=-1.16.2 typ=2 styp=0 nch=0 dMe:false dRec: false +inodelist: fid=2049.34.131228 pfid=-1.33.131226 typ=4 styp=0 nch=0 dMe:false dRec: false +inodelist: fid=2049.35.131230 pfid=-1.16.2 typ=4 styp=0 nch=0 dMe:false dRec: false +inodelist: fid=2049.36.131232 pfid=-1.16.2 typ=4 styp=0 nch=0 dMe:false dRec: false +inodelist: fid=2049.38.262312 pfid=-1.16.2 typ=2 styp=0 nch=0 dMe:false dRec: false +inodelist: fid=2049.39.262314 pfid=-1.38.262312 typ=1 styp=0 nch=0 dMe:false dRec: false got container inode lists (totalThreads=1) done phase two starting phase three (get fidmaps & tabletmaps) for volume mapr.cluster.root(205374230) ... got fidmap lists (totalFidmapThreads=0) got tabletmap lists (totalTabletmapThreads=0) done phase three === Start of GlobalFsck Report === file-fidmap-filelet union -- 2049.39.262314:P --> primary (nchunks=0) --> AllOk no errors table-tabletmap-tablet union -- empty orphan directories -- none orphan kvstores -- none orphan files -- none orphan fidmaps -- none orphan tables -- none orphan tabletmaps -- none orphan dbkvstores -- none orphan dbfiles -- none orphan dbinodes -- none containers that need repair -- none incomplete snapshots that need to be deleted -- none user statistics -- containers = 1 directories = 2 kvstores = 0 files = 1 fidmaps = 0 filelets = 0 tables = 0 tabletmaps = 0 schemas = 0 tablets = 0 segmaps = 0 spillmaps = 0 overflowfiles = 0 bucketfiles = 0 spillfiles = 0 === End of GlobalFsck Report === remove volume mapr.cluster.root from global-fsck mode (ret = 0) ... GlobalFsck completed successfully (7142 ms); Result: verify succeeded mapr-support-collect.sh Collects information about a cluster's recent activity, to help MapR Support diagnose problems. The "mini-dump" option limits the size of the support output. When the or option is specified along with a size, -m --mini-dump mapr-support collects only a head and tail, each limited to the specified size, from any log file that is larger than twice the specified size. The -collect.sh total size of the output is therefore limited to approximately 2 * size * number of logs. The size can be specified in bytes, or using the following suffixes: b - bytes k - kilobytes (1024 bytes) m - megabytes (1024 kilobytes) Syntax /opt/mapr/support/tools/mapr-support-collect.sh [ -h|--hosts <host file> ] [ -H|--host <host entry> ] [ -Q|--no-cldb ] [ -n|--name <name> ] [ -d|--output-dir <path> ] [ -l|--no-logs ] [ -s|--no-statistics ] [ -c|--no-conf ] [ -i|--no-sysinfo ] [ -x|--exclude-cluster ] [ -u|--user <user> ] [ -K|--strict-hostkey ] [ -m|--mini-dump <size> ] [ -f|--filter <filter string> ] [ -L|--no-libraries [ -O|--online ] [ -p|--par <par> ] [ -t|--dump-timeout <dump timeout> ] [ -T|--scp-timeout <SCP timeout> ] [ -C|--cluster-timeout <cluster timeout> ] [ -y|--yes ] [ -S|--scp-port <SCP port> ] [ --collect-cores ] [ --move-cores ] [ --no-hadoop-logs ] [ --no-hbase-logs ] [ --port <port> ] [ --use-hostname ] [ --cldb <CLDB node> ] [ --port <port> ] [ -?|--help ] Parameters Parameter Description -h or --hosts A file containing a list of hosts. Each line contains one host entry, in the format [user@]host[:port] -H or --host One or more hosts in the format [user@]host[:port] -Q or --no-cldb If specified, the command does not query the CLDB for list of nodes -n or --name Specifies the name of the output file. If not specified, the default is a date-named file in the format YYYY-MM-DD-hh-mm-ss.tar -d or --output-dir The absolute path to the output directory. If not specified, the default is /opt/mapr/support/collect/ -l or --no-logs If specified, the command output does not include log files --no-hadoop-logs If specified, the command output does not include Hadoop log files. --no-hbase-logs If specified, the command output does not include HBase log files. -f or --filter <filter string> Use this option to specify a filter string. Support information is only collected for nodes with names that match the filter string. -L or --no-libraries If specified, the command output does not include libraries. -s or --no-statistics If specified, the command output does not include statistics -c or --no-conf If specified, the command output does not include configurations -i or --no-sysinfo If specified, the command output does not include system information -x or --exclude-cluster If specified, the command output does not collect cluster diagnostics -u or --user The username for ssh connections -K or --strict-hostkey If specified, checks for strict host key in the SSH connection. -m, --mini-dump <size> For any log file greater than 2 * <size>, collects only a head and tail each of the specified size. The <size> may have a suffix specifying units: b - blocks (512 bytes) k - kilobytes (1024 bytes) m - megabytes (1024 kilobytes) -O or --online Specifies a space-separated list of nodes from which to gather support output, and uses the warden instead of ssh for transmitting the support data. -p or --par The maximum number of nodes from which support dumps will be gathered concurrently (default: 10) -t or --dump-timeout The timeout for execution of the command on mapr-support-dump a node (default: 120 seconds or 0 = no limit) -T or --scp-timeout The timeout for copy of support dump output from a remote node to the local file system (default: 120 seconds or 0 = no limit) -C or --cluster-timeout The timeout for collection of cluster diagnostics (default: 300 seconds or 0 = no limit) -y or --yes If specified, the command does not require acknowledgement of the number of nodes that will be affected -S or --scp-port The local port to which remote nodes will establish an SCP session --collect-cores If specified, the command collects cores of running mfs processes from all nodes (off by default) --move-cores If specified, the command moves mfs and nfs cores from /opt/cores from all nodes (off by default) --use-hostname If specified, uses hostnames instead of IP address for SSH. --cldb <cldbnode> Use this option when the CLDB Service is down to point to a CLDB node. --port The port number used by FileServer (default: 5660) -? or --help Displays usage help text Examples Collect support information and dump it to the file /opt/mapr/support/collect/mysupport-output.tar: /opt/mapr/support/tools/mapr-support-collect.sh -n mysupport-output mapr-support-dump.sh This script collects node and cluster-level information about the node where the script is invoked. The information collected is used to help MapR Support diagnose problems. Use to collect diagnostic information from all nodes in the cluster. mapr-support-collect.sh The "mini-dump" option limits the size of the support output. When the or option is specified along with a size, -m --mini-dump mapr-support collects only a head and tail, each limited to the specified size, from any log file that is larger than twice the specified size. The total -dump.sh size of the output is therefore limited to approximately 2 * size * number of logs. The size can be specified in bytes, or using the following suffixes: b - bytes k - kilobytes (1024 bytes) m - megabytes (1024 kilobytes) Syntax /opt/mapr/support/tools/mapr-support-dump.sh [ -n|--name <name> ] [ -d|--output-dir <path> ] [ -l|--no-logs ] [ -s|--no-statistics ] [ -c|--no-conf ] [ -i|--no-sysinfo ] [ -o|--exclude-cluster ] [ -m|--mini-dump <size> ] [ -O|--online ] [ -z|--only-cluster [ -L|--no-libraries ] [ -A|--logs-age <days> ] [ --no-hadoop-logs ] [ --no-hbase-logs ] [ --collect-cores ] [ --move-cores ] [ --port <port> ] [--nfs-port <port> ] [ -?|--help ] Parameters Parameter Description -n or --name Specifies the name of the output file. If not specified, the default is a date-named file in the format YYYY-MM-DD-hh-mm-ss.tar -d or --output-dir The absolute path to the output directory. If not specified, the default is /opt/mapr/support/collect/ -l or --no-logs If specified, the command output does not include log files -s or --no-statistics If specified, the command output does not include statistics -c or --no-conf If specified, the command output does not include configurations -i or --no-sysinfo If specified, the command output does not include system information -o or --exclude-cluster If specified, the command output does not collect cluster diagnostics -m, --mini-dump <size> For any log file greater than 2 * <size>, collects only a head and tail each of the specified size. The <size> may have a suffix specifying units: b - blocks (512 bytes) k - kilobytes (1024 bytes) m - megabytes (1024 kilobytes) -O or --online Specifies a space-separated list of nodes from which to gather support output, and uses the warden instead of ssh for transmitting the support data. -z or --only-cluster Collects diagnostic information at the cluster level only. -L or --no-libraries Excludes libraries. -A or --logs-age <days> Collects logs newer than the specified number of days. The default value for this parameter is 7. Specify a value of 0 to have the mapr-s script collect logs of any age. upport-dump.sh --no-hadoop-logs Excludes Hadoop log files. --no-hbase-logs Excludes HBase log files. --collect-cores If specified, the command collects cores of running mfs processes from all nodes (off by default) --move-cores If specified, the command moves mfs and nfs cores from /opt/cores from all nodes (off by default) --port The port number used by the FileServer. The default value for this parameter is 5660. --nfs-port Specifies the port used by the NFS server. The default value for this parameter is 9998, -? or --help Displays usage help text Examples Collect support information and dump it to the file /opt/mapr/support/collect/mysupport-output.tar: /opt/mapr/support/tools/mapr-support-dump.sh -n mysupport-output mrconfig The commands let you create, remove, and manage storage pools, disk groups, and disks; and provide information about containers. mrconfig mrconfig dg This section discusses the commands that allow you to configure disk groups. mrconfig dg mrconfig dg create The commands let you create disk groups (after you initialize disks with the command and add mrconfig dg create mrconfig disk init them to the node with the command). mrconfig disk load You can create a disk group with one of two formats: Use the command to create a striped disk group with a format. mrconfig dg create raid0 RAID 0 Use the command to create a disk group (one disk after another). mrconfig dg create concat concatenated After you create disk groups you will be ready to on the disk groups. create storage pools See for instructions about running commands. mrconfig mrconfig mrconfig dg create concat The commands provide direct control and access to MapR-FS at a low level. If you are not careful, or do not know what you mrconfig are doing, you can irrevocably destroy valuable data. The command creates a concatenated disk group. When a disk group is created MapR assigns one of the mrconfig dg create concat disks as the device path of the disk group. After you create a disk group you will be ready to on the disk group. create a storage pool See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig dg create concat <path> Parameters Parameter Description path The device path of each of the disks to add to the disk group; example /dev/sdc /dev/sdd /dev/sde Examples Create a concatenated disk group on a local node /opt/mapr/server/mrconfig dg create concat /dev/sdc /dev/sdd /dev/sde mrconfig dg create raid0 The command creates a disk group striped for RAID 0. When the disk group is created MapR assigns one of the mrconfig dg create raid0 disks as the device path of the disk group. After you create a disk group you will be ready to on the disk group. create a storage pool See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig dg create raid0 [-d <stripeDepth>] <path> Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 -d The stripe depth in 8K blocs; default (1 MB) 128 path The device path of each of the disks to add to the disk group; example /dev/sdc /dev/sdd /dev/sde Examples Create a disk group striped for RAID 0 with a stripe depth of 24 on a local node /opt/mapr/server/mrconfig dg create raid0 -d 24 /dev/sdc /dev/sdd /dev/sde mrconfig dg help The command displays online help for disk group commands. mrconfig dg help See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig dg help Examples Display online help for commands on a local node mrconfig dg /opt/mapr/server/mrconfig dg help mrconfig dg list The command lists the disk groups on all the MapR-FS disks on a node. mrconfig dg list See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig dg list Examples List the disk groups on all the MapR-FS disks on localhost /opt/mapr/server/mrconfig disk list mrconfig info The commands provide information about memory, threads, volumes, containers and other information about the MapR mrconfig info filesystem. See for instructions about running commands. mrconfig mrconfig mrconfig info containerchain The command displays the containerchain for a given container. Example: mrconfig info containerchain $ /opt/mapr/server/mrconfig info containerchain 2050 Container 2050 prev 256000049 next 0. Container 256000049 prev 0 next 2050. See for instructions on running commands. mrconfig mrconfig Syntax mrconfig [-h <host>] [-p <port>] info containerchain <cid> <cid> Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 cid The container identifier Tip: Use the command to find the container identifiers on a node. mrconfig info dumpcontainers Examples Find the containerchain for a container with a cid of 2049 on a local node /opt/mapr/server/mrconfig info containerchain 2049 Find the containerchain for a container with a cid of 2049 on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info containerchain 2049 mrconfig info containerlist The command lists read/write container IDs for a specified volume. Example: mrconfig info containerlist $ /opt/mapr/server/mrconfig info containerlist volume1 Volume containers 2050 See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info containerlist <volName> Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 volName The name of the volume Tips: You can see the names of volumes using: The Volumes view in the MapR-FS group in the MapR Control System. The command. maprcli volume list Examples Display information about the containers in a volume named on a local node marketing /opt/mapr/server/mrconfig info containerlist marketing Display information about the containers on a volume named on a remote node with an IP address of xx.xx.xx.xx marketing /opt/mapr/server/mrconfig -h xx.xx.xx.xx info containerlist marketing mrconfig info containers The command displays information about containers. mrconfig info containers Example: $ /opt/mapr/server/mrconfig info containers rw RW containers: 1 2049 2050 $ /opt/mapr/server/mrconfig info containers resync $ /opt/mapr/server/mrconfig info containers snapshot Snapshot containers: 256000049 See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info containers <container-type> [path] <container-type> [path] Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 container-type When specified, lists only containers of the specified type. Possible values: rw resync snapshot path The path to a service pool (obtained with ). mrconfig sp list When specified, lists only containers on the specified service pool. Examples Display a list of read/write containers on a local node /opt/mapr/server/mrconfig info containers rw Display a list of read/write containers on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info containers rw mrconfig info dumpcontainers The command displays information about containers including container identifiers, volume identifiers, mrconfig info dumpcontainers storage pools, etc. Example: $ /opt/mapr/server/mrconfig info dumpcontainers cid:1 volid:1 sp:SP1:/tmp/mapr-scratch/v2.0/clst.TestVol.5660.img spid:f0bbcc261673adf800505808f20016fe prev:0 next:0 issnap:0 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:0 querycldb:0 resyncinprog:0 shared:0 owned:35 logical:35 snapusage:0 snapusageupdated:1 cid:65 volid:0 sp:SP1:/tmp/mapr-scratch/v2.0/clst.TestVol.5660.img spid:f0bbcc261673adf800505808f20016fe prev:0 next:0 issnap:0 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:0 querycldb:0 resyncinprog:0 shared:0 owned:8 logical:8 snapusage:0 snapusageupdated:0 cid:2049 volid:177916405 sp:SP1:/tmp/mapr-scratch/v2.0/clst.TestVol.5660.img spid:f0bbcc261673adf800505808f20016fe prev:0 next:0 issnap:0 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:0 querycldb:0 resyncinprog:0 shared:0 owned:10 logical:10 snapusage:0 snapusageupdated:1 cid:2050 volid:37012938 sp:SP1:/tmp/mapr-scratch/v2.0/clst.TestVol.5660.img spid:f0bbcc261673adf800505808f20016fe prev:256000049 next:0 issnap:0 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:0 querycldb:0 resyncinprog:0 shared:2 owned:9 logical:11 snapusage:10 snapusageupdated:1 cid:256000049 volid:37012938 sp:SP1:/tmp/mapr-scratch/v2.0/clst.TestVol.5660.img spid:f0bbcc261673adf800505808f20016fe prev:0 next:2050 issnap:1 isclone:0 deleteinprog:0 fixedbyfsck:0 stale:0 querycldb:0 resyncinprog:0 shared:0 owned:10 logical:10 snapusage:0 snapusageupdated:0 See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info dumpcontainers Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 Examples Display information about containers on a local node /opt/mapr/server/mrconfig info dumpcontainers Display information about containers on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info dumpcontainers mrconfig info fsstate The command displays information about the status of the MapR filesystem, for example whether or not storage mrconfig info fsstate pools are loaded. See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info fsstate Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 Examples Display information about the state of the MapR filesystem on a local node /opt/mapr/server/mrconfig info fsstate Display information about the state of the MapR filesystem on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info fsstate mrconfig info fsthreads The command displays information about threads running on MapR-FS disks on a node. mrconfig info fsthreads See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info fsthreads Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 Examples Display information about MapR-FS threads on a local node /opt/mapr/server/mrconfig info fsthreads Display information about MapR-FS threads on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info fsthreads mrconfig info orphanlist The command displays information about a container's orphans. mrconfig info orphanlist See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info orphanlist <cid> <cid> Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 cid The container identifier Tip: Use the command to find the container identifiers on a node. mrconfig info dumpcontainers Examples Display information about the orphans of a container with an identifier of 2049 on a local node /opt/mapr/server/mrconfig info orphanlist 2049 Display information about the orphans of a container with an identifier of 2049 on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info orphanlist 2049 mrconfig info replication The command displays information about container replication. mrconfig info replication See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info replication Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 Examples Display information about container replication on a local node /opt/mapr/server/mrconfig info replication Display information about container replication on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info replication mrconfig info slabs The command displays a report about memory usage. mrconfig info slabs This report is sometimes used for troubleshooting by MapR customer support and is typically not used by customers. See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info slabs Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 Examples Display information about memory usage on a local node /opt/mapr/server/mrconfig info slabs Display information about memory usage on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info slabs mrconfig info threads The command displays information about threads running on MapR-FS. mrconfig info threads See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info threads Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 Examples Display information about MapR-FS threads on a local node /opt/mapr/server/mrconfig info threads Display information about MapR-FS threads on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info threads mrconfig info volume snapshot The command displays information about volume snapshots. mrconfig info volume snapshot Snapshot and this command require an upgrade to a MapR M5 license if you don't already have it. See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig [-h <host>] [-p <port>] info volume snapshot <volName> <snapName> <volName> <snapName> Parameters Parameter Description -h host IP address; default 127.0.0.1 -p The MapR-FS port; default 5660 volName The name of the volume snapName The name of the snapshot Tips: To find volume and snapshot names: Navigate to the Volume view and the Snapshot view respectively in the MapR-FS group in the MapR Control System, or Execute the command, which creates a report that displays volume names and snapshot names. maprcli volume snapshot list Examples Display information about snapshot "snap-2012-01-01" of volume "myVolume" on a local node /opt/mapr/server/mrconfig info volume snapshot myVolume snap-2012-01-01 Display information about snapshot "snap-2012-01-01" of volume "myVolume" on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx info volume snapshot myVolume snap-2012-01-01 mrconfig sp The commands create and control storage pools. Storage pools are created on , so disk groups must be befo mrconfig sp disk groups created re storage pools can be created. MapR-FS reads and writes data (and metadata) to and from logical storage units called volumes. Volumes store data in containers in storage pools. Initially storage pools don't have any containers, the containers are automatically created for a volume as needed. When a container is created it is assigned a container identifier (cid). Storage pools aren't associated with any particular volume – storage pools may hold containers for multiple volumes. Large files may be distributed across multiple containers, and therefore across multiple storage pools. Data replication happens at the container level. Data cannot be written directly to containers, a volume is required. You can create volumes in one of two ways: Click the New Volume button in the Volumes view of the MapR-FS group in the MapR Control System, or Execute the command. maprcli volume create See for instructions about running commands. mrconfig mrconfig mrconfig sp help The command displays the online help for storage pool commands. mrconfig sp help See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp help Examples Display the online help for storage pools on a local node /opt/mapr/server/mrconfig sp help mrconfig sp list The command displays information about storage pools including the name, size, free space and path of each storage mrconfig sp list pool, whether or not each storage pool is online or offline, and the total number of storage pools. See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp list [-v] [sp path] Parameters Parameter Description path The device path of the storage pool. If not specified, information about all storage pools is displayed. If specified, only information about the specified storage pool is displayed; example /dev/sdc -v Print sp and cluster GUID information and service pool log (journal) size Examples Display information about all storage pools on a local node /opt/mapr/server/mrconfig sp list Display information about a storage pool with a path of on a local node /dev/sdc /opt/mapr/server/mrconfig sp list /dev/sdc mrconfig sp load The command loads all of the disks associated with a storage pool. mrconfig sp load See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp load <sp name> Parameters Parameter Description sp name The name of the storage pool; example SP2 Tips: Use the command to see storage pool names (examples , ) and the device paths of the storage pools mrconfig sp list SP1 SP2 (example ). /dev/sdc Use the command to see storage pool names (examples , ), the device paths of the storage pools mrconfig disk list SP1 SP2 (example ), and the disks associated with each storage pool (examples , , ). /dev/sdc /dev/sdc /dev/sdd /dev/sde Examples Load the disks associated with the storage pool named on the local node SP2 /opt/mapr/server/mrconfig sp load SP2 mrconfig sp make The command creates a storage pool on a concat disk group. mrconfig sp make Warning: Creating a Storage Pool Causes Data Loss Creating a storage pool on a disk group destroys the data on the disks in the disk group, so be sure that all data on the disks in the disk group is backed up and replicated before creating a storage pool. 1. 2. 3. 4. See for instructions on running commands. mrconfig mrconfig Example Assume the disks , , and are available; initialize them with : /dev/sdb /dev/sdc /dev/sdd mrconfig disk init /opt/mapr/server/mrconfig disk init /dev/sdb /opt/mapr/server/mrconfig disk init /dev/sdc /opt/mapr/server/mrconfig disk init /dev/sdd Create a disk group with : mrconfig dg create /opt/mapr/server/mrconfig dg create raid0 -d 128 /dev/sdb /dev/sdc /dev/sdd At this point, you can use to see the layout of the disk group, and which disk is the primary disk. The primary disk mrconfig dg list can be used in other commands to refer to the disk group as a whole. Example: /opt/mapr/server/mrconfig dg list From the disk group, create a storage pool with : mrconfig sp make /opt/mapr/server/mrconfig sp make /dev/sdb Syntax mrconfig sp make <dg path> [ -P <yes/no> ] [ -l <LogSize> ] [ -s <deviceSize> ] [ -L <Lable> ] [ -F ] [ -I <cid> ] <dg path> Parameters Parameter Description -P Primary partition or not; yes/no -l Log size in number of blocks; Note that this is a lowercase letter "l" (ell), not the number "1". -s Disk size in GB -L Label for this storage pool -F Force the overwrite of any existing storage pool -I Initialize the storage pool with one container with the specified container identifier, one directory, and one file. Note that this is an uppercase letter "I" (eye), not the letter "l" (ell) or the number "1". dg path The device path of the disk group; example /dev/sdc Examples Create a storage pool on a disk group with a path of on a local node /dev/sdc /opt/mapr/server/mrconfig sp make /dev/sdc mrconfig sp offline The command takes a loaded storage pool offline. When a storage pool is offline it remains loaded into memory but it mrconfig sp offline is not available to MapR-FS for reads and writes. The main use of the command is to take a storage pool offline so the (filesystem check) command can be run on mrconfig sp offline fsck one or more disks or storage pools if there are lost or corrupt containers, directories, tables, files, filelets, or blocks. After running the storage pool is brought back online with the command, and then typically the (global fsck mrconfig sp online gfsck filesystem check) command would be run on the affected cluster, volumes, or snapshots. See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp offline <sp path> Parameters Parameter Description sp path The device path of the storage pool; example /dev/sdc Examples Offline a loaded storage pool with a path of on localhost /dev/sdc /opt/mapr/server/mrconfig sp offline /dev/sdc mrconfig sp offline all The command takes all of a node's loaded storage pools offline. When a storage pool is offline it remains loaded mrconfig sp offline all into memory, but it is not available to MapR-FS for reads and writes. The main use of the command is to take all storage pools on a node offline so the (filesystem check) mrconfig sp offline all fsck command can be run on disks or storage pools if there are lost or corrupt containers, directories, tables, files, filelets, or blocks. After running the storage pools are brought back online with the command, and then typically the (global fsck mrconfig sp online gfsck filesystem check) command would be run on the affected cluster, volumes, or snapshots. See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp offline all Examples Offline all storage pools on a local node /opt/mapr/server/mrconfig sp offline all mrconfig sp online The command makes an offline storage pool online. mrconfig sp online When a storage pool is taken offline with the command, the storage pool is not available for reads and writes. Typically mrconfig sp offline this is done so the (filesystem check) command can be run to check for or repair filesystem inconsistencies. fsck After the storage pool is put back online with the command, the storage pool is once again available for reads and writes, mrconfig sp online and the (global filesystem check) command can be run on the affected cluster, volumes or snapshots. gfsck See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp online <sp path> Parameters Parameter Description sp path The device path of the storage pool; example /dev/sdc Examples Online a storage pool with a path of on a local node /dev/sdc /opt/mapr/server/mrconfig sp online /dev/sdc mrconfig sp refresh The command reloads the file and adds any new disks to MapR-FS. mrconfig sp refresh disktab See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp refresh Examples Refresh the storage pools on the local node /opt/mapr/server/mrconfig sp refresh mrconfig sp shutdown The command offlines all storage pools and stops the MapR filesystem on their disks. mrconfig sp shutdown See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp shutdown Examples Offline all storage pools on the local node and stop MapR-FS on their disks /opt/mapr/server/mrconfig sp shutdown mrconfig sp unload The command unloads all of the disks associated with a storage pool. mrconfig sp unload See for instructions on running commands. mrconfig mrconfig Syntax mrconfig sp unload <sp name> Parameters Parameter Description sp name The name of the storage pool; example SP2 Tips: Use the command to see storage pool names (examples , ) and the device paths of the storage pools mrconfig sp list SP1 SP2 (example ). /dev/sdc Use the command to see storage pool names (examples , ), the device paths of the storage pools mrconfig disk list SP1 SP2 (example ), and the disks associated with each storage pool (examples , ). /dev/sdc /dev/sdc /dev/sdd Examples Unload the disks associated with the storage pool named on a local node SP2 /opt/mapr/server/mrconfig sp unload SP2 mrconfig disk help The command displays the help text for commands. mrconfig disk help mrconfig disk See for instructions about running commands. mrconfig mrconfig Syntax /opt/mapr/server/mrconfig disk help Example Display the help text for commands on a local node mrconfig disk /opt/mapr/server/mrconfig disk help mrconfig disk init The command initializes a disk and formats it for the MapR filesystem. mrconfig disk init After executing the command, add the disk to the node with the command. mrconfig disk init mrconfig disk load See for instructions about running commands. mrconfig mrconfig Tip: To initialize, format and load one or more disks in one step, use: The button in the MapR Control System, or Add Disk(s) to MapR-FS The command. maprcli disk add Syntax /opt/mapr/server/mrconfig disk init <path> [-F] <path> Parameters Parameter Description -F Forces formatting of the disk for MapR-FS, regardless of prior formatting or existing data. path The device path of the disk; example /dev/sdc Examples Initialize a disk for MapR-FS on a local node /opt/mapr/server/mrconfig disk init /dev/sdc Initialize and format a disk for MapR-FS on a local node /opt/mapr/server/mrconfig disk init -F /dev/sdc Initialize and format a disk for MapR-FS on a remote node with an IP address of xx.xx.xx.xx /opt/mapr/server/mrconfig -h xx.xx.xx.xx disk init -F /dev/sdc Warning: Initializing a Disk Causes Data Loss Initializing a disk destroys the data on the disk, so be sure that all data on a disk is backed up and replicated before initializing the disk. mrconfig disk list The command lists all of the disks on a node that have a MapR filesystem. mrconfig disk list It also shows information about the disk groups and storage pools on the node including whether or not the storage pools are online. See for instructions about running commands. mrconfig mrconfig Tip: To list system disks and other available disks on a node in addition to MapR-FS disks, use: The and the sections of the MapR Control System, or MapR-FS and Available Disks System Disks The command. maprcli disk list Syntax /opt/mapr/server/mrconfig disk list [<path>] Parameters Parameter Description path The path of the disk; if not included shows information about all disks on the node, if included only shows information about the specified disk; example: /dev/sdc Examples List information about all MapR-FS disks on a local node /opt/mapr/server/mrconfig disk list List information about MapR-FS disk on a local node /dev/sdc /opt/mapr/server/mrconfig disk list /dev/sdc mrconfig disk load After initializing a disk with the command, load the disk into memory with the command. mrconfig disk init mrconfig disk load See for instructions about running commands. mrconfig mrconfig Syntax 1. 2. 3. 4. 5. 6. 7. 8. /opt/mapr/server/mrconfig disk load <path> <path> Parameters Parameter Description path The device path of the disk; example /dev/sdc Examples Load a disk on a local node /opt/mapr/server/mrconfig disk load /dev/sdc mrconfig disk remove The command removes a disk from MapR-FS. A disk cannot be removed unless its storage pool is offline. mrconfig disk remove The command is typically used when replacing a failed disk on a node. mrconfig disk remove In the following example, one of three disks in a storage pool has failed, and the storage pool has gone offline. To remove a disk with : mrconfig disk remove Ensure that the data on the surviving disks is backed up/replicated. Remove the failed disk from the node's disktab with the command. mrconfig disk remove Physically remove the failed disk. Physically attach the replacement disk. Run the command on the replacement disk and on the other two disks that were in the disk group. mrconfig disk init Run the command on each of the three disks. mrconfig disk load Use the command to create a new disk group with the three disks. mrconfig dg create Use the command to create a storage pool on the new disk group. mrconfig sp make Syntax /opt/mapr/server/mrconfig disk remove [<path>] Parameters Warning: Removing a Disk Causes Data Loss Removing a disk destroys the data on the disk, so be sure that all data on a disk is backed up and replicated before removing a disk. Parameter Description path The device path of the disk; example /dev/sdc Examples Remove a disk from a local node /opt/mapr/server/mrconfig disk remove /dev/sdc pullcentralconfig The script on each node pulls master configuration files from on the /opt/mapr/server/pullcentralconfig /var/mapr/configuration cluster to the local disk: If the master configuration file is newer, the local copy is overwritten by the master copy If the local configuration file is newer, no changes are made to the local copy The volume (normally mounted at ) contains directories with master configuration files: mapr.configuration /var/mapr/configuration Configuration files in the directory are applied to all nodes default To specify custom configuration files for individual nodes, create directories corresponding to individual hostnames. For example, the configuration files in a directory named would only be applied to the machine /var/mapr/configuration/nodes/host1.r1.nyc with the hostname . host1.r1.nyc The following parameters in control whether central configuration is enabled, the path to the master configuration files, and how warden.conf often runs: pullcentralconfig centralconfig.enabled — Specifies whether to enable central configuration. pollcentralconfig.interval.seconds--- The frequency to check for configuration updates, in seconds. rollingupgrade.sh Upgrades a MapR cluster to a specified version of the MapR software, or to a specific set of MapR packages. By default, any node on which upgrade fails is rolled back to the previous version. To disable rollback, use the option. To force installation -n regardless of the existing version on each node, use the option. -r For more information about using , see . rollingupgrade.sh Upgrade Guide Syntax /opt/upgrade-mapr/rollingupgrade.sh [-c <cluster name>] [-d] [-h] [-i <identity file>] [-n] [-p <directory>] [-r] [-s] [-u <username>] [-v <version>] [-x] Parameters Parameter Description -c Cluster name. -d If specified, performs a dry run without upgrading the cluster. -h Displays help text. -i Specifies an identity file for SSH. See the . SSH man page -n Specifies that the node should not be rolled back to the previous version if upgrade fails. -p Specifies a directory containing the upgrade packages. -r Specifies reinstallation of packages even on nodes that are already at the target version. -s Specifies SSH to upgrade nodes. -u A username for SSH. -v The target upgrade version, using the format to specify the x.y.z major, minor, and revision numbers. Example: 1.2.0 -x Specifies that packages should be copied to nodes via SCP. Environment Variables The following table describes environment variables specific to MapR. Note that environment variables must be set in $MAPR_HOME/conf/env. . sh Variable Example Values Description JAVA_HOME /usr/lib/jvm/java-6-sun The directory where the correct version of Java is installed. MAPR_HOME /opt/mapr (default) The directory in which MapR is installed. MAPR_SUBNETS 10.10.123/24,10.10.124/24 If you do not want MapR to use all NICs on each node, use the environment variable MAPR_SUBNETS to restrict MapR traffic to specific NICs. Set MAPR_SUBNETS to a comma-separated list of up to four subnets in CIDR notation with no spaces. If MAPR_SUBNETS is not set, MapR uses all NICs present on the node. When MAPR_SUBNETS is set, make sure the node can reach all nodes in the cluster (servers and clients) using the specified subnets. MAPR_USER (default) mapr Used with to specify the configure.sh user under which MapR runs its services. If not explicitly set, it defaults to the user . mapr After is run, the value is configure.sh stored in . daemon.conf Sample file env.sh export JAVA_HOME=/usr/lib/jvm/java-6-sun export MAPR_HOME=/opt/mapr export MAPR_SUBNETS=10.10.123/24,10.10.124/24 export MAPR_USER=mapr Configuration Files This guide contains reference information about the following configuration files: .dfs_attributes - Controls compression and chunk size for each directory cldb.conf - Specifies configuration parameters for the CLDB and cluster topology core-site.xml - Specifies the default filesystem daemon.conf - Specifies the user and group that MapR services run as disktab - Lists the disks in use by MapR-FS exports - Lists NFS exports hadoop-metrics.properties - Specifies where to output service metric reports mapr-clusters.conf - Specifies the CLDB nodes for one or more clusters that can be reached from the node or client mapred-default.xml - Contains MapReduce default settings that can be overridden using mapred-site.xml. Not to be edited directly by users. mapred-site.xml - Core MapReduce settings mfs.conf - Specifies parameters about MapR-FS server on each node The Roles File - Defines the configuration of services and nodes at install time taskcontroller.cfg - Specifies TaskTracker configuration parameters warden.conf - Specifies parameters related to MapR services and the warden. Not to be edited directly by users. zoo.cfg - Specifies ZooKeeper configuration parameters .dfs_attributes Each directory in MapR storage contains a hidden file called that controls compression and chunk size. To change these .dfs_attributes attributes, change the corresponding values in the file. Example: # lines beginning with # are treated as comments Compression=lz4 ChunkSize=268435456 Valid values: Compression: , , , or lz4 lzf zlib false Chunk size (in bytes): a multiple of 65535 (64 K) or zero (no chunks). Example: 131072 You can also set compression and chunksize using the command. hadoop mfs cldb.conf The file specifies configuration parameters for the CLDB and for cluster topology. /opt/mapr/conf/cldb.conf Field Default Description cldb.containers.cache.entries 1000000 The maximum number of read/write containers available in the CLDB cache. cldb.default.topology /data The default topology for newly-created volumes. cldb.detect.dup.hostid.enabled false When true, CLDB will disable nodes with all duplicate hostid, including new nodes that try to register with duplicate hostid the and existing node. Alarm NODE_ALARM_DUPLI is raised. This case requires CATE_HOSTID admin intervention to correct the hostid confusion. If duplicate hostid occurs on nodes running CLDB, the cluster may fail to start in which case the alarm will not get raised, but the file in cldb.log /opt/mapr will contain an error message. /logs/ cldb.min.fileservers 1 Number of fileservers that must register with the CLDB before the root volume is created cldb.numthreads 10 The number of threads reserved for use by the CLDB. cldb.port 7222 The port on which the CLDB listens. cldb.v2.features.enabled 1 Enables new features added in MapR version 2.0. Used only during the upgrade process from v1.x to 2.x to control when new features become active. Once enabled, cannot be disabled. cldb.v3.features.enabled 1 Enables new features added in MapR version 3.0. Used only during the upgrade process from a pre-3.0 version to control when new features become active. Once enabled, cannot be disabled. net.topology.script.file.name The path to a script that associates IP addresses with physical topology paths. The script takes the IP address of a single node as input and returns the physical topology that should be associated with the specified node. This association is used only at the time a node is initially added to the cluster. To change topology for nodes already in the cluster, use the com maprcli node move mand. net.topology.table.file.name The path to a text file that associates IP addresses with physical topology paths. Each line of the text file is of format "<hostname/ip> <rack>", with the IP address or hostname of one node, followed by the topology to associate with the node. This association is used only at the time a node is initially added to the cluster. To change topology for nodes already in the cluster, use the command. maprcli node move cldb.web.port 7221 The port the CLDB uses for the webserver. cldb.zookeeper.servers The nodes that are running ZooKeeper, in the format . \<host:port\> hadoop.version The version of Hadoop supported by the cluster. cldb.jmxremote.port 7220 The CLDB JMX remote port cldb.ignore.stale.zk false When this setting is , the CLDB ignores true the ZooKeeper's information regarding the most recent copy of CLDB data. Change this setting to when the ZooKeeper true information is stale. Restart the CLDB with this setting. After the CLDB starts, change the setting back to then restart the false CLDB again. Example cldb.conf file Only change this setting on CLDB nodes that are known to have the most recent copy of the CLDB data. Shut down all CLDB processes before changing this variable. # # CLDB Config file. # Properties defined in this file are loaded during startup # and are valid for only CLDB which loaded the config. # These parameters are not persisted anywhere else. # # Wait until minimum number of fileserver register with # CLDB before creating Root Volume cldb.min.fileservers=1 # CLDB listening port cldb.port=7222 # Number of worker threads cldb.numthreads=10 # CLDB webport cldb.web.port=7221 # Disable duplicate hostid detection cldb.detect.dup.hostid.enabled=false # Number of RW containers in cache #cldb.containers.cache.entries=1000000 # # Topology script to be used to determine # Rack topology of node # Script should take an IP address as input and print rack path # on STDOUT. eg # $>/home/mapr/topo.pl 10.10.10.10 # $>/mapr-rack1 # $>/home/mapr/topo.pl 10.10.10.20 # $>/mapr-rack2 #net.topology.script.file.name=/home/mapr/topo.pl # # Topology mapping file used to determine # Rack topology of node # File is of a 2 column format (space separated) # 1st column is an IP address or hostname # 2nd column is the rack path # Line starting with '#' is a comment # Example file contents # 10.10.10.10 /mapr-rack1 # 10.10.10.20 /mapr-rack2 # host.foo.com /mapr-rack3 #net.topology.table.file.name=/home/mapr/topo.txt # # ZooKeeper address cldb.zookeeper.servers=10.10.40.36:5181,10.10.40.37:5181,10.10.40.38:5181 # Hadoop metrics jar version hadoop.version=0.20.2 # CLDB JMX remote port cldb.jmxremote.port=7220 num.volmirror.threads=1 # Set this to set the default topology for all volumes and nodes # The default for all volumes is /data by default # UNCOMMENT the below to change the default topology. # For e.g., set cldb.default.topology=/mydata to create volumes # in /mydata topology and to place all nodes in /mydata topology # by default #cldb.default.topology=/mydata core-site.xml The file contains configuration information that overrides the default values /opt/mapr/hadoop/hadoop-<version>/conf/core-site.xml for core Hadoop properties. Overrides of the default values for MapReduce configuration properties are stored in the file. mapred-site.xml To override a default value, specify the new value within the tags, using the following format: <configuration> <property> <name> </name> <value> </value> <description> </description> </property> The table of describes the possible entries to place in the and tags. The tag is optional but core parameters <name> <value> <description> recommended for maintainability. Default core-site.xml file <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> </configuration> Core Parameters Parameter Default Value Description dfs.umaskmode 022 This parameter sets the value, which umask sets default permissions for new objects in the file system (files, directories, sockets, pipes). The default value of 022 results in a default permission set of 755. The mask's semantics are identical to the UNIX s umask emantics. fs.automatic.close True The default behavior for filesystem instances is to close when the program exits. This filesystem closure uses a JVM shutdown hook. Set this property to False to disable this behavior. This is an advanced option. Set the value of to fs.automatic.close False only if your server application requires a specific shutdown sequence. You can examine the current configuration information for this node by using the command from a command -dump hadoop conf line. fs.file.impl org.apache.hadoop.fs.LocalFileSystem The filesystem for OS mounts that use file URIs. : fs.ftp.impl org.apache.hadoop.fs.ftp.FTPFileSystem The filesystem for URIs. ftp: fs.har.impl.disable.cache True The default value does not cache filesys har tem instances. Set this value to False to enable caching of filesystem instances. har fs.har.impl org.apache.hadoop.fs.HarFileSystem The filesystem for Hadoop archives. fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSyste m The filesystem for URIs. hdfs: fs.hftp.impl org.apache.hadoop.hftp.HftpFileSystem The filesystem for URIs. hftp: fs.hsftp.impl org.apache.hadoop.hdfs.HsftpFileSystem The filesystem for URIs. hsftp: fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem The filesystem for URIs. kfs: fs.maprfs.impl com.mapr.fs.MapRFileSystem The filesystem for URIs. maprfs: fs.mapr.working.dir /user/$USERNAME/ Working directory for MapR-FS fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem The filesystem for URIs. ramfs: fs.s3.blockSize 33554432 Block size to use when writing files to S3. fs.s3.buffer.dir ${hadoop.tmp.dir}/s3 Specifies the location on the local filesystem where Amazon S3 stores files before the files are sent to the S3 filesystem. This location also stores files retrieved from S3. fs.s3.impl org.apache.hadoop.fs.s3native.NativeS3File System The filesystem for URIs. s3: fs.s3.maxRetries 4 Specifies the maximum number of retries for file read or write operations to S3. After the maximum number of retries has been attempted, Hadoop signals failure to the application. fs.s3n.blockSize 33554432 Block size to use when reading files from the native S3 filesystem using URIs. s3n: fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3File System The filesystem for URIs. s3n: fs.s3.sleepTimeSeconds 10 The number of seconds to sleep between S3 retries. fs.trash.interval 0 Specified the number of minutes between trash checkpoints. Set this value to zero to disable the trash feature. hadoop.logfile.count 10 This property is deprecated. hadoop.logfile.size 10000000 This property is deprecated. hadoop.native.lib True Specifies whether to use native Hadoop libraries if they are present. Set this value to False to disable the use of native Hadoop libraries. hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.StandardSocketFact ory Specifies the default socket factory. The value for this parameter must be in the format . package.FactoryClassName hadoop.security.authentication simple Specifies authentication protocols to use. The default value of uses no simple authentication. Specify to enable kerberos Kerberos authentication. hadoop.security.authorization False Specifies whether or not service-level authorization is enabled. Specify True to enable service-level authorization. hadoop.security.group.mapping org.apache.hadoop.security.JniBasedUnixGr oupsMappingWithFallback Specifies the user-to-group mapping class that returns the groups a given user is in. hadoop.security.uid.cache.secs 14400 Specifies the timeout for entries in the NativeIO cache of UID-to-UserName pairs. hadoop.tmp.dir /tmp/hadoop-${user.name} Specifies the base directory for other temporary directories. hadoop.workaround.non.threadsafe.getpwuid False Some operating systems or authentication modules are known to have broken implementations of and getpwuid_r getpw that are not thread-safe. Symptoms of gid_r this problem include JVM crashes with a stack trace inside these functions. Enable this configuration parameter to include a lock around the calls as a workaround. An incomplete list of some systems known to have this issue is available at http://wiki.apache.org/hadoop/KnownBroken PwuidImplementations io.bytes.per.checksum 512 The number of checksum bytes. Maximum value for this parameter is equal to the value of the parameter. io.file.buffer.size io.compression.codecs org.apache.hadoop.io.compress.DefaultCod ec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Code c, org.apache.hadoop.io.compress.DeflateCod ec, org.apache.hadoop.io.compress.SnappyCod ec A list of compression codec classes available for compression and decompression. io.file.buffer.size 8192 Specifies the buffer size for sequence files. For optimal performance, set this parameter's value to a multiple of the hardware page size (the Intel x86 architecture has a hardware page size of 4096). This value determines how much data is buffered during read and write operations. io.mapfile.bloom.error.rate 0.005 This value specifies the acceptable rate of false positives for , which is BloomFilter-s used in . The size of the BloomMapFile Blo file increases exponentially as omFilter-s the value of this property decreases io.mapfile.bloom.size 1048576 This value sets the maximum number of keys in a file used in BloomFilter-s BloomMapFile. Once a number of keys equal to this value is appended, the next BloomFilter is created inside a DynamicBlo . Larger values decrease the omFilter number of individual filters. A lower number of filters increases performance and consumes more space. io.map.index.skip 0 Number of index entries to skip between each entry. Values for this property larger than zero can facilitate opening large map files using less memory. io.seqfile.compress.blocksize 1000000 The minimum block size for compression in block compressed . SequenceFiles io.seqfile.lazydecompress True . Set this value to False to Deprecated always decompress block-compressed Sequ . enceFiles io.seqfile.sorter.recordlimit 1000000 . The limit on number of records Deprecated kept in memory in a spill in SequenceFiles . .Sorter io.serializations org.apache.hadoop.io.serializer.WritableSeri alization A list of serialization classes available for obtaining serializers and deserializers. io.skip.checksum.errors False Set this property to True to skip an entry instead of throwing an exception when encountering a checksum error while reading a sequence file. ipc.client.connection.maxidletime 10000 This property's value specifies the maximum idle time in milliseconds. Once this idle time elapses, the client drops the connection to the server. ipc.client.connect.max.retries 10 This property's value specifies the maximum number of retry attempts a client makes to establish a server connection. ipc.client.idlethreshold 4000 This property's value specifies number of connections after which connections are inspected for idleness. ipc.client.kill.max 10 This property's value specifies the maximum number of clients to disconnect simultaneously. ipc.client.max.connection.setup.timeout 20 This property's value specifies the time in minutes that a failover RPC from the job client waits while setting up initial connection with the server. ipc.client.tcpnodelay True Change this value to False to enable Nagle's algorithm for the TCP socket connection on the client. Disabling Nagle's algorithm uses a greater number of smaller packets and may decrease latency. ipc.server.listen.queue.size 128 Indicates the length of the listen queue for servers accepting client connections. ipc.server.tcpnodelay True Change this value to False to enable Nagle's algorithm for the TCP socket connection on the server. Disabling Nagle's algorithm uses a greater number of smaller packets and may decrease latency. daemon.conf The file specifies the user and group under which MapR services run, and whether all MapR services run as /opt/mapr/conf/daemon.conf the specified user/group, or only ZooKeeper and FileServer. The configuration parameters operate as follows: If and are set, the ZooKeeper and FileServer run as the specified user/group. Otherwise, mapr.daemon.user mapr.daemon.group they run as . root If , all services started by the warden run as the specified user. Otherwise, they run as . mapr.daemon.runuser.warden=1 root Sample daemon.conf file mapr.daemon.user=mapr mapr.daemon.group=mapr mapr.daemon.runuser.warden=1 disktab On each node, the file lists all of the physical drives and partitions that have been added to MapR-FS. The /opt/mapr/conf/disktab diskta file is created by and automatically updated when disks are added or removed (either using the MapR Control System, or with the b disksetup and commands). disk add disk remove Sample disktab file # MapR Disks Mon Nov 28 11:46:16 2011 /dev/sdb 47E4CCDA-3536-E767-CD18-0CB7E4D34E00 /dev/sdc 7B6A3E66-6AF0-AF60-AE39-01B8E4D34E00 /dev/sdd 27A59ED3-DFD4-C692-68F8-04B8E4D34E00 /dev/sde F0BB5FB1-F2AC-CC01-275B-08B8E4D34E00 /dev/sdf 678FCF40-926F-0D04-49AC-0BB8E4D34E00 /dev/sdg 46823852-E45B-A7ED-8417-02B9E4D34E00 /dev/sdh 60A99B96-4CEE-7C46-A749-05B9E4D34E00 /dev/sdi 66533D4D-49F9-3CC4-0DF9-08B9E4D34E00 /dev/sdj 44CA818A-9320-6BBB-3751-0CB9E4D34E00 /dev/sdk 587E658F-EC8B-A3DF-4D74-00BAE4D34E00 /dev/sdl 11384F8D-1DA2-E0F3-E6E5-03BAE4D34E00 hadoop-metrics.properties The files direct MapR where to output service metric reports: to an output file ( ) or to 3.1 hadoop-metrics.properties FileContext Ganglia ( ). A third context, , disables metrics. To direct metrics to an output file, comment out the lines MapRGangliaContext31 NullContext pertaining to Ganglia and the ; for the chosen service; to direct metrics to Ganglia, comment out the lines pertaining to the metrics NullContext file and the . See . NullContext Service Metrics There are two files: hadoop-metrics.properties /opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties specifies output for standard Hadoop services /opt/mapr/conf/hadoop-metrics.properties specifies output for MapR-specific services The following table describes the parameters for each service in the files. hadoop-metrics.properties Parameter Example Values Description <service>.class org.apache.hadoop.metrics.spi.NullCont extWithUpdateThread apache.hadoop.metrics.file.FileContext com.mapr.fs.cldb.counters.MapRGangli aContext31 The class that implements the interface responsible for sending the service metrics to the appropriate handler. When implementing a class that sends metrics to Ganglia, set this property to the class name. <service>.period 10 60 The interval between 2 service metrics data exports to the appropriate interface. This is independent of how often are the metrics updated in the framework. <service>.fileName /tmp/cldbmetrics.log The path to the file where service metrics are exported when the cldb.class property is set to FileContext. <service.servers localhost:8649 The location of the gmon or gmeta that is aggregating metrics for this instance of the service, when the cldb.class property is set to GangliaContext. <service>.spoof 1 Specifies whether the metrics being sent out from the server should be spoofed as coming from another server. All our fileserver metrics are also on cldb, but to make it appear to end users as if these properties were emitted by fileserver host, we spoof the metrics to Ganglia using this property. Currently only used for the FileServer service. Examples The files are organized into sections for each service that provides metrics. Each section is divided into hadoop-metrics.properties subsections for the three contexts. /opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties # Configuration of the "dfs" context for null dfs.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "dfs" context for file #dfs.class=org.apache.hadoop.metrics.file.FileContext #dfs.period=10 #dfs.fileName=/tmp/dfsmetrics.log # Configuration of the "dfs" context for ganglia # Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter) # dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext # dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 # dfs.period=10 # dfs.servers=localhost:8649 # Configuration of the "mapred" context for null mapred.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "mapred" context for file #mapred.class=org.apache.hadoop.metrics.file.FileContext #mapred.period=10 #mapred.fileName=/tmp/mrmetrics.log # Configuration of the "mapred" context for ganglia # Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter) # mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext # mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 # mapred.period=10 # mapred.servers=localhost:8649 # Configuration of the "jvm" context for null #jvm.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "jvm" context for file #jvm.class=org.apache.hadoop.metrics.file.FileContext #jvm.period=10 #jvm.fileName=/tmp/jvmmetrics.log # Configuration of the "jvm" context for ganglia # jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext # jvm.period=10 # jvm.servers=localhost:8649 # Configuration of the "ugi" context for null ugi.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "fairscheduler" context for null #fairscheduler.class=org.apache.hadoop.metrics.spi.NullContext # Configuration of the "fairscheduler" context for file #fairscheduler.class=org.apache.hadoop.metrics.file.FileContext #fairscheduler.period=10 #fairscheduler.fileName=/tmp/fairschedulermetrics.log # Configuration of the "fairscheduler" context for ganglia # fairscheduler.class=org.apache.hadoop.metrics.ganglia.GangliaContext # fairscheduler.period=10 # fairscheduler.servers=localhost:8649 # /opt/mapr/conf/hadoop-metrics.properties ###################################################################################### ##################################### hadoop-metrics.properties ###################################################################################### ##################################### #CLDB metrics config - Pick one out of null,file or ganglia. #Uncomment all properties in null, file or ganglia context, to send cldb metrics to that context # Configuration of the "cldb" context for null #cldb.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread #cldb.period=10 # Configuration of the "cldb" context for file #cldb.class=org.apache.hadoop.metrics.file.FileContext #cldb.period=60 #cldb.fileName=/tmp/cldbmetrics.log # Configuration of the "cldb" context for ganglia cldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31 cldb.period=10 cldb.servers=localhost:8649 cldb.spoof=1 #FileServer metrics config - Pick one out of null,file or ganglia. #Uncomment all properties in null, file or ganglia context, to send fileserver metrics to that context # Configuration of the "fileserver" context for null #fileserver.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread #fileserver.period=10 # Configuration of the "fileserver" context for file #fileserver.class=org.apache.hadoop.metrics.file.FileContext #fileserver.period=60 #fileserver.fileName=/tmp/fsmetrics.log # Configuration of the "fileserver" context for ganglia fileserver.class=com.mapr.fs.cldb.counters.MapRGangliaContext31 fileserver.period=37 fileserver.servers=localhost:8649 fileserver.spoof=1 ###################################################################################### ########################## mapr-clusters.conf The configuration file specifies the CLDB nodes for one or more clusters that can be reached from /opt/mapr/conf/mapr-clusters.conf the node or client on which it is installed. Format: clustername1 <CLDB> <CLDB> <CLDB> [ clustername2 <CLDB> <CLDB> <CLDB> ] [ ... ] The <CLDB> string format can contain multiple space-separated instances of the following: host;ip:port - Host, IP, and port (uses DNS to resolve hostnames, or provided IP if DNS is down) host:port - Hostname and IP (uses DNS to resolve host, specifies port) ip:port - IP and port (avoids using DNS to resolve hosts, specifies port) host - Hostname only (default, uses DNS to resolve host, uses default port) ip - IP only (avoids using DNS to resolve hosts, uses default port) You can edit manually to add more clusters. For example: mapr-clusters.conf clustername2 <CLDB> <CLDB> <CLDB> Adding multihomed CLDB entries to mapr-clusters.conf with configure.sh In this example, the cluster has CLDB servers at , , , and . The CLDB servers and h my.cluster.com nodeA nodeB nodeC nodeD nodeB nodeD ave two NICs each at and . The entries in are separated by spaces for each server's entry. Within a server's eth0 eth1 mapr-clusters.conf entry, individual interfaces are separated by semicolons ( ). ; The command configure.sh -N my.cluster.com -C nodeAeth0,nodeCeth0 -M nodeBeth0,nodeBeth1 -M nodeDeth0,nodeDeth1 -Z zknodeA generates the following entry in : mapr-clusters.conf my.cluster.com nodeAeth0 nodeBeth0;nodeBeth1 nodeCeth0 nodeDeth0;nodeDeth1 mapred-default.xml The configuration file provides defaults that can be overridden using , and is located in the Hadoop mapred-default.xml mapred-site.xml core JAR file ( ). /opt/mapr/hadoop/hadoop-<version>/lib/hadoop-<version>-dev-core.jar The format for a parameter in both and is: mapred-default.xml mapred-site.xml Do not modify directly. Instead, copy parameters into and modify them there. If mapred-default.xml mapred-site.xml mapred-site.xml does not already exist, create it. You can examine the current configuration information for this node by using the command from a command -dump hadoop conf line. <property> <name>io.sort.spill.percent</name> <value>0.99</value> <description>The soft limit in either the buffer or record collection buffers. Once reached, a thread will begin to spill the contents to disk in the background. Note that this does not imply any chunking of data to the spill. A value less than 0.5 is not recommended.</description> </property> The element contains the parameter name, the element contains the parameter value, and the optional elem <name> <value> <description> ent contains the parameter description. You can create XML for any parameter from the table below, using the example above as a guide. Parameter Value Description hadoop.job.history.location If job tracker is static the history files are stored in this single well known place on local filesystem. If No value is set here, by default, it is in the local file system at $<hadoop.log.dir>/history. History files are moved to mapred.jobtracker.history.completed.location which is on MapRFs JobTracker volume. hadoop.job.history.user.location User can specify a location to store the history files of a particular job. If nothing is specified, the logs are stored in output directory. The files are stored in "_logs/history/" in the directory. User can stop logging by giving the value "none". hadoop.rpc.socket.factory.class.JobSubmissi onProtocol SocketFactory to use to connect to a Map/Reduce master (JobTracker). If null or empty, then use hadoop.rpc.socket.class.default. io.map.index.skip 0 Number of index entries to skip between each entry. Zero by default. Setting this to values larger than zero can facilitate opening large map files using less memory. io.sort.factor 256 The number of streams to merge at once while sorting files. This determines the number of open file handles. io.sort.mb 380 Buffer used to hold map outputs in memory before writing final map outputs. Setting this value very low may cause spills. A typical value for this parameter is 1.5 the average size of a map output. io.sort.record.percent 0.17 The percentage of io.sort.mb dedicated to tracking record boundaries. Let this value be r, io.sort.mb be x. The maximum number of records collected before the collection thread must block is equal to (r * x) / 4 io.sort.spill.percent 0.99 The soft limit in either the buffer or record collection buffers. Once reached, a thread will begin to spill the contents to disk in the background. Note that this does not imply any chunking of data to the spill. A value less than 0.5 is not recommended. job.end.notification.url http://localhost:8080/jobstatus.php?jobId=$jo bId&jobStatus=$jobStatus Indicates url which will be called on completion of job to inform end status of job. User can give at most 2 variables with URI : $jobId and $jobStatus. If they are present in URI, then they will be replaced by their respective values. job.end.retry.attempts 0 Indicates how many times hadoop should attempt to contact the notification URL job.end.retry.interval 30000 Indicates time in milliseconds between notification URL retry calls jobclient.completion.poll.interval 5000 The interval (in milliseconds) between which the JobClient polls the JobTracker for updates about job status. You may want to set this to a lower value to make tests run faster on a single node system. Adjusting this value in production may lead to unwanted client-server traffic. jobclient.output.filter FAILED The filter for controlling the output of the task's userlogs sent to the console of the JobClient. The permissible options are: NONE, KILLED, FAILED, SUCCEEDED and ALL. jobclient.progress.monitor.poll.interval 1000 The interval (in milliseconds) between which the JobClient reports status to the console and checks for job completion. You may want to set this to a lower value to make tests run faster on a single node system. Adjusting this value in production may lead to unwanted client-server traffic. keep.failed.task.files false Should the files for failed tasks be kept. This should only be used on jobs that are failing, because the storage is never reclaimed. It also prevents the map outputs from being erased from the reduce directory as they are consumed. keep.task.files.pattern .*_m_123456_0 Keep all files from tasks whose task names match the given regular expression. Defaults to none. map.sort.class org.apache.hadoop.util.QuickSort The default sort class for sorting keys. mapr.localoutput.dir output The path for local output mapr.localspill.dir spill The path for local spill mapr.localvolumes.path /var/mapr/local The path for local volumes mapred.acls.enabled false Specifies whether ACLs should be checked for authorization of users for doing various queue and job level operations. ACLs are disabled by default. If enabled, access control checks are made by JobTracker and TaskTracker when requests are made by users for queue operations like submit job to a queue and kill a job in the queue and job operations like viewing the job-details (See mapreduce.job.acl-view-job) or for modifying the job (See mapreduce.job.acl-modify-job) using Map/Reduce APIs, RPCs or via the console and web user interfaces. mapred.child.env User added environment variables for the task tracker child processes. Example : 1) A=foo This will set the env variable A to foo 2) B=$B:c This is inherit tasktracker's B env variable. mapred.child.java.opts Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[email protected] The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes. mapred.child.oom_adj 10 Increase the OOM adjust for oom killer (linux specific). We only allow increasing the adj value. (valid values: 0-15) mapred.child.renice 10 Nice value to run the job in. on linux the range is from -20 (most favorable) to 19 (least favorable). We only allow reducing the priority. (valid values: 0-19) mapred.child.taskset true Run the job in a taskset. man taskset (linux specific) 1-4 CPUs: No taskset 5-8 CPUs: taskset 1- (processor 0 reserved for infrastructrue processes) 9-n CPUs: taskset 2- (processors 0,1 reserved for infrastructrue processes) mapred.child.tmp ./tmp To set the value of tmp directory for map and reduce tasks. If the value is an absolute path, it is directly assigned. Otherwise, it is prepended with task's working directory. The java tasks are executed with option -Djava.io.tmpdir='the absolute path of the tmp dir'. Pipes and streaming are set with environment variable, TMPDIR='the absolute path of the tmp dir' mapred.child.ulimit The maximum virtual memory, in KB, of a process launched by the Map-Reduce framework. This can be used to control both the Mapper/Reducer tasks and applications using Hadoop Pipes, Hadoop Streaming etc. By default it is left unspecified to let cluster admins control it via limits.conf and other such relevant mechanisms. Note: mapred.child.ulimit must be greater than or equal to the -Xmx passed to JavaVM, else the VM might not start. mapred.cluster.map.memory.mb -1 The size, in terms of virtual memory, of a single map slot in the Map-Reduce framework, used by the scheduler. A job can ask for multiple slots for a single map task via mapred.job.map.memory.mb, upto the limit specified by mapred.cluster.max.map.memory.mb, if the scheduler supports the feature. The value of -1 indicates that this feature is turned off. mapred.cluster.max.map.memory.mb -1 The maximum size, in terms of virtual memory, of a single map task launched by the Map-Reduce framework, used by the scheduler. A job can ask for multiple slots for a single map task via mapred.job.map.memory.mb, upto the limit specified by mapred.cluster.max.map.memory.mb, if the scheduler supports the feature. The value of -1 indicates that this feature is turned off. mapred.cluster.max.reduce.memory.mb -1 The maximum size, in terms of virtual memory, of a single reduce task launched by the Map-Reduce framework, used by the scheduler. A job can ask for multiple slots for a single reduce task via mapred.job.reduce.memory.mb, upto the limit specified by mapred.cluster.max.reduce.memory.mb, if the scheduler supports the feature. The value of -1 indicates that this feature is turned off. mapred.cluster.reduce.memory.mb -1 The size, in terms of virtual memory, of a single reduce slot in the Map-Reduce framework, used by the scheduler. A job can ask for multiple slots for a single reduce task via mapred.job.reduce.memory.mb, upto the limit specified by mapred.cluster.max.reduce.memory.mb, if the scheduler supports the feature. The value of -1 indicates that this feature is turned off. mapred.compress.map.output false Should the outputs of the maps be compressed before being sent across the network. Uses SequenceFile compression. mapred.healthChecker.interval 60000 Frequency of the node health script to be run, in milliseconds mapred.healthChecker.script.args List of arguments which are to be passed to node health script when it is being launched comma seperated. mapred.healthChecker.script.path Absolute path to the script which is periodicallyrun by the node health monitoring service to determine if the node is healthy or not. If the value of this key is empty or the file does not exist in the location configured here, the node health monitoring service is not started. mapred.healthChecker.script.timeout 600000 Time after node health script should be killed if unresponsive and considered that the script has failed. mapred.hosts.exclude Names a file that contains the list of hosts that should be excluded by the jobtracker. If the value is empty, no hosts are excluded. mapred.hosts Names a file that contains the list of nodes that may connect to the jobtracker. If the value is empty, all hosts are permitted. mapred.inmem.merge.threshold 1000 The threshold, in terms of the number of files for the in-memory merge process. When we accumulate threshold number of files we initiate the in-memory merge and spill to disk. A value of 0 or less than 0 indicates we want to DON'T have any threshold and instead depend only on the ramfs's memory consumption to trigger the merge. mapred.job.map.memory.mb -1 The size, in terms of virtual memory, of a single map task for the job. A job can ask for multiple slots for a single map task, rounded up to the next multiple of mapred.cluster.map.memory.mb and upto the limit specified by mapred.cluster.max.map.memory.mb, if the scheduler supports the feature. The value of -1 indicates that this feature is turned off iff mapred.cluster.map.memory.mb is also turned off (-1). mapred.job.map.memory.physical.mb Maximum physical memory limit for map task of this job. If limit is exceeded task attempt will be FAILED. mapred.job.queue.name default Queue to which a job is submitted. This must match one of the queues defined in mapred.queue.names for the system. Also, the ACL setup for the queue must allow the current user to submit a job to the queue. Before specifying a queue, ensure that the system is configured with the queue, and access is allowed for submitting jobs to the queue. mapred.job.reduce.input.buffer.percent 0.0 The percentage of memory- relative to the maximum heap size- to retain map outputs during the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume less than this threshold before the reduce can begin. mapred.job.reduce.memory.mb -1 The size, in terms of virtual memory, of a single reduce task for the job. A job can ask for multiple slots for a single map task, rounded up to the next multiple of mapred.cluster.reduce.memory.mb and upto the limit specified by mapred.cluster.max.reduce.memory.mb, if the scheduler supports the feature. The value of -1 indicates that this feature is turned off iff mapred.cluster.reduce.memory.mb is also turned off (-1). mapred.job.reduce.memory.physical.mb Maximum physical memory limit for reduce task of this job. If limit is exceeded task attempt will be FAILED.. mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to -1, there is no limit. mapred.job.shuffle.input.buffer.percent 0.70 The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. mapred.job.shuffle.merge.percent 0.66 The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapred.job.shuffle.input.buffer.percent. mapred.job.tracker.handler.count 10 The number of server threads for the JobTracker. This should be roughly 4% of the number of tasktracker nodes. mapred.job.tracker.history.completed.locatio n /var/mapr/cluster/mapred/jobTracker/history/ done The completed job history files are stored at this single well-known location. If nothing is specified, the files are stored at $<hadoop.job.history.location>/done in local filesystem. mapred.job.tracker.http.address 0.0.0.0:50030 The job tracker http server address and port the server will listen on. If the port is 0 then the server will start on a free port. mapred.job.tracker.persist.jobstatus.active false Indicates if persistency of job status information is active or not. mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory where the job status information is persisted in a file system to be available after it drops of the memory queue and between jobtracker restarts. mapred.job.tracker.persist.jobstatus.hours 0 The number of hours job status information is persisted in DFS. The job status information will be available after it drops of the memory queue and between jobtracker restarts. With a zero value the job status information is not persisted at all in DFS. mapred.job.tracker localhost:9001 jobTracker address ip:port or use uri maprfs:/// for default cluster or maprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster. mapred.jobtracker.completeuserjobs.maximu m 100 The maximum number of complete jobs per user to keep around before delegating them to the job history. mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetri csInst Expert: The instrumentation class to associate with each JobTracker. mapred.jobtracker.job.history.block.size 3145728 The block size of the job history file. Since the job recovery uses job history, its important to dump job history to disk as soon as possible. Note that this is an expert level parameter. The default value is set to 3 MB. mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job history files loaded in memory. The jobs are loaded when they are first accessed. The cache is cleared based on LRU. mapred.jobtracker.maxtasks.per.job -1 The maximum number of tasks for a single job. A value of -1 indicates that there is no maximum. mapred.jobtracker.plugins Comma-separated list of jobtracker plug-ins to be activated. mapred.jobtracker.port 9001 Port on which JobTracker listens. mapred.jobtracker.restart.recover true "true" to enable (job) recovery upon restart, "false" to start afresh mapred.jobtracker.retiredjobs.cache.size 1000 The number of retired job status to keep in the cache. mapred.jobtracker.taskScheduler.maxRunnin gTasksPerJob The maximum number of running tasks for a job before it gets preempted. No limits if undefined. mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.JobQueueTaskS cheduler The class responsible for scheduling the tasks. mapred.line.input.format.linespermap 1 Number of lines per split in NLineInputFormat. mapred.local.dir.minspacekill 0 If the space in mapred.local.dir drops under this, do not ask more tasks until all the current ones have finished and cleaned up. Also, to save the rest of the tasks we have running, kill one of them, to clean up some space. Start with the reduce tasks, then go with the ones that have finished the least. Value in bytes. mapred.local.dir.minspacestart 0 If the space in mapred.local.dir drops under this, do not ask for more tasks. Value in bytes. mapred.local.dir $<hadoop.tmp.dir>/mapred/local The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored. mapred.map.child.env User added environment variables for the task tracker child processes. Example : 1) A=foo This will set the env variable A to foo 2) B=$B:c This is inherit tasktracker's B env variable. mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/hadoop/java_error %p.log Java opts for the map tasks. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[email protected] The configuration variable mapred.<map/reduce>.child.ulimit can be used to control the maximum virtual memory of the child processes. MapR: Default heapsize(-Xmx) is determined by memory reserved for mapreduce at tasktracker. Reduce task is given more memory than a map task. Default memory for a map task = (Total Memory reserved for mapreduce) * (#mapslots/ (#mapslots + 1.3*#reduceslots)) mapred.map.child.ulimit The maximum virtual memory, in KB, of a process launched by the Map-Reduce framework. This can be used to control both the Mapper/Reducer tasks and applications using Hadoop Pipes, Hadoop Streaming etc. By default it is left unspecified to let cluster admins control it via limits.conf and other such relevant mechanisms. Note: mapred.<map/reduce>.child.ulimit must be greater than or equal to the -Xmx passed to JavaVM, else the VM might not start. mapred.map.max.attempts 4 Expert: The maximum number of attempts per map task. In other words, framework will try to execute a map task these many number of times before giving up on it. mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCod ec If the map outputs are compressed, how should they be compressed? mapred.map.tasks.speculative.execution true If true, then multiple instances of some map tasks may be executed in parallel. mapred.map.tasks 2 The default number of map tasks per job. Ignored when mapred.job.tracker is "local". mapred.max.tracker.blacklists 4 The number of blacklists for a taskTracker by various jobs after which the task tracker could be blacklisted across all jobs. The tracker will be given a tasks later (after a day). The tracker will become a healthy tracker after a restart. mapred.max.tracker.failures 4 The number of task-failures on a tasktracker of a given job after which new tasks of that job aren't assigned to it. mapred.merge.recordsBeforeProgress 10000 The number of records to process during merge before sending a progress notification to the TaskTracker. mapred.min.split.size 0 The minimum size chunk that map input should be split into. Note that some file formats may have minimum split sizes that take priority over this setting. mapred.output.compress false Should the job outputs be compressed? mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCod ec If the job outputs are compressed, how should they be compressed? mapred.output.compression.type RECORD If the job outputs are to compressed as SequenceFiles, how should they be compressed? Should be one of NONE, RECORD or BLOCK. mapred.queue.default.state RUNNING This values defines the state , default queue is in. the values can be either "STOPPED" or "RUNNING" This value can be changed at runtime. mapred.queue.names default Comma separated list of queues configured for this jobtracker. Jobs are added to queues and schedulers can configure different scheduling properties for the various queues. To configure a property for a queue, the name of the queue must match the name specified in this value. Queue properties that are common to all schedulers are configured here with the naming convention, mapred.queue.$QUEUE-NAME.$PROPERT Y-NAME, for e.g. mapred.queue.default.submit-job-acl. The number of queues configured in this parameter could depend on the type of scheduler being used, as specified in mapred.jobtracker.taskScheduler. For example, the JobQueueTaskScheduler supports only a single queue, which is the default configured here. Before adding more queues, ensure that the scheduler you've configured supports multiple queues. mapred.reduce.child.env mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/hadoop/java_error %p.log Java opts for the reduce tasks. MapR: Default heapsize(-Xmx) is determined by memory reserved for mapreduce at tasktracker. Reduce task is given more memory than map task. Default memory for a reduce task = (Total Memory reserved for mapreduce) * (1.3*#reduceslots / (#mapslots + 1.3*#reduceslots)) mapred.reduce.child.ulimit mapred.reduce.copy.backoff 300 The maximum amount of time (in seconds) a reducer spends on fetching one map output before declaring it as failed. mapred.reduce.max.attempts 4 Expert: The maximum number of attempts per reduce task. In other words, framework will try to execute a reduce task these many number of times before giving up on it. mapred.reduce.parallel.copies 12 The default number of parallel transfers run by reduce during the copy(shuffle) phase. mapred.reduce.slowstart.completed.maps 0.95 Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job. mapred.reduce.tasks.speculative.execution true If true, then multiple instances of some reduce tasks may be executed in parallel. mapred.reduce.tasks 1 The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when mapred.job.tracker is "local". mapred.skip.attempts.to.start.skipping 2 The number of Task attempts AFTER which skip mode will be kicked off. When skip mode is kicked off, the tasks reports the range of records which it will process next, to the TaskTracker. So that on failures, tasktracker knows which ones are possibly the bad records. On further executions, those are skipped. mapred.skip.map.auto.incr.proc.count true The flag which if set to true, SkipBadRecords.COUNTER_MAP_PROCE SSED_RECORDS is incremented by MapRunner after invoking the map function. This value must be set to false for applications which process the records asynchronously or buffer the input records. For example streaming. In such cases applications should increment this counter on their own. mapred.skip.map.max.skip.records 0 The number of acceptable skip records surrounding the bad record PER bad record in mapper. The number includes the bad record as well. To turn the feature of detection/skipping of bad records off, set the value to 0. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to indicate that framework need not try to narrow down. Whatever records(depends on application) get skipped are acceptable. mapred.skip.out.dir If no value is specified here, the skipped records are written to the output directory at _logs/skip. User can stop writing skipped records by giving the value "none". mapred.skip.reduce.auto.incr.proc.count true The flag which if set to true, SkipBadRecords.COUNTER_REDUCE_PR OCESSED_GROUPS is incremented by framework after invoking the reduce function. This value must be set to false for applications which process the records asynchronously or buffer the input records. For example streaming. In such cases applications should increment this counter on their own. mapred.skip.reduce.max.skip.groups 0 The number of acceptable skip groups surrounding the bad group PER bad group in reducer. The number includes the bad group as well. To turn the feature of detection/skipping of bad groups off, set the value to 0. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to Long.MAX_VALUE to indicate that framework need not try to narrow down. Whatever groups(depends on application) get skipped are acceptable. mapred.submit.replication 10 The replication level for submitted job files. This should be around the square root of the number of nodes. mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared directory where MapReduce stores control files. mapred.task.cache.levels 2 This is the max level of the task cache. For example, if the level is 2, the tasks cached are at the host level and at the rack level. mapred.task.profile.maps 0-2 To set the ranges of map tasks to profile. mapred.task.profile has to be set to true for the value to be accounted. mapred.task.profile.reduces 0-2 To set the ranges of reduce tasks to profile. mapred.task.profile has to be set to true for the value to be accounted. mapred.task.profile false To set whether the system should collect profiler information for some of the tasks in this job? The information is stored in the user log directory. The value is "true" if task profiling is enabled. mapred.task.timeout 600000 The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. mapred.task.tracker.http.address 0.0.0.0:50060 The task tracker http server address and port. If the port is 0 then the server will start on a free port. mapred.task.tracker.report.address 127.0.0.1:0 The interface and port that task tracker server listens on. Since it is only connected to by the tasks, it uses the local interface. EXPERT ONLY. Should only be changed if your host does not have the loopback interface. mapred.task.tracker.task-controller org.apache.hadoop.mapred.DefaultTaskCont roller TaskController which is used to launch and manage task execution mapred.tasktracker.dns.interface default The name of the Network Interface from which a task tracker should report its IP address. mapred.tasktracker.dns.nameserver default The host name or IP address of the name server (DNS) which a TaskTracker should use to determine the host name used by the JobTracker for communication and display purposes. mapred.tasktracker.expiry.interval 600000 Expert: The time-interval, in miliseconds, after which a tasktracker is declared 'lost' if it doesn't send heartbeats. mapred.tasktracker.indexcache.mb 10 The maximum memory that a task tracker allows for the index cache that is used when serving map outputs to reducers. mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMet ricsInst Expert: The instrumentation class to associate with each TaskTracker. mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? (CPUS * 0.75) : 1 The maximum number of map tasks that will be run simultaneously by a task tracker. mapred.tasktracker.memory_calculator_plugi n Name of the class whose instance will be used to query memory information on the tasktracker. The class must be an instance of org.apache.hadoop.util.MemoryCalculatorPlu gin. If the value is null, the tasktracker attempts to use a class appropriate to the platform. Currently, the only platform supported is Linux. mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? (CPUS * 0.50): 1 The maximum number of reduce tasks that will be run simultaneously by a task tracker. mapred.tasktracker.taskmemorymanager.mo nitoring-interval 5000 The interval, in milliseconds, for which the tasktracker waits between two cycles of monitoring its tasks' memory usage. Used only if tasks' memory management is enabled via mapred.tasktracker.tasks.maxmemory. mapred.tasktracker.tasks.sleeptime-before-si gkill 5000 The time, in milliseconds, the tasktracker waits for sending a SIGKILL to a process, after it has been sent a SIGTERM. mapred.temp.dir $<hadoop.tmp.dir>/mapred/temp A shared directory for temporary files. mapred.user.jobconf.limit 5242880 The maximum allowed size of the user jobconf. The default is set to 5 MB mapred.userlog.limit.kb 0 The maximum size of user-logs of each task in KB. 0 disables the cap. mapred.userlog.retain.hours 24 The maximum time, in hours, for which the user-logs are to be retained after the job completion. mapreduce.heartbeat.10 300 heartbeat in milliseconds for small cluster (less than or equal 10 nodes) mapreduce.heartbeat.100 1000 heartbeat in milliseconds for medium cluster (11 - 100 nodes). Scales linearly between 300ms - 1s mapreduce.heartbeat.1000 10000 heartbeat in milliseconds for medium cluster (101 - 1000 nodes). Scales linearly between 1s - 10s mapreduce.heartbeat.10000 100000 heartbeat in milliseconds for medium cluster (1001 - 10000 nodes). Scales linearly between 10s - 100s mapreduce.job.acl-modify-job job specific access-control list for 'modifying' the job. It is only used if authorization is enabled in Map/Reduce by setting the configuration property mapred.acls.enabled to true. This specifies the list of users and/or groups who can do modification operations on the job. For specifying a list of users and groups the format to use is "user1,user2 group1,group". If set to '*', it allows all users/groups to modify this job. If set to ' '(i.e. space), it allows none. This configuration is used to guard all the modifications with respect to this job and takes care of all the following operations: o killing this job o killing a task of this job, failing a task of this job o setting the priority of this job Each of these operations are also protected by the per-queue level ACL "acl-administer-jobs" configured via mapred-queues.xml. So a caller should have the authorization to satisfy either the queue-level ACL or the job-level ACL. Irrespective of this ACL configuration, job-owner, the user who started the cluster, cluster administrators configured via mapreduce.cluster.administrators and queue administrators of the queue to which this job is submitted to configured via mapred.queue.queue-name.acl-administer-jo bs in mapred-queue-acls.xml can do all the modification operations on a job. By default, nobody else besides job-owner, the user who started the cluster, cluster administrators and queue administrators can perform modification operations on a job. mapreduce.job.acl-view-job job specific access-control list for 'viewing' the job. It is only used if authorization is enabled in Map/Reduce by setting the configuration property mapred.acls.enabled to true. This specifies the list of users and/or groups who can view private details about the job. For specifying a list of users and groups the format to use is "user1,user2 group1,group". If set to '*', it allows all users/groups to modify this job. If set to ' '(i.e. space), it allows none. This configuration is used to guard some of the job-views and at present only protects APIs that can return possibly sensitive information of the job-owner like o job-level counters o task-level counters o tasks' diagnostic information o task-logs displayed on the TaskTracker web-UI and o job.xml showed by the JobTracker's web-UI Every other piece of information of jobs is still accessible by any other user, for e.g., JobStatus, JobProfile, list of jobs in the queue, etc. Irrespective of this ACL configuration, job-owner, the user who started the cluster, cluster administrators configured via mapreduce.cluster.administrators and queue administrators of the queue to which this job is submitted to configured via mapred.queue.queue-name.acl-administer-jo bs in mapred-queue-acls.xml can do all the view operations on a job. By default, nobody else besides job-owner, the user who started the cluster, cluster administrators and queue administrators can perform view operations on a job. mapreduce.job.complete.cancel.delegation.t okens true if false - do not unregister/cancel delegation tokens from renewal, because same tokens may be used by spawned jobs mapreduce.job.split.metainfo.maxsize 10000000 The maximum permissible size of the split metainfo file. The JobTracker won't attempt to read split metainfo files bigger than the configured value. No limits if set to -1. mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recover y Recovery Directory mapreduce.jobtracker.recovery.job.initializati on.maxtime Maximum time in seconds JobTracker will wait for initializing jobs before starting recovery. By default it is same as mapreduce.jobtracker.recovery.maxtime. mapreduce.jobtracker.recovery.maxtime 480 Maximum time in seconds JobTracker should stay in recovery mode. JobTracker recovers job after talking to all running tasktrackers. On large cluster if many jobs are to be recovered, mapreduce.jobtracker.recovery.maxtime should be increased. mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the staging area for users' job files In practice, this should be the directory where users' home directories are located (usually /user) mapreduce.maprfs.use.checksum true Deprecated; checksums are always used. mapreduce.maprfs.use.compression true When true, MapReduce uses compression during the Shuffle phase. mapreduce.reduce.input.limit -1 The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that there is no limit set. mapreduce.task.classpath.user.precedence false Set to true if user wants to set different classpath. mapreduce.tasktracker.group Expert: Group to which TaskTracker belongs. If LinuxTaskController is configured via mapreduce.tasktracker.taskcontroller, the group owner of the task-controller binary should be same as this group. mapreduce.tasktracker.heapbased.memory. management false Expert only: If admin wants to prevent swapping by not launching too many tasks use this option. Task's memory usage is based on max java heap size (-Xmx). By default -Xmx will be computed by tasktracker based on slots and memory reserved for mapreduce tasks. See mapred.map.child.java.opts/mapred.reduce.c hild.java.opts. mapreduce.tasktracker.jvm.idle.time 10000 If jvm is idle for more than mapreduce.tasktracker.jvm.idle.time (milliseconds) tasktracker will kill it. mapreduce.tasktracker.outofband.heartbeat false Expert: Set this to true to let the tasktracker send an out-of-band heartbeat on task-completion for better latency. mapreduce.tasktracker.prefetch.maptasks 0.0 How many map tasks should be scheduled in advance on a tasktracker, expressed as a % of map slots. Default is 0.0 which means no overscheduled tasks allowed on tasktracker. mapreduce.tasktracker.reserved.physicalme mory.mb Maximum phyiscal memory tasktracker should reserve for mapreduce tasks. If tasks use more than the limit, task using maximum memory will be killed. Expert only: Set this value iff tasktracker should use a certain amount of memory for mapreduce tasks. In MapR Distro warden figures this number based on services configured on a node. Setting mapreduce.tasktracker.reserved.physicalme mory.mb to -1 will disable physical memory accounting and task management. mapreduce.tasktracker.volume.healthcheck.i nterval 60000 How often tasktracker should check for mapreduce volume at $<mapr.localvolumes.path>/mapred/. Value is in milliseconds. mapreduce.use.fastreduce false Expert only . Reducer won't be able to tolerate failures. mapreduce.use.maprfs true If true, then mapreduce uses maprfs to store task related data may be executed in parallel. tasktracker.http.threads 2 The number of worker threads that for the http server. This is used for map output fetching mapred-site.xml The file contains configuration information that overrides the default /opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml values for MapReduce parameters. Overrides of the default values for core configuration properties are stored in the file. core-site.xml To override a default value, specify the new value within the tags, using the following format: <configuration> <property> <name> </name> <value> </value> <description> </description> </property> The following configuration tables describes the possible entries to place in the and tags. The tag is optional <name> <value> <description> but recommended for maintainability. There are three parts to : mapred-site.xml JobTracker configuration TaskTracker configuration Job configuration Default core-site.xml file <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  <configuration> </configuration> JobTracker Configuration Changing any parameters in this section requires a JobTracker restart. You can examine the current configuration information for this node by using the command from a command -dump hadoop conf line. Parameter Value Description mapred.job.tracker maprfs:/// JobTracker address ip:port or use uri maprfs:/// for default cluster or maprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster. Replace localhost by one or more ip addresses for JobTracker. mapred.jobtracker.port 9001 Port on which JobTracker listens. Read by JobTracker to start RPC Server. mapreduce.tasktracker.outofband.heartbeat True The task tracker sends an out-of-band heartbeat on task completion to improve latency. Set this value to false to disable this behavior. webinterface.private.actions False By default, jobs cannot be killed from the JobTracker's web interface. Set this value to True to enable this behavior. maprfs.openfid2.prefetch.bytes 0 Expert: number of shuffle bytes to be prefetched by reduce task mapr.localoutput.dir output The path for map output files on shuffle volume. mapr.localspill.dir spill The path for local spill files on shuffle volume. mapreduce.jobtracker.node.labels.file The file that specifies the labels to apply to the nodes in the cluster. mapreduce.jobtracker.node.labels.monitor.in terval 120000 Specifies a time interval in milliseconds. The node labels file is polled for changes every time this interval passes. mapred.queue.<queue-name>.label Specifies a label for the queue named in the placeholder. <queue-name> mapred.queue.<queue-name>.label.policy Specifies a policy for the label applied to the queue named in the placeh <queue-name> older. The policy controls the interaction between the queue label and the job label: PREFER_QUEUE — always use label set on queue PREFER_JOB — always use label set on job AND (default) — job label AND node label OR — job label OR node label JobTracker Directories When changing any parameters in this section, a JobTracker restart is required. Volume path = mapred.system.dir/../ MapR recommends properly securing your interfaces before enabling this behavior. Parameter Value Description mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared directory where MapReduce stores control files. mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory where the job status information is persisted in a file system to be available after it drops of the memory queue and between JobTracker restarts. mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the staging area for users' job files In practice, this should be the directory where users' home directories are located (usually /user) mapreduce.job.split.metainfo.maxsize 10000000 The maximum permissible size of the split metainfo file. The JobTracker won't attempt to read split metainfo files bigger than the configured value. No limits if set to -1. mapreduce.maprfs.use.compression True Set this property's value to False to disable the use of MapR-FS compression for shuffle data by MapReduce. mapred.jobtracker.retiredjobs.cache.size 1000 The number of retired job status to keep in the cache. mapred.job.tracker.history.completed.locatio n /var/mapr/cluster/mapred/jobTracker/history/ done The completed job history files are stored at this single well known location. If nothing is specified, the files are stored at ${hadoop.job.history.location}/done in local filesystem. hadoop.job.history.location If job tracker is static the history files are stored in this single well known place on local filesystem. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history. History files are moved to mapred.jobtracker.history.completed.location which is on MapRFs JobTracker volume. mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job history files loaded in memory. The jobs are loaded when they are first accessed. The cache is cleared based on LRU. JobTracker Recovery When changing any parameters in this section, a JobTracker restart is required. Parameter Value Description mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recover y Recovery Directory. Stores list of known TaskTrackers. mapreduce.jobtracker.recovery.maxtime 120 Maximum time in seconds JobTracker should stay in recovery mode. mapreduce.jobtracker.split.metainfo.maxsize 10000000 This property's value sets the maximum permissible size of the split metainfo file. The JobTracker does not attempt to read split metainfo files larger than this value. mapred.jobtracker.restart.recover true "true" to enable (job) recovery upon restart, "false" to start afresh mapreduce.jobtracker.recovery.job.initializati on.maxtime 480 this property's value specifies the maximum time in seconds that the JobTracker waits to initialize jobs before starting recovery. This property's default value is equal to the value of the mapreduce.jobtracker.recover property. y.maxtime Enable Fair Scheduler When changing any parameters in this section, a JobTracker restart is required. Parameter Value Description mapred.fairscheduler.allocation.file conf/pools.xml mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.FairScheduler The class responsible for task scheduling. mapred.fairscheduler.assignmultiple true mapred.fairscheduler.eventlog.enabled false Enable scheduler logging in ${HADOOP_LOG_DIR}/fairscheduler/ mapred.fairscheduler.smalljob.schedule.ena ble True Set this property's value to False to disable fast scheduling for small jobs in FairScheduler. TaskTrackers can reserve an ephemeral slot for small jobs when the cluster is under load. mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job. mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job. mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job. Default is 10GB. mapred.fairscheduler.smalljob.max.reducer.i nputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed in small job. Default is 1GB per reducer. mapred.cluster.ephemeral.tasks.memory.limi t.mb 200 Small job definition. Max memory in mbytes reserved for an ephermal slot. Default is 200mb. This value must be same on JobTracker and TaskTracker nodes. TaskTracker Configuration When changing any parameters in this section, a TaskTracker restart is required. Parameter Value Description When is greater than 0, you must disable and mapreduce.tasktracker.prefetch.maptasks Fair Scheduler with preemption lab . el-based job placement mapred.tasktracker.map.tasks.maximum -1 The maximum number of map task slots to run simultaneously. The default value of -1 specifies that the number of map task slots is based on the total amount of memory reserved for MapReduce by the Warden. Of the memory available for MapReduce (not counting the memory reserved for ephemeral slots), 40% is allocated to map tasks. That total amount of memory is divided by the value of the mapred.maptask.memory.de parameter to determine the total fault number of map task slots on this node. You can also specify a formula using the following variables: CPUS - The number of CPUs on the node. DISKS - The number of disks on the node. MEM - The amount of memory (in MB) reserved for MapReduce tasks by the Warden. You can assemble these variables with the syntax CONDITIONAL ? TRUE : FALSE. For example, the expression 2*CPUS < DISKS ? 2*CPUS : DISKS results in 2*CPUS slots when there are more disks on the node than twice the number of cores, and DISKS slots otherwise. mapreduce.tasktracker.prefetch.maptasks 0.0 The proportion of map tasks that can be scheduled in advance (prefetched) on a TaskTracker. The number is given as a ratio of prefetched tasks to the total number of map slots. For example, 0.25 means the number of prefetched tasks = 25% of the total number of map slots. The default is 0.0, which means no prefetched tasks can be scheduled. mapreduce.tasktracker.reserved.physicalme mory.mb.low 0.8 This property's value sets the target memory usage level when the TaskTracker kills tasks to reduce total memory usage. This property's value represents a percentage of the amount in the mapreduce.tasktracke value. r.reserved.physicalmemory.mb mapreduce.tasktracker.task.slowlaunch False Set this property's value to True to wait after each task launch for nodes running critical services like CLDB, JobTracker, and ZooKeeper. mapreduce.tasktracker.volume.healthcheck.i nterval 60000 This property's value defines the frequency in milliseconds that the TaskTracker checks the Mapreduce volume defined in the ${mapr.loc property. alvolumes.path}/mapred/ mapreduce.use.maprfs True Use MapR-FS for shuffle and sort/merge. mapred.userlog.retain.hours 24 This property's value specifies the maximum time, in hours, to retain the user-logs after job completion. mapred.user.jobconf.limit 5242880 The maximum allowed size of the user jobconf. The default is set to 5 MB. mapred.userlog.limit.kb 0 Deprecated: The maximum size of user-logs of each task in KB. 0 disables the cap. mapreduce.use.fastreduce False Expert: Merge map outputs without copying. mapred.tasktracker.reduce.tasks.maximum -1 The maximum number of reduce task slots to run simultaneously. The default value of -1 specifies that the number of reduce task slots is based on the total amount of memory reserved for MapReduce by the Warden. Of the memory available for MapReduce (not counting the memory reserved for ephemeral slots), 60% is allocated to reduce tasks. That total amount of memory is divided by the value of the mapred.reducetask.memory parameter to determine the total .default number of reduce task slots on this node. You can also specify a formula using the following variables: CPUS - The number of CPUs on the node. DISKS - The number of disks on the node. MEM - The amount of memory reserved for MapReduce tasks by the Warden. You can assemble these variables with the syntax CONDITIONAL ? TRUE : FALSE. For example, the expression 2*CPUS < DISKS ? 2*CPUS : DISKS results in 2*CPUS slots when there are more disks on the node than twice the number of cores, and DISKS slots otherwise. mapred.tasktracker.ephemeral.tasks.maximu m 1 Reserved slot for small job scheduling mapred.tasktracker.ephemeral.tasks.timeout 10000 Maximum time in milliseconds a task is allowed to occupy ephemeral slot mapred.tasktracker.ephemeral.tasks.ulimit 4294967296 Ulimit (bytes) on all tasks scheduled on an ephemeral slot mapreduce.tasktracker.reserved.physicalme mory.mb Maximum phyiscal memory TaskTracker should reserve for mapreduce tasks. If tasks use more than the limit, task using maximum memory will be killed. Expert only: Set this value only if TaskTracker should use a certain amount of memory for mapreduce tasks. In MapR Distro warden figures this number based on services configured on a node. Setting mapreduce.tasktracker.reserved.physicalme mory.mb to -1 will disable physical memory accounting and task management. mapred.tasktracker.expiry.interval 600000 Expert: This property's value specifies a time interval in milliseconds. After this interval expires without any heartbeats sent, a TaskTracker is marked . lost mapreduce.tasktracker.heapbased.memory. management false Expert only: If the admin wants to prevent swapping by not launching too many tasks, use this option. Task's memory usage is based on max java heap size (-Xmx). By default, -Xmx will be computed by the TaskTracker based on slots and memory reserved for mapreduce tasks. See mapred.map.child.java.opts/mapred.reduce.c hild.java.opts. mapreduce.tasktracker.jvm.idle.time 10000 If JVM is idle for more than mapreduce.tasktracker.jvm.idle.time (milliseconds) TaskTracker will kill it. mapred.max.tracker.failures 4 The number of task failures on a TaskTracker of a given job after which new tasks of that job aren't assigned to it. mapred.max.tracker.blacklists 4 The number of blacklists for a TaskTracker by various jobs after which the TaskTracker could be blacklisted across all jobs. The TaskTracker will be given tasks later (after a day). The TaskTracker will become healthy after a restart. mapred.task.tracker.http.address 0.0.0.0:50060 This property's value specifies the HTTP server address and port for the TaskTracker. Specify 0 as the port to make the server start on a free port. mapred.task.tracker.report.address 127.0.0.1:0 The IP address and port that TaskTrackeer server listens on. Since it is only connected to by the tasks, it uses the local interface. EXPERT ONLY. Only change this value if your host does not have a loopback interface. mapreduce.tasktracker.group mapr Expert: Group to which TaskTracker belongs. If LinuxTaskController is configured via the m apreduce.tasktracker.taskcontroll value, the group owner of the er task-controller binary $HADOOP_HOME/bin/ must be same as platform/bin/task-controller this group. mapred.tasktracker.task-controller.config.ove rwrite True The needs a LinuxTaskController configuration file set at /con $HADOOP_HOME . The f/taskcontroller.cfg configuration file takes the following parameters: mapred.local.dir = Local dir used by TaskTracker, taken from mapred-site.xml. hadoop.log.dir = hadoop log dir, taken from system properties of the TaskTracker process mapreduce.tasktracker.group = groups allowed to run TaskTracker see 'mapreduce.tasktracker.group' min.user.id = Don't allow any user below this uid to launch a task. banned.users = Users who are not allowed to launch any tasks. If set to true, TaskTracker will always overwrite config file with default values as min.user.id = -1(check disabled), banned.users = bin, mapreduce.tasktracker.group = root To disable this configuration and use a custom configuration, set this property's value to False and restart the TaskTracker. mapred.tasktracker.indexcache.mb 10 This property's value specifies the maximum amount of memory allocated by the TaskTracker for the index cache. The index cache is used when the TaskTracker serves map outputs to reducers. mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMet ricsInst Expert: The instrumentation class to associate with each TaskTracker. mapred.task.tracker.task-controller org.apache.hadoop.mapred.LinuxTaskContr oller This property's value specifies the TaskController that launches and manages task execution. mapred.tasktracker.taskmemorymanager.killt ask.maxRSS False Set this property's value to True to kill tasks that are using maximum memory when the total number of MapReduce tasks exceeds the limit specified in the TaskTracker's mapr educe.tasktracker.reserved.physic property. Tasks are killed in almemory.mb most-recently-launched order. 1. 2. mapred.tasktracker.taskmemorymanager.mo nitoring-interval 3000 This property's value specifies an interval in milliseconds that TaskTracker waits between monitoring the memory usage of tasks. This property is only used when tasks memory management is enabled by setting the property mapred.tasktracker.tasks.m to True. axmemory mapred.tasktracker.tasks.sleeptime-before-si gkill 5000 This property's value sets the time in milliseconds that the TaskTracker waits before sending a SIGKILL to a process after it has been sent a SIGTERM. mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp A shared directory for temporary files. mapreduce.cluster.map.userlog.retain-size -1 This property's value specifies the number of bytes to retain from map task logs. The default value of -1 disables this feature. mapreduce.cluster.reduce.userlog.retain-size -1 This property's value specifies the number of bytes to retain from reduce task logs. The default value of -1 disables this feature. mapreduce.heartbeat.10000 100000 This property's value specifies a heartbeat time in milliseconds for a medium cluster of 1001 to 10000 nodes. Scales linearly between 10s - 100s. mapreduce.heartbeat.1000 10000 This property's value specifies a heartbeat time in milliseconds for a medium cluster of 101 to 1000 nodes. Scales linearly between 1s - 10s. mapreduce.heartbeat.100 1000 This property's value specifies a heartbeat time in milliseconds for a medium cluster of 11 to 100 nodes. Scales linearly between 300ms - 1s. mapreduce.heartbeat.10 300 This property's value specifies a heartbeat time in milliseconds for a medium cluster of 1 to 10 nodes. mapreduce.job.complete.cancel.delegation.t okens True Set this property's value to False to prevent unregister or cancel delegation tokens from renewing. mapreduce.jobtracker.inline.setup.cleanup False Set this property's value to True to make the JobTracker attempt to set up and clean up the job by itself or do it in setup/cleanup task. Job Configuration Set these values on the node from which you plan to submit jobs, before submitting the jobs. If you are using Hadoop examples, you can set these parameters from the command line. Example: hadoop jar hadoop-examples.jar terasort -Dmapred.map.child.java.opts="-Xmx1000m" When you submit a job, the JobClient creates by reading parameters from the following files in the following order: job.xml mapred-default.xml 2. 3. The local - overrides identical parameters in mapred-site.xml mapred-default.xml Any settings in the job code itself - overrides identical parameters in mapred-site.xml Parameter Value Description keep.failed.task.files false Should the files for failed tasks be kept. This should only be used on jobs that are failing, because the storage is never reclaimed. It also prevents the map outputs from being erased from the reduce directory as they are consumed. mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to -1, there is no limit. mapred.map.tasks.speculative.execution true If true, then multiple instances of some map tasks may be executed in parallel. mapred.reduce.tasks.speculative.execution true If true, then multiple instances of some reduce tasks may be executed in parallel. mapred.reduce.tasks 1 The default number of reduce tasks per job. Typically set to 99% of the cluster's reduce capacity, so that if a node fails the reduces can still be executed in a single wave. Ignored when the value of the mapred.job. property is . tracker local mapred.job.map.memory.physical.mb Maximum physical memory limit for map task of this job. If limit is exceeded task attempt will be FAILED. mapred.job.reduce.memory.physical.mb Maximum physical memory limit for reduce task of this job. If limit is exceeded task attempt will be FAILED. mapreduce.task.classpath.user.precedence false Set to true if user wants to set different classpath. mapred.max.maps.per.node -1 Per-node limit on running map tasks for the job. A value of -1 signifies no limit. mapred.max.reduces.per.node -1 Per-node limit on running reduce tasks for the job. A value of -1 signifies no limit. mapred.running.map.limit -1 Cluster-wide limit on running map tasks for the job. A value of -1 signifies no limit. mapred.running.reduce.limit -1 Cluster-wide limit on running reduce tasks for the job. A value of -1 signifies no limit. mapreduce.tasktracker.cache.local.numberdi rectories 10000 This property's value sets the maximum number of subdirectories to create in a given distributed cache store. Cache items in excess of this limit are expunged whether or not the total size threshold is exceeded. mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_er ror%p.log Java opts for the reduce tasks. MapR Default heapsize (-Xmx) is determined by memory reserved for mapreduce at TaskTracker. Reduce task is given more memory than map task. Default memory for a reduce task = (Total Memory reserved for mapreduce) * (2*#reduceslots / (#mapslots + 2*#reduceslots)) mapred.reduce.child.ulimit io.sort.factor 256 The number of streams to merge simultaneously during file sorting. The value of this property determines the number of open file handles. io.sort.mb 380 This value sets the size, in megabytes, of the memory buffer that holds map outputs before writing the final map outputs. Lower values for this property increases the chance of spills. Recommended practice is to set this value to 1.5 times the average size of a map output. io.sort.record.percent 0.17 The percentage of the memory buffer specified by the property that io.sort.mb is dedicated to tracking record boundaries. The maximum number of records that the collection thread can collect before blocking is one-fourth of ( ) x ( io.sort.mb io.sort. ). record.percent io.sort.spill.percent 0.99 This property's value sets the soft limit for either the buffer or record collection buffers. Threads that reach the soft limit begin to spill the contents to disk in the background. Note that this does not imply any chunking of data to the spill. Do not reduce this value below 0.5. mapred.reduce.slowstart.completed.maps 0.95 Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job. mapreduce.reduce.input.limit -1 The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that there is no limit set. mapred.reduce.parallel.copies 12 The default number of parallel transfers run by reduce during the copy(shuffle) phase. jobclient.completion.poll.interval 5000 This property's value specifies the JobClient's polling frequency in milliseconds to the JobTracker for updates about job status. Reduce this value for faster tests on single node systems. Adjusting this value on production clusters may result in undesired client-server traffic. jobclient.output.filter FAILED This property's value specifies the filter that controls the output of the task's userlogs that are sent to the JobClient's console. Legal values are: NONE KILLED FAILED SUCCEEDED ALL jobclient.progress.monitor.poll.interval 1000 This property's value specifies the JobClient's status reporting frequency in milliseconds to the console and checking for job completion. job.end.notification.url http://localhost:8080/jobstatus.php?jobId=$jo bId&jobStatus=$jobStatus This property's value specifies the URL to call at job completion to report the job's end status. Only two variables are legal in the URL, and . When $jobId $jobStatus present, these variables are replaced by their respective values. job.end.retry.attempts 0 This property's value specifies the maximum number of times that Hadoop attempts to contact the notification URL. job.end.retry.interval 30000 This property's value specifies the interval in milliseconds between attempts to contact the notification URL. keep.failed.task.files False Set this property's value to True to keep files for failed tasks. Because this storage is not automatically reclaimed by the system, keep files only for jobs that are failing. Setting this property's value to True also keeps map outputs in the reduce directory as the map outputs are consumed instead of deleting the map outputs on consumption. local.cache.size 10737418240 This property's value specifies the number of bytes allocated to each local TaskTracker directory to store Distributed Cache data. mapr.centrallog.dir logs This property's value specifies the relative path under a local volume path that points to the central log location, ${mapr.localvol umes.path}/ /${mapr.cent <hostname> }. rallog.dir mapr.localvolumes.path /var/mapr/local The path for local volumes. map.sort.class org.apache.hadoop.util.QuickSort The default sort class for sorting keys. tasktracker.http.threads 2 The number of worker threads that for the HTTP server. topology.node.switch.mapping.impl org.apache.hadoop.net.ScriptBasedMapping The default implementation of the DNSToSwitchMapping. It invokes a script specified in the topology.script.file. property to resolve node names. If no name value is set for the topology.script.fil property, the default value of e.name DEFAULT_RACK is returned for all node names. topology.script.number.args 100 The max number of arguments that the script configured with the topology.script.fil runs with. Each argument is an IP e.name address. mapr.task.diagnostics.enabled False Set this property's value to True to run the MapR diagnostics script before killing an unresponsive task attempt. mapred.acls.enabled False This property's value specifies whether or not to check ACLs for user authorization during various queue and job level operations. Set this property's value to True to enable access control checks made by the JobTracker and TaskTracker when users request queue and job operations using Map/Reduce APIs, RPCs, the console, or the web user interfaces. mapred.child.oom_adj 10 This property's value specifies the adjustment to the out-of-memory value for the Linux-specific out-of-memory killer. Legal values are 0-15. mapred.child.renice 10 This property's value specifies an integer from 0 to 19 for use by the Linux nice}} utility. mapred.child.taskset True Set this property's value to False to prevent running the job in a taskset. See the manual page for for more information. taskset(1) mapred.child.tmp ./tmp This property's value sets the location of the temporary directory for map and reduce tasks. Set this value to an absolute path to directly assign the directory. Relative paths are located under the task's working directory. Java tasks execute with the option -Djava.io.tmpdir=absolute path of . Pipes and streaming are set the tmp dir with environment variable TMPDIR=absolut . e path of the tmp dir mapred.cluster.ephemeral.tasks.memory.limi t.mb 200 This property's value specifies the maximum size in megabytes for small jobs. This value is reserved in memory for an ephemeral slot. JobTracker and TaskTracker nodes must set this property to the same value. mapred.cluster.map.memory.mb -1 This property's value sets the virtual memory size of a single map slot in the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via , to the mapred.job.map.memory.mb limit specified by the value of mapred.clus . The default ter.max.map.memory.mb value of -1 disables the feature. Set this value to a useful memory size to enable the feature. mapred.cluster.max.map.memory.mb -1 This property's value sets the virtual memory size of a single map task launched by the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.job.map.mem , to the limit specified by the value of ory.mb . mapred.cluster.max.map.memory.mb The default value of -1 disables the feature. Set this value to a useful memory size to enable the feature. mapred.cluster.max.reduce.memory.mb -1 This property's value sets the virtual memory size of a single reduce task launched by the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.job.reduce. , to the limit specified by the memory.mb value of mapred.cluster.max.reduce.m . The default value of -1 disables emory.mb the feature. Set this value to a useful memory size to enable the feature. mapred.cluster.reduce.memory.mb -1 This property's value sets the virtual memory size of a single reduce slot in the Map-Reduce framework used by the scheduler. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.job.reduce. , to the limit specified by the memory.mb value of mapred.cluster.max.reduce.m . The default value of -1 disables emory.mb the feature. Set this value to a useful memory size to enable the feature. mapred.compress.map.output False Set this property's value to True to compress map outputs with SequenceFile compresison before sending the outputs over the network. mapred.fairscheduler.assignmultiple True Set this property's value to False to prevent the FairScheduler from assigning multiple tasks. mapred.fairscheduler.eventlog.enabled False Set this property's value to True to enable scheduler logging in {{${HADOOP_LOG_DIR /fairscheduler/ } mapred.fairscheduler.smalljob.max.inputsize 10737418240 This property's value specifies the maximum size, in bytes, that defines a small job. mapred.fairscheduler.smalljob.max.maps 10 This property's value specifies the maximum number of maps allowed in a small job. mapred.fairscheduler.smalljob.max.reducer.i nputsize 1073741824 This property's value specifies the maximum estimated input size, in bytes, for a reducer in a small job. mapred.fairscheduler.smalljob.max.reducers 10 This property's value specifies the maximum number of reducers allowed in a small job. mapred.healthChecker.interval 60000 This property's value sets the frequency, in milliseconds, that the node health script runs. mapred.healthChecker.script.timeout 600000 This property's value sets the frequency, in milliseconds, after which the node script is killed for being unresponsive and reported as failed. mapred.inmem.merge.threshold 1000 When a number of files equal to this property's value accumulate, the in-memory merge triggers and spills to disk. Set this property's value to zero or less to force merges and spills to trigger solely on RAMFS memory consumption. mapred.job.map.memory.mb -1 This property's value sets the virtual memory size of a single map task for the job. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via , to mapred.cluster.map.memory.mb the limit specified by the value of mapred.c . The default luster.max.map.memory.mb value of -1 disables the feature if the value of the p mapred.cluster.map.memory.mgb roperty is also -1. Set this value to a useful memory size to enable the feature. mapred.job.queue.name default This property's value specifies the queue a job is submitted to. This property's value must match the name of a queue defined in for the system. The mapred.queue.names ACL setup for the queue must allow the current user to submit a job to the queue. mapred.job.reduce.input.buffer.percent 0 This property's value specifies the percentage of memory relative to the maximum heap size. After the shuffle, remaining map outputs in memory must occupy less memory than this threshold value before reduce begins. mapred.job.reduce.memory.mb -1 This property's value sets the virtual memory size of a single reduce task for the job. If the scheduler supports this feature, a job can ask for multiple slots for a single map task via mapred.cluster.reduce.memory.m , to the limit specified by the value of b mapre . d.cluster.max.reduce.memory.mb The default value of -1 disables the feature if the value of the mapred.cluster.map.me property is also -1. Set this value mory.mgb to a useful memory size to enable the feature. mapred.job.reuse.jvm.num.tasks -1 This property's value sets the number of tasks to run on each JVM. The default of -1 sets no limit. mapred.job.shuffle.input.buffer.percent 0.7 This property's value sets the percentage of memory allocated from the maximum heap size to storing map outputs during the shuffle. mapred.job.shuffle.merge.percent 0.66 This property's value sets a percentage of the total memory allocated to storing map outputs in mapred.job.shuffle.input. . When memory storage buffer.percent for map outputs reaches this percentage, an in-memory merge triggers. mapred.job.tracker.handler.count 10 This property's value sets the number of server threads for the JobTracker. As a best practice, set this value to approximately 4% of the number of TaskTracker nodes. mapred.job.tracker.history.completed.locatio n /var/mapr/cluster/mapred/jobTracker/history/ done This property's value sets a location to store completed job history files. When this property has no value specified, completed job files are stored at ${hadoop.job.history.lo /done in the local filesystem. cation} mapred.job.tracker.http.address 0.0.0.0:50030 This property's value specifies the HTTP server address and port for the JobTracker. Specify 0 as the port to make the server start on a free port. mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetri csInst Expert: The instrumentation class to associate with each JobTracker. mapred.jobtracker.job.history.block.size 3145728 This property's value sets the block size of the job history file. Dumping job history to disk is important because job recovery uses the job history. mapred.jobtracker.jobhistory.lru.cache.size 5 This property's value specifies the number of job history files to load in memory. The jobs are loaded when they are first accessed. The cache is cleared based on LRU. mapred.job.tracker maprfs:/// JobTracker address ip:port or use uri maprfs:/// for default cluster or maprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster. ""local"" for standalone mode. mapred.jobtracker.maxtasks.per.job -1 Set this property's value to any positive integer to set the maximum number of tasks for a single job. The default value of -1 indicates that there is no maximum. mapred.job.tracker.persist.jobstatus.active False Set this property's value to True to enable persistence of job status information. mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo This property's value specifies the directory where job status information persists after dropping out of the memory queue between JobTracker restarts. mapred.job.tracker.persist.jobstatus.hours 0 This property's value specifies job status information persistence time in hours. Persistent job status information is available after the information drops out of the memory queue and between JobTracker restarts. The default value of zero disables job status information persistence. mapred.jobtracker.port 9001 The IPC port on which the JobTracker listens. mapred.jobtracker.restart.recover True Set this property's value to False to disable job recovery on restart. mapred.jobtracker.retiredjobs.cache.size 1000 This property's value specifies the number of retired job statuses kept in the cache. mapred.jobtracker.retirejob.check 30000 This property's value specifies the frequency interval used by the retire job thread to check for completed jobs. mapred.line.input.format.linespermap 1 Number of lines per split in NLineInputFormat. mapred.local.dir.minspacekill 0 This property's value specifies a threshold of free space in the directory specified by the m property. When free apred.local.dir space drops below this threshold, no more tasks are requested until all current tasks finish and clean up. When free space is below this threshold, running tasks are killed in the following order until free space is above the threshold: Reduce tasks All other tasks in reverse percent-completed order. mapred.local.dir.minspacestart 0 This property's value specifies a free space threshold for the directory specified by mapr . No tasks are requested ed.local.dir while free space is below this threshold. mapred.local.dir /tmp/mapr-hadoop/mapred/local This property's value specifies the directory where MapReduce localized job files. Localized job files are the job-related files downloaded by the TaskTracker and include the job configuration, job JAR file, and files added to the DistributedCache. Each task attempt has a dedicated subdirectory under the directory. Shared mapred.local.dir files are symbolically linked to those subdirectories. mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_er ror%p.log This property stores Java options for map tasks. When present, the symbol @taskid@ is replaced with the current TaskID. As an example, to enable verbose garbage collection logging to a file named for the taskid in and to set the heap maximum /tmp to 1GB, set this property to the value -Xmx1 024m -verbose:gc . -Xloggc:/tmp/@[email protected] The configuration variable mapred.{map/r controls the .child.ulimit educe} maximum virtual memory of the child processes. In the MapR distribution for Hadoop, the default is determined by memory -Xmx reserved for mapreduce by the TaskTracker. Reduce tasks use memory than map tasks. The default memory for a map task follows the formula (Total Memory reserved for mapreduce) * (#mapslots/ (#mapslots + 1.3*#reduceslots)). mapred.map.child.log.level INFO This property's value sets the logging level for the map task. The allowed levels are: OFF FATAL ERROR WARN INFO DEBUG TRACE ALL mapred.map.max.attempts 4 Expert: This property's value sets the maximum number of attempts per map task. mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCod ec Specifies the compression codec to use to compress map outputs if compression of map outputs is enabled. mapred.maptask.memory.default 800 When the value of the mapred.tasktrack parameter is -1, er.map.tasks.maximum this parameter specifies a size in MB that is used to determine the default total number of map task slots on this node. mapred.map.tasks 2 The default number of map tasks per job. Ignored when the value of the mapred.job. property is . tracker local mapred.maxthreads.generate.mapoutput 1 Expert: Number of intra-map-task threads to sort and write the map output partitions. mapred.maxthreads.partition.closer 1 Expert: Number of threads that asynchronously close or flush map output partitions. mapred.merge.recordsBeforeProgress 10000 The number of records to process during a merge before sending a progress notification to the TaskTracker. mapred.min.split.size 0 The minimum size chunk that map input should be split into. File formats with minimum split sizes take priority over this setting. mapred.output.compress False Set this property's value to True to compress job outputs. mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCod ec When job output compression is enabled, this property's value specifies the compression codec. mapred.output.compression.type RECORD When job outputs are compressed as SequenceFiles, this value's property specifies how to compress the job outputs. Legal values are: NONE RECORD BLOCK mapred.queue.default.state RUNNING This property's value defines the state of the default queue, which can be either STOPPED or RUNNING. This value can be changed at runtime. mapred.queue.names default This property's value specifies a comma-separated list of the queues configured for this JobTracker. Jobs are added to queues and schedulers can configure different scheduling properties for the various queues. To configure a property for a queue, the name of the queue must match the name specified in this value. Queue properties that are common to all schedulers are configured here with the naming convention mapred.queue.$QUEUE . -NAME.$PROPERTY-NAME The number of queues configured in this parameter can depend on the type of scheduler being used, as specified in mapred.jobtracker.taskScheduler. For example, the JobQueueTaskScheduler supports only a single queue, which is the default configured here. Verify that the schedule supports multiple queues before adding queues. mapred.reduce.child.log.level INFO The logging level for the reduce task. The allowed levels are: OFF FATAL ERROR WARN INFO DEBUG TRACE ALL mapred.reduce.copy.backoff 300 This property's value specifies the maximum amount of time in seconds a reducer spends on fetching one map output before declaring the fetch failed. mapred.reduce.max.attempts 4 Expert: The maximum number of attempts per reduce task. mapred.reducetask.memory.default 1500 When the value of the mapred.tasktrack parameter is er.reduce.tasks.maximum -1, this parameter specifies a size in MB that is used to determine the default total number of reduce task slots on this node. mapred.skip.attempts.to.start.skipping 2 This property's value specifies a number of task attempts. After that many task attempts, skip mode is active. While skip mode is active, the task reports the range of records which it will process next to the TaskTracker. With this record range, the TaskTracker is aware of which records are dubious and skips dubious records on further executions. mapred.skip.map.auto.incr.proc.count True SkipBadRecords.COUNTER_MAP_PROCE SSED_RECORDS increments after MapRunner invokes the map function. Set this property's value to False for applications that process records asynchronously or buffer input records. Such applications must increment this counter directly. mapred.skip.map.max.skip.records 0 The number of acceptable skip records around the bad record, per bad record in the mapper. The number includes the bad record. The default value of 0 disables detection and skipping of bad records. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to to Long.MAX_VALUE prevent the framework from narrowing down the skipped range. mapred.skip.reduce.auto.incr.proc.count True SkipBadRecords.COUNTER_MAP_PROCE SSED_RECORDS increments after MapRunner invokes the reduce function. Set this property's value to False for applications that process records asynchronously or buffer input records. Such applications must increment this counter directly. mapred.skip.reduce.max.skip.groups 0 The number of acceptable skip records around the bad record, per bad record in the reducer. The number includes the bad record. The default value of 0 disables detection and skipping of bad records. The framework tries to narrow down the skipped range by retrying until this threshold is met OR all attempts get exhausted for this task. Set the value to to Long.MAX_VALUE prevent the framework from narrowing down the skipped range. mapred.submit.replication 10 This property's value specifies the replication level for submitted job files. As a best practice, set this value to approximately the square root of the number of nodes. mapred.task.cache.levels 2 This property's value specifies the maximum level of the task cache. For example, if the level is 2, the tasks cached are at the host level and at the rack level. mapred.task.calculate.resource.usage True Set this property's value to False to prevent the use of the ${mapreduce.tasktracke parame r.resourcecalculatorplugin} ter. mapred.task.profile False Set this property's value to True to enable task profiling and the collection of profiler information by the system. mapred.task.profile.maps 0-2 This property's value sets the ranges of map tasks to profile. This property is ignored when the value of the mapred.task.profi property is set to False. le mapred.task.profile.reduces 0-2 This property's value sets the ranges of reduce tasks to profile. This property is ignored when the value of the mapred.task property is set to False. .profile mapred.task.timeout 600000 This property's value specifies a time in milliseconds after which a task terminates if the task does not perform any of the following: reads an input writes an output updates its status string mapred.tasktracker.dns.interface default This property's value specifies the name of the network interface that the TaskTracker reports its IP address from. mapred.tasktracker.dns.nameserver default This property's value specifies the host name or IP address of the name server (DNS) that the TaskTracker uses to determine the JobTracker's hostname. Oozie Parameter Value Description hadoop.proxyuser.root.hosts * Specifies the hosts that the superuser must connect from in order to act as another user. Specify the hosts as a comma-separatedlist of IP addresses or hostnames that are running Oozie servers. hadoop.proxyuser.mapr.groups mapr,staff hadoop.proxyuser.root.groups root The superuser can act as any member of the listed groups. mfs.conf The configuration file specifies the following parameters about the MapR-FS server on each node: /opt/mapr/conf/mfs.conf Parameter Value Description mfs.server.ip 192.168.10.10 IP address of the FileServer mfs.server.port 5660 Port used for communication with the server mfs.cache.lru.sizes inode:6:log:6:meta:10:dir:40:small:15 LRU cache configuration mfs.on.virtual.machine 0 Specifies whether MapR-FS is running on a virtual machine mfs.io.disk.timeout 60 Timeout, in seconds, after which a disk is considered failed and taken offline. This parameter can be increased to tolerate slow disks. mfs.max.disks 48 Maximum number of disks supported on a single node. mfs.subnets.whitelist A list of subnets that are allowed to make requests to the FileServer service and access data on the cluster. mfs.disk.resynciothrottle.factor The amount of time a resync disk read work area waits between disk read operations. This time is based on the following calculation: WaitTime = C * N * RTT / throttleFactor C A constant with a value of 10. N The number of outstanding resync disk read work areas. RTT Time to complete the current disk read operation. throttleFactor The value of the mfs .disk.resynciothrottle.factor parameter. mfs.network.resynciothrottle.factor Controls the amount of time a resync network send work area waits between network send operations. Example mfs.server.ip=192.168.10.10 mfs.server.port=5660 mfs.cache.lru.sizes=inode:6:log:6:meta:10:dir:40:small:15 mfs.on.virtual.machine=0 mfs.io.disk.timeout=60 mfs.max.disks=48 taskcontroller.cfg The file specifies TaskTracker configuration parameters. The /opt/mapr/hadoop/hadoop-<version>/conf/taskcontroller.cfg parameters should be set the same on all TaskTracker nodes. See also . Secured TaskTracker Parameter Value Description mapred.local.dir /tmp/mapr-hadoop/mapred/local The local MapReduce directory. hadoop.log.dir /opt/mapr/hadoop/hadoop-0.20.2/bin/../logs The Hadoop log directory. mapreduce.tasktracker.group root The group that is allowed to submit jobs. min.user.id -1 The minimum user ID for submitting jobs: Set to to disallow from 0 root submitting jobs Set to to disallow all superusers 1000 from submitting jobs banned.users (not present by default) Add this parameter with a comma-separated list of usernames to ban certain users from submitting jobs warden.conf The file controls parameters related to MapR services and the warden. Most of the parameters are not /opt/mapr/conf/warden.conf intended to be edited directly by users. The following table shows the parameters of interest: Parameter Sample Value Description service.command.hbmaster.heapsize.percen t 4 The percentage of heap space reserved for the HBase Master. service.command.hbmaster.heapsize.max 512 The maximum heap space that can be used by the HBase Master. service.command.hbmaster.heapsize.min 64 The minimum heap space for use by the HBase Master. service.command.hbregion.heapsize.percent 25 The percentage of heap space reserved for the HBase Region Server. service.command.hbregion.heapsize.max 4000 The maximum heap space that can be used by the HBase Region Server. service.command.hbregion.heapsize.min 1000 The minimum heap space for use by the HBase Region Server. service.command.cldb.heapsize.percent 8 The percentage of heap space reserved for the CLDB. service.command.cldb.heapsize.max 4000 The maximum heap space that can be used by the CLDB. service.command.cldb.heapsize.min 256 The minimum heap space for use by the CLDB. service.command.cldb.retryinterval.time.sec 600 Specifies an interval in seconds. The warden attempts to restart a failed CLDB service when this interval expires. service.command.jt.heapsize.percent 10 The percentage of heap space reserved for the JobTracker. Memory allocation for JobTracker is only used to calculate total memory required for all services to run. The -Xmx JobTracker value is not set, allowing memory on JobTracker to grow as needed. If an upper limit on memory is strongly desired, set the HADOOP_HEAPSIZE variable in /op t/mapr/hadoop/hadoop-0.20.2/conf/ . hadoop-env.sh service.command.jt.heapsize.max 5000 The maximum heap space that can be used by the JobTracker. Memory allocation for JobTracker is only used to calculate total memory required for all services to run. The -Xmx JobTracker value is not set, allowing memory on JobTracker to grow as needed. If an upper limit on memory is strongly desired, set the HADOOP_HEAPSIZE variable in /op t/mapr/hadoop/hadoop-0.20.2/conf/ . hadoop-env.sh service.command.jt.heapsize.min 256 The minimum heap space for use by the JobTracker. Memory allocation for JobTracker is only used to calculate total memory required for all services to run. The -Xmx JobTracker value is not set, allowing memory on JobTracker to grow as needed. If an upper limit on memory is strongly desired, set the HADOOP_HEAPSIZE variable in /op t/mapr/hadoop/hadoop-0.20.2/conf/ . hadoop-env.sh service.command.mfs.heapsize.percent 35 The percentage of heap space reserved for the MapR-FS FileServer. Restart the Warden after modifying this setting. service.command.mfs.heapsize.min 512 The minimum heap space that can be used by the MapR-FS FileServer. Restart the Warden after modifying this setting. service.command.tt.heapsize.percent 2 The percentage of heap space reserved for the TaskTracker. Memory allocation for TaskTracker is only used to calculate total memory required for all services to run. The -Xmx TaskTracker value is not set, allowing memory on TaskTracker to grow as needed. If an upper limit on memory is strongly desired, set the HADOOP_HEAPSIZE variable in /opt/mapr/hadoop/hadoop-0 . .20.2/conf/hadoop-env.sh service.command.tt.heapsize.max 325 The maximum heap space that can be used by the TaskTracker. Memory allocation for TaskTracker is only used to calculate total memory required for all services to run. The -Xmx TaskTracker value is not set, allowing memory on TaskTracker to grow as needed. If an upper limit on memory is strongly desired, set the HADOOP_HEAPSIZE variable in /opt/mapr/hadoop/hadoop-0 . .20.2/conf/hadoop-env.sh service.command.tt.heapsize.min 64 The minimum heap space for use by the TaskTracker. Memory allocation for TaskTracker is only used to calculate total memory required for all services to run. The -Xmx TaskTracker value is not set, allowing memory on TaskTracker to grow as needed. If an upper limit on memory is strongly desired, set the HADOOP_HEAPSIZE variable in /opt/mapr/hadoop/hadoop-0 . .20.2/conf/hadoop-env.sh service.command.webserver.heapsize.perce nt 3 The percentage of heap space reserved for the MapR Control System. service.command.webserver.heapsize.max 750 The maximum heap space that can be used by the MapR Control System. service.command.webserver.heapsize.min 512 The minimum heap space for use by the MapR Control System. service.command.os.heapsize.percent 3 The percentage of heap space reserved for the operating system. service.command.os.heapsize.max 750 The maximum heap space that can be used by the operating system. service.command.os.heapsize.min 256 The minimum heap space for use by the operating system. service.nice.value -10 The priority under which all services nice will run. zookeeper.servers 10.250.1.61:5181 The list of ZooKeeper servers. services.retries 3 The number of times the Warden tries to restart a service that fails. services.resetretries.time.sec 3600 Specifies a time interval in seconds. The ser parameter sets the number vices.retries of times that the warden attempts to restart failing services within this interval. services.retryinterval.time.sec 1800 The number of seconds after which the warden will again attempt several times to start a failed service. The number of attempts after each interval is specified by the parameter . services.retries cldb.port 7222 The port for communicating with the CLDB. mfs.port 5660 The port for communicating with the FileServer. hbmaster.port 60000 The port for communicating with the HBase Master. hoststats.port 5660 The port for communicating with the HostStats service. jt.port 9001 The port for communicating with the JobTracker. jt.response.timeout.minutes 10 Specifies an interval in minutes. The warden kills JobTracker services that do not respond within the specified interval and restarts them as normal for failed services. kvstore.port 5660 The port for communicating with the Key/Value Store. mapr.home.dir /opt/mapr The directory where MapR is installed. centralconfig.enabled true Specifies whether to enable central configuration. pollcentralconfig.interval.seconds 300 How often to check for configuration updates, in seconds. rpc.drop false Drop outstanding metrics when the queue to send to hoststats is too large. hs.rpcon true Whether or not to configure Job Management. hs.port 1111 Hoststats listening port for Metrics RPC activity. hs.host localhost Hoststats hostname for RPC activity. log.retention.time 864000000 All and files in the cluster are .log .out kept for a time period defined by the value of the parameter in log.retention.time milliseconds. The default value is ten days. Restart the Warden after changing this value. log.retention.exceptions You can specify a comma-separated list of exceptions that are not removed during regular log file cleanup. The file names can be partial. Any filename that matches the specified string is not removed during regular log file cleanup. enable.overcommit false Set this value to true to allow services to start up even if their memory demands exceed the memory provided by the node. warden.conf services=webserver:all:cldb;jobtracker:1:cldb;tasktracker:all:jobtracker;nfs:all:cldb; kvstore:all;cldb:all:kvstore;hoststats:all:kvstore service.command.jt.start=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh start jobtracker service.command.tt.start=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh start tasktracker service.command.hbmaster.start=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh start master service.command.hbregion.start=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh start regionserver service.command.cldb.start=/etc/init.d/mapr-cldb start service.command.kvstore.start=/etc/init.d/mapr-mfs start service.command.mfs.start=/etc/init.d/mapr-mfs start service.command.nfs.start=/etc/init.d/mapr-nfsserver start service.command.hoststats.start=/etc/init.d/mapr-hoststats start service.command.webserver.start=/opt/mapr/adminuiapp/webserver start service.command.jt.stop=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh stop jobtracker service.command.tt.stop=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh stop tasktracker service.command.hbmaster.stop=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh stop master service.command.hbregion.stop=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh stop regionserver service.command.cldb.stop=/etc/init.d/mapr-cldb stop service.command.kvstore.stop=/etc/init.d/mapr-mfs stop service.command.mfs.stop=/etc/init.d/mapr-mfs stop service.command.nfs.stop=/etc/init.d/mapr-nfsserver stop service.command.hoststats.stop=/etc/init.d/mapr-hoststats stop service.command.webserver.stop=/opt/mapr/adminuiapp/webserver stop service.command.jt.type=BACKGROUND service.command.tt.type=BACKGROUND service.command.hbmaster.type=BACKGROUND service.command.hbregion.type=BACKGROUND service.command.cldb.type=BACKGROUND service.command.kvstore.type=BACKGROUND service.command.mfs.type=BACKGROUND service.command.nfs.type=BACKGROUND service.command.hoststats.type=BACKGROUND service.command.webserver.type=BACKGROUND service.command.jt.monitor=org.apache.hadoop.mapred.JobTracker service.command.tt.monitor=org.apache.hadoop.mapred.TaskTracker service.command.hbmaster.monitor=org.apache.hadoop.hbase.master.HMaster start service.command.hbregion.monitor=org.apache.hadoop.hbase.regionserver.HRegionServer start service.command.cldb.monitor=com.mapr.fs.cldb.CLDB service.command.kvstore.monitor=server/mfs service.command.mfs.monitor=server/mfs service.command.nfs.monitor=server/nfsserver service.command.jt.monitorcommand=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh status jobtracker service.command.tt.monitorcommand=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh status tasktracker service.command.hbmaster.monitorcommand=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon. sh status master service.command.hbregion.monitorcommand=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon. sh status regionserver service.command.cldb.monitorcommand=/etc/init.d/mapr-cldb status service.command.kvstore.monitorcommand=/etc/init.d/mapr-mfs status service.command.mfs.monitorcommand=/etc/init.d/mapr-mfs status service.command.nfs.monitorcommand=/etc/init.d/mapr-nfsserver status service.command.hoststats.monitorcommand=/etc/init.d/mapr-hoststats status service.command.webserver.monitorcommand=/opt/mapr/adminuiapp/webserver status # Memory allocation for JobTracker is only used # to calculate total memory required for all services to run # but -Xmx JobTracker itself is not set allowing memory # on JobTracker to grow as needed # if upper limit on memory is strongly desired # set HADOOP_HEAPSIZE env. variable in /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-env.sh service.command.jt.heapsize.percent=10 service.command.jt.heapsize.max=5000 service.command.jt.heapsize.min=256 # Memory allocation for TaskTracker is only used # to calculate total memory required for all services to run # but -Xmx TaskTracker itself is not set allowing memory # on TaskTracker to grow as needed # if upper limit on memory is strongly desired # set HADOOP_HEAPSIZE env. variable in /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-env.sh service.command.tt.heapsize.percent=2 service.command.tt.heapsize.max=325 service.command.tt.heapsize.min=64 service.command.hbmaster.heapsize.percent=4 service.command.hbmaster.heapsize.max=512 service.command.hbmaster.heapsize.min=64 service.command.hbregion.heapsize.percent=25 service.command.hbregion.heapsize.max=4000 service.command.hbregion.heapsize.min=1000 service.command.cldb.heapsize.percent=8 service.command.cldb.heapsize.max=4000 service.command.cldb.heapsize.min=256 service.command.mfs.heapsize.percent=20 service.command.mfs.heapsize.min=512 service.command.webserver.heapsize.percent=3 service.command.webserver.heapsize.max=750 service.command.webserver.heapsize.min=512 service.command.os.heapsize.percent=3 service.command.os.heapsize.max=750 service.command.os.heapsize.min=256 service.nice.value=-10 zookeeper.servers=10.250.1.61:5181 nodes.mincount=1 services.retries=3 cldb.port=7222 mfs.port=5660 hbmaster.port=60000 hoststats.port=5660 jt.port=9001 kvstore.port=5660 mapr.home.dir=/opt/mapr centralconfig.enabled=true pullcentralconfig.relativepath= pullcentralconfig.freq.millis=300000 rpc.drop=false hs.rpcon=true hs.port=1111 hs.host=localhost log.retention.time=864000000 log.retention.exceptions= enable.overcommit=false exports On each node, the file lists the clusters and mount points available to mount with NFS. /opt/mapr/conf/exports Access control for hosts To specify access control for hosts, list the hosts in comma-separated sets, followed by for read-write or for read-only access. You (rw) (ro) can separate multiple sets with a space. To specify a default access for all hosts not otherwise specified, add or after a space at the (rw) (ro) end of a line. The file follows the same semantics as a standard UNIX exports table. exports Restricting clusters to specific hosts To restrict access to a specific export path to particular hosts, use the following format: <Path> <comma separated list of host(access) sets> For example, the line restricts read-write access to the cluster in to host /mapr/cluster1 a.b.c.d(rw),e.f.g.h(ro) /mapr/cluster1 a , and restricts read-only access to host . No other hosts have access. .b.c.d e.f.g.h Enabling Central Configuration To enable for , specify a value for the parameter in the Central Configuration exports AutoRefreshExportsTimeInterval /opt/mapr/con file. The value of the parameter determines the number of minutes after which the f/nfsserver.conf AutoRefreshExportsTimeInterval NFS server refreshes the file. The default value of 0 disables central configuration for NFS exports. exports Sample exports file After making changes to this file, restart the NFS server. # Sample Exports file # for /mapr exports # <Path> <exports_control> #access_control -> order is specific to default # list the hosts before specifying a default for all # a.b.c.d,1.2.3.4(ro) d.e.f.g(ro) (rw) # enforces ro for a.b.c.d & 1.2.3.4 and everybody else is rw # special path to export clusters in mapr-clusters.conf. To disable exporting, # comment it out. to restrict access use the exports_control # /mapr (rw) #to export only certain clusters, comment out the /mapr & uncomment. # Note: this will cause /mapr to be unexported #/mapr/clustername (rw) #to export /mapr only to certain hosts (using exports_control) #/mapr a.b.c.d(rw),e.f.g.h(ro) # export /mapr/cluster1 rw to a.b.c.d & ro to e.f.g.h (denied for others) #/mapr/cluster1 a.b.c.d(rw),e.f.g.h(ro) # export /mapr/cluster2 only to e.f.g.h (denied for others) #/mapr/cluster2 e.f.g.h(rw) # export /mapr/cluster3 rw to e.f.g.h & ro to others #/mapr/cluster2 e.f.g.h(rw) (ro) zoo.cfg The file specifies ZooKeeper configuration parameters. /opt/mapr/zookeeper/zookeeper-3.3.2/conf/zoo.cfg Example # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=20 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=10 # the directory where the snapshot is stored. dataDir=/var/mapr-zookeeper-data # the port at which the clients will connect clientPort=5181 # max number of client connections maxClientCnxns=100 db.conf The file specifies configuration parameters for the MapR Metrics database. /opt/mapr/conf/db.conf Field Default Description db.url localhost:3306 The URL and port for the MySQL server that stores Metrics data. This machine does not need to be a node in the cluster. db.user root The MySQL user name. db.passwd mapr The MySQL password. db.schema metrics The name of the MySQL schema. db.mode mysql Reserved for future use. db.driverclass com.mysql.jdbc.Driver Reserved for future use. db.joblastaccessed.limit.hours 48 Task and task attempt data for a job are purged for jobs that have not been accessed in a number of hours equal to this parameter's value. db.partition.finest.count.days 3 Integer number of days for which the finest data granularity is kept. Finest granularity is a ten-second resolution. db.partition.fine.count.days 15 Integer number of days for which fine data granularity is kept. Fine granularity is a five-minute average of the finest resolution. db.partition.coarse.count.years 100 Integer number of years for which the coarse data granularity is kept. Coarse granularity is a 24-hour average of the fine resolution. metric.file.rotate 365 Integer number of days for which metrics files are kept in the local volume for each node. Example db.conf file db.url=localhost:3306 db.user=root db.passwd=mapr db.schema=metrics db.mode=mysql db.driverclass=com.mysql.jdbc.Driver db.joblastacessed.limit.hours=48 db.partition.finest.count.days=3 db.partition.fine.count.days=15 db.partition.coarse.count.years=100 ### How many files with raw node metrics data to keep metric.file.rotate=365 Any time you make changes to the file, you must restart the service and Warden for those changes to take db.conf hoststats effect. MapR Environment The following topics contain information about the MapR environment: Environment Variables Changes to the sudoers File Ports Used by MapR MapR Parameters The following table lists user-configurable parameters and their defaults. These defaults reflect the values in the default configuration files, plus any overrides shipped out-of-the-box in core-site.xml, mapred-site.xml, or other configuration files. You can override these values by editing or adding them in mapred-site.xml or core-site.xml, using the option to the command when submitting a job, or by setting them -D hadoop jar explicitly in your code. Parameter Default fs.mapr.working.dir fs.maprfs.impl fs.ramfs.impl fs.s3.block.size 33554432 fs.s3.blockSize 33554432 fs.s3.buffer.dir fs.s3.impl fs.s3.maxRetries 4 fs.s3.sleepTimeSeconds 10 fs.s3n.block.size 33554432 fs.s3n.blockSize 33554432 fs.s3n.impl fs.trash.interval 0 hadoop.logfile.count 10 hadoop.logfile.size 10000000 hadoop.native.lib TRUE hadoop.proxyuser.root.groups root hadoop.proxyuser.root.hosts hadoop.rpc.socket.factory.class.default hadoop.security.authentication simple hadoop.security.authorization FALSE hadoop.security.group.mapping hadoop.security.uid.cache.secs 14400 hadoop.tmp.dir hadoop.util.hash.type murmur hadoop.workaround.non.threadsafe.getpwuid FALSE io.bytes.per.checksum 512 io.compression.codecs io.file.buffer.size 8192 io.mapfile.bloom.error.rate 0.005 io.mapfile.bloom.size 1048576 io.serializations io.skip.checksum.errors FALSE io.sort.factor 256 io.sort.mb 380 io.sort.record.percent 0.17 io.sort.spill.percent 0.99 ipc.client.connect.max.retries 10 ipc.client.connection.maxidletime 10000 ipc.client.idlethreshold 4000 ipc.client.kill.max 10 ipc.client.max.connection.setup.timeout 20 ipc.client.tcpnodelay TRUE ipc.server.listen.queue.size 128 ipc.server.tcpnodelay TRUE job.end.retry.interval 30000 jobclient.completion.poll.interval 5000 jobclient.output.filter FAILED jobclient.progress.monitor.poll.interval 1000 keep.failed.task.files FALSE local.cache.size 1.07E+10 map.sort.class mapr.centrallog.dir logs mapr.localoutput.dir output mapr.localspill.dir spill mapr.localvolumes.path mapr.map.keyprefix.ints 1 mapr.task.diagnostics.enabled FALSE mapred.acls.enabled FALSE mapred.child.oom_adj 10 mapred.child.renice 10 mapred.child.taskset TRUE mapred.child.tmp ./tmp mapred.cluster.ephemeral.tasks.memory.limit.mb 200 mapred.cluster.map.memory.mb -1 mapred.cluster.max.map.memory.mb -1 mapred.cluster.max.reduce.memory.mb -1 mapred.cluster.reduce.memory.mb -1 mapred.compress.map.output FALSE mapred.fairscheduler.assignmultiple TRUE mapred.fairscheduler.eventlog.enabled FALSE mapred.fairscheduler.smalljob.max.inputsize 1.07E+10 mapred.fairscheduler.smalljob.max.maps 10 mapred.fairscheduler.smalljob.max.reducer.inputsize 1.07E+09 mapred.fairscheduler.smalljob.max.reducers 10 mapred.fairscheduler.smalljob.schedule.enable TRUE mapred.healthChecker.interval 60000 mapred.healthChecker.script.timeout 600000 mapred.inmem.merge.threshold 1000 mapred.job.reuse.jvm.num.tasks -1 mapred.job.shuffle.input.buffer.percent 0.7 mapred.job.shuffle.merge.percent 0.66 mapred.job.tracker.handler.count 10 mapred.job.tracker.history.completed.location mapred.job.tracker.http.address mapred.job.tracker.persist.jobstatus.active FALSE mapred.job.tracker.persist.jobstatus.dir mapred.job.tracker mapred.jobtracker.instrumentation mapred.jobtracker.job.history.block.size 3145728 mapred.jobtracker.jobhistory.lru.cache.size 5 mapred.jobtracker.maxtasks.per.job -1 mapred.jobtracker.port 9001 mapred.jobtracker.restart.recover TRUE mapred.jobtracker.retiredjobs.cache.size 1000 mapred.jobtracker.retirejob.check 30000 mapred.jobtracker.taskScheduler mapred.line.input.format.linespermap 1 mapred.local.dir.minspacekill 0 mapred.local.dir.minspacestart 0 mapred.local.dir mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log mapred.map.child.log.level INFO mapred.map.max.attempts 4 mapred.map.output.compression.codec mapred.map.tasks.speculative.execution TRUE mapred.map.tasks 2 mapred.maptask.memory.default 800 mapred.max.tracker.blacklists 4 mapred.max.tracker.failures 4 mapred.maxthreads.generate.mapoutput 1 mapred.maxthreads.partition.closer 1 mapred.merge.recordsBeforeProgress 10000 mapred.output.compression.type RECORD mapred.queue.default.state mapred.queue.names default mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log mapred.reduce.child.log.level INFO mapred.reduce.copy.backoff 300 mapred.reduce.max.attempts 4 mapred.reduce.parallel.copies 12 mapred.reduce.slowstart.completed.maps 0.95 mapred.reduce.tasks.speculative.execution TRUE mapred.reduce.tasks 1 mapred.reducetask.memory.default 1500 mapred.skip.attempts.to.start.skipping 2 mapred.submit.replication 10 mapred.system.dir mapred.task.cache.levels 2 mapred.task.calculate.resource.usage TRUE mapred.task.profile.maps 0-2 mapred.task.profile.reduces 0-2 mapred.task.profile FALSE mapred.task.timeout 600000 mapred.task.tracker.http.address mapred.task.tracker.report.address mapred.task.tracker.task-controller mapred.tasktracker.dns.interface default mapred.tasktracker.dns.nameserver default mapred.tasktracker.ephemeral.tasks.maximum 1 mapred.tasktracker.ephemeral.tasks.timeout 10000 mapred.tasktracker.ephemeral.tasks.ulimit mapred.tasktracker.expiry.interval 600000 mapred.tasktracker.indexcache.mb 10 mapred.tasktracker.instrumentation mapred.tasktracker.map.tasks.maximum -1 mapred.tasktracker.reduce.tasks.maximum -1 mapred.tasktracker.task-controller.config.overwrite TRUE mapred.tasktracker.taskmemorymanager.killtask.maxRSS FALSE mapred.tasktracker.taskmemorymanager.monitoring-interval 3000 mapred.tasktracker.tasks.sleeptime-before-sigkill 5000 mapred.user.jobconf.limit 5242880 mapred.userlog.limit.kb 0 mapred.userlog.retain.hours 24 mapreduce.heartbeat.10 300 mapreduce.heartbeat.100 1000 mapreduce.heartbeat.1000 10000 mapreduce.job.complete.cancel.delegation.tokens TRUE mapreduce.jobtracker.inline.setup.cleanup FALSE mapreduce.jobtracker.node.labels.monitor.interval 120000 mapreduce.jobtracker.recovery.dir mapreduce.jobtracker.recovery.job.initialization.maxtime 480 mapreduce.jobtracker.recovery.maxtime 480 mapreduce.jobtracker.split.metainfo.maxsize 10000000 mapreduce.jobtracker.staging.root.dir mapreduce.maprfs.use.compression TRUE mapreduce.reduce.input.limit -1 mapreduce.task.classpath.user.precedence FALSE mapreduce.tasktracker.cache.local.numberdirectories 10000 mapreduce.tasktracker.group mapr mapreduce.tasktracker.heapbased.memory.management FALSE mapreduce.tasktracker.jvm.idle.time 10000 mapreduce.tasktracker.outofband.heartbeat TRUE mapreduce.tasktracker.prefetch.maptasks 0 mapreduce.tasktracker.reserved.physicalmemory.mb.low 0.8 mapreduce.tasktracker.task.slowlaunch FALSE mapreduce.tasktracker.volume.healthcheck.interval 60000 maprfs.openfid2.prefetch.bytes 0 tasktracker.http.threads 2 topology.node.switch.mapping.impl Ports Used by MapR Services and Ports Quick Reference The table below defines the ports used by a MapR cluster, along with the default port numbers. Service Port CLDB 7222 CLDB JMX monitor port 7220 CLDB web port 7221 DNS 53 HBase Master 60000 HBase Master (for GUI) 60010 HBase RegionServer 60020 Hive Metastore 9083 JobTracker 9001 JobTracker web 50030 LDAP 389 LDAPS 636 MFS server 5660 MySQL 3306 NFS 2049 NFS monitor (for HA) 9997 NFS management 9998 NFS VIP service 9997 and 9998 NTP 123 Oozie 11000 Port mapper 111 SMTP 25 SSH 22 TaskTracker web 50060 Web UI HTTPS 8443 Web UI HTTP (off by default) 8080 ZooKeeper 5181 ZooKeeper follower-to-leader communication 2888 ZooKeeper leader election 3888 Port Details The following table shows source and destination nodes for a given port, the purpose of the port, and the file where the port number is set. Destination Port Destination Source Purpose Set In File 22 Nodes running any MapR services Nodes/client running mapr-support-collect.sh or "maprcli disk" API calls mapr-support-collect.sh leverages SSH over port 22 to connect to a shell environment on cluster nodes in which the mapr-support-dump.sh script will be run N/A 53 Domain Name Service N/A 111 Nodes running MapR NFS Services Nodes/clients accessing MapRFS via the NFS protocol RPC Portmap services used to connect to MapRFS via NFSv3 N/A 123 Network Time Protocol N/A 389 Lightweight Directory Access Protocol N/A 636 Lightweight Directory Access Protocol over SSL N/A 2049 Nodes running MapR NFS Services Nodes/clients accessing MapRFS via the NFS protocol NFSv3 access to MapRFS N/A 2888 Nodes running ZooKeeper services Nodes running ZooKeeper services ZooKeeper Server > Server Communication /opt/mapr/zookeeper/zoo keeper-3.3.2/conf/zoo.cfg 3306 Nodes running the MySQL database for system metrics and jobs display Nodes running the mapr-metrics package Used for mySQL traffic between the web services client and its mySQL backend server N/A (system default) 3888 Nodes running ZooKeeper services Nodes running ZooKeeper services ZooKeeper Server > Server Communication /opt/mapr/zookeeper/zoo keeper-3.3.2/conf/zoo.cfg 5181 Nodes running ZooKeeper services Nodes running ZooKeeper services, clients executing ZooKeeper API calls ZooKeeper API calls /opt/mapr/zookeeper/zoo keeper-3.3.2/conf/zoo.cfg , /opt/mapr/conf/warden.co nf, /opt/mapr/conf/cldb.conf, /opt/mapr/hbase/hbase-0. 90.4/conf/hbase-site.xml, /opt/mapr/hive/hive-0.7.1/ conf/hive-site.xml 5660 Nodes running FileServer services Nodes running any MapR services, clients interacting with MapRFS MapRFS API calls /opt/mapr/conf/mfs.conf, /opt/mapr/conf/warden.co nf 7220 Nodes running CLDB services CLDB JMX monitor port 7221 Nodes running CLDB services Nodes/clients connecting to the CLDB GUI CLDB GUI /opt/mapr/conf/cldb.conf 7222 Nodes running CLDB services Nodes running any MapR services, clients interacting with MapRFS MapRFS API calls /opt/mapr/conf/cldb.conf, /opt/mapr/conf/warden.co nf, /opt/mapr/conf/mapr-clust ers.conf 8443 Nodes running MapR GUI services Nodes/clients connecting to the MapR GUI MapR HTTPS GUI /opt/mapr/conf/web.conf 9001 Nodes running JobTracker Services Nodes running TaskTracker services, clients submitting/interacting with Map/Reduce jobs JobTracker <--> TaskTracker communication, Hadoop API calls that require interaction with JobTracker services /opt/mapr/conf/warden.co nf, /opt/mapr/hadoop/hadoo p-0.20.2/conf/mapred-site .xml 9083 Nodes running the Hive metastore services Nodes/clients performing Hive queries/operations Used by Hive clients to query/access the Hive metastore /opt/mapr/hive/hive-0.7.1/ conf/hive-site.xml 9997 Nodes running NFS services Nodes running NFS services NFS VIP Management /opt/mapr/conf/nfsserver. conf 9998 Nodes running NFS services Nodes running NFS services NFS VIP Management /opt/mapr/conf/nfsserver. conf 11000 Nodes running Oozie services Nodes/clients accessing Oozie services Used by Oozie clients to access the Oozie server /opt/mapr/oozie/oozie-3.0 .0/conf/oozie-env.sh 50030 Nodes running JobTracker Services Nodes/clients connecting to the JobTracker GUI JobTracker HTTP GUI /opt/mapr/conf/warden.co nf, /opt/mapr/hadoop/hadoo p-0.20.2/conf/mapred-site .xml 1. 2. 50060 Nodes running TaskTracker Services Nodes/clients connecting to the TaskTracker GUI TaskTracker HTTP GUI /opt/mapr/hadoop/hadoo p-0.20.2/conf/mapred-site .xml 60000 Nodes running HBase Master services Nodes running HBase RegionServer services, clients executing HBase API calls HBase Server > Server communication, HBase API calls /opt/mapr/hbase/hbase-0. 90.4/conf/hbase-site.xml, /opt/mapr/conf/warden.co nf 60010 Nodes running HBase Master services Nodes/clients connecting to the HBase GUI HBase Master HTTP GUI /opt/mapr/hbase/hbase-0. 90.4/conf/hbase-site.xml 60020 Nodes running HBase RegionServer services Nodes running HBase RegionServer services, clients executing HBase API calls HBase Server > Server communication, HBase API calls /opt/mapr/hbase/hbase-0. 90.4/conf/hbase-site.xml Avoiding Port Conflicts To avoid eventual trouble with port conflicts on your MapR clusters, do one of the following: Remap the ports for the JobTracker, TaskTracker, HBaseMaster, and HBaseRegionServer services to ports below 32768. Set the ephemeral port range to stop at 50029 by changing the value in the file . Note /proc/sys/net/ipv4/ip_local_port_range that this setting changes the available number of ephemeral ports from the default of 28,233 ports to 17,233. Best Practices Disk Setup It is not necessary to set up RAID on disks used by MapR-FS. MapR uses a script called to set up storage pools. In most cases, you disksetup should let MapR calculate storage pools using the default of two or three disks. If you anticipate a high volume of random-access I/O, stripe width you can use the option with to specify larger storage pools of up to 8 disks each. -W disksetup Setting Up MapR NFS NIC Configuration For high performance clusters, use more than one network interface card (NIC) per node. MapR can detect multiple IP addresses on each node and load-balance throughput automatically. Isolating CLDB Nodes In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional control over the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restricting the CLDB volume to its own topology and making sure all other volumes are on a separate topology. Because both the CLDB-only path and the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoid this problem, set a default volume topology. See . Setting Default Volume Topology To set up a CLDB-only node: SET UP the node as usual: PREPARE the node, making sure it meets the requirements. ADD the MapR Repository. INSTALL the following packages to the node. mapr-cldb MapR uses version 3 of the NFS protocol. NFS version 4 bypasses the port mapper and attempts to connect to the default port only. If you are running NFS on a non-standard port, mounts from NFS version 4 clients time out. Use the option to specify -o nfsvers=3 NFS version 3. 2. 1. 2. 3. 4. 1. 2. 1. 2. mapr-webserver mapr-core mapr-fileserver To set up a volume topology that restricts the CLDB volume to specific nodes: Move all CLDB nodes to a CLDB-only topology (e. g. ) using the MapR Control System or the following command: /cldbonly maprcli node move -serverids <CLDB nodes> -topology /cldbonly Restrict the CLDB volume to the CLDB-only topology. Use the MapR Control System or the following command: maprcli volume move -name mapr.cldb.internal -topology /cldbonly If the CLDB volume is present on nodes not in /cldbonly, increase the replication factor of mapr.cldb.internal to create enough copies in / using the MapR Control System or the following command: cldbonly maprcli volume modify -name mapr.cldb.internal -replication <replication factor> Once the volume has sufficient copies, remove the extra replicas by reducing the replication factor to the desired value using the MapR Control System or the command used in the previous step. To move all other volumes to a topology separate from the CLDB-only nodes: Move all non-CLDB nodes to a non-CLDB topology (e. g. ) using the MapR Control System or the following command: /defaultRack maprcli node move -serverids <all non-CLDB nodes> -topology /defaultRack Restrict all existing volumes to the topology using the MapR Control System or the following command: /defaultRack maprcli volume move -name <volume> -topology /defaultRack All volumes except are re-replicated to the changed topology automatically. mapr.cluster.root Isolating ZooKeeper Nodes For large clusters (100 nodes or more), isolate the ZooKeeper on nodes that do not perform any other function. Isolating the ZooKeeper node enables the node to perform its functions without competing for resources with other processes. Installing a ZooKeeper-only node is similar to any typical node installation, but with a specific subset of packages. To set up a ZooKeeper-only node: SET UP the node as usual: PREPARE the node, making sure it meets the requirements. ADD the MapR Repository. INSTALL the following packages to the node. mapr-zookeeper mapr-zk-internal mapr-core Setting Up RAID on the Operating System Partition You can set up RAID on the operating system partition(s) or drive(s) at installation time, to provide higher operating system performance (RAID 0), disk mirroring for failover (RAID 1), or both (RAID 10), for example. See the following instructions from the operating system websites: CentOS Red Hat Ubuntu Tuning MapReduce The memory allocated to each MapR service is specified in the file, which MapR automatically configures /opt/mapr/conf/warden.conf based on the physical memory available on the node. For example, you can adjust the minimum and maximum memory used for the To prevent subsequently created volumes from encroaching on the CLDB-only nodes, set a default topology that excludes the CLDB-only topology. Do not install the FileServer package on an isolated ZooKeeper node in order to prevent MapR from using this node for data storage. 1. 2. 3. TaskTracker, as well as the percentage of the heap that the TaskTracker tries to use, by setting the appropriate , , and paramet percent max min ers in the file: warden.conf ... service.command.tt.heapsize.percent=2 service.command.tt.heapsize.max=325 service.command.tt.heapsize.min=64 ... The percentages of memory used by the services need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.p parameters for all services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings ercent for individual services, unless you see specific memory-related problems occurring. MapReduce Memory The memory allocated for MapReduce tasks normally equals the total system memory minus the total memory allocated for MapR services. If necessary, you can use the parameter to set the maximum physical memory reserved by mapreduce.tasktracker.reserved.physicalmemory.mb MapReduce tasks, or you can set it to to disable physical memory accounting and task management. -1 If the node runs out of memory, MapReduce tasks are killed by the to free memory. You can use (copy OOM-killer mapred.child.oom_adj from to adjust the parameter for MapReduce tasks. The possible values of range from -17 to +15. mapred-default.xml oom_adj oom_adj The higher the score, more likely the associated process is to be killed by the OOM-killer. Troubleshooting Out-of-Memory Errors When the aggregated memory used by MapReduce tasks exceeds the memory reserve on a TaskTracker node, tasks can fail or be killed. MapR attempts to prevent out-of-memory exceptions by killing MapReduce tasks when memory becomes scarce. If you allocate too little Java heap for the expected memory requirements of your tasks, an exception can occur. The following steps can help configure MapR to avoid these problems: If a particular job encounters out-of-memory conditions, the simplest way to solve the problem might be to reduce the memory footprint of the map and reduce functions, and to ensure that the partitioner distributes map output to reducers evenly. If it is not possible to reduce the memory footprint of the application, try increasing the Java heap size (-Xmx) in the client-side MapReduce configuration. If many jobs encounter out-of-memory conditions, or if jobs tend to fail on specific nodes, it may be that those nodes are advertising too many TaskTracker slots. In this case, the cluster administrator should reduce the number of slots on the affected nodes. To reduce the number of slots on a node: Stop the TaskTracker service on the node: $ sudo maprcli node services -nodes <node name> -tasktracker stop Edit the file : /opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml Reduce the number of map slots by lowering mapred.tasktracker.map.tasks.maximum Reduce the number of reduce slots by lowering mapred.tasktracker.reduce.tasks.maximum Start the TaskTracker on the node: $ sudo maprcli node services -nodes <node name> -tasktracker start ExpressLane MapR provides an express path (called ExpressLane) that works in conjunction with . ExpressLane is for small MapReduce The Fair Scheduler jobs to run when all slots are occupied by long tasks. Small jobs are only given this special treatment when the cluster is busy, and only if they meet the criteria specified by the following parameters in : mapred-site.xml Parameter Value Description mapred.fairscheduler.smalljob.schedule.ena ble true Enable small job fast scheduling inside fair scheduler. TaskTrackers should reserve a slot called ephemeral slot which is used for smalljob if cluster is busy. mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job. mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job. mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job. Default is 10GB. mapred.fairscheduler.smalljob.max.reducer.i nputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed in small job. Default is 1GB per reducer. mapred.cluster.ephemeral.tasks.memory.limi t.mb 200 Small job definition. Max memory in mbytes reserved for an ephermal slot. Default is 200mb. This value must be same on JobTracker and TaskTracker nodes. MapReduce jobs that appear to fit the small job definition but are in fact larger than anticipated are killed and re-queued for normal execution. HBase The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase, turn off MapR compression for directories in the HBase volume (normally mounted at . Example: /hbase hadoop mfs -setcompression off /hbase You can check whether compression is turned off in a directory or mounted volume by using to list the file contents. hadoop mfs Example: hadoop mfs -ls /hbase The letter in the output indicates compression is turned on; the letter indicates compression is turned off. See for more Z U hadoop mfs information. On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so that the node can handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit the RegionServer to 4 GB, give 10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to each service, edit the file. See for more information. /opt/mapr/conf/warden.conf Tuning Your MapR Install Glossary Term Definition .dfs_attributes A special file in every directory, for controlling the compression and chunk size used for the directory and its subdirectories. .rw A special mount point in the root-level volume (or read-only mirror) that points to the writable original copy of the volume. .snapshot A special directory in the top level of each volume, containing all the snapshots for that volume. access control list A list of permissions attached to an object. An access control list (ACL) specifies users or system processes that can perform specific actions on an object. accounting entity A clearly defined economics unit that is accounted for separately. ACL See . access control list advisory quota An advisory disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the advisory quota, an alert is sent. AE See . accounting entity bitmask A binary number in which each bit controls a single toggle. chunk Files in MapR-FS are split into (similar to Hadoop ) that chunks blocks are normally 256 MB by default. Any multiple of 65,536 bytes is a valid chunk size, but tuning the size correctly is important. Files inherit the chunk size settings of the directory that contains them, as do subdirectories on which chunk size has not been explicitly set. Any files written by a Hadoop application, whether via the file APIs or over NFS, use chunk size specified by the settings for the directory where the file is written. CLDB See . container location database container The unit of sharded storage in a MapR cluster. Every container is either a or a . name container data container container location database A service, running on one or more MapR nodes, that maintains the locations of services, containers, and other cluster information. data container One of the two types of containers in a cluster. Data containers typically have a cascaded configuration (master replicates to replica1, replica1 replicates to replica2, and so on). Every data containers is either a master container, an intermediate container, or a tail container depending on its replication role. desired replication factor The number of copies of a volume that should be maintained by the MapR cluster for normal operation. When the number of copies falls below the desired replication factor, but remains equal to or above the , re-replication occurs after the timeout minimum replication factor specified in the parameter. cldb.fs.mark.rereplicate.sec disk space balancer The disk space balancer is a tool that balances disk space usage on a cluster by moving containers between storage pools. Whenever a storage pool is over 70% full (or a threshold defined by the cldb.ba parameter), the disk lancer.disk.threshold.percentage space balancer distributes containers to other storage pools that have lower utilization than the average for that cluster. The disk space balancer aims to ensure that the percentage of space used on all of the disks in the node is similar. disktab A file on each node, containing a list of the node's disks that have been configured for use by MapR-FS. dump file A file containing data from a volume for distribution or restoration. There are two types of dump files: dump files containing all data in full a volume, and dump files that contain changes to a incremental volume between two points in time. entity A user or group. Users and groups can represent . accounting entities full dump file See . dump file epoch A sequence number that identifies all copies that have the latest updates for a container. The larger the number, the most up-to-date the copy of the container. The CLDB uses the epoch to ensure that an out-of-date copy cannot become the master for the container. Hbase A distributed storage system, designed to scale to a very large size, for managing massive amounts of structured data. heartbeat A signal sent by each FileServer and NFS node every second to provide information to the CLDB about the node's health and resource usage. incremental dump file See . dump file JobTracker The process responsible for submitting and tracking MapReduce jobs. The JobTracker sends individual tasks to TaskTrackers on nodes in the cluster. Mapr-FS The NFS-mountable, distributed, high-performance MapR data storage system. minimum replication factor The minimum number of copies of a volume that should be maintained by the MapR cluster for normal operation. When the replication factor falls below this minimum, re-replication occurs as aggressively as possible to restore the replication level. If any containers in the CLDB volume fall below the minimum replication factor, writes are disabled until aggressive re-replication restores the minimum level of replication. mirror A read-only physical copy of a volume. name container A container that holds a volume's namespace information and file chunk locations, and the first 64 KB of each file in the volume. Network File System A protocol that allows a user on a client computer to access files over a network as though they were stored locally. NFS See . Network File System node An individual server (physical or virtual machine) in a cluster. quota A disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the quota, no more data can be written. recovery point objective The maximum allowable data loss as a point in time. If the recovery point objective is 2 hours, then the maximum allowable amount of data loss that is acceptable is 2 hours of work. recovery time objective The maximum alllowable time to recovery after data loss. If the recovery time objective is 5 hours, then it must be possible to restore data up to the recovery point objective within 5 hours. See also recov ery point objective replication factor The number of copies of a volume. replication role The replication role of a container determines how that container is replicated to other storage pools in the cluster. A may name container have one of two replication roles: master or replica. A data container may have one of three replication roles: master, intermediate, or tail. replication role balancer The replication role balancer is a tool that switches the replication roles of containers to ensure that every node has an equal share of of master and replica containers (for name containers) and an equal share of master, intermediate, and tail containers (for data containers). re-replication Re-replication occurs whenever the number of available replica containers drops below the number prescribed by that volume's replication factor. Re-replication may occur for a variety of reasons including replica container corruption, node unavailability, hard disk failure, or an increase in replication factor. RPO See . recovery point objective RTO See . recovery time objective schedule A group of rules that specify recurring points in time at which certain actions are determined to occur. snapshot A read-only logical image of a volume at a specific point in time. storage pool A unit of storage made up of one or more disks. By default, MapR storage pools contain two or three disks. For high-volume reads and writes, you can create larger storage pools when initially formatting storage during cluster creation. stripe width The number of disks in a . storage pool super group The group that has administrative access to the MapR cluster. super user The group that has administrative access to the MapR cluster. TaskTracker The process that starts and tracks MapReduce tasks on a node. The TaskTracker receives task assignments from the JobTracker and reports the results of each task back to the JobTracker on completion. volume A tree of files, directories, and other volumes, grouped for the purpose of applying a policy or set of policies to all of them at once. warden A MapR process that coordinates the starting and stopping of configured services on a node. ZooKeeper A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Source Code for MapR Software MapR releases source code to the open-source community for enhancements that MapR has made to the Apache Hadoop project and other ecosystem components. MapR regularly releases updates to Apache Hadoop ecosystem projects as the projects are released by Apache, after MapR can verify that the changes do not impact product stability. Releases of ecosystem components are independent of the release cycle for the core MapR distribution for Hadoop, so that new updates can be released quickly and efficiently. Source code developed by MapR can be found on GitHub at as of March 2013, coincident with version 2.1.2 of the MapR http://github.com/mapr distribution. MapR may also release source code for other MapR projects at github.com/mapr. For each release that MapR includes in its distribution, MapR branches and tags the release on GitHub using the underlying project release number appended by . -mapr Component Repositories on GitHub The following repositories are available on GitHub for components that MapR has enhanced, patched, or created. oozie hcatalog pig hive mahout hbase flume whirr opentsdb sqoop scribe Finding Source Changes Prior to February 2013 GitHub is the single, central location for tracking changes that MapR applies to components in releases of the MapR distribution. Prior to February 2013, MapR included a list of patches in each component directory, as shown below. This information is no longer stored in the installation directory for recent releases, and instead is available at GitHub. Example: Location of Information about MapR Patches to HBase Prior to February 2012 $ ls /opt/mapr/hbase/hbase-0.92.1/ bin hbase-0.92.1.jar LICENSE.txt pom.xml CHANGES.txt hbase-0.92.1-tests.jar logs README.txt conf hbase-webapps mapr-hbase-patches security conf.new lib NOTICE.txt src $ ls /opt/mapr/hbase/hbase-0.92.1/mapr-hbase-patches/ 0000-hbase-with-mapr.patch 0006-hbase-6285-fix.patch 0001-hbase-wait-for-fs+set-chunksize.patch 0007-hbase-6375-fix.patch 0002-hbase-source-env-vars.patch 0008-hbase-6455-fix.patch 0003-hbase-6158-fix.patch 0009-bug-7745-fix.patch 0004-hbase-6018-fix.patch Readme.txt 0005-hbase-6236-fix.patch